On the interplay between noise and curvature and its effect on optimization and generalization
Abstract
This work revisits the notion of \textit{information criterion} to characterize generalization for modern deep learning models. In particular, we empirically demonstrate the effectiveness of the Takeuchi Information Criterion, an extension of the Akaike Information Criterion for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss $\rmH$ and the covariance matrix of the gradients $\rmSS$. By exploring the semantic and numerical similarities and differences between these two matrices as well as the Fisher information matrix $\rmF$, we bring further evidence that flatness cannot in itself predict generalization. We also address the question of when is $\rmSS$ a reasonable approximation to $\rmF$, as commonly assumed.