Adaptive Gradient Methods at the Edge of Stability

Behrooz Ghorbani
David Cardoze
Jeremy Cohen
Justin Gilmer
Shankar Krishnan
NeuRIPS 2022 (2022) (to appear)

Abstract

Little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we show that during full-batch training, the maximum eigenvalue of the \emph{preconditioned} Hessian typically equilibrates at the stability threshold of a related non-adaptive algorithm. For Adam with step size $\eta$ and $\beta_1 = 0.9$, this stability threshold is $38/\eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the “Edge of Stability,” their behavior in this regime differs in a crucial way from that of their non-adaptive counterparts. Whereas non-adaptive algorithms are forced to remain in low-curvature regions of the loss landscape, we demonstrate that adaptive gradient methods often advance into high-curvature regions, while adapting the preconditioner to compensate. We believe that our findings will serve as a foundation for the community’s future understanding of adaptive gradient methods in deep learning.

Research Areas