Benign Underfitting of SGD in Stochastic Convex Optimization

Roi Livni
uri sherman
Yishay Mansour
NeurIPS 2022


A natural, seemingly undisputed understanding is that a learning algorithm must obtain a good fit to the training data in order to perform well on independent test data. This view is reflected both in the traditional bias-complexity trade-off, and in the newly emerging theory of generalization of interpolating learning rules. In this work, we ask to what extent may stochastic gradient descent (SGD) be similarly understood to generalize by means of fit to the training data. We consider the fundamental stochastic convex optimization framework, and seek bounds on the empirical risk and generalization gap of one-pass SGD. Surprisingly, we discover there exist convex learning problems where the output of SGD exhibits empirical risk and generalization gap both $\Omega(1)$, but generalizes well nonetheless with population risk of $O(1/\sqrt n)$, as guaranteed by the classical analysis. Consequently, it turns out SGD is not algorithmically stable in \emph{any} sense, and cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the variant of SGD given by sampling \emph{with}-replacement from the training set, for which we prove that in counter to what might be expected, overfitting does not occur and the population risk converges at the optimal rate. Finally, we study empirical risk guaranties of multiple epochs of without-replacement SGD, and derive nearly tight upper and lower bounds significantly improving over previously known results in this setting.

Research Areas