The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

Daniel S. Park; Jascha Sohl-dickstein; Quoc V. Le; Sam Smith

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

Daniel S. Park

Jascha Sohl-dickstein

Quoc V. Le

Sam Smith

ICML (2019) (to appear)

Google Scholar

Abstract

We investigate how the behavior of stochastic gradient descent is influenced by model size. By studying families of models obtained by increasing the number of channels in a base network, we examine how the optimal hyperparameters---the batch size and learning rate at which the test error is minimized---correlate with the network width. We find that the optimal "normalized noise scale," which we define to be a function of the batch size, learning rate and the initialization conditions, is proportional to the number of channels (in the absence of batch normalization). This conclusion holds for MLPs, ConvNets and ResNets. A surprising consequence is that if we wish to maintain optimal performance as the network width increases, we must use increasingly small batch sizes. Based on our experiments, we also conjecture that there may be a critical width, beyond which the optimal performance of networks trained with constant SGD ceases to improve unless additional regularization is introduced.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs