Large scale distributed neural network training through online distillation

Rohan Anil

Gabriel Pereyra

Alexandre Tachard Passos

Robert Ormandi

George Dahl

Geoffrey Hinton

ICLR(2018)

Download Google Scholar

Abstract

While techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model they are seldom used as the multi-stage training setups they require are cumbersome and the extra hyperparameters introduced make the process of tuning even more expensive. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup. We also show that distillation can be used as a meaningful distributed learning algorithm: instead of independent workers exchanging gradients, which requires worrying about delays and synchronization, independent workers can exchange full model checkpoints. This can be done far less frequently than exchanging gradients, breaking one of the scalability barriers of stochastic gradient descent. We have experiments on Criteo clickthrough rate, and the largest to-date dataset used for neural language modeling, based on Common Crawl and containing $6\times 10^{11}$ tokens. In these experiments we show we can scale at least $2\times$ as well as the maximum limit of distributed stochastic gradient descent. Finally, we also show that online distillation can dramatically reduce the churn in the predictions between different versions of a model.

Research Areas

Machine Intelligence

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Large scale distributed neural network training through online distillation

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Large scale distributed neural network training through online distillation

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities