Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Yujun Lin; Song Han; Huizi Mao; Yu Wang; William Dally

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Yujun Lin

Song Han

Huizi Mao

Yu Wang

William Dally

ICLR (2018)

Download Google Scholar

Abstract

Large-scale distributed training requires significant communication bandwidth to exchange gradients. The intensive gradient communication limits the scalability of multi-machine multi-GPU training, and requires expensive high-bandwidth network switches. In this paper, we discover that 99.9\% of the gradient exchange are redundant and can be safely removed without impacting the convergence accuracy. We propose "Deep Gradient Compression" that can efficiently save the communication bandwidth by up to 600$\times$ (after taking meta-data into account). We introduce four components of Deep Gradient Compression: momentum correction, local gradient clipping, momentum factor masking, and warm-up training that fully preserves the convergence accuracy. We extensively experimented Deep Gradient Compression on multiple types of machine learning tasks including image classification, speech recognition, and language modeling; and multiple datasets on Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On all these scenarios, Deep Gradient Compression with only 0.1\% gradient exchange achieved the same accuracy and the same learning curves compared with the conventional dense update. With such techniques, we enable distributed training on the cheap commodity 1Gbps Ethernet.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs