Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

James Patrick Lee-Thorp; Joshua Ainslie

Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

James Patrick Lee-Thorp

Joshua Ainslie

Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 58–75

Google Scholar

Abstract

We combine capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability from mixing transformations to design the Sparse Mixer encoder model. The Sparse Mixer just (<1%) outperforms BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, Fast Sparse Mixer, that very slightly (<0.2%) under-performs BERT on SuperGLUE, but trains and runs nearly twice as fast: 89% faster training and 98% faster inference. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and model hyperparameters. The Sparse Mixer overcomes the speed and stability concerns of MoE models and shows that smaller sparse models may be served out of the box, without resorting to distilling them to dense student models.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs