Google Research

Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

  • James Patrick Lee-Thorp
  • Joshua Ainslie
Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 58–75

Abstract

We combine capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability from mixing transformations to design the Sparse Mixer encoder model. The Sparse Mixer just (<1%) outperforms BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, Fast Sparse Mixer, that very slightly (<0.2%) under-performs BERT on SuperGLUE, but trains and runs nearly twice as fast: 89% faster training and 98% faster inference. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and model hyperparameters. The Sparse Mixer overcomes the speed and stability concerns of MoE models and shows that smaller sparse models may be served out of the box, without resorting to distilling them to dense student models.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work