- James Patrick Lee-Thorp
- Joshua Ainslie
We combine capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability from mixing transformations to design the Sparse Mixer encoder model. The Sparse Mixer just (<1%) outperforms BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, Fast Sparse Mixer, that very slightly (<0.2%) under-performs BERT on SuperGLUE, but trains and runs nearly twice as fast: 89% faster training and 98% faster inference. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and model hyperparameters. The Sparse Mixer overcomes the speed and stability concerns of MoE models and shows that smaller sparse models may be served out of the box, without resorting to distilling them to dense student models.