Google Research

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

  • Ashish Teku Vaswani
  • Dani Yogatama
  • Don Metzler
  • Hyung Won Chung
  • Jinfeng Rao
  • Liam B. Fedus
  • Mostafa Dehghani
  • Samira Abnar
  • Sharan Narang
  • Yi Tay
ICLR (2022)


Kaplan et al. argues that the performance of a Transformer model strongly depends on the model size, but only weakly on the model shape. Our work empirically confirms their results for upstream training, but then reveals a striking discrepancy when fine-tuning: downstream task performance is strongly influenced by model shape (e.g. depth and width). We find that widely adopted models including T5-base, T5-large and T5-XL/XXL (Raffel et al. 2019) are inefficient on a compute-performance Pareto curve. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster. We conclude by demonstrating that our improved scaling protocol also holds in other domains.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work