Jump to Content

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Ashish Teku Vaswani
Dani Yogatama
Hyung Won Chung
Jinfeng Rao
Liam B. Fedus
Samira Abnar
Sharan Narang
Yi Tay
ICLR (2022)

Abstract

Kaplan et al. argues that the performance of a Transformer model strongly depends on the model size, but only weakly on the model shape. Our work empirically confirms their results for upstream training, but then reveals a striking discrepancy when fine-tuning: downstream task performance is strongly influenced by model shape (e.g. depth and width). We find that widely adopted models including T5-base, T5-large and T5-XL/XXL (Raffel et al. 2019) are inefficient on a compute-performance Pareto curve. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster. We conclude by demonstrating that our improved scaling protocol also holds in other domains.