Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Abstract
Kaplan et al. argues that the performance of a Transformer model strongly depends on the model size, but only weakly on the model shape. Our work empirically confirms their results for upstream training, but then reveals a striking discrepancy when fine-tuning: downstream task performance is strongly influenced by model shape (e.g. depth and width). We find that widely adopted models including T5-base, T5-large and T5-XL/XXL (Raffel et al. 2019) are inefficient on a compute-performance Pareto curve. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster.
We conclude by demonstrating that our improved scaling protocol also holds in other domains.
We conclude by demonstrating that our improved scaling protocol also holds in other domains.