Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Ashish Teku Vaswani; Dani Yogatama; Don Metzler; Hyung Won Chung; Jinfeng Rao; Liam B. Fedus; Mostafa Dehghani; Samira Abnar; Sharan Narang; Yi Tay

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Ashish Teku Vaswani

Dani Yogatama

Don Metzler

Hyung Won Chung

Jinfeng Rao

Liam B. Fedus

Mostafa Dehghani

Samira Abnar

Sharan Narang

Yi Tay

ICLR (2022)

Download Google Scholar

Abstract

Kaplan et al. argues that the performance of a Transformer model strongly depends on the model size, but only weakly on the model shape. Our work empirically confirms their results for upstream training, but then reveals a striking discrepancy when fine-tuning: downstream task performance is strongly influenced by model shape (e.g. depth and width). We find that widely adopted models including T5-base, T5-large and T5-XL/XXL (Raffel et al. 2019) are inefficient on a compute-performance Pareto curve. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster.
We conclude by demonstrating that our improved scaling protocol also holds in other domains.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs