- Adams Yu
- Andrew Dai
- Barret Richard Zoph
- Claire Cui
- Dmitry (Dima) Lepikhin
- Emma Wang
- Kathy Meier-Hellstern
- Kellie Webster
- Kevin Robinson
- Kun Zhang
- Liam B. Fedus
- Lucas Dixon
- Maarten Paul Bosma
- Marie Pellat
- Maxim Krikun
- Nan Du
- Orhan Firat
- Quoc V. Le
- Simon Tong
- Tao Wang
- Toju Duke
- Yanping Huang
- Yanqi Zhou
- Yonghui Wu
- Yuanzhong Xu
- Zhifeng Chen
- Zongwei Zhou
Abstract
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work