UL2: Unifying Language Learning Paradigms

Yi Tay; Mostafa Dehghani; Vinh Tran; Xavier Garcia; Jason Wei; Xuezhi Wang; Hyung Won Chung; Dara Bahri; Tal Schuster; Steven Zheng; Denny Zhou; Neil Houlsby; Don Metzler

UL2: Unifying Language Learning Paradigms

Yi Tay

Mostafa Dehghani

Vinh Tran

Xavier Garcia

Jason Wei

Xuezhi Wang

Hyung Won Chung

Dara Bahri

Tal Schuster

Steven Zheng

Denny Zhou

Neil Houlsby

Don Metzler

ICLR (2023)

Download Google Scholar

Abstract

Existing pre-trained models are generally geared towards a particular class of
problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for
pre-training models that are universally effective across datasets and setups. We
begin by disentangling architectural archetypes with pre-training objectives – two
concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training
objectives can be cast as one another and how interpolating between different
objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pretraining objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning
is associated with specific pre-training schemes. We conduct extensive ablative
experiments to compare multiple pre-training objectives and find that our method
pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across
multiple diverse setups. Finally, by scaling our model up to 20B parameters, we
achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language
understanding, text classification, question answering, commonsense reasoning,
long text reasoning, structured knowledge grounding and information retrieval.
Our model also achieve strong results at in-context learning, outperforming 175B
GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on oneshot summarization. Finally, we show that UL2 20B works well with chain-ofthought prompting and reasoning tasks, making it an appealing choice for research
into reasoning at a small to medium scale of 20B parameters. We publicly release
Flax-based T5X model checkpoints for the 20B model.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

UL2: Unifying Language Learning Paradigms

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs