Denny Zhou
Founded and lead the Reasoning team in Google Brain. Revolutionize machine learning via introducing reasoning to solve long-standing challenges: learning from only a few examples or even instructions only, interpretability, and out-of-distribution/domain generalization. Our solution is teaching language models to reason. For more information, please check my personal homepage.
Research Areas
Authored Publications
Sort By
UL2: Unifying Language Learning Paradigms
Yi Tay
Xavier Garcia
Jason Wei
Hyung Won Chung
Steven Zheng
Neil Houlsby
ICLR (2023)
Preview abstract
Existing pre-trained models are generally geared towards a particular class of
problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for
pre-training models that are universally effective across datasets and setups. We
begin by disentangling architectural archetypes with pre-training objectives – two
concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training
objectives can be cast as one another and how interpolating between different
objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pretraining objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning
is associated with specific pre-training schemes. We conduct extensive ablative
experiments to compare multiple pre-training objectives and find that our method
pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across
multiple diverse setups. Finally, by scaling our model up to 20B parameters, we
achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language
understanding, text classification, question answering, commonsense reasoning,
long text reasoning, structured knowledge grounding and information retrieval.
Our model also achieve strong results at in-context learning, outperforming 175B
GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on oneshot summarization. Finally, we show that UL2 20B works well with chain-ofthought prompting and reasoning tasks, making it an appealing choice for research
into reasoning at a small to medium scale of 20B parameters. We publicly release
Flax-based T5X model checkpoints for the 20B model.
View details
Preview abstract
We propose a new paradigm to help Large Language Models (LLMs) generate more accurate factual knowledge without retrieving from an external corpus, called RECITation-augmented gEneration (RECITE). Different from retrieval-augmented language models that retrieve relevant documents before generating the outputs, given an input, RECITE first recites one or several relevant passages from LLMs' own memory via sampling, and then produces the final answers. We show that RECITE is a powerful paradigm for knowledge-intensive NLP tasks. Specifically, we show that by utilizing recitation as the intermediate step, a recite-and-answer scheme can achieve new state-of-the-art performance in various closed-book question answering (CBQA) tasks. In experiments, we verify the effectiveness of RECITE on three pre-trained models (PaLM, UL2, and OPT) and three CBQA tasks (Natural Questions, TriviaQA, and HotpotQA).
View details
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Shayne Longpre
Le Hou
Tu Vu
Albert Webson
Hyung Won Chung
Yi Tay
Barret Zoph
Jason Wei
Proceedings of the 40th International Conference on Machine Learning, PMLR (2023), pp. 22631-22648
Preview abstract
We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2.
View details
Preview abstract
Careful prompt design is critical to the use of large language models in zero-shot or few-shot learning. As a consequence, there is a growing interest in automated methods to design optimal prompts. In this work, we propose Test-time Prompt Editing using Reinforcement learning (TEMPERA). In contrast to prior prompt generation methods, TEMPERA can efficiently leverage prior knowledge, is adaptive to different queries and provides an interpretable prompt for every query. To achieve this, we design a novel action space that allows flexible editing of the initial prompts covering a wide set of commonly-used components like instructions, few-shot exemplars, and verbalizers. The proposed method achieves significant gains compared with recent SoTA approaches like prompt tuning, AutoPrompt, and RLPrompt, across a variety of tasks including sentiment analysis, topic classification, natural language inference, and reading comprehension. Our method achieves 5.33x on average improvement in sample efficiency when compared to the traditional fine-tuning methods.
View details
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Jason Wei
Sharan Narang
Aakanksha Chowdhery
ICLR 2023 (to appear)
Preview abstract
Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).
View details
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Brennan Saeta
Michele Catasta
Jason Wei
Kathy Meier-Hellstern
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
Mind's Eye: Grounded Language Model Reasoning through Simulation
Jason Wei
Shixiang Shane Gu
Soroush Vosoughi
ICLR 2023 (2022)
Preview abstract
Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world—their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind’s MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100× larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.
View details
Emergent abilities of large language models
Barret Zoph
Colin Raffel
Dani Yogatama
Jason Wei
Liam B. Fedus
Maarten Paul Bosma
Percy Liang
Sebastian Borgeaud
Tatsunori B. Hashimoto
Yi Tay
TMLR (2022)
Preview abstract
Scaling up language models has been shown to predictably confer a range of benefits such as improved performance and sample efficiency. This paper discusses an unpredictable phenomenon that we call emergent abilities of large language models. Such emergent abilities have close to random performance until evaluated on a model of sufficiently large scale, and hence their emergence cannot be predicted by extrapolating a scaling law based on small-scale models. The emergence of such abilities suggests that additional scaling could further expand the range of tasks that language models can perform. We discuss the implications of these phenomena and suggest directions for future research.
View details
LEGO: Latent Execution-Guided Reasoning for Multi-Hop Question Answering on Knowledge Graphs
Hongyu Ren
Michihiro Yasunaga
Haitian Sun
Jure Leskovec
ICML 2021
Preview abstract
Answering complex natural language questions on knowledge graphs (KGQA) is a challenging task. It requires reasoning with the input natural language questions as well as a massive, incomplete heterogeneous KG. Prior methods obtain an abstract structured query graph/tree from the input question and traverse the KG for answers following the query tree. However, they inherently cannot deal with missing links in the KG. Here we present LEGO, a Latent Execution-Guided reasOning framework to handle this challenge in KGQA. LEGO works in an iterative way, which alternates between (1) a Query Synthesizer, which synthesizes a reasoning action and grows the query tree step-by-step, and (2) a Latent Space Executor that executes the reasoning action in the latent embedding space to combat against the missing information in KG. To learn the synthesizer without step-wise supervision, we design a generic latent execution guided bottom-up search procedure to find good execution traces efficiently in the vast query space. Experimental results on several KGQA benchmarks demonstrate the effectiveness of our framework compared with previous state of the art.
View details
SpreadsheetCoder: Formula Prediction from Semi-structured Context
Rishabh Singh
Proceedings of the 38th International Conference on Machine Learning (ICML) (2021)
Preview abstract
Spreadsheet formula prediction has been an important
program synthesis problem with many
real-world applications. Previous works typically
utilize input-output examples as the specification
for spreadsheet formula synthesis, where each
input-output pair simulates a separate row in the
spreadsheet. However, this formulation does not
fully capture the rich context in real-world spreadsheets.
First, spreadsheet data entries are organized
as tables, thus rows and columns are not necessarily
independent from each other. In addition,
many spreadsheet tables include headers, which
provide high-level descriptions of the cell data.
However, previous synthesis approaches do not
consider headers as part of the specification. In
this work, we present the first approach for synthesizing
spreadsheet formulas from tabular context,
which includes both headers and semi-structured
tabular data. In particular, we propose SpreadsheetCoder,
a BERT-based model architecture
to represent the tabular context in both row-based
and column-based formats. We train our model on
a large dataset of spreadsheets, and demonstrate
that SpreadsheetCoder achieves top-1 prediction
accuracy of 42:51%, which is a considerable
improvement over baselines that do not employ
rich tabular context. Compared to a rule-based
system, SpreadsheetCoder assists 82% more
users in composing formulas on Google Sheets.
View details