Noam Shazeer
Noam Shazeer is currently co-Tech-Lead for Gemini.
Authored Publications
Sort By
LaMDA: Language Models for Dialog Applications
Aaron Daniel Cohen
Alena Butryna
Alicia Jin
Apoorv Kulshreshtha
Ben Zevenbergen
Chung-ching Chang
Cosmo Du
Daniel De Freitas Adiwardana
Dehao Chen
Dmitry (Dima) Lepikhin
Erin Hoffman-John
Igor Krivokon
James Qin
Jamie Hall
Joe Fenton
Johnny Soraker
Kathy Meier-Hellstern
Maarten Paul Bosma
Marc Joseph Pickett
Marcelo Amorim Menegali
Marian Croak
Maxim Krikun
Rachel Bernstein
Ravi Rajakumar
Ray Kurzweil
Romal Thoppilan
Steven Zheng
Taylor Bos
Toju Duke
Tulsee Doshi
Vincent Y. Zhao
Will Rusch
Yanping Huang
Yuanzhong Xu
Zhifeng Chen
arXiv (2022)
Preview abstract
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency.
View details
Primer: Searching for Efficient Transformers for Language Modeling
David Richard So
Wojciech Andrzej Mańke
Hanxiao Liu
Zihang Dai
Conference on Neural Information Processing Systems (2021)
Preview abstract
Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling. Primer's improvements can be mostly attributed to two simple modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. Experiments show Primer's gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We also verify empirically that Primer can be dropped into different codebases to significantly speed up training without additional tuning. For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X. Furthermore, the reduced training cost means Primer needs much less compute to reach a target one-shot performance. For instance, in a 1.9B parameter configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer. We open source our models and several comparisons in T5 to help with reproducibility.
View details
Preview abstract
Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence S=s1,...,sS, we propose truncating the target-side context used for incremental predictions by making a Markov (N-gram) assumption. Experiments on WMT EnDe and EnFr data sets show that the N-gram masked self-attention model loses very little in BLEU score for N values in the range 4,...,8, depending on the task.
View details
GShard: Scaling Giant Models With Conditional Computation and Automatic Sharding
Dehao Chen
Dmitry (Dima) Lepikhin
HyoukJoong Lee
Maxim Krikun
Yanping Huang
Yuanzhong Xu
Zhifeng Chen
ICLR 2021 (2020) (to appear)
Preview abstract
Neural network scaling has been critical for improving the model quality in many real world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes of existing model code. It enabled us to scale up multilingual machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can be easily trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
View details
Music Transformer: Generating Music with Long-Term Structure
Ashish Vaswani
Jakob Uszkoreit
Curtis Hawthorne
Andrew Dai
Matt Hoffman
Monica Dinculescu
ICLR (2019)
Preview abstract
Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter.
View details
Corpora Generation for Grammatical Error Correction
Jared Lichtarge
Niki J. Parmar
Simon Tong
(2019) (to appear)
Preview abstract
Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics, while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL-2014 benchmark and the JFLEG task. We provide systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.
View details
High Resolution Medical Image Analysis with Spatial Partitioning
Le Hou
Niki J. Parmar
Xiaodan Song
Youlong Cheng
High Resolution Medical Image Analysis with Spatial Partitioning (2019)
Preview abstract
Medical images such as 3D computerized tomography (CT) scans, have a typical resolution of 512×512×512 voxels, three orders of magnitude more pixel data than ImageNet images. It is impossible to train CNN models directly on such high resolution images, because feature maps of a single image do not fit in the memory of single GPU/TPU. Existing image analysis approaches alleviate this problem by dividing (e.g. taking 2D slices of 3D scans) or down-sampling input images, which leads to complicated implementation and sub-optimal performance due to information loss. In this paper, we implement spatial partitioning, which internally distributes input and output of convolution operations across GPUs/TPUs. Our implementation is based on the Mesh-TensorFlow framework and is transparent to end users. To the best of our knowledge, this is the first work on training networks on 512×512×512 resolution CT scans end-to-end, without significant computational overhead.
View details
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Katherine Lee
Michael Matena
Peter J. Liu
Sharan Narang
Wei Li
Google (2019)
Preview abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a lower-resource downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning for NLP by introducing a unified framework which casts every language problem as a text-to-text task. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of text understanding tasks. By combining the insights gained in our exploration with scale and a new giant unlabeled text dataset, we achieve state-of-the-art results in most of the tasks we consider. To facilitate future work on text understanding, we release our dataset, pre-trained models, and code.
View details
Generating Wikipedia by Summarizing Long Sequences
Peter J. Liu
Ben Goodrich
Ryan Sepassi
Lukasz Kaiser
ICLR (2018)
Preview abstract
We show that generating English Wikipedia articles can be approached as a multi-
document summarization of source documents. We use extractive summarization
to coarsely identify salient information and a neural abstractive model to generate
the article. For the abstractive model, we introduce a decoder-only architecture
that can scalably attend to very long sequences, much longer than typical encoder-
decoder architectures used in sequence transduction. We show that this model can
generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia
articles. When given reference documents, we show it can extract relevant factual
information as reflected in perplexity, ROUGE scores and human evaluations.
View details
Fast Decoding in Sequence Models Using Discrete Latent Variables
Lukasz Kaiser
Aurko Roy
Ashish Vaswani
Niki J. Parmar
Samy Bengio
Jakob Uszkoreit
ICML (2018)
Preview abstract
Auto-regressive sequence models based on deep neural networks, such as
RNNs, Wavenet and Transformer are the state of the art on many tasks.
However, they lack parallelism and are thus slow for long sequences.
RNNs lack parallelism both during training and decoding, while
architectures like WaveNet and Transformer are much more parallel
during training, but still lack parallelism during decoding.
We present a method to extend sequence models using
discrete latent variables that makes decoding much more parallel.
The main idea behind this approach is to first autoencode the
target sequence into a shorter discrete latent sequence,
which is generated auto-regressively,
and finally decode the full sequence from this shorter
latent sequence in a parallel manner.
We verify that our method works on the task of neural machine
translation, where our models are an order of magnitude faster than comparable
auto-regressive models. We also introduce a new method for constructing discrete
latent variables that allows us to obtain good BLEU scores.
View details