Katherine Lee
Authored Publications
Sort By
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Brennan Saeta
Michele Catasta
Jason Wei
Kathy Meier-Hellstern
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
Deduplicating Training Data Makes Language Models Better
Andrew Nystrom
Chiyuan Zhang
Chris Callison-Burch
(2022) (to appear)
Preview abstract
As large language models scale up, researchers and engineers have chosen to use larger datasets of loosely-filtered internet text instead of curated texts.
We find that existing NLP datasets are highly repetitive and contain duplicated examples.
For example, there is an example in the training dataset C4 that has over 200,000 near duplicates.
As a whole, we find that 1.68% of the C4 are near-duplicates.
Worse, we find a 1% overlap between the training and testing sets in these datasets.
Duplicate examples in training data inappropriately biases the distribution of rare/common sequences.
Models trained with non-deduplicated datasets are more likely to generate ``memorized" examples.
Additionally, if those models are used for downstream applications, such as scoring likelihoods of given sequences, we find that models trained on non-deduplicated and deduplicated datasets have a difference in accuracy of on average TODO.
View details
Preview abstract
Our language reflects who we are. The words and phrases we use as well as the contextual information in our conversations disclose
our personal life. As humans we learn how to communicate about ourselves and others, while delicately concealing private information
depending on the context of conversations. Language models, however, totally lack the ability to understand the context and analyze
the sensitivity of text, and tend to memorize phrases and remember information about their training sets. Thus, inference attacks are
shown to be alarmingly successful at extracting sensitive data from language models. In this paper, we discuss the privacy expectations
from language models, and provide a critical analysis of major data protection techniques: data redaction (scrubbing) and differential
privacy. We show that these protection methods can guarantee, at best, a very limited form of privacy which does not account for
correlations and other nuances in human communication. We finally argue that language models need to be trained on data which is
intended to be produced for public use with proper consent forms and authorization from authors.
View details
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Michael Matena
Noam Shazeer
Peter J. Liu
Sharan Narang
Wei Li
Google (2019)
Preview abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a lower-resource downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning for NLP by introducing a unified framework which casts every language problem as a text-to-text task. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of text understanding tasks. By combining the insights gained in our exploration with scale and a new giant unlabeled text dataset, we achieve state-of-the-art results in most of the tasks we consider. To facilitate future work on text understanding, we release our dataset, pre-trained models, and code.
View details
Hallucinations in Neural Machine Translation
Ashish Agarwal
Clara Wong-Fannjiang
David Sussillo
ICLR (2018) (to appear)
Preview abstract
Neural machine translation (NMT) systems have reached state of the art performance in translating text and are in wide deployment. Yet little is understood about how these systems function or how they break. Here we show that NMT systems are susceptible to producing highly pathological translations that are completely untethered from the source material, which we term {\it hallucinations}. Such pathological translations are problematic because they are are deeply disturbing of user trust and are easy to find with a simple search. We describe a method to generate hallucinations and show that many common variations of the NMT architecture are susceptible to them. We study a variety of approaches to reduce the frequency of hallucinations, including data augmentation, dynamical systems and regularization techniques, showing that a data augmentation technique significantly reduces hallucination frequency. Finally, we analyze networks that produce hallucinations and show that there are signatures in the attention matrix as well as in the stability measures of the decoder.
View details