Kellie Webster
Research Areas
Authored Publications
Sort By
Sparsely Activated Language Models are Efficient In-Context Learners
Barret Richard Zoph
Dmitry (Dima) Lepikhin
Emma Wang
Kathy Meier-Hellstern
Kun Zhang
Liam B. Fedus
Maarten Paul Bosma
Marie Pellat
Maxim Krikun
Nan Du
Simon Tong
Tao Wang
Toju Duke
Yuanzhong Xu
Zongwei Zhou
(2022)
Preview abstract
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3.
View details
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Pat Verga
Jianmo Ni
arXiv (2022)
Preview abstract
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
View details
Query Refinement Prompts for Closed-Book Long-Form Question Answering
Reinald Kim Amplayo
arXiv submission (2022)
Preview abstract
Large language models (LLMs) have been shown to perform well in answering questions and in producing long-form texts such as stories and explanations, both in few-shot closed-book settings. While the former can be validated using well-known evaluation metrics, the latter is difficult to evaluate. To this end, we investigate the ability of LLMs to do both tasks at once -- to do question answering that requires long-form answers. Such questions tend to be multifaceted, i.e., they may have ambiguities and/or require information from multiple sources. To this end, we define query refinement prompts that encourage LLMs to explicitly express the multifacetedness in questions and generate long-form answers covering multiple facets of the question. Our experiments on two long-form question answering datasets, ASQA and AQuAMuSe, show that using our prompts allows us to outperform fully finetuned models in the closed book setting, as well as achieve results comparable to retrieve-then-generate open-book models.
View details
Preview abstract
Research in natural language processing that focuses solely on binary genders can pose the serious danger of excluding communities and behaviors that are gender nonconforming. In this paper, we highlight the use of gender-inclusive language by proposing the task of rewriting gendered sentences in English to be gender-neutral using the \textit{singular they}. To this end, we train a Seq2Seq model for this task by creating a rewriting algorithm to generate a parallel dataset and evaluate performance on an annotated test set of 500 sentence-pairs (gendered to gender-neutral). Impressively, we are able to achieve over 99 BLEU and less than 1\% word error rate for both the algorithm and the model. Finally, we give some practical applications for this task, including machine translation and augmented writing.
View details
Preview abstract
Question Answering (QA) tasks are used as benchmarks of general machine intelligence. Therefore, robust QA evaluation is critical, and metrics should indicate how models will answer _any_ question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, models generalize---we find little evidence that accuracy is lower for people based on gender or nationality. Instead, there is more variation in question topic and question ambiguity. Adequately accessing the generalization of \abr{qa} systems requires more representative datasets.
View details
Social Biases in NLP Models as Barriers for Persons with Disabilities
Stephen Craig Denuyl
Proceedings of ACL 2020, ACL (to appear)
Preview abstract
Building equitable and inclusive technologies
demands paying attention to how social attitudes towards persons with disabilities are
represented within technology. Representations perpetuated by NLP models often inadvertently encode undesirable social biases
from the data on which they are trained. In this
paper, first we present evidence of such undesirable biases towards mentions of disability in
two different NLP models: toxicity prediction
and sentiment analysis. Next, we demonstrate
that neural embeddings that are critical first
steps in most NLP pipelines also contain undesirable biases towards mentions of disabilities.
We then expose the topical biases in the social
discourse about some disabilities which may
explain such biases in the models; for instance,
terms related to gun violence, homelessness,
and drug addiction are over-represented in discussions about mental illness.
View details
Measuring and Reducing Gendered Correlations in Pre-trained Models
Alex Beutel
Emily Pitler
arXiv (2020)
Preview abstract
Large pre-trained models have revolutionized natural language understanding.
However, researchers have found they can encode correlations undesired in many applications, like \emph{surgeon} being associated more with \emph{he} than \emph{she}.
We explore such \emph{gendered correlations} as a case study, to learn how we can configure and train models to mitigate the risk of encoding unintended associations.
We find that it is important to define correlation metrics, since they can reveal differences among models with similar accuracy.
Large models have more capacity to encode gendered correlations, but this can be mitigated with general dropout regularization.
Counterfactual data augmentation is also effective, and can even reduce correlations not explicitly targeted for mitigation, potentially making it useful beyond gender too.
Both techniques yield models with comparable accuracy to unmitigated analogues, and still resist re-learning correlations in fine-tuning.
View details
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Dan Moldovan
Ben Adlam
Babak Alipanahi
Alex Beutel
Christina Chen
Jon Deaton
Matthew D. Hoffman
Shaobo Hou
Neil Houlsby
Ghassen Jerfel
Yian Ma
Diana Mincu
Akinori Mitani
Andrea Montanari
Christopher Nielsen
Thomas Osborne
Rajiv Raman
Kim Ramasamy
Martin Gamunu Seneviratne
Shannon Sequeira
Harini Suresh
Victor Veitch
Steve Yadlowsky
Journal of Machine Learning Research (2020)
Preview abstract
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
View details
Preview abstract
Machine translation systems with inadequate document understanding can make errors when translating dropped or neutral pronouns into languages with gendered pronouns (e.g., English). Predicting the underlying gender of these pronouns is difficult since it is not marked textually and must instead be inferred from coreferent mentions in the context. We propose a novel cross-lingual pivoting technique for automatically producing high-quality gender labels, and show that this data can be used to fine-tune a BERT classifier with 92% F1 for Spanish dropped feminine pronouns, compared with 30-51% for neural machine translation models and 54-71% for a non-fine-tuned BERT model. We augment a neural machine translation model with labels from our classifier to improve pronoun translation, while still having parallelizable translation models that translate a sentence at a time.
View details
Preview abstract
Gender bias has been shown to affect many tasks applications in NLU.
In the setting of machine translation (MT), research has primarily focused on measuring bias via synthetic datasets.
We present an automatic method for identifying gender biases in MT using a novel-application of BERT-generated sentence perturbations.
Using this method, we compile a dataset to serve as a benchmark for evaluating gender bias in MT across a diverse range of languages.
Our dataset further serves to highlight the limitations of the current task definition which requires a single translation be produced, even in the presence of underspecified input.
View details