Kellie Webster
Research Areas
Authored Publications
Sort By
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Pat Verga
Jianmo Ni
arXiv (2022)
Preview abstract
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
View details
Query Refinement Prompts for Closed-Book Long-Form Question Answering
Reinald Kim Amplayo
arXiv submission (2022)
Preview abstract
Large language models (LLMs) have been shown to perform well in answering questions and in producing long-form texts such as stories and explanations, both in few-shot closed-book settings. While the former can be validated using well-known evaluation metrics, the latter is difficult to evaluate. To this end, we investigate the ability of LLMs to do both tasks at once -- to do question answering that requires long-form answers. Such questions tend to be multifaceted, i.e., they may have ambiguities and/or require information from multiple sources. To this end, we define query refinement prompts that encourage LLMs to explicitly express the multifacetedness in questions and generate long-form answers covering multiple facets of the question. Our experiments on two long-form question answering datasets, ASQA and AQuAMuSe, show that using our prompts allows us to outperform fully finetuned models in the closed book setting, as well as achieve results comparable to retrieve-then-generate open-book models.
View details
Sparsely Activated Language Models are Efficient In-Context Learners
Barret Richard Zoph
Dmitry (Dima) Lepikhin
Emma Wang
Kathy Meier-Hellstern
Kun Zhang
Liam B. Fedus
Maarten Paul Bosma
Marie Pellat
Maxim Krikun
Nan Du
Simon Tong
Tao Wang
Toju Duke
Yuanzhong Xu
Zongwei Zhou
(2022)
Preview abstract
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3.
View details
Preview abstract
Research in natural language processing that focuses solely on binary genders can pose the serious danger of excluding communities and behaviors that are gender nonconforming. In this paper, we highlight the use of gender-inclusive language by proposing the task of rewriting gendered sentences in English to be gender-neutral using the \textit{singular they}. To this end, we train a Seq2Seq model for this task by creating a rewriting algorithm to generate a parallel dataset and evaluate performance on an annotated test set of 500 sentence-pairs (gendered to gender-neutral). Impressively, we are able to achieve over 99 BLEU and less than 1\% word error rate for both the algorithm and the model. Finally, we give some practical applications for this task, including machine translation and augmented writing.
View details
Preview abstract
Question Answering (QA) tasks are used as benchmarks of general machine intelligence. Therefore, robust QA evaluation is critical, and metrics should indicate how models will answer _any_ question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, models generalize---we find little evidence that accuracy is lower for people based on gender or nationality. Instead, there is more variation in question topic and question ambiguity. Adequately accessing the generalization of \abr{qa} systems requires more representative datasets.
View details
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Dan Moldovan
Ben Adlam
Babak Alipanahi
Alex Beutel
Christina Chen
Jon Deaton
Matthew D. Hoffman
Shaobo Hou
Neil Houlsby
Ghassen Jerfel
Yian Ma
Diana Mincu
Akinori Mitani
Andrea Montanari
Christopher Nielsen
Thomas Osborne
Rajiv Raman
Kim Ramasamy
Martin Gamunu Seneviratne
Shannon Sequeira
Harini Suresh
Victor Veitch
Steve Yadlowsky
Xiaohua Zhai
Journal of Machine Learning Research (2020)
Preview abstract
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
View details
How to Write a Bias Statement: Recommendations for Submissions to the Workshop on Gender Bias in NLP
Preview abstract
At the Workshop on Gender Bias in NLP (GeBNLP), we'd like to encourage
authors to give explicit consideration to the wider aspects of bias and its
social implications. For the 2020 edition of the workshop, we therefore
requested that all authors include an explicit \emph{bias statement} in their
work to clarify how their work relates to the social context in which NLP
systems are used.
The programme committee of the workshops included a number of
reviewers with a background in the humanities and social sciences, in addition
to NLP experts doing the bulk of the reviewing. Each paper was assigned one of
those reviewers, and they were asked to pay specific attention to the provided
bias statements in their reviews. This initiative was well received by the
authors who submitted papers to the workshop, several of whom said they received
useful suggestions and literature hints from the bias reviewers. We are
therefore planning to keep this feature of the review process in future editions
of the workshop.
This document was originally published as a blog post on the
web site of GeBNLP 2020.
View details
Preview abstract
Gender bias has been shown to affect many tasks applications in NLU.
In the setting of machine translation (MT), research has primarily focused on measuring bias via synthetic datasets.
We present an automatic method for identifying gender biases in MT using a novel-application of BERT-generated sentence perturbations.
Using this method, we compile a dataset to serve as a benchmark for evaluating gender bias in MT across a diverse range of languages.
Our dataset further serves to highlight the limitations of the current task definition which requires a single translation be produced, even in the presence of underspecified input.
View details
Measuring and Reducing Gendered Correlations in Pre-trained Models
Alex Beutel
Emily Pitler
arXiv (2020)
Preview abstract
Large pre-trained models have revolutionized natural language understanding.
However, researchers have found they can encode correlations undesired in many applications, like \emph{surgeon} being associated more with \emph{he} than \emph{she}.
We explore such \emph{gendered correlations} as a case study, to learn how we can configure and train models to mitigate the risk of encoding unintended associations.
We find that it is important to define correlation metrics, since they can reveal differences among models with similar accuracy.
Large models have more capacity to encode gendered correlations, but this can be mitigated with general dropout regularization.
Counterfactual data augmentation is also effective, and can even reduce correlations not explicitly targeted for mitigation, potentially making it useful beyond gender too.
Both techniques yield models with comparable accuracy to unmitigated analogues, and still resist re-learning correlations in fine-tuning.
View details
Social Biases in NLP Models as Barriers for Persons with Disabilities
Stephen Craig Denuyl
Proceedings of ACL 2020, ACL (to appear)
Preview abstract
Building equitable and inclusive technologies
demands paying attention to how social attitudes towards persons with disabilities are
represented within technology. Representations perpetuated by NLP models often inadvertently encode undesirable social biases
from the data on which they are trained. In this
paper, first we present evidence of such undesirable biases towards mentions of disability in
two different NLP models: toxicity prediction
and sentiment analysis. Next, we demonstrate
that neural embeddings that are critical first
steps in most NLP pipelines also contain undesirable biases towards mentions of disabilities.
We then expose the topical biases in the social
discourse about some disabilities which may
explain such biases in the models; for instance,
terms related to gun violence, homelessness,
and drug addiction are over-represented in discussions about mental illness.
View details