Jacob Eisenstein
I work on computational linguistics and natural language processing. One of my main research focuses is language variation and change: making NLP systems robust to it, and using computational techniques to measure and understand it.
Research Areas
Authored Publications
Sort By
MD3: The Multi-Dialect Dataset of Dialogues
Clara Rivera
Dora Demszky
Devyani Sharma
InterSpeech (2023) (to appear)
Preview abstract
We introduce a new dataset of conversational speech representing English from India, Nigeria, and the United States. Unlike prior datasets, the Multi-Dialect Dataset of Dialogues (MD3) strikes a balance between open-ended conversational speech and task-oriented dialogue by prompting participants to perform a series of short information-sharing tasks.
This facilitates quantitative cross-dialectal comparison, while avoiding the imposition of a restrictive task structure that might inhibit the expression of dialect features.
Preliminary analysis of the dataset reveals significant differences in syntax and in the use of discourse markers. The dataset includes more than 20 hours of audio and more than 200,000 orthographically-transcribed tokens, and is made publicly available at \url{https://www.kaggle.com/datasets/jacobeis99/md3en}.
View details
Dialect-robust Evaluation of Generated Text
Jiao Sun
Elizabeth Clark
Tu Vu
Sebastian Gehrmann
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada (2023), pp. 6010-6028
Preview abstract
Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics. We introduce a suite of methods and corresponding statistical tests one can use to assess metrics in light of the two goals. Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust and that semantic perturbations frequently lead to smaller decreases in a metric than the introduction of dialect features. As a first step to overcome this limitation, we propose a training schema, NANO, which introduces regional and language information to the pretraining process of a metric. We demonstrate that NANO provides a size-efficient way for models to improve the dialect robustness while simultaneously improving their performance on the standard metric benchmark.
View details
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Pat Verga
Jianmo Ni
arXiv (2022)
Preview abstract
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
View details
The MultiBERTs: BERT Reproductions for Robustness Analysis
Steve Yadlowsky
Jason Wei
Naomi Saphra
Iulia Raluca Turc
Preview abstract
Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics.
View details
Adapting Language Models to Temporal Knowledge
Bhuwan Dhingra
Transactions of the ACL (2021)
Preview abstract
It is only a matter of time before facts become out of date: from the name of \abr{POTUS} to the basketball team Lebron James plays for.
This continuously limits the usefulness of previously collected datasets and language models (LMs) trained on them.
This problem is exacerbated as LMs are used in the closed book question answering setting,
where the pretraining data must contain the facts for the model to remember within its fixed parameters.
A frequent paradigm is to update or refresh the dataset every so often, then retrain models with the new data: this is costly, but does it work?
In this paper, we introduce a diagnostic dataset for probing LMs for factual knowledge that changes over time.
Using it we show that models trained only on the most recent slice of data perform
worse on questions about the past than models trained on uniform data across time,
while being better on current and future questions.
Moreover, we propose jointly modeling text with the time it was created
and show that this improves memorization of previous facts,
as well as reasoning about the uncertainty around future facts.
We also show that models trained with temporal context allow for efficient refreshes as
new data arrives without the need of retraining from scratch.
View details
Preview abstract
In this paper, we propose a new adversarial augmentation method for Neural Machine Translation (NMT). The main idea is to minimize the vicinal risk over virtual sentences sampled from two vicinity distributions, in which the crucial one is a novel vicinity distribution for adversarial sentences that
describes a smooth interpolated embedding space centered around observed training sentence pairs. We then discuss our approach, AdvAug, to train NMT models using the embeddings of virtual sentences in sequence-tosequence learning. Experiments on ChineseEnglish, English-French, and English-German
translation benchmarks show that AdvAug achieves significant improvements over the Transformer (up to 4.9 BLEU points), and substantially outperforms other data augmentation techniques (e.g. back-translation) without using extra corpora.
View details
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Dan Moldovan
Ben Adlam
Babak Alipanahi
Alex Beutel
Christina Chen
Jon Deaton
Matthew D. Hoffman
Shaobo Hou
Neil Houlsby
Ghassen Jerfel
Yian Ma
Diana Mincu
Akinori Mitani
Andrea Montanari
Christopher Nielsen
Thomas Osborne
Rajiv Raman
Kim Ramasamy
Martin Gamunu Seneviratne
Shannon Sequeira
Harini Suresh
Victor Veitch
Steve Yadlowsky
Xiaohua Zhai
Journal of Machine Learning Research (2020)
Preview abstract
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
View details