Jacob Eisenstein

Jacob Eisenstein

I work on computational linguistics and natural language processing. One of my main research focuses is language variation and change: making NLP systems robust to it, and using computational techniques to measure and understand it.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    MD3: The Multi-Dialect Dataset of Dialogues
    Clara Rivera
    Dora Demszky
    Devyani Sharma
    InterSpeech (2023) (to appear)
    Preview abstract We introduce a new dataset of conversational speech representing English from India, Nigeria, and the United States. Unlike prior datasets, the Multi-Dialect Dataset of Dialogues (MD3) strikes a balance between open-ended conversational speech and task-oriented dialogue by prompting participants to perform a series of short information-sharing tasks. This facilitates quantitative cross-dialectal comparison, while avoiding the imposition of a restrictive task structure that might inhibit the expression of dialect features. Preliminary analysis of the dataset reveals significant differences in syntax and in the use of discourse markers. The dataset includes more than 20 hours of audio and more than 200,000 orthographically-transcribed tokens, and is made publicly available at \url{https://www.kaggle.com/datasets/jacobeis99/md3en}. View details
    Dialect-robust Evaluation of Generated Text
    Jiao Sun
    Elizabeth Clark
    Tu Vu
    Sebastian Gehrmann
    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada (2023), pp. 6010-6028
    Preview abstract Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics. We introduce a suite of methods and corresponding statistical tests one can use to assess metrics in light of the two goals. Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust and that semantic perturbations frequently lead to smaller decreases in a metric than the introduction of dialect features. As a first step to overcome this limitation, we propose a training schema, NANO, which introduces regional and language information to the pretraining process of a metric. We demonstrate that NANO provides a size-efficient way for models to improve the dialect robustness while simultaneously improving their performance on the standard metric benchmark. View details
    Preview abstract Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?). View details
    Preview abstract Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics. View details
    Preview abstract It is only a matter of time before facts become out of date: from the name of \abr{POTUS} to the basketball team Lebron James plays for. This continuously limits the usefulness of previously collected datasets and language models (LMs) trained on them. This problem is exacerbated as LMs are used in the closed book question answering setting, where the pretraining data must contain the facts for the model to remember within its fixed parameters. A frequent paradigm is to update or refresh the dataset every so often, then retrain models with the new data: this is costly, but does it work? In this paper, we introduce a diagnostic dataset for probing LMs for factual knowledge that changes over time. Using it we show that models trained only on the most recent slice of data perform worse on questions about the past than models trained on uniform data across time, while being better on current and future questions. Moreover, we propose jointly modeling text with the time it was created and show that this improves memorization of previous facts, as well as reasoning about the uncertainty around future facts. We also show that models trained with temporal context allow for efficient refreshes as new data arrives without the need of retraining from scratch. View details
    Preview abstract In this paper, we propose a new adversarial augmentation method for Neural Machine Translation (NMT). The main idea is to minimize the vicinal risk over virtual sentences sampled from two vicinity distributions, in which the crucial one is a novel vicinity distribution for adversarial sentences that describes a smooth interpolated embedding space centered around observed training sentence pairs. We then discuss our approach, AdvAug, to train NMT models using the embeddings of virtual sentences in sequence-tosequence learning. Experiments on ChineseEnglish, English-French, and English-German translation benchmarks show that AdvAug achieves significant improvements over the Transformer (up to 4.9 BLEU points), and substantially outperforms other data augmentation techniques (e.g. back-translation) without using extra corpora. View details
    Underspecification Presents Challenges for Credibility in Modern Machine Learning
    Dan Moldovan
    Ben Adlam
    Babak Alipanahi
    Alex Beutel
    Christina Chen
    Jon Deaton
    Matthew D. Hoffman
    Shaobo Hou
    Neil Houlsby
    Ghassen Jerfel
    Yian Ma
    Diana Mincu
    Akinori Mitani
    Andrea Montanari
    Christopher Nielsen
    Thomas Osborne
    Rajiv Raman
    Kim Ramasamy
    Martin Gamunu Seneviratne
    Shannon Sequeira
    Harini Suresh
    Victor Veitch
    Steve Yadlowsky
    Xiaohua Zhai
    Journal of Machine Learning Research (2020)
    Preview abstract ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. View details