Ellie Pavlick

Ellie Pavlick

Ellie is an Assistant Professor at Brown University and a member of Google AI based in NYC. Her research focuses on computational models of semantics, pragmatics, natural language inference, and grounded language.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics. View details
    Preview abstract Many Question-Answering (QA) datasets contain unanswerable questions, but their treatment in QA systems remains primitive. Our analysis of the Natural Questions (Kwiatkowski et al., 2019) dataset reveals that a substantial portion of unanswerable questions (∼21%) can be explained based on the presence of unverifiable presuppositions. Through a user preference study, we demonstrate that the oracle behavior of our proposed system—which provides responses based on presupposition failure—is preferred over the oracle behavior of existing QA systems. Then, we present a novel framework for implementing such a system in three steps: presupposition generation, presupposition verification, and explanation generation, reporting progress on each. Finally, we show that a simple modification of adding presuppositions and their verifiability to the input of a competitive end-to-end QA system yields modest gains in QA performance and unanswerability detection, demonstrating the promise of our approach. View details
    Frequency Effects on Syntactic Rule Learning in Transformers
    Jason Wei
    Conference on Empirical Methods in Natural Language Processing (2021)
    Preview abstract Pre-trained language models perform well on a variety of linguistic tasks that require symbolic reasoning, raising the question of whether such models implicitly represent abstract symbols and rules. We investigate this question using the case study of BERT's performance on English subject-verb agreement. Unlike prior work, we train multiple instances of BERT from scratch, allowing us to perform a series of controlled interventions at pre-training time. We show that BERT often generalizes well to subject-verb pairs that never occurred in training, suggesting a degree of rule-governed behavior. We also find, however, that performance is heavily influenced by word frequency, with experiments showing that both the absolute frequency of a verb form, as well as the frequency relative to the alternate inflection, are causally implicated in the predictions BERT makes at inference time. Closer analysis of these frequency effects reveals that BERT's behavior is consistent with a system that correctly applies the SVA rule in general but struggles to overcome strong training priors and to estimate agreement features (singular vs. plural) on infrequent lexical items. View details
    What Happens To BERT Embeddings During Fine-tuning?
    Amil Merchant
    Elahe Rahimtoroghi
    Proceedings of the 2020 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics (to appear)
    Preview abstract While there has been much recent work studying how linguistic information is encoded in pre-trained sentence representations, comparatively little is understood about how these models change when adapted to solve downstream tasks. Using a suite of analysis techniques (probing classifiers, Representational Similarity Analysis, and model ablations), we investigate how fine-tuning affects the representations of the BERT model. We find that while fine-tuning necessarily makes significant changes, it does not lead to catastrophic forgetting of linguistic phenomena. We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing. Finally, we also find that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization. View details
    Preview abstract Large pre-trained models have revolutionized natural language understanding. However, researchers have found they can encode correlations undesired in many applications, like \emph{surgeon} being associated more with \emph{he} than \emph{she}. We explore such \emph{gendered correlations} as a case study, to learn how we can configure and train models to mitigate the risk of encoding unintended associations. We find that it is important to define correlation metrics, since they can reveal differences among models with similar accuracy. Large models have more capacity to encode gendered correlations, but this can be mitigated with general dropout regularization. Counterfactual data augmentation is also effective, and can even reduce correlations not explicitly targeted for mitigation, potentially making it useful beyond gender too. Both techniques yield models with comparable accuracy to unmitigated analogues, and still resist re-learning correlations in fine-tuning. View details
    BERT Rediscovers the Classical NLP Pipeline
    Association for Computational Linguistics (2019) (to appear)
    Preview abstract Pre-trained sentence encoders such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have rapidly advanced the state-of-theart on many NLP tasks, and have been shown to encode contextual information that can resolve many aspects of language structure. We extend the edge probing suite of Tenney et al. (2019) to explore the computation performed at each layer of the BERT model, and find that tasks derived from the traditional NLP pipeline appear in a natural progression: part-of-speech tags are processed earliest, followed by constituents, dependencies, semantic roles, and coreference. We trace individual examples through the encoder and find that while this order holds on average, the encoder occasionally inverts the order, revising low-level decisions after deciding higher-level contextual relations. View details
    What do you learn from context? Probing for sentence structure in contextualized word representations
    Patrick Xia
    Berlin Chen
    Alex Wang
    Adam Poliak
    R. Thomas McCoy
    Najoung Kim
    Benjamin Van Durme
    Samuel R. Bowman
    International Conference on Learning Representations (2019)
    Preview abstract Contextualized representation models such as CoVe (McCann et al., 2017) and ELMo (Peters et al., 2018a) have recently achieved state-of-the-art results on a broad suite of downstream NLP tasks. Building on recent token-level probing work (Peters et al., 2018a; Blevins et al., 2018; Belinkov et al., 2017b; Shi et al., 2016), we introduce a broad suite of sub-sentence probing tasks derived from the traditional structured-prediction pipeline, including parsing, semantic role labeling, and coreference, and covering a range of syntactic, semantic, local, and long-range phenomena. We use these tasks to examine the word-level contextual representations and investigate how they encode information about the structure of the sentence in which they appear. We probe three recently-released contextual encoder models, and find that ELMo better encodes linguistic structure at the word level than do other comparable models. We find that the existing models trained on language modeling and translation produce strong representations for syntactic phenomena, but only offer small improvements on semantic tasks over a non-contextual baseline. View details
    Preview abstract We release a corpus of atomic insertion ed-its: instances in which a human editor has inserted a single contiguous span of text into an existing sentence. Our corpus is derived fromWikipedia edit history and contains 43 million sentences across 8 different languages. We argue that the signal contained in these edits is valuable for research in semantics and dis-course, and that such signal differs from that found in conventional language modeling corpora. We provide experimental evidence from both a corpus linguistics and a language modeling perspective to support these claims. View details
    Identifying 1950s American Jazz Musicians: Fine-Grained IsA Extraction via Modifier Composition
    Marius Pasca
    Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL-2017), Vancouver, Canada, pp. 2099-2109
    Preview