Daphne Ippolito

Daphne Ippolito

I'm a senior research scientist at Google Brain. I work on understanding the limitations of neural language models, including their propensity for memorizing their training data. I also work on using natural language generation systems to build tools for creative writing. I received my PhD in 2022 from University of Pennsylvania, and I will be starting as an assistant professor at Carnegie Mellon University in fall 2023.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract AI researchers have posited Dungeons and Dragons (D&D) as a challenge problem to test systems on various language-related capabilities. In this paper, we frame D&D specifically as a dialogue system challenge, where the tasks are to both generate the next conversational turn in the game and predict the state of the game given the dialogue history. We create a gameplay dataset consisting of nearly 900 games, with a total of 7,000 players, 800,000 dialogue turns, 500,000 dice rolls, and 58 million words. We automatically annotate the data with partial state information about the game play. We train a large language model to generate the next game turn, conditioning it on different information. The LM can respond as a particular character or as the player who runs the game—i.e., the Dungeon Master (DM). It is trained to produce dialogue that is either in-character (roleplaying in the fictional world) or out-of-character (discussing rules or strategy). We perform a human evaluation to determine what factors make the generated output plausible and interesting. We further perform an automatic evaluation to determine how well the model can predict the game state given the history and examine how well tracking the game state improves its ability to produce plausible conversational output. View details
    PaLM: Scaling Language Modeling with Pathways
    Aakanksha Chowdhery
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Brennan Saeta
    Michele Catasta
    Jason Wei
    Kathy Meier-Hellstern
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    Preview abstract As large language models scale up, researchers and engineers have chosen to use larger datasets of loosely-filtered internet text instead of curated texts. We find that existing NLP datasets are highly repetitive and contain duplicated examples. For example, there is an example in the training dataset C4 that has over 200,000 near duplicates. As a whole, we find that 1.68% of the C4 are near-duplicates. Worse, we find a 1% overlap between the training and testing sets in these datasets. Duplicate examples in training data inappropriately biases the distribution of rare/common sequences. Models trained with non-deduplicated datasets are more likely to generate ``memorized" examples. Additionally, if those models are used for downstream applications, such as scoring likelihoods of given sequences, we find that models trained on non-deduplicated and deduplicated datasets have a difference in accuracy of on average TODO. View details
    Automatic Detection of Generated Text is Easiest when Humans are Fooled
    Chris Callison-Burch
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp. 1808-1822
    Preview abstract Recent advancements in neural language modelling make it possible to rapidly generate vast amounts of human-sounding text. The capabilities of humans and automatic discriminators to detect machine-generated text have been a large source of research interest, but humans and machines rely on different cues to make their decisions. Here, we perform careful benchmarking and analysis of three popular sampling-based decoding strategies—top-_k_, nucleus sampling, and untruncated random sampling—and show that improvements in decoding methods have primarily optimized for fooling humans. This comes at the expense of introducing statistical abnormalities that make detection easy for automatic systems. We also show that though both human and automatic detector performance improve with longer excerpt length, even multi-sentence excerpts can fool expert human raters over 30% of the time. Our findings reveal the importance of using both human and automatic detectors to assess the humanness of text generation systems. View details
    Towards Better Storylines with Sentence-Level Language Models
    David Grangier
    Chris Callison-Burch
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp. 1808-1822
    Preview abstract This work proposes a sentence-level language model which predicts the next sentence in a story given the embeddings of the previous sentences. The model operates at the sentence-level and selects the next sentence within a fine set of fluent alternatives. By working with sentence embeddings instead of word embeddings, our model is able to efficiently consider a large number of alternative sentences. By considering only fluent sentences, our model is relieved from modeling fluency and can focus on longer range dependencies. Our method achieves state-of-the-art accuracy on the StoryCloze task in the unsupervised setting. View details
    Unsupervised Hierarchical Story Infilling
    David Grangier
    Chris Callison-Burch
    NAACL 2019 Workshop on Narrative Understanding, Minneapolis, MN (2019)
    Preview abstract Story infilling involves predicting words to go into a missing span from a story. This challenging task has the potential to transform interactive tools for creative writing. However, state-of-the-art conditional language models have trouble balancing fluency and coherence with novelty and diversity. We address this limitation with a hierarchical model which first selects a set of rare words and then generates text conditioned on that set. By relegating the high entropy task of picking rare words to a word-sampling model, the second-stage model conditioned on those words can achieve high fluency and coherence by searching for likely sentences, without sacrificing diversity. View details