Joshua Maynez

Joshua Maynez

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We introduce Seahorse (SummariEs Annotated with Human Ratings in Six languagEs), a dataset of 96K summaries with ratings along 6 dimensions (comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness). The summaries are generated from 8 different models, conditioned on source text from 4 datasets in 6 languages (German, English, Spanish, Russian, Turkish, and Vietnamese). We release the annotated summaries as a resource for developing better summarization models and automatic metrics. We present an analysis of the dataset's composition and quality, and we demonstrate the potential of this dataset for building better summarization metrics, showing that metrics finetuned with Seahorse data outperform baseline metrics. View details
    Text-Blueprint: An Interactive Platform for Plan-based Conditional Generation
    Fantine Huot
    Reinald Kim Amplayo
    Mirella Lapata
    Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (2023)
    Preview abstract While conditional generation models can now generate natural language well enough to create fluent text, it is still difficult to control the generation process, leading to irrelevant, repetitive, and hallucinated content. Recent work shows that planning can be a useful intermediate step to render conditional generation less opaque and more grounded. We present a web browser-based demonstration for query-focused summarization that uses a sequence of question-answer pairs, as a blueprint plan for guiding text generation (i.e., what to say and in what order). We illustrate how users may interact with the generated text and associated plan visualizations, e.g., by editing and modifying the blueprint in order to improve or control the generated output. View details
    Conditional Generation with a Question-Answering Blueprint
    Reinald Kim Amplayo
    Fantine Huot
    Mirella Lapata
    Transactions of the Association for Computational Linguistics (2023) (to appear)
    Preview abstract The ability to convey relevant and faithful information is critical for many tasks in conditional generation and yet remains elusive for neural seq-to-seq models whose outputs often reveal hallucinations and fail to correctly cover important details. In this work, we advocate planning as a useful intermediate representation for rendering conditional generation less opaque and more grounded. Our work proposes a new conceptualization of text plans as a sequence of question-answer (QA) pairs. We enhance existing datasets (e.g., for summarization) with a QA blueprint operating as a proxy for both content selection (i.e., what to say) and planning (i.e., in what order). We obtain blueprints automatically by exploiting state-of-the-art question generation technology and convert input-output pairs into input-blueprint-output tuples. We develop Transformer-based models, each varying in how they incorporate the blueprint in the generated output (e.g., as a global plan or iteratively). Evaluation across metrics and datasets demonstrates that blueprint models are more factual than alternatives which do not resort to planning and allow tighter control of the generation output. View details
    Preview abstract A typical product or place often has hundreds of reviews, and summarization of these texts is an important and challenging problem. Recent progress on abstractive summarization in domains such as news has been driven by supervised systems trained on hundreds of thousands of news articles paired with human-written summaries. However for opinion texts, such large scale datasets are rarely available. Unsupervised methods, self-training, and few-shot learning approaches bridge that gap. In this work, we present a novel self-training approach, OpineSum for abstractive opinion summarization. The summaries in this approach are built using a novel application of textual entailment and capture the consensus of opinions across the various reviews for an item. This method can be used to obtain silver-standard summaries on a large scale and train both unsupervised and few-shot abstractive summarization systems. OpineSum achieves state-of-the-art performance in both settings. View details
    PaLM: Scaling Language Modeling with Pathways
    Aakanksha Chowdhery
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Brennan Saeta
    Michele Catasta
    Jason Wei
    Kathy Meier-Hellstern
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation
    Yao Zhao
    Mirella Lapata
    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), Association for Computational Linguistics, pp. 21
    Preview abstract We propose Composition Sampling, a simple but effective method to generate diverse outputs for conditional generation of higher quality compared to previous stochastic decoding strategies. It builds on recently proposed plan-based neural generation models (Narayan et al., 2021) that are trained to first create a composition of the output and then generate by conditioning on it and the input. Our approach avoids text degeneration by first sampling a composition in the form of an entity chain and then using beam search to generate the best possible text grounded to this entity chain. Experiments on summarization (CNN/DailyMail and XSum) and question generation (SQuAD), using existing and newly proposed automatic metrics together with human-based evaluation, demonstrate that Composition Sampling is currently the best available decoding strategy for generating diverse meaningful outputs. View details
    Preview abstract The availability of large, high-quality datasets has been one of the main drivers of recent progress in question answering (QA). Such annotated datasets however are difficult and costly to collect, and rarely exist in languages other than English, rendering QA technology inaccessible to underrepresented languages. An alternative to building large monolingual training datasets is to leverage pre-trained language models (PLMs) under a few-shot learning setting. Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained, thus avoiding costly annotation. Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines, bridges nearly 60% of the gap between an English-only baseline and a fully supervised upper bound trained on almost 50,000 hand labeled examples, and always leads to substantial improvements compared to fine-tuning a QA model directly on labeled examples in low resource settings. Experiments on the TyDiQA-GoldP and MLQA benchmarks show that few-shot prompt tuning for data synthesis scales across languages and is a viable alternative to large-scale annotation. View details
    Preview abstract In this paper we introduce a Focus Attention MEchanism to two popular Seq2Seq architectures: RoBERTaS2S and Pegasus . Both RoBERTaS2S and Pegasus use Transformer-based encoder-decoder architecture; at each decoding step decoder learns a single contextual representation necessary to predict the next token by attending to the input sequence and the sequence that has been predicted so far. The focus attention takes inspiration from human-written text and augments this contextual representation through a dynamic vocabulary biasing to proactively generate tokens that are similar or topical to the input sequence. When evaluated on the BBC extreme summarization task, both RoBERTaS2S and Pegasus with Focus Attention generate summaries that are more faithful to their input documents, in comparison to their counterparts. Models with focus attention can holistically learn any abstract-level properties, such as mostly extractive, mostly abstractive or text-editing only, embodied in the target texts, without introducing any task-specific architectural priors. Finally, by its virtue, it supports Focus Sampling -- a technique to sample topically relevant tokens to generate diverse, yet topically consistent and faithful outputs. View details
    Planning with Learned Entity Prompts for Abstractive Summarization
    Yao Zhao
    Ryan McDonald
    Transactions of the Association for Computational Linguistics, 9 (2021), 1475–1492
    Preview abstract We investigate Entity Chain -- a chain of related entities in the summary -- as an intermediate summary representation to better plan and ground the generation of abstractive summaries. In particular, we achieve this by augmenting the target by appending it with an entity chain extracted from the target. We experiment with Transformer-based encoder-decoder models; a transformer encoder first encodes the input and a transformer decoder generates an intermediate summary representation in the form of an entity chain and then continues generating the summary conditioned on the entity chain and the input. We evaluate our approach on a diverse set of text summarization tasks and show that Pegasus finetuned models with entity chains clearly outperform regular finetuning in terms of entity accuracy. We further demonstrate that our simple method can be easily used for pretraining summarization models to do entity-level content planning and summary generation. We see further gains with pretraining. View details
    On Faithfulness and Factuality in Abstractive Summarization
    Ryan Thomas Mcdonald
    Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL) (2020)
    Preview abstract It is well known that the standard likelihood training and approximate decoding objectives in neural text generation models are fundamentally flawed and lead to dull and repetitive responses. We found that these models when tested on abstractive summarization are highly prone to hallucinate content that is either unfaithful to the input document, completely irrelevant or gibberish. We conduct a large scale human evaluation of several state of the art neural abstractive summarization systems including pretrained language models to better understand the types of hallucinations. Furthermore, we study the extent to which the hallucinated content (i) co-occurs with the common linguistic irregularities such as repetition and incoherence, and (ii) can be measured by NLU measures such as textual entailment, question answering and OpenIE-based fact checking. View details