Joshua Maynez
Research Areas
Authored Publications
Sort By
SEAHORSE: A Dataset of Summaries Annotated with Human Ratings in Six Languages
Elizabeth Clark
Shruti Rijhwani
Sebastian Gehrmann
EMNLP 2023, Association for Computational Linguistics (2023)
Preview abstract
We introduce Seahorse (SummariEs Annotated with Human Ratings in Six languagEs), a dataset of 96K summaries with ratings along 6 dimensions (comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness). The summaries are generated from 8 different models, conditioned on source text from 4 datasets in 6 languages (German, English, Spanish, Russian, Turkish, and Vietnamese). We release the annotated summaries as a resource for developing better summarization models and automatic metrics. We present an analysis of the dataset's composition and quality, and we demonstrate the potential of this dataset for building better summarization metrics, showing that metrics finetuned with Seahorse data outperform baseline metrics.
View details
Text-Blueprint: An Interactive Platform for Plan-based Conditional Generation
Fantine Huot
Reinald Kim Amplayo
Mirella Lapata
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (2023)
Preview abstract
While conditional generation models can now generate natural language well enough to create fluent text, it is still difficult to control the generation process, leading to irrelevant, repetitive, and hallucinated content. Recent work shows that planning can be a useful intermediate step to render conditional generation less opaque and more grounded. We present a web browser-based demonstration for query-focused summarization that uses a sequence of question-answer pairs, as a blueprint plan for guiding text generation (i.e., what to say and in what order). We illustrate how users may interact with the generated text and associated plan visualizations, e.g., by editing and modifying the blueprint in order to improve or control the generated output.
View details
Conditional Generation with a Question-Answering Blueprint
Reinald Kim Amplayo
Fantine Huot
Mirella Lapata
Transactions of the Association for Computational Linguistics (2023) (to appear)
Preview abstract
The ability to convey relevant and faithful information is critical for many tasks in conditional generation and yet remains elusive for neural seq-to-seq models whose outputs often reveal hallucinations and fail to correctly cover important details. In this work, we advocate planning as a useful intermediate representation for rendering conditional generation less opaque and more grounded. Our work proposes a new conceptualization of text plans as a sequence of question-answer (QA) pairs. We enhance existing datasets (e.g., for summarization) with a QA blueprint operating as a proxy for both content selection (i.e., what to say) and planning (i.e., in what order). We obtain blueprints automatically by exploiting state-of-the-art question generation technology and convert input-output pairs into input-blueprint-output tuples. We develop Transformer-based models, each varying in how they incorporate the blueprint in the generated output (e.g., as a global plan or iteratively). Evaluation across metrics and datasets demonstrates that blueprint models are more factual than alternatives which do not resort to planning and allow tighter control of the generation output.
View details
Preview abstract
A typical product or place often has hundreds of reviews, and summarization of these texts is an important and challenging problem.
Recent progress on abstractive summarization in domains such as news
has been driven by supervised systems trained on hundreds of thousands of
news articles paired with human-written summaries. However for opinion texts, such large scale
datasets are rarely available. Unsupervised methods, self-training, and few-shot learning approaches bridge that gap.
In this work, we present a novel self-training approach, OpineSum for abstractive opinion
summarization. The summaries in this approach are built using a novel application
of textual entailment and capture the consensus of opinions across the various reviews for an item. This method can be used to obtain silver-standard summaries on a large scale and train both
unsupervised and few-shot abstractive summarization systems. OpineSum achieves state-of-the-art performance in both settings.
View details
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Brennan Saeta
Michele Catasta
Jason Wei
Kathy Meier-Hellstern
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation
Yao Zhao
Mirella Lapata
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), Association for Computational Linguistics, pp. 21
Preview abstract
We propose Composition Sampling, a simple but effective method to generate diverse outputs for conditional generation of higher quality compared to previous stochastic decoding strategies. It builds on recently proposed plan-based neural generation models (Narayan et al., 2021) that are trained to first create a composition of the output and then generate by conditioning on it and the input. Our approach
avoids text degeneration by first sampling a composition in the form of an entity chain and then using beam search to generate the best possible text grounded to this entity chain. Experiments on summarization (CNN/DailyMail and XSum) and question generation (SQuAD), using existing and newly proposed automatic metrics together with human-based evaluation, demonstrate that Composition Sampling is currently the best available decoding strategy for generating diverse meaningful outputs.
View details
QAmeleon: Multilingual QA with Only 5 Examples
Fantine Huot
Sebastian Ruder
Mirella Lapata
Arxiv (2022)
Preview abstract
The availability of large, high-quality datasets has been one of the main drivers of recent progress in question answering (QA). Such annotated datasets however are difficult and costly to collect, and rarely exist in languages other than English, rendering QA technology inaccessible to underrepresented languages. An alternative to building large monolingual training datasets is to leverage pre-trained language models (PLMs) under a few-shot learning setting. Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained, thus avoiding costly annotation. Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines, bridges nearly 60% of the gap between an English-only baseline and a fully supervised upper bound trained on almost 50,000 hand labeled examples, and always leads to substantial improvements compared to fine-tuning a QA model directly on labeled examples in low resource settings. Experiments on the TyDiQA-GoldP and MLQA benchmarks show that few-shot prompt tuning for data synthesis scales across languages and is a viable alternative to large-scale annotation.
View details
Preview abstract
In this paper we introduce a Focus Attention MEchanism to two popular Seq2Seq architectures: RoBERTaS2S and Pegasus . Both RoBERTaS2S and Pegasus use Transformer-based encoder-decoder architecture; at each decoding step decoder learns a single contextual representation necessary to predict the next token by attending to the input sequence and the sequence that has been predicted so far. The focus attention takes inspiration from human-written text and augments this contextual representation through a dynamic vocabulary biasing to proactively generate tokens that are similar or topical to the input sequence. When evaluated on the BBC extreme summarization task, both RoBERTaS2S and Pegasus with Focus Attention generate summaries that are more faithful to their input documents, in comparison to their counterparts. Models with focus attention can holistically learn any abstract-level properties, such as mostly extractive, mostly abstractive or text-editing only, embodied in the target texts, without introducing any task-specific architectural priors. Finally, by its virtue, it supports Focus Sampling -- a technique to sample topically relevant tokens to generate diverse, yet topically consistent and faithful outputs.
View details
Planning with Learned Entity Prompts for Abstractive Summarization
Yao Zhao
Ryan McDonald
Transactions of the Association for Computational Linguistics, 9 (2021), 1475–1492
Preview abstract
We investigate Entity Chain -- a chain of related entities in the summary -- as an intermediate summary representation to better plan and ground the generation of abstractive summaries. In particular, we achieve this by augmenting the target by appending it with an entity chain extracted from the target. We experiment with Transformer-based encoder-decoder models; a transformer encoder first encodes the input and a transformer decoder generates an intermediate summary representation in the form of an entity chain and then continues generating the summary conditioned on the entity chain and the input. We evaluate our approach on a diverse set of text summarization tasks and show that Pegasus finetuned models with entity chains clearly outperform regular finetuning in terms of entity accuracy. We further demonstrate that our simple method can be easily used for pretraining summarization models to do entity-level content planning and summary generation. We see further gains with pretraining.
View details
On Faithfulness and Factuality in Abstractive Summarization
Ryan Thomas Mcdonald
Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL) (2020)
Preview abstract
It is well known that the standard likelihood training and approximate decoding objectives in neural text generation models are fundamentally flawed and lead to dull and repetitive responses. We found that these models when tested on abstractive summarization are highly prone to hallucinate content that is either unfaithful to the input document, completely irrelevant or gibberish. We conduct a large scale human evaluation of several state of the art neural abstractive summarization systems including pretrained language models to better understand the types of hallucinations. Furthermore, we study the extent to which the hallucinated content (i) co-occurs with the common linguistic irregularities such as repetition and incoherence, and (ii) can be measured by NLU measures such as textual entailment, question answering and OpenIE-based fact checking.
View details