Google Research

PEGASUS: Pretraining with Extracted Gap-sentences for Abstractive Summarization by Sequence-to-sequence Models



Previous development of abstractive summarization was constrained by the demand of large scale high-quality supervised summarization datasets. Recent works on the Transformer model and pretraining techniques have shown great success in various NLP tasks including text summarization. However, none of those works has explored pretraining techniques tailored specifically for abstractive text summarization; furthermore, there is a lack of systematic evaluation on abstractive summarization in broad domains. In this work, we propose Pretraining using Extracted Gap-sentences for Abstractive SUmmarization by Sequence-to-sequence models (PEGASUS). In other words, we propose extractive strategies to select and mask principal sentences and the sequence-to-sequence model is pretrained to generate the masked sentences. We evaluate PEGASUS on 12 downstream summarization datasets spanning news, science, technology, medical, social networking, instructions, cooperate emails and legal domains. Experiments demonstrate PEGASUS achieves state-of-the-art performance on all 12 downstream summarization datasets measured by ROUGE scores. PEGASUS also shows surprising capability on low resource settings, achieving SOTA or near-SOTA results on x out of 12 tasks using only 100 finetuning examples.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work