Jump to Content
Peter J. Liu

Peter J. Liu

I joined the Google Brain team in 2015. Earlier at Google I developed the first deep learning models for Gmail Spam detection, YouTube comments, and the Perspective API. I first got into machine learning as an undergraduate doing research with Radford Neal and Brendan Frey (both advising me in graduate school) and as a member of the University of Toronto machine learning group. Find my personal site at peterjliu.com.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Previous development of abstractive summarization was constrained by the demand of large scale high-quality supervised summarization datasets. Recent works on the Transformer model and pretraining techniques have shown great success in various NLP tasks including text summarization. However, none of those works has explored pretraining techniques tailored specifically for abstractive text summarization; furthermore, there is a lack of systematic evaluation on abstractive summarization in broad domains. In this work, we propose Pretraining using Extracted Gap-sentences for Abstractive SUmmarization by Sequence-to-sequence models (PEGASUS). In other words, we propose extractive strategies to select and mask principal sentences and the sequence-to-sequence model is pretrained to generate the masked sentences. We evaluate PEGASUS on 12 downstream summarization datasets spanning news, science, technology, medical, social networking, instructions, cooperate emails and legal domains. Experiments demonstrate PEGASUS achieves state-of-the-art performance on all 12 downstream summarization datasets measured by ROUGE scores. PEGASUS also shows surprising capability on low resource settings, achieving SOTA or near-SOTA results on x out of 12 tasks using only 100 finetuning examples. View details
    Preview abstract Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based on genomic sequences, which holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on OOD genomic sequences from new bacteria that were not present in the training data. We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. We investigate deep generative model based approaches for OOD detection and observe that the likelihood score is heavily affected by population level background statistics. We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics. We benchmark the OOD detection performance of the proposed method against existing approaches on the genomics dataset and show that our method achieves state-of-the-art performance. We demonstrate the generality of the proposed method by showing that it significantly improves OOD detection when applied to deep generative models of images. View details
    Preview abstract We propose an end-to-end neural model for zero-shot abstractive text summarization of paragraphs, and introduce a benchmark task, ROCSumm, based on ROCStories, a subset for which we collected human summaries. In this task, five-sentence stories (paragraphs) are summarized with one sentence, using human summaries only for evaluation. We show results for extractive and human baselines to demonstrate a large abstractive gap in performance. Our model, SummAE, consists of a denoising auto-encoder that embeds sentences and paragraphs in a common space, from which either can be decoded. Summaries for paragraphs are generated by decoding a sentence from the paragraph representations. We find that traditional sequence-to-sequence auto-encoders fail to produce good summaries and describe how specific architectural choices and pre-training techniques can significantly improve performance, outperforming extractive baselines. The data, training, evaluation code, and best model weights are open-sourced. View details
    Assessing The Factual Accuracy of Text Generation
    Ben Goodrich
    Vinay Rao
    The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'19) (2019) (to appear)
    Preview abstract We propose an automatic metric to reflect the factual accuracy of generated text as an alternative to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We consider models that can extract fact triplets from text and then use them to de- fine a metric that compares triplets extracted from generated summaries and reference texts. We show that this metric correlates with human evaluation of factual accuracy better than ROUGE does. To build these models, we introduce a new Wikidata based dataset for fact extraction, and show that a transformer-based attention model can learn to predict structured fact triplets as well as perform favorably compared to more traditional two-stage approaches (entity recognition and relationship classification). View details
    Preview abstract Abstractive summarization has been studied using neural sequence transduction methods with datasets of large, paired document-summary examples. However, such datasets are rare and the models trained from them do not generalize to other domains. Recently, some progress has been made in learning sequence-to-sequence mappings with only unpaired examples. In our work, we consider the setting where there are only documents and no summaries provided and propose and end-to-end, neural model architecture to perform unsupervised abstractive summarization. Our proposed model consists of an auto-encoder trained so that the mean of the representations of the input documents decodes to a reasonable summary. We consider variants of the proposed architecture and perform an ablation study to show the importance of specific components. We apply our model to the summarization of business and product reviews and show that the generated summaries are fluent, show relevancy in terms of word-overlap, and are representative of the average sentiment with respect to the input documents compared to baselines. View details
    Preview abstract Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a lower-resource downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning for NLP by introducing a unified framework which casts every language problem as a text-to-text task. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of text understanding tasks. By combining the insights gained in our exploration with scale and a new giant unlabeled text dataset, we achieve state-of-the-art results in most of the tasks we consider. To facilitate future work on text understanding, we release our dataset, pre-trained models, and code. View details
    Preview abstract Clinicians spend a significant amount of time inputting free-form textual notes into Electronic Health Records (EHR) systems. Much of this documentation work is seen as a burden, reducing time spent with patients and contributing to clinician burnout. With the aspiration of AI-assisted note-writing, we propose a new language modeling task predicting the content of notes conditioned on past data from a patient's medical record, including patient demographics, labs, medications, and past notes. We train generative models using the public, de-identified MIMIC-III dataset and compare generated notes with those in the dataset on multiple measures. We find that much of the content can be predicted, and that many common templates found in notes can be learned. We discuss how such models can be useful in supporting assistive note-writing features such as error-detection and auto-complete. View details
    Scalable and accurate deep learning for electronic health records
    Alvin Rishi Rajkomar
    Eyal Oren
    Nissan Hajaj
    Mila Hardt
    Xiaobing Liu
    Jake Marcus
    Patrik Per Sundberg
    Kun Zhang
    Yi Zhang
    Gerardo Flores
    Gavin Duggan
    Jamie Irvine
    Kurt Litsch
    Alex Mossin
    Justin Jesada Tansuwan
    De Wang
    Dana Ludwig
    Samuel Volchenboum
    Kat Chou
    Michael Pearson
    Srinivasan Madabushi
    Nigam Shah
    Atul Butte
    npj Digital Medicine (2018)
    Preview abstract Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient’s chart. View details
    Preview abstract We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction. We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. When given reference documents, we show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations. View details
    Preview abstract The driving force behind the recent success of LSTMs has been their ability to learn complex and non-linear relationships. Consequently, our inability to de- scribe these relationships has led to LSTMs being characterized as black boxes. To this end, we introduce contextual decomposition (CD), a novel algorithm for capturing the contributions of combinations of words or variables in terms of CD scores. On the task of sentiment analysis with the Yelp and SST data sets, we show that CD is able to reliably identify words and phrases of contrasting senti- ment, and how they are combined to yield the LSTM’s final prediction. Using the phrase-level labels in SST, we also demonstrate that CD is able to successfully extract positive and negative negations from an LSTM, something which has not previously been done. View details
    Preview abstract Recurrent neural network models with an attention mechanism have proven to be extremely effective on a wide variety of sequence-to-sequence problems. However, the fact that soft attention mechanisms perform a pass over the entire input sequence when producing each element in the output sequence precludes their use in online settings and results in a quadratic time complexity. Based on the insight that the alignment between input and output sequence elements is monotonic in many problems of interest, we propose an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time. We validate our approach on sentence summarization, machine translation, and online speech recognition problems and achieve results competitive with existing sequence-to-sequence models. View details
    Get To The Point: Summarization with Pointer-Generator Networks
    Abigail See
    Christopher Manning
    Association for Computational Linguistics (2017)
    Preview abstract Neural sequence-to-sequence models have provided a new viable approach to ab- stractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the origi- nal text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we pro- pose a novel architecture that augments the standard sequence-to-sequence atten- tion model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate repro- duction of information while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail sum- marization task, outperforming the current abstractive state-of-the-art ROUGE scores with statistical significance. View details
    Preview abstract Sequence to sequence models are successful tools for supervised sequence learning tasks, such as machine translation. Despite their success, these models still require much labeled data and it is unclear how to improve them using unlabeled data, which is much less expensive to obtain. In this paper, we present simple changes that lead to a significant improvement in the accuracy of seq2seq models when the labeled set is small. Our method intializes the encoder and decoder of the seq2seq model with the trained weights of two language models, and then all weights are jointly fine-tuned with labeled data. An additional language modeling loss can be used to regularize the model during fine-tuning. We apply this method to low-resource tasks in machine translation and abstractive summarization and find that it significantly improves the subsequent supervised models. Our main finding is that the pretraining accelerates training and improves generalization of seq2seq models, achieving state-of-the-art results on the WMT English→German task. Our model obtains an improvement of 1.3 BLEU from the previous best models on both WMT'14 and WMT'15 English→German. Our ablation study shows that pretraining helps seq2seq models in different ways depending on the nature of the task: translation benefits from the improved generalization whereas summarization benefits from the improved optimization. View details
    No Results Found