Slav Petrov

Slav Petrov

Slav Petrov is Vice President, Research at Google DeepMind. He is a co-lead on Gemini, Google's large models effort. His work has been recognized with multiple Best Paper Awards (ACL'11, NAACL'12, ACL'16, 10-year Test-of-Time Award at ACL'23) and provides better language capabilities to billions of users in a variety of Google products spanning Web Search, Assistant, Ads, Translate & Cloud. Slav is the recipient of the 2014 John Atanasoff Award by the President of Bulgaria and a World Champion at RoboCup 2004. For many years, Slav taught Statistical Natural Language Processing at New York University. He holds a PhD from the University of California at Berkeley.

Slav has spent roughly equal parts of his life in Bulgaria, Germany and the US. Whenever Bulgaria plays Germany in soccer, he supports Bulgaria.

See also my personal webpage for more information (including presentation slides).
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies. View details
    PaLM: Scaling Language Modeling with Pathways
    Aakanksha Chowdhery
    Sharan Narang
    Jacob Devlin
    Maarten Bosma
    Hyung Won Chung
    Sebastian Gehrmann
    Parker Schuh
    Sasha Tsvyashchenko
    Abhishek Rao
    Yi Tay
    Noam Shazeer
    Nan Du
    Reiner Pope
    James Bradbury
    Guy Gur-Ari
    Toju Duke
    Henryk Michalewski
    Xavier Garcia
    Liam Fedus
    David Luan
    Barret Zoph
    Ryan Sepassi
    David Dohan
    Shivani Agrawal
    Mark Omernick
    Marie Pellat
    Aitor Lewkowycz
    Erica Moreira
    Rewon Child
    Oleksandr Polozov
    Zongwei Zhou
    Brennan Saeta
    Michele Catasta
    Jason Wei
    Kathy Meier-Hellstern
    arxiv:2204.02311 (2022)
    Preview abstract Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. View details
    Preview abstract Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?). View details
    Preview abstract Large pre-trained models have revolutionized natural language understanding. However, researchers have found they can encode correlations undesired in many applications, like \emph{surgeon} being associated more with \emph{he} than \emph{she}. We explore such \emph{gendered correlations} as a case study, to learn how we can configure and train models to mitigate the risk of encoding unintended associations. We find that it is important to define correlation metrics, since they can reveal differences among models with similar accuracy. Large models have more capacity to encode gendered correlations, but this can be mitigated with general dropout regularization. Counterfactual data augmentation is also effective, and can even reduce correlations not explicitly targeted for mitigation, potentially making it useful beyond gender too. Both techniques yield models with comparable accuracy to unmitigated analogues, and still resist re-learning correlations in fine-tuning. View details
    Natural Questions: a Benchmark for Question Answering Research
    Olivia Redfield
    Danielle Epstein
    Illia Polosukhin
    Matthew Kelcey
    Jacob Devlin
    Llion Jones
    Ming-Wei Chang
    Jakob Uszkoreit
    Transactions of the Association of Computational Linguistics (2019) (to appear)
    Preview abstract We present the Natural Questions corpus, a question answering dataset. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. View details
    Universal Semantic Parsing
    Siva Reddy
    Oscar Tackstrom
    Mark Steedman
    Mirella Lapata
    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)
    Preview
    Natural Language Processing with Small Feed-Forward Networks
    Jan A. Botha
    Emily Pitler
    Anton Bakalov
    Alex Salcianu
    Ryan Mcdonald
    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, 2879–2885
    Preview abstract We show that small and shallow feedforward neural networks can achieve near state-of-the-art results on a range of unstructured and structured language processing tasks while being considerably cheaper in memory and computational requirements than deep recurrent models. Motivated by resource-constrained environments like mobile phones, we showcase simple techniques for obtaining such small neural network models, and investigate different tradeoffs when deciding how to allocate a small memory budget. View details
    CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
    Daniel Zeman
    Martin Popel
    Milan Straka
    Jan Hajic
    Joakim Nivre
    Filip Ginter
    Juhani Luotolahti
    Sampo Pyysalo
    Martin Potthast
    Francis Tyers
    Elena Badmaeva
    Memduh Gokirmak
    Anna Nedoluzhko
    Silvie Cinkova
    Jan Hajic jr.
    Jaroslava Hlavacova
    Václava Kettnerová
    Zdenka Uresova
    Jenna Kanerva
    Stina Ojala
    Anna Missilä
    Christopher D. Manning
    Sebastian Schuster
    Siva Reddy
    Dima Taji
    Nizar Habash
    Herman Leung
    Marie-Catherine de Marneffe
    Manuela Sanguinetti
    Maria Simi
    Hiroshi Kanayama
    Valeria de Paiva
    Kira Droganova
    Héctor Martínez Alonso
    Çagrı Çöltekin
    Umut Sulubacak
    Hans Uszkoreit
    Vivien Macketanz
    Aljoscha Burchardt
    Kim Harris
    Katrin Marheinecke
    Georg Rehm
    Tolga Kayadelen
    Ali Elkahky
    Zhuoran Yu
    Emily Pitler
    Saran Lertpradit
    Michael Mandl
    Jesse Kirchner
    Hector Fernandez Alcalde
    Esha Banerjee
    Antonio Stella
    Atsuko Shimada
    Sookyoung Kwak
    Gustavo Mendonca
    Tatiana Lando
    Rattima Nitisaroj
    Josie Li
    Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
    Preview
    Preview abstract The aim of this document is to provide a list of dependency tags that are to be used for the Arabic dependency annotation task, with examples provided for each tag. The dependency representation is a simple description of the grammatical relationships in a sentence. It represents all sentence relations uniformly typed as dependency relations. The dependencies are all binary relations between a governor (also known the head) and a dependant (any complement of or modifier to the head). View details
    Universal Dependencies v1: A Multilingual Treebank Collection
    Joakim Nivre
    Marie-Catherine de Marneffe
    Filip Ginter
    Yoav Goldberg
    Jan Hajic
    Christopher D. Manning
    Ryan McDonald
    Sampo Pyysalo
    Natalia Silveira
    Reut Tsarfaty
    Daniel Zeman
    Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
    Preview