Slav Petrov
Slav Petrov is Vice President, Research at Google DeepMind. He is a co-lead on Gemini, Google's large models effort. His work has been recognized with multiple Best Paper Awards (ACL'11, NAACL'12, ACL'16, 10-year Test-of-Time Award at ACL'23) and provides better language capabilities to billions of users in a variety of Google products spanning Web Search, Assistant, Ads, Translate & Cloud. Slav is the recipient of the 2014 John Atanasoff Award by the President of Bulgaria and a World Champion at RoboCup 2004. For many years, Slav taught Statistical Natural Language Processing at New York University. He holds a PhD from the University of California at Berkeley.
Slav has spent roughly equal parts of his life in Bulgaria, Germany and the US. Whenever Bulgaria plays Germany in soccer, he supports Bulgaria.
See also my personal webpage for more information (including presentation slides).
Slav has spent roughly equal parts of his life in Bulgaria, Germany and the US. Whenever Bulgaria plays Germany in soccer, he supports Bulgaria.
See also my personal webpage for more information (including presentation slides).
Authored Publications
Sort By
Measuring Attribution in Natural Language Generation Models
Iulia Turc
Computational Linguistics, 49 (2023), pp. 777-840
Preview abstract
With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies.
View details
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Hyung Won Chung
Sebastian Gehrmann
Parker Schuh
Sasha Tsvyashchenko
Abhishek Rao
Yi Tay
Noam Shazeer
Nan Du
Reiner Pope
James Bradbury
Guy Gur-Ari
Toju Duke
Henryk Michalewski
Xavier Garcia
Liam Fedus
David Luan
Barret Zoph
Ryan Sepassi
David Dohan
Shivani Agrawal
Mark Omernick
Marie Pellat
Aitor Lewkowycz
Erica Moreira
Rewon Child
Oleksandr Polozov
Zongwei Zhou
Brennan Saeta
Michele Catasta
Jason Wei
Kathy Meier-Hellstern
arxiv:2204.02311 (2022)
Preview abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
View details
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Pat Verga
Jianmo Ni
arXiv (2022)
Preview abstract
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
View details
Measuring and Reducing Gendered Correlations in Pre-trained Models
Alex Beutel
Emily Pitler
arXiv (2020)
Preview abstract
Large pre-trained models have revolutionized natural language understanding.
However, researchers have found they can encode correlations undesired in many applications, like \emph{surgeon} being associated more with \emph{he} than \emph{she}.
We explore such \emph{gendered correlations} as a case study, to learn how we can configure and train models to mitigate the risk of encoding unintended associations.
We find that it is important to define correlation metrics, since they can reveal differences among models with similar accuracy.
Large models have more capacity to encode gendered correlations, but this can be mitigated with general dropout regularization.
Counterfactual data augmentation is also effective, and can even reduce correlations not explicitly targeted for mitigation, potentially making it useful beyond gender too.
Both techniques yield models with comparable accuracy to unmitigated analogues, and still resist re-learning correlations in fine-tuning.
View details
Natural Questions: a Benchmark for Question Answering Research
Olivia Redfield
Danielle Epstein
Illia Polosukhin
Matthew Kelcey
Jacob Devlin
Llion Jones
Ming-Wei Chang
Jakob Uszkoreit
Transactions of the Association of Computational Linguistics (2019) (to appear)
Preview abstract
We present the Natural Questions corpus, a question answering dataset. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.
View details
Universal Semantic Parsing
Preview
Siva Reddy
Oscar Tackstrom
Mark Steedman
Mirella Lapata
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Natural Language Processing with Small Feed-Forward Networks
Jan A. Botha
Emily Pitler
Anton Bakalov
Alex Salcianu
Ryan Mcdonald
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, 2879–2885
Preview abstract
We show that small and shallow feedforward neural networks can achieve near state-of-the-art results on a range of unstructured and structured language processing tasks while being considerably cheaper in memory and computational requirements than deep recurrent models. Motivated by resource-constrained environments like mobile phones, we showcase simple techniques for obtaining such small neural network models, and investigate different tradeoffs when deciding how to allocate a small memory budget.
View details
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Preview
Daniel Zeman
Martin Popel
Milan Straka
Jan Hajic
Joakim Nivre
Filip Ginter
Juhani Luotolahti
Sampo Pyysalo
Martin Potthast
Francis Tyers
Elena Badmaeva
Memduh Gokirmak
Anna Nedoluzhko
Silvie Cinkova
Jan Hajic jr.
Jaroslava Hlavacova
Václava Kettnerová
Zdenka Uresova
Jenna Kanerva
Stina Ojala
Anna Missilä
Christopher D. Manning
Sebastian Schuster
Siva Reddy
Dima Taji
Nizar Habash
Herman Leung
Marie-Catherine de Marneffe
Manuela Sanguinetti
Maria Simi
Hiroshi Kanayama
Valeria de Paiva
Kira Droganova
Héctor Martínez Alonso
Çagrı Çöltekin
Umut Sulubacak
Hans Uszkoreit
Vivien Macketanz
Aljoscha Burchardt
Kim Harris
Katrin Marheinecke
Georg Rehm
Tolga Kayadelen
Ali Elkahky
Zhuoran Yu
Emily Pitler
Saran Lertpradit
Michael Mandl
Jesse Kirchner
Hector Fernandez Alcalde
Esha Banerjee
Antonio Stella
Atsuko Shimada
Sookyoung Kwak
Gustavo Mendonca
Tatiana Lando
Rattima Nitisaroj
Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Preview abstract
The aim of this document is to provide a list of dependency tags that are to be used for the Arabic dependency annotation task, with examples provided for each tag. The dependency representation is a simple description of the grammatical relationships in a sentence. It represents all sentence relations uniformly typed as dependency relations. The dependencies are all binary relations between a governor (also known the head) and a dependant (any complement of or modifier to the head).
View details
Universal Dependencies v1: A Multilingual Treebank Collection
Preview
Joakim Nivre
Marie-Catherine de Marneffe
Filip Ginter
Yoav Goldberg
Jan Hajic
Christopher D. Manning
Ryan McDonald
Sampo Pyysalo
Natalia Silveira
Reut Tsarfaty
Daniel Zeman
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)