We advance the state of the art in natural language technologies and build systems that learn to understand and generate language in context.

street view


street view

About the team

Our team comprises multiple research groups working on a wide range of natural language understanding and generation projects. We pursue long-term research to develop novel capabilities that can address the needs of current and future Google products. We publish frequently and evaluate our methods on established scientific benchmarks (e.g., SQuAD, GLUE, SuperGlue) or develop new ones for measuring progress (e.g., Conceptual Captions, Natural Questions, TyDiQA). We collaborate with other teams across Google to deploy our research to the benefit of our users. Our product contributions often stretch the boundaries of what is technically possible. Applications of our research have resulted in better language capabilities across all major Google products.

Our researchers are experts in natural language processing and machine learning with varied backgrounds and a passion for language. Computer scientists and linguists work hand-in-hand to provide insight into ways to define language tasks, collect valuable data, and assist in enabling internationalization. Researchers and engineers work together to develop new neural network models that are sensitive to the nuances of language while taking advantage of the latest advances in specialized compute hardware (e.g., TPUs) to produce scalable solutions that can be used by billions of users.

Team focus summaries

Language representations

Learn contextual language representations that capture meaning at various levels of granularity and are transferable across tasks.

Question answering

Learn end-to-end models for real world question answering that requires complex reasoning about concepts, entities, relations, and causality in the world.

Document understanding

Learn document representations from geometric features and spatial relations, multi-modal content features, syntactic, semantic and pragmatic signals.


Advance next generation dialogue systems in human-machine and multi-human-machine interactions to achieve natural user interactions and enrich conversations between human users.


Produce natural and fluent output for spoken and written text for different domains and styles.


Learning high-quality models that scale to all languages and locales and are robust to multilingual inputs, transliterations, and regional variants.

Language & vision

Understand visual inputs (image & video) and express that understanding using fluent natural language (phrases, sentences, paragraphs).


Use state-of-the-art machine learning techniques and large-scale infrastructure to break language barriers and offer human quality translations across many languages to make it possible to easily explore the multilingual world.


Learn to summarize single and multiple documents into cohesive and concise summaries that accurately represent the documents.


Learn end-to-end models that classify the semantics of text, such as topic, sentiment or sensitive content (i.e., offensive, inappropriate, or controversial content).

Speech and language algorithms

Represent, combine, and optimize models for speech to text and text to speech.

Entities, relations, and reasoning

Learn models that infer entities (people, places, things) from text and that can perform reasoning based on their relationships.

Grounded language understanding

Use and learn representations that span language and other modalities, such as vision, space and time, and adapt and use them for problems requiring language-conditioned action in real or simulated environments (i.e., vision-and-language navigation).

Semantic parsing

Learn models for predicting executable logical forms given text in varying domains and languages, situated within diverse task contexts.

Sentiment analysis

Learn models that can detect sentiment attribution and changes in narrative, conversation, and other text or spoken scenarios.


Learn models of language that are predictable and understandable, perform well across the broadest possible range of linguistic settings and applications, and adhere to our principles of responsible practices in AI.

Featured publications

Preview abstract We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). View details
Natural Questions: a Benchmark for Question Answering Research
Olivia Redfield
Danielle Epstein
Illia Polosukhin
Matthew Kelcey
Jacob Devlin
Llion Jones
Ming-Wei Chang
Jakob Uszkoreit
Slav Petrov
Transactions of the Association of Computational Linguistics(2019) (to appear)
Preview abstract We present the Natural Questions corpus, a question answering dataset. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. View details
Preview abstract Pre-trained sentence encoders such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have rapidly advanced the state-of-theart on many NLP tasks, and have been shown to encode contextual information that can resolve many aspects of language structure. We extend the edge probing suite of Tenney et al. (2019) to explore the computation performed at each layer of the BERT model, and find that tasks derived from the traditional NLP pipeline appear in a natural progression: part-of-speech tags are processed earliest, followed by constituents, dependencies, semantic roles, and coreference. We trace individual examples through the encoder and find that while this order holds on average, the encoder occasionally inverts the order, revising low-level decisions after deciding higher-level contextual relations. View details
Preview abstract We present a new dataset of image caption annotations, CHIA, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both image and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of Internet webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 CNN for image-feature extraction and Transformer for sequence modeling achieves best performance when trained on the CHIA dataset. We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 for image-feature extraction and Transformer for sequence modeling achieves the best performance when trained on the Conceptual Captions dataset. View details
Preview abstract We frame Question Answering (QA) as a Reinforcement Learning task, an approach that we call Active Question Answering. We propose an agent that sits between the user and a black box QA system and learns to reformulate questions to elicit the best possible answers. The agent probes the system with, potentially many, natural language reformulations of an initial question and aggregates the returned evidence to yield the best answer. The reformulation system is trained end-to-end to maximize answer quality using policy gradient. We evaluate on SearchQA, a dataset of complex questions extracted from Jeopardy!. The agent outperforms a state-of-the-art base model, playing the role of the environment, and other benchmarks. We also analyze the language that the agent has learned while interacting with the question answering system. We find that successful question reformulations look quite different from natural language paraphrases. The agent is able to discover non-trivial reformulation strategies that resemble classic information retrieval techniques such as term re-weighting (tf-idf) and stemming. View details
Massively Multilingual Neural Machine Translation
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp. 3874-3884 (to appear)
Preview abstract Multilingual Neural Machine Translation enables training a single model that supports translation from multiple source languages into multiple target languages. We perform extensive experiments in training massively multilingual NMT models, involving up to 103 distinct languages and 204 translation directions simultaneously. We explore different setups for training such models and analyze the trade-offs between translation quality and various modeling decisions. We report results on the publicly available TED talks multilingual corpus where we show that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages in 116 translation directions in a single model. Our experiments on a large-scale dataset with 103 languages, 204 trained directions and one million examples per direction also show promising results, surpassing strong bilingual baselines and encouraging future work on massively multilingual NMT. View details
Preview abstract Coreference resolution is an important task for natural language understanding, and the resolution of ambiguous pronouns a longstanding challenge. Nonetheless, existing corpora do not capture ambiguous pronouns in sufficient volume or diversity to accurately indicate the practical utility of models. Furthermore, we find gender bias in existing corpora and systems favoring masculine entities. To address this, we present and release GAP, a gender-balanced labeled corpus of 8,908 ambiguous pronoun–name pairs sampled to provide diverse coverage of challenges posed by real-world text. We explore a range of baselines that demonstrate the complexity of the challenge, the best achieving just 66.9% F1. We show that syntactic structure and continuous neural models provide promising, complementary cues for approaching the challenge. View details
Matching the Blanks: Distributional Similarity for Relation Learning
Jeffrey Ling
ACL 2019 - The 57th Annual Meeting of the Association for Computational Linguistics(2019) (to appear)
Preview abstract General purpose relation extractors, which can model arbitrary relations, are a core aspiration in information extraction. Efforts have been made to build general purpose extractors that represent relations with their surface forms, or which jointly embed surface forms with relations from an existing knowledge graph. How ever, both of these approaches are limited in their ability to generalize. In this paper, we build on extensions of Harris’ distributional hypothesis to relations, as well as recent advances in learning text representations (specifically, BERT), to build task agnostic relation representations solely from entity-linked text. We show that these representations significantly outperform previous work on exemplar based relation extraction (FewRel) even without using any of that task’s training data. We also show that models initialized with our task agnostic representations, and then tuned on supervised relation extraction datasets, significantly outperform the previous methods on SemEval 2010 Task 8, KBP37, and TACRED. View details
Counterfactual Fairness in Text Classification through Robustness
Sahaj Garg
Nicole Limtiaco
Ankur Taly
Alex Beutel
AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES)(2019)
Preview abstract In this paper, we study counterfactual fairness in text classification, which asks the question: How would the prediction change if the sensitive attribute referenced in the example were different? Toxicity classifiers demonstrate a counterfactual fairness issue by predicting that "Some people are gay'' is toxic while "Some people are straight'' is nontoxic. We offer a metric, counterfactual token fairness (CTF), for measuring this particular form of fairness in text classifiers, and describe its relationship with group fairness. Further, we offer three approaches, blindness, counterfactual augmentation, and counterfactual logit pairing (CLP), for optimizing counterfactual token fairness during training, bridging the robustness and fairness literature. Empirically, we find that blindness and CLP address counterfactual token fairness. The methods do not harm classifier performance, and have varying tradeoffs with group fairness. These approaches, both for measurement and optimization, provide a new path forward for addressing fairness concerns in text classification. View details
Monotonic Infinite Lookback Attention for Simultaneous Machine Translation
Naveen Ari
Colin Andrew Cherry
Chung-Cheng Chiu
Semih Yavuz
Ruoming Pang
Wei Li
Colin Raffel
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Association for Computational Linguistics, Florence, Italy(2019), pp. 1313-1323
Preview abstract Simultaneous machine translation begins to translate each source sentence before the source speaker is finished speaking, with applications to live and streaming scenarios. Simultaneous systems must carefully schedule their reading of the source sentence to balance quality against latency. We present the first simultaneous translation system to learn an adaptive schedule jointly with a neural machine translation (NMT) model that attends over all source tokens read thus far. We do so by introducing Monotonic Infinite Lookback (MILk) attention, which maintains both a hard,monotonic attention head to schedule the read-ing of the source sentence, and a soft attention head that extends from the monotonic head back to the beginning of the source. We show that MILk’s adaptive schedule allows it to arrive at latency-quality trade-offs that are favorable to those of a recently proposed wait-k strategy for many latency values. View details

Highlighted work

Some of our people