Ankur Parikh

Ankur Parikh

Ankur is a Research Scientist at Google NYC and his primary interests are in natural language processing and machine learning. Ankur received his PhD from Carnegie Mellon in 2015 (advised by Prof. Eric Xing) and his B.S.E. from Princeton University in 2009. He has received a best paper runner up award at EMNLP 2014 and a best paper in translational bioinformatics at ISMB 2011.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We introduce Seahorse (SummariEs Annotated with Human Ratings in Six languagEs), a dataset of 96K summaries with ratings along 6 dimensions (comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness). The summaries are generated from 8 different models, conditioned on source text from 4 datasets in 6 languages (German, English, Spanish, Russian, Turkish, and Vietnamese). We release the annotated summaries as a resource for developing better summarization models and automatic metrics. We present an analysis of the dataset's composition and quality, and we demonstrate the potential of this dataset for building better summarization metrics, showing that metrics finetuned with Seahorse data outperform baseline metrics. View details
    Improving Compositional Generalization with Self-Training for Data-to-Text Generation
    Jinfeng Rao
    Yi Tay
    Mihir Sanjay Kale
    Emma Strubell
    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland (2022), pp. 4205-4219
    Preview abstract Data-to-text generation focuses on generating fluent natural language responses from structured meaning representations (MRs). Such representations are compositional and it is costly to collect responses for all possible combinations of atomic meaning schemata, thereby necessitating few-shot generalization to novel MRs. In this work, we systematically study the compositional generalization of the state-of-the-art T5 models in few-shot data-to-text tasks. We show that T5 models fail to generalize to unseen MRs, and we propose a template-based input representation that considerably improves the model’s generalization capability. To further improve the model’s performance, we propose an approach based on self-training using fine-tuned BLEURT for pseudo-response selection. On the commonly-used SGD and Weather benchmarks, the proposed self-training approach improves tree accuracy by 46%+ and reduces the slot error rates by 73%+ over the strong T5 baselines in few-shot settings. View details
    Preview abstract We study the problem of model extraction in natural language processing, where an adversary with query access to a victim model attempts to reconstruct a local copy of the model. We show that when both the adversary and victim model fine-tune existing pretrained models such as BERT, the adversary does not need to have access to any training data to mount the attack. Indeed, we show that randomly sampled sequences of words, which do not satisfy grammar structures, make effective queries to extract textual models. This is true even for complex tasks such as natural language inference or question answering. Our attacks can be mounted with a modest query budget of less than $400.The extraction's accuracy can be further improved using a large textual corpus like Wikipedia, or with intuitive heuristics we introduce. Finally, we measure the effectiveness of two potential defense strategies---membership classification and API watermarking. While these defenses mitigate certain adversaries and come at a low overhead because they do not require re-training of the victim model, fully coping with model extraction remains an open problem. View details
    Preview abstract Despite significant advances in text generation in recent years, evaluation metrics have lagged behind, with n-gram overlap metrics such as BLEU or ROUGE still remaining popular. In this work, we introduce BLEURT, a learnt evaluation metric based on BERT that achieves state of the art performance on the last years of the WMT Metrics Shared Task and the WebNLG challenge. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetically constructed examples to increase generalization. We show that in contrast to a vanilla BERT fine-tuning approach, BLEURT yields superior results even in the presence of scarce, skewed, or out-of-domain training data. View details
    Preview abstract We present a probabilistic framework for multilingual neural machine translation that encompasses supervised and unsupervised setups, focusing on unsupervised translation. In addition to studying the vanilla case where there is only monolingual data available, we propose a novel setup where one language in the (source, target) pair is not associated with any parallel data, but there may exist auxiliary parallel data that contains the other. This auxiliary data can naturally be utilized in our probabilistic framework via a novel cross-translation loss term. Empirically, we show that our approach results in higher BLEU scores over state-of-the-art unsupervised models on the WMT'14 English-French, WMT'16 English-German, and WMT'16 English-Romanian datasets in most directions. In particular, we obtain a +1.65 BLEU advantage over the best-performing unsupervised model in the Romanian-English direction. View details
    ToTTo: A Controlled Table-To-Text Generation Dataset
    Sebastian Gehrmann
    Bhuwan Dhingra
    Manaal Faruqui
    Diyi Yang
    EMNLP (2020)
    Preview abstract We present ToTTo, an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. To obtain generated targets that are natural but also faithful to the source table, we introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. We present systematic analyses of our dataset and annotation process as well as results achieved by several state-of-the-art baselines. While usually fluent, existing methods often hallucinate phrases that are not supported by the table, suggesting that this dataset can serve as a useful research benchmark for high-precision conditional text generation. View details
    Preview abstract We propose a novel conditioned text generation model. It draws inspiration from traditional template-based text generation techniques, where the source provides the content (i.e.,what to say), and the template influences how to say it. Building on the successful encoder-decoder paradigm, it first encodes the content representation from the given in-put text; to produce the output, it retrieves exemplar text from the training data as “soft templates,” which are then used to construct an exemplar-specific decoder. We evaluate the proposed model on abstractive text summarization and data-to-text generation. Empirical results show that this model achieves strong performance and outperforms comparable baselines. View details
    Natural Questions: a Benchmark for Question Answering Research
    Olivia Redfield
    Danielle Epstein
    Illia Polosukhin
    Matthew Kelcey
    Jacob Devlin
    Llion Jones
    Ming-Wei Chang
    Jakob Uszkoreit
    Transactions of the Association of Computational Linguistics (2019) (to appear)
    Preview abstract We present the Natural Questions corpus, a question answering dataset. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. View details
    Preview abstract Generalization and reliability of multilingual translation systems often highly depend on the available parallel data for each language pair of interest. In this paper, we focus on zero-shot generalization—a challenging setup that tests systems on translation directions they have never been optimized for at training time. To solve the problem, we (i) reformulate multilingual translation as probabilistic inference and show that standard training is ad-hoc and often results in models unsuitable for zeroshot tasks, (ii) introduce an agreement-based training method that encourages the model to produce equivalent translations of parallel sentences in an auxiliary third language, (iii) make a simple change to the decoder to make agreement losses end-to-end differentiable. We test our mutilingual NMT architectures on multiple public zero-shot translation benchmarks and show that agreement-based learning often results in 2-3 BLEU point improvement over strong baselines without any loss in performance on supervised directions. View details
    Preview abstract Automatically constructed datasets for generating text from semi-structured data (tables), such as WikiBio, often contain reference texts that diverge from the information in the corresponding semi-structured data. We show that metrics which rely solely on the reference texts, such as BLEU and ROUGE, show poor correlation with human judgments when those references diverge. We propose a new metric, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall. Through a large scale human evaluation study of table-to-text models for WikiBio, we show that PARENT correlates with human judgments better than existing text generation metrics. We also adapt and evaluate the information extraction based evaluation proposed by Wiseman et al (2017), and show that PARENT has comparable correlation to it, while being easier to use. We show that PARENT is also applicable when the reference texts are elicited from humans using the data from the WebNLG challenge. View details