Aditya Siddhant
Working on open research problems in multilingual machine translation and cross-lingual representation learning. Additionally, using my research to enable Google Assistant in multiple languages. Before joining Google, I completed my masters from Language Technologies Institute at Carnegie Mellon University and bachelors from Indian Institute of Technology Guwahati. Outside of work, I like to play tennis and I absolutely love cars.
Research Areas
Authored Publications
Sort By
MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task
Jurik Juraska
Mara Finkelstein
Mahdi Mirzazadeh
Conference on Machine Translation (2023)
Preview abstract
This report details the MetricX-23 submission to the Workshop on Machine Translation's 2023 Metrics Shared Task and provides an overview of the experiments that informed which metrics were submitted. Our three submissions---each with a quality estimation (or reference-free) version---are all learned regression-based metrics that vary in the data used for training and which pretrained language model was used for initialization. We report results related to understanding (1) which supervised training data to use, (2) the impact of how the training labels are normalized, (3) the amount of synthetic training data to use, (4) how metric performance is related to model size, and (5) the effect of initializing the metrics with different pretrained language models. The training recipes that we found to be most successful are detailed in this report.
View details
SEAHORSE: A Dataset of Summaries Annotated with Human Ratings in Six Languages
Elizabeth Clark
Shruti Rijhwani
Sebastian Gehrmann
EMNLP 2023, Association for Computational Linguistics (2023)
Preview abstract
We introduce Seahorse (SummariEs Annotated with Human Ratings in Six languagEs), a dataset of 96K summaries with ratings along 6 dimensions (comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness). The summaries are generated from 8 different models, conditioned on source text from 4 datasets in 6 languages (German, English, Spanish, Russian, Turkish, and Vietnamese). We release the annotated summaries as a resource for developing better summarization models and automatic metrics. We present an analysis of the dataset's composition and quality, and we demonstrate the potential of this dataset for building better summarization metrics, showing that metrics finetuned with Seahorse data outperform baseline metrics.
View details
Dialect-robust Evaluation of Generated Text
Jiao Sun
Elizabeth Clark
Tu Vu
Sebastian Gehrmann
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada (2023), pp. 6010-6028
Preview abstract
Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics. We introduce a suite of methods and corresponding statistical tests one can use to assess metrics in light of the two goals. Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust and that semantic perturbations frequently lead to smaller decreases in a metric than the introduction of dialect features. As a first step to overcome this limitation, we propose a training schema, NANO, which introduces regional and language information to the pretraining process of a metric. We demonstrate that NANO provides a size-efficient way for models to improve the dialect robustness while simultaneously improving their performance on the standard metric benchmark.
View details
Building Machine Translation Systems for the Next Thousand Languages
Julia Kreutzer
Mengmeng Niu
Pallavi Nikhil Baljekar
Xavier Garcia
Maxim Krikun
Pidong Wang
Apu Shah
Macduff Richard Hughes
Google Research (2022)
mT5: A massively multilingual pre-trained text-to-text transformer
Linting Xue
Mihir Sanjay Kale
Rami Al-Rfou
Aditya Barua
Colin Raffel
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2021), Association for Computational Linguistics, Online, pp. 483-498
Preview abstract
The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.
View details
nmT5 - Is parallel data still relevant for pre-training massively multilingual language models?
Linting Xue
Mihir Sanjay Kale
Rami Al-Rfou
Annual Meeting of the Association for Computational Linguistics (ACL) (2021) (to appear)
Preview abstract
Recently, mT5 - a massively multilingual version of T5 - leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that simply multi-tasking language modeling with objectives such as machine translation during pre-training leads to improved performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime.
View details
Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation
Henry Tsai
Naveen Ari
AAAI 2020 (2020)
Preview abstract
Recently proposed Massively Multilingual Neural Machine Translation system has been shown to be capable of translating 102 languages to and from English within a single model. In this paper, we evaluate the cross-lingual effectiveness of representations from the encoder of such a model on 5 downstream classification and sequence tagging tasks spanning more than 50 languages. We compare our results to a strong multilingual baseline, BERT and show modest gains on zero-shot cross-lingual transfer in 4 out of these 5 tasks. Our results provide strong insight into how applicable the representations learned from multilingual machine translation are, across languages and tasks.
View details
Preview abstract
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing.
To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We will release the benchmark to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
View details
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation
Naveen Ari
ACL 2020 (2020)
Preview abstract
Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervised data. In this work, we join these two lines of research and demonstrate the efficacy of monolingual data with self-supervision in multilingual NMT. We offer three major results: (i) Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. (ii) Self-supervision improves zero-shot translation quality in multilingual models. (iii) Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models, getting up to 28 BLEU on ro-en translation without any parallel data or back-translation.
View details