Abhirut Gupta
I work as a Software Engineer on the Advertisement Sciences team in Google Research, Bangalore. At Google, my work broadly encompasses various aspects of query understanding. My research interests include Natural Language Processing and Machine Translation, with a focus on low resource scenarios.
I finished my Masters from IIT Bombay in 2014, and my Bachelors in Computer Science from NIT Nagpur in 2012. From 2014 to Feb 2020, I worked as a Software Engineer at IBM Research in Bangalore. I worked on a variety of language and understanding problems, and received 2 Outstanding Technical Achievement awards for my work during the period.
Research Areas
Authored Publications
Sort By
Bi-Phone: Modeling Inter Language Phonetic Influences in Text
Ananya B. Sai
Richard William Sproat
Yuri Vasilevski
James Ren
Ambarish Jash
Sukhdeep Sodhi
ACL, Association for Computational Linguistics, Toronto, Canada (2023), 2580–2592
Preview abstract
A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1).
We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2.
These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text.
Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web.
We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly.
We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the SuperGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.
View details
Adapting Multilingual Models for Code-Mixed Translation
Aditya Vavre
Sunita Sarawagi
Findings of the Association for Computational Linguistics: EMNLP 2022, Association for Computational Linguistics (2022), 7133–7141
Preview abstract
The scarcity of gold standard code-mixed to pure language parallel data makes it difficult to train translation models reliably.Prior work has addressed the paucity of parallel data with data augmentation techniques.Such methods rely heavily on external resources making systems difficult to train and scale effectively for multiple languages.We present a simple yet highly effective two-stage back-translation based training scheme for adapting multilingual models to the task of code-mixed translation which eliminates dependence on external resources.We show a substantial improvement in translation quality (measured through BLEU), beating existing prior work by up to +3.8 BLEU on code-mixed Hi→En, Mr→En, and Bn→En tasks. On the LinCE Machine Translation leader board, we achieve the highest score for code-mixed Es→En, beating existing best baseline by +6.5 BLEU, and our own stronger baseline by +1.1 BLEU.
View details
Training Data Augmentation for Code-Mixed Translation
Aditya Vavre
Sunita Sarawagi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, pp. 5760-5766
Preview abstract
Machine translation of user-generated code-mixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to code-mixed parallel data. We present an m-BERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English code-mixed translation task.
View details
Preview abstract
Back-translation (BT) of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT), especially for low-resource language pairs. To improve the effectiveness of the available BT data, we introduce HintedBT -- a family of techniques which provides hints (through tags) to the encoder and decoder. First, we propose a novel method of using \textit{both high and low quality} BT data by providing hints (as encoder tags) to the model about the quality of each source-target pair. We don't filter out low quality data but instead show that these hints enable the model to learn effectively from noisy data.
Second, we address the problem of predicting whether a source token needs to be translated or transliterated to the target language, which is common in cross-script translation tasks (i.e., where source and target do not share the written script).
For such cases, we propose training the model with additional hints (as decoder tags) that provide information about the \textit{operation} required on the source (translation or both translation and transliteration). We conduct experiments and detailed analyses on standard WMT benchmarks for three cross-script low/medium-resource language pairs: \{Hindi,Gujarati,Tamil\}$\rightarrow$English.
Our methods compare favorably with five strong and well established baselines. We show that using these hints, both separately and together, significantly improves translation quality and leads to state-of-the-art performance in all three language pairs in corresponding bilingual settings.
View details