Partha Talukdar
Partha is a Senior Staff Research Scientist at Google Research, Bangalore where he leads a group focused on Natural Language Understanding. He is also an Associate Professor (on leave) at IISc Bangalore. Partha founded KENOME, an enterprise Knowledge graph company with the mission to help enterprises make sense of unstructured data. Previously, Partha was a Postdoctoral Fellow in the Machine Learning Department at Carnegie Mellon University, working with Tom Mitchell on the NELL project. He received his PhD (2010) in CIS from the University of Pennsylvania. Partha is broadly interested in Natural Language Processing, Machine Learning, and Knowledge Graphs. Partha is a recipient of several awards, including an Outstanding Paper Award at ACL 2019 and ACM India Early Career Award 2022. He is a co-author of a book on Graph-based Semi-Supervised Learning. Homepage: https://parthatalukdar.github.io
Research Areas
Authored Publications
Sort By
Multimodal Modeling for Spoken Language Identification
Shikhar Bharadwaj
Sriram (Sri) Ganapathy
Sid Dalmia
Wei Han
Yu Zhang
Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024) (2024)
Preview abstract
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition.
View details
UGIF-DataSet: A New Dataset for Cross-lingual, Cross-modal Sequential actions on the UI
Findings of the Association for Computational Linguistics: NAACL 2024
Preview abstract
Help documents are supposed to aid smartphone users in resolving queries such as "How to block calls from unknown numbers?". However, given a query, identifying the right help document, understanding instructions from the document, and using them to resolve the issue at hand is challenging. The user experience may be enhanced by converting the instructions in the help document to a step-by-step tutorial overlaid on the phone UI. Successful execution of this task requires overcoming research challenges in retrieval, parsing, and grounding in the multilingual-multimodal setting. For example, user queries in one language may have to be matched against instructions in another language, which in turn needs to be grounded in a multimodal UI in yet another language. Moreover, there isn’t any relevant dataset for such a task. In order to bridge this gap, we introduce UGIF-DataSet, a multi-lingual, multi-modal UI grounded dataset for step-by-step task completion on the smartphone, containing 4,184 tasks across 8 languages. The instruction steps in UGIF-DataSet are available only in English, so the challenge involves operations in the cross-modal, cross-lingual setting. We compare the performance of different large language models for this task and find that the end-to-end task completion rate drops from 48% in English to 32% for other languages, demonstrating significant overall headroom for improvement. We are hopeful that UGIF-DataSet and our analysis will aid further research on the important problem of sequential task completion in the multilingual and multimodal setting.
View details
Evaluating Inclusivity, Equity, and Accessibility of NLP Technology: A Case Study for Indian Languages
Simran Khanuja
Sebastian Ruder
Findings of the Association for Computational Linguistics: EACL 2023
Preview abstract
In order for NLP technology to be widely applicable and useful, it needs to be **inclusive** of users across the world's languages, **equitable**, i.e., not unduly biased towards any particular language, and **accessible** to users, particularly in low-resource settings where compute constraints are common. In this paper, we propose an evaluation paradigm that assesses NLP technologies across all three dimensions, hence quantifying the diversity of users they can serve. While inclusion and accessibility have received attention in recent literature, quantifying equity is relatively unexplored. We propose to address this gap using the *Gini coefficient*, a well-established metric used for estimating societal wealth inequality. Using our paradigm, we highlight the distressed state of utility and equity of current technologies for Indian (IN) languages. Our focus on IN is motivated by their linguistic diversity and their large, varied speaker population. To improve upon these metrics, we demonstrate the importance of region-specific choices in model building and dataset creation and also propose a novel approach to optimal resource allocation in pursuit of building linguistically diverse, equitable technologies.
View details
Self-influence Guided Data Reweighing for Language Model Pre-training
Megh Thakkar
Sarath Chandar
Sriram (Sri) Ganapathy
EMNLP (2023)
Preview abstract
We explore a fundamental question in language model pre-training with huge amounts of unlabeled and randomly sampled text data - should every data sample have equal contribution to the model learning. To this end, we use self-influence (SI) scores as an indicator of sample importance, analyzing the relationship of self-influence scores with the sample quality and probing the efficacy of SI scores for offline pre-training dataset filtering. Building upon this, we propose PRESENCE: Pre-training data REweighting with Self-influENCE, an online and adaptive pre-training data re-weighting strategy using self-influence scores. PRESENCE is a two-phased learning method: In the first phase of learning, the data samples with higher SI scores are emphasized more, while in the subsequent phase of learning, the data samples with higher SI scores are de-emphasized to limit the impact of noisy and unreliable samples. We validate PRESENCE over $2$ model sizes of multilingual-t5 with $5$ datasets across $3$ tasks, obtaining significant performance improvements over the baseline methods considered. Through extensive ablations and qualitative analyses, we put forward a new research direction for language model pre-training.
View details
MASR: Multi-Label Aware Speech Representation
Anjali Raj
Shikhar Bharadwaj
Sriram Ganapathy
2023 Workshop on Automatic Speech Recognition and Understanding (ASRU) (2023)
Preview abstract
In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the sideinformation that is often available for a given speech recording. Incorporation of side information in existing techniques is constrained to a specific category of meta-data, thereby imposing limitations. Furthermore, these approaches exhibit inefficiencies in their utilization of such information. In this paper, we propose MASR , a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of external knowledge sources to enhance the utilization of meta-data information. Using MASR representations, we perform evaluation on several downstream tasks such as language identification and speech recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. A key advantage of the MASR is that it can be combined with any choice of SSL method. We perform a detailed analysis on the language identification task which illustrates how the proposed loss function enables the representations to separate closely related languages. We also investigate the application of the proposed approach for other non-semantic tasks such as speaker and emotion recognition.
View details
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
Sebastian Ruder
Mihir Sanjay Kale
Shruti Rijhwani
Jean-Michel Sarr
Cindy Wang
John Wieting
Christo Kirov
Dana L. Dickinson
Bidisha Samanta
Connie Tao
David Adelani
Reeve Ingle
Dmitry Panteleev
Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp. 1856-1884
Preview abstract
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) — languages for which NLP research is particularly far behind in meeting user needs — it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks — tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text only, multi-modal (vision, audio, and text), supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models.
View details
Label Aware Speech Representation Learning For Language Identification
Shikhar Bharadwaj
Sriram Ganapathy
Wei Han
Proceedings of Interspeech 2023, pp. 5351-5355
Preview abstract
The speech representation learning approaches, for nonsemantic tasks like language recognition, have either explored supervised embedding extraction methods using a classifier model or the self-supervised representation learning approach using raw data. In this paper, we propose a novel framework of combining the self-supervised representation learning with the language label information for the pre-training task. This framework, termed as label aware speech representation learning (LASR), uses a triplet based objective function to incorporate the language labels along with the self-supervised loss function. The speech representations are further fine-tuned for the identification task. The language recognition experiments are performed on two public datasets - FLEURS and Dhwani. In these experiments, we illustrate that the proposed LASR framework improves over the state-of-art systems in terms of recognition performance. We also report an analysis of the robustness of the LASR approach to noisy/missing labels as well as the application of the LASR model for downstream multi-lingual speech recognition tasks.
View details
Preview abstract
Pretrained multilingual models such as mBERT and multilingual T5 (mT5) have been successful at many Natural Language Processing tasks. The shared representations learned by these models facilitate cross lingual transfer in case of low resource settings. In this work, we study the usability of these models for morphology analysis tasks such as root word extraction and morphological feature tagging for Indian langauges. In particular, we use the mT5 model to train gender, number and person tagger for langauges from 2 Indian families. We use data from 6 Indian langauges: Marathi, Hindi, Bengali, Tamil, Telugu and Kannada to fine-tune a multilingual GNP tagger and root word extractor.
We demonstrate the usability of multilingual models for few shot cross-lingual transfer through an average 7\% increase in GNP tagging in case of cross-lingual settings as compared to a monolingual setting and through controlled experiments. We also provide insights into cross-lingual transfer of morphological tags for verbs and nouns; which also provides a proxy for quality of the multilingual representations of word markers learned by the model.
View details
Parameter-Efficient Finetuning for Robust Continual Multilingual Learning
Findings of the Association for Computational Linguistics: ACL 2023
Preview abstract
We introduce and study the problem of Continual Multilingual Learning (CML), where a previously trained multilingual model is periodically updated using new data arriving in stages. If the new data is present only in a subset of languages, we find that the resulting model shows improved performance only on the languages included in the latest update (and few closely related languages) while its performance on all the remaining languages degrade significantly. We address this challenge by proposing LAFT-URIEL, a parameter-efficient finetuning strategy which aims to increase the number of languages on which the model improves after an update, while reducing the magnitude of loss in performance for the remaining languages. LAFT-URIEL uses linguistic knowledge to balance overfitting and knowledge sharing across languages, thus resulting in 25% increase in the number of languages whose performances improve during an update and 78% relative decrease in average magnitude of losses on the remaining languages.
View details
Bootstrapping Multilingual Semantic Parsers using Large Language Models
Abhijeet Awasthi
Bidisha Samanta
Sunita Sarawagi
Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2023)
Preview abstract
Despite cross-lingual generalization demonstrated by pre-trained multilingual models, the translate-and-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models. However, for many low-resource languages, the availability of a reliable translation service entails significant amounts of costly human annotated translation pairs. Further, the translation services for low resource languages may continue to be brittle due to domain mismatch between the task-specific input text and the general-purpose text used while training the translation models. We consider the task of multilingual semantic parsing, and demonstrate the effectiveness and the flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting. We provide (i) Extensive comparisons with prior translate-and-train methods across 50 languages demonstrating that LLMs can serve as highly effective data translators, outperforming prior translation based methods on 40 out of 50 languages; (ii) A comprehensive study of the key design choices that enable effective data translation via prompted LLMs.
View details