Jump to Content
Partha Talukdar

Partha Talukdar

Partha is a Senior Staff Research Scientist at Google Research, Bangalore where he leads a group focused on Natural Language Understanding. He is also an Associate Professor (on leave) at IISc Bangalore. Partha founded KENOME, an enterprise Knowledge graph company with the mission to help enterprises make sense of unstructured data. Previously, Partha was a Postdoctoral Fellow in the Machine Learning Department at Carnegie Mellon University, working with Tom Mitchell on the NELL project. He received his PhD (2010) in CIS from the University of Pennsylvania. Partha is broadly interested in Natural Language Processing, Machine Learning, and Knowledge Graphs. Partha is a recipient of several awards, including an Outstanding Paper Award at ACL 2019 and ACM India Early Career Award 2022. He is a co-author of a book on Graph-based Semi-Supervised Learning. Homepage: https://parthatalukdar.github.io
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract We explore a fundamental question in language model pre-training with huge amounts of unlabeled and randomly sampled text data - should every data sample have equal contribution to the model learning. To this end, we use self-influence (SI) scores as an indicator of sample importance, analyzing the relationship of self-influence scores with the sample quality and probing the efficacy of SI scores for offline pre-training dataset filtering. Building upon this, we propose PRESENCE: Pre-training data REweighting with Self-influENCE, an online and adaptive pre-training data re-weighting strategy using self-influence scores. PRESENCE is a two-phased learning method: In the first phase of learning, the data samples with higher SI scores are emphasized more, while in the subsequent phase of learning, the data samples with higher SI scores are de-emphasized to limit the impact of noisy and unreliable samples. We validate PRESENCE over $2$ model sizes of multilingual-t5 with $5$ datasets across $3$ tasks, obtaining significant performance improvements over the baseline methods considered. Through extensive ablations and qualitative analyses, we put forward a new research direction for language model pre-training. View details
    Preview abstract We introduce and study the problem of Continual Multilingual Learning (CML), where a previously trained multilingual model is periodically updated using new data arriving in stages. If the new data is present only in a subset of languages, we find that the resulting model shows improved performance only on the languages included in the latest update (and few closely related languages) while its performance on all the remaining languages degrade significantly. We address this challenge by proposing LAFT-URIEL, a parameter-efficient finetuning strategy which aims to increase the number of languages on which the model improves after an update, while reducing the magnitude of loss in performance for the remaining languages. LAFT-URIEL uses linguistic knowledge to balance overfitting and knowledge sharing across languages, thus resulting in 25% increase in the number of languages whose performances improve during an update and 78% relative decrease in average magnitude of losses on the remaining languages. View details
    Preview abstract The speech representation learning approaches, for nonsemantic tasks like language recognition, have either explored supervised embedding extraction methods using a classifier model or the self-supervised representation learning approach using raw data. In this paper, we propose a novel framework of combining the self-supervised representation learning with the language label information for the pre-training task. This framework, termed as label aware speech representation learning (LASR), uses a triplet based objective function to incorporate the language labels along with the self-supervised loss function. The speech representations are further fine-tuned for the identification task. The language recognition experiments are performed on two public datasets - FLEURS and Dhwani. In these experiments, we illustrate that the proposed LASR framework improves over the state-of-art systems in terms of recognition performance. We also report an analysis of the robustness of the LASR approach to noisy/missing labels as well as the application of the LASR model for downstream multi-lingual speech recognition tasks. View details
    Bootstrapping Multilingual Semantic Parsers using Large Language Models
    Abhijeet Awasthi
    Bidisha Samanta
    Sunita Sarawagi
    Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2023)
    Preview abstract Despite cross-lingual generalization demonstrated by pre-trained multilingual models, the translate-and-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models. However, for many low-resource languages, the availability of a reliable translation service entails significant amounts of costly human annotated translation pairs. Further, the translation services for low resource languages may continue to be brittle due to domain mismatch between the task-specific input text and the general-purpose text used while training the translation models. We consider the task of multilingual semantic parsing, and demonstrate the effectiveness and the flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting. We provide (i) Extensive comparisons with prior translate-and-train methods across 50 languages demonstrating that LLMs can serve as highly effective data translators, outperforming prior translation based methods on 40 out of 50 languages; (ii) A comprehensive study of the key design choices that enable effective data translation via prompted LLMs. View details
    Evaluating Inclusivity, Equity, and Accessibility of NLP Technology: A Case Study for Indian Languages
    Simran Khanuja
    Sebastian Ruder
    Findings of the Association for Computational Linguistics: EACL 2023
    Preview abstract In order for NLP technology to be widely applicable and useful, it needs to be **inclusive** of users across the world's languages, **equitable**, i.e., not unduly biased towards any particular language, and **accessible** to users, particularly in low-resource settings where compute constraints are common. In this paper, we propose an evaluation paradigm that assesses NLP technologies across all three dimensions, hence quantifying the diversity of users they can serve. While inclusion and accessibility have received attention in recent literature, quantifying equity is relatively unexplored. We propose to address this gap using the *Gini coefficient*, a well-established metric used for estimating societal wealth inequality. Using our paradigm, we highlight the distressed state of utility and equity of current technologies for Indian (IN) languages. Our focus on IN is motivated by their linguistic diversity and their large, varied speaker population. To improve upon these metrics, we demonstrate the importance of region-specific choices in model building and dataset creation and also propose a novel approach to optimal resource allocation in pursuit of building linguistically diverse, equitable technologies. View details
    XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
    Sebastian Ruder
    Shruti Rijhwani
    Jean-Michel Sarr
    Cindy Wang
    John Wieting
    Christo Kirov
    Dana L. Dickinson
    Bidisha Samanta
    Connie Tao
    David Adelani
    Reeve Ingle
    Dmitry Panteleev
    Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp. 1856-1884
    Preview abstract Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) — languages for which NLP research is particularly far behind in meeting user needs — it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks — tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text only, multi-modal (vision, audio, and text), supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models. View details
    Multimodal Language Identification
    Shikhar Bharadwaj
    Sid Dalmia
    Sriram (Sri) Ganapathy
    Yu Zhang
    2024 IEEE International Conference on Acoustics, Speech and Signal Processing (2023) (to appear)
    Preview abstract Language identification (LangID) of video data, the task of determining the spoken language in a given multimedia file, is primarily treated as a speech based language recognition task. On the other hand, text based language recognition is employed for written language content. In this work, we present a multimodal LangID system for video data that combines speech and text features to achieve state-of-the-art performance. We show that title and description of the video along with other meta-data, like geographic upload location of the video, contain substantial information regarding the language identity of the video recording. With a single multimodal model that can encode speech and text data, we build a language recognition system that can combine the information from speech, text and geographic location data. We experiment on public language recognition tasks with the Dhwani (22 language) dataset and the VoxLingua (107 language) dataset. In these settings, the proposed system achieves an absolute improvement of 6.6% and 5.6% in F1 score over the speech only baseline, respectively. We also provide an ablation study highlighting the contribution of different modalities for the language recognition task. View details
    Preview abstract Pretrained multilingual models such as mBERT and multilingual T5 (mT5) have been successful at many Natural Language Processing tasks. The shared representations learned by these models facilitate cross lingual transfer in case of low resource settings. In this work, we study the usability of these models for morphology analysis tasks such as root word extraction and morphological feature tagging for Indian langauges. In particular, we use the mT5 model to train gender, number and person tagger for langauges from 2 Indian families. We use data from 6 Indian langauges: Marathi, Hindi, Bengali, Tamil, Telugu and Kannada to fine-tune a multilingual GNP tagger and root word extractor. We demonstrate the usability of multilingual models for few shot cross-lingual transfer through an average 7\% increase in GNP tagging in case of cross-lingual settings as compared to a monolingual setting and through controlled experiments. We also provide insights into cross-lingual transfer of morphological tags for verbs and nouns; which also provides a proxy for quality of the multilingual representations of word markers learned by the model. View details
    MASR: Multi-Label Aware Speech Representation
    Anjali Raj
    Shikhar Bharadwaj
    Sriram Ganapathy
    2023 Workshop on Automatic Speech Recognition and Understanding (ASRU) (2023)
    Preview abstract In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the sideinformation that is often available for a given speech recording. Incorporation of side information in existing techniques is constrained to a specific category of meta-data, thereby imposing limitations. Furthermore, these approaches exhibit inefficiencies in their utilization of such information. In this paper, we propose MASR , a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of external knowledge sources to enhance the utilization of meta-data information. Using MASR representations, we perform evaluation on several downstream tasks such as language identification and speech recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. A key advantage of the MASR is that it can be combined with any choice of SSL method. We perform a detailed analysis on the language identification task which illustrates how the proposed loss function enables the representations to separate closely related languages. We also investigate the application of the proposed approach for other non-semantic tasks such as speaker and emotion recognition. View details
    Few-shot Controllable Style Transfer for Low-Resource Languages
    Kalpesh Krishna
    Deepak Nathani
    Xavier Garcia
    Bidisha Samanta
    Few-shot Controllable Style Transfer for Low-Resource Languages (2022)
    Preview abstract Few-shot style transfer is the task of rewriting an input sentence using the stylistic properties extracted from a few (3-10) exemplar sentences, while approximately preserving the input content. This is especially useful in low resource settings where no large style-transfer datasets are available. We push the state-of-the-art for few-shot style transfer with a new method that models the stylistic difference between paraphrases. When compared to prior work our method achieves better performance in , output diversity and style transfer magnitude control for five Indian Languages. View details
    Preview abstract Recent research has revealed undesirable biases in NLP data and models. However, these efforts focus of social disparities in West, and are not directly portable to other geo-cultural contexts. In this position paper, we outline a holistic research agenda to re-contextualize NLP fairness research for the Indian context, accounting for Indian \textit{societal context}, bridging \textit{technological} gaps in capability \& resources, and adapting to Indian cultural \textit{values}. We also report high-level findings from an empirical study on various social stereotypes for Region and Religion axes in the Indian context, demonstrating its prevalence in corpora and models. View details
    Walking with PACE — Personalized and Automated Coaching Engine
    Deepak Nathani
    Eshan Motwani
    Karina Lorenzana Livingston
    Madhurima Vardhan
    Martin Gamunu Seneviratne
    Nur Muhammad
    Rahul Singh
    Shantanu Prabhat
    Srujana Merugu
    UMAP: 30th ACM Conference on User Modeling, Adaptation and Personalization (2022)
    Preview abstract Fitness coaching is effective in helping individuals to develop and maintain healthy lifestyle habits. However, there is a significant shortage of fitness coaches, particularly in low resource communities. Automated coaching assistants may help to improve the accessibility of personalized fitness coaching. Although a variety of automated nudge systems have been developed, few make use of formal behavior science principles and they are limited in their level of personalization. In this work, we introduce a computational framework leveraging the Fogg’s behavioral science model which serves as a personalised and automated coaching engine (PACE).PACE is a rule-based system that infers user state and suggests appropriate text nudges. We compared the effectiveness of PACE to human coaches in a Wizard-of-Oz deployment study with 33 participants over 21 days. Participants were randomized to either a human coach (’human’ arm, n=18) or the PACE framework handled by a human coach (’wizard’ arm, n=15). Coaches and participants interacted via a chat interface. We tracked coach-participant conversations, step counts and qualitative survey feedback. Our findings indicate that the PACE framework strongly emulated human coaching with no significant differences in the overall number of active days (PACE: 85.33% vs human: 92%, [p=NS]) and step count (PACE: 6674 vs human: 6605, [p=NS]) of participants from both groups.The qualitative user feedback suggests that PACE cultivated a coach-like experience, offering barrier resolution, motivation and educational support. As a post-hoc analysis, we annotated the conversation logs from the human coaching arm based on the Fogg framework, and then trained machine learning (ML) models on these data sets to predict the next coach action (AUC 0.73±0.02). This suggests that a ML-driven approach may be a viable alternative to a rule-based system in suggesting personalized nudges. In future, such an ML system could be made increasingly personalized and adaptive based on user behaviors. View details
    Re-contextualizing Fairness in NLP: The Case of India
    Shaily Bhatt
    In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP) (2022)
    Preview abstract Recent research has revealed undesirable biases in NLP data and models. However, these efforts focus of social disparities in West, and are not directly portable to other geo-cultural contexts. In this paper, we focus on NLP fair-ness in the context of India. We start with a brief account of the prominent axes of social disparities in India. We build resources for fairness evaluation in the Indian context and use them to demonstrate prediction biases along some of the axes. We then delve deeper into social stereotypes for Region and Religion, demonstrating its prevalence in corpora and models. Finally, we outline a holistic research agenda to re-contextualize NLP fairness research for the Indian context, ac-counting for Indian societal context, bridging technological gaps in NLP capabilities and re-sources, and adapting to Indian cultural values.While we focus on India, this framework can be generalized to other geo-cultural contexts. View details
    Preview abstract While recent work on multilingual language models has demonstrated their capacity for cross-lingual zero-shot transfer on downstream tasks, there is a lack of consensus in the community as to what shared properties between languages enable such transfer. Analyses involving pairs of natural languages are often inconclusive and contradictory since languages simultaneously differ in many linguistic aspects. In this paper, we perform a large-scale empirical study to isolate the effects of various linguistic properties by measuring zero-shot transfer between four diverse natural languages and their counterparts constructed by modifying aspects such as the script, word order, and syntax. Among other things, our experiments show that the absence of sub-word overlap significantly affects zero-shot transfer when languages differ in their word order, and there is a strong correlation between transfer performance and word embedding alignment between languages (e.g., R=0.94 on the task of NLI). Our results call for focus in multilingual models on explicitly improving word embedding alignment between languages rather than relying on its implicit emergence. View details
    MergeDistill: Merging Pre-trained Language Models using Distillation
    Simran Khanuja
    Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
    Preview abstract Pre-trained multilingual language models (LMs) have achieved state-of-the-art results in cross-lingual transfer, but they often lead to an inequitable representation of languages due to limited capacity, skewed pre-training data, and sub-optimal vocabularies. This has prompted the creation of an ever-growing pre-trained model universe, where each model is trained on large amounts of language or domain specific data with a carefully curated, linguistically informed vocabulary. However, doing so brings us back full circle and prevents one from leveraging the benefits of multilinguality. To address the gaps at both ends of the spectrum, we propose MergeDistill, a framework to merge pre-trained LMs in a way that can best leverage their assets with minimal dependencies, using task-agnostic knowledge distillation. We demonstrate the applicability of our framework in a practical setting by leveraging pre-existing teacher LMs and training student LMs that perform competitively with or even outperform teacher LMs trained on several orders of magnitude more data and with a fixed model capacity. We also highlight the importance of teacher selection and its impact on student model performance. View details
    Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study
    Yash Khemchandani
    Sarvesh Mehtani
    Vaidehi Patil
    Abhijeet Awasthi
    Sunita Sarawagi
    ACL 2021 (to appear)
    Preview abstract Recent research in multilingual language models (LM) has demonstrated their ability to effectively handle multiple languages in a single model. This holds promise for low web-resource languages (LRL) as multilingual models can enable transfer of supervision from high resource languages to LRLs. However, incorporating a new language in an LM still remains a challenge, particularly for languages with limited corpora and in unseen scripts. In this paper we argue that relatedness among languages in a language family may be exploited to overcome some of the corpora limitations of LRLs, and propose RelateLM. We focus on Indian languages, and exploit relatedness along two dimensions: (1) script (since many Indic scripts originated from the Brahmic script), and (2) sentence structure. RelateLM uses transliteration to convert the unseen script of limited LRL text into the script of a Related Prominent Language (RPL) (Hindi in our case). While exploiting similar sentence structures, RelateLM utilizes readily available bilingual dictionaries to pseudo translate RPL text into LRL corpora. Experiments on multiple real-world benchmark datasets provide validation to our hypothesis that using a related language as pivot, along with transliteration and pseudo translation based data augmentation, can be an effective way to adapt LMs for LRLs, rather than direct training or pivoting through English. View details
    Question Answering over Temporal Knowledge Graphs
    Apoorv Saxena
    Soumen Chakrabarti
    ACL 2021 (to appear)
    Preview abstract Temporal Knowledge Graphs (Temporal KGs) extend regular Knowledge Graphs by providing temporal scopes (start and end times) on each edge in the KG. While Question Answering over KG (KGQA) has received some attention from the research community, QA over Temporal KGs (Temporal KGQA) is a relatively unexplored area. Lack of broad coverage datasets has been another factor limiting progress in this area. We address this challenge by presenting CRONQUESTIONS, the largest known Temporal KGQA dataset, clearly stratified into buckets of structural complexity. CRONQUESTIONS expands the only known previous dataset by a factor of 340x. We find that various state-of-the-art KGQA methods fall far short of the desired performance on this new dataset. In response, we also propose CRONKGQA, a transformer-based solution that exploits recent advances in Temporal KG embeddings, and achieves performance superior to all baselines, with an increase of 120% in accuracy over the next best performing method. Through extensive experiments, we give detailed insights into the workings of CRONKGQA, as well as situations where significant further improvements appear possible. In addition to the dataset, we have released our code as well. View details
    Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks
    Joseph Reisinger
    Rahul Bhagat
    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2008), Association for Computational Linguistics, Honolulu, Hawaii, pp. 582-590
    Frustratingly Hard Domain Adaptation for Dependency Parsing
    Mark Dredze
    João V. Graça
    Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pp. 1051-1055
    Learning to Create Data-Integrating Queries
    Marie Jacob
    M. Salman Mehmood
    Koby Crammer
    Zachary Ives
    Sudipto Guha
    VLDB (2008)