Shikhar Vashishth

Shikhar

Shikhar Vashishth is a research scientist at Google focusing on building multimodal language models for Indian languages. Prior to this, he was a postdoc at Microsoft Research and Language Technologies Institute, Carnegie Mellon University. He completed his Ph.D. from Indian Insitute of Science under the guidance of Partha Pratim Talukdar, Chiranjib Bhattacharyya, and Manaal Faruqui. He has been a recipient of the prestigious ACM India Doctoral Dissertation Award and Google PhD Fellowship. He completed his graduation from BITS Pilani, Pilani in 2016.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Multimodal Modeling for Spoken Language Identification
    Shikhar Bharadwaj
    Sriram (Sri) Ganapathy
    Sid Dalmia
    Wei Han
    Yu Zhang
    Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024) (2024)
    Preview abstract Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition. View details
    LinguaMeta: Unified Metadata for Thousands of Languages
    Uche Okonkwo
    Emily Drummond
    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
    Preview abstract We introduce LinguaMeta, a unified resource for language metadata for thousands of languages, including language codes, names, number of speakers, writing systems, countries, official status, coordinates, and language varieties. The resources are drawn from various existing repositories and supplemented with our own research. Each data point is tagged for its origin, allowing us to easily trace back to and improve existing resources with more up-to-date and complete metadata. The resource is intended for use by researchers and organizations who aim to extend technology to thousands of languages. View details
    Preview abstract We explore a fundamental question in language model pre-training with huge amounts of unlabeled and randomly sampled text data - should every data sample have equal contribution to the model learning. To this end, we use self-influence (SI) scores as an indicator of sample importance, analyzing the relationship of self-influence scores with the sample quality and probing the efficacy of SI scores for offline pre-training dataset filtering. Building upon this, we propose PRESENCE: Pre-training data REweighting with Self-influENCE, an online and adaptive pre-training data re-weighting strategy using self-influence scores. PRESENCE is a two-phased learning method: In the first phase of learning, the data samples with higher SI scores are emphasized more, while in the subsequent phase of learning, the data samples with higher SI scores are de-emphasized to limit the impact of noisy and unreliable samples. We validate PRESENCE over $2$ model sizes of multilingual-t5 with $5$ datasets across $3$ tasks, obtaining significant performance improvements over the baseline methods considered. Through extensive ablations and qualitative analyses, we put forward a new research direction for language model pre-training. View details
    Preview abstract The speech representation learning approaches, for nonsemantic tasks like language recognition, have either explored supervised embedding extraction methods using a classifier model or the self-supervised representation learning approach using raw data. In this paper, we propose a novel framework of combining the self-supervised representation learning with the language label information for the pre-training task. This framework, termed as label aware speech representation learning (LASR), uses a triplet based objective function to incorporate the language labels along with the self-supervised loss function. The speech representations are further fine-tuned for the identification task. The language recognition experiments are performed on two public datasets - FLEURS and Dhwani. In these experiments, we illustrate that the proposed LASR framework improves over the state-of-art systems in terms of recognition performance. We also report an analysis of the robustness of the LASR approach to noisy/missing labels as well as the application of the LASR model for downstream multi-lingual speech recognition tasks. View details
    MASR: Multi-Label Aware Speech Representation
    Anjali Raj
    Shikhar Bharadwaj
    Sriram Ganapathy
    2023 Workshop on Automatic Speech Recognition and Understanding (ASRU) (2023)
    Preview abstract In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the sideinformation that is often available for a given speech recording. Incorporation of side information in existing techniques is constrained to a specific category of meta-data, thereby imposing limitations. Furthermore, these approaches exhibit inefficiencies in their utilization of such information. In this paper, we propose MASR , a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of external knowledge sources to enhance the utilization of meta-data information. Using MASR representations, we perform evaluation on several downstream tasks such as language identification and speech recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. A key advantage of the MASR is that it can be combined with any choice of SSL method. We perform a detailed analysis on the language identification task which illustrates how the proposed loss function enables the representations to separate closely related languages. We also investigate the application of the proposed approach for other non-semantic tasks such as speaker and emotion recognition. View details