Approaches for Neural-Network Language Model Adaptation
My work focuses on research and development of automatic speech recognition, large language modeling, multimodal multilingual modeling, etc.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Multimodal Language Identification
Shikhar Bharadwaj
Sid Dalmia
Sriram (Sri) Ganapathy
Yu Zhang
2024 IEEE International Conference on Acoustics, Speech and Signal Processing (2023) (to appear)
Preview abstract
Language identification (LangID) of video data, the task of determining the spoken language in a given multimedia file, is primarily treated as a speech based language recognition task. On the other hand, text based language recognition is employed for written language content. In this work, we present a multimodal LangID system for video data that combines speech and text features to achieve state-of-the-art performance. We show that title and description of the video along with other meta-data, like geographic upload location of the video, contain substantial information regarding the language identity of the video recording. With a single multimodal model that can encode speech and text data, we build a language recognition system that can combine the information from speech, text and geographic location data. We experiment on public language recognition tasks with the Dhwani (22 language) dataset and the VoxLingua (107 language) dataset. In these settings, the proposed system achieves an absolute improvement of 6.6% and 5.6% in F1 score over the speech only baseline, respectively. We also provide an ablation study highlighting the contribution of different modalities for the language recognition task.
View details
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
Sebastian Ruder
Shruti Rijhwani
Jean-Michel Sarr
Cindy Wang
John Wieting
Christo Kirov
Dana L. Dickinson
Bidisha Samanta
Connie Tao
David Adelani
Reeve Ingle
Dmitry Panteleev
Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp. 1856-1884
Preview abstract
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) — languages for which NLP research is particularly far behind in meeting user needs — it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks — tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text only, multi-modal (vision, audio, and text), supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models.
View details
MASR: Multi-Label Aware Speech Representation
Anjali Raj
Shikhar Bharadwaj
Sriram Ganapathy
2023 Workshop on Automatic Speech Recognition and Understanding (ASRU) (2023)
Preview abstract
In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the sideinformation that is often available for a given speech recording. Incorporation of side information in existing techniques is constrained to a specific category of meta-data, thereby imposing limitations. Furthermore, these approaches exhibit inefficiencies in their utilization of such information. In this paper, we propose MASR , a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of external knowledge sources to enhance the utilization of meta-data information. Using MASR representations, we perform evaluation on several downstream tasks such as language identification and speech recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. A key advantage of the MASR is that it can be combined with any choice of SSL method. We perform a detailed analysis on the language identification task which illustrates how the proposed loss function enables the representations to separate closely related languages. We also investigate the application of the proposed approach for other non-semantic tasks such as speaker and emotion recognition.
View details
Label Aware Speech Representation Learning For Language Identification
Shikhar Bharadwaj
Sriram Ganapathy
Wei Han
Proceedings of Interspeech 2023, pp. 5351-5355
Preview abstract
The speech representation learning approaches, for nonsemantic tasks like language recognition, have either explored supervised embedding extraction methods using a classifier model or the self-supervised representation learning approach using raw data. In this paper, we propose a novel framework of combining the self-supervised representation learning with the language label information for the pre-training task. This framework, termed as label aware speech representation learning (LASR), uses a triplet based objective function to incorporate the language labels along with the self-supervised loss function. The speech representations are further fine-tuned for the identification task. The language recognition experiments are performed on two public datasets - FLEURS and Dhwani. In these experiments, we illustrate that the proposed LASR framework improves over the state-of-art systems in terms of recognition performance. We also report an analysis of the robustness of the LASR approach to noisy/missing labels as well as the application of the LASR model for downstream multi-lingual speech recognition tasks.
View details
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
Alexis Conneau
Simran Khanuja
Yu Zhang
Siddharth Dalmia
Clara Rivera
IEEE Spoken Language Technology Workshop (SLT) (2022)
Preview abstract
We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.
View details
XTREME-S: Evaluating Cross-lingual Speech Representations
Clara E. Rivera
Sebastian Ruder
Simran Khanuja
Ye Jia
Yu Zhang
Proc. Interspeech 2022
Preview abstract
We introduce \xtremes, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, retrieval and speech-to-text translation. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in ``universal'' speech representation learning. This paper describes the new benchmark and establishes the first speech-only and speech-text baselines using XLS-R and mSLAM on all downstream tasks. We motivate the design choices and detail how to use the benchmark. The code and pre-processing scripts will be made publicly available.\footnote{\small\url{https://huggingface.co/datasets/google/xtreme_s}}
View details
Improving Streaming ASR with Non-streaming Model Distillation on Unsupervised Data
Chung-Cheng Chiu
Liangliang Cao
Ruoming Pang
Thibault Doutre
Wei Han
Yu Zhang
Zhiyun Lu
ICASSP 2021 (to appear)
Preview abstract
Streaming end-to-end Automatic Speech Recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Streaming models almost always perform worse than non-streaming models.
We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher, generating transcripts on an arbitrary large data set, to better distill knowledge into streaming ASR models. This way, we are able to scale the training of streaming models to 3M hours of YouTube audio. Experiments show that our approach can significantly reduce the Word Error Rate (WER) of RNN-T models in four languages trained from YouTube data.
View details
Transliteration based approaches to improve code-switched speech recognition performance
Jesse Emond
IEEE Spoken Language Technology Workshop (SLT) (2018), pp. 448-455
Preview abstract
Code-switching is a commonly occurring phenomenon in many multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. In this work, we propose the use of a new metric, transliteration-optimized Word Error Rate (toWER) that smoothes out many of these irregularities by mapping all text to one writing system and demonstrate a correlation with the amount of code-switching present in a language. We also present a novel approach to acoustic and language modeling for bilingual code-switched Indic languages using the same transliteration approach to normalize the data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model. We demonstrate the robustness of the proposed approach on several Indic languages from Google Voice Search traffic with significant gains in ASR performance up to 10% relative over the state-of-the-art baseline.
View details
No Results Found