Bhuvana Ramabhadran

Bhuvana Ramabhadran

Bhuvana Ramabhadran (IEEE Fellow, 2017, ISCA Fellow 2017) currently leads a team of researchers at Google, focusing on semi-supervised learning for speech recognition and multilingual speech recognition. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM’s world wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. She has served as an elected member of the IEEE SPS Speech and Language Technical Committee (SLTC), for two terms since 2010 and as its elected Vice Chair and Chair (2014–2016) and currently serves as an Advisory Member. She has served as the Area Chair for ICASSP (2011–2018), on the editorial board of the IEEE Transactions on Audio, Speech, and Language Processing (2011–2015), and on the IEEE SPS conference board (2017-2018) during which she also served as the conference board’s liaison with the ICASSP organizing committees , and as Regional Director-At-Large (2018-2020), where she coordinated work across all US IEEE chapters . She currently serves as the Chair of the IEEE Flanagan Speech & Audio Award Committee and currently serves as a Member-at-Large of the IEEE SPS Board of Governors. She serves on the International Speech Communication Association (ISCA) board and has served as the area chair for Interspeech conferences since 2012. In addition to organizing several workshops at ICML, HLT-NAACL, NeurIPS and ICML, she has also served as an adjunct professor at Columbia University, where she co-taught a graduate course on speech recognition. She has served as the (Co/-)Principal Investigator on several projects funded by the National Science Foundation, EU and iARPA, spanning speech recognition, information retrieval from spoken archives, keyword spotting in many languages. She has published over 150 papers and been granted over 40 U.S. patents. Her research interests include speech recognition and synthesis algorithms, statistical modeling, signal processing, and machine learning. Some of her recent work has focused on the use of speech synthesis to improve core speech recognition performance and self-supervised learning.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract This paper discusses a method to inject text when training an ASR system without the need for up sampling the text sequence to match the length of the speech sequence. View details
    Preview abstract This paper proposes Virtuoso, a massive multilingual speech–text joint learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech–text paired data in low-resource languages. This study extends Maestro, which is a speech–text semi-supervised joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline TTS models in seen languages, and 2) these models can synthesize reasonably good speech for unseen languages where no paired TTS data is available. View details
    Preview abstract Hard and soft distillation are two popular approaches for knowledge distillation from a teacher to student ASR model. Despite soft distillation being better than hard distillation, it has several limitations. First, training convergence depends on the match between the teacher and student alignments. Second, soft distillation suffers quality regressions when using teacher and student models with different architectures. Third, in case of non-causal teacher models, soft distillation requires tuning of the shift in teacher alignments to the right. Finally, soft distillation requires both the teacher and student models to have the same temporal sampling rates. In this work, we propose a novel knowledge distillation method for RNN-T models that tackles limitations of both hard and soft distillation approaches. We call our method Full-sum distillation, which simply distills the sequence posterior probability of the teacher model to the student model. Thus, this method does not depend directly on the noisy labels to distill knowledge as well as it does not depend on time dimension. We also propose a variant of Full-sum distillation to distill the sequence discriminative knowledge of the teacher model to the student model to further improve performance. Using full-sum distillation, we achieve significant improvements when training with strong and weak teacher models on public data as well as on in-house production data. View details
    Twenty-Five Years of Evolution in Speech and Language Processing
    Michael Picheny
    Dilek Hakkani-Tur
    IEEE Signal Processing Magazine, 40(2023), pp. 27-39
    Preview
    Ask2Mask: Guided Data Selection for Masked Speech Modeling
    Pedro Jose Moreno Mengibar
    Yu Zhang
    IEEE Journal of Selected Topics in Signal Processing(2022)
    Preview abstract Masked speech modeling (MSM) pre-training methods such as wav2vec2 or w2v-BERT randomly mask speech frames in an utterance and compute losses on the masked instances. While these methods improve performance of Automated Speech Recognition (ASR) systems, they have one major limitation. They generally perform best under matched conditions, i.e., when the data used for pre-training is matched to the data used for fine-tuning. Using out-of-domain (OOD) pre-training data with limited in-domain fine-tuning data from the target domain results in reduced gains. The relative value of in-domain data within a MSM pre-training corpus has not been well-explored in the literature. In this work, we address precisely this limitation. We propose ask2mask, a novel approach to focus on samples relevant to the target domain (in-domain) during pre-training with OOD or any available data. To perform this fine-grained data selection, ATM applies masking only to input frames with high confidence scores obtained from an external classification model. This allows the model to achieve meaningful in-domain representations and simultaneously discard low-confidence frames which could lead to learning erroneous representations. The ATM approach is further extended to focus on utterances with high confidences by scaling the final MSM loss computed for each masked input frame with the utterance-level confidence score. We conduct experiments on two well-benchmarked read speech corpus (Librispeech) and conversational speech corpus (AMI). The results substantiate the efficacy of ATM on significantly improving target domain performance under mismatched conditions while still yielding modest improvements under matched conditions. View details
    Preview abstract Second-pass rescoring is a well known technique to improve the performance of Automatic Speech Recognition (ASR) systems. Neural oracle search (NOS), which selects the most likely hypothesis from N-best hypothesis list by integrating in-formation from multiple sources, such as the input acoustic representations, N-best hypotheses, additional first-pass statistics,and unpaired textual information through an external language model, has shown success in re-scoring for RNN-T first-pass models. Multilingual first-pass speech recognition models of-ten outperform their monolingual counterparts when trained on related or low-resource languages. In this paper, we investigate making the second-pass model multilingual and apply rescoring on a multilingual first-pass. We conduct experiments on Nordic languages including Danish, Dutch, Finnish, Norwegian and Swedish. View details
    Preview abstract Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model introduced in~\cite{zhehuai2021} can be leveraged to train a massively multilingual ASR model without any transcribed speech. In most zero resource conditions, lack of transcribed speech also implies lack of lexicons. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero transcribed speech, real-world setting to expand the set of languages covered by ASR models with only unlabeled speech and text in the target languages. We define the task to cover $102$ languages, where transcribed speech is available in $52$ of these languages and can be used to improve end-to-end ASR quality on the remaining $50$. First, we show that by combining speech representations with byte-level text representations coupled with the effective use of language embeddings, we can dramatically reduce the resource requirements for deploying an ASR model to a new language. On the FLEURS dataset, this approach is able to reduce the CER on languages with no transcribed speech from 64.1\% to 29.6\%, a relative reduction of 54\%. Second, using a subset of Indic languages we show that the proposed method can learn effectively from languages with transcribed speech even when there is limited to no graphemeic overlap with the target languages, reducing the average CER of the target languages from 56.3 to 17.2. We believe this is the first demonstration that competitive ASR performance can be achieved for an unseen language using no language resources other than text and untranscribed speech. View details
    Preview abstract Building inclusive speech recognition systems is a crucial step towards developing technologies that speakers of all language varieties can use. Therefore, ASR systems must work for everybody independently of the way they speak. To accomplish this goal, there should be available data sets representing language varieties, and also an understanding of model configuration that is the most helpful in achieving robust understanding of all types of speech. However, there are not enough data sets for accented speech, and for the ones that are already available, more training approaches need to be explored to improve the quality of accented speech recognition. In this paper, we discuss recent progress towards developing more inclusive ASR systems, namely, the importance of building new data sets representing linguistic diversity, and exploring novel training approaches to improve performance for all users. We address recent directions within benchmarking ASR systems for accented speech, measure the effects of wav2vec 2.0 pre-training on accented speech recognition, and highlight corpora relevant for diverse ASR evaluations. View details
    Preview abstract Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples in two different ways: 1) A fine-grained data selection is performed by masking over the highly confident input frames as chosen by the scorer. This allows the model to learn meaningful representations. 2) ATM is further extended to focus at utterancelevel by weighting the final MSM loss with the utterancelevel confidence score. We conduct fine-tuning experiments on two well-benchmarked corpora: LibriSpeech (matching the pretraining data) and Commonvoice, TED-LIUM, AMI and CHiME6 (not matching the pre-training data). The results substantiate the efficacy of ATM on significantly improving the recognition performance under mismatched conditions (up to 11.6% relative over published results and upto 4.46% relative over our internal baseline) while still yielding modest improvements under matched conditions. View details
    Preview abstract This paper explores ways to improve a two-pass speech recognition system when the first-pass is hybrid autoregressive transducer model and the second-pass is a neural language model. The main focus is on the scores provided by each of these models, their quantitative analysis, how to improve them and the best way to integrate them with the objective of better recognition accuracy. Several analysis are presented to show the importance of the choice of the integration weights for combining the first-pass and the second-pass scores. A sequence level weight estimation model along with four training criteria are proposed which allow adaptive integration of the scores per acoustic sequence. The effectiveness of this algorithm is demonstrated by constructing and analyzing models on the Librispeech data set. View details