Rohit Prabhavalkar
Rohit Prabhavalkar received his PhD in Computer Science and Engineering from The Ohio State University, USA, in 2013. Following his PhD, Rohit joined the Speech Technologies group at Google where he is currently a Staff Research Scientist. At Google, his research has focused primarily on developing compact acoustic models which can run efficiently on mobile devices, and on developing improved end-to-end automatic speech recognition systems. Rohit has co-authored over 70 refereed papers, which have received two best paper awards (ASRU 2017; ICASSP 2018). He has previously served as an associate editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing (2021-2024), and currently serves as a member of the IEEE Speech and Language Processing Technical Committee (2018-2021; 2021-2024).
Research Areas
Authored Publications
Sort By
Improving Deliberation by Text-Only and Semi-Supervised Training
Kevin Hu
Weiran Wang
Interspeech 2022 (2022) (to appear)
Preview abstract
Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text or speech data. In this work, we propose text-only and semi-supervised training for attention-decoder based deliberation. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, joint acoustic and text decoder (JATD) training, and semi-supervised training based on a conventional model as a teacher, we achieved up to 11.7% WER reduction compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the WER by 8% relative for Google Voice Search with reasonable endpointing latencies. We show that the deliberation has achieved a positive human side-by-side evaluation compared to LM rescoring.
View details
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
David Rybach
Cal Peyser
Zhiyun Lu
Interspeech 2022 (2022) (to appear)
Preview abstract
Improving the performance of end-to-end ASR models on long utterances of minutes to hours is an ongoing problem in speech recognition.
A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundaries based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set a alarm for... 5 o'clock").
Here, we propose replacing the VAD with an end-to-end ASR model capable of predicting segment boundaries, allowing the segmentation to be conditioned not only on deeper acoustic features but also on linguistic features from the decoded text, while requiring negligible extra compute.
In experiments on real world long-form audio (YouTube) of up to 30 minutes long, we demonstrate WER gains of 5\% relative to the VAD baseline on a state-of-the-art Conformer RNN-T setup.
View details
REPLACING HUMAN-RECORDED AUDIO WITH SYNTHETIC AUDIOFOR ON-DEVICE UNSPOKEN PUNCTUATION PREDICTION
Bogdan Prisacari
Daria Soboleva
Felix Weissenberger
Justin Lu
Márius Šajgalík
ICASSP 2021: International Conference on Acoustics, Speech and Signal Processing (2021) (to appear)
Preview abstract
We present a novel multi-modal unspoken punctuation prediction system for the English language, which relies on Quasi-Recurrent Neural Networks (QRNNs) applied jointly on the text output from automatic speech recognition and acoustic features.
%
We show significant improvements from adding acoustic features compared to the text-only baseline. Because annotated acoustic data is hard to obtain, we demonstrate that relying on only 20% of human-annotated audio and replacing the rest with synthetic text-to-speech (TTS) predictions, does not suffer from quality loss on LibriTTS corpus.
%
Furthermore, we demonstrate that through data augmentation using TTS models, we can remove human-recorded audio completely and outperform models trained on it.
View details
Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging
David Johannes Rybach
Sean Campbell
ICASSP 2021, IEEE
Preview abstract
End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.
View details
Learning Word-Level Confidence for Subword End-to-End ASR
David Qiu
Yu Zhang
Liangliang Cao
Deepti Bhatia
Wei Li
Ke Hu
ICASSP (2021)
Preview abstract
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
View details
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
Ruoming Pang
Antoine Bruguier
Wei Li
Raziel Alvarez
Chung-Cheng Chiu
David Garcia
Kevin Hu
Minho Jin
Qiao Liang
Cal Peyser
David Rybach
(June) Yuan Shangguan
Yash Sheth
Mirkó Visontai
Yu Zhang
Ding Zhao
ICASSP (2020)
Preview abstract
Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.
View details
Preview abstract
End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively to conventional models. To further improve the quality of an E2E model, two-pass decoding has been proposed to rescore streamed hypotheses using a non-streaming E2E model while maintaining a reasonable latency. However, the rescoring model uses only acoustics to rerank hypotheses. On the other hand, a class of neural correction models use only first-pass hypotheses for second-pass decoding. In this work, we propose to attend to both acoustics and first-pass hypotheses using the deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 25% relatively WER reduction compared to a recurrent neural network transducer, and 12% to LAS rescoring in Google Voice Search tasks. The improvement on a proper noun test set is even larger: 23% compared to LAS rescoring. The proposed model has a similar latency compared to LAS rescoring in decoding Voice Search utterances.
View details
Preview abstract
End-to-End (E2E) automatic speech recognition (ASR) systems learn word spellings directly from text-audio pairs, in contrast to traditional ASR systems which incorporate a separate pronunciation lexicon. The lexicon allows a traditional system to correctly spell rare words unobserved in training, if their phonetic pronunciation is known during inference. E2E systems, however, are more likely to misspell rare words.
In this work we propose an E2E model which benefits from the best of both worlds: it outputs graphemes, and thus learns to spell words directly, while also being able to leverage pronunciations for words which might be likely in a given context. Our model, which we name Phoebe, is based on the recently proposed Contextual Listen Attend and Spell model (CLAS). As in CLAS, our model accepts a set of bias phrases and learns an embedding for them which is jointly optimized with the rest of the ASR system. In contrast to CLAS, which accepts only the textual form of the bias phrases, the proposed model also has access to phonetic embeddings, which as we show improves performance on challenging test sets which include words unseen in training. The proposed model provides a 16% relative word error rate reduction over CLAS when both the phonetic and written representation of the context bias phrases are used.
View details
Preview abstract
Contextual biasing in end-to-end (E2E) models is challenging because E2E models do poorly in proper nouns and a limited number of candidates are kept for beam search decoding. This problem is exacerbated when biasing towards proper nouns in foreign languages, such as geographic location names, which are virtually unseen in training and are thus out-of-vocabulary (OOV). While a grapheme or wordpiece E2E model might have a difficult time spelling OOV words, phonemes are more acoustically oriented, and past work has shown that E2E models can better predict phonemes for such words. In this work, we address the OOV issue by incorporating phonemes in a wordpiece E2E model, and perform contextual biasing at the phoneme level to recognize foreign words. Phonemes are mapped from the source language to the foreign language and subsequently transduced to foreign words using pronunciations. We show that phoneme-based biasing performs 16% better than a grapheme-only biasing model, and 8% better than the wordpiece-only biasing model on a foreign place name recognition task, while causing slight degradation on regular English tasks.
View details
Preview abstract
The tradeoff between word error rate (WER) and latency is very important for online automatic speech recognition (ASR) applications. We want the system to endpoint and close the microphone as quickly as possible, without degrading WER. For conventional ASR systems, endpointing is a separate model from the acoustic, pronunciation and language models (AM, PM, LM), which can often cause endpointer problems, with either a higher WER or larger latency. In going with the all-neural spirit of end-to-end (E2E) models, which fold the AM, PM and LM into one neural network, in this work we look at foldinging the endpointer into the model. On a large vocabulary Voice Search task, we show that joint optimization of the endpointer with the E2E model results in no quality degradation and reduces latency by more than a factor of 2 compared to having a separate endpointer with the E2E model.
View details