Rohit Prabhavalkar
Rohit Prabhavalkar received his PhD in Computer Science and Engineering from The Ohio State University, USA, in 2013. Following his PhD, Rohit joined the Speech Technologies group at Google where he is currently a Staff Research Scientist. At Google, his research has focused primarily on developing compact acoustic models which can run efficiently on mobile devices, and on developing improved end-to-end automatic speech recognition systems. Rohit has co-authored over 70 refereed papers, which have received two best paper awards (ASRU 2017; ICASSP 2018). He has previously served as an associate editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing (2021-2024), and currently serves as a member of the IEEE Speech and Language Processing Technical Committee (2018-2021; 2021-2024).
Research Areas
Authored Publications
Sort By
Improving Deliberation by Text-Only and Semi-Supervised Training
Kevin Hu
Weiran Wang
Interspeech 2022 (2022) (to appear)
Preview abstract
Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text or speech data. In this work, we propose text-only and semi-supervised training for attention-decoder based deliberation. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, joint acoustic and text decoder (JATD) training, and semi-supervised training based on a conventional model as a teacher, we achieved up to 11.7% WER reduction compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the WER by 8% relative for Google Voice Search with reasonable endpointing latencies. We show that the deliberation has achieved a positive human side-by-side evaluation compared to LM rescoring.
View details
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
David Rybach
Cal Peyser
Zhiyun Lu
Interspeech 2022 (2022) (to appear)
Preview abstract
Improving the performance of end-to-end ASR models on long utterances of minutes to hours is an ongoing problem in speech recognition.
A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundaries based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set a alarm for... 5 o'clock").
Here, we propose replacing the VAD with an end-to-end ASR model capable of predicting segment boundaries, allowing the segmentation to be conditioned not only on deeper acoustic features but also on linguistic features from the decoded text, while requiring negligible extra compute.
In experiments on real world long-form audio (YouTube) of up to 30 minutes long, we demonstrate WER gains of 5\% relative to the VAD baseline on a state-of-the-art Conformer RNN-T setup.
View details
REPLACING HUMAN-RECORDED AUDIO WITH SYNTHETIC AUDIOFOR ON-DEVICE UNSPOKEN PUNCTUATION PREDICTION
Bogdan Prisacari
Daria Soboleva
Felix Weissenberger
Justin Lu
Márius Šajgalík
ICASSP 2021: International Conference on Acoustics, Speech and Signal Processing (2021) (to appear)
Preview abstract
We present a novel multi-modal unspoken punctuation prediction system for the English language, which relies on Quasi-Recurrent Neural Networks (QRNNs) applied jointly on the text output from automatic speech recognition and acoustic features.
%
We show significant improvements from adding acoustic features compared to the text-only baseline. Because annotated acoustic data is hard to obtain, we demonstrate that relying on only 20% of human-annotated audio and replacing the rest with synthetic text-to-speech (TTS) predictions, does not suffer from quality loss on LibriTTS corpus.
%
Furthermore, we demonstrate that through data augmentation using TTS models, we can remove human-recorded audio completely and outperform models trained on it.
View details
Learning Word-Level Confidence for Subword End-to-End ASR
David Qiu
Yu Zhang
Liangliang Cao
Deepti Bhatia
Wei Li
Ke Hu
ICASSP (2021)
Preview abstract
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
View details
Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging
David Johannes Rybach
Sean Campbell
ICASSP 2021, IEEE
Preview abstract
End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.
View details
Preview abstract
End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively to conventional models. To further improve the quality of an E2E model, two-pass decoding has been proposed to rescore streamed hypotheses using a non-streaming E2E model while maintaining a reasonable latency. However, the rescoring model uses only acoustics to rerank hypotheses. On the other hand, a class of neural correction models use only first-pass hypotheses for second-pass decoding. In this work, we propose to attend to both acoustics and first-pass hypotheses using the deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 25% relatively WER reduction compared to a recurrent neural network transducer, and 12% to LAS rescoring in Google Voice Search tasks. The improvement on a proper noun test set is even larger: 23% compared to LAS rescoring. The proposed model has a similar latency compared to LAS rescoring in decoding Voice Search utterances.
View details
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
Ruoming Pang
Antoine Bruguier
Wei Li
Raziel Alvarez
Chung-Cheng Chiu
David Garcia
Kevin Hu
Minho Jin
Qiao Liang
Cal Peyser
David Rybach
(June) Yuan Shangguan
Yash Sheth
Mirkó Visontai
Yu Zhang
Ding Zhao
ICASSP (2020)
Preview abstract
Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.
View details
STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES
Raziel Alvarez
Ding Zhao
David Rybach
Ruoming Pang
Qiao Liang
Deepti Bhatia
Yuan Shangguan
ICASSP (2019)
Preview abstract
End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.
View details
Two-Pass End-to-End Speech Recognition
Ruoming Pang
David Rybach
Wei Li
Mirkó Visontai
Qiao Liang
Chung-Cheng Chiu
Interspeech (2019)
Preview abstract
The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models. However, this model still lags behind a large state-of-the-art conventional model in quality. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T.
View details
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
Kazuki Irie
Antoine Bruguier
David Rybach
Patrick Nguyen
Interspeech (2019)
Preview abstract
In conventional speech recognition, phoneme-based models outperform grapheme-based models for non-phonetic languages such as English. The performance gap between the two typically reduces as the amount of training data is increased. In this work, we examine the impact of the choice of modeling unit for attention-based encoder-decoder models. We conduct experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks, using various target units (phoneme, grapheme, and word-piece); across all tasks, we find that grapheme or word-piece models consistently outperform phoneme-based models, even though they are evaluated without a lexicon or an external language model. We also investigate model complementarity: we find that we can improve WERs by up to 9% relative by rescoring N-best lists generated from a strong word-piece based baseline with either the phoneme or the grapheme model. Rescoring an N-best list generated by the phonemic system, however, provides limited improvements. Further analysis shows that the word-piece-based models produce more diverse N-best hypotheses, and thus lower oracle WERs, than phonemic models.
View details