Rohit Prabhavalkar

Rohit Prabhavalkar

Rohit Prabhavalkar received his PhD in Computer Science and Engineering from The Ohio State University, USA, in 2013. Following his PhD, Rohit joined the Speech Technologies group at Google where he is currently a Staff Research Scientist. At Google, his research has focused primarily on developing compact acoustic models which can run efficiently on mobile devices, and on developing improved end-to-end automatic speech recognition systems. Rohit has co-authored over 70 refereed papers, which have received two best paper awards (ASRU 2017; ICASSP 2018). He has previously served as an associate editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing (2021-2024), and currently serves as a member of the IEEE Speech and Language Processing Technical Committee (2018-2021; 2021-2024).
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text or speech data. In this work, we propose text-only and semi-supervised training for attention-decoder based deliberation. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, joint acoustic and text decoder (JATD) training, and semi-supervised training based on a conventional model as a teacher, we achieved up to 11.7% WER reduction compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the WER by 8% relative for Google Voice Search with reasonable endpointing latencies. We show that the deliberation has achieved a positive human side-by-side evaluation compared to LM rescoring. View details
    Preview abstract Improving the performance of end-to-end ASR models on long utterances of minutes to hours is an ongoing problem in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundaries based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set a alarm for... 5 o'clock"). Here, we propose replacing the VAD with an end-to-end ASR model capable of predicting segment boundaries, allowing the segmentation to be conditioned not only on deeper acoustic features but also on linguistic features from the decoded text, while requiring negligible extra compute. In experiments on real world long-form audio (YouTube) of up to 30 minutes long, we demonstrate WER gains of 5\% relative to the VAD baseline on a state-of-the-art Conformer RNN-T setup. View details
    Preview abstract End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied. View details
    Preview abstract We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model. View details
    REPLACING HUMAN-RECORDED AUDIO WITH SYNTHETIC AUDIOFOR ON-DEVICE UNSPOKEN PUNCTUATION PREDICTION
    Bogdan Prisacari
    Daria Soboleva
    Felix Weissenberger
    Justin Lu
    Márius Šajgalík
    ICASSP 2021: International Conference on Acoustics, Speech and Signal Processing(2021) (to appear)
    Preview abstract We present a novel multi-modal unspoken punctuation prediction system for the English language, which relies on Quasi-Recurrent Neural Networks (QRNNs) applied jointly on the text output from automatic speech recognition and acoustic features. % We show significant improvements from adding acoustic features compared to the text-only baseline. Because annotated acoustic data is hard to obtain, we demonstrate that relying on only 20% of human-annotated audio and replacing the rest with synthetic text-to-speech (TTS) predictions, does not suffer from quality loss on LibriTTS corpus. % Furthermore, we demonstrate that through data augmentation using TTS models, we can remove human-recorded audio completely and outperform models trained on it. View details
    Preview abstract Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size. View details
    Preview abstract End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively to conventional models. To further improve the quality of an E2E model, two-pass decoding has been proposed to rescore streamed hypotheses using a non-streaming E2E model while maintaining a reasonable latency. However, the rescoring model uses only acoustics to rerank hypotheses. On the other hand, a class of neural correction models use only first-pass hypotheses for second-pass decoding. In this work, we propose to attend to both acoustics and first-pass hypotheses using the deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 25% relatively WER reduction compared to a recurrent neural network transducer, and 12% to LAS rescoring in Google Voice Search tasks. The improvement on a proper noun test set is even larger: 23% compared to LAS rescoring. The proposed model has a similar latency compared to LAS rescoring in decoding Voice Search utterances. View details
    Preview abstract Contextual biasing in end-to-end (E2E) models is challenging because E2E models do poorly in proper nouns and a limited number of candidates are kept for beam search decoding. This problem is exacerbated when biasing towards proper nouns in foreign languages, such as geographic location names, which are virtually unseen in training and are thus out-of-vocabulary (OOV). While a grapheme or wordpiece E2E model might have a difficult time spelling OOV words, phonemes are more acoustically oriented, and past work has shown that E2E models can better predict phonemes for such words. In this work, we address the OOV issue by incorporating phonemes in a wordpiece E2E model, and perform contextual biasing at the phoneme level to recognize foreign words. Phonemes are mapped from the source language to the foreign language and subsequently transduced to foreign words using pronunciations. We show that phoneme-based biasing performs 16% better than a grapheme-only biasing model, and 8% better than the wordpiece-only biasing model on a foreign place name recognition task, while causing slight degradation on regular English tasks. View details
    Preview abstract End-to-End (E2E) automatic speech recognition (ASR) systems learn word spellings directly from text-audio pairs, in contrast to traditional ASR systems which incorporate a separate pronunciation lexicon. The lexicon allows a traditional system to correctly spell rare words unobserved in training, if their phonetic pronunciation is known during inference. E2E systems, however, are more likely to misspell rare words. In this work we propose an E2E model which benefits from the best of both worlds: it outputs graphemes, and thus learns to spell words directly, while also being able to leverage pronunciations for words which might be likely in a given context. Our model, which we name Phoebe, is based on the recently proposed Contextual Listen Attend and Spell model (CLAS). As in CLAS, our model accepts a set of bias phrases and learns an embedding for them which is jointly optimized with the rest of the ASR system. In contrast to CLAS, which accepts only the textual form of the bias phrases, the proposed model also has access to phonetic embeddings, which as we show improves performance on challenging test sets which include words unseen in training. The proposed model provides a 16% relative word error rate reduction over CLAS when both the phonetic and written representation of the context bias phrases are used. View details
    Preview abstract In conventional speech recognition, phoneme-based models outperform grapheme-based models for non-phonetic languages such as English. The performance gap between the two typically reduces as the amount of training data is increased. In this work, we examine the impact of the choice of modeling unit for attention-based encoder-decoder models. We conduct experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks, using various target units (phoneme, grapheme, and word-piece); across all tasks, we find that grapheme or word-piece models consistently outperform phoneme-based models, even though they are evaluated without a lexicon or an external language model. We also investigate model complementarity: we find that we can improve WERs by up to 9% relative by rescoring N-best lists generated from a strong word-piece based baseline with either the phoneme or the grapheme model. Rescoring an N-best list generated by the phonemic system, however, provides limited improvements. Further analysis shows that the word-piece-based models produce more diverse N-best hypotheses, and thus lower oracle WERs, than phonemic models. View details