Rohit Prabhavalkar
Rohit Prabhavalkar received his PhD in Computer Science and Engineering from The Ohio State University, USA, in 2013. Following his PhD, Rohit joined the Speech Technologies group at Google where he is currently a Staff Research Scientist. At Google, his research has focused primarily on developing compact acoustic models which can run efficiently on mobile devices, and on developing improved end-to-end automatic speech recognition systems. Rohit has co-authored over 70 refereed papers, which have received two best paper awards (ASRU 2017; ICASSP 2018). He currently serves as a member of the IEEE Speech and Language Processing Technical Committee (2018-2021; 2021-2024) and as an associate editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Improving Deliberation by Text-Only and Semi-Supervised Training
Kevin Hu
Weiran Wang
Interspeech 2022 (2022) (to appear)
Preview abstract
Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text or speech data. In this work, we propose text-only and semi-supervised training for attention-decoder based deliberation. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, joint acoustic and text decoder (JATD) training, and semi-supervised training based on a conventional model as a teacher, we achieved up to 11.7% WER reduction compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the WER by 8% relative for Google Voice Search with reasonable endpointing latencies. We show that the deliberation has achieved a positive human side-by-side evaluation compared to LM rescoring.
View details
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
Zhiyun Lu
Interspeech 2022 (2022) (to appear)
Preview abstract
Improving the performance of end-to-end ASR models on long utterances of minutes to hours is an ongoing problem in speech recognition.
A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundaries based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set a alarm for... 5 o'clock").
Here, we propose replacing the VAD with an end-to-end ASR model capable of predicting segment boundaries, allowing the segmentation to be conditioned not only on deeper acoustic features but also on linguistic features from the decoded text, while requiring negligible extra compute.
In experiments on real world long-form audio (YouTube) of up to 30 minutes long, we demonstrate WER gains of 5\% relative to the VAD baseline on a state-of-the-art Conformer RNN-T setup.
View details
REPLACING HUMAN-RECORDED AUDIO WITH SYNTHETIC AUDIOFOR ON-DEVICE UNSPOKEN PUNCTUATION PREDICTION
Bogdan Prisacari
Daria Soboleva
Felix Weissenberger
Justin Lu
Márius Šajgalík
ICASSP 2021: International Conference on Acoustics, Speech and Signal Processing (2021) (to appear)
Preview abstract
We present a novel multi-modal unspoken punctuation prediction system for the English language, which relies on Quasi-Recurrent Neural Networks (QRNNs) applied jointly on the text output from automatic speech recognition and acoustic features.
%
We show significant improvements from adding acoustic features compared to the text-only baseline. Because annotated acoustic data is hard to obtain, we demonstrate that relying on only 20% of human-annotated audio and replacing the rest with synthetic text-to-speech (TTS) predictions, does not suffer from quality loss on LibriTTS corpus.
%
Furthermore, we demonstrate that through data augmentation using TTS models, we can remove human-recorded audio completely and outperform models trained on it.
View details
Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging
Sean Campbell
ICASSP 2021, IEEE
Preview abstract
End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.
View details
Learning Word-Level Confidence for Subword End-to-End ASR
David Qiu
Yu Zhang
Liangliang Cao
Deepti Bhatia
Wei Li
Ke Hu
ICASSP (2021)
Preview abstract
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
View details
Preview abstract
End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively to conventional models. To further improve the quality of an E2E model, two-pass decoding has been proposed to rescore streamed hypotheses using a non-streaming E2E model while maintaining a reasonable latency. However, the rescoring model uses only acoustics to rerank hypotheses. On the other hand, a class of neural correction models use only first-pass hypotheses for second-pass decoding. In this work, we propose to attend to both acoustics and first-pass hypotheses using the deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 25% relatively WER reduction compared to a recurrent neural network transducer, and 12% to LAS rescoring in Google Voice Search tasks. The improvement on a proper noun test set is even larger: 23% compared to LAS rescoring. The proposed model has a similar latency compared to LAS rescoring in decoding Voice Search utterances.
View details
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
Ruoming Pang
Antoine Bruguier
Wei Li
Raziel Alvarez
Chung-Cheng Chiu
David Garcia
Kevin Hu
Minho Jin
Qiao Liang
(June) Yuan Shangguan
Yash Sheth
Mirkó Visontai
Yu Zhang
Ding Zhao
ICASSP (2020)
Preview abstract
Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.
View details
Preview abstract
Contextual biasing in end-to-end (E2E) models is challenging because E2E models do poorly in proper nouns and a limited number of candidates are kept for beam search decoding. This problem is exacerbated when biasing towards proper nouns in foreign languages, such as geographic location names, which are virtually unseen in training and are thus out-of-vocabulary (OOV). While a grapheme or wordpiece E2E model might have a difficult time spelling OOV words, phonemes are more acoustically oriented, and past work has shown that E2E models can better predict phonemes for such words. In this work, we address the OOV issue by incorporating phonemes in a wordpiece E2E model, and perform contextual biasing at the phoneme level to recognize foreign words. Phonemes are mapped from the source language to the foreign language and subsequently transduced to foreign words using pronunciations. We show that phoneme-based biasing performs 16% better than a grapheme-only biasing model, and 8% better than the wordpiece-only biasing model on a foreign place name recognition task, while causing slight degradation on regular English tasks.
View details
Preview abstract
End-to-End (E2E) automatic speech recognition (ASR) systems learn word spellings directly from text-audio pairs, in contrast to traditional ASR systems which incorporate a separate pronunciation lexicon. The lexicon allows a traditional system to correctly spell rare words unobserved in training, if their phonetic pronunciation is known during inference. E2E systems, however, are more likely to misspell rare words.
In this work we propose an E2E model which benefits from the best of both worlds: it outputs graphemes, and thus learns to spell words directly, while also being able to leverage pronunciations for words which might be likely in a given context. Our model, which we name Phoebe, is based on the recently proposed Contextual Listen Attend and Spell model (CLAS). As in CLAS, our model accepts a set of bias phrases and learns an embedding for them which is jointly optimized with the rest of the ASR system. In contrast to CLAS, which accepts only the textual form of the bias phrases, the proposed model also has access to phonetic embeddings, which as we show improves performance on challenging test sets which include words unseen in training. The proposed model provides a 16% relative word error rate reduction over CLAS when both the phonetic and written representation of the context bias phrases are used.
View details
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
Kazuki Irie
Antoine Bruguier
Patrick Nguyen
Interspeech (2019)
Preview abstract
In conventional speech recognition, phoneme-based models outperform grapheme-based models for non-phonetic languages such as English. The performance gap between the two typically reduces as the amount of training data is increased. In this work, we examine the impact of the choice of modeling unit for attention-based encoder-decoder models. We conduct experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks, using various target units (phoneme, grapheme, and word-piece); across all tasks, we find that grapheme or word-piece models consistently outperform phoneme-based models, even though they are evaluated without a lexicon or an external language model. We also investigate model complementarity: we find that we can improve WERs by up to 9% relative by rescoring N-best lists generated from a strong word-piece based baseline with either the phoneme or the grapheme model. Rescoring an N-best list generated by the phonemic system, however, provides limited improvements. Further analysis shows that the word-piece-based models produce more diverse N-best hypotheses, and thus lower oracle WERs, than phonemic models.
View details
Two-Pass End-to-End Speech Recognition
Ruoming Pang
Wei Li
Mirkó Visontai
Qiao Liang
Chung-Cheng Chiu
Interspeech (2019)
Preview abstract
The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models. However, this model still lags behind a large state-of-the-art conventional model in quality. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T.
View details
Preview abstract
The tradeoff between word error rate (WER) and latency is very important for online automatic speech recognition (ASR) applications. We want the system to endpoint and close the microphone as quickly as possible, without degrading WER. For conventional ASR systems, endpointing is a separate model from the acoustic, pronunciation and language models (AM, PM, LM), which can often cause endpointer problems, with either a higher WER or larger latency. In going with the all-neural spirit of end-to-end (E2E) models, which fold the AM, PM and LM into one neural network, in this work we look at foldinging the endpointer into the model. On a large vocabulary Voice Search task, we show that joint optimization of the endpointer with the E2E model results in no quality degradation and reduces latency by more than a factor of 2 compared to having a separate endpointer with the E2E model.
View details
STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES
Raziel Alvarez
Ding Zhao
Ruoming Pang
Qiao Liang
Deepti Bhatia
Yuan Shangguan
ICASSP (2019)
Preview abstract
End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.
View details
No Need For A Lexicon? Evaluating The Value Of The Pronunciation Lexica In End-To-End Models
Seungji Lee
Vlad Schogol
Patrick Nguyen
Chung-Cheng Chiu
ICASSP (2018)
Preview abstract
For decades, context-dependent phonemes have been the dominant sub-word unit for conventional acoustic modeling systems. This status quo has begun to be challenged recently by end-to-end models which seek to combine acoustic, pronunciation, and language model components into a single neural network. Such systems, which typically predict graphemes or words, simplify the recognition process since they remove the need for a separate expert-curated pronunciation lexicon to map from phoneme-based units to words. However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units.
In this work, we conduct detailed experiments which are aimed at quantifying the value of phoneme-based pronunciation lexica in the context of end-to-end models. We examine phoneme-based end-to-end models, which are contrasted against grapheme-based ones on a large vocabulary English Voice-search task, where we find that graphemes do indeed outperform phoneme-based models. We also compare grapheme and phoneme-based end-to-end approaches on a multi-dialect English task, which once again confirm the superiority of graphemes, greatly simplifying the system for recognizing multiple
dialects.
View details
State-of-the-art Speech Recognition With Sequence-to-Sequence Models
Chung-Cheng Chiu
Patrick Nguyen
Katya Gonina
Navdeep Jaitly
Jan Chorowski
ICASSP (2018) (to appear)
Preview abstract
Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In our previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore techniques such as synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12,500 hour voice search task, we find that the proposed changes improve the WER of the LAS system from 9.2% to 5.6%, while the best conventional system achieve 6.7% WER. We also test both models on a dictation dataset, and our model provide 4.1% WER while the conventional system provides 5% WER.
View details
Improving the Performance of Online Neural Transducer models
Chung-Cheng Chiu
Patrick Nguyen
Proc. ICASSP (2018)
Preview abstract
Having an sequence-to-sequence model which can operate in an online fashion is important for streaming applications such as Voice Search. Neural transducer is a streaming sequence-to-sequence model, but has shown to degrade significantly in performance compared to non-streaming models such as Listen, Attend and Spell (LAS). In this paper, we present various improvements to NT. Specifically, we look at increasing the window over which NT computes attention, mainly by looking backwards in time so the model still remains online. In addition, we explore initializing a NT model from a LAS-trained model so that it is guided with a better alignment. Finally. we explore including stronger language models such as using wordpiece models, and applying an external LM during the beam search. On a Voice Search task, we find with these improvements we can get NT to match the performance of LAS.
View details
Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models
Patrick Nguyen
Chung-Cheng Chiu
ICASSP 2018 (to appear)
Preview abstract
Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER. In the present work, we explore techniques to train attention-based models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we find that the proposed training procedure improves performance by up to 8.2% relative to the baseline system. This allows us to train grapheme-based, uni-directional attention-based models which match the performance of a traditional, state-of-the-art, discriminative sequence-trained system on a mobile voice-search task.
View details
Compression of End-to-End Models
Ruoming Pang
Suyog Gupta
Shuyuan Zhang
Chung-Cheng Chiu
Interspeech (2018)
Preview abstract
End-to-end models which are trained to directly output grapheme or word-piece targets have been demonstrated to be competitive with conventional speech recognition models. Such models do not require additional resources for decoding, and are typically much smaller than conventional models while makes them particularly attractive in the context of on-device speech recognition where both small memory footprint and low power consumption are critical. With these constraints in mind, in this work, we consider the problem of compressing end-to-end models with the goal of minimizing the number of model parameters without sacrificing model accuracy. We explore matrix factorization, knowledge distillation and parameter sparsity to determine the most effect method given a fixed parameter budget.
View details
From audio to semantics: Approaches to end-to-end spoken language understanding
Galen Chuang
Delia Qu
Spoken Language Technology Workshop (SLT), 2018 IEEE
Preview abstract
Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to text, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of intents and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem. We propose and compare various encoder-decoder based approaches that optimizes both modules jointly, in an end-to-end manner. We evaluate these methods on a real-world task. Our results show that having an intermediate text representation while jointly optimizing the full system improves accuracy of prediction.
View details
Preview abstract
We investigate the effectiveness of generative adversarial
networks (GANs) for speech enhancement, in the context of
improving noise robustness of automatic speech recognition
(ASR) systems. Prior work demonstrates that GANs can
effectively suppress additive noise in raw waveform speech
signals, improving perceptual quality metrics; however this
technique was not justified in the context of ASR. In this
work, we conduct a detailed study to measure the effectiveness
of GANs in enhancing speech contaminated by both
additive and reverberant noise. Motivated by recent advances
in image processing, we propose operating GANs on
log-Mel filterbank spectra instead of waveforms, which requires
less computation and is more robust to reverberant
noise. While GAN enhancement improves the performance
of an out-of-box ASR system on noisy speech, it falls short
of the performance achieved by conventional multi-style
training (MTR). By appending the GAN-enhanced features
to the noisy inputs and retraining, we achieve a 7% WER
improvement relative to the MTR system.
View details
An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model
Patrick Nguyen
ICASSP (2018)
Preview abstract
Attention-based sequence-to-sequence models for automatic
speech recognition jointly train an acoustic model, language model,
and alignment mechanism. Thus, the language model component is
only trained on transcribed audio-text pairs. This leads to the use of
shallow fusion with an external language model at inference time.
Shallow fusion refers to log-linear interpolation with a separately
trained language model at each step of the beam search. In this
work, we investigate the behavior of shallow fusion across a range of
conditions: different types of language models, different decoding
units, and different tasks. On Google Voice Search, we demonstrate
that the use of shallow fusion with an neural LM with wordpieces
yields a 9.1% relative word error rate reduction (WERR) over our
competitive attention-based sequence-to-sequence model, obviating
the need for second-pass rescoring.
View details
Preview abstract
In automatic speech recognition (ASR) what a user says
depends on the particular context she is in. Typically, this
context is represented as a set of word n-grams. In this work,
we present a novel, all-neural, end-to-end (E2E) ASR system
that utilizes such context. Our approach, which we refer
to as Contextual Listen, Attend and Spell (CLAS) jointlyoptimizes
the ASR components along with embeddings of the
context n-grams. During inference, the CLAS system can be
presented with context phrases which might contain out-ofvocabulary
(OOV) terms not seen during training. We compare
our proposed system to a more traditional contextualization
approach, which performs shallow-fusion between independently
trained LAS and contextual n-gram models during
beam search. Across a number of tasks, we find that the proposed
CLAS system outperforms the baseline method by as
much as 68% relative WER, indicating the advantage of joint
optimization over individually trained components.
Index Terms: speech recognition, sequence-to-sequence
models, listen attend and spell, LAS, attention, embedded
speech recognition.
View details
Preview abstract
In this paper, we conduct a detailed investigation of attention-based models for automatic speech recognition (ASR). First, we explore different types of attention, including online and full-sequence attention. Second, we explore different sub-word units to see how much of the end-to-end ASR process can reasonably be captured by an attention model. In experimental evaluations, we find that although attention is typically focussed over a small region of the acoustics during each step of next label prediction, full sequence attention outperforms “online” attention, although this gap can be significantly reduced by increasing the length of the segments over which attention is computed. Furthermore, we find that content-independent phonemes are a reasonable sub-word unit for attention models; when used in the second-pass to rescore N-best hypotheses these models provide over a 10% relative improvement in word error rate.
View details
Streaming Small-Footprint Keyword Spotting Using Sequence-to-Sequence Models
Wei Li
Anton Bakhtin
Automatic Speech Recognition and Understanding (ASRU), 2017 IEEE Workshop on
Preview abstract
We develop streaming keyword spotting systems using a recurrent neural network transducer (RNN-T) model: an all-neural, end-to-end trained, sequence-to-sequence model which jointly learns acoustic and language model components. Our models are trained to predict either phonemes or graphemes as subword units, thus allowing us to detect arbitrary keyword phrases, without any out-of-vocabulary words. In order to adapt the models to the requirements of keyword spotting, we propose a novel technique which biases the RNN-T system towards a specific keyword of interest. Our systems are compared against a strong sequence-trained, connectionist temporal classification (CTC) based “keyword-filler” baseline, which is augmented with a separate phoneme language model. Overall, our RNN-T system with the proposed biasing technique significantly improves performance over the baseline system.
View details
A Comparison of Sequence-to-Sequence Models for Speech Recognition
Navdeep Jaitly
Interspeech 2017, ISCA (2017)
Preview abstract
In this work, we conduct a detailed evaluation of various all-neural, end-to-end trained, sequence-to-sequence models applied to the task of speech recognition. Notably, each of these systems directly predicts graphemes in the written domain, without using an external pronunciation lexicon, or a separate language model. We examine several sequence-to-sequence models including connectionist temporal classification (CTC), the recurrent neural network (RNN) transducer, an attention-based model, and a model which augments the RNN-transducer with an attention mechanism.
We find that end-to-end models are capable of learning all components of the speech recognition process: acoustic, pronunciation, and language models, directly outputting words in the written form (e.g., “one hundred dollars” to “$100”), in a single jointly-optimized neural network. Furthermore, the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline outperforms these models on voice-search test sets.
View details
Preview abstract
We investigate training end-to-end speech recognition models with the recurrent neural network
transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly
learns acoustic and language model components from transcribed acoustic data.
We demonstrate how the model can be improved further if additional text or
pronunciation data are available. The model consists of an `encoder', which is initialized
from a connectionist temporal classification-based (CTC) acoustic model, and a
`decoder' which is partially initialized from a recurrent neural network language model trained on text data alone.
The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript
as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance
can be improved further through the use of sub-word units (`wordpieces') which capture longer context
and significantly reduce substitution errors. The best RNN-T system, a twelve-layer LSTM encoder with a
two-layer LSTM decoder trained with 30,000 wordpieces as output targets, is comparable in performance to a
state-of-the-art baseline on dictation and voice-search tasks.
View details
Personalized Speech Recognition On Mobile Devices
Raziel Alvarez
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Preview abstract
We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.
View details
On the Efficient Representation and Execution of Deep Acoustic Models
Raziel Alvarez
Anton Bakhtin
Proceedings of Annual Conference of the International Speech Communication Association (Interspeech) (2016)
Preview abstract
In this paper we present a simple and computationally efficient quantization scheme that enables us to reduce the resolution of the parameters of a neural network from 32-bit floating point values to 8-bit integer values. The proposed quantization scheme leads to significant memory savings and enables the use of optimized hardware instructions for integer arithmetic, thus significantly reducing the cost of inference. Finally, we propose a 'quantization aware' training process that applies the proposed scheme during network training and find that it allows us to recover most of the loss in accuracy introduced by quantization. We validate the proposed techniques by applying them to a long short-term memory-based acoustic model on an open-ended large vocabulary speech recognition task.
View details
On The Compression Of Recurrent Neural Networks With An Application To LVCSR Acoustic Modeling For Embedded Speech Recognition
Antoine Bruguier
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Preview abstract
We study the problem of compressing recurrent neural networks (RNNs). In particular, we focus on the compression
of RNN acoustic models, which are motivated by the goal
of building compact and accurate speech recognition systems
which can be run efficiently on mobile devices. In this work, we present a technique for general recurrent model compression that jointly compresses both recurrent and non-recurrent inter-layer weight matrices. We find that the proposed technique allows us to reduce the size of our Long Short-Term Memory (LSTM) acoustic model to a third of its original size with negligible loss in accuracy.
View details
Compressing Deep Neural Networks using a Rank-Constrained Topology
Preetum Nakkiran
Raziel Alvarez
Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), ISCA (2015), pp. 1473-1477
Preview abstract
We present a general approach to reduce the size of feed-forward deep neural networks (DNNs). We propose a rank-constrained topology, which factors the weights in the input layer of the DNN in terms of a low-rank representation: unlike previous work, our technique is applied at the level of the filters learned at individual hidden layer nodes, and exploits the natural two-dimensional time-frequency structure in the input. These techniques are applied on a small-footprint DNN-based keyword spotting task, where we find that we can reduce model size by 75% relative to the baseline, without any loss in performance. Furthermore, we find that the proposed approach is more effective at improving model performance compared to other popular dimensionality reduction techniques, when evaluated with a comparable number of parameters.
View details
Automatic Gain Control and Multi-style Training for Robust Small-Footprint Keyword Spotting with Deep Neural Networks
Raziel Alvarez
Preetum Nakkiran
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2015), pp. 4704-4708
Preview abstract
We explore techniques to improve the robustness of small-footprint keyword spotting models based on deep neural networks (DNNs) in the presence of background noise and in far-field conditions. We find that system performance can be improved significantly, with relative improvements up to 75% in far-field conditions, by employing a combination of multi-style training and a proposed novel formulation of automatic gain control (AGC) that estimates the levels of both speech and background noise. Further, we find that these techniques allow us to achieve competitive performance, even when applied to DNNs with an order of magnitude fewer parameters than our baseline.
View details
Conditional Random Fields in Speech, Audio, and Language Processing
Eric Fosler-Lussier
Yanzhang He
Preethi Jyothi
Proceedings of the IEEE, vol. 101 (2013), pp. 1054-1075
Discriminative Articulatory Models for Spoken Term Detection in Low-Resource Conversational Settings
Karen Livescu
Eric Fosler-Lussier
Joseph Keshet
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2013), pp. 8287-8291
Discriminative Articulatory Feature-based Pronunciation Models with Application to Spoken Term Detection
Ph.D. Thesis, The Ohio State University, Department of Computer Science and Engineering (2013)
An Evaluation of Posterior Modeling Techniques for Phonetic Recognition
David Nahamoo
Bhuvana Ramabhadran
Dimitri Kanevsky
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2013), pp. 7165-7169
Discriminative Spoken Term Detection with Limited Data
Joseph Keshet
Karen Livescu
Eric Fosler-Lussier
Proceedings of Symposium on Machine Learning in Speech and Language Processing (MLSLP) (2012), pp. 22-25
A Chunk-Based Phonetic Score for Mobile Voice Search
Jasha Droppo
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2012), pp. 4729-4732
Articulatory Feature Classification Using Nearest Neighbors
Arild Brandrud Næss
Karen Livescu
Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), ISCA (2011), pp. 2301-2304
A Factored Conditional Random Field Model for Articulatory Feature Forced Transcription
Eric Fosler-Lussier
Karen Livescu
Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), IEEE (2011), pp. 77-82
Investigations into the Crandem Approach to Word Recognition
Preethi Jyothi
William Hartmann
J. J. Morris
Eric Fosler-Lussier
Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), ACL (2010), pp. 725-728
Backpropagation Training for Multilayer Conditional Random Field Based Phone Recognition
Eric Fosler-Lussier
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2010), pp. 5534-5537
Combining Monaural and Binaural Evidence for Reverberant Speech Segregation
John Woodruff
Eric Fosler-Lussier
DeLiang Wang
Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), ISCA (2010), pp. 406-409
Monaural Segregation of Voiced Speech using Discriminative Random Fields
Zhaozhang Jin
Eric Fosler-Lussier
Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), ISCA (2009), pp. 856-859