Tara N. Sainath
Tara Sainath received her S.B., M.Eng and PhD in Electrical Engineering and Computer Science (EECS) from MIT. After her PhD, she spent 5 years at the Speech and Language Algorithms group at IBM T.J. Watson Research Center, before joining Google Research. She has served as a Program Chair for ICLR in 2017 and 2018. Also, she has co-organized numerous special sessions and workshops, including Interspeech 2010, ICML 2013, Interspeech 2016, ICML 2017, Interspeech 2019, NeurIPS 2020. In addition, she has served as a member of the IEEE Speech and Language Processing Technical Committee (SLTC) as well as the Associate Editor for IEEE/ACM Transactions on Audio, Speech, and Language Processing. She is an IEEE and ISCA Fellow. In addition, she is the recipient of the 2021 IEEE SPS Industrial Innovation Award as well as the 2022 IEEE SPS Signal Processing Magazine Best Paper Award. She is currently a Principal Research Scientist at Google, working on applications of deep neural networks for automatic speech recognition.
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
Speech data from different domains has distinct acoustic and linguistic characteristics. It is common to train a single multidomain model such as a Conformer transducer for speech recognition on a mixture of data from all domains. However, changing data in one domain or adding a new domain would require the multidomain model to be retrained. To this end, we propose a framework called modular domain adaptation (MDA) that enables a single model to process multidomain data while keeping all parameters domain-specific, i.e., each parameter is only trained by data from one domain. On a streaming Conformer transducer trained only on video caption data, experimental results show that an MDA-based model can reach similar performance as the multidomain model on other domains such as voice search and dictation by adding per-domain adapters and per-domain feed-forward networks in the Conformer encoder.
View details
Text Injection for Capitalization and Turn-taking Prediction In ASR Models
Weiran Wang
Interspeech 2023 (2023)
Preview abstract
Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall.
View details
SENTENCE-SELECT: LARGE-SCALE LANGUAGE MODEL DATA SELECTION FOR RARE-WORD SPEECH RECOGNITION
Ruoming Pang
Submitted to interspeech 2022 (2022) (to appear)
Preview abstract
Language model fusion can help smart assistants recognize tail words which are rare in acoustic data but abundant in text-only corpora.
However, large-scale text corpora sourced from typed chat or search logs are often (1) prohibitively expensive to train on, (2) beset with content that is mismatched to the voice domain, and (3) heavy-headed rather than heavy-tailed (e.g., too many common search queries such as ``weather''), hindering downstream performance gains.
We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance.
First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences.
Second, to encourage rare-word accuracy, we explicitly filter for sentences with words which are rare in the acoustic data.
Finally, we tackle domain-mismatch by apply perplexity-based contrastive selection to filter for examples which are matched to the target domain.
We downselect a large corpus of web search queries by a factor of over 50x to train an LM, achieving better perplexities on the target acoustic domain than without downselection.
When used with shallow fusion on a production-grade speech engine, it achieves a WER reduction of up to 24\% on rare-word sentences (without changing the overall WER) relative to a baseline LM trained on an unfiltered corpus.
View details
Improving Deliberation by Text-Only and Semi-Supervised Training
Weiran Wang
Interspeech 2022 (2022) (to appear)
Preview abstract
Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text or speech data. In this work, we propose text-only and semi-supervised training for attention-decoder based deliberation. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, joint acoustic and text decoder (JATD) training, and semi-supervised training based on a conventional model as a teacher, we achieved up to 11.7% WER reduction compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the WER by 8% relative for Google Voice Search with reasonable endpointing latencies. We show that the deliberation has achieved a positive human side-by-side evaluation compared to LM rescoring.
View details
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
Zhiyun Lu
Interspeech 2022 (2022) (to appear)
Preview abstract
Improving the performance of end-to-end ASR models on long utterances of minutes to hours is an ongoing problem in speech recognition.
A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundaries based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set a alarm for... 5 o'clock").
Here, we propose replacing the VAD with an end-to-end ASR model capable of predicting segment boundaries, allowing the segmentation to be conditioned not only on deeper acoustic features but also on linguistic features from the decoded text, while requiring negligible extra compute.
In experiments on real world long-form audio (YouTube) of up to 30 minutes long, we demonstrate WER gains of 5\% relative to the VAD baseline on a state-of-the-art Conformer RNN-T setup.
View details
Preview abstract
Multilingual end-to-end automatic speech recognition models are attractive due to its simplicity in training and deployment. Recent work on large-scale training of such models has shown promising results compared to monolingual models. However, the work often focuses on the structure of multilingual models themselves in a single-pass decoding setup. In this work, we investigate second-pass deliberation for multilingual speech recognition. Our proposed deliberation is multilingual, i.e., the text encoder encodes hypothesis text from multiple languages, and the deliberation decoder attends to encoded text and audio from multiple languages without explicitly using language information. We investigate scaling different components of the multilingual deliberation model, such as the text encoder and deliberation decoder, and also compare scaling the second-pass deliberation decoder and the first-pass cascaded encoder. We show that deliberation improves the average WER on 9 languages by 4% relative compared to the single-pass model in a truly multilingual setup. By increasing the size of the deliberation model up to 1B parameters, the average WER improvement increases to 9%, with up to 14% for certain languages.
View details
Transducer-Based Streaming Deliberation For A Cascaded Encoder Model
Ruoming Pang
ICASSP 2022 (2022) (to appear)
Preview abstract
Previous research on deliberation networks has achieved excellent recognition quality. The attention decoder based deliberation models often works as a rescorer to improve first-pass recognition results, and often requires the full first-pass hypothesis for second-pass deliberation. In this work, we propose a streaming transducer-based deliberation model. The joint network of a transducer decoder often consists of inputs from the encoder and the prediction network. We propose to use attention to the first-pass text hypotheses as the third input to the joint network. The proposed transducer based deliberation model naturally streams, making it more desirable for on-device applications. We also show that the model improves rare word recognition, with relative WER reductions ranging from 3.6% to 10.4% for a variety of test sets. Our model does not use any additional text data for training.
View details
Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems
Chao Zhang
IEEE Spoken Language Technology Workshop (2022)
Preview abstract
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. This EP model strongly affects latency, but is subject to computational constraints, which limits prediction accuracy. We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This allows flexibility during inference to produce a low-cost prediction or a higher quality prediction if ASR computation is ongoing. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 130ms (33.3% reduction), and 90th percentile latency by 160ms (22.2% reduction), without regressing word-error rate. For continuous recognition, WER improves by 10.6% (relative).
View details
Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling
Jiahui Yu
Wei Han
Anmol Gulati
Ruoming Pang
ICLR 2021
Preview abstract
Streaming automatic speech recognition (ASR) aims at emitting each recognized word shortly as they are spoken, while full-context ASR encodes an entire speech sequence before decoding texts. In this work, we propose a unified framework, Universal ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. More importantly, we show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation. Universal ASR framework is network-agnostic, and can be applied to recent state-of-the-art convolution-based and transformer-based end-to-end ASR networks. We present extensive experiments on both research dataset LibriSpeech and mega-scale internal dataset MultiDomain with two state-of-the-art ASR networks ContextNet and Conformer. Experiments and ablation studies demonstrate that Universal ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR.
View details
Preview abstract
We propose a multitask training method for attention-basedend-to-end speech recognition models. We regularize the de-coder in a listen, attend, and spell model by multitask trainingon both audio-text and text-only data. Trained on the 100-hoursubset of LibriSpeech, the proposed method leads to an 11%relative performance improvement over the baseline and is com-parable to language model shallow fusion, without requiring anadditional neural network during decoding. We observe a simi-lar trend on the whole 960-hour LibriSpeech training set. Anal-yses of sample output sentences demonstrate that the proposedmethod can incorporate language level information, suggestingits effectiveness in real-world applications
View details
Learning Word-Level Confidence for Subword End-to-End ASR
David Qiu
Yu Zhang
Liangliang Cao
Deepti Bhatia
Wei Li
ICASSP (2021)
Preview abstract
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
View details
Preview abstract
We propose to deliberate the hypothesis alignment of a streaming RNN-T model with the previously proposed Align-Refine non-autoregressive decoding method and its improved versions. The method performs a few refinement steps, where each step shares a transformer decoder that attends to both text features (extracted from alignments) and audio features, and outputs complete updated alignments. The transformer decoder is trained with the CTC loss which facilitates parallel greedy decoding, and performs full-context attention to capture label dependencies. We improve Align-Refine by introducing cascaded encoder that captures more audio context before refinement, and alignment augmentation which enforces learning label dependency. We show that, conditioned on hypothesis alignments of a streaming RNN-T model, our method obtains significantly more accurate recognition results than the first-pass RNN-T, with only small amount of model parameters.
View details
Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging
Sean Campbell
ICASSP 2021, IEEE
Preview abstract
End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.
View details
Lookup-Table Recurrent Language Models for Long Tail Speech Recognition
Interspeech (2021) (to appear)
Preview abstract
We introduce Lookup-Table Language Models (LookupLM), a method for scaling up the size of RNN language models with only a constant increase in the floating point operations, by increasing the expressivity of the embedding table. In particular, we instantiate an (additional) embedding table which embeds the previous n-gram token sequence, rather than a single token. This allows the embedding table to be scaled up arbitrarily -- with a commensurate increase in performance -- without changing the token vocabulary. Since embeddings are sparsely retrieved from the table via a lookup; increasing the size of the table adds neither extra operations to each forward pass nor extra parameters that need to be stored on limited GPU/TPU memory. We explore scaling n-gram embedding tables up to nearly a billion parameters. When trained on a 3-billion sentence corpus, we find that LookupLM improves long tail log perplexity by 2.44 and long tail WER by 23.4% on a downstream speech recognition task over a standard RNN language model baseline, an improvement comparable to a scaling up the baseline by 6.2x the number of floating point operations.
View details
Preview abstract
We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass decoding results. Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation with JATD’s use of unpaired text data to further improve performance. The proposed model produces substantial gains across multiple test sets, especially those focused on rare words, where it reduces word error rate (WER) by between 12% and 22.5% relative. This is done without increasing model size or requiring multi-stage training, making Deliberation-JATD an efficient candidate for on-device applications.
View details
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization
Jiahui Yu
Wei Han
Anmol Gulati
Ruoming Pang
ICASSP 2021
Preview abstract
Streaming automatic speech recognition (ASR) aims to output each hypothesized word as quickly and accurately as possible. However, reducing latency while retaining accuracy is highly challenging. Existing approaches including Early and Late Penalties~\cite{li2020towards} and Constrained Alignment~\cite{sainath2020emitting} penalize emission delay by manipulating per-token or per-frame RNN-T output logits. While being successful in reducing latency, these approaches lead to significant accuracy degradation. In this work, we propose a sequence-level emission regularization technique, named FastEmit, that applies emission latency regularization directly on the transducer forward-backward probabilities. We demonstrate that FastEmit is more suitable to the sequence-level transducer~\cite{Graves12} training objective for streaming ASR networks. We apply FastEmit on various end-to-end (E2E) ASR networks including RNN-Transducer~\cite{Ryan19}, Transformer-Transducer~\cite{zhang2020transformer}, ConvNet-Transducer~\cite{han2020contextnet} and Conformer-Transducer~\cite{gulati2020conformer}, and achieve 150-300ms latency reduction over previous art without accuracy degradation on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4%/8.9% to 3.1%/7.5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech.
View details
Transformer Based Deliberation for Two-Pass Speech Recognition
Ruoming Pang
IEEE Spoken Language Technology Workshop (2021)
Preview abstract
Interactive speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that deliberation networks can be effective second-pass models. These models accept two kinds of inputs at once: encoded audio frames and the hypothesis text from the first-pass model. In this work, we explore using transformer layers instead of long-short term memory (LSTM) layers for deliberation rescoring. In transformer layers, we generalize the ``encoder-decoder" attention to attend to both encoded audio and first-pass text hypotheses. The output context vectors are then combined by a merger layer. Compared to LSTM-based deliberation, our best transformer deliberation achieves 7% relative word error rate (WER) improvements along with a 38% reduction in computation. We also compare against a non-deliberation transformer rescoring, and find a 9% relative improvement.
View details
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
Rami Botros
Ruoming Pang
James Qin
Quoc-Nam Le-The
Anmol Gulati
Emmanuel Guzman
Jiahui Yu
Qiao Liang
Wei Li
Yu Zhang
Interspeech (2021) (to appear)
Preview abstract
On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller.
View details
Low Latency Speech Recognition using End-to-End Prefetching
Wei Li
Interspeech 2020 (to appear)
Preview abstract
Latency is a crucial metric for streaming speech recognition systems. In this paper, we reduce latency by fetching responses early based on the partial recognition results and refer to it as prefetching. Specifically, prefetching works by submitting partial recognition results for subsequent processing such as obtaining assistant server responses or second-pass rescoring before the recognition result is finalized. If the partial result matches the final recognition result, the early fetched response can be delivered to the user instantly. This effectively speeds up the system by saving the execution latency that typically happens after recognition is completed.
Prefetching can be triggered multiple times for a single query, but this leads to multiple rounds of downstream processing and increases the computation costs. It is hence desirable to fetch the result sooner but meanwhile limiting the number of prefetches. To achieve the best trade-off between latency and computation cost, we investigated a series of prefetching decision models including decoder silence based prefetching, acoustic silence based prefetching and end-to-end prefetching.
In this paper, we demonstrate the proposed prefetching mechanism reduced 200 ms for a system that consists of a streaming first pass model using recurrent neural network transducer (RNN-T) and a non-streaming second pass rescoring model using Listen, Attend and Spell (LAS) [1]. We observe that the endto-end prefetching provides the best trade-off between cost and latency that is 100 ms faster compared to silence based prefetching at a fixed prefetch rate.
View details
Preview abstract
End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively to conventional models. To further improve the quality of an E2E model, two-pass decoding has been proposed to rescore streamed hypotheses using a non-streaming E2E model while maintaining a reasonable latency. However, the rescoring model uses only acoustics to rerank hypotheses. On the other hand, a class of neural correction models use only first-pass hypotheses for second-pass decoding. In this work, we propose to attend to both acoustics and first-pass hypotheses using the deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 25% relatively WER reduction compared to a recurrent neural network transducer, and 12% to LAS rescoring in Google Voice Search tasks. The improvement on a proper noun test set is even larger: 23% compared to LAS rescoring. The proposed model has a similar latency compared to LAS rescoring in decoding Voice Search utterances.
View details
Towards fast and accurate streaming end-to-end ASR
Ruoming Pang
Proc. ICASSP (2020)
Preview abstract
End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, Recurrent neural network transducer (RNN-T) as a streaming E2E model that has shown promising potential for on-device ASR. For such applications, quality and latency are two critical factors. We propose to reduce E2E model's latency by extending the RNN-T endpointer (RNN-T EP) model with additional early and late penalties. By further applying the minimum word error rate (MWER) training technique, we achieved 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction on a Voice search test set. We also experimented with a second pass Listen, Attend and Spell (LAS) rescorer for the RNN-T EP model. Although it cannot directly improve the first pass latency, the large WER reduction actually give us more room to trade WER for latency. RNN-T+LAS, together with EMBR training brings in 17.3% relative WER reduction while maintaining similar 120ms 90-percentile latency reductions.
View details
Preview abstract
Recurrent Neural Network Transducer (RNN-T) models [1] for automatic speech recognition (ASR) provide high accuracy speech recognition. Such end-to-end (E2E) models combine acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network, dramatically reducing complexity and model size. In this paper, we propose a technique for incorporating contextual signals, such as intelligent assistant device state or dialog state, directly into RNN-T models. We explore different encoding methods and demonstrate that RNN-T models can effectively utilize such context. Our technique results in reduction in Word Error Rate (WER) of up to 10.4% relative on a variety of contextual recognition tasks. We also demonstrate that proper regularization can be used to model context independently for improved overall quality.
View details
Preview abstract
Recently, we introduced a 2-pass on-device E2E model, which runs RNN-T in the first-pass and then rescores/redecodes this with a LAS decoder. This on-device model was similar in performance compared to a state-of-the-art conventional model. However, like many E2E models it is trained on supervised audio-text pairs and thus did poorly on rare-words compared to a conventional model trained on a much larger text-corpora. In this work, we introduce a joint acoustic and text-only decoder (JATD) into the LAS decoder, which allows the LAS decoder to be trained on a much larger text-corporate. We find that the JATD model provides between a 3-10\% relative improvement in WER compared to a LAS decoder trained on only supervised audio-text pairs across a variety of proper noun test sets.
View details
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
Ruoming Pang
Antoine Bruguier
Wei Li
Raziel Alvarez
David Garcia
Minho Jin
Qiao Liang
(June) Yuan Shangguan
Yash Sheth
Mirkó Visontai
Yu Zhang
Ding Zhao
ICASSP (2020)
Preview abstract
Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.
View details
Preview abstract
End-to-end (E2E) models are a promising research direction in speech recognition, as the single all-neural E2E system offers a much simpler and more compact solution compared to a conventional model, which has a separate acoustic (AM), pronunciation (PM) and language model (LM).
However, it has been noted that E2E models perform poorly on tail words and proper nouns, likely because the training requires joint audio-text pairs, and does not take advantage of a large amount of text-only data used to train the LMs in conventional models.
There has been numerous efforts in training an RNN-LM on text-only data and fusing it into the end-to-end model.
In this work, we contrast this approach to training the E2E model with audio-text pairs generated from unsupervised speech data.
To target the proper noun issue specifically, we adopt a Part-of-Speech (POS) tagger to filter the unsupervised data to use only those with proper nouns.
We show that training with filtered unsupervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up to a 17% relative improvement.
View details
Preview abstract
Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component of the end-to-end model is only trained on transcribed audio-text pairs, which leads to performance degradation especially on rare words. While there have been a variety of work that look at incorporating an external LM trained on text-only data into the end-to-end framework, none of them have taken into account the characteristic error distribution made by the model. In this paper, we propose a novel approach to utilizing text-only data, by training a spelling correction (SC) model to explicitly correct those errors. On the LibriSpeech dataset, we demonstrate that the proposed model results in an 18.6\% relative improvement in WER over the baseline model when directly correcting top ASR hypothesis, and a 29.0\% relative improvement when further rescoring an expanded n-best list using an external LM.
View details
Two-Pass End-to-End Speech Recognition
Ruoming Pang
Wei Li
Mirkó Visontai
Qiao Liang
Interspeech (2019)
Preview abstract
The requirements for many applications of state-of-the-art speech recognition systems include not only low word error rate (WER) but also low latency. Specifically, for many use-cases, the system must be able to decode utterances in a streaming fashion and faster than real-time. Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency metrics compared to conventional on-device models. However, this model still lags behind a large state-of-the-art conventional model in quality. On the other hand, a non-streaming E2E Listen, Attend and Spell (LAS) model has shown comparable quality to large conventional models. This work aims to bring the quality of an E2E streaming model closer to that of a conventional system by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. Our proposed two-pass model achieves a 17%-22% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T.
View details
Preview abstract
End-to-End (E2E) automatic speech recognition (ASR) systems learn word spellings directly from text-audio pairs, in contrast to traditional ASR systems which incorporate a separate pronunciation lexicon. The lexicon allows a traditional system to correctly spell rare words unobserved in training, if their phonetic pronunciation is known during inference. E2E systems, however, are more likely to misspell rare words.
In this work we propose an E2E model which benefits from the best of both worlds: it outputs graphemes, and thus learns to spell words directly, while also being able to leverage pronunciations for words which might be likely in a given context. Our model, which we name Phoebe, is based on the recently proposed Contextual Listen Attend and Spell model (CLAS). As in CLAS, our model accepts a set of bias phrases and learns an embedding for them which is jointly optimized with the rest of the ASR system. In contrast to CLAS, which accepts only the textual form of the bias phrases, the proposed model also has access to phonetic embeddings, which as we show improves performance on challenging test sets which include words unseen in training. The proposed model provides a 16% relative word error rate reduction over CLAS when both the phonetic and written representation of the context bias phrases are used.
View details
Preview abstract
Recognizing written domain numeric utterances (e.g., I need
$1.25.) can be challenging for ASR systems, particularly when
numeric sequences are not seen during training. This out-ofvocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances
(e.g., I need one dollar and twenty five cents.), for which numeric sequences are composed of in-vocabulary numbers, and
then using an FST verbalizer to denormalize the result. Unfortunately, conventional ASR models are not suitable for the low
memory setting of on-device speech recognition. E2E models
such as RNN-T, are attractive for on-device ASR, as they fold
the AM, PM and LM of a conventional model into one neural
network. However, in the on-device setting the large memory
footprint of an FST denormer makes spoken domain training
more difficult. In this paper, we investigate techniques to improve E2E model performance on numeric data. We find that
using a text-to-speech system to generate additional numeric
training data, as well as using a small-footprint neural network
to perform spoken-to-written domain denorming, yields improvement in several numeric classes. In the case of the longest
numeric sequences, we see reduction of WER by up to a factor
of 7.in this setting forces training back into the written domain, resulting in poor model performance on numeric sequences. In
this paper, we investigate different techniques to improve E2E
model performance on numeric data. We find that by using a
text-to-speech system to generate additional training data that
emphasizes difficult numeric utterances, as well as by using
an independently-trained small-footprint neural network to perform spoken-to-written domain denorming, we achieve strong
results in several numeric classes. In the case of the longest numeric sequences, for which the OOV issue is most prevalent,
we see reduction of WER by up to a factor of 7.
View details
STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES
Raziel Alvarez
Ding Zhao
Ruoming Pang
Qiao Liang
Deepti Bhatia
Yuan Shangguan
ICASSP (2019)
Preview abstract
End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.
View details
Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model
Interspeech 2019 (2019) (to appear)
Preview abstract
Multilingual end-to-end (E2E) models have shown great
promise as a means to expand coverage of the world’s lan-
guages by automatic speech recognition systems. They im-
prove over monolingual E2E systems, especially on low re-
source languages, and simplify training and serving by elimi-
nating language-specific acoustic, pronunciation, and language
models. This work aims to develop an E2E multilingual system
which is equipped to operate in low-latency interactive applica-
tions as well as handle the challenges of real world imbalanced
data. First, we present a streaming E2E multilingual model.
Second, we compare techniques to deal with imbalance across
languages. We find that a combination of conditioning on a
language vector and training language-specific adapter layers
produces the best model. The resulting E2E multilingual model
system achieves lower word error rate (WER) than state-of-the-
art conventional monolingual models by at least 10% relative
on every language.
View details
Preview abstract
The tradeoff between word error rate (WER) and latency is very important for online automatic speech recognition (ASR) applications. We want the system to endpoint and close the microphone as quickly as possible, without degrading WER. For conventional ASR systems, endpointing is a separate model from the acoustic, pronunciation and language models (AM, PM, LM), which can often cause endpointer problems, with either a higher WER or larger latency. In going with the all-neural spirit of end-to-end (E2E) models, which fold the AM, PM and LM into one neural network, in this work we look at foldinging the endpointer into the model. On a large vocabulary Voice Search task, we show that joint optimization of the endpointer with the E2E model results in no quality degradation and reduces latency by more than a factor of 2 compared to having a separate endpointer with the E2E model.
View details
Preview abstract
Contextual biasing in end-to-end (E2E) models is challenging because E2E models do poorly in proper nouns and a limited number of candidates are kept for beam search decoding. This problem is exacerbated when biasing towards proper nouns in foreign languages, such as geographic location names, which are virtually unseen in training and are thus out-of-vocabulary (OOV). While a grapheme or wordpiece E2E model might have a difficult time spelling OOV words, phonemes are more acoustically oriented, and past work has shown that E2E models can better predict phonemes for such words. In this work, we address the OOV issue by incorporating phonemes in a wordpiece E2E model, and perform contextual biasing at the phoneme level to recognize foreign words. Phonemes are mapped from the source language to the foreign language and subsequently transduced to foreign words using pronunciations. We show that phoneme-based biasing performs 16% better than a grapheme-only biasing model, and 8% better than the wordpiece-only biasing model on a foreign place name recognition task, while causing slight degradation on regular English tasks.
View details
Preview abstract
We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These units are difficult to scale to languages with large vocabularies, particularly the case for multilingual processing. In this work, we model text via a sequence of unicode bytes. Bytes allow us to avoid large softmaxes in languages with large vocabularies, and share representations in multilingual models. We show that bytes are superior to grapheme characters over a wide variety of languages in end-to-end speech recognition. We also present an end-to-end multilingual model using unicode byte representations, which outperforms each respective single language baseline by 4~5\% relatively. Finally, we present an end-to-end multilingual speech synthesis model using unicode byte representations which also achieves state-of-the-art performance.
View details
Improving the Performance of Online Neural Transducer models
Patrick Nguyen
Proc. ICASSP (2018)
Preview abstract
Having an sequence-to-sequence model which can operate in an online fashion is important for streaming applications such as Voice Search. Neural transducer is a streaming sequence-to-sequence model, but has shown to degrade significantly in performance compared to non-streaming models such as Listen, Attend and Spell (LAS). In this paper, we present various improvements to NT. Specifically, we look at increasing the window over which NT computes attention, mainly by looking backwards in time so the model still remains online. In addition, we explore initializing a NT model from a LAS-trained model so that it is guided with a better alignment. Finally. we explore including stronger language models such as using wordpiece models, and applying an external LM during the beam search. On a Voice Search task, we find with these improvements we can get NT to match the performance of LAS.
View details
A comparison of techniques for language model integration in encoder-decoder speech recognition
Shubham Toshniwal
Karen Livescu
IEEE SLT (2018)
Preview abstract
Attention-based recurrent neural encoder-decoder models present an
elegant solution to the automatic speech recognition problem. This
approach folds the acoustic model, pronunciation model, and language
model into a single network and requires only a parallel corpus
of speech and text for training. However, unlike in conventional
approaches that combine separate acoustic and language models, it
is not clear how to use additional (unpaired) text. While there has
been previous work on methods addressing this problem, a thorough
comparison among methods is still lacking. In this paper, we compare
a suite of past methods and some of our own proposed methods
for using unpaired text data to improve encoder-decoder models. For
evaluation, we use the medium-sized Switchboard data set and the
large-scale Google voice search and dictation data sets. Our results
confirm the benefits of using unpaired text across a range of methods
and data sets. Surprisingly, for first-pass decoding, the rather simple
approach of shallow fusion performs best across data sets. However,
for Google data sets we find that cold fusion has a lower oracle error
rate and outperforms other approaches after second-pass rescoring
on the Google voice search data set.
View details
Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models
Patrick Nguyen
ICASSP 2018 (to appear)
Preview abstract
Sequence-to-sequence models, such as attention-based models in automatic speech recognition (ASR), are typically trained to optimize the cross-entropy criterion which corresponds to improving the log-likelihood of the data. However, system performance is usually measured in terms of word error rate (WER), not log-likelihood. Traditional ASR systems benefit from discriminative sequence training which optimizes criteria such as the state-level minimum Bayes risk (sMBR) which are more closely related to WER. In the present work, we explore techniques to train attention-based models to directly minimize expected word error rate. We consider two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which we find to be more effective than the sampling-based method. In experimental evaluations, we find that the proposed training procedure improves performance by up to 8.2% relative to the baseline system. This allows us to train grapheme-based, uni-directional attention-based models which match the performance of a traditional, state-of-the-art, discriminative sequence-trained system on a mobile voice-search task.
View details
Compression of End-to-End Models
Ruoming Pang
Suyog Gupta
Shuyuan Zhang
Interspeech (2018)
Preview abstract
End-to-end models which are trained to directly output grapheme or word-piece targets have been demonstrated to be competitive with conventional speech recognition models. Such models do not require additional resources for decoding, and are typically much smaller than conventional models while makes them particularly attractive in the context of on-device speech recognition where both small memory footprint and low power consumption are critical. With these constraints in mind, in this work, we consider the problem of compressing end-to-end models with the goal of minimizing the number of model parameters without sacrificing model accuracy. We explore matrix factorization, knowledge distillation and parameter sparsity to determine the most effect method given a fixed parameter budget.
View details
Multilingual Speech Recognition with a Single End-to-End Model
Shubham Toshniwal
ICASSP (2018)
Preview abstract
Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages.
View details
Preview abstract
Recent work has shown that end-to-end (E2E) speech
recognition architectures such as Listen Attend and Spell (LAS)
can achieve state-of-the-art quality results in LVCSR tasks. One
benefit of this architecture is that it does not require a separately
trained pronunciation model, language model, and acoustic
model. However, this property also introduces a drawback:
it is not possible to adjust language model contributions separately
from the system as a whole. As a result, inclusion of
dynamic, contextual information (such as nearby restaurants or
upcoming events) into recognition requires a different approach
from what has been applied in conventional systems.
We introduce a technique to adapt the inference process
to take advantage of contextual signals by adjusting the output
likelihoods of the neural network at each step in the beam
search. We apply the proposed method to a LAS E2E model
and show its effectiveness in experiments on a voice search task
with both artificial and real contextual information. Given optimal
context, our system reduces WER from 9.2% to 3.8%.
The results show that this technique is effective at incorporating
context into the prediction of an E2E system.
Index Terms: speech recognition, end-to-end, contextual
speech recognition, neural network
View details
Domain Adaptation Using Factorized Hidden Layer for Robust Automatic Speech Recognition
Interspeech (2018), pp. 892-896
Preview abstract
Domain robustness is a challenging problem for automatic speech recognition (ASR). In this paper, we consider speech data collected for different applications as separate domains and investigate the robustness of acoustic models trained on multi-domain data on unseen domains. Specifically, we use Factorized Hidden Layer (FHL) as a compact low-rank representation to adapt a multi-domain ASR system to unseen domains. Experimental results on two unseen domains show that FHL is a more effective adaptation method compared to selectively fine-tuning part of the network, without dramatically increasing the model parameters. Furthermore, we found that using singular value decomposition to initialize the low-rank bases of an FHL model leads to a faster convergence and improved performance.
View details
Spectral distortion model for training phase-sensitive deep-neural networks for far-field speech recognition
Chanwoo Kim
Rajeev Nongpiur
ICASSP 2018 (2018)
Preview abstract
In this paper, we present an algorithm which introduces phaseperturbation
to the training database when training phase-sensitive
deep neural-network models. Traditional features such as log-mel or
cepstral features do not have have any phase-relevant information.
However more recent features such as raw-waveform or complex
spectra features contain phase-relevant information. Phase-sensitive
features have the advantage of being able to detect differences in
time of arrival across different microphone channels or frequency
bands. However, compared to magnitude-based features, phase
information is more sensitive to various kinds of distortions such
as variations in microphone characteristics, reverberation, and so
on. For traditional magnitude-based features, it is widely known
that adding noise or reverberation, often called Multistyle-TRaining
(MTR) , improves robustness. In a similar spirit, we propose an algorithm
which introduces spectral distortion to make the deep-learning
model more robust against phase-distortion. We call this approach
Spectral-Distortion TRaining (SDTR) and Phase-Distortion TRaining
(PDTR). In our experiments using a training set consisting of
22-million utterances, this approach has proved to be quite successful
in reducing Word Error Rates in test sets obtained with real
microphones on Google Home
View details
TEMPORAL MODELING USING DILATED CONVOLUTION AND GATING FOR VOICE-ACTIVITY-DETECTION
Gabor Simko
Aäron van den Oord
ICASSP 2018
Preview abstract
Voice-activity-detection (VAD) is the task of predicting where in
the utterance is speech versus background noise. It is an important
first step to determine when to open the microphone (i.e., start-of-
speech) and close the microphone (i.e., end-of-speech) for streaming
speech recognition applications such as Voice Search. Long short-
term memory neural networks (LSTMs) have been a popular archi-
tecture for sequential modeling for acoustic signals, and have been
successfully used for many VAD applications. However, it has been
observed that LSTMs suffer from state saturation problems when the
utterance is long (i.e., for voice dictation tasks), and thus requires the
LSTM state to be periodically reset. In this paper, we propse an alter-
native architecture that does not suffer from saturation problems by
modeling temporal variations through a stateless dilated convolution
neural network (CNN). The proposed architecture differs from con-
ventional CNNs in three respects (1) dilated causal convolution, (2)
gated activations and (3) residual connections. Results on a Google Voice
Typing task shows that the proposed architecture achieves 14% rela-
tive FA improvement at a FR of 1% over state-of-the-art LSTMs for
VAD task. We also include detailed experiments investigating the
factors that distinguish the proposed architecture from conventional
convolution.
View details
Preview abstract
In automatic speech recognition (ASR) what a user says
depends on the particular context she is in. Typically, this
context is represented as a set of word n-grams. In this work,
we present a novel, all-neural, end-to-end (E2E) ASR system
that utilizes such context. Our approach, which we refer
to as Contextual Listen, Attend and Spell (CLAS) jointlyoptimizes
the ASR components along with embeddings of the
context n-grams. During inference, the CLAS system can be
presented with context phrases which might contain out-ofvocabulary
(OOV) terms not seen during training. We compare
our proposed system to a more traditional contextualization
approach, which performs shallow-fusion between independently
trained LAS and contextual n-gram models during
beam search. Across a number of tasks, we find that the proposed
CLAS system outperforms the baseline method by as
much as 68% relative WER, indicating the advantage of joint
optimization over individually trained components.
Index Terms: speech recognition, sequence-to-sequence
models, listen attend and spell, LAS, attention, embedded
speech recognition.
View details
State-of-the-art Speech Recognition With Sequence-to-Sequence Models
Patrick Nguyen
Katya Gonina
Navdeep Jaitly
Jan Chorowski
ICASSP (2018) (to appear)
Preview abstract
Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In our previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore techniques such as synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12,500 hour voice search task, we find that the proposed changes improve the WER of the LAS system from 9.2% to 5.6%, while the best conventional system achieve 6.7% WER. We also test both models on a dictation dataset, and our model provide 4.1% WER while the conventional system provides 5% WER.
View details
No Need For A Lexicon? Evaluating The Value Of The Pronunciation Lexica In End-To-End Models
Seungji Lee
Vlad Schogol
Patrick Nguyen
ICASSP (2018)
Preview abstract
For decades, context-dependent phonemes have been the dominant sub-word unit for conventional acoustic modeling systems. This status quo has begun to be challenged recently by end-to-end models which seek to combine acoustic, pronunciation, and language model components into a single neural network. Such systems, which typically predict graphemes or words, simplify the recognition process since they remove the need for a separate expert-curated pronunciation lexicon to map from phoneme-based units to words. However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units.
In this work, we conduct detailed experiments which are aimed at quantifying the value of phoneme-based pronunciation lexica in the context of end-to-end models. We examine phoneme-based end-to-end models, which are contrasted against grapheme-based ones on a large vocabulary English Voice-search task, where we find that graphemes do indeed outperform phoneme-based models. We also compare grapheme and phoneme-based end-to-end approaches on a multi-dialect English task, which once again confirm the superiority of graphemes, greatly simplifying the system for recognizing multiple
dialects.
View details
An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model
Patrick Nguyen
ICASSP (2018)
Preview abstract
Attention-based sequence-to-sequence models for automatic
speech recognition jointly train an acoustic model, language model,
and alignment mechanism. Thus, the language model component is
only trained on transcribed audio-text pairs. This leads to the use of
shallow fusion with an external language model at inference time.
Shallow fusion refers to log-linear interpolation with a separately
trained language model at each step of the beam search. In this
work, we investigate the behavior of shallow fusion across a range of
conditions: different types of language models, different decoding
units, and different tasks. On Google Voice Search, we demonstrate
that the use of shallow fusion with an neural LM with wordpieces
yields a 9.1% relative word error rate reduction (WERR) over our
competitive attention-based sequence-to-sequence model, obviating
the need for second-pass rescoring.
View details
Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home
Chanwoo Kim
Kean Chin
Thad Hughes
interspeech 2017 (2017), pp. 379-383
Preview abstract
We describe the structure and application of an acoustic room
simulator to generate large-scale simulated data for training
deep neural networks for far-field speech recognition. The system
simulates millions of different room dimensions, a wide
distribution of reverberation time and signal-to-noise ratios,
and a range of microphone and sound source locations. We
start with a relatively clean training set as the source and artificially
create simulated data by randomly sampling a noise
configuration for every new training example. As a result,
the acoustic model is trained using examples that are virtually
never repeated. We evaluate performance of this approach
based on room simulation using a factored complex Fast Fourier
Transform (CFFT) acoustic model introduced in our earlier
work, which uses CFFT layers and LSTM AMs for joint multichannel
processing and acoustic modeling. Results show that
the simulator-driven approach is quite effective in obtaining
large improvements not only in simulated test conditions, but
also in real / rerecorded conditions. This room simulation system
has been employed in training acoustic models including
the ones for the recently released Google Home.
View details
Preview abstract
Recently, very deep networks, with as many as hundreds of
layers, have shown great success in image classification tasks.
One key component that has enabled such deep models is the
use of “skip connections”, including either residual or highway
connections, to alleviate the vanishing and exploding gradient
problems. While these connections have been explored
for speech, they have mainly been explored for feed-forward
networks. Since recurrent structures, such as LSTMs, have produced
state-of-the-art results on many of our Voice Search tasks,
the goal of this work is to thoroughly investigate different approaches
to adding depth to recurrent structures. Specifically,
we experiment with novel Highway-LSTM models with bottlenecks
skip connections and show that a 10 layer model can outperform
a state-of-the-art 5 layer LSTM model with the same
number of parameters by 2% relative WER. In addition, we experiment
with Recurrent Highway layers and find these to be on
par with Highway-LSTM models, when given sufficient depth.
View details
Raw Multichannel Processing Using Deep Neural Networks
Kean Chin
Chanwoo Kim
New Era for Robust Speech Recognition: Exploiting Deep Learning, Springer (2017)
Preview abstract
Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this chapter, we perform multi-channel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model.
View details
Preview abstract
The task of endpointing is to determine when the user has finished speaking, which is important for interactive speech applications such as voice search and Google Home. In this paper, we propose a GLDNN-based (grid long short-term memory, deep neural network) endpointer model and show that it provides significant improvements over a state-of-the-art CLDNN (convolutional, long short-term memory, deep neural networks) model. Specifically, we replace the convolution layer with a grid LSTM layer that models both spectral and temporal variations through recurrent connections. Results show that the GLDNN achieves 39% relative improvement in false alarm rate at a fixed false reject rate of 2%, and reduces median latency by 11%. We also include detailed experiments investigating why grid LSTMs offer better performance than CLDNNs. Analysis reveals that the recurrent connection along the frequency axis is an important factor that greatly contributes to the performance of grid LSTMs, especially in the presence of background noise. Finally, we also show that multichannel input further increases robustness to background speech. Overall, we achieved 16% (100 ms) endpointer latency improvement relative to our previous best model.
View details
Preview abstract
In this paper, we conduct a detailed investigation of attention-based models for automatic speech recognition (ASR). First, we explore different types of attention, including online and full-sequence attention. Second, we explore different sub-word units to see how much of the end-to-end ASR process can reasonably be captured by an attention model. In experimental evaluations, we find that although attention is typically focussed over a small region of the acoustics during each step of next label prediction, full sequence attention outperforms “online” attention, although this gap can be significantly reduced by increasing the length of the segments over which attention is computed. Furthermore, we find that content-independent phonemes are a reasonable sub-word unit for attention models; when used in the second-pass to rescore N-best hypotheses these models provide over a 10% relative improvement in word error rate.
View details
A Comparison of Sequence-to-Sequence Models for Speech Recognition
Navdeep Jaitly
Interspeech 2017, ISCA (2017)
Preview abstract
In this work, we conduct a detailed evaluation of various all-neural, end-to-end trained, sequence-to-sequence models applied to the task of speech recognition. Notably, each of these systems directly predicts graphemes in the written domain, without using an external pronunciation lexicon, or a separate language model. We examine several sequence-to-sequence models including connectionist temporal classification (CTC), the recurrent neural network (RNN) transducer, an attention-based model, and a model which augments the RNN-transducer with an attention mechanism.
We find that end-to-end models are capable of learning all components of the speech recognition process: acoustic, pronunciation, and language models, directly outputting words in the written form (e.g., “one hundred dollars” to “$100”), in a single jointly-optimized neural network. Furthermore, the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline outperforms these models on voice-search test sets.
View details
Acoustic Modeling for Google Home
Joe Caroselli
Kean Chin
Chanwoo Kim
Mitchel Weintraub
Erik McDermott
INTERSPEECH 2017 (2017)
Preview abstract
This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system.
View details
Multichannel Signal Processing with Deep Neural Networks for Automatic Speech Recognition
Kean Chin
Chanwoo Kim
IEEE /ACM Transactions on Audio, Speech, and Language Processing, vol. 25 (2017), pp. 965 - 979
Preview abstract
Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction.
%
Next, we show how performance can be improved by \emph{factoring} the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition.
%
We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs.
%
Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5\% compared to a traditional beamforming-based multichannel ASR system and more than 10\% compared to a single channel waveform model.
View details
Preview abstract
Various neural network architectures have been proposed in the literature to model 2D correlations in the input signal, including convolutional layers, frequency LSTMs and 2D LSTMs such as time-frequency LSTMs, grid LSTMs and ReNet LSTMs. It has been argued that frequency LSTMs can model translational variations similar to CNNs, and 2D LSTMs can model even more variations [1], but no proper comparison has been done for speech tasks. While convolutional layers have been a popular technique in speech tasks, this paper compares convolutional and LSTM architectures to model time-frequency patterns as the first layer in an LDNN [2] architecture. This comparison is particularly interesting when the convolutional layer degrades performance, such as in noisy conditions or when the learned filterbank is not constant-Q [3]. We find that grid-LDNNs offer the best performance of all techniques, and provide between a 1-4% relative improvement over an LDNN and CLDNN on 3 different large vocabulary Voice Search tasks.
View details
Learning Compact Recurrent Neural Networks
Preview
Zhiyun Lu
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2016
Preview abstract
State-of-the-art automatic speech recognition (ASR) systems
typically rely on pre-processed features. This paper studies
the time-frequency duality in ASR feature extraction methods
and proposes extending the standard acoustic model with a
complex-valued linear projection layer to learn and optimize
features that minimize standard cost functions such as cross
entropy. The proposed Complex Linear Projection (CLP) features
achieve superior performance compared to pre-processed
Log Mel features.
View details
Preview abstract
Recently neural network acoustic models trained with Connectionist
Temporal Classification (CTC) were proposed as an alternative approach
to conventional cross-entropy trained neural network acoustic models which output frame-level decisions every 10ms~\cite{senior15asru}. As opposed to
conventional models, CTC learns an alignment jointly with the acoustic
model, and outputs a \textit{blank} symbol in addition to the
regular acoustic state units. This allows the CTC model to run with a
lower frame rate, outputting decisions every 30ms rather than 10ms as
in conventional models, thus improving overall system latency. In this
work, we explore how conventional models behave with lower frame
rates. On a large vocabulary Voice Search task, we will show that with
conventional models, we can slow the frame rate to 40ms while improving WER by 3\% relative over a CTC-based model.
View details
Preview abstract
Voice Activity Detection (VAD) is an important preprocessing
step in any state-of-the-art speech recognition system.
Choosing the right set of features and model architecture can
be challenging and is an active area of research. In this paper
we propose a novel approach to VAD to tackle both feature
and model selection jointly. The proposed method is based
on a CLDNN (Convolutional, Long Short-Term Memory, Deep
Neural Networks) architecture fed directly with the raw waveform.
We show that using the raw waveform allows the neural
network to learn features directly for the task at hand, which is
more powerful than using log-mel features, specially for noisy
environments. In addition, using a CLDNN, which takes advantage
of both frequency modeling with the CNN and temporal
modeling with LSTM, is a much better model for VAD compared
to the DNN. The proposed system achieves over 78% relative
improvement in False Alarms (FA) at the operating point
of 2% False Rejects (FR) on both clean and noisy conditions
compared to a DNN of comparable size trained with log-mel
features. In addition, we study the impact of the model size
and the learned features to provide a better understanding of the
proposed architecture
View details
Preview abstract
Joint multichannel enhancement and acoustic modeling using neural networks has shown promise over the past few years. However, one shortcoming of previous work [1,2,3] is that the filters learned during training are fixed for decoding, potentially limiting the ability of these models to adapt to previously unseen or changing conditions. In this paper we explore a neural network adaptive beamforming (NAB) technique to address this issue. Specifically, we use LSTM layers to predict time domain beamforming filter coefficients at each input frame. These filters are convolved with the framed time domain input signal and summed across channels, essentially performing FIR filter-and-sum beamforming using the dynamically adapted filter. The beamformer output is passed into a waveform CLDNN acoustic model [4] which is trained jointly with the filter prediction LSTM layers. We find that the proposed NAB model achieves a 12.7% relative improvement in WER over a single channel model [4] and reaches similar performance to a ``factored'' model architecture which utilizes several fixed spatial filters [3] on a 2,000-hour Voice Search task, with a 17.9% decrease in computational cost.
View details
Factored Spatial and Spectral Multichannel Raw Waveform CLDNNs
Preview
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Automatic Gain Control and Multi-style Training for Robust Small-Footprint Keyword Spotting with Deep Neural Networks
Raziel Alvarez
Preetum Nakkiran
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2015), pp. 4704-4708
Preview abstract
We explore techniques to improve the robustness of small-footprint keyword spotting models based on deep neural networks (DNNs) in the presence of background noise and in far-field conditions. We find that system performance can be improved significantly, with relative improvements up to 75% in far-field conditions, by employing a combination of multi-style training and a proposed novel formulation of automatic gain control (AGC) that estimates the levels of both speech and background noise. Further, we find that these techniques allow us to achieve competitive performance, even when applied to DNNs with an order of magnitude fewer parameters than our baseline.
View details
Locally-Connected and Convolutional Neural Networks for Small Footprint Speaker Recognition
Preview
Yu-hsin Chen
Mirkó Visontai
Raziel Alvarez
Interspeech (2015)
Preview abstract
This paper describes a series of experiments to extend the application of Context-Dependent (CD) long short-term memory (LSTM) recurrent neural networks (RNNs) trained with Connectionist Temporal Classification (CTC) and sMBR loss. Our experiments, on a noisy, reverberant voice search task, include training with alternative pronunciations and the application to child speech recognition; combination of multiple models, and convolutional input layers. We also investigate the latency of CTC models and show that constraining forward-backward alignment in training can reduce the delay for a real-time streaming speech recognition system. Finally we investigate transferring knowledge from one network to another through alignments
View details
Preview abstract
We consider the task of building compact deep learning pipelines suitable for deployment
on storage and power constrained mobile devices. We propose a uni-
fied framework to learn a broad family of structured parameter matrices that are
characterized by the notion of low displacement rank. Our structured transforms
admit fast function and gradient evaluation, and span a rich range of parameter
sharing configurations whose statistical modeling capacity can be explicitly tuned
along a continuum from structured to unstructured. Experimental results show
that these transforms can significantly accelerate inference and forward/backward
passes during training, and offer superior accuracy-compactness-speed tradeoffs
in comparison to a number of existing techniques. In keyword spotting applications
in mobile speech recognition, our methods are much more effective than
standard linear low-rank bottleneck layers and nearly retain the performance of
state of the art models, while providing more than 3.5-fold compression.
View details
Preview abstract
Both Convolutional Neural Networks (CNNs) and Long Short-Term
Memory (LSTM) have shown improvements over Deep Neural Networks
(DNNs) across a wide variety of speech recognition tasks.
CNNs, LSTMs and DNNs are complementary in their modeling
capabilities, as CNNs are good at reducing frequency variations,
LSTMs are good at temporal modeling, and DNNs are appropriate
for mapping features to a more separable space. In this paper, we
take advantage of the complementarity of CNNs, LSTMs and DNNs
by combining them into one unified architecture. We explore the
proposed architecture, which we call CLDNN, on a variety of large
vocabulary tasks, varying from 200 to 2,000 hours. We find that
the CLDNN provides a 4-6% relative improvement in WER over an
LSTM, the strongest of the three individual models.
View details
Large Vocabulary Automatic Speech Recognition for Children
Melissa Carroll
Noah Coccaro
Qi-Ming Jiang
Interspeech (2015)
Preview abstract
Recently, Google launched YouTube Kids, a mobile application for children, that uses a speech recognizer built specifically for recognizing children’s speech. In this paper we present techniques we explored to build such a system. We describe the use of a neural network classifier to identify matched acoustic training data, filtering data for language modeling to reduce the chance of producing offensive results. We also compare long short-term memory (LSTM) recurrent networks to convolutional, LSTM, deep neural networks (CLDNN). We found that a CLDNN acoustic model outperforms an LSTM across a variety of different conditions, but does not specifically model child speech relatively better than adult. Overall, these findings allow us to build a successful, state-of-the-art large vocabulary speech recognizer for both children and adults.
View details
Exemplar-Based Processing for Speech Recognition: An Overview
Preview
Bhuvana Ramabhadran
David Nahamoo
Dimitri Kanevsky
Dirk Van Compernolle
Kris Demuynck
Jort F. Gemmeke
Jerome R. Bellegarda
Shiva Sundaram
IEEE Signal Process. Mag., vol. 29 (2012), pp. 98-113
Deep Neural Networks for Acoustic Modeling in Speech Recognition
Geoffrey Hinton
Li Deng
Dong Yu
George Dahl
Abdel-rahman Mohamed
Navdeep Jaitly
Patrick Nguyen
Brian Kingsbury
Signal Processing Magazine (2012)
Preview abstract
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over
HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition.
View details
Deep Convolutional Neural Networks for Large-scale Speech Tasks
Brian Kingsbury
George Saon
Hagen Soltau
Abdel-rahman Mohamed
George E. Dahl
Bhuvana Ramabhadran
Neural Networks, vol. 64 (2015), pp. 39-48
Improvements to filterbank and delta learning within a deep neural network framework
Brian Kingsbury
Abdel-rahman Mohamed
George Saon
Bhuvana Ramabhadran
ICASSP (2014), pp. 6839-6843
Parallel deep neural network training for LVCSR tasks using blue gene/Q
I-Hsin Chung
Bhuvana Ramabhadran
Michael Picheny
John A. Gunnels
Brian Kingsbury
George Saon
Vernon Austel
Upendra V. Chaudhari
INTERSPEECH (2014), pp. 1048-1052
Joint training of convolutional and non-convolutional neural networks
Deep Scattering Spectrum with deep neural networks
Vijayaditya Peddinti
Shay Maymon
Bhuvana Ramabhadran
David Nahamoo
Vaibhava Goel
ICASSP (2014), pp. 210-214
Parallel Deep Neural Network Training for Big Data on Blue Gene/Q
I-Hsin Chung
Bhuvana Ramabhadran
Michael Picheny
John A. Gunnels
Vernon Austel
Upendra V. Chaudhari
Brian Kingsbury
SC (2014), pp. 745-753
Deep scattering spectra with deep neural networks for LVCSR tasks
Vijayaditya Peddinti
Brian Kingsbury
Petr Fousek
Bhuvana Ramabhadran
David Nahamoo
INTERSPEECH (2014), pp. 900-904
An Evaluation of Posterior Modeling Techniques for Phonetic Recognition
David Nahamoo
Bhuvana Ramabhadran
Dimitri Kanevsky
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2013), pp. 7165-7169
Deep convolutional neural networks for LVCSR
Abdel-rahman Mohamed
Brian Kingsbury
Bhuvana Ramabhadran
ICASSP (2013), pp. 8614-8618
Optimization Techniques to Improve Training Speed of Deep Neural Networks for Large Speech Tasks
Brian Kingsbury
Hagen Soltau
Bhuvana Ramabhadran
IEEE Transactions on Audio, Speech & Language Processing, vol. 21 (2013), pp. 2267-2276
Developing speech recognition systems for corpus indexing under the IARPA Babel program
Jia Cui
Xiaodong Cui
Bhuvana Ramabhadran
Janice Kim
Brian Kingsbury
Jonathan Mamou
Lidia Mangu
Michael Picheny
Abhinav Sethy
ICASSP (2013), pp. 6753-6757
Improving deep neural networks for LVCSR using rectified linear units and dropout
Learning filter banks within a deep neural network framework
Improvements to deep convolutional neural networks for LVCSR
Brian Kingsbury
Abdel-rahman Mohamed
George E. Dahl
George Saon
Hagen Soltau
Tomás Beran
Aleksandr Y. Aravkin
Bhuvana Ramabhadran
CoRR, vol. abs/1309.1501 (2013)
Improvements to Deep Convolutional Neural Networks for LVCSR
Brian Kingsbury
Abdel-rahman Mohamed
George E. Dahl
George Saon
Hagen Soltau
Tomás Beran
Aleksandr Y. Aravkin
Bhuvana Ramabhadran
ASRU (2013), pp. 315-320
Improving training time of Hessian-free optimization for deep neural networks using preconditioning and sampling
Lior Horesh
Brian Kingsbury
Aleksandr Y. Aravkin
Bhuvana Ramabhadran
CoRR, vol. abs/1309.1508 (2013)
Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling
Lior Horesh
Brian Kingsbury
Aleksandr Y. Aravkin
Bhuvana Ramabhadran
ASRU (2013), pp. 303-308
Improved pre-training of Deep Belief Networks using Sparse Encoding Symmetric Machines
Auto-encoder bottleneck features using deep belief networks
Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization
Enhancing Exemplar-Based Posteriors for Speech Recognition Tasks
N-best entropy based data selection for acoustic modeling
Nobuyasu Itoh
Dan-Ning Jiang
Jie Zhou
Bhuvana Ramabhadran
ICASSP (2012), pp. 4133-4136
Application specific loss minimization using gradient boosting
Reducing Computational Complexities of Exemplar-Based Sparse Representations with Applications to Large Vocabulary Speech Recognition
Convergence of Line Search A-Function Methods
A convex hull approach to sparse representations for exemplar-based speech recognition
David Nahamoo
Dimitri Kanevsky
Bhuvana Ramabhadran
Parikshit M. Shah
ASRU (2011), pp. 59-64
Making Deep Belief Networks effective for large vocabulary continuous speech recognition
Brian Kingsbury
Bhuvana Ramabhadran
Petr Fousek
Petr Novák
Abdel-rahman Mohamed
ASRU (2011), pp. 30-35
Deep Belief Networks using discriminative features for phone recognition
Abdel-rahman Mohamed
George E. Dahl
Bhuvana Ramabhadran
Geoffrey E. Hinton
Michael A. Picheny
ICASSP (2011), pp. 5060-5063
A-Functions: A generalization of Extended Baum-Welch transformations to convex optimization
Dimitri Kanevsky
David Nahamoo
Bhuvana Ramabhadran
Peder A. Olsen
ICASSP (2011), pp. 5164-5167
Exemplar-based Sparse Representation phone identification features
David Nahamoo
Bhuvana Ramabhadran
Dimitri Kanevsky
Vaibhava Goel
Parikshit M. Shah
ICASSP (2011), pp. 4492-4495
Data selection for language modeling using sparse representations
Abhinav Sethy
Bhuvana Ramabhadran
Dimitri Kanevsky
INTERSPEECH (2010), pp. 2258-2261
Bayesian compressive sensing for phonetic classification
Sparse representations for text categorization
Sameer Maskey
Dimitri Kanevsky
Bhuvana Ramabhadran
David Nahamoo
Julia Hirschberg
INTERSPEECH (2010), pp. 2266-2269
The Use of isometric transformations and bayesian estimation in compressive sensing for fMRI classification
Avishy Carmi
Pini Gurfil
Dimitri Kanevsky
David Nahamoo
Bhuvana Ramabhadran
ICASSP (2010), pp. 493-496
An analysis of sparseness and regularization in exemplar-based methods for speech classification
Dimitri Kanevsky
Bhuvana Ramabhadran
David Nahamoo
INTERSPEECH (2010), pp. 2842-2845
A voice-commandable robotic forklift working alongside humans in minimally-prepared outdoor environments
Seth J. Teller
Matthew R. Walter
Matthew E. Antone
Andrew Correa
Randall Davis
Luke Fletcher
Emilio Frazzoli
Jim Glass
Jonathan P. How
Albert S. Huang
Jeong Hwan Jeon
Sertac Karaman
Brandon Luders
Nicholas Roy
ICRA (2010), pp. 526-533
Incorporating sparse representation phone identification features in automatic speech recognition using exponential families
Vaibhava Goel
Bhuvana Ramabhadran
Peder A. Olsen
David Nahamoo
Dimitri Kanevsky
INTERSPEECH (2010), pp. 1345-1348
Kalman filtering for compressed sensing
Dimitri Kanevsky
Avishy Carmi
Lior Horesh
Pini Gurfil
Bhuvana Ramabhadran
FUSION (2010), pp. 1-8
Sparse representation features for speech recognition
Bhuvana Ramabhadran
David Nahamoo
Dimitri Kanevsky
Abhinav Sethy
INTERSPEECH (2010), pp. 2254-2257
An exploration of large vocabulary tools for small vocabulary phonetic recognition
Island-driven search using broad phonetic classes
ASRU (2009), pp. 287-292
A generalized family of parameter estimation techniques
Gradient steepness metrics using extended Baum-Welch transformations for universal pattern recognition tasks
Generalization of extended baum-welch parameter estimation for discriminative training and decoding
A comparison of broad phonetic and acoustic units for noise robust segment-based phonetic recognition
Broad phonetic class recognition in a Hidden Markov model framework using extended Baum-Welch transformations
Unsupervised Audio Segmentation using Extended Baum-Welch Transformations
Audio classification using extended baum-welch transformations
A Sinusoidal Model Approach to Acoustic Landmark Detection and Segmentation for Robust Segment-Based Speech Recognition