Golan Pundak
Research Areas
Authored Publications
Sort By
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
Ruoming Pang
Antoine Bruguier
Wei Li
Raziel Alvarez
Chung-Cheng Chiu
David Garcia
Kevin Hu
Minho Jin
Qiao Liang
Cal Peyser
David Rybach
(June) Yuan Shangguan
Yash Sheth
Mirkó Visontai
Yu Zhang
Ding Zhao
ICASSP (2020)
Preview abstract
Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.
View details
STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES
Raziel Alvarez
Ding Zhao
David Rybach
Ruoming Pang
Qiao Liang
Deepti Bhatia
Yuan Shangguan
ICASSP (2019)
Preview abstract
End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.
View details
Preview abstract
Contextual biasing in end-to-end (E2E) models is challenging because E2E models do poorly in proper nouns and a limited number of candidates are kept for beam search decoding. This problem is exacerbated when biasing towards proper nouns in foreign languages, such as geographic location names, which are virtually unseen in training and are thus out-of-vocabulary (OOV). While a grapheme or wordpiece E2E model might have a difficult time spelling OOV words, phonemes are more acoustically oriented, and past work has shown that E2E models can better predict phonemes for such words. In this work, we address the OOV issue by incorporating phonemes in a wordpiece E2E model, and perform contextual biasing at the phoneme level to recognize foreign words. Phonemes are mapped from the source language to the foreign language and subsequently transduced to foreign words using pronunciations. We show that phoneme-based biasing performs 16% better than a grapheme-only biasing model, and 8% better than the wordpiece-only biasing model on a foreign place name recognition task, while causing slight degradation on regular English tasks.
View details
Preview abstract
End-to-End (E2E) automatic speech recognition (ASR) systems learn word spellings directly from text-audio pairs, in contrast to traditional ASR systems which incorporate a separate pronunciation lexicon. The lexicon allows a traditional system to correctly spell rare words unobserved in training, if their phonetic pronunciation is known during inference. E2E systems, however, are more likely to misspell rare words.
In this work we propose an E2E model which benefits from the best of both worlds: it outputs graphemes, and thus learns to spell words directly, while also being able to leverage pronunciations for words which might be likely in a given context. Our model, which we name Phoebe, is based on the recently proposed Contextual Listen Attend and Spell model (CLAS). As in CLAS, our model accepts a set of bias phrases and learns an embedding for them which is jointly optimized with the rest of the ASR system. In contrast to CLAS, which accepts only the textual form of the bias phrases, the proposed model also has access to phonetic embeddings, which as we show improves performance on challenging test sets which include words unseen in training. The proposed model provides a 16% relative word error rate reduction over CLAS when both the phonetic and written representation of the context bias phrases are used.
View details
Domain Adaptation Using Factorized Hidden Layer for Robust Automatic Speech Recognition
Interspeech (2018), pp. 892-896
Preview abstract
Domain robustness is a challenging problem for automatic speech recognition (ASR). In this paper, we consider speech data collected for different applications as separate domains and investigate the robustness of acoustic models trained on multi-domain data on unseen domains. Specifically, we use Factorized Hidden Layer (FHL) as a compact low-rank representation to adapt a multi-domain ASR system to unseen domains. Experimental results on two unseen domains show that FHL is a more effective adaptation method compared to selectively fine-tuning part of the network, without dramatically increasing the model parameters. Furthermore, we found that using singular value decomposition to initialize the low-rank bases of an FHL model leads to a faster convergence and improved performance.
View details
Preview abstract
In automatic speech recognition (ASR) what a user says
depends on the particular context she is in. Typically, this
context is represented as a set of word n-grams. In this work,
we present a novel, all-neural, end-to-end (E2E) ASR system
that utilizes such context. Our approach, which we refer
to as Contextual Listen, Attend and Spell (CLAS) jointlyoptimizes
the ASR components along with embeddings of the
context n-grams. During inference, the CLAS system can be
presented with context phrases which might contain out-ofvocabulary
(OOV) terms not seen during training. We compare
our proposed system to a more traditional contextualization
approach, which performs shallow-fusion between independently
trained LAS and contextual n-gram models during
beam search. Across a number of tasks, we find that the proposed
CLAS system outperforms the baseline method by as
much as 68% relative WER, indicating the advantage of joint
optimization over individually trained components.
Index Terms: speech recognition, sequence-to-sequence
models, listen attend and spell, LAS, attention, embedded
speech recognition.
View details
TOWARD DOMAIN-INVARIANT SPEECH RECOGNITION VIA LARGE SCALE TRAINING
Mohamed (Mo) Elfeky
SLT, IEEE (2018)
Preview abstract
Current state-of-the-art automatic speech recognition systems are trained to work in specific ‘domains’, defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. In this paper, we explore the idea of building a single domain-invariant model that works well for varied use-cases. We do this by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be trained to be robust to multiple application domains, and other variations like codecs and noise. Such models also generalize better to unseen conditions and allow for rapid adaptation to new domains – we show that by using as little as 10 hours of data for adapting a domain-invariant model to a new domain, we can match performance of a domain-specific model trained from scratch using roughly 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work.
View details
Preview abstract
Recently, very deep networks, with as many as hundreds of
layers, have shown great success in image classification tasks.
One key component that has enabled such deep models is the
use of “skip connections”, including either residual or highway
connections, to alleviate the vanishing and exploding gradient
problems. While these connections have been explored
for speech, they have mainly been explored for feed-forward
networks. Since recurrent structures, such as LSTMs, have produced
state-of-the-art results on many of our Voice Search tasks,
the goal of this work is to thoroughly investigate different approaches
to adding depth to recurrent structures. Specifically,
we experiment with novel Highway-LSTM models with bottlenecks
skip connections and show that a 10 layer model can outperform
a state-of-the-art 5 layer LSTM model with the same
number of parameters by 2% relative WER. In addition, we experiment
with Recurrent Highway layers and find these to be on
par with Highway-LSTM models, when given sufficient depth.
View details
Acoustic Modeling for Google Home
Joe Caroselli
Kean Chin
Chanwoo Kim
Mitchel Weintraub
Erik McDermott
INTERSPEECH 2017 (2017)
Preview abstract
This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system.
View details
Preview abstract
Recently neural network acoustic models trained with Connectionist
Temporal Classification (CTC) were proposed as an alternative approach
to conventional cross-entropy trained neural network acoustic models which output frame-level decisions every 10ms~\cite{senior15asru}. As opposed to
conventional models, CTC learns an alignment jointly with the acoustic
model, and outputs a \textit{blank} symbol in addition to the
regular acoustic state units. This allows the CTC model to run with a
lower frame rate, outputting decisions every 30ms rather than 10ms as
in conventional models, thus improving overall system latency. In this
work, we explore how conventional models behave with lower frame
rates. On a large vocabulary Voice Search task, we will show that with
conventional models, we can slow the frame rate to 40ms while improving WER by 3\% relative over a CTC-based model.
View details