Ehsan Variani
Research Areas
Authored Publications
Sort By
On Weight Interpolation of the Hybrid Autoregressive Transducer Model
David Rybach
Interspeech 2022, Interspeech 2022 (2022) (to appear)
Preview abstract
This paper explores ways to improve a two-pass speech recognition system when the first-pass
is hybrid autoregressive transducer model and the second-pass is a neural language model.
The main focus is on the scores provided by each of these models, their quantitative analysis,
how to improve them and the best way to integrate them with the objective of better recognition
accuracy. Several analysis are presented to show the importance of the choice of the
integration weights for combining the first-pass and the second-pass scores. A sequence level weight
estimation model along with four training criteria are proposed which allow adaptive integration
of the scores per acoustic sequence.
The effectiveness of this algorithm is demonstrated by constructing and analyzing
models on the Librispeech data set.
View details
Multilingual Second-Pass Rescoring for Automatic Speech RecognitionSystems
Pedro Moreno Mengibar
ICASSP (2022)
Preview abstract
Second-pass rescoring is a well known technique to improve the performance of Automatic Speech Recognition (ASR) systems. Neural oracle search (NOS), which selects the most likely hypothesis from N-best hypothesis list by integrating in-formation from multiple sources, such as the input acoustic representations, N-best hypotheses, additional first-pass statistics,and unpaired textual information through an external language model, has shown success in re-scoring for RNN-T first-pass models. Multilingual first-pass speech recognition models of-ten outperform their monolingual counterparts when trained on related or low-resource languages. In this paper, we investigate making the second-pass model multilingual and apply rescoring on a multilingual first-pass. We conduct experiments on Nordic languages including Danish, Dutch, Finnish, Norwegian and Swedish.
View details
Preview abstract
Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech corpus, supplemented with personalized text-only data for each user from Project Gutenberg. We release this User-Specific LibriSpeech (UserLibri) dataset to aid future personalization research. LibriSpeech audio-transcript pairs are grouped into 55 users from the test-clean dataset and 52 users from test-other. We are able to lower the average word error rate per user across both sets in streaming and nonstreaming models, including an improvement of 2.5 for the harder set of test-other users when streaming.
View details
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
Rami Botros
Ruoming Pang
David Johannes Rybach
James Qin
Quoc-Nam Le-The
Anmol Gulati
Cal Peyser
Chung-Cheng Chiu
Emmanuel Guzman
Jiahui Yu
Qiao Liang
Wei Li
Yu Zhang
Interspeech (2021) (to appear)
Preview abstract
On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller.
View details
Hybrid Autoregressive Transducer (HAT)
David Rybach
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6139-6143
Preview abstract
This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoder-decoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches.
View details
Preview abstract
Modeling tasks that use a large vocabulary require two words-to-vector maps, one for
the embedding layer and one for the softmax layer. A majority of model parameters for
such modeling tasks are in the embedding and the softmax layers, while only a small fraction
of the parameters are used to the core of the model e.g., recurrent structures such as
LSTM. When training models on small to medium corpus size, these models are subject
to over-tting as well as large storage and memory footprint requirements. We propose to
compress the embedding and softmax matrices by imposing structure into the parameter
space. The embedding and softmax matrices are factored as the product of a sparse matrix
and a structured dense matrix. Without compromizing performance, we achieve a significant
compression rate for the embedding layer and a moderate compression rate for the
softmax layer. The factoring of the embedding and softmax matrix before training allows us
to jointly train these matrix values to optimize the training objective. Being able to compress
the embedding and softmax layers allows us to uses this saved memory for increased
recurrent unit size, which results in improved performance at an uncompressed memory
footprint. We report results of this compression technique on standard datasets and a state
of the art on-device automatic speech recognition system.
View details
Preview abstract
This article introduces and evaluates Sampled Connectionist
Temporal Classification (CTC) which connects the CTC criterion to
the Cross Entropy (CE) objective through sampling. Instead of com-
puting the logarithm of the sum of the alignment path likelihoods,
at each training step the sampled CTC only computes the CE loss be-
tween the sampled alignment path and model posteriors. It is shown
that the sampled CTC objective is an unbiased estimator of an upper
bound for the CTC loss, thus minimization of the sampled CTC is
equivalent to the minimization of the upper bound of the CTC ob-
jective. The definition of the sampled CTC objective has the advan-
tage that it is scalable computationally to the massive datasets using
accelerated computation machines. The sampled CTC is compared
with CTC in two large-scale speech recognition tasks and it is shown
that sampled CTC can achieve similar WER performance of the best
CTC baseline in about one fourth of the training time of the CTC
baseline.
View details
Multichannel Signal Processing with Deep Neural Networks for Automatic Speech Recognition
Kean Chin
Chanwoo Kim
IEEE /ACM Transactions on Audio, Speech, and Language Processing, 25 (2017), pp. 965 - 979
Preview abstract
Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction.
%
Next, we show how performance can be improved by \emph{factoring} the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition.
%
We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs.
%
Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5\% compared to a traditional beamforming-based multichannel ASR system and more than 10\% compared to a single channel waveform model.
View details
Acoustic Modeling for Google Home
Joe Caroselli
Kean Chin
Chanwoo Kim
Mitchel Weintraub
Erik McDermott
INTERSPEECH 2017 (2017)
Preview abstract
This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system.
View details
Raw Multichannel Processing Using Deep Neural Networks
Kean Chin
Chanwoo Kim
New Era for Robust Speech Recognition: Exploiting Deep Learning, Springer (2017)
Preview abstract
Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this chapter, we perform multi-channel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model.
View details