Yanzhang He
Authored Publications
Sort By
Closing the Gap between Single-User and Multi-User VoiceFilter-Lite
Qiao Liang
Rajeev Vijay Rikhye
To submit to Odyssey 2022 (to appear)
Preview abstract
VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from the non-target speaker. One limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating dual learning rates and using feature-wise linear modulation (FiLM) to condition the model with the attended embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.
View details
Large-scale ASR Domain Adaptation by Self- and Semi-supervised Learning
David Qiu
Dongseong Hwang
ICASSP (2022) (to appear)
Preview abstract
Self- and Semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online ASR model. This approach demonstrates that using the source domain data with a small fraction of the target domain data (3%) can recover the performance gap compared to a full data baseline: relative 13.5% WER improvement for target domain data.
View details
Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems
Chao Zhang
IEEE Spoken Language Technology Workshop (2022)
Preview abstract
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. This EP model strongly affects latency, but is subject to computational constraints, which limits prediction accuracy. We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This allows flexibility during inference to produce a low-cost prediction or a higher quality prediction if ASR computation is ongoing. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 130ms (33.3% reduction), and 90th percentile latency by 160ms (22.2% reduction), without regressing word-error rate. For continuous recognition, WER improves by 10.6% (relative).
View details
Improving Deliberation by Text-Only and Semi-Supervised Training
Kevin Hu
Weiran Wang
Interspeech 2022 (2022) (to appear)
Preview abstract
Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text or speech data. In this work, we propose text-only and semi-supervised training for attention-decoder based deliberation. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, joint acoustic and text decoder (JATD) training, and semi-supervised training based on a conventional model as a teacher, we achieved up to 11.7% WER reduction compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the WER by 8% relative for Google Voice Search with reasonable endpointing latencies. We show that the deliberation has achieved a positive human side-by-side evaluation compared to LM rescoring.
View details
Preview abstract
In this paper, we propose a solution to allow speaker conditioned speech models, such as VoiceFilter-Lite, to support an arbitrary number of enrolled users in a single pass. This is achieved by using an attention mechanism on multiple speaker embeddings to compute a single attentive embedding, which is then used as a side input to the model. We implemented multi-user VoiceFilter-Lite and evaluated it for two tasks: (1) a standard text-independent speaker verification task, where the input audio may contain overlapped speech; (2) a personalized keyphrase detection task, where ASR has to detect keyphrases from multiple enrolled users in a noisy environment. Our experiments show that with up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speaker verification errors when there is overlapped speech, without hurting the performance under other acoustic conditions. This attentive speaker embedding approach can also be easily applied to other speaker-conditioned models such as personal VAD and personalized ASR.
View details
Multi-Task Learning for E2E ASR Word and Utterance Confidence
David Qiu
Yu Zhang
Liangliang Cao
Interspeech (2021)
Preview abstract
Confidence scores are very useful for downstream applicationsof automatic speech recognition (ASR) systems. Recent workshave proposed using neural attention models to learn word or ut-terance confidence scores for end-to-end (E2E) ASR. By them-selves, word confidence does not model deletions, and utteranceconfidence discards much of the useful word-level training sig-nals. This paper studies the effect of adding utterance-level lossand individual deletion loss to the framework proposed in [1].Empirical results show that multi-task learning with all threeobjectives improves confidence metrics (NCE, AUC, RMSE)without the need for increasing the model size of the trans-former feature extractor. Using the utterance-level confidencefor rescoring also decreases the word error rates on Google’sVoice Search and long-tail datasets by 3-5% relative.
View details
Learning Word-Level Confidence for Subword End-to-End ASR
David Qiu
Yu Zhang
Liangliang Cao
Deepti Bhatia
Wei Li
Ke Hu
ICASSP (2021)
Preview abstract
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
View details
Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging
David Johannes Rybach
Sean Campbell
ICASSP 2021, IEEE
Preview abstract
End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.
View details
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
Rami Botros
Ruoming Pang
David Johannes Rybach
James Qin
Quoc-Nam Le-The
Anmol Gulati
Cal Peyser
Chung-Cheng Chiu
Emmanuel Guzman
Jiahui Yu
Qiao Liang
Wei Li
Yu Zhang
Interspeech (2021) (to appear)
Preview abstract
On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller.
View details
Personalized Keyphrase Detection using Speaker and Environment Information
Rajeev Vijay Rikhye
Qiao Liang
Ding Zhao
Yiteng (Arden) Huang
Interspeech 2021
Preview abstract
In this paper, we introduce a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary. The system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model. To address the challenge of detecting these keyphrases under various noisy conditions, a speaker separation model is added to the feature frontend of the speaker verification model, and an adaptive noise cancellation (ANC) algorithm is included to exploit the cross-microphone noise coherence. Our experiments show that the text-independent speaker recognition model largely reduces the false triggering rate of the keyphrase detection, while the speaker separation model and adaptive noise cancellation largely reduce false rejections.
View details