Google Research

Improving Streaming ASR with Non-streaming Model Distillation on Unsupervised Data

ICASSP 2021 (to appear)

Abstract

Streaming end-to-end Automatic Speech Recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Streaming models almost always perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher, generating transcripts on an arbitrary large data set, to better distill knowledge into streaming ASR models. This way, we are able to scale the training of streaming models to 3M hours of YouTube audio. Experiments show that our approach can significantly reduce the Word Error Rate (WER) of RNN-T models in four languages trained from YouTube data.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work