Improving Streaming ASR with Non-streaming Model Distillation on Unsupervised Data

Ananya Misra; Arun Narayanan; Chung-Cheng Chiu; Liangliang Cao; Min Ma; Ruoming Pang; Thibault Doutre; Wei Han; Yu Zhang; Zhiyun Lu

Improving Streaming ASR with Non-streaming Model Distillation on Unsupervised Data

Ananya Misra

Arun Narayanan

Chung-Cheng Chiu

Liangliang Cao

Min Ma

Ruoming Pang

Thibault Doutre

Wei Han

Yu Zhang

Zhiyun Lu

ICASSP 2021 (to appear)

Google Scholar

Abstract

Streaming end-to-end Automatic Speech Recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Streaming models almost always perform worse than non-streaming models.
We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher, generating transcripts on an arbitrary large data set, to better distill knowledge into streaming ASR models. This way, we are able to scale the training of streaming models to 3M hours of YouTube audio. Experiments show that our approach can significantly reduce the Word Error Rate (WER) of RNN-T models in four languages trained from YouTube data.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Improving Streaming ASR with Non-streaming Model Distillation on Unsupervised Data

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs