Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

Jiahui Yu; Wei Han; Anmol Gulati; Chung-Cheng Chiu; Bo Li; Tara N Sainath; Yonghui Wu; Ruoming Pang

Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

Jiahui Yu

Wei Han

Anmol Gulati

Chung-Cheng Chiu

Bo Li

Tara N Sainath

Yonghui Wu

Ruoming Pang

ICLR 2021

Download Google Scholar

Abstract

Streaming automatic speech recognition (ASR) aims at emitting each recognized word shortly as they are spoken, while full-context ASR encodes an entire speech sequence before decoding texts. In this work, we propose a unified framework, Universal ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. More importantly, we show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation. Universal ASR framework is network-agnostic, and can be applied to recent state-of-the-art convolution-based and transformer-based end-to-end ASR networks. We present extensive experiments on both research dataset LibriSpeech and mega-scale internal dataset MultiDomain with two state-of-the-art ASR networks ContextNet and Conformer. Experiments and ablation studies demonstrate that Universal ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs