Towards fast and accurate streaming end-to-end ASR

Bo Li; Shuo-yiin Chang; Tara Sainath; Ruoming Pang; Yanzhang (Ryan) He; Trevor Strohman; Yonghui Wu

Towards fast and accurate streaming end-to-end ASR

Bo Li

Shuo-yiin Chang

Tara Sainath

Ruoming Pang

Yanzhang (Ryan) He

Trevor Strohman

Yonghui Wu

Proc. ICASSP (2020)

Download Google Scholar

Abstract

End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, Recurrent neural network transducer (RNN-T) as a streaming E2E model that has shown promising potential for on-device ASR. For such applications, quality and latency are two critical factors. We propose to reduce E2E model's latency by extending the RNN-T endpointer (RNN-T EP) model with additional early and late penalties. By further applying the minimum word error rate (MWER) training technique, we achieved 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction on a Voice search test set. We also experimented with a second pass Listen, Attend and Spell (LAS) rescorer for the RNN-T EP model. Although it cannot directly improve the first pass latency, the large WER reduction actually give us more room to trade WER for latency. RNN-T+LAS, together with EMBR training brings in 17.3% relative WER reduction while maintaining similar 120ms 90-percentile latency reductions.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Towards fast and accurate streaming end-to-end ASR

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs