Jump to Content

Parallel rescoring with Transformer for Streaming On-Device speech recognition

Wei Li
James Qin
Chung-Cheng Chiu
Ruoming Pang
Yanzhang (Ryan) He


Two-pass models have achieved better quality for on-device speech recognition, where a 1st-pass recurrent neural network transducer (RNN-T) model generates hypotheses in a streaming fashion, and a 2nd-pass Listen, Attend and Spell (LAS) model re-scores the hypotheses with full audio sequence context. Such models provide both fast responsiveness with the 1st-pass model and better quality with the 2nd-pass model. The computation latency from the 2nd-pass model is a critical problem, as the model has to wait for the speech and hypotheses from the first pass to be complete. Yet the rescoring latency is constrained by the recurrent nature of LSTM, as the processing for each sequence has to run sequentially. In this work we explore replacing the LSTM layers in the 2nd-pass rescorer with Transformer layers, which can process the entire hypothesis sequences in parallel and can therefore utilize the on-device computation resources more efficiently. Compared with an LAS-based baseline, our proposed transformer rescorer achieves more than 50% latency reduction with quality improvement.