Google Research

Parallel rescoring with Transformer for Streaming On-Device speech recognition


Two-pass models have achieved better quality for on-device speech recognition, where a 1st-pass recurrent neural network transducer (RNN-T) model generates hypotheses in a streaming fashion, and a 2nd-pass Listen, Attend and Spell (LAS) model re-scores the hypotheses with full audio sequence context. Such models provide both fast responsiveness with the 1st-pass model and better quality with the 2nd-pass model. The computation latency from the 2nd-pass model is a critical problem, as the model has to wait for the speech and hypotheses from the first pass to be complete. Yet the rescoring latency is constrained by the recurrent nature of LSTM, as the processing for each sequence has to run sequentially. In this work we explore replacing the LSTM layers in the 2nd-pass rescorer with Transformer layers, which can process the entire hypothesis sequences in parallel and can therefore utilize the on-device computation resources more efficiently. Compared with an LAS-based baseline, our proposed transformer rescorer achieves more than 50% latency reduction with quality improvement.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work