Google Research

Two-Pass End-to-End Speech Recognition


A state-of-the-art, commercial speech recognition system should not only have a low word error rate (WER), but must also abide by latency constraints. % Specifically, the model must be decode utterances in a streaming fashion and faster than real-time. % Recently, a streaming recurrent neural network transducer (RNN-T) end-to-end (E2E) model has shown to be a good candidate for on-device speech recognition, with improved WER and latency numbers compared to conventional on-device models \cite{Ryan19}. % However, this model still lags behind a large state-of-the-art conventional model in quality~\cite{Golan16}. % On the other hand, a non-streaming E2E Listen, Attend, Spell (LAS) model has shown comparable quality to the large conventional model~\cite{CC18}. % This work aims to bring the quality of an end-to-end streaming model closer to the large conventional model quality by incorporating a LAS network as a second-pass component, while still abiding by latency constraints. % We find that the proposed two-pass model achieves a 19\% relative reduction in WER compared to RNN-T alone and increases latency by a small fraction over RNN-T.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work