Deliberation Model Based Two-Pass End-to-End Speech Recognition

Kevin Hu; Rohit Prabhavalkar; Ruoming Pang; Tara Sainath

Deliberation Model Based Two-Pass End-to-End Speech Recognition

Kevin Hu

Rohit Prabhavalkar

Ruoming Pang

Tara Sainath

(2020)

Google Scholar

Abstract

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively to conventional models. To further improve the quality of an E2E model, two-pass decoding has been proposed to rescore streamed hypotheses using a non-streaming E2E model while maintaining a reasonable latency. However, the rescoring model uses only acoustics to rerank hypotheses. On the other hand, a class of neural correction models use only first-pass hypotheses for second-pass decoding. In this work, we propose to attend to both acoustics and first-pass hypotheses using the deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 25% relatively WER reduction compared to a recurrent neural network transducer, and 12% to LAS rescoring in Google Voice Search tasks. The improvement on a proper noun test set is even larger: 23% compared to LAS rescoring. The proposed model has a similar latency compared to LAS rescoring in decoding Voice Search utterances.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Deliberation Model Based Two-Pass End-to-End Speech Recognition

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs