Deliberation Model Based Two-Pass End-to-End Speech Recognition

Kevin Hu
Ruoming Pang
(2020)
Google Scholar

Abstract

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively to conventional models. To further improve the quality of an E2E model, two-pass decoding has been proposed to rescore streamed hypotheses using a non-streaming E2E model while maintaining a reasonable latency. However, the rescoring model uses only acoustics to rerank hypotheses. On the other hand, a class of neural correction models use only first-pass hypotheses for second-pass decoding. In this work, we propose to attend to both acoustics and first-pass hypotheses using the deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 25% relatively WER reduction compared to a recurrent neural network transducer, and 12% to LAS rescoring in Google Voice Search tasks. The improvement on a proper noun test set is even larger: 23% compared to LAS rescoring. The proposed model has a similar latency compared to LAS rescoring in decoding Voice Search utterances.

Research Areas