Bridging the gap between streaming and non-streaming automatic speechrecognition systems through distillation of an ensemble of models
Abstract
Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their small size and minimal latency make them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context. Nevertheless, non-streaming models can be used as teacher models to improve streaming ASR systems. An arbitrarily large set of unsupervised utterances is distilled from such teacher models so that streaming models can be trained using these generated labels. However, the performance gap between teacher and student world error rates (WER) remains high. In this paper, we propose to reduce this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). Fusing RNN-T and CTC models makes stronger teachers as they improve the performance of streaming student models. In this paper, we outperform a baseline streaming RNN-T trained from non-streaming RNN-T teachers by 27\% to 42\% depending on the language.