Multi-Task Learning for E2E ASR Word and Utterance Confidence

David Qiu
Yu Zhang
Liangliang Cao
Interspeech (2021)

Abstract

Confidence scores are very useful for downstream applicationsof automatic speech recognition (ASR) systems. Recent workshave proposed using neural attention models to learn word or ut-terance confidence scores for end-to-end (E2E) ASR. By them-selves, word confidence does not model deletions, and utteranceconfidence discards much of the useful word-level training sig-nals. This paper studies the effect of adding utterance-level lossand individual deletion loss to the framework proposed in [1].Empirical results show that multi-task learning with all threeobjectives improves confidence metrics (NCE, AUC, RMSE)without the need for increasing the model size of the trans-former feature extractor. Using the utterance-level confidencefor rescoring also decreases the word error rates on Google’sVoice Search and long-tail datasets by 3-5% relative.