Google Research

Multi-Task Learning for E2E ASR Word and Utterance Confidence

Interspeech (2021)

Abstract

Confidence scores are very useful for downstream applicationsof automatic speech recognition (ASR) systems. Recent workshave proposed using neural attention models to learn word or ut-terance confidence scores for end-to-end (E2E) ASR. By them-selves, word confidence does not model deletions, and utteranceconfidence discards much of the useful word-level training sig-nals. This paper studies the effect of adding utterance-level lossand individual deletion loss to the framework proposed in [1].Empirical results show that multi-task learning with all threeobjectives improves confidence metrics (NCE, AUC, RMSE)without the need for increasing the model size of the trans-former feature extractor. Using the utterance-level confidencefor rescoring also decreases the word error rates on Google’sVoice Search and long-tail datasets by 3-5% relative.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work