- David Qiu
- Yanzhang (Ryan) He
- Qiujia Li
- Yu Zhang
- Liangliang Cao
- Ian Carmichael McGraw
Abstract
Confidence scores are very useful for downstream applicationsof automatic speech recognition (ASR) systems. Recent workshave proposed using neural attention models to learn word or ut-terance confidence scores for end-to-end (E2E) ASR. By them-selves, word confidence does not model deletions, and utteranceconfidence discards much of the useful word-level training sig-nals. This paper studies the effect of adding utterance-level lossand individual deletion loss to the framework proposed in [1].Empirical results show that multi-task learning with all threeobjectives improves confidence metrics (NCE, AUC, RMSE)without the need for increasing the model size of the trans-former feature extractor. Using the utterance-level confidencefor rescoring also decreases the word error rates on Google’sVoice Search and long-tail datasets by 3-5% relative.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work