RED-ACE: Robust Error Detection for ASR using Confidence Embeddings
Abstract
ASR Error Detection (AED) models aim to post-process the output of Automatic Speech Recognition (ASR) systems, in order to detect transcription errors. Modern approaches usually use text-based input, comprised of the ASR transcription hypothesis, without leveraging additional signals from the ASR model, resulting in loss of important acoustic information. In this work, we propose to utilize the ASR system's word-level confidence scores for improving AED performance. Specifically, we propose to add an ASR Confidence Embedding (ACE) layer to the AED model's encoder, allowing us to jointly encode the confidence scores and the transcribed text into a contextualized representation. Our experiments show the benefits of ASR confidence scores for AED, their complementary effect over the textual signal, as well as the effectiveness and robustness of our approach for combining those signals. To foster further research, we curate and publish a novel AED dataset consisting of ASR outputs on the LibriSpeech corpus with annotated transcription errors.