- Golan Pundak
- Tara Sainath
- Rohit Prabhavalkar
- Anjuli Kannan
- Ding Zhao
Abstract
In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR system that utilizes such context. Our approach, which we refer to as Contextual Listen, Attend and Spell (CLAS) jointlyoptimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases which might contain out-ofvocabulary (OOV) terms not seen during training. We compare our proposed system to a more traditional contextualization approach, which performs shallow-fusion between independently trained LAS and contextual n-gram models during beam search. Across a number of tasks, we find that the proposed CLAS system outperforms the baseline method by as much as 68% relative WER, indicating the advantage of joint optimization over individually trained components. Index Terms: speech recognition, sequence-to-sequence models, listen attend and spell, LAS, attention, embedded speech recognition.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work