DEEP CONTEXT: END-TO-END CONTEXTUAL SPEECH RECOGNITION
Abstract
In automatic speech recognition (ASR) what a user says
depends on the particular context she is in. Typically, this
context is represented as a set of word n-grams. In this work,
we present a novel, all-neural, end-to-end (E2E) ASR system
that utilizes such context. Our approach, which we refer
to as Contextual Listen, Attend and Spell (CLAS) jointlyoptimizes
the ASR components along with embeddings of the
context n-grams. During inference, the CLAS system can be
presented with context phrases which might contain out-ofvocabulary
(OOV) terms not seen during training. We compare
our proposed system to a more traditional contextualization
approach, which performs shallow-fusion between independently
trained LAS and contextual n-gram models during
beam search. Across a number of tasks, we find that the proposed
CLAS system outperforms the baseline method by as
much as 68% relative WER, indicating the advantage of joint
optimization over individually trained components.
Index Terms: speech recognition, sequence-to-sequence
models, listen attend and spell, LAS, attention, embedded
speech recognition.
depends on the particular context she is in. Typically, this
context is represented as a set of word n-grams. In this work,
we present a novel, all-neural, end-to-end (E2E) ASR system
that utilizes such context. Our approach, which we refer
to as Contextual Listen, Attend and Spell (CLAS) jointlyoptimizes
the ASR components along with embeddings of the
context n-grams. During inference, the CLAS system can be
presented with context phrases which might contain out-ofvocabulary
(OOV) terms not seen during training. We compare
our proposed system to a more traditional contextualization
approach, which performs shallow-fusion between independently
trained LAS and contextual n-gram models during
beam search. Across a number of tasks, we find that the proposed
CLAS system outperforms the baseline method by as
much as 68% relative WER, indicating the advantage of joint
optimization over individually trained components.
Index Terms: speech recognition, sequence-to-sequence
models, listen attend and spell, LAS, attention, embedded
speech recognition.