Contextual speech recognition in end-to-end neural network systems using beam search
Abstract
Recent work has shown that end-to-end (E2E) speech
recognition architectures such as Listen Attend and Spell (LAS)
can achieve state-of-the-art quality results in LVCSR tasks. One
benefit of this architecture is that it does not require a separately
trained pronunciation model, language model, and acoustic
model. However, this property also introduces a drawback:
it is not possible to adjust language model contributions separately
from the system as a whole. As a result, inclusion of
dynamic, contextual information (such as nearby restaurants or
upcoming events) into recognition requires a different approach
from what has been applied in conventional systems.
We introduce a technique to adapt the inference process
to take advantage of contextual signals by adjusting the output
likelihoods of the neural network at each step in the beam
search. We apply the proposed method to a LAS E2E model
and show its effectiveness in experiments on a voice search task
with both artificial and real contextual information. Given optimal
context, our system reduces WER from 9.2% to 3.8%.
The results show that this technique is effective at incorporating
context into the prediction of an E2E system.
Index Terms: speech recognition, end-to-end, contextual
speech recognition, neural network
recognition architectures such as Listen Attend and Spell (LAS)
can achieve state-of-the-art quality results in LVCSR tasks. One
benefit of this architecture is that it does not require a separately
trained pronunciation model, language model, and acoustic
model. However, this property also introduces a drawback:
it is not possible to adjust language model contributions separately
from the system as a whole. As a result, inclusion of
dynamic, contextual information (such as nearby restaurants or
upcoming events) into recognition requires a different approach
from what has been applied in conventional systems.
We introduce a technique to adapt the inference process
to take advantage of contextual signals by adjusting the output
likelihoods of the neural network at each step in the beam
search. We apply the proposed method to a LAS E2E model
and show its effectiveness in experiments on a voice search task
with both artificial and real contextual information. Given optimal
context, our system reduces WER from 9.2% to 3.8%.
The results show that this technique is effective at incorporating
context into the prediction of an E2E system.
Index Terms: speech recognition, end-to-end, contextual
speech recognition, neural network