An Analysis of "Attention" in Sequence-to-Sequence Models
Abstract
In this paper, we conduct a detailed investigation of attention-based models for automatic speech recognition (ASR). First, we explore different types of attention, including online and full-sequence attention. Second, we explore different sub-word units to see how much of the end-to-end ASR process can reasonably be captured by an attention model. In experimental evaluations, we find that although attention is typically focussed over a small region of the acoustics during each step of next label prediction, full sequence attention outperforms “online” attention, although this gap can be significantly reduced by increasing the length of the segments over which attention is computed. Furthermore, we find that content-independent phonemes are a reasonable sub-word unit for attention models; when used in the second-pass to rescore N-best hypotheses these models provide over a 10% relative improvement in word error rate.