In this work, we conduct a detailed evaluation of various all-neural, end-to-end trained, sequence-to-sequence models applied to the task of speech recognition. Notably, each of these systems directly predicts graphemes in the written domain, without using an external pronunciation lexicon, or a separate language model. We examine several sequence-to-sequence models including connectionist temporal classification (CTC), the recurrent neural network (RNN) transducer, an attention-based model, and a model which augments the RNN-transducer with an attention mechanism.
We find that end-to-end models are capable of learning all components of the speech recognition process: acoustic, pronunciation, and language models, directly outputting words in the written form (e.g., “one hundred dollars” to “$100”), in a single jointly-optimized neural network. Furthermore, the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline outperforms these models on voice-search test sets.