Flatstart-CTC: a new acoustic model training procedure for speech recognition
Abstract
We present a new procedure to train acoustic models from scratch for large vocabulary speech
recognition requiring no previous model for alignments or boot-strapping.
We augment the Connectionist Temporal Classification (CTC) objective function to allow training of acoustic models directly
from a parallel corpus of audio data and transcribed data. With this augmented CTC function
we train a phoneme recognition acoustic model directly from the written-domain transcript. Further,
we outline a mechanism to generate a context-dependent phonemes from a CTC model trained to predict phonemes
and ultimately train a second CTC model to predict these context-dependent phonemes. Since this approach does not
require training of any previous non-CTC model it drastically reduces the overall data-to-model training time from
30 days to 10 days. Additionally, models obtain from this flatstart-CTC procedure outperform the state-of-the-art by XX-XX\%.
recognition requiring no previous model for alignments or boot-strapping.
We augment the Connectionist Temporal Classification (CTC) objective function to allow training of acoustic models directly
from a parallel corpus of audio data and transcribed data. With this augmented CTC function
we train a phoneme recognition acoustic model directly from the written-domain transcript. Further,
we outline a mechanism to generate a context-dependent phonemes from a CTC model trained to predict phonemes
and ultimately train a second CTC model to predict these context-dependent phonemes. Since this approach does not
require training of any previous non-CTC model it drastically reduces the overall data-to-model training time from
30 days to 10 days. Additionally, models obtain from this flatstart-CTC procedure outperform the state-of-the-art by XX-XX\%.