Flatstart-CTC: a new acoustic model training procedure for speech recognition

Andrew Senior; Hasim Sak; Kanishka Rao

Flatstart-CTC: a new acoustic model training procedure for speech recognition

Andrew Senior

Hasim Sak

Kanishka Rao

ICASSP 2016

Google Scholar

Abstract

We present a new procedure to train acoustic models from scratch for large vocabulary speech
recognition requiring no previous model for alignments or boot-strapping.
We augment the Connectionist Temporal Classification (CTC) objective function to allow training of acoustic models directly
from a parallel corpus of audio data and transcribed data. With this augmented CTC function
we train a phoneme recognition acoustic model directly from the written-domain transcript. Further,
we outline a mechanism to generate a context-dependent phonemes from a CTC model trained to predict phonemes
and ultimately train a second CTC model to predict these context-dependent phonemes. Since this approach does not
require training of any previous non-CTC model it drastically reduces the overall data-to-model training time from
30 days to 10 days. Additionally, models obtain from this flatstart-CTC procedure outperform the state-of-the-art by XX-XX\%.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Flatstart-CTC: a new acoustic model training procedure for speech recognition

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs