Discriminative pronunciation modeling for dialectal speech recognition

Maider Lehr
Proc. Interspeech (2014) (to appear)
Google Scholar

Abstract

Speech recognizers are typically trained with data from a standard
dialect and do not generalize to non-standard dialects. Mismatch
mainly occurs in the acoustic realization of words, which is represented
by acoustic models and pronunciation lexicon. Standard techniques for
addressing this mismatch are generative in nature and include acoustic
model adaptation and expansion of lexicon with pronunciation variants,
both of which have limited effectiveness. We present a discriminative
pronunciation model whose parameters are learned jointly with
parameters from the language models. We tease apart the
gains from modeling the transitions of canonical phones, the
transduction from surface to canonical phones, and the language
model. We report experiments on African American Vernacular English
(AAVE) using NPR's StoryCorps corpus. Our models improve the
performance over the baseline by about 2.1% on AAVE, of which 0.6%
can be attributed to the pronunciation model. The model learns the most
relevant phonetic transformations for AAVE speech.