Discriminative Articulatory Feature-based Pronunciation Models with Application to Spoken Term Detection

Ph.D. Thesis, The Ohio State University, Department of Computer Science and Engineering(2013)

Abstract

Conversational speech is characterized by large amounts of variability; variation in accents, pronunciation and disfluencies continue to present challenges for speech recognition systems. Speech recognition systems must account for this variability if they are to be successfully deployed in real-world environments. Traditional speech recognition approaches, based on the so-called 'beads on a string' approach have a number of drawbacks when it comes to modeling the variability in conversational speech. In recent work, articulatory feature-based pronunciation models have been proposed as alternatives to phone-based representations, and have been shown to improve performance in various studies. These models are grounded in linguistic theories and attempt to explain the variation observed in conversational speech by hypothesizing it to be produced in part as a result of the relative asynchrony between the speech articulators. The main contributions of this thesis are the development of discriminative articulatory feature-based pronunciation models, and the application of these models to the task of detecting words or phrases in conversational speech. We first develop factored conditional random field models of the articulatory feature streams, which explicitly account for the ability of the speech articulators to desynchronize in conversational speech. Additionally, we describe how exact inference can be performed efficiently in the proposed models by exploiting deterministic task-specific constraints. In experimental evaluations, we find that the proposed discriminative conditional random field models outperform previously proposed generative dynamic Bayesian network models for the task. We then apply the proposed articulatory feature-based pronunciation models to the problem of spoken term detection: detecting whether and where specific words or phrases are uttered in conversational speech. We conduct detailed evaluations to determine the effectiveness of the proposed techniques in low-resource settings where transcribed training data are limited and find that the proposed articulatory feature-based models improve performance over phone-based models in a number of settings. Additionally, in many instances, the information contained in the articulatory feature-based pronunciation models appears to be complementary to the phone-based pronunciation models allowing us to improve performance through model combination. Finally, we end the thesis by describing how the proposed spoken term detection approach can be adapted to leverage existing spoken term detection systems based on large vocabulary continuous speech recognizers, if available, in order to improve system running time and performance.

Research Areas