Uncertainty Decoding for Noise Robust Speech Recognition

Ph.D. Thesis, University of Cambridge(2007)

Abstract

It is well known that the performance of automatic speech recognition degrades in noisy conditions. To address this, typically the noise is removed from the features or the models are compensated for the noise condition. The former is usually quite efficient, but not as effective as the latter, often computationally expensive, approach. This thesis examines a hybrid form of noise compensation called uncertainty decoding that is characterised by transforming the features and a simple acoustic model update that increases the model variances in proportion to the noise level. In particular, a novel approach called joint uncertainty decoding (JUD) is introduced. JUD compensation parameters are derived from the joint distribution between the training and test conditions. Two forms of uncertainty decoding are presented: front-end and model-based joint uncertainty decoding (FE-Joint and M-Joint). An important contribution is it is shown that front-end uncertainty decoding forms, like SPLICE with uncertainty and FE-Joint, can exhibit problems in low SNR that do not occur with model-based forms. Furthermore, M-Joint is as efficient as FE-Joint for the same number of transforms. Thus JUD provides forms that are fast like feature compensation, yet more efficient than standard model-based techniques. Some common shortcomings of noise robustness techniques are that they only work with stereo data, on small vocabulary systems, are difficult to integrate with other acoustic modelling techniques and are evaluated on artificial data. These are all addressed in this work for JUD. An EM-based ML noise model estimation technique allows JUD transforms to be generated given a sample of the noisy speech from the test environment. An ML approach may update the noise model during speech, can be optimised for the noise compensation type and provide a suitable noise model for multistyle-trained acoustic models. In addition, it is shown how JUD can be combined with CMLLR or semi-tied covariance modelling. The last main contribution is noise adaptive training using JUD transforms called joint adaptive training (JAT). Instead of forcing the acoustic models to represent extraneous variability introduced by noise in the training data, as is the case for multistyle training, the noise effect is modelled by JUD transforms. Adaptive training with CMLLR or normalisation updates the features and subsequently treats cleaner observations the same as noisier ones. In contrast, during acoustic model training, JAT directly takes into account the noise level of observations by de-weighting them in proportion to the uncertainty. In this way, noisier observations contribute less to the estimation of the canonical model parameters than clean ones. The resulting acoustic models are then purer representations of the speech variability. JUD is evaluated on small, medium, and large vocabulary tasks, over a wide range of SNR, and artificially corrupted databases as well as actual recorded noisy speech data. The results show that JUD is a flexible, fast, yet powerful noise robustness technique for ASR. Keywords: speech recognition; noise robustness; hidden Markov models; uncertainty decoding; model-based noise compensation; adaptation; adaptive training.

Research Areas