Uncertainty Decoding for Noise Robust Speech Recognition
Abstract
It is well known that the performance of automatic speech recognition
degrades in noisy conditions. To address this, typically the noise is
removed from the features or the models are compensated for the noise
condition. The former is usually quite efficient, but not as effective
as the latter, often computationally expensive, approach. This thesis
examines a hybrid form of noise compensation called uncertainty
decoding that is characterised by transforming the features and a
simple acoustic model update that increases the model variances in
proportion to the noise level. In particular, a novel approach called
joint uncertainty decoding (JUD) is introduced. JUD compensation
parameters are derived from the joint distribution between the
training and test conditions. Two forms of uncertainty decoding are
presented: front-end and model-based joint uncertainty decoding
(FE-Joint and M-Joint). An important contribution is it is shown that
front-end uncertainty decoding forms, like SPLICE with uncertainty and
FE-Joint, can exhibit problems in low SNR that do not occur with
model-based forms. Furthermore, M-Joint is as efficient as FE-Joint
for the same number of transforms. Thus JUD provides forms that are
fast like feature compensation, yet more efficient than standard
model-based techniques.
Some common shortcomings of noise robustness techniques are that they
only work with stereo data, on small vocabulary systems, are difficult
to integrate with other acoustic modelling techniques and are
evaluated on artificial data. These are all addressed in this work for
JUD. An EM-based ML noise model estimation technique allows JUD
transforms to be generated given a sample of the noisy speech from the
test environment. An ML approach may update the noise model during
speech, can be optimised for the noise compensation type and provide a
suitable noise model for multistyle-trained acoustic models. In
addition, it is shown how JUD can be combined with CMLLR or semi-tied
covariance modelling.
The last main contribution is noise adaptive training using JUD
transforms called joint adaptive training (JAT). Instead of forcing
the acoustic models to represent extraneous variability introduced by
noise in the training data, as is the case for multistyle training,
the noise effect is modelled by JUD transforms. Adaptive training with
CMLLR or normalisation updates the features and subsequently treats
cleaner observations the same as noisier ones. In contrast, during
acoustic model training, JAT directly takes into account the noise
level of observations by de-weighting them in proportion to the
uncertainty. In this way, noisier observations contribute less to the
estimation of the canonical model parameters than clean ones. The
resulting acoustic models are then purer representations of the speech
variability.
JUD is evaluated on small, medium, and large vocabulary tasks, over a
wide range of SNR, and artificially corrupted databases as well as
actual recorded noisy speech data. The results show that JUD is a
flexible, fast, yet powerful noise robustness technique for ASR.
Keywords:
speech recognition; noise robustness; hidden Markov models;
uncertainty decoding; model-based noise compensation; adaptation;
adaptive training.