Adaptive Training with Joint Uncertainty Decoding for Robust Recognition of Noise Data
Abstract
Standard noise compensation techniques for automatic speech recognition assume a clean trained acoustic model. What is thought of as "clean" data, may still have a variety of speakers, different channels and varying noise conditions. Hence it may be more reasonable to consider such data multi-conditional for multistyle training. This paper shows that multistyle models benefit from VTS compensation or joint uncertainty decoding by reducing the mismatch between training and test. An EM-based noise estimation procedure that produces ML VTS or joint noise models is also described. Alternatively, adaptive training with joint uncertainty transforms factors out the noise from the data. The uncertainty variance bias de-weights observations in the training data where the SNR is low. This property allows data with a wide SNR range to be used and produces canonical models that truly represent clean speech, whereas multistyle trained models must account for all acoustic variation associated with different noise conditions. This paper presents joint adaptive training including formula for estimating the transforms and canonical model parameters. Experiments are conducted on the resource management and broadcast news corpora.