Spectral distortion model for training phase-sensitive deep-neural networks for far-field speech recognition

Chanwoo Kim
Rajeev Nongpiur
ICASSP 2018(2018)


In this paper, we present an algorithm which introduces phaseperturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information. However more recent features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of arrival across different microphone channels or frequency bands. However, compared to magnitude-based features, phase information is more sensitive to various kinds of distortions such as variations in microphone characteristics, reverberation, and so on. For traditional magnitude-based features, it is widely known that adding noise or reverberation, often called Multistyle-TRaining (MTR) , improves robustness. In a similar spirit, we propose an algorithm which introduces spectral distortion to make the deep-learning model more robust against phase-distortion. We call this approach Spectral-Distortion TRaining (SDTR) and Phase-Distortion TRaining (PDTR). In our experiments using a training set consisting of 22-million utterances, this approach has proved to be quite successful in reducing Word Error Rates in test sets obtained with real microphones on Google Home