Ideal ratio mask estimation using deep neural networks for robust speech recognition

Arun Narayanan; DeLiang Wang

Ideal ratio mask estimation using deep neural networks for robust speech recognition

Arun Narayanan

DeLiang Wang

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2013), pp. 7092-7096

Google Scholar

Abstract

We propose a feature enhancement algorithm to improve robust automatic speech recognition (ASR). The algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask. The estimated IRM is used to filter out noise from a noisy Mel spectrogram before performing cepstral feature extraction for ASR. On the noisy subset of the Aurora-4 robust ASR corpus, the proposed enhancement obtains a relative improvement of over 38% in terms of word error rates using ASR models trained in clean conditions, and an improvement of over 14% when the models are trained using the multi-condition training data. In terms of instantaneous SNR estimation performance, the proposed system obtains a mean absolute error of less than 4 dB in most frequency channels.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Ideal ratio mask estimation using deep neural networks for robust speech recognition

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs