Arun Narayanan

Arun Narayanan

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Speech enhancement (SE) is used as a frontend in speech applications including automatic speech recognition (ASR) and telecommunication. A difficulty in using the SE frontend is that the appropriate noise reduction level differs depending on applications and/or noise characteristics. In this study, we propose ``{\it signal-to-noise ratio improvement (SNRi) target training}''; the SE frontend is trained to output a signal whose SNRi is controlled by an auxiliary scalar input. In joint training with a backend, the target SNRi value is estimated by an auxiliary network. By training all networks to minimize the backend task loss, we can estimate the appropriate noise reduction level for each noisy input in a data-driven scheme. Our experiments showed that the SNRi target training enables control of the output SNRi. In addition, the proposed joint training relatively reduces word error rate by 4.0\% and 5.7\% compared to a Conformer-based standard ASR model and conventional SE-ASR joint training model, respectively. Furthermore, by analyzing the predicted target SNRi, we observed the jointly trained network automatically controls the target SNRi according to noise characteristics. Audio demos are available in our demo page []. View details
    Preview abstract Previous research on deliberation networks has achieved excellent recognition quality. The attention decoder based deliberation models often works as a rescorer to improve first-pass recognition results, and often requires the full first-pass hypothesis for second-pass deliberation. In this work, we propose a streaming transducer-based deliberation model. The joint network of a transducer decoder often consists of inputs from the encoder and the prediction network. We propose to use attention to the first-pass text hypotheses as the third input to the joint network. The proposed transducer based deliberation model naturally streams, making it more desirable for on-device applications. We also show that the model improves rare word recognition, with relative WER reductions ranging from 3.6% to 10.4% for a variety of test sets. Our model does not use any additional text data for training. View details
    Preview abstract Recent work has designed methods to demonstrate that model updates in ASR training can leak potentially sensitive attributes of the utterances used in computing the updates. In this work, we design the first method to demonstrate information leakage about training data from trained ASR models. We design Noise Masking, a fill-in-the-blank style method for extracting targeted parts of training data from trained ASR models. We demonstrate the success of Noise Masking by using it in four settings for extracting names from the LibriSpeech dataset used for training a state-of-the-art Conformer model. In particular, we show that we are able to extract the correct names from masked training utterances with 11.8% accuracy, while the model outputs some name from the train set 55.2% of the time. Further, we show that even in a setting that uses synthetic audio and partial transcripts from the test set, our method achieves 2.5% correct name accuracy (47.7% any name success rate). Lastly, we design Word Dropout, a data augmentation method that we show when used in training along with Multistyle TRaining (MTR), provides comparable utility as the baseline, along with significantly mitigating extraction via Noise Masking across the four evaluated settings. View details
    Preview abstract End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied. View details
    Improving Streaming ASR with Non-streaming Model Distillation on Unsupervised Data
    Chung-Cheng Chiu
    Liangliang Cao
    Min Ma
    Ruoming Pang
    Thibault Doutre
    Wei Han
    Yu Zhang
    Zhiyun Lu
    ICASSP 2021 (to appear)
    Preview abstract Streaming end-to-end Automatic Speech Recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Streaming models almost always perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher, generating transcripts on an arbitrary large data set, to better distill knowledge into streaming ASR models. This way, we are able to scale the training of streaming models to 3M hours of YouTube audio. Experiments show that our approach can significantly reduce the Word Error Rate (WER) of RNN-T models in four languages trained from YouTube data. View details
    Preview abstract Streaming automatic speech recognition (ASR) aims to output each hypothesized word as quickly and accurately as possible. However, reducing latency while retaining accuracy is highly challenging. Existing approaches including Early and Late Penalties~\cite{li2020towards} and Constrained Alignment~\cite{sainath2020emitting} penalize emission delay by manipulating per-token or per-frame RNN-T output logits. While being successful in reducing latency, these approaches lead to significant accuracy degradation. In this work, we propose a sequence-level emission regularization technique, named FastEmit, that applies emission latency regularization directly on the transducer forward-backward probabilities. We demonstrate that FastEmit is more suitable to the sequence-level transducer~\cite{Graves12} training objective for streaming ASR networks. We apply FastEmit on various end-to-end (E2E) ASR networks including RNN-Transducer~\cite{Ryan19}, Transformer-Transducer~\cite{zhang2020transformer}, ConvNet-Transducer~\cite{han2020contextnet} and Conformer-Transducer~\cite{gulati2020conformer}, and achieve 150-300ms latency reduction over previous art without accuracy degradation on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4%/8.9% to 3.1%/7.5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech. View details
    Personalized Keyphrase Detection using Speaker and Environment Information
    Rajeev Vijay Rikhye
    Qiao Liang
    Yanzhang (Ryan) He
    Ding Zhao
    Yiteng (Arden) Huang
    Interspeech 2021
    Preview abstract In this paper, we introduce a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary. The system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model. To address the challenge of detecting these keyphrases under various noisy conditions, a speaker separation model is added to the feature frontend of the speaker verification model, and an adaptive noise cancellation (ANC) algorithm is included to exploit the cross-microphone noise coherence. Our experiments show that the text-independent speaker recognition model largely reduces the false triggering rate of the keyphrase detection, while the speaker separation model and adaptive noise cancellation largely reduce false rejections. View details
    Preview abstract On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller. View details
    A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
    Yanzhang (Ryan) He
    Bo Li
    Ruoming Pang
    Antoine Bruguier
    Wei Li
    Raziel Alvarez
    Chung-Cheng Chiu
    David Garcia
    Kevin Hu
    Minho Jin
    Qiao Liang
    (June) Yuan Shangguan
    Yash Sheth
    Mirkó Visontai
    Yu Zhang
    Ding Zhao
    Preview abstract Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size. View details
    Preview abstract Current state-of-the-art automatic speech recognition systems are trained to work in specific ‘domains’, defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. In this paper, we explore the idea of building a single domain-invariant model that works well for varied use-cases. We do this by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be trained to be robust to multiple application domains, and other variations like codecs and noise. Such models also generalize better to unseen conditions and allow for rapid adaptation to new domains – we show that by using as little as 10 hours of data for adapting a domain-invariant model to a new domain, we can match performance of a domain-specific model trained from scratch using roughly 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work. View details
    Preview abstract Domain robustness is a challenging problem for automatic speech recognition (ASR). In this paper, we consider speech data collected for different applications as separate domains and investigate the robustness of acoustic models trained on multi-domain data on unseen domains. Specifically, we use Factorized Hidden Layer (FHL) as a compact low-rank representation to adapt a multi-domain ASR system to unseen domains. Experimental results on two unseen domains show that FHL is a more effective adaptation method compared to selectively fine-tuning part of the network, without dramatically increasing the model parameters. Furthermore, we found that using singular value decomposition to initialize the low-rank bases of an FHL model leads to a faster convergence and improved performance. View details
    Preview abstract Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to text, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of intents and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem. We propose and compare various encoder-decoder based approaches that optimizes both modules jointly, in an end-to-end manner. We evaluate these methods on a real-world task. Our results show that having an intermediate text representation while jointly optimizing the full system improves accuracy of prediction. View details
    Preview abstract In this paper, we present an algorithm which introduces phaseperturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information. However more recent features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of arrival across different microphone channels or frequency bands. However, compared to magnitude-based features, phase information is more sensitive to various kinds of distortions such as variations in microphone characteristics, reverberation, and so on. For traditional magnitude-based features, it is widely known that adding noise or reverberation, often called Multistyle-TRaining (MTR) , improves robustness. In a similar spirit, we propose an algorithm which introduces spectral distortion to make the deep-learning model more robust against phase-distortion. We call this approach Spectral-Distortion TRaining (SDTR) and Phase-Distortion TRaining (PDTR). In our experiments using a training set consisting of 22-million utterances, this approach has proved to be quite successful in reducing Word Error Rates in test sets obtained with real microphones on Google Home View details
    Preview abstract We describe the structure and application of an acoustic room simulator to generate large-scale simulated data for training deep neural networks for far-field speech recognition. The system simulates millions of different room dimensions, a wide distribution of reverberation time and signal-to-noise ratios, and a range of microphone and sound source locations. We start with a relatively clean training set as the source and artificially create simulated data by randomly sampling a noise configuration for every new training example. As a result, the acoustic model is trained using examples that are virtually never repeated. We evaluate performance of this approach based on room simulation using a factored complex Fast Fourier Transform (CFFT) acoustic model introduced in our earlier work, which uses CFFT layers and LSTM AMs for joint multichannel processing and acoustic modeling. Results show that the simulator-driven approach is quite effective in obtaining large improvements not only in simulated test conditions, but also in real / rerecorded conditions. This room simulation system has been employed in training acoustic models including the ones for the recently released Google Home. View details
    Preview abstract In this paper, we describe how to efficiently implement an acoustic room simulator to generate large-scale simulated data for training deep neural networks. Even though Google Room Simulator in [1] was shown to be quite effective in reducing the Word Error Rates (WERs) for far-field applications by generating simulated far-field training sets, it requires a very large number of Fast Fourier Transforms (FFTs) of large size. Room Simulator in [1] used approximately 80 percent of Central Processing Unit (CPU) usage in our CPU + Graphics Processing Unit (GPU) training architecture [2]. In this work, we implement an efficient OverLap Addition (OLA) based filtering using the open-source FFTW3 library. Further, we investigate the effects of the Room Impulse Response (RIR) lengths. Experimentally, we conclude that we can cut the tail portions of RIRs whose power is less than 20 dB below the maximum power without sacrificing the speech recognition accuracy. However, we observe that cutting RIR tail more than this threshold harms the speech recognition accuracy for rerecorded test sets. Using these approaches, we were able to reduce CPU usage for the room simulator portion down to 9.69 percent in CPU/GPU training architecture. Profiling result shows that we obtain 22.4 times speed-up on a single machine and 37.3 times speed up on Google's distributed training infrastructure. View details
    Preview abstract Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this chapter, we perform multi-channel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model. View details
    Preview abstract This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system. View details
    Preview abstract Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. % Next, we show how performance can be improved by \emph{factoring} the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. % We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. % Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5\% compared to a traditional beamforming-based multichannel ASR system and more than 10\% compared to a single channel waveform model. View details
    Preview abstract Recently, it was shown that the performance of supervised time-frequency masking based robust automatic speech recognition techniques can be improved by training them jointly with the acoustic model [1]. The system in [1], termed deep neural network based joint adaptive training, used fully-connected feed-forward deep neural networks for estimating time-frequency masks and for acoustic modeling; stacked log mel spectra was used as features and training minimized cross entropy loss. In this work, we extend such jointly trained systems in several ways. First, we use recurrent neural networks based on long short-term memory (LSTM) units — this allows the use of unstacked features, simplifying joint optimization. Next, we use a sequence discriminative training criterion for optimizing parameters. Finally, we conduct experiments on large scale data and show that joint adaptive training can provide gains over a strong baseline. Systematic evaluations on noisy voice-search data show relative improvements ranging from 2% at 15 dB to 5.4% at -5 dB over a sequence discriminative, multi-condition trained LSTM acoustic model. View details
    Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training
    DeLiang Wang
    IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(2015), pp. 92-101
    Preview abstract Although deep neural network (DNN) acoustic models are known to be inherently noise robust, especially with matched training and testing data, the use of speech separation as a frontend and for deriving alternative feature representations has been shown to improve performance in challenging environments. We first present a supervised speech separation system that significantly improves automatic speech recognition (ASR) performance in realistic noise conditions. The system performs separation via ratio time-frequency masking; the ideal ratio mask (IRM) is estimated using DNNs. We then propose a framework that unifies separation and acoustic modeling via joint adaptive training. Since the modules for acoustic modeling and speech separation are implemented using DNNs, unification is done by introducing additional hidden layers with fixed weights and appropriate network architecture. On the CHiME-2 medium-large vocabulary ASR task, and with log mel spectral features as input to the acoustic model, an independently trained ratio masking frontend improves word error rates by 10.9% (relative) compared to the noisy baseline. In comparison, the jointly trained system improves performance by 14.4%. We also experiment with alternative feature representations to augment the standard log mel features, like the noise and speech estimates obtained from the separation module, and the standard feature set used for IRM estimation. Our best system obtains a word error rate of 15.4% (absolute), an improvement of 4.6 percentage points over the next best result on this corpus. View details
    On training targets for supervised speech separation
    Yuxuan Wang
    DeLiang Wang
    IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2014), pp. 1849-1858
    Preview abstract Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation. View details
    Joint noise adaptive training for robust automatic speech recognition
    DeLiang Wang
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE(2014), pp. 2523-2527
    Preview abstract We explore time-frequency masking to improve noise robust automatic speech recognition. Apart from its use as a frontend, we use it for providing smooth estimates of speech and noise which are then passed as additional features to a deep neural network (DNN) based acoustic model. Such a system improves performance on the Aurora-4 dataset by 10.5% (relative) compared to the previous best published results. By formulating separation as a supervised mask estimation problem, we develop a unified DNN framework that jointly improves separation and acoustic modeling. Our final system outperforms the previous best system on CHiME-2 corpus by 22.1% (relative). View details
    Analysis by synthesis feature estimation for robust automatic speech recognition using spectral masks
    Michael I Mandel
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE(2014), pp. 2528-2532
    Preview abstract Spectral masking is a promising method for noise suppression in which regions of the spectrogram that are dominated by noise are attenuated while regions dominated by speech are preserved. It is not clear, however, how best to combine spectral masking with the non-linear processing necessary to compute automatic speech recognition features. We propose an analysis-by-synthesis approach to automatic speech recognition, which, given a spectral mask, poses the estimation of mel frequency cepstral coefficients (MFCCs) of the clean speech as an optimization problem. MFCCs are found that minimize a combination of the distance from the resynthesized clean power spectrum to the regions of the noisy spectrum selected by the mask and the negative log likelihood under an unmodified large vocabulary continuous speech recognizer. In evaluations on the Aurora4 noisy speech recognition task with both ideal and estimated masks, analysis-by-synthesis decreases both word error rates and distances to clean speech as compared to traditional approaches. View details
    Computational auditory scene analysis and robust automatic speech recognition
    Ph.D. Thesis, Ohio State University(2014)
    Investigation of speech separation as a front-end for noise robust speech recognition
    DeLiang Wang
    IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2014), pp. 826-835
    Preview abstract We perform an in-depth evaluation of such techniques as a front-end for noise-robust automatic speech recognition (ASR). The proposed separation front-end consists of two stages. The first stage removes additive noise via time-frequency masking. The second stage addresses channel mismatch and the distortions introduced by the first stage; a non-linear function is learned that maps the masked spectral features to their clean counterpart. Results show that the proposed front-end substantially improves ASR performance when the acoustic models are trained in clean conditions. We also propose a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks and HMM. Results show that dFDLR consistently improves performance in all test conditions. Surprisingly, the best average results are obtained when dFDLR is applied to models trained using noisy log-Mel spectral features from the multi-condition training set. With no channel mismatch, the best results are obtained when the proposed speech separation front-end is used along with multi-condition training using log-Mel features followed by dFDLR adaptation. Both these results are among the best on the Aurora-4 dataset. View details
    Coupling binary masking and robust ASR
    DeLiang Wang
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE(2013), pp. 6817-6821
    Preview abstract We present a novel framework for performing speech separation and robust automatic speech recognition (ASR) in a unified fashion. Separation is performed by estimating the ideal binary mask (IBM), which identifies speech dominant and noise dominant units in a time-frequency (T-F) representation of the noisy signal. ASR is performed on extracted cepstral features after binary masking. Previous systems perform these steps in a sequential fashion - separation followed by recognition. The proposed framework, which we call bidirectional speech decoding (BSD), unifies these two stages. It does this by using multiple IBM estimators each of which is designed specifically for a back-end acoustic phonetic unit (BPU) of the recognizer. The standard ASR decoder is modified to use these IBM estimators to obtain BPU-specific cepstra during likelihood calculation. On the Aurora-4 robust ASR task, the proposed framework obtains a relative improvement of 17% in word error rate over the noisy baseline. It also obtains significant improvements in the quality of the estimated IBM. View details
    Ideal ratio mask estimation using deep neural networks for robust speech recognition
    DeLiang Wang
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE(2013), pp. 7092-7096
    Preview abstract We propose a feature enhancement algorithm to improve robust automatic speech recognition (ASR). The algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask. The estimated IRM is used to filter out noise from a noisy Mel spectrogram before performing cepstral feature extraction for ASR. On the noisy subset of the Aurora-4 robust ASR corpus, the proposed enhancement obtains a relative improvement of over 38% in terms of word error rates using ASR models trained in clean conditions, and an improvement of over 14% when the models are trained using the multi-condition training data. In terms of instantaneous SNR estimation performance, the proposed system obtains a mean absolute error of less than 4 dB in most frequency channels. View details
    A direct masking approach to robust ASR
    William Hartmann
    Eric Fosler-Lussier
    DeLiang Wang
    IEEE Transactions on Audio, Speech, and Language Processing, 21(2013), pp. 1993-2005
    Preview abstract Recently, much work has been devoted to the computation of binary masks for speech segregation. Conventional wisdom in the field of ASR holds that these binary masks cannot be used directly; the missing energy significantly affects the calculation of the cepstral features commonly used in ASR. We show that this commonly held belief may be a misconception; we demonstrate the effectiveness of directly using the masked data on both a small and large vocabulary dataset. In fact, this approach, which we term the direct masking approach, performs comparably to two previously proposed missing feature techniques. We also investigate the reasons why other researchers may have not come to this conclusion; variance normalization of the features is a significant factor in performance. This work suggests a much better baseline than unenhanced speech for future work in missing feature ASR. View details
    The role of binary mask patterns in automatic speech recognition in background noise
    DeLiang Wang
    Journal of the Acoustical Society of America, 133(2013), pp. 3083-3093
    Preview abstract Processing noisy signals using the ideal binary mask improves automatic speech recognition (ASR) performance. This paper presents the first study that investigates the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs), and vocabulary sizes. Binary masks are computed either by comparing the SNR within a time-frequency unit of a mixture signal with a local criterion (LC), or by comparing the local target energy with the long-term average spectral energy of speech. ASR results show that (1) akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as −60 dB; (2) the ASR performance profiles are qualitatively similar to those obtained in human intelligibility experiments; (3) the difference between the LC and mixture SNR is more correlated to the recognition accuracy than LC; (4) LC at which the performance peaks is lower than 0 dB, which is the threshold that maximizes the SNR gain of processed signals. This broad agreement with human performance is rather surprising. The results also indicate that maximizing the SNR gain is probably not an appropriate goal for improving either human or machine recognition of noisy speech. View details
    Computational auditory scene analysis and automatic speech recognition
    DeLiang Wang
    Techniques for Noise Robustness in Automatic Speech Recognition, John Wiley & Sons(2012), pp. 433-462
    On the role of binary mask pattern in automatic speech recognition
    DeLiang Wang
    INTERSPEECH-2012, ISCA, pp. 1239-1242
    Preview abstract Processing noisy signals using the ideal binary mask has been shown to improve automatic speech recognition (ASR) performance. In this paper, we present the first study that investigates the role of mask patterns in ASR under varying signal-to-noise ratios (SNR), noise conditions and mask definitions. Binary masks are typically computed either by comparing the local SNR within a time-frequency unit of a mixture signal with a threshold termed the local criterion (LC), or by comparing the local target energy with the long-term average energy of speech. Results show that: (i) Akin to human speech recognition, binary masking can significantly improve ASR even when the mixture SNR is as low as -60 dB. (ii) The difference between the LC and the mixture SNR is more correlated to the recognition accuracy than LC. (iii) The performance profiles in ASR are qualitatively similar to those obtained for human speech recognition. (iv) The LC at which the peak performance is obtained is lower than 0 dB, which is the optimal threshold as far as the SNR gain of processed signals is concerned. This indicates that maximizing SNR gain may not be the optimal criterion to improve either human or machine recognition of noisy speech. View details
    A CASA based system for long-term SNR estimation
    DeLiang Wang
    IEEE Transactions on Audio, Speech, and Language Processing, 20(2012), pp. 2518-2527
    Preview abstract We present a system for robust signal-to-noise ratio (SNR) estimation based on computational auditory scene analysis (CASA). The proposed algorithm uses an estimate of the ideal binary mask to segregate a time-frequency representation of the noisy signal into speech dominated and noise dominated regions. Energy within each of these regions is summated to derive the filtered global SNR. An SNR transform is introduced to convert the estimated filtered SNR to the true broadband SNR of the noisy signal. The algorithm is further extended to estimate subband SNRs. Evaluations are done using the TIMIT speech corpus and the NOISEX92 noise database. Results indicate that both global and subband SNR estimates are superior to those of existing methods, especially at low SNR conditions. View details
    On the use of ideal binary masks for improving phonetic classification
    DeLiang Wang
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE(2011), pp. 5212-5215
    Preview abstract Ideal binary masks are binary patterns that encode the masking characteristics of speech in noise. Recent evidence in speech perception suggests that such binary patterns provide sufficient information for human speech recognition. Motivated by these findings, we propose to use ideal binary masks to improve phonetic modeling. We show that by combining the outputs of classifiers trained on the traditional MFCC features and this novel speech pattern, statistically significant improvements over the baseline MFCC based classifier can be achieved for the task of phonetic classification. Using the combined classifiers, we achieve an error rate of 19.5% on the TIMIT phonetic classification task using multilayer perceptrons as the underlying classifier. View details
    Robust speech recognition using multiple prior models for speech reconstruction
    Xiaojia Zhao
    DeLiang Wang
    Eric Fosler-Lussier
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE(2011), pp. 4800-4803
    Preview abstract Prior models of speech have been used in robust automatic speech recognition to enhance noisy speech. Typically, a single prior model is trained by pooling the entire training data. In this paper we propose to train multiple prior models of speech instead of a single prior model. The prior models can be trained based on distinct characteristics of speech. In this study, they are trained based on voicing characteristics. The trained prior models are then used to reconstruct noisy speech. Significant improvements are obtained on the Aurora-4 robust speech recognition task when multiple priors are used; in conjunction with an uncertainty transform technique, multiple priors yield a 13.7% absolute improvement in the average word error rate over directly recognizing noisy speech. View details
    Robust speech recognition from binary masks
    DeLiang Wang
    Journal of the Acoustical Society of America, 128(2010), EL217-222
    Preview abstract Inspired by recent evidence that a binary pattern may provide sufficient information for human speech recognition, this letter proposes a fundamentally different approach to robust automatic speech recognition. Specifically, recognition is performed by classifying binary masks corresponding to a word utterance. The proposed method is evaluated using a subset of the TIDigits corpus to perform isolated digit recognition. Despite dramatic reduction of speech information encoded in a binary mask, the proposed system performs surprisingly well. The system is compared with a traditional HMM based approach and is shown to perform well under low SNR conditions. View details