Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection
Abstract
Voice Activity Detection (VAD) is an important preprocessing
step in any state-of-the-art speech recognition system.
Choosing the right set of features and model architecture can
be challenging and is an active area of research. In this paper
we propose a novel approach to VAD to tackle both feature
and model selection jointly. The proposed method is based
on a CLDNN (Convolutional, Long Short-Term Memory, Deep
Neural Networks) architecture fed directly with the raw waveform.
We show that using the raw waveform allows the neural
network to learn features directly for the task at hand, which is
more powerful than using log-mel features, specially for noisy
environments. In addition, using a CLDNN, which takes advantage
of both frequency modeling with the CNN and temporal
modeling with LSTM, is a much better model for VAD compared
to the DNN. The proposed system achieves over 78% relative
improvement in False Alarms (FA) at the operating point
of 2% False Rejects (FR) on both clean and noisy conditions
compared to a DNN of comparable size trained with log-mel
features. In addition, we study the impact of the model size
and the learned features to provide a better understanding of the
proposed architecture
step in any state-of-the-art speech recognition system.
Choosing the right set of features and model architecture can
be challenging and is an active area of research. In this paper
we propose a novel approach to VAD to tackle both feature
and model selection jointly. The proposed method is based
on a CLDNN (Convolutional, Long Short-Term Memory, Deep
Neural Networks) architecture fed directly with the raw waveform.
We show that using the raw waveform allows the neural
network to learn features directly for the task at hand, which is
more powerful than using log-mel features, specially for noisy
environments. In addition, using a CLDNN, which takes advantage
of both frequency modeling with the CNN and temporal
modeling with LSTM, is a much better model for VAD compared
to the DNN. The proposed system achieves over 78% relative
improvement in False Alarms (FA) at the operating point
of 2% False Rejects (FR) on both clean and noisy conditions
compared to a DNN of comparable size trained with log-mel
features. In addition, we study the impact of the model size
and the learned features to provide a better understanding of the
proposed architecture