EXPLORING TRADEOFFS IN MODELS FOR LOW-LATENCY SPEECH ENHANCEMENT
Abstract
We explore a variety of configurations of neural networks for one- and
two-channel spectrogram-mask-based speech enhancement. Our best model improves on
state-of-the-art performance on the CHiME2 speech enhancement task.
We examine trade-offs among non-causal lookahead, compute work, and parameter count versus enhancement performance and find that zero-lookahead models can achieve, on average, only 0.5 dB worse performance than our best bidirectional model. Further, we find that 200 milliseconds of lookahead is sufficient to achieve performance within about 0.2 dB from our best bidirectional model.
two-channel spectrogram-mask-based speech enhancement. Our best model improves on
state-of-the-art performance on the CHiME2 speech enhancement task.
We examine trade-offs among non-causal lookahead, compute work, and parameter count versus enhancement performance and find that zero-lookahead models can achieve, on average, only 0.5 dB worse performance than our best bidirectional model. Further, we find that 200 milliseconds of lookahead is sufficient to achieve performance within about 0.2 dB from our best bidirectional model.