Phase-sensitive Joint Learning Algorithms for Deep Learning-based Speech Enhancement
Abstract
This letter presents a phase-sensitive joint learning
algorithm for single-channel speech enhancement. Although a
deep learning framework that estimates time-frequency (T-F)
domain ideal ratio masks demonstrates a strong performance,
it is limited in that the enhancement process is performed
only in the magnitude domain, while the phase spectra remain
unchanged. Thus, recent studies have been conducted to involve
phase spectra in speech enhancement systems. A phase-sensitive
mask (PSM) is a T-F mask that implicitly represents phaserelated
information. However, since the PSM has an unbounded
value, the networks are trained to target its truncated values
rather than directly estimating it. To effectively train the PSM,
we first approximate it to have a bounded dynamic range under
the assumption that speech and noise are uncorrelated. We then
propose a joint learning algorithm that trains the approximated
value through its parameterized variables in order to minimize
the inevitable error caused by the truncation process. Specifically,
we design a network that explicitly targets three parameterized
variables: speech magnitude spectra, noise magnitude spectra,
and phase difference of clean to noisy spectra. To further improve
the performance, we also investigate how the dynamic range
of magnitude spectra controlled by a warping function affects
the final performance in joint learning algorithms. Finally, we
examined how the proposed additional constraint that preserves
the sum of the estimated speech and noise power spectra affects
the overall system performance. The experimental results show
that the proposed learning algorithm outperforms the conventional
learning algorithm with the truncated phase-sensitive
approximation.