DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement

Lion Jones
Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA)(2021)


Combinations of a trainable filterbank and a mask prediction network is a strong framework in single-channel speech enhancement (SE). Since the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network, we aim to improve this network. In this study, by focusing on a similarity between the structure of Conv-TasNet and Conformer, we integrate the Conformer into SE as a mask prediction network to benefit its powerful sequential modeling ability. To improve the computational complexity and local sequential modeling, we extend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers. Experimental results show that (i) the use of linear complexity attention avoids high computational complexity, and (ii) our model achieves higher scale-invariant signal-to-noise ratio than the improved time-dilated convolution network (TDCN++), an extended version of Conv-TasNet.