Universal Paralinguistic Speech Representations Using Self-Supervised Conformers

Daniel S. Park
Joel Shor
Wei Han
Yu Zhang
ICASSP 2022(2022)


Many speech applications require understanding aspects other than content, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. Generally-useful paralinguistic speech representations offer one solution to these kinds of problems. In this work, we introduce a new state-of-the-art paralinguistic speech representation based on self-supervised training of a 600M+ parameter Conformer-based architecture. Linear classifiers trained on top of our best representation outperform previous results on 7 of 8 tasks we evaluate. We perform a larger comparison than has been done previously both in terms of number of embeddings compared and number of downstream datasets evaluated on. Our analyses into the role of time demonstrate the importance of context window size for many downstream tasks. Furthermore, while the optimal representation is extracted internally in the network, we demonstrate stable high performance across several layers, allowing a single universal representation to reach near optimal performance on all tasks.

Research Areas