Jump to Content

LanSER: Language-Model Supported Speech Emotion Recognition

Taesik Gong
Josh Belanich
Brian Eoff
INTERSPEECH 2023 (to appear)


Speech emotion recognition (SER) studies typically rely on costly motion-labeled speech for training, making scaling methods to to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER that enables one to use those unlabeled data by generating weak emotion labels via pre-trained large language models, which are then used for weakly-supervised learning. For weak label generation, we utilize a textual entailment approach that selects an emotion label with the highest entailment score, given a transcript extracted from speech via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and exhibit much greater label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech.