LanSER: Language-Model Supported Speech Emotion Recognition
Abstract
Speech emotion recognition (SER) studies typically rely on costly motion-labeled speech for training, making scaling methods to to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER that enables one to use those unlabeled data by generating weak emotion labels via pre-trained large language models, which are then used for weakly-supervised learning. For weak label generation, we utilize a textual entailment approach that selects an emotion label with the highest entailment score, given a transcript extracted from speech via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and exhibit much greater label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech.