TRILLSSON: DISTILLING UNIVERSAL PARALINGUISTIC SPEECH REPRESENTATIONS

Joel Shor
Interspeech 2022 (2022)
Google Scholar

Abstract

Recent advances in self-supervision have dramatically im-
proved the quality of speech representations. However, wide
deployment of state-of-the-art embedding models on devices
has been severely restricted due to their limited public avail-
ability and large resource footprint. Our work addresses these
by publicly releasing a collection of paralinguistic speech
models1 that are small, near state-of-the-art performance.
Our approach is based on knowledge distillation, and our
models are distilled only on public data. We explore differ-
ent architectures and thoroughly evaluate our models on the
Non-Semantic Speech (NOSS) benchmark. Our largest dis-
tilled model is less than 16% the size of the original model
(340MB vs 2.2GB) and achieves over 94% the accuracy on
6 of 7 tasks. The smallest model is less than 0.3% in size
(22MB) and achieves over 90% as the accuracy on 6 of 7
tasks.

Research Areas