FRILL: A Non-Semantic Speech Embedding for Mobile Devices
Abstract
Learned speech representations can drastically improve performance on tasks with limited labeled data. However, due to their size and complexity, learned representations have limited utility
in mobile settings where run-time performance can be a significant bottleneck. In this work, we propose a class of lightweight speech embedding models that run efficiently on mobile devices
based on the recently proposed TRILL speech embedding. We combine novel architectural modifications with existing speedup techniques to create embedding models that are fast enough to run in real-time on a mobile device and exhibit minimal performance degradation on a benchmark of non-semantic speech tasks. One such model (FRILL) is 32x faster on a Pixel 1 smartphone and 40% the size of TRILL, with an average decrease in accuracy of only 2%. To our knowledge, FRILL is the highest quality non-semantic embedding designed for use on mobile devices. Furthermore, we demonstrate that these representations are useful for mobile health tasks such as non-speech human sounds detection and face-masked speech detection. Our training and evaluation code is publicly available.
in mobile settings where run-time performance can be a significant bottleneck. In this work, we propose a class of lightweight speech embedding models that run efficiently on mobile devices
based on the recently proposed TRILL speech embedding. We combine novel architectural modifications with existing speedup techniques to create embedding models that are fast enough to run in real-time on a mobile device and exhibit minimal performance degradation on a benchmark of non-semantic speech tasks. One such model (FRILL) is 32x faster on a Pixel 1 smartphone and 40% the size of TRILL, with an average decrease in accuracy of only 2%. To our knowledge, FRILL is the highest quality non-semantic embedding designed for use on mobile devices. Furthermore, we demonstrate that these representations are useful for mobile health tasks such as non-speech human sounds detection and face-masked speech detection. Our training and evaluation code is publicly available.