Google Research

Training Keyword Spotters with Limited and Synthesized Speech Data

International Conference on Acoustics, Speech, and Signal Processing, IEEE, Barcelona, Spain (2020)

Abstract

With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of high training data. In this paper, we explore the effectiveness of synthesized speech data in training small spoken term detection models. Instead of training such models directly on the audio or low level feature such as MFCCs we use a small speech embedding model trained to extract useful features for keyword spotting models. Using this embedding, we show that such a model for detecting 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 50 real examples, and to a model trained on 4000 real examples if we do not use the speech embeddings.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work