Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Abstract
In this work, we propose “global style tokens”(GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained in a completely unsupervised manner, and yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of surprising results. The soft interpretable “labels” they generate can be used to control synthesis in novel ways, such as varying speed and modifying speak-ing style – independently of the text content. The labels can also be used for style transfer, replicating the speaking style of one “seed” phrase across an entire long-form text corpus. Perhaps most surprisingly, when trained on noisy, unlabelled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scaleable but robust speech synthesis.