Google's next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based autoencoders

Vincent Wan; Yannis Agiomyrgiannakis; Hanna Silen; Jakub Vit

Google's next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based autoencoders

Vincent Wan

Yannis Agiomyrgiannakis

Hanna Silen

Jakub Vit

Interspeech (2017)

Google Scholar

Abstract

A neural network model that significant improves unit-selection-based Text-To-Speech synthesis is presented. The model employs a sequence-to-sequence LSTM-based autoencoder that compresses the acoustic and linguistic features of each unit to a fixed-size vector referred to as an embedding. Unit-selection is facilitated by formulating the target cost as an L2 distance in the embedding space. In open-domain speech synthesis the method achieves a 0.2 improvement in the MOS, while for limited-domain it reaches the cap of 4.5 MOS. Furthermore, the new TTS system halves the gap
between the previous unit-selection system and WaveNet in terms of quality while retaining low computational cost and latency.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Google's next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based autoencoders

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs