Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

RJ Skerry-Ryan; Eric Battenberg; Ying Xiao; Yuxuan Wang; Daisy Stanton; Joel Shor; Ron J. Weiss; Rob Clark; Rif A. Saurous

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

RJ Skerry-Ryan

Eric Battenberg

Ying Xiao

Yuxuan Wang

Daisy Stanton

Joel Shor

Ron J. Weiss

Rob Clark

Rif A. Saurous

International Conference on Machine Learning (2018)

Download Google Scholar

Abstract

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the reference signal’s prosody with fine time detail. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results and audio samples from a single-speaker and 44-speaker Tacotron model on a prosody transfer task.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs