Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Ying Xiao
Yuxuan Wang
Joel Shor
International Conference on Machine Learning (2018)

Abstract

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the reference signal’s prosody with fine time detail. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results and audio samples from a single-speaker and 44-speaker Tacotron model on a prosody transfer task.