Google Research

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Abstract

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the reference signal’s prosody with fine time detail. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results and audio samples from a single-speaker and 44-speaker Tacotron model on a prosody transfer task.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work