- RJ Skerry-Ryan
- Eric Battenberg
- Ying Xiao
- Yuxuan Wang
- Daisy Stanton
- Joel Shor
- Ron J. Weiss
- Rob Clark
- Rif A. Saurous
International Conference on Machine Learning (2018)
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the reference signal’s prosody with fine time detail. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results and audio samples from a single-speaker and 44-speaker Tacotron model on a prosody transfer task.
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work