Speaker Generation

Daisy Stanton; David Teh-Hwa Kao; Eric Battenberg; Matt Shannon; RJ Skerry-Ryan; Soroosh Mariooryad; Tom Bagby

Speaker Generation

Daisy Stanton

David Teh-Hwa Kao

Eric Battenberg

Matt Shannon

RJ Skerry-Ryan

Soroosh Mariooryad

Tom Bagby

ICASSP (2022)

Download Google Scholar

Abstract

This work explores the task of synthesizing speech in human-sounding voices unseen in any training set. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a deep generative text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Speaker Generation

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs