More than Words: In-the-Wild Visually-Driven Text-to-Speech

Brendan Shillingford; Miaosen Wang; Michael Eyov Hassid; Michelle Tadmor Ramanovich; Tal Remez; Ye Jia

More than Words: In-the-Wild Visually-Driven Text-to-Speech

Brendan Shillingford

Miaosen Wang

Michael Eyov Hassid

Michelle Tadmor Ramanovich

Tal Remez

Ye Jia

CVPR, CVF CVPR-2022 (2022)

Google Scholar

Abstract

In this paper we present VDTTS, a visual-driven TTS model. Unlike most recent text-to-speech methods which are limited by their lack of ability to generate speech with pauses, emotions, prosody and pitch, is able to do so by taking advantage of an additional silent video as an input.Our method is composed of video and text encoders that are combined via a multi-source attention layer. Speech is generated by a mel-spectrogram decoder followed by a vocoder. We evaluate our method on several challenging benchmarks including VoxCeleb2. To the best of our knowledge this is the first time such a method is trained and evaluated on in-the-wild examples that include unseen speakers.Through a rigorous evaluation we demonstrate the superior performance of our method with respect to other recent work both in terms of objective measures as well as human listening studies.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

More than Words: In-the-Wild Visually-Driven Text-to-Speech

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs