More than Words: In-the-Wild Visually-Driven Text-to-Speech

Brendan Shillingford
Michael Eyov Hassid
Tal Remez
Ye Jia
CVPR, CVF CVPR-2022(2022)
Google Scholar


In this paper we present VDTTS, a visual-driven TTS model. Unlike most recent text-to-speech methods which are limited by their lack of ability to generate speech with pauses, emotions, prosody and pitch, is able to do so by taking advantage of an additional silent video as an input.Our method is composed of video and text encoders that are combined via a multi-source attention layer. Speech is generated by a mel-spectrogram decoder followed by a vocoder. We evaluate our method on several challenging benchmarks including VoxCeleb2. To the best of our knowledge this is the first time such a method is trained and evaluated on in-the-wild examples that include unseen speakers.Through a rigorous evaluation we demonstrate the superior performance of our method with respect to other recent work both in terms of objective measures as well as human listening studies.