Miaosen Wang
Authored Publications
Sort By
More than Words: In-the-Wild Visually-Driven Text-to-Speech
Brendan Shillingford
Michael Eyov Hassid
Tal Remez
Ye Jia
CVPR, CVF CVPR-2022 (2022)
Preview abstract
In this paper we present VDTTS, a visual-driven TTS model. Unlike most recent text-to-speech methods which are limited by their lack of ability to generate speech with pauses, emotions, prosody and pitch, is able to do so by taking advantage of an additional silent video as an input.Our method is composed of video and text encoders that are combined via a multi-source attention layer. Speech is generated by a mel-spectrogram decoder followed by a vocoder. We evaluate our method on several challenging benchmarks including VoxCeleb2. To the best of our knowledge this is the first time such a method is trained and evaluated on in-the-wild examples that include unseen speakers.Through a rigorous evaluation we demonstrate the superior performance of our method with respect to other recent work both in terms of objective measures as well as human listening studies.
View details
Preview abstract
We present a large-scale dataset for the task of rewriting an ill-formed natural language question to a well-formed one. Our multi-domain question rewriting (MQR) dataset is constructed from human contributed Stack Exchange question edit histories. The dataset contains 427,719 question pairs which come from 303 domains. We provide human annotations for a subset of the dataset as a quality estimate. When moving from ill-formed to well-formed questions, the question quality improves by an average of 45 points across three aspects. We train sequence-to-sequence neural models on the constructed dataset and obtain an improvement of 13.2% in BLEU-4 over baseline methods built from other data resources.
View details