Jump to Content
Michelle Tadmor Ramanovich

Michelle Tadmor Ramanovich

Michelle received her B.Sc from Tel Aviv University in Mathematics and later her M.S from Columbia University in Computer Science. She joined Google in 2016. At Google, she has worked on speech and translation related products and published research in areas that relate to automatic speech translation and dubbing.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained endto-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a ‘cross-modal’ chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. We release our audio samples and spoken QA dataset via our website. View details
    Preview abstract We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems. Two versions of translation speeches are provided: 1) CVSS-C: All the translation speeches are in a single high-quality canonical voice; 2) CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. In addition, CVSS provides normalized translation text which matches the pronunciation in the translation speech. On each version of CVSS, we built baseline multilingual direct S2ST models and cascade S2ST models, verifying the effectiveness of the corpus. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous state-of-the-art trained on the corpus without extra data by 5.8 BLEU. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.1 or 0.7 BLEU difference on ASR transcribed translation when initialized from matching ST models. View details
    More than Words: In-the-Wild Visually-Driven Text-to-Speech
    Brendan Shillingford
    Michael Eyov Hassid
    Tal Remez
    Ye Jia
    CVPR, CVF CVPR-2022 (2022)
    Preview abstract In this paper we present VDTTS, a visual-driven TTS model. Unlike most recent text-to-speech methods which are limited by their lack of ability to generate speech with pauses, emotions, prosody and pitch, is able to do so by taking advantage of an additional silent video as an input.Our method is composed of video and text encoders that are combined via a multi-source attention layer. Speech is generated by a mel-spectrogram decoder followed by a vocoder. We evaluate our method on several challenging benchmarks including VoxCeleb2. To the best of our knowledge this is the first time such a method is trained and evaluated on in-the-wild examples that include unseen speakers.Through a rigorous evaluation we demonstrate the superior performance of our method with respect to other recent work both in terms of objective measures as well as human listening studies. View details
    No Results Found