Michelle Tadmor Ramanovich

Michelle Tadmor Ramanovich

Michelle received her B.Sc from Tel Aviv University in Mathematics and later her M.S from Columbia University in Computer Science. She joined Google in 2016. At Google, she has worked on speech and translation related products and published research in areas that relate to automatic speech translation and dubbing.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained endto-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a ‘cross-modal’ chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. We release our audio samples and spoken QA dataset via our website. View details
    Preview abstract This paper presents a novel approach to train a direct speech-to-speech translation model from monolingual datasets only in a fully unsupervised manner. The proposed approach combines back-translation, denoising autoencoder, and unsupervised embedding mapping techniques to achieve this goal. We demonstrate the effectiveness of the proposed approach by comparing it against a cascaded baseline using two Spanish and English datasets. The proposed approach achieved a significant improvement over the cascaded baseline on synthesized unpaired conversational and synthesized Common Voice $11$ datasets. View details
    More than Words: In-the-Wild Visually-Driven Text-to-Speech
    Brendan Shillingford
    Michael Eyov Hassid
    Tal Remez
    Ye Jia
    CVPR, CVF CVPR-2022 (2022)
    Preview abstract In this paper we present VDTTS, a visual-driven TTS model. Unlike most recent text-to-speech methods which are limited by their lack of ability to generate speech with pauses, emotions, prosody and pitch, is able to do so by taking advantage of an additional silent video as an input.Our method is composed of video and text encoders that are combined via a multi-source attention layer. Speech is generated by a mel-spectrogram decoder followed by a vocoder. We evaluate our method on several challenging benchmarks including VoxCeleb2. To the best of our knowledge this is the first time such a method is trained and evaluated on in-the-wild examples that include unseen speakers.Through a rigorous evaluation we demonstrate the superior performance of our method with respect to other recent work both in terms of objective measures as well as human listening studies. View details
    Preview abstract We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems. Two versions of translation speeches are provided: 1) CVSS-C: All the translation speeches are in a single high-quality canonical voice; 2) CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. In addition, CVSS provides normalized translation text which matches the pronunciation in the translation speech. On each version of CVSS, we built baseline multilingual direct S2ST models and cascade S2ST models, verifying the effectiveness of the corpus. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous state-of-the-art trained on the corpus without extra data by 5.8 BLEU. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.1 or 0.7 BLEU difference on ASR transcribed translation when initialized from matching ST models. View details
    Preview abstract We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, but unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential abuse for ``deepfake'' artifacts. When the new method is used together with a simple concatenation data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker-switching. View details