Direct speech-to-speech translation with a sequence-to-sequence model

Ye Jia; Ron J. Weiss; Fadi Biadsy; Wolfgang Macherey; Melvin Johnson; Zhifeng Chen; Yonghui Wu

Direct speech-to-speech translation with a sequence-to-sequence model

Ye Jia

Ron J. Weiss

Fadi Biadsy

Wolfgang Macherey

Melvin Johnson

Zhifeng Chen

Yonghui Wu

Interspeech (2019)

Download Google Scholar

Abstract

We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Direct speech-to-speech translation with a sequence-to-sequence model

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs