Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes

Bo Li; Tara Sainath; Will Chan; Yonghui Wu; Yu Zhang

Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes

Bo Li

Tara Sainath

Will Chan

Yonghui Wu

Yu Zhang

Proc. ICASSP 2019, IEEE

Google Scholar

Abstract

We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These units are difficult to scale to languages with large vocabularies, particularly the case for multilingual processing. In this work, we model text via a sequence of unicode bytes. Bytes allow us to avoid large softmaxes in languages with large vocabularies, and share representations in multilingual models. We show that bytes are superior to grapheme characters over a wide variety of languages in end-to-end speech recognition. We also present an end-to-end multilingual model using unicode byte representations, which outperforms each respective single language baseline by 4~5\% relatively. Finally, we present an end-to-end multilingual speech synthesis model using unicode byte representations which also achieves state-of-the-art performance.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs