Jump to Content

Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes

Will Chan
Yu Zhang
Proc. ICASSP 2019, IEEE
Google Scholar

Abstract

We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These units are difficult to scale to languages with large vocabularies, particularly the case for multilingual processing. In this work, we model text via a sequence of unicode bytes. Bytes allow us to avoid large softmaxes in languages with large vocabularies, and share representations in multilingual models. We show that bytes are superior to grapheme characters over a wide variety of languages in end-to-end speech recognition. We also present an end-to-end multilingual model using unicode byte representations, which outperforms each respective single language baseline by 4~5\% relatively. Finally, we present an end-to-end multilingual speech synthesis model using unicode byte representations which also achieves state-of-the-art performance.

Research Areas