Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection

Zhehuai Chen; Andrew Rosenberg; Yu Zhang; Gary Wang; Bhuvana Ramabhadran; Pedro Jose Moreno Mengibar

Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection

Zhehuai Chen

Andrew Rosenberg

Yu Zhang

Gary Wang

Bhuvana Ramabhadran

Pedro Jose Moreno Mengibar

Interspeech 2020

Download Google Scholar

Abstract

Text-to-Speech synthesis (TTS) based data augmentation is a relatively new mechanism for utilizing text-only data to improve automatic speech recognition (ASR) training without parameter or inference architecture changes. However, efforts to train speech recognition systems on synthesized utterances suffer from limited acoustic diversity of TTS outputs. Additionally, the text-only corpus is always much larger than the transcribed speech corpus by several orders of magnitude, which makes speech synthesis of all the text data impractical. In this work, we propose to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data. We also present a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text. We demonstrate the ability of our proposed method to enable efficient, large-scale unspoken text learning which achieving a 32.7\% relative WER reduction on a voice-search task.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs