Improving Speech Recognition Using Consistent Predictions on Synthesized Speech
Abstract
Speech synthesis has advanced to the point of being close to indistinguishable from human speech. However, efforts to train speech recognition systems on synthesized utterances have not been able to show that synthesized data can be effectively used to augment or replace human speech.
In this work, we demonstrate that promoting consistent predictions in response to real and synthesized speech enables significantly improved speech recognition performance.
We also find that training on 460 hours of LibriSpeech augmented with 500 hours of transcripts (without audio) performance is within 0.2\% WER of a system trained on 960 hours of transcribed audio. This suggests that with this approach, when there is sufficient text available, reliance on transcribed audio can be cut nearly in half.
In this work, we demonstrate that promoting consistent predictions in response to real and synthesized speech enables significantly improved speech recognition performance.
We also find that training on 460 hours of LibriSpeech augmented with 500 hours of transcripts (without audio) performance is within 0.2\% WER of a system trained on 960 hours of transcribed audio. This suggests that with this approach, when there is sufficient text available, reliance on transcribed audio can be cut nearly in half.