StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes
Abstract
Recently, WaveNet has become a popular choice of neural network to synthesize speech audio. Autoregressive WaveNet is capable of producing high-fidelity audio, but is too slow for real-time synthesis. As a remedy, Parallel WaveNet was proposed, which can produce audio faster than real time through distillation of an autoregressive teacher into a feedforward student network. A shortcoming of this approach, however, is that a large amount of recorded speech data is required to produce high-quality student models, and this data is not always available. In this paper, we propose StrawNet: a self-training approach to train a Parallel WaveNet. Self-training is performed using the synthetic examples generated by the autoregressive WaveNet teacher. We show that, in low-data regimes, training on high-fidelity synthetic data from an autoregressive teacher model is superior to training the student model on (much fewer) examples of recorded speech. We compare StrawNet to a baseline Parallel WaveNet, using both side-by-side tests and Mean Opinion Score evaluations. To our knowledge, synthetic speech has not been used to train neural text-to-speech before.