Directly Modeling Voiced and Unvoiced Components in Speech Waveforms by Neural Networks

Keiichi Tokuda
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE(2016), pp. 5640-5644


This paper proposes a novel acoustic model based on neural networks for statistical parametric speech synthesis. The neural network outputs parameters of a non-zero mean Gaussian process, which defines a probability density function of a speech waveform given linguistic features. The mean and covariance functions of the Gaussian process represent deterministic (voiced) and stochastic (unvoiced) components of a speech waveform, whereas the previous approach considered the unvoiced component only. Experimental results show that the proposed approach can generate speech waveforms approximating natural speech waveforms.

Research Areas