- Heiga Zen
- Yannis Agiomyrgiannakis
- Niels Egberts
- Fergus Henderson
- Przemysław Szczepaniak
Abstract
Acoustic models based on long short-term memory recurrent neural network (LSTM-RNN) were applied to statistical parametric speech synthesis (SPSS) and showed significant improvements in naturalness and latency over those based on hidden Markov models (HMMs). This paper describes further optimizations of LSTM-RNN-based SPSS to deploy it to mobile devices; weight quantization, multi-frame inference, and robust inference using an ε-contaminated Gaussian loss function. Experimental results in subjective listening tests show that these optimizations can make LSTM-RNN-based SPSS comparable to HMM-based SPSS in runtime speed while maintaining naturalness. Evaluations between LSTM-RNN-based SPSS and HMM-driven unit selection speech synthesis are also presented.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work