- Ye Jia
- Heiga Zen (Byungha Chun)
- Jonathan Shen
- Yu Zhang
- Yonghui Wu
Abstract
This paper introduces a new encoder model for neural TTS. The proposed model, called PnG BERT, is augmented from the original BERT model, but taking both phoneme and grapheme representation of a text, as well as the word-level alignment between them, as its input. It can be pre-trained on a large text corpus in a self-supervised manner then fine-tuned in a TTS task. The experimental results suggest that PnG BERT can significantly further improve the performance of a state-of-the-art neural TTS model, by producing more appropriate prosody and more accurate pronunciation. Subjective side-by-side preference evaluation showed that raters had no statistically significant preference between the synthesized speech and the ground truth recordings from professional speakers.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work