PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
Abstract
This paper introduces a new encoder model for neural TTS. The proposed model, called PnG BERT, is augmented from the original BERT model, but taking both phoneme and grapheme representation of a text, as well as the word-level alignment between them, as its input. It can be pre-trained on a large text corpus in a self-supervised manner then fine-tuned in a TTS task. The experimental results suggest that PnG BERT can significantly further improve the performance of a state-of-the-art neural TTS model, by producing more appropriate prosody and more accurate pronunciation. Subjective side-by-side preference evaluation showed that raters had no statistically significant preference between the synthesized speech and the ground truth recordings from professional speakers.