Improving Prosody of RNN-based English Text-To-Speech Synthesis by Incorporating a BERT model
Abstract
The prosody of currently available speech synthesis systems can be unnatural due to the systems only having access to the text, possibly enriched by linguistic information such as part-of-speech tags and parse trees. We show that incorporating a BERT model in an RNN-based speech synthesis model - where the BERT model is pretrained on large amounts of unlabeled data, and fine-tuned to the speech domain - improves prosody. Additionally, we propose a way of handling arbitrarily long sequences with BERT. Our findings indicate that small BERT models work better than big ones, and that fine-tuning
the BERT part of the model is pivotal for getting good results.
the BERT part of the model is pivotal for getting good results.