Improving Prosody of RNN-based English Text-To-Speech Synthesis by Incorporating a BERT model

Tom Kenter; Manish Kumar Sharma; Rob Clark

Improving Prosody of RNN-based English Text-To-Speech Synthesis by Incorporating a BERT model

Tom Kenter

Manish Kumar Sharma

Rob Clark

INTERSPEECH 2020

Download Google Scholar

Abstract

The prosody of currently available speech synthesis systems can be unnatural due to the systems only having access to the text, possibly enriched by linguistic information such as part-of-speech tags and parse trees. We show that incorporating a BERT model in an RNN-based speech synthesis model - where the BERT model is pretrained on large amounts of unlabeled data, and fine-tuned to the speech domain - improves prosody. Additionally, we propose a way of handling arbitrarily long sequences with BERT. Our findings indicate that small BERT models work better than big ones, and that fine-tuning
the BERT part of the model is pivotal for getting good results.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Improving Prosody of RNN-based English Text-To-Speech Synthesis by Incorporating a BERT model

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs