Jump to Content

Improving phonetic realizations in TTS by using phoneme-aligned graphemes

Manish Kumar Sharma
Siamak Tazari
Yizhi Hong
Google Scholar


Most text-to-speech acoustic models, such as WaveNet, Tacotron, ClariNet etc., use either a phoneme sequence or a letter sequence as the foundational unit of speech. Although the letter (or grapheme) sequence more closely matches the actual runtime input of the TTS system, it often fails to represent the fine-grained and often plentiful grapheme-to-phoneme relationships of the target language. A purely phonemic input seems to perform better in practice, though is heavily dependent on a scrupulous phonology and lexicon to provide the model with the phoneme sequences. This reliance poses issues (namely with quality and consistency) which can lead to the need for a trade-off between quality and scalability. In order to overcome this, we propose using a mix of the two inputs, namely providing both phonemic and graphemic identities to the model. In this paper, we show that this approach can help the model learn to disambiguate some of the more subtle phonemic variations (such as the realization of reduced vowels), and that this effect improves the fidelity to the accent of the original voice talent. We present a way of generating an unbiased targeted test using phoneme spectral diffs, and using that, show improvement over the baseline approach. Since different types of neural networks build on top of the same input feature space, we show that the improvement scales to multiple voice technologies, and on several languages.