Modelling Intonation in Spectrograms for Neural Vocoder based Text-to-Speech

Jonathan Shen
Hanna Silen
Speech Prosody 2020

Abstract

Intonation is characterized by rises and falls in pitch and energy. In previous work, we explicitly modelled these prosodic features using Clockwork Hierarchical Variational Autoencoders (CHiVE) to show we can generate multiple intonation contours for any text. However, recent advances in text-to-speech synthesis produce spectrograms which are inverted by neural vocoders to produce waveforms. Spectrograms encode intonation in a complex way; there is no simple, explicit representation analogous to pitch (fundamental frequency) and energy. In this paper, we extend CHiVE to model intonation within a spectrogram. Compared to the original model, the spectrogram extension gives better mean opinion scores in subjective listening tests. We show that the intonation in the generated spectrograms match the intonations represented by the generated pitch curves.