Experiments with training corpora for statistical text-to-speech systems
Abstract
Common text-to-speech (TTS) systems rely on training data
for modelling human speech. The quality of this data can
range from professional voice actors recording hand-curated
sentences in high-quality studio conditions, to found voice data
representing arbitrary domains. For years, the unit selection
technology dominant in the field required many hours of data
that was expensive and time-consuming to collect. With the advancement
of statistical methods of waveform generation, there
have been experiments with more noisy and often much larger
datasets (“big data”), testing the inherent flexibility of such systems.
In this paper we examine the relationship between training
data and speech synthesis quality. We then hypothesise that
statistical text-to-speech benefits from high acoustic quality corpora
with high level of prosodic variation, but that beyond the
first few hours of training data we don’t observe quality gains.
We then describe how we engineered a training dataset containing
optimized distribution of features, and how these features
were defined. Lastly, we present results from a series of
evaluation tests. These confirm our hypothesis and show how
a carefully engineered training corpus of a smaller size yields
the same speech quality as much larger datasets, particularly
for voices that use WaveNet.
for modelling human speech. The quality of this data can
range from professional voice actors recording hand-curated
sentences in high-quality studio conditions, to found voice data
representing arbitrary domains. For years, the unit selection
technology dominant in the field required many hours of data
that was expensive and time-consuming to collect. With the advancement
of statistical methods of waveform generation, there
have been experiments with more noisy and often much larger
datasets (“big data”), testing the inherent flexibility of such systems.
In this paper we examine the relationship between training
data and speech synthesis quality. We then hypothesise that
statistical text-to-speech benefits from high acoustic quality corpora
with high level of prosodic variation, but that beyond the
first few hours of training data we don’t observe quality gains.
We then describe how we engineered a training dataset containing
optimized distribution of features, and how these features
were defined. Lastly, we present results from a series of
evaluation tests. These confirm our hypothesis and show how
a carefully engineered training corpus of a smaller size yields
the same speech quality as much larger datasets, particularly
for voices that use WaveNet.