AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

Brian Patton; Yannis Agiomyrgiannakis; Michael Terry; Kevin Wilson; Rif A. Saurous; D. Sculley

AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

Brian Patton

Yannis Agiomyrgiannakis

Michael Terry

Kevin Wilson

Rif A. Saurous

D. Sculley

NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop (to appear)

Download Google Scholar

Abstract

Developers of text-to-speech synthesizers (TTS) often make use of
human raters to assess the quality of synthesized speech. We
demonstrate that we can model human raters' mean opinion scores
(MOS) of synthesized speech using a deep recurrent neural network
whose inputs consist solely of a raw waveform. Our best models
provide utterance-level estimates of MOS only moderately inferior to
sampled human ratings, as shown by Pearson and Spearman
correlations. When multiple utterances are scored and averaged,
a scenario common in synthesizer quality assessment,
we achieve correlations comparable to those of human raters.
This model has a number of applications, such as the
ability to automatically explore the parameter space of a speech
synthesizer without requiring a human-in-the-loop.
We explore a method of probing what the models have learned.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs