Partial Monotonic Speech Quality Estimation in ViSQOL with Deep Lattice Networks

Andrew Hines; Jan Skoglund; Michael Chinen

Partial Monotonic Speech Quality Estimation in ViSQOL with Deep Lattice Networks

Andrew Hines

Jan Skoglund

Michael Chinen

Journal of the Acoustical Society of America, 149 (2021), pp. 3851-3861

Google Scholar

Abstract

When predicting subjective quality as mean opinion score (MOS) for speech, a raw similarity score is often mapped onto the score dimension with a mapping function. Virtual Speech Quality Objective Listener (ViSQOL) uses monotonic one-dimensional mappings to evaluate speech. More recent models such as support vector regression (SVR) or deep neural networks (DNNs) use multidimensional input, which allows for a more accurate prediction, but do not provide the monotonic property that is expected. We propose to integrate a multi-dimensional mapping function using deep lattice networks (DLNs) into ViSQOL. DLNs also provide some insight into model interpretation and are robust to overfitting, leading to better out-of-sample performance. With the DLN, ViSQOL improved the speech mapping from the previous exponential mapping's .58 MSE to .24 MSE on a mixture of datasets, outperforming the 1-D fitted functions, SVR, as well as PESQ and POLQA. Additionally, we show that the DLN can be used to learn a quantile function that is well calibrated and a useful measure of uncertainty. With this quantile function, the model is able to provide useful quantile intervals for predictions instead of point intervals.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Partial Monotonic Speech Quality Estimation in ViSQOL with Deep Lattice Networks

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs