Google Research

Partial Monotonic Speech Quality Estimation in ViSQOL with Deep Lattice Networks

Journal of the Acoustical Society of America, vol. 149 (2021), pp. 3851-3861


When predicting subjective quality as mean opinion score (MOS) for speech, a raw similarity score is often mapped onto the score dimension with a mapping function. Virtual Speech Quality Objective Listener (ViSQOL) uses monotonic one-dimensional mappings to evaluate speech. More recent models such as support vector regression (SVR) or deep neural networks (DNNs) use multidimensional input, which allows for a more accurate prediction, but do not provide the monotonic property that is expected. We propose to integrate a multi-dimensional mapping function using deep lattice networks (DLNs) into ViSQOL. DLNs also provide some insight into model interpretation and are robust to overfitting, leading to better out-of-sample performance. With the DLN, ViSQOL improved the speech mapping from the previous exponential mapping's .58 MSE to .24 MSE on a mixture of datasets, outperforming the 1-D fitted functions, SVR, as well as PESQ and POLQA. Additionally, we show that the DLN can be used to learn a quantile function that is well calibrated and a useful measure of uncertainty. With this quantile function, the model is able to provide useful quantile intervals for predictions instead of point intervals.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work