WARP-Q: Quality Prediction For Generative Neural Speech Codecs

Andrew Hines
Michael Chinen
Wissam Jassim
ICASSP 2021(2021)
Google Scholar


Speech coding has been shown to achieve good speech quality using either waveform matching or parametric reconstruction. For very low bitrate streams, recently developed generative speech models can reconstruct high quality wide band speech from the bit streams of standard parametric encoders at less than 3 kb/s. Generative codecs create high quality codec speech based on synthesising speech from a DNN and the parametric input. Existing objective speech quality models (e.g. ViSQOL, POLQA) cannot be used to accurately evaluate the quality of generatively coded speech as they penalise them based on signal differences not apparent in subjective listening test results. This paper presents \NEWMODEL{}, a full-reference objective speech quality metric that uses dynamic time warping cost for MFCC representations of the signals. It is robust to the codec changes introduced by low-bitrate neural vocoders. Evaluation using waveform matching, parametric and generative neural vocoder based codecs as well as channel and environmental noise shows that \NEWMODEL{} has better correlation and codec quality ranking for novel codecs compared to traditional metrics as well as veritiltiy and potential for additive noise and channel degradations.

Research Areas