Google Research

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech Representation and Linguistic Features

WASPAA 2023 (2023) (to appear)


Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) linguistic features extracted from transcripts and PnG-BERT for conditioning features. Experiments show that the proposed model (i) is robust against various audio degradation, (ii) can restore samples in the LJspeech dataset and improves the quality of text-to-speech (TTS) outputs without changing the model and hyper-parameters, and (iii) enable us to train a high-quality TTS model from restored speech samples collected from the web.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work