Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech Representation and Linguistic Features

Yuma Koizumi

Heiga Zen

Shigeki Karita

Yifan Ding

Kohei Yatabe

Nobuyuki Morioka

Yu Zhang

Wei Han

Ankur Bapna

Michiel Adriaan Unico Bacchiani

WASPAA 2023 (2023) (to appear)

Download Google Scholar

Abstract

Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) linguistic features extracted from transcripts and PnG-BERT for conditioning features. Experiments show that the proposed model (i) is robust against various audio degradation, (ii) can restore samples in the LJspeech dataset and improves the quality of text-to-speech (TTS) outputs without changing the model and hyper-parameters, and (iii) enable us to train a high-quality TTS model from restored speech samples collected from the web.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech Representation and Linguistic Features

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs