Heiga Zen
Heiga Zen received his AE from Suzuka National College of Technology, Suzuka, Japan, in 1999, and PhD from the Nagoya Institute of Technology, Nagoya, Japan, in 2006. He was an Intern/Co-Op researcher at the IBM T.J. Watson Research Center, Yorktown Heights, NY (2004--2005), and a Research Engineer at Toshiba Research Europe Ltd. Cambridge Research Laboratory, Cambridge, UK (2008--2011). At Google, he was with the Speech team from July 2011 to July 2018, then joined the Brain team from August 2018. From June 2023, he is a Principal Scientist at Google DeepMind, Japan. His research interests include speech technology and machine learning. He was one of the original authors and the first maintainer of the HMM-based speech synthesis system (HTS).
Authored Publications
Sort By
Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech Representation and Linguistic Features
Yifan Ding
Kohei Yatabe
Nobuyuki Morioka
Yu Zhang
Wei Han
WASPAA 2023 (2023) (to appear)
Preview abstract
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) linguistic features extracted from transcripts and PnG-BERT for conditioning features. Experiments show that the proposed model (i) is robust against various audio degradation, (ii) can restore samples in the LJspeech dataset and improves the quality of text-to-speech (TTS) outputs without changing the model and hyper-parameters, and (iii) enable us to train a high-quality TTS model from restored speech samples collected from the web.
View details
Twenty-Five Years of Evolution in Speech and Language Processing
Preview
Michael Picheny
Dilek Hakkani-Tur
IEEE Signal Processing Magazine, 40 (2023), pp. 27-39
Extracting Representative Subset from Massive Raw Texts for Training Pre-trained Neural Language Models
Jun Suzuki
Information Processing & Management Conference, 60 (2023) (to appear)
Preview abstract
This paper explores the research question of whether training neural language models using a
small subset of representative data selected from a large training dataset can achieve the same level of performance obtained using all the original training data. We explore the likelihood-based scoring for the purpose of obtaining representative subsets, which we call RepSet. Our experiments confirm that the representative subset obtained by a likelihood difference-based score can achieve the 90% performance level even when the dataset is reduced to about 1,000th of the original data. We also show that the performance of the random selection method deteriorates significantly when the amount of data is reduced.
View details
LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus
Yifan Ding
Kohei Yatabe
Nobuyuki Morioka
Yu Zhang
Wei Han
Interspeech 2023 (2023)
Preview abstract
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from [URL-HERE]
View details
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech
Takaaki Saeki
Zhehuai Chen
Nobuyuki Morioka
Yu Zhang
ICASSP (2023)
Preview abstract
This paper proposes Virtuoso, a massive multilingual speech–text joint learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech–text paired data in low-resource languages. This study extends Maestro, which is a speech–text semi-supervised joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline TTS models in seen languages, and 2) these models can synthesize reasonably good speech for unseen languages where no paired TTS data is available.
View details
MAESTRO: Matched Speech Text Representations through Modality Matching
Pedro Jose Moreno Mengibar
Yu Zhang
Zhehuai Chen
interspeech 2022 (2022) (to appear)
Preview abstract
We present Maestro, a self-supervised training method to
unify representations learnt from speech and text modalities.
Self-supervised learning from speech signals aims to learn the
latent structure inherent in the signal, while self-supervised
learning from text attempts to capture lexical information.
Learning aligned representations from unpaired speech and
text sequences is a challenging task. Previous work either
implicitly enforced the representations learnt from these two
modalities to be aligned in the latent space through multi-
tasking and parameter sharing or explicitly through conversion
of modalities via speech synthesis. While the former suffers
from interference between the two modalities, the latter
introduces additional complexity. In this paper, we propose
Maestro, a novel algorithm to learn unified representations from
both these modalities simultaneously that can transfer to diverse
downstream tasks such as Automated Speech Recognition
(ASR) and Speech Translation (ST). Maestro learns unified
representations through sequence alignment, duration predic-
tion and matching embeddings in the learned space through
an aligned masked-language model loss. We establish a new
state-of-the-art (SOTA) on VoxPopuli multilingual ASR with
a 8% relative reduction in Word Error Rate (WER), multi-
domain SpeechStew ASR (3.7% relative) and 21 languages to
English multilingual ST on CoVoST 2 with an improvement of
2.8 BLEU averaged over 21 languages.
View details
WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration
Kohei Yatabe
Proc. IEEE Spoken Language Technology Workshop (SLT) (2022) (to appear)
Preview abstract
Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called WaveFit, which integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration. WaveFit iteratively denoises an input signal, and trains a deep neural network (DNN) for minimizing an adversarial loss calculated from intermediate outputs at all iterations. Subjective (side-by-side) listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations. Furthermore, the inference speed of WaveFit was more than 240 times faster than WaveRNN. Audio demos are available at google.github.io/df-conformer/wavefit/.
View details
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
Kohei Yatabe
Nanxin Chen
Proc. Interspeech (2022) (to appear)
Preview abstract
Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios. Audio demos are available at [wavegrad.github.io/specgrad/].
View details
Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Lev Finkelstein
Norman Casagrande
Ye Jia
Alexey Petelin
Jonathan Shen
Yu Zhang
Interspeech (2022)
Preview abstract
Transfer tasks in text-to-speech (TTS) synthesis — where one
or more aspects of the speech of one set of speakers is transferred
to another set of speakers that do not feature these aspects originally —
remains a challenging task. One of the challenges is that models
that have high-quality transfer capabilities can have issues in stability,
making them impractical for user-facing critical tasks. This paper
demonstrates that transfer can be obtained by training an robust TTS
system on data generated by a less robust TTS system designed for a high-quality
transfer task; In particular, a CHiVE-BERT monolingual TTS
system is trained on the output of a Tacotron model designed
for accent transfer. While some quality loss is inevitable with
this approach, experimental results show that the models trained
on synthetic data this way can produce high quality audio displaying accent
transfer, while preserving speaker characteristics such as speaking style.
View details
Preview abstract
This paper explores the research question of whether training neural language models using a small subset of representative data selected from a large training dataset can achieve the same level of performance that obtained using all the original training data.
In our experiments, we confirm that the representative subset obtained by the likelihood-difference-based method can maintain the same performance level even when the dataset is reduced to about 10th or 100th of the original data.
We also show that the performance of the random selection method deteriorates significantly when the amount of data is reduced.
View details