Heiga Zen

Heiga Zen

Heiga Zen received his AE from Suzuka National College of Technology, Suzuka, Japan, in 1999, and PhD from the Nagoya Institute of Technology, Nagoya, Japan, in 2006. He was an Intern/Co-Op researcher at the IBM T.J. Watson Research Center, Yorktown Heights, NY (2004--2005), and a Research Engineer at Toshiba Research Europe Ltd. Cambridge Research Laboratory, Cambridge, UK (2008--2011). At Google, he was with the Speech team from July 2011 to July 2018, then joined the Brain team from August 2018. From June 2023, he is a Principal Scientist at Google DeepMind, Japan. His research interests include speech technology and machine learning. He was one of the original authors and the first maintainer of the HMM-based speech synthesis system (HTS).
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract This paper presents a novel approach to train a direct speech-to-speech translation model from monolingual datasets only in a fully unsupervised manner. The proposed approach combines back-translation, denoising autoencoder, and unsupervised embedding mapping techniques to achieve this goal. We demonstrate the effectiveness of the proposed approach by comparing it against a cascaded baseline using two Spanish and English datasets. The proposed approach achieved a significant improvement over the cascaded baseline on synthesized unpaired conversational and synthesized Common Voice $11$ datasets. View details
    Preview abstract This paper proposes Virtuoso, a massive multilingual speech–text joint learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech–text paired data in low-resource languages. This study extends Maestro, which is a speech–text semi-supervised joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline TTS models in seen languages, and 2) these models can synthesize reasonably good speech for unseen languages where no paired TTS data is available. View details
    Preview abstract Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) linguistic features extracted from transcripts and PnG-BERT for conditioning features. Experiments show that the proposed model (i) is robust against various audio degradation, (ii) can restore samples in the LJspeech dataset and improves the quality of text-to-speech (TTS) outputs without changing the model and hyper-parameters, and (iii) enable us to train a high-quality TTS model from restored speech samples collected from the web. View details
    Twenty-Five Years of Evolution in Speech and Language Processing
    Michael Picheny
    Dilek Hakkani-Tur
    IEEE Signal Processing Magazine, 40 (2023), pp. 27-39
    Preview
    Preview abstract This paper explores the research question of whether training neural language models using a small subset of representative data selected from a large training dataset can achieve the same level of performance obtained using all the original training data. We explore the likelihood-based scoring for the purpose of obtaining representative subsets, which we call RepSet. Our experiments confirm that the representative subset obtained by a likelihood difference-based score can achieve the 90% performance level even when the dataset is reduced to about 1,000th of the original data. We also show that the performance of the random selection method deteriorates significantly when the amount of data is reduced. View details
    Preview abstract This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from [URL-HERE] View details
    深層学習によるテキスト音声合成の飛躍的発展
    電子情報通信学会誌, 105-5 (2022), pp. 413-417
    Preview abstract テキスト音声合成では、音声波形を自動で切り貼りして所望するテキストに対応する音声を合成する、波形接続型音声合成が主流であった。一方、条件付き生成モデルを用いてテキストと音声の関係を学習し、これより任意のテキストより音声を合成する生成モデル型音声合成は、声色を少量の音声で変換できる等の利点があるが、合成音の自然性に課題があった。過去約10年間に深層学習が生成モデル型に導入され、性能が飛躍的に向上した結果、高い自然性を保ちつつ柔軟に話者性や韻律を制御できるようになった。本稿では、深層生成モデルの導入がテキスト音声合成に与えた影響について考察する。 View details
    Preview abstract Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios. Audio demos are available at [wavegrad.github.io/specgrad/]. View details
    Preview abstract Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called WaveFit, which integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration. WaveFit iteratively denoises an input signal, and trains a deep neural network (DNN) for minimizing an adversarial loss calculated from intermediate outputs at all iterations. Subjective (side-by-side) listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations. Furthermore, the inference speed of WaveFit was more than 240 times faster than WaveRNN. Audio demos are available at google.github.io/df-conformer/wavefit/. View details
    Preview abstract We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems. Two versions of translation speeches are provided: 1) CVSS-C: All the translation speeches are in a single high-quality canonical voice; 2) CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. In addition, CVSS provides normalized translation text which matches the pronunciation in the translation speech. On each version of CVSS, we built baseline multilingual direct S2ST models and cascade S2ST models, verifying the effectiveness of the corpus. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous state-of-the-art trained on the corpus without extra data by 5.8 BLEU. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.1 or 0.7 BLEU difference on ASR transcribed translation when initialized from matching ST models. View details