Restoring speaker voices with zero-shot cross-lingual voice transfer for TTS

August 21, 2024

Fadi Biadsy, Senior Staff Research Scientist, and Youzheng (Joseph) Chen, Senior Software Engineer, Google Assistant

We present a new zero-shot voice transfer module for text-to-speech (TTS) systems that can restore the voices of individuals with dysarthria who have lost or never had a typical voice.

Quick links

Audio Samples

Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one's voice, caused by physical or neurological conditions, can result in a profound sense of loss, striking at the very heart of one's identity. Speakers with degenerative neural diseases, such as amyotrophic lateral sclerosis (ALS), Parkinson's, and multiple sclerosis, may experience a degradation of some of the unique characteristics of their voice over time. Some individuals are born with conditions, like muscular dystrophy, that affect the articulatory system and limit their ability to produce certain sounds. Profound deafness also impacts vocal and articulatory patterns due to the absence of auditory input and feedback. These conditions present lifelong challenges in matching the typical speech heard widely.

In recent years, there have been new advances in voice transfer (VT) technology, integrated in text-to-speech (TTS), voice conversion (VC), and speech-to-speech translation models. For example, in our previous work, we built a VC model that converts atypical speech directly to a synthesized predetermined typical voice that can be more easily understood by others. Yet for many individuals with dysarthria, VT extends speech technologies to help them regain their original voice and potentially predict speech patterns they have lost.

A VT module can be designed for a given speaker using either few- or zero-shot training. In few-shot training for VT, a sample of speech from a given speaker is used to adapt a pre-trained model to transfer or clone their voice. This approach typically produces high quality speech with high speaker-voice fidelity, depending on the amount and quality of the training samples. A more challenging approach is zero-shot, which does not require training, but rather feeds audio reference samples (e.g., 10 seconds) from a given speaker to the system during generation, to transfer their voice into the output synthesized speech. These systems vary significantly in their quality and do not guarantee to produce high fidelity voices to the reference voice. Few-shot approaches can be effective for those speakers who once had typical speech and have banked a set of high quality samples of their voice before an etiology has progressed (or a physical injury has occurred). On the other hand, zero-shot is more appropriate for those dysarthric speakers who have not banked sufficient samples of their voice or have never had a typical voice. Moreover, a zero-shot system can be easily scaled and deployed.

In this blogpost, we describe a zero-shot VT module that can be easily plugged into a state-of-the-art TTS system to restore the voices of input speakers. It can be used both when speakers have banked a small set of their voice or when atypical speech is the only data available. We add this module to our TTS system and use it to restore the voices of speakers who banked their typical speech. We also show that the same model produces high quality speech with high fidelity voice preservation even when the input reference is atypical, useful for those who have not banked their voice or never had typical speech. Finally, we demonstrate that such a module is capable of transferring voice across languages, even though the language of the input reference speech is different from the intended target language.

The text-to-speech model with voice transfer module

The TTS inference system consists of a text encoder that transforms the linguistic information into a sequence of hidden representations. These representations are then fed into a token duration predictor and upsampler, which generate a longer sequence proportional to the predicted output duration. This expanded sequence is passed to a feature decoder to generate hidden features corresponding to the synthesized acoustic features. Finally, a WaveFit vocoder converts these features into an output time-domain waveform.

Our new VT module is an extension to this TTS system that takes an input reference speech example. This extension enables the TTS model to transfer the voice in the reference speech to generate synthesized speech in this voice. This extension is shown in the yellow components in the figure below. The VT module is composed of a speaker encoder, bottleneck modules, and residual adapters.

The VT speaker encoder takes a spectrogram from a reference voice of 2–14 seconds. It extracts a high-level representation to summarize the acoustic-phonetic and prosodic characteristics of the input reference speech into an embedding vector. This vector is then passed to all layers of duration and feature decoders.

In each of those layers, we add a 1024-dimensional simplex bottleneck layer (based on global style token) to constrain the embedding vectors to lie within a simplex. This layer also helps ensure the embedding space continuity and completeness. The simplex itself can be learned during training. We find that this choice of bottleneck is crucial to modeling zero-shot capability for unseen voices, especially for speakers with atypical speech.

Finally, the output of the bottleneck layer is concatenated with the previous layer’s output and is given to a residual adapter added between each two consecutive layers. Using residual adapters makes the model modular — i.e., one can plug and unplug this VT module to a pre-trained TTS model without impacting the quality of the original TTS model. Additionally, they enable us to adapt and dedicate a very small set of parameters for each target voice to perform few-shot training, to perfect a target voice when banked-speech is enabled. Given their parameter-efficient nature, these adapters can be loaded on-demand.

Model training

We follow a previously defined procedure using multilingual training data to obtain a multilingual TTS system that includes our VT extension. With the model now accepting both text and reference speech, we pass a random consecutive chunk (2–14 seconds) from the target speech as a reference in each training sample. Using random chunks helps prevent leakage of durational and linguistic information. Recall that since the model is trained on multilingual data, the VT module is multilingual.

Experiments

Typical voice samples

Below are zero-shot examples using typical reference speech, to demonstrate when the speaker's voice was recorded before any voice degradation occurred. We demonstrate the concept of zero-shot capability using samples from the VCTK corpus (full list of audio samples on our GitHub repo):

	Reference	TTS with Zero-shot VT
Female (P257)
Male (P256)
Female (P244)
Male (P243)

We have done a subjective similarity evaluation of 500 pairs of audio, where each pair includes reference audio and synthesized speech of sentences generated by an LLM (using Gemini API). Each pair is presented to five human raters who are asked whether the two samples are spoken by the same person. We found that on average 76% (± 3.6) of those raters indicate that the two samples came from the same speaker.

Case study: Atypical speech as a reference

To demonstrate the system’s performance when atypical speech/voice is the only reference available, we worked with our fellow Google research scientist Dimitri Kanevsky who is profoundly deaf from a young age and learned to speak English using Russian phonetics. As a result, Dimitri has never had a typical voice. His speech patterns are unique and may be difficult for unfamiliar listeners to understand. Using only 12 seconds of Dimitri’s atypical voice as a reference, we synthesize the transcript of the following original video to generate the output video presented below.

	Reference	English	Chinese Mandarin	Spanish	Arabic	French	Japanese	German
Male (P246)	Audio	Audio	Audio	Audio	Audio	Audio	Audio	Audio
Female (P303)	Audio	Audio	Audio	Audio	Audio	Audio	Audio	Audio
Male (P285)	Audio	Audio	Audio	Audio	Audio	Audio	Audio	Audio
Female (P244)	Audio	Audio	Audio	Audio	Audio	Audio	Audio	Audio
Speaker Similarity	85% ± 6.3%	76% ± 3.6%	70% ± 14.4%	62% ± 12.0%	74% ± 11.1%	75% ± 6.9%	70% ± 5.5%	70% ± 5.5%
MOS	3.300 ± .29	3.621 ± .05	3.921 ± .05	3.606 ± .06	4.242 ± .03	4.058 ± .04	3.616 ± .05	3.985 ± .04

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Restoring speaker voices with zero-shot cross-lingual voice transfer for TTS

Quick links

The text-to-speech model with voice transfer module

Model training

Experiments

Typical voice samples

Case study: Atypical speech as a reference

Watch the film

Watch the film

Watch the film

Watch the film

Cross-lingual experiments

Watch the film

Watch the film

Watch the film

Watch the film

Watch the film

Watch the film

Watch the film

Watch the film

Watch the film

Watch the film

Watch the film

Watch the film

Voice transfer concerns

Acknowledgments

Quick links

Other posts of interest