ON USING BACKPROPAGATION FOR SPEECH TEXTURE GENERATION AND VOICE CONVERSION

Jan Chorowski
Samy Bengio
ICASSP (2018)

Abstract

Inspired by recent work on neural network image generation which
rely on backpropagation towards the network inputs, we present
a proof-of-concept system for speech texture synthesis and voice
conversion based on two mechanisms: approximate inversion of
the representation learned by a speech recognition neural network,
and on matching statistics of neuron activations between different
source and target utterances. Similar to image texture synthesis and
neural style transfer, the system works by optimizing a cost function
with respect to the input waveform samples. To this end we use a
differentiable mel-filterbank feature extraction pipeline and train a
convolutional CTC speech recognition network. Our system is able
to extract speaker characteristics from very limited amounts of target
speaker data, as little as a few seconds, and can be used to generate
realistic speech babble or reconstruct an utterance in a different voice.

Research Areas