Google Research

Salient Speech Representations Based on Cloned Networks

Interspeech (2019)


We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative in- formation. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receives as input a different signal from a set of equivalent signals. The objective function encourages the network clones to map their input into a set of unit-variance features that is identical across the clones. The training procedure can be unsupervised or supervised manner with a decoder that attempts to reconstruct a desired target signal. As an application, we train a system that extracts a time-sequence of feature vectors of speech and uses it as a conditioning of a WaveNet generative system, facilitating both coding and enhancement.

