Marco Tagliasacchi
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
LMCODEC: A LOW BITRATE SPEECH CODEC WITH CAUSAL TRANSFORMER MODELS
Bastiaan Kleijn
Michael Chinen
Neil Zeghidour
Teerapat Jenrungrot
ICASSP 2023 (2023)
Preview abstract
We introduce LMCodec, a fully-causal neural speech codec that provides high quality at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec first trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://google.github.io/chrome-media-audio-papers/publications/lmcodec.
View details
MusicLM: Generating Music From Text
Andrea Agostinelli
Mauro Verzetti
Antoine Caillon
Qingqing Huang
Neil Zeghidour
Christian Frank
under review (2023)
Preview abstract
We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.
Further links: samples, MusicCaps dataset
View details
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition
Xuankai Chang
Neil Zeghidour
Interspeech 2023
Preview abstract
We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We also present a "refinement" version of the model that predicts enhanced audio tokens from the audio tokens of speech separated by a conventional separation model. Using both objective metrics and subjective MUSHRA listening tests, we show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
View details
Disentangling speech from surroundings with neural embeddings
Malcolm Slaney
Neil Zeghidour
ICASSP 2023 (2023)
Preview abstract
We present a method to separate speech signals from noisy environments in the embedding space of a neural audio codec. We introduce a new training procedure that allows our model to produce structured encodings of audio waveforms given by embedding vectors, where one part of the embedding vector represents the speech signal, and the rest represent the environment. We achieve this by partitioning the embeddings of different input waveforms and training the model to faithfully reconstruct audio from mixed partitions, thereby ensuring each partition encodes a separate audio attribute. As use cases, we demonstrate the separation of speech from background noise or from reverberation characteristics. Our method also allows for targeted adjustments of the audio output characteristics.
View details
Preview abstract
We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input. We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions, and generalizing to unseen speakers. Our approach significantly outperforms baselines constructed using adaptive TTS, as judged by human raters in side-by-side preference and MOS tests.
View details
Real-time Speech Frequency Bandwidth Extension
Dominik Roblek
2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
Preview abstract
In this paper we propose a lightweight model that performs frequency bandwidth extension of speech signals, increasing the sampling frequency from 8kHz to 16kHz, while restoring the high frequency content to a level that is indistinguishable from the original samples at 16kHz. The model architecture is based on SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combination of feature losses and adversarial losses to reconstruct an enhanced version of the input speech. In addition, we propose a version of SEANet that can be deployed on device in streaming mode, achieving an architecture latency of 16ms. When profiled on a single mobile CPU, processing one 16ms frame takes only 1.5ms, so that the total latency is compatible with a deployment in bi-directional voice communication systems.
View details
MicAugment: One-shot Microphone Style Transfer
2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
Preview abstract
A critical aspect for the successful deployment of audio-based models ``in-the-wild'' is the robustness to the transformations introduced by heterogeneous microphones. In this work we propose a method that is able to perform \emph{one-shot microphone style} transfer. Given only a \emph{few seconds} of audio recorded by a target device, \emph{MicAugment} identifies the transformations associated to the microphone and uses the learned transformations to synthesize audio as if it were recorded by that device. We show that our method can successfully apply the style of a target microphone and that it significantly increases model robustness to microphone variability when used as \emph{data augmentation} in downstream tasks.
View details
Semi-supervised batch active learning via bilevel optimization
Andreas Krause
2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
Preview abstract
\emph{Active learning} is an effective technique for reducing the labeling cost by improving data efficiency. In this work, we propose a novel \emph{batch acquisition strategy} for active learning in the setting when the model training is performed in a \emph{semi-supervised} manner. We formulate our approach as a \emph{data summarization} problem via \emph{bilevel optimization}, where the queried batch consists of the points that best summarize the unlabeled data pool. We show that our method is highly effective in \emph{keyword detection} tasks in the regime when only \emph{few labeled samples} are available.
View details
SoundStream: An End-to-End Neural Audio Codec
Neil Zeghidour
Alejandro Luebs
Transactions on Audio, Speech and Language Processing (2021)
Preview abstract
We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3 kbps to 18 kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24 kHz sampling rate, SoundStream at 3 kbps outperforms Opus at 12 kbps and approaches EVS at 9.6 kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.
View details
Self-Supervised Learning from Automatically Separated Sound Scenes
Xavier Serra
WASPAA 2021 (2021)
Preview abstract
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically-constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.
View details
One-shot conditional audio filtering of arbitrary sounds
Dominik Roblek
2021 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE (to appear)
Preview abstract
We consider the problem of separating a particular sound source from a single-channel mixture, based on
only a short sample of the target source. Using \tuneenv, a waveform-to-waveform neural network architecture, we are able to train a model in an
entirely unsupervised way.
Using a sound source encoder model which is learned jointly with the source separation network, the trained model can be ``configured'' to filter arbitrary sound sources, even ones that it has not seen during training. Evaluated on the FSD50k dataset, our model obtains an SI-SDR improvement of 9.6 dB, for mixtures of two sounds.
When trained on Librispeech, our model achieves an SI-SDR improvement of 12.3 dB when separating one voice from a mixture of two speakers.
Moreover, we show that the representation learned by the sound source encoder clusters acoustically similar sounds together in the embedding space, even if it is trained without using any labels.
View details
Pre-training audio representations with self-supervision
Dominik Roblek
IEEE Signal Processing Letters, vol. 27 (2020), pp. 600-604
Preview abstract
We explore self-supervision as a way to learn general purpose audio
representations. Specifically, we propose two self-supervised tasks:
Audio2Vec, which aims at reconstructing a spectrogram slice from past and
future slices and TemporalGap, which estimates the distance between two short
audio segments extracted at random from the same audio clip. We evaluate how the
representations learned via self-supervision transfer to different downstream
tasks, either training a task-specific linear classifier on top of the
pretrained embeddings, or fine-tuning a model end-to-end for each downstream
task. Our results show that the representations learned with Audio2Vec
transfer better than those learned by fully-supervised training on Audioset. In
addition, by fine-tuning Audio2Vec representations it is possible to
outperform fully-supervised models trained from scratch on each task,
when limited data is available, thus improving label efficiency.
View details
Learning to Denoise Historical Music
Dominik Roblek
ISMIR 2020 - 21st International Society for Music Information Retrieval Conference
Preview abstract
We propose SEANet (Sound Enhancement Adversarial Network), an audio-to-audio generative model that learns to denoise and enhance old music recordings. Our model internally converts its input into time-frequency representation by means of a short-time Fourier transform (STFT), and processes the resulting spectrogram using a convolutional neural network. The network is trained with both reconstructive and adversarial objectives on a synthetic noisy music dataset, which is created by mixing clean music with real noise samples extracted from quiet segments of old recordings. We evaluate our method both quantitatively on held-out test examples of the synthetic dataset, and qualitatively by human rating on samples of actual historical recordings. Our results show that the proposed method is effective in removing noise, while preserving the musical quality and details of the original.
View details
Preview abstract
We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combination of feature losses and adversarial losses to reconstruct an enhanced version of user's speech. We trained our model with data collected by sensors mounted on an earbud and synthetically noisified by superimposing different kinds of noise sources to the audio signal. Our experimental results demonstrate that it is possible to achieve very high quality results, even in the case of interfering speech at the same level of loudness.
View details
SPICE: Self-supervised pitch estimation
Christian Frank
Dominik Roblek
Mihajlo Velimirović
IEEE Transactions on Audio Speech and Language Processing (to appear) (2020)
Preview abstract
We propose a model to estimate the fundamental frequency in monophonic audio,
often referred to as pitch estimation. We acknowledge the fact that obtaining
ground truth annotations at the required temporal and frequency resolution is a
particularly daunting task. Therefore, we propose to adopt a self-supervised
learning technique, which is able to estimate pitch without any form
of supervision. The key observation is that pitch shift maps to a simple
translation when the audio signal is analysed through the lens of the constant-Q
transform (CQT). We design a self-supervised task by feeding two shifted slices
of the CQT to the same convolutional encoder, and require that the difference in
the outputs is proportional to the corresponding difference in pitch. In
addition, we introduce a small model head on top of the encoder, which is able
to determine the confidence of the pitch estimate, so as to distinguish between
voiced and unvoiced audio. Our results show that the proposed method is able to
estimate pitch at a level of accuracy comparable to fully supervised models,
both on clean and noisy audio samples, although it does not require access to large
labeled datasets.
View details
Towards Learning a Universal Non-Semantic Representation of Speech
Ronnie Zvi Maor
Ira Shavitt
Proc. Interspeech 2020 (2020)
Preview abstract
The ultimate goal of transfer learning is to enable learning with a small amount of data, by using a strong embedding. While significant progress has been made in the visual and language domains, the speech domain does not have such a universal method. This paper presents a new representation of speech signals based on an unsupervised triplet-loss objective, which outperforms both existing state of the art and other representations on a number of transfer learning tasks in the non-semantic speech domain. The embedding is learned on a publicly available dataset, and it is tested on a variety of low-resource downstream tasks, including personalization tasks and medical domain. The model will be publicly released.
View details
Multi-Task Adapters for On-Device Audio Inference
Dominik Roblek
IEEE Signal Processing Letters, vol. 27, pp. 630-634
Preview abstract
The deployment of deep networks on mobile devices
requires to efficiently use the scarce computational resources, expressed as
either available memory or computing cost. When addressing multiple tasks
simultaneously, it is extremely important to share resources across tasks,
especially when they all consume the same input data, e.g., audio samples
captured by the on-board microphones. In this paper we propose a multi-task
model architecture that consists of a shared encoder and multiple task-specific
adapters. During training, we learn the model parameters as well as the
allocation of the task-specific additional resources across both tasks and
layers. A global tuning parameter can be used to obtain different multi-task
network configurations finding the desired trade-off between cost and the level
of accuracy across tasks. Our results show that this solution significantly
outperforms a multi-head model baseline. Interestingly, we observe that the optimal
resource allocation depends on both the task intrinsic characteristics as well
as on the targeted cost measure (e.g., memory or computing cost).
View details
No Results Found