Jump to Content
Marco Tagliasacchi

Marco Tagliasacchi

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We also present a "refinement" version of the model that predicts enhanced audio tokens from the audio tokens of speech separated by a conventional separation model. Using both objective metrics and subjective MUSHRA listening tests, we show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model. View details
    MusicLM: Generating Music From Text
    Andrea Agostinelli
    Mauro Verzetti
    Antoine Caillon
    Qingqing Huang
    Neil Zeghidour
    Christian Frank
    under review (2023)
    Preview abstract We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. Further links: samples, MusicCaps dataset View details
    LMCODEC: A LOW BITRATE SPEECH CODEC WITH CAUSAL TRANSFORMER MODELS
    Bastiaan Kleijn
    Michael Chinen
    Neil Zeghidour
    Teerapat Jenrungrot
    ICASSP 2023 (2023)
    Preview abstract We introduce LMCodec, a fully-causal neural speech codec that provides high quality at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec first trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://google.github.io/chrome-media-audio-papers/publications/lmcodec. View details
    Preview abstract We present a method to separate speech signals from noisy environments in the embedding space of a neural audio codec. We introduce a new training procedure that allows our model to produce structured encodings of audio waveforms given by embedding vectors, where one part of the embedding vector represents the speech signal, and the rest represent the environment. We achieve this by partitioning the embeddings of different input waveforms and training the model to faithfully reconstruct audio from mixed partitions, thereby ensuring each partition encodes a separate audio attribute. As use cases, we demonstrate the separation of speech from background noise or from reverberation characteristics. Our method also allows for targeted adjustments of the audio output characteristics. View details
    Preview abstract We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input. We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions, and generalizing to unseen speakers. Our approach significantly outperforms baselines constructed using adaptive TTS, as judged by human raters in side-by-side preference and MOS tests. View details
    Real-time Speech Frequency Bandwidth Extension
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
    Preview abstract In this paper we propose a lightweight model that performs frequency bandwidth extension of speech signals, increasing the sampling frequency from 8kHz to 16kHz, while restoring the high frequency content to a level that is indistinguishable from the original samples at 16kHz. The model architecture is based on SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combination of feature losses and adversarial losses to reconstruct an enhanced version of the input speech. In addition, we propose a version of SEANet that can be deployed on device in streaming mode, achieving an architecture latency of 16ms. When profiled on a single mobile CPU, processing one 16ms frame takes only 1.5ms, so that the total latency is compatible with a deployment in bi-directional voice communication systems. View details
    Semi-supervised batch active learning via bilevel optimization
    Andreas Krause
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
    Preview abstract \emph{Active learning} is an effective technique for reducing the labeling cost by improving data efficiency. In this work, we propose a novel \emph{batch acquisition strategy} for active learning in the setting when the model training is performed in a \emph{semi-supervised} manner. We formulate our approach as a \emph{data summarization} problem via \emph{bilevel optimization}, where the queried batch consists of the points that best summarize the unlabeled data pool. We show that our method is highly effective in \emph{keyword detection} tasks in the regime when only \emph{few labeled samples} are available. View details
    One-shot conditional audio filtering of arbitrary sounds
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE (to appear)
    Preview abstract We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source. Using \tuneenv, a waveform-to-waveform neural network architecture, we are able to train a model in an entirely unsupervised way. Using a sound source encoder model which is learned jointly with the source separation network, the trained model can be ``configured'' to filter arbitrary sound sources, even ones that it has not seen during training. Evaluated on the FSD50k dataset, our model obtains an SI-SDR improvement of 9.6 dB, for mixtures of two sounds. When trained on Librispeech, our model achieves an SI-SDR improvement of 12.3 dB when separating one voice from a mixture of two speakers. Moreover, we show that the representation learned by the sound source encoder clusters acoustically similar sounds together in the embedding space, even if it is trained without using any labels. View details
    MicAugment: One-shot Microphone Style Transfer
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
    Preview abstract A critical aspect for the successful deployment of audio-based models ``in-the-wild'' is the robustness to the transformations introduced by heterogeneous microphones. In this work we propose a method that is able to perform \emph{one-shot microphone style} transfer. Given only a \emph{few seconds} of audio recorded by a target device, \emph{MicAugment} identifies the transformations associated to the microphone and uses the learned transformations to synthesize audio as if it were recorded by that device. We show that our method can successfully apply the style of a target microphone and that it significantly increases model robustness to microphone variability when used as \emph{data augmentation} in downstream tasks. View details
    SoundStream: An End-to-End Neural Audio Codec
    Neil Zeghidour
    Alejandro Luebs
    Transactions on Audio, Speech and Language Processing (2021)
    Preview abstract We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3 kbps to 18 kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24 kHz sampling rate, SoundStream at 3 kbps outperforms Opus at 12 kbps and approaches EVS at 9.6 kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech. View details
    Preview abstract Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically-constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark. View details
    Learning to Denoise Historical Music
    ISMIR 2020 - 21st International Society for Music Information Retrieval Conference
    Preview abstract We propose SEANet (Sound Enhancement Adversarial Network), an audio-to-audio generative model that learns to denoise and enhance old music recordings. Our model internally converts its input into time-frequency representation by means of a short-time Fourier transform (STFT), and processes the resulting spectrogram using a convolutional neural network. The network is trained with both reconstructive and adversarial objectives on a synthetic noisy music dataset, which is created by mixing clean music with real noise samples extracted from quiet segments of old recordings. We evaluate our method both quantitatively on held-out test examples of the synthetic dataset, and qualitatively by human rating on samples of actual historical recordings. Our results show that the proposed method is effective in removing noise, while preserving the musical quality and details of the original. View details
    Preview abstract We explore self-supervision as a way to learn general purpose audio representations. Specifically, we propose two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip. We evaluate how the representations learned via self-supervision transfer to different downstream tasks, either training a task-specific linear classifier on top of the pretrained embeddings, or fine-tuning a model end-to-end for each downstream task. Our results show that the representations learned with Audio2Vec transfer better than those learned by fully-supervised training on Audioset. In addition, by fine-tuning Audio2Vec representations it is possible to outperform fully-supervised models trained from scratch on each task, when limited data is available, thus improving label efficiency. View details
    SPICE: Self-supervised pitch estimation
    Christian Frank
    Mihajlo Velimirović
    IEEE Transactions on Audio Speech and Language Processing (to appear) (2020)
    Preview abstract We propose a model to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. We acknowledge the fact that obtaining ground truth annotations at the required temporal and frequency resolution is a particularly daunting task. Therefore, we propose to adopt a self-supervised learning technique, which is able to estimate pitch without any form of supervision. The key observation is that pitch shift maps to a simple translation when the audio signal is analysed through the lens of the constant-Q transform (CQT). We design a self-supervised task by feeding two shifted slices of the CQT to the same convolutional encoder, and require that the difference in the outputs is proportional to the corresponding difference in pitch. In addition, we introduce a small model head on top of the encoder, which is able to determine the confidence of the pitch estimate, so as to distinguish between voiced and unvoiced audio. Our results show that the proposed method is able to estimate pitch at a level of accuracy comparable to fully supervised models, both on clean and noisy audio samples, although it does not require access to large labeled datasets. View details
    Preview abstract The deployment of deep networks on mobile devices requires to efficiently use the scarce computational resources, expressed as either available memory or computing cost. When addressing multiple tasks simultaneously, it is extremely important to share resources across tasks, especially when they all consume the same input data, e.g., audio samples captured by the on-board microphones. In this paper we propose a multi-task model architecture that consists of a shared encoder and multiple task-specific adapters. During training, we learn the model parameters as well as the allocation of the task-specific additional resources across both tasks and layers. A global tuning parameter can be used to obtain different multi-task network configurations finding the desired trade-off between cost and the level of accuracy across tasks. Our results show that this solution significantly outperforms a multi-head model baseline. Interestingly, we observe that the optimal resource allocation depends on both the task intrinsic characteristics as well as on the targeted cost measure (e.g., memory or computing cost). View details
    Preview abstract We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combination of feature losses and adversarial losses to reconstruct an enhanced version of user's speech. We trained our model with data collected by sensors mounted on an earbud and synthetically noisified by superimposing different kinds of noise sources to the audio signal. Our experimental results demonstrate that it is possible to achieve very high quality results, even in the case of interfering speech at the same level of loudness. View details
    Preview abstract The ultimate goal of transfer learning is to enable learning with a small amount of data, by using a strong embedding. While significant progress has been made in the visual and language domains, the speech domain does not have such a universal method. This paper presents a new representation of speech signals based on an unsupervised triplet-loss objective, which outperforms both existing state of the art and other representations on a number of transfer learning tasks in the non-semantic speech domain. The embedding is learned on a publicly available dataset, and it is tested on a variety of low-resource downstream tasks, including personalization tasks and medical domain. The model will be publicly released. View details
    No Results Found