Jump to Content
Zalan Borsos

Zalan Borsos

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We present a method to separate speech signals from noisy environments in the embedding space of a neural audio codec. We introduce a new training procedure that allows our model to produce structured encodings of audio waveforms given by embedding vectors, where one part of the embedding vector represents the speech signal, and the rest represent the environment. We achieve this by partitioning the embeddings of different input waveforms and training the model to faithfully reconstruct audio from mixed partitions, thereby ensuring each partition encodes a separate audio attribute. As use cases, we demonstrate the separation of speech from background noise or from reverberation characteristics. Our method also allows for targeted adjustments of the audio output characteristics. View details
    Preview abstract We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We also present a "refinement" version of the model that predicts enhanced audio tokens from the audio tokens of speech separated by a conventional separation model. Using both objective metrics and subjective MUSHRA listening tests, we show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model. View details
    Bastiaan Kleijn
    Michael Chinen
    Neil Zeghidour
    Teerapat Jenrungrot
    ICASSP 2023 (2023)
    Preview abstract We introduce LMCodec, a fully-causal neural speech codec that provides high quality at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec first trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://google.github.io/chrome-media-audio-papers/publications/lmcodec. View details
    MusicLM: Generating Music From Text
    Andrea Agostinelli
    Mauro Verzetti
    Antoine Caillon
    Qingqing Huang
    Neil Zeghidour
    Christian Frank
    under review (2023)
    Preview abstract We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. Further links: samples, MusicCaps dataset View details
    Preview abstract We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input. We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions, and generalizing to unseen speakers. Our approach significantly outperforms baselines constructed using adaptive TTS, as judged by human raters in side-by-side preference and MOS tests. View details
    Semi-supervised batch active learning via bilevel optimization
    Andreas Krause
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
    Preview abstract \emph{Active learning} is an effective technique for reducing the labeling cost by improving data efficiency. In this work, we propose a novel \emph{batch acquisition strategy} for active learning in the setting when the model training is performed in a \emph{semi-supervised} manner. We formulate our approach as a \emph{data summarization} problem via \emph{bilevel optimization}, where the queried batch consists of the points that best summarize the unlabeled data pool. We show that our method is highly effective in \emph{keyword detection} tasks in the regime when only \emph{few labeled samples} are available. View details
    MicAugment: One-shot Microphone Style Transfer
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
    Preview abstract A critical aspect for the successful deployment of audio-based models ``in-the-wild'' is the robustness to the transformations introduced by heterogeneous microphones. In this work we propose a method that is able to perform \emph{one-shot microphone style} transfer. Given only a \emph{few seconds} of audio recorded by a target device, \emph{MicAugment} identifies the transformations associated to the microphone and uses the learned transformations to synthesize audio as if it were recorded by that device. We show that our method can successfully apply the style of a target microphone and that it significantly increases model robustness to microphone variability when used as \emph{data augmentation} in downstream tasks. View details
    No Results Found