Jump to Content

Beat Gfeller

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    One-shot conditional audio filtering of arbitrary sounds
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE (to appear)
    Preview abstract We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source. Using \tuneenv, a waveform-to-waveform neural network architecture, we are able to train a model in an entirely unsupervised way. Using a sound source encoder model which is learned jointly with the source separation network, the trained model can be ``configured'' to filter arbitrary sound sources, even ones that it has not seen during training. Evaluated on the FSD50k dataset, our model obtains an SI-SDR improvement of 9.6 dB, for mixtures of two sounds. When trained on Librispeech, our model achieves an SI-SDR improvement of 12.3 dB when separating one voice from a mixture of two speakers. Moreover, we show that the representation learned by the sound source encoder clusters acoustically similar sounds together in the embedding space, even if it is trained without using any labels. View details
    MicAugment: One-shot Microphone Style Transfer
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
    Preview abstract A critical aspect for the successful deployment of audio-based models ``in-the-wild'' is the robustness to the transformations introduced by heterogeneous microphones. In this work we propose a method that is able to perform \emph{one-shot microphone style} transfer. Given only a \emph{few seconds} of audio recorded by a target device, \emph{MicAugment} identifies the transformations associated to the microphone and uses the learned transformations to synthesize audio as if it were recorded by that device. We show that our method can successfully apply the style of a target microphone and that it significantly increases model robustness to microphone variability when used as \emph{data augmentation} in downstream tasks. View details
    Preview abstract We explore self-supervision as a way to learn general purpose audio representations. Specifically, we propose two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip. We evaluate how the representations learned via self-supervision transfer to different downstream tasks, either training a task-specific linear classifier on top of the pretrained embeddings, or fine-tuning a model end-to-end for each downstream task. Our results show that the representations learned with Audio2Vec transfer better than those learned by fully-supervised training on Audioset. In addition, by fine-tuning Audio2Vec representations it is possible to outperform fully-supervised models trained from scratch on each task, when limited data is available, thus improving label efficiency. View details
    SPICE: Self-supervised pitch estimation
    Christian Frank
    Mihajlo Velimirović
    IEEE Transactions on Audio Speech and Language Processing (to appear) (2020)
    Preview abstract We propose a model to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. We acknowledge the fact that obtaining ground truth annotations at the required temporal and frequency resolution is a particularly daunting task. Therefore, we propose to adopt a self-supervised learning technique, which is able to estimate pitch without any form of supervision. The key observation is that pitch shift maps to a simple translation when the audio signal is analysed through the lens of the constant-Q transform (CQT). We design a self-supervised task by feeding two shifted slices of the CQT to the same convolutional encoder, and require that the difference in the outputs is proportional to the corresponding difference in pitch. In addition, we introduce a small model head on top of the encoder, which is able to determine the confidence of the pitch estimate, so as to distinguish between voiced and unvoiced audio. Our results show that the proposed method is able to estimate pitch at a level of accuracy comparable to fully supervised models, both on clean and noisy audio samples, although it does not require access to large labeled datasets. View details
    Learning to Denoise Historical Music
    ISMIR 2020 - 21st International Society for Music Information Retrieval Conference
    Preview abstract We propose SEANet (Sound Enhancement Adversarial Network), an audio-to-audio generative model that learns to denoise and enhance old music recordings. Our model internally converts its input into time-frequency representation by means of a short-time Fourier transform (STFT), and processes the resulting spectrogram using a convolutional neural network. The network is trained with both reconstructive and adversarial objectives on a synthetic noisy music dataset, which is created by mixing clean music with real noise samples extracted from quiet segments of old recordings. We evaluate our method both quantitatively on held-out test examples of the synthetic dataset, and qualitatively by human rating on samples of actual historical recordings. Our results show that the proposed method is effective in removing noise, while preserving the musical quality and details of the original. View details
    Preview abstract Existing music recognition applications require both user activation and a connection to a server that performs the actual recognition. In this paper we present a low power music recognizer that runs entirely on a mobile phone and automatically recognizes music without requiring any user activation. A small music detector runs continuously on the mobile phone’s DSP (digital signal processor) chip and only wakes main the processor when it is confident that music is present. Once woken the detector on the main processor is provided with an 8s buffer of audio which is then fingerprinted and compared to the stored fingerprints in the on-device fingerprint database of over 70000 songs. View details
    Preview abstract This paper presents a cross-lingual projection technique for training class-based language models. We borrow from previous success in projecting POS tags and NER mentions to that of a trained classbased language model. We use a CRF to train a model to predict when a sequence of words is a member of a given class and use this to label our language model training data. We show that we can successfully project the contextual cues for these classes across pairs of languages and retain a high quality class model in languages with no supervised class data. We present empirical results that show the quality of the projected models as well as their effect on the down-stream speech recognition objective. We are able to achieve over half the reduction of WER when using the projected class models as compared to models trained on human annotations. View details
    No Results Found