Beat Gfeller
Authored Publications
Sort By
One-shot conditional audio filtering of arbitrary sounds
Dominik Roblek
2021 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE (to appear)
Preview abstract
We consider the problem of separating a particular sound source from a single-channel mixture, based on
only a short sample of the target source. Using \tuneenv, a waveform-to-waveform neural network architecture, we are able to train a model in an
entirely unsupervised way.
Using a sound source encoder model which is learned jointly with the source separation network, the trained model can be ``configured'' to filter arbitrary sound sources, even ones that it has not seen during training. Evaluated on the FSD50k dataset, our model obtains an SI-SDR improvement of 9.6 dB, for mixtures of two sounds.
When trained on Librispeech, our model achieves an SI-SDR improvement of 12.3 dB when separating one voice from a mixture of two speakers.
Moreover, we show that the representation learned by the sound source encoder clusters acoustically similar sounds together in the embedding space, even if it is trained without using any labels.
View details
MicAugment: One-shot Microphone Style Transfer
2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
Preview abstract
A critical aspect for the successful deployment of audio-based models ``in-the-wild'' is the robustness to the transformations introduced by heterogeneous microphones. In this work we propose a method that is able to perform \emph{one-shot microphone style} transfer. Given only a \emph{few seconds} of audio recorded by a target device, \emph{MicAugment} identifies the transformations associated to the microphone and uses the learned transformations to synthesize audio as if it were recorded by that device. We show that our method can successfully apply the style of a target microphone and that it significantly increases model robustness to microphone variability when used as \emph{data augmentation} in downstream tasks.
View details
SPICE: Self-supervised pitch estimation
Christian Frank
Dominik Roblek
Mihajlo Velimirović
IEEE Transactions on Audio Speech and Language Processing (to appear) (2020)
Preview abstract
We propose a model to estimate the fundamental frequency in monophonic audio,
often referred to as pitch estimation. We acknowledge the fact that obtaining
ground truth annotations at the required temporal and frequency resolution is a
particularly daunting task. Therefore, we propose to adopt a self-supervised
learning technique, which is able to estimate pitch without any form
of supervision. The key observation is that pitch shift maps to a simple
translation when the audio signal is analysed through the lens of the constant-Q
transform (CQT). We design a self-supervised task by feeding two shifted slices
of the CQT to the same convolutional encoder, and require that the difference in
the outputs is proportional to the corresponding difference in pitch. In
addition, we introduce a small model head on top of the encoder, which is able
to determine the confidence of the pitch estimate, so as to distinguish between
voiced and unvoiced audio. Our results show that the proposed method is able to
estimate pitch at a level of accuracy comparable to fully supervised models,
both on clean and noisy audio samples, although it does not require access to large
labeled datasets.
View details
Pre-training audio representations with self-supervision
Dominik Roblek
IEEE Signal Processing Letters, 27 (2020), pp. 600-604
Preview abstract
We explore self-supervision as a way to learn general purpose audio
representations. Specifically, we propose two self-supervised tasks:
Audio2Vec, which aims at reconstructing a spectrogram slice from past and
future slices and TemporalGap, which estimates the distance between two short
audio segments extracted at random from the same audio clip. We evaluate how the
representations learned via self-supervision transfer to different downstream
tasks, either training a task-specific linear classifier on top of the
pretrained embeddings, or fine-tuning a model end-to-end for each downstream
task. Our results show that the representations learned with Audio2Vec
transfer better than those learned by fully-supervised training on Audioset. In
addition, by fine-tuning Audio2Vec representations it is possible to
outperform fully-supervised models trained from scratch on each task,
when limited data is available, thus improving label efficiency.
View details
Learning to Denoise Historical Music
Dominik Roblek
ISMIR 2020 - 21st International Society for Music Information Retrieval Conference
Preview abstract
We propose SEANet (Sound Enhancement Adversarial Network), an audio-to-audio generative model that learns to denoise and enhance old music recordings. Our model internally converts its input into time-frequency representation by means of a short-time Fourier transform (STFT), and processes the resulting spectrogram using a convolutional neural network. The network is trained with both reconstructive and adversarial objectives on a synthetic noisy music dataset, which is created by mixing clean music with real noise samples extracted from quiet segments of old recordings. We evaluate our method both quantitatively on held-out test examples of the synthetic dataset, and qualitatively by human rating on samples of actual historical recordings. Our results show that the proposed method is effective in removing noise, while preserving the musical quality and details of the original.
View details
Now Playing: Continuous low-power music recognition
Dominik Roblek
James David Lyon
Julian James Odell
Mihajlo Velimirović
NIPS 2017 Workshop: Machine Learning on the Phone
Preview abstract
Existing music recognition applications require both user activation and a connection to a server that performs the actual recognition. In this paper we present a low power music recognizer that runs entirely on a mobile phone and automatically recognizes music without requiring any user activation. A small music detector runs continuously on the mobile phone’s DSP (digital signal processor) chip and only wakes main the processor when it is confident that music is present. Once woken the detector on the main processor is provided with an 8s buffer of audio which is then fingerprinted and compared to the stored fingerprints in the on-device fingerprint database of over 70000 songs.
View details
Preview abstract
This paper presents a cross-lingual projection
technique for training class-based
language models. We borrow from previous
success in projecting POS tags and
NER mentions to that of a trained classbased
language model. We use a CRF
to train a model to predict when a sequence
of words is a member of a given
class and use this to label our language
model training data. We show that we can
successfully project the contextual cues
for these classes across pairs of languages
and retain a high quality class model in
languages with no supervised class data.
We present empirical results that show the
quality of the projected models as well
as their effect on the down-stream speech
recognition objective. We are able to
achieve over half the reduction of WER
when using the projected class models as
compared to models trained on human annotations.
View details