Jump to Content

Dominik Roblek

I am a software engineer, industrial researcher and manager at Google AI. I have extensive experience in using machine intelligence for low-power machine hearing, sound processing and ambient audio understanding. I've also been working on big data analytics, image understanding and generative models for audio and video. I hold numerous patents in these areas.

I earned Master's degree (Dipl.-Ing.) in Mathematics in 1998 from the University of Ljubljana, and Master's degree in Computer Science in 2006 from Trinity College Dublin. I grew up in Slovenia and during the course of my career lived and worked in Slovenia, Ireland, California, and Switzerland.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    One-shot conditional audio filtering of arbitrary sounds
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE (to appear)
    Preview abstract We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source. Using \tuneenv, a waveform-to-waveform neural network architecture, we are able to train a model in an entirely unsupervised way. Using a sound source encoder model which is learned jointly with the source separation network, the trained model can be ``configured'' to filter arbitrary sound sources, even ones that it has not seen during training. Evaluated on the FSD50k dataset, our model obtains an SI-SDR improvement of 9.6 dB, for mixtures of two sounds. When trained on Librispeech, our model achieves an SI-SDR improvement of 12.3 dB when separating one voice from a mixture of two speakers. Moreover, we show that the representation learned by the sound source encoder clusters acoustically similar sounds together in the embedding space, even if it is trained without using any labels. View details
    Real-time Speech Frequency Bandwidth Extension
    2021 IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
    Preview abstract In this paper we propose a lightweight model that performs frequency bandwidth extension of speech signals, increasing the sampling frequency from 8kHz to 16kHz, while restoring the high frequency content to a level that is indistinguishable from the original samples at 16kHz. The model architecture is based on SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combination of feature losses and adversarial losses to reconstruct an enhanced version of the input speech. In addition, we propose a version of SEANet that can be deployed on device in streaming mode, achieving an architecture latency of 16ms. When profiled on a single mobile CPU, processing one 16ms frame takes only 1.5ms, so that the total latency is compatible with a deployment in bi-directional voice communication systems. View details
    Training Keyword Spotters with Limited and Synthesized Speech Data
    James Lin
    International Conference on Acoustics, Speech, and Signal Processing, IEEE, Barcelona, Spain (2020)
    Preview abstract With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of high training data. In this paper, we explore the effectiveness of synthesized speech data in training small spoken term detection models. Instead of training such models directly on the audio or low level feature such as MFCCs we use a small speech embedding model trained to extract useful features for keyword spotting models. Using this embedding, we show that such a model for detecting 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 50 real examples, and to a model trained on 4000 real examples if we do not use the speech embeddings. View details
    Learning to Denoise Historical Music
    ISMIR 2020 - 21st International Society for Music Information Retrieval Conference
    Preview abstract We propose SEANet (Sound Enhancement Adversarial Network), an audio-to-audio generative model that learns to denoise and enhance old music recordings. Our model internally converts its input into time-frequency representation by means of a short-time Fourier transform (STFT), and processes the resulting spectrogram using a convolutional neural network. The network is trained with both reconstructive and adversarial objectives on a synthetic noisy music dataset, which is created by mixing clean music with real noise samples extracted from quiet segments of old recordings. We evaluate our method both quantitatively on held-out test examples of the synthetic dataset, and qualitatively by human rating on samples of actual historical recordings. Our results show that the proposed method is effective in removing noise, while preserving the musical quality and details of the original. View details
    Preview abstract We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combination of feature losses and adversarial losses to reconstruct an enhanced version of user's speech. We trained our model with data collected by sensors mounted on an earbud and synthetically noisified by superimposing different kinds of noise sources to the audio signal. Our experimental results demonstrate that it is possible to achieve very high quality results, even in the case of interfering speech at the same level of loudness. View details
    Preview abstract We explore self-supervision as a way to learn general purpose audio representations. Specifically, we propose two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip. We evaluate how the representations learned via self-supervision transfer to different downstream tasks, either training a task-specific linear classifier on top of the pretrained embeddings, or fine-tuning a model end-to-end for each downstream task. Our results show that the representations learned with Audio2Vec transfer better than those learned by fully-supervised training on Audioset. In addition, by fine-tuning Audio2Vec representations it is possible to outperform fully-supervised models trained from scratch on each task, when limited data is available, thus improving label efficiency. View details
    Preview abstract The deployment of deep networks on mobile devices requires to efficiently use the scarce computational resources, expressed as either available memory or computing cost. When addressing multiple tasks simultaneously, it is extremely important to share resources across tasks, especially when they all consume the same input data, e.g., audio samples captured by the on-board microphones. In this paper we propose a multi-task model architecture that consists of a shared encoder and multiple task-specific adapters. During training, we learn the model parameters as well as the allocation of the task-specific additional resources across both tasks and layers. A global tuning parameter can be used to obtain different multi-task network configurations finding the desired trade-off between cost and the level of accuracy across tasks. Our results show that this solution significantly outperforms a multi-head model baseline. Interestingly, we observe that the optimal resource allocation depends on both the task intrinsic characteristics as well as on the targeted cost measure (e.g., memory or computing cost). View details
    SPICE: Self-supervised pitch estimation
    Christian Frank
    Mihajlo Velimirović
    IEEE Transactions on Audio Speech and Language Processing (to appear) (2020)
    Preview abstract We propose a model to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. We acknowledge the fact that obtaining ground truth annotations at the required temporal and frequency resolution is a particularly daunting task. Therefore, we propose to adopt a self-supervised learning technique, which is able to estimate pitch without any form of supervision. The key observation is that pitch shift maps to a simple translation when the audio signal is analysed through the lens of the constant-Q transform (CQT). We design a self-supervised task by feeding two shifted slices of the CQT to the same convolutional encoder, and require that the difference in the outputs is proportional to the corresponding difference in pitch. In addition, we introduce a small model head on top of the encoder, which is able to determine the confidence of the pitch estimate, so as to distinguish between voiced and unvoiced audio. Our results show that the proposed method is able to estimate pitch at a level of accuracy comparable to fully supervised models, both on clean and noisy audio samples, although it does not require access to large labeled datasets. View details
    Preview abstract We propose the Fréchet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide variety of distortions. As an alternative, we propose adapting the Fréchet Inception Distance (FID) metric used to evaluate generative image models to the audio domain. FAD is validated using a wide variety of artificial distortions and is compared to the signal based metrics signal to distortion ratio (SDR), cosine distance and magnitude L2 distance. We show that, with a correlation coefficient of 0.52, FAD correlates more closely with human perception than either SDR, cosine distance or magnitude L2 distance, with correlation coefficients of 0.39, -0.15 and -0.01 respectively. View details
    Preview abstract Existing music recognition applications require both user activation and a connection to a server that performs the actual recognition. In this paper we present a low power music recognizer that runs entirely on a mobile phone and automatically recognizes music without requiring any user activation. A small music detector runs continuously on the mobile phone’s DSP (digital signal processor) chip and only wakes main the processor when it is confident that music is present. Once woken the detector on the main processor is provided with an 8s buffer of audio which is then fingerprinted and compared to the stored fingerprints in the on-device fingerprint database of over 70000 songs. View details
    No Results Found