Félix de Chaumont Quitry
Authored Publications
Sort By
Disentangling speech from surroundings with neural embeddings
Malcolm Slaney
Neil Zeghidour
ICASSP 2023 (2023)
Preview abstract
We present a method to separate speech signals from noisy environments in the embedding space of a neural audio codec. We introduce a new training procedure that allows our model to produce structured encodings of audio waveforms given by embedding vectors, where one part of the embedding vector represents the speech signal, and the rest represent the environment. We achieve this by partitioning the embeddings of different input waveforms and training the model to faithfully reconstruct audio from mixed partitions, thereby ensuring each partition encodes a separate audio attribute. As use cases, we demonstrate the separation of speech from background noise or from reverberation characteristics. Our method also allows for targeted adjustments of the audio output characteristics.
View details
Towards Learning a Universal Non-Semantic Representation of Speech
Joel Shor
Ronnie Zvi Maor
Ira Shavitt
Proc. Interspeech 2020 (2020)
Preview abstract
The ultimate goal of transfer learning is to enable learning with a small amount of data, by using a strong embedding. While significant progress has been made in the visual and language domains, the speech domain does not have such a universal method. This paper presents a new representation of speech signals based on an unsupervised triplet-loss objective, which outperforms both existing state of the art and other representations on a number of transfer learning tasks in the non-semantic speech domain. The embedding is learned on a publicly available dataset, and it is tested on a variety of low-resource downstream tasks, including personalization tasks and medical domain. The model will be publicly released.
View details
Multi-Task Adapters for On-Device Audio Inference
Dominik Roblek
IEEE Signal Processing Letters, 27, pp. 630-634
Preview abstract
The deployment of deep networks on mobile devices
requires to efficiently use the scarce computational resources, expressed as
either available memory or computing cost. When addressing multiple tasks
simultaneously, it is extremely important to share resources across tasks,
especially when they all consume the same input data, e.g., audio samples
captured by the on-board microphones. In this paper we propose a multi-task
model architecture that consists of a shared encoder and multiple task-specific
adapters. During training, we learn the model parameters as well as the
allocation of the task-specific additional resources across both tasks and
layers. A global tuning parameter can be used to obtain different multi-task
network configurations finding the desired trade-off between cost and the level
of accuracy across tasks. Our results show that this solution significantly
outperforms a multi-head model baseline. Interestingly, we observe that the optimal
resource allocation depends on both the task intrinsic characteristics as well
as on the targeted cost measure (e.g., memory or computing cost).
View details
Pre-training audio representations with self-supervision
Dominik Roblek
IEEE Signal Processing Letters, 27 (2020), pp. 600-604
Preview abstract
We explore self-supervision as a way to learn general purpose audio
representations. Specifically, we propose two self-supervised tasks:
Audio2Vec, which aims at reconstructing a spectrogram slice from past and
future slices and TemporalGap, which estimates the distance between two short
audio segments extracted at random from the same audio clip. We evaluate how the
representations learned via self-supervision transfer to different downstream
tasks, either training a task-specific linear classifier on top of the
pretrained embeddings, or fine-tuning a model end-to-end for each downstream
task. Our results show that the representations learned with Audio2Vec
transfer better than those learned by fully-supervised training on Audioset. In
addition, by fine-tuning Audio2Vec representations it is possible to
outperform fully-supervised models trained from scratch on each task,
when limited data is available, thus improving label efficiency.
View details
High quality agreement-based semi-supervised training data for acoustic modeling
Asa Oines
Pedro Moreno
2016 IEEE Workshop on Spoken Language Technology
Preview abstract
This paper describes a new technique to automatically obtain large high-quality training speech corpora for acoustic modeling. Traditional approaches select utterances based on confidence thresholds and other heuristics. We propose instead to use an ensemble approach: we transcribe each utterance using several recognizers, and only keep those on which they agree. The recognizers we use are trained on data from different dialects of the same language, and this diversity leads them to make different mistakes in transcribing speech utterances. In this work we show, however, that when they agree, this is an extremely strong signal that the transcript is correct. This allows us to produce automatically transcribed speech corpora that are superior in transcript correctness even to those manually transcribed by humans. Further more, we show that using the produced semi-supervised data sets, we can train new acoustic models which outperform those trained solely on previously available data sets.
View details
Preview abstract
This paper describes a series of experiments to extend the application of Context-Dependent (CD) long short-term memory (LSTM) recurrent neural networks (RNNs) trained with Connectionist Temporal Classification (CTC) and sMBR loss. Our experiments, on a noisy, reverberant voice search task, include training with alternative pronunciations and the application to child speech recognition; combination of multiple models, and convolutional input layers. We also investigate the latency of CTC models and show that constraining forward-backward alignment in training can reduce the delay for a real-time streaming speech recognition system. Finally we investigate transferring knowledge from one network to another through alignments
View details