Ignacio Lopez Moreno
Ignacio Lopez-Moreno received his M.S. degree in Electrical Engineering in 2009 from Universidad Politecnica de Madrid (UPM). He is currently working as software engineer in Google New York, with particular interest in speech processing. He is also pursuing his PhD degree with the Biometric Recognition Group - ATVS at Universidad Autonoma de Madrid. His research interests include speech recognition, speaker verification, language identification, pattern recognition and forensic evaluation of the evidence. He has been recipient of several awards and distinctions, such as the IBM Research Best Student Paper in 2009.
Research Areas
Authored Publications
Sort By
FEDAQT: ACCURATE QUANTIZED TRAINING WITH FEDERATED LEARNING
Renkun Ni
Oleg Rybakov
Phoenix Meadowlark
Tom Goldstein
Preview abstract
Federated learning has been widely used to train automatic speech recognition models, where the training procedure is decentralized to client devices to avoid data privacy concerns by keeping the training data locally. However, the limited computation resources on client devices prevent training with large models. Recently, quantization-aware training has shown the potential to train a quantized neural network with similar performance to the full-precision model while keeping the model size small and inference faster. However, these quantization methods will not save memory during training since they still keep the full-precision model. To address this issue, we propose a new quantization training framework for federated learning which saves the memory usage by training with quantized variables directly on local devices. We empirically show that our method can achieve comparable WER while only using 60% memory of the full-precision model.
View details
Augmenting Transformer-Transducer Based Speaker Change Detection With Token-Level Training Loss
Han Lu
Yiling Huang
ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Preview abstract
In this work we propose a novel token-based training strategy that improves Transformer-Transducer (T-T) based speaker change detection (SCD) performance. The conventional T-T based SCD model loss optimizes all output tokens equally. Due to the sparsity of the speaker changes in the training data, the conventional T-T based SCD model loss leads to sub-optimal detection accuracy. To mitigate this issue, we use a customized edit-distance algorithm to estimate the SCD false accept (FA) and false reject (FR) rates during training and optimize model parameters to minimize a weighted combination of the FA and FR, focusing the model on accurately predicting speaker changes. Experiments on a group of challenging real-world datasets show that the proposed training method can significantly improve the overall performance of the SCD model with the same number of parameters.
View details
Exploring sequence-to-sequence Transformer-Transducer models for keyword spotting
Beltrán Labrador
Angelo Scorza Scarpati
Liam Fowl
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Preview abstract
In this paper, we present a novel approach to adapt a sequence-to-sequence Transformer-Transducer ASR system to the keyword spotting (KWS) task. We achieve this by replacing the keyword in the text transcription with a special token kw and training the system to detect the kw token in an audio stream. At inference time, we create a decision function inspired by conventional KWS approaches, to make our approach more suitable for the KWS task. Furthermore, we introduce a specific keyword spotting loss by adapting the sequence-discriminative Minimum Bayes-Risk training technique. We find that our approach significantly outperforms ASR based KWS systems. When compared with a conventional keyword spotting system, our proposal has similar performance while bringing the advantages and flexibility of sequence-to-sequence training. Additionally, when combined with the conventional KWS system, our approach can improve the performance at any operation point.
View details
Turn-To-Diarize: Online speaker diarization constrained by transformer transducer speaker turn detection
Han Lu
Wei Xia
Submitted to ICASSP 2022, IEEE (2021)
Preview abstract
In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns.
Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of timestamped speaker labels, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.
View details
SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System
Roza Chojnacka
Jason Pelecanos
arXiv preprint arXiv:2104.02125 (2021)
Preview abstract
In this paper, we describe SpeakerStew - a hybrid system to perform speaker verification on 46 languages. Two core ideas were explored in this system: (1) Pooling training data of different languages together for multilingual generalization and reducing development cycles; (2) A triage mechanism between text-dependent and text-independent models to reduce runtime cost and expected latency. To the best of our knowledge, this is the first study of speaker verification systems at the scale of 46 languages. The problem is framed from the perspective of using a smart speaker device with interactions consisting of a wake-up keyword (text-dependent) followed by a speech query (text-independent).Experimental evidence suggests that training on multiple languages can generalize to unseen varieties while maintaining performance on seen varieties. We also found that it can reduce computational requirements for training models by an order of magnitude. Furthermore, during model inference on English data, we observe that leveraging a triage framework can reduce the number of calls to the more computationally expensive text-independent system by 73% (and reduce latency by 60%) while maintaining an EER no worse than the text-independent setup.
View details
Training Keyword Spotting Models on Non-IID Data with Federated Learning
Aishanee Shah
Cameron Nguyen
Niranjan Subrahmanya
Pai Zhu
Interspeech (2020)
Preview abstract
We demonstrate that a production-quality keyword-spotting model can be trained on-device using federated learning and achieve comparable false accept and false reject rates to a centrally-trained model. To overcome the algorithmic constraints associated with fitting on-device data (which are inherently non-independent and identically distributed), we conduct thorough empirical studies of optimization algorithms and hyperparameter configurations using large-scale federated simulations. And we explore techniques for utterance augmentation and data labeling to overcome the physical limitations of on-device training.
View details
Preview abstract
This paper discusses one of the most challenging practical engineering problems in speaker recognition systems -the version control of models and user profiles. A typical speaker recognition system consists of two stages: the enrollment stage, where a profile is generated from user-provided enrollment audio; and the runtime stage, where the voice identity of the runtime audio is compared against the stored profiles. As technology advances, the speaker recognition system needs to be updated for better performance. However, if the stored user profiles are not updated accordingly, version mismatch will result in meaningless recognition results. In this paper, we describe different version control strategies for different types of speaker recognition systems, according to how they are deployed in the production environment.
View details
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Mert Saglam
Alan Chiao
Renjie Liu
Wei Li
Jason Pelecanos
Marily Nika
Interspeech 2020 (2020) (to appear)
Preview abstract
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime.
View details
Preview abstract
In many scenarios of a language identification task, the user will specify a set of languages which he/she speaks from
a large set of all languages. This setup usually happens before the real-time identification.
We want to model such prior knowledge into the way we train our neural networks, by replacing the commonly used softmax loss function with a novel loss function named \emph{tuplemax loss}. For example, a language identification system launched in North America may have $95\%$ users only speaking up to two languages.
Together with a sliding window LSTM inference approach, our language identification system achieves a
$2.33$\% error rate, which is a relative $48.5$\% improvement over the $4.50\%$ error rate of standard softmax loss method.
View details
Preview abstract
In this paper, we propose "personal VAD'', a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a VAD-alike neural network which is conditioned on the target speaker embedding or the speaker verification score. For every frame, personal VAD outputs the scores for three classes: non-speech, target speaker speech, and non-target speaker speech. With our optimal setup, we are able to train a 130KB model that out-performs a baseline system where individually trained standard VAD and speaker recognition network are combined to perform the same task.
View details