Fadi Biadsy
Authored Publications
Sort By
Extending Parrotron: An End-to-End, Speech Conversion and Speech Recognition Model for Atypical Speech
Rohan Doshi
Youzheng Chen
Liyang Jiang
Xia Zhang
Andrea Chu
Pedro Jose Moreno Mengibar
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Preview abstract
We present an extended Parrotron model: a single, end-to-end model that enables voice conversion and recognition simultaneously. Input spectrograms are transformed to output spectrograms in the voice of a predetermined target speaker while also generating hypotheses in the target vocabulary. We study the performance of this novel architecture that jointly predicts speech and text on atypical (‘dysarthric’) speech. We show that with as little as an hour of atypical speech, speaker adaptation can yield up to 67% relative reduction in Word Error Rate (WER). We also show that data augmentation using a customized synthesizer built on the atypical speech can provide an additional 10% relative improvement over the best speaker-adapted model. Finally, we show that these methods generalize across 8 dysarthria etiologies with a range of severities.
View details
Conformer Parrotron: a Faster and Stronger End-to-end SpeechConversion and Recognition Model for Atypical Speech
Zhehuai Chen
Xia Zhang
Youzheng Chen
Liyang Jiang
Andrea Chu
Rohan Doshi
Pedro Jose Moreno Mengibar
interspeech 2021 (2021)
Preview abstract
Parrotron is an end-to-end personalizable model that enables many-to-one voice conversion and Automated Speech
Recognition (ASR) simultaneously for atypical speech. In this
work, we present the next-generation Parrotron model with improvements in overall performance and training and inference
speeds. The proposed architecture builds on the recently popularized conformer encoder comprising of convolution and attention layer based blocks used in ASR. We introduce architectural modifications that sub-samples encoder activations to
achieve speed-ups in training and inference. In order to jointly
improve ASR and voice conversion quality, we show that this
requires a corresponding up-sampling in the decoder network.
We provide an in-depth analysis on how the proposed approach
can maximize the efficiency of a speech-to-speech conversion
model in the context of atypical speech. Experiments on both
many-to-one and one-to-one dysarthric speech conversion tasks
show that we can achieve up to 7X speedup and 35% relative reduction in WER over the previous best Transformer-based Parrotron model. We also show that these techniques are general
enough and can provide similar wins on the transformer based
Parrotron model.
View details
Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech
Vicky Zayats
Dirk Padfield
Proceedings of EMNLP 2021 (2021)
Preview abstract
Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.
View details
Preview abstract
We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speech from any speaker regardless of accent, prosody, and background noise, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We further show that this normalization model can be adapted to normalize highly atypical speech from a deaf speaker, resulting in significant improvements in intelligibility and naturalness, measured via a speech recognizer and listening tests. Finally, demonstrating the utility of this model on other speech tasks, we show that the same model architecture can be trained to perform a speech separation task
View details
Direct speech-to-speech translation with a sequence-to-sequence model
Ye Jia
Interspeech (2019)
Preview abstract
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.
View details
Sparse Non-negative Matrix Language Modeling: Maximum Entropy Flexibility on the Cheap
The 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 2725-2729 (to appear)
Preview abstract
We present a new method for estimating the sparse non-negative model (SNM) by
using a small amount of held-out data and the multinomial loss that is natural
for language modeling; we validate it experimentally against the previous
estimation method which uses leave-one-out on training data and a binary loss
function and show that it performs equally well. Being able to train on
held-out data is very important in practical situations where training data is
mismatched from held-out/test data. We find that fairly small amounts of
held-out data (on the order of 30-70 thousand words) are sufficient for
training the adjustment model, which is the only model component estimated
using gradient descent; the bulk of model parameters are relative frequencies
counted on training data.
A second contribution is a comparison between SNM and the related class of
Maximum Entropy language models. While much cheaper computationally, we show
that SNM achieves slightly better perplexity results for the same feature set
and same speech recognition accuracy on voice search and short message
dictation.
View details
Preview abstract
Maximum Entropy (MaxEnt) Language Models (LMs) are powerful models
that can incorporate linguistic and non-linguistic contextual signals
in a unified framework, by optimizing a convex loss function.
In addition to their flexibility, a key advantage is their scalability,
in terms of model size and the amount of data that can be used during
training. We present the following two contributions to
MaxEnt training: (1) By leveraging smaller amounts of transcribed
data, we demonstrate that a MaxEnt LM trained on various
types of corpora can be easily adapted to better match the test
distribution of speech recognition; (2) A novel adaptive-training approach that efficiently
models multiple types of non-linguistic features in a
universal model.
We test the impact of these approaches on Google's state-of-the-art
speech recognizer for the task of voice-search transcription and
dictation. Training 10B parameter models utilizing a corpus
of up to 1T words, we show large reductions in word error
rate from adaptation across multiple languages. Also, human evaluations
show strong significant improvements on a wide range of domains from
using non-linguistic signals. For example, adapting to geographical
domains (e.g., US States and cities) affects about 4% of test
utterances, with 2:1 wins to loss ratio.
View details
Approaches for Neural-Network Language Model Adaptation
Michael Alexander Nirschl
Min Ma
Interspeech 2017, Stockholm, Sweden (2017)
Preview abstract
Language Models (LMs) for Automatic Speech Recognition
(ASR) are typically trained on large text corpora from news
articles, books and web documents. These types of corpora,
however, are unlikely to match the test distribution of ASR systems,
which expect spoken utterances. Therefore, the LM is
typically adapted to a smaller held-out in-domain dataset that is
drawn from the test distribution. We present three LM adaptation
approaches for Deep NN and Long Short-Term Memory
(LSTM): (1) Adapting the softmax layer in the NN; (2)
Adding a non-linear adaptation layer before the softmax layer
that is trained only in the adaptation phase; (3) Training the
extra non-linear adaptation layer in pre-training and adaptation
phases. Aiming to improve upon a hierarchical Maximum Entropy
(MaxEnt) second-pass LM baseline, which factors the
model into word-cluster and word models, we build an NN
LM that predicts only word clusters. Adapting the LSTM LM
by training the adaptation layer in both training and adaptation
phases (Approach 3), we reduce the cluster perplexity by
30% compared to an unadapted LSTM model. Initial experiments
using a state-of-the-art ASR system show a 2.3% relative
reduction in WER on top of an adapted MaxEnt LM.
View details
Preview abstract
Maximum Entropy (MaxEnt) language models are linear models that are typically regularized via well-known L1 or L2 terms in the likelihood objective, hence avoiding the need for the kinds of backoff or mixture weights used in smoothed n-gram language models using Katz backoff and similar techniques. Even though backoff cost is not required to regularize the model, we investigate the use of backoff features in MaxEnt models, as well as some backoff-inspired variants. These features are shown to improve model quality substantially, as shown in perplexity and word-error rate reductions, even in very large scale training scenarios of tens or hundreds of billions of words and hundreds of millions of features.
View details
Preview abstract
In this paper we introduce JustSpeak, a universal voice control solution for non-visual access to the Android operating system. JustSpeak offers two contributions as compared to existing systems. First, it enables system wide voice control on Android that can accommodate any application. JustSpeak constructs the set of available voice commands based on application context; these commands are directly synthesized from on-screen labels and accessibility metadata, and require no further intervention from the application developer. Second, it provides more efficient and natural interaction with support of multiple voice commands in the same utterance. We present the system design of JustSpeak and describe its utility in various use cases. We then discuss the system level supports required by a service like JustSpeak on other platforms. By eliminating the target locating and pointing tasks, JustSpeak can significantly improve experience of graphic interface interaction for blind and motion-impaired users.
View details