Fadi Biadsy

Fadi Biadsy

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Extending Parrotron: An End-to-End, Speech Conversion and Speech Recognition Model for Atypical Speech
    Rohan Doshi
    Youzheng Chen
    Liyang Jiang
    Xia Zhang
    Andrea Chu
    Pedro Jose Moreno Mengibar
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
    Preview abstract We present an extended Parrotron model: a single, end-to-end model that enables voice conversion and recognition simultaneously. Input spectrograms are transformed to output spectrograms in the voice of a predetermined target speaker while also generating hypotheses in the target vocabulary. We study the performance of this novel architecture that jointly predicts speech and text on atypical (‘dysarthric’) speech. We show that with as little as an hour of atypical speech, speaker adaptation can yield up to 67% relative reduction in Word Error Rate (WER). We also show that data augmentation using a customized synthesizer built on the atypical speech can provide an additional 10% relative improvement over the best speaker-adapted model. Finally, we show that these methods generalize across 8 dysarthria etiologies with a range of severities. View details
    Conformer Parrotron: a Faster and Stronger End-to-end SpeechConversion and Recognition Model for Atypical Speech
    Zhehuai Chen
    Xia Zhang
    Youzheng Chen
    Liyang Jiang
    Andrea Chu
    Rohan Doshi
    Pedro Jose Moreno Mengibar
    interspeech 2021 (2021)
    Preview abstract Parrotron is an end-to-end personalizable model that enables many-to-one voice conversion and Automated Speech Recognition (ASR) simultaneously for atypical speech. In this work, we present the next-generation Parrotron model with improvements in overall performance and training and inference speeds. The proposed architecture builds on the recently popularized conformer encoder comprising of convolution and attention layer based blocks used in ASR. We introduce architectural modifications that sub-samples encoder activations to achieve speed-ups in training and inference. In order to jointly improve ASR and voice conversion quality, we show that this requires a corresponding up-sampling in the decoder network. We provide an in-depth analysis on how the proposed approach can maximize the efficiency of a speech-to-speech conversion model in the context of atypical speech. Experiments on both many-to-one and one-to-one dysarthric speech conversion tasks show that we can achieve up to 7X speedup and 35% relative reduction in WER over the previous best Transformer-based Parrotron model. We also show that these techniques are general enough and can provide similar wins on the transformer based Parrotron model. View details
    Preview abstract Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures. View details
    Preview abstract We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speech from any speaker regardless of accent, prosody, and background noise, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We further show that this normalization model can be adapted to normalize highly atypical speech from a deaf speaker, resulting in significant improvements in intelligibility and naturalness, measured via a speech recognizer and listening tests. Finally, demonstrating the utility of this model on other speech tasks, we show that the same model architecture can be trained to perform a speech separation task View details
    Preview abstract We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task. View details
    Sparse Non-negative Matrix Language Modeling: Maximum Entropy Flexibility on the Cheap
    The 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 2725-2729 (to appear)
    Preview abstract We present a new method for estimating the sparse non-negative model (SNM) by using a small amount of held-out data and the multinomial loss that is natural for language modeling; we validate it experimentally against the previous estimation method which uses leave-one-out on training data and a binary loss function and show that it performs equally well. Being able to train on held-out data is very important in practical situations where training data is mismatched from held-out/test data. We find that fairly small amounts of held-out data (on the order of 30-70 thousand words) are sufficient for training the adjustment model, which is the only model component estimated using gradient descent; the bulk of model parameters are relative frequencies counted on training data. A second contribution is a comparison between SNM and the related class of Maximum Entropy language models. While much cheaper computationally, we show that SNM achieves slightly better perplexity results for the same feature set and same speech recognition accuracy on voice search and short message dictation. View details
    Preview abstract Maximum Entropy (MaxEnt) Language Models (LMs) are powerful models that can incorporate linguistic and non-linguistic contextual signals in a unified framework, by optimizing a convex loss function. In addition to their flexibility, a key advantage is their scalability, in terms of model size and the amount of data that can be used during training. We present the following two contributions to MaxEnt training: (1) By leveraging smaller amounts of transcribed data, we demonstrate that a MaxEnt LM trained on various types of corpora can be easily adapted to better match the test distribution of speech recognition; (2) A novel adaptive-training approach that efficiently models multiple types of non-linguistic features in a universal model. We test the impact of these approaches on Google's state-of-the-art speech recognizer for the task of voice-search transcription and dictation. Training 10B parameter models utilizing a corpus of up to 1T words, we show large reductions in word error rate from adaptation across multiple languages. Also, human evaluations show strong significant improvements on a wide range of domains from using non-linguistic signals. For example, adapting to geographical domains (e.g., US States and cities) affects about 4% of test utterances, with 2:1 wins to loss ratio. View details
    Approaches for Neural-Network Language Model Adaptation
    Michael Alexander Nirschl
    Min Ma
    Interspeech 2017, Stockholm, Sweden (2017)
    Preview abstract Language Models (LMs) for Automatic Speech Recognition (ASR) are typically trained on large text corpora from news articles, books and web documents. These types of corpora, however, are unlikely to match the test distribution of ASR systems, which expect spoken utterances. Therefore, the LM is typically adapted to a smaller held-out in-domain dataset that is drawn from the test distribution. We present three LM adaptation approaches for Deep NN and Long Short-Term Memory (LSTM): (1) Adapting the softmax layer in the NN; (2) Adding a non-linear adaptation layer before the softmax layer that is trained only in the adaptation phase; (3) Training the extra non-linear adaptation layer in pre-training and adaptation phases. Aiming to improve upon a hierarchical Maximum Entropy (MaxEnt) second-pass LM baseline, which factors the model into word-cluster and word models, we build an NN LM that predicts only word clusters. Adapting the LSTM LM by training the adaptation layer in both training and adaptation phases (Approach 3), we reduce the cluster perplexity by 30% compared to an unadapted LSTM model. Initial experiments using a state-of-the-art ASR system show a 2.3% relative reduction in WER on top of an adapted MaxEnt LM. View details
    Backoff Inspired Features for Maximum Entropy Language Models
    Keith Hall
    Pedro Moreno
    Proceedings of Interspeech, ISCA (2014)
    Preview abstract Maximum Entropy (MaxEnt) language models are linear models that are typically regularized via well-known L1 or L2 terms in the likelihood objective, hence avoiding the need for the kinds of backoff or mixture weights used in smoothed n-gram language models using Katz backoff and similar techniques. Even though backoff cost is not required to regularize the model, we investigate the use of backoff features in MaxEnt models, as well as some backoff-inspired variants. These features are shown to improve model quality substantially, as shown in perplexity and word-error rate reductions, even in very large scale training scenarios of tens or hundreds of billions of words and hundreds of millions of features. View details
    Preview abstract In this paper we introduce JustSpeak, a universal voice control solution for non-visual access to the Android operating system. JustSpeak offers two contributions as compared to existing systems. First, it enables system wide voice control on Android that can accommodate any application. JustSpeak constructs the set of available voice commands based on application context; these commands are directly synthesized from on-screen labels and accessibility metadata, and require no further intervention from the application developer. Second, it provides more efficient and natural interaction with support of multiple voice commands in the same utterance. We present the system design of JustSpeak and describe its utility in various use cases. We then discuss the system level supports required by a service like JustSpeak on other platforms. By eliminating the target locating and pointing tasks, JustSpeak can significantly improve experience of graphic interface interaction for blind and motion-impaired users. View details