Jump to Content

Eugene Weinstein

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    RNN-Transducer with stateless prediction network
    James Apfel
    Rodrigo Cabrera
    Xiaofeng Liu
    ICASSP 2020, IEEE, pp. 7049-7053
    Preview abstract The RNN-Transducer (RNNT) outperforms classic Automatic Speech Recognition (ASR) systems when a large amount of supervised training data is available. For low-resource languages, the RNNT models overfit, and can not directly take advantage of additional large text corpora as in classic ASR systems. We focus on the prediction network of the RNNT, since it is believed to be analogous to the Language Model (LM) in the classic ASR systems. We pre-train the prediction network with text-only data, which is not helpful. Moreover, removing the recurrent layers from the prediction network, which makes the prediction network stateless, performs virtually as well as the original RNNT model, when using wordpieces. The stateless prediction network does not depend on the previous output symbols, except the last one. Therefore it simplifies the RNNT architectures and the inference. Our results suggest that the RNNT prediction network does not function as the LM in classical ASR. Instead, it merely helps the model align to the input audio, while the RNNT encoder and joint networks capture both the acoustic and the linguistic information. View details
    Preview abstract Multilingual end-to-end (E2E) models have shown great promise as a means to expand coverage of the world’s lan- guages by automatic speech recognition systems. They im- prove over monolingual E2E systems, especially on low re- source languages, and simplify training and serving by elimi- nating language-specific acoustic, pronunciation, and language models. This work aims to develop an E2E multilingual system which is equipped to operate in low-latency interactive applica- tions as well as handle the challenges of real world imbalanced data. First, we present a streaming E2E multilingual model. Second, we compare techniques to deal with imbalance across languages. We find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model system achieves lower word error rate (WER) than state-of-the- art conventional monolingual models by at least 10% relative on every language. View details
    Preview abstract Multidialectal languages can pose challenges for acoustic modeling. Past research has shown that with a large training corpus but without explicit modeling of inter-dialect variability, training individual per-dialect models yields superior performance to that of a single model trained on the combined data [1, 2]. In this work, we were motivated by the idea that adaptation techniques can allow the models to learn dialect-independent features and in turn leverage the power of the larger training corpus sizes afforded when pooling data across dialects. Our goal was thus to create a single multidialect acoustic model that would rival the performance of the dialect-specific models.Working in the context of deep Long-Short Term Memory (LSTM) acoustic models trained on up to 40K hours of speech, we explored several methods for training and incorporating dialect-specific information into the model, including 12 variants of interpolation-of-bases techniques related to Cluster Adaptive Training (CAT) [3] and Factorized Hidden Layer (FHL) [4] techniques. We found that with our model topology and large training corpus, simply appending the dialect-specific information to the feature vector resulted in a more accurate model than any of the more complex interpolation-of-bases techniques, while requiring less model complexity and fewer parameters. This simple adaptation yielded a single unified model for all dialects that, in most cases, outperformed individual models which had been trained per-dialect. View details
    Preview abstract This paper describes a series of experiments with neural networks containing long short-term memory (LSTM) [1] and feedforward sequential memory network (FSMN) [2, 3, 4] layers trained with the connectionist temporal classification (CTC) [5] criteria for acoustic modeling. We propose using a hybrid LSTM/FSMN (FLMN) architecture as an enhancement to conventional LSTM-only acoustic models. The addition of FSMN layers allows the network to model a fixed size representation of future context suitable for online speech recognition. Our experiments show that FLMN acoustic models significantly outperform conventional LSTM. We also compare the FLMN architecture with other methods of modeling future context. Finally, we present a modification of the FSMN architecture that improves performance by reducing the width of the FSMN output. View details
    Preview abstract Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages. View details
    Preview abstract We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and sMBR loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. With feature frames computed every 30ms, our acoustic models are well suited to syllable-level modeling as compared to phonemes which can have a shorter duration. Additionally, when compared to word-level modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform as well as context-independent (CI) phone-output models, and, under certain circumstances can beat the performance of our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than that with CI models, and vastly faster than with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes. View details
    Preview abstract This paper describes a new technique to automatically obtain large high-quality training speech corpora for acoustic modeling. Traditional approaches select utterances based on confidence thresholds and other heuristics. We propose instead to use an ensemble approach: we transcribe each utterance using several recognizers, and only keep those on which they agree. The recognizers we use are trained on data from different dialects of the same language, and this diversity leads them to make different mistakes in transcribing speech utterances. In this work we show, however, that when they agree, this is an extremely strong signal that the transcript is correct. This allows us to produce automatically transcribed speech corpora that are superior in transcript correctness even to those manually transcribed by humans. Further more, we show that using the produced semi-supervised data sets, we can train new acoustic models which outperform those trained solely on previously available data sets. View details
    A big data approach to acoustic model training corpus selection
    John Alex
    Conference of the International Speech Communication Association (Interspeech) (2014)
    Preview abstract Deep neural networks (DNNs) have recently become the state of the art technology in speech recognition systems. In this paper we propose a new approach to constructing large high quality unsupervised sets to train DNN models for large vocabulary speech recognition. The core of our technique consists of two steps. We first redecode speech logged by our production recognizer with a very accurate (and hence too slow for real-time usage) set of speech models to improve the quality of ground truth transcripts used for training alignments. Using confidence scores, transcript length and transcript flattening heuristics designed to cull salient utterances from three decades of speech per language, we then carefully select training data sets consisting of up to 15K hours of speech to be used to train acoustic models without any reliance on manual transcription. We show that this approach yields models with approximately 18K context dependent states that achieve 10% relative improvement in large vocabulary dictation and voice-search systems for Brazilian Portuguese, French, Italian and Russian languages. View details
    Mobile Music Modeling, Analysis and Recognition
    Pavel Golik
    Boulos Harb
    Alex Rudnick
    International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2012)
    Preview abstract We present an analysis of music modeling and recognition techniques in the context of mobile music matching, substantially improving on the techniques presented in [Mohri et al., 2010]. We accomplish this by adapting the features specifically to this task, and by introducing new modeling techniques that enable using a corpus of noisy and channel-distorted data to improve mobile music recognition quality. We report the results of an extensive empirical investigation of the system's robustness under realistic channel effects and distortions. We show an improvement of recognition accuracy by explicit duration modeling of music phonemes and by integrating the expected noise environment into the training process. Finally, we propose the use of frame-to-phoneme alignment for high-level structure analysis of polyphonic music. View details
    Discriminative Topic Segmentation of Text and Speech
    International Conference on Artificial Intelligence and Statistics (AISTATS) (2010)
    Preview
    Factor Automata of Automata and Applications
    Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA2007), July, CIAA 2007Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA2007), Prague, Czech Republic.
    Preview
    Music Identification with Weighted Finite-State Transducers
    Proceedings of the International Conference in Acoustics, Speech and Signal Processing (ICASSP) (2007)
    Preview
    Robust music identification, detection, and analysis
    Proceedings of the International Conference on Music Information Retrieval (ISMIR) (2007)
    Preview
    No Results Found