Jump to Content
Pedro J. Moreno

Pedro J. Moreno

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Speech data from different domains has distinct acoustic and linguistic characteristics. It is common to train a single multidomain model such as a Conformer transducer for speech recognition on a mixture of data from all domains. However, changing data in one domain or adding a new domain would require the multidomain model to be retrained. To this end, we propose a framework called modular domain adaptation (MDA) that enables a single model to process multidomain data while keeping all parameters domain-specific, i.e., each parameter is only trained by data from one domain. On a streaming Conformer transducer trained only on video caption data, experimental results show that an MDA-based model can reach similar performance as the multidomain model on other domains such as voice search and dictation by adding per-domain adapters and per-domain feed-forward networks in the Conformer encoder. View details
    Preview abstract Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model introduced in~\cite{zhehuai2021} can be leveraged to train a massively multilingual ASR model without any transcribed speech. In most zero resource conditions, lack of transcribed speech also implies lack of lexicons. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero transcribed speech, real-world setting to expand the set of languages covered by ASR models with only unlabeled speech and text in the target languages. We define the task to cover $102$ languages, where transcribed speech is available in $52$ of these languages and can be used to improve end-to-end ASR quality on the remaining $50$. First, we show that by combining speech representations with byte-level text representations coupled with the effective use of language embeddings, we can dramatically reduce the resource requirements for deploying an ASR model to a new language. On the FLEURS dataset, this approach is able to reduce the CER on languages with no transcribed speech from 64.1\% to 29.6\%, a relative reduction of 54\%. Second, using a subset of Indic languages we show that the proposed method can learn effectively from languages with transcribed speech even when there is limited to no graphemeic overlap with the target languages, reducing the average CER of the target languages from 56.3 to 17.2. We believe this is the first demonstration that competitive ASR performance can be achieved for an unseen language using no language resources other than text and untranscribed speech. View details
    Preview abstract Second-pass rescoring is a well known technique to improve the performance of Automatic Speech Recognition (ASR) systems. Neural oracle search (NOS), which selects the most likely hypothesis from N-best hypothesis list by integrating in-formation from multiple sources, such as the input acoustic representations, N-best hypotheses, additional first-pass statistics,and unpaired textual information through an external language model, has shown success in re-scoring for RNN-T first-pass models. Multilingual first-pass speech recognition models of-ten outperform their monolingual counterparts when trained on related or low-resource languages. In this paper, we investigate making the second-pass model multilingual and apply rescoring on a multilingual first-pass. We conduct experiments on Nordic languages including Danish, Dutch, Finnish, Norwegian and Swedish. View details
    Preview abstract We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information. Learning aligned representations from unpaired speech and text sequences is a challenging task. Previous work either implicitly enforced the representations learnt from these two modalities to be aligned in the latent space through multi- tasking and parameter sharing or explicitly through conversion of modalities via speech synthesis. While the former suffers from interference between the two modalities, the latter introduces additional complexity. In this paper, we propose Maestro, a novel algorithm to learn unified representations from both these modalities simultaneously that can transfer to diverse downstream tasks such as Automated Speech Recognition (ASR) and Speech Translation (ST). Maestro learns unified representations through sequence alignment, duration predic- tion and matching embeddings in the learned space through an aligned masked-language model loss. We establish a new state-of-the-art (SOTA) on VoxPopuli multilingual ASR with a 8% relative reduction in Word Error Rate (WER), multi- domain SpeechStew ASR (3.7% relative) and 21 languages to English multilingual ST on CoVoST 2 with an improvement of 2.8 BLEU averaged over 21 languages. View details
    Preview abstract Masked speech modeling (MSM) pre-training methods such as wav2vec2 or w2v-BERT randomly mask speech frames in an utterance and compute losses on the masked instances. While these methods improve performance of Automated Speech Recognition (ASR) systems, they have one major limitation. They generally perform best under matched conditions, i.e., when the data used for pre-training is matched to the data used for fine-tuning. Using out-of-domain (OOD) pre-training data with limited in-domain fine-tuning data from the target domain results in reduced gains. The relative value of in-domain data within a MSM pre-training corpus has not been well-explored in the literature. In this work, we address precisely this limitation. We propose ask2mask, a novel approach to focus on samples relevant to the target domain (in-domain) during pre-training with OOD or any available data. To perform this fine-grained data selection, ATM applies masking only to input frames with high confidence scores obtained from an external classification model. This allows the model to achieve meaningful in-domain representations and simultaneously discard low-confidence frames which could lead to learning erroneous representations. The ATM approach is further extended to focus on utterances with high confidences by scaling the final MSM loss computed for each masked input frame with the utterance-level confidence score. We conduct experiments on two well-benchmarked read speech corpus (Librispeech) and conversational speech corpus (AMI). The results substantiate the efficacy of ATM on significantly improving target domain performance under mismatched conditions while still yielding modest improvements under matched conditions. View details
    Preview abstract Multilingual speech recognition models are capable of recognizing speech in multiple different languages. When trained on related or low-resource languages, these models often outperform their monolingual counterparts. Similar to other forms of multi-task models, when the group of languages are unrelated, or when large amounts of training data is available, multilingual models can suffer from performance loss. We investigate the use of a mixture-of-expert approach to assign per-language parameters in the model to increase network capacity in a structured fashion. We introduce a novel variant of this approach, 'informed experts', which attempts to tackle inter-task conflicts by eliminating gradients from other tasks in the these task-specific parameters. We conduct experiments on a real-world task on English, French and four dialects of Arabic to show the effectiveness of our approach. View details
    Preview abstract With a large population of the world speaking more than one language, multilingual automatic speech recognition (ASR) has gained popularity in the recent years. While lower resource languages can benefit from quality improvements in a multilingual ASR system, including unrelated or higher resource languages in the mix often results in performance degradation. In this paper, we propose distilling from multiple teachers, with each language using its best teacher during training, to tackle this problem. We introduce self-adaptive distillation, a novel technique for automatic weighting of the distillation loss that uses the student/teachers confidences. We analyze the effectiveness of the proposed techniques on two real world use-cases and show that the performance of the multilingual ASR models can be improved by up to 11.5% without any increase in model capacity. Furthermore, we show that when our methods are combined with increase in model capacity, we can achieve quality gains of up to 20.7%. View details
    Preview abstract Parrotron is an end-to-end personalizable model that enables many-to-one voice conversion and Automated Speech Recognition (ASR) simultaneously for atypical speech. In this work, we present the next-generation Parrotron model with improvements in overall performance and training and inference speeds. The proposed architecture builds on the recently popularized conformer encoder comprising of convolution and attention layer based blocks used in ASR. We introduce architectural modifications that sub-samples encoder activations to achieve speed-ups in training and inference. In order to jointly improve ASR and voice conversion quality, we show that this requires a corresponding up-sampling in the decoder network. We provide an in-depth analysis on how the proposed approach can maximize the efficiency of a speech-to-speech conversion model in the context of atypical speech. Experiments on both many-to-one and one-to-one dysarthric speech conversion tasks show that we can achieve up to 7X speedup and 35% relative reduction in WER over the previous best Transformer-based Parrotron model. We also show that these techniques are general enough and can provide similar wins on the transformer based Parrotron model. View details
    Preview abstract Semi- and self-supervised training techniques have the potential to improve performance of speech recognition systems without additional transcribed speech data. In this work, we demonstrate the efficacy of two approaches to semi-supervision for automated speech recognition. The two approaches lever-age vast amounts of available unspoken text and untranscribed audio. First, we present factorized multilingual speech synthesis to improve data augmentation on unspoken text. Next, we present an online implementation of Noisy Student Training to incorporate untranscribed audio. We propose a modified Sequential MixMatch algorithm with iterative learning to learn from untranscribed speech. We demonstrate the compatibility of these techniques yielding a relative reduction of word error rate of up to 14.4% on the voice search task. View details
    Preview abstract Streaming automatic speech recognition (ASR) hypothesizes words as soon as the input audio arrives, whereas non-streaming ASR can potentially wait for the completion of the entire utterance to hypothesize words. Streaming and non-streaming ASR systems have typically used different acoustic encoders. Recent work has attempted to unify them by either jointly training a fixed stack of streaming and non-streaming layers or using knowledge distillation during training to ensure consistency between the streaming and non-streaming predictions. We propose mixture model (MiMo) attention as a simpler and theoretically-motivated alternative that replaces only the attention mechanism, requires no change to the training loss, and allows greater flexibility of switching between streaming and non-streaming mode during inference. Our experiments on the public Librispeech data set and a few Indic language data sets show that MiMo attention endows a single ASR model with the ability to operate in both streaming and non-streaming modes without any overhead and without significant loss in accuracy compared to separately-trained streaming and non-streaming models. View details
    Extending Parrotron: An End-to-End, Speech Conversion and Speech Recognition Model for Atypical Speech
    Rohan Doshi
    Youzheng Chen
    Liyang Jiang
    Xia Zhang
    Andrea Chu
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
    Preview abstract We present an extended Parrotron model: a single, end-to-end model that enables voice conversion and recognition simultaneously. Input spectrograms are transformed to output spectrograms in the voice of a predetermined target speaker while also generating hypotheses in the target vocabulary. We study the performance of this novel architecture that jointly predicts speech and text on atypical (‘dysarthric’) speech. We show that with as little as an hour of atypical speech, speaker adaptation can yield up to 67% relative reduction in Word Error Rate (WER). We also show that data augmentation using a customized synthesizer built on the atypical speech can provide an additional 10% relative improvement over the best speaker-adapted model. Finally, we show that these methods generalize across 8 dysarthria etiologies with a range of severities. View details
    Preview abstract Recent developments in data augmentation has brought great gains in improvement for automatic speech recognition (ASR). Parallel developments in augmentation policy search in computer vision domain has shown improvements in model performance and robustness. In addition, recent developments in semi-supervised learning has shown that consistency measures are crucial for performance and robustness. In this work, we demonstrate that combining augmentation policies with consistency measures and model regularization can greatly improve speech recognition performance. Using the Librispeech task, we show: 1) symmetric consistency measures such as the Jensen-Shannon Divergence provide 11\% relative improvements in ASR performance; 2) Augmented adversarial inputs using Virtual Adversarial Noise (VAT) provides 8.9\% relative win; and 3) random sampling from arbitrary combination of augmentation policies yields the best policy. These contributions result in an overall reduction in Word Error Rate (WER) of 18\% relative on the Librispeech task presented in this paper. View details
    Preview abstract Speech synthesis has advanced to the point of being close to indistinguishable from human speech. However, efforts to train speech recognition systems on synthesized utterances have not been able to show that synthesized data can be effectively used to augment or replace human speech. In this work, we demonstrate that promoting consistent predictions in response to real and synthesized speech enables significantly improved speech recognition performance. We also find that training on 460 hours of LibriSpeech augmented with 500 hours of transcripts (without audio) performance is within 0.2\% WER of a system trained on 960 hours of transcribed audio. This suggests that with this approach, when there is sufficient text available, reliance on transcribed audio can be cut nearly in half. View details
    Multilingual Speech Recognition with Self-Attention Structured Parameterization
    Yun Zhu
    Brian Farris
    Hainan Xu
    Han Lu
    Qian Zhang
    Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, ISCA
    Preview abstract Multilingual automatic speech recognition systems can transcribe utterances from different languages. These systems are attractive from different perspectives: they can provide quality improvements, specially for lower resource languages, and simplify the training and deployment procedure. End-to-end speech recognition has further simplified multilingual modeling as one model, instead of several components of a classical system, have to be unified. In this paper, we investigate a streamable end-to-end multilingual system based on the Transformer Transducer. We propose several techniques for adapting the self-attention architecture based on the language id. We analyze the trade-offs of each method with regards to quality gains and number of additional parameters introduced. We conduct experiments in a real-world task consisting of five languages. Our experimental results demonstrate $\sim$10\% and $\sim$15\% relative gain over the baseline multilingual model. View details
    Preview abstract Text-to-Speech synthesis (TTS) based data augmentation is a relatively new mechanism for utilizing text-only data to improve automatic speech recognition (ASR) training without parameter or inference architecture changes. However, efforts to train speech recognition systems on synthesized utterances suffer from limited acoustic diversity of TTS outputs. Additionally, the text-only corpus is always much larger than the transcribed speech corpus by several orders of magnitude, which makes speech synthesis of all the text data impractical. In this work, we propose to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data. We also present a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text. We demonstrate the ability of our proposed method to enable efficient, large-scale unspoken text learning which achieving a 32.7\% relative WER reduction on a voice-search task. View details
    Preview abstract Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of the synthesized speech. In this paper, we evaluate the feasibility of enhancing speech recognition performance using speech synthesis using two corpora from different domains. We explore algorithms to provide the necessary acoustic and lexical diversity needed for robust speech recognition. Finally, we demonstrate the feasibility of this approach as a data augmentation strategy for domain-transfer. View details
    Preview abstract We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speech from any speaker regardless of accent, prosody, and background noise, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We further show that this normalization model can be adapted to normalize highly atypical speech from a deaf speaker, resulting in significant improvements in intelligibility and naturalness, measured via a speech recognizer and listening tests. Finally, demonstrating the utility of this model on other speech tasks, we show that the same model architecture can be trained to perform a speech separation task View details
    Preview abstract Multilingual speech recognition models are capable of recognizing speech in multiple different languages. Depending on the amount of training data, and the relatedness of languages, these models can outperform their monolingual counterparts. However, the performance of these models heavily relies on an externally provided language-id which is used to augment the input features or modulate the neural network's per-layer outputs using a language-gate. In this paper, we introduce a novel technique for inferring the language-id in a streaming fashion using the RNN-T loss that eliminates reliance on knowing the utterance's language. We conduct experiments on two sets of languages, arabic and nordic, and show the effectiveness of our approach. View details
    Preview abstract Code-switching is a commonly occurring phenomenon in many multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. In this work, we propose the use of a new metric, transliteration-optimized Word Error Rate (toWER) that smoothes out many of these irregularities by mapping all text to one writing system and demonstrate a correlation with the amount of code-switching present in a language. We also present a novel approach to acoustic and language modeling for bilingual code-switched Indic languages using the same transliteration approach to normalize the data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model. We demonstrate the robustness of the proposed approach on several Indic languages from Google Voice Search traffic with significant gains in ASR performance up to 10% relative over the state-of-the-art baseline. View details
    Preview abstract This paper describes a series of experiments with neural networks containing long short-term memory (LSTM) [1] and feedforward sequential memory network (FSMN) [2, 3, 4] layers trained with the connectionist temporal classification (CTC) [5] criteria for acoustic modeling. We propose using a hybrid LSTM/FSMN (FLMN) architecture as an enhancement to conventional LSTM-only acoustic models. The addition of FSMN layers allows the network to model a fixed size representation of future context suitable for online speech recognition. Our experiments show that FLMN acoustic models significantly outperform conventional LSTM. We also compare the FLMN architecture with other methods of modeling future context. Finally, we present a modification of the FSMN architecture that improves performance by reducing the width of the FSMN output. View details
    Preview abstract Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to text, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of intents and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem. We propose and compare various encoder-decoder based approaches that optimizes both modules jointly, in an end-to-end manner. We evaluate these methods on a real-world task. Our results show that having an intermediate text representation while jointly optimizing the full system improves accuracy of prediction. View details
    Preview abstract Recent interest in intelligent assistants has increased demand for Automatic Speech Recognition (ASR) systems that can utilize contextual information to adapt to the user’s preferences or the current device state. For example, a user might be more likely to refer to their favorite songs when giving a “music playing” command or request to watch a movie starring a particular favorite actor when giving a “movie playing” command. Similarly, when a device is in a “music playing” state, a user is more likely to give volume control commands. In this paper, we explore using semantic information inside the ASR word lattice by employing Named Entity Recognition (NER) to identify and boost contextually relevant paths in order to improve speech recognition accuracy. We use broad semantic classes comprising millions of entities, such as songs and musical artists, to tag relevant semantic entities in the lattice. We show that our method reduces Word Error Rate (WER) by 12.0% relative on a Google Assistant “media playing” commands test set, while not affecting WER on a test set containing commands unrelated to media. View details
    Preview abstract Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages. View details
    Preview abstract We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and sMBR loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. With feature frames computed every 30ms, our acoustic models are well suited to syllable-level modeling as compared to phonemes which can have a shorter duration. Additionally, when compared to word-level modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform as well as context-independent (CI) phone-output models, and, under certain circumstances can beat the performance of our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than that with CI models, and vastly faster than with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes. View details
    Towards Acoustic Model Unification Across Dialects
    Meysam Bastani
    Mohamed G. Elfeky
    2016 IEEE Workshop on Spoken Language Technology
    Preview abstract Research has shown that acoustic model performance typically decreases when evaluated on a dialectal variation of the same language that was not used during training. Similarly, models simultaneously trained on a group of dialects tend to under-perform when compared to dialect-specific models. In this paper, we report on our efforts towards building a unified acoustic model that can serve a multi-dialectal language. Two techniques are presented: Distillation and MTL. In Distillation, we use an ensemble of dialect-specific acoustic models and distill its knowledge in a single model. In MTL, we utilize MultiTask Learning to train a unified acoustic model that learns to distinguish dialects as a side task. We show that both techniques are superior to the naive model that is trained on all dialectal data, reducing word error rates by 4.2% and 0.6%, respectively. And, while achieving this improvement, neither technique degrades the performance of the dialect-specific models by more than 3.4%. View details
    Preview abstract While research has often shown that building dialect-specific Automatic Speech Recognizers is the optimal approach to dealing with dialectal variations of the same language, we have observed that dialect-specific recognizers do not always output the best recognitions. Often enough, another dialectal recognizer outputs a better recognition than the dialect-specific one. In this paper, we present two methods to select and combine the best decoded hypothesis from a pool of dialectal recognizers. We follow a Machine Learning approach and extract features from the Speech Recognition output along with Word Embeddings and use Shallow Neural Networks for classification. Our experiments using Dictation and Voice Search data from the main four Arabic dialects show good WER improvements for the hypothesis selection scheme, reducing the WER by 2.1 to 12.1% depending on the test set, and promising results for the hypotheses combination scheme. View details
    Preview abstract This paper describes a new technique to automatically obtain large high-quality training speech corpora for acoustic modeling. Traditional approaches select utterances based on confidence thresholds and other heuristics. We propose instead to use an ensemble approach: we transcribe each utterance using several recognizers, and only keep those on which they agree. The recognizers we use are trained on data from different dialects of the same language, and this diversity leads them to make different mistakes in transcribing speech utterances. In this work we show, however, that when they agree, this is an extremely strong signal that the transcript is correct. This allows us to produce automatically transcribed speech corpora that are superior in transcript correctness even to those manually transcribed by humans. Further more, we show that using the produced semi-supervised data sets, we can train new acoustic models which outperform those trained solely on previously available data sets. View details
    Multi-Dialectical Languages Effect on Speech Recognition
    Mohamed Elfeky
    Victor Soto
    International Conference on Natural Language and Speech Processing (2015)
    Preview abstract Maximum Entropy (MaxEnt) language models are linear models that are typically regularized via well-known L1 or L2 terms in the likelihood objective, hence avoiding the need for the kinds of backoff or mixture weights used in smoothed n-gram language models using Katz backoff and similar techniques. Even though backoff cost is not required to regularize the model, we investigate the use of backoff features in MaxEnt models, as well as some backoff-inspired variants. These features are shown to improve model quality substantially, as shown in perplexity and word-error rate reductions, even in very large scale training scenarios of tens or hundreds of billions of words and hundreds of millions of features. View details
    Frame by Frame Language Identification in Short Utterances using Deep Neural Networks
    Javier Gonzalez-Dominguez
    Joaquin Gonzalez-Rodriguez
    Neural Networks Special Issue: Neural Network Learning in Big Data (2014)
    Preview abstract This work addresses the use of deep neural networks (DNNs) in automatic language identification (LID) focused on short test utterances. Motivated by their recent success in acoustic modelling for speech recognition, we adapt DNNs to the problem of identifying the language in a given utterance from the short-term acoustic features. We show how DNNs are particularly suitable to perform LID in real-time applications, due to their capacity to emit a language identification posterior at each new frame of the test utterance. We then analyse different aspects of the system, such as the amount of required training data, the number of hidden layers, the relevance of contextual information and the effect of the test utterance duration. Finally, we propose several methods to combine frame-by-frame posteriors. Experiments are conducted on two different datasets: the public NIST Language Recognition Evaluation 2009 (3 seconds task) and a much larger corpus (of 5 million utterances) known as Google 5M LID, obtained from different Google Services. Reported results show relative improvements of DNNs versus the i-vector system of 40% in LRE09 3 second task and 76% in Google 5M LID. View details
    A big data approach to acoustic model training corpus selection
    John Alex
    Conference of the International Speech Communication Association (Interspeech) (2014)
    Preview abstract Deep neural networks (DNNs) have recently become the state of the art technology in speech recognition systems. In this paper we propose a new approach to constructing large high quality unsupervised sets to train DNN models for large vocabulary speech recognition. The core of our technique consists of two steps. We first redecode speech logged by our production recognizer with a very accurate (and hence too slow for real-time usage) set of speech models to improve the quality of ground truth transcripts used for training alignments. Using confidence scores, transcript length and transcript flattening heuristics designed to cull salient utterances from three decades of speech per language, we then carefully select training data sets consisting of up to 15K hours of speech to be used to train acoustic models without any reliance on manual transcription. We show that this approach yields models with approximately 18K context dependent states that achieve 10% relative improvement in large vocabulary dictation and voice-search systems for Brazilian Portuguese, French, Italian and Russian languages. View details
    Google's Cross-Dialect Arabic Voice Search
    Martin Jansche
    IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2012), pp. 4441-4444
    Preview abstract We present a large scale effort to build a commercial Automatic Speech Recognition (ASR) product for Arabic. Our goal is to support voice search, dictation, and voice control for the general Arabic-speaking public, including support for multiple Arabic dialects. We describe our ASR system design and compare recognizers for five Arabic dialects, with the potential to reach more than 125 million people in Egypt, Jordan, Lebanon, Saudi Arabia, and the United Arab Emirates (UAE). We compare systems built on diacritized vs. non-diacritized text. We also conduct cross-dialect experiments, where we train on one dialect and test on the others. Our average word error rate (WER) is 24.8% for voice search. View details
    Deploying Google Search by Voice in Cantonese
    Martin Jansche
    12th Annual Conference of the International Speech Communication Association (Interspeech 2011), pp. 2865-2868
    Preview abstract We describe our efforts in deploying Google search by voice for Cantonese, a southern Chinese dialect widely spoken in and around Hong Kong and Guangzhou. We collected audio data from local Cantonese speakers in Hong Kong and Guangzhou by using our DataHound smartphone application. This data was used to create appropriate acoustic models. Language models were trained on anonymized query logs from Google Web Search for Hong Kong. Because users in Hong Kong frequently mix English and Cantonese in their queries, we designed our system from the ground up to handle both languages. We report on experiments with different techniques for mapping the phoneme inventories for both languages into a common space. Based on extensive experiments we report word error rates and web scores for both Hong Kong and Guangzhou data. Cantonese Google search by voice was launched in December 2010. View details
    Building Transcribed Speech Corpora Quickly and Cheaply for Many Languages
    Thad Hughes
    Kaisuke Nakajima
    Linne Ha
    Atul Vasu
    Mike LeBeau
    Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), International Speech Communication Association, pp. 1914-1917
    Preview abstract We present a system for quickly and cheaply building transcribed speech corpora containing utterances from many speakers in a variety of acoustic conditions. The system consists of a client application running on an Android mobile device with an intermittent Internet connection to a server. The client application collects demographic information about the speaker, fetches textual prompts from the server for the speaker to read, records the speaker’s voice, and uploads the audio and associated metadata to the server. The system has so far been used to collect over 3000 hours of transcribed audio in 17 languages around the world. View details
    Discriminative Topic Segmentation of Text and Speech
    International Conference on Artificial Intelligence and Statistics (AISTATS) (2010)
    Voice Search for Development
    Etienne Barnard
    Johan Schalkwyk
    Charl van Heerden
    Interspeech 2010
    Preview abstract In light of the serious problems with both illiteracy and information access in the developing world, there is a widespread belief that speech technology can play a significant role in improving the quality of life of developing-world citizens. We review the main reasons why this impact has not occurred to date, and propose that voice-search systems may be a useful tool in delivering on the original promise. The challenges that must be addressed to realize this vision are analyzed, and initial experimental results in developing voice search for two languages of South Africa (Zulu and Afrikaans) are summarized View details
    Search by Voice in Mandarin Chinese
    Jiulong Shan
    Genqing Wu
    Zhihong Hu
    Xiliu Tang
    Martin Jansche
    Interspeech 2010, pp. 354-357
    Preview abstract In this paper we describe our efforts to build a Mandarin Chinese voice search system. We describe our strategies for data collection, language, lexicon and acoustic modeling, as well as issues related to text normalization that are an integral part of building voice search systems. We show excellent performance on typical spoken search queries under a variety of accents and acoustic conditions. The system has been in operation since October 2009 and has received very positive user reviews. View details
    An Audio Indexing System for Election Video Material
    Christopher Alberti
    Ari Bezman
    Anastassia Drofa
    Ted Power
    Arnaud Sahuguet
    Maria Shugrina
    Proceedings of ICASSP (2009), pp. 4873-4876
    Preview abstract In the 2008 presidential election race in the United States, the prospective candidates made extensive use of YouTube to post video material. We developed a scalable system that transcribes this material and makes the content searchable (by indexing the meta-data and transcripts of the videos) and allows the user to navigate through the video material based on content. The system is available as an iGoogle gadget as well as a Labs product. Given the large exposure, special emphasis was put on the scalability and reliability of the system. This paper describes the design and implementation of this system. View details
    Robust music identification, detection, and analysis
    Proceedings of the International Conference on Music Information Retrieval (ISMIR) (2007)
    Music Identification with Weighted Finite-State Transducers
    Proceedings of the International Conference in Acoustics, Speech and Signal Processing (ICASSP) (2007)
    Supervised Learning of Semantic Classes for Image Annotation and Retrieval
    Gustavo Carneiro
    Antoni B. Chan
    Nuno Vasconcelos
    IEEE Transactions on Pattern Analysis and Machine Intelligence (2007), pp. 394-410
    Factor Automata of Automata and Applications
    Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA2007), July, CIAA 2007Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA2007), Prague, Czech Republic.
    Query by Semantic Example
    Nikhil Rasiwasia
    Nuno Vasconcelos
    CIVR (2006), pp. 51-60
    Approaches to reduce the effects of OOV queries on indexed spoken audio
    Beth Logan
    Jean-Manuel Van Thong
    IEEE Transactions on Multimedia, vol. 7 (2005), pp. 899-906
    The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition
    Nuno Vasconcelos
    Purdy Ho
    ECCV (3) (2004), pp. 430-441
    News Tuner: a simple interface for searching and browsing radio archives
    J. Marston
    G. MacCarthy
    Beth Logan
    Jean-Manuel Van Thong
    ICME (2004), pp. 1531-1534
    A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
    Purdy Ho
    Nuno Vasconcelos
    NIPS (2003)
    From Multimedia Retrieval to Knowledge Management
    Jean-Manuel Van Thong
    Beth Logan
    Gareth J. F. Jones
    IEEE Computer, vol. 35 (2002), pp. 58-66
    Speechbot: an experimental speech-based search engine for multimedia content on the web
    Jean-Manuel Van Thong
    Beth Logan
    Blair Fidler
    K. Maffey
    M. Moores
    IEEE Transactions on Multimedia, vol. 4 (2002), pp. 88-96
    Topic Segmentation with an Aspect Hidden Markov Model
    David M. Blei
    SIGIR (2001), pp. 343-348
    Indexing Multimedia for the Internet
    Brian S. Eberman
    Blair Fidler
    Robert A. Iannucci
    Christopher F. Joerg
    Leonidas I. Kontothanassis
    David E. Kovalcin
    Michael J. Swain
    Jean-Manuel Van Thong
    VISUAL (1999), pp. 195-202
    Efficient Grammar Processing for a Spoken Language Translation System
    David B. Roe
    Alejandro Macarrón
    Proceedings of ICASSP, IEEE, San Francisco, California (1992), pp. 213-216
    A spoken language translator for restricted-domain context-free languages
    David B. Roe
    Alejandro Macarrón
    Speech Communication, vol. 11 (1992), pp. 311-319
    Toward a Spoken Language Translator for Restricted-Domain Context-Free Languages
    David B. Roe
    Alejandro Macarrón
    EUROSPEECH 91 -- 2nd European Conference on Speech Communication and Technology, Genova, Italy (1991), pp. 1063-1066