Pedro J. Moreno
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
Speech data from different domains has distinct acoustic and linguistic characteristics. It is common to train a single multidomain model such as a Conformer transducer for speech recognition on a mixture of data from all domains. However, changing data in one domain or adding a new domain would require the multidomain model to be retrained. To this end, we propose a framework called modular domain adaptation (MDA) that enables a single model to process multidomain data while keeping all parameters domain-specific, i.e., each parameter is only trained by data from one domain. On a streaming Conformer transducer trained only on video caption data, experimental results show that an MDA-based model can reach similar performance as the multidomain model on other domains such as voice search and dictation by adding per-domain adapters and per-domain feed-forward networks in the Conformer encoder.
View details
Ask2Mask: Guided Data Selection for Masked Speech Modeling
Yu Zhang
IEEE Journal of Selected Topics in Signal Processing (2022)
Preview abstract
Masked speech modeling (MSM) pre-training methods such as wav2vec2 or w2v-BERT randomly mask speech frames in an utterance and compute losses on the masked instances. While these methods improve performance of Automated Speech Recognition (ASR) systems, they have one major limitation. They generally perform best under matched conditions, i.e., when the data used for pre-training is matched to the data used for fine-tuning. Using out-of-domain (OOD) pre-training data with limited in-domain fine-tuning data from the target domain results in reduced gains. The relative value of in-domain data within a MSM pre-training corpus has not been well-explored in the literature. In this work, we address precisely this limitation. We propose ask2mask, a novel approach to focus on samples relevant to the target domain (in-domain) during pre-training with OOD or any available data. To perform this fine-grained data selection, ATM applies masking only to input frames with high confidence scores obtained from an external classification model. This allows the model to achieve meaningful in-domain representations and simultaneously discard low-confidence frames which could lead to learning erroneous representations. The ATM approach is further extended to focus on utterances with high confidences by scaling the final MSM loss computed for each masked input frame with the utterance-level confidence score. We conduct experiments on two well-benchmarked read speech corpus (Librispeech) and conversational speech corpus (AMI). The results substantiate the efficacy of ATM on significantly improving target domain performance under mismatched conditions while still yielding modest improvements under matched conditions.
View details
MAESTRO: Matched Speech Text Representations through Modality Matching
Yu Zhang
Zhehuai Chen
interspeech 2022 (2022) (to appear)
Preview abstract
We present Maestro, a self-supervised training method to
unify representations learnt from speech and text modalities.
Self-supervised learning from speech signals aims to learn the
latent structure inherent in the signal, while self-supervised
learning from text attempts to capture lexical information.
Learning aligned representations from unpaired speech and
text sequences is a challenging task. Previous work either
implicitly enforced the representations learnt from these two
modalities to be aligned in the latent space through multi-
tasking and parameter sharing or explicitly through conversion
of modalities via speech synthesis. While the former suffers
from interference between the two modalities, the latter
introduces additional complexity. In this paper, we propose
Maestro, a novel algorithm to learn unified representations from
both these modalities simultaneously that can transfer to diverse
downstream tasks such as Automated Speech Recognition
(ASR) and Speech Translation (ST). Maestro learns unified
representations through sequence alignment, duration predic-
tion and matching embeddings in the learned space through
an aligned masked-language model loss. We establish a new
state-of-the-art (SOTA) on VoxPopuli multilingual ASR with
a 8% relative reduction in Word Error Rate (WER), multi-
domain SpeechStew ASR (3.7% relative) and 21 languages to
English multilingual ST on CoVoST 2 with an improvement of
2.8 BLEU averaged over 21 languages.
View details
Preview abstract
Second-pass rescoring is a well known technique to improve the performance of Automatic Speech Recognition (ASR) systems. Neural oracle search (NOS), which selects the most likely hypothesis from N-best hypothesis list by integrating in-formation from multiple sources, such as the input acoustic representations, N-best hypotheses, additional first-pass statistics,and unpaired textual information through an external language model, has shown success in re-scoring for RNN-T first-pass models. Multilingual first-pass speech recognition models of-ten outperform their monolingual counterparts when trained on related or low-resource languages. In this paper, we investigate making the second-pass model multilingual and apply rescoring on a multilingual first-pass. We conduct experiments on Nordic languages including Danish, Dutch, Finnish, Norwegian and Swedish.
View details
Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition
Proceedings of Interspeech, 2021 (to appear)
Preview abstract
Streaming automatic speech recognition (ASR) hypothesizes words as soon as the input audio arrives, whereas non-streaming ASR can potentially wait for the completion of the entire utterance to hypothesize words.
Streaming and non-streaming ASR systems have typically used different acoustic encoders.
Recent work has attempted to unify them by either jointly training a fixed stack of streaming and non-streaming layers or using knowledge distillation during training to ensure consistency between the streaming and non-streaming predictions.
We propose mixture model (MiMo) attention as a simpler and theoretically-motivated alternative that replaces only the attention mechanism, requires no change to the training loss, and allows greater flexibility of switching between streaming and non-streaming mode during inference.
Our experiments on the public Librispeech data set and a few Indic language data sets show that MiMo attention endows a single ASR model with the ability to operate in both streaming and non-streaming modes without any overhead and without significant loss in accuracy compared to separately-trained streaming and non-streaming models.
View details
Semi-Supervision in ASR: Sequential Mixmatch and Factorized TTS-Based Augmentation
Zhehuai Chen
Yu Zhang
Yinghui Huang
Jesse Emond
(2021)
Preview abstract
Semi- and self-supervised training techniques have the potential to improve performance of speech recognition systems without additional transcribed speech data. In this work, we demonstrate the efficacy of two approaches to semi-supervision for automated speech recognition. The two approaches lever-age vast amounts of available unspoken text and untranscribed audio. First, we present factorized multilingual speech synthesis to improve data augmentation on unspoken text. Next, we present an online implementation of Noisy Student Training to incorporate untranscribed audio. We propose a modified Sequential MixMatch algorithm with iterative learning to learn from untranscribed speech. We demonstrate the compatibility of these techniques yielding a relative reduction of word error rate of up to 14.4% on the voice search task.
View details
Conformer Parrotron: a Faster and Stronger End-to-end SpeechConversion and Recognition Model for Atypical Speech
Zhehuai Chen
Xia Zhang
Youzheng Chen
Liyang Jiang
Andrea Chu
Rohan Doshi
interspeech 2021 (2021)
Preview abstract
Parrotron is an end-to-end personalizable model that enables many-to-one voice conversion and Automated Speech
Recognition (ASR) simultaneously for atypical speech. In this
work, we present the next-generation Parrotron model with improvements in overall performance and training and inference
speeds. The proposed architecture builds on the recently popularized conformer encoder comprising of convolution and attention layer based blocks used in ASR. We introduce architectural modifications that sub-samples encoder activations to
achieve speed-ups in training and inference. In order to jointly
improve ASR and voice conversion quality, we show that this
requires a corresponding up-sampling in the decoder network.
We provide an in-depth analysis on how the proposed approach
can maximize the efficiency of a speech-to-speech conversion
model in the context of atypical speech. Experiments on both
many-to-one and one-to-one dysarthric speech conversion tasks
show that we can achieve up to 7X speedup and 35% relative reduction in WER over the previous best Transformer-based Parrotron model. We also show that these techniques are general
enough and can provide similar wins on the transformer based
Parrotron model.
View details
Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence
Brian Farris
Yun Zhu
Interspeech 2021 (to appear)
Preview abstract
With a large population of the world speaking more than one language, multilingual automatic speech recognition (ASR) has gained popularity in the recent years. While lower resource languages can benefit from quality improvements in a multilingual ASR system, including unrelated or higher resource languages in the mix often results in performance degradation. In this paper, we propose distilling from multiple teachers, with each language using its best teacher during training, to tackle this problem. We introduce self-adaptive distillation, a novel technique for automatic weighting of the distillation loss that uses the student/teachers confidences. We analyze the effectiveness of the proposed techniques on two real world use-cases and show that the performance of the multilingual ASR models can be improved by up to 11.5% without any increase in model capacity. Furthermore, we show that when our methods are combined with increase in model capacity, we can achieve quality gains of up to 20.7%.
View details
Extending Parrotron: An End-to-End, Speech Conversion and Speech Recognition Model for Atypical Speech
Rohan Doshi
Youzheng Chen
Liyang Jiang
Xia Zhang
Andrea Chu
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Preview abstract
We present an extended Parrotron model: a single, end-to-end model that enables voice conversion and recognition simultaneously. Input spectrograms are transformed to output spectrograms in the voice of a predetermined target speaker while also generating hypotheses in the target vocabulary. We study the performance of this novel architecture that jointly predicts speech and text on atypical (‘dysarthric’) speech. We show that with as little as an hour of atypical speech, speaker adaptation can yield up to 67% relative reduction in Word Error Rate (WER). We also show that data augmentation using a customized synthesizer built on the atypical speech can provide an additional 10% relative improvement over the best speaker-adapted model. Finally, we show that these methods generalize across 8 dysarthria etiologies with a range of severities.
View details
Mixture of Informed Experts for Multilingual Speech Recognition
Brian Farris
Yun Zhu
ICASSP 2021, IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
Preview abstract
Multilingual speech recognition models are capable of recognizing speech in multiple different languages. When trained on related or low-resource languages, these models often outperform their monolingual counterparts. Similar to other forms of multi-task models, when the group of languages are unrelated, or when large amounts of training data is available, multilingual models can suffer from performance loss. We investigate the use of a mixture-of-expert approach to assign per-language parameters in the model to increase network capacity in a structured fashion. We introduce a novel variant of this approach, 'informed experts', which attempts to tackle inter-task conflicts by eliminating gradients from other tasks in the these task-specific parameters. We conduct experiments on a real-world task on English, French and four dialects of Arabic to show the effectiveness of our approach.
View details
SCADA: Stochastic, Consistent and Adversarial Data Augmentation to Improve ASR
Zhehuai Chen
Yu Zhang
Proceedings of Interspeech 2020, pp. 2832-2836
Preview abstract
Recent developments in data augmentation has brought great gains in improvement for automatic speech recognition (ASR). Parallel developments in augmentation policy search in computer vision domain has shown improvements in model performance and robustness. In addition, recent developments in semi-supervised learning has shown that consistency measures are crucial for performance and robustness. In this work, we demonstrate that combining augmentation policies with consistency measures and model regularization can greatly improve speech recognition performance. Using the Librispeech task, we show: 1) symmetric consistency measures such as the Jensen-Shannon Divergence provide 11\% relative improvements in ASR performance; 2) Augmented adversarial inputs using Virtual Adversarial Noise (VAT) provides 8.9\% relative win; and 3) random sampling from arbitrary combination of augmentation policies yields the best policy. These contributions result in an overall reduction in Word Error Rate (WER) of 18\% relative on the Librispeech task presented in this paper.
View details
Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection
Zhehuai Chen
Yu Zhang
Interspeech 2020
Preview abstract
Text-to-Speech synthesis (TTS) based data augmentation is a relatively new mechanism for utilizing text-only data to improve automatic speech recognition (ASR) training without parameter or inference architecture changes. However, efforts to train speech recognition systems on synthesized utterances suffer from limited acoustic diversity of TTS outputs. Additionally, the text-only corpus is always much larger than the transcribed speech corpus by several orders of magnitude, which makes speech synthesis of all the text data impractical. In this work, we propose to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data. We also present a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text. We demonstrate the ability of our proposed method to enable efficient, large-scale unspoken text learning which achieving a 32.7\% relative WER reduction on a voice-search task.
View details
Improving Speech Recognition Using Consistent Predictions on Synthesized Speech
Zhehuai Chen
Yu Zhang
IEEE ICASSP 2020
Preview abstract
Speech synthesis has advanced to the point of being close to indistinguishable from human speech. However, efforts to train speech recognition systems on synthesized utterances have not been able to show that synthesized data can be effectively used to augment or replace human speech.
In this work, we demonstrate that promoting consistent predictions in response to real and synthesized speech enables significantly improved speech recognition performance.
We also find that training on 460 hours of LibriSpeech augmented with 500 hours of transcripts (without audio) performance is within 0.2\% WER of a system trained on 960 hours of transcribed audio. This suggests that with this approach, when there is sufficient text available, reliance on transcribed audio can be cut nearly in half.
View details
Multilingual Speech Recognition with Self-Attention Structured Parameterization
Yun Zhu
Brian Farris
Hainan Xu
Han Lu
Qian Zhang
Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, ISCA
Preview abstract
Multilingual automatic speech recognition systems can transcribe utterances from different languages. These systems are attractive from different perspectives: they can provide quality improvements, specially for lower resource languages, and simplify the training and deployment procedure. End-to-end speech recognition has further simplified multilingual modeling as one model, instead of several components of a classical system, have to be unified. In this paper, we investigate a streamable end-to-end multilingual system based on the Transformer Transducer. We propose several techniques for adapting the self-attention architecture based on the language id. We analyze the trade-offs of each method with regards to quality gains and number of additional parameters introduced. We conduct experiments in a real-world task consisting of five languages. Our experimental results demonstrate $\sim$10\% and $\sim$15\% relative gain over the baseline multilingual model.
View details
Leveraging Language ID in Multilingual End-to-End Speech Recognition
Delia Qu
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019 (2019)
Preview abstract
Multilingual speech recognition models are capable of recognizing speech in multiple different languages. Depending on the amount of training data, and the relatedness of languages, these models can outperform their monolingual counterparts. However, the performance of these models heavily relies on an externally provided language-id which is used to augment the input features or modulate the neural network's per-layer outputs using a language-gate. In this paper, we introduce a novel technique for inferring the language-id in a streaming fashion using the RNN-T loss that eliminates reliance on knowing the utterance's language. We conduct experiments on two sets of languages, arabic and nordic, and show the effectiveness of our approach.
View details
Preview abstract
We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speech from any speaker regardless of accent, prosody, and background noise, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We further show that this normalization model can be adapted to normalize highly atypical speech from a deaf speaker, resulting in significant improvements in intelligibility and naturalness, measured via a speech recognizer and listening tests. Finally, demonstrating the utility of this model on other speech tasks, we show that the same model architecture can be trained to perform a speech separation task
View details
Speech Recognition with Augmented Synthesized Speech
Ye Jia
Yu Zhang
ASRU 2019 (to appear)
Preview abstract
Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of the synthesized speech. In this paper, we evaluate the feasibility of enhancing speech recognition performance using speech synthesis using two corpora from different domains. We explore algorithms to provide the necessary acoustic and lexical diversity needed for robust speech recognition. Finally, we demonstrate the feasibility of this approach as a data augmentation strategy for domain-transfer.
View details
Preview abstract
This paper describes a series of experiments with neural networks containing long short-term memory (LSTM) [1] and feedforward sequential memory network (FSMN) [2, 3, 4] layers trained with the connectionist temporal classification (CTC) [5] criteria for acoustic modeling. We propose using a hybrid LSTM/FSMN (FLMN) architecture as an enhancement to conventional LSTM-only acoustic models. The addition of FSMN layers allows the network to model a fixed size representation of future context suitable for online speech recognition. Our experiments show that FLMN acoustic models significantly outperform conventional LSTM. We also compare the FLMN architecture with other methods of modeling future context. Finally, we present a modification of the FSMN architecture that improves performance by reducing the width of the FSMN output.
View details
Transliteration based approaches to improve code-switched speech recognition performance
Jesse Emond
IEEE Spoken Language Technology Workshop (SLT) (2018), pp. 448-455
Preview abstract
Code-switching is a commonly occurring phenomenon in many multilingual communities, wherein a speaker switches between languages within a single utterance. Conventional Word Error Rate (WER) is not sufficient for measuring the performance of code-mixed languages due to ambiguities in transcription, misspellings and borrowing of words from two different writing systems. These rendering errors artificially inflate the WER of an Automated Speech Recognition (ASR) system and complicate its evaluation. Furthermore, these errors make it harder to accurately evaluate modeling errors originating from code-switched language and acoustic models. In this work, we propose the use of a new metric, transliteration-optimized Word Error Rate (toWER) that smoothes out many of these irregularities by mapping all text to one writing system and demonstrate a correlation with the amount of code-switching present in a language. We also present a novel approach to acoustic and language modeling for bilingual code-switched Indic languages using the same transliteration approach to normalize the data for three types of language models, namely, a conventional n-gram language model, a maximum entropy based language model and a Long Short Term Memory (LSTM) language model, and a state-of-the-art Connectionist Temporal Classification (CTC) acoustic model. We demonstrate the robustness of the proposed approach on several Indic languages from Google Voice Search traffic with significant gains in ASR performance up to 10% relative over the state-of-the-art baseline.
View details
Multilingual Speech Recognition with a Single End-to-End Model
Shubham Toshniwal
ICASSP (2018)
Preview abstract
Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages.
View details
From audio to semantics: Approaches to end-to-end spoken language understanding
Galen Chuang
Delia Qu
Spoken Language Technology Workshop (SLT), 2018 IEEE
Preview abstract
Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to text, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of intents and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem. We propose and compare various encoder-decoder based approaches that optimizes both modules jointly, in an end-to-end manner. We evaluate these methods on a real-world task. Our results show that having an intermediate text representation while jointly optimizing the full system improves accuracy of prediction.
View details
Semantic Lattice Processing in Contextual Automatic Speech Recognition for Google Assistant
Ian Williams
Justin Scheiner
Interspeech 2018, ISCA (2018), pp. 2222-2226
Preview abstract
Recent interest in intelligent assistants has increased demand for Automatic Speech Recognition (ASR) systems that can utilize contextual information to adapt to the user’s preferences or the current device state. For example, a user might be more likely to refer to their favorite songs when giving a “music playing” command or request to watch a movie starring a particular favorite actor when giving a “movie playing” command. Similarly, when a device is in a “music playing” state, a user is more likely to give volume control commands.
In this paper, we explore using semantic information inside the ASR word lattice by employing Named Entity Recognition (NER) to identify and boost contextually relevant paths in order to improve speech recognition accuracy. We use broad semantic classes comprising millions of entities, such as songs and musical artists, to tag relevant semantic entities in the lattice. We show that our method reduces Word Error Rate (WER) by 12.0% relative on a Google Assistant “media playing” commands test set, while not affecting WER on a test set containing commands unrelated to media.
View details
Preview abstract
We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and sMBR loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. With feature frames computed every 30ms, our acoustic models are well suited to syllable-level modeling as compared to phonemes which can have a shorter duration. Additionally, when compared to word-level modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform as well as context-independent (CI) phone-output models, and, under certain circumstances can beat the performance of our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than that with CI models, and vastly faster than with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes.
View details
Towards Acoustic Model Unification Across Dialects
Meysam Bastani
Mohamed G. Elfeky
2016 IEEE Workshop on Spoken Language Technology
Preview abstract
Research has shown that acoustic model performance typically decreases when evaluated on a dialectal variation of the same language that was not used during training. Similarly, models simultaneously trained on a group of dialects tend to under-perform when compared to dialect-specific models. In this paper, we report on our efforts towards building a unified acoustic model that can serve a multi-dialectal language. Two techniques are presented: Distillation and MTL. In Distillation, we use an ensemble of dialect-specific acoustic models and distill its knowledge in a single model. In MTL, we utilize MultiTask Learning to train a unified acoustic model that learns to distinguish dialects as a side task. We show that both techniques are superior to the naive model that is trained on all dialectal data, reducing word error rates by 4.2% and 0.6%, respectively. And, while achieving this improvement, neither technique degrades the performance of the dialect-specific models by more than 3.4%.
View details
High quality agreement-based semi-supervised training data for acoustic modeling
Asa Oines
2016 IEEE Workshop on Spoken Language Technology
Preview abstract
This paper describes a new technique to automatically obtain large high-quality training speech corpora for acoustic modeling. Traditional approaches select utterances based on confidence thresholds and other heuristics. We propose instead to use an ensemble approach: we transcribe each utterance using several recognizers, and only keep those on which they agree. The recognizers we use are trained on data from different dialects of the same language, and this diversity leads them to make different mistakes in transcribing speech utterances. In this work we show, however, that when they agree, this is an extremely strong signal that the transcript is correct. This allows us to produce automatically transcribed speech corpora that are superior in transcript correctness even to those manually transcribed by humans. Further more, we show that using the produced semi-supervised data sets, we can train new acoustic models which outperform those trained solely on previously available data sets.
View details
Preview abstract
While research has often shown that building dialect-specific Automatic Speech Recognizers is the optimal approach to dealing with dialectal variations of the same language, we have observed that dialect-specific recognizers do not always output the best recognitions. Often enough, another dialectal recognizer outputs a better recognition than the dialect-specific one. In this paper, we present two methods to select and combine the best decoded hypothesis from a pool of dialectal recognizers. We follow a Machine Learning approach and extract features from the Speech Recognition output along with Word Embeddings and use Shallow Neural Networks for classification. Our experiments using Dictation and Voice Search data from the main four Arabic dialects show good WER improvements for the hypothesis selection scheme, reducing the WER by 2.1 to 12.1% depending on the test set, and promising results for the hypotheses combination scheme.
View details
Bringing Contextual Information to Google Speech Recognition
Preview
Keith Hall
Interspeech 2015, International Speech Communications Association
Improved recognition of contact names in voice commands
Preview
David Elson
Aleks Kracun
Diego Melendo Casado
ICASSP 2015
Multi-Dialectical Languages Effect on Speech Recognition
Preview
Mohamed Elfeky
Victor Soto
International Conference on Natural Language and Speech Processing (2015)
A big data approach to acoustic model training corpus selection
John Alex
Conference of the International Speech Communication Association (Interspeech) (2014)
Preview abstract
Deep neural networks (DNNs) have recently become the state
of the art technology in speech recognition systems. In this paper we propose a new approach to constructing large high quality unsupervised sets to train DNN models for large vocabulary speech recognition. The core of our technique consists of two steps. We first redecode speech logged by our production recognizer with a very accurate (and hence too slow for real-time usage) set of speech models to improve the quality of ground truth transcripts used for training alignments. Using confidence scores, transcript length and transcript flattening heuristics designed to cull salient utterances from three decades of speech per language, we then carefully select training data sets consisting of up to 15K hours of speech to be used to train acoustic models without any reliance on manual transcription. We show that this approach yields models with approximately 18K context dependent states that achieve 10% relative improvement in large vocabulary dictation and voice-search systems for Brazilian Portuguese, French, Italian and Russian languages.
View details
Preview abstract
Maximum Entropy (MaxEnt) language models are linear models that are typically regularized via well-known L1 or L2 terms in the likelihood objective, hence avoiding the need for the kinds of backoff or mixture weights used in smoothed n-gram language models using Katz backoff and similar techniques. Even though backoff cost is not required to regularize the model, we investigate the use of backoff features in MaxEnt models, as well as some backoff-inspired variants. These features are shown to improve model quality substantially, as shown in perplexity and word-error rate reductions, even in very large scale training scenarios of tens or hundreds of billions of words and hundreds of millions of features.
View details
Frame by Frame Language Identification in Short Utterances using Deep Neural Networks
Javier Gonzalez-Dominguez
Joaquin Gonzalez-Rodriguez
Neural Networks Special Issue: Neural Network Learning in Big Data (2014)
Preview abstract
This work addresses the use of deep neural networks (DNNs) in automatic language identification (LID) focused on short test utterances. Motivated by their recent success in acoustic modelling for speech recognition, we adapt DNNs to the problem of identifying the language in a given utterance from the short-term acoustic features. We show how DNNs are particularly suitable to perform LID in real-time applications, due to their capacity to emit a language identification posterior at each new frame of the test utterance. We then analyse different aspects of the system, such
as the amount of required training data, the number of hidden layers, the relevance of contextual information and
the effect of the test utterance duration. Finally, we propose several methods to combine frame-by-frame posteriors.
Experiments are conducted on two different datasets: the public NIST Language Recognition Evaluation 2009 (3
seconds task) and a much larger corpus (of 5 million utterances) known as Google 5M LID, obtained from different
Google Services. Reported results show relative improvements of DNNs versus the i-vector system of 40% in LRE09
3 second task and 76% in Google 5M LID.
View details
Google's Cross-Dialect Arabic Voice Search
Martin Jansche
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2012), pp. 4441-4444
Preview abstract
We present a large scale effort to build a commercial Automatic Speech Recognition (ASR) product for Arabic. Our goal is to support voice search, dictation, and voice control for the general Arabic-speaking public, including support for multiple Arabic dialects. We describe our ASR system design and compare recognizers for five Arabic dialects, with the potential to reach more than 125 million people in Egypt, Jordan, Lebanon, Saudi Arabia, and the United Arab Emirates (UAE). We compare systems built on diacritized vs. non-diacritized text. We also conduct cross-dialect experiments, where we train on one dialect and test on the others. Our average word error rate (WER) is 24.8% for voice search.
View details
Deploying Google Search by Voice in Cantonese
Martin Jansche
12th Annual Conference of the International Speech Communication Association (Interspeech 2011), pp. 2865-2868
Preview abstract
We describe our efforts in deploying Google search by voice for Cantonese, a southern Chinese dialect widely spoken in and around Hong Kong and Guangzhou. We collected audio data from local Cantonese speakers in Hong Kong and Guangzhou by using our DataHound smartphone application. This data was used to create appropriate acoustic models. Language models were trained on anonymized query logs from Google Web Search for Hong Kong. Because users in Hong Kong frequently mix English and Cantonese in their queries, we designed our system from the ground up to handle both languages. We report on experiments with different techniques for mapping the phoneme inventories for both languages into a common space. Based on extensive experiments we report word error rates and web scores for both Hong Kong and Guangzhou data. Cantonese Google search by voice was launched in December 2010.
View details
Search by Voice in Mandarin Chinese
Jiulong Shan
Genqing Wu
Zhihong Hu
Xiliu Tang
Martin Jansche
Interspeech 2010, pp. 354-357
Preview abstract
In this paper we describe our efforts to build a Mandarin Chinese voice search system. We describe our strategies for data collection, language, lexicon and acoustic modeling, as well as issues related to text normalization that are an integral part of building voice search systems. We show excellent performance on typical spoken search queries under a variety of accents and acoustic conditions. The system has been in operation since October 2009 and has received very positive user reviews.
View details
Discriminative Topic Segmentation of Text and Speech
Preview
International Conference on Artificial Intelligence and Statistics (AISTATS) (2010)
Preview abstract
In light of the serious problems with both illiteracy and information access in the developing world, there is a widespread belief that speech technology can play a significant role in improving the quality of life of developing-world citizens. We review the main reasons why this impact has not occurred to date, and propose that voice-search systems may be a useful tool in delivering on the original promise. The challenges that must be addressed to realize this vision are analyzed, and initial experimental results in developing voice search for two languages of South Africa (Zulu and Afrikaans) are summarized
View details
Building Transcribed Speech Corpora Quickly and Cheaply for Many Languages
Thad Hughes
Kaisuke Nakajima
Linne Ha
Atul Vasu
Mike LeBeau
Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), International Speech Communication Association, pp. 1914-1917
Preview abstract
We present a system for quickly and cheaply building transcribed speech corpora containing utterances from many speakers in a variety of acoustic conditions. The system consists of a client application running on an Android mobile device with an intermittent Internet connection to a server. The client application collects demographic information about the speaker, fetches textual prompts from the server for the speaker to read, records the speaker’s voice, and uploads the audio and associated metadata to the server. The system has so far been used to collect over 3000 hours of transcribed audio in 17 languages around the world.
View details
Efficient and Robust Music Identification with Weighted Finite-State Transducers
Preview
IEEE Transactions on Audio, Speech, and Language Processing, vol. to appear (2009)
Audiovisual Celebrity Recognition in Unconstrained Web Videos
Preview
Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009)
An Audio Indexing System for Election Video Material
Christopher Alberti
Ari Bezman
Anastassia Drofa
Ted Power
Arnaud Sahuguet
Maria Shugrina
Proceedings of ICASSP (2009), pp. 4873-4876
Preview abstract
In the 2008 presidential election race in the United States, the prospective candidates made extensive use of YouTube to post video material. We developed a scalable system that transcribes this material and makes the content searchable (by indexing the meta-data and transcripts of the videos) and allows the user to navigate through the video material based on content. The system is available as an iGoogle gadget as well as a Labs product. Given the large exposure, special emphasis was put on the scalability and reliability of the system. This paper describes the design and implementation of this system.
View details
A new quality measure for topic segmentation of text and speech
Preview
Conference of the International Speech Communication Association (Interspeech) (2009)
Music Identification with Weighted Finite-State Transducers
Preview
Proceedings of the International Conference in Acoustics, Speech and Signal Processing (ICASSP) (2007)
Factor Automata of Automata and Applications
Preview
Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA2007), July, CIAA 2007Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA2007), Prague, Czech Republic.
Robust music identification, detection, and analysis
Preview
Proceedings of the International Conference on Music Information Retrieval (ISMIR) (2007)
Supervised Learning of Semantic Classes for Image Annotation and Retrieval
Preview
Gustavo Carneiro
Antoni B. Chan
Nuno Vasconcelos
IEEE Transactions on Pattern Analysis and Machine Intelligence (2007), pp. 394-410
Approaches to reduce the effects of OOV queries on indexed spoken audio
Beth Logan
Jean-Manuel Van Thong
IEEE Transactions on Multimedia, vol. 7 (2005), pp. 899-906
News Tuner: a simple interface for searching and browsing radio archives
The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition
A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
Speechbot: an experimental speech-based search engine for multimedia content on the web
Jean-Manuel Van Thong
Beth Logan
Blair Fidler
K. Maffey
M. Moores
IEEE Transactions on Multimedia, vol. 4 (2002), pp. 88-96
From Multimedia Retrieval to Knowledge Management
Jean-Manuel Van Thong
Beth Logan
Gareth J. F. Jones
IEEE Computer, vol. 35 (2002), pp. 58-66
Topic Segmentation with an Aspect Hidden Markov Model
Indexing Multimedia for the Internet
Brian S. Eberman
Blair Fidler
Robert A. Iannucci
Christopher F. Joerg
Leonidas I. Kontothanassis
David E. Kovalcin
Michael J. Swain
Jean-Manuel Van Thong
VISUAL (1999), pp. 195-202
A spoken language translator for restricted-domain context-free languages
David B. Roe
Alejandro Macarrón
Speech Communication, vol. 11 (1992), pp. 311-319
Efficient Grammar Processing for a Spoken Language Translation System
David B. Roe
Alejandro Macarrón
Proceedings of ICASSP, IEEE, San Francisco, California (1992), pp. 213-216
Toward a Spoken Language Translator for Restricted-Domain Context-Free Languages
David B. Roe
Alejandro Macarrón
EUROSPEECH 91 -- 2nd European Conference on Speech Communication and Technology, Genova, Italy (1991), pp. 1063-1066