Jump to Content
Ciprian Chelba

Ciprian Chelba

Ciprian Chelba is a Research Scientist with Google. Previously he worked as a Researcher in the Speech Technology Group at Microsoft Research.

His research interests are in statistical modeling of natural language and speech. Recent projects include: Google Audio Indexing; indexing, ranking and snippeting of speech content; Language Modeling for Google Search by Voice, and Android IME predictive keyboard.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The paper presents an approach to semantic grounding of language models (LMs) that conceptualizes the LM as a conditional model generating text given a desired semantic message. It embeds the LM in an auto-encoder by feeding its output to a semantic parser whose output is in the same representation domain as the input message. Compared to a baseline that generates text using greedy search, we demonstrate two techniques that improve the fluency and semantic accuracy of the generated text: The first technique samples multiple candidate text sequences from which the semantic parser chooses. The second trains the language model while keeping the semantic parser frozen to improve the semantic accuracy of the auto-encoder. We carry out experiments on the English WebNLG 3.0 data set, using BLEU to measure the fluency of generated text and standard parsing metrics to measure semantic accuracy. We show that our proposed approaches significantly improve on the greedy search baseline. Human evaluation corroborates the results of the automatic evaluation experiments. View details
    Preview abstract The paper investigates the feasibility of confidence estimation for neural machine translation models operating at the high end of the performance spectrum. As a side product of the data annotation process necessary for building such models we propose sentence level accuracy $SACC$ as a simple, self-explanatory evaluation metric for quality of translation. Experiments on two different annotator pools, one comprised of non-expert (crowd-sourced) and one of expert (professional) translators show that $SACC$ can vary greatly depending on the translation proficiency of the annotators, despite the fact that both pools are about equally reliable according to Krippendorff's alpha metric; the relatively low values of inter-annotator agreement confirm the expectation that sentence-level binary labeling $good$ / $needs\ work$ for translation out of context is very hard. For an English-Spanish translation model operating at $SACC = 0.89$ according to a non-expert annotator pool we can derive a confidence estimate that labels 0.5-0.6 of the $good$ translations in an ``in-domain" test set with 0.95 Precision. Switching to an expert annotator pool decreases $SACC$ dramatically: $0.61$ for English-Spanish, measured on the exact same data as above. This forces us to lower the CE model operating point to 0.9 Precision while labeling correctly about 0.20-0.25 of the $good$ translations in the data. We find surprising the extent to which CE depends on the level of proficiency of the annotator pool used for labeling the data. This leads to an important recommendation we wish to make when tackling CE modeling in practice: it is critical to match the end-user expectation for translation quality in the desired domain with the demands of annotators assigning binary quality labels to CE training data. View details
    Preview abstract Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence S=s1,...,sS, we propose truncating the target-side context used for incremental predictions by making a Markov (N-gram) assumption. Experiments on WMT EnDe and EnFr data sets show that the N-gram masked self-attention model loses very little in BLEU score for N values in the range 4,...,8, depending on the task. View details
    Preview abstract Noise and domain are important aspects of data quality for neural machine translation. Existing research focus separately on domain-data selection, clean-data selection, or their static combination, leaving the dynamic interaction across them not explicitly examined. This paper introduces a ``co-curricular learning'' method to compose dynamic domain-data selection with dynamic clean-data selection, for transfer learning across both capabilities. We apply an EM-style optimization procedure to further refine the ``co-curriculum''. Experiment results and analysis with two domains demonstrate the viability of the method and the properties of data scheduled by the co-curriculum. View details
    Preview abstract Recent work in Neural Machine Translation (NMT) has shown significant quality gains from noised-beam decoding during back-translation, a method to generate synthetic parallel data. We show that the main role of such synthetic noise is not to diversify the source side, as previously suggested, but simply to indicate to the model that the given source is synthetic. We propose a simpler alternative to noising techniques, consisting of tagging back-translated source sentences with an extra token. Our results on WMT outperform noised back-translation in English-Romanian and match performance on English-German, re-defining state-of-the-art in the former. View details
    Preview abstract Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. For advanced NLP problems, a neural language model usually consists of recurrent layers (e.g., using LSTM cells), an embedding matrix for representing input tokens, and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of-the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90\% of the model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words). We start by grouping words into c blocks based on their frequency, and then refine the clustering iteratively by constructing weighted low-rank approximation for each block, where the weights are based the frequencies of the words in the block. The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. On the OBW dataset, our method achieved 6.6x compression rate for the embedding and softmax matrices, and when combined with quantization, our method can achieve 26x compression rate without losing prediction accuracy. View details
    Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection
    Wei Wang
    Taro Watanabe
    Macduff Hughes
    Tetsuji Nakagawa
    Third Conference on Machine Translation (WMT18) (2018)
    Preview abstract Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise. View details
    Preview abstract We investigate the effective memory depth of RNN models by using them for $n$-gram language model (LM) smoothing. Experiments on a small corpus (UPenn Treebank, one million words of training data and 10k vocabulary) have found the LSTM cell with dropout to be the best model for encoding the $n$-gram state when compared with feed-forward and vanilla RNN models. When preserving the sentence independence assumption the LSTM $n$-gram matches the LSTM LM performance for $n=9$ and slightly outperforms it for $n=13$. When allowing dependencies across sentence boundaries, the LSTM $13$-gram almost matches the perplexity of the unlimited history LSTM LM. LSTM $n$-gram smoothing also has the desirable property of improving with increasing $n$-gram order, unlike the Katz or Kneser-Ney back-off estimators. Using multinomial distributions as targets in training instead of the usual one-hot target is only slightly beneficial for low $n$-gram orders. Experiments on the One Billion Words benchmark show that the results hold at larger scale. Building LSTM $n$-gram LMs may be appealing for some practical situations: the state in a $n$-gram LM can be succinctly represented with $(n-1)*4$ bytes storing the identity of the words in the context and batches of $n$-gram contexts can be processed in parallel. On the downside, the $n$-gram context encoding computed by the LSTM is discarded, making the model more expensive than a regular recurrent LSTM LM. View details
    Language Modeling in the Era of Abundant Data
    AI With the Best online conference. (2017)
    Preview abstract Overview of N-gram language modeling on large amounts of data, anchored in the reality of the speech recognition team at Google. View details
    Sparse Non-negative Matrix Language Modeling: Maximum Entropy Flexibility on the Cheap
    The 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 2725-2729 (to appear)
    Preview abstract We present a new method for estimating the sparse non-negative model (SNM) by using a small amount of held-out data and the multinomial loss that is natural for language modeling; we validate it experimentally against the previous estimation method which uses leave-one-out on training data and a binary loss function and show that it performs equally well. Being able to train on held-out data is very important in practical situations where training data is mismatched from held-out/test data. We find that fairly small amounts of held-out data (on the order of 30-70 thousand words) are sufficient for training the adjustment model, which is the only model component estimated using gradient descent; the bulk of model parameters are relative frequencies counted on training data. A second contribution is a comparison between SNM and the related class of Maximum Entropy language models. While much cheaper computationally, we show that SNM achieves slightly better perplexity results for the same feature set and same speech recognition accuracy on voice search and short message dictation. View details
    Sparse Non-negative Matrix Language Modeling (EMNLP presentation)
    Joris Pelemans
    Noam Shazeer
    Association for Computational Linguistics
    Preview abstract We present Sparse Non-negative Matrix (SNM) estimation, a novel probability estimation technique for language modeling that can efficiently incorporate arbitrary features. We evaluate SNM language models on two corpora: the One Billion Word Benchmark and a subset of the LDC English Gigaword corpus. Results show that SNM language models trained with n-gram features are a close match for the well-established Kneser-Ney models. The addition of skip-gram features yields a model that is in the same league as the state-of-the-art recurrent neural network language models, as well as complementary: combining the two modeling techniques yields the best known result on the One Billion Word Benchmark. On the Gigaword corpus further improvements are observed using features that cross sentence boundaries. The computational advantages of SNM estimation over both maximum entropy and neural network estimation are probably its main strength, promising an approach that has large flexibility in combining arbitrary features and yet scales gracefully to large amounts of data. View details
    Sparse Non-negative Matrix Language Modeling
    Joris Pelemans
    Noam Shazeer
    Transactions of the Association for Computational Linguistics, vol. 4 (2016), pp. 329-342
    Preview abstract We present Sparse Non-negative Matrix (SNM) estimation, a novel probability estimation technique for language modeling that can efficiently incorporate arbitrary features. We evaluate SNM language models on two corpora: the One Billion Word Benchmark and a subset of the LDC English Gigaword corpus. Results show that SNM language models trained with n-gram features are a close match for the well-established Kneser-Ney models. The addition of skip-gram features yields a model that is in the same league as the state-of-the-art recurrent neural network language models, as well as complementary: combining the two modeling techniques yields the best known result on the One Billion Word Benchmark. On the Gigaword corpus further improvements are observed using features that cross sentence boundaries. The computational advantages of SNM estimation over both maximum entropy and neural network estimation are probably its main strength, promising an approach that has large flexibility in combining arbitrary features and yet scales gracefully to large amounts of data. Presented at EMNLP 2016 (Austin, Texas), see slide deck: http://research.google.com/pubs/pub45647.html. View details
    Geo-location for Voice Search Language Modeling
    Xuedong Zhang
    Keith Hall
    Interspeech 2015, International Speech Communications Association, pp. 1438-1442
    Preview
    Effects of Language Modeling and its Personalization on Touchscreen Typing Performance
    Andrew Fowler
    Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2015), ACM, New York, NY, USA, pp. 649-658
    Preview abstract Modern smartphones correct typing errors and learn userspecific words (such as proper names). Both techniques are useful, yet little has been published about their technical specifics and concrete benefits. One reason is that typing accuracy is difficult to measure empirically on a large scale. We describe a closed-loop, smart touch keyboard (STK) evaluation system that we have implemented to solve this problem. It includes a principled typing simulator for generating human-like noisy touch input, a simple-yet-effective decoder for reconstructing typed words from such spatial data, a large web-scale background language model (LM), and a method for incorporating LM personalization. Using the Enron email corpus as a personalization test set, we show for the first time at this scale that a combined spatial/language model reduces word error rate from a pre-model baseline of 38.4% down to 5.7%, and that LM personalization can improve this further to 4.6%. View details
    Preview abstract We describe Sparse Non-negative Matrix (SNM) language model estimation using multinomial loss on held-out data. Being able to train on held-out data is important in practical situations where the training data is usually mismatched from the held-out/test data. It is also less constrained than the previous training algorithm using leave-one-out on training data: it allows the use of richer meta-features in the adjustment model, e.g. the diversity counts used by Kneser-Ney smoothing which would be difficult to deal with correctly in leave-one-out training. In experiments on the one billion words language modeling benchmark, we are able to slightly improve on our previous results which use a different loss function, and employ leave-one-out training on a subset of the main training set. Surprisingly, an adjustment model with meta-features that discard all lexical information can perform as well as lexicalized meta-features. We find that fairly small amounts of held-out data (on the order of 30-70 thousand words) are sufficient for training the adjustment model. In a real-life scenario where the training data is a mix of data sources that are imbalanced in size, and of different degrees of relevance to the held-out and test data, taking into account the data source for a given skip-/n-gram feature and combining them for best performance on held-out/test data improves over skip-/n-gram SNM models trained on pooled data by about 8% in the SMT setup, or as much as 15% in the ASR/IME setup. The ability to mix various data sources based on how relevant they are to a mismatched held-out set is probably the most attractive feature of the new estimation method for SNM LM. View details
    Preview abstract The talk presents an overview of statistical language modeling as applied to real-word problems: speech recognition, machine translation, spelling correction, soft keyboards to name a few prominent ones. We summarize the most successful estimation techniques, and examine how they fare for applications with abundant data, e.g. voice search. We conclude by highlighting a few open problems: getting an accurate estimate for the entropy of text produced by a very specific source, e.g. query stream); optimally leveraging data that is of different degrees of relevance to a given "domain"; does a bound on the size of a "good" model for a given source exist? View details
    Sparse Non-negative Matrix Language Modeling for Geo-annotated Query Session Data
    Noam M. Shazeer
    Automatic Speech Recognition and Understanding Workshop (ASRU 2015) Proceedings, IEEE, to appear (to appear)
    Preview abstract The paper investigates the impact on query language modeling when using skip-grams within query as well as across queries in a given search session, in conjunction with the geo-annotation available for the query stream data. As modeling tool we use the recently proposed sparse non-negative matrix estimation technique, since it offers the same expressive power as the well-established maximum entropy approach in combining arbitrary context features. Experiments on the google.com query stream show that using session-level and geo-location context we can expect reductions in perplexity of 34% relative over the Kneser Ney N-gram baseline; when evaluating on the `''local'' subset of the query stream, the relative reduction in PPL is 51%---more than a bit. Both sources of context information (geo-location, and previous queries in session) are about equally valuable in building a language model for the query stream. View details
    Sparse Non-negative Matrix Language Modeling For Skip-grams
    Noam M. Shazeer
    Joris Pelemans
    Proceedings of Interspeech 2015, ISCA, pp. 1428-1432
    Preview abstract We present a novel family of language model (LM) estimation techniques named Sparse Non-negative Matrix (SNM) estimation. A first set of experiments empirically evaluating these techniques on the One Billion Word Benchmark [3] shows that with skip-gram features SNMLMs are able to match the state-of-the art recurrent neural network (RNN) LMs; combining the two modeling techniques yields the best known result on the benchmark. The computational advantages of SNM over both maximum entropy and RNNLM estimation are probably its main strength, promising an approach that has the same flexibility in combining arbitrary features effectively and yet should scale to very large amounts of data as gracefully as n-gram LMs do. View details
    Pruning Sparse Non-negative Matrix N-gram Language Models
    Joris Pelemans
    Noam M. Shazeer
    Proceedings of Interspeech 2015, ISCA, pp. 1433-1437
    Preview abstract In this paper we present a pruning algorithm and experimental results for our recently proposed Sparse Non-negative Matrix (SNM) family of language models (LMs). We have uncovered a bug in the experimental setup for SNM pruning; see Errata section for correct results. We also illustrate a method for converting an SNMLM to ARPA back-off format which can be readily used in a single-pass decoder for Automatic Speech Recognition. View details
    Preview abstract We present a novel family of language model (LM) estimation techniques named Sparse Non-negative Matrix (SNM) estimation. A first set of experiments empirically evaluating it on the One Billion Word Benchmark shows that SNM n-gram LMs perform almost as well as the well-established Kneser-Ney (KN) models. When using skip-gram features the models are able to match the state-of-the-art recurrent neural network (RNN) LMs; combining the two modeling techniques yields the best known result on the benchmark. The computational advantages of SNM over both maximum entropy and RNN LM estimation are probably its main strength, promising an approach that has the same flexibility in combining arbitrary features effectively and yet should scale to very large amounts of data as gracefully as n-gram LMs do. View details
    One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
    Tomas Mikolov
    Mike Schuster
    Qi Ge
    Thorsten Brants
    Phillipp Koehn
    Tony Robinson
    ArXiv, Google (2013)
    Preview abstract We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline. The benchmark is available as a code.google.com project at https://code.google.com/p/1-billion-word-language-modeling-benchmark/; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models. View details
    Large Scale Distributed Acoustic Modeling With Back-off N-grams
    Peng Xu
    Thomas Richardson
    IEEE Transactions on Audio, Speech and Language Processing, vol. 21 (2013), pp. 1158-1169
    Preview abstract The paper revives an older approach to acoustic modeling that borrows from n-gram language modeling in an attempt to scale up both the amount of training data and model size (as measured by the number of parameters in the model), to approximately 100 times larger than current sizes used in automatic speech recognition. In such a data-rich setting, we can expand the phonetic context significantly beyond triphones, as well as increase the number of Gaussian mixture components for the context-dependent states that allow it. We have experimented with contexts that span seven or more context-independent phones, and up to 620 mixture components per state. Dealing with unseen phonetic contexts is accomplished using the familiar back-off technique used in language modeling due to implementation simplicity. The back-off acoustic model is estimated, stored and served using MapReduce distributed computing infrastructure. Speech recognition experiments are carried out in an N-best list rescoring framework for Google Voice Search. Training big models on large amounts of data proves to be an effective way to increase the accuracy of a state-of-the-art automatic speech recognition system. We use 87,000 hours of training data (speech along with transcription) obtained by filtering utterances in Voice Search logs on automatic speech recognition confidence. Models ranging in size between 20--40 million Gaussians are estimated using maximum likelihood training. They achieve relative reductions in word-error-rate of 11% and 6% when combined with first-pass models trained using maximum likelihood, and boosted maximum mutual information, respectively. Increasing the context size beyond five phones (quinphones) does not help. View details
    Preview abstract Slides from a presentation on invited panel at the Mobile Voice Conference 2013, San Francisco. View details
    Preview abstract Google Voice Search is an application that provides a data-rich setup for both language and acoustic modeling research. The approach we take revives an older approach to acoustic modeling that borrows from n-gram language modeling in an attempt to scale up both the amount of training data, and the model size (as measured by the number of parameters in the model), to approximately 100 times larger than current sizes used in automatic speech recognition. Speech recognition experiments are carried out in an N-best list rescoring framework for Google Voice Search. We use 87,000 hours of training data (speech along with transcription) obtained by filtering utterances in Voice Search logs on automatic speech recognition confidence. Models ranging in size between 20--40 million Gaussians are estimated using maximum likelihood training. They achieve relative reductions in word-error-rate of 11% and 6% when combined with first-pass models trained using maximum likelihood, and boosted maximum mutual information, respectively. Increasing the context size beyond five phones (quinphones) does not help. View details
    Empirical Exploration of Language Modeling for the google.com Query Stream as Applied to Mobile Voice Search
    Johan Schalkwyk
    Mobile Speech and Advanced Natural Language Solutions, Springer Science+Business Media, New York (2013), pp. 197-229
    Preview abstract Mobile is poised to become the predominant platform over which people are accessing the World Wide Web. Recent developments in speech recognition and understanding, backed by high bandwidth coverage and high quality speech signal acquisition on smartphones and tablets are presenting the users with the choice of speaking their web search queries instead of typing them. A critical component of a speech recognition system targeting web search is the language model. The chapter presents an empirical exploration of the google.com query stream with the end goal of high quality statistical language modeling for mobile voice search. Our experiments show that after text normalization the query stream is not as ``wild'' as it seems at first sight. One can achieve out-of-vocabulary rates below 1% using a one million word vocabulary, and excellent n-gram hit ratios of 77/88% even at high orders such as n=5/4, respectively. A more careful analysis shows that a significantly larger vocabulary (approx. 10 million words) may be required to guarantee at most 1% out-of-vocabulary rate for a large percentage (95%) of users. Using large scale, distributed language models can improve performance significantly---up to 10% relative reductions in word-error-rate over conventional models used in speech recognition. We also find that the query stream is non-stationary, which means that adding more past training data beyond a certain point provides diminishing returns, and may even degrade performance slightly. Perhaps less surprisingly, we have shown that locale matters significantly for English query data across USA, Great Britain and Australia. In an attempt to leverage the speech data in voice search logs, we successfully build large-scale discriminative N-gram language models and derive small but significant gains in recognition performance. View details
    Bimanual gesture keyboard
    Proceeding of UIST 2012 – The ACM Symposium on User Interface Software and Technology, ACM, New York, NY, USA, pp. 137-146
    Preview abstract Gesture keyboards represent an increasingly popular way to input text on mobile devices today. However, current gesture keyboards are exclusively unimanual. To take advantage of the capability of modern multi-touch screens, we created a novel bimanual gesture text entry system, extending the gesture keyboard paradigm from one finger to multiple fingers. To address the complexity of recognizing bimanual gesture, we designed and implemented two related interaction methods, finger-release and space-required, both based on a new multi-stroke gesture recognition algorithm. A formal experiment showed that bimanual gesture behaviors were easy to learn. They improved comfort and reduced the physical demand relative to unimanual gestures on tablets. The results indicated that these new gesture keyboards were valuable complements to unimanual gesture and regular typing keyboards. View details
    Distributed Discriminative Language Models for Google Voice Search
    Preethi Jyothi
    Brian Strope
    Proceedings of ICASSP 2012, IEEE, pp. 5017-5021
    Preview abstract This paper considers large-scale linear discriminative language models trained using a distributed perceptron algorithm. The algorithm is implemented efficiently using a MapReduce/SSTable framework. This work also introduces the use of large amounts of unsupervised data (confidence filtered Google voice-search logs) in conjunction with a novel training procedure that regenerates word lattices for the given data with a weaker acoustic model than the one used to generate the unsupervised transcriptions for the logged data. We observe small but statistically significant improvements in recognition performance after reranking N-best lists of a standard Google voice-search data set. View details
    Distributed Acoustic Modeling with Back-off N-grams
    Peng Xu
    Thomas Richardson
    Proceedings of ICASSP 2012, IEEE, pp. 4129-4132
    Preview abstract The paper proposes an approach to acoustic modeling that borrows from n-gram language modeling in an attempt to scale up both the amount of training data and model size (as measured by the number of parameters in the model) to approximately 100 times larger than current sizes used in ASR. Dealing with unseen phonetic contexts is accomplished using the familiar back-off technique used in language modeling due to implementation simplicity. The new acoustic model is estimated and stored using the MapReduce distributed computing infrastructure. Speech recognition experiments are carried out in an Nbest rescoring framework for Google Voice Search. 87,000 hours of training data is obtained in an unsupervised fashion by filtering utterances in Voice Search logs on ASR confidence. The resulting models are trained using maximum likelihood and contain 20-40 million Gaussians. They achieve relative reductions in WER of 11% and 6% over first-pass models trained using maximum likelihood, and boosted MMI, respectively. View details
    Large-scale Discriminative Language Model Reranking for Voice Search
    Preethi Jyothi
    Brian Strope
    Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, Association for Computational Linguistics, pp. 41-49
    Preview abstract We present a distributed framework for large-scale discriminative language models that can be integrated within a large vocabulary continuous speech recognition (LVCSR) system using lattice rescoring. We intentionally use a weakened acoustic model in a baseline LVCSR system to generate candidate hypotheses for voice-search data; this allows us to utilize large amounts of unsupervised data to train our models. We propose an efficient and scalable MapReduce framework that uses a perceptron-style distributed training strategy to handle these large amounts of data. We report small but significant improvements in recognition accuracies on a standard voice-search data set using our discriminative reranking model. We also provide an analysis of the various parameters of our models including model size, types of features, size of partitions in the MapReduce framework with the help of supporting experiments. View details
    Preview abstract In this paper, we investigate how to optimize the vocabulary for a voice search language model. The metric we optimize over is the out-of-vocabulary (OoV) rate since it is a strong indicator of user experience. In a departure from the usual way of measuring OoV rates, web search logs allow us to compute the per-session OoV rate and thus estimate the percentage of users that experience a given OoV rate. Under very conservative text normalization, we find that a voice search vocabulary consisting of 2 to 2.5M words extracted from 1 week of search query data will result in an aggregate OoV rate of 0.01; at that size, the same OoV rate will also be experienced by 90% of users. The number of words included in the vocabulary is a stable indicator of the OoV rate. Altering the freshness of the vocabulary or the duration of the time window over which the training data is gathered does not significantly change the OoV rate. Surprisingly, a significantly larger vocabulary (approx. 10 million words) is required to guarantee OoV rates below 0.01 (1%) for 95% of the users. View details
    Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice
    Johan Schalkwyk
    Boulos Harb
    Peng Xu
    Preethi Jyothi
    Thorsten Brants
    Vida Ha
    Will Neveitt
    University of Toronto (2012)
    Preview abstract A critical component of a speech recognition system targeting web search is the language model. The talk presents an empirical exploration of the google.com query stream with the end goal of high quality statistical language modeling for mobile voice search. Our experiments show that after text normalization the query stream is not as ``wild'' as it seems at first sight. One can achieve out-of-vocabulary rates below 1% using a one million word vocabulary, and excellent n-gram hit ratios of 77/88% even at high orders such as n=5/4, respectively. Using large scale, distributed language models can improve performance significantly---up to 10\% relative reductions in word-error-rate over conventional models used in speech recognition. We also find that the query stream is non-stationary, which means that adding more past training data beyond a certain point provides diminishing returns, and may even degrade performance slightly. Perhaps less surprisingly, we have shown that locale matters significantly for English query data across USA, Great Britain and Australia. In an attempt to leverage the speech data in voice search logs, we successfully build large-scale discriminative N-gram language models and derive small but significant gains in recognition performance. View details
    Preview abstract Large language models have been proven quite beneficial for a variety of automatic speech recognition tasks in Google. We summarize results on Voice Search and a few YouTube speech transcription tasks to highlight the impact that one can expect from increasing both the amount of training data, and the size of the language model estimated from such data. Depending on the task, availability and amount of training data used, language model size and amount of work and care put into integrating them in the lattice rescoring step we observe reductions in word error rate between 6% and 10% relative, for systems on a wide range of operating points between 17% and 52% word error rate. View details
    Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice
    Johan Schalkwyk
    Boulos Harb
    Peng Xu
    Thorsten Brants
    Vida Ha
    Will Neveitt
    OGI/OHSU Seminar Series, Portland, Oregon, USA (2011)
    Preview abstract The talk presents key aspects faced when building language models (LM) for the google.com query stream, and their use for automatic speech recognition (ASR). Distributed LM tools enable us to handle a huge amount of data, and experiment with LMs that are two orders of magnitude larger than usual. An empirical exploration of the problem led us to re-discovering a less known interaction between Kneser-Ney smoothing and entropy pruning, possible non-stationarity of the query stream, as well as strong dependence on various English locales---USA, Britain and Australia. LM compression techniques allowed us to use one billion n-gram LMs in the first pass of an ASR system built on FST technology, and evaluate empirically whether a two-pass system architecture has any losses over one pass. View details
    Speech Retrieval
    Timothy J. Hazen
    Bhuvana Ramabhadran
    Murat Saraçlar
    Spoken Language Understanding, John Wiley and Sons, Ltd (2011), pp. 417-446
    Preview
    Study on Interaction between Entropy Pruning and Kneser-Ney Smoothing
    Thorsten Brants
    Will Neveitt
    Peng Xu
    Proceedings of Interspeech (2010), pp. 2242-2245
    Preview abstract The paper presents an in-depth analysis of a less known interaction between Kneser-Ney smoothing and entropy pruning that leads to severe degradation in language model performance under aggressive pruning regimes. Experiments in a data-rich setup such as google.com voice search show a significant impact in WER as well: pruning Kneser-Ney and Katz models to 0.1% of their original impacts speech recognition accuracy significantly, approx. 10% relative. Any third party with LDC membership should be able to reproduce our experiments using the scripts available at http://code.google.com/p/kneser-ney-pruning-experiments. View details
    Model Combination for Machine Translation
    John DeNero
    Franz Och
    Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) (2010), pp. 975-983
    Preview
    Statistical Language Modeling
    The Handbook of Computational Linguistics and Natural Language Processing, Wiley-Blackwell, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ United Kingdom (2010), pp. 74-104
    Preview abstract Many practical applications such as automatic speech recognition, statistical machine translation, spelling correction resort to variants of the well established source-channel model for producing the correct string of words W given an input speech signal, sentence in foreign language, or typed text with possible mistakes, respectively. A basic component of such systems is a statistical language model which estimates the prior probability values for strings of words W. View details
    Preview abstract ISCA Student panel presentation slides View details
    Query Language Modeling for Voice Search
    Johan Schalkwyk
    Thorsten Brants
    Vida Ha
    Boulos Harb
    Will Neveitt
    Peng Xu
    Proceedings of the 2010 IEEE Workshop on Spoken Language Technology, IEEE, pp. 127-132
    Preview abstract The paper presents an empirical exploration of google.com query stream language modeling. We describe the normalization of the typed query stream resulting in out-of-vocabulary (OoV) rates below 1% for a one million word vocabulary. We present a comprehensive set of experiments that guided the design decisions for a voice search service. In the process we re-discovered a less known interaction between Kneser-Ney smoothing and entropy pruning, and found empirical evidence that hints at non-stationarity of the query stream, as well as strong dependence on various English locales---USA, Britain and Australia. View details
    Google Search by Voice: A Case Study
    Johan Schalkwyk
    Doug Beeferman
    Mike Cohen
    Brian Strope
    Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, Springer (2010), pp. 61-90
    Preview
    An Audio Indexing System for Election Video Material
    Christopher Alberti
    Ari Bezman
    Anastassia Drofa
    Ted Power
    Arnaud Sahuguet
    Maria Shugrina
    Proceedings of ICASSP (2009), pp. 4873-4876
    Preview abstract In the 2008 presidential election race in the United States, the prospective candidates made extensive use of YouTube to post video material. We developed a scalable system that transcribes this material and makes the content searchable (by indexing the meta-data and transcripts of the videos) and allows the user to navigate through the video material based on content. The system is available as an iGoogle gadget as well as a Labs product. Given the large exposure, special emphasis was put on the scalability and reliability of the system. This paper describes the design and implementation of this system. View details
    Back-off Language Model Compression
    Boulos Harb
    Proceedings of Interspeech 2009, International Speech Communication Association (ISCA), pp. 325-355
    Preview abstract With the availability of large amounts of training data relevant to speech recognition scenarios, scalability becomes a very productive way to improve language model performance. We present a technique that represents a back-off n-gram language model using arrays of integer values and thus renders it amenable to effective block compression. We propose a few such compression algorithms and evaluate the resulting language model along two dimensions: memory footprint, and speed reduction relative to the uncompressed one. We experimented with a model that uses a 32-bit word vocabulary (at most 4B words) and log-probabilities/back-off-weights quantized to 1 byte, respectively. The best compression algorithm achieves 2.6 bytes/n-gram at ≈18X slower than uncompressed. For faster LM operation we found it feasible to represent the LM at ≈4.0 bytes/n-gram, and ≈3X slower than the uncompressed LM. The memory footprint of a LM containing one billion n-grams can thus be reduced to 3–4 Gbytes without impacting its speed too much. See the presentation material from a talk about this paper. View details
    Retrieval and Browsing of Spoken Content
    Timothy J. Hazen
    Murat Saraçlar
    Signal Processing Magazine, IEEE, vol. 25 (2008), pp. 39-49
    Preview
    Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lot
    Alex Acero
    Computer Speech and Language, vol. 20 (2006), pp. 382-399
    Integration of Metadata in Spoken Document Search Using Position Specific Posterior Lattices
    Jorge Silva
    Alex Acero
    Proceedings of the IEEE International Workshop on Spoken Language Technology, IEEE, Palm Beach, Aruba (2006), pp. 46-49
    Acoustic Sensitive Language Model Perplexity for Automatic Speech Recognition
    Proceedings of Machine Learning Workshop, Snowbird, UT (2006)
    Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures
    Zheng-Yu Zhou
    Peng Yu
    Frank Seide
    Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, Association for Computational Linguistics, New York City, USA (2006), pp. 415-422
    Soft Indexing of Speech Content for Search in Spoken Documents
    Jorge Silva
    Alex Acero
    Computer Speech and Language (2006), pp. 458-478
    Pruning Analysis of the Position Specific Posterior Lattices for Spoken Document Search
    Jorge Silva Sanchez
    Alex Acero
    Proceedings of ICASSP'06, IEEE, Toulouse, France (2006), pp. 945-948
    Indexing Uncertainty for Spoken Document Search
    Alex Acero
    Proceedings of Eurospeech, ISCA, Lisbon, Portugal (2005), pp. 61-64
    SPEECH OGLE: Indexing Uncertainty for Spoken Document Search
    Alex Acero
    Proceedings of the ACL Interactive Poster and Demonstration Sessions, Association for Computational Linguistics, Ann Arbor, Michigan (2005), pp. 41-44
    Position Specific Posterior Lattices for Indexing Speech
    Alex Acero
    Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), Association for Computational Linguistics, Ann Arbor, Michigan (2005), pp. 443-450
    Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lot
    Alex Acero
    Proceedings of EMNLP, Barcelona, Spain (2004), pp. 285-292
    Conditional Maximum Likelihood Estimation of Naive Bayes Probability Models Using Rational Function Growth Transform
    Alex Acero
    Proceedings of Machine Learning Workshop, Snowbird, UT (2004)
    Parsing Conversational Speech Using Enhanced Segmentation
    Jeremy G. Kahn
    Mari Ostendorf
    HLT-NAACL 2004: Short Papers, Association for Computational Linguistics, Boston, Massachusetts, USA, pp. 125-128
    Conditional Maximum Likelihood Estimation of Naive Bayes Probability Models
    Alex Acero
    Microsoft Research, Redmond, WA (2004)
    Discriminative Training of N-gram Classifiers for Speech and Text Routing
    Alex Acero
    Proceedings of Eurospeech 2003, Geneva, Switzerland, pp. 1-4
    Speech Utterance Classification
    M. Mahajan
    A. Acero
    Proceedings of ICASSP, Hong Kong (2003), pp. 280-283
    A Study on Richer Syntactic Dependencies for Structured Language Modeling
    Peng Xu
    Frederick Jelinek
    ACL, http://www.aclweb.org/ (2002), pp. 191-198
    Growth Transform for Conditional Maximum Likelihood Estimation of Log-linear Models
    Milind Mahajan
    Microsoft Research, Redmond, WA (2002)
    Mutual Information Phone Clustering for Decision Tree Induction
    R. Morton
    Proc. Int. Conf. on Spoken Language Processing, Denver, Colorado (2002)
    Portability of Syntactic Structure for Language Modeling
    Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing Conference, IEEE, www.ieee.org (2001)
    Richer Syntactic Dependencies for Structured Language Modeling
    P. Xu
    Proc. of the IEEE Workshop on Automatic Speech Recognition and Understanding, Madonna di Campiglio, Italy (2001)
    Information Extraction Using the Structured Language Model
    Milind Mahajan
    Proceedings of EMNLP, Pittsburgh, Pennsylvania (2001), pp. 74-81
    Structured Language Modeling
    Frederick Jelinek
    Computer Speech and Language, vol. 14 (2000), pp. 283-332
    Exploiting Syntactic Structure for Natural Language Modeling
    The Johns Hopkins University, www.jhu.edu (2000)
    Recognition performance of a structured language model
    F. Jelinek
    Proceedings of Eurospeech, Budapest, Hungary (1999)
    Structured Language Modeling for Speech Recognition
    Frederick Jelinek
    Proceedings of NLDB (1999)
    Putting Language into Language Modeling
    Frederick Jelinek
    Proceedings of Eurospeech'99, Budapest, Hungary (1999)
    Exploiting Syntactic Structure for Language Modeling
    Frederick Jelinek
    Proceedings of COLING-ACL (1998), pp. 225-231
    Refinement of a Structured Language Model
    Frederick Jelinek
    Proceedings of ICAPR (1998)
    Structure and Performance of a Dependency Language Model
    D. Engle
    F. Jelinek
    V. Jimenez
    S. Khudanpur
    L. Mangu
    H. Printz
    E. S. Ristad
    R. Rosenfeld
    A. Stolcke
    D. Wu
    Proceedings of Eurospeech, Rhodes, Greece (1997), pp. 2775-2778
    A Structured Language Model
    Proceedings of ACL-EACL, Madrid, Spain (1997), 498-500,student section