Jump to Content
Cyril Allauzen

Cyril Allauzen

Cyril Allauzen is a research scientist at Google in New York. His main research interests are in finite-state methods and their applications to text, speech and natural language processing and machine learning. Before joining Google, he worked as a researcher at AT&T Labs Research and at NYU's Courant Institute of Mathematical Sciences. Cyril received his Ph.D. in computer science from the Université de Marne-la-Vallée in 2001.

Cyril is an author of the OpenFst Library, the OpenKernel Library and the GRM Library.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Improving the performance of end-to-end ASR models on long utterances of minutes to hours is an ongoing problem in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundaries based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set a alarm for... 5 o'clock"). Here, we propose replacing the VAD with an end-to-end ASR model capable of predicting segment boundaries, allowing the segmentation to be conditioned not only on deeper acoustic features but also on linguistic features from the decoded text, while requiring negligible extra compute. In experiments on real world long-form audio (YouTube) of up to 30 minutes long, we demonstrate WER gains of 5\% relative to the VAD baseline on a state-of-the-art Conformer RNN-T setup. View details
    Preview abstract This paper explores ways to improve a two-pass speech recognition system when the first-pass is hybrid autoregressive transducer model and the second-pass is a neural language model. The main focus is on the scores provided by each of these models, their quantitative analysis, how to improve them and the best way to integrate them with the objective of better recognition accuracy. Several analysis are presented to show the importance of the choice of the integration weights for combining the first-pass and the second-pass scores. A sequence level weight estimation model along with four training criteria are proposed which allow adaptive integration of the scores per acoustic sequence. The effectiveness of this algorithm is demonstrated by constructing and analyzing models on the Librispeech data set. View details
    Preview abstract On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller. View details
    Hybrid Autoregressive Transducer (HAT)
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6139-6143
    Preview abstract This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoder-decoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches. View details
    Preview abstract As voice-driven intelligent assistants become commonplace, adaptation to user context becomes critical for Automatic Speech Recognition (ASR) systems. For example, ASR systems may be expected to recognize a user’s contact names containing improbable or out-of-vocabulary (OOV) words. We introduce a method to identify contextual cues in a firstpass ASR system’s output and to recover out-of-lattice hypotheses that are contextually relevant. Our proposed module is agnostic to the architecture of the underlying recognizer, provided it generates a word lattice of hypotheses; it is sufficiently compact for use on device. The module identifies subgraphs in the lattice likely to contain named entities (NEs), recovers phoneme hypotheses over corresponding time spans, and inserts NEs that are phonetically close to those hypotheses. We measure a decrease in the mean word error rate (WER) of word lattices from 11.5% to 4.9% on a test set of NEs. View details
    Federated Learning of N-gram Language Models
    Adeline Wong
    Ananda Theertha Suresh
    Francoise Beaufays
    The SIGNLL Conference on Computational Natural Language Learning (2019)
    Preview abstract We propose algorithms to train production-quality n-gram language models using federated learning. Federated learning is a machine learning technique to train global models to be used on portable devices such as smart phones, without the users' data ever leaving their devices. This is especially relevant for applications handling privacy-sensitive data, such as virtual keyboards. While the principles of federated learning are fairly generic, its methodology assumes that the underlying models are neural networks. However, virtual keyboards are typically powered by n-gram language models, mostly for latency reasons. We propose to train a recurrent neural network language model using the decentralized "FederatedAveraging" algorithm directly on training and to approximating this federated model server-side with an n-gram model that can be deployed to devices for fast inference. Our technical contributions include novel ways of handling large vocabularies, algorithms to correct capitalization errors in user data, and efficient finite state transducer algorithms to convert word language models to word-piece language models and vice versa. The n-gram language models trained with federated learning are compared to n-grams trained with traditional server-based algorithms using A/B tests on tens of millions of users of a virtual keyboard. Results are presented for two languages, American English and Brazilian Portuguese. This work demonstrates that high-quality n-gram language models can be trained directly on client mobile devices without sensitive training data ever leaving the device. View details
    Algorithms for Weighted Finite Automata with Failure Transitions
    International Conference of Implementation and Applications of Automata (CIAA) (2018), pp. 46-58
    Preview abstract In this paper we extend some key weighted finite automata (WFA) algorithms to automata with failure transitions (phi-WFAs). Failure transitions, which are taken only when no immediate\ match is possible at a given state, are used to compactly epresent automata and have many applications. An efficient intersection algorithm and a shortest distance algorithm (over R+) are presented as well as a related algorithm to remove failure transitions from a phi-WFA. View details
    Transliterated mobile keyboard input via weighted finite-state transducers
    Lars Hellsten
    Prasoon Goyal
    Francoise Beaufays
    Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP) (2017)
    Preview abstract We present an extension to a mobile keyboard input decoder based on finite-state transducers that provides general transliteration support, and demonstrate its use for input of South Asian languages using a QWERTY keyboard. On-device keyboard decoders must operate under strict latency and memory constraints, and we present several transducer optimizations that allow for high accuracy decoding under such constraints. Our methods yield substantial accuracy improvements and latency reductions over an existing baseline transliteration keyboard approach. The resulting system was launched for 22 languages in Google Gboard in the first half of 2017. View details
    Distributed representation and estimation of WFST-based n-gram models
    Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata (StatFSM) (2016), pp. 32-41
    Preview abstract We present methods for partitioning a weighted finite-state transducer (WFST) representation of an n-gram language model into multiple shards, each of which is a stand-alone WFST n-gram model in its own right, allowing processing with existing algorithms. After independent estimation, including normalization, smoothing and pruning on each shard, the shards can be merged into a single WFST that is identical to the model that would have resulted from estimation without sharding. We then present an approach that uses data partitions in conjunction with WFST sharding to estimate models on orders-of-magnitude more data than would have otherwise been feasible with a single process. We present some numbers on shard characteristics when large models are trained from a very large data set. Functionality to support distributed n-gram modeling has been added to the OpenGrm library. View details
    Composition-based on-the-fly rescoring for salient n-gram biasing
    Keith Hall
    Eunjoon Cho
    Francoise Beaufays
    Noah Coccaro
    Kaisuke Nakajima
    Linda Zhang
    Interspeech 2015, International Speech Communications Association
    Improved recognition of contact names in voice commands
    David Elson
    Aleks Kracun
    Diego Melendo Casado
    Pedro J. Moreno
    ICASSP 2015
    Pushdown automata in statistical machine translation
    Bill Byrne
    Adrià de Gispert
    Gonzalo Iglesias
    Computational Linguistics, vol. 40 (2014), pp. 687-723
    Smoothed marginal distribution constraints for language modeling
    Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL) (2013), pp. 43-52
    Preview abstract We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney smoothing. Unlike Kneser-Ney, our approach is designed to be applied to any given smoothed backoff model, including models that have already been heavily pruned. As a result, the algorithm avoids issues observed when pruning Kneser-Ney models (Siivola et al., 2007; Chelba et al., 2010), while retaining the benefits of such marginal distribution constraints. We present experimental results for heavily pruned backoff n-gram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. An open-source version of the algorithm has been released as part of the OpenGrm ngram library. View details
    Mixture of mixture n-gram language models
    Kaisuke Nakajima
    Françoise Beaufays
    ASRU (2013), pp. 31-36
    Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice
    Johan Schalkwyk
    Boulos Harb
    Peng Xu
    Preethi Jyothi
    Thorsten Brants
    Vida Ha
    Will Neveitt
    University of Toronto (2012)
    Preview abstract A critical component of a speech recognition system targeting web search is the language model. The talk presents an empirical exploration of the google.com query stream with the end goal of high quality statistical language modeling for mobile voice search. Our experiments show that after text normalization the query stream is not as ``wild'' as it seems at first sight. One can achieve out-of-vocabulary rates below 1% using a one million word vocabulary, and excellent n-gram hit ratios of 77/88% even at high orders such as n=5/4, respectively. Using large scale, distributed language models can improve performance significantly---up to 10\% relative reductions in word-error-rate over conventional models used in speech recognition. We also find that the query stream is non-stationary, which means that adding more past training data beyond a certain point provides diminishing returns, and may even degrade performance slightly. Perhaps less surprisingly, we have shown that locale matters significantly for English query data across USA, Great Britain and Australia. In an attempt to leverage the speech data in voice search logs, we successfully build large-scale discriminative N-gram language models and derive small but significant gains in recognition performance. View details
    Preview abstract This paper explores various static interpolation methods for approximating a single dynamically-interpolated language model used for a variety of recognition tasks on the Google Android platform. The goal is to find the statically-interpolated firstpass LM that best reduces search errors in a two-pass system or that even allows eliminating the more complex dynamic second pass entirely. Static interpolation weights that are uniform, prior-weighted, and the maximum likelihood, maximum a posteriori, and Bayesian solutions are considered. Analysis argues and recognition experiments on Android test data show that a Bayesian interpolation approach performs best. View details
    Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice
    Johan Schalkwyk
    Boulos Harb
    Peng Xu
    Thorsten Brants
    Vida Ha
    Will Neveitt
    OGI/OHSU Seminar Series, Portland, Oregon, USA (2011)
    Preview abstract The talk presents key aspects faced when building language models (LM) for the google.com query stream, and their use for automatic speech recognition (ASR). Distributed LM tools enable us to handle a huge amount of data, and experiment with LMs that are two orders of magnitude larger than usual. An empirical exploration of the problem led us to re-discovering a less known interaction between Kneser-Ney smoothing and entropy pruning, possible non-stationarity of the query stream, as well as strong dependence on various English locales---USA, Britain and Australia. LM compression techniques allowed us to use one billion n-gram LMs in the first pass of an ASR system built on FST technology, and evaluate empirically whether a two-pass system architecture has any losses over one pass. View details
    Unary Data Structures for Language Models
    Interspeech 2011, International Speech Communication Association, pp. 1425-1428
    Preview abstract Language models are important components of speech recognition and machine translation systems. Trained on billions of words, and consisting of billions of parameters, language models often are the single largest components of these systems. There have been many proposed techniques to reduce the storage requirements for language models. A technique based upon pointer-free compact storage of ordinal trees shows compression competitive with the best proposed systems, while retaining the full finite state structure, and without using computationally expensive block compression schemes or lossy quantization techniques. View details
    Hierarchical Phrase-Based Translation Representations
    Gonzalo Iglesias
    William Byrne
    Adrià de Gispert
    Proceedings of EMNLP 2011
    A Filter-based Algorithm for Efficient Composition of Finite-State Transducers
    Johan Schalkwyk
    International Journal of Foundations of Computer Science, vol. 22 (2011), pp. 1781-1795
    Preview abstract Google offers several speech features on the Android mobile operating system: search by voice, voice input to any text field, and an API for application developers. As a result, our speech recognition service must support a wide range of usage scenarios and speaking styles: relatively short search queries, addresses, business names, dictated SMS and e-mail messages, and a long tail of spoken input to any of the applications users may install. We present a method of on-demand language model interpolation in which contextual information about each utterance determines interpolation weights among a number of n-gram language models. On-demand interpolation results in an 11.2% relative reduction in WER compared to using a single language model to handle all traffic. View details
    N-Way Composition of Weighted Finite-State Transducers
    International Journal of Foundations of Computer Science, vol. 20 (2009), pp. 613-627
    OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language
    Martin Jansche
    Proceedings of the North American Chapter of the Association for Computational Linguistics -- Human Language Technologies (NAACL HLT) 2009 conference, Tutorials
    Preview abstract Finite-state methods are well established in language and speech processing. OpenFst (available from www.openfst.org) is a free and open-source software library for building and using finite automata, in particular, weighted finite-state transducers (FSTs). This tutorial is an introduction to weighted finitestate transducers and their uses in speech and language processing. While there are other weighted finite-state transducer libraries, OpenFst (a) offers, we believe, the most comprehensive, general and efficient set of operations; (b) makes available full source code; (c) exposes high- and low-level C++ APIs that make it easy to embed and extend; and (d) is a platform for active research and use among many colleagues. View details
    Preview abstract This paper describes a weighted finite-state transducer composition algorithm that generalizes the notion of the composition filter and present filters that remove useless epsilon paths and push forward labels and weights along epsilon paths. This filtering allows us to compose together large speech recognition context-dependent lexicons and language models much more efficiently in time and space than previously possible. We present experiments on Broadcast News and Google Search by Voice that demonstrate a 5% to 10% overhead for dynamic, runtime composition compared to a static, offline composition of the recognition transducer. To our knowledge, this is the first such system with such small overhead. View details
    3-Way Composition of Weighted Finite-State Transducers
    Proceedings of the 13th International Conference on Implementation and Application of Automata (CIAA 2008), Springer-Verlag, Heidelberg, Germany, San Francisco, California, pp. 262-273
    General Algorithms for Testing the Ambiguity of Finite Automata
    Ashish Rastogi
    Proceedings of Twelfth International Conference Developments in Language Theory (DLT 2008), Springer, Heidelberg, Germany, Kyoto, Japan
    Preview abstract The problem of identifying the minimal gene set required to sustain life is of crucial importance in understanding cellular mechanisms and designing therapeutic drugs. This work describes several kernel-based solutions for predicting essential genes that outperform existing models while using less training data. Our first solution is based on a semi-manually designed kernel derived from the Pfam database, which includes several Pfam domains. We then present novel and general {\em domain-based} sequence kernels that capture sequence similarity with respect to several domains made of large sets of protein sequences. We show how to deal with the large size of the problem -- several thousands of domains with individual domains sometimes containing thousand of sequences -- by representing and efficiently computing these kernels using automata. We report results of extensive experiments demonstrating that they compare favorably with the Pfam kernel in predicting protein essentiality, while requiring no manual tuning. View details
    OpenFst: a General and Efficient Weighted Finite-State Transducer Library
    Johan Schalkwyk
    Wojciech Skut
    Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA 2007), Springer-Verlag, Heidelberg, Germany, Prague, Czech Republic
    A Unified Construction of the Glushkov, Follow, and Antimirov Automata
    Proceedings of the 31st International Symposium on Mathematical Foundations of Computer Science (MFCS 2006), Springer-Verlag, Heidelberg, Germany, Star\'a Lesn\'a, Slovakia, pp. 110-121
    A Unified Construction of the Glushkov, Follow, and Antimirov Automata
    MFCS (2006), pp. 110-121
    A General Weighted Grammar Library
    Ninth International Conference on Automata (CIAA 2004), Kingston, Canada, July 22-24, 2004, Springer-Verlag, Berlin-NY (2005)
    Statistical Modeling for Unit Selection in Speech Synthesis
    42nd Meeting of the Association for Computational Linguistics (ACL 2004), Proceedings of the Conference, Barcelona, Spain
    A General Weighted Grammar Library
    Proceedings of the Ninth International Conference on Automata (CIAA 2004), Kingston, Ontario, Canada
    A General Weighted Grammar Library
    General Indexation of Weighted Automata -- Application to Spoken Utterance Retrieval
    Murat Saraclar
    Proceedings of the annual meeting of the Human Language Technology conference and North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2004), Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, Boston, Massachusetts
    A Generalized Construction of Integrated Speech Recognition Transducers
    Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), Montreal, Canada
    An optimal pre-determinization algorithm for weighted transducers
    Theor. Comput. Sci., vol. 328 (2004), pp. 3-18
    Statistical Modeling for Unit Selection in Speech Synthesis
    $42$nd Meeting of the Association for Computational Linguistics (ACL 2004), Proceedings of the Conference, Barcelona, Spain
    An Optimal Pre-Determinization Algorithm for Weighted Transducers
    Theoretical Computer Science, vol. 328 (2004)
    Generalized Optimization Algorithm for Speech Recognition Transducers
    Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), Hong Kong
    Efficient Algorithms for Testing the Twins Property
    Journal of Automata, Languages and Combinatorics, vol. 8 (2003)
    Finitely Subsequential Transducers
    International Journal of Foundations of Computer Science, vol. 14 (2003), pp. 983-994
    Generalized optimization algorithm for speech recognition transducers
    ICASSP (1) (2003), pp. 352-355
    An Efficient Pre-Determinization Algorithm
    Eighth International Conference on Automata (CIAA 2003), Santa Barbara, CA, Springer, Berlin-NY, pp. 83-95
    An Efficient Pre-determinization Algorithm
    Finitely Subsequential Transducers
    Int. J. Found. Comput. Sci., vol. 14 (2003)
    Generalized Algorithms for Constructing Statistical Language Models
    $41$st Meeting of the Association for Computational Linguistics (ACL 2003), Proceedings of the Conference, Sapporo, Japan
    Generalized Algorithms for Constructing Statistical Language Models
    $p$-Subsequentiable Transducers
    Seventh International Conference on Automata (CIAA 2002), Tours, France, Springer, Berlin-NY (2003), pp. 24-34
    p-Subsequentiable Transducers
    CIAA (2002), pp. 24-34
    $p$-Subsequentiable Transducers
    Proceedings of the Seventh International Conference on Automata (CIAA 2002), Tours, France
    On the Determinizability of Weighted Automata and Transducers
    Proceedings of the workshop Weighted Automata: Theory and Applications (WATA), Dresden, Germany (2002)