 
                Michael Riley
            Michael Riley has a B.S., M.S., and Ph.D. from MIT, all in computer science. He began his career at Bell Labs and AT&T Labs where he, together with Mehryar Mohri and Fernando Pereira, introduced and developed the theory and use of weighted finite-state transducers (WFSTs) in speech and language. He is currently distinguished research scientist at Google, Inc.  His interests include speech and natural language processing, machine learning, and information retrieval. He is a principal author of the OpenFst library  He manages a group with expertise that includes speech recognition and synthesis, NLP,  information retrieval, image processing, algorithms, machine learning and privacy. He is an IEEE and ISCA Fellow.
          
        
        
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
        
          
              Preview abstract
          
          
              We introduce a framework for adapting a virtual keyboard to individual user behavior by modifying a Gaussian spatial model to use personalized key center offset means and, optionally, learned covariances. Through numerous real-world studies, we determine the importance of training data quantity and weights, as well as the number of clusters into which to group keys to avoid overfitting. While past research has shown potential of this technique using artificially-simple virtual keyboards and games or fixed typing prompts, we demonstrate effectiveness using the highly-tuned Gboard app with a representative set of users and their real typing behaviors. Across a variety of top languages,we achieve small-but-significant improvements in both typing speed and decoder accuracy.
              
  
View details
          
        
      
    
        
          
            
              On Weight Interpolation of the Hybrid Autoregressive Transducer Model
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Bhuvana Ramabhadran
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        David Rybach
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Interspeech 2022, Interspeech 2022 (2022) (to appear)
          
          
        
        
        
          
              Preview abstract
          
          
              This paper explores ways to improve a two-pass speech recognition system when the first-pass
is hybrid autoregressive transducer model and the second-pass is a neural language model.
The main focus is on the scores provided by each of these models, their quantitative analysis,
how to improve them and the best way to integrate them with the objective of better recognition
accuracy. Several analysis are presented to show the importance of the choice of the 
integration weights for combining the first-pass and the second-pass scores. A sequence level weight
estimation model along with four training criteria are proposed which allow adaptive integration
of the scores per acoustic sequence.
The effectiveness of this algorithm is demonstrated by constructing and analyzing
models on the Librispeech data set.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We introduce the Globally Normalized Autoregressive Transducer (GNAT) foraddressing the label bias problem in streaming speech recognition. Our solutionadmits a tractable exact computation of the denominator for the sequence-levelnormalization.  Through theoretical and empirical results, we demonstrate thatby switching to a globally normalized model, the word error rate gap betweenstreaming and non-streaming speech-recognition models can be greatly reduced (bymore than 50% on the Librispeech dataset). This model is developed in a modularframework which encompasses all the common neural speech recognition models.The modularity of this framework enables controlled comparison of modellingchoices and creation of new models.
              
  
View details
          
        
      
    
        
          
            
              An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Rami Botros
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ruoming Pang
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        David Johannes Rybach
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        James Qin
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Quoc-Nam Le-The
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Anmol Gulati
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Cal Peyser
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chung-Cheng Chiu
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Emmanuel Guzman
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jiahui Yu
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Qiao Liang
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Wei Li
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yonghui Wu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yu Zhang
                      
                    
                  
              
            
          
          
          
          
            Interspeech (2021) (to appear)
          
          
        
        
        
          
              Preview abstract
          
          
              On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model.  Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller.
              
  
View details
          
        
      
    
        
          
            
              Approximating probabilistic models as weighted finite automata
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Vlad Schogol
                      
                    
                  
              
            
          
          
          
          
            Computational Linguistics, 47 (2021), pp. 221-254
          
          
        
        
        
          
              Preview abstract
          
          
              Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-
gram language models, since they are efficient for recognition tasks in time and space. The
probabilistic source to be represented as a WFA, however, may come in many forms. Given
a generic probabilistic model over sequences, we propose an algorithm to approximate it as a
weighted finite automaton such that the Kullback-Leiber divergence between the source model
and the WFA target model is minimized. The proposed algorithm involves a counting step and a
difference of convex optimization step, both of which can be performed efficiently. We demonstrate
the usefulness of our approach on various tasks, including distilling n-gram models from neural
models, building compact language models, and building open-vocabulary character models. The
algorithms used for these experiments are available in an open-source software library.
              
  
View details
          
        
      
    
        
          
            
              Hybrid Autoregressive Transducer (HAT)
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        David Rybach
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6139-6143
          
          
        
        
        
          
              Preview abstract
          
          
              This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoder-decoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches.
              
  
View details
          
        
      
    
        
          
            
              Distilling weighted finite automata from arbitrary probabilistic models
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Vlad Schogol
                      
                    
                  
              
            
          
          
          
          
            Proceedings of FSMNLP (2019), pp. 87-97
          
          
        
        
        
          
              Preview abstract
          
          
              Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a weighted finite automaton such that the Kullback-Leibler divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization, both of which can be performed efficiently. We demonstrate the usefulness of our approach on some tasks including distilling n-gram models from neural models.
              
  
View details
          
        
      
    
        
          
            
              Latin script keyboards for South Asian languages with finite-state normalization
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Vlad Schogol
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing, Association for Computational Linguistics, Dresden, Germany (2019), pp. 108-117
          
          
        
        
        
          
              Preview abstract
          
          
              The use of the Latin script for text entry of South Asian languages is common, even though there is no standard orthography for these languages in the script. We explore several compact finite-state architectures that permit variable spellings of words during mobile text entry. We find that approaches making use of transliteration transducers provide large accuracy improvements over baselines, but that simpler approaches involving a compact representation of many attested alternatives yields much of the accuracy gain. This is particularly important when operating under constraints on model size (e.g., on inexpensive mobile devices with limited storage and memory for keyboard models), and on speed of inference, since people typing on mobile keyboards expect no perceptual delay in keyboard responsiveness.
              
  
View details
          
        
      
    
        
          
            
              Federated Learning of N-gram Language Models
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Adeline Wong
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Francoise Beaufays
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            The SIGNLL Conference on Computational Natural Language Learning (2019)
          
          
        
        
        
          
              Preview abstract
          
          
              We propose algorithms to train production-quality n-gram language models using federated learning. Federated learning is a machine learning technique to train global models to be used on portable devices such as smart phones, without the users' data ever leaving their devices. This is especially relevant for applications handling privacy-sensitive data, such as virtual keyboards. While the principles of federated learning are fairly generic, its methodology assumes that the underlying models are neural networks. However, virtual keyboards are typically powered by n-gram language models, mostly for latency reasons.
We propose to train a recurrent neural network language model using the decentralized "FederatedAveraging" algorithm directly on training and to approximating this federated model server-side with an n-gram model that can be deployed to devices for fast inference.
Our technical contributions include novel ways of handling large vocabularies, algorithms to correct capitalization errors in user data, and efficient finite state transducer algorithms to convert word language models to word-piece language models and vice versa.
The n-gram language models trained with federated learning are compared to n-grams trained with traditional server-based algorithms using A/B tests on tens of millions of users of a virtual keyboard.
Results are presented for two languages, American English and Brazilian Portuguese. This work demonstrates that high-quality n-gram language models can be trained directly on client mobile devices without sensitive training data ever leaving the device.
              
  
View details
          
        
      
    
        
          
            
              Semantic Lattice Processing in Contextual Automatic Speech Recognition for Google Assistant
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Ian Williams
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Justin Scheiner
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Pedro Moreno
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Interspeech 2018, ISCA (2018), pp. 2222-2226
          
          
        
        
        
          
              Preview abstract
          
          
              Recent interest in intelligent assistants has increased demand for Automatic Speech Recognition (ASR) systems that can utilize contextual information to adapt to the user’s preferences or the current device state. For example, a user might be more likely to refer to their favorite songs when giving a “music playing” command or request to watch a movie starring a particular favorite actor when giving a “movie playing” command. Similarly, when a device is in a “music playing” state, a user is more likely to give volume control commands.
In this paper, we explore using semantic information inside the ASR word lattice by employing Named Entity Recognition (NER) to identify and boost contextually relevant paths in order to improve speech recognition accuracy. We use broad semantic classes comprising millions of entities, such as songs and musical artists, to tag relevant semantic entities in the lattice. We show that our method reduces Word Error Rate (WER) by 12.0% relative on a Google Assistant “media playing” commands test set, while not affecting WER on a test set containing commands unrelated to media.
              
  
View details
          
        
      
    