Kyle Gorman
I am a computational linguist working on speech and language processing. I also an assistant professor
at the Graduate Center, City University of New York, where I direct the computational linguistics masters program. Before joining Google, I was a postdoctoral researcher, and an assistant professor, at the Center for Spoken Language Understanding at the Oregon Health & Science University. I received a Ph.D. in linguistics from the University of Pennsylvania in 2013.
At Google, I contribute to the OpenFst and OpenGrm libraries, and am the principal author of Pynini, a powerful weighted-finite state grammar extension for Python. In my copious free time, I also participate in ongoing collaborations in linguistics, language acquisition, and language disorders.
More information, including a complete list of publications, can be found at my external website.
Research Areas
Authored Publications
Sort By
Neural Models of Text Normalization for Speech Applications
Felix Stahlberg
Ke Wu
Richard Sproat
Xiaochang Peng
Computational Linguistics, 45(2) (2019) (to appear)
Preview abstract
Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS). In this application, one must decide, for example, that "123" is verbalized as "one hundred twenty three" in "123 pages" but "one twenty three" in "123 King Ave". For this task, state-of-the-art industrial systems depend heavily on hand-written language-specific grammars.
In this paper we present neural network models which treat text normalization for TTS as a sequence-to-sequence problem, in which the input is a text token in context, and the output is the verbalization of that token. We find that the most effective model (in terms of efficiency and accuracy) is a model where the sentential context is computed once and the results of that computation are combined with the computation of each token in sequence to compute the verbalization. This model allows for a great deal of flexibility in terms of representing the context, and also allows us to integrate tagging and segmentation into the process.
The neural models perform very well overall, but there is one problem, namely that occasionally they will predict inappropriate verbalizations, such as reading "3cm" as "three kilometers". While rare, such verbalizations are a major issue for TTS applications. To deal with such cases, we develop an approach based on finite-state "covering grammars", which can be used to guide the neural models (either during training and decoding, or just during decoding) away from such "silly" verbalizations. These covering grammars can also largely be learned from data.
View details
Unified Verbalization for Speech Recognition & Synthesis Across Languages
Richard Sproat
Christian Schallhart
Nikos Bampounis
Jonas Fromseier Mortensen
Millie Holt
Proceedings of Interspeech 2019
Preview abstract
We describe a new approach to converting written tokens to their spoken form, which can be used across automatic speech recognition (ASR) and text-to-speech synthesis (TTS) systems. Both ASR and TTS systems need to map from the written to the spoken domain, and we present an approach that enables us to share verbalization grammars between the two systems. We also describe improvements to an induction system for number name grammars. Between these shared ASR/TTS verbalization systems and the improved induction system for number name grammars, we see significant gains in development time and scalability across languages
View details
What Kind of Language Is Hard to Language-Model?
Sabrina J. Mielke
Ryan Cotterell
Jason Eisner
Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2019), pp. 4975-4989
Preview abstract
How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that "translationese" is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.
View details
Preview abstract
We describe a pre-existing rule-based homograph disambiguation system used for text-to-speech synthesis at Google, and compare it to a novel system which performs disambiguation using classifiers trained on a small amount of labeled data. An evaluation of these systems, using a new, freely available English data set, finds that hybrid systems (making use of both rules and machine learning) are significantly more accurate than either hand-written rules or machine learning alone. The evaluation also finds minimal performance degradation when the hybrid system is configured to run on limited-resource mobile devices rather than on production servers. The two best systems described here are used for homograph disambiguation on all US English text-to-speech traffic at Google.
View details
Minimally Supervised Number Normalization
Richard Sproat
Transactions of the Association for Computational Linguistics, 4 (2016), pp. 507-519
Preview abstract
We propose two models for verbalizing numbers, a key component in speech recognition and synthesis systems. The first model uses an end-to-end recurrent neural network. The second model, drawing inspiration from the linguistics literature, uses finite-state transducers constructed with a minimal amount of training data. While both models achieve near-perfect performance, the latter model can be trained using several orders of magnitude less data than the former, making it particularly useful for low-resource languages.
View details
Pynini: A Python library for weighted finite-state grammar compilation
Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata (2016), pp. 75-80
Preview abstract
We present Pynini, an open-source library allowing users to compile weighted finite-state transducers (FSTs) and pushdown transducers from strings, context-dependent rewrite rules, and recursive transition networks. Pynini uses the OpenFst library for encoding, modifying, and applying WFSTs, as well as a powerful generic optimization routine. We describe the design of this library and the algorithms and interfaces used for FST and PDT compilation and optimization, and illustrate its use for a natural language processing application.
View details
Preview abstract
Speech recognizers are typically trained with data from a standard
dialect and do not generalize to non-standard dialects. Mismatch
mainly occurs in the acoustic realization of words, which is represented
by acoustic models and pronunciation lexicon. Standard techniques for
addressing this mismatch are generative in nature and include acoustic
model adaptation and expansion of lexicon with pronunciation variants,
both of which have limited effectiveness. We present a discriminative
pronunciation model whose parameters are learned jointly with
parameters from the language models. We tease apart the
gains from modeling the transitions of canonical phones, the
transduction from surface to canonical phones, and the language
model. We report experiments on African American Vernacular English
(AAVE) using NPR's StoryCorps corpus. Our models improve the
performance over the baseline by about 2.1% on AAVE, of which 0.6%
can be attributed to the pronunciation model. The model learns the most
relevant phonetic transformations for AAVE speech.
View details