Richard Sproat

Richard Sproat

From January, 2009, through October 2012, I was a professor at the Center for Spoken Language Understanding at the Oregon Health and Science University.

Prior to going to OHSU, I was a professor in the departments of Linguistics and Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. I was also a full-time faculty member at the Beckman Institute. I still hold adjunct positions in Linguistics and ECE at UIUC.

Before joining the faculty at UIUC I worked in the Information Systems and Analysis Research Department headed by Ken Church at AT&T Labs --- Research where I worked on Speech and Text Data Mining: extracting potentially useful information from large speech or text databases using a combination of speech/NLP technology and data mining techniques.

Before joining Ken's department I worked in the Human/Computer Interaction Research Department headed by Candy Kamm. My most recent project in that department was WordsEye, an automatic text-to-scene conversion system. The WordsEye technology is now being developed at Semantic Light, LLC. WordsEye is particularly good for creating surrealistic images that I can easily conceive of but are well beyond my artistic ability to execute.

More info --- and many more publications --- on my external website here.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation
    Lion Jones
    Haruko Ishikawa
    Transactions of the Association for Computational Linguistics, 11(2023), 85–101
    Preview abstract If one sees the place name Houston Mercer Dog Run in New York, how does one know how to pronounce it? Assuming one knows that Houston in New York is pronounced ˈhaʊstən and not like the Texas city (ˈhjuːstən), then one can probably guess that ˈhaʊstən is also used in the name of the dog park. We present a novel architecture that learns to use the pronunciations of neighboring names in order to guess the pronunciation of a given target feature. Applied to Japanese place names, we demonstrate the utility of the model to finding and proposing corrections for errors in Google Maps. To demonstrate the utility of this approach to structurally similar problems, we also report on an application to a totally different task: Cognate reflex prediction in comparative historical linguistics. A version of the code has been open-sourced. View details
    Preview abstract For millennia humans have used visible marks to communicate information. Modern examples of conventional graphical symbols include traffic signs, mathematical notation, and written language such as the text you are currently reading. This book presents the first systematic study of graphical symbol systems, including a taxonomy of non-linguistic systems—systems such as mathematical and musical notation that are not tied to spoken language. An important point is that non-linguistic symbol systems may have complex syntax, if the information encoded by the system itself has a complex structure. Writing systems are a special instance of graphical symbol system where the symbols represent linguistic, and in particular phonological information. I review the properties of writing and how it relates to non-linguistic symbols. I also review how writing is processed in the brain, and compare that to the neural processing of non-linguistic symbols. Writing first appeared in Mesopotamia about 5,000 years ago and is believed to have evolved from a previous non-linguistic accounting system. The exact mechanism is unknown, but crucial was the discovery that symbols can represent the sounds of words, not just the meanings. I present the novel hypothesis that writing evolved in an institutional context in which accounts were dictated, thus driving an association between symbol and sound. I provide a computational simulation to support this hypothesis. Human language has syntactic structure, and writing inherits the structure that language has. This leads to a common fallacy when it comes to undeciphered ancient symbols, namely that the presence of structure in a system favors the conclusion that the system was writing. I review recent instances of this fallacy, pointing out that known non-linguistic systems also have structure, so that the presence of structure is not very informative. The book ends with some thoughts about the future of graphical symbol systems. View details
    Bi-Phone: Modeling Inter Language Phonetic Influences in Text
    Ananya B. Sai
    Yuri Vasilevski
    James Ren
    Ambarish Jash
    Sukhdeep Sodhi
    ACL, Association for Computational Linguistics, Toronto, Canada(2023), 2580–2592
    Preview abstract A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the SuperGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text. View details
    Graphemic Normalization of the Perso-Arabic Script
    Raiomond Doctor
    Proceedings of Grapholinguistics in the 21st Century, 2022 (G21C, Grafematik), Paris, France
    Preview abstract Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions (Unicode Consortium, 2021). This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages, such as Arabic and Persian, building on earlier work by the expert community (ICANN, 2011, 2015). We particularly focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues such as the use of visually ambiguous yet canonically nonequivalent letters and the mixing of letters from different orthographies. Among the contributing conflating factors are the lack of input methods, the instability of modern orthographies (e.g., Aazim et al., 2009; Iyengar, 2018), insufficient literacy, and loss or lack of orthographic tradition (Jahani and Korn, 2013; Liljegren, 2018). We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks. Our results indicate statistically significant improvements in performance in most conditions for all the languages considered when normalization is applied. We argue that better understanding and representation of Perso-Arabic script variation within regional orthographic traditions, where those are present, is crucial for further progress of modern computational NLP techniques (Ponti et al., 2019; Conneau et al., 2020; Muller et al., 2021) especially for languages with a paucity of resources. View details
    Mockingbird at the SIGTYP 2022 Shared Task: Two Types of Models for Prediction of Cognate Reflexes
    Christo Kirov
    Proceedings of the 4th Workshop on Research in Computational Typology and Multilingual NLP (SIGTYP 2022) at NAACL, Association for Computational Linguistics (ACL), Seattle, WA, pp. 70-79
    Preview abstract The SIGTYP 2022 shared task concerns the problem of word reflex generation in a target language, given cognate words from a subset of related languages. We present two systems to tackle this problem, covering two very different modeling approaches. The first model extends transformer-based encoder-decoder sequence-to-sequence modeling, by encoding all available input cognates in parallel, and having the decoder attend to the resulting joint representation during inference. The second approach takes inspiration from the field of image restoration, where models are tasked with recovering pixels in an image that have been masked out. For reflex generation, the missing reflexes are treated as “masked pixels” in an “image” which is a representation of an entire cognate set across a language family. As in the image restoration case, cognate restoration is performed with a convolutional network. View details
    Preview abstract In a recent position paper, Turing Award Winners Yoshua Bengio, Geoffrey Hinton and Yann LeCun, make the case that symbolic methods are not needed in AI and that, while there are still many issues to be resolved, AI will be solved using purely neural methods. In this piece I issue a challenge: demonstrate that a purely neural approach to the problem of text normalization is possible. Various groups have tried, but so far nobody has eliminated the problem of unrecoverable errors, errors where, due to insufficient training data or faulty generalization, the system substitutes some other reading for the correct one. Solutions have been proposed that involve a marriage of traditional finite-state methods with neural models, but thus far nobody has shown that the problem can be solved using neural methods alone. Though text normalization is hardly an "exciting" problem, I argue that until one can solve "boring'"problems like that using purely AI methods, one cannot claim that AI is a success. View details
    Beyond Arabic: Software for Perso-Arabic Script Manipulation
    Raiomond Doctor
    Proceedings of the 7th Arabic Natural Language Processing Workshop (WANLP2022) at EMNLP, Association for Computational Linguistics (ACL), Abu Dhabi, United Arab Emirates (Hybrid), pp. 381-387
    Preview abstract This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for ten contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people. View details
    Preview abstract Taxonomies of writing systems since Gelb (1952) have classified systems based on what the written symbols represent: if they represent words or morphemes, they are logographic; if syllables, syllabic; if segments, alphabetic; etc. Sproat (2000) and Rogers (2005) broke with tradition by splitting the logographic and phonographic aspects into two dimensions, with logography being graded rather than a categorical distinction. A system could be syllabic, and highly logographic; or alphabetic, and mostly non-logographic. This accords better with how writing systems actually work, but neither author proposed a method for measuring logography. In this article we propose a novel measure of the degree of logography that uses an attention based sequence-to-sequence model trained to predict the spelling of a token from its pronunciation in context. In an ideal phonographic system, the model should need to attend to only the current token in order to compute how to spell it, and this would show in the attention matrix activations. In contrast, with a logographic system, where a given pronunciation might correspond to several different spellings, the model would need to attend to a broader context. The ratio of the activation outside the token and the total activation forms the basis of our measure. We compare this with a simple lexical measure, and an entropic measure, as well as several other neural models, and argue that on balance our attention-based measure accords best with intuition about how logographic various systems are. Our work provides the first quantifiable measure of the notion of logography that accords with linguistic intuition and, we argue, provides better insight into what this notion means. View details
    Preview abstract This paper describes the NEMO submission to SIGTYP 2020 shared task (Bjerva et al., 2020) which deals with prediction of linguistic typological features for multiple languages using the data derived from World Atlas of Language Structures (WALS). We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi­-class estimators that predict individual features. We describe two submitted ridge regression­-based configurations which ranked second and third overall in the constrained task. Our best configuration achieved the micro­averaged accuracy score of 0.66 on 149 test languages. View details
    Preview abstract Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%. View details
    Neural Models of Text Normalization for Speech Applications
    Felix Stahlberg
    Hao Zhang
    Ke Wu
    Xiaochang Peng
    Computational Linguistics, 45(2)(2019) (to appear)
    Preview abstract Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS). In this application, one must decide, for example, that "123" is verbalized as "one hundred twenty three" in "123 pages" but "one twenty three" in "123 King Ave". For this task, state-of-the-art industrial systems depend heavily on hand-written language-specific grammars. In this paper we present neural network models which treat text normalization for TTS as a sequence-to-sequence problem, in which the input is a text token in context, and the output is the verbalization of that token. We find that the most effective model (in terms of efficiency and accuracy) is a model where the sentential context is computed once and the results of that computation are combined with the computation of each token in sequence to compute the verbalization. This model allows for a great deal of flexibility in terms of representing the context, and also allows us to integrate tagging and segmentation into the process. The neural models perform very well overall, but there is one problem, namely that occasionally they will predict inappropriate verbalizations, such as reading "3cm" as "three kilometers". While rare, such verbalizations are a major issue for TTS applications. To deal with such cases, we develop an approach based on finite-state "covering grammars", which can be used to guide the neural models (either during training and decoding, or just during decoding) away from such "silly" verbalizations. These covering grammars can also largely be learned from data. View details
    Preview abstract Neural text normalization systems achieve high accuracy, but the errors they do make can include not only “acceptable” errors (such as reading $3 as three dollar) but also unacceptable errors (reading $3 as three euros). We explore ways of training dual encoder classifiers with both positive and negative data to then use as soft constraints in neural text normalization in order to decrease the number of unacceptable errors. Already-low error rates and high variability in performance on the evaluation set make it difficult to determine when improvement is significant, but qualitative analysis suggests that certain types of dual encoder constraints yield systems that make fewer unacceptable errors. View details
    Preview abstract We describe a new approach to converting written tokens to their spoken form, which can be used across automatic speech recognition (ASR) and text-to-speech synthesis (TTS) systems. Both ASR and TTS systems need to map from the written to the spoken domain, and we present an approach that enables us to share verbalization grammars between the two systems. We also describe improvements to an induction system for number name grammars. Between these shared ASR/TTS verbalization systems and the improved induction system for number name grammars, we see significant gains in development time and scalability across languages View details
    Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
    Alena Butryna
    Shan Hui Cathy Chu
    Linne Ha
    Fei He
    Martin Jansche
    Chen Fang Li
    Tatiana Merkulova
    Yin May Oo
    Knot Pipatsrisawat
    Clara E. Rivera
    Supheakmungkol Sarin
    Pasindu De Silva
    Keshan Sodimana
    Jaka Aris Eko Wibawa
    2019 UNESCO International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, 4--6 December, Paris, France, pp. 91-94
    Preview abstract This paper presents an overview of a program designed to address the growing need for developing free speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language community. View details
    Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala, and Sundanese TTS Systems
    Keshan Sodimana
    Pasindu De Silva
    Chen Fang Li
    Supheakmungkol Sarin
    Knot Pipatsrisawat
    6th International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU-2018), International Speech Communication Association (ISCA), 29--31 August, Gurugram, India, pp. 147-151
    Preview abstract Text normalization is the process of converting non-standard words (NSWs) such as numbers, abbreviations, and time expressions into standard words so that their pronunciations can be derived either through lexicon lookup or by utilizing a program to predict pronunciations from spellings. Text normalization is, thus, an important component of any Text-to-Speech (TTS) system. Without such component, the resulting voice, no matter how good the quality is, may sound unintelligent. Such a component is often built manually by translating language-specific knowledge into rules that can be utilized by TTS pipelines. In this paper, we describe an approach to develop a rule-based text normalization component for many low-resourced languages. We also describe our open source repository containing text normalization grammars for Bangla, Javanese, Khmer, Nepali, Sinhala, Sundanese and present a recipe for utilizing them in a TTS system. View details
    Preview abstract Attention-based sequence-to-sequence neural network models learn to jointly align and translate. The quadratic-time attention mechanism is powerful as it is capable of handling arbitrary long- distance reordering, but computationally expensive. In this paper, towards making neural translation both accurate and efficient, we follow the traditional pre-reordering approach to decouple reordering from translation. We add a reordering RNN that shares the input encoder with the decoder. The RNNs are trained jointly with a multi-task loss function and applied sequentially at inference time. The task of the reordering model is to predict the permutation of the input words following the target language word order. After reordering, the attention in the decoder becomes more peaked and monotonic. For reordering, we adopt the Inversion Transduction Grammars (ITG) and propose a transition system to parse input to trees for reordering. We harness the ITG transition system with RNN. With the modeling power of RNN, we achieve superior reordering accuracy without any feature engineering. In experiments, we apply the model to the task of text normalization. Compared to a strong baseline of attention-based RNN, our ITG RNN reordering model can reach the same reordering accuracy with only 1/10 of the training data and is 2.5x faster in decoding. View details
    An RNN Model of Text Normalization
    Navdeep Jaitly
    Interspeech 2017(2017)
    Preview abstract This paper presents a challenge to the community: given a large corpus of written text aligned to its normalized spoken form, train an RNN to learn the correct normalization function. We present a data set of general text where the normalizations were generated using an existing text normalization component of a text-to-speech system. This data set will be released open-source in the near future. We also present our own experiments with this data set with a variety of different RNN architectures. While some of the architectures do in fact produce very good results when measured in terms of overall accuracy, the errors that are produced are problematic, since they would convey completely the wrong message if such a system were deployed in a speech application. On the other hand, we show that a simple FST-based filter can mitigate those errors, and achieve a level of accuracy not achievable by the RNN alone. Though our conclusions are largely negative on this point, we are actually not arguing that the text normalization problem is intractable using an pure RNN approach, merely that it is not going to be something that can be solved merely by having huge amounts of annotated text data and feeding that to a general RNN model. And when we open-source our data, we will be providing a novel data set for sequence-to-sequence modeling in the hopes that the the community can find better solutions. View details
    Areal and Phylogenetic Features for Multilingual Speech Synthesis
    Proc. of Interspeech 2017, International Speech Communication Association (ISCA), August 20–24, 2017, Stockholm, Sweden, pp. 2078-2082
    Preview abstract We introduce phylogenetic and areal language features to the domain of multilingual text-to-speech (TTS) synthesis. Intuitively, enriching the existing universal phonetic features with such cross-language shared representations should benefit the multilingual acoustic models and help to address issues like data scarcity for low-resource languages. We investigate these representations using the acoustic models based on long short-term memory (LSTM) recurrent neural networks (RNN). Subjective evaluations conducted on eight languages from diverse language families show that sometimes phylogenetic and areal representations lead to significant multilingual synthesis quality improvements. View details
    Preview abstract We describe an expanded taxonomy of semiotic classes for text normalization, building upon the work in Sproat (2001). We add a large number of categories of non-standard words (NSWs) that we believe a robust real-world text normalization system will have to be able to process. Our new categories are based upon empirical findings encountered while building text normalization systems across many languages, for both Speech Recognition and Speech Synthesis purposes. We believe our new taxonomy is useful both for ensuring high coverage when writing manual grammars, as well as for eliciting training data to build machine learning-based text normalization systems. View details
    TTS for Low Resource Languages: A Bangla Synthesizer
    Linne Ha
    Martin Jansche
    Knot Pipatsrisawat
    10th edition of the Language Resources and Evaluation Conference, 23-28 May 2016, European Language Resources Association (ELRA), Portorož, Slovenia, pp. 2005-2010
    Preview abstract We present a text-to-speech (TTS) system designed for the dialect of Bengali spoken in Bangladesh. This work is part of an ongoing effort to address the needs of under-resourced languages. We propose a process for streamlining the bootstrapping of TTS systems for under-resourced languages. First, we use crowdsourcing to collect the data from multiple ordinary speakers, each speaker recording small amount of sentences. Second, we leverage an existing text normalization system for a related language (Hindi) to bootstrap a linguistic front-end for Bangla. Third, we employ statistical techniques to construct multi-speaker acoustic models using Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) and Hidden Markov Model (HMM) approaches. We then describe our experiments that show that the resulting TTS voices score well in terms of their perceived quality as measured by Mean Opinion Score (MOS) evaluations. View details
    Building Statistical Parametric Multi-speaker Synthesis for Bangladeshi Bangla
    Linne Ha
    Martin Jansche
    Knot Pipatsrisawat
    5th Workshop on Spoken Language Technologies for Under-resourced languages (SLTU-2016), Procedia Computer Science (Elsevier B.V.), 09--12 May 2016, Yogyakarta, Indonesia, pp. 194-200
    Preview abstract We present a text-to-speech (TTS) system designed for the dialect of Bengali spoken in Bangladesh. This work is part of an ongoing effort to address the needs of new under-resourced languages. We propose a process for streamlining the bootstrapping of TTS systems for under-resourced languages. First, we use crowdsourcing to collect the data from multiple ordinary speakers, each speaker recording small amount of sentences. Second, we leverage an existing text normalization system for a related language (Hindi) to bootstrap a linguistic front-end for Bangla. Third, we employ statistical techniques to construct multi-speaker acoustic models using Long Short-term Memory Recurrent Neural Network (LSTM-RNN) and Hidden Markov Model (HMM) approaches. We then describe our experiments that show that the resulting TTS voices score well in terms of their perceived quality as measured by Mean Opinion Score (MOS) evaluations. View details
    Minimally Supervised Number Normalization
    Transactions of the Association for Computational Linguistics, 4(2016), pp. 507-519
    Preview abstract We propose two models for verbalizing numbers, a key component in speech recognition and synthesis systems. The first model uses an end-to-end recurrent neural network. The second model, drawing inspiration from the linguistics literature, uses finite-state transducers constructed with a minimal amount of training data. While both models achieve near-perfect performance, the latter model can be trained using several orders of magnitude less data than the former, making it particularly useful for low-resource languages. View details
    Preview abstract Incorrect normalization of text can be particularly damaging for applications like text-to-speech synthesis (TTS) or typing auto-correction, where the resulting normalization is directly presented to the user, versus feeding downstream applications. In this paper, we focus on abbreviation expansion for TTS, which requires a ``do no harm'', high precision approach yielding few expansion errors at the cost of leaving relatively many abbreviations unexpanded. In the context of a large-scale, real-world TTS scenario, we present methods for training classifiers to establish whether a particular expansion is apt. We achieve a large increase in correct abbreviation expansion when combined with the baseline text normalization component of the TTS system, together with a substantial reduction in incorrect expansions. View details
    Applications of Maximum Entropy Rankers to Problems in Spoken Language Processing
    Keith Hall
    Interspeech 2014, International Speech Communications Association
    Preview abstract We report on two applications of Maximum Entropy-based ranking models to problems of relevance to automatic speech recognition and text-to-speech synthesis. The first is stress prediction in Russian, a language with notoriously complex morphology and stress rules. The second is the classification of alphabetic non-standard words, which may be read as words (NATO), as letter sequences (USA), or as a mixed (mymsn). For this second task we report results on English, and five other European languages. View details
    A Database for Measuring Linguistic Information Content.
    David Huynh
    Linne Ha
    Ravindran Rajakumar
    Evelyn Wenzel-Grondie
    Language Resources and Evaluation Conference, ELDA, 330 W 58th St(2014)
    Preview abstract Which languages convey the most information in a given amount of space? This is a question often asked of linguists, especially by engineers who often have some information theoretic measure of ``information'' in mind, but rarely define exactly how they would measure that information. The question is, in fact remarkably hard to answer, and many linguists consider it unanswerable. But it is a question that seems as if it ought to have an answer. If one had a database of close translations between a set of typologically diverse languages, with detailed marking of morphosyntactic and morphosemantic features, one could hope to quantify the differences between how these different languages convey information. Since no appropriate database exists we decided to construct one. The purpose of this paper is to present our work on the database, along with some preliminary results. We plan to release the dataset once complete. View details
    Lightly Supervised Learning of Text Normalization: Russian Number Names
    IEEE Workshop on Spoken Language Technology, Berkeley, CA(2010) (to appear)
    Preview
    Named Entity Transcription with Pair n-Gram Models
    Martin Jansche
    2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009), ACL-IJCNLP 2009, pp. 32-35
    Preview abstract We submitted results for each of the eight shared tasks. Except for Japanese name kanji restoration, which uses a noisy channel model, our Standard Run submissions were produced by generative long-range pair ngram models, which we mostly augmented with publicly available data (either from LDC datasets or mined from Wikipedia) for the Non-Standard Runs. View details
    {On a Common Fallacy in Computational Linguistics}
    A Man of Measure: Festschrift in Honour of Fred Karlsson on this 60th Birthday, SKY Journal of Linguistics, Volume 19(2006), pp. 432-439
    Preview
    Applications of Lexicographic Semirings to Problems in Speech and Language Processing
    Mahsa Yarmohammadi
    Computational Linguistics, 40(2014)
    Preview abstract This paper explores lexicographic semirings and their application to problems in speech and language processing. Specifically, we present two instantiations of binary lexicographic semirings, one involving a pair of tropical weights, and the other a tropical weight paired with a novel string semiring we term the categorial semiring. The first of these is used to yield an exact encoding of backoff models with epsilon transitions. This lexicographic language model semiring allows for off-line optimization of exact models represented as large weighted finitestate transducers in contrast to implicit (on-line) failure transition representations. We present empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. The second of these lexicographic semirings is applied to the problem of extracting, from a lattice of word sequences tagged for part of speech, only the single bestscoring part of speech tagging for each word sequence. We do this by incorporating the tags as a categorial weight in the second component of a hTropical, Categoriali lexicographic semiring, determinizing the resulting word lattice acceptor in that semiring, and then mapping the tags back as output labels of the word lattice transducer. We compare our approach to a competing method due to Povey et al. (2012). View details
    A STATISTICAL COMPARISON OF WRITTEN LANGUAGE AND NONLINGUISTIC SYMBOL SYSTEMS
    Language, 90(2014), pp. 457-481
    Preview abstract Are statistical methods useful in distinguishing written language from nonlinguistic symbol systems? Some recent articles (Rao et al. 2009a, Lee et al. 2010a) have claimed so. Both of these previous articles use measures based at least in part on bigram conditional entropy, and subsequent work by one of the authors (Rao) has used other entropic measures. In both cases the authors have argued that the methods proposed either are useful for discriminating between linguistic and nonlinguistic systems (Lee et al.), or at least count as evidence of a more ‘inductive’ kind for the status of a system (Rao et al.). Using a larger set of nonlinguistic and comparison linguistic corpora than were used in these and other studies, I show that none of the previously proposed methods are useful as published. However, one of the measures proposed by Lee and colleagues (2010a) (with a different cut-off value) and a novel measure based on repetition turn out to be good measures for classifying symbol systems into the two categories. For the two ancient symbol systems of interest to Rao and colleagues (2009a) and Lee and colleagues (2010a)—Indus Valley inscriptions and Pictish symbols, respectively—both of these measures classify them as nonlinguistic, contradicting the findings of those previous works. View details
    Thrax: An Open Source Grammar Compiler Built on OpenFst
    Terry Tai
    Wojciech Skut
    Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, IEEE, Piscataway, NJ(2011)
    MAP adaptation of stochastic grammars
    Computer Speech and Language, 20(2006), pp. 41-68
    Improved name recognition with meta-data dependent name networks
    S. Maskey
    Proceedings of the International Conference on Acoustics,Speech and Signal Processing(2004)
    An Efficient Compiler for Weighted Rewrite Rules
    CoRR, cmp-lg/9606026(1996)
    Algorithms for Speech Recognition and Language Processing
    Compilation of Weighted Finite-State Transducers from Decision Trees
    ACL(1996), pp. 215-222
    {Finite-State Transducers in Language and Speech Processing}
    Tutorial at the 16th International Conference on Computational Linguistics (COLING-96), COLING, Copenhagen, Denmark(1996)
    An Efficient Compiler for Weighted Rewrite Rules
    ACL(1996), pp. 231-238
    Compilation of Weighted Finite-State Transducers from Decision Trees
    CoRR, cmp-lg/9606018(1996)
    {An Efficient Compiler for Weighted Rewrite Rules}
    $34$th Meeting of the Association for Computational Linguistics (ACL '96), Proceedings of the Conference, Santa Cruz, California, Santa Cruz, California(1996)
    Weighted Rational Transductions and their Application to Human Language Processing
    Human Language Technology Workshop, Morgan Kaufmann, San Francisco, California(1994), pp. 262-267
    A spoken language translator for restricted-domain context-free languages
    David B. Roe
    Pedro J. Moreno
    Alejandro Macarrón
    Speech Communication, 11(1992), pp. 311-319
    Efficient Grammar Processing for a Spoken Language Translation System
    David B. Roe
    Pedro J. Moreno
    Alejandro Macarrón
    Proceedings of ICASSP, IEEE, San Francisco, California(1992), pp. 213-216
    Toward a Spoken Language Translator for Restricted-Domain Context-Free Languages
    David B. Roe
    Pedro J. Moreno
    Alejandro Macarrón
    EUROSPEECH 91 -- 2nd European Conference on Speech Communication and Technology, Genova, Italy(1991), pp. 1063-1066