Richard Sproat

Richard Sproat

From January, 2009, through October 2012, I was a professor at the Center for Spoken Language Understanding at the Oregon Health and Science University.

Prior to going to OHSU, I was a professor in the departments of Linguistics and Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. I was also a full-time faculty member at the Beckman Institute. I still hold adjunct positions in Linguistics and ECE at UIUC.

Before joining the faculty at UIUC I worked in the Information Systems and Analysis Research Department headed by Ken Church at AT&T Labs --- Research where I worked on Speech and Text Data Mining: extracting potentially useful information from large speech or text databases using a combination of speech/NLP technology and data mining techniques.

Before joining Ken's department I worked in the Human/Computer Interaction Research Department headed by Candy Kamm. My most recent project in that department was WordsEye, an automatic text-to-scene conversion system. The WordsEye technology is now being developed at Semantic Light, LLC. WordsEye is particularly good for creating surrealistic images that I can easily conceive of but are well beyond my artistic ability to execute.

More info --- and many more publications --- on my external website here.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Bi-Phone: Modeling Inter Language Phonetic Influences in Text
    Ananya B. Sai
    Yuri Vasilevski
    James Ren
    Ambarish Jash
    Sukhdeep Sodhi
    ACL, Association for Computational Linguistics, Toronto, Canada(2023), 2580–2592
    Preview abstract A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the SuperGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text. View details
    Preview abstract For millennia humans have used visible marks to communicate information. Modern examples of conventional graphical symbols include traffic signs, mathematical notation, and written language such as the text you are currently reading. This book presents the first systematic study of graphical symbol systems, including a taxonomy of non-linguistic systems—systems such as mathematical and musical notation that are not tied to spoken language. An important point is that non-linguistic symbol systems may have complex syntax, if the information encoded by the system itself has a complex structure. Writing systems are a special instance of graphical symbol system where the symbols represent linguistic, and in particular phonological information. I review the properties of writing and how it relates to non-linguistic symbols. I also review how writing is processed in the brain, and compare that to the neural processing of non-linguistic symbols. Writing first appeared in Mesopotamia about 5,000 years ago and is believed to have evolved from a previous non-linguistic accounting system. The exact mechanism is unknown, but crucial was the discovery that symbols can represent the sounds of words, not just the meanings. I present the novel hypothesis that writing evolved in an institutional context in which accounts were dictated, thus driving an association between symbol and sound. I provide a computational simulation to support this hypothesis. Human language has syntactic structure, and writing inherits the structure that language has. This leads to a common fallacy when it comes to undeciphered ancient symbols, namely that the presence of structure in a system favors the conclusion that the system was writing. I review recent instances of this fallacy, pointing out that known non-linguistic systems also have structure, so that the presence of structure is not very informative. The book ends with some thoughts about the future of graphical symbol systems. View details
    Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation
    Lion Jones
    Haruko Ishikawa
    Transactions of the Association for Computational Linguistics, 11(2023), 85–101
    Preview abstract If one sees the place name Houston Mercer Dog Run in New York, how does one know how to pronounce it? Assuming one knows that Houston in New York is pronounced ˈhaʊstən and not like the Texas city (ˈhjuːstən), then one can probably guess that ˈhaʊstən is also used in the name of the dog park. We present a novel architecture that learns to use the pronunciations of neighboring names in order to guess the pronunciation of a given target feature. Applied to Japanese place names, we demonstrate the utility of the model to finding and proposing corrections for errors in Google Maps. To demonstrate the utility of this approach to structurally similar problems, we also report on an application to a totally different task: Cognate reflex prediction in comparative historical linguistics. A version of the code has been open-sourced. View details
    Graphemic Normalization of the Perso-Arabic Script
    Raiomond Doctor
    Proceedings of Grapholinguistics in the 21st Century, 2022 (G21C, Grafematik), Paris, France
    Preview abstract Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions (Unicode Consortium, 2021). This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages, such as Arabic and Persian, building on earlier work by the expert community (ICANN, 2011, 2015). We particularly focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues such as the use of visually ambiguous yet canonically nonequivalent letters and the mixing of letters from different orthographies. Among the contributing conflating factors are the lack of input methods, the instability of modern orthographies (e.g., Aazim et al., 2009; Iyengar, 2018), insufficient literacy, and loss or lack of orthographic tradition (Jahani and Korn, 2013; Liljegren, 2018). We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks. Our results indicate statistically significant improvements in performance in most conditions for all the languages considered when normalization is applied. We argue that better understanding and representation of Perso-Arabic script variation within regional orthographic traditions, where those are present, is crucial for further progress of modern computational NLP techniques (Ponti et al., 2019; Conneau et al., 2020; Muller et al., 2021) especially for languages with a paucity of resources. View details
    Mockingbird at the SIGTYP 2022 Shared Task: Two Types of Models for Prediction of Cognate Reflexes
    Christo Kirov
    Proceedings of the 4th Workshop on Research in Computational Typology and Multilingual NLP (SIGTYP 2022) at NAACL, Association for Computational Linguistics (ACL), Seattle, WA, pp. 70-79
    Preview abstract The SIGTYP 2022 shared task concerns the problem of word reflex generation in a target language, given cognate words from a subset of related languages. We present two systems to tackle this problem, covering two very different modeling approaches. The first model extends transformer-based encoder-decoder sequence-to-sequence modeling, by encoding all available input cognates in parallel, and having the decoder attend to the resulting joint representation during inference. The second approach takes inspiration from the field of image restoration, where models are tasked with recovering pixels in an image that have been masked out. For reflex generation, the missing reflexes are treated as “masked pixels” in an “image” which is a representation of an entire cognate set across a language family. As in the image restoration case, cognate restoration is performed with a convolutional network. View details
    Preview abstract In a recent position paper, Turing Award Winners Yoshua Bengio, Geoffrey Hinton and Yann LeCun, make the case that symbolic methods are not needed in AI and that, while there are still many issues to be resolved, AI will be solved using purely neural methods. In this piece I issue a challenge: demonstrate that a purely neural approach to the problem of text normalization is possible. Various groups have tried, but so far nobody has eliminated the problem of unrecoverable errors, errors where, due to insufficient training data or faulty generalization, the system substitutes some other reading for the correct one. Solutions have been proposed that involve a marriage of traditional finite-state methods with neural models, but thus far nobody has shown that the problem can be solved using neural methods alone. Though text normalization is hardly an "exciting" problem, I argue that until one can solve "boring'"problems like that using purely AI methods, one cannot claim that AI is a success. View details
    Beyond Arabic: Software for Perso-Arabic Script Manipulation
    Raiomond Doctor
    Proceedings of the 7th Arabic Natural Language Processing Workshop (WANLP2022) at EMNLP, Association for Computational Linguistics (ACL), Abu Dhabi, United Arab Emirates (Hybrid), pp. 381-387
    Preview abstract This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for ten contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people. View details
    Preview abstract Taxonomies of writing systems since Gelb (1952) have classified systems based on what the written symbols represent: if they represent words or morphemes, they are logographic; if syllables, syllabic; if segments, alphabetic; etc. Sproat (2000) and Rogers (2005) broke with tradition by splitting the logographic and phonographic aspects into two dimensions, with logography being graded rather than a categorical distinction. A system could be syllabic, and highly logographic; or alphabetic, and mostly non-logographic. This accords better with how writing systems actually work, but neither author proposed a method for measuring logography. In this article we propose a novel measure of the degree of logography that uses an attention based sequence-to-sequence model trained to predict the spelling of a token from its pronunciation in context. In an ideal phonographic system, the model should need to attend to only the current token in order to compute how to spell it, and this would show in the attention matrix activations. In contrast, with a logographic system, where a given pronunciation might correspond to several different spellings, the model would need to attend to a broader context. The ratio of the activation outside the token and the total activation forms the basis of our measure. We compare this with a simple lexical measure, and an entropic measure, as well as several other neural models, and argue that on balance our attention-based measure accords best with intuition about how logographic various systems are. Our work provides the first quantifiable measure of the notion of logography that accords with linguistic intuition and, we argue, provides better insight into what this notion means. View details
    Preview abstract Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%. View details
    Preview abstract This paper describes the NEMO submission to SIGTYP 2020 shared task (Bjerva et al., 2020) which deals with prediction of linguistic typological features for multiple languages using the data derived from World Atlas of Language Structures (WALS). We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi­-class estimators that predict individual features. We describe two submitted ridge regression­-based configurations which ranked second and third overall in the constrained task. Our best configuration achieved the micro­averaged accuracy score of 0.66 on 149 test languages. View details