Jump to Content
Sandy Ritchie

Sandy Ritchie

I work on internationalization for speech technology at Google. My research interests are in how we can scale language technology to a more diverse range of languages all around the world. Before joining Google, I received a Ph.D. and conducted postdoctoral research at SOAS, University of London.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Chimane-Mosetén
    Jeanette Sakel
    Amazonian Languages: An International Handbook, De Gruyter Mouton (2023)
    Preview abstract Chimane-Mosetén (also known as Mosetenan; ISO 639–3: cas; Glottocode: mose1249) is a dialect continuum spoken by 13,500–16,000 people in the Amazonian region of northern Bolivia. It has not been convincingly shown to be related to any other language. Its status as an isolate makes it unique in many respects, not least in its combination of features typical of both Amazonian and Andean languages. Like its closer geographical neighbors in Amazonian Bolivia, including Movima, Tacana, Reyesano, and Cavineña, it exhibits contrastive nasality in the vowel system and is head marking and predominantly agglutinative. Bound pronominal forms marking arguments in the clause have the same form as bound pronominals marking possessors. Subordinate clauses typically involve nominalized verbs. Unlike most of its Amazonian neighbors, on the other hand, it does not have a semantically-based classifier or gender system but instead features arbitrarily assigned masculine or feminine gender. It also does not feature any incorporation of nouns, adverbs, or adpositions. It has an extensive oblique case-marking system, though core case-marking does not occur. More similar to Quechua and other Andean languages, it features a complex predicate-argument agreement system in which one or more agreement suffixes cross-reference the subject and object arguments of a transitive verb. It also has a large class of lexical numbers following a decimal numeral system. View details
    Multimodal Language Identification
    Shikhar Bharadwaj
    Sid Dalmia
    Sriram (Sri) Ganapathy
    Yu Zhang
    2024 IEEE International Conference on Acoustics, Speech and Signal Processing (2023) (to appear)
    Preview abstract Language identification (LangID) of video data, the task of determining the spoken language in a given multimedia file, is primarily treated as a speech based language recognition task. On the other hand, text based language recognition is employed for written language content. In this work, we present a multimodal LangID system for video data that combines speech and text features to achieve state-of-the-art performance. We show that title and description of the video along with other meta-data, like geographic upload location of the video, contain substantial information regarding the language identity of the video recording. With a single multimodal model that can encode speech and text data, we build a language recognition system that can combine the information from speech, text and geographic location data. We experiment on public language recognition tasks with the Dhwani (22 language) dataset and the VoxLingua (107 language) dataset. In these settings, the proposed system achieves an absolute improvement of 6.6% and 5.6% in F1 score over the speech only baseline, respectively. We also provide an ablation study highlighting the contribution of different modalities for the language recognition task. View details
    Preview abstract We introduce \xtremes, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, retrieval and speech-to-text translation. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in ``universal'' speech representation learning. This paper describes the new benchmark and establishes the first speech-only and speech-text baselines using XLS-R and mSLAM on all downstream tasks. We motivate the design choices and detail how to use the benchmark. The code and pre-processing scripts will be made publicly available.\footnote{\small\url{https://huggingface.co/datasets/google/xtreme_s}} View details
    Preview abstract Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data and collected data for 15 languages, and trained experimental models using these techniques. Our results show that pooling the small amounts of data available in multilingual end-to-end models, and pre-training on unsupervised data can help improve speech recognition quality for many African languages. View details
    Preview abstract Pronunciation modeling is a key task for building speech technology in new languages, and while solid grapheme-to-phoneme (G2P) mapping systems exist, language coverage can stand to be improved. The information needed to build G2P models for many more languages can easily be found on Wikipedia, but unfortunately, it is stored in disparate formats. We report on a system we built to mine a pronunciation data set in 819 languages from loosely structured tables within Wikipedia. The data includes phoneme inventories, and for 63 low-resource languages, also includes the grapheme-to-phoneme (G2P) mapping. 54 of these languages do not have easily findable G2P mappings online otherwise. We turned the information from Wikipedia into a structured, machine-readable TSV format, and make the resulting data set publicly available so it can be improved further and used in a variety of applications involving low-resource languages. View details
    Text Normalization for Low-Resource Languages of Africa
    Andrew Zupon
    Evan Elizabeth Crew
    AfricaNLP (2021)
    Preview abstract Training data for machine learning models can come from many different sources, which can be of dubious quality. For resource-rich languages like English, there is a lot of data available, so we can afford to throw out the dubious data. For low-resource languages where there is much less data available, we can't necessarily afford to throw out the dubious data, in case we end up with a training set which is too small to train a model. In this study, we examine the effects of text normalization and data set quality for a set of low-resource languages of Africa -- Afrikaans, Amharic, Hausa, Igbo, Malagasy, Somali, Swahili, and Zulu. We describe our text normalizer which we built in the Pynini framework, a Python library for finite state transducers, and our experiments in training language models for African languages using the Natural Language Toolkit (NLTK), an open-source Python library for NLP. View details
    Data-Driven Parametric Text Normalization: Rapidly Scaling Finite-State Transduction Verbalizers to New Languages
    Kim Anne Heiligenstein
    Nikos Bampounis
    Christian Schallhart
    Jonas Fromseier Mortensen
    Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), Language Resources and Evaluation Conference (LREC 2020), Marseille, 218–225
    Preview abstract This paper presents a methodology for rapidly generating FST-based verbalizers for ASR and TTS systems by efficiently sourcing language-specific data. We describe a questionnaire which collects the necessary data to bootstrap the number grammar induction system and parameterize the verbalizer templates described in Ritchie et al. (2019), and a machine-readable data store which allows the data collected through the questionnaire to be supplemented by additional data from other sources. We also discuss the benefits of this system for low-resource languages. View details
    Preview abstract We describe a new approach to converting written tokens to their spoken form, which can be used across automatic speech recognition (ASR) and text-to-speech synthesis (TTS) systems. Both ASR and TTS systems need to map from the written to the spoken domain, and we present an approach that enables us to share verbalization grammars between the two systems. We also describe improvements to an induction system for number name grammars. Between these shared ASR/TTS verbalization systems and the improved induction system for number name grammars, we see significant gains in development time and scalability across languages View details
    Preview abstract We discuss two methods that let us easily create grapheme-to-phoneme (G2P) conversion systems for languages without any human-curated pronunciation lexicons, as long as we know the phoneme inventory of the target language and as long as we have some pronunciation lexicons for other languages written in the same script. We use these resources to infer what grapheme-to-phoneme correspondences we would expect, and predict pronunciations for words in the target language with minimal or no language-specific human work. Our first approach uses finite-state transducers, while our second approach uses a sequence-to-sequence neural network. Our G2P models reach high degrees of accuracy, and can be used for various applications, e.g. in developing an Automatic Speech Recognition system. Our methods greatly simplify a task that has historically required extensive manual labor. View details
    Preview abstract When building automatic speech recognition (ASR) systems, typically some amount of audio and text data in the target language is needed. While text data can be obtained relatively easily across many languages, transcribed audio data is challenging to obtain. This presents a barrier to making voice technologies available in more languages of the world. In this paper, we present a way to build an ASR system for a language even in the absence of any audio training data in that language at all. We do this by simply re-using an existing acoustic model from a phonologically similar language, without any kind of modification or adaptation towards the target language. The basic insight is that, if two languages are sufficiently similar in terms of their phonological system, an acoustic model should hold up relatively well when used for another language. We describe how we tailor our pronunciation models to enable such re-use, and show experimental results across a number of languages from various language families. We also provide a theoretical analysis of situations in which this approach is likely to work. Our results show that is possible to achieve less than 20% word error rate (WER) using this method. View details
    Disjoint and reflexive prominent internal possessor constructions in Chimane
    Prominent Internal Possessors, Oxford University Press (2019), pp. 107-130
    The syntax of possessor prominence in Maithili
    Yogendra P. Yadava
    Oliver Bond
    Irina Nikolaeva
    Prominent Internal Possessors, Oxford University Press (2019), pp. 39-79
    Agreement with the internal possessor in Chimane: A mediated locality approach
    Studies in Language, vol. 41(3) (2017), pp. 660-716
    Two cases of prominent internal possessor constructions
    Proceedings of the Joint 2016 Conference on Head-Driven Phrase Structure Grammar and Lexical Functional Grammar, CSLI Publications, pp. 620-640