Jump to Content
Alexander Gutkin

Alexander Gutkin

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
    Sebastian Ruder
    Min Ma
    Shruti Rijhwani
    Parker Riley
    Jean-Michel Sarr
    Cindy Wang
    John Wieting
    Christo Kirov
    Dana L. Dickinson
    Bidisha Samanta
    Connie Tao
    David Adelani
    Colin Cherry
    Reeve Ingle
    Dmitry Panteleev
    Partha Talukdar
    Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp. 1856-1884
    Preview abstract Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) — languages for which NLP research is particularly far behind in meeting user needs — it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks — tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text only, multi-modal (vision, audio, and text), supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models. View details
    Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation
    Lion Jones
    Haruko Ishikawa
    Transactions of the Association for Computational Linguistics, vol. 11 (2023), 85–101
    Preview abstract If one sees the place name Houston Mercer Dog Run in New York, how does one know how to pronounce it? Assuming one knows that Houston in New York is pronounced ˈhaʊstən and not like the Texas city (ˈhjuːstən), then one can probably guess that ˈhaʊstən is also used in the name of the dog park. We present a novel architecture that learns to use the pronunciations of neighboring names in order to guess the pronunciation of a given target feature. Applied to Japanese place names, we demonstrate the utility of the model to finding and proposing corrections for errors in Google Maps. To demonstrate the utility of this approach to structurally similar problems, we also report on an application to a totally different task: Cognate reflex prediction in comparative historical linguistics. A version of the code has been open-sourced. View details
    Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities
    Raiomond Doctor
    Lawrence Wolf-Sonkin
    Proceedings of the 13th Language Resources and Evaluation Conference.(LREC), European Language Resources Association (ELRA), 20-25 June, Marseille, France (2022), 6450‑6460
    Preview abstract The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems. In this work, we present several substantial extensions to Brahmic script functionality within the open-source Nisaba library of finite-state script normalization and processing utilities (Johny et. al. , 2021). First, we extend coverage from the original ten scripts to an additional ten scripts of South Asia and beyond, including some used to record endangered languages such as Dogri. Second, we augment the language layer so that scripts used by multiple languages in distinct ways can be processed correctly for more languages, such as the Bengali script when used for the low-resource language Santali. We document key changes to the finite-state engine required to support these new languages and scripts. Finally, we add new script processing utilities, including lightweight script-level reading normalization that (unlike existing visual normalization) does not preserve visual invariance, and a fixed-input transliteration mechanism specifically tailored to Brahmic text entry with ASCII characters. View details
    Preview abstract In this paper we share findings from our effort towards building practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results across three research domains: (i) Building clean, web-mined datasets by leveraging semi-supervised pre-training for language-id and developing data-driven filtering techniques; (ii) Leveraging massively multilingual MT models trained with supervised parallel data for over $100$ languages and small monolingual datasets for over 1000 languages to enable translation for several previously under-studied languages; and (iii) Studying the limitations of evaluation metrics for long tail languages and conducting qualitative analysis of the outputs from our MT models. We hope that our work provides useful insights to practitioners working towards building MT systems for long tail languages, and highlights research directions that can complement the weaknesses of massively multilingual pre-trained models in data-sparse settings. View details
    Design principles of an open-source language modeling microservice package for AAC text-entry applications
    9th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022), Association for Computational Linguistics (ACL), Dublin, Ireland, pp. 1-16
    Preview abstract We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library. The intent of the library is to allow the ensembling of multiple diverse language models without requiring the clients (user interface designers, system users or speech-language pathologists) to attend to the formats of the models. Issues around privacy, security, dynamic versus static models, and methods of model combination are explored and specific design choices motivated. Some simulation experiments demonstrating the benefits of personalized language model ensembling via the library are presented. View details
    Mockingbird at the SIGTYP 2022 Shared Task: Two Types of Models for Prediction of Cognate Reflexes
    Christo Kirov
    Proceedings of the 4th Workshop on Research in Computational Typology and Multilingual NLP (SIGTYP 2022) at NAACL, Association for Computational Linguistics (ACL), Seattle, WA, pp. 70-79
    Preview abstract The SIGTYP 2022 shared task concerns the problem of word reflex generation in a target language, given cognate words from a subset of related languages. We present two systems to tackle this problem, covering two very different modeling approaches. The first model extends transformer-based encoder-decoder sequence-to-sequence modeling, by encoding all available input cognates in parallel, and having the decoder attend to the resulting joint representation during inference. The second approach takes inspiration from the field of image restoration, where models are tasked with recovering pixels in an image that have been masked out. For reflex generation, the missing reflexes are treated as “masked pixels” in an “image” which is a representation of an entire cognate set across a language family. As in the image restoration case, cognate restoration is performed with a convolutional network. View details
    Criteria for Useful Automatic Romanization in South Asian Languages
    Proceedings of the 13th Language Resources and Evaluation Conference.(LREC), European Language Resources Association (ELRA), 20-25 June, Marseille, France (2022), 6662‑6673
    Preview abstract This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script. This process is also known as romanization. These criteria are related to either fidelity to human linguistic behavior (pronunciation transparency, naturalness and conventionality) or processing utility for people (ease of input) as well as under-the-hood in systems (invertibility and stability across languages and scripts). When addressing these differing criteria several linguistic considerations, such as modeling of prominent phonological processes and their relation to orthography, need to be taken into account. We discuss these key linguistic details in the context of Brahmic scripts and languages that use them, such as Hindi and Malayalam. We then present the core features of several romanization algorithms, implemented in finite state transducer (FST) formalism, that address differing criteria. Implementation of these algorithms will be released as part of the Nisaba finite-state script processing library. View details
    Graphemic Normalization of the Perso-Arabic Script
    Raiomond Doctor
    Proceedings of Grapholinguistics in the 21st Century, 2022 (G21C, Grafematik), Paris, France
    Preview abstract Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions (Unicode Consortium, 2021). This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages, such as Arabic and Persian, building on earlier work by the expert community (ICANN, 2011, 2015). We particularly focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues such as the use of visually ambiguous yet canonically nonequivalent letters and the mixing of letters from different orthographies. Among the contributing conflating factors are the lack of input methods, the instability of modern orthographies (e.g., Aazim et al., 2009; Iyengar, 2018), insufficient literacy, and loss or lack of orthographic tradition (Jahani and Korn, 2013; Liljegren, 2018). We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks. Our results indicate statistically significant improvements in performance in most conditions for all the languages considered when normalization is applied. We argue that better understanding and representation of Perso-Arabic script variation within regional orthographic traditions, where those are present, is crucial for further progress of modern computational NLP techniques (Ponti et al., 2019; Conneau et al., 2020; Muller et al., 2021) especially for languages with a paucity of resources. View details
    Beyond Arabic: Software for Perso-Arabic Script Manipulation
    Raiomond Doctor
    Proceedings of the 7th Arabic Natural Language Processing Workshop (WANLP2022) at EMNLP, Association for Computational Linguistics (ACL), Abu Dhabi, United Arab Emirates (Hybrid), pp. 381-387
    Preview abstract This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for ten contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people. View details
    Preview abstract Taxonomies of writing systems since Gelb (1952) have classified systems based on what the written symbols represent: if they represent words or morphemes, they are logographic; if syllables, syllabic; if segments, alphabetic; etc. Sproat (2000) and Rogers (2005) broke with tradition by splitting the logographic and phonographic aspects into two dimensions, with logography being graded rather than a categorical distinction. A system could be syllabic, and highly logographic; or alphabetic, and mostly non-logographic. This accords better with how writing systems actually work, but neither author proposed a method for measuring logography. In this article we propose a novel measure of the degree of logography that uses an attention based sequence-to-sequence model trained to predict the spelling of a token from its pronunciation in context. In an ideal phonographic system, the model should need to attend to only the current token in order to compute how to spell it, and this would show in the attention matrix activations. In contrast, with a logographic system, where a given pronunciation might correspond to several different spellings, the model would need to attend to a broader context. The ratio of the activation outside the token and the total activation forms the basis of our measure. We compare this with a simple lexical measure, and an entropic measure, as well as several other neural models, and argue that on balance our attention-based measure accords best with intuition about how logographic various systems are. Our work provides the first quantifiable measure of the notion of logography that accords with linguistic intuition and, we argue, provides better insight into what this notion means. View details
    Finite-state script normalization and processing utilities: The Nisaba Brahmic library
    Lawrence Wolf-Sonkin
    The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations, Association for Computational Linguistics, [Online], Kyiv, Ukraine, April, 2021, pp. 14-23
    Preview abstract This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts. The library provides a flexible and extensible framework for supporting crucial operations on Brahmic scripts, such as NFC, visual normalization, reversible transliteration, and validity checks, implemented in Python within a finite-state transducer formalism. We survey some common Brahmic script issues that may adversely affect the performance of downstream NLP tasks, and provide the rationale for finite-state design and system implementation details. View details
    Towards Induction of Structured Phoneme Inventories
    Martin Jansche
    Lucy Skidmore
    Association of Computational Linguistics (ACL), 19th November, Online
    Preview abstract This extended abstract was presented at "SIGTYP 2020: The Second Workshop on Computational Research in Linguistic Typology" at EMNLP 2020. View details
    Developing an Open-Source Corpus of Yoruba Speech
    Clara E. Rivera
    Kólá Túbòsún
    Proc. of Interspeech 2020, International Speech Communication Association (ISCA), October 25--29, Shanghai, China, 2020., pp. 404-408
    Preview abstract This paper introduces an open-source speech dataset for Yoruba - one of the largest low-resource West African languages spoken by at least 22 million people. Yoruba is one of the official languages of Nigeria, Benin and Togo, and is spoken in other neighboring African countries and beyond. The corpus consists of over four hours of 48 kHz recordings from 36 male and female volunteers and the corresponding transcriptions that include disfluency annotation. The transcriptions have full diacritization, which is vital for pronunciation and lexical disambiguation. The annotated speech dataset described in this paper is primarily intended for use in text-to-speech systems, serve as adaptation data in automatic speech recognition and speech-to-speech translation, and provide insights in West African corpus linguistics. We demonstrate the use of this corpus in a simple statistical parametric speech synthesis (SPSS) scenario evaluating it against the related languages from the CMU Wilderness dataset and the Yoruba Lagos-NWU corpus. View details
    Burmese Speech Corpus, Finite­-State Text Normalization and Pronunciation Grammars with an Application to Text-­to-­Speech
    Yin May Oo
    Chen Fang Li
    Pasindu De Silva
    Supheakmungkol Sarin
    Knot Pipatsrisawat
    Martin Jansche
    Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, pp. 6328-6339
    Preview abstract This paper introduces an open-­source crowd­-sourced multi­-speaker speech corpus along with the comprehensive set of finite-­state transducer (FST) grammars for performing text normalization for the Burmese (Myanmar) language. We also introduce the open­-source finite­-state grammars for performing grapheme­-to­-phoneme (G2P) conversion for Burmese. These three components are necessary (but not sufficient) for building a high­-quality text-­to-­speech (TTS) system for Burmese, a tonal Southeast Asian language from the Sino­-Tibetan family which presents several linguistic challenges. We describe the corpus acquisition process and provide the details of our finite state­based approach to Burmese text normalization and G2P. Our experiments involve building a multi­speaker TTS system based on long short term memory (LSTM) recurrent neural network (RNN) models, which were previously shown to perform well for other languages in a low­-resource setting. Our results indicate that the data and grammars that we are announcing are sufficient to build reasonably high­-quality models comparable to other systems. We hope these resources will facilitate speech and language research on the Burmese language, which is considered by many to be low­resource due to the limited availability of free linguistic data. View details
    Eidos: An Open-Source Auditory Periphery Modeling Toolkit and Evaluation of Cross-Lingual Phonemic Contrasts
    Proc. of 1st Joint Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop (SLTU-CCURL 2020), European Language Resources Association (ELRA), 11--12 May, Marseille, France, pp. 9-20
    Preview abstract Many analytical models that mimic, in varying degree of detail, the basic auditory processes involved in human hearing have been developed over the past decades. While the auditory periphery mechanisms responsible for transducing the sound pressure wave into the auditory nerve discharge are relatively well understood, the models that describe them are usually very complex because they try to faithfully simulate the behavior of several functionally distinct biological units involved in hearing. Because of this, there is a relative scarcity of toolkits that support combining publicly-available auditory models from multiple sources. We address this shortcoming by presenting an open-source auditory toolkit that integrates multiple models of various stages of human auditory processing into a simple and easily configurable pipeline, which supports easy switching between ten available models. The auditory representations that the pipeline produces can serve as machine learning features and provide analytical benchmark for comparing against auditory filters learned from the data. Given a low- and high-resource language pair, we evaluate several auditory representations on a simple multilingual phonemic contrast task to determine whether contrasts that are meaningful within a language are also empirically robust across languages. View details
    Does A Priori Phonological Knowledge Improve Cross-Lingual Robustness of Phonemic Contrasts?
    Lucy Skidmore
    22nd International Conference on Speech and Computer (SPECOM 2020), Springer, St. Petersburg, Russia, pp. 530-543
    Preview abstract For speech models that depend on sharing between phonological representations an often overlooked issue is that phonological contrasts that are succinctly described language-internally by the phonemes and their respective featurizations are not necessarily robust across languages. This paper extends a recently proposed method for assessing the cross-linguistic consistency of phonological features in phoneme inventories. The original method employs binary neural classifiers for individual phonological contrasts trained solely on audio. This method cannot resolve some important phonological contrasts, such as retroflex consonants, cross-linguistically. We extend this approach by leveraging prior phonological knowledge during classifier training. We observe that since phonemic descriptions are articulatory rather than acoustic the model input space needs to be grounded in phonology to better capture phonemic correlations between the training samples. The cross-linguistic consistency of the proposed method is evaluated in multilingual setting on held-out low-resource languages and classification quality is reported. We observe modest gains over the baseline for difficult cases, such as cross-lingual detection of aspiration, and discuss multiple confounding factors that explain the dimensions of the difficulty for this task. View details
    Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech
    Fei He
    Shan Hui Cathy Chu
    Supheakmungkol Sarin
    Knot Pipatsrisawat
    Alena Butryna
    Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, pp. 6504-6513
    Preview abstract In this paper we present a multidialectal corpus approach for building a text-to-speech voice for a new dialect in a language with existing resources, focusing on various South American dialects of Spanish. We first present public speech datasets for Argentinian, Chilean, Colombian, Peruvian, Puerto Rican and Venezuelan Spanish specifically constructed with text-to-speech applications in mind using crowd-sourcing. We then compare the monodialectal voices built with minimal data to a multidialectal model built by pooling all the resources from all dialects. Our results show that the multidialectal model outperforms the monodialectal baseline models. We also experiment with a ``zero-resource'' dialect scenario where we build a multidialectal voice for a dialect while holding out target dialect recordings from the training data. View details
    Preview abstract This paper describes the NEMO submission to SIGTYP 2020 shared task (Bjerva et al., 2020) which deals with prediction of linguistic typological features for multiple languages using the data derived from World Atlas of Language Structures (WALS). We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi­-class estimators that predict individual features. We describe two submitted ridge regression­-based configurations which ranked second and third overall in the constrained task. Our best configuration achieved the micro­averaged accuracy score of 0.66 on 149 test languages. View details
    Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems
    Fei He
    Shan Hui Cathy Chu
    Clara E. Rivera
    Martin Jansche
    Supheakmungkol Sarin
    Knot Pipatsrisawat
    Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, 6494‑-6503
    Preview abstract We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India. The corpora is primarily intended for use in text-to-speech (TTS) applications, such as constructing multilingual voices or being used for speaker or language adaptation. The data can also be useful for automatic speech recognition (ASR) in various multilingual scenarios. Most of the corpora (apart from Marathi, which is a female-only database) consist of at least 2,000 recorded lines from female and male native speakers of the language. We present the methodological details behind corpora acquisition, which can be scaled to acquiring the data for more languages of interest. We describe the experiments in building a multilingual text-to-speech model that is constructed by combining our corpora. Our results indicate that using these corpora results in good quality voices, with Mean Opinion Scores (MOS) $>$ 3.6, for all the languages tested. We believe that these resources, released with an open-source license, and the described methodology will help developing speech applications for the Indic languages and aid corpora development for other, smaller, languages of India and beyond. View details
    Open-Source High Quality Speech Datasets for Basque, Catalan and Galician
    Alena Butryna
    Clara E. Rivera
    Proc. of 1st Joint Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop (SLTU-CCURL 2020), European Language Resources Association (ELRA), 11--12 May, Marseille, France, pp. 21-27
    Preview abstract This paper introduces three new open speech datasets for Basque, Catalan and Galician, which are languages of Spain, where Catalan is furthermore the official language of the Principality of Andorra. The datasets consist of high-quality multi-speaker recordings of the three languages along with the associated transcriptions. The resulting corpora include over 33 hours of crowd-sourced recordings of 132 male and female native speakers. The recording scripts also include material for elicitation of global and local place names, personal and business names. The datasets are released under a permissive license and are available for free download for commercial, academic and personal use. The high-quality annotated speech datasets described in this paper can be used to, among other things, build text-to-speech systems, serve as adaptation data in automatic speech recognition and provide useful phonetic and phonological insights in corpus linguistics. View details
    Open-source Multi-speaker Corpora of the English Accents in the British Isles
    Clara E. Rivera
    Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, 6532‑-6541
    Preview abstract This paper presents a dataset of transcribed high-quality audio of English sentences recorded by volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage. The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words. Overlapping lines for all speakers were included for idiolect elicitation which include the same or similar lines with other existing resources such as the CSTR VCTK corpus and the Speech Accent Archive to allow for easy comparison of personal and regional accents. The resulting corpora include over 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English. View details
    Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
    Alena Butryna
    Shan Hui Cathy Chu
    Linne Ha
    Fei He
    Martin Jansche
    Chen Fang Li
    Tatiana Merkulova
    Yin May Oo
    Knot Pipatsrisawat
    Clara E. Rivera
    Supheakmungkol Sarin
    Pasindu De Silva
    Keshan Sodimana
    Jaka Aris Eko Wibawa
    2019 UNESCO International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, 4--6 December, Paris, France, pp. 91-94
    Preview abstract This paper presents an overview of a program designed to address the growing need for developing free speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language community. View details
    Sampling from Stochastic Finite Automata with Applications to CTC Decoding
    Martin Jansche
    Proc. of Interspeech 2019 (20th Annual Conference of the International Speech Communication Association), International Speech Communication Association (ISCA), September 15--19, Graz, Austria, pp. 2230-2234
    Preview abstract Stochastic finite automata arise naturally in many language and speech processing tasks. They include stochastic acceptors, which represent certain probabilty distributions over random strings. We consider the problem of efficient sampling: drawing random string variates from the probability distribution represented by stochastic automata and transformations of those. We show that path-sampling is effective and can be efficient if the epsilon-graph of a finite automaton is acyclic. We provide an algorithm that ensures this by conflating epsilon-cycles within strongly connected components. Sampling is also effective in the presence of non-injective transformations of strings. We illustrate this in the context of decoding for Connectionist Temporal Classification (CTC), where the predictive probabilities yield auxiliary sequences which are transformed into shorter labeling strings. We can sample efficiently from the tranformed labeling distribution and use this in two different strategies for finding the most probable CTC labeling. View details
    Cross-Lingual Consistency of Phonological Features: An Empirical Study
    Martin Jansche
    Proc. of Interspeech 2019 (20th Annual Conference of the International Speech Communication Association), International Speech Communication Association (ISCA), September 15--19, Graz, Austria, pp. 1741-1745
    Preview abstract The concept of a phoneme arose historically as a theoretical abstraction that applies language-internally. Using phonemes and phonological features in cross-linguistic settings raises an important question of conceptual validity: Are contrasts that are meaningful within a language also empirically robust across languages? This paper develops a method for assessing the cross-linguistic consistency of phonological features in phoneme inventories. The method involves training separate binary neural classifiers for several phonological contrast in audio spans centered on particular segments within continuous speech. To assess cross-linguistic consistency, these classifiers are evaluated on held-out languages and classification quality is reported. We apply this method to several common phonological contrasts, including vowel height, vowel frontness, and retroflex consonants, in the context of multi-speaker corpora for ten languages from three language families (Indo-Aryan, Dravidian, and Malayo-Polynesian). We empirically evaluate and discuss the consistency of phonological contrasts derived from features found in phonological ontologies such as PanPhon and PHOIBLE. View details
    Predicting the Features of World Atlas of Language Structures from Speech
    Tatiana Merkulova
    Martin Jansche
    Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), International Speech Communication Association (ISCA), 29--31 August, Gurugram, India (2018), pp. 243-247
    Preview abstract We present a novel task that involves prediction of linguistic typological features from the World Atlas of Language Structures (WALS) from multilingual speech. We frame this task as a multi-label classification involving predicting the set of non-mutually exclusive and extremely sparse multi-valued WALS features. We investigate whether the speech modality has enough signals for an RNN to reliably discriminate between the typological features for languages which are included in the training data as well as languages withheld from the training. We show that the proposed approach can identify typological features with the overall accuracy of 91.6% for the 16 in-domain and 71.1% for 19 held-out languages. In addition, our approach outperforms language identification-based baselines on all the languages. Also, we show that correctly identifying all the typological features for an unseen language is still a distant goal: for 14 languages out of 19 the prediction error is well above 30%. View details
    Preview abstract The use of linguistic typological resources in natural language processing has been steadily gaining more popularity. It has been observed that the use of typological information, often combined with distributed language representations, leads to significantly more powerful models. While linguistic typology representations from various resources have mostly been used for conditioning the models, there has been relatively little attention on predicting features from these resources from the input data. In this paper we investigate whether the various linguistic features from World Atlas of Language Structures (WALS) can be reliably inferred from multi-lingual text. Such a predictor can be used to infer structural features for a language never observed in training data. We frame this task as a multi-label classification involving predicting the set of non-mutually exclusive and extremely sparse multi-valued labels (WALS features). We construct a recurrent neural network predictor based on byte embeddings and convolutional layers and test its performance on 556 languages, providing analysis for various linguistic types, macro-areas, language families and individual features. We show that some features from various linguistic types can be predicted reliably. View details
    FonBund: A Library for Combining Cross-lingual Phonological Segment Data
    Martin Jansche
    Tatiana Merkulova
    Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), 7-12 May 2018, Miyazaki, Japan, pp. 2236-2240
    Preview abstract In this paper, we present an open-source library that provides a way of mapping sequences of arbitrary phonetic segments in International Phonetic Association (IPA) alphabet into multiple articulatory feature representations. The library interfaces with several existing linguistic typology resources providing phonological segment inventories and their corresponding articulatory feature systems. Our first goal was to facilitate the derivation of articulatory features without giving a special preference to any particular phonological segment inventory provided by freely available linguistic typology resources. The second goal was to build a very light-weight library that can be easily modified to support new phonological segment inventories. In order to support IPA segments unsuppored by the freely available resources the library provides a simple configuration language for performing segment rewrites and adding custom segments with the corresponding feature structures. In addition to introducing the library and the corresponding linguistic resources, we also describe some of the practical uses of this library (multilingual speech synthesis) in the hope that this software will help facilitate multilingual speech research. View details
    A Unified Phonological Representation of South Asian Languages for Multilingual Text-to-Speech
    Martin Jansche
    Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), International Speech Communication Association (ISCA), 29--31 August, Gurugram, India (2018), pp. 80-84
    Preview abstract We present a multilingual phoneme inventory and inclusion mappings from the native inventories of several major South Asian languages for multilingual parametric text-to-speech synthesis (TTS). Our goal is to reduce the need for training data when building new TTS voices by leveraging available data for similar languages within a common feature design. For West Bengali, Gujarati, Kannada, Malayalam, Marathi, Tamil, Telugu, and Urdu we compare TTS voices trained only on monolingual data with voices trained on multilingual data from 12 languages. In subjective evaluations multilingually trained voices outperform (or in a few cases are statistically tied with) the corresponding monolingual voices. The multilingual setup can further be used to synthesize speech for languages not seen in the training data; preliminary evaluations lean towards good. Our results indicate that pooling data from different languages in a single acoustic model can be beneficial, opening up new uses and research questions. View details
    Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech
    Jaka Aris Eko Wibawa
    Supheakmungkol Sarin
    Chen Fang Li
    Knot Pipatsrisawat
    Keshan Sodimana
    Martin Jansche
    Linne Ha
    Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), 7-12 May 2018, Miyazaki, Japan, pp. 1610-1614
    Preview abstract We present the multi-speaker text-to-speech corpora for Javanese and Sundanese languages, the second and third biggest languages of Indonesia spoken by well over a hundred million people. The key objectives were to collect the high-quality data an affordable way and to share the data publicly with the speech community. To achieve this, we collaborated with two local universities in Java and streamlined our recording and crowdsourcing processes to produce the corpora consisting of 5.8 thousand (Javanese) and 4.2 thousand (Sundanese) mixed-gender recordings. We used these corpora to build several configurations of multi-speaker neural network-based text-to-speech systems for Javanese and Sundanese. Subjective evaluations performed on these configurations demonstrate that multilingual configurations for which Javanese and Sundanese are trained jointly with a larger Indonesian corpus significantly outperform the systems constructed from a single language. We hope that sharing these corpora publicly and presenting our multilingual approach to text-to-speech will help the community to scale up the text-to-speech technologies to other lesser resourced languages of Indonesia. View details
    Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala, and Sundanese TTS Systems
    Keshan Sodimana
    Pasindu De Silva
    Chen Fang Li
    Supheakmungkol Sarin
    Knot Pipatsrisawat
    6th International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU-2018), International Speech Communication Association (ISCA), 29--31 August, Gurugram, India, pp. 147-151
    Preview abstract Text normalization is the process of converting non-standard words (NSWs) such as numbers, abbreviations, and time expressions into standard words so that their pronunciations can be derived either through lexicon lookup or by utilizing a program to predict pronunciations from spellings. Text normalization is, thus, an important component of any Text-to-Speech (TTS) system. Without such component, the resulting voice, no matter how good the quality is, may sound unintelligent. Such a component is often built manually by translating language-specific knowledge into rules that can be utilized by TTS pipelines. In this paper, we describe an approach to develop a rule-based text normalization component for many low-resourced languages. We also describe our open source repository containing text normalization grammars for Bangla, Javanese, Khmer, Nepali, Sinhala, Sundanese and present a recipe for utilizing them in a TTS system. View details
    Areal and Phylogenetic Features for Multilingual Speech Synthesis
    Proc. of Interspeech 2017, International Speech Communication Association (ISCA), August 20–24, 2017, Stockholm, Sweden, pp. 2078-2082
    Preview abstract We introduce phylogenetic and areal language features to the domain of multilingual text-to-speech (TTS) synthesis. Intuitively, enriching the existing universal phonetic features with such cross-language shared representations should benefit the multilingual acoustic models and help to address issues like data scarcity for low-resource languages. We investigate these representations using the acoustic models based on long short-term memory (LSTM) recurrent neural networks (RNN). Subjective evaluations conducted on eight languages from diverse language families show that sometimes phylogenetic and areal representations lead to significant multilingual synthesis quality improvements. View details
    Uniform Multilingual Multi-Speaker Acoustic Model for Statistical Parametric Speech Synthesis of Low-Resourced Languages
    Proc. of Interspeech 2017, International Speech Communication Association (ISCA), August 20--24, Stockholm, Sweden, pp. 2183-2187
    Preview abstract Acquiring data for text-to-speech (TTS) systems is expensive. This typically requires large amounts of training data, which is not available for low-resourced languages. Sometimes small amounts of data can be collected, while often - no data may be available at all. This paper presents acoustic modeling approach utilizing long short-term memory (LSTM) recurrent neural network (RNN) aimed at partially addressing the language data scarcity problem. Unlike speaker-adaption systems that aim to preserve speaker similarity across languages, the salient feature of the proposed approach is that, once constructed, the resulting system does not need retraining to cope with the previously unseen languages. This is due to language and speaker-agnostic model topology and universal linguistic feature set. Experiments on twelve languages show that the system is able to produce intelligible and sometimes natural output when language is unseen. We also show that, when small amounts of training data are available, pooling the data sometimes improves the overall intelligibility and naturalness. Finally, we show that sometimes having a multilingual system with no prior exposure to the language is better than building single-speaker system from small amounts of data for that language. View details
    Recent Advances in Google Real-time HMM-driven Unit Selection Synthesizer
    Siamak Tazari
    Hanna Silen
    International Speech Communication Association (ISCA), Sep 8--12, San Francisco, USA, pp. 2238-2242
    Preview abstract This paper presents advances in Google's hidden Markov model (HMM)-driven unit selection speech synthesis system. We describe several improvements to the run-time system; these include minimal latency, high-quality and fast refresh cycle for new voices. Traditionally unit selection synthesizers are limited in terms of the amount of data they can handle and the real applications they are built for. That is even more critical for real-life large-scale applications where high-quality is expected and low latency is required given the available computational resources. In this paper we present an optimized engine to handle a large database at runtime, a composite unit search approach for combining diphones and phrase-based units. In addition a new voice building strategy for handling big databases and keeping the building times low is presented. View details
    Building Statistical Parametric Multi-speaker Synthesis for Bangladeshi Bangla
    Linne Ha
    Martin Jansche
    Knot Pipatsrisawat
    5th Workshop on Spoken Language Technologies for Under-resourced languages (SLTU-2016), Procedia Computer Science (Elsevier B.V.), 09--12 May 2016, Yogyakarta, Indonesia, pp. 194-200
    Preview abstract We present a text-to-speech (TTS) system designed for the dialect of Bengali spoken in Bangladesh. This work is part of an ongoing effort to address the needs of new under-resourced languages. We propose a process for streamlining the bootstrapping of TTS systems for under-resourced languages. First, we use crowdsourcing to collect the data from multiple ordinary speakers, each speaker recording small amount of sentences. Second, we leverage an existing text normalization system for a related language (Hindi) to bootstrap a linguistic front-end for Bangla. Third, we employ statistical techniques to construct multi-speaker acoustic models using Long Short-term Memory Recurrent Neural Network (LSTM-RNN) and Hidden Markov Model (HMM) approaches. We then describe our experiments that show that the resulting TTS voices score well in terms of their perceived quality as measured by Mean Opinion Score (MOS) evaluations. View details
    TTS for Low Resource Languages: A Bangla Synthesizer
    Linne Ha
    Martin Jansche
    Knot Pipatsrisawat
    10th edition of the Language Resources and Evaluation Conference, 23-28 May 2016, European Language Resources Association (ELRA), Portorož, Slovenia, pp. 2005-2010
    Preview abstract We present a text-to-speech (TTS) system designed for the dialect of Bengali spoken in Bangladesh. This work is part of an ongoing effort to address the needs of under-resourced languages. We propose a process for streamlining the bootstrapping of TTS systems for under-resourced languages. First, we use crowdsourcing to collect the data from multiple ordinary speakers, each speaker recording small amount of sentences. Second, we leverage an existing text normalization system for a related language (Hindi) to bootstrap a linguistic front-end for Bangla. Third, we employ statistical techniques to construct multi-speaker acoustic models using Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) and Hidden Markov Model (HMM) approaches. We then describe our experiments that show that the resulting TTS voices score well in terms of their perceived quality as measured by Mean Opinion Score (MOS) evaluations. View details
    No Results Found