Alexander Gutkin
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation
Lion Jones
Haruko Ishikawa
Transactions of the Association for Computational Linguistics, vol. 11 (2023), 85–101
Preview abstract
If one sees the place name Houston Mercer Dog Run in New York, how does one know how to pronounce it? Assuming one knows that Houston in New York is pronounced ˈhaʊstən and not like the Texas city (ˈhjuːstən), then one can probably guess that ˈhaʊstən is also used in the name of the dog park. We present a novel architecture that learns to use the pronunciations of neighboring names in order to guess the pronunciation of a given target feature. Applied to Japanese place names, we demonstrate the utility of the model to finding and proposing corrections for errors in Google Maps.
To demonstrate the utility of this approach to structurally similar problems, we also report on an application to a totally different task: Cognate reflex prediction in comparative historical linguistics. A version of the code has been open-sourced.
View details
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
Sebastian Ruder
Shruti Rijhwani
Jean-Michel Sarr
Cindy Wang
John Wieting
Christo Kirov
Dana L. Dickinson
Bidisha Samanta
Connie Tao
David Adelani
Reeve Ingle
Dmitry Panteleev
Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, pp. 1856-1884
Preview abstract
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) — languages for which NLP research is particularly far behind in meeting user needs — it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks — tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text only, multi-modal (vision, audio, and text), supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models.
View details
Building Machine Translation Systems for the Next Thousand Languages
Julia Kreutzer
Mengmeng Niu
Pallavi Nikhil Baljekar
Xavier Garcia
Maxim Krikun
Pidong Wang
Apu Shah
Macduff Richard Hughes
Google Research (2022)
Design principles of an open-source language modeling microservice package for AAC text-entry applications
9th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022), Association for Computational Linguistics (ACL), Dublin, Ireland, pp. 1-16
Preview abstract
We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library. The intent of the library is to allow the ensembling of multiple diverse language models without requiring the clients (user interface designers, system users or speech-language pathologists) to attend to the formats of the models. Issues around privacy, security, dynamic versus static models, and methods of model combination are explored and specific design choices motivated. Some simulation experiments demonstrating the benefits of personalized language model ensembling via the library are presented.
View details
Mockingbird at the SIGTYP 2022 Shared Task: Two Types of Models for Prediction of Cognate Reflexes
Christo Kirov
Proceedings of the 4th Workshop on Research in Computational Typology and Multilingual NLP (SIGTYP 2022) at NAACL, Association for Computational Linguistics (ACL), Seattle, WA, pp. 70-79
Criteria for Useful Automatic Romanization in South Asian Languages
Proceedings of the 13th Language Resources and Evaluation Conference.(LREC), European Language Resources Association (ELRA), 20-25 June, Marseille, France (2022), 6662‑6673
Preview abstract
This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script. This process is also known as romanization. These criteria are related to either fidelity to human linguistic behavior (pronunciation transparency, naturalness and conventionality) or processing utility for people (ease of input) as well as under-the-hood in systems (invertibility and stability across languages and scripts). When addressing these differing criteria several linguistic considerations, such as modeling of prominent phonological processes and their relation to orthography, need to be taken into account. We discuss these key linguistic details in the context of Brahmic scripts and languages that use them, such as Hindi and Malayalam. We then present the core features of several romanization algorithms, implemented in finite state transducer (FST) formalism, that address differing criteria. Implementation of these algorithms will be released as part of the Nisaba finite-state script processing library.
View details
Beyond Arabic: Software for Perso-Arabic Script Manipulation
Raiomond Doctor
Proceedings of the 7th Arabic Natural Language Processing Workshop (WANLP2022) at EMNLP, Association for Computational Linguistics (ACL), Abu Dhabi, United Arab Emirates (Hybrid), pp. 381-387
Preview abstract
This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for ten contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people.
View details
Graphemic Normalization of the Perso-Arabic Script
Raiomond Doctor
Proceedings of Grapholinguistics in the 21st Century, 2022 (G21C, Grafematik), Paris, France
Preview abstract
Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions (Unicode Consortium, 2021). This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages, such as Arabic and Persian, building on earlier work by the expert community (ICANN, 2011, 2015). We particularly focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues such as the use of visually ambiguous yet canonically nonequivalent letters and the mixing of letters from different orthographies. Among the contributing conflating factors are the lack of input methods, the instability of modern orthographies (e.g., Aazim et al., 2009; Iyengar, 2018), insufficient literacy, and loss or lack of orthographic tradition (Jahani and Korn, 2013; Liljegren, 2018). We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks. Our results indicate statistically significant improvements in performance in most conditions for all the languages considered when normalization is applied. We argue that better understanding and representation of Perso-Arabic script variation within regional orthographic traditions, where those are present, is crucial for further progress of modern computational NLP techniques (Ponti et al., 2019; Conneau et al., 2020; Muller et al., 2021) especially for languages with a paucity of resources.
View details
Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities
Raiomond Doctor
Lawrence Wolf-Sonkin
Proceedings of the 13th Language Resources and Evaluation Conference.(LREC), European Language Resources Association (ELRA), 20-25 June, Marseille, France (2022), 6450‑6460
Preview abstract
The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems. In this work, we present several substantial extensions to Brahmic script functionality within the open-source Nisaba library of finite-state script normalization and processing utilities (Johny et. al. , 2021). First, we extend coverage from the original ten scripts to an additional ten scripts of South Asia and beyond, including some used to record endangered languages such as Dogri. Second, we augment the language layer so that scripts used by multiple languages in distinct ways can be processed correctly for more languages, such as the Bengali script when used for the low-resource language Santali. We document key changes to the finite-state engine required to support these new languages and scripts. Finally, we add new script processing utilities, including lightweight script-level reading normalization that (unlike existing visual normalization) does not preserve visual invariance, and a fixed-input transliteration mechanism specifically tailored to Brahmic text entry with ASCII characters.
View details
Preview abstract
Taxonomies of writing systems since Gelb (1952) have classified systems based on what the
written symbols represent: if they represent words or morphemes, they are logographic; if
syllables, syllabic; if segments, alphabetic; etc. Sproat (2000) and Rogers (2005) broke with
tradition by splitting the logographic and phonographic aspects into two dimensions, with
logography being graded rather than a categorical distinction. A system could be syllabic, and
highly logographic; or alphabetic, and mostly non-logographic. This accords better with how
writing systems actually work, but neither author proposed a method for measuring logography.
In this article we propose a novel measure of the degree of logography that uses an attention based
sequence-to-sequence model trained to predict the spelling of a token from its pronunciation
in context. In an ideal phonographic system, the model should need to attend to only
the current token in order to compute how to spell it, and this would show in the attention
matrix activations. In contrast, with a logographic system, where a given pronunciation might
correspond to several different spellings, the model would need to attend to a broader context. The
ratio of the activation outside the token and the total activation forms the basis of our measure.
We compare this with a simple lexical measure, and an entropic measure, as well as several
other neural models, and argue that on balance our attention-based measure accords best with
intuition about how logographic various systems are.
Our work provides the first quantifiable measure of the notion of logography that accords
with linguistic intuition and, we argue, provides better insight into what this notion means.
View details
Finite-state script normalization and processing utilities: The Nisaba Brahmic library
Lawrence Wolf-Sonkin
The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations, Association for Computational Linguistics, [Online], Kyiv, Ukraine, April, 2021, pp. 14-23
Preview abstract
This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts. The library provides a flexible and extensible framework for supporting crucial operations on Brahmic scripts, such as NFC, visual normalization, reversible transliteration, and validity checks, implemented in Python within a finite-state transducer formalism. We survey some common Brahmic script issues that may adversely affect the performance of downstream NLP tasks, and provide the rationale for finite-state design and system implementation details.
View details
Eidos: An Open-Source Auditory Periphery Modeling Toolkit and Evaluation of Cross-Lingual Phonemic Contrasts
Proc. of 1st Joint Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop (SLTU-CCURL 2020), European Language Resources Association (ELRA), 11--12 May, Marseille, France, pp. 9-20
Preview abstract
Many analytical models that mimic, in varying degree of detail, the basic auditory processes involved in human hearing have been developed over the past decades. While the auditory periphery mechanisms responsible for transducing the sound pressure wave into the auditory nerve discharge are relatively well understood, the models that describe them are usually very complex because they try to faithfully simulate the behavior of several functionally distinct biological units involved in hearing. Because of this, there is a relative scarcity of toolkits that support combining publicly-available auditory models from multiple sources. We address this shortcoming by presenting an open-source auditory toolkit that integrates multiple models of various stages of human auditory processing into a simple and easily configurable pipeline, which supports easy switching between ten available models. The auditory representations that the pipeline produces can serve as machine learning features and provide analytical benchmark for comparing against auditory filters learned from the data. Given a low- and high-resource language pair, we evaluate several auditory representations on a simple multilingual phonemic contrast task to determine whether contrasts that are meaningful within a language are also empirically robust across languages.
View details
Towards Induction of Structured Phoneme Inventories
Martin Jansche
Lucy Skidmore
Association of Computational Linguistics (ACL), 19th November, Online
Preview abstract
This extended abstract was presented at "SIGTYP 2020: The Second Workshop on Computational Research in Linguistic Typology" at EMNLP 2020.
View details
Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech
Fei He
Shan Hui Cathy Chu
Supheakmungkol Sarin
Knot Pipatsrisawat
Alena Butryna
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, pp. 6504-6513
Preview abstract
In this paper we present a multidialectal corpus approach for building a text-to-speech voice for a new dialect in a language with existing resources, focusing on various South American dialects of Spanish. We first present public speech datasets for Argentinian, Chilean, Colombian, Peruvian, Puerto Rican and Venezuelan Spanish specifically constructed with text-to-speech applications in mind using crowd-sourcing. We then compare the monodialectal voices built with minimal data to a multidialectal model built by pooling all the resources from all dialects. Our results show that the multidialectal model outperforms the monodialectal baseline models. We also experiment with a ``zero-resource'' dialect scenario where we build a multidialectal voice for a dialect while holding out target dialect recordings from the training data.
View details
Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems
Fei He
Shan Hui Cathy Chu
Clara E. Rivera
Martin Jansche
Supheakmungkol Sarin
Knot Pipatsrisawat
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, 6494‑-6503
Preview abstract
We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India. The corpora is primarily intended for use in text-to-speech (TTS) applications, such as constructing multilingual voices or being used for speaker or language adaptation. The data can also be useful for automatic speech recognition (ASR) in various multilingual scenarios. Most of the corpora (apart from Marathi, which is a female-only database) consist of at least 2,000 recorded lines from female and male native speakers of the language. We present the methodological details behind corpora acquisition, which can be scaled to acquiring the data for more languages of interest. We describe the experiments in building a multilingual text-to-speech model that is constructed by combining our corpora. Our results indicate that using these corpora results in good quality voices, with Mean Opinion Scores (MOS) $>$ 3.6, for all the languages tested. We believe that these resources, released with an open-source license, and the described methodology will help developing speech applications for the Indic languages and aid corpora development for other, smaller, languages of India and beyond.
View details
Open-Source High Quality Speech Datasets for Basque, Catalan and Galician
Alena Butryna
Clara E. Rivera
Proc. of 1st Joint Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop (SLTU-CCURL 2020), European Language Resources Association (ELRA), 11--12 May, Marseille, France, pp. 21-27
Preview abstract
This paper introduces three new open speech datasets for Basque, Catalan and Galician, which are languages of Spain, where Catalan is furthermore the official language of the Principality of Andorra. The datasets consist of high-quality multi-speaker recordings of the three languages along with the associated transcriptions. The resulting corpora include over 33 hours of crowd-sourced recordings
of 132 male and female native speakers. The recording scripts also include material for elicitation of global and local place names, personal and business names. The datasets are released under a permissive license and are available for free download for commercial, academic and personal use. The high-quality annotated speech datasets described in this paper can be used to, among other things, build text-to-speech systems, serve as adaptation data in automatic speech recognition and provide useful phonetic and phonological insights in corpus linguistics.
View details
Burmese Speech Corpus, Finite-State Text Normalization and Pronunciation Grammars with an Application to Text-to-Speech
Yin May Oo
Chen Fang Li
Pasindu De Silva
Supheakmungkol Sarin
Knot Pipatsrisawat
Martin Jansche
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, pp. 6328-6339
Preview abstract
This paper introduces an open-source crowd-sourced multi-speaker speech corpus along with the comprehensive set of finite-state transducer (FST) grammars for performing text normalization for the Burmese (Myanmar) language. We also introduce the open-source finite-state grammars for performing grapheme-to-phoneme (G2P) conversion for Burmese. These three components are necessary (but not sufficient) for building a high-quality text-to-speech (TTS) system for Burmese, a tonal Southeast Asian language from the Sino-Tibetan family which presents several linguistic challenges. We describe the corpus acquisition process and provide the details of our finite statebased approach to Burmese text normalization and G2P. Our experiments involve building a multispeaker TTS system based on long short term memory (LSTM) recurrent neural network (RNN) models, which were previously shown to perform well for other languages in a low-resource setting. Our results indicate that the data and grammars that we are announcing are sufficient to build
reasonably high-quality models comparable to other systems. We hope these resources will facilitate speech and language research on the Burmese language, which is considered by many to be lowresource due to the limited availability of free linguistic data.
View details
Developing an Open-Source Corpus of Yoruba Speech
Clara E. Rivera
Kólá Túbòsún
Proc. of Interspeech 2020, International Speech Communication Association (ISCA), October 25--29, Shanghai, China, 2020., pp. 404-408
Preview abstract
This paper introduces an open-source speech dataset for Yoruba - one of the largest low-resource West African languages spoken by at least 22 million people. Yoruba is one of the official languages of Nigeria, Benin and Togo, and is spoken in other neighboring African countries and beyond. The corpus consists of over four hours of 48 kHz recordings from 36 male and female volunteers and the corresponding transcriptions that include disfluency annotation. The transcriptions have full diacritization, which is vital for pronunciation and lexical disambiguation. The annotated speech dataset described in this paper is primarily intended for use in text-to-speech systems, serve as adaptation data in automatic speech recognition and speech-to-speech translation, and provide insights in West African corpus linguistics. We demonstrate the use of this corpus in a simple statistical parametric speech synthesis (SPSS) scenario evaluating it against the related languages from the CMU Wilderness dataset and the Yoruba Lagos-NWU corpus.
View details
Open-source Multi-speaker Corpora of the English Accents in the British Isles
Clara E. Rivera
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, 6532‑-6541
Preview abstract
This paper presents a dataset of transcribed high-quality audio of English sentences recorded by volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage. The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words. Overlapping lines for all speakers were included for idiolect elicitation which include the same or similar lines with other existing resources such as the CSTR VCTK corpus and the Speech Accent Archive to allow for easy comparison of personal and regional accents. The resulting corpora include over 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.
View details
NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task
Association for Computational Linguistics (ACL), 19th November, Online, pp. 17-28
Preview abstract
This paper describes the NEMO submission to SIGTYP 2020 shared task (Bjerva et al., 2020) which deals with prediction of linguistic typological features for multiple languages using the data derived from World Atlas of Language Structures (WALS). We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi-class estimators that predict individual features. We describe two submitted ridge regression-based configurations which ranked second and third overall in the constrained task. Our best configuration achieved the microaveraged accuracy score of 0.66 on 149 test languages.
View details
Does A Priori Phonological Knowledge Improve Cross-Lingual Robustness of Phonemic Contrasts?
Lucy Skidmore
22nd International Conference on Speech and Computer (SPECOM 2020), Springer, St. Petersburg, Russia, pp. 530-543
Preview abstract
For speech models that depend on sharing between phonological representations an often overlooked issue is that phonological contrasts that are succinctly described language-internally by the phonemes and their respective featurizations are not necessarily robust across languages. This paper extends a recently proposed method for assessing the cross-linguistic consistency of phonological features in phoneme inventories. The original method employs binary neural classifiers for individual phonological contrasts trained solely on audio. This method cannot resolve some important phonological contrasts, such as retroflex consonants, cross-linguistically. We extend this approach by leveraging prior phonological knowledge during classifier training. We observe that since phonemic descriptions are articulatory rather than acoustic the model input space needs to be grounded in phonology to better capture phonemic correlations between the training samples. The cross-linguistic consistency of the proposed method is evaluated in multilingual setting on held-out low-resource languages and classification quality is reported. We observe modest gains over the baseline for difficult cases, such as cross-lingual detection of aspiration, and discuss multiple confounding factors that explain the dimensions of the difficulty for this task.
View details
Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
Alena Butryna
Shan Hui Cathy Chu
Linne Ha
Fei He
Martin Jansche
Chen Fang Li
Tatiana Merkulova
Yin May Oo
Knot Pipatsrisawat
Clara E. Rivera
Supheakmungkol Sarin
Pasindu De Silva
Keshan Sodimana
Jaka Aris Eko Wibawa
2019 UNESCO International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, 4--6 December, Paris, France, pp. 91-94
Preview abstract
This paper presents an overview of a program designed to address the growing need for developing free speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language community.
View details
Sampling from Stochastic Finite Automata with Applications to CTC Decoding
Martin Jansche
Proc. of Interspeech 2019 (20th Annual Conference of the International Speech Communication Association), International Speech Communication Association (ISCA), September 15--19, Graz, Austria, pp. 2230-2234
Preview abstract
Stochastic finite automata arise naturally in many language and speech
processing tasks. They include stochastic acceptors, which represent
certain probabilty distributions over random strings. We consider the
problem of efficient sampling: drawing random string variates from the
probability distribution represented by stochastic automata and
transformations of those. We show that path-sampling is effective and can
be efficient if the epsilon-graph of a finite automaton is acyclic. We
provide an algorithm that ensures this by conflating epsilon-cycles
within strongly connected components. Sampling is also effective in the
presence of non-injective transformations of strings. We illustrate this
in the context of decoding for Connectionist Temporal Classification
(CTC), where the predictive probabilities yield auxiliary sequences which
are transformed into shorter labeling strings. We can sample efficiently
from the tranformed labeling distribution and use this in two different
strategies for finding the most probable CTC labeling.
View details
Cross-Lingual Consistency of Phonological Features: An Empirical Study
Martin Jansche
Proc. of Interspeech 2019 (20th Annual Conference of the International Speech Communication Association), International Speech Communication Association (ISCA), September 15--19, Graz, Austria, pp. 1741-1745
Preview abstract
The concept of a phoneme arose historically as a theoretical abstraction that applies language-internally. Using phonemes and phonological features in cross-linguistic settings raises an important question of conceptual validity: Are contrasts that are meaningful within a language also empirically robust across languages? This paper develops a method for assessing the cross-linguistic consistency of phonological features in phoneme inventories. The method involves training separate binary neural classifiers for several phonological contrast in audio spans centered on particular segments within continuous speech. To assess cross-linguistic consistency, these classifiers are evaluated on held-out languages and classification quality is reported. We apply this method to several common phonological contrasts, including vowel height, vowel frontness, and retroflex consonants, in the context of multi-speaker corpora for ten languages from three language families (Indo-Aryan, Dravidian, and Malayo-Polynesian). We empirically evaluate and discuss the consistency of phonological contrasts derived from features found in phonological ontologies such as PanPhon and PHOIBLE.
View details
Preview abstract
The use of linguistic typological resources in natural language processing has been steadily gaining more popularity. It has been observed that the use of typological information, often combined with distributed language representations, leads to significantly more powerful models. While linguistic typology representations from various resources have mostly been used for conditioning the models, there has been relatively little attention on predicting features from these resources from the input data. In this paper we investigate whether the various linguistic features from World Atlas of Language Structures (WALS) can be reliably inferred from multi-lingual text. Such a predictor can be used
to infer structural features for a language never observed in training data. We frame this task as a multi-label classification involving predicting the set of non-mutually exclusive and extremely sparse multi-valued labels (WALS features). We construct a recurrent neural network predictor based on byte embeddings and convolutional layers and test its performance on 556 languages, providing analysis for various linguistic types, macro-areas, language families and individual features. We show that some features from various linguistic types can be predicted reliably.
View details
Predicting the Features of World Atlas of Language Structures from Speech
Tatiana Merkulova
Martin Jansche
Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), International Speech Communication Association (ISCA), 29--31 August, Gurugram, India (2018), pp. 243-247
Preview abstract
We present a novel task that involves prediction of linguistic typological features from the World Atlas of Language Structures (WALS) from multilingual speech. We frame this task as a multi-label classification involving predicting the set of non-mutually exclusive and extremely sparse multi-valued WALS features. We investigate whether the speech modality has enough signals for an RNN to reliably discriminate between the typological features for languages which are included in the training data as well as languages withheld from the training. We show that the proposed approach can identify typological features with the overall accuracy of 91.6% for the 16 in-domain and 71.1% for 19 held-out languages. In addition, our approach outperforms language identification-based baselines on all the languages. Also, we show that correctly identifying all the typological features for an unseen language is still a distant goal: for 14 languages out of 19 the prediction error is well above 30%.
View details
A Unified Phonological Representation of South Asian Languages for Multilingual Text-to-Speech
Martin Jansche
Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), International Speech Communication Association (ISCA), 29--31 August, Gurugram, India (2018), pp. 80-84
Preview abstract
We present a multilingual phoneme inventory and inclusion mappings from the native inventories of several major South Asian languages for multilingual parametric text-to-speech synthesis (TTS). Our goal is to reduce the need for training data when building new TTS voices by leveraging available data
for similar languages within a common feature design. For West Bengali, Gujarati, Kannada, Malayalam, Marathi, Tamil, Telugu, and Urdu we compare TTS voices trained only on monolingual data with voices trained on multilingual data from 12 languages. In subjective evaluations multilingually trained voices outperform (or in a few cases are statistically tied with) the corresponding monolingual voices. The multilingual setup can further be used to synthesize speech for languages not seen in the training data; preliminary evaluations lean towards good. Our results indicate that pooling data from different languages in a single acoustic model can be beneficial, opening up new uses and research questions.
View details
FonBund: A Library for Combining Cross-lingual Phonological Segment Data
Martin Jansche
Tatiana Merkulova
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), 7-12 May 2018, Miyazaki, Japan, pp. 2236-2240
Preview abstract
In this paper, we present an open-source library that provides a way of mapping sequences of arbitrary phonetic segments in International Phonetic Association (IPA) alphabet into multiple articulatory feature representations. The library interfaces with several existing linguistic typology resources providing phonological segment inventories and their corresponding articulatory feature
systems. Our first goal was to facilitate the derivation of articulatory features without giving a special preference to any particular phonological segment inventory provided by freely available linguistic typology resources. The second goal was to build a very light-weight library that can be easily modified to support new phonological segment inventories. In order to support IPA segments unsuppored by the freely available resources the library provides a simple configuration language for performing segment rewrites and adding custom segments with the corresponding feature structures. In addition to introducing the library and the corresponding linguistic resources, we also describe some of the practical uses of this library (multilingual speech synthesis) in the hope that this software will help facilitate multilingual speech research.
View details
Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala, and Sundanese TTS Systems
Keshan Sodimana
Pasindu De Silva
Chen Fang Li
Supheakmungkol Sarin
Knot Pipatsrisawat
6th International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU-2018), International Speech Communication Association (ISCA), 29--31 August, Gurugram, India, pp. 147-151
Preview abstract
Text normalization is the process of converting non-standard words (NSWs) such as numbers, abbreviations, and time expressions into standard words so that their pronunciations can be derived either through lexicon lookup or by utilizing a program to predict pronunciations from spellings. Text normalization is, thus, an important component of any Text-to-Speech (TTS) system. Without such component, the resulting voice, no matter how good the quality is, may sound unintelligent. Such a component is often built manually by translating language-specific knowledge into rules that can be utilized by TTS pipelines. In this paper, we describe an approach to develop a rule-based text normalization component for many low-resourced languages. We also describe our open source repository containing text normalization grammars for Bangla, Javanese, Khmer, Nepali, Sinhala,
Sundanese and present a recipe for utilizing them in a TTS system.
View details
Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech
Jaka Aris Eko Wibawa
Supheakmungkol Sarin
Chen Fang Li
Knot Pipatsrisawat
Keshan Sodimana
Martin Jansche
Linne Ha
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), 7-12 May 2018, Miyazaki, Japan, pp. 1610-1614
Preview abstract
We present the multi-speaker text-to-speech corpora for Javanese and Sundanese
languages, the second and third biggest languages of Indonesia spoken by well
over a hundred million people. The key objectives were to collect the high-quality
data an affordable way and to share the data publicly with the speech
community. To achieve this, we collaborated with two local universities in Java and
streamlined our recording and crowdsourcing processes to produce the corpora
consisting of 5.8 thousand (Javanese) and 4.2 thousand (Sundanese) mixed-gender
recordings. We used these corpora to build several configurations of multi-speaker
neural network-based text-to-speech systems for Javanese and Sundanese. Subjective
evaluations performed on these configurations demonstrate that multilingual
configurations for which Javanese and Sundanese are trained jointly with a
larger Indonesian corpus significantly outperform the systems constructed
from a single language. We hope that sharing these corpora publicly and
presenting our multilingual approach to text-to-speech will help the community
to scale up the text-to-speech technologies to other lesser resourced languages
of Indonesia.
View details
Uniform Multilingual Multi-Speaker Acoustic Model for Statistical Parametric Speech Synthesis of Low-Resourced Languages
Proc. of Interspeech 2017, International Speech Communication Association (ISCA), August 20--24, Stockholm, Sweden, pp. 2183-2187
Preview abstract
Acquiring data for text-to-speech (TTS) systems is expensive. This typically requires large amounts of training data, which is not available for low-resourced languages. Sometimes small amounts of data can be collected, while often - no data may be available at all. This paper presents acoustic
modeling approach utilizing long short-term memory (LSTM) recurrent neural network (RNN) aimed at partially addressing the language data scarcity problem. Unlike speaker-adaption systems that aim to preserve speaker similarity across languages, the salient feature of the proposed approach is that, once constructed, the resulting system does not need retraining to cope with the previously unseen languages. This is due to language and speaker-agnostic model topology and universal linguistic feature set. Experiments on twelve languages show that the system is able to produce
intelligible and sometimes natural output when language is unseen. We also show that, when small amounts of training data are available, pooling the data sometimes improves the overall intelligibility and naturalness. Finally, we show that sometimes having a multilingual system with no prior exposure to the language is better than building single-speaker system from small amounts of
data for that language.
View details
Areal and Phylogenetic Features for Multilingual Speech Synthesis
Proc. of Interspeech 2017, International Speech Communication Association (ISCA), August 20–24, 2017, Stockholm, Sweden, pp. 2078-2082
Preview abstract
We introduce phylogenetic and areal language features to the domain of
multilingual text-to-speech (TTS) synthesis. Intuitively, enriching the
existing universal phonetic features with such cross-language shared representations
should benefit the multilingual acoustic models and help to address issues like
data scarcity for low-resource languages. We investigate these representations
using the acoustic models based on long short-term memory (LSTM) recurrent
neural networks (RNN). Subjective evaluations conducted on eight languages
from diverse language families show that sometimes phylogenetic and areal
representations lead to significant multilingual synthesis quality improvements.
View details
Recent Advances in Google Real-time HMM-driven Unit Selection Synthesizer
Siamak Tazari
Hanna Silen
International Speech Communication Association (ISCA), Sep 8--12, San Francisco, USA, pp. 2238-2242
Preview abstract
This paper presents advances in Google's hidden Markov model (HMM)-driven unit selection speech synthesis system. We describe several improvements to the run-time system; these include minimal
latency, high-quality and fast refresh cycle for new voices. Traditionally unit selection synthesizers are limited in terms of the amount of data they can handle and the real applications they
are built for. That is even more critical for real-life large-scale applications where high-quality is expected and low latency is required given the available computational resources. In this paper we present an optimized engine to handle a large database at runtime, a composite unit search approach for combining diphones and phrase-based units. In addition a new voice building strategy for handling big
databases and keeping the building times low is presented.
View details
TTS for Low Resource Languages: A Bangla Synthesizer
Linne Ha
Martin Jansche
Knot Pipatsrisawat
10th edition of the Language Resources and Evaluation Conference, 23-28 May 2016, European Language Resources Association (ELRA), Portorož, Slovenia, pp. 2005-2010
Preview abstract
We present a text-to-speech (TTS) system designed for the dialect of Bengali spoken in Bangladesh. This work is part of an ongoing effort to address the needs of under-resourced languages. We propose a process for streamlining the bootstrapping of TTS systems for under-resourced languages. First, we use crowdsourcing to collect the data from multiple ordinary speakers, each speaker recording small amount of sentences. Second, we leverage an existing text normalization system for a related language (Hindi) to bootstrap a linguistic front-end for Bangla. Third, we employ statistical techniques to construct multi-speaker acoustic models using Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) and Hidden Markov Model (HMM) approaches. We then describe our experiments that show that the resulting TTS voices score well in terms of their perceived quality as measured by Mean Opinion Score (MOS) evaluations.
View details
Building Statistical Parametric Multi-speaker Synthesis for Bangladeshi Bangla
Linne Ha
Martin Jansche
Knot Pipatsrisawat
5th Workshop on Spoken Language Technologies for Under-resourced languages (SLTU-2016), Procedia Computer Science (Elsevier B.V.), 09--12 May 2016, Yogyakarta, Indonesia, pp. 194-200
Preview abstract
We present a text-to-speech (TTS) system designed for the dialect of Bengali spoken in Bangladesh. This work is part of an ongoing effort to address the needs of new under-resourced languages. We propose a process for streamlining the bootstrapping of TTS systems for under-resourced languages. First, we use crowdsourcing to collect the data from multiple ordinary speakers, each speaker recording small amount of sentences. Second, we leverage an existing text normalization system for a related language (Hindi) to bootstrap a linguistic front-end for Bangla. Third, we employ statistical techniques to construct multi-speaker acoustic models using Long Short-term Memory Recurrent Neural Network (LSTM-RNN) and Hidden Markov Model (HMM) approaches. We then describe our experiments that show that the resulting TTS voices score well in terms of their perceived quality as measured by Mean Opinion Score (MOS) evaluations.
View details
No Results Found