Oddur Kjartansson

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract As we move towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. A clear and thorough understanding of a dataset's origins, development, intent, ethical considerations and evolution becomes a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of its documentation. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of the documentation, and consistency and comparability across the documentation of all datasets involved, and as such documentation must be treated as a user-centric product in and of itself. In this paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset's lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models—such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use, or decisions affecting model performance. We also present evaluative frameworks that ground Data Cards in real-world utility and human-centricity. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over twenty Data Cards. View details
    Preview abstract Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation. Which stakeholder groups had their perspectives included when the dataset was conceived? Which domain experts were consulted regarding how to model subgroups and other phenomena? How were questions of representational biases measured and addressed? Who labeled the data? In this paper, we introduce a rigorous framework for dataset development transparency which supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields a set of documents that facilitate improved communication and decision-making, as well as drawing attention the value and necessity of careful data work. The proposed framework is intended to contribute to closing the accountability gap in artificial intelligence systems, by making visible the often overlooked work that goes into dataset creation. View details
    Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech
    Fei He
    Shan Hui Cathy Chu
    Supheakmungkol Sarin
    Knot Pipatsrisawat
    Alena Butryna
    Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, pp. 6504-6513
    Preview abstract In this paper we present a multidialectal corpus approach for building a text-to-speech voice for a new dialect in a language with existing resources, focusing on various South American dialects of Spanish. We first present public speech datasets for Argentinian, Chilean, Colombian, Peruvian, Puerto Rican and Venezuelan Spanish specifically constructed with text-to-speech applications in mind using crowd-sourcing. We then compare the monodialectal voices built with minimal data to a multidialectal model built by pooling all the resources from all dialects. Our results show that the multidialectal model outperforms the monodialectal baseline models. We also experiment with a ``zero-resource'' dialect scenario where we build a multidialectal voice for a dialect while holding out target dialect recordings from the training data. View details
    Open-source Multi-speaker Corpora of the English Accents in the British Isles
    Clara E. Rivera
    Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, 6532‑-6541
    Preview abstract This paper presents a dataset of transcribed high-quality audio of English sentences recorded by volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage. The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words. Overlapping lines for all speakers were included for idiolect elicitation which include the same or similar lines with other existing resources such as the CSTR VCTK corpus and the Speech Accent Archive to allow for easy comparison of personal and regional accents. The resulting corpora include over 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English. View details
    Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems
    Fei He
    Shan Hui Cathy Chu
    Clara E. Rivera
    Martin Jansche
    Supheakmungkol Sarin
    Knot Pipatsrisawat
    Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, 6494‑-6503
    Preview abstract We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India. The corpora is primarily intended for use in text-to-speech (TTS) applications, such as constructing multilingual voices or being used for speaker or language adaptation. The data can also be useful for automatic speech recognition (ASR) in various multilingual scenarios. Most of the corpora (apart from Marathi, which is a female-only database) consist of at least 2,000 recorded lines from female and male native speakers of the language. We present the methodological details behind corpora acquisition, which can be scaled to acquiring the data for more languages of interest. We describe the experiments in building a multilingual text-to-speech model that is constructed by combining our corpora. Our results indicate that using these corpora results in good quality voices, with Mean Opinion Scores (MOS) $>$ 3.6, for all the languages tested. We believe that these resources, released with an open-source license, and the described methodology will help developing speech applications for the Indic languages and aid corpora development for other, smaller, languages of India and beyond. View details
    Burmese Speech Corpus, Finite­-State Text Normalization and Pronunciation Grammars with an Application to Text-­to-­Speech
    Yin May Oo
    Chen Fang Li
    Pasindu De Silva
    Supheakmungkol Sarin
    Knot Pipatsrisawat
    Martin Jansche
    Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, pp. 6328-6339
    Preview abstract This paper introduces an open-­source crowd­-sourced multi­-speaker speech corpus along with the comprehensive set of finite-­state transducer (FST) grammars for performing text normalization for the Burmese (Myanmar) language. We also introduce the open­-source finite­-state grammars for performing grapheme­-to­-phoneme (G2P) conversion for Burmese. These three components are necessary (but not sufficient) for building a high­-quality text-­to-­speech (TTS) system for Burmese, a tonal Southeast Asian language from the Sino­-Tibetan family which presents several linguistic challenges. We describe the corpus acquisition process and provide the details of our finite state­based approach to Burmese text normalization and G2P. Our experiments involve building a multi­speaker TTS system based on long short term memory (LSTM) recurrent neural network (RNN) models, which were previously shown to perform well for other languages in a low­-resource setting. Our results indicate that the data and grammars that we are announcing are sufficient to build reasonably high­-quality models comparable to other systems. We hope these resources will facilitate speech and language research on the Burmese language, which is considered by many to be low­resource due to the limited availability of free linguistic data. View details
    Open-Source High Quality Speech Datasets for Basque, Catalan and Galician
    Alena Butryna
    Clara E. Rivera
    Proc. of 1st Joint Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop (SLTU-CCURL 2020), European Language Resources Association (ELRA), 11--12 May, Marseille, France, pp. 21-27
    Preview abstract This paper introduces three new open speech datasets for Basque, Catalan and Galician, which are languages of Spain, where Catalan is furthermore the official language of the Principality of Andorra. The datasets consist of high-quality multi-speaker recordings of the three languages along with the associated transcriptions. The resulting corpora include over 33 hours of crowd-sourced recordings of 132 male and female native speakers. The recording scripts also include material for elicitation of global and local place names, personal and business names. The datasets are released under a permissive license and are available for free download for commercial, academic and personal use. The high-quality annotated speech datasets described in this paper can be used to, among other things, build text-to-speech systems, serve as adaptation data in automatic speech recognition and provide useful phonetic and phonological insights in corpus linguistics. View details
    Developing an Open-Source Corpus of Yoruba Speech
    Clara E. Rivera
    Kólá Túbòsún
    Proc. of Interspeech 2020, International Speech Communication Association (ISCA), October 25--29, Shanghai, China, 2020., pp. 404-408
    Preview abstract This paper introduces an open-source speech dataset for Yoruba - one of the largest low-resource West African languages spoken by at least 22 million people. Yoruba is one of the official languages of Nigeria, Benin and Togo, and is spoken in other neighboring African countries and beyond. The corpus consists of over four hours of 48 kHz recordings from 36 male and female volunteers and the corresponding transcriptions that include disfluency annotation. The transcriptions have full diacritization, which is vital for pronunciation and lexical disambiguation. The annotated speech dataset described in this paper is primarily intended for use in text-to-speech systems, serve as adaptation data in automatic speech recognition and speech-to-speech translation, and provide insights in West African corpus linguistics. We demonstrate the use of this corpus in a simple statistical parametric speech synthesis (SPSS) scenario evaluating it against the related languages from the CMU Wilderness dataset and the Yoruba Lagos-NWU corpus. View details
    Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
    Alena Butryna
    Shan Hui Cathy Chu
    Linne Ha
    Fei He
    Martin Jansche
    Chen Fang Li
    Tatiana Merkulova
    Yin May Oo
    Knot Pipatsrisawat
    Clara E. Rivera
    Supheakmungkol Sarin
    Pasindu De Silva
    Keshan Sodimana
    Richard Sproat
    Jaka Aris Eko Wibawa
    2019 UNESCO International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, 4--6 December, Paris, France, pp. 91-94
    Preview abstract This paper presents an overview of a program designed to address the growing need for developing free speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language community. View details
    A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese
    Keshan Sodimana
    Knot Pipatsrisawat
    Linne Ha
    Martin Jansche
    Pasindu De Silva
    Supheakmungkol Sarin
    Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (2018), pp. 66-70
    Preview abstract The availability of language resources is vital for the development of text-to-speech (TTS) systems. Thus, open source data and tools are highly beneficial for research communities, especially those focusing on low-resourced languages. In this paper, we present data sets for 6 low-resourced languages that we open sourced to the public. The data sets consist of audio files, pronunciation lexicons, and phonology definitions of Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese. These data sets are sufficient for building TTS voices in these languages. We also describe a recipe for building a new TTS voice using our data together with openly available resources and tools. View details