Oddur Kjartansson
Research Areas
Authored Publications
Sort By
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
Association for Computing MachineryNew YorkNYUnited States (2022)
Preview abstract
As we move towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. A clear and thorough understanding of a dataset's origins, development, intent, ethical considerations and evolution becomes a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of its documentation. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of the documentation, and consistency and comparability across the documentation of all datasets involved, and as such documentation must be treated as a user-centric product in and of itself. In this paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset's lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models—such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use, or decisions affecting model performance. We also present evaluative frameworks that ground Data Cards in real-world utility and human-centricity. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over twenty Data Cards.
View details
Towards Accountability for Machine Learning Datasets
Alex Hanna
Christina Greer
Margaret Mitchell
Proceedings of FAccT 2021 (2021) (to appear)
Preview abstract
Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation. Which stakeholder groups had their perspectives included when the dataset was conceived? Which domain experts were consulted regarding how to model subgroups and other phenomena? How were questions of representational biases measured and addressed? Who labeled the data? In this paper, we introduce a rigorous framework for dataset development transparency which supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields a set of documents that facilitate improved communication and decision-making, as well as drawing attention the value and necessity of careful data work. The proposed framework is intended to contribute to closing the accountability gap in artificial intelligence systems, by making visible the often overlooked work that goes into dataset creation.
View details
Open-source Multi-speaker Corpora of the English Accents in the British Isles
Clara E. Rivera
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, 6532‑-6541
Preview abstract
This paper presents a dataset of transcribed high-quality audio of English sentences recorded by volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage. The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words. Overlapping lines for all speakers were included for idiolect elicitation which include the same or similar lines with other existing resources such as the CSTR VCTK corpus and the Speech Accent Archive to allow for easy comparison of personal and regional accents. The resulting corpora include over 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.
View details
Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems
Fei He
Shan Hui Cathy Chu
Clara E. Rivera
Martin Jansche
Supheakmungkol Sarin
Knot Pipatsrisawat
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, 6494‑-6503
Preview abstract
We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India. The corpora is primarily intended for use in text-to-speech (TTS) applications, such as constructing multilingual voices or being used for speaker or language adaptation. The data can also be useful for automatic speech recognition (ASR) in various multilingual scenarios. Most of the corpora (apart from Marathi, which is a female-only database) consist of at least 2,000 recorded lines from female and male native speakers of the language. We present the methodological details behind corpora acquisition, which can be scaled to acquiring the data for more languages of interest. We describe the experiments in building a multilingual text-to-speech model that is constructed by combining our corpora. Our results indicate that using these corpora results in good quality voices, with Mean Opinion Scores (MOS) $>$ 3.6, for all the languages tested. We believe that these resources, released with an open-source license, and the described methodology will help developing speech applications for the Indic languages and aid corpora development for other, smaller, languages of India and beyond.
View details
Burmese Speech Corpus, Finite-State Text Normalization and Pronunciation Grammars with an Application to Text-to-Speech
Yin May Oo
Chen Fang Li
Pasindu De Silva
Supheakmungkol Sarin
Knot Pipatsrisawat
Martin Jansche
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, pp. 6328-6339
Preview abstract
This paper introduces an open-source crowd-sourced multi-speaker speech corpus along with the comprehensive set of finite-state transducer (FST) grammars for performing text normalization for the Burmese (Myanmar) language. We also introduce the open-source finite-state grammars for performing grapheme-to-phoneme (G2P) conversion for Burmese. These three components are necessary (but not sufficient) for building a high-quality text-to-speech (TTS) system for Burmese, a tonal Southeast Asian language from the Sino-Tibetan family which presents several linguistic challenges. We describe the corpus acquisition process and provide the details of our finite statebased approach to Burmese text normalization and G2P. Our experiments involve building a multispeaker TTS system based on long short term memory (LSTM) recurrent neural network (RNN) models, which were previously shown to perform well for other languages in a low-resource setting. Our results indicate that the data and grammars that we are announcing are sufficient to build
reasonably high-quality models comparable to other systems. We hope these resources will facilitate speech and language research on the Burmese language, which is considered by many to be lowresource due to the limited availability of free linguistic data.
View details
Open-Source High Quality Speech Datasets for Basque, Catalan and Galician
Alena Butryna
Clara E. Rivera
Proc. of 1st Joint Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop (SLTU-CCURL 2020), European Language Resources Association (ELRA), 11--12 May, Marseille, France, pp. 21-27
Preview abstract
This paper introduces three new open speech datasets for Basque, Catalan and Galician, which are languages of Spain, where Catalan is furthermore the official language of the Principality of Andorra. The datasets consist of high-quality multi-speaker recordings of the three languages along with the associated transcriptions. The resulting corpora include over 33 hours of crowd-sourced recordings
of 132 male and female native speakers. The recording scripts also include material for elicitation of global and local place names, personal and business names. The datasets are released under a permissive license and are available for free download for commercial, academic and personal use. The high-quality annotated speech datasets described in this paper can be used to, among other things, build text-to-speech systems, serve as adaptation data in automatic speech recognition and provide useful phonetic and phonological insights in corpus linguistics.
View details
Developing an Open-Source Corpus of Yoruba Speech
Clara E. Rivera
Kólá Túbòsún
Proc. of Interspeech 2020, International Speech Communication Association (ISCA), October 25--29, Shanghai, China, 2020., pp. 404-408
Preview abstract
This paper introduces an open-source speech dataset for Yoruba - one of the largest low-resource West African languages spoken by at least 22 million people. Yoruba is one of the official languages of Nigeria, Benin and Togo, and is spoken in other neighboring African countries and beyond. The corpus consists of over four hours of 48 kHz recordings from 36 male and female volunteers and the corresponding transcriptions that include disfluency annotation. The transcriptions have full diacritization, which is vital for pronunciation and lexical disambiguation. The annotated speech dataset described in this paper is primarily intended for use in text-to-speech systems, serve as adaptation data in automatic speech recognition and speech-to-speech translation, and provide insights in West African corpus linguistics. We demonstrate the use of this corpus in a simple statistical parametric speech synthesis (SPSS) scenario evaluating it against the related languages from the CMU Wilderness dataset and the Yoruba Lagos-NWU corpus.
View details
Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech
Fei He
Shan Hui Cathy Chu
Supheakmungkol Sarin
Knot Pipatsrisawat
Alena Butryna
Proc. 12th Language Resources and Evaluation Conference (LREC 2020), European Language Resources Association (ELRA), 11--16 May, Marseille, France, pp. 6504-6513
Preview abstract
In this paper we present a multidialectal corpus approach for building a text-to-speech voice for a new dialect in a language with existing resources, focusing on various South American dialects of Spanish. We first present public speech datasets for Argentinian, Chilean, Colombian, Peruvian, Puerto Rican and Venezuelan Spanish specifically constructed with text-to-speech applications in mind using crowd-sourcing. We then compare the monodialectal voices built with minimal data to a multidialectal model built by pooling all the resources from all dialects. Our results show that the multidialectal model outperforms the monodialectal baseline models. We also experiment with a ``zero-resource'' dialect scenario where we build a multidialectal voice for a dialect while holding out target dialect recordings from the training data.
View details
Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
Alena Butryna
Shan Hui Cathy Chu
Linne Ha
Fei He
Martin Jansche
Chen Fang Li
Tatiana Merkulova
Yin May Oo
Knot Pipatsrisawat
Clara E. Rivera
Supheakmungkol Sarin
Pasindu De Silva
Keshan Sodimana
Richard Sproat
Jaka Aris Eko Wibawa
2019 UNESCO International Conference Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide, 4--6 December, Paris, France, pp. 91-94
Preview abstract
This paper presents an overview of a program designed to address the growing need for developing free speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language community.
View details
Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali
Supheakmungkol Sarin
Knot Pipatsrisawat
Martin Jansche
Linne Ha
Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (2018), pp. 52-55
Preview abstract
We present speech corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali. Each corpus consists of an average of approximately 200k recorded utterances that were provided by native-speaker volunteers in the respective region. Recordings were made using portable consumer electronics in reasonably quiet environments. For each recorded utterance the textual prompt and an anonymized hexadecimal identifier of the speaker are available. Biographical information of the speakers is unavailable. In particular, the speakers come from an unspecified mix of genders. The recordings are suitable for research on acoustic modeling for speech recognition, for example. To validate the integrity of the corpora and their suitability for speech recognition research, we provide simple recipes that illustrate how they can be used with the open-source Kaldi speech recognition toolkit. The corpora are being made available under a Creative Commons license in the hope that they will stimulate further research on these languages.
View details