Developing an Open-Source Corpus of Yoruba Speech

Alexander Gutkin; Isin Demirsahin; Oddur Kjartansson; Clara E. Rivera; Kólá Túbòsún

Developing an Open-Source Corpus of Yoruba Speech

Alexander Gutkin

Isin Demirsahin

Oddur Kjartansson

Clara E. Rivera

Kólá Túbòsún

Proc. of Interspeech 2020, International Speech Communication Association (ISCA), October 25--29, Shanghai, China, 2020., pp. 404-408

Download Google Scholar

Abstract

This paper introduces an open-source speech dataset for Yoruba - one of the largest low-resource West African languages spoken by at least 22 million people. Yoruba is one of the official languages of Nigeria, Benin and Togo, and is spoken in other neighboring African countries and beyond. The corpus consists of over four hours of 48 kHz recordings from 36 male and female volunteers and the corresponding transcriptions that include disfluency annotation. The transcriptions have full diacritization, which is vital for pronunciation and lexical disambiguation. The annotated speech dataset described in this paper is primarily intended for use in text-to-speech systems, serve as adaptation data in automatic speech recognition and speech-to-speech translation, and provide insights in West African corpus linguistics. We demonstrate the use of this corpus in a simple statistical parametric speech synthesis (SPSS) scenario evaluating it against the related languages from the CMU Wilderness dataset and the Yoruba Lagos-NWU corpus.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Developing an Open-Source Corpus of Yoruba Speech

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs