Open-Source High Quality Speech Datasets for Basque, Catalan and Galician

Oddur Kjartansson; Alexander Gutkin; Alena Butryna; Isin Demirsahin; Clara E. Rivera

Open-Source High Quality Speech Datasets for Basque, Catalan and Galician

Oddur Kjartansson

Alexander Gutkin

Alena Butryna

Isin Demirsahin

Clara E. Rivera

Proc. of 1st Joint Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop (SLTU-CCURL 2020), European Language Resources Association (ELRA), 11--12 May, Marseille, France, pp. 21-27

Download Google Scholar

Abstract

This paper introduces three new open speech datasets for Basque, Catalan and Galician, which are languages of Spain, where Catalan is furthermore the official language of the Principality of Andorra. The datasets consist of high-quality multi-speaker recordings of the three languages along with the associated transcriptions. The resulting corpora include over 33 hours of crowd-sourced recordings
of 132 male and female native speakers. The recording scripts also include material for elicitation of global and local place names, personal and business names. The datasets are released under a permissive license and are available for free download for commercial, academic and personal use. The high-quality annotated speech datasets described in this paper can be used to, among other things, build text-to-speech systems, serve as adaptation data in automatic speech recognition and provide useful phonetic and phonological insights in corpus linguistics.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Open-Source High Quality Speech Datasets for Basque, Catalan and Galician

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs