Massively Multilingual Pronunciation Mining with WikiPron

Jackson L. Lee
Lucas F. E. Ashby
M. Elizabeth Garza
Yeonju Lee-Sikka
Sean Miller
Alan Wong
2020

Abstract

We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a free online multilingual dictionary. We first describe the design and use of the library. We then discuss the challenges faced scaling this tool to create an ever-growing database of pronunciations, currently containing 1.7 million pronunciations from 160 languages, both living and dead, natural and constructed. Finally, we validate the pronunciation database by using it to training and evaluating a collection of generic grapheme-to-phoneme models. Software, pronunciation data, and models are all made available under permissive open-source licenses.
×