Massively Multilingual Pronunciation Mining with WikiPron
Abstract
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a free online multilingual dictionary. We first describe the design and use of the library. We then discuss the challenges faced scaling this tool to create an ever-growing database of pronunciations, currently containing 1.7 million pronunciations from 160 languages, both living and dead, natural and constructed. Finally, we validate the pronunciation database by using it to training and evaluating a collection of generic grapheme-to-phoneme models. Software, pronunciation data, and models are all made available under permissive open-source licenses.