Data-Driven Parametric Text Normalization: Rapidly Scaling Finite-State Transduction Verbalizers to New Languages

Kim Anne Heiligenstein
Nikos Bampounis
Christian Schallhart
Jonas Fromseier Mortensen
Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), Language Resources and Evaluation Conference (LREC 2020), Marseille, 218–225

Abstract

This paper presents a methodology for rapidly generating FST-based verbalizers for ASR and TTS systems by efficiently sourcing language-specific data. We describe a questionnaire which collects the necessary data to bootstrap the number grammar induction system and parameterize the verbalizer templates described in Ritchie et al. (2019), and a machine-readable data store which allows the data collected through the questionnaire to be supplemented by additional data from other sources. We also discuss the benefits of this system for low-resource languages.