- Keshan Sodimana
- Pasindu De Silva
- Richard Sproat
- A Theeraphol
- Chen Fang Li
- Alexander Gutkin
- Supheakmungkol Sarin
- Knot Pipatsrisawat
Abstract
Text normalization is the process of converting non-standard words (NSWs) such as numbers, abbreviations, and time expressions into standard words so that their pronunciations can be derived either through lexicon lookup or by utilizing a program to predict pronunciations from spellings. Text normalization is, thus, an important component of any Text-to-Speech (TTS) system. Without such component, the resulting voice, no matter how good the quality is, may sound unintelligent. Such a component is often built manually by translating language-specific knowledge into rules that can be utilized by TTS pipelines. In this paper, we describe an approach to develop a rule-based text normalization component for many low-resourced languages. We also describe our open source repository containing text normalization grammars for Bangla, Javanese, Khmer, Nepali, Sinhala, Sundanese and present a recipe for utilizing them in a TTS system.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work