Jump to Content

Criteria for Useful Automatic Romanization in South Asian Languages

Proceedings of the 13th Language Resources and Evaluation Conference.(LREC), European Language Resources Association (ELRA), 20-25 June, Marseille, France (2022), 6662‑6673

Abstract

This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script. This process is also known as romanization. These criteria are related to either fidelity to human linguistic behavior (pronunciation transparency, naturalness and conventionality) or processing utility for people (ease of input) as well as under-the-hood in systems (invertibility and stability across languages and scripts). When addressing these differing criteria several linguistic considerations, such as modeling of prominent phonological processes and their relation to orthography, need to be taken into account. We discuss these key linguistic details in the context of Brahmic scripts and languages that use them, such as Hindi and Malayalam. We then present the core features of several romanization algorithms, implemented in finite state transducer (FST) formalism, that address differing criteria. Implementation of these algorithms will be released as part of the Nisaba finite-state script processing library.