Criteria for Useful Automatic Romanization in South Asian Languages

Isin Demirsahin; Cibu Johny; Alexander Gutkin; Brian Edward Roark

Criteria for Useful Automatic Romanization in South Asian Languages

Isin Demirsahin

Cibu Johny

Alexander Gutkin

Brian Edward Roark

Proceedings of the 13th Language Resources and Evaluation Conference.(LREC), European Language Resources Association (ELRA), 20-25 June, Marseille, France (2022), 6662‑6673

Download Google Scholar

Abstract

This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script. This process is also known as romanization. These criteria are related to either fidelity to human linguistic behavior (pronunciation transparency, naturalness and conventionality) or processing utility for people (ease of input) as well as under-the-hood in systems (invertibility and stability across languages and scripts). When addressing these differing criteria several linguistic considerations, such as modeling of prominent phonological processes and their relation to orthography, need to be taken into account. We discuss these key linguistic details in the context of Brahmic scripts and languages that use them, such as Hindi and Malayalam. We then present the core features of several romanization algorithms, implemented in finite state transducer (FST) formalism, that address differing criteria. Implementation of these algorithms will be released as part of the Nisaba finite-state script processing library.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Criteria for Useful Automatic Romanization in South Asian Languages

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs