Finite-state script normalization and processing utilities: The Nisaba Brahmic library

Lawrence Wolf-Sonkin
The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations, Association for Computational Linguistics, [Online], Kyiv, Ukraine, April, 2021, pp. 14-23


This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts. The library provides a flexible and extensible framework for supporting crucial operations on Brahmic scripts, such as NFC, visual normalization, reversible transliteration, and validity checks, implemented in Python within a finite-state transducer formalism. We survey some common Brahmic script issues that may adversely affect the performance of downstream NLP tasks, and provide the rationale for finite-state design and system implementation details.