Graphemic Normalization of the Perso-Arabic Script

Raiomond Doctor; Alexander Gutkin; Cibu Johny; Brian Roark; Richard Sproat

Graphemic Normalization of the Perso-Arabic Script

Raiomond Doctor

Alexander Gutkin

Cibu Johny

Brian Roark

Richard Sproat

Proceedings of Grapholinguistics in the 21st Century, 2022 (G21C, Grafematik), Fluxus Editions, Brest, France, pp. 315-376

Download Google Scholar

Abstract

Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions (Unicode Consortium, 2021). This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages, such as Arabic and Persian, building on earlier work by the expert community (ICANN, 2011, 2015). We particularly focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues such as the use of visually ambiguous yet canonically nonequivalent letters and the mixing of letters from different orthographies. Among the contributing conflating factors are the lack of input methods, the instability of modern orthographies (e.g., Aazim et al., 2009; Iyengar, 2018), insufficient literacy, and loss or lack of orthographic tradition (Jahani and Korn, 2013; Liljegren, 2018). We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks. Our results indicate statistically significant improvements in performance in most conditions for all the languages considered when normalization is applied. We argue that better understanding and representation of Perso-Arabic script variation within regional orthographic traditions, where those are present, is crucial for further progress of modern computational NLP techniques (Ponti et al., 2019; Conneau et al., 2020; Muller et al., 2021) especially for languages with a paucity of resources.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Graphemic Normalization of the Perso-Arabic Script

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs