Jump to Content

Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach

Kareem Darwish
Ahmed Abdelali
Hamdy Mubarak
Younes Samih
Mohammed Attia
The 3rd Workshop on Open-Source Arabic Corpora and Processing Tools in the Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018)


Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted. Diacritization attempts to recover short vowels and is an essential step for Text-to-Speech (TTS) systems. Though Automatic diacritization of Modern Standard Arabic (MSA) has received significant attention, limited research has been conducted on dialectal Arabic (DA) diacritization. Phonemic patterns of DA vary greatly from MSA and even from one another, which accounts for the noted difficulty with mutual intelligibility between dialects. With the recent advent of spoken dialog systems (or intelligent personal assistants), dialect vowel restoration is crucial to allow systems to speak back to the users in their own language variant. In this paper we present our research and benchmark results on the automatic diacritization of Tunisian and Moroccan using linear Conditional Random Fields.