Google Research

A Neural Architecture for Dialectal Arabic Segmentation

  • Younes Samih
  • Mohammed Attia
  • Mohamed Eldesouki
  • Hamdy Mubarak
  • Ahmed Abdelali
  • Laura Kallmeyer
  • Kareem Darwish
The Third Arabic Natural Language Processing Workshop (WANLP), Valencia, Spain (2017), pp. 46-54


The automated processing of Arabic dialects is challenging due to the lack of spelling standards and the scarcity of annotated data and resources in general. Segmentation of words into their constituent tokens is an important processing step for natural language processing. In this paper, we show how a segmenter can be trained on only 350 annotated tweets using neural networks without any normalization or reliance on lexical features or linguistic resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that heavily depend on additional resources.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work