Google Research

Effective Multi Dialectal Arabic POS Tagging

  • Kareem Darwish
  • Mohammed Attia
  • Hamdy Mubarak
  • Younes Samih
  • Ahmed Abdelali
  • Lluís Màrquez
  • Mohamed Eldesouki
  • Laura Kallmeyer
Natural Language Engineering (NLE) (2020)

Abstract

This work introduces robust multi-dialectal part of speech tagging trained on an annotated dataset of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses Conditional Random Fields (CRF), while the second combines word and character-based representations in a Deep Neural Network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a dataset of 350 tweets per dialect.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work