Effective Multi Dialectal Arabic POS Tagging

Kareem Darwish; Mohammed Attia; Hamdy Mubarak; Younes Samih; Ahmed Abdelali; Lluís Màrquez; Mohamed Eldesouki; Laura Kallmeyer

Effective Multi Dialectal Arabic POS Tagging

Kareem Darwish

Mohammed Attia

Hamdy Mubarak

Younes Samih

Ahmed Abdelali

Lluís Màrquez

Mohamed Eldesouki

Laura Kallmeyer

Natural Language Engineering (NLE) (2020)

Download Google Scholar

Abstract

This work introduces robust multi-dialectal part of speech tagging trained on an annotated dataset of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses Conditional Random Fields (CRF), while the second combines word and character-based representations in a Deep Neural Network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a dataset of 350 tweets per dialect.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Effective Multi Dialectal Arabic POS Tagging

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs