Mohammed Attia

Mohammed Attia

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Effective Multi Dialectal Arabic POS Tagging
    Kareem Darwish
    Hamdy Mubarak
    Younes Samih
    Ahmed Abdelali
    Lluís Màrquez
    Mohamed Eldesouki
    Laura Kallmeyer
    Natural Language Engineering (NLE) (2020)
    Preview abstract This work introduces robust multi-dialectal part of speech tagging trained on an annotated dataset of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses Conditional Random Fields (CRF), while the second combines word and character-based representations in a Deep Neural Network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a dataset of 350 tweets per dialect. View details
    POS Tagging for Improving Code-Switching Identification in Arabic
    Ahmed Abdelali
    Ali Elkahky
    Hamdy Mubarak
    Kareem Darwish
    Younes Samih
    Workshop on Arabic Natural Language Processing -- ACL 2019, Florence, Italy (2019)
    Preview abstract When speakers code-switch between their native language and a second language or language variant, they follow a syntactic pattern where words and phrases from the embedded language are inserted into the matrix language. This paper explores the possibility of utilizing this pattern in improving code-switching identification between Modern Standard Arabic (MSA) and Egyptian Arabic (EA). We try to answer the question of how strong is the POS signal in word-level code-switching identification. We build a deep learning model enriched with linguistic features (including POS tags) that outperforms the state-of-the-art results by 1.9% on the development set and 1.0% on the test set. We also show that in intra-sentential code-switching, the selection of lexical items is constrained by POS categories, where function words tend to come more often from the dialectal language while the majority of content words come from the standard language. View details
    Segmentation for Domain Adaptation in Arabic
    Ali Elkahky
    Workshop on Arabic Natural Language Processing -- ACL 2019, Florence, Italy (2019)
    Preview abstract Segmentation serves as an integral part in many NLP applications including Machine Translation, Parsing, and Information Retrieval. When a model trained on the standard language is applied to dialects, the accuracy drops dramatically. However, there are more lexical items shared by the standard language and dialects than can be found by mere surface word matching. This shared lexicon is obscured by a lot of cliticization, gemination, and character repetition. In this paper, we prove that segmentation and base normalization of dialects can help in domain adaptation by reducing data sparseness. Segmentation will improve a system performance by reducing the number of OOVs, help isolate the differences and allow better utilization of the commonalities. We show that adding a small amount of dialectal segmentation training data reduced OOVs by 5% and remarkably improves POS tagging for dialects by 7.37% f-score, even though no dialect-specific POS training data is included. View details
    QC-GO Submission for MADAR Shared Task: Arabic Fine-Grained Dialect Identification
    Ahmed Abdelali
    Hamdy Mubarak
    Kareem Darwish
    Mohamed Eldesouki
    Younes Samih
    MADAR Shared on Dialect Identification -- ACL 2019 (2019)
    Preview abstract This paper describes the QC-GO team submission to the MADAR Shared Task Subtask 1 (travel domain dialect identification) and Subtask 2 (Twitter user location identification). In our participation in both subtasks, we explored a number of approaches and system combinations to obtain the best performance for both tasks. These include deep neural nets and heuristics. Since individual approaches suffer from various shortcomings, the combination of different approaches was able to fill some of these gaps. Our system achieves F1-Scores of 66.1% and 67.0% on the development sets for Subtasks 1 and 2 respectively. View details
    GHHT at CALCS 2018: Named Entity Recognition for Dialectal Arabic Using Neural Networks
    Younes Samih
    Wolfgang Maier
    Third Workshop on Computational Approaches to Linguistic Code-switching in ACL 2018 (2018)
    Preview abstract This paper describes our system submission to the CALCS 2018 shared task on named entity recognition on code-switched data for the language variant pair of Modern Standard Arabic and Egyptian dialectal Arabic. We build a a Deep Neural Network that combines word and character-based representations in convolutional and recurrent networks with a CRF layer. The model is augmented with stacked layers of enriched information such pre-trained embeddings, Brown clusters and named entity gazetteers. Our system is ranked second among those participating in the shared task achieving an FB1 average of 70.09%. View details
    GHH at SemEval-2018 Task 10: Discovering Discriminative Attributes in Distributional Semantics
    Younes Samih
    Wolfgang Maier
    SemEval 2018 Task 10 on Capturing Discriminative Attributes (2018)
    Preview abstract This paper describes our system submission to the SemEval 2018 Task 10 on Capturing Discriminative Attributes. Given two concepts and an attribute, the task is to determine whether the attribute is semantically related to one concept and not the other. In this work we assume that discriminative attributes can be detected by discovering the association (or lack of association) between a pair of words. The hypothesis we test in this contribution is whether the semantic difference between two pairs of concepts can be treated in terms of measuring the distance between words in a vector space, or can simply be obtained as a by-product of word co-occurrence counts. View details
    Multi-Dialect Arabic POS Tagging: A CRF Approach
    Kareem Darwish
    Hamdy Mubarak
    Ahmed Abdelali
    Mohamed Eldesouki
    Younes Samih
    Randah Alharbi
    Walid Magdy
    Laura Kallmeyer
    Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018), pp. 93-98
    Preview abstract This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional Random Fields (CRF) sequence labeler to train POS taggers for each dialect and examine the effect of cross and joint dialect training, and give benchmark results for the datasets. Using clitic n-grams, clitic metatypes, and stem templates as features, we were able to train a joint model that can correctly tag four different dialects with an average accuracy of 89.3%. View details
    Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach
    Kareem Darwish
    Ahmed Abdelali
    Hamdy Mubarak
    Younes Samih
    The 3rd Workshop on Open-Source Arabic Corpora and Processing Tools in the Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018)
    Preview abstract Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted. Diacritization attempts to recover short vowels and is an essential step for Text-to-Speech (TTS) systems. Though Automatic diacritization of Modern Standard Arabic (MSA) has received significant attention, limited research has been conducted on dialectal Arabic (DA) diacritization. Phonemic patterns of DA vary greatly from MSA and even from one another, which accounts for the noted difficulty with mutual intelligibility between dialects. With the recent advent of spoken dialog systems (or intelligent personal assistants), dialect vowel restoration is crucial to allow systems to speak back to the users in their own language variant. In this paper we present our research and benchmark results on the automatic diacritization of Tunisian and Moroccan using linear Conditional Random Fields. View details
    The Morpho-syntactic Annotation of Animacy for a Dependency Parser
    Ali Elkahky
    Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018), pp. 2607-2615
    Preview abstract In this paper we present the annotation scheme and parser results of the animacy feature in Russian and Arabic, two morphologicallyrich languages, in the spirit of the universal dependency framework (McDonald et al., 2013; de Marneffe et al., 2014). We explain the animacy hierarchies in both languages and make the case for the existence of five animacy types. We train a morphological analyzer on the annotated data and the results show a prediction f-measure for animacy of 95.39% for Russian and 92.71% for Arabic. We also use animacy along with other morphological tags as features to train a dependency parser, and the results show a slight improvement gained from animacy. We compare the impact of animacy on improving the dependency parser to other features found in nouns, namely, ‘gender’, ‘number’, and ‘case’. To our knowledge this is the first contrastive study of the impact of morphological features on the accuracy of a transition parser. A portion of our data (1,000 sentences for Arabic and Russian each, along with other languages) annotated according to the scheme described in this paper is made publicly available (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1983) as part of the CoNLL 2017 Shared Task on Multilingual Parsing (Zeman et al., 2017). View details
    Multilingual Multi-class Sentiment Classification Using Convolutional Neural Networks
    Younes Samih
    Ali Elkahky
    Laura Kallmeyer
    Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018), pp. 635-640
    Preview abstract This paper describes a language-independent model for multi-class sentiment analysis using a simple neural network architecture of five layers (Embedding, Conv1D, GlobalMaxPooling and two Fully-Connected). The advantage of the proposed model is that it does not rely on language-specific features such as ontologies, dictionaries, or morphological or syntactic pre-processing. Equally important, our system does not use pre-trained word2vec embeddings which can be costly to obtain and train for some languages. In this research, we also demonstrate that oversampling can be an effective approach for correcting class imbalance in the data. We evaluate our methods on three publicly available datasets for English, German and Arabic, and the results show that our system’s performance is comparable to, or even better than, the state of the art for these datasets. We make our source-code publicly available. View details