Mohammed Attia
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Effective Multi Dialectal Arabic POS Tagging
Kareem Darwish
Hamdy Mubarak
Younes Samih
Ahmed Abdelali
Lluís Màrquez
Mohamed Eldesouki
Laura Kallmeyer
Natural Language Engineering (NLE) (2020)
Preview abstract
This work introduces robust multi-dialectal part of speech tagging trained on an annotated dataset of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses Conditional Random Fields (CRF), while the second combines word and character-based representations in a Deep Neural Network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a dataset of 350 tweets per dialect.
View details
Segmentation for Domain Adaptation in Arabic
Ali Elkahky
Workshop on Arabic Natural Language Processing -- ACL 2019, Florence, Italy (2019)
Preview abstract
Segmentation serves as an integral part in many NLP applications including Machine Translation, Parsing, and Information Retrieval. When a model trained on the standard language is applied to dialects, the accuracy drops dramatically. However, there are more lexical items shared by the standard language and dialects than can be found by mere surface word matching. This shared lexicon is obscured by a lot of cliticization, gemination, and character repetition. In this paper, we prove that segmentation and base normalization of dialects can help in domain adaptation by reducing data sparseness. Segmentation will improve a system performance by reducing the number of OOVs, help isolate the differences and allow better utilization of the commonalities. We show that adding a small amount of dialectal segmentation training data reduced OOVs by 5% and remarkably improves POS tagging for dialects by 7.37% f-score, even though no dialect-specific POS training data is included.
View details
POS Tagging for Improving Code-Switching Identification in Arabic
Ahmed Abdelali
Ali Elkahky
Hamdy Mubarak
Kareem Darwish
Younes Samih
Workshop on Arabic Natural Language Processing -- ACL 2019, Florence, Italy (2019)
Preview abstract
When speakers code-switch between their native language and a second language or language variant, they follow a syntactic pattern where words and phrases from the embedded language are inserted into the matrix language. This paper explores the possibility of utilizing this pattern in improving code-switching identification between Modern Standard Arabic (MSA) and Egyptian Arabic (EA). We try to answer the question of how strong is the POS signal in word-level code-switching identification. We build a deep learning model enriched with linguistic features (including POS tags) that outperforms the state-of-the-art results by 1.9% on the development set and 1.0% on the test set. We also show that in intra-sentential code-switching, the selection of lexical items is constrained by POS categories, where function words tend to come more often from the dialectal language while the majority of content words come from the standard language.
View details
QC-GO Submission for MADAR Shared Task: Arabic Fine-Grained Dialect Identification
Ahmed Abdelali
Hamdy Mubarak
Kareem Darwish
Mohamed Eldesouki
Younes Samih
MADAR Shared on Dialect Identification -- ACL 2019 (2019)
Preview abstract
This paper describes the QC-GO team submission to the MADAR Shared Task Subtask 1 (travel domain dialect identification) and Subtask 2 (Twitter user location identification). In our participation in both subtasks, we explored a number of approaches and system combinations to obtain the best performance for both tasks. These include deep neural nets and heuristics. Since individual approaches suffer from various shortcomings, the combination of different approaches was able to fill some of these gaps. Our system achieves F1-Scores of 66.1% and 67.0% on the development sets for Subtasks 1 and 2 respectively.
View details
Multi-Dialect Arabic POS Tagging: A CRF Approach
Kareem Darwish
Hamdy Mubarak
Ahmed Abdelali
Mohamed Eldesouki
Younes Samih
Randah Alharbi
Walid Magdy
Laura Kallmeyer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018), pp. 93-98
Preview abstract
This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional Random Fields (CRF) sequence labeler to train POS taggers for each dialect and examine the effect of cross and joint dialect training, and give benchmark results for the datasets. Using clitic n-grams, clitic metatypes, and stem templates as features, we were able to train a joint model that can correctly tag four different dialects with an average accuracy of 89.3%.
View details
Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach
Kareem Darwish
Ahmed Abdelali
Hamdy Mubarak
Younes Samih
The 3rd Workshop on Open-Source Arabic Corpora and Processing Tools in the Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018)
Preview abstract
Arabic is written as a sequence of consonants and long vowels, with short vowels normally omitted. Diacritization attempts to recover short vowels and is an essential step for Text-to-Speech (TTS) systems. Though Automatic diacritization of Modern Standard Arabic (MSA) has received significant attention, limited research has been conducted on dialectal Arabic (DA) diacritization. Phonemic patterns of DA vary greatly from MSA and even from one another, which accounts for the noted difficulty with mutual intelligibility between dialects. With the recent advent of spoken dialog systems (or intelligent personal assistants), dialect vowel restoration is crucial to allow systems to speak back to the users in their own language variant. In this paper we present our research and benchmark results on the automatic diacritization of Tunisian and Moroccan using linear Conditional Random Fields.
View details
GHH at SemEval-2018 Task 10: Discovering Discriminative Attributes in Distributional Semantics
Younes Samih
Wolfgang Maier
SemEval 2018 Task 10 on Capturing Discriminative Attributes (2018)
Preview abstract
This paper describes our system submission to the SemEval 2018 Task 10 on Capturing Discriminative Attributes. Given two concepts and an attribute, the task is to determine whether the attribute is semantically related to one concept and not the other. In this work we assume that discriminative attributes can be detected by discovering the association (or lack of association) between a pair of words. The hypothesis we test in this contribution is whether the semantic difference between two pairs of concepts can be treated in terms of measuring the distance between words in a vector space, or can simply be obtained as a by-product of word co-occurrence counts.
View details
The Morpho-syntactic Annotation of Animacy for a Dependency Parser
Ali Elkahky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018), pp. 2607-2615
Preview abstract
In this paper we present the annotation scheme and parser results of the animacy feature in Russian and Arabic, two morphologicallyrich languages, in the spirit of the universal dependency framework (McDonald et al., 2013; de Marneffe et al., 2014). We explain the animacy hierarchies in both languages and make the case for the existence of five animacy types. We train a morphological analyzer on the annotated data and the results show a prediction f-measure for animacy of 95.39% for Russian and 92.71% for Arabic. We also use animacy along with other morphological tags as features to train a dependency parser, and the results show a slight improvement gained from animacy. We compare the impact of animacy on improving the dependency parser to other features found in nouns, namely, ‘gender’, ‘number’, and ‘case’. To our knowledge this is the first contrastive study of the impact of morphological features on the accuracy of a transition parser. A portion of our data (1,000 sentences for Arabic and Russian each, along with other languages) annotated according to the scheme described in this paper is made publicly available (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1983) as part of the CoNLL 2017 Shared Task on Multilingual Parsing (Zeman et al., 2017).
View details
GHHT at CALCS 2018: Named Entity Recognition for Dialectal Arabic Using Neural Networks
Younes Samih
Wolfgang Maier
Third Workshop on Computational Approaches to Linguistic Code-switching in ACL 2018 (2018)
Preview abstract
This paper describes our system submission to the CALCS 2018 shared task on named entity recognition on code-switched data for the language variant pair of Modern Standard Arabic and Egyptian dialectal Arabic. We build a a Deep Neural Network that combines word and character-based representations in convolutional and recurrent networks with a CRF layer. The model is augmented with stacked layers of enriched information such pre-trained embeddings, Brown clusters and named entity gazetteers. Our system is ranked second among those participating in the shared task achieving an FB1 average of 70.09%.
View details
Multilingual Multi-class Sentiment Classification Using Convolutional Neural Networks
Younes Samih
Ali Elkahky
Laura Kallmeyer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan (2018), pp. 635-640
Preview abstract
This paper describes a language-independent model for multi-class sentiment analysis using a simple neural network architecture of five layers (Embedding, Conv1D, GlobalMaxPooling and two Fully-Connected). The advantage of the proposed model is that it does not rely on language-specific features such as ontologies, dictionaries, or morphological or syntactic pre-processing. Equally important, our system does not use pre-trained word2vec embeddings which can be costly to obtain and train for some languages. In this research, we also demonstrate that oversampling can be an effective approach for correcting class imbalance in the data. We evaluate our methods on three publicly available datasets for English, German and Arabic, and the results show that our system’s performance is comparable to, or even better than, the state of the art for these datasets. We make our source-code publicly available.
View details
Preview abstract
The aim of this document is to provide a list of dependency tags that are to be used for the Arabic dependency annotation task, with examples provided for each tag. The dependency representation is a simple description of the grammatical relationships in a sentence. It represents all sentence relations uniformly typed as dependency relations. The dependencies are all binary relations between a governor (also known the head) and a dependant (any complement of or modifier to the head).
View details
A Neural Architecture for Dialectal Arabic Segmentation
Younes Samih
Mohamed Eldesouki
Hamdy Mubarak
Ahmed Abdelali
Laura Kallmeyer
Kareem Darwish
The Third Arabic Natural Language Processing Workshop (WANLP), Valencia, Spain (2017), pp. 46-54
Preview abstract
The automated processing of Arabic dialects is challenging due to the lack of spelling standards and the scarcity of annotated data and resources in general. Segmentation of words into their constituent tokens is an important processing step for natural language processing. In this paper, we show how a segmenter can be trained on only 350 annotated tweets using neural networks without any normalization or reliance on lexical features or linguistic resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that heavily depend on additional resources.
View details
Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM
Mohamed Eldesouki
Younes Samih
Ahmed Abdelali
Hamdy Mubarak
Kareem Darwish
Laura Kallmeyer
arxiv.org 2017 (2017)
Preview abstract
Arabic word segmentation is essential for a variety of NLP applications such machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for each dialect. The two approaches involve posing the problem as a ranking problem, where an SVM ranker picks the best segmentation, and as a sequence labeling problem, where a bi-LSTM RNN coupled with CRF determines where best to segment words. We are able to achieve solid segmentation results for all dialects using rather limited training data. We also show that employing Modern Standard Arabic data for domain adaptation and assuming context independence improve overall results.
View details
Learning from Relatives: Unified Dialectal Arabic Segmentation
Younes Samih
Mohamed Eldesouki
Ahmed Abdelali
Hamdy Mubarak
Kareem Darwish
Laura Kallmeyer
CONLL, Vancouver, Canada (2017)
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Preview
Daniel Zeman
Martin Popel
Milan Straka
Jan Hajic
Joakim Nivre
Filip Ginter
Juhani Luotolahti
Sampo Pyysalo
Martin Potthast
Francis Tyers
Elena Badmaeva
Memduh Gokirmak
Anna Nedoluzhko
Silvie Cinkova
Jan Hajic jr.
Jaroslava Hlavacova
Václava Kettnerová
Zdenka Uresova
Jenna Kanerva
Stina Ojala
Anna Missilä
Christopher D. Manning
Sebastian Schuster
Siva Reddy
Dima Taji
Nizar Habash
Herman Leung
Marie-Catherine de Marneffe
Manuela Sanguinetti
Maria Simi
Hiroshi Kanayama
Valeria de Paiva
Kira Droganova
Héctor Martínez Alonso
Çagrı Çöltekin
Umut Sulubacak
Hans Uszkoreit
Vivien Macketanz
Aljoscha Burchardt
Kim Harris
Katrin Marheinecke
Georg Rehm
Tolga Kayadelen
Ali Elkahky
Zhuoran Yu
Emily Pitler
Saran Lertpradit
Michael Mandl
Jesse Kirchner
Hector Fernandez Alcalde
Esha Banerjee
Antonio Stella
Atsuko Shimada
Sookyoung Kwak
Gustavo Mendonca
Tatiana Lando
Rattima Nitisaroj
Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
The Power of Language Music: Arabic Lemmatization through Patterns
Ayah Zirizkly
Mona Diab
Proceedings of the Workshop on Cognitive Aspects of the Lexicon, Osaka, Japan (2016), pp. 40-50
Preview abstract
Patterns play a pivotal role in Arabic morphological processing whether related to derivation or inflection. These patterns have not been yet adequately and fully utilized in computational processing of the language. The novel contribution of this paper is performing lemmatization (a high level lexical processing) without relying on a lookup dictionary. We use a machine learning classifier to predict the lemma pattern for a given stem, and use mapping rules to convert stems to their respective lemmas.
View details
Preview abstract
Idafa in traditional Arabic grammar is an umbrella construction that covers several phenomena including what is expressed in English as noun-noun compounds and Saxon & Norman genitives. Additionally, Idafa participates in some other constructions, such as quantifiers, quasi-prepositions, and adjectives. Identifying the various types of the Idafa construction (IC) is of importance to Natural Language Processing (NLP) applications. Noun-Noun compounds exhibit special behaviour in most languages impacting their semantic interpretation. Hence distinguishing them could have an impact on downstream NLP applications. The most comprehensive computational syntactic representation of the Arabic language is found in the LDC Arabic Treebank (ATB). Despite its coverage, ICs are not explicitly labeled in the ATB and furthermore, there is no clear distinction between ICs of noun-noun relations and other traditional ICs. Hence, we devise a detailed syntactic and semantic typification process of the IC phenomenon in Arabic. We target the ATB as a platform for this classification. We render the ATB annotated with explicit IC labels in addition to further semantic characterization which is useful for syntactic, semantic and cross language processing. Our typification of IC comprises 3 main syntactic IC types: False Idafas (FIC), Grammatical Idafas (GIC), and True Idafas (TIC), which are further divided into 10 syntactic subclasses. The TIC group is further classified into semantic relations. We devise a method for automatic IC labeling and compare its yield against the CATiB Treebank. Our evaluation shows that we achieve the same level of accuracy, but with the additional fine-grained classification into the various syntactic and semantic types.
View details
Multilingual Code-switching Identification via LSTM Recurrent Neural Networks
Younes Samih
Suraj Maharjan
Laura Kallmeyer
Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching,, Austin, TX (2016), pp. 50-59
Preview abstract
This paper describes the HHU-UH-G system submitted to the EMNLP 2016 Second Workshop on Computational Approaches to Code Switching. Our system ranked first place for Arabic (MSA-Egyptian) with an F1-score of 0.83 and second place for Spanish-English with an F1-score of 0.90. The HHU-UHG system introduces a novel unified neural network architecture for language identification in code-switched tweets for both Spanish-English and MSA-Egyptian dialect. The system makes use of word and character level representations to identify code-switching. For the MSA-Egyptian dialect the system does not rely on any kind of language-specific knowledge or linguistic resources such as, Part Of Speech (POS) taggers, morphological analyzers, gazetteers or word lists to obtain state-of-the-art performance.
View details
CogALex-V Shared Task: GHHH - Detecting Semantic Relations via Word Embeddings
Suraj Maharjan
Younes Samih
Laura Kallmeyer
Thamar Solorio
CogALex-2016 Shared Task on the Corpus-Based Identification of Semantic Relations, Osaka, Japan (2016), pp. 86-91
Preview abstract
This paper describes our system submitted to the CogALex-2016 Shared Task on the Corpus-Based Identification of Semantic Relations. The evaluation results of our system on the test set are 88.1\% (79.0\% for TRUE only) f-measure for Task-1 on detecting semantic similarity, and 76.0\% (42.3\% when excluding RANDOM) for Task-2 on identifying more finer grained semantic relations. In our experiments, we try word analogy, linear regression, and multi-task Convolutional Neural Networks (CNN) with word embeddings from publicly available word vectors. We found that linear regression performs better in binary classification (Task-1), while CNN has better performance in multi-class semantic classification (Task-2).
We assume that word analogy is more suited for deterministic answers rather than handling the ambiguity of one-to-many and many-to-many relationships. We also show that classifier performance could benefit from balancing the frequency of labels in the training data.
View details
No Results Found