A Morphologically Annotated Corpus of Emirati Arabic

Page created by Jessica Carpenter
 
CONTINUE READING
A Morphologically Annotated Corpus of Emirati Arabic
                                  Salam Khalifa, Nizar Habash, Fadhl Eryani,
                              Ossama Obeid, Dana Abdulrahim,† Meera Al Kaabi‡
              Computational Approaches to Modeling Language Lab, New York University Abu Dhabi, UAE
                                           University of Bahrain, Bahrain†
                                     The United Arab Emirates University, UAE‡
                        {salamkhalifa@nyu.edu, nizar.habash@nyu.edu, darahim@uob.edu.bh}

                                                               Abstract
We present an ongoing effort on the first large-scale morphologically manually annotated corpus of Emirati Arabic. This corpus includes
about 200,000 words selected from eight Gumar corpus novels in the Emirati Arabic variety. The selected texts are being annotated for
tokenization, part-of-speech, lemmatization, English glosses and dialect identification. The orthography of the text is also adjusted for
errors and inconsistencies. We discuss the guidelines for each part of the annotation components, and the annotation interface we use.
We report on the quality of the annotation through an inter-annotator agreement measure.

Keywords: Gulf Arabic, Part-of-Speech Tagging, Morphology, Annotation

                    1.    Introduction                                    The rest of this paper is organized as follows. We discuss
                                                                       related work on dialectal corpora in Section 2. In Section 3.
There has been an increasing number of natural language                we describe the corpus used in this effort. We then present
processing (NLP) efforts focusing on dialectal Arabic, es-             the annotation guidelines that are used to annotate the cor-
pecially with the increasing amounts of written material on            pus in Section 4. We discuss the annotation process and the
the web. However, resources for dialectal Arabic NLP tasks             annotation quality results in Section 5.
such as part-of-speech (POS) tagging, morphological anal-
ysis and disambiguation are still lacking compared to those
                                                                                           2.   Related Work
for Modern Standard Arabic (MSA). MSA is the official
language in more than 20 countries, where it is used in offi-          In this section we review a number of efforts on Arabic
cial communications, news, and education. Yet, it is not the           corpus creation, that significantly supported research and
commonly spoken variety of Arabic; the dialectal varieties             tool development for Arabic NLP.
of Arabic are what is used in the day-to-day communica-
tion. Dialectal Arabic is also commonly used in written                2.1.   Modern Standard Arabic Resources
form on social media platforms, forums and blogs.                      The Penn Arabic Treebank (PATB) (Maamouri et al., 2004)
                                                                       has been a central resource for developing MSA resources.
   Using available resources developed for MSA such as                 It was developed at the Linguistic Data Consortium (LDC),
POS taggers and tokenizers gives limited performance                   and it mainly consists of newswire text from different news
when used on dialectal Arabic (Habash and Rambow, 2006;                sources. The PATB corpus is annotated for tokenization,
Jarrar et al., 2014; Khalifa et al., 2016a). Many researchers          segmentation, POS tagging, lemmatization, diacritization,
moved into the direction of creating tools and resources               English gloss and syntactic structure. The PATB has 12
targeting the dialects specifically. Egyptian Arabic is one            parts of more than 1.3 million words. The annotated data
of the dialects that received earlier efforts for developing           has been a backbone of many state-of-the-art tools such
tools and resources. More resources are being developed                as analyzers and disambiguators including MADAMIRA
for other dialects such as Levantine, Tunisian, Moroccan               (Pasha et al., 2014) and its predecessor MADA (Habash et
and Yemeni Arabic. Gulf Arabic, as we define it to be the              al., 2009), in addition to YAMAMA (Khalifa et al., 2016b),
native spoken variety in the Gulf Cooperation Council, is              and most recently a neural morphological disambiguatior
still lagging behind other Arabic dialects with respect to re-         (Zalmout and Habash, 2017) and a fine grained POS tag-
source and tool creation, given the considerable amount of             ger (Inoue et al., 2017). In addition, the PATB guidelines
dialectal content online.                                              (Maamouri et al., 2009) have inspired the creation of simi-
                                                                       lar guidelines for the dialects including our own.
  In this paper, we present an ongoing project for creating
a manually annotated corpus of about 200,000 words of the              2.2.   Dialectal Arabic Resources
Gulf Arabic of the United Arab Emirates – Emirati Arabic.
The corpus is annotated for tokenization, POS, lemmas and              In the scope of dialectal Arabic, there have been many re-
English glosses in addition to spelling conventionalization            cent contributions to the development and creation of re-
and dialect identification. This resource will support the de-         sources. Below, we discuss the highlights of those contri-
velopment of Arabic dialect enabling technologies, such as             butions.
automatic POS tagging and morphological disambiguation,                Egyptian Arabic Resources Egyptian Arabic (EGY)
which in turn will facilitate efforts on different NLP tasks           was one of the first dialects that received the attention of
such as machine translation.                                           the NLP community. The earliest effort, to the best of

                                                                 3839
our knowledge, is the Egyptian Colloquial Arabic Lexicon         tate for tokenization, POS tagging, lemma, English gloss
(ECAL) (Kilany et al., 2002) which was developed as part         and dialect identification. Additionally we conventionalize
of the CALLHOME Egypt corpus (Gadalla et al., 1997).             the spelling in accordance with the Conventional Orthog-
The ECAL served as the seed to the EGY morphological             raphy for Dialectal Arabic (CODA) rules (Habash et al.,
analyzer (CALIMA) (Habash et al., 2012a). Later on, the          2012b; Habash et al., 2018).
Egyptian Arabic Treebank (ARZATB) (Maamouri et al.,
                                                                   For recent surveys on Arabic resources for NLP, see Za-
2012a; Maamouri et al., 2014) was created by the LDC
                                                                 ghouani (2014), Shoufan and Al-Ameri (2015) and Zeroual
using CALIMA to provide analysis options for the anno-
                                                                 and Lakhouaja (2018).
tation process. The ARZATB has currently 400,000 words
in eight parts annotated in a similar fashion to the PATB.
The annotation guidelines for the ARZATB (Maamouri et                   3.   Annotating the Gumar Corpus
al., 2012b) followed that of the PATB with decisions spe-
                                                                 We discuss next the Gumar Corpus and the portion of it we
cific to the dialect. Since the release, the ARZATB has been
                                                                 use to annotate in this effort.
used extensively for developing EGY resources such as the
EGY part of MADAMIRA, MADA and YAMAMA, in ad-
dition to a noise-robust morphological disambiguator for         3.1.   Gumar Corpus
EGY (Zalmout et al., 2018). Other developed corpora and          The Gumar corpus is a large-scale corpus of Gulf Arabic
POS taggers for EGY include the work of Al-Sabbagh and           containing more than 100 million words. The corpus con-
Girju (2012) where they created their own POS tagset and         sists mainly of documents of long conversational novels
corpus with the intention to facilitate certain NLP applica-     also known as I     JË@ HAK
                                                                                           @ð P ‘Internet Novels’. This type
tions like subjectivity and sentiment analysis.                  of literature is very popular among female teenagers in the
                                                                 Gulf area. These novels are written mostly in dialectal Ara-
Levantine Arabic and and Other Dialectal Arabic Re-
                                                                 bic, where the lengthy conversations between the characters
sources Levantine Arabic (LEV) received some notable
                                                                 of the story are in the dialect and the narration in between
efforts including the Levantine Arabic Treebank (LATB)
                                                                 the conversations can sometimes be in MSA.
of Jordanian Arabic (Maamouri et al., 2006) which con-
tains around 27,000 annotated words in a similar fashion to         The writers of the novels remain anonymous and use
ARZATB. A more recent resource is the annotated corpus           noms de plume. The novels are publicly available online,
of Palestinian Arabic (Curras) (Jarrar et al., 2014; Jarrar et   where most of the writers ask for their pen name to be
al., 2016). ARZATB and Curras were used to create mor-           mentioned if the novel is to be published in a different plat-
phological analyzers and disambiguators (Eskander et al.,        form than the original. The genre of the novels is mainly
2016). Other dialects such as Yemeni and Moroccan Arabic         romantic, but also features tragedy and drama. The cor-
followed the same approach (Al-Shargi et al., 2016). In ad-      pus can be browsed online,1 it is currently annotated using
dition to the dialects mentioned above, there were recent ef-    MADAMIRA in EGY mode.
forts on creating corpora for other dialects, namely Tunisian
and Algerian (McNeil and Faiza, 2011; Masmoudi et al.,              On the document level, Gulf Arabic text makes up more
2014; Zribi et al., 2015; Smaïli et al., 2014). Other works      than 90% of the corpus, the rest of the corpus consists of
targeted multi-dialect corpora (Diab et al., 2010; Zaidan        other Arabic dialects in addition to MSA. Emirati Arabic
and Callison-Burch, 2011; Diab et al., Forthcoming 2013;         text covers around 11% of the Gumar corpus.
Bouamor et al., 2014; Cotterell and Callison-Burch, 2014),
and, most recently, the ongoing Multi Arabic Dialect and         3.2.   The Annotated Gumar Corpus
Application Resources project (MADAR) (Bouamor et al.,           We chose a set of 200,110 word tokens for the annotation
2018) which includes corpora for 25 different city dialects.     task. The text consists of the first 25,000 words (rounded
Gulf Arabic Resources As far as Gulf Arabic (GLF)                up to the nearest full sentence) from eight different novels
is concerned, the only existing annotated corpora in-            by eight different authors. This allows us to cover different
clude the Emirati Arabic Corpus (EAC) (Halefom et al.,           writing styles. The text is comprised of 15,277 sentences
2013) and the Emirati Arabic Language Acquisition Corpus         with an average of 13 words per sentence. Table 1 shows
(EMALAC) (Ntelitheos and Idrissi, 2017) that were created        the list of the novels from which the text is selected. We
by linguists with emphasis on the phonological and mor-          name this subset of the corpus the Annotated Gumar Cor-
phosyntactic phenomena of Emirati Arabic. We recently            pus. In the future we plan to continue adding annotations
collected a large-scale corpus of Gulf Arabic (Khalifa et        to it from other Gumar novels including different dialects.
al., 2016a) containing more than 100 million words cover-           Additionally, a total of about 12,000 words – 1,500
ing six Gulf Arabic varieties. In regards to other tools and     words from each of the eight parts rounded up to the near-
resources, we recently developed a morphological analyzer        est full sentence – are chosen to evaluate Inter-Annotator
for Gulf Arabic verbs (CALIMAGLF ) (Khalifa et al., 2017).       Agreement (IAA) throughout the annotation process. Thus,
We are also aware of the previously developed rule-based         the total number of words to be annotated is about 212,000
stemmer for Arabic Gulf dialect (Abuata and Al-Omari,            words.
2015).
   In this work, we use about 200,000 words from the Emi-           1
                                                                    Please visit https://camel.abudhabi.nyu.edu/
rati Arabic portion of the Gumar corpus to manually anno-        gumar/

                                                             3840
Parts     Tokens    Sentences                                             à@ñJªË@     Title Transltation                                 ­ËñÖÏ@      Author
   Part 1    25,022    1,387         ÉJÓ éJK Q£ B ‡ ¯@ñmÌ '@ HA        Hñ“
                                                                       ¯X              ‘The sound of the beating hearts          b3thra had2a         ‘Calm scatter’
                                                    ÐAJmÌ '@ †ñ   ¯ Q¢ÖÏ@ HAJ   .k     when I remember it, is like the
                                                                                         drops of rain on the tents’
   Part 2    25,009    3,176               ­Jƒ úæJ . ¯P úΫ ñË ½J.k @ é
inal enclitics get a similar tag format as the baseword (i.e.
‘PRON.features’).
                                                                        CAMEL POS Arabic           CAMEL POS         ARZATB POS
   C AMEL POS provides full array of features: (i) Aspect                                         PROCLITIC tags

with the values Perfective, Imperfective and Command;                      ­K QªK_ è@X @          PART_DET            DET
                                                                           ­¢«_ ¬Qk                   CONJ              CONJ
(ii) Person with the values 1st, 2nd, 3rd; (iii) Gender                       Qk._ ¬Qk                PREP              PREP
with values Masculine and Feminine; (iv) Number with                           ù®K_ è@X @          PART_NEG          NEG_PART
values Simgular, Dual and Plural and (v) State with val-                   ÈAJ.® Jƒ@_ è@X @      PART_FUT          FUT_PART
ues Definite, Indefinite and Construct; (vi) Case with val-                 ’Ó_ è@X @
                                                                           é«PA                     PART_PROG        PROG_PART
ues Nominative, Genitive and Accusative; (vii) Voice with                     ¡ P_ è@X @          CONJ_SUB         SUB_CONJ
                                                                            èPAƒ@. _ QÖޕ
values Active and Passive and (viii) Mood with values                                               PRON_DEM         DEM_PRON

Subjunctive, Indicative and Jussive. Not all the features                  ÐAê®Jƒ@_QÖޕ         PRON_INTERROG    INTERROG_PRON
                                                                                    è@X @            PART             PART
mentioned are necessarily relevant to the dialects. In the
                                                                             ¡. P_ ¬Qk           PART_CONNECT      CONNEC_PART
full POS tag, the specified values of the different features                  YJ»ñK_ è@X @      PART_EMPHATIC    EMPHATIC_PART
will appear in the following order:                                            Qå_ H. @ñk.          PART_RC          RC_PART

   ..                                             Z@YK_ è@X @         PART_VOC          VOC_PART
                                                                                                   ENCLITIC tags
The second period is not necessary if none of the last four                   ù®K_ è@X @           PART_NEG          NEG_PART
features is specified.                                                         QÖޕ                  PRON          *SUFF_DO:[PGN]
                                                                               QÖޕ                  PRON         POSS_PRON_[PGN]
   Table 2 shows the list of POS tagset used in this anno-                     QÖޕ                  PRON            PRON_[PGN]
tation effort compared with the ones used ARZATB. The                                             BASEWORD tags
tagset is divided into three categories according to the tok-                        Õæ @             NOUN              NOUN

enization scheme we follow: proclitics (14 tags), enclitics                     XY«_ Õæ @           NOUN_NUM         NOUN_NUM
                                                                                 ÕΫ_ Õæ @         NOUN_PROP         NOUN_PROP
(2 tags) and baseword (39 tags). Together with the fea-                           Õ»_ Õæ @         NOUN_QUANT       NOUN_QUANT
tures, C AMEL POS tagset maps to ARZATB and retains                                 鮓               ADJ                ADJ
backward compatibility. It also offers an intuitive Arabic                     XY«_ 鮓             ADJ_NUM          ADJ_NUM

scheme that is suitable to use for annotation.                              éKPA®Ó  _ 鮓        ADJ_COMP          ADJ_COMP
                                                                                   ¬Q£                 ADV               ADV
   For a subset of POS tags in the baseword category, each                 ÐAê®Jƒ@_ ¬Q£          ADV_INTERROG     INTERROG_ADV
                                                                           Èñ“ñÓ_ ¬Q£                ADV_REL          REL_ADV
POS tag has a limited number of possible feature combina-                           ɪ¯               VERB             IV/PV/CV
tions that is paired with it. Below is the list of the POS tags                 ɪ¯_ éJ.ƒ        VERB_PSEUDO       PSEUDO_VERB
that take features and their possible ordered combination.                      ɪ¯_ Õæ @           VERB_NOM             VERB
                                                                                   QÖޕ              PRON           PRON_[PGN]
                                                                             èPAƒ@ _ QÖޕ
  • NOUN, NOUN_*, ADJ, ADJ_* All nominals take                                                      PRON_DEM       DEM_PRON_[GN]

    the combination of Gender, Number. For example                         ÐAê®Jƒ@_QÖޕ         PRON_INTERROG    INTERROG_PRON
                                                                           I.j.ªK_QÖޕ           PRON_EXCLAM      EXCLAM_PRON
    ËAg. jAls ‘sitting’ is tagged ADJ.MS ; In the occa-                   Èñ“ñÓ_QÖޕ               PRON_REL         REL_PRON
                                                                                    è@X @            PART             PART
    sional uses of State, such as AªJ.£ TabςAã ‘of course’
                                                                            ­K QªK_ è@X @         PART_DET            DET
    the tag would be NOUN.MS.I                                                  ù®K_ è@X @         PART_NEG          NEG_PART
  • VERB All verbs take the combination of Aspect,
                                                                           ÈAJ.® Jƒ@_ è@X @     PART_FUT          FUT_PART
    Person, Gender and Number. For example ©¢®K yqTς                         ’Ó_ è@X @
                                                                            é«PA                    PART_PROG        PROG_PART
    ‘cut’ is tagged as VERB.I3MS                                                ɪ¯_ è@X @         PART_VERB        VERB_PART
  • PRON All pronouns take the combination of Person,                           Z@YK_ è@X @        PART_VOC          VOC_PART
    Gender and Number. For example úæK@ Anty ‘you [fs]’                    ÐAê®Jƒ@_ è@X @      PART_INTERROG    INTERROG_PART
                                                                                    _ è@X @
                                                                             ZAJJƒ@             PART_RESTRICT     RESTRIC_PART
    is tagged as PRON.2FS                                                   ÉJ’®K_ è@X @         PART_FOCUS        FOCUS_PART
  • PRON_DEM All demonstrative pronouns take the                               YJ»ñK_ è@X @     PART_EMPHATIC    EMPHATIC_PART
    combination of Gender and Number. For example @ XAë                         Qå_ H. @ñk.         PART_RC          RC_PART
                                                                               ¡. P_ è@X @        CONJ_SUB          SUB_CONJ
    hAðA ‘this’ is tagged as PRON_DEM.3MS                                      Qk._ ¬Qk               PREP              PREP
                                                                           ­¢«_ ¬Qk                   CONJ              CONJ
   In cases where a feature is not present, such as gender in                 ¡. P_ ¬Qk          PART_CONNECT      CONNEC_PART
verbs of first person inflections, the gender feature is simply                       Õ¯P            DIGIT          NOUN_NUM

dropped and does not require a placeholder since the pos-                        PA’Jk@             ABBREV           ABBREV
                                                                                  I.j.ªK             INTERJ           INTERJ
sible feature values are ordered and unique. For example                                             FORIEGN          FOREIGN
                                        
the imperfective 1st person verb Èñ¯@ Aqwl ‘I say’ will be
                                                                                   úæ.Jk. @
                                                                            Õæ¯Q K_ éÓC«            PUNC             PUNC
tagged as VERB.I1S
                                                                  Table 2: Table shows the C AMEL POS tagset used in the
4.4.   Lemma Guidelines                                           annotation of Annotated Gumar Corpus compared to the
The lemma is the citation form of the the word. We fol-           POS tagset in ARZATB. C AMEL POS Arabic shows the
low the same guidelines of the lemma specification from           Arabic name of the tag.
Graff et al. (2009), where nominals are cited using the
masculine singular form of the word or the feminine sin-
gular form if no masculine form exists. For example, the

                                                              3842

lemma for the noun QK AJƒ syAyyr ‘cars’ (NOUN.FP) is èPAJƒ      and all its analysis choices in context with reference to the
say∼Arah̄ which is feminine singular since there is no mas-       raw text at all times. For each word the annotator faces one
culine singular form of the word. The verbs are cited using       of the following scenarios:
the perfective 3rd person masculine singular form. For ex-
                                      yšwfn ‘they see [f.p]’        • All annotation tasks are correct: the annotator has to
ample, the lemma for the verb á¯ñ‚
                                                                       only validate the answer.
(VERB.I3FP) is ¬Aƒ šAf . For all other tags (i.e. particles,        • Correct analysis but wrong spelling: the annotator has
adverbs, ... etc) the lemma is the same form of the base-              to adjust the spelling and then validate the answer.
word. In this annotation effort, the lemma is the only form          • Wrong analysis (wholly or partially) but correct
we require to be manually diacritized.                                 spelling: the annotator can manually adjust the analy-
                                                                       sis or can use the ‘analysis search’ utility provided by
4.5.   English Gloss Guidelines                                        MADARi to get an analysis for a word with similar
The English gloss in this context refers to the semantic               structure and then they would only have to change the
translation of the Arabic lemma. For nominals we use the               lemma and the gloss entries. Finally they validate the
singular form, and for verbs we use the infinitive form. An            answer.
Arabic lemma could have multiple synonymous English                  • Wrong analysis and spelling: the annotator has to ad-
glosses. For example QJ.» kbyr would have the following               just the spelling and follow the previous step.
English glosses ‘large; great; important; major; senior’.            At any point of the annotation process, the annotator
4.6.   Word Level Dialect Identification                          is able to apply mass changes to spelling and/or analysis
                                                                  across the document they are working on. However, the an-
Dialect identification is the task of tagging a certain context   notator must insure that all the words affected by the change
with a given dialect tag. Deciding the dialect tag depends        are in similar contexts. The annotator can also modify their
on the context of the sentence and/or the document. This          answers any time during the annotation through feedback
can be challenging since many words in their written form         they get if they have any inquiries. This allows the anno-
may be shared by many dialects and MSA. Additionally, it          tator to skip over words they are not confident about and
is not uncommon to find dialect code switching between            leave the answer unvalidated.
MSA and a dialect, and even a dialect with another dialect
(less commonly) (Elfardy and Diab, 2012). Hence we tag               Once the annotation task is fully completed, the annota-
per word, but rely on the context of the sentence and even        tor may ‘submit’ the finished document to be later exported.
the document to identify the dialect.                             This will allow all the analyses made by the annotator to be
                                                                  accessible to all the other annotators when they look up the
In Table 3 we show an example of a fully annotated sen-           analysis for similar words.
tence and the POS tag in ARZATB for comparison. For               5.3.   Inter Annotator Agreement
full description of each of the annotation tasks and exam-
ples, the full guidelines can be accessed online.                 We evaluated the quality of the annotation using the Inter
                                                                  Annotator Agreement (IAA) measure between two anno-
               5.    Annotation Process                           tators on a selected text of 1,500 words. We measured
                                                                  the agreement on: (i) word boundary, that is the agree-
In this section, we discuss the annotation process details,       ment on whether word boundaries are the same (no split-
the tool we used, and some annotation quality evaluation          s/merges); (ii) CODA spelling; (iii) baseword form; (iv)
results.                                                          baseword POS; (v) baseword features; (vi) clitic form (av-
                                                                  eraged across all clitic positions) and (vii) clitic POS (av-
5.1.   MADARi Interface
                                                                  eraged across all clitic positions). To align the pair of an-
We used a newly developed interface for morphological an-         notations, we perform a word level alignment within the
notation and spelling correction called MADARi (Obeid et          sentences. We use a weighted Levenshtein distance to
al., 2018). MADARi is a web-based interface that sup-             maximize alignment, where insertions and deletions are
ports joint morphological annotation (tokenization, POS           weighted as 1 and substitutions are weighted as follows:
tagging, lemmatization) and spelling correction at any point
of the annotation process, which minimizes error propaga-                                             2Lev(t1 , t2 )
                                                                                Wedit (t1 , t2 ) =                          (1)
tion. English glossing and dialect identification are also                                           max(|t1 |, |t2 |)
supported in the interface. MADARi assigns initial answers
to the new text using MADAMIRA in EGY mode, whose                 Above, t1 , t2 are the two word tokens, and Lev is the Lev-
databases we extended with CALIMAGLF for more cover-              enshtein distance at the character level. We employ this
age. MADARi has many utilities to facilitate the annotation       character-based weighing scheme to encourage the align-
process that we utilize for more efficiency, of which exam-       ment of words with spelling changes. Using the same IAA
ples are discussed in the next subsection. Figure 1 shows a       measure, we measured the similarity between each annota-
screenshot of the annotation view in MADARi.                      tor and the initial answers from the CALIMAGLF -extended
                                                                  MADAMIRA.
5.2.   Manual Annotation                                            The results are presented in Table 4 in terms of percent
The annotator starts on an automatically pre-annotated doc-       agreement. MADAMIRA provided a very helpful starting
ument. They carefully examine the spelling of each word           point. In at least 75% of the case, annotators agreed with

                                                              3843
Raw sentence                            ½J®JK. ð úæ úk. AK. AÓ é
ranging between 94.7% to more than 99% for different an-       Elfardy, H. and Diab, M. (2012). Token Level Identifica-
notation tasks. In the future, we plan to expand the anno-       tion of Linguistic Code Switching. Proceedings of COL-
tated text to include other genres and dialects. We are also     ING 2012: Posters, pages 287–296.
interested in using the annotations to improve the quality     Eskander, R., Habash, N., Rambow, O., and Pasha, A.
of Arabic dialect POS tagging and morphological disam-           (2016). Creating resources for Dialectal Arabic from a
biguation.                                                       single annotation: A case study on Egyptian and Levan-
                                                                 tine. In Proceedings of COLING 2016, the 26th Interna-
              7.    Acknowledgements                             tional Conference on Computational Linguistics: Tech-
This project is funded by a New York University Abu Dhabi        nical Papers, pages 3455–3465, Osaka, Japan.
Research Enhancement Fund. We would like to thank              Gadalla, H., Kilany, H., Arram, H., Yacoub, A., El-
Ramy Eskander and the team of annotators at Ramitechs.           Habashi, A., Shalaby, A., Karins, K., Rowson, E., Mac-
We also thank the creators of MADARi from the MADAR              Intyre, R., Kingsbury, P., Graff, D., and McLemore, C.
project. Finally we thank Sondos Krouna for insightful dis-      (1997). CALLHOME Egyptian Arabic Transcripts. In
cussions on POS decisions.                                       Linguistic Data Consortium, Philadelphia.
                                                               Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick,
            Bibliographical References
                                                                 S., and Buckwalter, T. (2009). Standard Arabic Morpho-
Abuata, B. and Al-Omari, A. (2015). A Rule-based Stem-           logical Analyzer (SAMA) Version 3.1. Linguistic Data
  mer for Arabic Gulf Dialect. Journal of King Saud Uni-         Consortium LDC2009E73.
  versity - Computer and Information Sciences, 27(2):104       Habash, N. and Rambow, O. (2006). MAGEAD: A Mor-
  – 112.                                                         phological Analyzer and Generator for the Arabic Di-
Al-Sabbagh, R. and Girju, R. (2012). A Supervised POS            alects. In Proceedings of ACL, pages 681–688, Sydney,
  Tagger for Written Arabic Social Networking Corpora.           Australia.
  In Jeremy Jancsary, editor, Proceedings of KONVENS
                                                               Habash, N., Soudi, A., and Buckwalter, T. (2007). On Ara-
  2012, pages 39–52. ÖGAI, September. Main track: oral
                                                                 bic Transliteration. In A. van den Bosch et al., editors,
  presentations.
                                                                 Arabic Computational Morphology: Knowledge-based
Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N., and
                                                                 and Empirical Methods. Springer.
  Rambow, O. (2016). A Morphologically Annotated
                                                               Habash, N., Rambow, O., and Roth, R. (2009).
  Corpus and a Morphological Analyzer for Moroccan and
                                                                 MADA+TOKAN: A toolkit for Arabic tokenization, dia-
  Sanaani Yemeni Arabic. In Proceedings of the Interna-
                                                                 critization, morphological disambiguation, POS tagging,
  tional Conference on Language Resources and Evalua-
                                                                 stemming and lemmatization. In Khalid Choukri et al.,
  tion (LREC), Portorož, Slovenia.
                                                                 editors, Proceedings of the Second International Con-
Alkuhlani, S. and Habash, N. (2011). A Corpus for Mod-
                                                                 ference on Arabic Language Resources and Tools. The
  eling Morpho-Syntactic Agreement in Arabic: Gender,
                                                                 MEDAR Consortium, April.
  Number and Rationality. In Proceedings of the 49th An-
  nual Meeting of the Association for Computational Lin-       Habash, N., Eskander, R., and Hawwari, A. (2012a).
  guistics (ACL’11), Portland, Oregon, USA.                      A Morphological Analyzer for Egyptian Arabic. In
Bouamor, H., Habash, N., and Oflazer, K. (2014). A               NAACL-HLT 2012 Workshop on Computational Mor-
  Multidialectal Parallel Corpus of Arabic. In Proceed-          phology and Phonology (SIGMORPHON2012), pages
  ings of the Ninth International Conference on Language         1–9.
  Resources and Evaluation (LREC-2014). European Lan-          Habash, N., Diab, M. T., and Rambow, O. (2012b). Con-
  guage Resources Association (ELRA).                            ventional Orthography for Dialectal Arabic. In LREC,
Bouamor, H., Habash, N., Salameh, M., Zaghouani, W.,             pages 711–718.
  Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S.,          Habash, N., Khalifa, S., Eryani, F., Rambow, O., Ab-
  Eryani, F., Erdmann, A., and Oflazer, K. (2018). The           dulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W.,
  MADAR Arabic Dialect Corpus and Lexicon. In Pro-               Bouamor, H., Zalmout, N., Hassan, S., shargi, F. A.,
  ceedings of the International Conference on Language           Alkhereyf, S., Abdulkareem, B., Eskander, R., Salameh,
  Resources and Evaluation (LREC 2018), May.                     M., and Saddiki, H. (2018). Unified Guidelines and Re-
Cotterell, R. and Callison-Burch, C. (2014). A multi-            sources for Arabic Dialect Orthography. In Proceedings
  dialect, multi-genre corpus of informal written Arabic.        of the International Conference on Language Resources
  In LREC, pages 241–245.                                        and Evaluation (LREC 2018), May.
Diab, M., Habash, N., Rambow, O., AlTantawy, M., and           Habash, N. (2010). Introduction to Arabic Natural Lan-
  Benajiba, Y. (2010). Colaba: Arabic dialect annotation         guage Processing. Morgan & Claypool Publishers.
  and processing. In Proceedings of the LREC Workshop          Halefom, G., Leung, T., and Ntelitheos, D. (2013). A cor-
  for Language Resources (LRs) and Human Language                pus of Emirati Arabic. Technical Report NRF Grant (31
  Technologies (HLT) for Semitic Languages: Status, Up-          H001), United Arab Emirates University.
  dates, and Prospects.                                        Inoue, G., Shindo, H., and Matsumoto, Y. (2017).
Diab, M., Hawwari, A., Elfardy, H., Dasigi, P., Al-              Joint Prediction of Morphosyntactic Categories for Fine-
  Badrashiny, M., Eskander, R., and Habash, N. (Forth-           Grained Arabic Part-of-Speech Tagging Exploiting Tag
  coming – 2013). Tharwa: A Multi-Dialectal Multi-               Dictionary Information. In Proceedings of the 21st Con-
  Lingual Machine Readable Dictionary.                           ference on Computational Natural Language Learning

                                                           3845
(CoNLL 2017), pages 421–431, Vancouver, Canada, Au-          (LREC-2014). European Language Resources Associa-
  gust. Association for Computational Linguistics.             tion (ELRA).
Jarrar, M., Habash, N., Akra, D., and Zalmout, N. (2014).    McNeil, K. and Faiza, M. (2011). Tunisian Arabic Corpus
  Building a Corpus for Palestinian Arabic: a Preliminary      : Creating a Written Corpus of an "Unwritten" Language.
  Study. ANLP 2014, page 18.                                 Ntelitheos, D. and Idrissi, A. (2017). Language Growth in
Jarrar, M., Habash, N., Alrimawi, F., Akra, D., and Zal-       Child Emirati Arabic. Hamid Ouali (ed.) Perspectives on
  mout, N. (2016). Curras: An Annotated Corpus for             Arabic Linguistics 29, pages 229–248.
  the Palestinian Arabic dialect. Language Resources and     Obeid, O., Khalifa, S., Habash, N., Bouamor, H., Za-
  Evaluation, pages 1–31.                                      ghouani, W., and Oflazer, K. (2018). MADARi: A
Khalifa, S., Habash, N., Abdulrahim, D., and Hassan, S.        Web Interface for Joint Arabic Morphological Anno-
  (2016a). A Large Scale Corpus of Gulf Arabic. In Pro-        tation and Spelling Correction. In Proceedings of the
  ceedings of the International Conference on Language         International Conference on Language Resources and
  Resources and Evaluation (LREC), Portorož, Slovenia.         Evaluation (LREC 2018), May.
Khalifa, S., Zalmout, N., and Habash, N. (2016b). YA-        Pasha, A., Al-Badrashiny, M., Kholy, A. E., Eskander, R.,
  MAMA: Yet Another Multi-Dialect Arabic Morpholog-            Diab, M., Habash, N., Pooleery, M., Rambow, O., and
  ical Analyzer. In Proceedings of the International Con-      Roth, R. (2014). MADAMIRA: A Fast, Comprehensive
  ference on Computational Linguistics (COLING): Sys-          Tool for Morphological Analysis and Disambiguation of
  tem Demonstrations, pages 223–227.                           Arabic. In In Proceedings of LREC, Reykjavik, Iceland.
Khalifa, S., Hassan, S., and Habash, N. (2017). A Morpho-    Shoufan, A. and Al-Ameri, S. (2015). Natural Language
  logical Analyzer for Gulf Arabic Verbs. WANLP 2017           Processing for Dialectical Arabic: A Survey. In ANLP
  (co-located with EACL 2017), page 35.                        Workshop 2015, page 36.
Kilany, H., Gadalla, H., Arram, H., Yacoub, A., El-          Smaïli, K., Abbas, M., Meftouh, K., and Harrat, S. (2014).
  Habashi, A., and McLemore, C. (2002). Egyp-                  Building resources for Algerian Arabic dialects. In 15th
  tian Colloquial Arabic Lexicon. LDC catalog number           Annual Conference of the International Communication
  LDC99L22.                                                    Association Interspeech.
                                                             Smrž, O. (2007). ElixirFM — Implementation of Func-
Maamouri, M., Bies, A., Buckwalter, T., and Mekki, W.
                                                               tional Arabic Morphology. In Proceedings of the 2007
  (2004). The Penn Arabic Treebank: Building a Large-
                                                               Workshop on Computational Approaches to Semitic Lan-
  Scale Annotated Arabic Corpus. In NEMLAR Confer-
                                                               guages: Common Issues and Resources, pages 1–8,
  ence on Arabic Language Resources and Tools, pages
                                                               Prague, Czech Republic, June. ACL.
  102–109, Cairo, Egypt.
                                                             Zaghouani, W. (2014). Critical Survey of the Freely Avail-
Maamouri, M., Bies, A., Buckwalter, T., Diab, M., Habash,
                                                               able Arabic Corpora. In Proceedings of the Workshop
  N., Rambow, O., and Tabessi, D. (2006). Developing
                                                               on Free/Open-Source Arabic Corpora and Corpora Pro-
  and Using a Pilot Dialectal Arabic Treebank. In Pro-
                                                               cessing Tools, LREC, pages 1–8.
  ceedings of the Fifth International Conference on Lan-
                                                             Zaidan, O. F. and Callison-Burch, C. (2011). The Arabic
  guage Resources and Evaluation, LREC 2006.
                                                               Online Commentary Dataset: an Annotated Dataset of
Maamouri, M., Bies, A., Krouna, S., Gaddeche, F., and          Informal Arabic With High Dialectal Content. In Pro-
  Bouziri, B., (2009). Penn Arabic Treebank Guidelines.        ceedings of the 49th Annual Meeting of the Association
  Linguistic Data Consortium.                                  for Computational Linguistics: Human Language Tech-
Maamouri, M., Bies, A., Kulick, S., Tabessi, D.,               nologies: short papers-Volume 2, pages 37–41.
  and Krouna, S. (2012a). Egyptian Arabic Tree-              Zalmout, N. and Habash, N. (2017). Don’t Throw Those
  bank DF Parts 1-8 V2.0 - LDC catalog num-                    Morphological Analyzers Away Just Yet: Neural Mor-
  bers LDC2012E93, LDC2012E98, LDC2012E89,                     phological Disambiguation for Arabic. In Proceedings
  LDC2012E99,          LDC2012E107,        LDC2012E125,        of the 2017 Conference on Empirical Methods in Natu-
  LDC2013E12, LDC2013E21.                                      ral Language Processing, pages 704–713.
Maamouri, M., Krouna, S., Tabessi, D., Hamrouni, N., and     Zalmout, N., Erdmann, A., and Habash, N. (2018).
  Habash, N. (2012b). Egyptian Arabic Morphological            Noise-Robust Morphological Disambiguation for Di-
  Annotation Guidelines.                                       alectal Arabic. In Proceedings of the 26th Annual Con-
Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N.,      ference of the North American Chapter of the Associa-
  and Eskander, R. (2014). Developing an egyptian arabic       tion for Computational Linguistics: Human Language
  treebank: Impact of dialectal morphology on annotation       Technologies Conference (HLT-NAACL2018).
  and tool development. In Proceedings of the Ninth Inter-   Zeroual, I. and Lakhouaja, A., (2018). Arabic Corpus
  national Conference on Language Resources and Evalu-         Linguistics: Major Progress, But Still A Long Way to
  ation (LREC-2014). European Language Resources As-           Go, pages 613–636. Springer International Publishing,
  sociation (ELRA).                                            Cham.
Masmoudi, A., Ellouze Khmekhem, M., Esteve, Y.,              Zribi, I., Ellouze, M., Belguith, L. H., and Blache, P.
  Hadrich Belguith, L., and Habash, N. (2014). A cor-          (2015). Spoken Tunisian Arabic Corpus" STAC": Tran-
  pus and phonetic dictionary for tunisian arabic speech       scription and Annotation. Research in computing sci-
  recognition. In Proceedings of the Ninth International       ence, 90:123–135.
  Conference on Language Resources and Evaluation

                                                         3846
You can also read