A Morphologically Annotated Corpus of Emirati Arabic
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Morphologically Annotated Corpus of Emirati Arabic Salam Khalifa, Nizar Habash, Fadhl Eryani, Ossama Obeid, Dana Abdulrahim,† Meera Al Kaabi‡ Computational Approaches to Modeling Language Lab, New York University Abu Dhabi, UAE University of Bahrain, Bahrain† The United Arab Emirates University, UAE‡ {salamkhalifa@nyu.edu, nizar.habash@nyu.edu, darahim@uob.edu.bh} Abstract We present an ongoing effort on the first large-scale morphologically manually annotated corpus of Emirati Arabic. This corpus includes about 200,000 words selected from eight Gumar corpus novels in the Emirati Arabic variety. The selected texts are being annotated for tokenization, part-of-speech, lemmatization, English glosses and dialect identification. The orthography of the text is also adjusted for errors and inconsistencies. We discuss the guidelines for each part of the annotation components, and the annotation interface we use. We report on the quality of the annotation through an inter-annotator agreement measure. Keywords: Gulf Arabic, Part-of-Speech Tagging, Morphology, Annotation 1. Introduction The rest of this paper is organized as follows. We discuss related work on dialectal corpora in Section 2. In Section 3. There has been an increasing number of natural language we describe the corpus used in this effort. We then present processing (NLP) efforts focusing on dialectal Arabic, es- the annotation guidelines that are used to annotate the cor- pecially with the increasing amounts of written material on pus in Section 4. We discuss the annotation process and the the web. However, resources for dialectal Arabic NLP tasks annotation quality results in Section 5. such as part-of-speech (POS) tagging, morphological anal- ysis and disambiguation are still lacking compared to those 2. Related Work for Modern Standard Arabic (MSA). MSA is the official language in more than 20 countries, where it is used in offi- In this section we review a number of efforts on Arabic cial communications, news, and education. Yet, it is not the corpus creation, that significantly supported research and commonly spoken variety of Arabic; the dialectal varieties tool development for Arabic NLP. of Arabic are what is used in the day-to-day communica- tion. Dialectal Arabic is also commonly used in written 2.1. Modern Standard Arabic Resources form on social media platforms, forums and blogs. The Penn Arabic Treebank (PATB) (Maamouri et al., 2004) has been a central resource for developing MSA resources. Using available resources developed for MSA such as It was developed at the Linguistic Data Consortium (LDC), POS taggers and tokenizers gives limited performance and it mainly consists of newswire text from different news when used on dialectal Arabic (Habash and Rambow, 2006; sources. The PATB corpus is annotated for tokenization, Jarrar et al., 2014; Khalifa et al., 2016a). Many researchers segmentation, POS tagging, lemmatization, diacritization, moved into the direction of creating tools and resources English gloss and syntactic structure. The PATB has 12 targeting the dialects specifically. Egyptian Arabic is one parts of more than 1.3 million words. The annotated data of the dialects that received earlier efforts for developing has been a backbone of many state-of-the-art tools such tools and resources. More resources are being developed as analyzers and disambiguators including MADAMIRA for other dialects such as Levantine, Tunisian, Moroccan (Pasha et al., 2014) and its predecessor MADA (Habash et and Yemeni Arabic. Gulf Arabic, as we define it to be the al., 2009), in addition to YAMAMA (Khalifa et al., 2016b), native spoken variety in the Gulf Cooperation Council, is and most recently a neural morphological disambiguatior still lagging behind other Arabic dialects with respect to re- (Zalmout and Habash, 2017) and a fine grained POS tag- source and tool creation, given the considerable amount of ger (Inoue et al., 2017). In addition, the PATB guidelines dialectal content online. (Maamouri et al., 2009) have inspired the creation of simi- lar guidelines for the dialects including our own. In this paper, we present an ongoing project for creating a manually annotated corpus of about 200,000 words of the 2.2. Dialectal Arabic Resources Gulf Arabic of the United Arab Emirates – Emirati Arabic. The corpus is annotated for tokenization, POS, lemmas and In the scope of dialectal Arabic, there have been many re- English glosses in addition to spelling conventionalization cent contributions to the development and creation of re- and dialect identification. This resource will support the de- sources. Below, we discuss the highlights of those contri- velopment of Arabic dialect enabling technologies, such as butions. automatic POS tagging and morphological disambiguation, Egyptian Arabic Resources Egyptian Arabic (EGY) which in turn will facilitate efforts on different NLP tasks was one of the first dialects that received the attention of such as machine translation. the NLP community. The earliest effort, to the best of 3839
our knowledge, is the Egyptian Colloquial Arabic Lexicon tate for tokenization, POS tagging, lemma, English gloss (ECAL) (Kilany et al., 2002) which was developed as part and dialect identification. Additionally we conventionalize of the CALLHOME Egypt corpus (Gadalla et al., 1997). the spelling in accordance with the Conventional Orthog- The ECAL served as the seed to the EGY morphological raphy for Dialectal Arabic (CODA) rules (Habash et al., analyzer (CALIMA) (Habash et al., 2012a). Later on, the 2012b; Habash et al., 2018). Egyptian Arabic Treebank (ARZATB) (Maamouri et al., For recent surveys on Arabic resources for NLP, see Za- 2012a; Maamouri et al., 2014) was created by the LDC ghouani (2014), Shoufan and Al-Ameri (2015) and Zeroual using CALIMA to provide analysis options for the anno- and Lakhouaja (2018). tation process. The ARZATB has currently 400,000 words in eight parts annotated in a similar fashion to the PATB. The annotation guidelines for the ARZATB (Maamouri et 3. Annotating the Gumar Corpus al., 2012b) followed that of the PATB with decisions spe- We discuss next the Gumar Corpus and the portion of it we cific to the dialect. Since the release, the ARZATB has been use to annotate in this effort. used extensively for developing EGY resources such as the EGY part of MADAMIRA, MADA and YAMAMA, in ad- dition to a noise-robust morphological disambiguator for 3.1. Gumar Corpus EGY (Zalmout et al., 2018). Other developed corpora and The Gumar corpus is a large-scale corpus of Gulf Arabic POS taggers for EGY include the work of Al-Sabbagh and containing more than 100 million words. The corpus con- Girju (2012) where they created their own POS tagset and sists mainly of documents of long conversational novels corpus with the intention to facilitate certain NLP applica- also known as I JË@ HAK @ð P ‘Internet Novels’. This type tions like subjectivity and sentiment analysis. of literature is very popular among female teenagers in the Gulf area. These novels are written mostly in dialectal Ara- Levantine Arabic and and Other Dialectal Arabic Re- bic, where the lengthy conversations between the characters sources Levantine Arabic (LEV) received some notable of the story are in the dialect and the narration in between efforts including the Levantine Arabic Treebank (LATB) the conversations can sometimes be in MSA. of Jordanian Arabic (Maamouri et al., 2006) which con- tains around 27,000 annotated words in a similar fashion to The writers of the novels remain anonymous and use ARZATB. A more recent resource is the annotated corpus noms de plume. The novels are publicly available online, of Palestinian Arabic (Curras) (Jarrar et al., 2014; Jarrar et where most of the writers ask for their pen name to be al., 2016). ARZATB and Curras were used to create mor- mentioned if the novel is to be published in a different plat- phological analyzers and disambiguators (Eskander et al., form than the original. The genre of the novels is mainly 2016). Other dialects such as Yemeni and Moroccan Arabic romantic, but also features tragedy and drama. The cor- followed the same approach (Al-Shargi et al., 2016). In ad- pus can be browsed online,1 it is currently annotated using dition to the dialects mentioned above, there were recent ef- MADAMIRA in EGY mode. forts on creating corpora for other dialects, namely Tunisian and Algerian (McNeil and Faiza, 2011; Masmoudi et al., On the document level, Gulf Arabic text makes up more 2014; Zribi et al., 2015; Smaïli et al., 2014). Other works than 90% of the corpus, the rest of the corpus consists of targeted multi-dialect corpora (Diab et al., 2010; Zaidan other Arabic dialects in addition to MSA. Emirati Arabic and Callison-Burch, 2011; Diab et al., Forthcoming 2013; text covers around 11% of the Gumar corpus. Bouamor et al., 2014; Cotterell and Callison-Burch, 2014), and, most recently, the ongoing Multi Arabic Dialect and 3.2. The Annotated Gumar Corpus Application Resources project (MADAR) (Bouamor et al., We chose a set of 200,110 word tokens for the annotation 2018) which includes corpora for 25 different city dialects. task. The text consists of the first 25,000 words (rounded Gulf Arabic Resources As far as Gulf Arabic (GLF) up to the nearest full sentence) from eight different novels is concerned, the only existing annotated corpora in- by eight different authors. This allows us to cover different clude the Emirati Arabic Corpus (EAC) (Halefom et al., writing styles. The text is comprised of 15,277 sentences 2013) and the Emirati Arabic Language Acquisition Corpus with an average of 13 words per sentence. Table 1 shows (EMALAC) (Ntelitheos and Idrissi, 2017) that were created the list of the novels from which the text is selected. We by linguists with emphasis on the phonological and mor- name this subset of the corpus the Annotated Gumar Cor- phosyntactic phenomena of Emirati Arabic. We recently pus. In the future we plan to continue adding annotations collected a large-scale corpus of Gulf Arabic (Khalifa et to it from other Gumar novels including different dialects. al., 2016a) containing more than 100 million words cover- Additionally, a total of about 12,000 words – 1,500 ing six Gulf Arabic varieties. In regards to other tools and words from each of the eight parts rounded up to the near- resources, we recently developed a morphological analyzer est full sentence – are chosen to evaluate Inter-Annotator for Gulf Arabic verbs (CALIMAGLF ) (Khalifa et al., 2017). Agreement (IAA) throughout the annotation process. Thus, We are also aware of the previously developed rule-based the total number of words to be annotated is about 212,000 stemmer for Arabic Gulf dialect (Abuata and Al-Omari, words. 2015). In this work, we use about 200,000 words from the Emi- 1 Please visit https://camel.abudhabi.nyu.edu/ rati Arabic portion of the Gumar corpus to manually anno- gumar/ 3840
Parts Tokens Sentences à@ñJªË@ Title Transltation ËñÖÏ@ Author Part 1 25,022 1,387 ÉJÓ éJK Q£ B ¯@ñmÌ '@ HA Hñ ¯X ‘The sound of the beating hearts b3thra had2a ‘Calm scatter’ ÐAJmÌ '@ ñ ¯ Q¢ÖÏ@ HAJ .k when I remember it, is like the drops of rain on the tents’ Part 2 25,009 3,176 J úæJ . ¯P úΫ ñË ½J.k @ é
inal enclitics get a similar tag format as the baseword (i.e. ‘PRON.features’). CAMEL POS Arabic CAMEL POS ARZATB POS C AMEL POS provides full array of features: (i) Aspect PROCLITIC tags with the values Perfective, Imperfective and Command; K QªK_ è@X @ PART_DET DET ¢«_ ¬Qk CONJ CONJ (ii) Person with the values 1st, 2nd, 3rd; (iii) Gender Qk._ ¬Qk PREP PREP with values Masculine and Feminine; (iv) Number with ù®K_ è@X @ PART_NEG NEG_PART values Simgular, Dual and Plural and (v) State with val- ÈAJ.® J@_ è@X @ PART_FUT FUT_PART ues Definite, Indefinite and Construct; (vi) Case with val- Ó_ è@X @ é«PA PART_PROG PROG_PART ues Nominative, Genitive and Accusative; (vii) Voice with ¡ P_ è@X @ CONJ_SUB SUB_CONJ èPA@. _ QÖÞ values Active and Passive and (viii) Mood with values PRON_DEM DEM_PRON Subjunctive, Indicative and Jussive. Not all the features ÐAê®J@_QÖÞ PRON_INTERROG INTERROG_PRON è@X @ PART PART mentioned are necessarily relevant to the dialects. In the ¡. P_ ¬Qk PART_CONNECT CONNEC_PART full POS tag, the specified values of the different features YJ»ñK_ è@X @ PART_EMPHATIC EMPHATIC_PART will appear in the following order: Qå_ H. @ñk. PART_RC RC_PART .. Z@YK_ è@X @ PART_VOC VOC_PART ENCLITIC tags The second period is not necessary if none of the last four ù®K_ è@X @ PART_NEG NEG_PART features is specified. QÖÞ PRON *SUFF_DO:[PGN] QÖÞ PRON POSS_PRON_[PGN] Table 2 shows the list of POS tagset used in this anno- QÖÞ PRON PRON_[PGN] tation effort compared with the ones used ARZATB. The BASEWORD tags tagset is divided into three categories according to the tok- Õæ @ NOUN NOUN enization scheme we follow: proclitics (14 tags), enclitics XY«_ Õæ @ NOUN_NUM NOUN_NUM ÕΫ_ Õæ @ NOUN_PROP NOUN_PROP (2 tags) and baseword (39 tags). Together with the fea- Õ»_ Õæ @ NOUN_QUANT NOUN_QUANT tures, C AMEL POS tagset maps to ARZATB and retains é® ADJ ADJ backward compatibility. It also offers an intuitive Arabic XY«_ é® ADJ_NUM ADJ_NUM scheme that is suitable to use for annotation. éKPA®Ó _ é® ADJ_COMP ADJ_COMP ¬Q£ ADV ADV For a subset of POS tags in the baseword category, each ÐAê®J@_ ¬Q£ ADV_INTERROG INTERROG_ADV ÈññÓ_ ¬Q£ ADV_REL REL_ADV POS tag has a limited number of possible feature combina- ɪ¯ VERB IV/PV/CV tions that is paired with it. Below is the list of the POS tags ɪ¯_ éJ. VERB_PSEUDO PSEUDO_VERB that take features and their possible ordered combination. ɪ¯_ Õæ @ VERB_NOM VERB QÖÞ PRON PRON_[PGN] èPA@ _ QÖÞ • NOUN, NOUN_*, ADJ, ADJ_* All nominals take PRON_DEM DEM_PRON_[GN] the combination of Gender, Number. For example ÐAê®J@_QÖÞ PRON_INTERROG INTERROG_PRON I.j.ªK_QÖÞ PRON_EXCLAM EXCLAM_PRON ËAg. jAls ‘sitting’ is tagged ADJ.MS ; In the occa- ÈññÓ_QÖÞ PRON_REL REL_PRON è@X @ PART PART sional uses of State, such as AªJ.£ TabςAã ‘of course’ K QªK_ è@X @ PART_DET DET the tag would be NOUN.MS.I ù®K_ è@X @ PART_NEG NEG_PART • VERB All verbs take the combination of Aspect, ÈAJ.® J@_ è@X @ PART_FUT FUT_PART Person, Gender and Number. For example ©¢®K yqTς Ó_ è@X @ é«PA PART_PROG PROG_PART ‘cut’ is tagged as VERB.I3MS ɪ¯_ è@X @ PART_VERB VERB_PART • PRON All pronouns take the combination of Person, Z@YK_ è@X @ PART_VOC VOC_PART Gender and Number. For example úæK@ Anty ‘you [fs]’ ÐAê®J@_ è@X @ PART_INTERROG INTERROG_PART _ è@X @ ZAJJ@ PART_RESTRICT RESTRIC_PART is tagged as PRON.2FS ÉJ®K_ è@X @ PART_FOCUS FOCUS_PART • PRON_DEM All demonstrative pronouns take the YJ»ñK_ è@X @ PART_EMPHATIC EMPHATIC_PART combination of Gender and Number. For example @ XAë Qå_ H. @ñk. PART_RC RC_PART ¡. P_ è@X @ CONJ_SUB SUB_CONJ hAðA ‘this’ is tagged as PRON_DEM.3MS Qk._ ¬Qk PREP PREP ¢«_ ¬Qk CONJ CONJ In cases where a feature is not present, such as gender in ¡. P_ ¬Qk PART_CONNECT CONNEC_PART verbs of first person inflections, the gender feature is simply Õ¯P DIGIT NOUN_NUM dropped and does not require a placeholder since the pos- PAJk@ ABBREV ABBREV I.j.ªK INTERJ INTERJ sible feature values are ordered and unique. For example FORIEGN FOREIGN the imperfective 1st person verb Èñ¯@ Aqwl ‘I say’ will be úæ.Jk. @ Õæ¯Q K_ éÓC« PUNC PUNC tagged as VERB.I1S Table 2: Table shows the C AMEL POS tagset used in the 4.4. Lemma Guidelines annotation of Annotated Gumar Corpus compared to the The lemma is the citation form of the the word. We fol- POS tagset in ARZATB. C AMEL POS Arabic shows the low the same guidelines of the lemma specification from Arabic name of the tag. Graff et al. (2009), where nominals are cited using the masculine singular form of the word or the feminine sin- gular form if no masculine form exists. For example, the 3842
lemma for the noun QK AJ syAyyr ‘cars’ (NOUN.FP) is èPAJ and all its analysis choices in context with reference to the say∼Arah̄ which is feminine singular since there is no mas- raw text at all times. For each word the annotator faces one culine singular form of the word. The verbs are cited using of the following scenarios: the perfective 3rd person masculine singular form. For ex- yšwfn ‘they see [f.p]’ • All annotation tasks are correct: the annotator has to ample, the lemma for the verb á¯ñ only validate the answer. (VERB.I3FP) is ¬A šAf . For all other tags (i.e. particles, • Correct analysis but wrong spelling: the annotator has adverbs, ... etc) the lemma is the same form of the base- to adjust the spelling and then validate the answer. word. In this annotation effort, the lemma is the only form • Wrong analysis (wholly or partially) but correct we require to be manually diacritized. spelling: the annotator can manually adjust the analy- sis or can use the ‘analysis search’ utility provided by 4.5. English Gloss Guidelines MADARi to get an analysis for a word with similar The English gloss in this context refers to the semantic structure and then they would only have to change the translation of the Arabic lemma. For nominals we use the lemma and the gloss entries. Finally they validate the singular form, and for verbs we use the infinitive form. An answer. Arabic lemma could have multiple synonymous English • Wrong analysis and spelling: the annotator has to ad- glosses. For example QJ.» kbyr would have the following just the spelling and follow the previous step. English glosses ‘large; great; important; major; senior’. At any point of the annotation process, the annotator 4.6. Word Level Dialect Identification is able to apply mass changes to spelling and/or analysis across the document they are working on. However, the an- Dialect identification is the task of tagging a certain context notator must insure that all the words affected by the change with a given dialect tag. Deciding the dialect tag depends are in similar contexts. The annotator can also modify their on the context of the sentence and/or the document. This answers any time during the annotation through feedback can be challenging since many words in their written form they get if they have any inquiries. This allows the anno- may be shared by many dialects and MSA. Additionally, it tator to skip over words they are not confident about and is not uncommon to find dialect code switching between leave the answer unvalidated. MSA and a dialect, and even a dialect with another dialect (less commonly) (Elfardy and Diab, 2012). Hence we tag Once the annotation task is fully completed, the annota- per word, but rely on the context of the sentence and even tor may ‘submit’ the finished document to be later exported. the document to identify the dialect. This will allow all the analyses made by the annotator to be accessible to all the other annotators when they look up the In Table 3 we show an example of a fully annotated sen- analysis for similar words. tence and the POS tag in ARZATB for comparison. For 5.3. Inter Annotator Agreement full description of each of the annotation tasks and exam- ples, the full guidelines can be accessed online. We evaluated the quality of the annotation using the Inter Annotator Agreement (IAA) measure between two anno- 5. Annotation Process tators on a selected text of 1,500 words. We measured the agreement on: (i) word boundary, that is the agree- In this section, we discuss the annotation process details, ment on whether word boundaries are the same (no split- the tool we used, and some annotation quality evaluation s/merges); (ii) CODA spelling; (iii) baseword form; (iv) results. baseword POS; (v) baseword features; (vi) clitic form (av- eraged across all clitic positions) and (vii) clitic POS (av- 5.1. MADARi Interface eraged across all clitic positions). To align the pair of an- We used a newly developed interface for morphological an- notations, we perform a word level alignment within the notation and spelling correction called MADARi (Obeid et sentences. We use a weighted Levenshtein distance to al., 2018). MADARi is a web-based interface that sup- maximize alignment, where insertions and deletions are ports joint morphological annotation (tokenization, POS weighted as 1 and substitutions are weighted as follows: tagging, lemmatization) and spelling correction at any point of the annotation process, which minimizes error propaga- 2Lev(t1 , t2 ) Wedit (t1 , t2 ) = (1) tion. English glossing and dialect identification are also max(|t1 |, |t2 |) supported in the interface. MADARi assigns initial answers to the new text using MADAMIRA in EGY mode, whose Above, t1 , t2 are the two word tokens, and Lev is the Lev- databases we extended with CALIMAGLF for more cover- enshtein distance at the character level. We employ this age. MADARi has many utilities to facilitate the annotation character-based weighing scheme to encourage the align- process that we utilize for more efficiency, of which exam- ment of words with spelling changes. Using the same IAA ples are discussed in the next subsection. Figure 1 shows a measure, we measured the similarity between each annota- screenshot of the annotation view in MADARi. tor and the initial answers from the CALIMAGLF -extended MADAMIRA. 5.2. Manual Annotation The results are presented in Table 4 in terms of percent The annotator starts on an automatically pre-annotated doc- agreement. MADAMIRA provided a very helpful starting ument. They carefully examine the spelling of each word point. In at least 75% of the case, annotators agreed with 3843
Raw sentence ½J®JK. ð úæ úk. AK. AÓ é
ranging between 94.7% to more than 99% for different an- Elfardy, H. and Diab, M. (2012). Token Level Identifica- notation tasks. In the future, we plan to expand the anno- tion of Linguistic Code Switching. Proceedings of COL- tated text to include other genres and dialects. We are also ING 2012: Posters, pages 287–296. interested in using the annotations to improve the quality Eskander, R., Habash, N., Rambow, O., and Pasha, A. of Arabic dialect POS tagging and morphological disam- (2016). Creating resources for Dialectal Arabic from a biguation. single annotation: A case study on Egyptian and Levan- tine. In Proceedings of COLING 2016, the 26th Interna- 7. Acknowledgements tional Conference on Computational Linguistics: Tech- This project is funded by a New York University Abu Dhabi nical Papers, pages 3455–3465, Osaka, Japan. Research Enhancement Fund. We would like to thank Gadalla, H., Kilany, H., Arram, H., Yacoub, A., El- Ramy Eskander and the team of annotators at Ramitechs. Habashi, A., Shalaby, A., Karins, K., Rowson, E., Mac- We also thank the creators of MADARi from the MADAR Intyre, R., Kingsbury, P., Graff, D., and McLemore, C. project. Finally we thank Sondos Krouna for insightful dis- (1997). CALLHOME Egyptian Arabic Transcripts. In cussions on POS decisions. Linguistic Data Consortium, Philadelphia. Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, Bibliographical References S., and Buckwalter, T. (2009). Standard Arabic Morpho- Abuata, B. and Al-Omari, A. (2015). A Rule-based Stem- logical Analyzer (SAMA) Version 3.1. Linguistic Data mer for Arabic Gulf Dialect. Journal of King Saud Uni- Consortium LDC2009E73. versity - Computer and Information Sciences, 27(2):104 Habash, N. and Rambow, O. (2006). MAGEAD: A Mor- – 112. phological Analyzer and Generator for the Arabic Di- Al-Sabbagh, R. and Girju, R. (2012). A Supervised POS alects. In Proceedings of ACL, pages 681–688, Sydney, Tagger for Written Arabic Social Networking Corpora. Australia. In Jeremy Jancsary, editor, Proceedings of KONVENS Habash, N., Soudi, A., and Buckwalter, T. (2007). On Ara- 2012, pages 39–52. ÖGAI, September. Main track: oral bic Transliteration. In A. van den Bosch et al., editors, presentations. Arabic Computational Morphology: Knowledge-based Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N., and and Empirical Methods. Springer. Rambow, O. (2016). A Morphologically Annotated Habash, N., Rambow, O., and Roth, R. (2009). Corpus and a Morphological Analyzer for Moroccan and MADA+TOKAN: A toolkit for Arabic tokenization, dia- Sanaani Yemeni Arabic. In Proceedings of the Interna- critization, morphological disambiguation, POS tagging, tional Conference on Language Resources and Evalua- stemming and lemmatization. In Khalid Choukri et al., tion (LREC), Portorož, Slovenia. editors, Proceedings of the Second International Con- Alkuhlani, S. and Habash, N. (2011). A Corpus for Mod- ference on Arabic Language Resources and Tools. The eling Morpho-Syntactic Agreement in Arabic: Gender, MEDAR Consortium, April. Number and Rationality. In Proceedings of the 49th An- nual Meeting of the Association for Computational Lin- Habash, N., Eskander, R., and Hawwari, A. (2012a). guistics (ACL’11), Portland, Oregon, USA. A Morphological Analyzer for Egyptian Arabic. In Bouamor, H., Habash, N., and Oflazer, K. (2014). A NAACL-HLT 2012 Workshop on Computational Mor- Multidialectal Parallel Corpus of Arabic. In Proceed- phology and Phonology (SIGMORPHON2012), pages ings of the Ninth International Conference on Language 1–9. Resources and Evaluation (LREC-2014). European Lan- Habash, N., Diab, M. T., and Rambow, O. (2012b). Con- guage Resources Association (ELRA). ventional Orthography for Dialectal Arabic. In LREC, Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., pages 711–718. Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Habash, N., Khalifa, S., Eryani, F., Rambow, O., Ab- Eryani, F., Erdmann, A., and Oflazer, K. (2018). The dulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., MADAR Arabic Dialect Corpus and Lexicon. In Pro- Bouamor, H., Zalmout, N., Hassan, S., shargi, F. A., ceedings of the International Conference on Language Alkhereyf, S., Abdulkareem, B., Eskander, R., Salameh, Resources and Evaluation (LREC 2018), May. M., and Saddiki, H. (2018). Unified Guidelines and Re- Cotterell, R. and Callison-Burch, C. (2014). A multi- sources for Arabic Dialect Orthography. In Proceedings dialect, multi-genre corpus of informal written Arabic. of the International Conference on Language Resources In LREC, pages 241–245. and Evaluation (LREC 2018), May. Diab, M., Habash, N., Rambow, O., AlTantawy, M., and Habash, N. (2010). Introduction to Arabic Natural Lan- Benajiba, Y. (2010). Colaba: Arabic dialect annotation guage Processing. Morgan & Claypool Publishers. and processing. In Proceedings of the LREC Workshop Halefom, G., Leung, T., and Ntelitheos, D. (2013). A cor- for Language Resources (LRs) and Human Language pus of Emirati Arabic. Technical Report NRF Grant (31 Technologies (HLT) for Semitic Languages: Status, Up- H001), United Arab Emirates University. dates, and Prospects. Inoue, G., Shindo, H., and Matsumoto, Y. (2017). Diab, M., Hawwari, A., Elfardy, H., Dasigi, P., Al- Joint Prediction of Morphosyntactic Categories for Fine- Badrashiny, M., Eskander, R., and Habash, N. (Forth- Grained Arabic Part-of-Speech Tagging Exploiting Tag coming – 2013). Tharwa: A Multi-Dialectal Multi- Dictionary Information. In Proceedings of the 21st Con- Lingual Machine Readable Dictionary. ference on Computational Natural Language Learning 3845
(CoNLL 2017), pages 421–431, Vancouver, Canada, Au- (LREC-2014). European Language Resources Associa- gust. Association for Computational Linguistics. tion (ELRA). Jarrar, M., Habash, N., Akra, D., and Zalmout, N. (2014). McNeil, K. and Faiza, M. (2011). Tunisian Arabic Corpus Building a Corpus for Palestinian Arabic: a Preliminary : Creating a Written Corpus of an "Unwritten" Language. Study. ANLP 2014, page 18. Ntelitheos, D. and Idrissi, A. (2017). Language Growth in Jarrar, M., Habash, N., Alrimawi, F., Akra, D., and Zal- Child Emirati Arabic. Hamid Ouali (ed.) Perspectives on mout, N. (2016). Curras: An Annotated Corpus for Arabic Linguistics 29, pages 229–248. the Palestinian Arabic dialect. Language Resources and Obeid, O., Khalifa, S., Habash, N., Bouamor, H., Za- Evaluation, pages 1–31. ghouani, W., and Oflazer, K. (2018). MADARi: A Khalifa, S., Habash, N., Abdulrahim, D., and Hassan, S. Web Interface for Joint Arabic Morphological Anno- (2016a). A Large Scale Corpus of Gulf Arabic. In Pro- tation and Spelling Correction. In Proceedings of the ceedings of the International Conference on Language International Conference on Language Resources and Resources and Evaluation (LREC), Portorož, Slovenia. Evaluation (LREC 2018), May. Khalifa, S., Zalmout, N., and Habash, N. (2016b). YA- Pasha, A., Al-Badrashiny, M., Kholy, A. E., Eskander, R., MAMA: Yet Another Multi-Dialect Arabic Morpholog- Diab, M., Habash, N., Pooleery, M., Rambow, O., and ical Analyzer. In Proceedings of the International Con- Roth, R. (2014). MADAMIRA: A Fast, Comprehensive ference on Computational Linguistics (COLING): Sys- Tool for Morphological Analysis and Disambiguation of tem Demonstrations, pages 223–227. Arabic. In In Proceedings of LREC, Reykjavik, Iceland. Khalifa, S., Hassan, S., and Habash, N. (2017). A Morpho- Shoufan, A. and Al-Ameri, S. (2015). Natural Language logical Analyzer for Gulf Arabic Verbs. WANLP 2017 Processing for Dialectical Arabic: A Survey. In ANLP (co-located with EACL 2017), page 35. Workshop 2015, page 36. Kilany, H., Gadalla, H., Arram, H., Yacoub, A., El- Smaïli, K., Abbas, M., Meftouh, K., and Harrat, S. (2014). Habashi, A., and McLemore, C. (2002). Egyp- Building resources for Algerian Arabic dialects. In 15th tian Colloquial Arabic Lexicon. LDC catalog number Annual Conference of the International Communication LDC99L22. Association Interspeech. Smrž, O. (2007). ElixirFM — Implementation of Func- Maamouri, M., Bies, A., Buckwalter, T., and Mekki, W. tional Arabic Morphology. In Proceedings of the 2007 (2004). The Penn Arabic Treebank: Building a Large- Workshop on Computational Approaches to Semitic Lan- Scale Annotated Arabic Corpus. In NEMLAR Confer- guages: Common Issues and Resources, pages 1–8, ence on Arabic Language Resources and Tools, pages Prague, Czech Republic, June. ACL. 102–109, Cairo, Egypt. Zaghouani, W. (2014). Critical Survey of the Freely Avail- Maamouri, M., Bies, A., Buckwalter, T., Diab, M., Habash, able Arabic Corpora. In Proceedings of the Workshop N., Rambow, O., and Tabessi, D. (2006). Developing on Free/Open-Source Arabic Corpora and Corpora Pro- and Using a Pilot Dialectal Arabic Treebank. In Pro- cessing Tools, LREC, pages 1–8. ceedings of the Fifth International Conference on Lan- Zaidan, O. F. and Callison-Burch, C. (2011). The Arabic guage Resources and Evaluation, LREC 2006. Online Commentary Dataset: an Annotated Dataset of Maamouri, M., Bies, A., Krouna, S., Gaddeche, F., and Informal Arabic With High Dialectal Content. In Pro- Bouziri, B., (2009). Penn Arabic Treebank Guidelines. ceedings of the 49th Annual Meeting of the Association Linguistic Data Consortium. for Computational Linguistics: Human Language Tech- Maamouri, M., Bies, A., Kulick, S., Tabessi, D., nologies: short papers-Volume 2, pages 37–41. and Krouna, S. (2012a). Egyptian Arabic Tree- Zalmout, N. and Habash, N. (2017). Don’t Throw Those bank DF Parts 1-8 V2.0 - LDC catalog num- Morphological Analyzers Away Just Yet: Neural Mor- bers LDC2012E93, LDC2012E98, LDC2012E89, phological Disambiguation for Arabic. In Proceedings LDC2012E99, LDC2012E107, LDC2012E125, of the 2017 Conference on Empirical Methods in Natu- LDC2013E12, LDC2013E21. ral Language Processing, pages 704–713. Maamouri, M., Krouna, S., Tabessi, D., Hamrouni, N., and Zalmout, N., Erdmann, A., and Habash, N. (2018). Habash, N. (2012b). Egyptian Arabic Morphological Noise-Robust Morphological Disambiguation for Di- Annotation Guidelines. alectal Arabic. In Proceedings of the 26th Annual Con- Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N., ference of the North American Chapter of the Associa- and Eskander, R. (2014). Developing an egyptian arabic tion for Computational Linguistics: Human Language treebank: Impact of dialectal morphology on annotation Technologies Conference (HLT-NAACL2018). and tool development. In Proceedings of the Ninth Inter- Zeroual, I. and Lakhouaja, A., (2018). Arabic Corpus national Conference on Language Resources and Evalu- Linguistics: Major Progress, But Still A Long Way to ation (LREC-2014). European Language Resources As- Go, pages 613–636. Springer International Publishing, sociation (ELRA). Cham. Masmoudi, A., Ellouze Khmekhem, M., Esteve, Y., Zribi, I., Ellouze, M., Belguith, L. H., and Blache, P. Hadrich Belguith, L., and Habash, N. (2014). A cor- (2015). Spoken Tunisian Arabic Corpus" STAC": Tran- pus and phonetic dictionary for tunisian arabic speech scription and Annotation. Research in computing sci- recognition. In Proceedings of the Ninth International ence, 90:123–135. Conference on Language Resources and Evaluation 3846
You can also read