CAMEL TOOLS: AN OPEN SOURCE PYTHON TOOLKIT FOR ARABIC NATURAL LANGUAGE PROCESSING
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 7022–7032 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima Taji, Mai Oudah, Bashar Alhafni, Go Inoue, Fadhl Eryani, Alexander Erdmann, Nizar Habash Computational Approaches to Modeling Language (CAMeL) Lab New York University Abu Dhabi, UAE {oobeid,nasser.zalmout,salamkhalifa,dima.taji,mai.oudah, alhafni,go.inoue,fadhl.eryani,ae1541,nizar.habash}@nyu.edu Abstract We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. In this paper, we describe the design of CAMeL Tools and the functionalities it provides. Keywords: Arabic, Open Source, Morphology, Named Entity Recognition, Sentiment Analysis, Dialect Identification 1. Introduction marks for short vowels and consonantal gemination. While Over the last two decades, there have been many efforts to the missing diacritics are not a major challenge to literate develop resources to support Arabic natural language pro- native adults, their absence is the main source of ambiguity cessing (NLP). Some of these resources target specific NLP in Arabic NLP. tasks such as tokenization, diacritization or sentiment anal- Morphological Richness Arabic has a rich inflectional ysis, while others attempt to target multiple tasks jointly. morphology system involving gender, number, person, as- Most resources focus on a specific variety of Arabic such pect, mood, case, state and voice, in addition to the cliti- as Modern Standard Arabic (MSA), or Egyptian Arabic. cization of a number of pronouns and particles (conjunc- These resource also vary in terms of the programming lan- tions, prepositions, definite article, etc). This richness leads guages they use, the types of interfaces they provide, the to MSA verbs with upwards of 5,400 forms. data representations and standards they utilize, and the de- Dialectal Variation While MSA is the official language gree of public availability (e.g., open-source, commercial, of culture and education in the Arab World, it is no one’s unavailable). These variations make it difficult to write new native tongue. A number of different local dialects (such as applications that combine multiple resources. Egyptian, Levantine, and Gulf) are the de facto daily lan- To address many of these issues, we present CAMeL Tools, guages of the Arab World – both off-line and on-line. Ara- an open-source Python toolkit that supports Arabic and bic dialects differ significantly in terms of their phonology, Arabic dialect pre-processing, morphological modeling, di- morphology and lexicon from each other and from MSA alect identification, named entity recognition and sentiment to the point that using MSA tools and resources for pro- analysis. CAMeL Tools provides command-line interfaces cessing dialects is not sufficient. For example, Khalifa et (CLIs) and application programming interfaces (APIs) cov- al. (2016a) report that using a state-of-the-art tool for MSA ering these utilities. morphological disambiguation on Gulf Arabic returns POS The rest of the paper is organized as follows. We present tag accuracy at about 72%, compared to the performance some background on the difficulty of processing Arabic text on MSA, which is 96% (Pasha et al., 2014). (Section 2), and then discuss previous work on a variety of Orthographic Inconsistency MSA and Arabic dialects, Arabic NLP tasks (Section 3). We describe the design and as encountered online, show a lot of spelling inconsistency. implementation of CAMeL Tools (Section 4) and provide Zaghouani et al. (2014) report that 32% of words in MSA more details about each of its components (Sections 6 to 9). comments online have spelling errors. Habash et al. (2018) Finally, we discuss some future additions to CAMeL Tools presented 27 encountered ways to write an Egyptian Arabic (Section 10). ®J J.Ó mbyqwl- word meaning ‘he does not say it’: e.g., AêËñ ®J J.Ó mbyqlhAš (≈ 1,000), and hAš1 (≈ 26,000 times), AêÊ 2. Arabic Linguistics Background ñJ.Ó mbŵlhAš (less than 10). Habash et al. (2018) AêË Aside of obvious issues such as resource poverty, Arabic proposed a conventional orthography for dialectal Arabic poses a number of challenges to NLP: orthographic ambi- (CODA), a set of guidelines for consistent spelling of Ara- guity, morphological richness, dialectal variations, and or- bic dialects for NLP. In addition, dialectal Arabic text is thographic noise (Habash, 2010). While these are not nec- also known to appear on social media in a non-standard ro- essarily unique issues to Arabic, their combination makes manization called Arabizi (Darwish, 2014). Arabic processing particularly challenging. Orthographic Ambiguity Arabic is generally written us- 1 Arabic transliteration is presented in the Habash-Soudi- ing the Arabic Abjad script which uses optional diacritical Buckwalter scheme (Habash et al., 2007). 7022
(a) Input ... ém .k A ¯ øX ᣠ@ ,ÕËAªË@ A¿ HAJ JËAÓð AJQ ® áÓ Ik . Qk AJËA¢ @ éÊË YÒmÌ '@ AlHmd llh ǍyTAlyA xrjt mn tSfyAt kÂs AlςAlm, Aðn dý frStnA w mAln___Aš Hjh̄... (b) English Thank God Italy is out of the world cup, I think this is our chance and we do not have an excuse (c) Preprocessing AlHmd llh AyTAlyA xrjt mn tSfyAt kAs AlEAlm , AZn dy frStnA wmAlnA$ Hjh ... (d) Morphological Diac furSitnA Q¯ AJ faraS∼atnA AJ Q¯ furSatunA AJQ¯ Analysis Gloss Opportunity Compress Holiday ¯ AJQ Lemma furSah̄ éQ¯ raS∼ P furSah̄ éQ¯ frStnA D3 Tokenization furSah̄ +nA AK+ éQ¯ fa+ raS∼at +nA AK+ I P+ ¬ furSah̄u +nA AK+ éQ¯ ... POS noun verb noun Dialect Egyptian MSA MSA (e) Disambiguation 3 (f) Sentiment positive (g) NER LOC: AJËA¢ @ (ǍyTAlyA ‘Italy’) (h) Dialect ID Cairo: 95.0%, Doha: 1.9%, Jeddah: 1.5%, Aswan: 0.8%, Khartoum: 0.3%, Other: 0.5%, MSA: 0.0% Table 1: An example showing the outputs of the different CAMeL Tools components The above phenomena lead to highly complex readings that alyzers producing form-based, functional, and lexical fea- vary across many dimensions. Table 1(d) shows three anal- tures as well as morpheme forms (Buckwalter, 2002; Smrž, yses with different lemma readings and features for the 2007; Boudlal et al., 2010; Alkuhlani and Habash, 2011; word AJQ¯ frStnA. Two of the three readings have the same Boudchiche et al., 2017). POS (noun) but vary in terms of the English gloss and di- Systems such as MADA (Habash et al., 2009) and alect. The third reading, a verb, has a different meaning, MADAMIRA (Pasha et al., 2014) disambiguate the anal- morphological analysis and tokenization. yses that are produced by a morphological analyzer. Zal- mout and Habash (2017) outline a neural model that fol- 3. Previous Work lows the same approach. Alternatively, Farasa (Abdelali et Arabic NLP, and NLP in general, involves many tasks, al., 2016) relies on probabilistic models to produce high to- some of which may be related to others. Tasks like mor- kenization accuracy, and YAMAMA (Khalifa et al., 2016b) phological modeling, parsing, tokenization, segmentation, uses the same approach to produce MADAMIRA-like dis- lemmatization, stemming and named entity recognition ambiguated analyses. For Arabic dialects, Zalmout et al. (NER) tend to serve as building block for higher level (2018) present a neural system that does morphological tag- applications such as translation, spelling correction, and ging and disambiguation for Egyptian Arabic. question-answering systems. Other tasks, such as dialect identification (DID) and sentiment analysis (SA), aim to Named Entity Recognition Wile some Arabic NER sys- compute different characteristics of input text either for an- tems have used rule-based approaches (Shaalan and Raza, alytical purposes or as additional input for some of the pre- 2009; Zaghouani, 2012; Aboaoga and Aziz, 2013), most viously mentioned tasks. recent Arabic NER systems integrate machine learning in In this section we discuss those tasks currently addressed their architectures (Benajiba and Rosso, 2008; AbdelRah- by CAMeL Tools. We also discuss some of the software man et al., 2010; Mohammed and Omar, 2012; Oudah and suites that have inspired the design of CAMeL Tools. Shaalan, 2012; Darwish, 2013; El Bazi and Laachfoubi, 2019). Benajiba and Rosso (2007) developed ANERsys, 3.1. NLP Tasks one of the earliest NER systems for Arabic. They built We briefly survey the different efforts for a variety of Ara- their own linguistic resources, which have become stan- bic NLP tasks. Specifically, we discuss those tasks that are dard in Arabic NER literature: ANERcorp (i.e. an anno- relevant to CAMeL Tools at the time of publishing. These tated corpus for Person, Location and Organization names) include morphological modeling, NER, DID and SA. and ANERgazet (i.e. Person, Location and Organization gazetteers). In CAMeL Tools, we make use of all publicly Morphological Modeling Efforts on Arabic morpholog- available training data and compare our system to the work ical modeling have varied over a number of dimensions, of Benajiba and Rosso (2008). ranging from very abstract and linguistically rich repre- sentations paired with word-form derivation rules (Beesley Dialect Identification Salameh et al. (2018) introduced et al., 1989; Beesley, 1996; Habash and Rambow, 2006; a fine-grained DID system covering the dialects of 25 cities Smrž, 2007) to pre-compiled representations of the differ- from several countries across the Arab world. Elfardy and ent components needed for morphological analysis (Buck- Diab (2012) presented a set of guidelines for token-level walter, 2002; Graff et al., 2009; Habash, 2007). Previous identification of dialectness. They later proposed a super- efforts also show a wide range for the depth of information vised approach for identifying whether a given sentence is that morphological analyzers can produce, with very shal- prevalently MSA or Egyptian (Elfardy and Diab, 2013) us- low analyzers on one end (Vanni and Zajac, 1996), to an- ing the Arabic Online Commentary Dataset (Zaidan and 7023
CAMeL Tools MADAMIRA Stanford CoreNLP Farasa Suite Type Arabic Specific Arabic Specific Multilingual Arabic Specific Language Python Java Java with Python bindings Java CLI 3 3 3 3 API 3 3 3 3 Exposed Pre-processing 3 Morphological Modeling 3 3 Morphological Disambiguation 3 3 3 3 POS Tagging 3 3 3 3 Diacritization 3 3 3 Tokenization/Segementation/Stemming 3 3 3 3 Lemmatization 3 3 3 Named Entity Recognition 3 3 3 3 Sentiment Analysis 3 Dialect ID 3 Parsing Work in progress 3 3 Table 2: Feature comparison of CAMeL Tools, MADAMIRA, Stanford CoreNLP and Farasa. Callison-Burch, 2011). Their system (Elfardy and Diab, we discuss the capabilities of the mentioned tools and we 2012) combines a token-level DID approach with other fea- provide a rough comparison in Table 2. tures to train a Naive-Bayes classifier. Sadat et al. (2014) MADAMIRA provides a single, yet configurable, mode presented a bi-gram character-level model to identify the of operation revolving around it’s morphological analyzer. dialect of sentences in the social media context among di- When disambiguation is enabled, features such as POS alects of 18 Arab countries. More recently, discriminating tagging, tokenization, segmentation, lemmatization, NER between Arabic Dialects has been the goal of a dedicated and base-phrase chunking become available. MADAMIRA shared task (Zampieri et al., 2017), encouraging researchers was designed specifically for Arabic and supports both to submit systems to recognize the dialect of speech tran- MSA and Egyptian and primarily provides a CLI, a server scripts along with acoustic features for dialects of four main mode, and a Java API. regions: Egyptian, Gulf, Levantine and North African, in addition to MSA. The dataset used in these tasks is differ- Farasa is a collection of Java libraries and CLIs for ent from the dataset we use in this work in its genre, size MSA.2 These include separate tools for diacritization, seg- and the dialects covered. mentation, POS tagging, parsing, and NER. This allows for a more flexible usage of components compared to Sentiment Analysis Arabic SA is a well studied prob- MADAMIRA. lem that has many proposed solutions. These solutions span a wide array of methods such as developing lexicon- Stanford CoreNLP is a multilingual Java library, CLI based conventional machine learning models (Badaro et and server providing multiple NLP components with vary- al., 2014; Abdul-Mageed and Diab, 2014) or deep learn- ing support for different languages. Arabic support is ing models (Abu Farha and Magdy, 2019; Badaro et al., provided for parsing, tokenization, POS tagging, sentence 2018; Baly et al., 2017). More recently, fine-tuning large splitting and NER (a rule-based system using regular ex- pre-trained language models has achieved state-of-the-art pressions). An official Python library is also available (Qi results on various NLP tasks. ElJundi et al. (2019) devel- et al., 2018) which provides official bindings to CoreNLP oped a universal language model for Arabic (hULMonA) as well as providing neural implementations for some of which was pre-trained on a large corpus of Wikipedia sen- their components. It is worth noting that the commonly tences and compared its performance to multilingual BERT used Natural Language Toolkit (NLTK) (Loper and Bird, (mBERT) (Devlin et al., 2018) on the task of Arabic SA 2002) provides Arabic support through unofficial bindings after fine-tuning. They have shown that although mBERT to Stanford CoreNLP. was only trained on Modern Standard Arabic (MSA), it can still achieve state-of-the-art results on some datasets if it’s 4. Design and Implementation fine-tuned on dialectal Arabic data. Furthermore, Antoun et CAMeL Tools is an open-source package consisting of a al. (2020) pre-trained an Arabic specific BERT (AraBERT) set of Python APIs with accompanying command-line tools and were able to achieve state-of-the-art results on several that are thin wrap these APIs. In this section we discuss downstream tasks. the design decisions behind CAMeL Tools and how it was implemented. 3.2. Software Suites While most efforts focus on individual tasks, there are few 4.1. Motivation that provide multiple capabilities in the form of a unified There are two main issues with currently available tools that toolkit. These can be classified as Arabic specific, such as prompted us to develop CAMeL Tools and influenced its MADAMIRA (Pasha et al., 2014) and Farasa (Abdelali et design: al., 2016; Darwish and Mubarak, 2016) or multi-lingual, 2 such as Stanford CoreNLP (Manning et al., 2014). Below http://qatsdemo.cloudapp.net/farasa/ 7024
Fragmented Packages and Standards Tools tend to fo- added and older ones being improved. As such, it is diffi- cus on one task with no easy way to glue together different cult to accurately report on the performance of each compo- packages. Some tools do provide APIs, making this easier, nent. In this paper, we will report the state of CAMeL Tools but others only provide command-line tools. This leads to at the time of publishing and updates will be published on overhead writing glue code including interfaces to different dedicated web page.4 packages and parsers for different output formats. At the moment, CAMeL Tools provides utilities for pre- Lack of Flexibility Most tools don’t expose intermedi- processing, morphological analysis and disambiguation, ate functionality. This makes it difficult to reuse interme- DID, NER and Sentiment Analysis. In the following sec- diate output to reduce redundant computation. Addition- tions, we discuss these components in more detail. ally, functions for pre-processing input don’t tend to be ex- posed to users which makes it difficult to replicate any pre- 5. Pre-processing Utilities processing performed internally. CAMeL Tools provides a set of pre-processing utilities that are common in Arabic NLP but get re-implemented fre- 4.2. Design Philosophy quently. Different tools and packages tend to do slightly Here we identify the driving principles behind the design different pre-processing steps that are often not well docu- of CAMeL Tools. These have been largely inspired by the mented or exposed for use on their own. By providing these designs of MADAMIRA, Farasa, CoreNLP and NLTK, and utilities as part of the package, we hope to reduce the over- include our own personal requirements. head of writing Arabic NLP applications and insure that pre-processing is consistent from one project to the other. Arabic Specific We want a toolkit that focuses solely on In this section, we discuss the different pre-processing util- Arabic NLP. Language agnostic tools don’t tend to model ities that come with CAMeL Tools. the complexities of Arabic text very well, or indeed, other morphologically complex languages. Doing so would in- Transliteration When working with Arabic text, it is crease software complexity and development time. sometimes convenient or even necessary to use alternate transliteration schemes. Buckwalter transliteration (Buck- Flexibility Provide flexible components to allow mixing walter, 2002) and its variants, Safe Buckwalter and XML and matching rather than providing large monolithic ap- Buckwalter, for example, map the Arabic character set plications. We want CAMeL Tools to be usable by both into ASCII representations using a one-to-one map. These application developers requiring only high-level access to transliterations are used as an input or output formats components, and researchers, who might want to exper- in Arabic NLP tools such as MADAMIRA and SAMA, iment with new implementations of components or their and as the chosen encoding of various corpora such as sub-components. the SAMA databases and the Columbia Arabic Tree- Modularity Different implementations of the same com- bank (CATiB) (Habash and Roth, 2009). Habash-Soudi- ponent type should be slot-in replacements for one another. Buckwalter (HSB) (Habash et al., 2007) is another vari- This would allow for easier experimentation as well as al- ant on the Buckwalter transliteration scheme that includes lowing users to provide custom implementations as a re- some non-ASCII characters whose pronunciation is easier placement for built-in components. to remember for non-Arabic speakers. We provide translit- eration functions to and from the following schemes: Ara- Performance Components should provide close to state- bic script, Buckwalter, Safe Buckwalter, XML Buckwal- of-the-art results but within reasonable run-time perfor- ter, and Habash-Soudi-Buckwalter. Additionally, we pro- mance and memory usage. However, we don’t want this vide utilities for users to define their own transliteration to come at the cost of code readability and maintainability schemes. and we are willing to sacrifice some performance benefits to this end. Orthographic Normalization Due to Arabic’s complex morphology, it is necessary to normalize text in various Ease of Use Using individual components shouldn’t take ways in order to reduce noise and sparsity (Habash, 2010). more than a few lines of code to get started with. We aim The most common of these normalizations are: to provide meaningful default configurations that are gen- erally applicable in most situations. • Unicode normalization which includes breaking up combined sequences (e.g. B to È and @), converting 4.3. Implementation character variant forms to a single canonical form (e.g. CAMeL Tools is implemented in Python 3 and can be in- ©, «, and ª to ¨), and converting extensions to the stalled through pip3 . We chose Python due to it’s ease of Arabic character set used for Persian and Urdu to the use and its pervasiveness in NLP and Machine Learning closest Arabic character (e.g. À to ¼). along with libraries such as tensorflow, pytorch, and scikit- learn. We aim to be compatible with Python version 3.5 and • Dediacritization which removes Arabic diacritics later running on Linux, MacOS, and Windows. CAMeL which occur infrequently in Arabic text and tend to be Tools is in continuous development with new features being considered noise. These include short vowels, shadda 3 4 Installation instructions and documentation can be found on https://camel-lab.github.io/camel_tools_ https://github.com/CAMeL-Lab/camel_tools updates/ 7025
(gemination marker), and the dagger alif (e.g. éPYÓ Reinflection Reinflection differs from generation in that the input to the reinflector is a fully inflected word, with mudar∼isah̄u to éPYÓ mdrsh̄). a list of morphological feature. The reinflector is not lim- • Removal of unnecessary characters including those ited to a specific lemma or POS tag, but rather uses the with no phonetic value, such as the tatweel (kashida) analyzer component to generate all possible analyses that character. the input word has, including the possible lemmas and POS tags. Next, the reinflector uses the list of features it had for • Letter variant normalization for letters that are so input, along with the features from the analyzer, to create a often misspelled. This includes normalizing all the new list of features that can be input to the generator to pro- forms of Hamzated Alif ( @ Â, @ Ǎ, @ Ā) to bare Alif ( @ A), duce a reinflection for every possible analysis of the input the Alif-Maqsura ( ø ý) to Ya ( ø y), the Ta-Marbuta ( è word. h̄) to Ha ( è h), and the non-Alif forms of Hamza (ð' ŵ, In table 1(d), we present an example of the morphological analysis which CAMeL Tools can provide. The analysis in- ø' ŷ) to the Hamza letter ( Z' ’). cludes but not limited to diacritization, the CAMEL Arabic Table 1(c), illustrates how these pre-processing utilities can Phonetic Inventory (CAPHI), English gloss, tokenization, be applied to an Arabic sentence. Specifically, we apply lemmatization, and POS tagging. Arabic to Buckwalter transliteration, letter variant normal- ization, and tatweel removal. 6.2. Disambiguation and Annotation Morphological disambiguation is the task of choosing an 6. Morphological Modeling analysis of a word in context. Tasks such as diacritization, Morphology is one of the most studied aspects of the Ara- POS tagging, tokenization and lemmatization can be con- bic language. The richness of Arabic makes the modeling sidered forms of disambiguation. When dealing with Ara- of morphology a complex task, justifying the focus put on bic however, each of these tasks on their own, don’t fully morphological analysis functionalities. disambiguate a word. In order to avoid confusion, we re- 6.1. Analysis, Generation, and Reinflection fer to the individual tasks on their own as annotation tasks. Traditionally, annotation tasks would be implemented in- As part of CAMeL Tools, we currently provide implemen- dependently of each other. We however, chose to use the tations of the CALIMAStar analyzer, generator, and re- one-fell-swoop approach of Habash and Rambow (2005). inflector described by Taji et al. (2018). These compo- This approach entails predicting a set of features of a given nents operate on databases that are in the same format as word, ranking its out-of-context analyses based on the pre- the ALMORGEANA (Habash, 2007) database, which con- dicted features, and finally choosing the top ranked analy- tains three lexicon tables for prefixes, suffixes, and stems, sis. Annotation can then be achieved by retrieving the rele- and three compatibility tables for prefix-suffix, prefix-stem, vant feature from the ranked analysis. and stem-suffix. CAMeL Tools include databases for MSA We provide two builtin disambiguators: a simple low-cost as well as the Egyptian and Gulf dialects. disambiguator based on a maximum likelihood estimation Analysis For our tool’s purposes, we define analysis as (MLE) model and a neural network disambiguator that pro- the identification of all the different readings (analyses) of vides improved disambiguation accuracy using multitask a word our of context. Each of these readings is defined by learning. Table 1 (e) shows how these morphological dis- a lexical features such as lemma, gloss, and stem, and mor- ambiguators can disambiguate a word. phological features such POS tags, gender, number, case, and mood. The analyzer expects a word for input, and tries MLE Disambiguator We built a simple disambiguation to match a prefix, stem, and suffix from the database lexi- model based on YAMAMA (Khalifa et al., 2016b), where con table to the surface form of the word, while maintaining the main component is a lookup model of a word and its the constraints imposed by the compatibility tables. This most probable full morphological analysis. For a word that follows the algorithm described by Buckwalter (2002) with does appear in the lookup model, we use the morphological some enhancements. These enhancements, along with the analyzer from Section 6.1 to generate all possible analyses switch of encoding to Arabic script, allows our analyzer to for the word and chose the top ranked analysis based on better detect and handle foreign words, punctuation, and the pre-computed probabilistic score for the joint lemma digits. and POS frequency. Both the lookup model and the pre- computed scores are based on the same training dataset for Generation Generation is the task of inflecting a lemma the given dialect. for a set of morphological features. The algorithm we use for generation follow the description of the generation com- Multitask Learning Disambiguator We provide a sim- ponent in ALMOR (Habash, 2007). The generator expects plified implementation of the neural multitask learning ap- minimally a lemma, and a POS tag, and produces a list of proach to disambiguation by Zalmout and Habash (2019). inflected words with their full analysis. If inflectional fea- Instead of the LSTM-based (long short-term memory) lan- tures, such as person, gender, and case, are specified in the guage models used in the original system, we use unigram input, we limit the generated output to those values. We language models without sacrificing much accuracy. This is generate all possible values for the inflectional features that done to increase run-time performance and decrease mem- are not specified. Clitics, being optional features, are only ory usage. generated when they are specified. We evaluated the CAMeL Tools disambiguators against 7026
CAMeL Tools 7. Dialect Identification MSA MADAMIRA Multitask MLE We provide an implementation of the Dialect Identification DIAC 87.7% 90.9% 78.4% (DID) system described by Salameh et al. (2018). This sys- LEX 96.4% 95.4% 95.7% tem is able to distinguish between 25 Arabic city dialects POS 97.1% 97.2% 95.5% and MSA. Table 1(h), shows an example of the our sys- FULL 85.6% 89.0% 70.0% tem’s output. ATB TOK 99.0% 99.4% 99.0% Approach We train a Multinomial Naive Bayes (MNB) classifier that outputs 26 probability scores referring to the 25 cities and MSA. The model is fed with a suite of fea- Table 3: Comparison of the performance of the CAMeL tures covering word unigrams and character unigrams, bi- Tools Multitask learning and MLE systems on MSA to grams and trigrams weighted by their Term Frequency- MADAMIRA. The systems are evaluated on their accuracy Inverse Document Frequency (TF-IDF) scores, combined to correctly predict diacritics (DIAC), lemmas (LEX), part- with word-level and character-level language model scores of-speech (POS), the full set of predicted features (FULL), for each dialect. We also train a secondary MNB model and the ATB tokenization. using this same setup trained that outputs six probability scores referring to five city dialects and MSA for which the CAMeL Tools MADAR corpus provides additional training data. The out- EGY MADAMIRA MLE put of the secondary model is used as additional features to DIAC 82.8% 78.9% the primary model. LEX 86.6% 87.8% Datasets We train our system using the MADAR corpus POS 91.7% 91.8% (Bouamor et al., 2018), a large-scale collection of parallel FULL 76.4% 73.0% sentences built to cover the dialects of 25 cities from the BW TOK 93.5% 92.8% Arab World, in addition to MSA. It is composed of two sub-corpora; the first consisting of 1,600 training sentences and covering all 25 city dialects and MSA, and the other Table 4: Comparison of the performance of the CAMeL consisting of 9,000 training sentences and covering 5 city Tools MLE systems on Egyptian to MADAMIRA. The sys- dialects and MSA. The first sub-corpus is used to train our tems are evaluated on their accuracy to correctly predict di- primary model while the latter is used to train our secondary acritics (DIAC), lemmas (LEX), part-of-speech (POS), the model. We use the full corpus to train our language models. full set of predicted features (FULL), and the Buckwalter tag tokenization (BW TOK) (Khalifa et al., 2016b). Experiments and Results We evaluate our system us- ing the test split of the MADAR corpus. Our system is able to distinguish between the 25 Arabic city dialects and MADAMIRA.5 For the accuracy evaluation we used the MSA with an accuracy of 67.9% for sentences with an av- Dev sets recommended by Diab et al. (2013) of the Penn erage length of 7 words with an accuracy of 90% for sen- Arabic Treebank (PATB parts 1,2 and 3) (Maamouri et al., tences consisting of 16 words. Our DID system is used in 2004) for MSA and the ARZ Treebank (Maamouri et al., the back-end component of ADIDA (Obeid et al., 2019) to 2014) for EGY. compute the dialect probabilities of a given input and dis- Table 3 provides the evaluation on MSA on MADAMIRA play them as point-map or a heat-map on top of a geograph- using the SAMA database (Graff et al., 2009), CAMeL ical map of the Arab world. Tools multitask learning system,6 and CAMeL Tools MLE system. Table 4 provides the evaluation on EGY on 8. Named Entity Recognition MADAMIRA using the CALIMA ARZ database (Habash Our NER system targets the ENAMEX classes of Per- et al., 2012), following the work described by Pasha et al. son, Location and Organization. We employ a light-weight (2015) and Khalifa et al. (2016b), and the CAMeL Tools machine learning approach and exploit large annotated MLE system. datasets with the intention of being generalizable to differ- We compare to Farasa’s tokenizer (Abdelali et al., 2016), ent contexts. and the accuracy of their system’s ATB tokenization on the same MSA Dev set was 98%. However, this number is not Approach We adopt the common IOB (inside, outside, fair to report for the sake of this evaluation because Farasa beginning) NER tagging format, with a number of em- follows slightly different tokenization conventions. For ex- pirically determined features including morphological fea- ample, the Ta ( H t) in words such as AîDPYÓ mdrsthA ‘her- tures, POS tags, in addition to word-level, gazetteer, and contextual features. Random Forests empirically outper- school’ is not reverted to its original Ta-Marbuta ( è h̄) form, forms all the other algorithms we explored. unlike the convention we have adopted for ATB tokeniza- tion. Datasets We train on a combination of Arabic NER anno- tated datasets, including ANERcorp (∼150K words) (Be- 5 MADAMIRA: Released on April 03, 2017, version 2.1 najiba and Rosso, 2007) and ACE Multilingual Training 6 The implementation we report on is an older version that uses Corpora (2003-2005, 2007) (∼200K words)(Mitchell et al., TensorFlow 1.8. We are currently working on a new implementa- 2003; Mitchell et al., 2005; Walker et al., 2006; Song et tion using TensorFlow 2.1. al., 2014). We also employ three sets of gazetteers totaling 7027
Per. Loc. Org. Avg. CRF-based Sys. CAMeL Tools AraBERT mBERT Mazajak Precision 91% 90% 79% 87% 86% Recall 93% 83% 74% 83% 69% ArSAS 92% 89% 90% F-measure 92% 86% 76% 85% 76% ASTD 73% 66% 72% SemEval 69% 60% 63% Table 5: The results of the proposed system when trained and tested on ANERcorp dataset vs. CRF-based system. Table 7: CAMeL Tools sentiment analyzer performance us- ing AraBERT and mBERT compared to Mazajak over three benchmark datasets Per. Loc. Org. Avg. ANERcorp Model Precision 97% 94% 93% 95% 61% fully connected linear layer with a softmax activation func- Recall 95% 88% 81% 88% 40% tion to the last hidden state. We report results on both fine- F-measure 96% 91% 86% 91% 47% tuned models and we provide the best of the two as part of CAMeL Tools. Table 6: The results of the proposed system when trained on ACE corpora+ANERcorp datasets (W/O ACE 2003 Datasets To ensure that mBERT and AraBERT can be NW) and tested on ACE 2003 NW. generalized to classify the sentiment of dialectal tweets, we used various datasets for fine-tuning and evaluation. The first dialectal dataset is the Arabic Speech-Act and Senti- around 10K unique gazetteers entries, which were gathered ment Corpus of Tweets (ArSAS) (AbdelRahim Elmadany from free online resources/tools and ANERgazet (Benajiba and Magdy, 2018) where each tweet is annotated for pos- and Rosso, 2007). itive, negative, neutral, and mixed sentiment. The second Experiments and Results We conducted a number of ex- dataset is the Arabic Sentiment Tweets Dataset (ASTD) periments to evaluate the performance of the proposed NER (Nabil et al., 2015) that is in both Egyptian Arabic and architecture, with Random Forests as the adopted ML algo- MSA. Each tweet has a sentiment label of either posi- rithm. For training purposes, we use standard NE anno- tive, negative, neutral, or objective. The third dataset is tated datasets, including ANERcorp (∼150K words) and SemEval-2017 task 4-A benchmark dataset (Rosenthal et ACE Multilingual Training Corpora (2003-2005, 2007)7 al., 2017) which is in MSA and each tweet was anno- (∼200K words). Each experiment includes a reference (i.e. tated with a sentiment label of positive, negative, or neutral. gold-standard) dataset used for testing, including subset of Lastly, we used the Multi-Topic Corpus for Target-based ANERcorp and ACE 2003 Newswire (NW). Sentiment Analysis in Arabic Levantine Tweets (ArSenTD- Table 5 illustrates the results of training the model on 5/6 Lev) (Baly et al., 2019) where tweet was labeled as being of ANERcorp and testing on the remaining 1/6, compared positive, very positive, neutral, negative, or very negative. to the results by Benajiba and Rosso (2008)’s CRF-based All the tweets in the datasets were pre-processed using the system, which uses the same data splitting. The results utilities provided by CAMeL Tools to remove diacritics, show that our classifier significantly outperforms the CRF- URLs, and usernames. based system in terms of average Precision, Recall and F- Experiments and Results We compare our results to measure. Moreover, we train on ACE corpora (w/o ACE Mazajak (Abu Farha and Magdy, 2019) and we tried to 2003 NW) and ANERcorp and then test on ACE 2003 NW follow their approach in terms of evaluation and experi- (∼ 28K words), and compare against the performance of mental setup. We discarded the objective tweets from the the ANERcorp model, as is illustrated in Table 6. The re- ASTD dataset and applied 80/20 random sampling to split sults show a dramatic improvement in the performance. the dataset into train/test respectively. We also discarded the mixed class from the ArSAS dataset and kept the tweets 9. Sentiment Analysis with an annotation confidence of over 50% and applied We leverage large pre-trained language models to build an 80/20 random sampling to split the dataset into train/test Arabic sentiment analyzer as part of CAMeL Tools. Our respectively. Additionally, we reduced the number of la- analyzer classifies an Arabic sentence into being positive, bels in the ArSenTD-Lev dataset to positive, negative, and negative, or neutral. Table 1(f), shows an example of how neutral by turning the very positive labels to positive and CAMeL Tools sentiment analyzer takes an Arabic sentence the very negative labels to negative. Both mBERT and as an input and outputs its sentiment. AraBERT were fine-tuned on ArSenTD-Lev and the train Approach We used HuggingFace’s Transformers (Wolf splits from SemEval, ASTD, and ArSAS (24,827 tweets) et al., 2019) to fine-tune multilingual BERT (mBERT) (De- on a single GPU for 3 epochs with a learning rate of 3e-5, vlin et al., 2018) and AraBERT (Antoun et al., 2020) on the batch size of 32, and a maximum sequence length of 128. task of Arabic SA. The fine-tuning was done by adding a For evaluation, we use the F1P N score which was defined by SemEval-2017 and used by Majazak; F1P N is the macro 7 Available under licence from LDC with catalogue numbers F1 score over the positive and negative classes only while LDC2004T09, LDC2005T09, LDC2006T06 and LDC2014T18, neglecting the neutral class. Table 7 shows our results com- respectively. pared to Mazajak. 7028
10. Conclusion and Future Work on Computational Natural Language Learning (CoNLL), We presented CAMeL Tools, an open source set of tools pages 30–38, Ann Arbor, Michigan. for Arabic NLP providing utilities for pre-processing, mor- Alkuhlani, S. and Habash, N. (2011). A Corpus for Mod- phological modeling, dialect identification, Named Entity eling Morpho-Syntactic Agreement in Arabic: Gender, Recognition and Sentiment Analysis. Number and Rationality. In Proceedings of the Confer- CAMeL Tools is in a state of active development. We will ence of the Association for Computational Linguistics continue to add new API components and command-line (ACL), Portland, Oregon, USA. tools while also improving existing functionality. Addition- Antoun, W., Baly, F., and Hajj, H. (2020). Arabert: ally, we will be adding more datasets and models to support Transformer-based model for arabic language under- more dialects. Some features we are currently working on standing. adding to CAMeL Tools include: Badaro, G., Baly, R., Hajj, H., Habash, N., and El-Hajj, W. (2014). A large scale Arabic sentiment lexicon for Ara- • A transliterator to and from Arabic script and Arabizi bic opinion mining. In Proceedings of the Workshop for inspired by Al-Badrashiny et al. (2014). Arabic Natural Language Processing (WANLP), pages • A spelling correction component supporting MSA and 165–173, Doha, Qatar. DA text inspired by the work of Watson et al. (2018) Badaro, G., El Jundi, O., Khaddaj, A., Maarouf, A., Kain, and Eryani et al. (2020). R., Hajj, H., and El-Hajj, W. (2018). EMA at SemEval- 2018 task 1: Emotion mining for Arabic. In Proceedings • Additional morphological disambiguators for Arabic of The 12th International Workshop on Semantic Eval- dialects, building on work by (Khalifa et al., 2020). uation, pages 236–244, New Orleans, Louisiana, June. Association for Computational Linguistics. • A dependency parser based on the CAMeL Parser by Baly, R., Hajj, H., Habash, N., Shaban, K. B., and El-Hajj, Shahrour et al. (2016). W. (2017). A sentiment treebank and morphologically • More Dialect ID options including more cities and a enriched recursive deep models for effective sentiment new model for country/region based identification. analysis in arabic. ACM Transactions on Asian and Low- Resource Language Information Processing (TALLIP), Bibliographical References 16(4):23. Baly, R., Khaddaj, A., Hajj, H., El-Hajj, W., and Sha- Abdelali, A., Darwish, K., Durrani, N., and Mubarak, H. ban, K. B. (2019). Arsentd-lev: A multi-topic corpus (2016). Farasa: A fast and furious segmenter for Arabic. for target-based sentiment analysis in arabic levantine In Proceedings of the Conference of the North American tweets. Chapter of the Association for Computational Linguis- tics (NAACL), pages 11–16, San Diego, California. Beesley, K., Buckwalter, T., and Newton, S. (1989). Two- AbdelRahim Elmadany, H. M. and Magdy, W. (2018). Level Finite-State Analysis of Arabic Morphology. In Arsas: An arabic speech-act and sentiment corpus of Proceedings of the Seminar on Bilingual Computing in tweets. In Hend Al-Khalifa, et al., editors, Proceed- Arabic and English. ings of the Eleventh International Conference on Lan- Beesley, K. (1996). Arabic Finite-State Morphological guage Resources and Evaluation (LREC 2018), Paris, Analysis and Generation. In Proceedings of the Interna- France, may. European Language Resources Association tional Conference on Computational Linguistics (COL- (ELRA). ING), pages 89–94, Copenhagen, Denmark. AbdelRahman, S., Elarnaoty, M., Magdy, M., and Fahmy, Benajiba, Y. and Rosso, P. (2007). Anersys 2.0: Conquer- A. (2010). Integrated machine learning techniques for ing the ner task for the arabic language by combining arabic named entity recognition. International Journal the maximum entropy with pos-tag information. In Pro- of Computer Science Issues (IJCSI), 7:27–36. ceedings of workshop on natural language-independent Abdul-Mageed, M. and Diab, M. (2014). SANA: A engineering, 3rd Indian international conference on ar- large scale multi-genre, multi-dialect lexicon for Ara- tificial intelligence (IICAI-2007), pages 1814–1823. bic subjectivity and sentiment analysis. In Proceedings Benajiba, Y. and Rosso, P. (2008). Arabic Named Entity of the Language Resources and Evaluation Conference Recognition using Conditional Random Fields. In Pro- (LREC), pages 1162–1169, Reykjavik, Iceland. ceedings of workshop on HLT NLP within the Arabic Aboaoga, M. and Aziz, M. J. A. (2013). Arabic person world (LREC). names recognition by using a rule based approach. Jour- Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., nal of Computer Science, 9:922–927. Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Abu Farha, I. and Magdy, W. (2019). Mazajak: An online Eryani, F., Erdmann, A., and Oflazer, K. (2018). The Arabic sentiment analyser. In Proceedings of the Fourth MADAR Arabic Dialect Corpus and Lexicon. In Pro- Arabic Natural Language Processing Workshop, pages ceedings of the Language Resources and Evaluation 192–198, Florence, Italy, August. Association for Com- Conference (LREC), Miyazaki, Japan. putational Linguistics. Boudchiche, M., Mazroui, A., Bebah, M. O. A. O., Al-Badrashiny, M., Eskander, R., Habash, N., and Ram- Lakhouaja, A., and Boudlal, A. (2017). AlKhalil Mor- bow, O. (2014). Automatic transliteration of roman- pho Sys 2: A robust Arabic morpho-syntactic analyzer. ized Dialectal Arabic. In Proceedings of the Conference Journal of King Saud University - Computer and Infor- 7029
mation Sciences, 29(2):141–146. phological analyzer and generator for the Arabic dialects. Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Be- In Proceedings of the International Conference on Com- bah, M., and Shoul, M. (2010). Alkhalil Morpho Sys1: putational Linguistics and the Conference of the Asso- A morphosyntactic analysis system for Arabic texts. In ciation for Computational Linguistics (COLING-ACL), Proceedings of the International Arab Conference on In- pages 681–688, Sydney, Australia. formation Technology, pages 1–6. Habash, N. and Roth, R. (2009). CATiB: The Columbia Buckwalter, T. (2002). Buckwalter Arabic morphological Arabic Treebank. In Proceedings of the Joint Confer- analyzer version 1.0. Linguistic Data Consortium (LDC) ence of the Association for Computational Linguistics catalog number LDC2002L49, ISBN 1-58563-257-0. and the International Joint Conference on Natural Lan- Darwish, K. and Mubarak, H. (2016). Farasa: A new fast guage Processing (ACL-IJCNLP), pages 221–224, Sun- and accurate Arabic word segmenter. In Proceedings tec, Singapore. of the Language Resources and Evaluation Conference Habash, N., Soudi, A., and Buckwalter, T. (2007). On Ara- (LREC), Portorož, Slovenia. bic Transliteration. In A. van den Bosch et al., editors, Darwish, K. (2013). Named entity recognition using cross- Arabic Computational Morphology: Knowledge-based lingual resources: Arabic as an example. In Proceedings and Empirical Methods, pages 15–22. Springer, Nether- of the 51st Annual Meeting of the Association for Com- lands. putational Linguistics, page 1558â1567. Habash, N., Rambow, O., and Roth, R. (2009). Darwish, K. (2014). Arabizi Detection and Conversion to MADA+TOKAN: A toolkit for Arabic tokenization, dia- Arabic. In Proceedings of the Workshop for Arabic Nat- critization, morphological disambiguation, POS tagging, ural Language Processing (WANLP), pages 217–224, stemming and lemmatization. In Khalid Choukri et al., Doha, Qatar. editors, Proceedings of the International Conference on Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Arabic Language Resources and Tools, Cairo, Egypt. (2018). Bert: Pre-training of deep bidirectional trans- The MEDAR Consortium. formers for language understanding. Habash, N., Eskander, R., and Hawwari, A. (2012). A Diab, M., Habash, N., Rambow, O., and Roth, R. (2013). Morphological Analyzer for Egyptian Arabic. In Pro- LDC Arabic treebanks and associated corpora: Data di- ceedings of the Workshop of the Special Interest Group visions manual. arXiv preprint arXiv:1309.5652. on Computational Morphology and Phonology (SIG- El Bazi, I. and Laachfoubi, N. (2019). Arabic named en- MORPHON), pages 1–9, Montréal, Canada. tity recognition using deep learning approach. Interna- Habash, N., Eryani, F., Khalifa, S., Rambow, O., Ab- tional Journal of Electrical and Computer Engineering, dulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., 9:2025–2032. Bouamor, H., Zalmout, N., Hassan, S., shargi, F. A., Elfardy, H. and Diab, M. (2012). Simplified guidelines for Alkhereyf, S., Abdulkareem, B., Eskander, R., Salameh, the creation of large scale dialectal Arabic annotations. M., and Saddiki, H. (2018). Unified guidelines and re- In Proceedings of the Language Resources and Evalua- sources for Arabic dialect orthography. In Proceedings tion Conference (LREC), Istanbul, Turkey. of the Language Resources and Evaluation Conference Elfardy, H. and Diab, M. (2013). Sentence Level Dialect (LREC), Miyazaki, Japan. Identification in Arabic. In Proceedings of the Confer- Habash, N. (2007). Arabic Morphological Represen- ence of the Association for Computational Linguistics tations for Machine Translation. In Antal van den (ACL), pages 456–461, Sofia, Bulgaria. Bosch et al., editors, Arabic Computational Mor- ElJundi, O., Antoun, W., El Droubi, N., Hajj, H., El-Hajj, phology: Knowledge-based and Empirical Methods. W., and Shaban, K. (2019). hULMonA: The universal Kluwer/Springer. language model in Arabic. In Proceedings of the Fourth Habash, N. Y. (2010). Introduction to Arabic natural lan- Arabic Natural Language Processing Workshop, pages guage processing, volume 3. Morgan & Claypool Pub- 68–77, Florence, Italy, August. Association for Compu- lishers. tational Linguistics. Khalifa, S., Habash, N., Abdulrahim, D., and Hassan, Eryani, F., Habash, N., Bouamor, H., and Khalifa, S. S. (2016a). A Large Scale Corpus of Gulf Arabic. In (2020). A Spelling Correction Corpus for Multiple Ara- Proceedings of the Language Resources and Evaluation bic Dialects. In Proceedings of the Language Resources Conference (LREC), Portorož, Slovenia. and Evaluation Conference (LREC), Marseille, France. Khalifa, S., Zalmout, N., and Habash, N. (2016b). Ya- Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, mama: Yet another multi-dialect Arabic morphological S., and Buckwalter, T. (2009). Standard Arabic Morpho- analyzer. In Proceedings of the International Conference logical Analyzer (SAMA) Version 3.1. Linguistic Data on Computational Linguistics (COLING), pages 223– Consortium LDC2009E73. 227, Osaka, Japan. Habash, N. and Rambow, O. (2005). Arabic tokenization, Khalifa, S., Zalmout, N., and Habash, N. (2020). Mor- part-of-speech tagging and morphological disambigua- phological Disambiguation for Gulf Arabic: The Inter- tion in one fell swoop. In Proceedings of the Conference play between Resources and Methods. In Proceedings of the Association for Computational Linguistics (ACL), of the Language Resources and Evaluation Conference pages 573–580, Ann Arbor, Michigan. (LREC), Marseille, France. Habash, N. and Rambow, O. (2006). MAGEAD: A mor- Loper, E. and Bird, S. (2002). Nltk: The natural language 7030
toolkit. In Proceedings of the Workshop on Effective ings of the CoNLL 2018 Shared Task: Multilingual Pars- Tools and Methodologies for Teaching Natural Language ing from Raw Text to Universal Dependencies, pages Processing and Computational Linguistics, pages 63–70, 160–170, Brussels, Belgium, October. Association for Philadelphia, Pennsylvania. Computational Linguistics. Maamouri, M., Bies, A., Buckwalter, T., and Mekki, W. Rosenthal, S., Farra, N., and Nakov, P. (2017). Semeval- (2004). The Penn Arabic Treebank: Building a Large- 2017 task 4: Sentiment analysis in twitter. In Proceed- Scale Annotated Arabic Corpus. In Proceedings of the ings of the International Workshop on Semantic Evalua- International Conference on Arabic Language Resources tion (SemEval), pages 501–516, Vancouver, Canada. and Tools, pages 102–109, Cairo, Egypt. Sadat, F., Kazemi, F., and Farzindar, A. (2014). Automatic Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N., Identification of Arabic Dialects in Social Media. In Pro- and Eskander, R. (2014). Developing an Egyptian Ara- ceedings of the Workshop on Natural Language Process- bic Treebank: Impact of Dialectal Morphology on Anno- ing for Social Media (SocialNLP), pages 22–27, Dublin, tation and Tool Development. In Proceedings of the Lan- Ireland. guage Resources and Evaluation Conference (LREC), Salameh, M., Bouamor, H., and Habash, N. (2018). Fine- Reykjavik, Iceland. grained arabic dialect identification. In Proceedings of Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., the International Conference on Computational Linguis- Bethard, S. J., and McClosky, D. (2014). The Stanford tics (COLING), pages 1332–1344, Santa Fe, New Mex- CoreNLP natural language processing toolkit. In Pro- ico, USA. ceedings of the Conference of the Association for Com- Shaalan, K. and Raza, H. (2009). Nera: Named entity putational Linguistics (ACL), pages 55–60. recognition for arabic. Journal of the American Society Mitchell, A., Strassel, S., Przybocki, M., Davis, J., Dod- for Information Science and Technology, 60:1652–1663. dington, G., and R, G. (2003). Tides extraction (ace) Shahrour, A., Khalifa, S., Taji, D., and Habash, N. (2016). 2003 multilingual training data. ldc2004t09. Linguistic CamelParser: A system for Arabic syntactic analysis and Data Consortium. morphological disambiguation. In Proceedings of the Mitchell, A., Strassel, S., Huang, S., and Zakhary, International Conference on Computational Linguistics R. (2005). Ace 2004 multilingual training corpus. (COLING), pages 228–232. ldc2005t09. Linguistic Data Consortium. Smrž, O. (2007). ElixirFM — Implementation of Func- Mohammed, N. and Omar, N. (2012). Arabic Named En- tional Arabic Morphology. In Proceedings of the Work- tity Recognition using artificial neural network. Journal shop on Computational Approaches to Semitic Lan- of Computer Science, 8:1285–1293. guages (CASL), pages 1–8, Prague, Czech Republic. Nabil, M., Aly, M., and Atiya, A. (2015). ASTD: Ara- ACL. bic sentiment tweets dataset. In Proceedings of the Song, Z., Maeda, K., Walker, C., and Strassel, S. (2014). 2015 Conference on Empirical Methods in Natural Lan- Ace 2007 multilingual training corpus. ldc2014t18. Lin- guage Processing, pages 2515–2519, Lisbon, Portugal, guistic Data Consortium. September. Association for Computational Linguistics. Taji, D., Khalifa, S., Obeid, O., Eryani, F., and Habash, N. Obeid, O., Salameh, M., Bouamor, H., and Habash, N. (2018). An Arabic Morphological Analyzer and Gener- (2019). ADIDA: Automatic dialect identification for ator with Copious Features. In Proceedings of the Fif- Arabic. In Proceedings of the 2019 Conference of the teenth Workshop on Computational Research in Pho- North American Chapter of the Association for Compu- netics, Phonology, and Morphology (SIGMORPHON), tational Linguistics (Demonstrations), pages 6–11, Min- pages 140–150. neapolis, Minnesota, June. Association for Computa- Vanni, M. and Zajac, R. (1996). The Temple translator’s tional Linguistics. workstation project. In Proceedings of the TIPSTER Text Oudah, M. and Shaalan, K. (2012). A pipeline arabic Program Workshop, pages 101–106, Vienna, Virginia. named entity recognition using a hybrid approach. In Walker, C., Strassel, S., Medero, J., and Maeda, K. (2006). Proceedings of the 24th international conference on Ace 2005 multilingual training corpus. ldc2006t06. Lin- computational linguistics (COLING 2012), pages 2159– guistic Data Consortium. 2176. Watson, D., Zalmout, N., and Habash, N. (2018). Uti- Pasha, A., Al-Badrashiny, M., Diab, M., Kholy, A. E., lizing character and word embeddings for text normal- Eskander, R., Habash, N., Pooleery, M., Rambow, O., ization with sequence-to-sequence models. In Proceed- and Roth, R. (2014). MADAMIRA: A fast, comprehen- ings of the 2018 Conference on Empirical Methods in sive tool for morphological analysis and disambiguation Natural Language Processing, pages 837–843, Brussels, of Arabic. In Proceedings of the Language Resources Belgium, October-November. Association for Computa- and Evaluation Conference (LREC), pages 1094–1101, tional Linguistics. Reykjavik, Iceland. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, Pasha, A., Al-Badrashiny, M., Diab, M., Habash, N., C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow- Pooleery, M., and Rambow, O., (2015). MADAMIRA icz, M., and Brew, J. (2019). Huggingface’s transform- v2.0 User Manual. ers: State-of-the-art natural language processing. ArXiv, Qi, P., Dozat, T., Zhang, Y., and Manning, C. D. (2018). abs/1910.03771. Universal dependency parsing from scratch. In Proceed- Zaghouani, W., Mohit, B., Habash, N., Obeid, O., Tomeh, 7031
N., Rozovskaya, A., Farra, N., Alkuhlani, S., and Oflazer, K. (2014). Large Scale Arabic Error An- notation: Guidelines and Framework. In Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland. Zaghouani, W. (2012). Renar: A rule-based arabic named entity recognition system. ACM Transactions on Asian Language Information Processing, 11:1–13. Zaidan, O. F. and Callison-Burch, C. (2011). The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic With High Dialectal Content. In Pro- ceedings of the Conference of the Association for Com- putational Linguistics (ACL), pages 37–41. Zalmout, N. and Habash, N. (2017). Don’t throw those morphological analyzers away just yet: Neural morpho- logical disambiguation for Arabic. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 704–713, Copen- hagen, Denmark. Zalmout, N. and Habash, N. (2019). Adversarial multitask learning for joint multi-feature and multi-dialect mor- phological modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics, pages 1775–1786, Florence, Italy, July. Association for Computational Linguistics. Zalmout, N., Erdmann, A., and Habash, N. (2018). Noise- robust morphological disambiguation for dialectal Ara- bic. In Proceedings of the Conference of the North Amer- ican Chapter of the Association for Computational Lin- guistics (NAACL), New Orleans, Louisiana, USA. Zampieri, M., Malmasi, S., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J., Scherrer, Y., and Aepli, N. (2017). Find- ings of the VarDial Evaluation Campaign 2017. In Pro- ceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 1–15, Valencia, Spain. 7032
You can also read