Arabic Offensive Language on Twitter: Analysis and Experiments - ACL Anthology
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Arabic Offensive Language on Twitter: Analysis and Experiments Hamdy Mubarak 1 Ammar Rashed2 Kareem Darwish1 Younes Samih1 Ahmed Abdelali1 1 Qatar Computing Research Institute, HBKU 2 Özyeğin University {hmubarak, kdarwish, ysamih, aabdelali}@hbku.edu.qa ammar.rasid@ozu.edu.tr Abstract a large dataset. Since our methodology does not use a seed list of offensive words, it is not biased Detecting offensive language on Twitter by topic, target, or dialect. Using our methodol- has many applications ranging from detect- ing/predicting bullying to measuring polariza- ogy, we tagged a 10,000 Arabic tweet dataset for tion. In this paper, we focus on building a large offensiveness, where offensive tweets account for Arabic offensive tweet dataset. We introduce a roughly 19% of the tweets. Further, we labeled method for building a dataset that is not biased tweets as vulgar or hate speech. To date, this is the by topic, dialect, or target. We produce the largest available dataset, which we plan to make largest Arabic dataset to date with special tags publicly available along with annotation guidelines. for vulgarity and hate speech. We thoroughly We use this dataset to characterize Arabic offensive analyze the dataset to determine which topics, dialects, and gender are most associated with language to ascertain the topics, dialects, and users’ offensive tweets and how Arabic speakers use gender that are most associated with the use of of- offensive language. Lastly, we conduct many fensive language. Though we suspect that there are experiments to produce strong results (F1 = common features that span different languages and 83.2) on the dataset using SOTA techniques. cultures, some characteristics of Arabic offensive language are language and culture specific. Thus, 1 Introduction we conduct a thorough analysis of how Arab users Disclaimer: Due to the nature of the paper, some use offensive language. Next, we use the dataset to examples herein contain highly offensive language train strong Arabic offensive language classifiers and hate speech. They don’t reflect the views of using state-of-the-art representations and classifica- the authors in any way. This work is an attempt to tion techniques. Specifically, we experiment with help fight such speech. static and contextualized embeddings for represen- tation along with a variety of classifiers such as Much recent interest has focused on the detec- Transformer-based and Support Vector Machine tion of offensive language and hate speech in on- (SVM) classifiers. The contributions of this paper line social media. Offensiveness is often associ- are as follows: ated with undesirable behaviors such as trolling, cyberbullying, online extremism, political polariza- • We built the largest Arabic offensive language tion, and propaganda. Thus, offensive language dataset to date that is also labeled for vulgar detection is instrumental for a variety of applica- language and hate speech and is not biased tion such as: quantifying polarization (Barberá and by topic or dialect. We describe the method- Sood, 2015; Conover et al., 2011), trolls and pro- ology for building it along with annotation paganda account detection (Darwish et al., 2017), guidelines. hate crimes likelihood estimation (Waseem and Hovy, 2016); and predicting conflicts (Chadefaux, • We performed thorough analysis to describe 2014). In this paper, we describe our methodol- the peculiarities of Arabic offensive language. ogy for building a large dataset of Arabic offensive tweets. Given that roughly 1-2% of all Arabic • We experimented with SOTA classification tweets are offensive (Mubarak and Darwish, 2019), techniques to provide strong results on detect- targeted annotation is essential to efficiently build ing offensive language. 126 Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 126–135 Kyiv, Ukraine (Virtual), April 19, 2021.
2 Related Work proved classification with stemming and achieved a precision of 88%. Albadi et al. (2018) focused Many recent papers have focused on the detec- on detecting religious hate speech using a recurrent tion of offensive language, including hate speech neural network. (Agrawal and Awekar, 2018; Badjatiya et al., 2017; Arabic is a morphologically rich language with Davidson et al., 2017; Djuric et al., 2015; Kwok a standard variety called Modern Standard Arabic and Wang, 2013; Malmasi and Zampieri, 2017; (MSA), which is typically used in formal communi- Nobata et al., 2016; Yin et al., 2009). Offensive cation, and many dialectal varieties that differ from language can be categorized as: vulgar, which in- MSA in lexical selection, morphology, phonology, clude explicit and rude sexual references, porno- and syntactic structures. In MSA, words are typi- graphic, and hateful, which includes offensive re- cally derived from a set of thousands of roots by marks concerning people’s race, religion, country, fitting a root into a stem template and the result- etc. (Jay and Janschewitz, 2008). Prior works ing stem may accept a variety of prefixes and suf- have concentrated on building annotated corpora fixes. Though word segmentation, which greatly and training classification models. Concerning cor- improves word matching, is quite accurate for MSA pora, hatespeechdata.com attempts to maintain (Abdelali et al., 2016), with accuracy approaching an updated list of hate speech corpora for multiple 99%, dialectal segmentation is not sufficiently re- languages including Arabic and English. Further, liable, with accuracy ranging between 91-95% for SemEval 2019 ran an evaluation task targeted at different dialects (Samih et al., 2017). Since di- detecting offensive language, which focused ex- alectal Arabic is ubiquitous in Arabic tweets and clusively on English (Zampieri et al., 2019). For many tweets have creative spellings of words, re- SemEval 2020, they extended the task to include cent work on Arabic offensive language detection other languages including Arabic (Zampieri et al., used character-level models (Mubarak and Dar- 2020). As for classification models, most studies wish, 2019). used supervised classification at either word level (Kwok and Wang, 2013), character sequence level 3 Data Collection (Malmasi and Zampieri, 2017), and word embed- dings (Djuric et al., 2015). The studies used differ- 3.1 Collecting Arabic Offensive Tweets ent classification techniques including Naı̈ve Bayes Our target is to build a large Arabic offensive lan- (Kwok and Wang, 2013), Support Vector Machines guage dataset that is representative of its appear- (SVM) (Malmasi and Zampieri, 2017), and deep ance on Twitter and is hopefully not biased to spe- learning (Agrawal and Awekar, 2018; Badjatiya cific dialects, topics, or targets. One of the main et al., 2017; Nobata et al., 2016) classification. The challenges is that offensive tweets constitute a very accuracy of the aforementioned system ranged be- small portion of overall tweets. To quantify their tween 76% and 90%. Earlier work looked at the use proportion, we took 3 random samples of tweets of sentiment words as features as well as contextual from different days, with each sample composed features (Yin et al., 2009). of 1,000 tweets, and we found that only 1-2% of The work on Arabic offensive language de- them were offensive (including pornographic ad- tection is relatively nascent (Abozinadah, 2017; vertisements). This percentage is consistent with Alakrot et al., 2018; Albadi et al., 2018; Mubarak previously reported percentages (Mubarak et al., et al., 2017; Mubarak and Darwish, 2019). 2017). Thus, annotating random tweets is grossly Mubarak et al. (2017) suggested that certain users inefficient. One way to overcome this problem is are more likely to use offensive languages than to use a seed list of offensive words to filter tweets. others, and they used this insight to build a list of However, doing so is problematic, as it would skew offensive Arabic words and to construct a labeled the dataset to particular types of offensive language set of 1,100 tweets. Abozinadah (2017) used super- or to specific dialects. Offensiveness is often di- vised classification based on a variety of features alect and country specific. including user profile features, textual features, and After inspecting many tweets, we observed that network features. They reported an accuracy of many offensive tweets have the vocative particle nearly 90%. Alakrot et al. (2018) used supervised AK (“yA” – meaning “O”)1 , which is mainly used classification based on word n-grams to detect of- 1 fensive language in YouTube comments. They im- Arabic words are provided along with their Buckwalter 127
in directing the speech to a specific person or tute a small percentage of tweets in general, while group. The ratio of offensive tweets increases to being far more generic than using a seed list of 5% if a tweet contains one vocative particle and offensive words, which may greatly skew the dis- to 19% if it has at least two vocative particles. tribution of offensive tweets. For future work, we Users often repeat this particle for emphasis, as plan to explore other methods for identifying offen- in: éKñJk AK ú× @ AK (“yA Amy yA Hnwnp” – O my sive tweets with greater stylistic diversity. mother, O kind one), which is endearing and non- 3.2 Annotating Tweets offensive, and P Y¯ AK I . Ê¿ AK (“yA klb yA q*r” – We developed annotation guidelines jointly with “O dog, O dirty one”), which is offensive. We de- an experienced annotator, who is a native Arabic cided to use this pattern to increase our chances of speaker with good knowledge of various Arabic di- finding offensive tweets. One of the main advan- alects, in accordance to the OffensEval2019 guide- tages of the pattern AK ... AK (“yA ... yA”) is that it is lines. Tweets were given one or more of the fol- not associated with any specific topic or genre, and lowing four labels: offensive, vulgar, hate speech, it appears in all Arabic dialects. Though the use of or clean. Since the offensive label covers both vul- offensive language does not necessitate the appear- gar and hate speech and vulgarity and hate speech ance of the vocative particle, the particle does not are not mutually exclusive, a tweet can be just of- favor any specific offensive expressions and greatly fensive or offensive and vulgar and/or hate speech. improves our chances of finding offensive tweets. The annotation adhered to the following guidelines: Using Twitter APIs, we collected 660k Arabic tweets having this pattern between April 15 – May OFFENSIVE (OFF): Offensive tweets contain 6, 2019. To increase diversity, we sorted the word explicit or implicit insults or attacks against other sequences between the vocative particles and took people, or inappropriate language, such as: Direct threats the most frequent 10,000 unique sequences. For each word sequence, we took a random tweet con- éPAªÖÏ @ H@ or (“AHrqwA @ñ¯Qk@ Q®Ó incitement, ex: mqrAt taining that sequence. Then we annotated those AlmEArDp” – “burn opposition headquar- tweets, ending up with 1,915 offensive tweets ters”) and ¯AJÖÏ @ @ Yë @ñÊJ¯@ (“h*A AlmnAfq yjb which represent roughly 19% of all tweets. Each qtlh” – “kill this hypocrite”). tweet was labeled as: offensive, which could ad- Insults and expressions of contempt, which ditionally be labeled as vulgar and/or hate speech, include: Animal analogies, ex: I or Clean. We describe in greater detail our anno- . Ê¿ AK (“yA klb” tation guidelines, which are compatible with the – “O dog”) and á.K É¿ (“kl tbn” – “eat hay”).; OffensEval2019 annotation guidelines (Zampieri Insult to family, ex: ½Ó @ hðP AK (“yA rwH Amk” et al., 2019). For example, if a tweet has insults – “O mother’s soul”); Sexually-related insults, ex: or threats targeting a group based on their national- X AK (“yA dywv” – “O cuckold”); Damnation, HñK ity, ethnicity, gender, political affiliation, religious belief, or other common characteristics, this is con- ex: ½JªÊK é
(“dynk Alq*r” – “your filthy religion”). Tweets Words Offensive 1,915 38k – Vulgar 225 4k – Hate speech 506 13k CLEAN (CLN): Clean tweets do not contain Clean 8,085 151k vulgar or offensive language. We noticed that Total 10,000 193k some tweets have some offensive words, but the whole tweet should not be considered as offen- Table 1: Distribution of offensive and clean tweets. sive due to the intention of users. This sug- gests that normal string match without consid- 3.3 Statistics and User Demographics ering contexts may fail in some cases. Ex- amples of such ambiguous cases include: Hu- Given the annotated tweets, we wanted to ascer- mor, ex: éêë ékQ®Ë@ èðY« AK (“yA Edwp AlfrHp tain the distribution of: types of offensive language, hhh” – “O enemy of happiness hahaha”); Advice, genres or topics where it is used, the dialects used, ex: QK Qg AK ½J.kAË É®K B (“lA tql lSAHbk yA and the gender of users using such language. Ac- cordingly, the annotator manually examined and xnzyr” – “don’t say to your friend: You are a tagged all the offensive tweets. pig”); Condition, ex: ÉJÔ« AK àñËñ®K ÑîDPA« @ X@ Topic: Figure 1 shows the distribution of topics as- (“A*A EArDthm yqwlwn yA Emyl” – “if you sociated with offensive tweets. As the figure shows, disagree with them, they call you a spy”); Con- @ XAÖÏ (“lmA*A sports and politics are most dominant for offensive demnation, ex: ? èQ®K. AK :Èñ®K. I . language including vulgar and hate speech. nsb bqwl: yA bqrp?” – “Why do we insult Dialect: We looked at MSA and four major di- others by saying: O cow?”); Self offense, ex: alects, namely Egyptian (EGY), Leventine (LEV), úGAË áÓ IJ P Y®Ë@ . ªK (“tEbt mn lsAny Alq*r” – “IMaghrebi (MGR), and Gulf (GLF). Figure 2 shows am tired of my dirty tongue”); Non-human target, that 71% of vulgar tweets were written in EGY ex: èPñ» AK éKñJj.ÖÏ @ I K. AK (“yA bnt Almjnwnp followed by GLF, which accounted for 13% of yA kwrp” – “O daughter of the crazy one O foot- vulgar tweets. MSA was not used in any vulgar ball”); and Quotation from a movies or a story, ex: tweets. As for offensive tweets in general, EGY ÉA ¯ AK úGAK !ú»P AK úGAK (“tAny yA zky! tAny yA and GLF were used in 36% and 35% of the offen- sive tweets respectively. Unlike the case of vulgar fA$l” – “again smarty! again O loser”). For am- language, 15% of the offensive tweets were writ- biguous expressions, the annotator searched Twitter ten in MSA. For hate speech, GLF and EGY were to observe real sample usages. again dominant and MSA constituted 21% of the Table 1 shows the distribution of the annotated tweets. This is consistent with findings for other tweets. There are 1,915 offensive tweets, including languages, e.g. English and Italian, where vulgar- 225 vulgar tweets and 506 hate speech tweets, and ity was more frequently associated with colloquial 8,085 clean tweets. To validate annotation quality, language (Mattiello, 2005; Maisto et al., 2017). we asked three additional annotators to annotate Gender: Figure 3 shows that the vast majority of two tweet sample sets. The first was a random offensive tweets, including vulgar and hate speech, sample of 100 tweets containing 50 offensive and were authored by males. Female Twitter users ac- 50 non-offensive tweets. The Inter-Annotator counted for 14% of offensive tweets in general and Agreement (IIA) between the annotators using 6% and 9% of vulgar and hate speech respectively. Fleiss’s Kappa coefficient (Fleiss, 1971) was Figure 4 shows a detailed categorization of hate 0.92. The second was general random samples speech types, where the top three include insulting containing 100 tweets each from the dataset, and groups based on their political ideology, origin, and the IIA with the dataset was: 0.97, 0.96, and sport affiliation. Religious hate speech appeared in 0.97. This high level of agreement gives more only 15% of all hate speech tweets. confidence in the quality of the annotation. Data Next, we analyzed all tweets labeled as offen- can be downloaded from: sive to better understand how Arabic speakers use https://alt.qcri.org/resources/ offensive language. Here is a breakdown of usage: OSACT2020-sharedTask-CodaLab-Train-Dev-Test.Direct name calling: The most frequent attack is zip to call a person an animal name, and the most used 129
Figure 3: Gender distribution for offensive language Figure 1: Topic distribution for offensive language and and its sub-categories its sub-categories Disability/Diseases Gender Social Class/Job Religion Sport Affiliation Origin (race, ethnicity, nationality) Political Ideology 0 0.1 0.2 0.3 0.4 Figure 4: Distribution of Hate Speech Types. Note: A tweet may have more than one type. Figure 2: Dialect distribution for offensive language and its sub-categories animals were I.Ê¿ (“klb” – “dog”), PAÔg (“HmAr” – “donkey”), and ÕæîE. (“bhym” – “beast”). The sec- ond most common was insulting mental abilities us- ing words such as úæ« (“gby” – “stupid”) and ¡JJ.« . (“EbyT” –“idiot”). Culturally, not all animal names are used as insults. For example, animals such as (“Sqr” – “falcon”), and Figure 5: Tag cloud for words with top valence score Y @ (“Asd” – “lion”), Q® among offensive class, e.g. name calling (animals), È@Q« (“gzAl” – “gazelle”) are typically used for curses, insults, etc. praise. For other insults, people use: some bird names such as ék. Ag. X (“djAjp” – “chicken”), éÓñK. metaphor were they would compare a person to: an (“bwmp” – “owl”), and H . @Q« (“grAb” – “crow”); animal as in PñJË@ ø P (“zy Alvwr” – “like a bull”), insects such as éK. AK. X (“*bAbp” – “fly”), PñQå ½®J îE úæªÖÞ (“smEny nhyqk” – “let me hear your (“SrSwr” – “cockroach”), and èQåk (“H$rp” – “in- braying”), and ½ÊK X Që (“hz dylk” – “wag your sect”); microorganisms such as éÓñKQk. (“jrvwmp” – tail”); a person with mental or physical disability “microbe”) and I such as úÍñªJÓ (“mngwly” – “Mongolian (Down . ËAj£ (“THAlb” – “algae”); inan- (“mEwq” – “disabled”), and imate objects such as éÓQk. (“jzmp” – “shoes”) and syndrome)”), ñªÓ É¢ (“sTl” – “bucket”) among other usages. Ð Q¯ (“qzm” – “dwarf”); and to the opposite gender Simile and metaphor: Users use simile and such as È@ñK k. (“jy$ nwAl” – “Nawal’s army 130
(Nawal is female name)”) and ø QK P ø XAK (“nAdy 4 Experiments zyzy” – “Zizi’s club (Zizi is a female nickname)”). We conducted an extensive battery of experiments Indirect speech: This includes: sarcasm such as on the dataset to establish strong Arabic offen- ½K@ñk@ ú»X @ (“A*kY AxwAtk” – “smartest one of sive language classification results. Though of- your siblings”) and QÒmÌ '@ ¬ñÊJ¯ (“fylswf AlH- fensive tweets have finer-grained labels where of- myr” – “the donkeys’ philosopher”); questions fensive tweet could also be vulgar and/or hate such as èX ZAJ.ªË@ É¿ éK @ (“Ayh kl AlgbA dh” – speech, we conducted coarser-grained classifica- tion to determine if a tweet was offensive or not. “what is all this stupidity”); and indirect speech ® JË@ (“AlnqA$ mE For classification, we experimented with several such as QÒJÓ Q« Õç' AîD.Ë@ ©Ó A tweet representation and classification models. For AlbhAym gyr mvmr” – “no use arguing with cat- tweet representations, we used: the count of pos- tle”). itive and negative terms, based on a polarity lexi- Wishing Evil: This entails wishing death or ma- con; static embeddings, namely fastText and Skip- jor harm to befall someone such as ¼YgAK AJK. P Gram; and deep contextual embeddings, namely (“rbnA yAxdk” – “May God take (kill) you”), BERTbase-multilingual and AraBERT (Antoun et al., ½JªÊK é
different corpora with different vector dimension- of the sequence, [CLS], adding a softmax activa- ality. We compared pre-trained embeddings to em- tion on the top of BERT to predict the probability beddings that were trained on our dataset. For of the l label: p(l|h) = sof tmax(W h), where pre-trained embeddings, we used: fastText Egyp- W is the task-specific weight matrix. During fine- tian Arabic pre-trained embeddings (Bojanowski tuning, all BERT/AraBERT parameters together et al., 2017) with vector dimensionality of 300; Ar- with W are optimized end-to-end to maximize the aVec skip-gram embeddings (Mohammad et al., log-probability of the correct labels. 2017), trained on 66.9M Arabic tweets with 100- dimensional vectors; and Mazajak skip-gram em- 4.3 Classification Models beddings (Abu Farha and Magdy, 2019), trained on We explored different classifiers. When using lexi- 250M Arabic tweets with 300-dimensional vectors. cal features and pre-trained static embeddings, we Sentence embeddings were calculated by taking the primarily used an SVM classifier with a radial basis mean of the embeddings of their tokens. The im- function kernel. Only when using the Mazajak em- portance of testing a character level n-gram model beddings, we experimented with other classifiers like fastText lies in the agglutinative nature of the such as AdaBoost and Logistic regression. The Arabic language. We trained a new fastText text SVM classifier performed the best on static em- classification model (Joulin et al., 2017) on our beddings, and we picked the Mazajak embeddings dataset with vectors of 40 dimensions, 0.5 learning because they yielded the best results among all rate, 2−10 character n-grams as features, for 30 static embeddings. We used the Scikit Learn imple- epochs. These hyper-parameters were tuned using mentations of all the classifiers such as libsvm for a 5-fold cross-validated grid-search. the SVM classifier. We also experimented with fast- Text, which trained embeddings on our data. When Deep Contextualized Embeddings We also ex- using contextualized embeddings, we fine-tuned perimented with pre-trained contextualized em- BERT and AraBERT by adding a fully-connected beddings with fine-tuning for down-stream tasks. dense layer followed by a softmax classifier, mini- Recently, deep contextualized language models mizing the binary cross-entropy loss function for such as BERT (Bidirectional Encoder Represen- the training data. For all experiments, we used the tations from Transformers) (Devlin et al., 2019), PyTorch2 implementation by HuggingFace3 as it UMLFIT (Howard and Ruder, 2018), and Ope- provides pre-trained weights and vocabularies. nAI GPT (Radford et al., 2018), have achieved ground-breaking results in many NLP classifica- 4.4 Evaluation tion and language understanding tasks. In this pa- per, we fine-tuned BERTbase-multilingual (or simply For all of our experiments, we used 5-fold cross BERT) and AraBERT embeddings to classify Ara- validation with identical folds for all experiments. bic offensive language on Twitter as it eliminates Table 2 reports on the results of using lexical fea- the need for feature engineering. Although Ro- tures, static pre-trained embeddings with an SVM bustly Optimized BERT (RoBERTa) embeddings classifier, embeddings trained on our data with fast- perform better than (BERTlarge ) on GLUE (Wang Text classifier, and BERT and AraBERT over a et al., 2018), RACE (Lai et al., 2017), and SQuAD dense layer with softmax activation. As the results (Rajpurkar et al., 2016) tasks, pre-trained multilin- show, using fine-tuned AraBERT yielded the best gual RoBERTa models are not available. BERT results overall, followed closely by Mazajak/SVM, is pre-trained on Wikipedia text from 104 lan- with large improvements in precision over using guages, and AraBERT is trained on a large Arabic BERT. The success of AraBERT was surprising news corpus containing 8.5M articles composed of given that it was not trained on social media text. roughly 2.5B tokens. Both use identical architec- Perhaps, pre-training a Transformer model on so- tures and come with hundreds of millions of param- cial media text may improve results further. We eters. Both contain an encoder with 12 Transformer suspect that the Mazajak/SVM combination per- blocks, hidden size of 768, and 12 self-attention formed better than BERT due to the fact that the heads. These embedding use BP sub-word seg- Mazajak embeddings, though static, were trained ments. Following Devlin et al. (2019), the classi- 2 https://pytorch.org/ fication consists of introducing a dense layer over 3 https://github.com/huggingface/ the final hidden state h corresponding to first token transformers 132
on in-domain data, as opposed to BERT. For com- • Implicit Sarcasm: ex. pleteness, we compared 7 other classifiers with I.k ú¯ ½¾ QK A« I K@ áK Ag AK QÊË I.ªË@ SVM using Mazajak embeddings. As results in (“yA xAyn Ant EAwz t$kk fy Hb Al$Eb Table 3 show, using SVM yielded the best results. llrys” – “O traitor, (you) want to question Model/classifier Prec. Recall F1 people’s love for the president ”) where the Lexical Features author is mocking the president’s popularity. SVM 68.5 35.3 46.6 Two false negative types: Pre-trained static embeddings fastText/SVM 76.7 43.5 55.5 • Mixture of offensiveness and admiration: ex. calling a girl a puppy éK. ñJ.Ê¿ AK (“yA klbwbp” – AraVec/SVM 85.5 69.2 76.4 Mazajak/SVM 88.6 72.4 79.7 “O puppy”) in a flirtatious manner. Embeddings trained on our data • Implicit offensiveness: ex. call- fastText/fastText 82.1 68.1 74.4 ing for cure while implying sanity: Contextualized embeddings QÖÏ @ áÓ ¼YÊK. ÐA¾k ù® ð (“wt$fy HkAm BERTbase-multilingual 78.3 74.0 76.0 bldk mn AlmrD” – “and cure rulers of your AraBERT 84.6 82.4 83.2 country from illness”). Table 2: Classification performance with different fea- tures and models. 5 Conclusion and Future Work In this paper we presented a systematic method for Model Prec. Recall F1 building an Arabic offensive language tweet dataset Decision Tree 51.2 53.8 52.4 that does not favor specific dialects, topics, or gen- Random Forest 82.4 42.4 56.0 res. We developed detailed guidelines for tagging Gaussian NB 44.9 86.0 59.0 the tweets as clean or offensive, including special Perceptron 75.6 67.7 66.8 tags for vulgar tweets and hate speech. We tagged AdaBoost 74.3 67.0 70.4 10,000 tweets, which we plan to release publicly Gradient Boosting 84.2 63.0 72.1 and would constitute the largest available Arabic Logistic Regression 84.7 69.5 76.3 offensive language dataset. We characterized the SVM 88.6 72.4 79.7 offensive tweets in the dataset to determine the top- ics that illicit such language, the dialects that are Table 3: Performance of different classification models most often used, the common modes of offensive- on Mazajak embeddings. ness, and the gender distribution of their authors. We performed this breakdown for offensive tweets 4.5 Error Analysis in general and for vulgar and hate speech tweets separately. We believe that this is the first detailed We inspected the tweets of one fold that were mis- analysis of its kind. Lastly, we conducted a large classified by the Mazajak/SVM model (36 false battery of experiments on the dataset, using cross- positives/121 false negatives) to determine the most validation, to establish a strong system for Arabic common errors. They were as follows: offensive language detection. We showed that us- Four false positive types: ing an Arabic specific BERT model (AraBERT) • Gloating: ex. èYJJ.ë AK (“yA hbydp” - “O you and static embeddings trained on tweets produced delusional”) referring to fans of rival sports team competitive results on the dataset. for thinking they could win. For future work, we plan to pursue several di- rections. First, we want explore target specific • Quoting: ex. I.Ê¿ AK Èñ®K ð I. Yg AÖÏ offensive language, where attacks against an entity (“lmA Hd ysb wyqwl yA klb” – “when some- or a group may employ certain expressions that are one swears and says: O dog”). only offensive within the context of that target and • Idioms: ex. ½JK X Qå Ag AK àAÓP Q£A¯ AK (“yA completely innocuous otherwise. Second, we plan fATr rmDAn yA xAsr dynk” – “o you who does to examine the effectiveness of cross dialectal and not fast Ramadan, you have lost your faith”), cross lingual learning of offensive language. which is a colloquial idiom. 133
References Kareem Darwish, Dimitar Alexandrov, Preslav Nakov, and Yelena Mejova. 2017. Seminar users in the ara- Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and bic twitter sphere. In International Conference on Hamdy Mubarak. 2016. Farasa: A fast and furious Social Informatics, pages 91–108. Springer. segmenter for arabic. In Proceedings of the 2016 conference of the North American chapter of the as- Thomas Davidson, Dana Warmsley, Michael Macy, sociation for computational linguistics: Demonstra- and Ingmar Weber. 2017. Automated hate speech tions, pages 11–16. detection and the problem of offensive language. In Eleventh International Conference on Web and So- Ehab Abozinadah. 2017. Detecting Abusive Arabic cial Media (ICWSM), pages 512–515. Language Twitter Accounts Using a Multidimen- sional Analysis Model. Ph.D. thesis, George Mason Jacob Devlin, Ming-Wei Chang, Kenton Lee, and University. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Ibrahim Abu Farha and Walid Magdy. 2019. Mazajak: standing. In Proceedings of the 2019 Conference An online Arabic sentiment analyser. In Proceed- of the North American Chapter of the Association ings of the Fourth Arabic Natural Language Process- for Computational Linguistics: Human Language ing Workshop, pages 192–198, Florence, Italy. Asso- Technologies, Volume 1 (Long and Short Papers), ciation for Computational Linguistics. pages 4171–4186, Minneapolis, Minnesota. Associ- Sweta Agrawal and Amit Awekar. 2018. Deep learn- ation for Computational Linguistics. ing for detecting cyberbullying across multiple so- Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Gr- cial media platforms. In European Conference on bovic, Vladan Radosavljevic, and Narayan Bhamidi- Information Retrieval, pages 141–153. Springer. pati. 2015. Hate speech detection with comment em- Azalden Alakrot, Liam Murray, and Nikola S Nikolov. beddings. In Proceedings of the 24th international 2018. Towards accurate detection of offensive lan- conference on world wide web, pages 29–30. ACM. guage in online communication in arabic. Procedia computer science, 142:315–320. Samhaa R. El-Beltagy. 2016. NileULex: A phrase and word level sentiment lexicon for Egyptian and mod- Nuha Albadi, Maram Kurdi, and Shivakant Mishra. ern standard Arabic. In Proceedings of the Tenth In- 2018. Are they our brothers? analysis and detec- ternational Conference on Language Resources and tion of religious hate speech in the arabic twitter- Evaluation (LREC’16), pages 2900–2905, Portorož, sphere. In 2018 IEEE/ACM International Confer- Slovenia. European Language Resources Associa- ence on Advances in Social Networks Analysis and tion (ELRA). Mining (ASONAM), pages 69–76. IEEE. Joseph L Fleiss. 1971. Measuring nominal scale agree- Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. ment among many raters. Psychological bulletin, Arabert: Transformer-based model for arabic lan- 76(5):378. guage understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Pro- Jeremy Howard and Sebastian Ruder. 2018. Universal cessing Tools, with a Shared Task on Offensive Lan- language model fine-tuning for text classification. In guage Detection, pages 9–15. Proceedings of the 56th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, Long Papers), Melbourne, Australia. Association for and Vasudeva Varma. 2017. Deep learning for hate Computational Linguistics. speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Timothy Jay and Kristin Janschewitz. 2008. The prag- Companion, pages 759–760. International World matics of swearing. Journal of Politeness Research. Wide Web Conferences Steering Committee. Language, Behaviour, Culture, 4(2):267–288. Pablo Barberá and Gaurav Sood. 2015. Follow your Armand Joulin, Edouard Grave, Piotr Bojanowski, and ideology: Measuring media ideology on social net- Tomas Mikolov. 2017. Bag of tricks for efficient works. In Annual Meeting of the European Political text classification. In Proceedings of the 15th Con- Science Association, Vienna, Austria. Retrieved from ference of the European Chapter of the Association http://www. gsood. com/research/papers/mediabias. for Computational Linguistics: Volume 2, Short Pa- pdf. pers, pages 427–431. Association for Computational Linguistics. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with Irene Kwok and Yuzhou Wang. 2013. Locate the hate: subword information. Transactions of the Associa- Detecting tweets against blacks. In Twenty-seventh tion for Computational Linguistics, 5:135–146. AAAI conference on artificial intelligence. Thomas Chadefaux. 2014. Early warning signals Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, for war in the news. Journal of Peace Research, and Eduard Hovy. 2017. Race: Large-scale reading 51(1):5–18. comprehension dataset from examinations. Michael Conover, Jacob Ratkiewicz, Matthew R Fran- Alessandro Maisto, Serena Pelosi, Simonetta Vietri, cisco, Bruno Gonçalves, Filippo Menczer, and Pierluigi Vitale, and Via Giovanni Paolo II. 2017. Alessandro Flammini. 2011. Political polarization Mining offensive language on social media. CLiC- on twitter. ICWSM, 133:89–96. it 2017 11-12 December 2017, Rome, page 252. 134
Shervin Malmasi and Marcos Zampieri. 2017. De- Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin. tecting hate speech in social media. arXiv preprint 2020. Semeval-2020 task 12: Multilingual offensive arXiv:1712.06427. language identification in social media (offenseval Elisa Mattiello. 2005. The pervasiveness of slang in 2020). arXiv preprint arXiv:2006.07235. standard and non-standard english. Mots Palabras Words, 5:7–41. Abu Bakr Mohammad, Kareem Eissa, and Samhaa El- Beltagy. 2017. Aravec: A set of arabic word embed- ding models for use in arabic nlp. Procedia Com- puter Science, 117:256–265. Hamdy Mubarak and Kareem Darwish. 2019. Arabic offensive language classification on twitter. In In- ternational Conference on Social Informatics, pages 269–276. Springer. Hamdy Mubarak, Kareem Darwish, and Walid Magdy. 2017. Abusive language detection on arabic social media. In Proceedings of the First Workshop on Abu- sive Language Online, pages 52–56. Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive lan- guage detection in online user content. In Proceed- ings of the 25th international conference on world wide web, pages 145–153. International World Wide Web Conferences Steering Committee. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/researchcovers/languageunsupervised/language understanding paper. pdf. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. Younes Samih, Mohamed Eldesouki, Mohammed At- tia, Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, and Laura Kallmeyer. 2017. Learning from relatives: unified dialectal arabic segmentation. In Proceedings of the 21st Conference on Compu- tational Natural Language Learning (CoNLL 2017), pages 432–441. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. Zeerak Waseem and Dirk Hovy. 2016. Hateful sym- bols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop, pages 88–93. Dawei Yin, Zhenzhen Xue, Liangjie Hong, Brian D Davison, April Kontostathis, and Lynne Edwards. 2009. Detection of harassment on web 2.0. Pro- ceedings of the Content Analysis in the WEB, 2:1–7. Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. Semeval-2019 task 6: Identifying and cate- gorizing offensive language in social media (offen- seval). arXiv preprint arXiv:1903.08983. Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, 135
You can also read