AraFacts: The First Large Arabic Dataset of Naturally-Occurring Professionally-Verified Claims
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
AraFacts: The First Large Arabic Dataset of Naturally-Occurring Professionally-Verified Claims Zien Sheikh Ali, Watheq Mansour, Tamer Elsayed, and Abdulaziz Al-Ali Computer Science and Engineering Department, Qatar University {zs1407404, wm1900793, telsayed, a.alali}@qu.edu.qa Abstract a dataset that can be used to develop automated systems. In this paper, we introduce AraFacts, the We introduce AraFacts, the first large Arabic first large Arabic Fact-checking dataset of naturally- dataset of naturally-occurring claims collected occurring and professionally-verified claims. We from 5 Arabic fact-checking websites, e.g., Fa- tabyyano and Misbar, covering claims since crawl 6,222 claims along with their factual labels, 2016. Our dataset consists of 6,222 claims description, and additional metadata, from 5 differ- along with their factual labels and additional ent fact-checking websites, and make them avail- metadata, such as fact-checking article content, able to the research community in a normalized topical category, and links to posts or Web form. The contributions of our work are two-fold: pages spreading the claim. Since the data is obtained from various fact-checking websites, • We extract and publicly share1 about 6k we standardize the original claim labels to pro- claims with their original and normalized fac- vide a unified label rating for all claims. More- tual labels along with their metadata. over, we provide revealing dataset statistics and motivate its use by suggesting possible re- • We propose several research applications for search applications. The dataset is made pub- which our dataset can be utilized. licly available for the research community. The rest of the paper is organized as follows. Sec- 1 Introduction tion 2 summarizes related work. Section 3 de- scribes the data collection process. Section 4 pro- Fake news and misinformation are considered vides data analysis. Section 5 outlines some re- among the greatest threats to nations. The spread of search applications, and Section 6 concludes. fake news can cause manipulation in public opin- ion, which has adverse consequences to politics 2 Related Work and journalism. Moreover, the recent COVID- 19 pandemic revealed how medical misinforma- Several datasets of naturally occurring claims have tion could easily harm the health of people (Is- been previously proposed, but mostly in English. lam et al., 2020). Notable development has been For example, MultiFC dataset (Augenstein et al., made in automated fact-checking systems over the 2019) includes 38,918 claims and their metadata past years. However, one of the main limitations from 26 different fact-checking websites. Similarly, of Arabic automated fact-checking systems is the Liar dataset (Wang, 2017) consists of 12,836 claims lack of Arabic datasets. Recently several Arabic collected from PolitiFact.2 FakeNewsNet (Shu fact-checking websites (e.g., Fatabyyano and Mis- et al., 2017a,b) is a mutli-dimensional data reposi- bar) have emerged to help combat the spread of ru- tory that contains 23,196 fact-checked articles from mors and fake news, especially over social media. PolitiFact and GossipCop.3 It also includes addi- They constitute a valuable resource of naturally- tional social context related to the checked claims. occurring fact-checked claims in the Arab world. Other datasets that focus on topic-specific The claims are annotated and verified by profes- claims were published recently. FakeCovid (Shahi sional fact-checkers and journalists, making them and Nandini, 2020) dataset covers 5,182 multi- a reliable source of information. While the veri- lingual fact-checked claims from 92 different fact- fied information about those rumors is posted on 1 https://gitlab.com/bigirqu/AraFacts/ those websites and their corresponding social me- 2 https://www.politifact.com/ 3 dia accounts, they are not gathered and unified as https://www.gossipcop.com/ 231 Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 231–236 Kyiv, Ukraine (Virtual), April 19, 2021.
checking websites, all related to COVID-19. More- fake-news, rumors, and conspiracy theories using over, PUBHEALTH dataset Kotonya and Toni Facebook’s claim rating system.5 2) FactuelAFP (2020) consists of 11,832 claims public health- Arabic6 is the Arabic bureau of the French press related claims attained from 8 different websites. news service. Their team consists of journalists Over the past few years, there has been an in- and fact-checkers. It aims to debunk false state- creased interest in task-specific Arabic applications ments, videos, or images that appear online. It is leading to the release of several Arabic datasets. certified by IFCN and collaborates with Facebook. Most of these are built for tasks such as check- 3) Misbar7 is a popular independent Arabic fac- worthiness of tweets, evidence retrieval, and claim t-checking platform. It uses an 8-point claim rating verification (Barrón-Cedeno et al., 2020; ?). system to label claims online. 4) Maharat-news Additionally, the Arabic News Stance dataset fact-o-meter8 is an IFCN-certified fact-checking (ANS) (Khouja, 2020) supports claim verification website that focuses on investigating rumors online and stance prediction. The dataset is generated and in real-life using a 3-point claim rating system. using existing Arabic news titles from ANT cor- 5) Verify-Sy9 is a media platform that specializes pus (Chouigui et al., 2017). The limitation with in detecting and debunking false news and media ANS is that it requires manual annotation to gener- by experienced journalists. ate claims from news titles. Elmadany et al. (2020) address this limitation with the AraNews dataset. 3.2 Data Extraction AraNews uses Arabic articles from online news From every fact-checking website, we crawled all data to automatically generate Arabic manipulated fact-checking articles and extracted their claims, text. labels, and metadata. We also extracted the claim Baly et al. (2018) proposed a corpus consisting type using some indicative search keywords. The of 422 fact-checked Arabic claims and documents claim type identifies if the claim refers to textual associated with the claims. The corpus supports information or if the claim is paired with visual multiple tasks such as stance detection and fact- information (such as a video or an image). For checking. AraFacts differs from their work in two example; The claim MISB 2941 is referring to a ways; it covers a larger number of claims and all fake image shown in Figure 2. claims are professionally fact-checked. While parsing the content of the HTML pages, While several datasets were constructed for dif- some challenges were faced. One challenge was ferent tasks that are related to fact-checking, to that websites like Verify-Sy do not explicitly label the best of our knowledge, AraFacts is the largest the veracity of the claim. To address this challenge, Arabic dataset that leverages emerging Arabic we reviewed the website’s editorial policy and in- fact-checking websites as a source of annotated spected a subset of the claims manually. We con- professionally-verified information. cluded that all claims are, in fact, false. Moreover, not all metadata fields were available on all web- 3 Data Collection sites. Table 1 shows the number of crawled claims In this section, we describe the process of selecting per website and a summary of available metadata. fact-checking websites, crawling the verification 3.3 Veracity and Category Normalization articles, and constructing the AraFacts dataset. Another faced challenge was the varying claim rat- 3.1 Fact-checking websites ing system adopted by the different websites. We We chose fact-checking websites that are either ver- proposed a normalized claim rating to achieve a ified by the International Fact-Checking Network standard rating for all claims in AraFacts. We also (IFCN) or popular in the Arab region. IFCN cer- used a normalized rating for the categories of the tification indicates that the website complies with claims. IFCN’s code of ethics. The following websites To normalize the veracity label of the claim, we were selected as data sources: 1) Fatabyyano4 is used the label normalization method presented by an IFCN-certified fact-checking organization that 5 http://bit.ly/2LnD1Rk launched in 2016. Fatabyyano collaborates with 6 https://factcheck.afp.com/ar/list 7 Facebook as a third-party fact-checker to debunk https://misbar.com/ 8 https://maharat-news.com/fact-o-meter 4 9 https://Fatabyyano.net/ https://www.verify-sy.com/ 232
Table 1: Summary of the amount of claims extracted from each website and some of their metadata Number Image or Extracted Metadata Website of Video Pages with Avg. no. of Pages with Avg. no. of Claims Claims Claim URLs claim URLs Evidence URLs Evidence URLs Misbar 2,952 1,974 2,946 6.4 2,945 2.6 Fatabyyano 1,503 930 905 2.2 1,449 4.7 FactuelAFP 973 824 514 2.4 461 3.8 Verify-sy 707 403 179 1.1 86 0.3 Maharat-news 87 10 62 1.1 0 0.0 Khouja (2020) with some variation. We set our checking article and included it in AraFacts. Fig- own claim rating scheme consisting of four labels ure 1 shows an example claim with some attributes. (False, Partly-False, Sarcasm, True) and mapped original labels to them. To do so, we referred to the ClaimID: MIS_2550 source websites’ methodology, manually inspected Claim: واالستمتاع بالعرض،كانون األول الجاري/ترامب خالل مقطع فيديو يدعو إلى ترقب مفاجأة بداية يناير a subset of the claims, then finally did the mapping Description: of 27 distinct original labels to four normalized مقطع فيديو،كانون األول المنصرم/ ديسمبر29 منذ تاريخ،تداولت حسابات وصفحات على موقع التواصل االجتماعي تويتر تم التوقيع2021 كانون الثاني عام/ ادعت فيه بأنّه قال هناك مفاجأة ستحدث بداية يناير،للرئيس األميركي دونالد ترامب labels. We adopted a similar mapping technique وأوضحت المنشورات أن هناك قنبلة سيفجرها ترامب مع بداية شهر، وعلى الجميع انتظارها واالستمتاع بالعرض،عليها يناير الجاري to map 35 distinct topical categories to 8 normal- Source: Misbar ized categories: (Politics, News, Health, Social, Date: 2021-01-02 Source_label: مضلل Religion, General Sciences, Arts & Culture, and Normalized_label: Partly-false Other). More details on our claim and category nor- Source_category: سياسة Normalized_category: Politics malization can be found in AraFacts repository. Source_url: https://misbar.com/factcheck/2021/01/02/ االنتخابية-حملته-وعود-من-كانت-ترامب-عنها-أعلن-التي-يناير-فاجأة The normalization process was performed by . two authors of this work independently; then, dis- Figure 1: An example claim from AraFacts. agreements were discussed and resolved. 3.4 Dataset Construction After extracting claim information and normaliz- ing the veracity labels and categories, we assigned a unique ID to each claim. We finally merged the crawled information for each website into one database with the following schema: 1) Claim-ID: ID of the claim. 2) Claim: Text of the claim. 3) Source: Name of the fact-checking website from which the claim was crawled. 4) Description: Detailed description of the claim. 5) Source-la- bel: The veracity label of the claim as it appears in the fact-checking website. 6) Normalized-la- bel: Normalized claim label. 7) Source-category: Topical category of the claim as it appears in the Figure 2: An example claim from AraFacts that has fact-checking website. 8) Normalized-category: a fake image. Claim MISB 2941 refers to this im- Normalized topical category of the claim. 9) Date: age claiming that UN decided to open applications for refugee resettlement in Greece. Article publication date. 10) Source URL: URL of the article. 11) Claim URLs: URLs to web pages spreading the claim. 12) Evidence URLs: URLs 4 Data Analysis referenced by the fact-checker to justify their an- notation. 13) Claim type: Indicates whether the In this section, we perform further analysis on the claim refers to text, an image or a video. collected claims. Table 1 gives an overall sum- Additionally, we extracted the content of the fact- mary of the collected claims and some of the ex- 233
Figure 4: Claims distribution over time. Figure 3: Distribution of normalized labels and cate- gories. Table 2: Top 5 source domains of claims. Domain % of URLs Figure 5: Most frequent 100 words in claims. facebook.com 54.5% twitter.com 25.0% perma.cc 4.1% fact the most fact-checking websites in our dataset archive.vn 3.1% were launched in the past two years. Figure 5 youtube.com 3.1% shows the most frequent 100 words in the claim text. We notice a mix of political figures, religious entities, and country names among others. tracted metadata. In addition to the number of Finally, we analyze the types of extracted claims. claims, for each website, we report the number Figure 6 shows the frequency of the different types of claims that have a refer to visual information of claims (Text claims, Image claims, and Video (image or video), number of claims that have em- claims) for each fact-checking website. Interest- bedded claim URLs, average number of embedded ingly, the two most contributing fact-checking web- URLs per claim, number of claims that have em- sites (Misbar and Fatabyyano) contain more visual bedded evidence URLs, and average number of claims than textual. evidence URLs per claim. 5 Example Use Cases Figure 3 illustrates the distribution of the normal- ized labels over the normalized categories. Most This section provides example use cases and claims are political or news, and false claims are research problems that can be supported by expectedly dominating in all categories. AraFacts. We also examined the original source domains of the claims from the embedded claim URLs. Ta- 5.1 Claim Verification ble 2 lists the most frequent 5 claim URLs. It is One of the main automated fake news detection worthwhile noting that social media platforms con- tasks is claim verification. The task is defined as fol- stitute the main source of claims and rumors in the lows: given a claim, predict its veracity. The meta- dataset. data can be used as features to classify the claim, Figure 4 shows the distribution of the claims over such as claim source URLs. While we provide four publication time. We notice that the majority of the normalized veracity labels of the claim, the orig- claims were published in 2020; this is due to the inal veracity labels can still be utilized following 234
6 Conclusion In this paper, we introduce AraFacts, the first large Arabic dataset of naturally-occurring claims cover- ing about 6k claims published since 2016 and al- ready annotated by professional fact-checkers from 5 different Arabic fact-checking websites over sev- eral topical categories. The dataset supports many research tasks such as claim verification, claim retrieval, also evidence retrieval. We made our dataset publicly available to the research commu- nity, and we plan for periodic crawling to augment the newly verified claims from existing and new Figure 6: Type of claims published by each fact- fact-checking websites. We believe that AraFacts checking website will serve as a valuable resource for future stud- ies on fact-checking, such as fake image detection. Our future work will focus on evaluating and uti- a Multi-Task Learning (MLT) approach similar to lizing AraFacts for several fact-checking tasks to Augenstein et al. (2019) MLT claim verification contribute the Arabic-based work in that domain. approach. Acknowledgments 5.2 Claim Retrieval This work was made possible by NPRP grant No.: The claim retrieval task is defined as follows: given NPRP11S-1204-170060 from the Qatar National a claim, check whether this claim has already been Research Fund (a member of Qatar Foundation). checked. The importance of this problem stems The statements made herein are solely the respon- from the fact that many posted claims over social sibility of the authors. media platforms are just repetitions of previously fact-checked old claims. Having a system that ad- dresses this problem can help combat the spread of References those circulating claims over social media. Isabelle Augenstein, Christina Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Chris- 5.3 Evidence Retrieval tian Hansen, and Jakob Grue Simonsen. 2019. Mul- tifc: A real-world multi-domain dataset for evidence- The task is defined as follows: given a claim, re- based fact checking of claims. In Proceedings of trieve evidential sentences that help in verifying or the 2019 Conference on Empirical Methods in Nat- debunking the claim. As we provide the content ural Language Processing and the 9th International Joint Conference on Natural Language Processing of the fact-checking articles and the URLs of evi- (EMNLP-IJCNLP), pages 4677–4691. dence pages of claims, AraFacts can be extended by further annotating evidence sentences from the Ramy Baly, Mitra Mohtarami, and James Glass. 2018. evidence pages to support such task. Integrating stance detection and fact checking in a unified corpus. In Proceedings of NAACL-HLT, pages 21–27. 5.4 Image-based Fact-checking Alberto Barrón-Cedeno, Tamer Elsayed, Preslav The task is defined as follows; given a claim that Nakov, Giovanni Da San Martino, Maram Hasanain, is mainly about an image, predict the factuality of Reem Suwaileh, Fatima Haouari, Nikolay Bab- the claim using the image and claim. This task ulkov, Bayan Hamdan, Alex Nikolov, et al. 2020. Overview of checkthat! 2020: Automatic identifica- has been explored for English claims (Zlatkova tion and verification of claims in social media. In In- et al., 2019), but it has not been studied in Arabic ternational Conference of the Cross-Language Eval- yet. While AraFacts does not include the images uation Forum for European Languages, pages 215– related to the claims, we indicate the claim type 236. Springer. and provide the URL to the fact-checking articles Amina Chouigui, Oussama Ben Khiroun, and Bilel where the images can be extracted and paired with Elayeb. 2017. Ant corpus: an arabic news text col- the claims. lection for textual classification. In 2017 IEEE/ACS 235
14th International Conference on Computer Systems and Applications (AICCSA), pages 135–142. IEEE. AbdelRahim Elmadany, Muhammad Abdul-Mageed, Tariq Alhindi, et al. 2020. Machine generation and detection of arabic manipulated and fake news. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 69–84. Md Saiful Islam, Tonmoy Sarkar, Sazzad Hossain Khan, Abu-Hena Mostofa Kamal, SM Murshid Hasan, Alamgir Kabir, Dalia Yeasmin, Moham- mad Ariful Islam, Kamal Ibne Amin Chowdhury, Kazi Selim Anwar, et al. 2020. Covid-19–related in- fodemic and its impact on public health: A global so- cial media analysis. The American Journal of Tropi- cal Medicine and Hygiene, 103(4):1621. Jude Khouja. 2020. Stance prediction and claim veri- fication: An arabic perspective. In Proceedings of the Third Workshop on Fact Extraction and VERifi- cation (FEVER), pages 8–17. Neema Kotonya and Francesca Toni. 2020. Ex- plainable automated fact-checking for public health claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 7740–7754, Online. Associa- tion for Computational Linguistics. Gautam Kishore Shahi and Durgesh Nandini. 2020. FakeCovid – a multilingual cross-domain fact check news dataset for covid-19. In Workshop Proceed- ings of the 14th International AAAI Conference on Web and Social Media. Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017a. Fake news detection on social me- dia: A data mining perspective. ACM SIGKDD Ex- plorations Newsletter, 19(1):22–36. Kai Shu, Suhang Wang, and Huan Liu. 2017b. Exploit- ing tri-relationship for fake news detection. arXiv preprint arXiv:1712.07709. William Yang Wang. 2017. “liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 422–426. Dimitrina Zlatkova, Preslav Nakov, and Ivan Koychev. 2019. Fact-checking meets fauxtography: Verify- ing claims about images. In Proceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2099–2108, Hong Kong, China. Association for Computational Linguistics. 236
You can also read