ARCOV-19: THE FIRST ARABIC COVID-19 TWITTER DATASET - ARXIV
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks Fatima Haouari, Maram Hasanain, Reem Suwaileh, Tamer Elsayed Computer Science and Engineering Department Qatar University Doha, Qatar {200159617, maram.hasanain, rs081123, telsayed}@qu.edu.qa Abstract In this paper, we present ArCOV-19, an Arabic COVID-19 Twitter dataset that covers the period from 27th of January till 31st of March 2020. ArCOV-19 is the first publicly-available Arabic Twitter dataset covering COVID-19 pandemic that includes around 748k popular tweets (according to Twitter search criterion) alongside the propagation networks of the most-popular subset of them. The propagation networks include both retweets and conversational threads (i.e., threads of replies). ArCOV-19 is designed to enable research under several domains including natural language processing, arXiv:2004.05861v2 [cs.CL] 18 Apr 2020 information retrieval, and social computing, among others. Preliminary analysis shows that ArCOV-19 captures rising discussions associated with the first reported cases of the disease as they appeared in the Arab world. In addition to the source tweets and the propagation networks, we also release the search queries and the language-independent crawler used to collect the tweets to encourage the curation of similar datasets. Keywords: Coronavirus pandemic, popular tweets, spread analysis, misinformation, conversational threads, retweets, social analytics. 1. Introduction dominant languages in Twitter (Alshaabi et al., 2020a), yet still under-studied in general. Twitter streams hundreds of millions of tweets daily. ArCOV-19 is the first Arabic Twitter dataset designed In addition to being a medium for the spread and con- to capture popular tweets (according to Twitter search sumption of news, it has been shown to capture the criterion) discussing COVID-19 starting from January dynamics of real-world events including the spread of 27th till the end of March 2020, constituting about diseases such as the seasonal influenza (Kagashe et al., 748k tweets, alongside the propagation networks of 2017) or more severe epidemics like Zika (Vijaykumar the most-popular subset of them.1 To our knowledge, et al., 2018), Ebola (Roy et al., 2020), and H1N1 (Mc- there is no publicly-available Arabic Twitter dataset Neill et al., 2016). Moreover, collective conversations for COVID-19 that includes the propagation network on Twitter about an event can have a great influence on of a good subset of its tweets. Some existing efforts the event’s outcomes, e.g., US 2016 presidential elec- have already started to curate COVID-19 datasets in- tions (Grover et al., 2019). Analyzing tweets about an cluding Arabic tweets, but Arabic is severely under- event, as it evolves, offers a great opportunity to un- represented (e.g., (Chen et al., 2020; Singh et al., derstanding its structure and characteristics, inform- 2020)) or represented by a random sample that is not ing decisions based on its development, and anticipat- specifically focused on COVID-19 (e.g., (Alshaabi et ing its outcomes as represented in Twitter and, more al., 2020b)), or the dataset is limited in the period it importantly, in the real world. covers and does not include the propagation networks Since the first reported case of Novel Coronavirus (e.g., (Alqurashi et al., 2020)). Furthermore, ArCOV-19 (COVID-19) in China, in November 2019, the COVID- includes only popular tweets as we hypothesize that 19 topic has drawn the interest of many Arab users over such tweets are more likely to be influential. Twitter. Their interest, reflected in the Arabic content The contribution of this paper is three-fold: on the platform, has reached a peak after two months • We release ArCOV-19, the first Arabic Twitter when the first case was reported in the United Arab dataset about COVID-19 that comprises 748k Emirates late in January 2020. This ongoing pandemic popular tweets per Twitter search criterion.2 In has, unsurprisingly, spiked discussions on Twitter cov- addition to the tweets, ArCOV-19 includes propa- ering a wide range of topics such as general informa- gation networks of a subset of 65k tweets, search tion about the disease, preventive measures, proce- queries, and documented implementation of our dures and newly-enforced decisions by governments, language-independent tweets crawler. up-to-date statistics of the spread in the world, and • We present a preliminary analysis on ArCOV-19, even the change in our daily habits and work styles. which reveals insights from the dataset concerning In this work, we aim to facilitate future research on so- temporal, geographical, and topical aspects. cial media during this complex and historical period of our history by curating an Arabic dataset (ArCOV-19) 1 The dataset will be continuously augmented with new that exclusively covers tweets about COVID-19. We tweets over the coming few months. 2 limit the dataset to Arabic since it is one of the most https://gitlab.com/bigirqu/ArCOV-19 1
• We suggest several use cases to enable research on 2.2 Propagation Networks Collection Arabic tweets in different research areas including, In addition to the source tweets, we also collected the but not limited to, emergency management, mis- propagation networks (i.e., retweets and conversa- information detection, and social analytics. tional threads) of the top 1000 most popular tweets The reminder of the paper is organized as follows. each day. To our knowledge, this is the first Arabic We present the crawling process followed to acquire tweet dataset to include such propagation networks. ArCOV-19 in Section 2. We then thoroughly discuss the Before getting the most popular tweets on any day, preliminary analysis that we conducted on the dataset we started from source tweets collected in that day in Section 3. We suggest some use cases to enable re- and applied a quality qualification pipeline. We first search on Arabic tweets in Section 4. We finally con- excluded tweets containing any inappropriate word clude in Section 5. (using a list of inappropriate words we constructed). Next, tweets with more than two URLs or four hash- 2. Data Collection tags, or shorter than four tokens (all are potentially ArCOV-19 includes two major components: the source spam) were dropped. Additionally, duplicate tweets tweets (i.e., tweets collected via Twitter search API that have exact textual content are also dropped to every day) and the propagation networks (i.e., avoid redundancy; only the most popular of them (ac- retweets and conversational threads of a subset of the cording to our scoring criterion below) is kept. Qual- source tweets). In this section, we present how we col- ified tweets are then scored by popularity defined by lected each in Sections 2.1, and 2.2 respectively, and the sum of the tweet’s retweet and favorite counts. We summarize the released data in Section 2.3. finally sort the qualified tweets by their scores and se- lect the most popular 1,000 tweets. We denote the set 2.1 Tweets Collection of all such tweets over all days as the top subset. To collect the source tweets, we implemented a For those 1,000 tweets per day, we then collected popularity-based tweet crawler.3 The crawler takes a all retweets and conversational threads (i.e., direct set of manually-crafted queries, comprising a target and indirect replies). We collected the retweets using topic, as input. At the end of each day, the crawler Pickaw,8 a platform for organizing contests on social issues a search request for each of those queries via media, and the replies9 using PHEME (Zubiaga et al., Twitter search API4 to get the top popular tweets as 2016) Twitter conversation collection script.10 defined by Twitter API5 on that day. Queries can be keywords (e.g., “Corona”), phrases (e.g., “the killing 2.3 Data Release virus”), or hashtags (e.g., “#the new coronavirus”). In summary, we release the following resources as Twitter returns a maximum of 3,200 tweets per query. ArCOV-19 dataset, taking into consideration Twitter We customized the search requests to return only Ara- content redistribution policy:11 bic tweets,6 and to exclude all retweets to avoid having • Source Tweets: tweet IDs of the tweets crawled copies of the same popular tweet. Additionally, dupli- in each day. cate tweets (returned by different queries) are removed • Search Queries: the list of search queries, in- in each day. Finally, tweets are sorted chronologically. cluding keywords, phrases, and hashtags, used in We started collecting our data since the 27th of Jan- each period to collect our source tweets. uary 2020 using a set of queries that we manually- • Top Subset: tweet IDs of the top 1,000 most updated based on our daily tracking of trending key- popular tweets for each day. words and hashtags. The full list of queries used in • Propagation Networks: the propagation net- each day is released alongside our dataset. We denote works for the top subset which include for each all collected tweets as source tweets. tweet in the top subset: Due to technical reasons, we missed collecting tweets for a few days. Due to Twitter search API restric- – Retweets: tweet IDs of the full retweet set. tion that limits the search results to the past 7 days, – Conversational Threads: tweet IDs of the we missed the old tweets. To overcome that, we full reply thread (including direct and indi- used GetOldTweets37 python library to download the rect replies). search results using the same trending keywords and hashtags we selected around those days. Along with the dataset, we provide, in our repository, some pointers to publicly-available crawlers that users 3 https://gitlab.com/bigirqu/ArCOV-19/- can easily use to crawl the tweets given their IDs. /tree/master/code/crawler 4 8 https://developer.twitter.com/en/docs/tweets/ https://pickaw.com/en/twrench-becomes-pickaw 9 search/api-reference/get-search-tweets By the time of writing, we have collected the replies 5 We are not aware of the criterion used by Twitter to for the first 22 days of the dataset and are still collecting retrieve the top tweets, but we believe it is based on a mix the rest; we will make all available in the near future. 10 of likes, retweets, impressions, etc. https://github.com/azubiaga/pheme-twitter- 6 The language of tweets to return is a configurable pa- conversation-collection 11 rameter. https://developer.twitter.com/en/developer- 7 https://github.com/Mottl/GetOldTweets3 terms/agreement-and-policy 2
Tweets number across days (verified/unverified) 22,000 3. ArCOV-19 in Numbers 20,000 Posted by verified user Posted by unverified user In this section, we present a statistical summary and 18,000 conduct a preliminary analysis on ArCOV-19 to shed 16,000 Number of tweets 14,000 some light on its major characteristics. 12,000 10,000 3.1 Tweets and Users Distribution 8,000 Table 1 presents an overall summary of the tweets and 6,000 users statistics in ArCOV-19. It indicates that The to- 4,000 tal number of source tweets in the dataset is about 2,000 748k posted by about 249k unique users. We note that 0 01-Mar 03-Mar 05-Mar 07-Mar 09-Mar 11-Mar 13-Mar 15-Mar 17-Mar 19-Mar 21-Mar 23-Mar 25-Mar 27-Mar 29-Mar 31-Mar 27-Jan 29-Jan 31-Jan 02-Feb 04-Feb 06-Feb 08-Feb 10-Feb 12-Feb 14-Feb 16-Feb 18-Feb 20-Feb 22-Feb 24-Feb 26-Feb 28-Feb 14.7% of the tweets were posted by verified users (who constitute only 1.42% of the unique users). This is a relatively large percentage, showing that a good por- Figure 1: Distribution of source tweets in ArCOV-19 tion of the source tweets are indeed popular. The aver- over time. Dark blue indicates tweets posted by veri- age numbers of followers, friends, and statuses of users fied users, while light blue indicates tweets posted by are also relatively large, showing that users observed unverified users. in ArCOV-19 are also popular (and possibly more in- Akhbar | أخبار اآلن 2613 صحيفة البيان 1855 fluential). The table also indicates that 21.3% of the CGTN Arabic 1473 عبدالعزيز الملحم 1189 tweets include URLs. We anticipate that this is due to Mulhak 1177 الخليج أونالين 1150 the extensive spreading of news (linked from tweets) NEWS KTV 1146 Bawabaa News 1133 during this period. The numbers of tweets that are قناة الغد 1106 Media News Agency 1053 geotagged and geolocated are also indicated in the ta- اليوم السابع 1043 صدى البلد 1036 ble; however, we defer the discussion of such types of الجوازات السعودية 955 العربية 954 tweets to Section 3.3. الحدث 924 القبس اإللكتروني 924 Figure 1 illustrates the distribution of tweets over صحيفة الخليج 911 بوابة الوطن 900 time in ArCOV-19. It is clearly non-uniform. More- الغد الكويتية 897 ترند 895 over, the volume of tweets hugely increases when the أخبار كورونا 829 جريدة األنباء 822 virus started to spread in several countries in the Arab صحيفة الشرق األوسط 810 جريدة النهار الكويتية 794 world. The figure also shows the distribution of the AlMamlakaTV 787 0 500 1000 1500 2000 2500 tweets posted by verified users vs. unverified users, Number of tweets showing a similar pattern. Figure 2 presents the top 25 tweeting users in Figure 2: Number of tweets posted by the top 25 tweet- ArCOV-19. We notice that several of them are news ers in ArCOV-19. sources, and 20 are verified Twitter accounts. This is again consistent with our crawling criteria focusing discussed over Twitter during that period. To help ex- only on popular tweets. amine this hypothesis, we analyzed the textual content of the tweets; in particular, we identified the most- Table 1: Tweets and users statistics of ArCOV-19. frequent words, hashtags, and Arab country names in Tweets Statistics the entire dataset, then tracked their frequency over time. Figure 3 shows the time series (over days) of the Source Tweets 747,599 three types of keywords.12 Geotagged 486 (0.07%) The 10 most-frequent words shown in Figure 3(a) in- Geolocated 20,031 (2.68%) dicate two different types of words: those that are di- Posted by verified users 110,141 (14.7%) rectly related to COVID-19 (e.g., “health”) and those Include URL 159,239 (21.3%) that are not but related to prayers and supplications Top Subset 65,000 (8.69%) (e.g., “Allah/God”)). It is interesting to see the word Users Statistics “Allah” is very frequent early on when the news about the virus started to spread (probably over discussions Unique 248,513 around whether the pandemic is a punishment from Verified 3,537 (1.42%) God or not, we believe), then declines over time only Average followers count 8,334 to become frequent again when the virus started to Average friends count 1,274 widely spread in the Arab world. Average statuses count 11,201 Figure 3(b) demonstrates how “#China” hashtag was the most trending one from 27 of January until 20 of February, as COVID-19 was prevalent only in China 3.2 Tweets Content & Topics and still not widely spread (at that time) in other It is important to demonstrate that the tweets in ArCOV-19 constitute a good representative sample of 12 When identifying the most frequent words and hash- the Arabic tweets posted during the target period on tags, we excluded the ones we used in our search queries COVID-19, and that they cover the prevalent topics since they are expected to be very frequent by definition. 3
30% هللا هللا حالة بفيروس الصحة بسبب 25% اصابة العالم انتشار اللهم عدد اصابة 20% بفيروس حالة % tweets الصحة بسبب 15% اللهم انتشار 10% العالم 5% 0% عدد 01-Mar 03-Mar 05-Mar 07-Mar 09-Mar 11-Mar 13-Mar 15-Mar 17-Mar 19-Mar 21-Mar 23-Mar 25-Mar 27-Mar 29-Mar 31-Mar 27-Jan 29-Jan 31-Jan 02-Feb 04-Feb 06-Feb 08-Feb 10-Feb 12-Feb 14-Feb 16-Feb 18-Feb 20-Feb 22-Feb 24-Feb 26-Feb 28-Feb (a) Top 10 most frequent words 20% _كورونا# #الكويت #الصين الكويت _كورونا# #إيران #عاجل الصين# _كورونا# في_الكويت #كورونا_في_الكويت #السعودية 15% لبنان #كورونا_الكويت #كورونا_لبنان #لبنان #covid19 لبنان# % tweets الكويت# عاجل# 10% إيران# #covid19 5% السعودية# 0% 11-Mar 13-Mar 15-Mar 17-Mar 19-Mar 21-Mar 23-Mar 25-Mar 27-Mar 29-Mar 31-Mar 2-Feb 4-Feb 6-Feb 8-Feb 27-Jan 29-Jan 31-Jan 10-Feb 12-Feb 14-Feb 16-Feb 18-Feb 20-Feb 22-Feb 24-Feb 26-Feb 28-Feb 1-Mar 3-Mar 5-Mar 7-Mar 9-Mar (b) Top 10 most frequent hashtags Time of first reported infected case ليبيا مصر لبنان الكويت عمان البحرين العراق قطر السعودية تونس المغرب األردن سوريا اإلمارات الجزائر السودان موريتانيا فلسطين 12% مصر لبنان الكويت الكويت مصر لبنان السعودية االمارات العراق البحرين قطر 10% البحرين العراق اليمن االردن الجزائر سوريا 8% المغرب السودان تونس فلسطين % tweets ليبيا عمان موريتانيا الصومال 6% الجزائر اإلمارات عمان 4% السعودية قطر ليبيا 2% 0% 17-Mar 31-Mar 01-Mar 03-Mar 05-Mar 07-Mar 09-Mar 11-Mar 13-Mar 15-Mar 19-Mar 21-Mar 23-Mar 25-Mar 27-Mar 29-Mar 27-Jan 29-Jan 31-Jan 02-Feb 04-Feb 06-Feb 08-Feb 10-Feb 12-Feb 14-Feb 16-Feb 18-Feb 20-Feb 22-Feb 24-Feb 26-Feb 28-Feb (c) Word frequency of Arab countries Figure 3: Time series of the frequency (in percentage of tweets) of top general keywords, hashtags, and Arab county words over the period of ArCOV-19. The time of first reported cases in Arab countries is also indicated and aligned with the time series. 4
youm7.com 1044 countries. We can see how it started to be less trend- alarabiya.net 872 youtube.com 627 ing as the number of cases started to decline in China cnn.com 508 sabq.org 441 by that date.13 On the other hand, when COVID- cgtn.com 433 skynewsarabia.com 431 19 started to spread in the Arab world, other hash- rt.com 404 aa.com.tr 389 tags started to become viral. Furthermore, Figure 3(b) akhbaralaan.net 372 elwatannews.com 360 shows that frequency spikes of “#Iran”, “#Lebanon”, aawsat.com 305 almasryalyoum.com 290 and “#Kuwait” exactly match the confirmation dates euronews.com 276 oaurl.com of first reported cases in Iran,14 Lebanon,15 and 275 spa.gov.sa 261 alhurra.com Kuwait,16 respectively. 250 24.ae 241 okaz.com.sa 226 To further analyze trending topics, Figure 3 features pal4all.live 212 argaam.com 206 the timeline of the first reported cases in the Arab albayan.ae 195 alqabas.com 185 countries and aligns them with the time series through- shorouknews.com 182 alhadath.net 178 out the figure. Aligning the timeline with the series 0 200 400 600 800 1000 in Figure 3(c) reveals a significant match between the Number of occurrences frequency peaks of several country names and the cor- responding dates of first reported cases in those coun- Figure 4: Most frequently-linked domains in the top tries, most notably in UAE, Egypt, Lebanon, Kuwait, tweets subset. Oman, Bahrain, Algeria, and Libya. Figure 3 also demonstrates the power of ArCOV-19 We observe that these news websites mainly origi- in capturing further controversial and trending top- nate from three Arab countries, namely, Egypt, Saudi ics. Table 2 shows a timeline covering dates of specific Arabia, and United Arab Emirates. Interestingly, topics of discussion trending on social media around URLs from the China “Global Television Network” times of peaks in tweeting frequency in Figure 3(c). (cgtn.com) are commonly shared as well, which can be Country Date Topic Related News explained by the fact that COVID-19 started in and UAE Feb 19 Yemeni Foreign http://tiny.cc/71qtmz heavily affected China. Affairs Minister thanks UAE for 3.3 Geographic Distribution of Tweets evacuating Yemeni students from Although Twitter provides automatic geo-reference China Iraq Feb 20 Iraq announces http://tiny.cc/yzrtmz functionally, few users (solely around 1-3% (Murdock, closure of borders 2011; Jurgens et al., 2015)) opt to enable it due to pri- with Iran vacy and safety reasons. Alternatively, to have an in- Egypt Mar 2 Egyptian health http://tiny.cc/7iqtmz minister announces sight about the geographical distribution and diversity a visit to China of the tweets in our dataset, we examined the place and Mar 7 Kuwait bans http://tiny.cc/lqstmz coordinates attributes of the Tweet object.17 We note travels with Egypt including entry of that the place attribute is an optional attribute that Egyptian residents allows the user to select a location from a list provided KSA Mar 4-5 Closure of The http://tiny.cc/iqrtmz by Twitter (therefore, the location might not necessar- Grand Mosque in Mecca ily show the actual location from where the tweet is posted), whereas the coordinates attribute represents Table 2: Examples of trending topics in social media the geographic location of the Tweet as reported by matched by spikes in tweeting frequency in ArCOV-19 the user or client application. Table 1 shows that ArCOV-19 has 20,031 geolocated We further explore the topics discussed in ArCOV-19 by tweets (i.e., having values in the place attribute) and considering the domains of URLs shared in the tweets. 486 geotagged tweets (i.e., having values in the co- We focused on the top tweets subset of ArCOV-19 (con- ordinates attribute). Those tweets were posted by structed as detailed in Section 2.2) and identified the 9,543 and 82 unique users respectively. The geolo- URLs posted in those tweets. We then expanded them cated tweets constitute about 2.68% of the total source (since Twitter URLs are shortened) and dropped URLs tweets, which is slightly higher than what is reported of images and videos to focus only on URLs referring to in previous studies on disaster datasets (Anderson et external sources. We extracted 17,308 URLs from 759 al., 2019). We anticipate that this is due to the cri- unique domains. Figure 4 shows that URLs from news teria we followed to collect the tweets, which relies on websites are dominantly the most-commonly shared. Twitter’s popularity criteria. 13 Focusing our attention on the geolocated tweets, Fig- https://www.worldometers.info/coronavirus/ ures 5 and 6 depict the overall and daily distribu- country/china/ 14 tions (respectively) of those tweets over Arab and other https://en.wikipedia.org/wiki/2020_ coronavirus_pandemic_in_Iran#February_2020 countries. The geolocated tweets were indeed posted 15 https://www.the961.com/first-coronavirus- by users from 81 countries from around the globe. We case-confirmed-in-lebanon/ 16 17 https://www.kuna.net.kw/ArticleDetails.aspx? https://developer.twitter.com/en/docs/tweets/ id=2865550&language=en data-dictionary/overview/geo-objects 5
Bahrain 1.8% 3.4 Propagation Networks Libya 2.0% As discussed earlier, we collected the propagation net- Jordan works of the top subset. Overall, the collected number 2.6% Qatar KSA of retweets is 2,697,951 for the entire subset, and the 2.6% 35.3% Iraq collected number of replies is 301,535 for only 22 days 3.7% Egypt of the time period.19 Both are released in ArCOV-19. 5.2% At the time of collecting the retweets, we were not Oman 6.1% able to get the retweets of some of the tweets either because they were deleted or they were posted from UAE 6.8% private accounts. For those collected successfully, Fig- ure 7 illustrates the distribution of retweets per day Lebanon using boxplots. It shows that the average (and me- 8.2% Kuwait Other 12.6% dian) number of retweets follows a similar pattern to 9.0% what was shown in Figure 1 (notice that the Y-axis here is in log scale). We also notice that a good num- Figure 5: Distribution of geolocated tweets. ber of tweets got more than 100 retweets; some of them got even larger numbers reaching about 10k retweets 800 Somalia Palestine or more, showing highly propagated content. Mauritania Syria 105 Yemen 600 Tunisia Sudan Algeria 104 Morocco Number of tweets Bahrain Number of retweets 400 Libya 103 Jordan Qatar Iraq Egypt 102 200 Oman UAE Lebanon 101 Other 0 Kuwait 27/01 29/01 31/01 02/02 04/02 06/02 08/02 10/02 12/02 14/02 16/02 18/02 20/02 22/02 24/02 26/02 28/02 01/03 03/03 05/03 07/03 09/03 11/03 13/03 15/03 17/03 19/03 21/03 23/03 25/03 27/03 29/03 31/03 KSA 100 27/01 29/01 31/01 02/02 04/02 06/02 08/02 10/02 12/02 14/02 16/02 18/02 20/02 22/02 24/02 26/02 28/02 01/03 03/03 05/03 07/03 09/03 11/03 13/03 15/03 17/03 19/03 21/03 23/03 25/03 27/03 29/03 31/03 Figure 6: Distribution of geolocated tweets per day. Figure 7: Distribution of retweets per day for the top found that 91% of them were posted from the Arab subset. world. The largest contribution of the geolocated con- tent (about 35.3%) comes from Saudi Arabia; this is somewhat expected as Saudi users represent the high- 104 est number of active Twitter users in the Arab world.18 Surprisingly, Kuwait comes second with 12.6% of the geolocated content. We believe the rationale is that it 103 Number of replies was among the first countries that reported the cases in the Middle East. Since then, Kuwait started a se- 102 ries of strict precautions such as a wide lock-down in many vital facilities until the government had imposed 101 a nationwide curfew. We think, after the curfew, the people become more active on Twitter as a platform 100 to break news and discuss developments of the virus. 27/01 29/01 31/01 02/02 04/02 06/02 08/02 10/02 12/02 14/02 16/02 18/02 20/02 22/02 24/02 26/02 28/02 01/03 03/03 05/03 07/03 09/03 11/03 13/03 15/03 17/03 19/03 21/03 23/03 25/03 27/03 29/03 31/03 Additionally, we used related phrases and hashtags to “Kuwait”, “Lebanon”, and “UAE” among the tracking keywords in the period between 22 of February and Figure 8: Distribution of replies per day for the top 13 of March which demonstrates the reason behind subset, for only 22 days. their high percentages during that period, as shown in Figure 6. Furthermore, it is not surprising to see the Figure 8 illustrates the distribution of replies per day countries that have a few or no cases have the smallest using boxplots, but only for 22 days that we managed portions of contribution to the content. Others have to crawl by the time of writing. However, the smaller small audience userbases on Twitter (e.g., Tunisia). period shows similar pattern, with a bit lower numbers, to the pattern shown in Figure 7. 18 https://www.statista.com/statistics/242606/ 19 number-of-active-twitter-users-in-selected- Crawling is still going on for the rest of the days and countries/ the repository will be updated accordingly. 6
4. Enabling Research versational threads available in ArCOV-19 provide a With the spread of Novel Coronavirus in several Arab valuable resource for early detection of fake news and countries and the subsequent procedural measures identification of malicious rumor spreading accounts. taken by the local governments, it continued to domi- 4.3 Social Analytics nate discussions over social media (and Twitter in par- ticular) to the time of this writing. In addition to sharing reports and awareness, people tend to express their emotions during the outbreak To enable research on different tasks on Arabic Tweets, (e.g., big change of lifestyle, social distancing, losing we make ArCOV-19 publicly-available for the research their beloved ones, distance learning, etc.). Therefore, community. ArCOV-19 is composed of the popu- analyzing the sentiment of the tweets to understand lar tweets, according to Twitter popularity criteria. the effects of the outbreak on people is of interest to Therefore, we envision that it contains the tweets that many stakeholders. had received the maximum social interactions (e.g., Additionally, lots of discussions focused on the causes impressions) which implies they are useful tweets to of Novel Coronavirus and its rapid spread, which can analyze as they got the most attention of the com- be studied to understand the stance of users toward dif- munity. We further provide propagation networks of ferent aspects of the situation such as claims, and the daily highly popular subset (the top 1,000 most pop- consequences of the spread. For example, the catas- ular) to ensure the quality of tweets. This sample is trophic situation caused by the rapid spread of the drawn after filtering out potential low-quality tweets virus made hospitals go beyond their capacity, which (e.g., spam) and content duplicates. forced doctors to deal with agonizing life-death deci- This careful design and architecture of ArCOV-19 sions triggering heated discussions on such issue. makes it suitable for diverse natural language pro- Moreover, in some countries, people are demanding to cessing, information retrieval, and computational so- return resident labors back to their countries to re- cial science research tasks including, but not limited duce the spread of the virus and limit the country’s to, emergency management, misinformation detection, resources to only citizens. This stance sparked contro- and social analytics, as we discuss below. We antici- versial discussion on social media on the legal, human- pate ArCOV-19 to support these tasks on the popular itarian, and ethical aspects of these demands. This re- Arabic tweets that we assume have the most effect on quires analytical tools to detect hate speech and, gen- shaping public opinion and understanding. erally, offensive language. 4.1 Emergency Management As COVID-19 is an international pandemic, national 5. Conclusion and international health organizations need to analyze In this paper, we presented ArCOV-19, the first Arabic the effects of the outbreak. We envision ArCOV-19 Twitter dataset about the Novel Coronavirus (COVID- to support several tasks such as filtering informative 19) that includes popular tweets and propagation net- content, summarization, eyewitness identification, ge- works, focusing on popular discussions on Twitter. We olocation, and studying information (and situational release all source tweets, top subset, search queries, awareness) propagation. The popular tweets provide and the propagation networks. Preliminary analysis key insights into different aspects that are of inter- showed that ArCOV-19 captured spikes in tweeting fre- est to many users. For example, recovered users write quency for country-specific tweets that are consistent about the symptoms of the virus they experienced, in- with the first reported cases of COVID-19 in several fected users search for possible medications, and non- Arab countries. We also found dominance of news infected people look for protective measures. Other agencies among top tweeting users and among most types of information of interest that help for situa- shared URLs. ArCOV-19 enables research under many tional awareness include reports (tolls of cases, deaths, domains including natural language processing, infor- recovered), development of precautions (e.g., quaran- mation retrieval, and social computing. We plan to tine, lockdown, curfew, etc.), among others. continue collecting tweets for the foreseeable future and the dataset will be continuously updated with 4.2 Misinformation Detection newly collected tweets and propagation networks. With the sheer amount of information shared about COVID-19, many rumors are disseminated and get- Acknowledgments ting high attention from the community, which causes The work of Tamer Elsayed and Maram Hasanain a fast propagation of misinformation. This hinders the was made possible by NPRP grant# NPRP 11S-1204- efforts of the health and governmental organizations 170060 from the Qatar National Research Fund (a on fighting the pandemic as such rumors spread panic member of Qatar Foundation). The work of Reem and may lead to undesirable consequences (e.g., in- Suwaileh was supported by GSRA grant# GSRA5-1- crease of cases and mortality rate, or lack of supplies 0527-18082 from the Qatar National Research Fund due to hoarding). ArCOV-19 supports studying infor- and the work of Fatima Haouari was supported by mation/claims propagation, claims check-worthiness GSRA grant# GSRA6-1-0611-19074 from the Qatar detection and verification tasks on the most popular National Research Fund. The statements made herein tweets. Furthermore, the retweet networks and con- are solely the responsibility of the authors. 7
References American journal of infection control, 46(5):549– 557. Alqurashi, S., Alhindi, A., and Alanazi, E. (2020). Zubiaga, A., Liakata, M., Procter, R., Hoi, G. W. S., Large arabic twitter dataset on covid-19. and Tolmie, P. (2016). Analysing how people orient Alshaabi, T., Dewhurst, D. R., Minot, J. R., Arnold, to and spread rumours in social media by looking at M. V., Adams, J. L., Danforth, C. M., and Dodds, conversational threads. PloS one, 11(3). P. S. (2020a). The growing echo chamber of social media: Measuring temporal and social contagion dy- namics for over 150 languages on twitter for 2009– 2020. Alshaabi, T., Minot, J., Arnold, M., Adams, J. L., Dewhurst, D. R., Reagan, A. J., Muhamad, R., Danforth, C. M., and Dodds, P. S. (2020b). How the world’s collective attention is being paid to a pandemic: Covid-19 related 1-gram time se- ries for 24 languages on twitter. arXiv preprint arXiv:2003.12614. Anderson, J., Casas Saez, G., Anderson, K. M., Palen, L., and Morss, R. (2019). Incorporating Context and Location Into Social Media Analysis: A Scal- able, Cloud-Based Approach for More Powerful Data Science, volume 6. IEEE. Chen, E., Lerman, K., and Ferrara, E. (2020). Covid- 19: The first public coronavirus twitter dataset. arXiv preprint arXiv:2003.07372. Grover, P., Kar, A. K., Dwivedi, Y. K., and Janssen, M. (2019). Polarization and acculturation in us elec- tion 2016 outcomes âĂŞ can twitter analytics predict changes in voting preferences. Technological Fore- casting and Social Change, 145:438 – 460. Jurgens, D., Finethy, T., McCorriston, J., Xu, Y. T., and Ruths, D. (2015). Geolocation prediction in twitter using social networks: A critical analysis and review of current practice. Kagashe, I., Yan, Z., and Suheryani, I. (2017). En- hancing seasonal influenza surveillance: Topic analy- sis of widely used medicinal drugs using twitter data. J Med Internet Res, 19(9):e315. McNeill, A., Harris, P. R., and Briggs, P. (2016). Twit- ter influence on uk vaccination and antiviral uptake during the 2009 h1n1 pandemic. Frontiers in Public Health, 4:26. Murdock, V. (2011). Your mileage may vary: on the limits of social media. SIGSPATIAL Special, 3(2):62–66. Roy, M., Moreau, N., Rousseau, C., Mercier, A., Wil- son, A., and Atlani-Duault, L. (2020). Ebola and localized blame on social media: analysis of twit- ter and facebook conversations during the 2014–2015 ebola epidemic. Culture, Medicine, and Psychiatry, 44(1):56–79. Singh, L., Bansal, S., Bode, L., Budak, C., Chi, G., Kawintiranon, K., Padden, C., Vanarsdall, R., Vraga, E., and Wang, Y. (2020). A first look at covid-19 information and misinformation sharing on twitter. arXiv preprint arXiv:2003.13907. Vijaykumar, S., Nowak, G., Himelboim, I., and Jin, Y. (2018). Virtual zika transmission after the first us case: who said what and how it spread on twitter. 8
You can also read