ARCOV-19: THE FIRST ARABIC COVID-19 TWITTER DATASET WITH PROPAGATION NETWORKS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks Fatima Haouari, Maram Hasanain, Reem Suwaileh, Tamer Elsayed Computer Science and Engineering Department, Qatar University {200159617, maram.hasanain, rs081123, telsayed}@qu.edu.qa Abstract Since the first reported case of Novel Coron- avirus (COVID-19) in China, in November 2019, In this paper, we present ArCOV-19, an Arabic COVID-19 Twitter dataset that spans one year, the COVID-19 topic has drawn the interest of many covering the period from 27th of January 2020 Arab users over Twitter. Their interest, reflected till 31st of January 2021. ArCOV-19 is the first in the Arabic content on the platform, has reached publicly-available Arabic Twitter dataset cov- a peak after two months when the first case was ering COVID-19 pandemic that includes about reported in the United Arab Emirates late in Jan- 2.7M tweets alongside the propagation net- uary 2020. This ongoing pandemic has, unsur- works of the most-popular subset of them (i.e., prisingly, spiked discussions on Twitter covering a most-retweeted and -liked). The propagation networks include both retweets and conversa- wide range of topics such as general information tional threads (i.e., threads of replies). ArCOV- about the disease, preventive measures, procedures 19 is designed to enable research under several and newly-enforced decisions by governments, up- domains including natural language process- to-date statistics of the spread in the world, and ing, information retrieval, and social comput- even the change in our daily habits and work styles. ing. Preliminary analysis shows that ArCOV- In this work, we aim to support future research 19 captures rising discussions associated with on social media during this historical period of our the first reported cases of the disease as they time by curating an Arabic dataset (ArCOV-19) appeared in the Arab world. In addition to the source tweets and propagation networks, we that exclusively covers tweets about COVID-19. also release the search queries and language- We limit the dataset to Arabic since it is among independent crawler used to collect the tweets the most dominant languages in Twitter (Alshaabi to encourage the curation of similar datasets. et al., 2020a), yet under-studied in general. ArCOV-19 is the first Arabic Twitter dataset 1 Introduction designed to capture tweets discussing COVID-19 Twitter streams hundreds of millions of tweets starting from January 27th 2020 till end of January daily. In addition to being a medium for the spread 20211 , constituting about 2.7M tweets, alongside and consumption of news, it has been shown to the propagation networks of the most-popular sub- capture the dynamics of real-world events includ- set of them. To our knowledge, there is no publicly- ing the spread of diseases such as the seasonal available Arabic Twitter dataset for COVID-19 that influenza (Kagashe et al., 2017) or more severe includes the propagation networks of a good subset epidemics like Zika (Vijaykumar et al., 2018), of its tweets. Some existing efforts have already Ebola (Roy et al., 2020), and H1N1 (McNeill et al., started to curate COVID-19 datasets including Ara- 2016). Moreover, collective conversations on Twit- bic tweets, but Arabic is severely under-represented ter about an event can have a great influence on the (e.g., (Chen et al., 2020a; Singh et al., 2020)) or event’s outcomes, e.g., US 2016 presidential elec- represented by a random sample that is not specifi- tions (Grover et al., 2019). Analyzing tweets about cally focused on COVID-19 (e.g., (Alshaabi et al., an event, as it evolves, offers a great opportunity 2020b)), or the dataset is limited in the period it to understanding its structure and characteristics, covers and does not include the propagation net- informing decisions based on its development, and works (e.g., (Alqurashi et al., 2020)). anticipating its outcomes as represented in Twitter 1 The dataset will be continuously augmented with new and, more importantly, in the real world. tweets over the coming months.
The contribution of this paper is three-fold: constituting 65% and 12% of the tweets respec- tively. Chen et al. (2020b) are continously collect- • We release ArCOV-19,2 the first Arabic Twit- ing tweets since 21st of January 2020, where at ter dataset about COVID-19 that comprises the time of writing, around 66% and 11% of the about 2.7M tweets collected via Twitter search tweets are in English and Spanish respectively. An- API. It covers a full year allowing for captur- other multilingual ongoing collection is released ing discussions on the topic since it started to by Lopez et al. (2020). Qazi et al. (2020) enriched be popular in the Arab world. In addition to their large-scale multilingual Twitter dataset with the tweets, ArCOV-19 includes propagation geolocation information. networks of the most popular subset, search The only raw Arabic Twitter dataset available is queries, and documented implementation of the one released by Alqurashi et al. (2020), how- our language-independent tweets crawler. ever it covers the period from 1st of January to 15th • We present a preliminary analysis on ArCOV- April 2020 only. Moreover, it does not include the 19, which reveals insights from the dataset propagation networks of tweets. concerning temporal, geographical, and topi- Compared to existing datasets, we release an cal aspects. only-Arabic collection of tweets, where we exclude • We suggest several use cases to enable re- retweets to avoid having copies of the same tweet. search on Arabic tweets in different research Moreover, we also release the most-popular sub- areas including, but not limited to, emergency set of them (i.e., most-retweeted and -liked and management, misinformation detection, and spam free tweets). Furthermore, we collect the social analytics. propagation networks including both retweets and conversational threads (i.e., threads of replies) for The reminder of the paper is organized as fol- the most-popular subset. lows. We present the crawling process followed to acquire ArCOV-19 in Section 3. We then thor- 3 Data Collection oughly discuss the analysis that we conducted on the dataset in Section 4. We suggest some use cases ArCOV-19 includes two major components: the to enable research on Arabic tweets in Section 5. source tweets (i.e., tweets collected via Twitter We finally conclude in Section 6. search API every day) and the propagation net- works (i.e., retweets and conversational threads 2 Related Work of a subset of the source tweets). In this section, we present how we collected each in Sections 3.1, Social media platforms and Twitter specifically and 3.2 respectively, and summarize the released showed to be an indispensable medium for sharing data in Section 3.3. information and discussions about the COVID-19 pandemic since December 2019. A considerable 3.1 Tweets Collection body of raw social media datasets were released To collect the source tweets, we used our tweet to facilitate analyzing these discussions including crawler3 that uses Twitter search API4 . The crawler Twitter datasets (Banda et al., 2020; Lopez et al., takes a set of manually-crafted queries, comprising 2020; Chen et al., 2020b; Qazi et al., 2020; Gao a target topic, as input. At the end of each day, the et al., 2020; Alqurashi et al., 2020; Dashtian and crawler issues a search request for each of those Murthy, 2021), Instagram (Zarei et al., 2020), or queries. Queries can be keywords (e.g., “Corona”), the Chinese social media platform Weibo (Hu et al., phrases (e.g., “the killing virus”), or hashtags (e.g., 2020; Gao et al., 2020). “#the new coronavirus”). Twitter returns a maxi- Banda et al. (2020) released a multilingual mum of 3,200 tweets per query. We customized dataset that includes over 800M tweets, the data is the search requests to return only Arabic tweets,5 collected since 1st of January 2020 using Twitter and to exclude all retweets to avoid having copies streaming API. Dashtian and Murthy (2021), col- 3 lected multilingual tweets posted between March https://gitlab.com/bigirqu/ArCOV-19/ -/tree/master/code/crawler and July 2020. The most dominant languages 4 https://developer.twitter.com/en/ covered by their dataset are English and Spanish docs/tweets/search/api-reference/ get-search-tweets 2 5 https://gitlab.com/bigirqu/ArCOV-19/ Language of tweets to return is a configurable parameter.
of the same tweet. Additionally, duplicate tweets retweets using Pickaw,8 a platform for organiz- (returned by different queries) are removed in each ing contests on social media, and the replies using day. Finally, tweets are sorted chronologically. PHEME (Zubiaga et al., 2016) Twitter conversation We started collecting our data since 27th of Jan- collection script.9 uary 2020 using a set of queries that we manually- updated based on our daily tracking of trending 3.3 Data Release keywords and hashtags. For example, starting from In summary, we release the following resources 16th of December 2020, we extended our queries as ArCOV-19 dataset, taking into consideration to cover tweets related to COVID-19 vaccines as Twitter content redistribution policy:10 that sub-topic was gaining interest on Arabic social • Source Tweets: IDs of the tweets crawled in media. The full list of queries used in each day each day. is released alongside our dataset. We denote all • Search Queries: the list of search queries, in- collected tweets as source tweets. cluding keywords, phrases, and hashtags, used Due to technical reasons, we missed collecting in each period to collect our source tweets. tweets for a few days. Since Twitter search API • Top Subset: IDs of the top 1,000 most popu- limits the search results to the past 7 days, we lar tweets for each day. missed the old tweets. To overcome that, we used • Propagation Networks: the propagation net- GetOldTweets36 python library to download the works for the top subset which include for search results using the same trending keywords each tweet in the top subset: and hashtags we selected around those days. – Retweets: IDs of the full retweet set. – Conversational Threads: tweet IDs of 3.2 Propagation Networks Collection the full reply thread (including direct and In addition to the source tweets, we also collected indirect replies). the propagation networks (i.e., retweets and con- Along with the dataset, we provide some pointers versational threads) of the top 1000 most popular to publicly-available crawlers that users can easily (i.e., top-retweeted & -liked) tweets each day. To use to crawl the tweets given their IDs. our knowledge, this is the first Arabic tweet dataset to include such propagation networks. 4 ArCOV-19 in Numbers Before getting the most popular tweets on any In this section, we present a statistical summary and day, we started from source tweets collected in conduct an analysis on ArCOV-19 to shed some that day and applied a qualification pipeline. We light on its major characteristics. first excluded tweets containing any inappropriate word (using a list of inappropriate words we con- 4.1 Tweets and Users Distribution structed). Next, tweets with more than two URLs or four hashtags, or shorter than four tokens (all Table 1 presents an overall summary of the tweets are potentially spam) were dropped. Additionally, and users statistics in ArCOV-19. It indicates that duplicate tweets that have exact textual content are the total number of source tweets in the dataset also dropped to avoid redundancy; only the most is about 2.7M posted by over 690k unique users. popular of them (according to our scoring criterion We note that 18.66% of the tweets were posted by below) is kept. Qualified tweets are then scored by verified users (who constitute only 0.81% of the popularity defined by the sum of the tweet’s retweet unique users). This is a relatively large percentage, and favorite counts. We finally sort the qualified showing that a good portion of the source tweets tweets by their scores and select the most popular are popular. The average numbers of followers, 1,000 tweets. We denote the set of all such tweets friends, and statuses of users are also relatively over all days as the top subset. large, showing that users observed in ArCOV-19 For those 1K top tweets per day, we then col- are also popular (and possibly more influential). lected all retweets and conversational threads (i.e., rest; we will make all available in the near future. direct and indirect replies).7 We collected the 8 https://pickaw.com/en/ twrench-becomes-pickaw 6 9 https://github.com/Mottl/ https://github.com/azubiaga/ GetOldTweets3 pheme-twitter-conversation-collection 7 10 At the time of writing, we have collected the replies for https://developer.twitter.com/en/ tweets until end of April 2020 and we are still collecting the developer-terms/agreement-and-policy
550,000 The table also indicates that 25.40% of the tweets 500,000 Posted by non-verified users Posted by verified users include URLs. We anticipate that this is due to the 450,000 400,000 extensive spreading of news (linked from tweets) 350,000 during this period. The numbers of tweets that are 300,000 250,000 geotagged and geolocated are also indicated in the 200,000 table; however, we defer the discussion of such 150,000 100,000 types of tweets to Section 4.3. 50,000 Figure 1 illustrates the monthly distribution of 0 Dec 2020 Mar 2020 May 2020 Apr 2020 Nov 2020 Aug 2020 Jan 2020 Sep 2020 Jan 2021 Oct 2020 Jul 2020 Jun 2020 Feb 2020 tweets in ArCOV-19. The volume of tweets dras- tically increased in March when the virus started to spread in several Arab countries. Then, tweets Figure 1: Monthly distribution of source tweets of number started to decrease monthly. We speculate ArCOV-19. this is due to the fact that the topic was not heavily Akhbar | أﺧﺑﺎر اﻵن 11,086 discussed as in the first days. In December, we ﺻﺣﯾﻔﺔ اﻟﺑﯾﺎن 10,223 Shorouk News 9,031 expanded our queries list to include hashtags and Bawabaa News | ﺑواﺑﺔ اﻷﺧﺑﺎر 8,453 Assawsana News 6,962 keyword related to more current sub-topics such as اﻟﯾوم اﻟﺳﺎﺑﻊ ﺑواﺑﺔ اﻟوطن 6,224 5,810 vaccines and the new variant of the virus, which NEWS KTV ﻛوروﻧﺎ 5,517 5,263 resulted in an increase in the volume of tweets. Mulhak ﻗﻧﺎة اﻟﻐد 5,019 4,953 AlMamlakaTV 4,890 Figure 2 presents the top 25 tweeting users in ﺻﺣﯾﻔﺔ اﻟﺧﻠﯾﺞ 4,844 ALAYAM - اﻷﯾﺎم 4,699 ArCOV-19. We notice that several of them are news 24اﻟﺷﺎرﻗﺔ 4,584 ﺟرﯾدة اﻟدﺳﺗور 4,536 sources, and 20 are verified Twitter accounts. MENAFN.com Arabic ﺳﻛﺎي ﻧﯾوز ﻋرﺑﯾﺔ 4,137 4,124 اﻟﻌرﺑﻲ اﻟﺟدﯾد 4,068 ﺻﺣﯾﻔﺔ اﻻﺗﺣﺎد 4,051 Tweets Statistics ﺟرﯾدة اﻟوطن Independent ﻋرﺑﯾﺔ 4,021 3,989 ﺻدى اﻟﺑﻠد 3,975 Source Tweets 2,675,049 اﻟوﻗﺎﺋﻊ اﻻﺧﺑﺎرﯾﺔ ﺻﺣﯾﻔﺔ ﻋﻛﺎظ 3,968 3,912 Geotagged 2,078 (0.08%) 0 2,000 4,000 6,000 8,000 10,000 12,000 Number of tweets Geolocated 60,873 (2.28%) Posted by verified users 499,351 (18.66%) Figure 2: Number of tweets posted by the top 25 tweet- Include URL 679,482 (25.40%) ers in ArCOV-19 for the period of 27th Jan 2020 to 31st Include Media 973,952 (36.41%) Jan 2021. Top Subset 370,132 (13.84%) Retweets of Top Subset 7,925,821 Replies of Top Subset 1,476,950 country names for the tweets posted in the period of 27th Jan to 31st Mar 2020. We focus on this Users Statistics critical period since the virus started to spread in Unique 690,339 the Arab world during that time. We then tracked Verified 5,575 (0.81%) their frequency over time. Figure 3 shows the time Average followers count 4,630 series (over days) of the three types of keywords.11 Average friends count 918 The 10 most-frequent words shown in Fig- Average statuses count 9,054 ure 3(a) indicate two different types of words: those that are directly related to COVID-19 (e.g., Table 1: Tweets and users statistics of ArCOV-19. Stats “health”) and those that are not but related to prayers for replies are for tweets until end of April 2020. and supplications (e.g., “Allah/God”)). It is inter- esting to see the word “Allah” is very frequent 4.2 Tweets Content & Topics early on when the news about the virus started to It is important to demonstrate that the tweets in spread (probably over discussions around whether ArCOV-19 constitute a good representative sample the pandemic is a punishment from God or not, we of the Arabic tweets posted during the target period believe), then declines over time only to become on COVID-19, and that they cover the prevalent frequent again when the virus started to widely topics discussed over Twitter during that period. spread in the Arab world. To help examine this hypothesis, we analyzed the 11 When identifying the most frequent words and hashtags, textual content of the tweets; in particular, we iden- we excluded the ones we used in our search queries since they tified the most-frequent words, hashtags, and Arab are expected to be very frequent by definition.
30% هللا هللا حالة بفيروس الصحة بسبب 25% اصابة العالم انتشار اللهم عدد اصابة 20% بفيروس حالة % tweets الصحة بسبب 15% اللهم انتشار 10% العالم 5% 0% عدد 01-Mar 03-Mar 05-Mar 07-Mar 09-Mar 11-Mar 13-Mar 15-Mar 17-Mar 19-Mar 21-Mar 23-Mar 25-Mar 27-Mar 29-Mar 31-Mar 27-Jan 29-Jan 31-Jan 02-Feb 04-Feb 06-Feb 08-Feb 10-Feb 12-Feb 14-Feb 16-Feb 18-Feb 20-Feb 22-Feb 24-Feb 26-Feb 28-Feb (a) Top 10 most frequent words 20% _كورونا# #الكويت #الصين الكويت _كورونا# #إيران #عاجل الصين# _كورونا# في_الكويت #كورونا_في_الكويت #السعودية 15% لبنان #كورونا_الكويت #كورونا_لبنان #لبنان #covid19 لبنان# % tweets الكويت# عاجل# 10% إيران# #covid19 5% السعودية# 0% 11-Mar 13-Mar 15-Mar 17-Mar 19-Mar 21-Mar 23-Mar 25-Mar 27-Mar 29-Mar 31-Mar 2-Feb 4-Feb 6-Feb 8-Feb 27-Jan 29-Jan 31-Jan 10-Feb 12-Feb 14-Feb 16-Feb 18-Feb 20-Feb 22-Feb 24-Feb 26-Feb 28-Feb 1-Mar 3-Mar 5-Mar 7-Mar 9-Mar (b) Top 10 most frequent hashtags Time of first reported infected case ليبيا مصر لبنان الكويت عمان البحرين العراق قطر السعودية تونس المغرب األردن سوريا اإلمارات الجزائر السودان موريتانيا فلسطين 12% مصر لبنان الكويت الكويت مصر لبنان السعودية االمارات العراق البحرين قطر 10% البحرين العراق اليمن االردن الجزائر سوريا 8% المغرب السودان تونس فلسطين % tweets ليبيا عمان موريتانيا الصومال 6% الجزائر اإلمارات عمان 4% السعودية قطر ليبيا 2% 0% 17-Mar 31-Mar 01-Mar 03-Mar 05-Mar 07-Mar 09-Mar 11-Mar 13-Mar 15-Mar 19-Mar 21-Mar 23-Mar 25-Mar 27-Mar 29-Mar 27-Jan 29-Jan 31-Jan 02-Feb 04-Feb 06-Feb 08-Feb 10-Feb 12-Feb 14-Feb 16-Feb 18-Feb 20-Feb 22-Feb 24-Feb 26-Feb 28-Feb (c) Word frequency of Arab countries Figure 3: Time series of the frequency (in percentage of tweets) of top general keywords, hashtags, and Arab country words for the period of 27th Jan to 31st Mar. The time of first reported cases in Arab countries is also indicated and aligned with the time series.
youm7.com 5420 Figure 3(b) demonstrates how “#China” hashtag alarabiya.net youtube.com 2984 3405 elwatannews.com was the most trending one from 27 of January until alaraby.co.uk 2769 2569 aawsat.com 2348 20 of February, as COVID-19 was prevalent only sabq.org independentarabia.com 2342 2143 in China and still not widely spread (at that time) in shorouknews.com cnn.com 2041 1989 akhbaralaan.net 1944 other countries. We can see how it started to be less rt.com alittihad.ae 1913 1901 trending as the number of cases started to decline skynewsarabia.com argaam.com 1792 1705 ajel.sa 1432 in China by that date.12 On the other hand, when royanews.tv aa.com.tr 1405 1379 COVID-19 started to spread in the Arab world, ow.ly euronews.com 1346 1306 24.ae 1239 other hashtags started to become viral. Further- ampproject.org al-ain.com 1231 1218 more, Figure 3(b) shows that frequency spikes of pscp.tv forbesmiddleeast.com 1200 1185 “#Iran”, “#Lebanon”, and “#Kuwait” exactly match 0 1000 2000 3000 4000 5000 6000 Number of occurrences the confirmation dates of first reported cases in Iran,13 Lebanon,14 and Kuwait,15 respectively. Figure 4: Most frequently-linked domains in the top To further analyze trending topics, Figure 3 fea- tweets subset for the period of 27th Jan 2020 to 31st tures the timeline of the first reported cases in the Jan 2021. Arab countries and aligns them with the time se- ries throughout the figure. Aligning the timeline ArCOV-19 by considering the domains of URLs with the series in Figure 3(c) reveals a significant shared in the tweets. We focused on the top tweets match between the frequency peaks of several coun- subset of ArCOV-19 (constructed as detailed in Sec- try names and the corresponding dates of first re- tion 3.2) and identified the URLs posted in those ported cases in those countries, most notably in tweets. We then expanded them (since Twitter UAE, Egypt, Lebanon, Kuwait, Oman, Bahrain, URLs are shortened) and dropped URLs of images Algeria, and Libya. and videos uploaded to Twitter, to focus only on Figure 3 also demonstrates the power of ArCOV- URLs referring to external sources. We extracted 19 in capturing further controversial and trending 110,220 URLs from 2,617 unique domains. Fig- topics. Table 2 shows a timeline covering dates ure 4 shows that URLs from news websites are of specific topics of discussion trending on social dominantly the most-commonly shared. We ob- media around times of peaks in tweeting frequency serve that these news websites mainly originate in Figure 3(c). from three Arab countries, namely, Egypt, Saudi Country Date Topic Related News Arabia, and United Arab Emirates. Interestingly, UAE Feb 19 Yemeni Foreign http://tiny.cc/71qtmz videos from YouTube are among the most com- Affairs Minister thanks UAE for monly shared media. Coupled with the fact that evacuating Yemeni 36% of the tweets in ArCOV-19 include embedded students from China images and videos (Table 1), we believe this en- Iraq Feb 20 Iraq announces http://tiny.cc/yzrtmz ables ArCOV-19 to be a potential dataset to further closure of borders with Iran support the evaluation of multi-modal retrieval and Egypt Mar 2 Egyptian health http://tiny.cc/7iqtmz classification systems. minister announces a visit to China Mar 7 Kuwait bans travels http://tiny.cc/lqstmz 4.3 Geographic Distribution of Tweets with Egypt including entry of Although Twitter provides automatic geo-reference Egyptian residents functionally, few users (solely around 1-3% (Mur- KSA Mar 4-5 Closure of The http://tiny.cc/iqrtmz Grand Mosque in dock, 2011; Jurgens et al., 2015)) opt to enable it Mecca due to privacy and safety reasons. Alternatively, to have an insight about the geographical distribution Table 2: Examples of trending topics in social media matched by spikes in tweeting frequency in ArCOV-19 and diversity of the tweets in our dataset, we ex- amined the place and coordinates attributes of the We further explore the topics discussed in Tweet object.16 We note that the place attribute is an optional attribute that allows the user to select a 12 https://tinyurl.com/rfj5khn 13 16 https://tinyurl.com/yykgwbox https://developer.twitter.com/en/ 14 https://tinyurl.com/yxkg78g5 docs/tweets/data-dictionary/overview/ 15 https://tinyurl.com/y4nlgz5h geo-objects
105 location from a list provided by Twitter (therefore, the location might not necessarily show the actual 104 location from where the tweet is posted), whereas Number of retweets the coordinates attribute represents the geographic 103 location of the tweet as reported by the user or client application. 102 Table 1 shows that ArCOV-19 has 60,873 ge- 101 olocated tweets (i.e., having values in the place attribute) and 2,078 geotagged tweets (i.e., having 100 values in the coordinates attribute). Those tweets 27/01 29/01 31/01 02/02 04/02 06/02 08/02 10/02 12/02 14/02 16/02 18/02 20/02 22/02 24/02 26/02 28/02 01/03 03/03 05/03 07/03 09/03 11/03 13/03 15/03 17/03 19/03 21/03 23/03 25/03 27/03 29/03 31/03 02/04 04/04 06/04 08/04 10/04 12/04 14/04 16/04 18/04 20/04 22/04 24/04 26/04 28/04 30/04 were posted by 24,072 and 256 unique users re- spectively. The geolocated tweets constitute about Figure 6: Distribution of retweets per day for the top 2.28% of the total source tweets, which is con- subset for the period of 27th Jan 2020 to 30th April sistent with previous studies (Huang and Carley, 2020. 2019). among the tracking keywords in the period between Libya 1.7% Bahrain 22 of February and 13 of March. Furthermore, it is 1.9% Qatar not surprising to see the countries that have a few 2.1% Jordan cases have the smallest portions of contribution to 3.0% Iraq KSA the content. Others have small audience userbases 4.2% 41.7% Lebanon on Twitter (e.g., Tunisia). 5.3% Egypt 5.4% 4.4 Propagation Networks Oman 6.6% As discussed earlier, we collected the propagation UAE networks of the top subset. Overall, the number of 6.8% retweets is 7,925,821 for the entire subset, and the Other Kuwait 7.3% 9.6% number of replies is 1,476,950 18 . At the time of collecting the retweets, we were Figure 5: Distribution of geolocated tweets for the pe- not able to get the retweets of some of the tweets ei- riod of 27th Jan 2020 to 31st Jan 2021. ther because they were deleted or they were posted from private accounts. For those collected suc- The geolocated tweets were indeed posted by cessfully, Figure 6 illustrates the distribution of users from 102 countries from around the globe. retweets per day using boxplots. It shows that the We found that 92.75% of them were posted from average (and median) number of retweets follows the Arab world. The largest contribution of the ge- a similar pattern to what was shown in Figure 1 olocated content (about 41.7%) comes from Saudi (notice that the Y-axis here is in log scale). We also Arabia; this is somewhat expected as Saudi users notice that a good number of tweets got more than represent the highest number of active Twitter users 100 retweets; some of them got even larger num- in the Arab world.17 Surprisingly, Kuwait comes bers reaching about 10k retweets or more, showing second with 9.6% of the geolocated content. We highly propagated content. As shown in Figure 7, believe the rationale is that it was among the first we applied a similar analysis to the replies per day, countries that reported COVID-19 cases in the Mid- and found similar patterns to the retweets. dle East. Since then, Kuwait started a series of strict precautions such as a wide lock-down in many vi- 5 Enabling Research tal facilities until the government had imposed a With the spread of Novel Coronavirus in several nationwide curfew. We think, after the curfew, the Arab countries and the subsequent procedural mea- people become more active on Twitter as a plat- sures taken by the local governments, it continued form to break news and discuss developments of to dominate discussions over social media (and the virus. Additionally, we used related phrases 18 and hashtags to “Kuwait”, “Lebanon”, and “UAE” The stats are for the replies for data until end of April 2020 and we are still collecting the rest; we will make all 17 https://tinyurl.com/jmkttt3 available in the near future.
Misinformation Detection: With the sheer 104 amount of information shared about COVID-19, many rumors are disseminated and getting high at- 103 Number of replies tention from the community, which causes a fast propagation of misinformation. This hinders the 102 efforts of the health and governmental organiza- 101 tions on fighting the pandemic as such rumors spread panic and may lead to undesirable conse- 100 quences (e.g., increase of cases and mortality rate, or lack of supplies due to hoarding). ArCOV-19 27/01 29/01 31/01 02/02 04/02 06/02 08/02 10/02 12/02 14/02 16/02 18/02 20/02 22/02 24/02 26/02 28/02 01/03 03/03 05/03 07/03 09/03 11/03 13/03 15/03 17/03 19/03 21/03 23/03 25/03 27/03 29/03 31/03 02/04 04/04 06/04 08/04 10/04 12/04 14/04 16/04 18/04 20/04 22/04 24/04 26/04 28/04 30/04 supports studying information/claims propagation, Figure 7: Distribution of replies per day for the top sub- claims check-worthiness detection and verification set for the period of 27th Jan 2020 to 30th April 2020. tasks on the most popular tweets. Furthermore, the retweet networks and conversational threads provide a valuable resource for early detection of Twitter in particular) to the time of this writing. To fake news and identification of malicious rumor- enable research on different tasks on Arabic Tweets, spreading accounts. To further facilitate work in we make ArCOV-19 publicly-available for the re- this domain of problems, we recently constructed search community. Having quantified the topical, ArCOV19-Rumors, an annotated dataset on top of geographical, and community representativeness ArCOV-19 to support claim and tweet verification of ArCOV-19, we envision that it is suitable for (Haouari et al., 2021). diverse natural language processing, information retrieval, and computational social science research 6 Conclusion tasks including, but not limited to, emergency man- agement, misinformation detection, and social ana- In this paper, we presented ArCOV-19, the first lytics, as we discuss below. Furthermore, we pro- Arabic Twitter dataset about the Novel Coronavirus vide propagation networks of daily highly popular (COVID-19) that includes propagation networks subset (the most popular 1K) to ensure the quality of a large subset of tweets. We release all source of tweets. This sample is drawn after filtering out tweets, top subset, search queries, and the propa- potential low-quality tweets (e.g., spam) and con- gation networks. Preliminary analysis showed that tent duplicates. We anticipate the top subset and ArCOV-19 captured spikes in tweeting frequency the network to support these tasks on the popular for country-specific tweets that are consistent with Arabic tweets that we assume have the most effect the first reported cases of COVID-19 in several on shaping public opinion and understanding. Arab countries. We also found dominance of news Emergency Management: As COVID-19 is an agencies among top tweeting users and among most international pandemic, national and international shared URLs. ArCOV-19 enables research under health organizations need to analyze the effects of many domains including natural language process- the outbreak. We believe ArCOV-19 can support ing, information retrieval, and social computing. several tasks in that domain, such as filtering of in- We plan to continue collecting tweets for the fore- formative content, events and sub-events detection, seeable future and the dataset will be continuously summarization, identification of eyewitnesses, ge- updated with newly collected tweets and propaga- olocation, and studying information and situational tion networks. awareness propagation, to name a few. Acknowledgments Social Analytics: In addition to sharing reports and awareness during the outbreak, people tend to The work of Tamer Elsayed and Maram Hasanain discuss their opinions (e.g., stance towards the situ- was made possible by NPRP grant# NPRP 11S- ation and its consequences) and express their emo- 1204-170060 from the Qatar National Research tions (e.g., big changes in lifestyle, social distanc- Fund (a member of Qatar Foundation). The work ing, loss of their beloved ones, etc.). Therefore, an- of Reem Suwaileh was supported by GSRA grant# alyzing the tweets to detect sentiment, stance, hate GSRA5-1-0527-18082 from the Qatar National Re- speech, and generally, offensive language, among search Fund and the work of Fatima Haouari was other aspects, is of interest to many stakeholders. supported by GSRA grant# GSRA6-1-0611-19074
from the Qatar National Research Fund. The state- Yong Hu, He-Yan Huang, Anfan Chen, and Xian-Ling ments made herein are solely the responsibility of Mao. 2020. Weibo-cov: A large-scale covid-19 so- cial media dataset from weibo. In Proceedings of the authors. the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. References Binxuan Huang and Kathleen M Carley. 2019. A large-scale empirical study of geotagging behavior Sarah Alqurashi, Ahmad Alhindi, and Eisa Alanazi. on Twitter. In Proceedings of the 2019 IEEE/ACM 2020. Large Arabic Twitter Dataset on COVID-19. International Conference on Advances in Social Net- arXiv preprint arXiv:2004.04315. works Analysis and Mining, pages 365–373. Thayer Alshaabi, David R Dewhurst, Joshua R Minot, David Jurgens, Tyler Finethy, James McCorriston, Michael V Arnold, Jane L Adams, Christopher M Yi Tian Xu, and Derek Ruths. 2015. Geolocation Danforth, and Peter Sheridan Dodds. 2020a. The prediction in twitter using social networks: A critical growing echo chamber of social media: Measuring analysis and review of current practice. In Proceed- temporal and social contagion dynamics for over ings of the Ninth International AAAI Conference on 150 languages on twitter for 2009–2020. arXiv Web and Social Media. preprint arXiv:2003.03667. Ireneus Kagashe, Zhijun Yan, and Imran Suheryani. Thayer Alshaabi, JR Minot, MV Arnold, Jane Lydia 2017. Enhancing seasonal influenza surveillance: Adams, David Rushing Dewhurst, Andrew J Reagan, topic analysis of widely used medicinal drugs using Roby Muhamad, Christopher M Danforth, and Pe- twitter data. Journal of medical Internet research, ter Sheridan Dodds. 2020b. How the world’s collec- 19(9):e315. tive attention is being paid to a pandemic: Covid-19 related 1-gram time series for 24 languages on twit- Christian E Lopez, Malolan Vasu, and Caleb Galle- ter. arXiv preprint arXiv:2003.12614. more. 2020. Understanding the perception of covid-19 policies by mining a multilanguage twitter Juan M Banda, Ramya Tekumalla, Guanyu Wang, dataset. arXiv preprint arXiv:2003.10359. Jingyuan Yu, Tuo Liu, Yuning Ding, and Ger- ardo Chowell. 2020. A large-scale covid-19 twit- Andrew McNeill, Peter R. Harris, and Pam Briggs. ter chatter dataset for open scientific research– 2016. Twitter influence on uk vaccination and an- an international collaboration. arXiv preprint tiviral uptake during the 2009 h1n1 pandemic. Fron- arXiv:2004.03688. tiers in Public Health, 4:26. Emily Chen, Kristina Lerman, and Emilio Ferrara. Vanessa Murdock. 2011. Your mileage may vary: On 2020a. Covid-19: The first public coronavirus twit- the limits of social media. SIGSPATIAL Special, ter dataset. arXiv preprint arXiv:2003.07372. 3(2):62–66. Emily Chen, Kristina Lerman, and Emilio Ferrara. Umair Qazi, Muhammad Imran, and Ferda Ofli. 2020. 2020b. Tracking social media discourse about the Geocov19: a dataset of hundreds of millions of mul- covid-19 pandemic: Development of a public coro- tilingual covid-19 tweets with location information. navirus twitter data set. JMIR Public Health and SIGSPATIAL Special, 12(1):6–15. Surveillance, 6(2):e19273. Melissa Roy, Nicolas Moreau, Cécile Rousseau, Ar- Hassan Dashtian and Dhiraj Murthy. 2021. Cml-covid: naud Mercier, Andrew Wilson, and Laëtitia Atlani- A large-scale covid-19 twitter dataset with latent Duault. 2020. Ebola and localized blame on social topics, sentiment and location information. arXiv media: Analysis of twitter and facebook conversa- preprint arXiv:2101.12202. tions during the 2014–2015 ebola epidemic. Culture, Medicine, and Psychiatry, 44(1):56–79. Zhiwei Gao, Shuntaro Yada, Shoko Wakamiya, and Eiji Aramaki. 2020. Naist covid: Multilingual Lisa Singh, Shweta Bansal, Leticia Bode, Ceren Budak, covid-19 twitter and weibo dataset. arXiv preprint Guangqing Chi, Kornraphop Kawintiranon, Colton arXiv:2004.08145. Padden, Rebecca Vanarsdall, Emily Vraga, and Yanchen Wang. 2020. A first look at covid-19 infor- Purva Grover, Arpan Kumar Kar, Yogesh K. Dwivedi, mation and misinformation sharing on twitter. arXiv and Marijn Janssen. 2019. Polarization and accultur- preprint arXiv:2003.13907. ation in us election 2016 outcomes – can twitter an- alytics predict changes in voting preferences. Tech- Santosh Vijaykumar, Glen Nowak, Itai Himelboim, and nological Forecasting and Social Change, 145:438 – Yan Jin. 2018. Virtual zika transmission after the 460. first us case: who said what and how it spread on twitter. American journal of infection control, Fatima Haouari, Maram Hasanain, Reem Suwaileh, 46(5):549–557. and Tamer Elsayed. 2021. Arcov19-rumors: Ara- bic covid-19 twitter dataset for misinformation de- Koosha Zarei, Reza Farahbakhsh, Noel Crespi, and tection. In Proceedings of the Sixth Arabic Natural Gareth Tyson. 2020. A first instagram dataset on Language Processing Workshop, WANLP 2021. covid-19. arXiv preprint arXiv:2004.12226.
Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral- dine Wong Sak Hoi, and Peter Tolmie. 2016. Analysing how people orient to and spread rumours in social media by looking at conversational threads. PloS one, 11(3).
You can also read