ARCOV-19: THE FIRST ARABIC COVID-19 TWITTER DATASET WITH PROPAGATION NETWORKS

Page created by Sylvia Jacobs
 
CONTINUE READING
ArCOV-19: The First Arabic COVID-19 Twitter Dataset
                         with Propagation Networks

          Fatima Haouari, Maram Hasanain, Reem Suwaileh, Tamer Elsayed
             Computer Science and Engineering Department, Qatar University
      {200159617, maram.hasanain, rs081123, telsayed}@qu.edu.qa

                      Abstract                                   Since the first reported case of Novel Coron-
                                                              avirus (COVID-19) in China, in November 2019,
    In this paper, we present ArCOV-19, an Arabic
    COVID-19 Twitter dataset that spans one year,             the COVID-19 topic has drawn the interest of many
    covering the period from 27th of January 2020             Arab users over Twitter. Their interest, reflected
    till 31st of January 2021. ArCOV-19 is the first          in the Arabic content on the platform, has reached
    publicly-available Arabic Twitter dataset cov-            a peak after two months when the first case was
    ering COVID-19 pandemic that includes about               reported in the United Arab Emirates late in Jan-
    2.7M tweets alongside the propagation net-                uary 2020. This ongoing pandemic has, unsur-
    works of the most-popular subset of them (i.e.,
                                                              prisingly, spiked discussions on Twitter covering a
    most-retweeted and -liked). The propagation
    networks include both retweets and conversa-
                                                              wide range of topics such as general information
    tional threads (i.e., threads of replies). ArCOV-         about the disease, preventive measures, procedures
    19 is designed to enable research under several           and newly-enforced decisions by governments, up-
    domains including natural language process-               to-date statistics of the spread in the world, and
    ing, information retrieval, and social comput-            even the change in our daily habits and work styles.
    ing. Preliminary analysis shows that ArCOV-                  In this work, we aim to support future research
    19 captures rising discussions associated with            on social media during this historical period of our
    the first reported cases of the disease as they
                                                              time by curating an Arabic dataset (ArCOV-19)
    appeared in the Arab world. In addition to the
    source tweets and propagation networks, we                that exclusively covers tweets about COVID-19.
    also release the search queries and language-             We limit the dataset to Arabic since it is among
    independent crawler used to collect the tweets            the most dominant languages in Twitter (Alshaabi
    to encourage the curation of similar datasets.            et al., 2020a), yet under-studied in general.
                                                                 ArCOV-19 is the first Arabic Twitter dataset
1   Introduction                                              designed to capture tweets discussing COVID-19
Twitter streams hundreds of millions of tweets                starting from January 27th 2020 till end of January
daily. In addition to being a medium for the spread           20211 , constituting about 2.7M tweets, alongside
and consumption of news, it has been shown to                 the propagation networks of the most-popular sub-
capture the dynamics of real-world events includ-             set of them. To our knowledge, there is no publicly-
ing the spread of diseases such as the seasonal               available Arabic Twitter dataset for COVID-19 that
influenza (Kagashe et al., 2017) or more severe               includes the propagation networks of a good subset
epidemics like Zika (Vijaykumar et al., 2018),                of its tweets. Some existing efforts have already
Ebola (Roy et al., 2020), and H1N1 (McNeill et al.,           started to curate COVID-19 datasets including Ara-
2016). Moreover, collective conversations on Twit-            bic tweets, but Arabic is severely under-represented
ter about an event can have a great influence on the          (e.g., (Chen et al., 2020a; Singh et al., 2020)) or
event’s outcomes, e.g., US 2016 presidential elec-            represented by a random sample that is not specifi-
tions (Grover et al., 2019). Analyzing tweets about           cally focused on COVID-19 (e.g., (Alshaabi et al.,
an event, as it evolves, offers a great opportunity           2020b)), or the dataset is limited in the period it
to understanding its structure and characteristics,           covers and does not include the propagation net-
informing decisions based on its development, and             works (e.g., (Alqurashi et al., 2020)).
anticipating its outcomes as represented in Twitter              1
                                                                   The dataset will be continuously augmented with new
and, more importantly, in the real world.                     tweets over the coming months.

                                                         82
                  Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 82–91
                                         Kyiv, Ukraine (Virtual), April 19, 2021.
The contribution of this paper is three-fold:             constituting 65% and 12% of the tweets respec-
                                                              tively. Chen et al. (2020b) are continously collect-
    • We release ArCOV-19,2 the first Arabic Twit-            ing tweets since 21st of January 2020, where at
      ter dataset about COVID-19 that comprises               the time of writing, around 66% and 11% of the
      about 2.7M tweets collected via Twitter search          tweets are in English and Spanish respectively. An-
      API. It covers a full year allowing for captur-         other multilingual ongoing collection is released
      ing discussions on the topic since it started to        by Lopez et al. (2020). Qazi et al. (2020) enriched
      be popular in the Arab world. In addition to            their large-scale multilingual Twitter dataset with
      the tweets, ArCOV-19 includes propagation               geolocation information.
      networks of the most popular subset, search                The only raw Arabic Twitter dataset available is
      queries, and documented implementation of               the one released by Alqurashi et al. (2020), how-
      our language-independent tweets crawler.                ever it covers the period from 1st of January to 15th
    • We present a preliminary analysis on ArCOV-             April 2020 only. Moreover, it does not include the
      19, which reveals insights from the dataset             propagation networks of tweets.
      concerning temporal, geographical, and topi-               Compared to existing datasets, we release an
      cal aspects.                                            only-Arabic collection of tweets, where we exclude
    • We suggest several use cases to enable re-              retweets to avoid having copies of the same tweet.
      search on Arabic tweets in different research           Moreover, we also release the most-popular sub-
      areas including, but not limited to, emergency          set of them (i.e., most-retweeted and -liked and
      management, misinformation detection, and               spam free tweets). Furthermore, we collect the
      social analytics.                                       propagation networks including both retweets and
                                                              conversational threads (i.e., threads of replies) for
   The reminder of the paper is organized as fol-             the most-popular subset.
lows. We present the crawling process followed
to acquire ArCOV-19 in Section 3. We then thor-               3       Data Collection
oughly discuss the analysis that we conducted on
the dataset in Section 4. We suggest some use cases           ArCOV-19 includes two major components: the
to enable research on Arabic tweets in Section 5.             source tweets (i.e., tweets collected via Twitter
We finally conclude in Section 6.                             search API every day) and the propagation net-
                                                              works (i.e., retweets and conversational threads
2       Related Work                                          of a subset of the source tweets). In this section,
                                                              we present how we collected each in Sections 3.1,
Social media platforms and Twitter specifically               and 3.2 respectively, and summarize the released
showed to be an indispensable medium for sharing              data in Section 3.3.
information and discussions about the COVID-19
pandemic since December 2019. A considerable                  3.1      Tweets Collection
body of raw social media datasets were released               To collect the source tweets, we used our tweet
to facilitate analyzing these discussions including           crawler3 that uses Twitter search API4 . The crawler
Twitter datasets (Banda et al., 2020; Lopez et al.,           takes a set of manually-crafted queries, comprising
2020; Chen et al., 2020b; Qazi et al., 2020; Gao              a target topic, as input. At the end of each day, the
et al., 2020; Alqurashi et al., 2020; Dashtian and            crawler issues a search request for each of those
Murthy, 2021), Instagram (Zarei et al., 2020), or             queries. Queries can be keywords (e.g., “Corona”),
the Chinese social media platform Weibo (Hu et al.,           phrases (e.g., “the killing virus”), or hashtags (e.g.,
2020; Gao et al., 2020).                                      “#the new coronavirus”). Twitter returns a maxi-
   Banda et al. (2020) released a multilingual                mum of 3,200 tweets per query. We customized
dataset that includes over 800M tweets, the data is           the search requests to return only Arabic tweets,5
collected since 1st of January 2020 using Twitter             and to exclude all retweets to avoid having copies
streaming API. Dashtian and Murthy (2021), col-
                                                                  3
lected multilingual tweets posted between March                    https://gitlab.com/bigirqu/ArCOV-19/
                                                              -/tree/master/code/crawler
and July 2020. The most dominant languages                       4
                                                                   https://developer.twitter.com/en/
covered by their dataset are English and Spanish              docs/tweets/search/api-reference/
                                                              get-search-tweets
    2                                                            5
        https://gitlab.com/bigirqu/ArCOV-19/                       Language of tweets to return is a configurable parameter.

                                                         83
of the same tweet. Additionally, duplicate tweets                     retweets using Pickaw,8 a platform for organiz-
(returned by different queries) are removed in each                   ing contests on social media, and the replies using
day. Finally, tweets are sorted chronologically.                      PHEME (Zubiaga et al., 2016) Twitter conversation
   We started collecting our data since 27th of Jan-                  collection script.9
uary 2020 using a set of queries that we manually-
updated based on our daily tracking of trending                       3.3    Data Release
keywords and hashtags. For example, starting from                     In summary, we release the following resources
16th of December 2020, we extended our queries                        as ArCOV-19 dataset, taking into consideration
to cover tweets related to COVID-19 vaccines as                       Twitter content redistribution policy:10
that sub-topic was gaining interest on Arabic social                     • Source Tweets: IDs of the tweets crawled in
media. The full list of queries used in each day                           each day.
is released alongside our dataset. We denote all                         • Search Queries: the list of search queries, in-
collected tweets as source tweets.                                         cluding keywords, phrases, and hashtags, used
   Due to technical reasons, we missed collecting                          in each period to collect our source tweets.
tweets for a few days. Since Twitter search API                          • Top Subset: IDs of the top 1,000 most popu-
limits the search results to the past 7 days, we                           lar tweets for each day.
missed the old tweets. To overcome that, we used                         • Propagation Networks: the propagation net-
GetOldTweets36 python library to download the                              works for the top subset which include for
search results using the same trending keywords                            each tweet in the top subset:
and hashtags we selected around those days.                                   – Retweets: IDs of the full retweet set.
                                                                              – Conversational Threads: tweet IDs of
3.2   Propagation Networks Collection
                                                                                 the full reply thread (including direct and
In addition to the source tweets, we also collected                              indirect replies).
the propagation networks (i.e., retweets and con-                     Along with the dataset, we provide some pointers
versational threads) of the top 1000 most popular                     to publicly-available crawlers that users can easily
(i.e., top-retweeted & -liked) tweets each day. To                    use to crawl the tweets given their IDs.
our knowledge, this is the first Arabic tweet dataset
to include such propagation networks.                                 4     ArCOV-19 in Numbers
   Before getting the most popular tweets on any
                                                                      In this section, we present a statistical summary and
day, we started from source tweets collected in
                                                                      conduct an analysis on ArCOV-19 to shed some
that day and applied a qualification pipeline. We
                                                                      light on its major characteristics.
first excluded tweets containing any inappropriate
word (using a list of inappropriate words we con-                     4.1    Tweets and Users Distribution
structed). Next, tweets with more than two URLs
or four hashtags, or shorter than four tokens (all                    Table 1 presents an overall summary of the tweets
are potentially spam) were dropped. Additionally,                     and users statistics in ArCOV-19. It indicates that
duplicate tweets that have exact textual content are                  the total number of source tweets in the dataset
also dropped to avoid redundancy; only the most                       is about 2.7M posted by over 690k unique users.
popular of them (according to our scoring criterion                   We note that 18.66% of the tweets were posted by
below) is kept. Qualified tweets are then scored by                   verified users (who constitute only 0.81% of the
popularity defined by the sum of the tweet’s retweet                  unique users). This is a relatively large percentage,
and favorite counts. We finally sort the qualified                    showing that a good portion of the source tweets
tweets by their scores and select the most popular                    are popular. The average numbers of followers,
1,000 tweets. We denote the set of all such tweets                    friends, and statuses of users are also relatively
over all days as the top subset.                                      large, showing that users observed in ArCOV-19
   For those 1K top tweets per day, we then col-                      are also popular (and possibly more influential).
lected all retweets and conversational threads (i.e.,                 rest; we will make all available in the near future.
direct and indirect replies).7 We collected the                           8
                                                                            https://pickaw.com/en/
                                                                      twrench-becomes-pickaw
   6                                                                      9
     https://github.com/Mottl/                                              https://github.com/azubiaga/
GetOldTweets3                                                         pheme-twitter-conversation-collection
   7                                                                     10
     At the time of writing, we have collected the replies for              https://developer.twitter.com/en/
tweets until end of April 2020 and we are still collecting the        developer-terms/agreement-and-policy

                                                                 84
550,000
The table also indicates that 25.40% of the tweets              500,000
                                                                                                                                       Posted by non-verified users                          Posted by verified users

include URLs. We anticipate that this is due to the             450,000
                                                                400,000
extensive spreading of news (linked from tweets)                350,000
during this period. The numbers of tweets that are              300,000
                                                                250,000
geotagged and geolocated are also indicated in the              200,000

table; however, we defer the discussion of such                 150,000
                                                                100,000
types of tweets to Section 4.3.                                  50,000

   Figure 1 illustrates the monthly distribution of                      0

                                                                                                                                                                                                                         Dec 2020
                                                                                                         Mar 2020

                                                                                                                                   May 2020
                                                                                                                    Apr 2020

                                                                                                                                                                                                           Nov 2020
                                                                                                                                                                       Aug 2020
                                                                               Jan 2020

                                                                                                                                                                                  Sep 2020

                                                                                                                                                                                                                                    Jan 2021
                                                                                                                                                                                              Oct 2020
                                                                                                                                                            Jul 2020
                                                                                                                                              Jun 2020
                                                                                          Feb 2020
tweets in ArCOV-19. The volume of tweets dras-
tically increased in March when the virus started
to spread in several Arab countries. Then, tweets              Figure 1: Monthly distribution of source tweets of
number started to decrease monthly. We speculate               ArCOV-19.
this is due to the fact that the topic was not heavily
                                                                       Akhbar | ‫أﺧﺑﺎر اﻵن‬                                                                                                                                           11,086
discussed as in the first days. In December, we                               ‫ﺻﺣﯾﻔﺔ اﻟﺑﯾﺎن‬                                                                                                                                     10,223
                                                                        Shorouk News                                                                                                                         9,031
expanded our queries list to include hashtags and              Bawabaa News | ‫ﺑواﺑﺔ اﻷﺧﺑﺎر‬                                                                                                               8,453
                                                                     Assawsana News                                                                                           6,962
keyword related to more current sub-topics such as                              ‫اﻟﯾوم اﻟﺳﺎﺑﻊ‬
                                                                                ‫ﺑواﺑﺔ اﻟوطن‬
                                                                                                                                                                          6,224
                                                                                                                                                                       5,810
vaccines and the new variant of the virus, which                           NEWS KTV
                                                                                     ‫ﻛوروﻧﺎ‬                                                                          5,517
                                                                                                                                                                   5,263

resulted in an increase in the volume of tweets.                                 Mulhak
                                                                                    ‫ﻗﻧﺎة اﻟﻐد‬
                                                                                                                                                                 5,019
                                                                                                                                                                 4,953
                                                                         AlMamlakaTV                                                                            4,890
   Figure 2 presents the top 25 tweeting users in                            ‫ﺻﺣﯾﻔﺔ اﻟﺧﻠﯾﺞ‬                                                                       4,844
                                                                         ALAYAM - ‫اﻷﯾﺎم‬                                                                        4,699
ArCOV-19. We notice that several of them are news                                24‫اﻟﺷﺎرﻗﺔ‬                                                                    4,584
                                                                             ‫ﺟرﯾدة اﻟدﺳﺗور‬                                                                    4,536
sources, and 20 are verified Twitter accounts.                    MENAFN.com Arabic
                                                                          ‫ﺳﻛﺎي ﻧﯾوز ﻋرﺑﯾﺔ‬
                                                                                                                                                           4,137
                                                                                                                                                           4,124
                                                                              ‫اﻟﻌرﺑﻲ اﻟﺟدﯾد‬                                                               4,068
                                                                             ‫ﺻﺣﯾﻔﺔ اﻻﺗﺣﺎد‬                                                                 4,051
 Tweets Statistics                                                             ‫ﺟرﯾدة اﻟوطن‬
                                                                     Independent ‫ﻋرﺑﯾﺔ‬
                                                                                                                                                          4,021
                                                                                                                                                         3,989
                                                                                 ‫ﺻدى اﻟﺑﻠد‬                                                               3,975
 Source Tweets                2,675,049                                    ‫اﻟوﻗﺎﺋﻊ اﻻﺧﺑﺎرﯾﺔ‬
                                                                              ‫ﺻﺣﯾﻔﺔ ﻋﻛﺎظ‬
                                                                                                                                                         3,968
                                                                                                                                                         3,912
 Geotagged                        2,078    (0.08%)                                                   0                         2,000             4,000                 6,000                 8,000                    10,000             12,000
                                                                                                                                                                  Number of tweets
 Geolocated                      60,873    (2.28%)
 Posted by verified users       499,351    (18.66%)            Figure 2: Number of tweets posted by the top 25 tweet-
 Include URL                    679,482    (25.40%)            ers in ArCOV-19 for the period of 27th Jan 2020 to 31st
 Include Media                  973,952    (36.41%)            Jan 2021.
 Top Subset                     370,132    (13.84%)
 Retweets of Top Subset       7,925,821
 Replies of Top Subset        1,476,950                        country names for the tweets posted in the period
                                                               of 27th Jan to 31st Mar 2020. We focus on this
 Users Statistics                                              critical period since the virus started to spread in
 Unique                         690,339                        the Arab world during that time. We then tracked
 Verified                         5,575    (0.81%)             their frequency over time. Figure 3 shows the time
 Average followers count          4,630                        series (over days) of the three types of keywords.11
 Average friends count              918                           The 10 most-frequent words shown in Fig-
 Average statuses count           9,054                        ure 3(a) indicate two different types of words:
                                                               those that are directly related to COVID-19 (e.g.,
Table 1: Tweets and users statistics of ArCOV-19. Stats
                                                               “health”) and those that are not but related to prayers
for replies are for tweets until end of April 2020.
                                                               and supplications (e.g., “Allah/God”)). It is inter-
                                                               esting to see the word “Allah” is very frequent
4.2   Tweets Content & Topics                                  early on when the news about the virus started to
It is important to demonstrate that the tweets in              spread (probably over discussions around whether
ArCOV-19 constitute a good representative sample               the pandemic is a punishment from God or not, we
of the Arabic tweets posted during the target period           believe), then declines over time only to become
on COVID-19, and that they cover the prevalent                 frequent again when the virus started to widely
topics discussed over Twitter during that period.              spread in the Arab world.
To help examine this hypothesis, we analyzed the                    11
                                                                    When identifying the most frequent words and hashtags,
textual content of the tweets; in particular, we iden-         we excluded the ones we used in our search queries since they
tified the most-frequent words, hashtags, and Arab             are expected to be very frequent by definition.

                                                          85
30%
                                                                 ‫هللا‬                                                                                                                                                                     ‫هللا‬                       ‫حالة‬                             ‫بفيروس‬                   ‫الصحة‬                  ‫بسبب‬
                    25%                                                                                                                                                                                                                   ‫اصابة‬                      ‫العالم‬                           ‫انتشار‬                   ‫اللهم‬                  ‫عدد‬

                                                                                                                                                       ‫اصابة‬
                    20%                                                                               ‫بفيروس‬
                                                                                                      ‫حالة‬
         % tweets

                                                                                                                ‫الصحة‬                      ‫بسبب‬
                    15%
                                                                                                                                                                                                                                                                                                             ‫اللهم‬                       ‫انتشار‬
                    10%
                                                                                                                                                                                                                                                                                                                                                        ‫العالم‬

                    5%

                    0%                                                                                                                                                                                                                    ‫عدد‬

                                                                                                                                                                                                        01-Mar
                                                                                                                                                                                                                       03-Mar
                                                                                                                                                                                                                                 05-Mar
                                                                                                                                                                                                                                          07-Mar
                                                                                                                                                                                                                                                   09-Mar
                                                                                                                                                                                                                                                            11-Mar
                                                                                                                                                                                                                                                                        13-Mar
                                                                                                                                                                                                                                                                                           15-Mar
                                                                                                                                                                                                                                                                                                    17-Mar
                                                                                                                                                                                                                                                                                                             19-Mar
                                                                                                                                                                                                                                                                                                                      21-Mar
                                                                                                                                                                                                                                                                                                                                23-Mar
                                                                                                                                                                                                                                                                                                                                         25-Mar
                                                                                                                                                                                                                                                                                                                                                  27-Mar
                                                                                                                                                                                                                                                                                                                                                            29-Mar
                                                                                                                                                                                                                                                                                                                                                                     31-Mar
                          27-Jan
                                    29-Jan
                                               31-Jan
                                                        02-Feb
                                                                  04-Feb
                                                                           06-Feb
                                                                                    08-Feb
                                                                                             10-Feb
                                                                                                      12-Feb
                                                                                                                14-Feb
                                                                                                                         16-Feb
                                                                                                                                  18-Feb
                                                                                                                                           20-Feb
                                                                                                                                                    22-Feb
                                                                                                                                                               24-Feb
                                                                                                                                                                                26-Feb
                                                                                                                                                                                            28-Feb
                                                                                                                         (a) Top 10 most frequent words

                    20%
                                                                                                                              _‫كورونا‬#                                                                                                                                           #‫الكويت‬                                           #‫الصين‬
                                                                                                                                 ‫الكويت‬                                                                     _‫كورونا‬#                                                             #‫إيران‬                                            #‫عاجل‬
                                                   ‫الصين‬#
                                                                                                                                   _‫كورونا‬#                                                                 ‫في_الكويت‬                                                            #‫كورونا_في_الكويت‬                                 #‫السعودية‬
                    15%
                                                                                                                                        ‫لبنان‬                                                                                                                                    #‫كورونا_الكويت‬                                    #‫كورونا_لبنان‬
                                                                                                                                                                                                                                                                                 #‫لبنان‬                                            #covid19
                                                                                                                            ‫لبنان‬#
         % tweets

                                                                                                                                                                                                        ‫الكويت‬#                                                                                     ‫عاجل‬#
                    10%
                                                                                                                    ‫إيران‬#
                                                                                                                                                                                                                                                                                   #covid19
                    5%
                                             ‫السعودية‬#

                    0%
                                                                                                                                                                                                                                                            11-Mar
                                                                                                                                                                                                                                                                        13-Mar
                                                                                                                                                                                                                                                                                           15-Mar
                                                                                                                                                                                                                                                                                                    17-Mar
                                                                                                                                                                                                                                                                                                             19-Mar
                                                                                                                                                                                                                                                                                                                      21-Mar
                                                                                                                                                                                                                                                                                                                                23-Mar
                                                                                                                                                                                                                                                                                                                                         25-Mar
                                                                                                                                                                                                                                                                                                                                                   27-Mar
                                                                                                                                                                                                                                                                                                                                                            29-Mar
                                                                                                                                                                                                                                                                                                                                                                     31-Mar
                                                        2-Feb
                                                                  4-Feb
                                                                           6-Feb
                                                                                    8-Feb
                          27-Jan
                                    29-Jan
                                               31-Jan

                                                                                             10-Feb
                                                                                                      12-Feb
                                                                                                                14-Feb
                                                                                                                         16-Feb
                                                                                                                                  18-Feb
                                                                                                                                           20-Feb
                                                                                                                                                    22-Feb
                                                                                                                                                               24-Feb
                                                                                                                                                                                 26-Feb
                                                                                                                                                                                            28-Feb
                                                                                                                                                                                                        1-Mar
                                                                                                                                                                                                                       3-Mar
                                                                                                                                                                                                                                 5-Mar
                                                                                                                                                                                                                                          7-Mar
                                                                                                                                                                                                                                                   9-Mar

                                                                                                                   (b) Top 10 most frequent hashtags
                                                                                                                         Time of first reported infected case
                                                                                                                                                                                                                                                                                                                                    ‫ليبيا‬
                                                                                                                ‫مصر‬

                                                                                                                                              ‫لبنان‬

                                                                                                                                                             ‫الكويت عمان البحرين العراق‬

                                                                                                                                                                                                                      ‫قطر‬
                                                                                                                                                                                               ‫السعودية تونس المغرب األردن‬

                                                                                                                                                                                                                                                                                                                         ‫سوريا‬
                                    ‫اإلمارات‬

                                                                                                                                                                                 ‫الجزائر‬

                                                                                                                                                                                                                                                                       ‫السودان موريتانيا‬
                                                                                                                                                                                                                                 ‫فلسطين‬

                    12%

                                                                                     ‫مصر‬                             ‫لبنان‬                                                                 ‫الكويت‬                                                                     ‫الكويت‬                           ‫مصر‬                      ‫لبنان‬                   ‫السعودية‬
                                                                                                                                                                                                                                                                      ‫االمارات‬                         ‫العراق‬                   ‫البحرين‬                 ‫قطر‬
                    10%                                                                                                                                                                    ‫البحرين‬
                                                                                                                         ‫العراق‬                                                                                                                                       ‫اليمن‬                            ‫االردن‬                   ‫الجزائر‬                 ‫سوريا‬

                     8%                                                                                                                                                                                                                                               ‫المغرب‬                           ‫السودان‬                  ‫تونس‬                    ‫فلسطين‬
         % tweets

                                                                                                                                                                                                                                                                      ‫ليبيا‬                            ‫عمان‬                     ‫موريتانيا‬               ‫الصومال‬

                     6%                                                                                                                                                                                                         ‫الجزائر‬
                                                    ‫اإلمارات‬                                                                                                                       ‫عمان‬
                     4%                                              ‫السعودية‬                                                                                                                                                                                                                          ‫قطر‬
                                                                                                                                                                                                                                                                                                                                                  ‫ليبيا‬
                     2%

                     0%
                                                                                                                                                                                                                                                                                                    17-Mar

                                                                                                                                                                                                                                                                                                                                                                     31-Mar
                                                                                                                                                                                                        01-Mar
                                                                                                                                                                                                                       03-Mar
                                                                                                                                                                                                                                 05-Mar
                                                                                                                                                                                                                                          07-Mar
                                                                                                                                                                                                                                                   09-Mar
                                                                                                                                                                                                                                                            11-Mar
                                                                                                                                                                                                                                                                        13-Mar
                                                                                                                                                                                                                                                                                           15-Mar

                                                                                                                                                                                                                                                                                                             19-Mar
                                                                                                                                                                                                                                                                                                                      21-Mar
                                                                                                                                                                                                                                                                                                                                23-Mar
                                                                                                                                                                                                                                                                                                                                         25-Mar
                                                                                                                                                                                                                                                                                                                                                  27-Mar
                                                                                                                                                                                                                                                                                                                                                            29-Mar
                           27-Jan
                                     29-Jan
                                               31-Jan
                                                        02-Feb
                                                                  04-Feb
                                                                           06-Feb
                                                                                    08-Feb
                                                                                             10-Feb
                                                                                                       12-Feb
                                                                                                                14-Feb
                                                                                                                         16-Feb
                                                                                                                                  18-Feb
                                                                                                                                           20-Feb
                                                                                                                                                    22-Feb
                                                                                                                                                                24-Feb
                                                                                                                                                                                 26-Feb
                                                                                                                                                                                            28-Feb

                                                                                                                (c) Word frequency of Arab countries

Figure 3: Time series of the frequency (in percentage of tweets) of top general keywords, hashtags, and Arab
country words for the period of 27th Jan to 31st Mar. The time of first reported cases in Arab countries is also
indicated and aligned with the time series.

                                                                                                                                                                                       86
youm7.com                                                                5420
   Figure 3(b) demonstrates how “#China” hashtag                                   alarabiya.net
                                                                                   youtube.com                                        2984
                                                                                                                                             3405

                                                                              elwatannews.com
was the most trending one from 27 of January until                                 alaraby.co.uk
                                                                                                                                   2769
                                                                                                                                 2569
                                                                                    aawsat.com                                2348
20 of February, as COVID-19 was prevalent only                                          sabq.org
                                                                        independentarabia.com
                                                                                                                              2342
                                                                                                                            2143
in China and still not widely spread (at that time) in                        shorouknews.com
                                                                                         cnn.com
                                                                                                                           2041
                                                                                                                          1989
                                                                                akhbaralaan.net                          1944
other countries. We can see how it started to be less                                      rt.com
                                                                                     alittihad.ae
                                                                                                                         1913
                                                                                                                         1901
trending as the number of cases started to decline                           skynewsarabia.com
                                                                                    argaam.com
                                                                                                                        1792
                                                                                                                       1705
                                                                                           ajel.sa                  1432
in China by that date.12 On the other hand, when                                    royanews.tv
                                                                                       aa.com.tr
                                                                                                                   1405
                                                                                                                   1379
COVID-19 started to spread in the Arab world,                                               ow.ly
                                                                                  euronews.com
                                                                                                                  1346
                                                                                                                  1306
                                                                                            24.ae                1239
other hashtags started to become viral. Further-                                 ampproject.org
                                                                                      al-ain.com
                                                                                                                1231
                                                                                                                1218
more, Figure 3(b) shows that frequency spikes of                                          pscp.tv
                                                                          forbesmiddleeast.com
                                                                                                                1200
                                                                                                                1185

“#Iran”, “#Lebanon”, and “#Kuwait” exactly match                                                     0   1000         2000        3000          4000   5000     6000
                                                                                                                       Number of occurrences
the confirmation dates of first reported cases in
Iran,13 Lebanon,14 and Kuwait,15 respectively.                          Figure 4: Most frequently-linked domains in the top
   To further analyze trending topics, Figure 3 fea-                    tweets subset for the period of 27th Jan 2020 to 31st
tures the timeline of the first reported cases in the                   Jan 2021.
Arab countries and aligns them with the time se-
ries throughout the figure. Aligning the timeline                       ArCOV-19 by considering the domains of URLs
with the series in Figure 3(c) reveals a significant                    shared in the tweets. We focused on the top tweets
match between the frequency peaks of several coun-                      subset of ArCOV-19 (constructed as detailed in Sec-
try names and the corresponding dates of first re-                      tion 3.2) and identified the URLs posted in those
ported cases in those countries, most notably in                        tweets. We then expanded them (since Twitter
UAE, Egypt, Lebanon, Kuwait, Oman, Bahrain,                             URLs are shortened) and dropped URLs of images
Algeria, and Libya.                                                     and videos uploaded to Twitter, to focus only on
   Figure 3 also demonstrates the power of ArCOV-                       URLs referring to external sources. We extracted
19 in capturing further controversial and trending                      110,220 URLs from 2,617 unique domains. Fig-
topics. Table 2 shows a timeline covering dates                         ure 4 shows that URLs from news websites are
of specific topics of discussion trending on social                     dominantly the most-commonly shared. We ob-
media around times of peaks in tweeting frequency                       serve that these news websites mainly originate
in Figure 3(c).                                                         from three Arab countries, namely, Egypt, Saudi
 Country   Date      Topic                 Related News                 Arabia, and United Arab Emirates. Interestingly,
 UAE       Feb 19    Yemeni Foreign        http://tiny.cc/71qtmz        videos from YouTube are among the most com-
                     Affairs Minister
                     thanks UAE for
                                                                        monly shared media. Coupled with the fact that
                     evacuating Yemeni                                  36% of the tweets in ArCOV-19 include embedded
                     students from
                     China
                                                                        images and videos (Table 1), we believe this en-
 Iraq      Feb 20    Iraq announces        http://tiny.cc/yzrtmz        ables ArCOV-19 to be a potential dataset to further
                     closure of borders
                     with Iran
                                                                        support the evaluation of multi-modal retrieval and
 Egypt     Mar 2     Egyptian health       http://tiny.cc/7iqtmz        classification systems.
                     minister announces
                     a visit to China
           Mar 7     Kuwait bans travels   http://tiny.cc/lqstmz        4.3         Geographic Distribution of Tweets
                     with Egypt
                     including entry of                                 Although Twitter provides automatic geo-reference
                     Egyptian residents                                 functionally, few users (solely around 1-3% (Mur-
 KSA       Mar 4-5   Closure of The        http://tiny.cc/iqrtmz
                     Grand Mosque in                                    dock, 2011; Jurgens et al., 2015)) opt to enable it
                     Mecca                                              due to privacy and safety reasons. Alternatively, to
                                                                        have an insight about the geographical distribution
Table 2: Examples of trending topics in social media
matched by spikes in tweeting frequency in ArCOV-19                     and diversity of the tweets in our dataset, we ex-
                                                                        amined the place and coordinates attributes of the
   We further explore the topics discussed in                           Tweet object.16 We note that the place attribute is
                                                                        an optional attribute that allows the user to select a
  12
     https://tinyurl.com/rfj5khn
  13                                                                        16
     https://tinyurl.com/yykgwbox                                          https://developer.twitter.com/en/
  14
     https://tinyurl.com/yxkg78g5                                       docs/tweets/data-dictionary/overview/
  15
     https://tinyurl.com/y4nlgz5h                                       geo-objects

                                                                   87
105
location from a list provided by Twitter (therefore,
the location might not necessarily show the actual                                  104
location from where the tweet is posted), whereas

                                                               Number of retweets
the coordinates attribute represents the geographic                                 103

location of the tweet as reported by the user or
client application.                                                                 102

   Table 1 shows that ArCOV-19 has 60,873 ge-
                                                                                    101
olocated tweets (i.e., having values in the place
attribute) and 2,078 geotagged tweets (i.e., having                                 100
values in the coordinates attribute). Those tweets

                                                                                      27/01
                                                                                      29/01
                                                                                      31/01
                                                                                      02/02
                                                                                      04/02
                                                                                      06/02
                                                                                      08/02
                                                                                      10/02
                                                                                      12/02
                                                                                      14/02
                                                                                      16/02
                                                                                      18/02
                                                                                      20/02
                                                                                      22/02
                                                                                      24/02
                                                                                      26/02
                                                                                      28/02
                                                                                      01/03
                                                                                      03/03
                                                                                      05/03
                                                                                      07/03
                                                                                      09/03
                                                                                      11/03
                                                                                      13/03
                                                                                      15/03
                                                                                      17/03
                                                                                      19/03
                                                                                      21/03
                                                                                      23/03
                                                                                      25/03
                                                                                      27/03
                                                                                      29/03
                                                                                      31/03
                                                                                      02/04
                                                                                      04/04
                                                                                      06/04
                                                                                      08/04
                                                                                      10/04
                                                                                      12/04
                                                                                      14/04
                                                                                      16/04
                                                                                      18/04
                                                                                      20/04
                                                                                      22/04
                                                                                      24/04
                                                                                      26/04
                                                                                      28/04
                                                                                      30/04
were posted by 24,072 and 256 unique users re-
spectively. The geolocated tweets constitute about                Figure 6: Distribution of retweets per day for the top
2.28% of the total source tweets, which is con-                   subset for the period of 27th Jan 2020 to 30th April
sistent with previous studies (Huang and Carley,                  2020.
2019).
                                                                  among the tracking keywords in the period between
  Libya
  1.7%
  Bahrain
                                                                  22 of February and 13 of March. Furthermore, it is
  1.9%
  Qatar
                                                                  not surprising to see the countries that have a few
  2.1%
  Jordan
                                                                  cases have the smallest portions of contribution to
  3.0%
  Iraq                                           KSA              the content. Others have small audience userbases
  4.2%                                          41.7%
  Lebanon                                                         on Twitter (e.g., Tunisia).
  5.3%
  Egypt
  5.4%                                                            4.4                      Propagation Networks
  Oman
  6.6%                                                          As discussed earlier, we collected the propagation
  UAE
                                                                networks of the top subset. Overall, the number of
  6.8%                                                          retweets is 7,925,821 for the entire subset, and the
  Other                                        Kuwait
  7.3%                                          9.6%            number of replies is 1,476,950 18 .
                                                                   At the time of collecting the retweets, we were
Figure 5: Distribution of geolocated tweets for the pe-         not able to get the retweets of some of the tweets ei-
riod of 27th Jan 2020 to 31st Jan 2021.                         ther because they were deleted or they were posted
                                                                from private accounts. For those collected suc-
   The geolocated tweets were indeed posted by                  cessfully, Figure 6 illustrates the distribution of
users from 102 countries from around the globe.                 retweets per day using boxplots. It shows that the
We found that 92.75% of them were posted from                   average (and median) number of retweets follows
the Arab world. The largest contribution of the ge-             a similar pattern to what was shown in Figure 1
olocated content (about 41.7%) comes from Saudi                 (notice that the Y-axis here is in log scale). We also
Arabia; this is somewhat expected as Saudi users                notice that a good number of tweets got more than
represent the highest number of active Twitter users            100 retweets; some of them got even larger num-
in the Arab world.17 Surprisingly, Kuwait comes                 bers reaching about 10k retweets or more, showing
second with 9.6% of the geolocated content. We                  highly propagated content. As shown in Figure 7,
believe the rationale is that it was among the first            we applied a similar analysis to the replies per day,
countries that reported COVID-19 cases in the Mid-              and found similar patterns to the retweets.
dle East. Since then, Kuwait started a series of strict
precautions such as a wide lock-down in many vi-                  5                       Enabling Research
tal facilities until the government had imposed a
                                                                 With the spread of Novel Coronavirus in several
nationwide curfew. We think, after the curfew, the
                                                                 Arab countries and the subsequent procedural mea-
people become more active on Twitter as a plat-
                                                                 sures taken by the local governments, it continued
form to break news and discuss developments of
                                                                 to dominate discussions over social media (and
the virus. Additionally, we used related phrases
                                                                                     18
and hashtags to “Kuwait”, “Lebanon”, and “UAE”                         The stats are for the replies for data until end of April
                                                                  2020 and we are still collecting the rest; we will make all
  17
       https://tinyurl.com/jmkttt3                                available in the near future.

                                                          88
Misinformation Detection: With the sheer
                    104
                                                                      amount of information shared about COVID-19,
                                                                      many rumors are disseminated and getting high at-
                    103
Number of replies

                                                                      tention from the community, which causes a fast
                                                                      propagation of misinformation. This hinders the
                    102
                                                                      efforts of the health and governmental organiza-
                    101
                                                                      tions on fighting the pandemic as such rumors
                                                                      spread panic and may lead to undesirable conse-
                    100
                                                                      quences (e.g., increase of cases and mortality rate,
                                                                      or lack of supplies due to hoarding). ArCOV-19
                      27/01
                      29/01
                      31/01
                      02/02
                      04/02
                      06/02
                      08/02
                      10/02
                      12/02
                      14/02
                      16/02
                      18/02
                      20/02
                      22/02
                      24/02
                      26/02
                      28/02
                      01/03
                      03/03
                      05/03
                      07/03
                      09/03
                      11/03
                      13/03
                      15/03
                      17/03
                      19/03
                      21/03
                      23/03
                      25/03
                      27/03
                      29/03
                      31/03
                      02/04
                      04/04
                      06/04
                      08/04
                      10/04
                      12/04
                      14/04
                      16/04
                      18/04
                      20/04
                      22/04
                      24/04
                      26/04
                      28/04
                      30/04
                                                                      supports studying information/claims propagation,
    Figure 7: Distribution of replies per day for the top sub-        claims check-worthiness detection and verification
    set for the period of 27th Jan 2020 to 30th April 2020.           tasks on the most popular tweets. Furthermore,
                                                                      the retweet networks and conversational threads
                                                                      provide a valuable resource for early detection of
   Twitter in particular) to the time of this writing. To             fake news and identification of malicious rumor-
   enable research on different tasks on Arabic Tweets,               spreading accounts. To further facilitate work in
   we make ArCOV-19 publicly-available for the re-                    this domain of problems, we recently constructed
   search community. Having quantified the topical,                   ArCOV19-Rumors, an annotated dataset on top of
   geographical, and community representativeness                     ArCOV-19 to support claim and tweet verification
   of ArCOV-19, we envision that it is suitable for                   (Haouari et al., 2021).
   diverse natural language processing, information
   retrieval, and computational social science research               6   Conclusion
   tasks including, but not limited to, emergency man-
   agement, misinformation detection, and social ana-                 In this paper, we presented ArCOV-19, the first
   lytics, as we discuss below. Furthermore, we pro-                  Arabic Twitter dataset about the Novel Coronavirus
   vide propagation networks of daily highly popular                  (COVID-19) that includes propagation networks
   subset (the most popular 1K) to ensure the quality                 of a large subset of tweets. We release all source
   of tweets. This sample is drawn after filtering out                tweets, top subset, search queries, and the propa-
   potential low-quality tweets (e.g., spam) and con-                 gation networks. Preliminary analysis showed that
   tent duplicates. We anticipate the top subset and                  ArCOV-19 captured spikes in tweeting frequency
   the network to support these tasks on the popular                  for country-specific tweets that are consistent with
   Arabic tweets that we assume have the most effect                  the first reported cases of COVID-19 in several
   on shaping public opinion and understanding.                       Arab countries. We also found dominance of news
      Emergency Management: As COVID-19 is an                         agencies among top tweeting users and among most
   international pandemic, national and international                 shared URLs. ArCOV-19 enables research under
   health organizations need to analyze the effects of                many domains including natural language process-
   the outbreak. We believe ArCOV-19 can support                      ing, information retrieval, and social computing.
   several tasks in that domain, such as filtering of in-             We plan to continue collecting tweets for the fore-
   formative content, events and sub-events detection,                seeable future and the dataset will be continuously
   summarization, identification of eyewitnesses, ge-                 updated with newly collected tweets and propaga-
   olocation, and studying information and situational                tion networks.
   awareness propagation, to name a few.
                                                                      Acknowledgments
      Social Analytics: In addition to sharing reports
   and awareness during the outbreak, people tend to                  The work of Tamer Elsayed and Maram Hasanain
   discuss their opinions (e.g., stance towards the situ-             was made possible by NPRP grant# NPRP 11S-
   ation and its consequences) and express their emo-                 1204-170060 from the Qatar National Research
   tions (e.g., big changes in lifestyle, social distanc-             Fund (a member of Qatar Foundation). The work
   ing, loss of their beloved ones, etc.). Therefore, an-             of Reem Suwaileh was supported by GSRA grant#
   alyzing the tweets to detect sentiment, stance, hate               GSRA5-1-0527-18082 from the Qatar National Re-
   speech, and generally, offensive language, among                   search Fund and the work of Fatima Haouari was
   other aspects, is of interest to many stakeholders.                supported by GSRA grant# GSRA6-1-0611-19074

                                                                 89
from the Qatar National Research Fund. The state-             Yong Hu, He-Yan Huang, Anfan Chen, and Xian-Ling
ments made herein are solely the responsibility of              Mao. 2020. Weibo-cov: A large-scale covid-19 so-
                                                                cial media dataset from weibo. In Proceedings of
the authors.
                                                                the 1st Workshop on NLP for COVID-19 (Part 2) at
                                                                EMNLP 2020.
References                                                    Binxuan Huang and Kathleen M Carley. 2019. A
                                                                large-scale empirical study of geotagging behavior
Sarah Alqurashi, Ahmad Alhindi, and Eisa Alanazi.               on Twitter. In Proceedings of the 2019 IEEE/ACM
  2020. Large Arabic Twitter Dataset on COVID-19.               International Conference on Advances in Social Net-
  arXiv preprint arXiv:2004.04315.                              works Analysis and Mining, pages 365–373.
Thayer Alshaabi, David R Dewhurst, Joshua R Minot,            David Jurgens, Tyler Finethy, James McCorriston,
  Michael V Arnold, Jane L Adams, Christopher M                 Yi Tian Xu, and Derek Ruths. 2015. Geolocation
  Danforth, and Peter Sheridan Dodds. 2020a. The                prediction in twitter using social networks: A critical
  growing echo chamber of social media: Measuring               analysis and review of current practice. In Proceed-
  temporal and social contagion dynamics for over               ings of the Ninth International AAAI Conference on
  150 languages on twitter for 2009–2020. arXiv                 Web and Social Media.
  preprint arXiv:2003.03667.
                                                              Ireneus Kagashe, Zhijun Yan, and Imran Suheryani.
Thayer Alshaabi, JR Minot, MV Arnold, Jane Lydia                 2017. Enhancing seasonal influenza surveillance:
  Adams, David Rushing Dewhurst, Andrew J Reagan,                topic analysis of widely used medicinal drugs using
  Roby Muhamad, Christopher M Danforth, and Pe-                  twitter data. Journal of medical Internet research,
  ter Sheridan Dodds. 2020b. How the world’s collec-             19(9):e315.
  tive attention is being paid to a pandemic: Covid-19
  related 1-gram time series for 24 languages on twit-        Christian E Lopez, Malolan Vasu, and Caleb Galle-
  ter. arXiv preprint arXiv:2003.12614.                         more. 2020.      Understanding the perception of
                                                                covid-19 policies by mining a multilanguage twitter
Juan M Banda, Ramya Tekumalla, Guanyu Wang,                     dataset. arXiv preprint arXiv:2003.10359.
  Jingyuan Yu, Tuo Liu, Yuning Ding, and Ger-
  ardo Chowell. 2020. A large-scale covid-19 twit-            Andrew McNeill, Peter R. Harris, and Pam Briggs.
  ter chatter dataset for open scientific research–             2016. Twitter influence on uk vaccination and an-
  an international collaboration.  arXiv preprint               tiviral uptake during the 2009 h1n1 pandemic. Fron-
  arXiv:2004.03688.                                             tiers in Public Health, 4:26.
Emily Chen, Kristina Lerman, and Emilio Ferrara.              Vanessa Murdock. 2011. Your mileage may vary: On
  2020a. Covid-19: The first public coronavirus twit-           the limits of social media. SIGSPATIAL Special,
  ter dataset. arXiv preprint arXiv:2003.07372.                 3(2):62–66.
Emily Chen, Kristina Lerman, and Emilio Ferrara.              Umair Qazi, Muhammad Imran, and Ferda Ofli. 2020.
  2020b. Tracking social media discourse about the             Geocov19: a dataset of hundreds of millions of mul-
  covid-19 pandemic: Development of a public coro-             tilingual covid-19 tweets with location information.
  navirus twitter data set. JMIR Public Health and             SIGSPATIAL Special, 12(1):6–15.
  Surveillance, 6(2):e19273.
                                                              Melissa Roy, Nicolas Moreau, Cécile Rousseau, Ar-
Hassan Dashtian and Dhiraj Murthy. 2021. Cml-covid:            naud Mercier, Andrew Wilson, and Laëtitia Atlani-
  A large-scale covid-19 twitter dataset with latent           Duault. 2020. Ebola and localized blame on social
  topics, sentiment and location information. arXiv            media: Analysis of twitter and facebook conversa-
  preprint arXiv:2101.12202.                                   tions during the 2014–2015 ebola epidemic. Culture,
                                                               Medicine, and Psychiatry, 44(1):56–79.
Zhiwei Gao, Shuntaro Yada, Shoko Wakamiya, and
  Eiji Aramaki. 2020. Naist covid: Multilingual               Lisa Singh, Shweta Bansal, Leticia Bode, Ceren Budak,
  covid-19 twitter and weibo dataset. arXiv preprint             Guangqing Chi, Kornraphop Kawintiranon, Colton
  arXiv:2004.08145.                                              Padden, Rebecca Vanarsdall, Emily Vraga, and
                                                                Yanchen Wang. 2020. A first look at covid-19 infor-
Purva Grover, Arpan Kumar Kar, Yogesh K. Dwivedi,                mation and misinformation sharing on twitter. arXiv
  and Marijn Janssen. 2019. Polarization and accultur-           preprint arXiv:2003.13907.
  ation in us election 2016 outcomes – can twitter an-
  alytics predict changes in voting preferences. Tech-        Santosh Vijaykumar, Glen Nowak, Itai Himelboim, and
  nological Forecasting and Social Change, 145:438 –            Yan Jin. 2018. Virtual zika transmission after the
  460.                                                          first us case: who said what and how it spread
                                                                on twitter. American journal of infection control,
Fatima Haouari, Maram Hasanain, Reem Suwaileh,                  46(5):549–557.
  and Tamer Elsayed. 2021. Arcov19-rumors: Ara-
  bic covid-19 twitter dataset for misinformation de-         Koosha Zarei, Reza Farahbakhsh, Noel Crespi, and
  tection. In Proceedings of the Sixth Arabic Natural           Gareth Tyson. 2020. A first instagram dataset on
  Language Processing Workshop, WANLP 2021.                     covid-19. arXiv preprint arXiv:2004.12226.

                                                         90
Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral-
  dine Wong Sak Hoi, and Peter Tolmie. 2016.
  Analysing how people orient to and spread rumours
  in social media by looking at conversational threads.
  PloS one, 11(3).

                                                          91
You can also read