Data Mining and Content Analysis of the Chinese Social Media Platform Weibo During the Early COVID-19 Outbreak: Retrospective Observational ...

Page created by Kathleen Sparks
 
CONTINUE READING
Data Mining and Content Analysis of the Chinese Social Media Platform Weibo During the Early COVID-19 Outbreak: Retrospective Observational ...
JMIR PUBLIC HEALTH AND SURVEILLANCE                                                                                                                   Li et al

     Original Paper

     Data Mining and Content Analysis of the Chinese Social Media
     Platform Weibo During the Early COVID-19 Outbreak:
     Retrospective Observational Infoveillance Study

     Jiawei Li1,2,3,4, MS; Qing Xu1,2,3,4, MAS; Raphael Cuomo1,4, MPH, PhD; Vidya Purushothaman1,4, MBBS, MAS; Tim
     Mackey1,2,3,4, MAS, PhD
     1
      Department of Anesthesiology and Division of Infectious Diseases and Global Public Health, University of California San Diego School of Medicine,
     La Jolla, CA, United States
     2
      S-3 Research LLC, San Diego, CA, United States
     3
      Department of Healthcare Research and Policy, University of California San Diego Extension, La Jolla, CA, United States
     4
      Global Health Policy Institute, San Diego, CA, United States

     Corresponding Author:
     Tim Mackey, MAS, PhD
     Department of Anesthesiology and Division of Infectious Diseases and Global Public Health
     University of California San Diego School of Medicine
     8950 Villa La Jolla Drive
     A124
     La Jolla, CA, 92037
     United States
     Phone: 1 9514914161
     Email: tmackey@ucsd.edu

     Abstract
     Background: The coronavirus disease (COVID-19) pandemic, which began in Wuhan, China in December 2019, is rapidly
     spreading worldwide with over 1.9 million cases as of mid-April 2020. Infoveillance approaches using social media can help
     characterize disease distribution and public knowledge, attitudes, and behaviors critical to the early stages of an outbreak.
     Objective: The aim of this study is to conduct a quantitative and qualitative assessment of Chinese social media posts originating
     in Wuhan City on the Chinese microblogging platform Weibo during the early stages of the COVID-19 outbreak.
     Methods: Chinese-language messages from Wuhan were collected for 39 days between December 23, 2019, and January 30,
     2020, on Weibo. For quantitative analysis, the total daily cases of COVID-19 in Wuhan were obtained from the Chinese National
     Health Commission, and a linear regression model was used to determine if Weibo COVID-19 posts were predictive of the number
     of cases reported. Qualitative content analysis and an inductive manual coding approach were used to identify parent classifications
     of news and user-generated COVID-19 topics.
     Results: A total of 115,299 Weibo posts were collected during the study time frame consisting of an average of 2956 posts per
     day (minimum 0, maximum 13,587). Quantitative analysis found a positive correlation between the number of Weibo posts and
     the number of reported cases from Wuhan, with approximately 10 more COVID-19 cases per 40 social media posts (P
JMIR PUBLIC HEALTH AND SURVEILLANCE                                                                                                          Li et al

     (JMIR Public Health Surveill 2020;6(2):e18700) doi: 10.2196/18700

     KEYWORDS
     COVID-19; coronavirus; infectious disease; social media, surveillance; infoveillance; infodemiology

                                                                         key trends that may also correlate with outbreak incidence data
     Introduction                                                        [18].
     The coronavirus disease (COVID-19) is a rapidly emerging            Leveraging infoveillance approaches, we conducted a
     infectious disease caused by a novel coronavirus named severe       retrospective observational study for COVID-19 on one of the
     acute respiratory syndrome (SARS) coronavirus 2 [1]. The            largest Chinese social media platforms, Sina Weibo [新浪微
     COVID-19 outbreak began in late December 2019 in Wuhan,             博]. Sina Weibo is a microblogging website (also known as the
     Hubei Province, China, with a cluster of patients presenting        Chinese equivalent to Twitter) and one of the most influential
     with pneumonia of unknown origin and reported exposure to a         social media platforms in China. According to its own press
     seafood and live animal market in the same city [2]. The World      release in August 2020, it had over 486 million active users
     Health Organization (WHO) confirmed 41 cases and 1 death            [19]. Users can publish content such as messages in microblogs
     due to the novel coronavirus by January 12, 2020 [3]. Since this    and share text, pictures, videos, and music. Compared with
     initial reporting, COVID-19 has rapidly spread within China         WeChat, another popular social media platform in China, Weibo
     and internationally, with the WHO declaring a Public Health         posts are generally more publicly visible; with WeChat, posts
     Emergency of International Concern (PHEIC) under the revised        are generally more private and only visible to certain people
     International Health Regulations on January 30, 2020 [4].           selected by users. Due to the public nature of the platform, we
     Since the PHEIC declaration, COVID-19 has spread to every           attempted to assess whether Weibo posts about COVID-19 were
     continent except Antarctica, becoming a highly infectious global    predictive of the number of reported cases during the outbreak’s
     pandemic with sustained community transmission [5]. The             early stages and conducted a qualitative analysis of
     severity of the COVID-19 outbreak, with approximately 1.8           COVID-19–related themes detected and discussed by users
     million cases worldwide as of mid-April 2020 [6], has far           located in Wuhan.
     surpassed past coronavirus events such as the Middle East
     respiratory syndrome (MERS)–related coronavirus, which had          Methods
     2494 cases as of November 2019, and the 2003 SARS
     coronavirus, which had more than 8000 cases and affected 26
                                                                         Study Design
     countries. It is unknown whether viral mutations will result in     This observational infoveillance study was conducted in two
     patterns of annual re-emergence as seen with influenza strains.     phases: data collection using an automated Python (Python
                                                                         Software Foundation) programming script to collect
     Attempts to predict epidemiological features (eg, prevalence,       COVID-19-related posts on Weibo, and quantitative and
     attack rate, replication or reproduction rate, morbidity, and       qualitative analysis to identify trends and characterize key
     mortality) of an outbreak to inform infection control and public    themes discussed by Chinese users.
     health countermeasures are critical. This can be challenging
     during the earlier stages of an outbreak when there is a lack of    Data Collection
     sufficient information regarding the etiology of the disease,       Programming scripts were written in the Python programming
     inadequate diagnostic and testing capabilities, and incomplete      language to extract posts on Weibo in the Chinese language
     epidemiological data regarding confirmed cases [7]. In the          (traditional and simplified Chinese) from users self-reporting
     absence of such data, the use of information in an electronic       their location in Wuhan. Weibo users can post messages limited
     medium such as social media conversations can enable                to 2000 characters with or without images, videos, and other
     syndromic surveillance approaches to characterize disease           multimedia, and can repost messages equivalent to the retweet
     distribution and provide accurate case counts more rapidly [8].     function on Twitter. Python scripts were set to continuously
     These “infoveillance” approaches have been used to characterize     collect data filtered for COVID-19–related keywords from
     a host of public health issues including topics related to mental   December 23, 2019, to January 30, 2020. Keywords included
     health, substance abuse behavior, the spread of foodborne           the Chinese-language terms: [冠状病毒] (coronavirus), [新型
     illness, and the monitoring of infectious disease outbreaks (eg     肺炎] (novel pneumonia), [武汉肺炎] (Wuhan pneumonia),
     pertussis, influenza, HIV/AIDS, dengue, West Nile virus, Zika       [疫情] (epidemic situation), [非典] (severe acute respiratory
     virus, H1N1, and Ebola) [8-12]. Specifically, the now ubiquitous    syndrome), [华南海鲜市场] (Wuhan Seafood Wholesale
     nature of social media means it represents an important,            Market). These keywords were chosen on the basis of manual
     “nontraditional” source for disease surveillance. Specifically,     searches via the platform’s public search function to detect a
     user-generated social media data can be mined to assess the         baseline of user conversations related to the outbreak for our
     public’s knowledge, attitudes, and behaviors toward the disease,    systematic data collection processes. The variation in selected
     and can be particularly informative when cross-validated with       keywords was also necessary, as the official name of the disease
     traditional disease surveillance data [13-17]. Others have also     was not announced until February 11, 2020 [20]. These data
     used global social media platforms such as Twitter to examine       collection methods are consistent with other analyses of
     the 2009 H1N1 pandemic, conduct content analysis, and identify      health-related posts (including flu-related topics) on the Weibo
                                                                         platform not related to COVID-19 [21].
     http://publichealth.jmir.org/2020/2/e18700/                                           JMIR Public Health Surveill 2020 | vol. 6 | iss. 2 | e18700 | p. 2
                                                                                                                (page number not for citation purposes)
XSL• FO
RenderX
JMIR PUBLIC HEALTH AND SURVEILLANCE                                                                                                            Li et al

     Posts were filtered for geographic location, thereby identifying     Third, we identified parent classifications to select prevalent
     users specifically within the city of Wuhan, China. Posts with       topics and then tagged and grouped these classifications with
     geographic locations outside of Wuhan City were not included         supporting qualitative data (eg, Weibo posts). We primarily
     in this study, as the aim was to focus on the early origins of the   relied on inductive coding approaches starting with Weibo
     outbreak in this region. Posts were collected from all account       COVID-19 posts identified but also informed this inductive
     types including personal accounts, media accounts, and               coding based on themes detected in existing literature from prior
     government accounts. Technical limitations of the Weibo              disease outbreaks [18,21]. Coders individually selected parent
     platform and Python script used to collect data limited our data     topic classifications to represent different thematic areas and
     collection to a maximum of 2000 posts per hour. However,             collapsed infrequent categories into parent classifications. We
     during the 936 hours of data collection, we did not reach this       then combined the related topics, removed duplicate topics, and
     limit for any hour of collection.                                    evaluated thematic concurrence by independently coding the
                                                                          entire sample of posts collected from the early period of the
     The number of COVID-19 cases in all of mainland China and
                                                                          outbreak with detected COVID-19-related posts (December 31,
     Hubei Province were collected for each calendar day between
                                                                          2019, until January 20, 2020.)
     December 23, 2019, and January 30, 2020, inclusively. Case
     counts were made publicly available on the internet by the
     Chinese National Health Commission, a cabinet-level executive
                                                                          Results
     department of the Chinese central government, headquartered          Data Availability and Ethics Approval
     in Beijing [20].
                                                                          Data collected on social media platforms is available on request
     Quantitative Analysis                                                from authors subject to appropriate deidentification. Ethics
     Weibo posts with COVID-19-related keywords were binned               approval was not required for this study. All information
     into each calendar day to calculate posts per day. Longitudinal      collected from this study was from the public domain, and the
     trends were then visually conveyed using line graphs. Regression     study did not involve any interaction with users. Indefinable
     analysis was conducted to understand the predictive value of         user information was removed from the study results.
     social media posts on the number of confirmed cases reported         Data Collection
     by the Chinese government. Simple linear regression was
     performed between social media posts per day and the number          There were 115,299 posts collected during the study time frame,
     of cases reported within mainland China, excluding Hubei             with an average of 2956 Weibo posts per day. There was a high
     Province, and the cases in Hubei Province alone. Also, a simple      degree of variation in the number of posts depending on the
     linear regression model was computed wherein the number of           date of collection, with 0 posts collected on one day (December
     posts per day was used to predict percent daily change in cases      26, 2019) and the highest number of posts (13,740) collected
     from Hubei, and a separate model was computed to predict             on January 27, 2020. During this same time period, China
     percent daily change in cases from mainland China (excluding         reported 36,456 confirmed cases of COVID-19. COVID-19
     Hubei). Modeling of daily changes was conducted using the            cases were reported starting on January 16, 2020, when 45 cases
     final 20 days of data, as prior days did not exhibit daily changes   were reported. The number of cases then increased rapidly,
     in posts, cases from Hubei, or cases from the remainder of           reaching 9692 cases on January 30, 2020, the final day of data
     China. A P value
JMIR PUBLIC HEALTH AND SURVEILLANCE                                                                                                                     Li et al

     reported in Hubei among all incident cases in mainland China                outbreak-related Weibo posts are reactive to local disease
     (Wuhan cases proportion=–0.003*posts+100.2; P
JMIR PUBLIC HEALTH AND SURVEILLANCE                                                                                                             Li et al

     with the outbreak. This period of initial uncertainty was followed    behaviors that could potentially increase the risk of transmission
     by the information disclosure about the causative agent by the        (eg, going to New Year celebration events and self-evacuation
     Chinese National Health Commission and other official                 from Wuhan). New user-response topics began to emerge toward
     government and academic sources. As the outbreak progressed,          the end of the early outbreak period, including criticism of the
     a higher volume of more precise terms including “novel                Wuhan Red Cross response and user uncertainty related to news
     coronavirus” or “COVID-19” were detected, with a decline and          about quarantines, travel restrictions, and new hospital
     shift from posts mentioning colloquial terms such as “unknown         construction projects.
     reason pneumonia” and “Wuhan pneumonia,” similar to
                                                                           Another subtheme detected in user responses was the discussion
     terminology adoption during the H1N1 pandemic [18].
                                                                           about COVID-19-related symptoms that appeared in both news
     The presence of parent classifications also changed over the          and individual posts. At the beginning of the outbreak period,
     time period. For example, at the onset of the outbreak, many          most symptom-related posts were posted by official accounts
     users discussed what the causative agent of the outbreak might        or consisted of news posts that were reposted and shared by the
     be, including how seasonal influenza, avian influenza, MERS,          public that included describing symptoms of fever and shortness
     and SARS were ruled out. From January 16-20, posts about the          of breath (January 1) and coughing and weakness or fatigue
     causative agent also increased with most conversations                (January 10). Separate from these news-related posts, some
     discussing a novel coronavirus strain. There were also early          individual accounts also self-reported other symptoms including
     conversations about the South China Seafood Market and                headache, diarrhea, sore throat, and perspiration during sleep.
     wildlife trafficking, reflecting uncertainty regarding the zoonotic   After January 20, 2020, human-to-human transmission was
     origins and possible transmission vectors of the disease. The         confirmed, and we detected an increase in posts related to user
     proportion of posts with reference to the South China Seafood         self-reported symptoms along with posts that provided
     Market were much higher prior to January 6.                           second-hand reporting of symptoms from other people during
                                                                           our coding of the random stratified sample of the entire 39-day
     The detection of posts related to the epidemiological
                                                                           data study period. Though not fully coded for this study, our
     characteristics of the outbreak were relatively consistent
                                                                           random selection detected a few users who reported other
     throughout the entire 21-day period assessed. However,
                                                                           disputed symptoms, including a loss of taste and lack of appetite,
     following confirmation of the outbreak as a novel coronavirus,
                                                                           in the last 10 days of our overall data collection (January 21-30).
     there was a spike of discussions about whether COVID-19 was
     transmittable human-to-human. Relatedly, at the beginning of          Importantly, both the nature of the content and the volume of
     the outbreak, there were some posts where users expressed their       posts were likely driven by a combination of release of
     own personal reaction and concerns about a potential outbreak.        government information and news events. For example,
     From January 14, after 3 cases were confirmed outside of China,       December 31, 2019 had the second highest volume of posts
     there was a notable increase in the number of posts related to        related to the COVID-19 term “pneumonia of unknown cause,”
     the public's reaction to the outbreak and its associated risks and    which corresponded to the confirmation from an official source
     control and response measures.                                        (the Wuhan Health Commission) of cases of pneumonia of
                                                                           unknown etiology detected in Wuhan city. There was also an
     More specific subthemes regarding users’ knowledge, attitudes,
                                                                           increase in COVID-19 Weibo posts on January 6, 2020, likely
     and responses to COVID-19 changed as more information about
                                                                           driven by news on January 5 that laboratory test results had
     the underlining epidemiology became available. Specifically,
                                                                           ruled out other pathogens (eg, influenza, avian influenza,
     the terms “confirmed case,” “suspected case,” “death case,”
                                                                           adenovirus, MERS, and SARS) as the cause of the outbreak.
     “human-to-human transmission,” “monitored,” “public health
                                                                           Additionally, on January 15, the WHO announced that the
     supervision,” and “quarantine” became more frequent as the
                                                                           possibility of limited human-to-human transmission could not
     outbreak progressed. Accompanying this shift in terminology,
                                                                           be excluded. This drove a second increase in the overall number
     we also observed wide variation in user reactions as information
                                                                           of posts on January 16. Finally, on January 20, human-to-human
     from government sources was disseminated and the outbreak
                                                                           transmission was confirmed, generating a large number of
     worsened. This included posts conveying protective behaviors
                                                                           conversations from both media accounts and private Weibo
     (eg, cleaning hands, staying away from crowds, wearing medical
                                                                           users, resulting in the largest increase of posts observed during
     masks in public areas), while others conveyed attitudes and
                                                                           this time period.

     http://publichealth.jmir.org/2020/2/e18700/                                              JMIR Public Health Surveill 2020 | vol. 6 | iss. 2 | e18700 | p. 5
                                                                                                                   (page number not for citation purposes)
XSL• FO
RenderX
JMIR PUBLIC HEALTH AND SURVEILLANCE                                                                                                                     Li et al

     Table 1. Posts on the Weibo social media platform organized according to themes detected in qualitative analysis.
         Theme, Description of topic (English)                                  Example topics (Chinese with English translation)
         Causative agent

             Concerns about potential SARSa re-emergence                        [希望不是非典,大家安心过年]
                                                                                “I wish this is not SARS, hope everyone can have a peaceful New Year
                                                                                Eve”
             Wuhan Seafood Wholesale Market considered the source of causative [华南海鲜市场可能是病毒来源]
             agent                                                             “Wuhan Seafood Wholesale Market might be the source of the causative
                                                                               agent”
             Pneumonia caused by unknown reason                                 [武汉出现不明原因肺炎]
                                                                                “cases of pneumonia of unknown etiology detected in Wuhan City”
             Novel coronavirus has been confirmed as causative agent            [初步认定为新型冠状病毒]
                                                                                “Causative agent preliminary identification of a novel coronavirus”
         Epidemiological characteristics of the outbreak
             No evidence of human-to-human transmission                         [目前没有证据证明人传人]
                                                                                “There is no evidence of human-to-human transmission”
             New confirmed cases and statistics on mortality                    [截至昨日病例已增至44例,其中11例重症,均接受隔离治疗]
                                                                                “Until yesterday, there were 44 confirm cases, 11 severe cases, all have
                                                                                been isolated and treated”
             Public Health supervision for people who had close contact with pa- [与病患接触者进行医学观察]
             tients with confirmed cases                                         “People who have close connection with confirmed cases are under med-
                                                                                 ical supervision”
         Public reaction to outbreak control and response
             Wear masks, keep away from crowds                                  [虽然不知道是不是人传人,还是戴上口罩,远离人群]
                                                                                “Not sure if it is human to human transmissible, but wear masks and keep
                                                                                away from crowds would help. Just in case.”
             Self-evacuating Wuhan                                              [太可怕了,快点离开这儿吧]
                                                                                “It’s scary, let’s leave here (Wuhan)”
             Will still attend New Year celebration event                       [不明原因肺炎丝毫不影响大家戴口罩出来跨年]
                                                                                “The pneumonia of unknown cause seems not to impact people’s New
                                                                                Year Eve celebration event”
         Other topics
             Wuhan Red Cross under scrutiny                                     [红十字会遭到社会指责]
                                                                                “The Red Cross was accused by society”
             Travel restrictions                                                [从1月23日10时起,武汉市和周边的鄂州市、仙桃市、潜江市、黄
                                                                                冈市、荆门市等相续宣布暂停运营城市公交、地铁、轮渡、长途客
                                                                                运,暂时关闭机场、火车站、高速公路等离开通道,严防武汉新型
                                                                                冠状病毒疫情扩散。] “From 10:00 on January 23, Wuhan and surround-
                                                                                ing Ezhou, Xiantao, Qianjiang, Huanggang, Jingmen, etc. have successively
                                                                                announced the suspension of operation of city buses, subways, ferries,
                                                                                long-distance passenger transport, and temporarily closed airports and
                                                                                trains. Stations, highways, etc. leave the passage to strictly prevent the
                                                                                spread of new coronavirus epidemics in Wuhan.”
             New hospital construction projects                                 [新建武汉“小汤山]
                                                                                “Will build new hospitals in Wuhan (Lei Shen Shan & Huo Shen Shan),
                                                                                also called Wuhan Xiao Tang Shan”

     a
         SARS: severe acute respiratory syndrome.

                                                                                 microblogging site Twitter or Google trends data, yet only a
     Discussion                                                                  few have examined foreign-language platforms. Due to the
     Principal Findings                                                          COVID-19 outbreak originating in Wuhan, China, this study
                                                                                 sought to identify, characterize, and assess the potential
     The vast majority of infoveillance studies have analyzed data               relationship between Chinese social media conversations taking
     from English-language social media platforms such as the
     http://publichealth.jmir.org/2020/2/e18700/                                                      JMIR Public Health Surveill 2020 | vol. 6 | iss. 2 | e18700 | p. 6
                                                                                                                           (page number not for citation purposes)
XSL• FO
RenderX
JMIR PUBLIC HEALTH AND SURVEILLANCE                                                                                                             Li et al

     place in Wuhan and the number of COVID-19 confirmed cases             Public perception about the origins and transmission patterns
     at the early stages of the outbreak, which has now transitioned       of COVID-19 changed over the study period, with initial
     into a worldwide pandemic. It also sought to understand how           conversations focusing on a possible link with the seafood
     user perception changed as additional information became              market in Wuhan, later changing to discussions about the
     available from government and media sources as the outbreak           increased number of cases reporting no exposure to the seafood
     progressed while also attempting to identify parent                   market, and then leading to discussions of possible
     classifications of predominant user-generated themes that             human-to-human transmission, which was later confirmed by
     emerged as the outbreak accelerated. These study objectives           the Chinese government. Overall, we observed a wide variation
     and some of its general findings are consistent with prior studies,   in user reactions to information, with some users expressing a
     including a 2010 infoveillance study by Chew and Eysenbach            willingness to undertake protective behavior and other users
     [18] assessing the 2009 H1N1 outbreak. In that study, posts           downplaying the risks and engaging in behaviors that could
     from Twitter were collected, thematically assessed, and found         have exacerbated disease spread (eg, leaving Wuhan, attending
     to be significantly correlated with weekly H1N1 incidence             New Year events). Importantly, these observed attitudes and
     during the outbreak, with the absolute increase in H1N1-related       behaviors happened prior to the Chinese government announcing
     tweet volume coinciding with major news events.                       a lockdown of Wuhan and other cities in Hubei on January 23,
                                                                           2020. After the announcement of these quarantine measures,
     Based on our analysis, there appears to be a positive correlation
                                                                           there was a sizable increase in the volume of Wuhan Weibo
     between the number of COVID-19-related Weibo posts from
                                                                           posts we collected.
     Wuhan and the number of cases officially reported in Wuhan
     during the early stages of this outbreak. This effect size was        Though the mixed quantitative and qualitative results of this
     larger than what was observed for the rest of China excluding         study are primarily exploratory, they provide important insight
     Hubei Province (where Wuhan is the capital city) and held when        on the changing knowledge, attitudes, and behaviors of Chinese
     comparing the number of Weibo posts to the incidence                  social media users who were at the epicenter of what is now a
     proportion of cases in Hubei. However, any potential predictive       rapidly expanding pandemic that has impacted all facets of
     value of using social media data as a proxy for real world public     global society. More research is needed to better understand the
     health surveillance statistics needs more rigor and added data        effectiveness of health communication strategies during evolving
     layers to confirm possible associations, particularly in the          outbreaks such as COVID-19, particularly in the context of how
     context of user reactions to news events as discussed in this and     information is understood, shared, and acted upon by users in
     other studies. Despite these limitations, qualitative analysis        the face of uncertainty and changing information. Specifically,
     characterized the early stages of the outbreak as having different    we need to better understand how social media platforms can
     degrees of the Chinese public’s uncertainty regarding the risks       influence the public’s risk perception, their trust and credibility
     posed by COVID-19. As information emerged about the disease,          of different information sources, and, ultimately, how it changes
     users expressed new concerns, which also led to changing              real-world behavior that can have an impact on control measures
     knowledge, attitudes, and behaviors among Chinese social media        enacted to mitigate an outbreak.
     users, some protective and some that may have introduced
                                                                           Early reports indicated that major social media platforms are
     increased risks of disease spread.
                                                                           struggling with the volume of COVID-19 information and
     In response to public uncertainty, the Chinese government issued      user-generated content flooding their platforms, some of which
     a series of announcements on the Weibo “official” accounts in         is helpful and accurate and some of which are rumors and
     an attempt to clarify the characteristics of the disease as they      misinformation [22]. In fact, popular social media platforms
     became known, including an official warning to hospitals on           TikTok, Facebook, and Twitter have all announced measures
     December 30, 2019, about how to report potential cases and a          to better ensure access to credible and accurate information
     subsequent announcement on January 8, 2020, confirming a              about COVID-19, though whether these platforms are up to this
     novel coronavirus as the causative disease agent (Figure 2).          task remains an open question [22].Evaluating whether social
     These events evidence that social media was used as an outbreak       media can act as a positive tool to promote global health
     communication tool by the government, media, and users (who           objectives, particularly in the context of health emergencies,
     reposted content) and led to the dissemination of and user            will be tested by COVID-19, along with its utility as a modern
     reaction to information on an outbreak whose trajectory would         approach to public health surveillance.
     take it global.

     http://publichealth.jmir.org/2020/2/e18700/                                              JMIR Public Health Surveill 2020 | vol. 6 | iss. 2 | e18700 | p. 7
                                                                                                                   (page number not for citation purposes)
XSL• FO
RenderX
JMIR PUBLIC HEALTH AND SURVEILLANCE                                                                                                                Li et al

     Figure 2. Timeline of COVID-19 events, themes detected in Weibo posts, and examples of posts. CCTV: China Central Television; CDC: Centers for
     Disease Control and Prevention; COVID-19: coronavirus disease.

                                                                             simple linear regression showed a positive correlation between
     Limitations                                                             COVID-19 Weibo posts and the number of Wuhan cases in the
     This study has certain limitations. First, our data collection was      study time frame, but we did not control for other potential
     limited to one Chinese social media platform over a prescribed          confounders. Further, it is unclear whether our observed
     period of time. Hence, it is not generalizable to all COVID-19          predictive trend line would continue or if the trend line is
     social media conversations occurring among Chinese users. We            generalizable to other instances of COVID-19 outbreaks in other
     did not examine Chinese private communication apps (eg QQ,              countries or communities. It is more likely that only under
     WeChat) due to the difficulty of collecting data on these               specific disease transmission circumstances would this
     platforms. Future studies should assess a broader scope of              correlation occur, namely a lack of knowledge and reporting of
     conversational data from multiple platforms and use natural             an outbreak in its early stages, a novel virus with high
     language processing and machine learning approaches to help             transmission and sustained community spread, and high social
     classify larger volumes of conversations. Second, our data              media engagement involving outbreak conversations. Further,
     collection started from the reported early stages of the                as previously stated, it is highly likely that the volume of posts
     COVID-19 outbreak. During parts of this time period, the                are associated with user reactions to news events and
     causative agent had not been confirmed and there was no official        government announcements. Finally, due to censorship in China,
     name for the disease. Due to early inconsistency in terminology,        posts may have been deleted before data collection. In fact, a
     Weibo users may have used other keywords to describe                    few messages detected included associated comments that had
     COVID-19 that were not collected in this study. Third, the              been deleted and were not retrievable.

     Authors' Contributions
     JL collected the data; all authors designed the study, conducted the data analyses, wrote the manuscript, and approved the final
     version of the manuscript.

     Conflicts of Interest
     JL, QX, and TKM are employees of the startup company S-3 Research LLC. S-3 Research is a startup funded and currently
     supported by the National Institutes of Health – National Institute on Drug Abuse through a Small Business Innovation and
     Research contract for opioid-related social media research and technology commercialization. Authors report no other conflicts
     of interest associated with this manuscript.

     http://publichealth.jmir.org/2020/2/e18700/                                                 JMIR Public Health Surveill 2020 | vol. 6 | iss. 2 | e18700 | p. 8
                                                                                                                      (page number not for citation purposes)
XSL• FO
RenderX
JMIR PUBLIC HEALTH AND SURVEILLANCE                                                                                                         Li et al

     References
     1.      Centers for Disease Control and Prevention. 2019 Novel Coronavirus (2019-nCoV) Situation Summary URL: https://www.
             cdc.gov/coronavirus/2019-nCoV/summary.html#background [accessed 2020-02-26]
     2.      Wuhan Municipal Health Commission. [Wuhan Municipal Health Commission on the current situation of pneumonia in
             our city] URL: http://wjw.wuhan.gov.cn/front/web/showDetail/2019123108989 [accessed 2020-02-26]
     3.      World Health Organization. 2020 Jan 12. Novel Coronavirus – China URL: https://www.who.int/csr/don/
             12-january-2020-novel-coronavirus-china/en/ [accessed 2020-02-26]
     4.      World Health Organization. 2020 Jan 31. Novel Coronavirus (2019-nCoV) situation report - 11 URL: https://www.who.int/
             docs/default-source/coronaviruse/situation-reports/20200131-sitrep-11-ncov.pdf?sfvrsn=de7c0f7_4 [accessed 2020-02-26]
     5.      CCDC Weekly. 2020. The epidemiological characteristics of an outbreak of 2019 novel coronavirus diseases (COVID-19)
             — China, 2020 URL: http://weekly.chinacdc.cn/en/article/id/e53946e2-c6c4-41e9-9a9b-fea8db1a8f51 [accessed 2020-02-26]
     6.      Wang C, Horby PW, Hayden FG, Gao GF. A novel coronavirus outbreak of global health concern. Lancet 2020 Feb
             15;395(10223):470-473. [doi: 10.1016/S0140-6736(20)30185-9] [Medline: 31986257]
     7.      Kelly-Cirino CD, Nkengasong J, Kettler H, Tongio I, Gay-Andrieu F, Escadafal C, et al. Importance of diagnostics in
             epidemic and pandemic preparedness. BMJ Glob Health 2019;4(Suppl 2):e001179 [FREE Full text] [doi:
             10.1136/bmjgh-2018-001179] [Medline: 30815287]
     8.      Eysenbach G. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to
             analyze search, communication and publication behavior on the Internet. J Med Internet Res 2009 Mar 27;11(1):e11. [doi:
             10.2196/jmir.1157] [Medline: 19329408]
     9.      Kim SJ, Marsch LA, Hancock JT, Das AK. Scaling up research on drug abuse and addiction through social media big data.
             J Med Internet Res 2017 Oct 31;19(10):e353. [doi: 10.2196/jmir.6426] [Medline: 29089287]
     10.     Gianfredi V, Bragazzi N, Mahamid M, Bisharat B, Mahroum N, Amital H, et al. Monitoring public interest toward pertussis
             outbreaks: an extensive Google Trends-based analysis. Public Health 2018 Dec;165:9-15. [doi: 10.1016/j.puhe.2018.09.001]
             [Medline: 30342281]
     11.     Bragazzi NL, Bacigaluppi S, Robba C, Siri A, Canepa G, Brigo F. Infodemiological data of West-Nile virus disease in Italy
             in the study period 2004-2015. Data Brief 2016 Dec;9:839-845. [doi: 10.1016/j.dib.2016.10.022] [Medline: 27872881]
     12.     Mamidi R, Miller M, Banerjee T, Romine W, Sheth A. Identifying key topics bearing negative sentiment on Twitter: insights
             concerning the 2015-2016 Zika epidemic. JMIR Public Health Surveill 2019 Jun 04;5(2):e11036. [doi: 10.2196/11036]
             [Medline: 31165711]
     13.     Majumder MS, Santillana M, Mekaru SR, McGinnis DP, Khan K, Brownstein JS. Utilizing nontraditional data sources for
             near real-time estimation of transmission dynamics during the 2015-2016 Colombian Zika virus disease outbreak. JMIR
             Public Health Surveill 2016 Jun 01;2(1):e30. [doi: 10.2196/publichealth.5814] [Medline: 27251981]
     14.     Santillana M, Nguyen AT, Dredze M, Paul MJ, Nsoesie EO, Brownstein JS. Combining search, social media, and traditional
             data sources to improve influenza surveillance. PLoS Comput Biol 2015 Oct;11(10):e1004513. [doi:
             10.1371/journal.pcbi.1004513] [Medline: 26513245]
     15.     Signorini A, Segre AM, Polgreen PM. The use of Twitter to track levels of disease activity and public concern in the U.S.
             during the influenza A H1N1 pandemic. PLoS One 2011 May 04;6(5):e19467. [doi: 10.1371/journal.pone.0019467]
             [Medline: 21573238]
     16.     Nsoesie EO, Brownstein JS. Computational approaches to influenza surveillance: beyond timeliness. Cell Host Microbe
             2015 Mar 11;17(3):275-278. [doi: 10.1016/j.chom.2015.02.004] [Medline: 25766284]
     17.     World Health Organization. Naming the coronavirus disease (COVID-19) and the virus that causes it URL: https://www.
             who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/
             naming-the-coronavirus-disease-(covid-2019)-and-the-virus-that-causes-it [accessed 2020-02-26]
     18.     Chew C, Eysenbach G. Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak. PLoS
             One 2010 Nov 29;5(11):e14118. [doi: 10.1371/journal.pone.0014118] [Medline: 21124761]
     19.     Xinhua. 2019 Aug 20. Weibo reports robust Q2 user growth URL: http://www.xinhuanet.com/english/2019-08/20/
             c_138323288.htm [accessed 2020-04-13]
     20.     Muniz-Rodriguez K, Chowell G, Cheung C, Jia D, Lai P, Lee Y, et al. Epidemic doubling time of the COVID-19 epidemic
             by Chinese province. medRxiv 2020 Feb 28. [doi: 10.1101/2020.02.05.20020750]
     21.     Wang S, Paul M, Dredze M. Exploring health topics in Chinese social media: an analysis of Sina Weibo. 2014 Presented
             at: Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence; 2014 June 18; Québec City, Québec.
     22.     Convertino J. ABC News. 2020 Mar 05. Social media companies partnering with health authorities to combat misinformation
             on coronavirus URL: https://abcnews.go.com/Technology/
             social-media-companies-partnering-health-authorities-combat-misinformation/story?id=69389222 [accessed 2020-03-10]

     Abbreviations
               COVID-19: coronavirus disease
               MERS: Middle East respiratory syndrome

     http://publichealth.jmir.org/2020/2/e18700/                                          JMIR Public Health Surveill 2020 | vol. 6 | iss. 2 | e18700 | p. 9
                                                                                                               (page number not for citation purposes)
XSL• FO
RenderX
JMIR PUBLIC HEALTH AND SURVEILLANCE                                                                                                                  Li et al

               PHEIC: Public Health Emergency of International Concern
               SARS: severe acute respiratory syndrome
               WHO: World Health Organization

               Edited by T Sanchez, G Eysenbach; submitted 12.03.20; peer-reviewed by A Daughton, X Huang; comments to author 07.04.20;
               revised version received 14.04.20; accepted 14.04.20; published 21.04.20
               Please cite as:
               Li J, Xu Q, Cuomo R, Purushothaman V, Mackey T
               Data Mining and Content Analysis of the Chinese Social Media Platform Weibo During the Early COVID-19 Outbreak: Retrospective
               Observational Infoveillance Study
               JMIR Public Health Surveill 2020;6(2):e18700
               URL: http://publichealth.jmir.org/2020/2/e18700/
               doi: 10.2196/18700
               PMID:

     ©Jiawei Ken Li, Qing Xu, Raphael Cuomo, Vidya Purushothaman, Tim Mackey. Originally published in JMIR Public Health
     and Surveillance (http://publichealth.jmir.org), 21.04.2020. This is an open-access article distributed under the terms of the
     Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,
     and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly
     cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this
     copyright and license information must be included.

     http://publichealth.jmir.org/2020/2/e18700/                                                  JMIR Public Health Surveill 2020 | vol. 6 | iss. 2 | e18700 | p. 10
                                                                                                                        (page number not for citation purposes)
XSL• FO
RenderX
You can also read