COVID-19 Epidemic Analysis in India with Multi-Source State-Level Datasets

Page created by Brett Sharp
 
CONTINUE READING
COVID-19 Epidemic Analysis in India with Multi-Source State-Level Datasets
Hindawi
BioMed Research International
Volume 2022, Article ID 2601149, 12 pages
https://doi.org/10.1155/2022/2601149

Research Article
COVID-19 Epidemic Analysis in India with Multi-Source State-
Level Datasets

 Qirui Wang
 Fok Ying Tung Graduate School, The Hong Kong University of Science and Technology, Hong Kong 999077, China

 Correspondence should be addressed to Qirui Wang; qwangcw@connect.ust.hk

 Received 18 January 2022; Revised 8 March 2022; Accepted 4 April 2022; Published 25 April 2022

 Academic Editor: Yuvaraja Teekaraman

 Copyright © 2022 Qirui Wang. This is an open access article distributed under the Creative Commons Attribution License, which
 permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

 The COVID-19 pandemic has been a global crisis affecting billions of people and causing countless economic losses. Different
 approaches have been proposed for combating this crisis, including both medical measures and technical innovations, e.g.,
 artificial intelligence technologies to diagnose and predict COVID-19 cases. While there is much attention being paid to the
 USA and China, little research attention has been drawn to less developed countries, e.g., India. In this study, I conduct an
 analysis of the COVID-19 epidemic in India, with datasets collected from different sources. Several machine learning models
 have been built to predict the COVID-19 spread, with different combinations of input features, in which the Transformer is
 proven as the most precise one. I also find that the Facebook mobility dataset is the most useful for predicting the number of
 confirmed cases. However, I find that the datasets from different sources are not very effective when predicting the number of
 deaths caused by the COVID-19 infection.

1. Introduction Indian epidemic datasets. Some related institute in India does
 not public the historical data but daily data, and it will be
Reoccurring outbreaks of the COVID-19 epidemic in the time-costing to collect data by researchers. The missing
world remind us that the coronavirus is becoming more values in the collection of the epidemic datasets also disturb
dreadful. Especially the various variants, which are threaten- the researches. Also, the short of attention of the Indian epi-
ing the current protection systems and affecting the reliabil- demic is also a problem.
ity of vaccination. Among various variants, Delta and Delta- In this research, I mainly did the following contribu-
plus, which were first discovered and initially spread in tions: (I) Collecting an Indian with the statewise COVID-
India, showed breathtaking infectivity. On May 5, over 400 19 dataset covering from October 1, 2020, to July 15, 2021,
thousand people have been confirmed in a day in India. with medical statistics, population mobility, and census data.
 Compared with other countries, the consistency of coro- Considering the complex situation of Indian society and
navirus in India is purer, and the related studies have a higher geography, a statewise dataset will be beneficial for evaluat-
value as the reference for Indian and other governments pre- ing the importance of features and show the influences at
venting the Delta and Delta-plus variant. This is a merit the geography level. The dimensions of my dataset are much
chance to disclose more information about the new variants, more than previous, and the dataset will be public on
including whether the current government’s controlling GitHub (the website of data: http://github.com/vividricky/
methods still work, could the mobility data still be helpful IndiaCovid19StateDataset) and convenient for the following
for predicting the trend of epidemic, or determine the effect researchers. (II) Six different models have been implemented
of Indian vaccination for the new situation. However, there and compared in the research, including traditional statistic
is little research on the Indian epidemic and rarely adopts models, logistic regression, and multiple linear regression,
complex and state-of-art machine learning models. One of the data-driven time-series machine learning models: LSTM,
the reasons for the situation is insufficient public or complete Transformer, DeepAR, and TCN for predicting the trend of
COVID-19 Epidemic Analysis in India with Multi-Source State-Level Datasets
2 BioMed Research International

COVID-19 in India. Specified notes that up to the research in related studies. Ghosh and Malavika’s teams successively
finished, few studies have tried the last three models in the used the logistics regression model to predict the trend of
Indian epidemic. This work fills the gap of state-of-art COVID-19 [4, 5]. Ghosh et al. separately developed a logistic
time-series machine learning in this field. According to the regression model and an exponential regression model [4].
experiment, the Transformer model showed the best perfor- They used the exponential regression model as the upper
mance during six models, and mobility data contributes bound and a linear combination model of the above two
most in predicting the trend of the epidemic. models with DIR value as the lower bound to predict the
 Based on my observations, I demonstrate that the collec- range of confirmed numbers in India. Foy et al. focused on
tion of multisource datasets is valuable for the prediction and the priority group of vaccination. They considered the age
further control of the COVID-19 epidemic, although its cost structure and implemented the SEIR model to simulate [6].
is a new burden for the government. The second corollary is As the result, the elder, who are the most vulnerable to
that human mobility does contribute significantly to the COVID-19, should be inoculated immediately. Malavika
spread of viruses, including COVID-19 and other influenza et al. used a modified logistic growth model, which involved
and diseases. Maintaining social distance and lockdown pol- the population of the region to predict the number of con-
icies are necessary even with the potential economic losses. firmed, achieving a good performance. Shrivastav and Jha
 The rest of this paper is organized as follows. Related discussed the impact of temperature and humidity on the
work is discussed in Section 2. Multi-source datasets are transmission of COVID-19 [7]. They constructed a gradient
described in Section 3. The models used in this study are dis- boosting model (GBM) with maximum and minimum tem-
cussed in Section 4. The experimental results are presented perature and humidity data of Indian states to forecast. In
in Section 5. A short conclusion is given in Section 6. addition, they also provided ANOVA analysis of atmo-
 spheric data and COVID-19 data. Chandra et al. imple-
2. Related Work mented the LSTM models to predict the trend based on
 the time-series data [8]. According to experiments, LSTM
Since the first break out of the COVID-19 epidemic in India, had a better performance than ED-LSTM and BD-LSTM
many researchers have contributed to identifying factors with limited data, and the authors also believed that the
that may influence the spread of COVID-19 and construct- model can be improved by introducing more data. More-
ing models to predict the number of confirmations, deaths, over, their work also shows the possibility for the long-
and recoveries in India. These models cover medical, math- term prediction of Indian epidemic situation.
ematical, and machine learning types. Also, some studies for According to related studies, on the one hand, the pre-
other countries provide a good reference for studying the sit- diction of the spread of the COVID-19 is a complex task
uation in India. involving multiple aspects of society. Based on this under-
 Mele and Magazzino investigated the relationship standing, introducing more features in the model may con-
between economic growth, air pollution, and transmission tribute to the accuracy and stability of the model [1–3, 6, 7].
of COVID-19 in India [1]. They first used stationarity and In my study, more medical, population density, and popula-
Toda-Yamamoto causality to prove that PM2.5, CO2, and tion mobility data were collected for the model, which are
NO2 increase with the development of cities. Then, a D2C more direct to the spread of the virus. These characteristics
algorithm is adopted to verify a casual line between PM2.5 were introduced for the first time in the COVID-19 out-
and the number of deaths of COVID-19. Roy et. al did the break in India before the study was completed. On the other
disease risk analysis of Indian states [2]. They forecast the hand, the previous researches have proved the feasibility of
risk of COVID-19 using Autoregressive Integrated Moving their selecting models but have their own limitations. Tradi-
Average (ARIMA). They introduce the GIS data of India tional mathematical models are interpretable, but lack the
into the model, including the population density and ability to effectively predict epidemic trends [1–3]. The
regional status. ARIMA captures the pattern in two parts, time-series deep learning models describe the spread of
in which AR calculates based on the past values and MA COVID-19 [8] better than regression models [3–5]. How-
computes the difference of current and previous knowledge. ever, there are few studies addressing related research. In
However, ARIMA only considered the time-dependent data this study, a more advanced time-series deep learning model
and cannot understand the breaking events, such as commu- will be implemented to fill the gap of these studies for the
nity infections. Kumari et al. built a multiple linear regres- Indian epidemic. Moreover, previous researchers have col-
sion model of social policies to predict the spread, lected data only for some cities or the whole country. It
recovery, and deaths in India [3]. Multiple linear regression did not take into account the complexities of Indian society,
is an explained method that can tell the importance of fac- and with the advancement, my study does experiments at
tors by showing weights. Moreover, autoregression and the national level.
autocorrelation have been imported to the model to increase
the accuracy of the model and finally generate a good perfor- 3. Data Collection and Processing
mance. As shown in the research of [3], the lockdown and
social distance policies contribute the most to slow the 3.1. Raw Data Sources. This study analyzed and constructed
spread of the virus. models using multiple sources, including medical statistics,
 Besides the traditional mathematic models, medicinal population mobility, and census data under COVID-19.
models and machine learning models are also widely used Considering the reliability of the data sources, the datasets
COVID-19 Epidemic Analysis in India with Multi-Source State-Level Datasets
BioMed Research International 3

 Figure 1: The COVID-19 India dataset read in CSV format.

selected for this study meet at least one requirement: (1) data dardized the storage format as CSV. For the MoHFW Vacci-
collected and published by official agencies (MoHFW Vac- nation data, I imported the Tabula package to convert the
cines and Indian Census 2011) and (2) research datasets that data from PDF to CSV format. Also, I added a step to verify
are widely cited in academic papers (COVID-19 India, Goo- the correction of converting by checking the equality of the
gle mobility, and Facebook mobility). sum of the first dose number and second dose number to
 The medical statistics consist of two datasets. As shown the total dose number. (ii) Second is the standardized the
in Figure 1, the COVID-19 India dataset records the num- state names through different data sources. (iii) Third is
ber of confirmations, recoveries, and deaths per state per the standardized the district names. Special notes that some
day, with each state abbreviated, e.g., ap a for Andhra Pra- districts in India do not have a sole office spell. My dataset
desh. As shown in Figure 2, the Ministry of Health and has standardized the district names by referring to the office
Family Welfare vaccination data is derived from the daily Indian directory website (http://www.goidirectory.gov.in/).
data published by the Ministry of Health and Family Wel- (iv) Fourth is by adding the new districts created after the
fare (MoHFW), Government of India. It contains the cumu- 2011 to Indian census 2011 dataset, the population numbers
lative number of vaccinated beneficiaries, including first, were estimated based on original regions. The whole work-
second, and total doses. Special note, data became available flow of preprocessing can refer in Figure 3.
on February 24. In the state-level Facebook mobility, since the target data-
 Population mobility data includes Google mobility and set is focused on the state level but Facebook mobility records
Facebook mobility data. Google community mobility collects the data at the district level, we need to aggregate these data for
trends in population mobility in parks, workplaces, bus stops, further handling. In this case, the district contributes mobility
retail, entertainment, grocery stores, pharmacies, and resi- values based on the ratio of the district’s population to the
dences. Meanwhile, the Facebook Data for Good project quan- state’s population. The state population is calculated based
tifies mobility trends and residency values in the area. on the grouping of districts in the India 2011 Census dataset.
 The Indian census 2011 dataset (Data source website: This is shown in the following equation:
http://www.census2011.co.in/district.php) contains the related
information of India census 2011 which will be used to calcu- S = 〠 αi Mi , ð1Þ
late the immune rate and state-level mobility in 3.2 part. i=1
 The range of dates is mainly covering from October 1,
2020, to July 15, 2021, and the time period is day. All data where S represents the State’s mobility and M i and αi sepa-
are collected from available public sources. rately represent the ith district’s mobility and ratio population
 The raw data sources are shown in Table 1. of the state.
 Please note that the website of MoHFW does not record In handing the missing value, the dataset contains two
the historical data. The daily data need to be crawled by the types of missing values. (1) Part of the features cannot cover
user. The crawled data in this study will also be shared in the the whole-time range which is vaccination data in this case.
GitHub later. Considering the limited ratio of beneficiaries vaccinated in
 The raw dataset description is given in Table 2. the starting. I used 0 to fill these missing values. (2) Some
 missing values are generated by statistical work and data col-
3.2. Data Preprocessing. Standardization and merging of lection. For the gaps which are small, the data would be filled
datasets is challenging due to the diverse data sources and with the previous time step data to fulfill the missing value.
complex social conditions in India. Finally, the following For the gaps which are obvious, the current districts or
issues were addressed in the pre-processing process: (i) stan- states’ data will be dropped.
COVID-19 Epidemic Analysis in India with Multi-Source State-Level Datasets
4 BioMed Research International

 Cumulative coverage report of COVID-19 vaccination (As on 01 Jul’21 at 7:00 AM)

 Beneficiaries vaccinated
 1st Dose 2nd Dose T ota l Doses
 27, 60, 99, 880 5, 96, 16, 139 33, 57, 16, 019
 India
 (20,04,587 in last 24 Hours) (7,55,758 in last 24 Hours) (27,60,3 45 in last 24 Hours)

 Beneficiaries vaccinated
 S. No. State/UT
 1st Dose 2nd Dose Total Doses
 1 A & N Islands 1 ,5 6 ,4 7 0 1 9 ,8 2 3 1,76,293
 2 Andhra Pradesh 1 ,2 5 ,8 8 ,3 6 9 31,16,201 1 ,57 ,04 ,5 7 0
 3 Arunachal Pradesh 5 ,0 9 ,2 7 5 8 4 ,5 3 0 5,93,805
 4 Assam 5 7 ,9 2 ,1 1 4 12,30,867 70,22,981
 5 Bihar 1 ,3 6 ,8 4 ,1 1 2 22,15,545 1 ,58 ,99 ,65 7
 6 Chandigarh 4 ,3 9 ,3 0 1 8 7 ,5 7 6 5,26,877
 7 Chhattisgarh⁎ 7 9 ,2 6 ,4 2 0 16,00,677 95,27,097
 8 Dadra & Nagar Haveli 1 ,8 5 ,8 1 3 1 9 ,2 2 8 2,05,041
 9 Daman & Diu 2 ,0 1 ,2 6 9 2 1 ,0 0 6 2,22,275
 10 Delhi 6 0 ,8 8 ,8 8 1 18,04,655 78,93,536
 11 Goa 8 ,1 7 ,6 1 4 1 ,1 6 ,5 6 6 9 ,3 4 ,1 8 0
 12 Gujarat 2 ,0 0 ,7 0 ,9 8 6 56,21,236 2 ,56 ,92 ,22 2
 13 Haryana 7 4 ,3 3 ,7 5 9 14,23,139 88,56,898
 14 Himachal Pradesh 3 3 ,5 1 ,5 0 1 5 ,1 4 ,7 8 6 38,66,287
 15 Jammu & Kashmir 3 8 ,4 5 ,2 9 2 6 ,8 0 ,8 7 0 45,26,162
 16 Jharkhand 5 7 ,9 0 ,2 4 7 10,64,300 68,54,547
 17 Karnataka 1 ,8 9 ,7 3 ,2 6 6 37,39,413 2 ,2 7 ,1 2 ,6 7 9
 18 Kerala 1 ,0 8 ,4 2 ,6 0 1 32,54,223 1 ,40 ,96 ,82 4
 19 Ladakh 1 ,7 0 ,9 0 2 5 6 ,0 3 4 2,26,936
 20 Lakshadweep 4 6 ,7 7 9 7 ,9 2 1 54,700
 21 Madhya Pradesh 1 ,7 9 ,1 7 ,8 0 4 23,96,391 2 ,03 ,14 ,19 5
 22 Maharashtra 2 ,6 0 ,6 7 ,9 1 9 63,71,846 3 ,24 ,39 ,76 5
 23 Manipur 5 ,8 1 ,2 6 1 7 4 ,1 4 1 6,55,402
 24 Meghalaya 6 ,1 9 ,7 4 5 7 9 ,4 8 1 6,9 9,2 2 6
 25 Mizoram 5 ,1 8 ,6 8 6 5 4 ,3 6 9 5,73,055
 26 Nagaland 4 ,4 0 ,3 4 4 5 9 ,8 7 4 5,00,218
 27 Odisha 9 8 ,6 8 ,2 7 9 21,48,249 1 ,2 0 ,1 6 ,5 2 8
 28 Puducherry 4 ,4 0 ,3 2 6 6 7 ,0 6 8 5,07,394
 29 Punjab* 6 1 ,1 2 ,3 6 2 10,02,728 71,15,090
 30 Rajasthan 2 ,0 7 ,5 1 ,1 6 1 39,80,702 2 ,47 ,31 ,86 3
 31 Sikkim 4 ,0 1 ,5 0 6 7 2 ,3 0 4 4,73,810
 32 Tamil Nadu 1 ,3 0 ,7 3 ,4 4 9 25,69,324 1 ,5 6 ,4 2 ,7 7 3
 33 Telangana 9 4 ,9 3 ,7 7 0 15,77,710 1 ,1 0 ,7 1 ,4 8 0
 34 Tripura 1 9 ,5 5 ,9 3 1 5 ,9 4 ,4 2 0 25,50,351
 35 Uttar Pradesh 2 ,6 7 ,9 2 ,8 3 0 44,88,619 3 ,1 2 ,8 1 ,4 4 9
 36 Uttarakhand 3 5 ,7 9 ,8 4 0 8 ,2 1 ,9 7 4 44,01,814
 37 West Bengal 1 ,6 8 ,3 3 ,4 1 5 50,56,919 2 ,18 ,90 ,33 4
 38 Miscellaneous 1 7 ,3 6 ,2 8 1 1 5,2 1,4 24 32,57,705

 Figure 2: The sample of MoHFW Vaccination data.

 Table 1: Raw data sources.
 3.3. Data Description. The final dataset holds data for each of
Raw data sources the 33 states or union territories, covering the dates from
COVID-19 Indiaa October 1, 2020, to June 30, 2021. The four union territories
MoHFW Vaccinationb of Andaman and Nicobar Islands, Dardala and Nagar
Indian census 2011 Haveli, and Ladakh and Lakshadweep have huge gaps in
 data collection and will not be used for model construction.
Google mobility [9]
 Although Telangana has been separated from Andhra Pra-
Facebook mobility [10] desh, these two states will be recorded together in order to
a
 Data source website: http://www.covid19india.org/. bMinistry of Health minimize errors in manual division of data.
and Family Welfare: https://www.mohfw.gov.in/. Each state data has 273 rows with 14 columns. The
 detailed information of features is explained in Table 3.
COVID-19 Epidemic Analysis in India with Multi-Source State-Level Datasets
BioMed Research International 5

 Table 2: Raw dataset description.

Raw data sources
 Level State
COVID-19 India
 Storage format Json/csv
 Level State
MoHFW Vaccination
 Storage format PDF
 Level District
 Storage format Website
 Variable Description
 District The name of district
Indian census 2011 State The state which the district belongs to
 Population The population number of the district
 Growth The growth of population number since last census
 Sex ration The ratio of number of male and female
 Literacy The percentage of literacy in the state
 Level State
 Storage format CSV
 Variable Description
 Grocery and pharmacy Mobility trends for places like grocery markets
Google mobility Parks Mobility trends for parks
 Transit stations Mobility trends for public transport hubs
 Retail and recreation Mobility trends for retail and recreation
 Residential Mobility trends for places of residence
 Workplaces Mobility trends for places of work
 Level District
 Storage format CSV
Facebook mobility Variable Description
 all_day_bing_tiles_visited_relative_change Positive or negative change in movement relative to baseline
 all_day_ratio_single_tile_users Positive proportion of users staying put within a single location

3.4. Shortage of Dataset. Except four union territories data 4.1. Logistic Regression. The logistic regression model [14] is
has not been recorded, the population of some districts cre- popular in relevant studies, as it is capable of capturing the
ated after 2011 may also been calculated repeated. These effect of the government’s preventive measures [4] and it is
new districts’ population are estimated based on the original believed that logistics regression follows the trend of Coro-
region’s population that are statistic in 2011, and these navirus outbreak [5].
regions may belong to different districts before they are The logistic regression model is formulated as follows:
independent. It may lead to double counting in the sum of
population. Meanwhile, the Telangana also covers some
 1
region past belong to other state but not only Andhra Pra- y= , ð2Þ
desh. However, the sum of these potentially confounding 1 + e−wT x
population is smaller than the three percent of the country
population, which can generate limited effect to the model. where y is the prediction number of current state’s confirmed
 cases, x is the current input, and w is the model parameter.
4. Methodology
 4.2. Multiple Linear Regression. Multiple linear regression
The research of [8] showed the feasibility of time-series deep [14] is a basic and popular statistic model. It only requires
learning models in predicting the Indian epidemic. In this minimum computing resources to construct the model and
section, I will present different models used to predict the can be explained by comparing the weights of parameters.
number of confirmed cases and deaths of COVID-19. These The multiple linear regression model can be stated as
models were chosen because they have been shown to be follows:
effective in a range of problems in different domains
[11–13]. I wanted to validate their performance in the new
crown outbreak prediction task. y = b0 + b1 x1 + b2 x2 ⋯ bn xn , ð3Þ
6 BioMed Research International

 1.Standardize
 districts names
 Table 1 Table 1 Table 4
 with Table 5
 facebook
 covid 19 India Standardize covid 19 India 2.Encoding the
 1.Handling the data
 state names Handling the missing cities with the
 with Table 2 missing values⁎ values⁎ same name
 Table 6
 2. Calculating
 state-level
 mobility changing
 Calculating
 Table 2 PDF CSV Table 2 the immune rate 1.Standardize
 vaccination Tabula vaccination districts names Table 5
 with Table 4 indian
 census
 2.Adding new
 Verify the correction 2011
 districts
 of transforming created
 after 2011

 Table 3
 google
 data

 Figure 3: Procedure of data preprocessing.

 Table 3: Feature explained.

Variable Description
dateymd The date of tuple of data
confirmed The number of daily confirmed
recovered The number of daily recovered
deaths The number of daily deaths
Dose1 The cumulative ratio of population injected first vaccination
Dose2 The cumulative ratio of population injected second vaccination
Grocery & pharmacy Mobility trends for places like grocery markets
Parks Mobility trends for parks
Transit stations Mobility trends for public transport hubs
Retail & recreation Mobility trends for retail and recreation
Residential Mobility trends for places of residence
Workplaces Mobility trends for places of work
visit Positive or negative change in movement relative to baseline
staying Positive proportion of users staying put within a single location

where bi is the weight of current input item xi and is esti- f t = σðWf xt + Uf ht−1 + bf Þ, ð4Þ
mated by the least squares method.

 where xt denotes the input features and ht denotes the
4.3. Long Short Term Memory (LSTM). Compared with both hidden states. W ∈ Rh×d and U ∈ Rh×h are weight parameter
traditional time-series models and standard recurrent neural matrices, and b ∈ Rh is the basis parameter vector.
networks (RNNs), long short term memory (LSTM) [15] The input gate decides which values will be updated in
shows a better performance in handling the long-term depen- the cell states and be denoted as follows:
dency relationship by using the LSTM cell shown in Figure 4.
 Figure 4 shows the structure of a LSTM cell, which
would be noted as L in Figure 5. Three red dotted bordered It = σðWi xt + Ui ht−1 + bi Þ,
boxes represent three different gates in LSTM, namely, for-
get gate, input gate, and output gate from left to right. The  t = tan hðW C xt + U C ht−1 + bC Þ,
 C ð5Þ
forget gate controls the keeping or throwing of input infor-
  t,
 Ct = f t ○Ct−1 + It ○C
mation from the cell state as follows:
BioMed Research International 7

 Ct–1 X + Ct

 Ct
 X Tan h

 It Ot
 ft
 Tan h X Ht

 Ht–1

 Xt

 Figure 4: The structure of a typical LSTM cell.

 X (t + 1) X (t + 2) X (t + 3) X (t + m)
 Output layer

 h0 h1 h2 hk–1

 L L L L

 Input layer X (t) X (t–1) X (t–2) X (t–k + 1)

 Figure 5: (LSTM) neural networks; L represents the cell of Figure 4.

where C t restores the candidate values which may be added matrix vectors Q, K, and V, representing query, key, and
into cell states and Ct represents the internal memory of a value, respectively. First, these matrices are used for scaled
LSTM unit. dot product attention with the following equation:
 The output gate controls the output of the cell states as
follows: !
 QKT
 AttentionðQ, K, VÞ = softmax pffiffiffiffiffi V, ð7Þ
 dk
 Ot = σðWo xt + Uo ht−1 + bo Þ,
 ð6Þ
 ht = Ot ∗ tan hðC t Þ: where d k denotes the number of dimensions of Q and K.
 Meanwhile, multiple scaled dot-product attention will be
 calculated parallelly in multi-head attention as follows:
4.4. Transformer. The mission of Transformer [16] is to
decide what parts of input should be focused on by introduc-
ing the attention mechanism. Another important feature of MultiHeadðQ, K, VÞ = Concatðhead1 , head2 ⋯ headh ÞWO ,
the Transformer is the encoder-decoder structure. Different ð8Þ
from traditional time-series models, which encode the input
one-time step at a time, the Transformer encodes all the
input simultaneously. where head i = AttentionðQW Qi , KW Ki , VW Vi Þ, h is the num-
 The architecture of the Transformer is shown in ber of parallel scaled dot-product attention, and W O ∈
Figure 6. The left part represents the encoder block, and Rhdv ×dmodel is the parameter matrix.
the right part represents the decoder block. After the embed- Mask multi-head attention is similar to the multi-head
ding operation is completed, the inputs are fed to the self- attention but only keeps the current and previous for every
attentive layer. In the self-attentive case, each input has three input row.
8 BioMed Research International

 Output probability

 Softmax

 Linear

 Add & Norm

 Feed
 forward

 Add & Norm
 Add & Norm
 Feed
 forward Multi-head
 attention

 ×N
 ×N Add & Norm Add & Norm

 Multi-head Masked
 attention multi-head
 attention

 Positional Positional
 + + encoding
 encoding

 Embedding Embedding

 Input Output
 (shift right)

 Figure 6: The architecture of Transformer.

 Output

 Hidden layer 2

 Hidden layer 1

 Input

 Figure 7: The architecture of TCN.
BioMed Research International 9

 Figure 8: The training process of DeepAR. L is the cell of LSTM.

 Figure 9: The prediction process of DeepAR.

 Table 4: Results of different models.
 The feed forward network is a fully connected neural
network for every position with a ReLU layer, and a linear Model RMSEavg Training time (hours)
layer and can be denoted as follows: LSTM 584.74 2.33
 Transformer 460.08 11.1
 Logistic regression 971.89 0.03
 F FN ðxÞ = max ð0, xW 1 + b1 ÞW 2 + b2 , ð9Þ
 Multiple linear regression 1885.12 0.01
 TCN 1605.52 1.12
 The add and norm layer will normalize the result of the DeepAR 927.79 4.36
last layer’s output and construct a residual block with the last
layer.
 The convolution operation in TCN is based on the fol-
 lowing formula:
4.5. Temporal Convolutional Network (TCN). Temporal con-
volutional network (TCN) [17] implements the convolu- k−1
tional network for time series. TCN has an excellent FðsÞ = 〠 f ðiÞ∙xs−d∗i , ð10Þ
performance in capturing the long-term dependency rela- i=0
tions and local information. It also keeps the advantage of
a convolutional network by extracting features with limited where FðsÞ represents the dilated convolution operation F
parameters. The structure of TCN is shown in Figure 7. on element s, d is the dilation factor, and k is the filter size.
10 BioMed Research International

 Table 5: The explanation of feature selection models.

Model Description
Baseline (Transformer) The transformer model training with all features
 The transformer model training without dose feature group, which are Dose1 and Dose2 as
Transformer_without_dose
 shown in Table 3
 The transformer model training without Google mobility feature group, which are grocery and
Transformer_without_google
 pharmacy parks, transit stations, retail and recreation, residential, and workplaces as shown in Table 3
 The transformer model training without Facebook mobility feature group, which are visit and staying
Transformer_without_facebook
 as shown in Table 3

4.6. Deep Autoregressive Recurrent Neural Networks Table 6: Transformer model with feature selection for predicting
(DeepAR). DeepAR [18] combines recurrent neural net- the number of confirmed cases.
works and the autoregressive model. Instead of a determined
value, it will output a probability distribution of the predic- Transformer model with feature selection (confirm)
tion value. Regrading to the feature of autoregressive regres- Model RMSEavg
sion, DeepAR has a better performance on data with noises. Baseline (transformer) 460.08
 DeepAR can be represented as follows: Transformer_without_dose 426.47
 Transformer_without_google 519.28
 Transformer_without_facebook 1105.17
    YT   
 QΘ = zi,t0 :T zi,1:t0 −1, xi,1:T = QΘ zi,t zi,1:t−1, xi,1:T
 t=t0
 ð11Þ
 Y
 T    Table 7: Transformer model with feature selection for predicting
 = ℓ zi,t θðhi,t , ΘÞ , the number of deaths.
 t=t0
 Transformer_deaths model with feature selection
 Model RMSE
where xi,t is the current input covariates, z i,t−1 denotes the Baseline (transformer) 103.80
target value of last time step, 1 : t 0 − 1 is the conditional Transformer_without_dose 31.22
range, and t 0 : T represents the prediction range. hi,t = Transformer_without_google 45.09
hðhi,t−1,zi,t−1 ,xi,t , ΘÞ is the output of the autoregressive recurrent
 Transformer_without_facebook 26.89
network.
 The training process and prediction process are sepa-
rately shown in Figures 8 and 9. The difference between
the two figures is that the prediction range data is unknown The root mean square error (RMSE) has been used as the
in the prediction. The model cannot receive the real value of evaluation metric, which is calculated as follows:
the last time step but estimates the Z ~ i,t−1 from the sample.
 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 u N
 u1
5. Results RMSE = t 〠 ðyi − ̂yi Þ2 , ð12Þ
 N i=1
5.1. Experiment. The experiment was based on two research
hypotheses: (1) The state-of-art time-series deep learning where yi and ̂yi are observed and predicted value and N is
model will have the better performance on predicting the the number of data samples in the test set.
trend of Indian COVID-19 epidemic, and (2) the vaccina- In this research, all the models will predict 15-days con-
tion data and population mobility data will be helpful for firmed number for every state, and the average RMSE value
the accuracy of the model. is used as the main performance measure:
 The dataset was divided into a training dataset, covering
from October 1, 2020, to June 30, 2021, and a testing dataset,
 1 S
covering from July 1 to July 15, 2021. As a preprocessing RMSEavg = 〠 RMSEi , ð13Þ
step, the min-max normalization has been implemented in S i=1
the training process. The models are fit with the training
dataset and then evaluated with the testing dataset. For the where S in the number of states in the dataset and RMSEi is
logistic regression and multiple linear regression, a single the current state’s RMSE value.
model is trained for a single state, with all features as input The machine learning model is implemented with the
and the current state’s number of confirmed cases as the open-source Python library dart. Another open source Python
label. For other models, all the states share a single and com- library called hyperopt was used for hyperparameter optimiza-
mon model. tion with a stochastic search strategy. The optimization was
BioMed Research International 11

performed for 100 turns for each model, and the optimal ing the number of confirmed cases, but not as useful in pre-
hyperparameters were selected. dicting the number of deaths.
 Although the focus area of our study is India, our analy-
5.2. Result Table. Table 4 shows the result of the experiment. sis process can be extended to other countries and regions as
The training time in hours is also recorded and listed in long as the original dataset can be collected and used.
Table 4. Another research direction is to find out about COVID-19
 We can see that the Transformer, as the state-of-art algo- between different countries, which may interact with each
rithm, performed the best among six models. However, the other.
cost is that the Transformer also took the most time in train-
ing. LSTM is less accurate than the Transformer but costs Data Availability
much less time. The traditional statistic methods are not
effective in predicting the trend of COVID-19. The datasets used and/or analyzed during the current study
 are available from the corresponding author on reasonable
5.3. Feature Selection. Recall that three feature groups have
 request.
been used in this research, which are Dose, Google mobility,
and Facebook mobility. Considering the contribution of
three feature groups to the prediction result, three Trans- Conflicts of Interest
former models are trained with two feature groups for pre- The authors declare that they have no conflicts of interest.
dicting the number of confirmed and deaths daily number
of COVID-19. More specific explanation is given in Table 5.
 The results have been shown in Table 6 and Table 7.
 References
 As shown in Table 6, mobility data are helpful in pre- [1] M. Mele and C. Magazzino, “Pollution, economic growth, and
dicting, and Facebook mobility dataset contributes the most COVID-19 deaths in India: a machine learning evidence,”
to the model. It is unexpected that the vaccination rate did Environmental Science and Pollution Research, vol. 28, no. 3,
not have a positive effect on the prediction. It may cause pp. 2669–2677, 2020.
by the limited immune rate in India; up to July 15, about [2] S. Roy, G. S. Bhunia, and P. K. Shit, “Spatial prediction of
23% of the population has been vaccinated one dose, and COVID-19 epidemic using ARIMA techniques in India,”
only 5.8% of the whole population are fully vaccinated. Modeling Earth Systems and Environment, vol. 16, pp. 1–7,
Another potential reason is that the Indian vaccination does 2021.
not show a stable effect for the new variant virus. [3] R. Kumari, S. Kumar, R. C. Poonia et al., “Analysis and predic-
 According to Table 7, we can find that the dose and tions of spread, recovery, and death caused by COVID-19 in
mobility features bring limited effect in forecasting the num- India,” Big Data Mining and Analytics, vol. 4, no. 2, pp. 65–
ber of daily deaths of COVID-19. Considering the lagging of 75, 2021.
vaccination effect, the dead may not be vaccinated or fully [4] P. Ghosh, R. Ghosh, and B. Chakraborty, “COVID-19 in India:
vaccinated, and the effect may need to be observed in the statewise analysis and prediction,” JMIR Public Health Surveill,
 vol. 6, no. 3, article e20341, 2020.
longer term.
 [5] B. Malavika, S. Marimuthu, M. Joy, A. Nadaraj, E. S. Asirva-
5.4. Discussion. According to the experiment, there are two tham, and L. Jeyaseelan, “Forecasting COVID-19 epidemic in
research hypotheses: (1) The Transformer as the state-of- India and high incidence states using SIR and logistic growth
art time-series deep learning model has a significant better models,” Clinical Epidemiology and Global Health, vol. 9,
 pp. 26–33, 2021.
performance than other regression and deep learning
models and (2) the population mobility data are helpful for [6] B. H. Foy, B. Wahl, K. Mehta, A. Shet, G. I. Menon, and
 C. Britto, “Comparing COVID-19 vaccine allocation strategies
the prediction of spread of epidemic. However, the vaccina-
 in India: a mathematical modelling study,” International Jour-
tion data did not have positive effect in predicting. nal of Infectious Diseases, vol. 103, pp. 431–438, 2021.
 For the transmission of novel coronavirus, the effect of
 [7] L. K. Shrivastav and S. K. Jha, “A gradient boosting machine
vaccination is still under the question. Considering the learning approach in modeling the impact of temperature
potential variation of the virus, the government should not and humidity on the transmission rate of COVID-19 in India,”
overlay rely on the vaccination. Meanwhile, the population Applied Intelligence, vol. 51, no. 5, pp. 2727–2739, 2021.
mobility is an important factor for controlling the spread [8] R. Chandra, A. Jain, and D. S. Chauhan, “Deep learning via
of epidemic. LSTM models for COVID-19 infection forecasting in India,”
 2022, https://arxiv.org/abs/2101.11881.
6. Conclusion [9] L. L. C. Google, “Google COVID-19 community mobility
 reports,” July 2021, https://www.google.com/covid19/
In this study, the COVID-19 epidemic in India was analyzed mobility/.
using datasets collected from different sources. Among the [10] “Facebook. Facebook Data for Good: Movement Range Map.
various machine learning models for predicting COVID-19 Facebook Data for Good,” July 2021, https://dataforgood.fb
transmission, the Transformer proved to have the best per- .com/.
formance. The Transformer model may be a new baseline [11] W. Jiang, “Applications of deep learning in stock market pre-
for future researchers. The Facebook mobile dataset was diction: recent progress,” Expert Systems with Applications,
found to be most useful along with other datasets in predict- vol. 184, p. 115537, 2021.
12 BioMed Research International

[12] W. Jiang and L. Zhang, “Geospatial data to images: a deep-
 learning framework for traffic forecasting,” Tsinghua Science
 and Technology, vol. 24, no. 1, pp. 52–64, 2019.
[13] W. Jiang, “Internet traffic prediction with deep neural net-
 works,” Internet Technology Letters., vol. 5, no. 2, article
 e314, 2022.
[14] Y. Anzai, Pattern Recognition and Machine Learning, Elsevier,
 2012.
[15] S. Hochreiter and J. Schmidhuber, “Long short-term mem-
 ory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[16] A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you
 need,” Advances in Neural Information Processing Systems,
 vol. 30, pp. 5998–6008, 2017.
[17] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager,
 “Temporal convolutional networks for action segmentation
 and detection,” in IEEE Conference on Computer Vision and
 Pattern Recognition, pp. 156–165, Hawaii Convention Center,
 2017.
[18] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski,
 “DeepAR: probabilistic forecasting with autoregressive recur-
 rent networks,” International Journal of Forecasting, vol. 36,
 no. 3, pp. 1181–1191, 2020.
You can also read