Fake News Classication and Disaster in case of Pandemic: COVID-19

Page created by Darlene Thompson
 
CONTINUE READING
Fake News Classification and Disaster in case of
Pandemic: COVID-19
Tooba Qazi (  toobaqazi1@gmail.com )
 FAST - National University of Computer and Emerging Sciences https://orcid.org/0000-0002-4978-
5748
Dr. Rauf Ahmed Shams
 FAST - National University of Computer and Emerging Sciences

Research Article

Keywords: Fake News, network based, COVID-19, Pandemic, Machine Learning

Posted Date: January 24th, 2022

DOI: https://doi.org/10.21203/rs.3.rs-1287367/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

                                               Page 1/21
Abstract
Presently, one of the very common and prevalent means of consumption and spread of the news is social
media. Unfortunately, this does not include news with merit only but also a wide array of false news,
leaving negative imprints and influences in the world. Specially now, when there is a birth of a new story
and news every second, the widespread of false news is a matter of concern. These issues and concerns
have made it undeniably essential to limit the spread of incorrect data by analyzing and detecting data. In
the period of this kind of tragedy where the knowledge base is not appropriate to classify the news as
false or true since the topic is unique and therefore no prior information is known. To stop this
misguidance against human health issues, this work aims to examine the imperative aspects of social
networking and the fake News data and information it provides. For confronting this issue, we will be
building a data set of COVID-19 collected from social networking sites and detecting the visible patterns
and characteristics using social analysis tools in both fake and real news by using the different Machine
Learning algorithms in collaboration with Deep learning techniques. Wherever a problem involving fake
news detection is concerned, deep learning techniques have been applied as they skip feature engineering
and automatically represent features.

1. Introduction
As far as history is concerned, fake news has always been in top tier for creating crisis in human
communication which induces agitation, conflicts, and misconceptions among people. In this digital age,
social media platforms prove to be the primary source for relaying news as they are easy to access and
allow interactions with people having varying thoughts. The current wide spread of COVID-19 is
distressing the society, including the information revolving around it. True facts assist in diminishing the
crisis whereas false facts inflame the issues adding fuel to the fire. Consuming such information has
resulted in skepticism and frenzy in the society which is deadly wherever human health is concerned.
Since the evolution of social media platforms, the users have easy access to create, share, and to provide
reactions and feedback to any of the social media content. With this ease of social media content, there
is a high risk of spreading fake news on social platforms. These fake news contents have a highly
threatening effect on any individual or collectively for a country’s democracy, political campaigns, and to
the stock market fluctuations and massive trades [16]. Since fake news can be created and published
online faster and cheaper when compared to traditional news media such as newspapers and television.
Therefore, currently, it's impossible to validate every social news whether the shared content is fake or real
before it gets spread on social networks. This brings more attention to the researchers to address the
spread of fake news on social media platforms. Social media platforms being the primary source for
consuming news has necessitated the need to analyze the data revolving on the said platforms related to
COVID-19. The intake of news generated via social media comes with its own advantages and
disadvantages. If the social media app Twitter is considered, the use of this platform makes the spread
of fake news much more easily affecting a very massive number of users at a time. Repeatedly, it has
been seen that such tactics have the potential to be used for financial motives as well as political means

                                                  Page 2/21
on large scales such as organizations or personal motives [2]. The information shared by the users plays
an important role that can be helpful in bringing betterment to this health crisis. Since April 23, 2020, the
International Fact-Checking Network (IFCN) [17] has been closely working with the 100 different fact-
checking organizations and classified more than 4000 false information regarding COVID-19. The trends
of spreading misinformation were also observed during other epidemics such as Ebola [18], yellow fever
[19], and Zika [20]. The COVID-19 pandemic has caused the world to grieve for half a million dead people.
The pandemic along with loss of precious lives has also brought along unforeseen circumstances related
to health issues. A new concept combining “epidemic” and “information” has come into existence. This
means that in such a situation correct and incorrect information is widespread in abundance. One hand
the correct information can help deal with a crisis on the right level of priority, on the hand the incorrect
information worsens the situation and increases the magnitude of a crisis. [4] The development of
automatic detection of false news has been in progress in previous years, the methods presently used
can be categorized in two groups, “content-based” and “propagation-based” methods. [1] Amongst the
two categories mentioned, the content-based methods are used for the detection of false news by the
evaluation of news from numerous angles; secondary information such as the news propagator is not
considered. When news content is heavily relied upon, the methodologies can be sensitive it. By
concealing the expression style of an individual, false entries can be introduced which might diverge the
results of the approach. Due the mentioned issue caused, the detection of fake/false news using network-
based approaches have been in popularity recently [7]. There are unique challenges in the process of
ascertaining and alleviating the spread of fake news. To deal with these unprecedented challenges, sorts
of data along with network features have been used tediously by numerous existing research works. As
the spread of information is fleeting on social media platforms, this project is to provide a detection
mechanism to stop the circulation of fake news as early as possible by observing the patterns of fake
and real news.

2. Literature Review
Lots of researchers have been working to defeat the evil that is fake news. The methods varies from
research to research with some using deep learning while some focusing on machine learning models.
Plus, there are different aspects of fake news detection, which rely on feature extraction. Detailed study
has been carried out by many research studies to overcome the greatest evil: false news. Fake news is
one of such significant threats. Although accurate fake news classification and detection, a theory-driven
model for early identification of false news is indicated. The proposed approach efficiently analyses and
portrays news data at four levels of language to anticipate false news on social networking sites before it
became disseminated. Comparing the efficiency of the system proposed with other state-of-the-art
approaches. Such techniques identify fake news by (a) assessing news content means detecting fake
news content-based, (b) exploring social network news distribution based on propagation, and (c) using
both news propagation data and news content data [1]. It explores the potential patterns; relations among
fake news, recurring misinformation, and deceptions/disinformation along with enhancing the
interpretability in fake news feature engineering [1]. Content-based Fake News Detection aims to detect

                                                   Page 3/21
fake news by analyzing the content of related new Fake News Identification, focused on information,
attempts to spot fake news by evaluating the information of relevant news posts, articles, titles,
headlines, and blogs [1].

Propagation-based methods, which rely on Recurrent Neural Networks (RNNs), have gained popularity as
the models exhibit good performance by investigating associations between news stories, writers, and
consumers, i.e. spreaders [[1, 10]]. A selection of features has been derived from deception /
disinformation patterns to reflect news material, which contributes to the analysis of the connections
between fake news and deception / disinformation [1]. Dealing with fake news from propagation and
spreading out perspective, it relates to the dissemination of fake news, e.g., how it propagates and users
spreading it. The propagation of fake news is proposed in recent research that provides several
techniques to address this problem. The following are the recent research that will make a simple and
clear understanding of the solutions that are proposed to solve the addressed problem. These solutions
include empirical patterns, mathematical models, and social network models that proposed solutions for
this problem [21]. Another approach for addressing fake news detection in the context of social network
analysis is proposed in recent research studies as cascade-based and network-based. The cascade-based
detection follows the propagation paths and news cascade patterns using cascade-similarity and
cascade-representation of the social network [22]. On the other side, network-based detection constructs
the flexible network from cascades in order to indirectly capture fake news propagation. The networks
constructed in network-based detection can be homogenous, heterogeneous, or hierarchical. Fake news is
tweeted twice as much as the real news, each following a different propagation pattern [2]. Gradient
boosting, with the best ROC precision score, has the maximum accuracy results. Models focused on
hybrids and profiles were stronger. While the content-based model is best for early detection [2].

Most spreaders of fake news share some similar attributes; hence, by studying specific user behavior
allows one for discriminating fake news from actual news [7]. Network based models inspected info
revealed in news hybrid models and propagation models, incorporating user and network information [7].
Content-based (linguistic) models use spatiotemporal information; for spatial information, the location of
the user was noted and for the temporal information the timestamps of user engagements were recorded
[11]. These facts and figures showcase how fake news pieces propagate on social media, and the change
in hot topics over time [11]. There are four perspectives in fake news inspection which are as follows: (a)
knowledge-based, focusing on misleading knowledge; (b) style-based, focusing on how the news is
posted; (c) spread-based, focused on how it continues to spread; and; (d) credibility-based, examining the
legitimacy of its writers and spreaders [3]. Propagation based experiments also focus on experiences,
where knowledge-based analysis focuses on multi-dimensional encounters between individuals and
items generated from news material [3]. Machine learning and knowledge engineering convergence could
be valuable in identifying false news.

Differentiates between false and untrue news reports using data driven and knowledge engineering. A
new experimental algorithm solution will be developed to achieve the target this one can label the content
as quick as possible, the news is released on internet. We have to embed both behavioral and social
                                                  Page 4/21
structures the collected to be combined knowledge and data to tackle this problem head on. Combine the
data driven (machine learning) and knowledge learning by classification method news divided into labels;
for example, (Fake news, non-fake news, and unclear news) which is part of data driven. In the field of
text mining, classification of the content is the main issue to consider. To analyze the text and content of
the news and match this content with prior facts by fact checking which is called knowledge learning [6].
Data side includes classification of text and identification of stance while information side involves fact
checks that will enable us to optimize the results. Categorize our mission into three parts, and we will
merge the findings at the end to verify whether or not the news status is incorrect [6]. Feature-based
Studies concentrate on manually creating a range of observed and/or latent fake news interpretation
features and using these features within a machine learning system to identify or block fake news [3]. To
figuring the themes of fake news, content styles, origins, dissemination, and motives of mislead events,
they observed and checked social networking sites for possible fake news. Following steps are follow: (a)
Study misinformation; (b) produce a list for false information; (c) compile related news data and (d)
script the information collected. Fake news in social sites has four major categories: (a) text, (b) image,
(c) recording, and (d) clip; it can also appear in more than one medium at a time. [4] In distributing fake
news successfully despite COVID-19, internet media has done more harm than mass media. YouTube,
WhatsApp, Twitter, and WhatsApp are the four social media sites that generate the highest share of fake
news [4]. It would be intriguing and in fact, useful if the beginning of messages could be checked and
sifted where the phony messages were isolated from valid ones. The data that individuals tune in to and
share in web-based media is to a great extent affected by groups of friends and connections they
structure on the web [12]. The misinformation about the infectious disease that has been declared as a
pandemic is riskier than a virus itself. The studies show the spread of misinformation is not only present
in the COVID-19 pandemic but also observed during other epidemics such as Ebola [18], yellow fever [19],
and Zika [20]. These studies provide the spread analysis of misinformation on social media platforms
with respect to their disease.

In another research put forward the notion that the users of social media share counterfeit news more
compared to substantial and fact-based news with regards to COVID-19 causing such practice to become
a danger to the society [8]. This study also dissects data sourced from different numerous social
networking platforms. Recovered, arranged, and analyzed forged and/or groundless publications from
1923 about COVID-19 uploaded Twitter and Sina Weibo. The outcomes of the study revealed that as
compared to Sina Weibo, Twitter has more unsubstantiated and false news distributed on the platform
[8]. There are seven prevailing subjects relevant to COVID-19 unverified news on media platforms: Politics,
terrorism, faith, fitness, religious politics and miscellaneous culture. Out of all the mentioned subjects, the
highest rank in the list is related to health counterfeit news revolving around medical prescriptions and
clinical services [9]. For the last few years, numerous computational the procedure has been created in
order to alleviate these issues and recognize different types of false information in diverse genres.
Nonetheless, exposing COVID-19-related falsehood displays its own arrangement of special difficulties
[5]. We eliminate the tweets which are totally impartial, and map the tweets' sentiment polarity relevant to
false news and substantial news sources to better dissect suppositions in tweets identified with phony

                                                   Page 5/21
and genuine news stories. [5]. Model for recognizing forged news messages from twitter posts, by
figuring out how to anticipate precision appraisals, in view of computerizing forged news identification in
Twitter datasets. Neural networks are a form of machine learning method that have been found to exhibit
high accuracy and precision in clustering and classification for fake news [13]

3. Methodology
3.1 Data Set:
COVID-19, which includes brief news statements and labels obtained from fact-checking websites, will
contribute to the data collection. i.e politifact and gossipcop. In PolitiFact, journalists and domain experts
review the political news and provide fact-checking evaluation results to claim news articles as fake and
real. These news are classified by the fact-checking websites into different categories of which are
classified into different categories of misinformation such as false news, fake news, partial false,
misleading, and so on. The extracted fake news regarding COVID-19 that are classified by the fact-
checking websites as false, mostly false, misleading, inaccurate, pants on fire are considered as fake
news in our study. On the other hand, the news extracted and classified as true, mostly true by the fact-
checking websites are considered as true news in our study. In addition, the news published by the World
Health Organization or United Nations are also considered as true news in our study.

The total 75 news (40 Fake and 35 True News) against the COVID-19 have been extracted. Finally, against
75 news, we managed to extract 46,852 tweets which were tweeted by 18,940 users. The breakdown
news and tweets is shown in the Tab 2, below.

3.2 Data Collection Method
The extracted news of COVID-19 was supplied to the Hoaxy [14]. Articles search for every news item on
Hoaxy was done to explore all the articles matching with keywords. Relevant articles were selected and
then their spread was analyzed. Hoaxy gives a partial Tweet network for the spread of selected articles on
Twitter. With agreeing to Twitter sharing policy, Hoaxy shares limited details of every original Tweet
(related to articles) and all the Retweets from it. The data extracted from Hoaxy contains Tweets Id,
publishing DateTime, Creator user Id, Retweeter user ID, Retweet ID, Retweeting DateTime, and bot-scores.
There is more than one tweet observed for each news having their specific retweeters.

Since the data extracted from Hoaxy is not enough for exploratory analysis, Tweepy [15] was used to
further extract the data from Twitter for further analysis. By having the user IDs, we extracted follower
and friend count of retweeters. Followers and friends count will help us to further analysis. Data extracted
from Hoaxy do not have an interconnection of users. In order to find out that, tweepy helped us to extract
the directed edges between followers and followee.

3.3 Proposed Framework
                                                  Page 6/21
In the proposed framework shown in Fig. 1, we will implement user features and node-based features and
integrate them to train machine learning models include SVM, Linear Regression, Random forest,
AdaBoostClassifier, Decision tree, Ensemble learning (voting classifier), Catboost, Light GBM, Gradient
Boosting Classifier, KNN and Naive_bayes; and deep learning models includes CNN, RNN LSTM (details
are mentioned below), because no one has ever combined such features to pave the path. We will
perform social network analysis on the COVID-19 news items. Moreover, using the social analysis tool,
created the follower-following graph of every tweet. Moving on, each news item has a network of all users
involved in a news, constructed by connecting followers and friends. Our approach to this problem is
finding the dominant social network features and user’s features and combine them which can be then
used to train different machine learning and deep learning algorithms.

4. Experimentation
This chapter takes a deep dive into the experimental setup used in this study. As mentioned in the
previous section, the features are split into two parts. Network based features and user based features. In
this paper experimentation is done to classify fake and true news based on network and user features by
using different machine learning and deep learning algorithms.

4.1 Data set
4.1.1 Network Based Feature:
The following are the set of experiments that provide the analysis of network attributes such as

    number_of_edges,
    avg_degree,
    avg_shortest_path_length,
    degree_centrality,
    betweenness_centrality,
    average_clustering_coefficient,
    number_of_communities,
    average_clustering_coefficient,
    closeness_centrality,
    eigenvector_Centrality,
    Page Rank,
    InDegree, and
    outDegree

These network attributes are analyzed and provide a comparison between fake news network attributes
and true news network attributes.
                                                 Page 7/21
4.1.2 User Based Features
Following SNA, we extracted user features from twitter that were present in the graph related to one
particular news.

    User botScore
    Account Age,
    Is verified,
    Followers count,
    Friends count,
    Favorite count,
    Status count, and
    Listed count.

4.2 Social Network Graphs
The dataset extracted in the above section helps in constructing the social networks for fake and true
news. The Tweet and Retweet users and their relationship association generated the edge list. Each of the
75 news item contains a CSV file that has all the relevant attributes that we have discovered. Plus, 75
graphs generated using cytoscape showing the links between the users who tweeted a particular news.
Snapshots of two graphs are attacked below Fig. 14 [Fake] and Fig. 15[True] respectively.

4.3 Feature Analysis
4.3.1 Exploratory Analysis
The initial phase in your research study is exploratory-data-analysis (EDA). In this, you'll discover out
whether to interpret the data that have, and also what queries you really like to address and then how to
structure those, including how to efficiently handle your datasets to get the insights you require. We did
using different techniques like box plot, correlation and Ecdf etc.

We are using box plot for graphical representation of data, box plot shows us the difference of extreme
values from majority of the data. Min value, max value and quartiles values are used to make box plot.
Both real and fake news have different characteristics that can be important while analyzing a tweet
using box plot. Fig. 16 shows the difference in the distribution of network features of both sets of news.
The average shortest path length’s average in fake news is 2 as compared to 1.81 in real news.
Betweeness centrality in fake news is 0.34 as compared to 0.32 in real news. Closeness centrality in fake
news is 0.68 as compared to 0.59 in real news. Eigenvector centrality in fake news is 0.32 as compared to
0.37 in real news. Page Rank in fake news is 0.06 as compared to 0.08 in real news. No of communities
average is 7.86 in fake news whereas its 9.18 in real news.

                                                 Page 8/21
4.3.2 Empirical Cumulative Distribution
For the statistical model in Python, an empirical distribution function is preferable. The Python library
offers the ECDF class for fitting an empirical cumulative distribution function and computing the
cumulative probabilities for data attribute selection. A method for efficiently plotting and applying
percentile thresholds for data exploration to an empirical cumulative distribution function (ECDF). Fig. 17
and 18 showing the summary of analysis on the basis of ECDF for the network based features and user
based features for both True and False respectively.

4.3.3 Correlation Analysis
We measured the correlations between features to determine whether the network-based features which
we identified are dependent on established characteristics and independent of each other. For each data-
set, we computed a correlation matrix that determines the degree of association between features. We
also used sns library in python to visualize the correlations. Fig. 19 [Network based] and Fig. 20 [User
Based] presents the correlation matrix between all the network based features. Likewise we plotted a
correlation matrix for user based features. Following figures presents the result. The diagonal values, as
can be shown, are 1. The same features are found in both the upper and lower triangular regions.
Therefore, in order to evaluate the correlation, we only have to present one of these. A higher correlation
value is shown by darker shades, whereas a lighter color means a weaker correlation. So based on the
correlation we selected our best features from both network based features and user based features to
train models to perform better.

4.3.4 Feature Importance
The feature importance attribute of the model can be used to obtain the feature importance of attributes
in data. Feature importance assigns a value to every one of dataset features; the greater the value, those
most essential or important feature is to output variable. TreeBased Classifiers include an embedded
class called FeatureImportance. Initial feature selection through correlation and value distribution shown
in Tab 3. Based on feature selection, we selected our features for machine learning models.
5. Implementation
Our approach to this problem is finding the dominant social network features and user’s features and
combine them which can be then used to train different machine learning and deep learning algorithms.
In our model, we used different types of machine learning algorithms and for the implementation work,
we used Python as our programmable language. We test the features on different learning algorithms
and choose the one that achieves the best performance. For machine learning, the algorithms include
SVM, Linear Regression, Random forest, AdaBoostClassifier, Decision tree, Ensemble learning (voting
classifier), Catboost, Light GBM, Gradient Boosting Classifier, KNN and Naive_bayes. As per framework
dataset divided into two phases training phase (Classifies data (constructs a model) based on the

                                                  Page 9/21
training set and the values (class labels) in a classifying attribute and uses it in classifying new data)
and

testing phase (In testing we check those data that does not come under the dataset we have considered.
After the prediction, we will get the class labels).

5.1 Machine Learning Models Implementation
The dependent variable is always found to be binary when we are dealing with Logistic Regression. Main
applications of Logistic Regression are to predict and to calculate success probability. When using
regression classifier, the random state we use is 1 and Broyden–Fletcher–Goldfarb–Shanno algorithm as
a solver. A line or a hyperplane is created by Support Vector Machine that distinguishes the data points in
to distinct classes. The training data is mapped into a kernel space by SVM. In our approach, out of many
possibilities, we used the linear kernel for our SVM classifier. When catering to a learning problem, a
requirement for using Naïve bayes classifiers, which are highly scalable, is that the relationship between
numbers of parameters to the number of features is linear. The purpose of a k nearest neighbors’
algorithm is storing all accessible cases and classifying new cases on the criteria of a distance function.
For random forest is a meta estimator that fits a number of decision tree classifiers on various sub-
samples of the dataset. Furthermore, the predictive accuracy is improved and over fitting is controlled by
using averaging. The maximum depth of the tree is 1 and reinforced measures are “gini”. The trees that
conduct classification of instances on the criteria of feature values are decision trees. A feature, which
can be classified in an instance, is represented by each node of the decision tree. And the value assumed
by the node is represented by a branch. To predict the class, EVC classifier is used along with three
estimators being logistic regression, decision tree and svm. Many weak learning models are combined
forming a strong predictive model, this is achieved by a group of machine learning algorithms called
gradient boosting classifiers. With a depth of 2, fifty estimators are utilized in random forest. LightGBM is
able to handle huge amounts of data with ease. This algorithm uses gradient boosting framework which
in turn uses tree based algorithm allowing the tree to grow vertically by having default learning rate 0.1.
CATBoot is a classifier which is first rate and does not require tuning of parameter. It is an extensible GPU
version. This classifier helped the reduction of overfitting and increased the accuracy of the model.
5.2 Deep Learning Models Implementation
After testing the data with several supervised methods, deep learning model was chosen as it always
gives an efficient output layer based on the weight of input layers. Two most famous deep learning
methods RNN Recurrent Neural Network and CNN Convolutional Neural Network were also implemented
to see how well our data fit into the model. Such algorithms are ideal for various classifications and are
based on different datasets and have their own properties and performance. Keras provides easy and
reliable high-level APIs and follows best methods to reduce the cognitive burden for users, using Keras as
a neural network library. Unlike normal neural networks, RNNs rely on prior output knowledge to predict
upcoming info. This role becomes extremely helpful when combined with sequential knowledge. An
LSTM network is a kind of recurrent neural network that has LSTM cell blocks instead of our usual neural
                                                  Page 10/21
network layers. For the CNN model, the following architectural structure is available with a given number
of parameters. Most of the parameters came from the fully connected component, comparing the number
of parameters in the function learning part of the network and the fully connected part of the network. In
terms of computing costs, CNN is very effective. The architectural diagram for RNN and CNN is shown in
Fig. 22 and 23, and the description of the two models is shown in Fig. 24.

6. Results And Analysis
This chapter enlists the results obtained by executing our proposed methodology on the aforementioned
datasets. We will be enlisting the performance of all the different models of our methodology, and will
also be providing an insight on the results.

6.1 Feature Selection
Our approach to this problem is finding the dominant social network features and users’ features and
combine them which can be then used to train different machine learning algorithms. We determined the
optimal set of hyper parameters by testing the performance of our models on the development set for
different parameter combinations. We analyze the importance of the features in different machine
learning models. We used feature selection library for feature importance and feature engineering. To
evaluate we used the feature importance module. This module calculates the relative weight of each
feature in the model and it sums up to 1. We started with consisting both network and user based
features combined together that are trained on machine learning models. Fig. 13 showing the best
features of top performance algorithms.

6.2 Result of Network based and User based features
separately
We test the features on different learning algorithms and choose the one that achieves the best
performance. We used ROC-AUC curve to validate the performance machine learning algorithms. Plus,
metrics like accuracy, precision, recall and F1 scores are also calculated.

All models were tested on network based features, user based features and combine the both features
(Cartesian product of both features, mean, median and mode of user features combined with network
features). Tab 5 is showing the results of network features and user features. Median of user features
combined with network features performed well amongst above 3 mean, median and mode. Tab 6 is
showing the result of median of user features combined with network features.

6.3 Result for Combine Features
In our experiment, we plan to extract users features and network based features and combine them to
train the ML models. KNN and Naïve Bayes Model was tested on each of the feature mentioned above. It

                                                Page 11/21
gives 76% and 79% accuracy respectively. SVM and Logistic models gives 72% and 81% respectively.
Then Decision Tree and EVC model was performed, this time the performance was slightly optimized
than before, predicting 80% accuracy for both. Thirdly Random Forest was performed for observing any
further enhancement over the previous result so far. 6% enhancements were observed at this stage as
testing accuracy from RF is 86%. Ada boosting performed 91%. For the best performance, we applied
most recent two algorithms of the machine learning then we applied Gradient Boosting, LightGBM and
CATboost they performed well, enhancement was observed at this time as testing accuracy 97%, 97%,
and 98% respectively. Clearly CATboost has the highest test accuracy which is 98%. Overall comparison
of all the machine learning algorithms is shown in Tab 7. The highest accuracy is shown by CATboost.

6.4 Comparison of Results and Performance
We’ll be enlisting the performance of different models using graphs in order to analyze them. We consider
a scheme to be top performer in an experiment when it yields an accuracy better than all the other models
involved in the experimentation. We also performed deep learning techniques on our dataset to find
accuracy, accuracy achieved by deep learning algorithms like RNN and CNN 99% and 98% respectively as
shown in Fig. 25. We can see that deep learning models and Catboost performed well in all models.

7. Evaluation
7.1 Precision, Recall and F1 Score
In the field of machine learning, these measures are widely used and allow us to assess a classifier's
performance from various perspectives. Especially, the correlation between expected fake news and
actual fake news is calculated by precision. Precision tests the fraction of all fake news identified that is
annotated as fake news, resolving the significant issue of recognizing fake news. Therefore, recall is used
to measure the sensitivity or the fraction of fake news stories annotated that are supposed to be fake
news. Therefore, recall is being used to measure the sensitivity or the fraction of false narratives
compiled that are supposed to also be fake news. F1 is used to combine accuracy and recall, which can
provide an overall prediction efficiency for detecting false news. Notice that the higher the value, the
better the output for Precision, Recall, F1, and Accuracy. For machine learning algorithms, Tab 8 and Tab
9 compares the performance, recall and F1-score of network features and combined features.

7.2 Metric TP, TN, FN and FP
Tab 10 and Tab 11 shows the evaluation metrics for machine learning models on network based features
and combined features as we recognize that fake news concerns are taken a classification problem, in
which we forecast if the news piece is fake or real. True Positive (TP): when projected fake news items
are correctly marked as fake news; True Negative (TN): when projected true news items are correctly
marked as true news; False Negative (FN): when projected true news items are incorrectly marked as fake
news; False Positive (FP): when projected fake news items are incorrectly marked as real news.

                                                  Page 12/21
7.3 Hyper Parameters for models
We determined the optimal set of hyper parameters by testing the performance of our models on the
development set for different parameter combinations. Tab. 12 shows the parameters that gave the best
results for different algorithms on network based feature and Tab 13. Shows the result of combined
features.

7.4 ROC – Accuracy Score Graph
ROC Curve is graph with x and y axis, showing false positive rate FPR on x-axis and true positive rate TPR
on y-axis it is used for classify all true and false cases. FPR represents as 0 which means none and TPR
represents as 1 which is mean all. ROC-Curve is attached for all machine learning models in Fig. 26,
which is clearly showing that gradient boosting has the highest test accuracy, with best ROC accuracy
score in network based features and Catboost performed well in combine features showing in Fig. 27.

8. Conclusion
There is a lot of misrepresentative information on the internet and social media platforms, this has ended
up becoming a very prominent present-day problem.

In this research work, we have considered the issue of false information/news spread along with the
detection and classification of such false information. We performed social network analysis, extracted
important node based features and combined them with the twitter’s user based features into a single file
with each rows in the csv file representing an individual that appeared in fake or real, with all rows being
labelled as fake or real. We got relatively good results on CATboost, light GBM and deep learning models
since they perform better as shown in the results.

9. References
  1. Zhou, Xinyi, Atishay Jain, Vir V. Phoha, and Reza Zafarani. "Fake news early detection: A theory-
     driven model." Digital Threats: Research and Practice 1, no. 2 (2020): 1–25.
  2. Khubaib Ahmed Qureshi, Rauf Ahmed Shams Mallick, Muhammad Sabih. “Fake News Detection”
     2020. (Submitted)
  3. Zhou, Xinyi, and Reza Zafarani. "Fake news: A survey of research, detection methods, and
     opportunities." arXiv preprint arXiv:1812.00315 (2018).
  4. Al-Zaman, Md. "COVID-19-related Fake News in Social Media." (2020).
  5. Cui, Limeng, and Dongwon Lee. "CoAID: COVID-19 Healthcare Misinformation Dataset." arXiv preprint
     arXiv:2006.00885 (2020).
  6. Ahmed, Sajjad, Knut Hinkelmann, and Flavio Corradini. "Combining machine learning with knowledge
     engineering to detect fake news in social networks-a survey." In Proceedings of the AAAI 2019 Spring
                                                 Page 13/21
Symposium, vol. 12. 2019.
 7. Zhou, Xinyi, and Reza Zafarani. "Network-based fake news detection: A pattern-driven approach."
   ACM SIGKDD Explorations Newsletter 21, no. 2 (2019): 48–60.
 8. Rodríguez, Cristina Pulido, Beatriz Villarejo Carballido, Gisela Redondo-Sama, Mengna Guo, Mimar
   Ramis, and Ramon Flecha. "False news around COVID-19 circulated less on Sina Weibo than on
   Twitter. How to overcome false information?." International and Multidisciplinary Journal of Social
   Sciences (2020): 1–22.
 9. Erku, Daniel A., Sewunet A. Belachew, Solomon Abrha, Mahipal Sinnollareddy, Jackson Thomas,
    Kathryn J. Steadman, and Wubshet H. Tesfaye. "When fear and misinformation go viral:
   Pharmacists' role in deterring medication misinformation during the'infodemic'surrounding COVID-
   19." Research in Social and Administrative Pharmacy (2020).
10. Zhang, Jiawei, Bowen Dong, and S. Yu Philip. "Fakedetector: Effective fake news detection with deep
    diffusive neural network." In 2020 IEEE 36th International Conference on Data Engineering (ICDE),
   pp. 1826-1829. IEEE, 2020.
11. Shu, Kai, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. "FakeNewsNet: A
    Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying
   Fake News on Social Media." Big Data 8, no. 3 (2020): 171–188.
12. Ajao, Oluwaseun, Deepayan Bhowmik, and Shahrzad Zargari. "Fake news identification on twitter
   with hybrid cnn and rnn models." In Proceedings of the 9th international conference on social media
   and society, pp. 226-230. 2018.
13. Mahir, Ehesas Mia, Saima Akhter, and Mohammad Rezwanul Huq. "Detecting Fake News using
   Machine Learning and Deep Learning Algorithms." In 2019 7th International Conference on Smart
   Computing & Communications (ICSCC), pp. 1-5. IEEE, 2019.
14. Hoaxy by Indiana University URL: https://hoaxy.iuni.iu.edu/
15. A python library for Twitter: Tweepy URL: https://www.tweepy.org/
16. Pogue, David. "How to Stamp Out Fake News." Scientific American 316, no. 2 (2017): 24–24.
17. Poynter Institute, 2020. The International Fact-Checking Network. URL:
   https://www.poynter.org/ifcn/.
18. Oyeyemi, Sunday Oluwafemi, Elia Gabarron, and Rolf Wynn. "Ebola, Twitter, and misinformation: a
    dangerous combination?." Bmj 349 (2014).
19. Glowacki, Elizabeth M., Allison J. Lazard, Gary B. Wilcox, Michael Mackert, and Jay M. Bernhardt.
    "Identifying the public's concerns and the Centers for Disease Control and Prevention's reactions
   during a health crisis: An analysis of a Zika live Twitter chat." American journal of infection control
   44, no. 12 (2016): 1709–1711.
20. Miller, Michele, Tanvi Banerjee, Roopteja Muppalla, William Romine, and Amit Sheth. "What are
   people tweeting about Zika? An exploratory study concerning its symptoms, treatment, transmission,
   and prevention." JMIR public health and surveillance 3, no. 2 (2017): e38.

                                                Page 14/21
21. Zhou, Xinyi, and Reza Zafarani. "Fake news: A survey of research, detection methods, and
    opportunities." arXiv preprint arXiv:1812.00315 2 (2018).
22. Ma, Jing, Wei Gao, and Kam-Fai Wong. "Rumor detection on twitter with tree-structured recursive
    neural networks." Association for Computational Linguistics, 2018.

Declarations
Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Tables
Tables are available in the Supplementary Files section.

Figures

Figure 1

Proposed Framework

                                                  Page 15/21
Figure 2

Figure 14: Fake News propagation graph

                                         Page 16/21
Figure 3

Figure 15: Real News propagation graph

                                         Page 17/21
Figure 4

Figure 16: The box plots demonstrating the differences in the distribution of Network based features for
true news and fake news

Figure 5

Figure 17: Empirical Cumulative Distribution Function (ECDF) for network based features

                                                Page 18/21
Figure 6

Figure 18: Empirical Cumulative Distribution Function (ECDF) for user based features

Figure 7

Figure 19: The correlation between networks based features in the real and fake data-set.

Figure 8

Figure 20: The correlation between users based features in the real and fake data-set.

Figure 9

Figure 21: Network Features correlation for True and False

                                                Page 19/21
Figure 10

Figure 22: RNN Architecture Diagram

Figure 11

Figure 23: Architecture diagram of CNN

Figure 12

Figure 24: Summary of RNN and CNN models

Figure 13

Figure 25: The overall comparison of all the models of ML and DL

Figure 14

Figure 26: ROC for all ML models

Figure 15

Figure 27: Gradient boosting for network based and CAT boost for combined feature

                                               Page 20/21
Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.

    Tables.docx

                                                  Page 21/21
You can also read