Fake News Classication and Disaster in case of Pandemic: COVID-19
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Fake News Classification and Disaster in case of Pandemic: COVID-19 Tooba Qazi ( toobaqazi1@gmail.com ) FAST - National University of Computer and Emerging Sciences https://orcid.org/0000-0002-4978- 5748 Dr. Rauf Ahmed Shams FAST - National University of Computer and Emerging Sciences Research Article Keywords: Fake News, network based, COVID-19, Pandemic, Machine Learning Posted Date: January 24th, 2022 DOI: https://doi.org/10.21203/rs.3.rs-1287367/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Page 1/21
Abstract Presently, one of the very common and prevalent means of consumption and spread of the news is social media. Unfortunately, this does not include news with merit only but also a wide array of false news, leaving negative imprints and influences in the world. Specially now, when there is a birth of a new story and news every second, the widespread of false news is a matter of concern. These issues and concerns have made it undeniably essential to limit the spread of incorrect data by analyzing and detecting data. In the period of this kind of tragedy where the knowledge base is not appropriate to classify the news as false or true since the topic is unique and therefore no prior information is known. To stop this misguidance against human health issues, this work aims to examine the imperative aspects of social networking and the fake News data and information it provides. For confronting this issue, we will be building a data set of COVID-19 collected from social networking sites and detecting the visible patterns and characteristics using social analysis tools in both fake and real news by using the different Machine Learning algorithms in collaboration with Deep learning techniques. Wherever a problem involving fake news detection is concerned, deep learning techniques have been applied as they skip feature engineering and automatically represent features. 1. Introduction As far as history is concerned, fake news has always been in top tier for creating crisis in human communication which induces agitation, conflicts, and misconceptions among people. In this digital age, social media platforms prove to be the primary source for relaying news as they are easy to access and allow interactions with people having varying thoughts. The current wide spread of COVID-19 is distressing the society, including the information revolving around it. True facts assist in diminishing the crisis whereas false facts inflame the issues adding fuel to the fire. Consuming such information has resulted in skepticism and frenzy in the society which is deadly wherever human health is concerned. Since the evolution of social media platforms, the users have easy access to create, share, and to provide reactions and feedback to any of the social media content. With this ease of social media content, there is a high risk of spreading fake news on social platforms. These fake news contents have a highly threatening effect on any individual or collectively for a country’s democracy, political campaigns, and to the stock market fluctuations and massive trades [16]. Since fake news can be created and published online faster and cheaper when compared to traditional news media such as newspapers and television. Therefore, currently, it's impossible to validate every social news whether the shared content is fake or real before it gets spread on social networks. This brings more attention to the researchers to address the spread of fake news on social media platforms. Social media platforms being the primary source for consuming news has necessitated the need to analyze the data revolving on the said platforms related to COVID-19. The intake of news generated via social media comes with its own advantages and disadvantages. If the social media app Twitter is considered, the use of this platform makes the spread of fake news much more easily affecting a very massive number of users at a time. Repeatedly, it has been seen that such tactics have the potential to be used for financial motives as well as political means Page 2/21
on large scales such as organizations or personal motives [2]. The information shared by the users plays an important role that can be helpful in bringing betterment to this health crisis. Since April 23, 2020, the International Fact-Checking Network (IFCN) [17] has been closely working with the 100 different fact- checking organizations and classified more than 4000 false information regarding COVID-19. The trends of spreading misinformation were also observed during other epidemics such as Ebola [18], yellow fever [19], and Zika [20]. The COVID-19 pandemic has caused the world to grieve for half a million dead people. The pandemic along with loss of precious lives has also brought along unforeseen circumstances related to health issues. A new concept combining “epidemic” and “information” has come into existence. This means that in such a situation correct and incorrect information is widespread in abundance. One hand the correct information can help deal with a crisis on the right level of priority, on the hand the incorrect information worsens the situation and increases the magnitude of a crisis. [4] The development of automatic detection of false news has been in progress in previous years, the methods presently used can be categorized in two groups, “content-based” and “propagation-based” methods. [1] Amongst the two categories mentioned, the content-based methods are used for the detection of false news by the evaluation of news from numerous angles; secondary information such as the news propagator is not considered. When news content is heavily relied upon, the methodologies can be sensitive it. By concealing the expression style of an individual, false entries can be introduced which might diverge the results of the approach. Due the mentioned issue caused, the detection of fake/false news using network- based approaches have been in popularity recently [7]. There are unique challenges in the process of ascertaining and alleviating the spread of fake news. To deal with these unprecedented challenges, sorts of data along with network features have been used tediously by numerous existing research works. As the spread of information is fleeting on social media platforms, this project is to provide a detection mechanism to stop the circulation of fake news as early as possible by observing the patterns of fake and real news. 2. Literature Review Lots of researchers have been working to defeat the evil that is fake news. The methods varies from research to research with some using deep learning while some focusing on machine learning models. Plus, there are different aspects of fake news detection, which rely on feature extraction. Detailed study has been carried out by many research studies to overcome the greatest evil: false news. Fake news is one of such significant threats. Although accurate fake news classification and detection, a theory-driven model for early identification of false news is indicated. The proposed approach efficiently analyses and portrays news data at four levels of language to anticipate false news on social networking sites before it became disseminated. Comparing the efficiency of the system proposed with other state-of-the-art approaches. Such techniques identify fake news by (a) assessing news content means detecting fake news content-based, (b) exploring social network news distribution based on propagation, and (c) using both news propagation data and news content data [1]. It explores the potential patterns; relations among fake news, recurring misinformation, and deceptions/disinformation along with enhancing the interpretability in fake news feature engineering [1]. Content-based Fake News Detection aims to detect Page 3/21
fake news by analyzing the content of related new Fake News Identification, focused on information, attempts to spot fake news by evaluating the information of relevant news posts, articles, titles, headlines, and blogs [1]. Propagation-based methods, which rely on Recurrent Neural Networks (RNNs), have gained popularity as the models exhibit good performance by investigating associations between news stories, writers, and consumers, i.e. spreaders [[1, 10]]. A selection of features has been derived from deception / disinformation patterns to reflect news material, which contributes to the analysis of the connections between fake news and deception / disinformation [1]. Dealing with fake news from propagation and spreading out perspective, it relates to the dissemination of fake news, e.g., how it propagates and users spreading it. The propagation of fake news is proposed in recent research that provides several techniques to address this problem. The following are the recent research that will make a simple and clear understanding of the solutions that are proposed to solve the addressed problem. These solutions include empirical patterns, mathematical models, and social network models that proposed solutions for this problem [21]. Another approach for addressing fake news detection in the context of social network analysis is proposed in recent research studies as cascade-based and network-based. The cascade-based detection follows the propagation paths and news cascade patterns using cascade-similarity and cascade-representation of the social network [22]. On the other side, network-based detection constructs the flexible network from cascades in order to indirectly capture fake news propagation. The networks constructed in network-based detection can be homogenous, heterogeneous, or hierarchical. Fake news is tweeted twice as much as the real news, each following a different propagation pattern [2]. Gradient boosting, with the best ROC precision score, has the maximum accuracy results. Models focused on hybrids and profiles were stronger. While the content-based model is best for early detection [2]. Most spreaders of fake news share some similar attributes; hence, by studying specific user behavior allows one for discriminating fake news from actual news [7]. Network based models inspected info revealed in news hybrid models and propagation models, incorporating user and network information [7]. Content-based (linguistic) models use spatiotemporal information; for spatial information, the location of the user was noted and for the temporal information the timestamps of user engagements were recorded [11]. These facts and figures showcase how fake news pieces propagate on social media, and the change in hot topics over time [11]. There are four perspectives in fake news inspection which are as follows: (a) knowledge-based, focusing on misleading knowledge; (b) style-based, focusing on how the news is posted; (c) spread-based, focused on how it continues to spread; and; (d) credibility-based, examining the legitimacy of its writers and spreaders [3]. Propagation based experiments also focus on experiences, where knowledge-based analysis focuses on multi-dimensional encounters between individuals and items generated from news material [3]. Machine learning and knowledge engineering convergence could be valuable in identifying false news. Differentiates between false and untrue news reports using data driven and knowledge engineering. A new experimental algorithm solution will be developed to achieve the target this one can label the content as quick as possible, the news is released on internet. We have to embed both behavioral and social Page 4/21
structures the collected to be combined knowledge and data to tackle this problem head on. Combine the data driven (machine learning) and knowledge learning by classification method news divided into labels; for example, (Fake news, non-fake news, and unclear news) which is part of data driven. In the field of text mining, classification of the content is the main issue to consider. To analyze the text and content of the news and match this content with prior facts by fact checking which is called knowledge learning [6]. Data side includes classification of text and identification of stance while information side involves fact checks that will enable us to optimize the results. Categorize our mission into three parts, and we will merge the findings at the end to verify whether or not the news status is incorrect [6]. Feature-based Studies concentrate on manually creating a range of observed and/or latent fake news interpretation features and using these features within a machine learning system to identify or block fake news [3]. To figuring the themes of fake news, content styles, origins, dissemination, and motives of mislead events, they observed and checked social networking sites for possible fake news. Following steps are follow: (a) Study misinformation; (b) produce a list for false information; (c) compile related news data and (d) script the information collected. Fake news in social sites has four major categories: (a) text, (b) image, (c) recording, and (d) clip; it can also appear in more than one medium at a time. [4] In distributing fake news successfully despite COVID-19, internet media has done more harm than mass media. YouTube, WhatsApp, Twitter, and WhatsApp are the four social media sites that generate the highest share of fake news [4]. It would be intriguing and in fact, useful if the beginning of messages could be checked and sifted where the phony messages were isolated from valid ones. The data that individuals tune in to and share in web-based media is to a great extent affected by groups of friends and connections they structure on the web [12]. The misinformation about the infectious disease that has been declared as a pandemic is riskier than a virus itself. The studies show the spread of misinformation is not only present in the COVID-19 pandemic but also observed during other epidemics such as Ebola [18], yellow fever [19], and Zika [20]. These studies provide the spread analysis of misinformation on social media platforms with respect to their disease. In another research put forward the notion that the users of social media share counterfeit news more compared to substantial and fact-based news with regards to COVID-19 causing such practice to become a danger to the society [8]. This study also dissects data sourced from different numerous social networking platforms. Recovered, arranged, and analyzed forged and/or groundless publications from 1923 about COVID-19 uploaded Twitter and Sina Weibo. The outcomes of the study revealed that as compared to Sina Weibo, Twitter has more unsubstantiated and false news distributed on the platform [8]. There are seven prevailing subjects relevant to COVID-19 unverified news on media platforms: Politics, terrorism, faith, fitness, religious politics and miscellaneous culture. Out of all the mentioned subjects, the highest rank in the list is related to health counterfeit news revolving around medical prescriptions and clinical services [9]. For the last few years, numerous computational the procedure has been created in order to alleviate these issues and recognize different types of false information in diverse genres. Nonetheless, exposing COVID-19-related falsehood displays its own arrangement of special difficulties [5]. We eliminate the tweets which are totally impartial, and map the tweets' sentiment polarity relevant to false news and substantial news sources to better dissect suppositions in tweets identified with phony Page 5/21
and genuine news stories. [5]. Model for recognizing forged news messages from twitter posts, by figuring out how to anticipate precision appraisals, in view of computerizing forged news identification in Twitter datasets. Neural networks are a form of machine learning method that have been found to exhibit high accuracy and precision in clustering and classification for fake news [13] 3. Methodology 3.1 Data Set: COVID-19, which includes brief news statements and labels obtained from fact-checking websites, will contribute to the data collection. i.e politifact and gossipcop. In PolitiFact, journalists and domain experts review the political news and provide fact-checking evaluation results to claim news articles as fake and real. These news are classified by the fact-checking websites into different categories of which are classified into different categories of misinformation such as false news, fake news, partial false, misleading, and so on. The extracted fake news regarding COVID-19 that are classified by the fact- checking websites as false, mostly false, misleading, inaccurate, pants on fire are considered as fake news in our study. On the other hand, the news extracted and classified as true, mostly true by the fact- checking websites are considered as true news in our study. In addition, the news published by the World Health Organization or United Nations are also considered as true news in our study. The total 75 news (40 Fake and 35 True News) against the COVID-19 have been extracted. Finally, against 75 news, we managed to extract 46,852 tweets which were tweeted by 18,940 users. The breakdown news and tweets is shown in the Tab 2, below. 3.2 Data Collection Method The extracted news of COVID-19 was supplied to the Hoaxy [14]. Articles search for every news item on Hoaxy was done to explore all the articles matching with keywords. Relevant articles were selected and then their spread was analyzed. Hoaxy gives a partial Tweet network for the spread of selected articles on Twitter. With agreeing to Twitter sharing policy, Hoaxy shares limited details of every original Tweet (related to articles) and all the Retweets from it. The data extracted from Hoaxy contains Tweets Id, publishing DateTime, Creator user Id, Retweeter user ID, Retweet ID, Retweeting DateTime, and bot-scores. There is more than one tweet observed for each news having their specific retweeters. Since the data extracted from Hoaxy is not enough for exploratory analysis, Tweepy [15] was used to further extract the data from Twitter for further analysis. By having the user IDs, we extracted follower and friend count of retweeters. Followers and friends count will help us to further analysis. Data extracted from Hoaxy do not have an interconnection of users. In order to find out that, tweepy helped us to extract the directed edges between followers and followee. 3.3 Proposed Framework Page 6/21
In the proposed framework shown in Fig. 1, we will implement user features and node-based features and integrate them to train machine learning models include SVM, Linear Regression, Random forest, AdaBoostClassifier, Decision tree, Ensemble learning (voting classifier), Catboost, Light GBM, Gradient Boosting Classifier, KNN and Naive_bayes; and deep learning models includes CNN, RNN LSTM (details are mentioned below), because no one has ever combined such features to pave the path. We will perform social network analysis on the COVID-19 news items. Moreover, using the social analysis tool, created the follower-following graph of every tweet. Moving on, each news item has a network of all users involved in a news, constructed by connecting followers and friends. Our approach to this problem is finding the dominant social network features and user’s features and combine them which can be then used to train different machine learning and deep learning algorithms. 4. Experimentation This chapter takes a deep dive into the experimental setup used in this study. As mentioned in the previous section, the features are split into two parts. Network based features and user based features. In this paper experimentation is done to classify fake and true news based on network and user features by using different machine learning and deep learning algorithms. 4.1 Data set 4.1.1 Network Based Feature: The following are the set of experiments that provide the analysis of network attributes such as number_of_edges, avg_degree, avg_shortest_path_length, degree_centrality, betweenness_centrality, average_clustering_coefficient, number_of_communities, average_clustering_coefficient, closeness_centrality, eigenvector_Centrality, Page Rank, InDegree, and outDegree These network attributes are analyzed and provide a comparison between fake news network attributes and true news network attributes. Page 7/21
4.1.2 User Based Features Following SNA, we extracted user features from twitter that were present in the graph related to one particular news. User botScore Account Age, Is verified, Followers count, Friends count, Favorite count, Status count, and Listed count. 4.2 Social Network Graphs The dataset extracted in the above section helps in constructing the social networks for fake and true news. The Tweet and Retweet users and their relationship association generated the edge list. Each of the 75 news item contains a CSV file that has all the relevant attributes that we have discovered. Plus, 75 graphs generated using cytoscape showing the links between the users who tweeted a particular news. Snapshots of two graphs are attacked below Fig. 14 [Fake] and Fig. 15[True] respectively. 4.3 Feature Analysis 4.3.1 Exploratory Analysis The initial phase in your research study is exploratory-data-analysis (EDA). In this, you'll discover out whether to interpret the data that have, and also what queries you really like to address and then how to structure those, including how to efficiently handle your datasets to get the insights you require. We did using different techniques like box plot, correlation and Ecdf etc. We are using box plot for graphical representation of data, box plot shows us the difference of extreme values from majority of the data. Min value, max value and quartiles values are used to make box plot. Both real and fake news have different characteristics that can be important while analyzing a tweet using box plot. Fig. 16 shows the difference in the distribution of network features of both sets of news. The average shortest path length’s average in fake news is 2 as compared to 1.81 in real news. Betweeness centrality in fake news is 0.34 as compared to 0.32 in real news. Closeness centrality in fake news is 0.68 as compared to 0.59 in real news. Eigenvector centrality in fake news is 0.32 as compared to 0.37 in real news. Page Rank in fake news is 0.06 as compared to 0.08 in real news. No of communities average is 7.86 in fake news whereas its 9.18 in real news. Page 8/21
4.3.2 Empirical Cumulative Distribution For the statistical model in Python, an empirical distribution function is preferable. The Python library offers the ECDF class for fitting an empirical cumulative distribution function and computing the cumulative probabilities for data attribute selection. A method for efficiently plotting and applying percentile thresholds for data exploration to an empirical cumulative distribution function (ECDF). Fig. 17 and 18 showing the summary of analysis on the basis of ECDF for the network based features and user based features for both True and False respectively. 4.3.3 Correlation Analysis We measured the correlations between features to determine whether the network-based features which we identified are dependent on established characteristics and independent of each other. For each data- set, we computed a correlation matrix that determines the degree of association between features. We also used sns library in python to visualize the correlations. Fig. 19 [Network based] and Fig. 20 [User Based] presents the correlation matrix between all the network based features. Likewise we plotted a correlation matrix for user based features. Following figures presents the result. The diagonal values, as can be shown, are 1. The same features are found in both the upper and lower triangular regions. Therefore, in order to evaluate the correlation, we only have to present one of these. A higher correlation value is shown by darker shades, whereas a lighter color means a weaker correlation. So based on the correlation we selected our best features from both network based features and user based features to train models to perform better. 4.3.4 Feature Importance The feature importance attribute of the model can be used to obtain the feature importance of attributes in data. Feature importance assigns a value to every one of dataset features; the greater the value, those most essential or important feature is to output variable. TreeBased Classifiers include an embedded class called FeatureImportance. Initial feature selection through correlation and value distribution shown in Tab 3. Based on feature selection, we selected our features for machine learning models. 5. Implementation Our approach to this problem is finding the dominant social network features and user’s features and combine them which can be then used to train different machine learning and deep learning algorithms. In our model, we used different types of machine learning algorithms and for the implementation work, we used Python as our programmable language. We test the features on different learning algorithms and choose the one that achieves the best performance. For machine learning, the algorithms include SVM, Linear Regression, Random forest, AdaBoostClassifier, Decision tree, Ensemble learning (voting classifier), Catboost, Light GBM, Gradient Boosting Classifier, KNN and Naive_bayes. As per framework dataset divided into two phases training phase (Classifies data (constructs a model) based on the Page 9/21
training set and the values (class labels) in a classifying attribute and uses it in classifying new data) and testing phase (In testing we check those data that does not come under the dataset we have considered. After the prediction, we will get the class labels). 5.1 Machine Learning Models Implementation The dependent variable is always found to be binary when we are dealing with Logistic Regression. Main applications of Logistic Regression are to predict and to calculate success probability. When using regression classifier, the random state we use is 1 and Broyden–Fletcher–Goldfarb–Shanno algorithm as a solver. A line or a hyperplane is created by Support Vector Machine that distinguishes the data points in to distinct classes. The training data is mapped into a kernel space by SVM. In our approach, out of many possibilities, we used the linear kernel for our SVM classifier. When catering to a learning problem, a requirement for using Naïve bayes classifiers, which are highly scalable, is that the relationship between numbers of parameters to the number of features is linear. The purpose of a k nearest neighbors’ algorithm is storing all accessible cases and classifying new cases on the criteria of a distance function. For random forest is a meta estimator that fits a number of decision tree classifiers on various sub- samples of the dataset. Furthermore, the predictive accuracy is improved and over fitting is controlled by using averaging. The maximum depth of the tree is 1 and reinforced measures are “gini”. The trees that conduct classification of instances on the criteria of feature values are decision trees. A feature, which can be classified in an instance, is represented by each node of the decision tree. And the value assumed by the node is represented by a branch. To predict the class, EVC classifier is used along with three estimators being logistic regression, decision tree and svm. Many weak learning models are combined forming a strong predictive model, this is achieved by a group of machine learning algorithms called gradient boosting classifiers. With a depth of 2, fifty estimators are utilized in random forest. LightGBM is able to handle huge amounts of data with ease. This algorithm uses gradient boosting framework which in turn uses tree based algorithm allowing the tree to grow vertically by having default learning rate 0.1. CATBoot is a classifier which is first rate and does not require tuning of parameter. It is an extensible GPU version. This classifier helped the reduction of overfitting and increased the accuracy of the model. 5.2 Deep Learning Models Implementation After testing the data with several supervised methods, deep learning model was chosen as it always gives an efficient output layer based on the weight of input layers. Two most famous deep learning methods RNN Recurrent Neural Network and CNN Convolutional Neural Network were also implemented to see how well our data fit into the model. Such algorithms are ideal for various classifications and are based on different datasets and have their own properties and performance. Keras provides easy and reliable high-level APIs and follows best methods to reduce the cognitive burden for users, using Keras as a neural network library. Unlike normal neural networks, RNNs rely on prior output knowledge to predict upcoming info. This role becomes extremely helpful when combined with sequential knowledge. An LSTM network is a kind of recurrent neural network that has LSTM cell blocks instead of our usual neural Page 10/21
network layers. For the CNN model, the following architectural structure is available with a given number of parameters. Most of the parameters came from the fully connected component, comparing the number of parameters in the function learning part of the network and the fully connected part of the network. In terms of computing costs, CNN is very effective. The architectural diagram for RNN and CNN is shown in Fig. 22 and 23, and the description of the two models is shown in Fig. 24. 6. Results And Analysis This chapter enlists the results obtained by executing our proposed methodology on the aforementioned datasets. We will be enlisting the performance of all the different models of our methodology, and will also be providing an insight on the results. 6.1 Feature Selection Our approach to this problem is finding the dominant social network features and users’ features and combine them which can be then used to train different machine learning algorithms. We determined the optimal set of hyper parameters by testing the performance of our models on the development set for different parameter combinations. We analyze the importance of the features in different machine learning models. We used feature selection library for feature importance and feature engineering. To evaluate we used the feature importance module. This module calculates the relative weight of each feature in the model and it sums up to 1. We started with consisting both network and user based features combined together that are trained on machine learning models. Fig. 13 showing the best features of top performance algorithms. 6.2 Result of Network based and User based features separately We test the features on different learning algorithms and choose the one that achieves the best performance. We used ROC-AUC curve to validate the performance machine learning algorithms. Plus, metrics like accuracy, precision, recall and F1 scores are also calculated. All models were tested on network based features, user based features and combine the both features (Cartesian product of both features, mean, median and mode of user features combined with network features). Tab 5 is showing the results of network features and user features. Median of user features combined with network features performed well amongst above 3 mean, median and mode. Tab 6 is showing the result of median of user features combined with network features. 6.3 Result for Combine Features In our experiment, we plan to extract users features and network based features and combine them to train the ML models. KNN and Naïve Bayes Model was tested on each of the feature mentioned above. It Page 11/21
gives 76% and 79% accuracy respectively. SVM and Logistic models gives 72% and 81% respectively. Then Decision Tree and EVC model was performed, this time the performance was slightly optimized than before, predicting 80% accuracy for both. Thirdly Random Forest was performed for observing any further enhancement over the previous result so far. 6% enhancements were observed at this stage as testing accuracy from RF is 86%. Ada boosting performed 91%. For the best performance, we applied most recent two algorithms of the machine learning then we applied Gradient Boosting, LightGBM and CATboost they performed well, enhancement was observed at this time as testing accuracy 97%, 97%, and 98% respectively. Clearly CATboost has the highest test accuracy which is 98%. Overall comparison of all the machine learning algorithms is shown in Tab 7. The highest accuracy is shown by CATboost. 6.4 Comparison of Results and Performance We’ll be enlisting the performance of different models using graphs in order to analyze them. We consider a scheme to be top performer in an experiment when it yields an accuracy better than all the other models involved in the experimentation. We also performed deep learning techniques on our dataset to find accuracy, accuracy achieved by deep learning algorithms like RNN and CNN 99% and 98% respectively as shown in Fig. 25. We can see that deep learning models and Catboost performed well in all models. 7. Evaluation 7.1 Precision, Recall and F1 Score In the field of machine learning, these measures are widely used and allow us to assess a classifier's performance from various perspectives. Especially, the correlation between expected fake news and actual fake news is calculated by precision. Precision tests the fraction of all fake news identified that is annotated as fake news, resolving the significant issue of recognizing fake news. Therefore, recall is used to measure the sensitivity or the fraction of fake news stories annotated that are supposed to be fake news. Therefore, recall is being used to measure the sensitivity or the fraction of false narratives compiled that are supposed to also be fake news. F1 is used to combine accuracy and recall, which can provide an overall prediction efficiency for detecting false news. Notice that the higher the value, the better the output for Precision, Recall, F1, and Accuracy. For machine learning algorithms, Tab 8 and Tab 9 compares the performance, recall and F1-score of network features and combined features. 7.2 Metric TP, TN, FN and FP Tab 10 and Tab 11 shows the evaluation metrics for machine learning models on network based features and combined features as we recognize that fake news concerns are taken a classification problem, in which we forecast if the news piece is fake or real. True Positive (TP): when projected fake news items are correctly marked as fake news; True Negative (TN): when projected true news items are correctly marked as true news; False Negative (FN): when projected true news items are incorrectly marked as fake news; False Positive (FP): when projected fake news items are incorrectly marked as real news. Page 12/21
7.3 Hyper Parameters for models We determined the optimal set of hyper parameters by testing the performance of our models on the development set for different parameter combinations. Tab. 12 shows the parameters that gave the best results for different algorithms on network based feature and Tab 13. Shows the result of combined features. 7.4 ROC – Accuracy Score Graph ROC Curve is graph with x and y axis, showing false positive rate FPR on x-axis and true positive rate TPR on y-axis it is used for classify all true and false cases. FPR represents as 0 which means none and TPR represents as 1 which is mean all. ROC-Curve is attached for all machine learning models in Fig. 26, which is clearly showing that gradient boosting has the highest test accuracy, with best ROC accuracy score in network based features and Catboost performed well in combine features showing in Fig. 27. 8. Conclusion There is a lot of misrepresentative information on the internet and social media platforms, this has ended up becoming a very prominent present-day problem. In this research work, we have considered the issue of false information/news spread along with the detection and classification of such false information. We performed social network analysis, extracted important node based features and combined them with the twitter’s user based features into a single file with each rows in the csv file representing an individual that appeared in fake or real, with all rows being labelled as fake or real. We got relatively good results on CATboost, light GBM and deep learning models since they perform better as shown in the results. 9. References 1. Zhou, Xinyi, Atishay Jain, Vir V. Phoha, and Reza Zafarani. "Fake news early detection: A theory- driven model." Digital Threats: Research and Practice 1, no. 2 (2020): 1–25. 2. Khubaib Ahmed Qureshi, Rauf Ahmed Shams Mallick, Muhammad Sabih. “Fake News Detection” 2020. (Submitted) 3. Zhou, Xinyi, and Reza Zafarani. "Fake news: A survey of research, detection methods, and opportunities." arXiv preprint arXiv:1812.00315 (2018). 4. Al-Zaman, Md. "COVID-19-related Fake News in Social Media." (2020). 5. Cui, Limeng, and Dongwon Lee. "CoAID: COVID-19 Healthcare Misinformation Dataset." arXiv preprint arXiv:2006.00885 (2020). 6. Ahmed, Sajjad, Knut Hinkelmann, and Flavio Corradini. "Combining machine learning with knowledge engineering to detect fake news in social networks-a survey." In Proceedings of the AAAI 2019 Spring Page 13/21
Symposium, vol. 12. 2019. 7. Zhou, Xinyi, and Reza Zafarani. "Network-based fake news detection: A pattern-driven approach." ACM SIGKDD Explorations Newsletter 21, no. 2 (2019): 48–60. 8. Rodríguez, Cristina Pulido, Beatriz Villarejo Carballido, Gisela Redondo-Sama, Mengna Guo, Mimar Ramis, and Ramon Flecha. "False news around COVID-19 circulated less on Sina Weibo than on Twitter. How to overcome false information?." International and Multidisciplinary Journal of Social Sciences (2020): 1–22. 9. Erku, Daniel A., Sewunet A. Belachew, Solomon Abrha, Mahipal Sinnollareddy, Jackson Thomas, Kathryn J. Steadman, and Wubshet H. Tesfaye. "When fear and misinformation go viral: Pharmacists' role in deterring medication misinformation during the'infodemic'surrounding COVID- 19." Research in Social and Administrative Pharmacy (2020). 10. Zhang, Jiawei, Bowen Dong, and S. Yu Philip. "Fakedetector: Effective fake news detection with deep diffusive neural network." In 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 1826-1829. IEEE, 2020. 11. Shu, Kai, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. "FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media." Big Data 8, no. 3 (2020): 171–188. 12. Ajao, Oluwaseun, Deepayan Bhowmik, and Shahrzad Zargari. "Fake news identification on twitter with hybrid cnn and rnn models." In Proceedings of the 9th international conference on social media and society, pp. 226-230. 2018. 13. Mahir, Ehesas Mia, Saima Akhter, and Mohammad Rezwanul Huq. "Detecting Fake News using Machine Learning and Deep Learning Algorithms." In 2019 7th International Conference on Smart Computing & Communications (ICSCC), pp. 1-5. IEEE, 2019. 14. Hoaxy by Indiana University URL: https://hoaxy.iuni.iu.edu/ 15. A python library for Twitter: Tweepy URL: https://www.tweepy.org/ 16. Pogue, David. "How to Stamp Out Fake News." Scientific American 316, no. 2 (2017): 24–24. 17. Poynter Institute, 2020. The International Fact-Checking Network. URL: https://www.poynter.org/ifcn/. 18. Oyeyemi, Sunday Oluwafemi, Elia Gabarron, and Rolf Wynn. "Ebola, Twitter, and misinformation: a dangerous combination?." Bmj 349 (2014). 19. Glowacki, Elizabeth M., Allison J. Lazard, Gary B. Wilcox, Michael Mackert, and Jay M. Bernhardt. "Identifying the public's concerns and the Centers for Disease Control and Prevention's reactions during a health crisis: An analysis of a Zika live Twitter chat." American journal of infection control 44, no. 12 (2016): 1709–1711. 20. Miller, Michele, Tanvi Banerjee, Roopteja Muppalla, William Romine, and Amit Sheth. "What are people tweeting about Zika? An exploratory study concerning its symptoms, treatment, transmission, and prevention." JMIR public health and surveillance 3, no. 2 (2017): e38. Page 14/21
21. Zhou, Xinyi, and Reza Zafarani. "Fake news: A survey of research, detection methods, and opportunities." arXiv preprint arXiv:1812.00315 2 (2018). 22. Ma, Jing, Wei Gao, and Kam-Fai Wong. "Rumor detection on twitter with tree-structured recursive neural networks." Association for Computational Linguistics, 2018. Declarations Conflict of Interest On behalf of all authors, the corresponding author states that there is no conflict of interest. Tables Tables are available in the Supplementary Files section. Figures Figure 1 Proposed Framework Page 15/21
Figure 2 Figure 14: Fake News propagation graph Page 16/21
Figure 3 Figure 15: Real News propagation graph Page 17/21
Figure 4 Figure 16: The box plots demonstrating the differences in the distribution of Network based features for true news and fake news Figure 5 Figure 17: Empirical Cumulative Distribution Function (ECDF) for network based features Page 18/21
Figure 6 Figure 18: Empirical Cumulative Distribution Function (ECDF) for user based features Figure 7 Figure 19: The correlation between networks based features in the real and fake data-set. Figure 8 Figure 20: The correlation between users based features in the real and fake data-set. Figure 9 Figure 21: Network Features correlation for True and False Page 19/21
Figure 10 Figure 22: RNN Architecture Diagram Figure 11 Figure 23: Architecture diagram of CNN Figure 12 Figure 24: Summary of RNN and CNN models Figure 13 Figure 25: The overall comparison of all the models of ML and DL Figure 14 Figure 26: ROC for all ML models Figure 15 Figure 27: Gradient boosting for network based and CAT boost for combined feature Page 20/21
Supplementary Files This is a list of supplementary files associated with this preprint. Click to download. Tables.docx Page 21/21
You can also read