EARLY DETECTION OF SIMILAR FAKE ACCOUNTS ON TWITTER USING THE RANDOM FOREST ALGORITHM
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 11, Issue 12, December 2020, pp.611-620, Article ID: IJARET_11_12_064 Available online at http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=11&IType=12 ISSN Print: 0976-6480 and ISSN Online: 0976-6499 DOI: 10.34218/IJARET.11.12.2020.064 © IAEME Publication Scopus Indexed EARLY DETECTION OF SIMILAR FAKE ACCOUNTS ON TWITTER USING THE RANDOM FOREST ALGORITHM Dr. Mohammed Ali Alhariri College of Computing and Information Technology, Taif University, Saudi Arabia ABSTRACT A major issue for social-media platforms is the problem of fake accounts with different aims and objectives. A similar fake account is like having access to someone’s specific identity, which may impact that person’s life in the real world. Artificial intelligence is leading in dealing with these issues, because its machine-learning methodology can provide early detection of similar fake accounts on Twitter. In this work, we analyze the early detection of similar fake accounts on Twitter using the Twitter API Application Programming Interface mainly uses the following features based the confusion matrix: default_profile, default_profile_image, friends_count, statuses_count, followers_count, listed_count, listed_count, profile_background_image, verified, name, and id. These are the data features we chose to enable early detection of similar fake accounts, and we used the random forest algorithm in the model. We find that overall the model works better than other approaches, and the random forest algorithm provides impressive results even in the validation phase. The random forest results depend upon the features selected to identify the similar fake accounts. The model produced impressive results in the early detection of similar fake accounts on Twitter. Key words: Twitter, API, Confusion Matrix, Random forest algorithm and Fake accounts Cite this Article: Mohammed Ali Alhariri, Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm, International Journal of Advanced Research in Engineering and Technology,11(12), 2020, pp. 611-620. http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=11&IType=12 1. INTRODUCTION The increased use of social-media platforms requires increased focus on security, the integrity of the systems, the management of traffic by the platform architecture, and reliability. While different features combine to make a social-media platform ultimately a source of joy, various unpleasant elements create alarming situations (Jianqiang, and Xiaolin2017). For example, the http://www.iaeme.com/IJAERT/index.asp 611 editor@iaeme.com
Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm use of social-media platforms requires that the data in an account retain its integrity. The account holder thus needs a secure platform that cannot be accessed by a similar fake account. However, the numbers of similar fake accounts are increasing on social-media platforms, so that there is a need to implement machine-learning techniques to overcome this issue (Asghar, et al. 2018). Here, we have analyzed Twitter as a social-media platform, as it is one of the fastest growing social-media platforms. We have used its Application Programming Interface is accessed for this task to implement a machine-learning technique using some selected features. In the early development of social-media, before machine-learning algorithms and feature selection were implemented, it was hard to detect similar fake accounts. Today, machine-learning algorithms provide promising results in the detection of fake similar accounts (Caruccio et al. 2018, December). Selecting specific features to assist in detecting similar fake accounts on Twitter using the API improves the results. In this work, we used the Twitter API to collect the data and employed multiple features to detect similar accounts. We then incorporated the multiple features in a machine-learning algorithm (Zimbra, et al. 2018). We construct the model to detect similar accounts using random forest, and in that model we performed supervised learning. 2. LITERATURE REVIEW In the early years of social-media platforms some two decades ago, such platforms had fewer data-handling issues even the number of users was less at that time. However, the number is now increasing exponentially, with the leading global-communication applications tend to affect the daily routine of every single person. Communication is increasingly turning away from other means and tending toward social-media platforms. Among the main social-media platforms, Twitter is one of the top leading platforms, and its traffic is growing as the number of users increases. Users of this platform require stability in the form of data reliability (Karakaşlı, et al. 2019). However, the number of fake accounts on the social-media platforms is also increasing. There are different types of fake accounts with different aims and objectives. Similarly, fake accounts have different attributes, and they may contain facts and figures as well that contribute to miscommunication over the social-media platform. Data redundancy is also observed in fake accounts on social-media platforms (Rostami, and Karbasi 2020). Thus, fake accounts on social-media platforms are leading source of false information, and in fact similar fake accounts have been developed using celebrities to convey false communication. The misuse of similar accounts on social-media platforms is increasing, and it is impacting the real world by promoting miscommunication, which may also include criminal aspects. It is this necessary to eliminate fake accounts that cause such diverse impacts on social-media platforms. The rules from Twitter are also changing rapidly, and there are different methods for accessing the data from Twitter. One can use the dataset of tweets, or access it through the API, or use some other tools to obtain the data (Gurajala, et al. 2016). Artificial-intelligence-based solutions using machine-learning algorithms perform better and produce better results than other methods. In the early stages of this approach, the clustering method was used; however, later on semi-supervised methods were used to produce machine-learning and artificial- intelligence-based solutions. Feature selection is an important phase in utilizing artificial intelligence (El-Mawass, et al. 2020). The selection of specific features depends upon many factors, the most important being what to include and why to include it on the bases of those features (Çıtlak, et al. 2019). The increasing daily impact of fake accounts on social media leads to the false information. To eliminate such accounts and reduce the impact of fake accounts on social media, machine http://www.iaeme.com/IJARET/index.asp 612 editor@iaeme.com
Dr. Mohammed Ali Alhariri learning is used to obtain improved results. In generating the machine-learning model, feature extraction plays a vital role, as the features considered drive the results (Singh et al. 2018). Before features were used, machine-learning-based models used clustering techniques to obtain results. However, clustering techniques produce less accurate results than feature-based models. Spam accounts were originally developed unethical tasks. In the early stages spam accounts were just used for marketing purposes, but later some users made random fake accounts. However, later on, such accounts were developed and used unethically, on social-media platforms, even sometimes to place denial-of-service attacks, and accounts were developed to use information that impacts the lives of individuals (Rahman et al. 2019). These issues can be resolved using machine learning and artificial-intelligence-based tasks. 3. METHODOLOGY To address the issue accurately, it is necessary to have clear knowledge of similar fake accounts on social-media platforms, because this matters in choosing the selection criteria. There are different phases on which all such work depends. First, it requires getting access to the Twitter API, which remains a difficult real-time task (Burgess, and Bruns 2012). Our work is based on feature-based analysis and comparisons of features to provide early detection of similar fake accounts on Twitter (Kim et al. 2020). 4. DATA ACCESS The first step is to gain access to the database of the social-media platform that is Twitter. However, getting access to the Twitter database requires providing some information, so that after authenticating it and knowing the purpose of having access to the Twitter database, access is granted (Bruns 2020). The first two times, we tried this, access was denied, but later the access was allowed to enable us to perform the research-based tasks. The access policies of Twitter may differ according to the situation and time. Twitter previously allowed API access easily, but nowadays it is quite a difficult task to gain access to the Twitter API (Sharma, et al. 2020). 5. THE TWITTER API The general method of obtaining permission and accessing the Twitter API is to log in with a developer account using the URL developer.twitter.com. To avoid any unlawful activity on Twitter, it continually changes the rules, sometimes even on a daily basis (Karami, et al. 2020). After describing the reason for and getting access to Twitter, a unique key is assigned that allows access to the Twitter data from twitter and allows to perform the early based detection of similar Twitter account. 6. FEATURE SELECTION The four main objects on Twitter that are most likely to be accessed are entities, places, users, and tweets. Selecting specific features to use in order to obtain good results requires knowledge of how to detect a similar Twitter account (Srivastava, et al. 2019). The data accessed using the Twitter API mainly uses the following Features default_profile, default_profile_image, friends_count, statuses_count, followers_count, listed_count, listed_count, profile_background_image, verified, name, and id. These are the data features we chose to enable early detection of similar fake accounts. 7. CONFUSION MATRIX The confusion matrix is a tool for choosing between the features. It is later normalized, and the most relevant features are chosen to enable early detection of similar fake accounts. Figure.1 http://www.iaeme.com/IJARET/index.asp 613 editor@iaeme.com
Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm shows the confusion matrix for the training data used to distinguish between fake and genuine accounts for the detection of similar fake accounts (Hino, and Fahey 2019). The confusion matrix in this work is used to analyze training data using features, through which later the decision was generated between fake and genuine (Safari, and Sanner 2019). Figure 1. The confusion matrix of the decision between fake and genuine before normalization The Confusion matrix has four basic rules or attributes, which are True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) (Alperin, et al. 2019). These columns are represented graphically in Fig. 1, with the color showing the probability of being false or Genuine of the twitter accounts on the bases of selected features. The normalized confusion matrix, with the scale ranging from 0.00 to 1.00, is shown in Fig2. (Zeng 2020). The confusion matrix basically describes the performance of the model relative to the test data, because in this case the true values are known. This eases the decision making in the early detection of similar fake accounts on Twitter (Xu, et al. 2020). A further explanation of the confusion matrix is provided below. Table 1 confusion matrix Predicted False Predicted Genuine Actual False TN FP Actual Genuine FN TP 7.1. True Positive In this case, the predicted is Yes these are similar fake accounts and Yes they actually are similar fake accounts. 7.2. True Negative In this case, the prediction is that these are not similar fake accounts, and they actually are not similar fake accounts. http://www.iaeme.com/IJARET/index.asp 614 editor@iaeme.com
Dr. Mohammed Ali Alhariri 7.3. False Positive In this case, the prediction is Yes, these are similar fake accounts, but they actually are not similar fake accounts. 7.4. False Negative In this case, the prediction is that these are not similar fake accounts Yes, actually they are similar fake accounts. Figure 2 The normalized confusion matrix for early detection of similar fake accounts on Twitter The normalized confusion matrix shown in fig 2 is well discussed (Luque et al. 2019) in the context of the early detection of similar fake accounts. Blue 8. THE RANDOM FOREST ALGORITHM Gaining successful access to the Twitter data using the API, is an way towards an improved approach of an early based detection of similar fake accounts. We chose the random forest machine-learning algorithm for use in the model in order to obtain high accuracy (Yuan, et al. 2019). The reason for this choice are that the random forest algorithm produces high-accuracy results, and it also handles missing values in the data, which leads to successful results. the random forest algorithm involves the convergence of many decision trees, and it is so-named because it uses random sampling to train the data points while building the decision tree (Breuer, et al. 2020, April). 1 = ∑( − )2 =1 In the equal-representation random forest equation (1) above, N represents the number of features used to detect similar accounts, where Fi is the value returned from Twitter, and yi is the original value used for feature i. This simple equation was used to develop a random forest model for the early detection of similar fake accounts on Twitter (Balaanand, et al. 2019). This model can be used to obtain fast results, and it provides improved results, and accuracy as well. Overall, the random forest algorithm is a fast processing approach that can be used with a feature-selection model to identify related data using features. It has proven its ability by providing high accuracy (up to 95% in accessing results from the training example. Fig3 shows that the training score is 1, while the cross-validation score achieved up to 95% accuracy, which is quite good in overall comparisons (Jiang, et al. 2019).. http://www.iaeme.com/IJARET/index.asp 615 editor@iaeme.com
Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm Figure.3 Learning curve for Cross-Validation Figure-3 shows the cross-validation scores achieved when the model was prepared and tested on the limited Twitter data obtained using Twiiter API based on selected features the tool used Anaconda and Python as a programming language. The fig3 explains the output value as 0.94 in the cross validation represented in the green color, whereas the red color represents the maximum training score. This figure shows that the maximum training score was achieved on the training data. The maximum training score is obtained using selected features for training data (Elsayed, et al. 2019). This figure also shows the accuracy achieved using the data source accessing it through API access. On the basis, we conclude that for early detection of similar fake account on twitter, the cross-validation produces results with 94% accuracy. 9. RESULTS In the first step, we selected features having the highest probability for being involved in the the detection of similar fake accounts on Twitter. These features have a direct impact on reaching a decision based on similarities between the accounts (Sahoo, and Gupta 2021). The overall model produced impressive results with an accuracy of 0.94, which is far better than most other approaches. http://www.iaeme.com/IJARET/index.asp 616 editor@iaeme.com
Dr. Mohammed Ali Alhariri Figure 4 The accuracy of the random forest algorithm The Figure 4 shows the accuracy obtained after accessing the data using the Twitter API as the data source. The blue line in the figure.4 represents the Accuracy (ACU) obtained in the results against the selected features. Initially, while starting the process of early detection of a similar fake account, the model shows a true-positive rate around 0.8 while the maximum accuracy achieved is 0.94. This is impressive accuracy for the early detection of similar fake account on Twitter (Elsayed, at al. 2019). The measurement of accuracy shown in Figure 4 on the y-axis is the true-positive rate from the confusion matrix, while the x-axis shows the False Positive Rate. After applying the validation the results obtained are 0.98% and the accuracy remains on 0.94%. The overall working capability of this model for the early detection of the fake accounts on Twitter is better than that achieved with previous methods (Zervopoulos, et al. 2020). Figure 5 Accuracy validation http://www.iaeme.com/IJARET/index.asp 617 editor@iaeme.com
Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm We analyzed selected features using the Twitter API with the goal of achieving early detection of similar fake accounts. In validation accuracy and the accuracy remains parallel in the initial phase (Pakaya, et al. 2019). However, in the final phase, the validation accuracy is 0.98, and while the accuracy is 0.94. Figure 5 shows the results obtained for the early detection of similar fake accounts using the random forest algorithm. 10. CONCLUSION The whole task, of detecting similar fake accounts on Twitter includes several different phases. In the first phase, the data is accessed using the Twitter API, which includes data having different features. The different features are then summed up together to obtain the normalized confusion matrix. Dealing with and increasing the understandability of feature selection is an important part of the process used in the model for machine learning, as the results depend upon the selection of the features. In this work, we selected the features using the Twitter API, and we accessed the data and compared them using the random forest algorithm. The results provide improved accuracy and enable the early detection of similar fake accounts on Twitter. REFERENCES [1] Jianqiang, Z., & Xiaolin, G. (2017). Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access, 5, 2870-2879. [2] Asghar, M. Z., Kundi, F. M., Ahmad, S., Khan, A., & Khan, F. (2018). T‐SAF: Twitter sentiment analysis framework using a hybrid classification scheme. Expert Systems, 35(1), e12233. [3] Caruccio, L., Desiato, D., & Polese, G. (2018, December). Fake account identification in social networks. In 2018 IEEE International Conference on Big Data (Big Data) (pp. 5078-5085). IEEE. [4] Zimbra, D., Abbasi, A., Zeng, D., & Chen, H. (2018). The state-of-the-art in Twitter sentiment analysis: A review and benchmark evaluation. ACM Transactions on Management Information Systems (TMIS), 9(2), 1-29. [5] Karakaşlı, M. S., Aydin, M. A., Yarkan, S., & Boyaci, A. (2019). Dynamic Feature Selection for Spam Detection in Twitter. In International Telecommunications Conference (pp. 239-250). Springer, Singapore. [6] Rostami, R. R., & Karbasi, S. (2020). Detecting Fake Accounts on Twitter Social Network Using Multi-Objective Hybrid Feature Selection Approach. Webology, 17(1). [7] Gurajala, S., White, J. S., Hudson, B., Voter, B. R., & Matthews, J. N. (2016). Profile characteristics of fake Twitter accounts. Big Data & Society, 3(2), 2053951716674236. [8] Çıtlak, O., Dörterler, M., & Doğru, İ. A. (2019). A survey on detecting spam accounts on Twitter network. Social Network Analysis and Mining, 9(1), 35. [9] Singh, N., Sharma, T., Thakral, A., & Choudhury, T. (2018, June). Detection of fake profile in online social networks using machine learning. In 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE) (pp. 231-234). IEEE. [10] Oxford Analytica. Advertisers up scrutiny of social media fake activity. Emerald Expert Briefings, (oxan-db). http://www.iaeme.com/IJARET/index.asp 618 editor@iaeme.com
Dr. Mohammed Ali Alhariri [11] Burgess, J., & Bruns, A. (2012). Twitter archives and the challenges of" Big Social Data" for media and communication research. M/C Journal, 15(5). [12] Kim, Y., Nordgren, R., & Emery, S. (2020). The Story of Goldilocks and Three Twitter’s APIs: A Pilot Study on Twitter Data Sources and Disclosure. International Journal of Environmental Research and Public Health, 17(3), 864. [13] Bruns, A. (2020). Big social data approaches in Internet studies: The case of Twitter. Second international handbook of Internet research, 65-81. [14] Sharma, V., Sharma, V., Shukla, D., Tanwar, P., & Kumar, B. (2020). Live Twitter Sentiment Analysis. Available at SSRN 3609792. [15] Karami, A., Lundy, M., Webb, F., & Dwivedi, Y. K. (2020). Twitter and research: a systematic literature review through text mining. IEEE Access, 8, 67698-67717. [16] Srivastava, A., Singh, V., & Drall, G. S. (2019). Sentiment Analysis of Twitter Data: A Hybrid Approach. International Journal of Healthcare Information Systems and Informatics (IJHISI), 14(2), 1-16. [17] Hino, A., & Fahey, R. A. (2019). Representing the Twittersphere: Archiving a representative sample of Twitter data under resource constraints. International journal of information management, 48, 175-184. [18] Alperin, J. P., Gomez, C. J., & Haustein, S. (2019). Identifying diffusion patterns of research articles on Twitter: A case study of online engagement with open access articles. Public Understanding of Science, 28(1), 2-18. [19] Safari, K., & Sanner, S. (2019). Optimizing Search API Queries for Twitter Topic Classifiers Using a Maximum Set Coverage Approach. arXiv preprint arXiv:1904.10403. [20] Yuan, D., Miao, Y., Gong, N. Z., Yang, Z., Li, Q., Song, D., ... & Liang, X. (2019, November). Detecting fake accounts in online social networks at the time of registrations. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (pp. 1423- 1438). [21] Breuer, A., Eilat, R., & Weinsberg, U. (2020, April). Friend or Faux: Graph-Based Early Detection of Fake Accounts on Social Networks. In Proceedings of The Web Conference 2020 (pp. 1287-1297). [22] Balaanand, M., Karthikeyan, N., Karthik, S., Varatharajan, R., Manogaran, G., & Sivaparthipan, C. B. (2019). An enhanced graph-based semi-supervised learning algorithm to detect fake users on Twitter. The Journal of Supercomputing, 75(9), 6085-6105. [23] Jiang, X., Li, Q., Ma, Z., Dong, M., Wu, J., & Guo, D. (2019). QuickSquad: A new single- machine graph computing framework for detecting fake accounts in large-scale social networks. Peer-to-Peer Networking and Applications, 12(5), 1385-1402. [24] Sahoo, S. R., & Gupta, B. B. (2020). Real-Time Detection of Fake Account in Twitter Using Machine-Learning Approach. In Advances in Computational Intelligence and Communication Technology (pp. 149-159). Springer, Singapore. [25] Pakaya, F. N., Ibrohim, M. O., & Budi, I. (2019, October). Malicious Account Detection on Twitter Based on Tweet Account Features using Machine Learning. In 2019 Fourth International Conference on Informatics and Computing (ICIC) (pp. 1-5). IEEE. http://www.iaeme.com/IJARET/index.asp 619 editor@iaeme.com
Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm [26] Zervopoulos, A., Alvanou, A. G., Bezas, K., Papamichail, A., Maragoudakis, M., & Kermanidis, K. (2020, June). Hong Kong Protests: Using Natural Language Processing for Fake News Detection on Twitter. In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 408-419). Springer, Cham. [27] El-Mawass, N., Honeine, P., & Vercouter, L. (2020). SimilCatch: Enhanced social spammers detection on Twitter using Markov Random Fields. Information Processing & Management, 57(6), 102317. [28] Rahman, M. D., Likhon, A. M., Rahman, A. S., & Choudhury, M. H. (2019). Detection of fake identities on twitter using supervised machine learning (Doctoral dissertation, Brac University). [29] Luque, A., Carrasco, A., Martín, A., & de las Heras, A. (2019). The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, 91, 216-231. [30] Xu, J., Zhang, Y., & Miao, D. (2020). Three-way confusion matrix for classification: a measure driven view. Information Sciences, 507, 772-794. [31] Zeng, G. (2020). On the confusion matrix in credit scoring and its analytical properties. Communications in Statistics-Theory and Methods, 49(9), 2080-2093. [32] Elsayed, G., Kornblith, S., & Le, Q. V. (2019). Saccader: improving accuracy of hard attention models for vision. In Advances in Neural Information Processing Systems (pp. 702-714). http://www.iaeme.com/IJARET/index.asp 620 editor@iaeme.com
You can also read