EARLY DETECTION OF SIMILAR FAKE ACCOUNTS ON TWITTER USING THE RANDOM FOREST ALGORITHM

Page created by Chester Mclaughlin

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

International Journal of Advanced Research in Engineering and Technology (IJARET)
Volume 11, Issue 12, December 2020, pp.611-620, Article ID: IJARET_11_12_064
Available online at http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=11&IType=12
ISSN Print: 0976-6480 and ISSN Online: 0976-6499
DOI: 10.34218/IJARET.11.12.2020.064

EARLY DETECTION OF SIMILAR FAKE
ACCOUNTS ON TWITTER USING THE
RANDOM FOREST ALGORITHM
Dr. Mohammed Ali Alhariri
College of Computing and Information Technology,
Taif University, Saudi Arabia

ABSTRACT
A major issue for social-media platforms is the problem of fake accounts with
different aims and objectives. A similar fake account is like having access to someone’s
specific identity, which may impact that person’s life in the real world. Artificial
intelligence is leading in dealing with these issues, because its machine-learning
methodology can provide early detection of similar fake accounts on Twitter. In this
work, we analyze the early detection of similar fake accounts on Twitter using the
Twitter API Application Programming Interface mainly uses the following features
based the confusion matrix: default_profile, default_profile_image, friends_count,
statuses_count, followers_count, listed_count, listed_count,
profile_background_image, verified, name, and id. These are the data features we chose
to enable early detection of similar fake accounts, and we used the random forest
algorithm in the model. We find that overall the model works better than other
approaches, and the random forest algorithm provides impressive results even in the
validation phase. The random forest results depend upon the features selected to identify
the similar fake accounts. The model produced impressive results in the early detection
of similar fake accounts on Twitter.
Key words: Twitter, API, Confusion Matrix, Random forest algorithm and Fake
accounts
Cite this Article: Mohammed Ali Alhariri, Early Detection of Similar Fake Accounts
on Twitter Using the Random Forest Algorithm, International Journal of Advanced
Research in Engineering and Technology,11(12), 2020, pp. 611-620.
http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=11&IType=12

1. INTRODUCTION
The increased use of social-media platforms requires increased focus on security, the integrity
of the systems, the management of traffic by the platform architecture, and reliability. While
different features combine to make a social-media platform ultimately a source of joy, various
unpleasant elements create alarming situations (Jianqiang, and Xiaolin2017). For example, the

http://www.iaeme.com/IJAERT/index.asp 611 editor@iaeme.com

Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm

use of social-media platforms requires that the data in an account retain its integrity. The
account holder thus needs a secure platform that cannot be accessed by a similar fake account.
However, the numbers of similar fake accounts are increasing on social-media platforms, so
that there is a need to implement machine-learning techniques to overcome this issue (Asghar,
et al. 2018).
Here, we have analyzed Twitter as a social-media platform, as it is one of the fastest growing
social-media platforms. We have used its Application Programming Interface is accessed for
this task to implement a machine-learning technique using some selected features. In the early
development of social-media, before machine-learning algorithms and feature selection were
implemented, it was hard to detect similar fake accounts. Today, machine-learning algorithms
provide promising results in the detection of fake similar accounts (Caruccio et al. 2018,
December). Selecting specific features to assist in detecting similar fake accounts on Twitter
using the API improves the results.
In this work, we used the Twitter API to collect the data and employed multiple features to
detect similar accounts. We then incorporated the multiple features in a machine-learning
algorithm (Zimbra, et al. 2018). We construct the model to detect similar accounts using random
forest, and in that model we performed supervised learning.

2. LITERATURE REVIEW
In the early years of social-media platforms some two decades ago, such platforms had fewer
data-handling issues even the number of users was less at that time. However, the number is
now increasing exponentially, with the leading global-communication applications tend to
affect the daily routine of every single person. Communication is increasingly turning away
from other means and tending toward social-media platforms. Among the main social-media
platforms, Twitter is one of the top leading platforms, and its traffic is growing as the number
of users increases. Users of this platform require stability in the form of data reliability
(Karakaşlı, et al. 2019). However, the number of fake accounts on the social-media platforms
is also increasing.
There are different types of fake accounts with different aims and objectives. Similarly, fake
accounts have different attributes, and they may contain facts and figures as well that contribute
to miscommunication over the social-media platform. Data redundancy is also observed in fake
accounts on social-media platforms (Rostami, and Karbasi 2020). Thus, fake accounts on
social-media platforms are leading source of false information, and in fact similar fake accounts
have been developed using celebrities to convey false communication. The misuse of similar
accounts on social-media platforms is increasing, and it is impacting the real world by
promoting miscommunication, which may also include criminal aspects. It is this necessary to
eliminate fake accounts that cause such diverse impacts on social-media platforms.
The rules from Twitter are also changing rapidly, and there are different methods for
accessing the data from Twitter. One can use the dataset of tweets, or access it through the API,
or use some other tools to obtain the data (Gurajala, et al. 2016). Artificial-intelligence-based
solutions using machine-learning algorithms perform better and produce better results than
other methods. In the early stages of this approach, the clustering method was used; however,
later on semi-supervised methods were used to produce machine-learning and artificial-
intelligence-based solutions. Feature selection is an important phase in utilizing artificial
intelligence (El-Mawass, et al. 2020). The selection of specific features depends upon many
factors, the most important being what to include and why to include it on the bases of those
features (Çıtlak, et al. 2019).
The increasing daily impact of fake accounts on social media leads to the false information.
To eliminate such accounts and reduce the impact of fake accounts on social media, machine

http://www.iaeme.com/IJARET/index.asp 612 editor@iaeme.com

Dr. Mohammed Ali Alhariri

learning is used to obtain improved results. In generating the machine-learning model, feature
extraction plays a vital role, as the features considered drive the results (Singh et al. 2018).
Before features were used, machine-learning-based models used clustering techniques to obtain
results. However, clustering techniques produce less accurate results than feature-based models.
Spam accounts were originally developed unethical tasks. In the early stages spam accounts
were just used for marketing purposes, but later some users made random fake accounts.
However, later on, such accounts were developed and used unethically, on social-media
platforms, even sometimes to place denial-of-service attacks, and accounts were developed to
use information that impacts the lives of individuals (Rahman et al. 2019). These issues can be
resolved using machine learning and artificial-intelligence-based tasks.

3. METHODOLOGY
To address the issue accurately, it is necessary to have clear knowledge of similar fake accounts
on social-media platforms, because this matters in choosing the selection criteria. There are
different phases on which all such work depends. First, it requires getting access to the Twitter
API, which remains a difficult real-time task (Burgess, and Bruns 2012). Our work is based on
feature-based analysis and comparisons of features to provide early detection of similar fake
accounts on Twitter (Kim et al. 2020).

4. DATA ACCESS
The first step is to gain access to the database of the social-media platform that is Twitter.
However, getting access to the Twitter database requires providing some information, so that
after authenticating it and knowing the purpose of having access to the Twitter database, access
is granted (Bruns 2020). The first two times, we tried this, access was denied, but later the
access was allowed to enable us to perform the research-based tasks. The access policies of
Twitter may differ according to the situation and time. Twitter previously allowed API access
easily, but nowadays it is quite a difficult task to gain access to the Twitter API (Sharma, et al.
2020).

5. THE TWITTER API
The general method of obtaining permission and accessing the Twitter API is to log in with a
developer account using the URL developer.twitter.com. To avoid any unlawful activity on
Twitter, it continually changes the rules, sometimes even on a daily basis (Karami, et al. 2020).
After describing the reason for and getting access to Twitter, a unique key is assigned that
allows access to the Twitter data from twitter and allows to perform the early based detection
of similar Twitter account.

6. FEATURE SELECTION
The four main objects on Twitter that are most likely to be accessed are entities, places, users,
and tweets. Selecting specific features to use in order to obtain good results requires knowledge
of how to detect a similar Twitter account (Srivastava, et al. 2019). The data accessed using the
Twitter API mainly uses the following Features default_profile, default_profile_image,
friends_count, statuses_count, followers_count, listed_count, listed_count,
profile_background_image, verified, name, and id. These are the data features we chose to
enable early detection of similar fake accounts.

7. CONFUSION MATRIX
The confusion matrix is a tool for choosing between the features. It is later normalized, and the
most relevant features are chosen to enable early detection of similar fake accounts. Figure.1

http://www.iaeme.com/IJARET/index.asp 613 editor@iaeme.com

Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm

shows the confusion matrix for the training data used to distinguish between fake and genuine
accounts for the detection of similar fake accounts (Hino, and Fahey 2019). The confusion
matrix in this work is used to analyze training data using features, through which later the
decision was generated between fake and genuine (Safari, and Sanner 2019).

Figure 1. The confusion matrix of the decision between fake and genuine before normalization
The Confusion matrix has four basic rules or attributes, which are True Positive (TP), True
Negative (TN), False Positive (FP), and False Negative (FN) (Alperin, et al. 2019). These
columns are represented graphically in Fig. 1, with the color showing the probability of being
false or Genuine of the twitter accounts on the bases of selected features. The normalized
confusion matrix, with the scale ranging from 0.00 to 1.00, is shown in Fig2. (Zeng 2020).
The confusion matrix basically describes the performance of the model relative to the test
data, because in this case the true values are known. This eases the decision making in the early
detection of similar fake accounts on Twitter (Xu, et al. 2020). A further explanation of the
confusion matrix is provided below.

Table 1 confusion matrix

Predicted False Predicted Genuine

Actual False TN FP

Actual Genuine FN TP

7.1. True Positive
In this case, the predicted is Yes these are similar fake accounts and Yes they actually are similar
fake accounts.

7.2. True Negative
In this case, the prediction is that these are not similar fake accounts, and they actually are not
similar fake accounts.

http://www.iaeme.com/IJARET/index.asp 614 editor@iaeme.com

Dr. Mohammed Ali Alhariri

7.3. False Positive
In this case, the prediction is Yes, these are similar fake accounts, but they actually are not
similar fake accounts.

7.4. False Negative
In this case, the prediction is that these are not similar fake accounts Yes, actually they are
similar fake accounts.

Figure 2 The normalized confusion matrix for early detection of similar fake accounts on Twitter
The normalized confusion matrix shown in fig 2 is well discussed (Luque et al. 2019) in the
context of the early detection of similar fake accounts. Blue

8. THE RANDOM FOREST ALGORITHM
Gaining successful access to the Twitter data using the API, is an way towards an improved
approach of an early based detection of similar fake accounts. We chose the random forest
machine-learning algorithm for use in the model in order to obtain high accuracy (Yuan, et al.
2019). The reason for this choice are that the random forest algorithm produces high-accuracy
results, and it also handles missing values in the data, which leads to successful results. the
random forest algorithm involves the convergence of many decision trees, and it is so-named
because it uses random sampling to train the data points while building the decision tree
(Breuer, et al. 2020, April).

1
= ∑( − )2

=1
In the equal-representation random forest equation (1) above, N represents the number of
features used to detect similar accounts, where Fi is the value returned from Twitter, and yi is
the original value used for feature i. This simple equation was used to develop a random forest
model for the early detection of similar fake accounts on Twitter (Balaanand, et al. 2019). This
model can be used to obtain fast results, and it provides improved results, and accuracy as well.
Overall, the random forest algorithm is a fast processing approach that can be used with a
feature-selection model to identify related data using features. It has proven its ability by
providing high accuracy (up to 95% in accessing results from the training example. Fig3 shows
that the training score is 1, while the cross-validation score achieved up to 95% accuracy, which
is quite good in overall comparisons (Jiang, et al. 2019)..

http://www.iaeme.com/IJARET/index.asp 615 editor@iaeme.com

Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm

 Figure.3 Learning curve for Cross-Validation
 Figure-3 shows the cross-validation scores achieved when the model was prepared and
tested on the limited Twitter data obtained using Twiiter API based on selected features the tool
used Anaconda and Python as a programming language. The fig3 explains the output value as
0.94 in the cross validation represented in the green color, whereas the red color represents the
maximum training score. This figure shows that the maximum training score was achieved on
the training data. The maximum training score is obtained using selected features for training
data (Elsayed, et al. 2019). This figure also shows the accuracy achieved using the data source
accessing it through API access. On the basis, we conclude that for early detection of similar
fake account on twitter, the cross-validation produces results with 94% accuracy.

9. RESULTS
In the first step, we selected features having the highest probability for being involved in the
the detection of similar fake accounts on Twitter. These features have a direct impact on
reaching a decision based on similarities between the accounts (Sahoo, and Gupta 2021). The
overall model produced impressive results with an accuracy of 0.94, which is far better than
most other approaches.

http://www.iaeme.com/IJARET/index.asp 616 editor@iaeme.com

Dr. Mohammed Ali Alhariri

 Figure 4 The accuracy of the random forest algorithm
 The Figure 4 shows the accuracy obtained after accessing the data using the Twitter API as
the data source. The blue line in the figure.4 represents the Accuracy (ACU) obtained in the
results against the selected features. Initially, while starting the process of early detection of a
similar fake account, the model shows a true-positive rate around 0.8 while the maximum
accuracy achieved is 0.94. This is impressive accuracy for the early detection of similar fake
account on Twitter (Elsayed, at al. 2019).
 The measurement of accuracy shown in Figure 4 on the y-axis is the true-positive rate from
the confusion matrix, while the x-axis shows the False Positive Rate. After applying the
validation the results obtained are 0.98% and the accuracy remains on 0.94%. The overall
working capability of this model for the early detection of the fake accounts on Twitter is better
than that achieved with previous methods (Zervopoulos, et al. 2020).

 Figure 5 Accuracy validation

http://www.iaeme.com/IJARET/index.asp 617 editor@iaeme.com

Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm

 We analyzed selected features using the Twitter API with the goal of achieving early
detection of similar fake accounts. In validation accuracy and the accuracy remains parallel in
the initial phase (Pakaya, et al. 2019). However, in the final phase, the validation accuracy is
0.98, and while the accuracy is 0.94. Figure 5 shows the results obtained for the early detection
of similar fake accounts using the random forest algorithm.

10. CONCLUSION
The whole task, of detecting similar fake accounts on Twitter includes several different phases.
In the first phase, the data is accessed using the Twitter API, which includes data having
different features. The different features are then summed up together to obtain the normalized
confusion matrix. Dealing with and increasing the understandability of feature selection is an
important part of the process used in the model for machine learning, as the results depend upon
the selection of the features. In this work, we selected the features using the Twitter API, and
we accessed the data and compared them using the random forest algorithm. The results provide
improved accuracy and enable the early detection of similar fake accounts on Twitter.

REFERENCES
[1] Jianqiang, Z., & Xiaolin, G. (2017). Comparison research on text pre-processing methods on
 twitter sentiment analysis. IEEE Access, 5, 2870-2879.

[2] Asghar, M. Z., Kundi, F. M., Ahmad, S., Khan, A., & Khan, F. (2018). T‐SAF: Twitter
 sentiment analysis framework using a hybrid classification scheme. Expert Systems, 35(1),
 e12233.

[3] Caruccio, L., Desiato, D., & Polese, G. (2018, December). Fake account identification in social
 networks. In 2018 IEEE International Conference on Big Data (Big Data) (pp. 5078-5085).
 IEEE.

[4] Zimbra, D., Abbasi, A., Zeng, D., & Chen, H. (2018). The state-of-the-art in Twitter sentiment
 analysis: A review and benchmark evaluation. ACM Transactions on Management Information
 Systems (TMIS), 9(2), 1-29.

[5] Karakaşlı, M. S., Aydin, M. A., Yarkan, S., & Boyaci, A. (2019). Dynamic Feature Selection
 for Spam Detection in Twitter. In International Telecommunications Conference (pp. 239-250).
 Springer, Singapore.

[6] Rostami, R. R., & Karbasi, S. (2020). Detecting Fake Accounts on Twitter Social Network
 Using Multi-Objective Hybrid Feature Selection Approach. Webology, 17(1).

[7] Gurajala, S., White, J. S., Hudson, B., Voter, B. R., & Matthews, J. N. (2016). Profile
 characteristics of fake Twitter accounts. Big Data & Society, 3(2), 2053951716674236.

[8] Çıtlak, O., Dörterler, M., & Doğru, İ. A. (2019). A survey on detecting spam accounts on Twitter
 network. Social Network Analysis and Mining, 9(1), 35.

[9] Singh, N., Sharma, T., Thakral, A., & Choudhury, T. (2018, June). Detection of fake profile in
 online social networks using machine learning. In 2018 International Conference on Advances
 in Computing and Communication Engineering (ICACCE) (pp. 231-234). IEEE.

[10] Oxford Analytica. Advertisers up scrutiny of social media fake activity. Emerald Expert
 Briefings, (oxan-db).

http://www.iaeme.com/IJARET/index.asp 618 editor@iaeme.com

Dr. Mohammed Ali Alhariri

[11] Burgess, J., & Bruns, A. (2012). Twitter archives and the challenges of" Big Social Data" for
 media and communication research. M/C Journal, 15(5).

[12] Kim, Y., Nordgren, R., & Emery, S. (2020). The Story of Goldilocks and Three Twitter’s APIs:
 A Pilot Study on Twitter Data Sources and Disclosure. International Journal of Environmental
 Research and Public Health, 17(3), 864.

[13] Bruns, A. (2020). Big social data approaches in Internet studies: The case of Twitter. Second
 international handbook of Internet research, 65-81.

[14] Sharma, V., Sharma, V., Shukla, D., Tanwar, P., & Kumar, B. (2020). Live Twitter Sentiment
 Analysis. Available at SSRN 3609792.

[15] Karami, A., Lundy, M., Webb, F., & Dwivedi, Y. K. (2020). Twitter and research: a systematic
 literature review through text mining. IEEE Access, 8, 67698-67717.

[16] Srivastava, A., Singh, V., & Drall, G. S. (2019). Sentiment Analysis of Twitter Data: A Hybrid
 Approach. International Journal of Healthcare Information Systems and Informatics
 (IJHISI), 14(2), 1-16.

[17] Hino, A., & Fahey, R. A. (2019). Representing the Twittersphere: Archiving a representative
 sample of Twitter data under resource constraints. International journal of information
 management, 48, 175-184.

[18] Alperin, J. P., Gomez, C. J., & Haustein, S. (2019). Identifying diffusion patterns of research
 articles on Twitter: A case study of online engagement with open access articles. Public
 Understanding of Science, 28(1), 2-18.

[19] Safari, K., & Sanner, S. (2019). Optimizing Search API Queries for Twitter Topic Classifiers
 Using a Maximum Set Coverage Approach. arXiv preprint arXiv:1904.10403.

[20] Yuan, D., Miao, Y., Gong, N. Z., Yang, Z., Li, Q., Song, D., ... & Liang, X. (2019, November).
 Detecting fake accounts in online social networks at the time of registrations. In Proceedings of
 the 2019 ACM SIGSAC Conference on Computer and Communications Security (pp. 1423-
 1438).

[21] Breuer, A., Eilat, R., & Weinsberg, U. (2020, April). Friend or Faux: Graph-Based Early
 Detection of Fake Accounts on Social Networks. In Proceedings of The Web Conference
 2020 (pp. 1287-1297).

[22] Balaanand, M., Karthikeyan, N., Karthik, S., Varatharajan, R., Manogaran, G., & Sivaparthipan,
 C. B. (2019). An enhanced graph-based semi-supervised learning algorithm to detect fake users
 on Twitter. The Journal of Supercomputing, 75(9), 6085-6105.

[23] Jiang, X., Li, Q., Ma, Z., Dong, M., Wu, J., & Guo, D. (2019). QuickSquad: A new single-
 machine graph computing framework for detecting fake accounts in large-scale social
 networks. Peer-to-Peer Networking and Applications, 12(5), 1385-1402.

[24] Sahoo, S. R., & Gupta, B. B. (2020). Real-Time Detection of Fake Account in Twitter Using
 Machine-Learning Approach. In Advances in Computational Intelligence and Communication
 Technology (pp. 149-159). Springer, Singapore.

[25] Pakaya, F. N., Ibrohim, M. O., & Budi, I. (2019, October). Malicious Account Detection on
 Twitter Based on Tweet Account Features using Machine Learning. In 2019 Fourth
 International Conference on Informatics and Computing (ICIC) (pp. 1-5). IEEE.

http://www.iaeme.com/IJARET/index.asp 619 editor@iaeme.com

Early Detection of Similar Fake Accounts on Twitter Using the Random Forest Algorithm

[26] Zervopoulos, A., Alvanou, A. G., Bezas, K., Papamichail, A., Maragoudakis, M., & Kermanidis,
 K. (2020, June). Hong Kong Protests: Using Natural Language Processing for Fake News
 Detection on Twitter. In IFIP International Conference on Artificial Intelligence Applications
 and Innovations (pp. 408-419). Springer, Cham.

[27] El-Mawass, N., Honeine, P., & Vercouter, L. (2020). SimilCatch: Enhanced social spammers
 detection on Twitter using Markov Random Fields. Information Processing &
 Management, 57(6), 102317.

[28] Rahman, M. D., Likhon, A. M., Rahman, A. S., & Choudhury, M. H. (2019). Detection of fake
 identities on twitter using supervised machine learning (Doctoral dissertation, Brac University).

[29] Luque, A., Carrasco, A., Martín, A., & de las Heras, A. (2019). The impact of class imbalance
 in classification performance metrics based on the binary confusion matrix. Pattern
 Recognition, 91, 216-231.

[30] Xu, J., Zhang, Y., & Miao, D. (2020). Three-way confusion matrix for classification: a measure
 driven view. Information Sciences, 507, 772-794.

[31] Zeng, G. (2020). On the confusion matrix in credit scoring and its analytical
 properties. Communications in Statistics-Theory and Methods, 49(9), 2080-2093.

[32] Elsayed, G., Kornblith, S., & Le, Q. V. (2019). Saccader: improving accuracy of hard attention
 models for vision. In Advances in Neural Information Processing Systems (pp. 702-714).

http://www.iaeme.com/IJARET/index.asp 620 editor@iaeme.com

You can also read