Received July 5, 2021, accepted July 16, 2021, date of publication July 19, 2021, date of current version August 13, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3098470

Seeing and Believing: Evaluating the
Trustworthiness of Twitter Users
Department of Computing Sciences, Tampere University, 33720 Tampere, Finland
Corresponding author: Tanveer Khan (

  ABSTRACT Social networking and micro-blogging services, such as Twitter, play an important role in
  sharing digital information. Despite the popularity and usefulness of social media, there have been many
  instances where corrupted users found ways to abuse it, as for instance, through raising or lowering user’s
  credibility. As a result, while social media facilitates an unprecedented ease of access to information, it also
  introduces a new challenge - that of ascertaining the credibility of shared information. Currently, there is
  no automated way of determining which news or users are credible and which are not. Hence, establishing
  a system that can measure the social media user’s credibility has become an issue of great importance.
  Assigning a credibility score to a user has piqued the interest of not only the research community but also
  most of the big players on both sides - such as Facebook, on the side of industry, and political parties on
  the societal one. In this work, we created a model which, we hope, will ultimately facilitate and support the
  increase of trust in the social network communities. Our model collected data and analysed the behaviour
  of 50,000 politicians on Twitter. Influence score, based on several chosen features, was assigned to each
  evaluated user. Further, we classified the political Twitter users as either trusted or untrusted using random
  forest, multilayer perceptron, and support vector machine. An active learning model was used to classify any
  unlabelled ambiguous records from our dataset. Finally, to measure the performance of the proposed model,
  we used precision, recall, F1 score, and accuracy as the main evaluation metrics.

  INDEX TERMS Active learning, influence score, credibility, trust, sentiment analysis, fake news, twitter,
  machine learning.

I. INTRODUCTION                                                                                result, measuring the credibility of both the user and the text
An ever increasing usage and popularity of social media                                        itself has become a major issue. In this work, we assign a cred-
platforms has become the sign of our times – close to a half                                   ibility score to each Twitter user based on certain extracted
of the world’s population is connected through social media                                    features.
platforms. The dynamics of communication in all spheres of                                        Twitter is currently one of the most popular social media
life has changed. Social media provide a platform through                                      platforms with an average of 10,000 tweets per second [5].
which users can freely share information simultaneously with                                   Twitter-enabled analytics do not only constitute a valuable
a significantly larger audience than traditional media.                                        source of information but provide an uncomplicated extrac-
   As social media became ubiquitous in our daily lives,                                       tion and dissemination of subject specific information for
both its positive and negative impacts have become more                                        government agencies, businesses, political parties, financial
pronounced. Successive studies have shown that extensive                                       institutions, fundraisers and many others.
distribution of misinformation can play a significant role in                                     In a recent study [6], 10 million tweets from 700,000 Twit-
the success or failure of an important event or a cause [1], [2].                              ters accounts were examined. The collected accounts were
Barring the dissemination and circulation of misleading infor-                                 linked to 600 fakes news and conspiracy sites. Surprisingly,
mation, social networks also provide the mechanisms for                                        authors found that clusters of Twitter accounts are repeatedly
corrupted users to perform an extensive range of illegitimate                                  linked back to these sites in a coordinated and automated
actions such as spam and political astroturfing [3], [4]. As a                                 manner. A similar study [7] showed that 6.6 million fake news
                                                                                               tweets were distributed prior to the 2016 US elections.
   The associate editor coordinating the review of this manuscript and                            Globally, a number of social and political events in the last
approving it for publication was Muhammad Khurram Khan                   .                     three years have been marred by an ever-growing presence

T. Khan, A. Michalas: Seeing and Believing: Evaluating Trustworthiness of Twitter Users

of misleading information provoking an increasing concern                            •To classify a large pool of unlabelled data, we used
about their impact on society. This concern translated into an                        an active learning model – a technique best suited to
immediate need for the design, implementation, and adoption                           the situation where the unlabelled data is abundant but
of new systems and algorithms that will have the ability                              manual labelling is expensive [20], [21]. In addition,
to measure the credibility of a source or a piece of news.                            we evaluated the performance of various ML classifiers.
Notwithstanding, the seemingly unencumbered growth of                               We hope that this work will inspire others to further
social media users is continuing.1 Coupled with the growth in                    research this problem and simultaneously kick-start a period
user numbers, the generated content is growing exponentially                     of greater trust in social media.
thus producing a body of information where it is becoming
increasingly difficult to identify fabricated stories [9]. There-                 B. ORGANIZATION
upon, we are facing a situation where a compelling number                        The rest of paper is organized as follows: Related work is dis-
of unverified pieces of information could be misconstrued                        cussed in Section II, accompanied by a detailed discussion of
and ultimately misused. The research in the field is therefore                   our proposed approach in Section III. In Section IV, the active
currently focusing on defining the credibility of the tweets                     learning method and the type of classifier used are discussed.
and/or assigning scores to users based on the information they                   The data collection and experimental results are presented in
have been sharing [10]–[17].                                                     Section V. Finally, in Section VI, we conclude the paper.

A. OUR CONTRIBUTION AND DIFFERENCES WITH                                          II. RELATED WORK
PREVIOUS WORKS                                                                   Twitter is one of the most popular Online-Social-Networks
We would like to draw your attention to the areas in which this                  (OSNs). As data aggregator, it provides data that can be used
work builds on our previous one [18] and where, we believe,                      in research of both historical and current events. Twitter,
it expounds it and offers new insights. In this work we                          in relation to other popular OSNs, attracts significant atten-
used additional ML models, such as Multi-Layer Percep-                           tion in the research community due to its open policy on data
tron (MLP) and Logistic Regression (LR). Since the MLP                           sharing and distinctive features [22]. Although openness and
model outperformed the LR, we only present the findings                          vulnerability don’t necessarily go hand in hand, on a multiple
for the MLP model. For MLP, we performed the experiments                         occasions malicious users misused Twitter’s openness and
for Tanh, ReLU and Logistics. Moreover, unlike [18], where                       exploited the service (e.g. political astroturfing, spammers
just one evaluation metric, ‘‘Accuracy’’, was used to evaluate                   sending unsolicited messages, post malicious links, etc.).
the model’s performance, in this work, here, we measure                             In contrast to mounting evidence towards the negative
the model’s performance by using four evaluation metrics –                       impact of fake news dissemination, so far, only a few tech-
‘‘Precision’’, ‘‘Recall’’, ‘‘F1’’ score, and ‘‘Accuracy’’ (see                   niques for identifying them in social media have been pro-
table 5). Furthermore, we provide the descriptive statistics of                  posed [3], [4], [22]–[24].
the features (see table 4) as well as their correlation with the                    Among the most popular and promising ones is evaluating
target (see figure 3) and compare our work with other similar                    Twitter users and assigning them a reputation score. Authors
works as SybilTrap [19] (see table 2). Finally, we conduct a                     in [3] explored the posting of duplicate tweets and pointed
comparative review of the user characteristics primarily used                    that this behaviour, usually not followed by a legitimate user,
in the literature so far, and the ones used in our model and                     affects the reputation score. Posting the same tweet several
provide supplementary information to help with stratifying                       times has a negative effect on the user’s overall reputation
trusted and untrusted users (see table 3).                                       score. The authors presented research that supports the above
   Our main contribution can be summarized as follows:                           by calculating the edit distance to detect duplications between
   • First, we gathered a 50,000 Twitter users dataset where                     two tweets posted from the same account.
      for each user, we built a unique profile with 19 features                     Furthermore, users have used an immense amount of
      (discussed in Section III). Our dataset included only                      exchanged messages and information on Twitter to hijack
      users whose tweets are public and have non-zero friends                    trending topics [25] and send unsolicited messages to legit-
      and followers. Furthermore, each Twitter user account                      imate users. Additionally, there are Twitter accounts whose
      was classified as either trusted or untrusted by attaching                 only purpose is to artificially boost the popularity of a specific
      the trusted and untrusted flag based on different features.                hashtag thus increasing its popularity and eventually mak-
      These features are discussed in detail in Section IV.                      ing the underlying topic a trend. The BBC investigated an
   • We measured the social reputation score (Section III-                       instance where £150 was paid to Twitter users to increase the
      C1), a sentiment score (Section III-C3.a), an h-index                      popularity of a hashtag and promote it into a trend.2
      score (Section III-C2), tweets credibility (Section III-                      In an attempt to address these problems, researchers have
      C3.b) and the influence score (Section III-D) for each                     used several ways to detect the trustworthiness of tweets
      of the analyzed Twitter users.                                             and assign an overall rank to users [24]. Castillo et al. [26]
                                                                                 measured the credibility of tweets based on Twitter features
  1 In 2020, an estimated 3.23 billion people were using social media world-
T. Khan, A. Michalas: Seeing and Believing: Evaluating Trustworthiness of Twitter Users

by using an automated classification technique. Alex Hai                            TABLE 1. Features considered to calculate the influence score.
Wang [3] used the followers and friends features to calculate
the reputation score. Additionally, Preussler and Kerres [27]
considered the same metrics while assigning a rank to Twitter
users. In [28], authors analysed the tweets relevant to Mum-
bai attacks.3 Their analysis showed most of the information
providers were unknown while the reputation of the remain-
ing ones was very low. In another study [29] that examined
the same event, the information retrieval technique and ML
algorithm used found that mere 17% of the tweets were
credibly related to the underlying attacks.
   According to Gilani et al. [30], when compared to nor-
mal users, bots and fake accounts use a large number of
external links in their tweets. Hence, analysing other Twit-
ter features such as URL is crucial for correctly evalu-
ating the overall credibility of a user. Although, Twitter
has included tools to filter out such URLs, several mask-
ing techniques can effectively bypass Twitter’s existing
   In this work, we evaluate the users’ trustworthiness and
credibility [31], [32] by analysing a wide range of features
(see Table 1). In comparison to similar works in the field,
our model explores a number of factors that could be signs
of possible malicious behaviours and makes honest, fair, and
precise judgements about the users’ credibility.

In this section, we discuss the model and main algorithms we
used to calculate the user’s influence score. Our first goal is
to enable the users to identify certain attributes and assess a
political Twitter user by considering the influence score that is
the outcome of a proper run of our algorithms. Figure 1 illus-
                                                                                    A. FEATURES SELECTION AND COMPARISON WITH
trates the main features we used to calculate users’ influence
                                                                                    PREVIOUS MODELS
score. We also compare our work with state-of-the-art work in
                                                                                    The features used for calculating the influence score
this domain (see Table 2). Secondly, the political Twitter users
                                                                                    were based on extensive study of the existing literature.
are classified into either trusted or untrusted based on features
                                                                                    The selected features were used for detection purposes
as social reputation, the credibility of tweets, sentiment score,
                                                                                    [33]–[35], assigning a score [24] or classification pur-
the h-index score, influential score etc. Accounts containing
                                                                                    poses [36]. We used the features given in Table 1 to assign
abusive and/or harassment tweets, low social reputation and
                                                                                    an influence score to a ui . Table 2 provides a comparative
h-index score, and low influential score are grouped into
                                                                                    overview of existing models based on feature selection.
untrusted users. The trusted users category envelops more
reputable among the users with high h-index score, more
credible tweets as well as those having high influential score.                     B. TWITTER FEATURES EXTRACTION
We will discuss this in more detail in Section IV.                                  The pivotal step in the process of assigning a score to a
   In addition, we also present the approach used to calculate                      Twitter user is to extract the features linked to their accounts.
the Twitter users’ influence score based on both their context                      The features can be either user account specific, such as the
and content features. For the user evaluation we took into                          number of followers, friends, etc., or user tweet specific,
consideration only the Twitter features that can be extracted                       such as the number of likes, retweets, URLs, etc. In our
through Twitter API. We used the outcome of that evaluation                         model, we considered both and used them to calculate some
and derived more features to help us provide a better rounded                       additional features. We then combined them all to assign an
and fair evaluation (Section III-C). The features, as well as                       influence score to a Twitter user. Below we provide more
the relevant notation used throughout the paper, are given                          detailed information on features used in our model.
in Table 1.
                                                                                    1) NUMBER OF FRIENDS
  3                                    Friend is a user account feature indicating that a Twitter
jul/13/mumbai-blasts                                                                user (ui ) has subscribed to the updates of another ui [37].

T. Khan, A. Michalas: Seeing and Believing: Evaluating Trustworthiness of Twitter Users

 FIGURE 1. Twitter users influence score calculation.

TABLE 2. Models comparison using features.

Following users who are not part of interpersonal ties yields           the Nret as one of the main parameters for assigning the
a lot of novel information. One of the important indicators             Inf (ui ). We calculate the Rret (ui ) (used by Twitter grader) for
for calculating the Inf (ui ) is the follower/following ratio.          each tweet by considering Nret divided by NT (ui ), as given in
The follower/following ratio compares the number of ui ’s               equation 1.
subscribers to the number of the users, ui is following. Users                                                 Nret
are more interested in updates if the follower/following ratio                                 Rret (ui ) =                             (1)
                                                                                                            NT (ui )
is high [38]. The ideal follower/following ratio is 1 or close
to 1. In our model, we use the Nfri (ui ) as one of the indicators      4) NUMBER OF LIKES
for assigning Rs (ui ).                                                 The Nlik is considered a reasonable proxy for evaluating
                                                                        the quality of a tweet. Authors in [36] showed that humans
2) NUMBER OF FOLLOWERS                                                  receive more likes per tweet when compared to bots. In [42],
Nfol (ui ) is another user account feature showing the number           the authors used likes as one of the metrics to classify Twitter
of people interested in the specific ui ’s tweets. As discussed         accounts as a human user or automated agent. As mentioned
in [39], Nfol (ui ) is one of the most important parameters             in [5], if a specific tweet receives a large Nlik , it can be safely
for measuring ui ’s influence. The more followers a ui has              concluded that other ui ’s are interested in the tweets of the
the more influence he exerts [40]. Preussler and Kerres [27]            underlying ui . Based on this observation, we calculate the
correlates the Nfol (ui ) with the reputation of a ui . According       Rlik (ui ) by using the Nlik for each tweet and dividing it with
to their study, the credibility of a ui increases as the Nfol (ui )     NT (ui ) as shown in equation 2.
increases. Based on the above we consider the Nfol (ui ) an                                                   Nlik
important parameter and use it as input to calculate the Rs (ui ).                              Rlik (ui ) =                             (2)
                                                                                                             NT (ui )

3) NUMBER OF RETWEETS                                                   5) URLs
A tweet is considered important when it receives many                   URL is a content level feature some ui ’s include in their
positive reactions from other accounts. The reactions may               tweets [43]. As tweets are limited to a maximum of 280 char-
take the form of likes or retweets. Retweets act as a form              acters, it is common that ui ’s cannot include all relevant
of endorsement, allowing ui to forward the content gen-                 information in their tweets. To overcome this issue, ui ’s often
erated by other users, thus raising the content’s visibil-              populate tweets with URLs pointing to a source where more
ity. It is a way of promoting a topic and is associated                 information can be found. In our model, we consider the
with the reputation of the ui [41]. Since retweeting is                 URL as an independent variable for the engagement mea-
linked to popular topics and directly affects the ui ’s                 surements [44]. We count the tweets that include a URL
reputation, it is a key parameter for identifying possi-                and calculate the Rurl (ui ) by considering the UR (ui ) over the
ble fake account holders. As described in [30], bots or                 NT (ui ) as given in equation 3.
fake accounts depend more on retweets of existing con-                                                     UR (ui )
tent than posting new ones. In our model, we consider                                         Rurl (ui ) =                            (3)
                                                                                                           NT (ui )
T. Khan, A. Michalas: Seeing and Believing: Evaluating Trustworthiness of Twitter Users

6) LISTED COUNT                                                                           •   Sentiment score of a user;
In Twitter, a ui has the option to form several groups by cre-                            •   Credibility of Tweets;
ating lists of different ui ’s (e.g. competitors, followers etc.).                        •   Influence score of a user.
Twitter lists are mostly used to keep track of the most influen-
tial people.4 The simplest way to measure the ui ’s influence is                    1) USER’s SOCIAL REPUTATION
by checking the L(ui ) that the ui is placed on. Being present in                   The main factor for calculating the Rs (ui ) is the number of
a large number of lists is an indicator that the ui is considered                   users interested in ui ’s updates. Hence, Rs (ui ) is based on the
as important by others. Based on this assumption, we also                           Nfol (ui ), Nfri (ui ) and NT (ui ) [3], [38].
considered the number of lists that each ui belongs to.
                                                                                     Rs (ui ) = 2 log(1 + Nfol (ui )) + log(1 + NT (ui ))
7) STATUSES COUNT                                                                                                            − log(1 + Nfri (ui )) (5)
Compared to the other popular OSNs, Twitter is considered
as a service that is less social.5 This is mainly due to the large                     In equation 5 we utilized the log property to make the
number of inactive ui ’s or users who show low motivation                           distribution smoother and minimize the impact of outliers.
in participating in an online discussion. Twitter announced                         In addition to that, since log0 is undefined, we added 1
a new feature ‘‘Status availability’’, that checks the NT (ui ).6                   wherever log appears in equation 5. In equation 5, Rs (ui ) is
The status count is an important feature closely related to                         directly proportional to Nfol (ui ) and NT (ui ). Based on several
reporting credibility. If a user is active on Twitter for a longer                  studies [3], [5], [38], Rs (ui ) is more dependent on Nfol (ui )
period, the likelihood of producing more tweets increases,                          hence we give more importance to Nfol (ui ) in comparison to
which in turn may affect the author’s credibility [45], [46].                       NT (ui ) and Nfri (ui ). If a ui has a large Nfol (ui ) then the ui is
To this end, for the calculation of the Inf (ui ), we also took                     more reputable. In addition, if a ui is more active in updating
into account how active users are by measuring how often a                          his/her NT (ui ) there are more chances that ui ’s tweets receive
ui performs a new activity.7                                                        more likes and get retweeted. While Nfol (ui ) and NT (ui )
                                                                                    increase, Rs (ui ) also increases and vice versa. Alternatively,
8) ORIGINAL CONTENT RATIO                                                           if a ui has less Nfol (ui ) in comparison to the Nfri (ui ) then,
It has been observed that instead of posting original content,                      the Rs (ui ) is smaller. As can be seen from equation 5, there
most ui retweet posts by others [38]. As a result, Twitter                          is an inverse relation between Rs (ui ) and Nfri (ui ).
is changing into a pool of constantly updating information
streams. For ui ’s with high influence in the network, the best                     2) H-INDEX SCORE
strategy is to use the 30/30/30 rule: 30% retweets, 30%                             The hind score is most commonly used to measure the pro-
original content, and 30% engagement [47]. Having this in                           ductivity and impact of a scholar or scientist in the research
mind, in our model, we look for ui ’s original tweets and add                       community. It is based on the number of publications as well
them to their corresponding influence score. We calculate the                       as the number of citations for each publication [48]. In our
Rori (ui ) by extracting the retweeted posts by others from the                     work, we use the hind score for a more accurate calculation
total tweets of ui as given in equation 4.                                          of Inf (ui ). The hind of a ui is calculated considering Nlik and
                                                                                    Nret for each tweet. To find the hind ,8 we sort the tweets based
                        NT (ui ) − Retweeted other tweets                           on the Nlik and Nret (in decreasing order).
         Rori (ui ) =                                                     (4)
                                      NT (ui )                                         Algorithm 1 describes the main steps for calculating the
                                                                                    hind of a ui based on the Nret . The same algorithm is used for
                                                                                    calculating the hind of a ui based on Nlik by replacing Nret
Following the considerations for the selection of the basic fea-                    with Nlik .
tures for calculating the Inf (ui ), in this section we elaborate                      Rhind (ui ) and Lhind (ui ) are novel features used for measur-
on the extraction of the extra ones. Additionally, we discuss                       ing the relative importance of a ui . A tweet that has been
the sentiment analysis technique used to analyse ui ’s tweets.                      retweeted many times and liked by many users is considered
   By using the basic features described earlier, we calculated                     as attractive for the readers [5], [49]. For this reason, we use
the following features for each ui :                                                Rhind (ui ) and Lhind (ui ) for measuring the Inf (ui ). The higher
   • Social reputation of a user;                                                   the Rhind (ui ) and Lhind (ui ) score of a ui , the higher will be the
   • Retweet h-index score and liked h-index score;                                 Inf (ui ).
lists-to-always-be-engaging/                                                        3) TWITTER USER CREDIBILITY
  5                                    The credibility is actually the believability [26] – that is, pro-
social-study/                                                                       viding reasonable grounds for being believed. The credibility
                                                                                    of a ui can be assessed by using the information available
  7                                               8
most-active-twitter-user-data/                                                      Computing-Influence-Score-for-Twitter-Users-1

T. Khan, A. Michalas: Seeing and Believing: Evaluating Trustworthiness of Twitter Users

Algorithm 1 Calculating h-Index Score Based on Retweets                                       These four features are linked with the Rori (ui ) such as Rret (ui )
 1: procedure h-Index score(hind )                                                            and Rlik (ui ) specify the number of times the Rori (ui ) has been
 2:    Arrange Nret for each tweet of a ui in decreasing order                                retweeted and liked while Rhas (ui ) and Rurl (ui ) return only
 3:    for It in list: do                                                                     Rori (ui ) having URLs and hashtags. Hence, to calculate the
 4:        if Nret of a tweet < It then                                                       credibility of tweets, we first calculate the average of these
 5:            return It                                                                      four parameters and then multiply it with Rori (ui ).
 6:        end if
 7:    end for                                                                                 D. INFLUENCE SCORE
 8:    return Nret                                                                            The Inf (ui ) is calculated based on the evaluation of both con-
 9: end procedure                                                                             tent and context features. More precisely, we consider the fol-
                                                                                              lowing features described earlier: Rs (ui ), Sens (ui ), Twtcr (ui )
                                                                                              and hind (ui ). After calculating the values of all of these
on the Twitter platform. In our approach, we use both the                                     features we use them as input to Algorithm 2 line 7 which
Sens (ui ) and Twtcr (ui ) to find a credible ui .                                            calculates the Inf (ui ).

a: SENTIMENT SCORE                                                                            1) EQUATION FORMULATION
It has been observed that OSNs are a breeding ground for the                                  In order to ascertain how influential a ui is, researchers have
distribution of fake news. In many cases even a single Twitter                                taken into consideration one, two or more of the following
post significantly impacted [50] and affected the outcome of                                  characteristics:
an event.                                                                                         • Social reputation [54] and weight-age of his tweets [5];;
   Having this in mind, we used sentiment analysis and the                                        • Tweets credibility [5], [53];
TextBlob [51] library, to analyze tweets with the main aim                                        • His ability to formulate new ideas, as well as his active
to identify certain patterns that could facilitate identification                                    participation in follow-up events and discussions [55].
of credible news. The sentiment analysis returns a score                                          An influential ui must be highly active (have ideas that
using polarity values ranging from 1 to -1 and helps in tweet                                 impact others’ behaviours, able to start new discussions etc.,).
classification. We classified the collected tweets as (1) Pos-                                Additionally, the tweets must be relevant, credible and highly
itive (2) Neutral, and (3) Negative based on the number of                                    influential (retweeted and liked by a large number of other
positive, neutral and negative words in a tweet. According to                                 ui ’s). If the tweets of highly influential ui ’s are credible
Morozov and Sen [52], the least credible tweets have more                                     and the polarity of their tweets’ content is positive, they are
negative sentiment words and opinions and are associated                                      considered as highly acknowledged and recognized by the
with negative social events, while credible tweets, have more                                 community. In short, for a ui to be considered influential,
positive ones. Hence we classified positive tweets as being                                   we combine the efforts of [5], [53]–[55] and calculate the
the most credible followed by the neutral, and finally the least                              Inf (ui ) using equation 8.
credible negative tweets.
                                                                                                               Sens (ui )+Twtcr (ui )+Rs (ui )+Rhind (ui )+Lhind (ui )
   Following the tweets classification we assign a Sens (ui ) to                              Inf (ui ) =                                5
each ui [5] using the following equation:                                                                                                                                (8)
                        P               P
                            Nneu (ui ) + Npos (ui )
    Sens (ui ) = P              P              P              (6)
                    Nneu (ui ) + Npos (ui ) + Nneg (ui )                                      Algorithm 2 Influence Score Calculation
                                                                                               1: procedure Influence Score(Inf (ui ))
b: TWEETS CREDIBILITY                                                                          2:    For ith User
ODonovan et al. [53] focused on finding the most suitable                                      3:    Calculate Rhind (ui ) and Lhind (ui ), using Algorithm 1
indicators for credibility. According to their findings, prime                                 4:    Calculate Rs (ui ) using equation 5
indicators for a tweet’s credibility are mentions, URLs, tweet                                 5:    Calculate Sens (ui ) using equation 6
length and retweets. Gupta and Kumaraguru [29] ranked                                          6:    Calculate Twtcr (ui ) using equation 7
tweets based on tweets credibility. The parameters used as an                                  7:    Compute Inf (ui ) using equation 8
input for the ranking algorithm were: tweets, retweets, total                                  8: end procedure
unique users, trending topics, tweets with URLs, start and
end date. Based on the existing literature, we compute the
                                                                                               IV. ACTIVE LEARNING AND ML MODELS
Twtcr (ui ) by considering Rret (ui ), Rlik (ui ), Rhas (ui ), Rurl (ui )
and Rori (ui ) (see equation 7):                                                              In line with the existing literature, the classification of a ui
                                                                                            is performed on a manually annotated dataset. The manu-
                     Rret (ui )+Rlik (ui )+Rhas (ui )+Rurl (ui )                              ally annotated dataset gives a ground truth, however, manual
Twtcr (ui ) =                             4                            · Rori (ui )
                                                                                              labelling is an expensive and time-consuming task. In our
                                                                                              approach, we used active learning, a semi-supervised ML
  To begin, we consider the Rori (ui ) (tweet) by a ui and for                                model that helps in classification when the amount of avail-
each Rori (ui ) we collect Rret (ui ), Rlik (ui ), Rhas (ui ) and Rurl (ui ).                 able labelled data is small. In this model, the classifier

T. Khan, A. Michalas: Seeing and Believing: Evaluating Trustworthiness of Twitter Users

FIGURE 2. Active learning flow.

is trained using a small amount of training data (labelled
instances). Next, the points ambiguous to the classifier in the
large pool of unlabelled instances are labelled, and added to
the training set [21]. This process is repeated until all the
ambiguous instances are queried or the model performance
does not improve above a certain threshold. The basic flow
of active learning approach9 is shown in Figure 2. Based
on the proposed model, we first trained our classifier on a
small dataset of human-annotated data. Following this step,
it then further classified a large pool of unlabelled instances
efficiently and accurately.
    The steps in our active learning process were as follows:
    • Data Gathering: We gathered unlabelled data for
      50,000 ui ’s. The unlabelled data was then split into                         FIGURE 3. Features correlation.

      a seed – a small manually labelled dataset consisting
      of 1000 manually annotated data – and a large pool of                         TABLE 3. Feature engineering: All values greater than or equal to 0.5 are
                                                                                    considered high, whereas those below 0.5 are considered low.
      unlabelled data. The seed was then used to train the
      classifier just like a normal ML model. Using the seed
      dataset, we classified each political ui as either trusted
      or untrusted.
    • Classification of Twitter Users: Two manual anno-
      tators in the field classified 1000 ui ’s as trusted or
      untrusted based on certain features. Out of 1000 ui ’s,
      582 were classified as trusted and the rest 418 as
      untrusted. For feature selection, we employed the feature
      engineering technique, and selected the most important
      features among those presented in Table 1. Based on
      the existing literature [56]–[59] and correlation among
      features, certain features were considered the most
      discriminatory for ui ’s classification. We did not include
      the discriminatory features because they serve as an
      outlier and are biased. In addition, certain features
      were distributed almost equally between the trusted and                                 assessing Inf (ui ). In addition, all the features except
      untrusted users, as shown in Table 3. We discarded both                                 Rret (ui ) and Rhas (ui ) have a positive impact on the user’s
      as they do not add any value to classification. However,                                Inf (ui )(see Figure 3).
      certain features were good candidates for differentiat-                             •   Choosing Unlabelled Instances: A pool based
      ing trusted and untrusted users such as high Rhind (ui ),                               sampling with a batch size of 100 was used in
      Lhind (ui ), Inf (ui ), Sens (ui ), Twtcr (ui ), Rs (ui ). In Table 3,                  which 100 ambiguous instances from the unlabelled
      the features marked with ∗ were used for classification                                 dataset were labelled and added to a labelled dataset.
      in the existing literature [3], [57], [60] while the features                           Different sampling techniques were employed to select
      marked with ∩ were based on the correlation among                                       the instances from the unlabelled dataset. For the new
      the features. The impact of the individual feature is                                   labelled dataset, the classifier was re-trained and then
      shown in Figure 3. The figure indicates that among the                                  the next batch of ambiguous unlabelled instances to be
      features, the Lhind (ui ) and Nfol (ui ) are very relevant for                          labelled was selected. The process was repeated until
                                                                                              the model performance did not improve above a certain

T. Khan, A. Michalas: Seeing and Believing: Evaluating Trustworthiness of Twitter Users

   Among unlabelled instances, active learning finds the most                Tanh, ReLU and Logistics. We do not provide the plots
useful ones to be labelled by human annotators. In general,                  for ReLU activation function as its performance is not
the unlabelled instance which confuses the ML model the                      as good as Tanh and Logistics (see Table 5).
most will be the most valuable instance. The following sam-
pling techniques were employed to select instances from the            V. EXPERIMENTAL RESULTS AND MODEL EVALUATION
unlabelled dataset10 :                                                Experimental Setup: We used Python 3.5 for features extrac-
   • Uncertainty Sampling: It is the most common method               tion and dataset generation. The python script was executed
     used to calculate the difference between the most confi-         locally on a machine having configuration: Intel Core i7,
     dent prediction and 100% confidence.                             2.80 GHZ, 32GB, Ubuntu 16.04 LTS 64 bit. For training and
     U (x) = 1 − P(x̂|x)                                              evaluating the ML models, Google Colab is used. In addition,
     where x̂ is the most likely prediction and x is the instance     the modAL framework [64], an active learning framework
     to be predicted. This sampling technique selects the             for python is used for manually labeling the Twitter users.
     sample with greatest uncertainty.                                It is a scikit-learn based platform that is modular, flexible
   • Margin Sampling: In margin sampling, the probabil-               and extensible. We used the pool-based sampling technique
     ity difference between the first and second most likely          for the learner to query the labels of instances, and different
     prediction is calculated. Margin sampling is calculated          sampling techniques for the query strategy. For classification
     using equation:                                                  purposes, we used different classifiers, implemented using
     M (x) = P(xˆ1 |x) − P(xˆ2 |x),                                   the scikit-learn library.
     where xˆ1 and xˆ2 are the most likely instances. As the
     decision is unsure for smaller margins, in this sam-              A. DATASET AND DATA COLLECTION
     pling technique, the instance with the smallest margin           We used tweepy – the Twitter’s search API for collecting
     is selected.                                                     ui ’s tweets and features. Tweepy has certain limitations, as it
   • Entropy Sampling: It is the measure of entropy and is            only allows the collection of a certain number of features.
     defined by P the equation:                                       Additionally, a data rate cap is in place, which prevents the
     H (x) = − k pk log(pk )                                          information collection above a certain threshold. Our main
     where pk is the probability of a sample belonging to class       concern was to select a sufficient number of users for our
     k. Entropy sampling measures the difference between all          dataset. In our dataset, we analysed the Twitter accounts
     the predictions.                                                 belonging to 50,000 politicians. This dataset was generated
   Details of the three classifiers we used and their perfor-         in 2020.
mance characteristics are given below:                                    The main reason for choosing to evaluate politicians’
                                                                      profiles is their intrinsic potential to influence the public
   • Random Forest Classifier (RFC): An ensemble
                                                                      opinion. The content of such tweets originates and exists
     tree-based learning algorithm [61] that aggregates the
                                                                      in the sphere of political life which is, unfortunately, often
     votes from various decision trees to determine the out-
                                                                      surrounded by controversial events and outcomes. During
     put class of the instance. RFC runs efficiently on large
                                                                      the selection, we only considered politicians with a public
     dataset and is capable of handling thousands of input
                                                                      profile. Users that seemed to be inactive (e.g. limited number
     variables. In addition, RFC measures the relative impor-
                                                                      of followers and activities) were omitted. In addition, because
     tance of each feature, and produces a highly accurate
                                                                      duplicate data might influence model accuracy, we used the
                                                                      ‘‘max ID’’ parameter to exclude them from the data set.
   • Support Vector Machine (SVM): SVM models are
                                                                      Firstly, we requested the most recent tweets from each user
     commonly used in classification tasks as it achieves high
                                                                      (200 tweets at a time) and kept the smallest ID (i.e. the ID
     accuracy with less computation power. The SVM finds
                                                                      of the oldest tweet). Next, we iterate through the tweets and
     a hyperplane in N -dimensional space (N represents the
                                                                      the value of the max ID now will equal the ID of the oldest
     number of features) to classify an instance [62]. The goal
                                                                      tweet minus one. This means in the next requests (for tweets
     of SVM is to improve classification accuracy by locating
                                                                      collection), we got all the tweets having an ID less than
     the hyperplane that separates the two classes.
                                                                      or equal to a specific ID (max ID parameter). For all the
   • Multilayer Perceptron (MLP): A supervised ML algo-
                                                                      subsequent requests, we used the max ID parameter to avoid
     rithm that learns a nonlinear function by training on a
                                                                      tweet duplication.
     dataset. The MLP network is divided into an input layer,
                                                                          For each ui , we extracted all the features required by our
     hidden layer(s), and output layer [63]. Each layer consist
                                                                      model. Using the extracted features and tweets we calculated
     of interconnected neurons transferring information to
                                                                      Inf (ui ). Furthermore, we collected data that included 19 fea-
     each other. In our proposed model the MLP consisted
                                                                      tures including the influence score for 50,000 ui ’s. Table 4
     of one input and output layer and 50 hidden layers.
                                                                      summarizes the statistics of some of the features examined in
     In addition, the activation functions used in MLP are
                                                                      the dataset. For features which have no upper bound defined
  10                   and may have outliers values, such as the number of follow-
content/query_strategies/uncertainty_sampling.html                    ers, likes, etc., we used a percentile clip. We then normalized

T. Khan, A. Michalas: Seeing and Believing: Evaluating Trustworthiness of Twitter Users

TABLE 4. Dataset descriptive statistics of only four features.

our features using min-max normalization, with 0 being the
smallest and 1 being the largest value.

We gathered 50,000 unlabelled instances of ui ’s and divided
our dataset into three subsets: training, testing, and unla-
belled data pools. For the training and testing cohorts,
we had 1000 manually annotated data instances. The rest of
the data was unlabelled (49,000 instances). The model was
trained on the labelled training dataset while the performance
of the model was measured on the testing dataset.
   For the classification, we used different classifiers (all
classifiers were trained on the labelled dataset and predictions
are reported using 10 fold cross-validation). The precision,
recall, F1 score and accuracy, were used as the main eval-
uation metric for the model performance. Precision is the
ratio between true positive and all the positives while recall
is the ratio of true positive predictions to the total positives                    FIGURE 4. Accuracy using uncertainty sampling.
examples. F1 score is the weighted average of precision and
recall while accuracy measures the percentage of the correctly                      discarded. The RFC obtained accuracy of 96%(Figure 4a),
classified instances. The precision, recall and F1 score are                        the SVM obtained an accuracy of 90.8% (Figure 4b), while
based on true positive, true negative, false positive and false                     the MLP obtained an accuracy of 90% (Figure 4c) for Tanh
negative. To define these terms, first we considered that the                       and 84% for Logistic as given in Figure 4d.
trusted users are positive (labelled as 1), while the untrusted
users are negative (labelled as 0). When the model predicts                         2) MARGIN SAMPLING
the actual labels, we categorize them as a true positive and                        In margin sampling, instances with the smallest difference
true negative, otherwise false positive and false negative.                         between the first and second most probable labels were
If the model predicts that the user is trusted but the user                         considered. The accuracy for RFC, SVM and MLP using
is not it is false positive, and if the model predicts that the                     margin sampling was 96%, 91.2%, 87% and 88.4% as shown
user is untrusted but the user is not then it is a false neg-                       in Figure 5a, Figure 5b, Figure 5c and Figure 5d respectively.
ative. The performance of the model (precision, recall, and
F1 score) was calculated on the testing dataset. To improve
the model accuracy, the active learner randomly selected                            3) ENTROPY SAMPLING
ambiguous data instances from the unlabelled data pool using                        Lastly, the entropy sampling method obtained an accuracy
three different sampling techniques. These ambiguous data                           of 95% for RFC, 88% for SVM, almost 90% for MLP (Tanh)
instances were then manually labelled by human annotators.                          and 90% for MLP (Logistic). Obtained results for the RFC,
The annotated data was added to the labelled dataset. In our                        SVM and MLP, are shown in Figure 6a, 6b, 6c and 6d.
model, the human annotators labelled the 100 most ambigu-                              Comparison on the performance of our models and dif-
ous instances from the unlabelled dataset returned by the                           ferent sampling techniques used can be found in Table 5.
active learner. The respective sampling techniques and the                          Precision, recall, F1 score, and accuracy evaluation metrics
accuracy obtained for the top three classifiers (RFC, SVM                           were used to evaluate the results. Trusted users are rep-
and MLP) are discussed below.                                                       resented by 1 while untrusted users are represented by 0
                                                                                    (see Table 5). RFC outperforms the other models in uncer-
1) UNCERTAINTY SAMPLING                                                             tainty sampling, with an F1 score of 96% for both trusted
In uncertainty sampling, the least confidence instance is most                      and untrusted users. Similarly, for margin sampling, RFC
likely to be considered. In this type of sampling method,                           received an F1 score of 95% for untrustworthy users and 97%
the most probable labels are considered and the rest are                            for trustworthy users and again outperformed other models.

T. Khan, A. Michalas: Seeing and Believing: Evaluating Trustworthiness of Twitter Users

TABLE 5. Comparison of various models using different sampling techniques.

FIGURE 5. Accuracy using margin sampling.                                    FIGURE 6. Accuracy using entropy sampling.

Finally, RFC outperforms in entropy sampling as well,                        and hopefully extend/enhance our models we plan to make
obtaining an F1 score of 95% for both trusted and untrusted                  both our datasets as well as the code for our mod-
users. Overall, RFC was the best performing algorithm,                       els available through the Zenodo research artifacts portal.
while MLP (ReLU) had the worst performance. The results                      This does not violate Twitter’s developer terms. However,
obtained by RFC were the best due to its superior accuracy                   in case the paper gets accepted and in order to keep our
and better record when it comes to low-dimensional datasets.                 anonymity, we will make this available in the camera-ready
Similarly, the improved performance, in the case of margin                   version.
sampling, can be attributed to the fact that it considers the
most probable labels probabilities, unlike the other sampling                VI. CONCLUSION
methods.                                                                     Contemplating the momentous impact unreliable information
                                                                             has on our lives and the intrinsic issue of trust in OSNs, our
4) OPEN SCIENCE AND REPRODUCIBLE RESEARCH                                    work focused on finding ways to identify this kind of infor-
As a way to support open science and reproducible research                   mation and notifying users of the possibility that a specific
and give the opportunity to other researchers to use, test                   Twitter user is not credible.

   To do so, we designed a model that analyses Twitter users                        [18] T. Khan and A. Michalas, ‘‘Trust and believe–should we? Evaluating the
and assigns each a calculated score based on their social                                trustworthiness of Twitter users,’’ in Proc. IEEE 19th Int. Conf. Trust,
                                                                                         Secur. Privacy Comput. Commun. (TrustCom), Dec. 2020, pp. 1791–1800.
profiles, tweets credibility, sentiment score, and h-indexing                       [19] M. Al-Qurishi, S. M. M. Rahman, A. Alamri, M. A. Mostafa,
score. Users with a higher score are not only considered as                              M. Al-Rubaian, M. S. Hossain, and B. B. Gupta, ‘‘SybilTrap: A graph-
more influential but also, as having a greater credibility. To                           based semi-supervised Sybil defense scheme for online social networks,’’
                                                                                         Concurrency Comput., Pract. Exper., vol. 30, no. 5, p. e4276, Mar. 2018.
test our approach, we first generated a dataset of 50,000 Twit-                     [20] S. Tong and D. Koller, ‘‘Support vector machine active learning with
ter users along with a set of 19 features for each user. Then,                           applications to text classification,’’ J. Mach. Learn. Res., vol. 2, pp. 45–66,
we classified the Twitter users into trusted or untrusted using                          Nov. 2001.
three different classifiers. Further, we employed the active                        [21] B. Settles, ‘‘Active learning literature survey,’’ Dept. Comput. Sci., Univ.
                                                                                         Wisconsin-Madison, Madison, WI, USA, 2009.
learner approach to label the ambiguous unlabelled instances.                       [22] J. Ratkiewicz, M. D. Conover, M. Meiss, B. Gonçalves, A. Flammini, and
During the evaluation of our model, we conducted extensive                               F. M. Menczer, ‘‘Detecting and tracking political abuse in social media,’’
experiments using three sampling methods. The best results                               in Proc. 5th Int. AAAI Conf. Weblogs Social Media. Princeton, NJ, USA:
                                                                                         Citeseer, 2011, pp. 297–304.
were achieved by using RFC with the margin sampling.                                [23] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, ‘‘Fake news detection
We believe this work is an important step towards automating                             on social media: A data mining perspective,’’ ACM SIGKDD Explor.
the users’ credibility assessment, re-establishing their trust                           Newslett., vol. 19, no. 1, pp. 22–36, 2017.
in social networks, and building new bonds of trust between                         [24] A. Gupta, P. Kumaraguru, C. Castillo, and P. Meier, ‘‘TweetCred: Real-
                                                                                         time credibility assessment of content on Twitter,’’ in Proc. Int. Conf.
them.                                                                                    Social Informat. Barcelona, Spain: Springer, 2014, pp. 228–243.
                                                                                    [25] N. Jain, P. Agarwal, and J. Pruthi, ‘‘HashJacker-detection and analysis
REFERENCES                                                                               of hashtag hijacking on Twitter,’’ Int. J. Comput. Appl., vol. 114, no. 19,
                                                                                         pp. 17–20, Mar. 2015.
 [1] H. Allcott and M.Gentzkow, ‘‘Social media and fake news in the 2016 elec-      [26] C. Castillo, M. Mendoza, and B. Poblete, ‘‘Information credibility on
     tion,’’ J. Econ. Perspect., vol. 31, no. 2, pp. 211–236, 2017.                      Twitter,’’ in Proc. 20th Int. Conf. World Wide Web, 2011, pp. 675–684.
 [2] E. Metzgar and A. Maruggi, ‘‘Social media and the 2008 US presidential         [27] A. Preussler and M. Kerres, ‘‘Managing reputation by generating followers
     election,’’ J. New Commun. Res., vol. 4, no. 1, pp. 141–165, 2009.                  on Twitter,’’ in Medien–Wissen–Bildung Explorationen Visualisierter und
 [3] A. H. Wang, ‘‘Don’t follow me: Spam detection in Twitter,’’ in Proc. Int.           Kollaborativer Wissensräume. Innsbruck, Austria: Innsbruck Univ. Press,
     Conf. Secur. Cryptogr. (SECRYPT), 2010, pp. 1–10.                                   2010, pp. 129–143.
 [4] C. Grier, K. Thomas, V. Paxson, and M. Zhang, ‘‘@spam: The underground         [28] A. Gupta and P. Kumaraguru, ‘‘Twitter explodes with activity in mum-
     on 140 characters or less,’’ in Proc. 17th ACM Conf. Comput. Commun.                bai blasts! A lifeline or an unmonitored daemon in the lurking?’’ IIIT,
     Secur., 2010, pp. 27–37.                                                            New Delhi, India, Tech. Rep. IIITD-TR-2011-005, 2011.
 [5] M. Alrubaian, M. Al-Qurishi, M. Al-Rakhami, M. M. Hassan, and                  [29] A. Gupta and P. Kumaraguru, ‘‘Credibility ranking of tweets during high
     A. Alamri, ‘‘Reputation-based credibility analysis of Twitter social net-           impact events,’’ in Proc. 1st Workshop Privacy Secur. Online Social Media,
     work users,’’ Concurrency Comput., Pract. Exper., vol. 29, no. 7, p. e3873,         2012, pp. 2–8.
     Apr. 2017.
                                                                                    [30] Z. Gilani, R. Farahbakhsh, G. Tyson, L. Wang, and J. Crowcroft, ‘‘Of bots
 [6] M. Hindman and V. Barash, ‘‘Disinformation, and influence campaigns on
                                                                                         and humans (on Twitter),’’ in Proc. IEEE/ACM Int. Conf. Adv. Social Netw.
     Twitter,’’ Knight Found., George Washington Univ., 2018.
                                                                                         Anal. Mining, Jul. 2017, pp. 349–354.
 [7] A. Bovet and H. A. Makse, ‘‘Influence of fake news in Twitter during the
                                                                                    [31] T. Dimitriou and A. Michalas, ‘‘Multi-party trust computation in decentral-
     2016 US presidential election,’’ Nature Commun., vol. 10, no. 1, pp. 1–14,
                                                                                         ized environments,’’ in Proc. 5th Int. Conf. New Technol., Mobility Secur.
     Dec. 2019.
                                                                                         (NTMS), May 2012, pp. 1–5.
 [8] J. Clement. (Jul. 2020). Number of Global Social Network Users
                                                                                    [32] T. Dimitriou and A. Michalas, ‘‘Multi-party trust computation in decen-
     2017–2025. [Online]. Available:
                                                                                         tralized environments in the presence of malicious adversaries,’’ Ad Hoc
                                                                                         Netw., vol. 15, pp. 53–66, Apr. 2014.
 [9] M. Al-Qurishi, R. Aldrees, M. AlRubaian, M. Al-Rakhami,
     S. M. M. Rahman, and A. Alamri, ‘‘A new model for classifying                  [33] M. Fazil and M. Abulaish, ‘‘A hybrid approach for detecting automated
     social media users according to their behaviors,’’ in Proc. 2nd World               spammers in Twitter,’’ IEEE Trans. Inf. Forensics Security, vol. 13, no. 11,
     Symp. Web Appl. Netw. (WSWAN), Mar. 2015, pp. 1–5.                                  pp. 2707–2719, Nov. 2018.
[10] Y. Liu, C. Kliman-Silver, and A. Mislove, ‘‘The tweets they are a-changin:     [34] A. A. Amleshwaram, N. Reddy, S. Yadav, G. Gu, and C. Yang, ‘‘CATS:
     Evolution of Twitter users and behavior,’’ in Proc. ICWSM, vol. 30, 2014,           Characterizing automation of Twitter spammers,’’ in Proc. COMSNETS,
     pp. 305–314.                                                                        Jan. 2013, pp. 1–10.
[11] K. R. Canini, B. Suh, and P. L. Pirolli, ‘‘Finding credible information        [35] C. Yang, R. Harkreader, and G. Gu, ‘‘Empirical evaluation and new design
     sources in social networks based on content and social structure,’’ in Proc.        for fighting evolving Twitter spammers,’’ IEEE Trans. Inf. Forensics Secu-
     IEEE 3rd Int. Conf. Privacy, Secur., Risk Trust, IEEE 3rd Int. Conf. Social         rity, vol. 8, no. 8, pp. 1280–1293, Aug. 2013.
     Comput., Oct. 2011, pp. 1–8.                                                   [36] Z. Gilani, R. Farahbakhsh, G. Tyson, L. Wang, and J. Crowcroft,
[12] R. Tinati, L. Carr, W. Hall, and J. Bentwood, ‘‘Identifying communicator            ‘‘An in-depth characterisation of bots and humans on Twitter,’’ 2017,
     roles in Twitter,’’ in Proc. 21st Int. Conf. Companion World Wide Web,              arXiv:1704.01508. [Online]. Available:
     2012, pp. 1161–1168.                                                           [37] M. S. Granovetter, ‘‘The strength of weak ties,’’ in Social Networks.
[13] M. Gupta, P. Zhao, and J. Han, ‘‘Evaluating event credibility on Twitter,’’         Amsterdam, The Netherlands: Elsevier, 1977, pp. 347–367.
     in Proc. SIAM Int. Conf. Data Mining, Apr. 2012, pp. 153–164.                  [38] I. Anger and C. Kittl, ‘‘Measuring influence on Twitter,’’ in Proc. 11th Int.
[14] M.-F. Moens, J. Li, and T.-S. Chua, Mining User Generated Content.                  Conf. Knowl. Manage. Knowl. Technol., 2011, pp. 1–4.
     Boca Raton, FL, USA: CRC Press, 2014.                                          [39] C. G. McCoy, M. L. Nelson, and M. C. Weigle, ‘‘University Twitter
[15] D. Rao, D. Yarowsky, A. Shreevats, and M. Gupta, ‘‘Classifying latent user          engagement: Using Twitter followers to rank universities,’’ 2017,
     attributes in Twitter,’’ in Proc. 2nd Int. Workshop Search Mining User-             arXiv:1708.05790. [Online]. Available:
     Generated Contents, 2010, pp. 37–44.                                           [40] A. Leavitt, E. Burchard, D. Fisher, and S. Gilbert, ‘‘The influentials: New
[16] H. S. Al-Khalifa and R. M. Al-Eidan, ‘‘An experimental system for mea-              approaches for analyzing influence on Twitter,’’ Web Ecol. Project, vol. 4,
     suring the credibility of news content in Twitter,’’ Int. J. Web Inf. Syst.,        no. 2, pp. 1–18, 2009.
     vol. 7, no. 2, pp. 130–151, Jun. 2011.                                         [41] H. S. Dutta, A. Chetan, B. Joshi, and T. Chakraborty, ‘‘Retweet us, we
[17] M. M. Uddin, M. Imran, and H. Sajjad, ‘‘Understanding types of users                will retweet you: Spotting collusive retweeters involved in blackmarket
     on Twitter,’’ 2014, arXiv:1406.1335. [Online]. Available: http://arxiv.             services,’’ in Proc. IEEE/ACM Int. Conf. Adv. Social Netw. Anal. Mining
     org/abs/1406.1335                                                                   (ASONAM), Aug. 2018, pp. 242–249.

T. Khan, A. Michalas: Seeing and Believing: Evaluating Trustworthiness of Twitter Users

[42] Z. Gilani, E. Kochmar, and J. Crowcroft, ‘‘Classification of Twitter               [61] G. Biau, ‘‘Analysis of a random forests model,’’ J. Mach. Learn. Res.,
     accounts into automated agents and human users,’’ in Proc. IEEE/ACM                     vol. 13, pp. 1063–1095, 2012.
     Int. Conf. Adv. Social Netw. Anal. Mining, Jul. 2017, pp. 489–496.                 [62] W. S. Noble, ‘‘What is a support vector machine?’’ Nature Biotechnol.,
[43] A. L. Hughes and L. Palen, ‘‘Twitter adoption and use in mass convergence               vol. 24, no. 12, pp. 1565–1567, 2006.
     and emergency events,’’ Int. J. Emergency Manage., vol. 6, nos. 3–4,               [63] M.-C. Popescu, V. E. Balas, L. Perescu-Popescu, and N. Mastorakis, ‘‘Mul-
     pp. 248–260, 2009.                                                                      tilayer perceptron and neural networks,’’ WSEAS Trans. Circuits Syst.,
[44] X. Han, X. Gu, and S. Peng, ‘‘Analysis of tweet form’s effect on users’                 vol. 8, no. 7, pp. 579–588, 2009.
     engagement on Twitter,’’ Cogent Bus. Manage., vol. 6, no. 1, Jan. 2019,            [64] T. Danka and P. Horvath, ‘‘ModAL: A modular active learning framework
     Art. no. 1564168.                                                                       for Python,’’ 2018, arXiv:1805.00979. [Online]. Available: http://arxiv.
[45] B. Kang, J. O’Donovan, and T. Höllerer, ‘‘Modeling topic specific credi-                org/abs/1805.00979
     bility on Twitter,’’ in Proc. ACM Int. Conf. Intell. User Interfaces, 2012,
     pp. 179–188.
[46] J. Ross and K. Thirunarayan, ‘‘Features for ranking tweets based on cred-
     ibility and newsworthiness,’’ in Proc. Int. Conf. Collaboration Technol.
     Syst. (CTS), Oct. 2016, pp. 18–25.
[47] M. Soltau. (Jan. 2020). Twitter News is a Popularity Contest. [Online].                                      TANVEER KHAN received the master’s degree in
     Available:                                       information security from COMSATS University
     popularity-contest-aa2705433d65                                                                              Islamabad, Islamabad, Pakistan. After his master’s
[48] J. E. Hirsch, ‘‘An index to quantify an individual’s scientific research                                     degree, he worked as a Data Analyst on the project
     output,’’ Proc. Nat. Acad. Sci. USA, vol. 102, no. 46, pp. 16569–16572,
                                                                                                                  Cyber Threat Intelligence Platform at COMSATS
                                                                                                                  University Islamabad. He also worked as a Junior
[49] F. Riquelme and P. González-Cantergiani, ‘‘Measuring user influence on
     Twitter: A survey,’’ Inf. Process. Manage., vol. 52, no. 5, pp. 949–975,                                     Analyst at Trillium Infosec, Pakistan. He is cur-
     2016.                                                                                                        rently working as a Ph.D. Researcher with the
[50] G. Wolfsfeld, E. Segev, and T. Sheafer, ‘‘Social media and the Arab spring:                                  Department Computing Sciences, Tampere Uni-
     Politics comes first,’’ Int. J. Press/Politics, vol. 18, no. 2, pp. 115–137,                                 versity, Finland, where he is also a member of the
     2013.                                                                              Network and Information Security Group (NISEC). His research interests
[51] S. Loria et al., ‘‘TextBlob: Simplified text processing,’’ Secondary               include privacy-preserving machine learning, fake news detection in social
     TextBlob, Simplified Text Process., vol. 3, 2014.                                  networks, cyber security, digital forensics, and malware analysis.
[52] E. Morozov and M. Sen, ‘‘Analysing the Twitter social graph: Whom can
     we trust?’’ M.S. thesis, Dept. Comput. Sci., Univ. Nice Sophia Antipolis,
     Nice, France, 2014.
[53] J. ODonovan, B. Kang, G. Meyer, T. Höllerer, and S. Adalii, ‘‘Credibility
     in context: An analysis of feature distributions in Twitter,’’ in Proc. Int.                                  ANTONIS MICHALAS received the Ph.D. degree
     Conf. Privacy, Secur., Risk Trust Int. Conf. Social Comput., Sep. 2012,                                       in network security from Aalborg University,
     pp. 293–301.                                                                                                  Denmark. He is currently working as an Assis-
[54] D. Garcia, P. Mavrodiev, D. Casati, and F. Schweitzer, ‘‘Understanding                                        tant Professor with the Department Computing
     popularity, reputation, and social influence in the Twitter society,’’ Policy                                 Sciences, Tampere University, Finland, where he
     Internet, vol. 9, no. 3, pp. 343–364, 2017.                                                                   also co-leads the Network and Information Secu-
[55] G. M. Chen, ‘‘Tweet this: A uses and gratifications perspective on how                                        rity Group (NISEC). The group comprises Ph.D.
     active Twitter use gratifies a need to connect with others,’’ Comput. Hum.                                    students, professors, and researchers. Group mem-
     Behav., vol. 27, no. 2, pp. 755–762, Mar. 2011.                                                               bers conduct research in areas spanning from
[56] K. E. Daouadi, R. Z. Rebaï, and I. Amous, ‘‘Organization vs. individual:                                      the theoretical foundations of cryptography to the
     Twitter user classification,’’ in Proc. LPKM, 2018, pp. 1–8.                       design and implementation of leading edge efficient and secure communi-
[57] A. Pritzkau, S. Winandy, and T. Krumbiegel, ‘‘Finding a line between               cation protocols. Apart from his research work at NISEC, as an Assistant
     trusted and untrusted information on tweets through sequence classifica-
                                                                                        Professor, he is actively involved in the teaching activities of the university.
     tion,’’ in Proc. Int. Conf. Mil. Commun. Inf. Syst., May 2021, pp. 1–6.
[58] E. Ferrara, ‘‘Disinformation and social bot operations in the run up to
                                                                                        Finally, his role expands to student supervision and research projects coor-
     the 2017 French presidential election,’’ 2017, arXiv:1707.00086. [Online].         dination. Furthermore, he has published a significant number of papers in
     Available:                                         field related journals and conferences and has participated as a speaker in
[59] S. Gurajala, J. S. White, B. Hudson, B. R. Voter, and J. N. Matthews,              various conferences and workshops. His research interests include private
     ‘‘Profile characteristics of fake Twitter accounts,’’ Big Data Soc., vol. 3,       and secure e-voting systems, reputation systems, privacy in decentralized
     no. 2, 2016, Art. no. 2053951716674236.                                            environments, cloud computing, trusted computing and privacy preserving
[60] P. M. Nyein Myo, M. Swe, and N. Myo, ‘‘Fake accounts classification on             protocols in eHealth, and participatory sensing applications.
     Twitter,’’ Int. J. Latest Eng. Manage. Res., vol. 3, no. 6, pp. 141–146, 2018.

