Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin

Page created by Dustin Bennett
 
CONTINUE READING
Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin
Topic Modeling Based on
                                        Two-Step Flow Theory:
                                        Application to Tweets about
                                        Bitcoin
                                        Aos Mulahuwaish∗ , Matthew Loucks∗ , Basheer Qolomany† , and Ala Al-Fuqaha‡
arXiv:2303.02032v1 [cs.SI] 3 Mar 2023

                                        ∗ Department of Computer Science and Information Systems, Saginaw Valley State University,
                                        University Center, USA
                                        † Cyber Systems Department, University of Nebraska at Kearney, Kearney, USA
                                        ‡ Information and Computing Technologies (ICT) Division, College of Science and Engineering (CSE),
                                        Hamad Bin Khalifa University, Doha, Qatar

                                        Abstract—Digital cryptocurrencies such as Bitcoin have exploded in recent years in both
                                        popularity and value. By their novelty, cryptocurrencies tend to be both volatile and highly
                                        speculative. The capricious nature of these coins is helped facilitated by social media networks
                                        such as Twitter. However, not everyone’s opinion matters equally, with most posts garnering little
                                        to no attention. Additionally, the majority of tweets are retweeted from popular posts. We must
                                        determine whose opinion matters and the difference between influential and non-influential
                                        users. This study separates these two groups and analyzes the differences between them. It
                                        uses Hypertext-induced Topic Selection (HITS) algorithm, which segregates the dataset based
                                        on influence. Topic modeling is then employed to uncover differences in each group’s speech
                                        types and what group may best represent the entire community. We found differences in
                                        language and interest between these two groups regarding Bitcoin and that the opinion leaders
                                        of Twitter are not aligned with the majority of users. There were 2559 opinion leaders (0.72% of
                                        users) who accounted for 80% of the authority and the majority (99.28%) users for the remaining
                                        20% out of a total of 355,139 users.

                                                                                           [1, 2, 3, 4, 5, 6]. These networks, however, can
                                        Index Terms: Network Analysis, Twitter,            comprise billions of users, with an unfathomable
                                        Machine Learning, Topic Modeling, Bitcoin.         number of connections between them. To run any
                                                                                           topic modeling or sentiment analysis algorithms
                                                                                           across the entire network would be extraordinarily
                                                                                           costly. This begs the question — would it be
                                           T HE INTRODUCTION Due to the virtual nature     possible to run topic modeling on a significantly
                                        of cryptocurrencies, much discussion occurs in     smaller subset of the network and still yield
                                        online forums or social media platforms such as    similar results? It has been well established that
                                        Twitter and Facebook. Places like this establish   authority, particularly on digital networks, tends
                                        the general public opinion of cryptocurrencies     to follow a Pareto distribution [7]. Additionally,
                                        and, consequently, to some extent, their price

                                        IT Professional
                                                                                                                                                1
Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin
most users tend to mimic opinion leaders, par-           We have seen how famous public figures such
    ticularly on Twitter, where the built-in mech-           as Elon Musk have a tremendous capacity to
    anism of retweeting facilitates the replication          influence and even embody public opinion. Ad-
    of one’s opinions. By focusing on these hyper-           ditionally, given the difference in platform struc-
    concentrated authority centers on social media           ture, Twitter may be more susceptible to a few’s
    networks, it may be possible to use the opinion          opinions than other platforms. For these reasons,
    of the few to represent the entire community.            we seek to replicate this original study using a
        When lower-end media users follow active             Github fork and a few other modifications.
    media users who interpret the meaning of media               Also, in this paper, we used a different
    messages and content, these are known as opinion         methodology for topic modeling techniques and
    leaders. These leaders are generally held in high        the process in which a comparison is drawn
    esteem by those who accept their opinions. Opin-         between individual topics. In [10], the Non-
    ion leadership originated from the theory founded        negative Matrix Factorization (NMF) has been
    by Elihu Katz and Paul Lazarsfeld, where it              used, while in this paper, we utilized the La-
    is theorized that there is a two-step flow of            tent Dirichlet Allocation (LDA) due to its pop-
    communication [8]. Significant contributors to the       ularity and effectiveness in modeling the top-
    opinion leader concept include Berelson et al.           ics compared to the NMF. Additionally, in the
    [9]. The theory of opinion leadership is one of          [10], the Hungarian Algorithm (also called the
    the multiple models that attempt to explain the          Munkres assignment algorithm or Kuhn-Munkres
    diffusion of ideas, innovations, and commercial          algorithm) was employed to individually match
    products.                                                together the most closely related topics between
        Our research focused on analyzing the con-           opinion leaders and majority users. While in our
    structed network and finding the opinion leaders         paper, we reasoned, however, that not every topic
    on a large and broader Twitter dataset, as it            extracted from the first group would have a 1-1
    is more susceptible to a few’s opinions than             corresponding topic to the second group; thus,
    other platforms, analyzing the differences in lan-       the matching process would be redundant or,
    guage and interest between the opinion leaders           at the very least, unproductive. Thus, we only
    and majority users regarding Bitcoin, and finding        considered qualitatively whether or not the topics
    whether the opinion leaders of Twitter are not           from group to group were coherent. In addition,
    aligned with the majority of users, so that a            our methodology would be easily applicable to
    small number of highly influential users (opinion        various social networks, such as finding the ef-
    leaders) can effectively represent an entire com-        fectiveness of the influencers over social media
    munity’s opinions.                                       for various aspects of the real world, such as
        This paper does not consider the effects of          political sentiment, divisiveness, fashion trends,
    mimicry and the extent to which it exists within         music taste, or language changes.
    these networks. For instance, after an opinion               This study involves a few steps. In the first
    leader releases a tweet that corresponds to a            step, data is collected from Twitter using their
    specific topic, what is the precise effect on the        API and preprocessed; the network is constructed
    community? From speculation, we would expect             using comments and retweets. In the second
    those users following them to begin to mimic             step, the network is segregated into two different
    their opinions — though we do not know for               groups using Hyper-text-induced Topic Selection
    certain as our efforts were not focused on that          (HITS) [11]; these two groups are the opinion
    question.                                                leaders and the majority users. In the third step,
        This paper and its motivations were heavily          topic modeling is run on these two groups and
    influenced by a previous study by Kang et al. [10]       the entire community. Finally, in the fourth step,
    on a relative niche forum called bitcointalk.org.        we calculated the topic similarities between the
    Given the recent hype surrounding cryptocur-             two groups and the entire community to under-
    rency on mainstream social media platforms such          stand how well each constituent group represents
    as Twitter, it would be of great interest to replicate   the entire community. Figure 1 shows the steps
    this previous study on a larger, broader dataset.        conducted for our proposed analysis approach.

2                                                                                                  IT Professional
Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin
Figure 1. Steps of the proposed analysis approach.

    This paper is organized as follows. Section       events. The focus of this study was identifying
1 discusses related works. Section 2 presents         major cryptocurrency events and catastrophes in-
the data collection and preprocessing. Section 3      volving some sort of Ponzi scheme causing the
presents network construction and analysis. Sec-      market to crash. 30 was chosen for the number
tion 4 presents the topic of modeling. Section 5      of topics due to their semantic coherence score.
presents the results of the work.Section 6 presents   The dataset was obtained from scraping web data
the lessons learned. Finally, Section 7 provides      from bitcointalk.org, a cryptocurrency discussion
our conclusions.                                      forum. About 15 million posts were collected in
                                                      total (2012 through 2016). All data was scrapped
1. RELATED WORK                                       from a single cryptocurrency forum which most
    In this section, we will discuss some relevant,   likely exhibits a bias towards technologically
related works.                                        savvy miners with a vested interest in the technol-
    Within prior studies on user opinions in on-      ogy. No consideration was given to the influence
line discussion forums and price fluctuation, sen-    of more prominent Bitcoin community members.
timental and qualitative analyses were imple-             Abraham et al. [13] used sentiment analysis
mented. In addition, some of the said studies         to predict price changes in cryptocurrencies (like
focused on particular phases where cryptocur-         Bitcoin and Ethereum). VADER algorithm is used
rencies price rose. Multiple studies attempted to     to classify whether the text is positive or negative
investigate how cryptocurrencies, such as Bitcoin     and its polarity. The datasets used in this work
price, were associated with users’ feelings and       were collected from Google Trends and Twitter;
opinions by analyzing cryptocurrency forums.          as a result, there was no clear relationship be-
Some of these research works are the following:       tween the general sentiments of tweets and price.
    Linton et al. [12] used the LDA to see how        Most tweets regarding cryptocurrency tend to be
opinions are connected to large cryptocurrency        positive, and many papers use datasets in which

May/June 2019
                                                                                                             3
Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin
the price of crypto increases, resulting in a biased   15 topics total, with an R squared value of 0.28
    dataset.                                               in predicting bitcoin prices. The researchers also
        Rubio [14] shifted focus away from exclu-          found adding LDA topic weights as a feature in-
    sively Bitcoin and price to other cryptocurrencies     creases accuracy. Adding polarity and subjectivity
    and technical markers such as transaction speeds,      scores from sentiment analysis also improved the
    smart contracts, and privacy. The researcher gath-     R-value.
    ered tweets relating to over 25 top global cryp-            Bibi [17] used LDA to measure sentiment
    tocurrencies from May 2018 to August 2018,             towards cryptocurrency worldwide. The Twitter
    checked if a categorization fits well for a group,     dataset was used in this work; not much informa-
    detected prominent themes, and ran a prediction        tion was given regarding the dataset, algorithms,
    to see what would be used in the future, the           and methodology. As a result, Sweden and Den-
    LDA and MALLET algorithms were used in this            mark showed tremendously positive sentiment
    work. As a result, the coherence scores plateaued      towards cryptocurrency, with Canada and South
    at 18 topics, which gave a value of 18. The            Korea being the most negative.
    researcher then found the seven most common                 Kim et al. [18] analyzed the current user
    words per topic and visualized their frequencies       opinions from online forums in Massive Mul-
    in bar graphs. There were several themes between       tiplayer Online Games (MMOG) setting widely
    groups, including marketing, emotions, and group       used around the world. This led to the proposition
    relations.                                             of a method for predicting the next day’s fall
        Nizzoli et al. [15] present a chart of online      and rise of the currency used in an MMOG
    cryptocurrencies across multiple platforms such        environment. The prediction of the daily price
    as Twitter, Telegram, and Discord. They used a         fluctuations used in an MMOG environment was
    semi-supervised model to classify various social       found by analyzing online forum users’ opinions.
    media platforms. They checked how prevalent            The viability of predicting the fluctuation in the
    pump-and-dump and Ponzi schemes are in each            value of virtual currencies was shown by focusing
    bot activity and mapped their extent using a           on one of the most widely used MMOGs, the
    popular Twitter bot detection framework. The           World of Warcraft game.
    authors used the Twitter Streaming API to gather            Kim et al. [19] proposed a method founded on
    tweets from 3822 hashtags. They then searched          deep learning and established on user opinions on
    for additional Discord/Telegram links to crawl         online forums to predict the fluctuation in the Bit-
    and scrape more data from those platforms. In          coin price and transactions. This method is viable
    total, they had 50M messages from March to May         for understanding a variety of cryptocurrencies in
    2019. Data collected from Discord and Telegram         addition to Bitcoin and increasing the usability of
    was derived from Twitter, meaning any crypto           these cryptocurrencies.
    communities on those sites that do not have                 Additionally, multiple studies sought to iden-
    any Twitter presence would not be included. As         tify opinion leaders and/or find the main theme
    a result, Telegram showed to be replete with           and maximize their network marketing effec-
    cryptocurrency scams; 56% of crypto channels           tiveness by analyzing community networks and
    were fraudulent. Suspended ac-counts were also         users’ opinions. Some of these research works are
    much more likely to be involved with crypto            the following:
    manipulation.                                               Choi [20] examined how a piece of informa-
        Atashian and Hrachya [16] used the LDA             tion flowed in public forums that were social me-
    to construct a predictive model for Bitcoin us-        dia based, whether opinion leaders appeared from
    ing sentiment analysis, as the internet primar-        this flow of information, and the characteristics
    ily influences cryptocurrency. Data scraped from       that opinion leaders had in such forums. Two
    bitcointalk.org, a bitcoin forum, from April 23,       Twitter-based discussion groups focused on po-
    2011, to May 05, 2018. Used GDAX Python                litical discussions in South Korea were examined
    library to gather Bitcoin HOCV data from 2014          using network analysis and statistical measures,
    to 2018. Data was limited to niche esoteric cryp-      where it was concluded that opinion leaders were
    tocurrency forums. This work Built a model with        influential but not content creators.

4                                                                                                 IT Professional
Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin
Ho et al. [21] proposed an efficient approach     agents. The results from this simulation con-
to identify the opinion leader from a group dis-      cluded that enhancing opinion leaders’ credibility
cussion without analyzing syntactic and semantic      is key to maximizing the influence power within
features (which can add additional computing          e-commerce.
effort). The researchers proposed algorithms that         In summary, from the literature review, a
evaluate the magnitude of participation and emo-      significant number of studies identified the opin-
tional expression during the speaking of each         ion leaders on the network or analyzed users’
group member during the discussion. A well-           sentiments on social media to predict the price
trained model was tested on the single dataset,       fluctuation of the cryptocurrencies (like Bitcoin);
and a cross-dataset was obtained to recognize the     these studies more likely used online materials
opinion leader. This testing showed an accuracy       to understand the fluctuation of Bitcoin prices
of 94.68% on the Berlin dataset, 76% on the           and the volume of transactions to determine any
YouTube dataset, and 73.33% on live discussion        relation. While in our research, we primarily
groups for identifying the opinion leader.            focused on using network analysis and analyzing
    Jiang et al. [22] designed a method for de-       divergent behaviors and interests between opinion
tecting an opinion leader based on an improved        leaders and other majority users to find opinion
PageRank algorithm. This improved algorithm           leaders. In contrast to some of the aforementioned
used link relevance to determine the degree of the    research works, our research did not emphasize
link between users. Data was crawled from on-         predicting prices or volumes using causality anal-
line communities and preprocessed, and then the       ysis between price and word frequencies.
weight matrix was constructed by calculating link
relevance between users. The improved PageRank        2. DATA COLLECTION AND
algorithm then ranked users and detected opinion      PREPROCESSING
leaders. Compared to the baselines, the results           We collected 8 million cryptocurrency-related
from this experiment showed that the proposed         tweets for more than 100k users, scrapped from
method effectively identified opinion leaders in      2016-01-01 to 2019-03-29, leveraging the Twitter
online communities.                                   standard search application programming inter-
    Wang et al. [23] proposed a TopicSimilar-         face (API) and Tweepy Python library. A set of
Rank algorithm for opinion leaders mining and         predefined search English keywords used such
interactive information based on the similarity of    as Bitcoin or BTC, cryptocurrency. We extracted
topics. This algorithm took into account text char-   and stored the text and metadata of the tweets,
acteristics and user attributions in microblogs.      such as timestamp, number of likes and retweets,
This built links between users formed on user         hashtags, language, and user profile information,
interaction information combined with topic sim-      including user id, username, user location, num-
ilarity to construct a directed-weighted network.     ber of followers, and number of friends. Figure 2
The idea of the vote in the PageRank algorithm        shows the data collection process. The first data
to mine opinion leaders were also considered in       collection phase is registering a Twitter applica-
this algorithm. Sina Weibo datasets were used for     tion and obtaining a Twitter access key and token.
testing this algorithm, and the results showed that   The second phase is to import the Tweepy Python
this algorithm had a better performance.              package and write the python script for accessing
    Zhao et al. [24] researched the influence         Twitter API. The third phase is connecting to
power of opinion leaders and the interaction          Twitter search API and retrieving tweets using
mechanism of a group of autonomous agents             some cryptocurrency-related keywords. The last
in an e-commerce community during forming             phase reads and processes the data to extract
group opinions. The social agents within a so-        information on tags, agents, and locations for net-
cial network were divided into two subgroups          work construction and analysis. From this dataset,
opinion leaders and opinion followers. A new          a subset of four million tweets was used. The
bounded confidence-based dynamic model for            specific break-down is as follows — 2,753,808
opinion leaders and followers was founded to          posts, 533,924 comments, and 37,460 retweets.
simulate the opinion evolution of the group of            Many modifications were made to the existing

May/June 2019
                                                                                                            5
Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin
dataset before it was fed to the topic modeling       fine relationships between pages. Hubs are highly
    algorithm. This process is crucial to ensure coher-   valued lists for a given query. An authoritative
    ent results at the topic modeling stage. All non-     page is one that many hubs link to, and a hub
    English words were stripped as well as links and      is a page that links to many authorities. We
    special characters. All text was lemmatized using     utilized the HITS algorithm to partition the total
    WordNet and stripped of any stop words using the      community network into two groups: majority
    natural language toolkit (NLTK) library and a few     users and opinion leaders. More specifically, this
    additional custom stop words [25]. All characters     was done using the authority score from the HITS
    were converted to lowercase. Additionally, words      algorithm, which effectively shows the influence
    with less than three characters and posts with less   on neighboring nodes. The authority of a given
    than five words were omitted.                         node is defined as the sum of the hub scores from
                                                          neighboring incoming nodes. The equations (1)
                                                          and (2) for both authority and hubs are shown
                                                          below:

                                                                                            X
                                                            Authority (vui ) =                          hub (vuj ),     (1)
                                                                                   vuj In(vu        )
                                                                                                i

                                                                              X
                                                            Hub (vui ) =                    authority (vuj ), (2)
                                                                           vuj Out(vu   )
                                                                                    i

                                                              We may define the authority of any given
                                                          node (vui ) as the sum of the hub scores from
                                                          the set of adjacent nodes with incoming connec-
                                                          tions. Similarly, the hub score of a node (vui )
                                                          is defined as the sum of authority scores from
                                                          the set of nodes with outgoing connections. In
    Figure 2. Process for collecting tweet data.          practice, nodes with high authority are referenced
                                                          frequently, and nodes with high hub scores link
                                                          to many high authority nodes. For this work, we
    3. NETWORK CONSTRUCTION AND                           will focus on the authority score.
    ANALYSIS                                                  We define the opinion leader group as nodes
        For each of our networks, a node u is added       whose authority sums to 80% when considering
    to the graph for each user if and only if they        the authorities of all users sorted in descending
    have a post. For instance, if a user makes a          order. All other users are then to be considered
    comment but has no posts, they are not given a        the majority. As mentioned above, self-loops were
    node. An edge V is drawn between two nodes            omitted as they do not offer meaningful infor-
    ui and uj if the user corresponding with node ui      mation about their relationship with other nodes.
    comments any post corresponding with node uj.         The content of retweets contained the text of
    If a user comments on their post, no edge will        the original tweet when fed through to the topic
    be drawn — in the context of graphs, no self-         modeling algorithm. As far as implementation is
    loops will be present. We did this because self-      concerned, we employed the Python NetworkX
    comments do not give significant insight into the     library to calculate the values for HITS. This
    relationship between other nodes and will thus        is the converging algorithm and thus only ends
    only serve to add noise. Following this procedure,    when convergence is reached or a predefined limit
    we constructed the entire community network.          is reached — in our case, we set the maximum
        Next, the HITS algorithm was used; the HITS       number of iterations to be 200.
    algorithm works as a search engine to rank pages,         The distribution of authority for the users
    and the algorithm uses hubs and authority to de-      in the network followed a Pareto distribution.

6                                                                                                            IT Professional
Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin
Figure 3. Graph of the distribution of authority (y-axis) sorted in descending order for both linear (A) and log
scale (B). This figure includes both opinion leaders and majority users.

The Pareto distribution, commonly referred to            algorithm.
as the ”80/20” rule,” is commonly found in the               Also, Figure 5 represents a subset of the entire
allocation of capital and this case, influence. From     network. Vector graphics, such as a person wear-
this statistical observation, this work is based on      ing a suit, show the nodes of the opinion leaders,
— most users hold little to no authority and thus        while the other vector graphics show the nodes
have a negligible impact on the topics discussed         of the majority users; the numbers represent the
on the network. This distribution can be seen            account ID of each node. Black edges represent
as visualized in Figure 3 . This figure points           comments, and blue edges represent retweets.
out the nature of the distribution of authority.         This example shows that opinion leaders garner
We can see here that very few users possess the          substantial attention from other users, much more
vast majority of the authority in this community.        than most users.
Also, we can see from the graph that most users
have next to none. This distribution was plotted         4. TOPIC MODELING
using linear and logarithmic scales for Figure 3             Topic modeling is the process of deriving
(A) and (B). This figure plots the distribution of       a statistical topic model from a collection of
authorities of users in descending order, where the      documents. It is a machine learning technique
x-axis is simply the index of each user in the list.     and is generally unsupervised. For example, topic
The y-axis represents the authority. There were          modeling includes popular methods, including
2559 opinion leaders (0.72% of users) who ac-            Latent Dirichlet Allocation (LDA) [27], Non-
counted for 80% of the authority and the majority        negative Matrix Factorization (NMF) [28], and
(99.28%) users for the remaining 20% out of a            Latent Semantic Analysis (LSA) [29]. In this
total of 355,139 users. These percentages were           paper, we decided to utilize LDA, a probabilistic
found by dividing the number of users in each            algorithm, due to its popularity and effectiveness.
subgroup by the total number of users.                   The theory behind topic modeling is that each
    Additionally, we produced a network visual-          document is comprised of a mixture of various
ization using Gephi software [26]. Figure 4 visu-        topics; the number of the topics is a mixture of
alize the entire community network. Red nodes            different words. By categorizing documents by
represent the majority of users, and blue nodes          category, we can accurately and objectively group
represent opinion leaders. There is a gradient           documents and gain the meaning behind them.
between the two based on authority scores. Black             The LDA algorithm considers three hyperpa-
edges are comments, and blue edges are retweets.         rameters: α, β , and K. The α hyperparameter
This was distributed using the ForceAtlas2 layout        encodes the number of topics expected in each

May/June 2019
                                                                                                                   7
Figure 4. Gephi visualization of complete network. This visualization helps illustrate the concentration of
    authority, demonstrated by the tiny cluster of blue central nodes surrounded by connecting majority users.

    Table 1. Word frequencies for the opinion leaders and the majority users. The relative difference column shows by a
    factor how much larger the majority user percentage is than the opinion leaders.
                                                    Word Frequencies
                                                    Opinion Leaders  Majority Users
                                           price         0.85%          1.89%          1.22
                                            buy          0.39%          1.52%          2.87
                              Market        sell         0.19%          1.29%          5.74
                                          profit         0.12%          1.14%          8.17
                                          invest         0.04%          0.06%          0.53
                                           core          0.07%          0.02%          -0.70
                                          miner          0.09%          0.04%          -0.53
                             Technical   network         0.16%          0.09%          -0.46
                                           node          0.09%          0.03%          -0.67
                                         protocol        0.05%          0.02%          -0.67

    document. The β hyperparameter lets it know                outputs a distribution of topics of what each
    the distribution of words for each topic in the            document (tweet) is comprised of. We performed
    document, and K defines the number of topics               the LDA with a dataset of 3069704 tweets; the
    to use. Picking an accurate K value is crucial             dataset timeframe is from 2015-06-01 to 2019-
    to an interpretable output from the algorithm.             09-25 (YYYY-MM-DD). Table 1 shows the dif-
    Choosing a high value may result in an output              ferences in the word frequencies between opinion
    with topics with substantial overlap. Conversely,          leaders and majority users.
    a low K would generalize too strongly and place                As we can see in Table 1, opinion leaders
    most of the meaning behind a document into a               generally concern themselves more with cryp-
    smaller pool of topics.                                    tocurrency’s technical aspects than the majority
                                                               of users, who seem to be more interested in the
    5. RESULTS AND DISCUSSION                                  price and profit. This result would make sense,
        The topic modeling algorithm method, Latent            as we would expect those with high levels of
    Dirichlet Allocation (LDA), derives the topics             authority to possess greater expertise in the more
    from our corpus of text. The implementation of             technical aspects of cryptocurrencies. The words
    this algorithm is handled by the popular data              in Table 1 are the most used in their respective
    science package Scikit-learn [30]. This algorithm          categories. Also, we may define market as words

8                                                                                                        IT Professional
dropped to -0.11. Finally, between January 2019
                                                       and May 2019, it went to -0.80. Topic 3 was
                                                       chosen as an example due to its high correlation
                                                       with the Bitcoin price. However, the other top-
                                                       ics followed this pattern as well. We observed
                                                       a general decline in correlation between topic
                                                       weights and the price of Bitcoin over time. This
                                                       may potentially be due to the rising shrewdness
                                                       of investors — as time goes on, investors may
                                                       become less amenable to a suggestion by social
                                                       media networks. Its mainstream adaption would
                                                       lessen the influence of the subset of users with
                                                       special technical knowledge.
                                                           The most important aspect of topic modeling
                                                       is choosing the correct K value, or the number of
                                                       topics to be selected. For this work, we opted
                                                       not to use any heuristic-based approaches that
                                                       utilize coherence scores and instead ran the model
                                                       on varying topics, including 12, 8, and 4. We
                                                       found the most interpretable results came from
                                                       the output using four topics due to the minimized
                                                       inter-topic similarity. Additionally, the human in-
Figure 5. A subset of a network illustrating the dy-
                                                       terpretability of each topic significantly improved
namics between opinion leaders. We can see from
                                                       upon lowering the number of topics.
this example that opinion leaders are generally only
                                                           After running LDA on the entire community,
hubs for other opinion leaders. Seldom is there an
                                                       opinion leaders, and majority users, we used
outgoing connection from an opinion leader to a ma-
                                                       the cosine difference to calculate the degree of
jority user.
                                                       similarity between the groups. This allows us to
                                                       numerically quantify the similarity between the
pertaining to the financial aspects of Bitcoin, such   groups by treating the corpus of text for each
as profit or price. Technical words, on the other      group as a word vector. A similarity score was
hand, are regarded as words pertaining to the          calculated between the total community and opin-
internal working or function of Bitcoin itself as a    ion leaders and between the total community and
technology.                                            majority users. Given these similarities, we could
    Figure 6 shows a graph of the Bitcoin price        see which group was most similar to the total
graphed alongside a scaled version of the topic        community and, thus, which one could represent
weights smoothed out with a 60-day rolling av-         it more appropriately.
erage from the output of the total community               Table 2 shows the output of the LDA algo-
LDA. Each document belongs to each topic to            rithm for four topics for all three groups with
a certain extent — this is what the weight shows.      the similarity scores. Additionally, Figure 7 shows
Each color represents one of three top topics and      the graph of the similarities for each topic. From
runs from 2015 to 2019. The three topics are           the graph, it is evident that the opinion leader
related to crypto, price, and trade, respectively.     group’s similarity closely matches and surpasses
Each correlation contained a p-value and was cal-      that of the majority user group. The similarity
culated using the Pearson R score. As we might         of the opinion leaders to the entire community
expect, different epochs of time yielded stronger      is greater than that of the majority of users.
correlations with a Bitcoin price. Between the         Therefore, a small number of highly influential
period of December 2017 and April 2018, topic          users (opinion leaders) can effectively represent
3 yielded a correlation of -0.14. Between April        an entire community’s opinions.
2018 and August 2018, the correlation of topic 3           Table 3 shows the values for the similarities

May/June 2019
                                                                                                             9
Figure 6. Bitcoin price and topics vs. time, the x-axis represents the time, and the y-axis represents the price
     of Bitcoin. The topic weights have been scaled to fit the graph instead of being on a 0-1 scale.

     between the groups. Similarity scores for each           with the assumption that there indeed exists some
     topic have been calculated for both opinion lead-        level of causality [32].
     ers and majority users. The average score has                In addition, if Twitter were to lose public
     also been computed and is seen to be higher              trust, the methodology in the paper would still
     for the opinion leaders group, suggesting that, on       be easily applicable to different social networks.
     average, the opinion leader group bears greater          Our approach to this problem is not incompatible
     semblance to the entire community than the ma-           with the structure of other networks — we could
     jority of users. Thus, this result also shows that       have easily applied this to Facebook, Instagram,
     the opinion leaders are more able to represent the       Reddit, etc., albeit with different results. Even if
     entire community than the majority of users.             Twitter were to meet its demise, we would expect
                                                              an equivalent to rise and fill its shoes. Our interest
         From the above discussions, we may notice            in Twitter, in particular, stems from its popularity
     that regarding the connection between BTC price          amongst crypto influencers and its unique retweet
     and Tweets content, as with any matter con-              mechanic. This makes it easier to differentiate
     cerning correlation followed by a claim, it is           between opinion leaders and majority users as
     important first to ascertain that there is indeed        there is a more stark dichotomy. If Twitter were
     a causal relationship that is not the product of         to disappear, these opinion leaders would simply
     mere happenstance — for it could be the case             find a different platform on which we could
     that the price of Bitcoin correlates strongly with       easily reproduce the paper. Furthermore, while
     any number of arbitrary datasets, so, how is this        the paper’s thesis (opinion leaders can effectively
     dataset of Bitcoin-related tweets from Twitter           represent majority users) only extends so far as
     any different, we believe this topic has enough          Twitter, replicating this study on other networks
     merit to warrant a separate study. It is evident         would confirm or disprove the thesis on a broader
     that the influence of social media networks has          level which would shed greater insights on differ-
     tremendous potential to change markets. Look             ences between these networks.
     no further than the recent retail investor interest
     in GameStop or the unbelievable overvaluation            6. Lessons learned
     of Tesla to get a hint of how much the value                 We can conclude the following based on the
     of some assets is driven by the current social           results presented in this paper:
     media [31]. Given Bitcoin’s speculative nature,
     its popularity amongst young adults (the primary         • According to the words shown in Table 1,
     consumers of social media), and the sheer volume           opinion leaders use more technical words than
     of discussion that occurs about it online, we run          the majority users, where the majority use

10                                                                                                      IT Professional
Table 2. LDA output for four topics for all three groups and similarity scores.

  Topics    Topics Keywords of Entire           Topics Keywords of Opinion           Opinion      Topics Keywords of Majority           Majority
            Community                           Leaders                              Leaders      Users                                 Users
                                                                                     Similarity                                         Similarity
 Topic 0    bitcoin, price, usd, last, vol-     bitcoin, price, doge, current,       0.90195      bitcoin, crypto, blockchain,          0.82172
            ume, exchange, market, hour,        day, year, crypto, dogecoin,                      cryptocurrency,       ethereum,
            min, pair, profit, current, bi-     bcash, month, amp, time, read,                    new, money, exchange, news,
            nance, arb, change, ranging,        week, bch, market, since, hour,                   amp, get, mining, bank, world,
            yielding, opps, spanning, eur       first, buy, get, last, back, good,                currency, time, free, project,
                                                volume, high, transaction, ev-                    trading,       cryptocurrencies,
                                                ery, cryptocurrency, fee, eth,                    coin, future, ico, wallet,
                                                new, low, hit, tweet, next, ago,                  make, litecoin, gold, eth,
                                                ltc, bcashsv, block                               token, digital, network, know,
                                                                                                  first, libra, good, payment,
                                                                                                  transaction, great, binance,
                                                                                                  facebook
 Topic 1    bitcoin, crypto, money, new,        bitcoin,     xrp,      ethereum,     0.78528      bitcoin, eth, price, xrp, ltc,        0.80462
            get, cryptocurrency, year, time,    blockchain,               crypto,                 bch, volume, crypto, usd,
            blockchain, amp, gold, make,        time,       litecoin,       alert,                cryptocurrency,       ethereum,
            know, day, world, good, future,     cryptocurrencies,            top,                 binance, hour, top, eos,
            bank, currency, news                cryptocurrency, bitcoincash,                      litecoin, last, ripple, market,
                                                ltc, trading, eth, price,                         blockchain, change, trx, doge,
                                                ripple,      binance,        last,                bnb, etc, neo, sat, dash, ada,
                                                daysnews, vaultmex, market,                       xlm, xmr, forecast, trading,
                                                cryptocurrencymarket, fintech,                    bittrex, coin, average, link,
                                                trx, altcoins, ico, buy, eos,                     news, eur, bitcoincash
                                                altcoin, analysis, change,
                                                paypal, news, cryptonews,
                                                coinbase,       newsoftheweek,
                                                monero, skrill, usd
 Topic 2    bitcoin, price, buy, sell, mar-     bitcoin, usd, wallet, money,         0.92394      bitcoin, market, price, worth,        0.95417
            ket, high, profit, worth, inr,      get, gold, eth, price, coinbase,                  high, low, cap, crypto, trade,
            current, low, crypto, btce,         hardware, value, eur, make,                       last, btcusd, time, alt, usd,
            cap, time, bitfinex, btcusd, ex-    buy, amp, currency, altcoin,                      short, long, year, bull, next,
            change, vircurex, analysis          network, world, bank, know,                       day, bitmex, gmt, back, avg,
                                                time, apompliano, fiat, smart,                    bcoin, whale, hit, chart,
                                                work, year, mining, real, order,                  bullish, get, look, week,
                                                coin, good, secure, new, better,                  bought, month, right, break,
                                                trezor, bsv, crypto, asset, digi-                 change, trading, good, move
                                                tal
 Topic 3    bitcoin, crypto, blockchain,        bitcoin, crypto, price, market,      0.83673      bitcoin, buy, sell, profit, price,    0.64126
            cryptocurrency, eth, ethereum,      cryptocurrency, read, news, bt-                   inr, current, exchange, last,
            xrp, trading, ltc, bch, litecoin,   cusd, blockchain, bull, alt,                      pair, min, ranging, arb, yield-
            free, binance, top, amp, ico,       new, bullish, support, trading,                   ing, opps, spanning, btce,
            coin, ripple, join, cryptocur-      long, next, time, exchange,                       bitfinex, vircurex, btcusd, in-
            rencies                             short, chart, high, amp, resis-                   dia, buysellbitcoin, buysell-
                                                tance, trader, back, future, usd,                 bitco, btceur, block, mtgox,
                                                facebook, break, analyst, look,                   edt, cest, utc, alert, hitbtc, est,
                                                move, newsoftheweek, libra,                       cet, market, instantly, campbx,
                                                analysis, bear, level, investor,                  free, kraken, technical, trading
                                                craig

Table 3. Average scores and topic similarities for opinion             • From Figure 6, we can realize the correlation
leaders and majority users.                                              between the topic weights and the price of Bit-
     Group             0         1       2       3      Average          coin decreased over time. This may be because
 Opinion Leader       0.90     0.77     0.92    0.85     0.86            the investors become less interested in the
 Majority Users       0.83     0.81     0.96    0.65     0.81
                                                                         suggestions from the social media networks.
                                                                       • We can see from Figure 7 that despite only
                                                                         accounting for a minute segment of the entire
   words related to Bitcoin prices and profits. It                       community, the opinion leaders possess an
   implies that the opinion leaders would likely                         almost identical degree of similarity to the
   have a deeper understanding and experience of                         entire community as the majority users. We
   cryptocurrencies and Bitcoin.                                         hypothesize that this is due to the influence

May/June 2019
                                                                                                                                               11
Figure 7. Graph of similarities to the total community for both opinion leaders (blue) and majority users
     (orange), the x-axis represents the topics, and the y-axis represents the cosine similarity to the topics of the
     entire community.

         of the opinion leaders on the rest of the com-         Given individual voices such as Musk can have
         munity. This, coupled with Twitter’s retweet           such a significant individual influence on the
         feature, allows individuals’ voices to propagate       price of Bitcoin; we may conclude that the
         tremendously.                                          composition of all such opinion leaders has
     •   Based on the average scores and topic sim-             a causal influence. Because of this, it is our
         ilarities shown in Table 3, we realized it is          assumption then that the confluence of these
         possible that a minuscule subset of a network          powerful voices does indeed, to some extent,
         may represent an entire community much more            influence the price of Bitcoin — though, to
         effectively than the majority users.                   be clear, it is not certain to what extent this
     •   The impact of any individual majority user             causality reaches.
         is so low to almost considered non-existent          • Given what we have learned studying this
         compared to the effectiveness of the opinion           topic, it would be interesting to see future
         leaders in the network.                                studies applying the implied results of this
     •   Measures of efficiency could improve certain           paper. For instance, observing the opinion
         analysis methods for user networks such as             leaders of social media networks exclusively
         Twitter by only concentrating on the most              to observe their influence on various aspects
         influential voices.                                    of the real world, such as political sentiment,
     •   It is difficult to ascertain whether the price of      divisiveness, fashion trends, music taste, or
         Bitcoin drives tweets or tweets drive the price        language changes. It would be our assumption
         of Bitcoin. We know from isolated examples             from the thesis of this paper that only the top
         that tweets have the power to drive the price          voices would be relevant to the influence of
         of a cryptocurrency, with the most salient             these domains. It would also be interesting to
         examples being from figures such as Elon               observe situations where this theory falls apart
         Musk, who routinely manipulates cryptocur-             — where the majority’s voice outweighs that
         rency markets and, according to Ante [33],             of opinion leaders. This dichotomy of opin-
         “non-negative tweets from Musk lead to sig-            ion leader and majority user has the potential
         nificantly positive abnormal Bitcoin returns.”         to provide an interesting lens through which

12                                                                                                      IT Professional
to view the direction of future trends, not         5. V. Marella, “Bitcoin: A social movement
   necessarily the price of a certain commodity.          under attack,” 2017.
   With a better understanding of how these two        6. F. Reid and M. Harrigan, “An analysis of
   groups color the public narrative, we can better       anonymity in the bitcoin system,” in Security
   understand what lies ahead.                            and privacy in social networks. Springer,
                                                          2013, pp. 197–223.
7. CONCLUSION                                          7. R. Koch, The 80/20 Principle: The Secret
    This paper looked at the similarity between           of Achieving More with Less: Updated 20th
opinion leaders and majority users regarding Bit-         anniversary edition of the productivity and
coin on the social media platform Twitter. We             business classic. Hachette UK, 2011.
conclude that a small number of highly influential     8. E. Katz and P. F. Lazarsfeld, Personal influ-
users can effectively represent an entire commu-          ence: The part played by people in the flow
nity’s opinions.                                          of mass communications. Routledge, 2017.
    A more rigorous and deep-dive approach to          9. D. Riesman, N. Glazer, and R. Denney, “The
the number of topics employed would be of great           lonely crowd,” in The Lonely Crowd. Yale
interest to improve upon this study. The results          University Press, 2020.
of this study were settled subjectively and did not   10. K. Kang, J. Choo, and Y. Kim, “Whose opin-
incorporate any heuristic measures. Additionally,         ion matters? analyzing relationships between
this study using specific languages or geographic         bitcoin prices and user groups in online com-
locations may yield interesting results due to            munity,” Social Science Computer Review,
cultural differences.                                     vol. 38, no. 6, pp. 686–702, 2020.
    Future research would benefit from a more         11. J. M. Kleinberg, “Authoritative sources in
rigorous form of topic-matching to ascertain a            a hyperlinked environment,” Journal of the
more concrete comparison. Also, discovering a             ACM (JACM), vol. 46, no. 5, pp. 604–632,
quantifiable degree of mimicry between opinion            1999.
leaders and majority users would be a great route     12. M. Linton, E. G. S. Teo, E. Bommes,
to pursue. We may also expect opinion leaders             C. Chen, and W. K. Härdle, “Dynamic
to be influenced by their followers, following            topic modelling for cryptocurrency commu-
collectivized mimicry.                                    nity forums,” in Applied quantitative finance.
                                                          Springer, 2017, pp. 355–372.
REFERENCES                                            13. J. Abraham, D. Higdon, J. Nelson, and
 1. S. Barber, X. Boyen, E. Shi, and E. Uzun,             J. Ibarra, “Cryptocurrency price prediction
    “Bitter to better—how to make bitcoin a               using tweet volumes and sentiment analysis,”
    better currency,” in International conference         SMU Data Science Review, vol. 1, no. 3, p. 1,
    on financial cryptography and data security.          2018.
    Springer, 2012, pp. 399–414.                      14. B. Rubio, “Analyzing cryptocurrency groups
 2. R. Grinberg, “Bitcoin: An innovative alter-           using topic modeling on twitter posts,” Mas-
    native digital currency,” Hastings Science &          ter’s thesis, 2019.
    Technology Law Journal, vol. 4, p. 160, 2011.     15. L. Nizzoli, S. Tardelli, M. Avvenuti,
 3. Y. B. Kim, J. G. Kim, W. Kim, J. H. Im, T. H.         S. Cresci, M. Tesconi, and E. Ferrara,
    Kim, S. J. Kang, and C. H. Kim, “Predicting           “Charting the landscape of online
    fluctuations in cryptocurrency transactions           cryptocurrency manipulation,” IEEE Access,
    based on user comments and replies,” PloS             vol. 8, pp. 113 230–113 245, 2020.
    one, vol. 11, no. 8, p. e0161197, 2016.           16. G. Atashian and H. Khachatryan, “Senti-
 4. M. Kim, K. Kang, D. Park, J. Choo, and                ment analysis to predict global cryptocur-
    N. Elmqvist, “Topiclens: Efficient multi-level        rency trends,” 2018.
    visual topic exploration of large-scale docu-     17. S. Bibi, “Cryptocurrency world identification
    ment collections,” IEEE transactions on vi-           and public concerns detection via social me-
    sualization and computer graphics, vol. 23,           dia: student research abstract,” in Proceed-
    no. 1, pp. 151–160, 2016.

May/June 2019
                                                                                                           13
ings of the 34th ACM/SIGAPP Symposium on              pp. 361–362.
           Applied Computing, 2019, pp. 550–552.             27. D. M. Blei, A. Y. Ng, and M. I. Jordan, “La-
     18.   Y. B. Kim, K. Kang, J. Choo, S. J. Kang,              tent dirichlet allocation,” Journal of machine
           T. Kim, J. Im, J.-H. Kim, and C. H. Kim,              Learning research, vol. 3, no. Jan, pp. 993–
           “Predicting the currency market in online             1022, 2003.
           gaming via lexicon-based analysis on its on-      28. D. Lee and H. S. Seung, “Algorithms for
           line forum,” Complexity, vol. 2017, 2017.             non-negative matrix factorization,” Advances
     19.   Y. B. Kim, J. Lee, N. Park, J. Choo, J.-H.            in neural information processing systems,
           Kim, and C. H. Kim, “When bitcoin encoun-             vol. 13, 2000.
           ters information in an online forum: Using        29. T. K. Landauer, P. W. Foltz, and D. Laham,
           text mining to analyse user opinions and              “An introduction to latent semantic analysis,”
           predict value fluctuation,” PloS one, vol. 12,        Discourse processes, vol. 25, no. 2-3, pp.
           no. 5, p. e0177630, 2017.                             259–284, 1998.
     20.   S. Choi, “The two-step flow of communica-         30. F. Pedregosa, G. Varoquaux, A. Gramfort,
           tion in twitter-based public forums,” Social          V. Michel, B. Thirion, O. Grisel, M. Blondel,
           science computer review, vol. 33, no. 6, pp.          P. Prettenhofer, R. Weiss, V. Dubourg et al.,
           696–711, 2015.                                        “Scikit-learn: Machine learning in python,”
     21.   Y.-C. Ho, H.-M. Liu, H.-H. Hsu, C.-H. Lin,            the Journal of machine Learning research,
           Y.-H. Ho, and L.-J. Chen, “Automatic opin-            vol. 12, pp. 2825–2830, 2011.
           ion leader recognition in group discussions,”     31. J. E. Fisch, “Gamestop and the reemergence
           in 2016 Conference on Technologies and                of the retail investor,” U of Penn, Inst for Law
           Applications of Artificial Intelligence (TAAI).       & Econ Research Paper, no. 22-16, 2022.
           IEEE, 2016, pp. 138–145.                          32. A. Perrin, “Social media usage,” Pew re-
     22.   L. C. Jiang, F. F. Li, B. Ge, W. D. Xiao,             search center, vol. 125, pp. 52–68, 2015.
           J. Y. Tang, and Y. L. Hu, “Detecting opinion      33. L. Ante, “How elon musk’s twitter activity
           leaders in online communities based on an             moves cryptocurrency markets,” Technologi-
           improved pagerank algorithm,” in Applied              cal Forecasting and Social Change, vol. 186,
           Mechanics and Materials, vol. 543. Trans              p. 122112, 2023.
           Tech Publ, 2014, pp. 3524–3527.
     23.   C. Wang, Y. J. Du, and M. W. Tang, “Opinion
                                                             ACKNOWLEDGMENT
           leader mining algorithm in microblog plat-
                                                                This work was in part supported by Saginaw
           form based on topic similarity,” in 2016 2nd
                                                             Valley State University.
           IEEE International Conference on Computer
           and Communications (ICCC). IEEE, 2016,            Aos Mulahuwaish received a Ph.D. in computer
           pp. 160–165.                                      science from McMaster University, Hamilton, ON,
     24.   Y. Zhao, G. Kou, Y. Peng, and Y. Chen,            Canada. He is currently an Assistant Professor in the
           “Understanding influence power of opinion         computer science and information systems depart-
           leaders in e-commerce networks: An opinion        ment at Saginaw Valley State University, MI, USA.
           dynamics theory perspective,” Information         His research interests include social media analytics,
           Sciences, vol. 426, pp. 131–147, 2018.            social cybersecurity, machine, and deep learning,
     25.   S. Bird, “Nltk: the natural language toolkit,”    metaheuristics, and fault tolerance system. He is a
           in Proceedings of the COLING/ACL 2006             member of IEEE and ACM.
           Interactive Presentation Sessions, 2006, pp.
                                                             Matthew Loucks is an undergraduate student cur-
           69–72.
                                                             rently studying computer science at Saginaw Valley
     26.   M. Bastian, S. Heymann, and M. Jacomy,            State University. In his free time, he works as a
           “Gephi: an open source software for explor-       part-time software developer and participates in his
           ing and manipulating networks,” in Proceed-       school’s philosophy club. He enjoys reading, play-
           ings of the international AAAI conference on      ing piano, and going to the gym. After graduating,
           web and social media, vol. 3, no. 1, 2009,        Matthew is considering pursuing a master’s either in
                                                             computer science or artificial intelligence.

14                                                                                                    IT Professional
Basheer Qolomany [S’17-M’19] received the B.Sc.
and M.Sc. degrees in Computer Science from the
University of Mosul, Iraq, in 2008 and 2011, respec-
tively, and the Ph.D. and second master’s en-route
to Ph.D. degrees in Computer Science from Western
Michigan University (WMU), Kalamazoo, MI, USA, in
2018. He has worked as a Lecturer with the Depart-
ment of Computer Science, University of Duhok, Kur-
distan region of Iraq, from 2011 to 2013; a Graduate
Doctoral Assistant with the Department of Computer
Science, WMU, from 2016 to 2018; and a visiting
Assistant Professor at the Department of Computer
Science, Kennesaw State University (KSU), Marietta,
GA, USA, from 2018 to 2019. He is currently an Assis-
tant Professor with the Department of Cyber Systems,
University of Nebraska at Kearney (UNK), Kearney,
NE, USA. His research interests include evolutionary
computation, machine learning, deep learning, and
big data analytics in support of population health and
smart services. He is a member of IEEE and ACM.

Ala Al-Fuqaha [S’00-M’04-SM’09] received Ph.D. de-
gree in Computer Engineering and Networking from
the University of Missouri-Kansas City, Kansas City,
MO, USA, in 2004. He is currently a professor at
Hamad Bin Khalifa University (HBKU). His research
interests include the use of machine learning in
general and deep learning in particular in support
of the data-driven and self-driven management of
large-scale deployments of IoT and smart city infras-
tructure and services, Wireless Vehicular Networks
(VANETs), cooperation and spectrum access eti-
quette in cognitive radio networks, and management
and planning of software defined networks (SDN).
He is a senior member of the IEEE and an ABET
Program Evaluator (PEV). He serves on editorial
boards of multiple journals including IEEE Communi-
cations Letter and IEEE Network Magazine. He also
served as chair, co-chair, and technical program com-
mittee member of multiple international conferences
including IEEE VTC, IEEE Globecom, IEEE ICC, and
IWCMC.

May/June 2019
                                                         15
You can also read