Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin

Page created by Dustin Bennett

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Topic Modeling Based on Two-Step Flow Theory: Application to Tweets about Bitcoin

Topic Modeling Based on
                                        Two-Step Flow Theory:
                                        Application to Tweets about
                                        Bitcoin
                                        Aos Mulahuwaish∗ , Matthew Loucks∗ , Basheer Qolomany† , and Ala Al-Fuqaha‡
arXiv:2303.02032v1 [cs.SI] 3 Mar 2023

                                        ∗ Department of Computer Science and Information Systems, Saginaw Valley State University,
                                        University Center, USA
                                        † Cyber Systems Department, University of Nebraska at Kearney, Kearney, USA
                                        ‡ Information and Computing Technologies (ICT) Division, College of Science and Engineering (CSE),
                                        Hamad Bin Khalifa University, Doha, Qatar

                                        Abstract—Digital cryptocurrencies such as Bitcoin have exploded in recent years in both
                                        popularity and value. By their novelty, cryptocurrencies tend to be both volatile and highly
                                        speculative. The capricious nature of these coins is helped facilitated by social media networks
                                        such as Twitter. However, not everyone’s opinion matters equally, with most posts garnering little
                                        to no attention. Additionally, the majority of tweets are retweeted from popular posts. We must
                                        determine whose opinion matters and the difference between influential and non-influential
                                        users. This study separates these two groups and analyzes the differences between them. It
                                        uses Hypertext-induced Topic Selection (HITS) algorithm, which segregates the dataset based
                                        on influence. Topic modeling is then employed to uncover differences in each group’s speech
                                        types and what group may best represent the entire community. We found differences in
                                        language and interest between these two groups regarding Bitcoin and that the opinion leaders
                                        of Twitter are not aligned with the majority of users. There were 2559 opinion leaders (0.72% of
                                        users) who accounted for 80% of the authority and the majority (99.28%) users for the remaining
                                        20% out of a total of 355,139 users.

                                                                                           [1, 2, 3, 4, 5, 6]. These networks, however, can
                                        Index Terms: Network Analysis, Twitter,            comprise billions of users, with an unfathomable
                                        Machine Learning, Topic Modeling, Bitcoin.         number of connections between them. To run any
                                                                                           topic modeling or sentiment analysis algorithms
                                                                                           across the entire network would be extraordinarily
                                                                                           costly. This begs the question — would it be
                                           T HE INTRODUCTION Due to the virtual nature     possible to run topic modeling on a significantly
                                        of cryptocurrencies, much discussion occurs in     smaller subset of the network and still yield
                                        online forums or social media platforms such as    similar results? It has been well established that
                                        Twitter and Facebook. Places like this establish   authority, particularly on digital networks, tends
                                        the general public opinion of cryptocurrencies     to follow a Pareto distribution [7]. Additionally,
                                        and, consequently, to some extent, their price

                                        IT Professional
                                                                                                                                                1

most users tend to mimic opinion leaders, par- We have seen how famous public figures such
ticularly on Twitter, where the built-in mech- as Elon Musk have a tremendous capacity to
anism of retweeting facilitates the replication influence and even embody public opinion. Ad-
of one’s opinions. By focusing on these hyper- ditionally, given the difference in platform struc-
concentrated authority centers on social media ture, Twitter may be more susceptible to a few’s
networks, it may be possible to use the opinion opinions than other platforms. For these reasons,
of the few to represent the entire community. we seek to replicate this original study using a
When lower-end media users follow active Github fork and a few other modifications.
media users who interpret the meaning of media Also, in this paper, we used a different
messages and content, these are known as opinion methodology for topic modeling techniques and
leaders. These leaders are generally held in high the process in which a comparison is drawn
esteem by those who accept their opinions. Opin- between individual topics. In [10], the Non-
ion leadership originated from the theory founded negative Matrix Factorization (NMF) has been
by Elihu Katz and Paul Lazarsfeld, where it used, while in this paper, we utilized the La-
is theorized that there is a two-step flow of tent Dirichlet Allocation (LDA) due to its pop-
communication [8]. Significant contributors to the ularity and effectiveness in modeling the top-
opinion leader concept include Berelson et al. ics compared to the NMF. Additionally, in the
[9]. The theory of opinion leadership is one of [10], the Hungarian Algorithm (also called the
the multiple models that attempt to explain the Munkres assignment algorithm or Kuhn-Munkres
diffusion of ideas, innovations, and commercial algorithm) was employed to individually match
products. together the most closely related topics between
Our research focused on analyzing the con- opinion leaders and majority users. While in our
structed network and finding the opinion leaders paper, we reasoned, however, that not every topic
on a large and broader Twitter dataset, as it extracted from the first group would have a 1-1
is more susceptible to a few’s opinions than corresponding topic to the second group; thus,
other platforms, analyzing the differences in lan- the matching process would be redundant or,
guage and interest between the opinion leaders at the very least, unproductive. Thus, we only
and majority users regarding Bitcoin, and finding considered qualitatively whether or not the topics
whether the opinion leaders of Twitter are not from group to group were coherent. In addition,
aligned with the majority of users, so that a our methodology would be easily applicable to
small number of highly influential users (opinion various social networks, such as finding the ef-
leaders) can effectively represent an entire com- fectiveness of the influencers over social media
munity’s opinions. for various aspects of the real world, such as
This paper does not consider the effects of political sentiment, divisiveness, fashion trends,
mimicry and the extent to which it exists within music taste, or language changes.
these networks. For instance, after an opinion This study involves a few steps. In the first
leader releases a tweet that corresponds to a step, data is collected from Twitter using their
specific topic, what is the precise effect on the API and preprocessed; the network is constructed
community? From speculation, we would expect using comments and retweets. In the second
those users following them to begin to mimic step, the network is segregated into two different
their opinions — though we do not know for groups using Hyper-text-induced Topic Selection
certain as our efforts were not focused on that (HITS) [11]; these two groups are the opinion
question. leaders and the majority users. In the third step,
This paper and its motivations were heavily topic modeling is run on these two groups and
influenced by a previous study by Kang et al. [10] the entire community. Finally, in the fourth step,
on a relative niche forum called bitcointalk.org. we calculated the topic similarities between the
Given the recent hype surrounding cryptocur- two groups and the entire community to under-
rency on mainstream social media platforms such stand how well each constituent group represents
as Twitter, it would be of great interest to replicate the entire community. Figure 1 shows the steps
this previous study on a larger, broader dataset. conducted for our proposed analysis approach.

2 IT Professional

Figure 1. Steps of the proposed analysis approach.

This paper is organized as follows. Section events. The focus of this study was identifying
1 discusses related works. Section 2 presents major cryptocurrency events and catastrophes in-
the data collection and preprocessing. Section 3 volving some sort of Ponzi scheme causing the
presents network construction and analysis. Sec- market to crash. 30 was chosen for the number
tion 4 presents the topic of modeling. Section 5 of topics due to their semantic coherence score.
presents the results of the work.Section 6 presents The dataset was obtained from scraping web data
the lessons learned. Finally, Section 7 provides from bitcointalk.org, a cryptocurrency discussion
our conclusions. forum. About 15 million posts were collected in
total (2012 through 2016). All data was scrapped
1. RELATED WORK from a single cryptocurrency forum which most
In this section, we will discuss some relevant, likely exhibits a bias towards technologically
related works. savvy miners with a vested interest in the technol-
Within prior studies on user opinions in on- ogy. No consideration was given to the influence
line discussion forums and price fluctuation, sen- of more prominent Bitcoin community members.
timental and qualitative analyses were imple- Abraham et al. [13] used sentiment analysis
mented. In addition, some of the said studies to predict price changes in cryptocurrencies (like
focused on particular phases where cryptocur- Bitcoin and Ethereum). VADER algorithm is used
rencies price rose. Multiple studies attempted to to classify whether the text is positive or negative
investigate how cryptocurrencies, such as Bitcoin and its polarity. The datasets used in this work
price, were associated with users’ feelings and were collected from Google Trends and Twitter;
opinions by analyzing cryptocurrency forums. as a result, there was no clear relationship be-
Some of these research works are the following: tween the general sentiments of tweets and price.
Linton et al. [12] used the LDA to see how Most tweets regarding cryptocurrency tend to be
opinions are connected to large cryptocurrency positive, and many papers use datasets in which

May/June 2019
3

the price of crypto increases, resulting in a biased 15 topics total, with an R squared value of 0.28
dataset. in predicting bitcoin prices. The researchers also
Rubio [14] shifted focus away from exclu- found adding LDA topic weights as a feature in-
sively Bitcoin and price to other cryptocurrencies creases accuracy. Adding polarity and subjectivity
and technical markers such as transaction speeds, scores from sentiment analysis also improved the
smart contracts, and privacy. The researcher gath- R-value.
ered tweets relating to over 25 top global cryp- Bibi [17] used LDA to measure sentiment
tocurrencies from May 2018 to August 2018, towards cryptocurrency worldwide. The Twitter
checked if a categorization fits well for a group, dataset was used in this work; not much informa-
detected prominent themes, and ran a prediction tion was given regarding the dataset, algorithms,
to see what would be used in the future, the and methodology. As a result, Sweden and Den-
LDA and MALLET algorithms were used in this mark showed tremendously positive sentiment
work. As a result, the coherence scores plateaued towards cryptocurrency, with Canada and South
at 18 topics, which gave a value of 18. The Korea being the most negative.
researcher then found the seven most common Kim et al. [18] analyzed the current user
words per topic and visualized their frequencies opinions from online forums in Massive Mul-
in bar graphs. There were several themes between tiplayer Online Games (MMOG) setting widely
groups, including marketing, emotions, and group used around the world. This led to the proposition
relations. of a method for predicting the next day’s fall
Nizzoli et al. [15] present a chart of online and rise of the currency used in an MMOG
cryptocurrencies across multiple platforms such environment. The prediction of the daily price
as Twitter, Telegram, and Discord. They used a fluctuations used in an MMOG environment was
semi-supervised model to classify various social found by analyzing online forum users’ opinions.
media platforms. They checked how prevalent The viability of predicting the fluctuation in the
pump-and-dump and Ponzi schemes are in each value of virtual currencies was shown by focusing
bot activity and mapped their extent using a on one of the most widely used MMOGs, the
popular Twitter bot detection framework. The World of Warcraft game.
authors used the Twitter Streaming API to gather Kim et al. [19] proposed a method founded on
tweets from 3822 hashtags. They then searched deep learning and established on user opinions on
for additional Discord/Telegram links to crawl online forums to predict the fluctuation in the Bit-
and scrape more data from those platforms. In coin price and transactions. This method is viable
total, they had 50M messages from March to May for understanding a variety of cryptocurrencies in
2019. Data collected from Discord and Telegram addition to Bitcoin and increasing the usability of
was derived from Twitter, meaning any crypto these cryptocurrencies.
communities on those sites that do not have Additionally, multiple studies sought to iden-
any Twitter presence would not be included. As tify opinion leaders and/or find the main theme
a result, Telegram showed to be replete with and maximize their network marketing effec-
cryptocurrency scams; 56% of crypto channels tiveness by analyzing community networks and
were fraudulent. Suspended ac-counts were also users’ opinions. Some of these research works are
much more likely to be involved with crypto the following:
manipulation. Choi [20] examined how a piece of informa-
Atashian and Hrachya [16] used the LDA tion flowed in public forums that were social me-
to construct a predictive model for Bitcoin us- dia based, whether opinion leaders appeared from
ing sentiment analysis, as the internet primar- this flow of information, and the characteristics
ily influences cryptocurrency. Data scraped from that opinion leaders had in such forums. Two
bitcointalk.org, a bitcoin forum, from April 23, Twitter-based discussion groups focused on po-
2011, to May 05, 2018. Used GDAX Python litical discussions in South Korea were examined
library to gather Bitcoin HOCV data from 2014 using network analysis and statistical measures,
to 2018. Data was limited to niche esoteric cryp- where it was concluded that opinion leaders were
tocurrency forums. This work Built a model with influential but not content creators.

4 IT Professional

Ho et al. [21] proposed an efficient approach agents. The results from this simulation con-
to identify the opinion leader from a group dis- cluded that enhancing opinion leaders’ credibility
cussion without analyzing syntactic and semantic is key to maximizing the influence power within
features (which can add additional computing e-commerce.
effort). The researchers proposed algorithms that In summary, from the literature review, a
evaluate the magnitude of participation and emo- significant number of studies identified the opin-
tional expression during the speaking of each ion leaders on the network or analyzed users’
group member during the discussion. A well- sentiments on social media to predict the price
trained model was tested on the single dataset, fluctuation of the cryptocurrencies (like Bitcoin);
and a cross-dataset was obtained to recognize the these studies more likely used online materials
opinion leader. This testing showed an accuracy to understand the fluctuation of Bitcoin prices
of 94.68% on the Berlin dataset, 76% on the and the volume of transactions to determine any
YouTube dataset, and 73.33% on live discussion relation. While in our research, we primarily
groups for identifying the opinion leader. focused on using network analysis and analyzing
Jiang et al. [22] designed a method for de- divergent behaviors and interests between opinion
tecting an opinion leader based on an improved leaders and other majority users to find opinion
PageRank algorithm. This improved algorithm leaders. In contrast to some of the aforementioned
used link relevance to determine the degree of the research works, our research did not emphasize
link between users. Data was crawled from on- predicting prices or volumes using causality anal-
line communities and preprocessed, and then the ysis between price and word frequencies.
weight matrix was constructed by calculating link
relevance between users. The improved PageRank 2. DATA COLLECTION AND
algorithm then ranked users and detected opinion PREPROCESSING
leaders. Compared to the baselines, the results We collected 8 million cryptocurrency-related
from this experiment showed that the proposed tweets for more than 100k users, scrapped from
method effectively identified opinion leaders in 2016-01-01 to 2019-03-29, leveraging the Twitter
online communities. standard search application programming inter-
Wang et al. [23] proposed a TopicSimilar- face (API) and Tweepy Python library. A set of
Rank algorithm for opinion leaders mining and predefined search English keywords used such
interactive information based on the similarity of as Bitcoin or BTC, cryptocurrency. We extracted
topics. This algorithm took into account text char- and stored the text and metadata of the tweets,
acteristics and user attributions in microblogs. such as timestamp, number of likes and retweets,
This built links between users formed on user hashtags, language, and user profile information,
interaction information combined with topic sim- including user id, username, user location, num-
ilarity to construct a directed-weighted network. ber of followers, and number of friends. Figure 2
The idea of the vote in the PageRank algorithm shows the data collection process. The first data
to mine opinion leaders were also considered in collection phase is registering a Twitter applica-
this algorithm. Sina Weibo datasets were used for tion and obtaining a Twitter access key and token.
testing this algorithm, and the results showed that The second phase is to import the Tweepy Python
this algorithm had a better performance. package and write the python script for accessing
Zhao et al. [24] researched the influence Twitter API. The third phase is connecting to
power of opinion leaders and the interaction Twitter search API and retrieving tweets using
mechanism of a group of autonomous agents some cryptocurrency-related keywords. The last
in an e-commerce community during forming phase reads and processes the data to extract
group opinions. The social agents within a so- information on tags, agents, and locations for net-
cial network were divided into two subgroups work construction and analysis. From this dataset,
opinion leaders and opinion followers. A new a subset of four million tweets was used. The
bounded confidence-based dynamic model for specific break-down is as follows — 2,753,808
opinion leaders and followers was founded to posts, 533,924 comments, and 37,460 retweets.
simulate the opinion evolution of the group of Many modifications were made to the existing

May/June 2019
5

dataset before it was fed to the topic modeling fine relationships between pages. Hubs are highly
algorithm. This process is crucial to ensure coher- valued lists for a given query. An authoritative
ent results at the topic modeling stage. All non- page is one that many hubs link to, and a hub
English words were stripped as well as links and is a page that links to many authorities. We
special characters. All text was lemmatized using utilized the HITS algorithm to partition the total
WordNet and stripped of any stop words using the community network into two groups: majority
natural language toolkit (NLTK) library and a few users and opinion leaders. More specifically, this
additional custom stop words [25]. All characters was done using the authority score from the HITS
were converted to lowercase. Additionally, words algorithm, which effectively shows the influence
with less than three characters and posts with less on neighboring nodes. The authority of a given
than five words were omitted. node is defined as the sum of the hub scores from
neighboring incoming nodes. The equations (1)
and (2) for both authority and hubs are shown
below:

X
Authority (vui ) = hub (vuj ), (1)
vuj In(vu )
i

X
Hub (vui ) = authority (vuj ), (2)
vuj Out(vu )
i

We may define the authority of any given
node (vui ) as the sum of the hub scores from
the set of adjacent nodes with incoming connec-
tions. Similarly, the hub score of a node (vui )
is defined as the sum of authority scores from
the set of nodes with outgoing connections. In
Figure 2. Process for collecting tweet data. practice, nodes with high authority are referenced
frequently, and nodes with high hub scores link
to many high authority nodes. For this work, we
3. NETWORK CONSTRUCTION AND will focus on the authority score.
ANALYSIS We define the opinion leader group as nodes
For each of our networks, a node u is added whose authority sums to 80% when considering
to the graph for each user if and only if they the authorities of all users sorted in descending
have a post. For instance, if a user makes a order. All other users are then to be considered
comment but has no posts, they are not given a the majority. As mentioned above, self-loops were
node. An edge V is drawn between two nodes omitted as they do not offer meaningful infor-
ui and uj if the user corresponding with node ui mation about their relationship with other nodes.
comments any post corresponding with node uj. The content of retweets contained the text of
If a user comments on their post, no edge will the original tweet when fed through to the topic
be drawn — in the context of graphs, no self- modeling algorithm. As far as implementation is
loops will be present. We did this because self- concerned, we employed the Python NetworkX
comments do not give significant insight into the library to calculate the values for HITS. This
relationship between other nodes and will thus is the converging algorithm and thus only ends
only serve to add noise. Following this procedure, when convergence is reached or a predefined limit
we constructed the entire community network. is reached — in our case, we set the maximum
Next, the HITS algorithm was used; the HITS number of iterations to be 200.
algorithm works as a search engine to rank pages, The distribution of authority for the users
and the algorithm uses hubs and authority to de- in the network followed a Pareto distribution.

6 IT Professional

Figure 3. Graph of the distribution of authority (y-axis) sorted in descending order for both linear (A) and log
scale (B). This figure includes both opinion leaders and majority users.

The Pareto distribution, commonly referred to algorithm.
as the ”80/20” rule,” is commonly found in the Also, Figure 5 represents a subset of the entire
allocation of capital and this case, influence. From network. Vector graphics, such as a person wear-
this statistical observation, this work is based on ing a suit, show the nodes of the opinion leaders,
— most users hold little to no authority and thus while the other vector graphics show the nodes
have a negligible impact on the topics discussed of the majority users; the numbers represent the
on the network. This distribution can be seen account ID of each node. Black edges represent
as visualized in Figure 3 . This figure points comments, and blue edges represent retweets.
out the nature of the distribution of authority. This example shows that opinion leaders garner
We can see here that very few users possess the substantial attention from other users, much more
vast majority of the authority in this community. than most users.
Also, we can see from the graph that most users
have next to none. This distribution was plotted 4. TOPIC MODELING
using linear and logarithmic scales for Figure 3 Topic modeling is the process of deriving
(A) and (B). This figure plots the distribution of a statistical topic model from a collection of
authorities of users in descending order, where the documents. It is a machine learning technique
x-axis is simply the index of each user in the list. and is generally unsupervised. For example, topic
The y-axis represents the authority. There were modeling includes popular methods, including
2559 opinion leaders (0.72% of users) who ac- Latent Dirichlet Allocation (LDA) [27], Non-
counted for 80% of the authority and the majority negative Matrix Factorization (NMF) [28], and
(99.28%) users for the remaining 20% out of a Latent Semantic Analysis (LSA) [29]. In this
total of 355,139 users. These percentages were paper, we decided to utilize LDA, a probabilistic
found by dividing the number of users in each algorithm, due to its popularity and effectiveness.
subgroup by the total number of users. The theory behind topic modeling is that each
Additionally, we produced a network visual- document is comprised of a mixture of various
ization using Gephi software [26]. Figure 4 visu- topics; the number of the topics is a mixture of
alize the entire community network. Red nodes different words. By categorizing documents by
represent the majority of users, and blue nodes category, we can accurately and objectively group
represent opinion leaders. There is a gradient documents and gain the meaning behind them.
between the two based on authority scores. Black The LDA algorithm considers three hyperpa-
edges are comments, and blue edges are retweets. rameters: α, β , and K. The α hyperparameter
This was distributed using the ForceAtlas2 layout encodes the number of topics expected in each

May/June 2019
7

Figure 4. Gephi visualization of complete network. This visualization helps illustrate the concentration of
authority, demonstrated by the tiny cluster of blue central nodes surrounded by connecting majority users.

Table 1. Word frequencies for the opinion leaders and the majority users. The relative difference column shows by a
factor how much larger the majority user percentage is than the opinion leaders.
Word Frequencies
Opinion Leaders Majority Users
price 0.85% 1.89% 1.22
buy 0.39% 1.52% 2.87
Market sell 0.19% 1.29% 5.74
profit 0.12% 1.14% 8.17
invest 0.04% 0.06% 0.53
core 0.07% 0.02% -0.70
miner 0.09% 0.04% -0.53
Technical network 0.16% 0.09% -0.46
node 0.09% 0.03% -0.67
protocol 0.05% 0.02% -0.67

document. The β hyperparameter lets it know outputs a distribution of topics of what each
the distribution of words for each topic in the document (tweet) is comprised of. We performed
document, and K defines the number of topics the LDA with a dataset of 3069704 tweets; the
to use. Picking an accurate K value is crucial dataset timeframe is from 2015-06-01 to 2019-
to an interpretable output from the algorithm. 09-25 (YYYY-MM-DD). Table 1 shows the dif-
Choosing a high value may result in an output ferences in the word frequencies between opinion
with topics with substantial overlap. Conversely, leaders and majority users.
a low K would generalize too strongly and place As we can see in Table 1, opinion leaders
most of the meaning behind a document into a generally concern themselves more with cryp-
smaller pool of topics. tocurrency’s technical aspects than the majority
of users, who seem to be more interested in the
5. RESULTS AND DISCUSSION price and profit. This result would make sense,
The topic modeling algorithm method, Latent as we would expect those with high levels of
Dirichlet Allocation (LDA), derives the topics authority to possess greater expertise in the more
from our corpus of text. The implementation of technical aspects of cryptocurrencies. The words
this algorithm is handled by the popular data in Table 1 are the most used in their respective
science package Scikit-learn [30]. This algorithm categories. Also, we may define market as words

8 IT Professional

dropped to -0.11. Finally, between January 2019
and May 2019, it went to -0.80. Topic 3 was
chosen as an example due to its high correlation
with the Bitcoin price. However, the other top-
ics followed this pattern as well. We observed
a general decline in correlation between topic
weights and the price of Bitcoin over time. This
may potentially be due to the rising shrewdness
of investors — as time goes on, investors may
become less amenable to a suggestion by social
media networks. Its mainstream adaption would
lessen the influence of the subset of users with
special technical knowledge.
The most important aspect of topic modeling
is choosing the correct K value, or the number of
topics to be selected. For this work, we opted
not to use any heuristic-based approaches that
utilize coherence scores and instead ran the model
on varying topics, including 12, 8, and 4. We
found the most interpretable results came from
the output using four topics due to the minimized
inter-topic similarity. Additionally, the human in-
Figure 5. A subset of a network illustrating the dy-
terpretability of each topic significantly improved
namics between opinion leaders. We can see from
upon lowering the number of topics.
this example that opinion leaders are generally only
After running LDA on the entire community,
hubs for other opinion leaders. Seldom is there an
opinion leaders, and majority users, we used
outgoing connection from an opinion leader to a ma-
the cosine difference to calculate the degree of
jority user.
similarity between the groups. This allows us to
numerically quantify the similarity between the
pertaining to the financial aspects of Bitcoin, such groups by treating the corpus of text for each
as profit or price. Technical words, on the other group as a word vector. A similarity score was
hand, are regarded as words pertaining to the calculated between the total community and opin-
internal working or function of Bitcoin itself as a ion leaders and between the total community and
technology. majority users. Given these similarities, we could
Figure 6 shows a graph of the Bitcoin price see which group was most similar to the total
graphed alongside a scaled version of the topic community and, thus, which one could represent
weights smoothed out with a 60-day rolling av- it more appropriately.
erage from the output of the total community Table 2 shows the output of the LDA algo-
LDA. Each document belongs to each topic to rithm for four topics for all three groups with
a certain extent — this is what the weight shows. the similarity scores. Additionally, Figure 7 shows
Each color represents one of three top topics and the graph of the similarities for each topic. From
runs from 2015 to 2019. The three topics are the graph, it is evident that the opinion leader
related to crypto, price, and trade, respectively. group’s similarity closely matches and surpasses
Each correlation contained a p-value and was cal- that of the majority user group. The similarity
culated using the Pearson R score. As we might of the opinion leaders to the entire community
expect, different epochs of time yielded stronger is greater than that of the majority of users.
correlations with a Bitcoin price. Between the Therefore, a small number of highly influential
period of December 2017 and April 2018, topic users (opinion leaders) can effectively represent
3 yielded a correlation of -0.14. Between April an entire community’s opinions.
2018 and August 2018, the correlation of topic 3 Table 3 shows the values for the similarities

May/June 2019
9

Figure 6. Bitcoin price and topics vs. time, the x-axis represents the time, and the y-axis represents the price
of Bitcoin. The topic weights have been scaled to fit the graph instead of being on a 0-1 scale.

between the groups. Similarity scores for each with the assumption that there indeed exists some
topic have been calculated for both opinion lead- level of causality [32].
ers and majority users. The average score has In addition, if Twitter were to lose public
also been computed and is seen to be higher trust, the methodology in the paper would still
for the opinion leaders group, suggesting that, on be easily applicable to different social networks.
average, the opinion leader group bears greater Our approach to this problem is not incompatible
semblance to the entire community than the ma- with the structure of other networks — we could
jority of users. Thus, this result also shows that have easily applied this to Facebook, Instagram,
the opinion leaders are more able to represent the Reddit, etc., albeit with different results. Even if
entire community than the majority of users. Twitter were to meet its demise, we would expect
an equivalent to rise and fill its shoes. Our interest
From the above discussions, we may notice in Twitter, in particular, stems from its popularity
that regarding the connection between BTC price amongst crypto influencers and its unique retweet
and Tweets content, as with any matter con- mechanic. This makes it easier to differentiate
cerning correlation followed by a claim, it is between opinion leaders and majority users as
important first to ascertain that there is indeed there is a more stark dichotomy. If Twitter were
a causal relationship that is not the product of to disappear, these opinion leaders would simply
mere happenstance — for it could be the case find a different platform on which we could
that the price of Bitcoin correlates strongly with easily reproduce the paper. Furthermore, while
any number of arbitrary datasets, so, how is this the paper’s thesis (opinion leaders can effectively
dataset of Bitcoin-related tweets from Twitter represent majority users) only extends so far as
any different, we believe this topic has enough Twitter, replicating this study on other networks
merit to warrant a separate study. It is evident would confirm or disprove the thesis on a broader
that the influence of social media networks has level which would shed greater insights on differ-
tremendous potential to change markets. Look ences between these networks.
no further than the recent retail investor interest
in GameStop or the unbelievable overvaluation 6. Lessons learned
of Tesla to get a hint of how much the value We can conclude the following based on the
of some assets is driven by the current social results presented in this paper:
media [31]. Given Bitcoin’s speculative nature,
its popularity amongst young adults (the primary • According to the words shown in Table 1,
consumers of social media), and the sheer volume opinion leaders use more technical words than
of discussion that occurs about it online, we run the majority users, where the majority use

10 IT Professional

Table 2. LDA output for four topics for all three groups and similarity scores.

  Topics    Topics Keywords of Entire           Topics Keywords of Opinion           Opinion      Topics Keywords of Majority           Majority
            Community                           Leaders                              Leaders      Users                                 Users
                                                                                     Similarity                                         Similarity
 Topic 0    bitcoin, price, usd, last, vol-     bitcoin, price, doge, current,       0.90195      bitcoin, crypto, blockchain,          0.82172
            ume, exchange, market, hour,        day, year, crypto, dogecoin,                      cryptocurrency,       ethereum,
            min, pair, profit, current, bi-     bcash, month, amp, time, read,                    new, money, exchange, news,
            nance, arb, change, ranging,        week, bch, market, since, hour,                   amp, get, mining, bank, world,
            yielding, opps, spanning, eur       first, buy, get, last, back, good,                currency, time, free, project,
                                                volume, high, transaction, ev-                    trading,       cryptocurrencies,
                                                ery, cryptocurrency, fee, eth,                    coin, future, ico, wallet,
                                                new, low, hit, tweet, next, ago,                  make, litecoin, gold, eth,
                                                ltc, bcashsv, block                               token, digital, network, know,
                                                                                                  first, libra, good, payment,
                                                                                                  transaction, great, binance,
                                                                                                  facebook
 Topic 1    bitcoin, crypto, money, new,        bitcoin,     xrp,      ethereum,     0.78528      bitcoin, eth, price, xrp, ltc,        0.80462
            get, cryptocurrency, year, time,    blockchain,               crypto,                 bch, volume, crypto, usd,
            blockchain, amp, gold, make,        time,       litecoin,       alert,                cryptocurrency,       ethereum,
            know, day, world, good, future,     cryptocurrencies,            top,                 binance, hour, top, eos,
            bank, currency, news                cryptocurrency, bitcoincash,                      litecoin, last, ripple, market,
                                                ltc, trading, eth, price,                         blockchain, change, trx, doge,
                                                ripple,      binance,        last,                bnb, etc, neo, sat, dash, ada,
                                                daysnews, vaultmex, market,                       xlm, xmr, forecast, trading,
                                                cryptocurrencymarket, fintech,                    bittrex, coin, average, link,
                                                trx, altcoins, ico, buy, eos,                     news, eur, bitcoincash
                                                altcoin, analysis, change,
                                                paypal, news, cryptonews,
                                                coinbase,       newsoftheweek,
                                                monero, skrill, usd
 Topic 2    bitcoin, price, buy, sell, mar-     bitcoin, usd, wallet, money,         0.92394      bitcoin, market, price, worth,        0.95417
            ket, high, profit, worth, inr,      get, gold, eth, price, coinbase,                  high, low, cap, crypto, trade,
            current, low, crypto, btce,         hardware, value, eur, make,                       last, btcusd, time, alt, usd,
            cap, time, bitfinex, btcusd, ex-    buy, amp, currency, altcoin,                      short, long, year, bull, next,
            change, vircurex, analysis          network, world, bank, know,                       day, bitmex, gmt, back, avg,
                                                time, apompliano, fiat, smart,                    bcoin, whale, hit, chart,
                                                work, year, mining, real, order,                  bullish, get, look, week,
                                                coin, good, secure, new, better,                  bought, month, right, break,
                                                trezor, bsv, crypto, asset, digi-                 change, trading, good, move
                                                tal
 Topic 3    bitcoin, crypto, blockchain,        bitcoin, crypto, price, market,      0.83673      bitcoin, buy, sell, profit, price,    0.64126
            cryptocurrency, eth, ethereum,      cryptocurrency, read, news, bt-                   inr, current, exchange, last,
            xrp, trading, ltc, bch, litecoin,   cusd, blockchain, bull, alt,                      pair, min, ranging, arb, yield-
            free, binance, top, amp, ico,       new, bullish, support, trading,                   ing, opps, spanning, btce,
            coin, ripple, join, cryptocur-      long, next, time, exchange,                       bitfinex, vircurex, btcusd, in-
            rencies                             short, chart, high, amp, resis-                   dia, buysellbitcoin, buysell-
                                                tance, trader, back, future, usd,                 bitco, btceur, block, mtgox,
                                                facebook, break, analyst, look,                   edt, cest, utc, alert, hitbtc, est,
                                                move, newsoftheweek, libra,                       cet, market, instantly, campbx,
                                                analysis, bear, level, investor,                  free, kraken, technical, trading
                                                craig

Table 3. Average scores and topic similarities for opinion             • From Figure 6, we can realize the correlation
leaders and majority users.                                              between the topic weights and the price of Bit-
     Group             0         1       2       3      Average          coin decreased over time. This may be because
 Opinion Leader       0.90     0.77     0.92    0.85     0.86            the investors become less interested in the
 Majority Users       0.83     0.81     0.96    0.65     0.81
                                                                         suggestions from the social media networks.
                                                                       • We can see from Figure 7 that despite only
                                                                         accounting for a minute segment of the entire
   words related to Bitcoin prices and profits. It                       community, the opinion leaders possess an
   implies that the opinion leaders would likely                         almost identical degree of similarity to the
   have a deeper understanding and experience of                         entire community as the majority users. We
   cryptocurrencies and Bitcoin.                                         hypothesize that this is due to the influence

May/June 2019
                                                                                                                                               11

Figure 7. Graph of similarities to the total community for both opinion leaders (blue) and majority users
     (orange), the x-axis represents the topics, and the y-axis represents the cosine similarity to the topics of the
     entire community.

         of the opinion leaders on the rest of the com-         Given individual voices such as Musk can have
         munity. This, coupled with Twitter’s retweet           such a significant individual influence on the
         feature, allows individuals’ voices to propagate       price of Bitcoin; we may conclude that the
         tremendously.                                          composition of all such opinion leaders has
     •   Based on the average scores and topic sim-             a causal influence. Because of this, it is our
         ilarities shown in Table 3, we realized it is          assumption then that the confluence of these
         possible that a minuscule subset of a network          powerful voices does indeed, to some extent,
         may represent an entire community much more            influence the price of Bitcoin — though, to
         effectively than the majority users.                   be clear, it is not certain to what extent this
     •   The impact of any individual majority user             causality reaches.
         is so low to almost considered non-existent          • Given what we have learned studying this
         compared to the effectiveness of the opinion           topic, it would be interesting to see future
         leaders in the network.                                studies applying the implied results of this
     •   Measures of efficiency could improve certain           paper. For instance, observing the opinion
         analysis methods for user networks such as             leaders of social media networks exclusively
         Twitter by only concentrating on the most              to observe their influence on various aspects
         influential voices.                                    of the real world, such as political sentiment,
     •   It is difficult to ascertain whether the price of      divisiveness, fashion trends, music taste, or
         Bitcoin drives tweets or tweets drive the price        language changes. It would be our assumption
         of Bitcoin. We know from isolated examples             from the thesis of this paper that only the top
         that tweets have the power to drive the price          voices would be relevant to the influence of
         of a cryptocurrency, with the most salient             these domains. It would also be interesting to
         examples being from figures such as Elon               observe situations where this theory falls apart
         Musk, who routinely manipulates cryptocur-             — where the majority’s voice outweighs that
         rency markets and, according to Ante [33],             of opinion leaders. This dichotomy of opin-
         “non-negative tweets from Musk lead to sig-            ion leader and majority user has the potential
         nificantly positive abnormal Bitcoin returns.”         to provide an interesting lens through which

12                                                                                                      IT Professional

to view the direction of future trends, not         5. V. Marella, “Bitcoin: A social movement
   necessarily the price of a certain commodity.          under attack,” 2017.
   With a better understanding of how these two        6. F. Reid and M. Harrigan, “An analysis of
   groups color the public narrative, we can better       anonymity in the bitcoin system,” in Security
   understand what lies ahead.                            and privacy in social networks. Springer,
                                                          2013, pp. 197–223.
7. CONCLUSION                                          7. R. Koch, The 80/20 Principle: The Secret
    This paper looked at the similarity between           of Achieving More with Less: Updated 20th
opinion leaders and majority users regarding Bit-         anniversary edition of the productivity and
coin on the social media platform Twitter. We             business classic. Hachette UK, 2011.
conclude that a small number of highly influential     8. E. Katz and P. F. Lazarsfeld, Personal influ-
users can effectively represent an entire commu-          ence: The part played by people in the flow
nity’s opinions.                                          of mass communications. Routledge, 2017.
    A more rigorous and deep-dive approach to          9. D. Riesman, N. Glazer, and R. Denney, “The
the number of topics employed would be of great           lonely crowd,” in The Lonely Crowd. Yale
interest to improve upon this study. The results          University Press, 2020.
of this study were settled subjectively and did not   10. K. Kang, J. Choo, and Y. Kim, “Whose opin-
incorporate any heuristic measures. Additionally,         ion matters? analyzing relationships between
this study using specific languages or geographic         bitcoin prices and user groups in online com-
locations may yield interesting results due to            munity,” Social Science Computer Review,
cultural differences.                                     vol. 38, no. 6, pp. 686–702, 2020.
    Future research would benefit from a more         11. J. M. Kleinberg, “Authoritative sources in
rigorous form of topic-matching to ascertain a            a hyperlinked environment,” Journal of the
more concrete comparison. Also, discovering a             ACM (JACM), vol. 46, no. 5, pp. 604–632,
quantifiable degree of mimicry between opinion            1999.
leaders and majority users would be a great route     12. M. Linton, E. G. S. Teo, E. Bommes,
to pursue. We may also expect opinion leaders             C. Chen, and W. K. Härdle, “Dynamic
to be influenced by their followers, following            topic modelling for cryptocurrency commu-
collectivized mimicry.                                    nity forums,” in Applied quantitative finance.
                                                          Springer, 2017, pp. 355–372.
REFERENCES                                            13. J. Abraham, D. Higdon, J. Nelson, and
 1. S. Barber, X. Boyen, E. Shi, and E. Uzun,             J. Ibarra, “Cryptocurrency price prediction
    “Bitter to better—how to make bitcoin a               using tweet volumes and sentiment analysis,”
    better currency,” in International conference         SMU Data Science Review, vol. 1, no. 3, p. 1,
    on financial cryptography and data security.          2018.
    Springer, 2012, pp. 399–414.                      14. B. Rubio, “Analyzing cryptocurrency groups
 2. R. Grinberg, “Bitcoin: An innovative alter-           using topic modeling on twitter posts,” Mas-
    native digital currency,” Hastings Science &          ter’s thesis, 2019.
    Technology Law Journal, vol. 4, p. 160, 2011.     15. L. Nizzoli, S. Tardelli, M. Avvenuti,
 3. Y. B. Kim, J. G. Kim, W. Kim, J. H. Im, T. H.         S. Cresci, M. Tesconi, and E. Ferrara,
    Kim, S. J. Kang, and C. H. Kim, “Predicting           “Charting the landscape of online
    fluctuations in cryptocurrency transactions           cryptocurrency manipulation,” IEEE Access,
    based on user comments and replies,” PloS             vol. 8, pp. 113 230–113 245, 2020.
    one, vol. 11, no. 8, p. e0161197, 2016.           16. G. Atashian and H. Khachatryan, “Senti-
 4. M. Kim, K. Kang, D. Park, J. Choo, and                ment analysis to predict global cryptocur-
    N. Elmqvist, “Topiclens: Efficient multi-level        rency trends,” 2018.
    visual topic exploration of large-scale docu-     17. S. Bibi, “Cryptocurrency world identification
    ment collections,” IEEE transactions on vi-           and public concerns detection via social me-
    sualization and computer graphics, vol. 23,           dia: student research abstract,” in Proceed-
    no. 1, pp. 151–160, 2016.

May/June 2019
                                                                                                           13

ings of the 34th ACM/SIGAPP Symposium on              pp. 361–362.
           Applied Computing, 2019, pp. 550–552.             27. D. M. Blei, A. Y. Ng, and M. I. Jordan, “La-
     18.   Y. B. Kim, K. Kang, J. Choo, S. J. Kang,              tent dirichlet allocation,” Journal of machine
           T. Kim, J. Im, J.-H. Kim, and C. H. Kim,              Learning research, vol. 3, no. Jan, pp. 993–
           “Predicting the currency market in online             1022, 2003.
           gaming via lexicon-based analysis on its on-      28. D. Lee and H. S. Seung, “Algorithms for
           line forum,” Complexity, vol. 2017, 2017.             non-negative matrix factorization,” Advances
     19.   Y. B. Kim, J. Lee, N. Park, J. Choo, J.-H.            in neural information processing systems,
           Kim, and C. H. Kim, “When bitcoin encoun-             vol. 13, 2000.
           ters information in an online forum: Using        29. T. K. Landauer, P. W. Foltz, and D. Laham,
           text mining to analyse user opinions and              “An introduction to latent semantic analysis,”
           predict value fluctuation,” PloS one, vol. 12,        Discourse processes, vol. 25, no. 2-3, pp.
           no. 5, p. e0177630, 2017.                             259–284, 1998.
     20.   S. Choi, “The two-step flow of communica-         30. F. Pedregosa, G. Varoquaux, A. Gramfort,
           tion in twitter-based public forums,” Social          V. Michel, B. Thirion, O. Grisel, M. Blondel,
           science computer review, vol. 33, no. 6, pp.          P. Prettenhofer, R. Weiss, V. Dubourg et al.,
           696–711, 2015.                                        “Scikit-learn: Machine learning in python,”
     21.   Y.-C. Ho, H.-M. Liu, H.-H. Hsu, C.-H. Lin,            the Journal of machine Learning research,
           Y.-H. Ho, and L.-J. Chen, “Automatic opin-            vol. 12, pp. 2825–2830, 2011.
           ion leader recognition in group discussions,”     31. J. E. Fisch, “Gamestop and the reemergence
           in 2016 Conference on Technologies and                of the retail investor,” U of Penn, Inst for Law
           Applications of Artificial Intelligence (TAAI).       & Econ Research Paper, no. 22-16, 2022.
           IEEE, 2016, pp. 138–145.                          32. A. Perrin, “Social media usage,” Pew re-
     22.   L. C. Jiang, F. F. Li, B. Ge, W. D. Xiao,             search center, vol. 125, pp. 52–68, 2015.
           J. Y. Tang, and Y. L. Hu, “Detecting opinion      33. L. Ante, “How elon musk’s twitter activity
           leaders in online communities based on an             moves cryptocurrency markets,” Technologi-
           improved pagerank algorithm,” in Applied              cal Forecasting and Social Change, vol. 186,
           Mechanics and Materials, vol. 543. Trans              p. 122112, 2023.
           Tech Publ, 2014, pp. 3524–3527.
     23.   C. Wang, Y. J. Du, and M. W. Tang, “Opinion
                                                             ACKNOWLEDGMENT
           leader mining algorithm in microblog plat-
                                                                This work was in part supported by Saginaw
           form based on topic similarity,” in 2016 2nd
                                                             Valley State University.
           IEEE International Conference on Computer
           and Communications (ICCC). IEEE, 2016,            Aos Mulahuwaish received a Ph.D. in computer
           pp. 160–165.                                      science from McMaster University, Hamilton, ON,
     24.   Y. Zhao, G. Kou, Y. Peng, and Y. Chen,            Canada. He is currently an Assistant Professor in the
           “Understanding influence power of opinion         computer science and information systems depart-
           leaders in e-commerce networks: An opinion        ment at Saginaw Valley State University, MI, USA.
           dynamics theory perspective,” Information         His research interests include social media analytics,
           Sciences, vol. 426, pp. 131–147, 2018.            social cybersecurity, machine, and deep learning,
     25.   S. Bird, “Nltk: the natural language toolkit,”    metaheuristics, and fault tolerance system. He is a
           in Proceedings of the COLING/ACL 2006             member of IEEE and ACM.
           Interactive Presentation Sessions, 2006, pp.
                                                             Matthew Loucks is an undergraduate student cur-
           69–72.
                                                             rently studying computer science at Saginaw Valley
     26.   M. Bastian, S. Heymann, and M. Jacomy,            State University. In his free time, he works as a
           “Gephi: an open source software for explor-       part-time software developer and participates in his
           ing and manipulating networks,” in Proceed-       school’s philosophy club. He enjoys reading, play-
           ings of the international AAAI conference on      ing piano, and going to the gym. After graduating,
           web and social media, vol. 3, no. 1, 2009,        Matthew is considering pursuing a master’s either in
                                                             computer science or artificial intelligence.

14                                                                                                    IT Professional

Basheer Qolomany [S’17-M’19] received the B.Sc.
and M.Sc. degrees in Computer Science from the
University of Mosul, Iraq, in 2008 and 2011, respec-
tively, and the Ph.D. and second master’s en-route
to Ph.D. degrees in Computer Science from Western
Michigan University (WMU), Kalamazoo, MI, USA, in
2018. He has worked as a Lecturer with the Depart-
ment of Computer Science, University of Duhok, Kur-
distan region of Iraq, from 2011 to 2013; a Graduate
Doctoral Assistant with the Department of Computer
Science, WMU, from 2016 to 2018; and a visiting
Assistant Professor at the Department of Computer
Science, Kennesaw State University (KSU), Marietta,
GA, USA, from 2018 to 2019. He is currently an Assis-
tant Professor with the Department of Cyber Systems,
University of Nebraska at Kearney (UNK), Kearney,
NE, USA. His research interests include evolutionary
computation, machine learning, deep learning, and
big data analytics in support of population health and
smart services. He is a member of IEEE and ACM.

Ala Al-Fuqaha [S’00-M’04-SM’09] received Ph.D. de-
gree in Computer Engineering and Networking from
the University of Missouri-Kansas City, Kansas City,
MO, USA, in 2004. He is currently a professor at
Hamad Bin Khalifa University (HBKU). His research
interests include the use of machine learning in
general and deep learning in particular in support
of the data-driven and self-driven management of
large-scale deployments of IoT and smart city infras-
tructure and services, Wireless Vehicular Networks
(VANETs), cooperation and spectrum access eti-
quette in cognitive radio networks, and management
and planning of software defined networks (SDN).
He is a senior member of the IEEE and an ABET
Program Evaluator (PEV). He serves on editorial
boards of multiple journals including IEEE Communi-
cations Letter and IEEE Network Magazine. He also
served as chair, co-chair, and technical program com-
mittee member of multiple international conferences
including IEEE VTC, IEEE Globecom, IEEE ICC, and
IWCMC.

May/June 2019
                                                         15

You can also read