UCLA UCLA Electronic Theses and Dissertations - eScholarship
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
UCLA UCLA Electronic Theses and Dissertations Title Unsupervised Classification and Network Analysis of the Reddit Communities with Spiking Neural Network and Exponential-Family Random Graph Model Permalink https://escholarship.org/uc/item/2mr761pv Author HE, JIE Publication Date 2021 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California
UNIVERSITY OF CALIFORNIA Los Angeles Unsupervised Classification and Network Analysis of the Reddit Communities with Spiking Neural Network and Exponential-Family Random Graph Model A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Statistics by Jie He 2021
ABSTRACT OF THE THESIS Unsupervised Classification and Network Analysis of the Reddit Communities with Spiking Neural Network and Exponential-Family Random Graph Model by Jie He Master of Science in Statistics University of California, Los Angeles, 2021 Professor Yingnian Wu, Chair The spiking neural networks (SNNs) are often described as the “third generation” of neural networks, and they are expected to improve the existing deep neural networks. Recent advancements of SNNs mainly focused on processing and learning the visual signals, while SNNs’ potential in classifying non-image data is rarely tested. In this thesis, we extended the functionality of BindsNET, a popular SNN simulation software, to allow it to process and classify non-image data. We built an SNN that can efficiently classify the embedding data of 51,278 online communities (“subreddits”) on Reddit.com in an unsupervised fashion. With the classification result, we further analyzed the social network structure of the subreddit clusters of video games, using the exponential-family random graph model (ERGM). We discovered that communities of the same video game genre or same platform are more likely to be hostile towards each other. The number of subscribers and the availability of online mode are also significant factors in the hostility of a subreddit. ii
The thesis of Jie He is approved. Qing Zhou Guido Fra Montufar Cuartas Yingnian Wu, Committee Chair University of California, Los Angeles 2021 iii
TABLE OF CONTENTS 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview of the Spiking Neural Networks . . . . . . . . . . . . . . . . . . . . 1 1.2 Our Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Neuron Dynamics and Network Structure . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Leaky Integrate-and-Fire Model . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Learning Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Network and Software . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Data Description and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Video Game Subreddit Attributes . . . . . . . . . . . . . . . . . . . . 12 2.3 Overview of Social Network and ERGM . . . . . . . . . . . . . . . . . . . . 13 3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 SNN Training Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Social Network Analysis on the Video Game Subreddits . . . . . . . . . . . . 28 3.2.1 Undirected Network Analysis . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2 Analysis of Directed Network of Hostile Hyperlinks . . . . . . . . . . 32 3.2.3 ERGM analysis of Directed Hostile Hyperlink Network . . . . . . . . 33 3.2.4 MCMC Diagnostic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 v
LIST OF FIGURES 2.1 Simplified illustration of the SNN. There are equal numbers of excitatory and inhibitory neurons. An excitatory neuron receives all input spike trains, and sends a spike to a unique inhibitory neuron. The corresponding inhibitory neuron performs lateral inhibition by sending signals to the rest of excitatory neurons. . 8 2.2 Visualization of the subreddit embeddings. Each subreddit is colored based on the clustering result of K-means algorithm. Point size reflects the number of subscribers of a subreddit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Top: training set with multiplier = 1; Bottom: training set with multiplier = 10 18 3.2 Top: training set with multiplier = 30; Bottom: training set with multiplier = 40 19 3.3 Training accuracy over three epochs. . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Three-epoch class distribution results with multiplier = 10. The proportion of class 7 grew very fast, while the proportions of class 1 and class 8 quickly decreased to zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 Three-epoch class distribution results with multiplier = 20. The proportion of class 7 grew slower than the case for multiplier = 10. . . . . . . . . . . . . . . . 22 3.6 Three-epoch class distribution results with multiplier = 30. The network didn’t overfit to class 7 until the later part of the second epoch. . . . . . . . . . . . . . 23 3.7 Training accuracy with different time setting. A 200ms observation time can reduce overfitting, at the cost of a much slower training time. . . . . . . . . . . 25 3.8 Training accuracy of 400-neuron SNN. The result shows increasing the number of neurons make SNN more resistant to overfitting, at the cost of more training time and more GPU usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 vii
3.9 Top: confusion matrix of test result after one epoch. Bottom: confusion matrix of test result after three epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.10 Clustering result from trained two-layer SNN. . . . . . . . . . . . . . . . . . . . 28 3.11 Undirected network of video game subreddits. Left: Vertices colored by genere (ACT: red; RPG: green; SIM: blue; STR: cyan); Right: Vertices colored by platforms (console: red; mobile: green; pc:blue). Size of a vertex indicates the amount of users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.12 Left: distributions of degree centrality. Right: histogram of eigenvector centrality. 31 3.13 Visualization of the hostile hyperlink history network. Left: The entire network. Right: the largest strong component. Vertices are colored by genre. . . . . . . . 32 3.14 Visualization of components of the hostile hyperlink history network. Left: the largest weak component. Right: the second largest weak component. Vertices are colored by genre. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.15 Traid Census. Figure from the lecture notes of Prof. Mark S. Handcock at UCLA [Pro] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.16 MCMC diagnositic of the final model. The chains are mixing well and the simu- lated statistics are distributed in a bell-shape around the target values. . . . . . 39 viii
LIST OF TABLES 3.1 Top 10 video game subreddits sorted by eigenvector centrality. . . . . . . . . . . 31 3.2 A summary table of the solution path. We built the model with increasing com- plexity by separately examining the significance of each attributes and then in- cluding the significant terms into our models. Note that with triad census in- cluded, the model became worse. *: p-value is significant at 5%; **: significant at 1%; ***: significant at 0.1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Triad distribution of the hostile subreddit hyperlinks data . . . . . . . . . . . . 37 ix
CHAPTER 1 Introduction 1.1 Overview of the Spiking Neural Networks Since the debut of AlexNet [KSH12] in the ImageNet image recognition competition in 2012, the convolutional neural networks (CNNs) have become one of the focal points in machine learning research. A typical convolutional neural network takes a real-valued input vector, and passes it through multiple hidden layers of neuronal units. At each layer, the neurons process the information based on the output of its previous layer and a connection weight matrix. Then, a neuron uses a non-linear function such as ReLU and sigmoid to compute the final output. As the number of layer increases, a CNN demonstrates impressive ability to recognize different patterns of images. This architecture is an example of the “second generation” of neural networks, according to Maass [Maa97]. On the other hand, a “third generation” of neural network marks a step closer to the neuronal dynamics of human brains: it integrates discrete input signals called spike trains, and fires its own spike trains to other neurons. The integrate-and-fire model was first proposed more than a century ago [Lap07]. Since part of the success of the CNN can be attributed to its semblance to the visual receptors on the human retina, it is reasonable to assume that SNNs, when properly implemented, should outperform many state-of-the-art deep learning frameworks. Indeed, there are many advantages of an SNN. For example, the first one is energy efficiency. Spike trains are sparse signals in nature, and neurons in an SNN, unlike those in a CNN, can remain inactive unless they receive a large amount of spikes in a short time window. Another benefit is the additional temporal information encoded in the spike train. It has been observed that even 1
a slight variation of the timings of incoming spikes can result in different responses of an SNN [GKV96]. Consequently, an SNN can process complicated input data at a small cost of energy, which draws a distinction with the current generation of continuous-real-value-based neural networks. Despite the promising outlook, the research and development on the SNN are still in the early stage. The software options for SNN simulations remain limited when compared to the support for CNN, and there is a lack of effective training algorithms that leverage the power of the integrate-and-fire dynamics of spiking neurons. BindsNET [HSK18] is a recently developed software package that incorporates many powerful PyTorch functionalities into the SNNs. Compared with other simulation softwares such as BRIAN [SGB14] and AURYN[ZG14], BindsNET offers a more user-friendly interface that allow researchers to build and test different SNN architectures. More importantly, it utilizes the PyTorch libraries to transfer the data to GPUs, resulting in more efficient computations. As for training an SNN, the famous backpropagation [GBC16] algorithm cannot be directly applied to an SNN, since the discrete signals are non-differentiable. Furthermore, the backpropagation implies that higher layers in a neural network propagate the error signals through the same connection used for propagating input signals. This “symmetric feedback” assumption may not be valid for biological neural networks [LSM20]. Nevertheless, we have seen successes in converting the discrete spike trains to continuous functions and performing supervised learning on SNNs [ZG18]. Alternatively, the unsupervised learning rules like the spike- timing-dependent-plasticity (STDP) [BP98] demonstrated its ability in training a simple two- layer SNN that can classify the MNIST dataset in 2015 [DC15]. A recent re-implementation of this network, called DiehlandCook network, is also available in BindsNET. 2
1.2 Our Work The first part of our work is extending the functionality of BindsNET, allowing it to use non- image data as training sets. Currently, BindsNET only supports benchmark datasets in the popular TorchVision library. We built a new dataloader for BindsNET, which makes any .csv files compatible with BindsNET. In addition, we provide a series of pre-processing functions to handle the non-image data before they are sent to BindsNET. While SNNs are mostly benchmarked on image recognition tasks, they are rarely used for classifying non-image datasets such as word embeddings. For images, the pixel intensities can be proportional to the frequencies when an image is encoded to Poisson spike trains, but as we will present in the next section, the same process cannot be directly applied to the non-image data. Specifically, we need a series of pre-processing steps, including clustering, cleaning and signal amplification to make the data ready for training. We performed the pre-processing steps on the embedding data of 51,278 online communities(“subreddits”) on Reddit.com [KHL18]. Then, we built and trained an SNN based on the DielandCook network, and achieved a clustering result comparable to the K-mean clustering algorithm. In addition to our work in data preprocessing and training an SNN, the classification result is valuable in its own right. Combining our result with the Reddit Hyperlink Network data [KHL18], we explored the social network structure of the subreddit clusters of video games.1 Specifically, we are interested in the hostile relations among the video game sub- reddit clusters. Reddit.com is a popular online discussion website, where registered users can create and participate in discussions, or “posts”, within a subreddit. Every major video game has its subreddit, and the interactions among different video game subreddits can be complicated and often volatile. These interactions can be identified through the hyperlinks. When a post contains a hyperlink, it is created within a source subreddit, and it points to- ward the target subreddit. If the overall sentiment of the post is negative, then the hyperlink 1 This part is based on our work in STATS 218 at UCLA, taught by Professor Mark S. Handcock in the spring of 2020 [Pro]. 3
indicates an attack from the source to the target subreddit. Otherwise, it may be considered a peaceful interaction. The Reddit Hyperlink Network data contains the hyperlink history across 36,000 subreddits from 2014 to 2017. We first selected the top 80 largest video game subreddits from our clustering results. Next, we create a list of manually-examined attributes for each subreddit. Then, based on the hostile hyperlinks data, we created a network among these subreddits and modeled the network using a family of methods called the exponential random graph models (ERGM) [HHB08]. 4
CHAPTER 2 Materials and Methods 2.1 Neuron Dynamics and Network Structure 2.1.1 Leaky Integrate-and-Fire Model Unlike neurons in conventional neural networks, a spiking neuron in an SNN integrates incoming spikes from connections called synapses. Each spike changes the membrane voltage of the neuron, and upon reaching a voltage threshold Vthres , the neuron fires a spike to other neurons. The membrane voltage is then reset to the resting potential Erest and enters a refractory period during which no more spikes can be triggered. An improvement of such neuronal model is called the leaky integrate-and-fire (LIF) model [DC15]: Definition 2.1.1 (LIF model). A leaky integrate-and-fire (LIF) model is a type of biological neuron model. The membrane voltage dynamics of a LIF model can be described as follows: dV τ = (Erest − V ) + ge (Eexc − V ) + gi (Einh − V ) dt Erest is the resting potential. ge and Eexec are the conductance and equilibrium potential of excitatory synapses. Similarly, gi and Einh are the conductance and equilibrium potential of inhibitory synapses. τ is the time-constant. Here, the neuron is “leaky” in the sense that the membrane voltage decays exponentially down to the resting potential if there is no input current. A difference of the potentials in the excitatory synapse causes an input current to increase the membrane voltage, and the inhibitory synapse causes the opposite effect. 5
The spiking neurons are connected by synapses. When no spike is present, there is no current in the synapse, since the potentials at both ends of the synapse are in equilibrium state. However, when a spike arrives at “sender” side of the synapse (the presynaptic neuron), the conductance g is instantaneously increased by the presynaptic weight w. The increased conductance and a change in membrane potential result in an input current to the “receiver” of the synapse, or the postsynaptic neuron. By default, the conductance decays exponentially to zero: dg τg= −g dt where the choice of g and τg depends on whether the synapse is excitatory or inhibitory. The overall dynamics of the spiking neurons can be describes as follows: input current raises (or decreases) the membrane potential, the membrane potential causes the neuron to fire a spike, the spike increases the synaptic conductance and makes neuron output a current to other neurons. 2.1.2 Learning Rules The learning of an SNN happens at updating the synaptic weights w of the network. A famous unsupervised learning rule is the spike-timing-dependent-plasticity (STDP) [BP98]. In essence, the synaptic weights adjust automatically according to the relative timings of the input and output spikes. If an input spike arrives right before an output spike, then the synapse that connects the input and output neurons should be made stronger and vice versa. A modification of the STDP, which was used in building the network in [DC15], can be described as: ∆w = η(xpre − xtar )(wmax − w)µ where each synapse contains additional information of recent spike history, or the synaptic trace x. xpre is the presynaptic trace, recording the number of incoming spikes in the past. xtar is the target value of synaptic trace when a postsynaptic spike is fired. By default, xpre decays exponentially and is increased by one whenever a new presynaptic spike arrives. η is 6
the learning rate, wmax is the maximum weight, and µ adjusts how much the weight should update according to the previous weight. The weights are updated when a postsynaptic spike arrives. If few recent presynaptic spikes are observed before a postsynaptic spike, then the weight decreases. Hence, this update rule also tends to keep the connection sparse if the postsynaptic spikes are rarely observed. 2.1.3 Network and Software Here, we used a similar two-layer network as in [DC15]. As shown in Fig 2.1, there are two hidden layers of LIF neurons. The first layer contains excitatory neurons and the second layer contains an equal amount of inhibitory neurons. The excitatory layer is connected to the input layer in an all-to-all fashion. This design is particularly useful for classifying our non-image data, as a neuron in a typical CNN or a convolutional-style SNN [KGT18] often has a much smaller receptive field. Each excitatory neuron sends signal to a unique inhibitory neuron in a one-to-one fashion, and each inhibitory neuron sends signal to all excitatory neurons except the one it receives signal from. The inhibitory neurons can prevent excitatory neurons from being responsive to only one type of patterns in the input signal. This mechanism is called lateral inhibition. During the training period, each sample is presented to the network for a period of time (100ms in our case). A batch of 32 samples are processed at the same time, and the network is updated regularly. During the update period, each excitatory neuron is assigned a label based on its the highest response to the different classes of inputs. The learning is unsupervised, but we still need to provide labels for the training samples since they are used to label the neurons in the SNN. Here, the word “unsupervised” means that at the final layer, no error or loss function is computed, and no such information is sent back to the hidden layers to adjust their connection weights. On the other hand, “supervised” means the difference between predictions and true values help the hidden layers to adjust their weights, usually by backpropagation. Through training, each neuron itself can learn various 7
x1 x2 x3 x4 x5 Figure 2.1: Simplified illustration of the SNN. There are equal numbers of excitatory and inhibitory neurons. An excitatory neuron receives all input spike trains, and sends a spike to a unique inhibitory neuron. The corresponding inhibitory neuron performs lateral inhibition by sending signals to the rest of excitatory neurons. patterns by responding with different spike frequencies. The labels are used only to find which pattern it receives can cause the most frequent spikes. To predict a class, the learning is turned off, and the SNN classifies the input based on the highest response of different classes of excitatory neurons. We built our SNN in BindsNET, [HSK18] a recently developed SNN simulation software that incorporates the strong functionalities of the PyTorch machine learning library. One major improvement of the BindsNET is that it can transform the data into tensors and use GPU to speed up the computation, while traditional SNN simulators only use CPU. Additionally, BindsNET offers a more user-friendly interface and streamlines the process of encoding image data to spike trains. Currently, it only supports loading the existing datasets such as MNIST [LBB98] and CIFAR-10 [KH09] from the TorchVision library. We built a custom dataloader that allows any data saved in .csv format to be processed into spike trains and used for training an SNN. 8
2.2 Data Description and Preprocessing We used two sets of data in our thesis. The 300-dimensional subreddit embedding data were first introduced by Kumar et al. [KHL18] in their paper investigating the conflicts and interactions in the web. Embedding is a feature learning technique that can transform objects into vectors in high-dimensional spaces. For text data, a common embedding method is the GloVe word-embeddings [PSM14]. Here, the author used the information from both the users and communities to generate the embeddings of subreddits. The data can be then visualized as in Figure 2.2 by principle component analysis(PCA) and t-SNE [MH08]. We chose 70 principle components in the PCA step, which captured 95% of the total variance of the data. Then, we performed t-SNE with perplexity equal to 100. The data points were grouped and colored into 10 clusters based on the K-means algorithm [Llo82]. We can observe from Figure 2.2 that the subreddits are naturally clustered into groups. The clustering results are also used to help training the SNN. The Reddit Hyperlink data were also created by Kumar et al. The data contain 137,113 hyperlinks between 36,000 subreddits from January 2014 to April 2017 [Pus]. The sentiment of each hyperlink is classified by a random forest classifier into either hostile or neutral. There are very few posts that show a clear friendly intent and as a consequence, the original authors decided to merge the neutral and positive posts together into the neutral category. Therefore, posts that are neutral are labelled as 1 and the direct attacks are labelled as −1. 2.2.1 Data Preprocessing Preprocessing non-image data into spike trains suitable for our SNN model built in BindsNET is a non-trivial task. Here, we describe a general scheme for preprocessing our subreddit embedding data: 1. Performs a pre-training clustering. While the learning process is unsupervised, we still need the labels to assign neurons in our network. 9
Figure 2.2: Visualization of the subreddit embeddings. Each subreddit is colored based on the clustering result of K-means algorithm. Point size reflects the number of subscribers of a subreddit. 10
2. If the data contain both negative and positive values, create a positive part and a negative part that only store non-negative values. This means the dimension of the input is doubled. 3. For each cluster, examine the distribution of numerical values by creating a histogram. Remove the cluster that contains only very small values. A sample in this cluster is similar to a “white image” and our SNN cannot identify it. 4. For the remaining clusters, multiply the values by a number n, and convert them into integers of range [0, 255]. Any value greater than 255 is set to be 255. n can be arbitrary, but we should expect to see around 30% of values in most clusters are greater than 200. 5. Check for imbalance in the training data, and apply balancing techniques such as oversampling and undersampling. 6. Encode the data into Poisson spike trains and train the network. As noted before, we still need labels to train the SNN, despite the fact that the learning process is unsupervised. Hence, the first step is necessary. The second step is required because the existing Poisson encoder in BindsNET only supports non-negative data. In fact, one of many challenges facing the SNN today is that it can only read non-negative inputs [PP18]. During our research, we discovered that simply adding a value to make all values positive resulted in very poor performance of classification. Consequently, we traded space for preserving more information of the data, and split each dimension into the positive and negative parts. The third step is important because we found that our SNN couldn’t identify input that contains no signal. While our SNN architecture is proved to be able to recognize patterns in images, there is not a default label for “nothing”. If the input contains essentially no signal, then no spike is fed into the network, and hence no classification can be done. Such 11
a sample is useless, and can be easily detected if it is an image. When it is not image, we can use histogram to find which cluster of inputs contains no information. For the subreddit embedding data, we identified that one cluster of size 32,336 contains very small values. Including this cluster severely impacted the accuracy of our network, since all inputs from this cluster were given a default label of class 7, according to the random initialization of our network. During training the network, we also discovered that the SNN required strong and sparse signals to make a prediction. Strong and non-sparse input spike trains result in a random guess of our network, since too many neurons in the hidden layers are activated. As for the bound of [0, 255], this is the requirement of the toTensor function in the PyTorch library. The toTensor function converts a dataset with integer values in [0, 255] to a tensor with real values in [0, 1]. The tensor can then be sent to the GPU device for fast computation. As a result, we have to multiply the remaining data by a large value, so that the desired distribution of values is obtained. Unfortunately, doing so causes a loss of information, but it turns out that by removing the cluster in step 3, we can multiply the inputs by a much smaller number, and then preserve more information of the data. If the training data are imbalanced, we can use standard techniques for data balancing. Here, we used a simple resampling method to generate enough training samples for each class. The Poisson encoder is available in BindsNET, and we incorporated it into our dataloader. 2.2.2 Video Game Subreddit Attributes Once the data were classified, we focused our analysis on a particular cluster: the cluster of video game subreddits. It’s important to include as much information about each video game as possible. We selected the top 80 largest video game subreddits and added four main attributes for each subreddit. All values were manually examined: 1. Count: numerical. The amount of registered users within the subreddit. 12
2. Genere: categorical. We classify the games into four main categories: action game(ACT), role-playing game(RPG), simulation game(SIM) and strategy game(STR). The classi- fication is based on the popular tags related to each game in major gaming websites such as steam, metacritics and ign. When a game has multiple tags, we chose the most popular tag. 3. Score: numerical. We used the review scores from metacritics. When a score is absent, we estimate the score based on other review platforms such as google store, steam or ign. 4. Platform: categorical. The platform each game supports. There are three categories: console, pc and mobile. When a game has multiple platforms, we chose the earliest platform a game supports. 5. Multiplayer: categorical whether a game supports multiplayer mode. 2.3 Overview of Social Network and ERGM A network, or graph, can be described by two main components: an edge list and a vertex set. In our networks, a vertex corresponds to a subreddit, and an edge corresponds to a hyperlink between two subreddits. If all edges have directions, then the graph is directed. Otherwise, it is undirected. We can better describe each vertex within a network by the centrality measures. There are four common centrality measures for each vertex: degree centrality, closeness centrality, betweenness centrality and eigenvector centrality: 1. Degree centrality: the number of edges connected to a vertex. When a graph is directed, each vertex has an indegree and an outdegree, corresponding to the number of edges pointing towards it or out from it. High degree indicates that a vertex is connected to many other vertices. 13
2. Betweenness: the number of paths a pair of vertices must pass through a given vertex. High betweenness indicates that a vertex lies in the shortest paths of many pairs of vertices and it may have a strong control over the flow of information in a network. However, this measure is less important in our network since the flow of information among the gaming subreddits does not depend on the edges we defined. 3. Closeness: the distance of between a vertex and other vertices. High closeness value indicates that a vertex is overall far from other vertices. 4. Eigenvector centrality: the eigenvector centrality measures the relative influence of a vertex compared to other vertices. It is computed in a recursive fashion such that if a vertex is connected to vertices with high eigenvector centralities, then it has high eigenvector centrality as well. High eigenvector centrality value indicates that a vertex is more important to the network, or receives more attention. The ERGM, or exponential random graphical model, is a method that models network data. Specifically, the model assumes that the probability of a given graph y, over the set of all possible graphs Y is determined by PK exp( 1 θk gk (y)) P (Y |θ) = c(θ) where θk are parameters, gk (y) are statistics of the graph, and c(θ) is a normalizing constant. To interpret θ, we can look at the log-odds ratio of a tie (or an edge) between vertices i and j: K P (Yij = 1) = exp( θk (gk (yij+ ) − gk (yij− ))) X P (Yij = 0) 1 where yij+ is the graph that has an edge between yi and yj , with everything else fixed, and yij− is the graph where there is no edge between yi and yj . Therefore, θk implies the impact on the log-odds of a tie. Once the formula is defined, we can use tools such as maximum likelihood estimation to obtain the estimated values for each θk . However, a common problem in estimating ERGM 14
is the so-called model degeneracy [HRS03]. Sometimes, we may obtain a set of parameters that can’t correctly recreate the desired graph. Instead, only a small set of graphs with extreme values are simulated. Without simulation we may fail to capture such problem because the means of summary statistics can be close to the ones for the target graph. The problem of degeneracy also shows the importance of simulation and verification for ERGMs. One solution for the model degeneracy is the tapered ERNM [FH12]. In addition to ERGM, penalty terms are included: PK PK exp( 1 θk gk (y) − 1 βk2 [µk (θ, β) − gk (y)]2 ) P (Y = y|θ, β) = c(θ) where β > 0 are vectors of hyperparmeters. When β = ∞, the model is the same as ERGM. In practice, the tapered version is almost always preferred. We used the R packages ergm and ergm.tapered for our analysis. 15
CHAPTER 3 Results 3.1 SNN Training Results We performed a K-means clustering on the 300-dimensional embeddings of 51,278 subreddits. The data were clustered into ten classes, covering topics such as politics, entertainment and gaming. The 300-dimensional embeddings were then converted to 600-dimensional vectors containing only non-negative values. One cluster, which contains 32,336 samples which contained values all very close to zero, was removed. For comparison, when the data are scaled to real values in [0, 5], all values in this cluster are less than 1, while other clusters contain values greater than 1. A clearer illustration can be seen in Figure 2.2. Data points in the large cluster on the right side of the figure are loosely grouped, while the left portion of the figure shows clear signs of clustering. Next, we reserved 10% of the remaining data as our test set, and balanced the training set, resulting in a train set of 45,000 samples. Some samples from the smaller clusters may repeat nine, or ten times in the balanced training set. Next, we multiplied the training data by 1, 2, 5, 10, 15, 20, 30, 40 to obtain eight training sets of different signal strengths. The class-wise distribution of some training sets are shown in Figure 3.1 and Figure 3.2. From the figures, it is immediately obvious that the original data, when scaled directly to integers in [0, 255], will result in poor classification accuracy since none of the input sample can generate enough spikes to trigger responses from the excitatory neurons. With a scaling factor of 10, some classes, namely class 2 and class 3 begin to show moderate signals, while the values in the rest of the classes are still small. When we increases the scaling factor to 30, most of the classes have around 30% of their values greater than 16
200. However, increasing it further to 40 can be counter-productive, as strong signals (those values greater than 200) are too dominant. After obtaining the training sets, we trained our two-layer SNN with lateral inhibition with a batch size of 32. The update interval is 256 batches. The neurons learn the data from every batch, but the labels are assigned to them only at the update interval so that they have seen enough data to adjust their weights. There are 100 excitatory and 100 inhibitory neurons in the SNN. We ran the SNN on each training sets for three epochs. During each epoch, a training accuracy was also computed after every 256 batches. As a result, five training accuracy scores were computed for each epoch. We also enabled the SNN to output the classification distribution along with the training accuracy to monitor the learning process more closely. The training accuracy is shown in Figure 3.3. Since the first accuracy update was computed before the SNN could assign labels to the excitatory neurons, all of our results started at the same place. From the figure, we can see that the SNN trained on the original data without any multiplication (or scaling) indeed performed poorly. On the other hand, the figure indicates that increasing the multiplier (or scaling factor) can significantly improve the training accuracy. The cost of information loss from multiplication only kicks in when the multiplier is around 40. This result confirms our previous analysis on the histogram. Based on the change of training accuracy, the two-layer SNN architecture performs the best when there are around 30% strong signals, with the rest being close to zero. Additionally, a multiplier equal to 30 implies that, when the data is scaled to [0, 1], any value greater than 1/30 is set to maximum. This may seem to be very aggressive in scaling the data, but according to the histogram in Figure 3.1, those values greater than 1/30 are so few that ignoring the variance of those values hardly causes any loss of information. Instead, the small variations of the samples, which are crucial in classification, are sufficiently amplified so that the SNN can detect them. One concern from the history of training accuracy is overfitting. Since the learning is unsupervised, the training accuracy decreased across all datasets since the network overfit 17
Figure 3.1: Top: training set with multiplier = 1; Bottom: training set with multiplier = 10 18
Figure 3.2: Top: training set with multiplier = 30; Bottom: training set with multiplier = 40 19
Figure 3.3: Training accuracy over three epochs. 20
Figure 3.4: Three-epoch class distribution results with multiplier = 10. The proportion of class 7 grew very fast, while the proportions of class 1 and class 8 quickly decreased to zero. to a certain pattern in our training data instead of the labels. To further investigate the problem, we collected the class distribution data at each accuracy update. We present the distribution data for multipliers equal to 10, 20 and 30, as shown in Figure 3.4, 3.5 and 3.6. Since our training set is balanced, the proportion of each class should stay around 11%. An important observation from the distribution is that a larger scaling factor n helps pre- venting the overfitting. The overfitting of the two-layer SNN happens when some classes are not showing enough differences. This problem is apparent in histograms of Figure 3.1. Class 1, 7 and 8 all contain few large values, making them seem similar to the SNN. Recall 21
Figure 3.5: Three-epoch class distribution results with multiplier = 20. The proportion of class 7 grew slower than the case for multiplier = 10. 22
Figure 3.6: Three-epoch class distribution results with multiplier = 30. The network didn’t overfit to class 7 until the later part of the second epoch. 23
that all neurons were initialized to be class 7 by default. Initially, a slight difference among these three classes can still be observed, and the excitatory neurons are labelled correctly. However, due to the unsupervised STDP learning rule, as the training progresses, the strong signals from other classes further enhance their corresponding synaptic weights between the input and excitatory layers. On the contrary, the small signals from class 1,7 and 8 decrease the synaptic weights to the point where excitatory neurons are no longer able to detect the patterns in class 1 and 8 which they once can. When no sufficient signal is present in a sample, the neuron will simply classify it to class 7 by default. Multiplying the small signals in class 1 and 8 helps SNN detecting the pattern, but it seems the overfitting is inevitable. There is a unique trade-off of overfitting and information loss in the case of our SNN: mul- tiplying the value too little results in overfitting, but multiplying the value too much can also cause a drop in accuracy. Our finding also disagrees with the classification result of the MNIST data obtained by Diehl and Cook [DC15], in which the authors stated that such SNN is quite robust to the overfitting problem. We attribute the problem to the nature of data. Each sample in the MNIST contains sparse and strong signals thanks to its pixel intensity values in a black-and-white image. When the pictures get more complicated, for example, three-channel RGB images, SNNs with unsupervised learning rules perform much worse than CNNs in classification task [PP18]. In order to address the overfitting problem, we also tried a few different settings in training our network. The first parameter is intensity. The intensity parameter controls the maximum frequency of input spike trains. Larger intensity results in more input spikes, which is similar effect of our multiplication method. However, increasing the parameter can reduce the training speed. No upper bound is set for the intensity, and therefore, when the intensity is too high, the network receives too many input spikes to properly classify the data. Consequently, instead of tuning the intensity, we recommend multiplying the data before they are converted to tensors. The second parameter is time. Time controls how long the neurons in the hidden layer can 24
Figure 3.7: Training accuracy with different time setting. A 200ms observation time can reduce overfitting, at the cost of a much slower training time. observe a sample before they update their synaptic weights. For comparison, we trained our SNN on the training set with multiplier = 50 first with time = 100ms (the default setting), and then with time = 200ms. The result is shown in Figure 3.7. Indeed, the accuracy increased slightly, and the network is more resistant to overfitting. However, by doubling the observation time we essentially double the training time. A better device can simulate an SNN faster, and give each neuron more time for each sample without drastically slowing down the training. However, another challenge facing the current development of SNN is the lack of specialized devices. It has been stated that, without the support of specialized neuromorphic devices, increasing the simulation speed of SNN to more than one-tenth of real time is difficult [ZG14]. Finally, we tested training the network with different numbers of neurons. Due to the limit of our GPU device (NVIDIA GTX 1060), we can only train a network with at most 400 neurons in each layer. When comparing with the default setting of 100 neurons, a SNN with more neurons takes more epochs to train, but are also more resistant to overfitting. The accuracy history is given in Figure 3.8. Combining the above settings, we found that 25
Figure 3.8: Training accuracy of 400-neuron SNN. The result shows increasing the number of neurons make SNN more resistant to overfitting, at the cost of more training time and more GPU usage. the scaling factor n = 30 is the optimal choice for our subreddit embedding data, since it gave the highest training accuracy throughout the training epochs. Fig 3.2, also suggests that the factor n = 30 can sufficiently amplify the signals in the data without losing too much information, as most classes have at least 20% strong signals (values ≥ 200). To reduce overfitting, we only train the network for one epoch. Intensity, time, and number of neurons are kept to default settings due to the limitation of devices. The training takes about 6 minutes to complete on a laptop with NVIDIA GTX 1060. The testing accuracy is 72%. However, due to the imbalanced nature of the testing data, we present the confusion matrices of test results for both epoch = 1 and epoch = 3. The matrices are normalized along the true label, which are showed in Figure 3.9. In this figure, we can also observe the effect of overfitting in longer training epochs. The testing result after one epoch shows a better overall training accuracy than the one after three epochs. For the video game clusters (class 3), the SNN trained for one epoch correctly classified more testing data from this class. The classification result from our SNN is shown in Figure 3.10, which is very similar 26
Figure 3.9: Top: confusion matrix of test result after one epoch. Bottom: confusion matrix of test result after three epochs. 27
Figure 3.10: Clustering result from trained two-layer SNN. to Figure 2.2. In fact, combing results with the removed cluster, we achieved an overall accuracy of 89%. This result is quite impressive, considering that our network is trained for only one epoch. 3.2 Social Network Analysis on the Video Game Subreddits With the clustering result from both K-means and SNN, we are now ready to perform the network analysis on the cluster of video game subreddits. 3.2.1 Undirected Network Analysis While it’s more interesting to analyze the directed versions of the network, we believe the undirected version shows the general activity pattern among the subreddits, and the amonnt 28
Figure 3.11: Undirected network of video game subreddits. Left: Vertices colored by genere (ACT: red; RPG: green; SIM: blue; STR: cyan); Right: Vertices colored by platforms (con- sole: red; mobile: green; pc:blue). Size of a vertex indicates the amount of users. of attentions each community receives. The activity information can be summarized readily by the four centrality statistics introduced in the previous chapter. As seen in Figure 3.11, we can already observe some interesting facts about the video game communities. In both figures, large gaming communities tend to cluster together in the center of graphs. However, different genres of game have different impacts on the clustering of the communities. The ACT (red) communities tend to cluster together, while the RPG (green) communities are more distant to each other. There is almost no clustering among the SIM (blue) communities, and there is only one large STR (cyan) community, with a few small STR communities around it. The difference in clustering can be explained by the similarities among games in different genres. The action (ACT) games, as the name suggest, are all about fast actions and intense fighting. While the backstories behind the action games can vary, they do not constitute the main part of the gameplay. Players who enjoy one fast-paced ACT game can often 29
transfer their skills and mechanics (a term used among the gaming communities, meaning the ability to swiftly perform precise and challenging in-game movements) to other ACT games. Role-playing games (RPG), on the other hand, are story-driven games. A player who loves one RPG may not like the plot of another RPG. But the role-playing element is shared across all RPGs. Then, the simulation games (SIM) have much more variance in terms of gameplay. For example, City: Skylines and American Truck Simulator are both simulation games. It’s hard to predict whether a player who loves driving a virtual truck is also going to find enjoyment in building a city. Lastly, for the strategy games (STR), the themes can also vary, but the clustering is less obvious on the graph. For the platforms, a majority of the members in the communities are pc users (blue) and they tend to stay close to each other. The exclusive console players (red) are more separated. There is only one large mobile game community (green), the r/clashroyale. This phenomenon can be explained by the nature of these platforms. Most games can run on pc, but different consoles have exclusive games, meaning those games can only be played on the consoles from one company. The distribution of degree centrality, eigenvector centrality are give in Figure 3.12. A summary of the the top ten subreddits, ranked by their eigenvector centrality, is given in Table 3.1. We can see that there are clear outliers in the distributions of degree centrality and eigenvector centrality, indicating a few subreddits receive most of the attentions in the video game communities. From the table, games that have high eigenvector centrality tend to have a large userbase and generally well-received. However, we need to give special attention to two games: Overwatch and tf2 (Team Fortress 2). The reddit hyperlinks data were gathered in 2018, and they include data in the previous 40 months. The game Overwatch was released in 2016. Thanks to its unique gameplay and detailed character designs, the game received overwhelmingly positive reviews and became the most popular game in 2016. The gameplay of Overwatch has its root in Team Fortress 2, a relatively old shooting game. It’s then not 30
name degree eigen population genre review platform overwatch 46 0.29 3,434,718 ACT 91 pc leagueoflegends 37 0.26 4,913,911 ACT 78 pc hearthstone 36 0.24 1,738,100 STR 88 pc dota2 34 0.23 791,740 ACT 90 pc tf2 31 0.21 507,417 ACT 92 pc smashbros 30 0.20 848,418 ACT 93 console wow 25 0.19 2,041,493 RPG 93 pc globaloffensive 27 0.19 1,224,110 ACT 83 pc destinythegame 31 0.18 1,960,399 ACT 78 console smite 21 0.17 267,805 ACT 83 pc Table 3.1: Top 10 video game subreddits sorted by eigenvector centrality. Figure 3.12: Left: distributions of degree centrality. Right: histogram of eigenvector cen- trality. 31
Figure 3.13: Visualization of the hostile hyperlink history network. Left: The entire network. Right: the largest strong component. Vertices are colored by genre. surprising to see that the relative small community of Team Fortress 2 also received a lot of attention during the 40-month period. The moral of the story is that our network model can only summarize the activity patterns generally. When dealing with specific cases, it’s important to look at the history of those video games. Nevertheless, our undirected network is a good starting point for finding the outliers. 3.2.2 Analysis of Directed Network of Hostile Hyperlinks As described in the previous chapter, the Reddit Hyperlink data contain either hostile or non-hostile (neutral/friendly) hyperlinks. We used the hostile hyperlinks portion of the data. Then, we created and analyzed the network of the hostile relations among the video game subreddits. In Figure 3.13, we visualized the entire network(left), the largest strong component in the network(right). The largest weak component is shown on the left of Figure 3.14 and second largest one is on the right. Overall, the hostile hyerplinks are rare. This can be seen by the large amount of isolated points in the entire network. Furthermore, it’s 32
Figure 3.14: Visualization of components of the hostile hyperlink history network. Left: the largest weak component. Right: the second largest weak component. Vertices are colored by genre. also rare for subreddits to retaliate. Two vertices are connected strongly if they have edges pointing towards each other, and in our case, meaning two subreddits attacked each other in the past. Otherwise, two vertices are connected weakly. However, in our component analysis, the largest weak component contains 40 members, while the largest strong component has only 11 members. The 11 video game subreddits can be nicely clustered by their genres. ACT communities are quite volatile, and sometimes a few STR communities join the fight as well. The large RPG communities tend to stay far from each other, so are pretty peaceful. One exception is the second-largest weak component shown in Fig 3.14, which is mainly compirsed of RPG communities. It also seems that SIM players are the most peaceful. 3.2.3 ERGM analysis of Directed Hostile Hyperlink Network The next step is to fit our network data by ERGM. In particular, we chose the tapered ergm method to avoid model degeneracy. With the attributes described above, we can start 33
count model genre and gwesp platform multiplayer edges -19.6*** -12.86*** -14.05*** -14.13*** nodecov.count 0.57*** 0.28*** 0.32*** 0.27*** gwesp.fixed.0.5 1.77*** 1.61*** 1.53*** nodematch.genre.ACT 0.71** 0.7** 0.59** nodematch.genre.RPG 1.06* 1.06*** 1.34*** nodematch.genre.SIM 2.34*** 2.46*** 2.9*** nodematch.genre.STR 1.55* 1.58* 1.49* nodematch.platform 0.70** 0.75** nodefactor.multi.t 0.76** triadcensus.111D triadcensus.111U AIC 798.6 732.9 730.5 719.8 Table 3.2: A summary table of the solution path. We built the model with increasing complexity by separately examining the significance of each attributes and then including the significant terms into our models. Note that with triad census included, the model became worse. *: p-value is significant at 5%; **: significant at 1%; ***: significant at 0.1%. 34
by determining the significance of each attribute, and then gradually increase the model parameters by combining the significant attributes. In addition, we included the terms gwesp and triadcensus. We used the AIC for model selection, and a summary of our solutions is given in Table 3.2. A detailed explanation of each variable is given as follows: Count. The population of each community. One problem we encountered during mod- eling is that the variance of the counts was so large that ergm.tapered() function could not find an estimation. To solve this problem, we took the natural log of the population. Then, the parameters for log(count) is almost always significant in our selection of models. This result should be intuitive, as more registered users indicate a higher overall activity. However, we have to note that by taking the log, the population has a linear effect on the odds of a tie, instead of the usual exponential effect. Review. The review score of each video game. This term is not significant in any of our models. Genre. The genre of the game. We checked three types of effect separately: nodefactor, the effect of forming a tie based on the genre. nodematch checks the homophily effect of each genre. nodemix checks the effect of forming a tie that links two subreddits of different genres. During our estimation, we found that nodefactor has a moderate effect(p−value = 0.04) only for the RPG genre. The coefficients for the nodemix didn’t converge, possibly due to the limited amount of hyperlinks data. Fortunately, the nodematch coefficients are always significant, and each genre has a different coefficient. When we ignored the difference among the genres, the model became less accurate. Platform. We repeated the same process for the genre variable. Similarly, only the nodematch appeared to be significant, and the model was better when we ignored the dif- ference effects of homophily among different platforms. Multiplayer. Whether a game supports multiplayer mode. This effect appeared to be significant in all our models. 35
gwesp. gwesp stands for geometrically weighted edgewise shared partner distribution. When two vertices are both connected to a vertex, they have a shared partner. A pair of vertices can have many shared partners, but they are not necessarily connected themselves. However, when they connect, they will close out many “triangles” in the network. The term gwesp measures the effect of shared partners on the probability of forming a tie. The geometrically weighted part adds a decay factor to the number of shared partners so that the effect can’t grow out of control. It has been shown that in practice, it’s often beneficial to include gwesp or similar terms in the model [Hun07]. In our case, the term is always significant in our models. Triad census. For directed networks, there are 16 types of triads categorized by David and Leinhardt [DL67] as 003, 012, 102, 021D, 021U, 021C, 111D, 111U, 030T, 030C, 201, 120D, 120U, 120C, 210, and 300. Graphical representation of these triads is given in Fig 3.15. A common phenomenon in an hostile relationship network is the large presense of intransitive triads. These triads are inherently unstable in a friendship network, as friendship is often mutual. On the other hand, attacks are often initiated from one side and not retaliated. A summary of the distribution the triads are given in Table 3.3, where we can observe that after excluding the trivial (or vacuous) triads, most of the triads are intransitive. However, although they are significant when we model the network with only intransitive triads, they appear to be non-significant in the final model, and worsened our model fit. As a result, the final model contained edges, user counts, gwesp with fixed decay, ho- mophily for genre and platforms, and the multiplayer support. The final model achieved an AIC score of 719.8 out of the total 24 models we tested. 3.2.4 MCMC Diagnostic After estimated the coefficients, we tested the goodness-of-fit of our selected model by run- ning MCMC simulations. The diagnostic graphs are shown in Figure 3.16. Overall, the chains are mixing well and the simulated statistics are distributed in a bell-shape around 36
Figure 3.15: Traid Census. Figure from the lecture notes of Prof. Mark S. Handcock at UCLA [Pro] Triads 003 012 102 021D count 77221 3984 761 27 Triads 021U 021C 111D 111U count 30 52 40 23 Triads 030T 030C 201 120D count 1 0 7 1 Triads 120U 120C 210 300 count 4 4 4 1 Table 3.3: Triad distribution of the hostile subreddit hyperlinks data 37
the target values. There is no degeneracy found in the model. For deviance, after the model fit, we obtained a residual deviance of 701.8 on 6311 degrees of freedom. Compared to the null deviance which is 8761.4 on 6320 degrees of freedom, our model captured most of the information about the network data with 9 parameters. 38
Figure 3.16: MCMC diagnositic of the final model. The chains are mixing well and the simulated statistics are distributed in a bell-shape around the target values. 39
CHAPTER 4 Conclusion We chose the subreddit embedding data and the Reddit hyperlink data as the foundation of our research. First, we explore the possibility of classifying the 300-dimensional subreddit embedding data using a two-layer spiking neural network. During our research, we extended the functionalities of BindsNET, a PyTorch-based SNN simulation software. We built a two- layer spiking neural network with lateral inhibition that can load and encode the training data to tensors for efficient computation. In addition, our network can output various diagnostics such as class distribution and accuracy that allowed us to select the best settings of the neural network. We described a scheme that allowed us to preprocess the non-image embedding data that could be successfully used as a training data for the SNN. We discussed the problems of overfitting and training speed facing the current version of SNNs with unsupervised STDP learning rule. After obtaining the clustering result, we switched our focus to the social network analysis. Specifically, we selected the top 80 largest video game subreddits from our clustering result, and added new attributes for each subreddits. First, an undirected network analysis is performed to explore the structure of network. Then, we applied ERGM to model the hostile relationship among the subreddits. With AIC as our guide, we selected the best model from which we concluded that population, subreddits of the same video game genre, same platform and the multiplayer mode are significant factors in causing a hostile relationship. 40
You can also read