Automatically Detecting Cyberbullying Comments on Online Game Forums

Page created by James Reed

Arts & Entertainment

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Automatically Detecting Cyberbullying Comments
on Online Game Forums
Hanh Hong-Phuc Vo1,2,* , Hieu Trung Tran1,2,* , and Son T. Luu1,2,†
1
University of Information Technology, Ho Chi Minh City, Vietnam
2
Vietnam National University Ho Chi Minh City, Vietnam
Email: *{18520275,18520754}@gm.uit.edu.vn, †sonlt@uit.edu.vn

Abstract—Online game forums are popular to most of game helping administrators detect and hide abusive comments on
players. They use it to communicate and discuss the strategy online game forums.
arXiv:2106.01598v1 [cs.CL] 3 Jun 2021

of the game, or even to make friends. However, game forums The paper is organized as follows. Section 1 introduces
also contain abusive and harassment speech, disturbing and
threatening players. Therefore, it is necessary to automatically the topic and motivation to study the aforementioned task.
detect and remove cyberbullying comments to keep the game Section 2 presents related work on the problem of identifying
forum clean and friendly. We use the Cyberbullying dataset abusive comments. Section 3 describes the Cyberbullying
collected from World of Warcraft (WoW) and League of Legends dataset that is used for training machine learning models.
(LoL) forums and train classification models to automatically Section 4 and section 5 illustrate our approaches and empirical
detect whether a comment of a player is abusive or not. The
result obtains 82.69% of macro F1-score for LoL forum and results. Finally, Section 6 concludes our work.
83.86% of macro F1-score for WoW forum by the Toxic-BERT
model on the Cyberbullying dataset. II. R ELATED WORK
Index Terms—game forums, cyberbullying, machine learning, In 2017, Wulczyn et al. analyzed Wikipedia comments
text classification, deep neural network, bert. to find personally offensive comments with a dataset of
over 100,000 high-quality human-labeled comments and 63M
I. I NTRODUCTION machine-labeled ones [1]. The dataset achieved the best AUG
Currently, offensive comments are common issues on gam- result (96.59%) using multi-layer perceptrons (MLP) model on
ing forums. These comments are often negative, harmful, and ED (empirical distribution) label with char n-gram. In the case
abusive, causing a pessimistic impression on the gaming com- of a model trained on ED labels, the attack score represents
munity. To prevent offensive comments by users on forums, the predicted fraction of annotators who would consider the
administrators have to manually find each comment, then hide comment an attack.
it or ban the user whose comments are abusive. However, Also in 2017, Davidson et al. conducted the study on Au-
this approach seems impossible on big forums with huge tomated Hate Speech Detection and the Problem of Offensive
user comments. Therefore, automatically detecting offensive Language [2]. The authors constructed a dataset from tweets
comments on game forums is necessary to help administrators on Twitter, then trained them with machine learning models
find and hide negative comments or even ban users who post such as Logistic Regression, Naive Bayes, Decision Trees,
abusive and harassment comments. We solve the problem of Random Forests, and Linear SVMs. After trained, the best
automatically detecting cyberbullying comments on the forum performance model, which is Logistic Regression, had the
of League of Legends and World of Warcraft, which are results of 91.00% by Precision and 90.00% by Recall and
popular to most game players. F1-scores.
We use machine learning approaches for text classification In addition, in 2018, Ribeiro et al. performed characterizing
tasks to automatically classify whether a comment from a and detecting hateful users on Twitter [3]. Comments in the
game player is abusive or not. We also propose alternative dataset are collected from Twitter. However, unlike other
solutions such as data pre-processing and data enhancement to datasets, this dataset focuses on the user, not the content of the
increase the performance of classifiers. Then, we evaluate the user comments. The author crawled the dataset from 100,386
empirical results to find the best models on the Cyberbullying users with up to 200 tweets per user. They then selected a
dataset. Our approaches to machine learning include: tradi- subsample to be annotated. There are 4,972 comments man-
tional machine learning models (Logistic Regression, SVM), ually annotated users, of which 544 users labeled as hateful.
deep neural models (Text-CNN, GRU), and transfer learning They also implemented dataset models such as GraphSage,
models (Toxic BERT). Finally, based on the empirical results, AdaBoost, GradBoost. Using both user’s feature and glove
we choose the highest performance model to build a tool for properties together with the GraphSage model gives the best
results. Additionally, AdaBoos delivers a good AUC score, but
mis-classifying many of the normal users is negative, resulting
978-1-6654-0435-8/21/$31.00 ©2021 IEEE in low accuracy and F1-score.

Besides, many multilingual datasets for abusive comment is imbalanced, and most of the comments are non-offensive
recognition are built in different languages, such as hatEval comments.
dataset from SemEval-2019 Task 5 contest (2019) [4] on
identifying offensive comments about women and migrants in TABLE III: Average length of comments in the Cyberbullying
English and Spanish, and CONAN dataset is used to identify dataset.
comments inciting violence in the English, French and Italian Forum Label 0 Label 1
languages [5]. League of Legends 102.95 137.60
World of Warcraft 105.42 165.58
III. DATASET
The Cyberbullying dataset1 [6] includes two parts: com- Finally, Table III shows the average length of comments in
ments of players on League of Legends (LoL) and World of the Cyberbullying dataset. The average length of comments is
Warcraft (WoW) forums. The data from League of Legends computed by word level. According to Table III, the average
forum contains 17,354 comments with 207 offensive cases, length of comments on label 0 and label 1 are 102.95 and
and data from the World of Warcraft forum contains 16,975 137.60 respectively on the LoL forum, and 105.42 and 165.68
comments with 137 offensive cases. All comments are anno- respectively on the WoW forum. In general, the average length
tated manually and reviewed by experts. The dataset contains of comments on label 1 are longer than the average length of
English comments on 20 different topics. There are two labels comments on label 0 for both two forums.
in the dataset, label 1 indicates offensive comments, and label
0 indicates non-offensive comments. Table II show the data IV. M ETHODOLOGIES
distribution of labels on two forums - LoL and WoW. A. Text classification task
TABLE I: Samples in the Cyberbullying dataset. Text classification is one of the most common tasks in
No. Comment Label
machine learning and natural language processing. Aggarwal
Comments on League of Legends forum and Zhai (2012) defined the text classification task as given
1 We can only hope. 0 the training set D = {X1 , ..., Xn }, where each data point
2 This is why we don’t account share children. 0 Xi ∈ D is labeled with a value derived from the set of
3 This is from Season Stay down bitch fanboy loser. 1
4 Rest in pieces. fk u 1 labels are numbered {1. . . k}. The training data is built into a
Comments on World of Warcraft forum classification model. With the test set given, the classification
1 This thread is depressing, progress is slow. 0 model predicts the label for each data point in the test set [8].
2 But they could give us free transfers in the meantime. 0
3 I didnt nuked it. I ran the all road and can tag him at k 1
hp. Im a druid and i used cyclone time. So dont swear to
me you idiot
4 when you troll say a good time its . you down it at . 1
Braindead
Fig. 1: Model overview for detecting cyberbullying comments.

In addition, according to Table I, comments are often written Figure 1 describes our model overview for detecting cy-
in an informal form, using a lot of slang, abbreviations, berbullying comments. With input comments, the model does
strange characters and conversations centered on the exchanges pre-processing steps including splitting comments into tokens,
between players on game forums. In addition, it can be removing unimportant characters, using SMOTE to generate
seen that comments from the Cyberbullying dataset often use new examples and using word embedding (GloVe, fastText)
informal words such as abbreviations and slang, for example: to present vectors for input data. Next, the pre-processed data
“u”, and “fk”. This is a common feature for comments on are fit into classification models including traditional machine
social media platforms [7]. learning models, deep neural models, and transformer learning
model. Finally, the output is labels that was predicted by
TABLE II: Distribution of labels by percentage on two forums
classification models.
in the Cyberbullying dataset.
Forum Label 0 Label 1 B. SMOTE
League of Legends 17,354 (98.81%) 207 (1.19%)
World of Warcraft 16,975 (99.19%) 137 (0.81%) Synthetic Minority Oversampling Technique (SMOTE) is a
technique to increase the number of samples for the minority
Furthermore, as showed in Table II, the distribution of labels class. It is widely used to handle imbalanced datasets. SMOTE
is imbalanced on both LoL and WoW forums in the dataset. works by selecting data points in the feature space and drawing
Most of the comments are skewed to label 0, which account a line between those points and the nearest neighbors. Then
for 98.81% on the LoL forum and 99.19% on the WoW based on this line to select new data for the minority class.
forum, respectively. This indicates that Cyberbullying dataset New data points are generated from the original data points
to increase the amount of minority class. This is an effective
1 http://ub-web.de/research/ method for solving imbalance in datasets [9].

C. Word embedding V. E MPIRICAL RESULTS
Word embedding is a method of word representation that A. Model parameters
used in natural language processing. This method represents We implemented the Logistic Regression and SVM with
a word (or phrase) by a real number vector that was mapped random state equal to 0, C equal to 1, and max feature equal
from a corpus of text. Using word embedding improves the to 13,000. In addition, we implemeneted the Text-CNN and
performance of the predictive model [10]. GRU with 5 epochs, batch size equal to 512, sequence length
In this paper, we use sets of word representation vectors equal to 300, conv layer size equal to 5, 128 units, dropout
such as GloVe [11] and fastText [12]. The glove.6B.300d word equal to 0.1, and using sigmoid activation function. Finally, we
embedding2 comprises 300 dimensions vectors over 400,000 implement the Toxic-BERT with 5 epochs, train batch size
vocabularies, which is built on Wikipedia and Gigaword equal to 16, and test batch size equal to 8. We used the Simple
corpora. The fastText 3 is trained on Common Crawl and transformer4 for implementing the Toxic-BERT model.
Wikipedia datasets using CBOW with position-weight in 300
dimensions with 5-grams features. B. Data preparation
Before running the model, we do pre-processing steps as
D. Traditional machine learning models follows: (1) we remove special characters in the comments
Logistic Regression is one of the basic classification mod- (except for the ”*”) and keep only the letters. The characters
els commonly used for binary classification problems. For ”*” represent encoded offensive words. These are handled
classification problems, we have to convert texts to vectors by changing to the word ”beep”, (2) we split comments into
before fitting them into the training model [13]. tokens by using the TweetTokenizer of NLTK library, (3)
Support Vector Machine (SVM) is a classification algo- we converted comments to lowercase, and (4) we remove
rithm based on the concept of hyper-planes, which divide stop words like ”the”, ”in”, ”a”, ”an” because they have less
data into different classes. This algorithm is commonly used meaning in the sentence.
for classification and regression problems. Similar to Logistic We use the k-fold cross-validation with 5 folds (k = 5) on
Regression, SVM converts text data into vector form before the Cyberbullying dataset. The dataset is randomly divided
fitting into the training model [14]. into 5 equal parts with proportion 8:2 for train set and test set
respectively. The final result is the mean of 5 folds.
E. Deep neural models C. Empirical results
Text-CNN is a multi-layered neural network architecture Table IV presents the results of the models on two forums,
developed for the classification problem. CNN model uses LoL and WoW of the dataset. According to Table IV, the
Convolutional layers (CONV layer) to extract valuable infor- results of the models on the two forums are similar.
mation from data [15]. For traditional machine learning models, the macro F1-score
GRU is an improved version of the standard recurrent neural and accuracy are quite high disparities. The reason is the
network (RNN). To solve the vanishing gradient problem of imbalanced distribution of labels in the dataset as described in
a standard RNN, GRU uses special gates including update Section III.
gate and reset gate. Basically, these are two vectors which For deep neural models, the Text-CNN model with the
decide what information should be passed to the output. The GloVe word embedding gives the best results by macro F1
special thing about them is that they can be trained to keep score, which are 80.68% on the LoL forum and 83.10% on
information from long ago, without washing it through time the WoW forum, respectively. Besides, there is a discrepancy
or remove information which is relevant to the prediction [16]. between Accuracy and macro F1 scores on deep neural models
due to unbalanced data. However, this difference is not as
F. Transformer model much as traditional machine learning models.
BERT is a transformer architecture created by Devlin et Among the models, Toxic-BERT gives the highest results
al. [17] and become popular to many of natural language according to the macro F1-score on both forums, which are
processing tasks. With pre-trained models on large and un- 82.69% on LoL forum and 83.86% on the WoW forum,
supervised data, BERT can be fine-tuned later in specific respectively according to Table IV.
downstream tasks such as text classification and question To handle the imbalanced in the dataset, we apply the
answering. Toxic-BERT (also called as Detoxify) [18] is a SMOTE method to increasing the data in minor label. After
model based on BERT architecture that is used for toxic using the SMOTE method, on traditional machine learning
comments classification task. Toxic-BERT is trained on three models, the results by macro F1-score increased significantly.
different toxic dataset comes from three Jigsaw challenges. For Logistic Regression, the macro F1-score increases 10.49%
We implement the Toxic-BERT model on the Cyberbullying and 11.41% on the LoL forum and WoW forums, respectively
dataset for detecting cyberbullying comments from players. after using SMOTE. Same as Logistic Regression, the results
by macro F1-score on the SVM increases 8.95% and 11.41%
2 https://nlp.stanford.edu/projects/glove/
3 https://fasttext.cc/docs/en/crawl-vectors.html 4 https://simpletransformers.ai/

on the LoL forum and WoW forums, respectively after using
SMOTE. However, SMOTE does not augments the perfor-
mance of deep neural models on both forums.

TABLE IV: Empirical results of the Cyberbullying dataset.
League of Legends Forum
SMOTE NOT SMOTE
Model
Acc(%) F1-sc(%) Acc(%) F1-sc(%)
Logistic Regression 97.02 60.19 98.81 49.70 Not SMOTE SMOTE
SVM 98.20 58.65 98.81 49.70
Text-CNN + fastText 97.87 75.30 99.29 75.67 Fig. 3: Confusion matrix of the Text-CNN with GloVe word
Text-CNN+GloVe 97.06 73.19 99.37 80.68 embedding on the LoL forum.
GRU + fastText 96.69 66.37 99.18 78.46
GRU + GloVe 95.54 66.50 99.32 75.53
Besides, Figure 4 shows the confusion matrix of the Text-
Toxic-BERT - - 99.38 82.69
World of Warcraft Forum CNN model on two word embeddings including GloVe and
Model
SMOTE NOT SMOTE fastText without using SMOTE technique. As shown in Figure
Acc(%) F1-sc(%) Acc(%) F1-sc(%) 4, the predictive accuracy on label 1 of the Text CNN model on
Logistic Regression 98.11 63.50 99.19 49.79
SVM 98.74 61.20 99.19 49.79 GloVe word embedding is better than fastText word embed-
Text-CNN + fastText 97.43 73.07 99.48 75.37 ding. For label 0, the prediction results on both models are
Text-CNN+GloVe 97.56 74.40 99.62 83.10 similar. Therefore, the performance of the Text-CNN model
GRU + fastText 96.27 66.19 99.49 74.44 with GloVe is better than fastText.
GRU + GloVe 97.77 71.14 99.47 72.99
Toxic-BERT - - 99.58 83.86

D. Error analysis

Since the results of the models on the two forums are similar
according to Table IV, we analyze the error predictions based
on the LoL forum in the Cyberbullying dataset.

fastText GloVe
Fig. 4: Confusion matrix of the Text-CNN model on the LoL
forum (without SMOTE).

In addition, Figure 5 shows the confusion matrix of the
Toxic BERT model. According to confusion matrix in Figure
5, the Toxic-BERT model has a higher number of correct
predictions on label 1 than the number of incorrect predictions.
Comparing with other models, the Toxic-BERT model has
Not SMOTE SMOTE a higher total percentage of correct prediction on label 0
Fig. 2: Confusion matrix of the Logistic Regression model on and label 1 than other models. Therefore, Toxic-BERT has
the LoL forum. the highest accuracy on the dataset, even without the data
enhancement method.

For the traditional machine learning model, Figure 2 shows
the confusion matrix of the Logistic Regression model before
and after the SMOTE technique is applied. As shown in
Figure 2, SMOTE increases the predictability of the Logistic
Regression model on label 1. However, there are still a lot of
comments that are mispredicted from label 1 to label 0.
Next, Figure 3 shows the confusion matrix of the Text-
CNN model with the GloVe embedding before and after
enhancing the data using the SMOTE technique. After the Fig. 5: Confusion matrix of the Toxic-BERT model on the
data enhancement, the amount of data correctly predicted in LoL forum.
label 1 is increased by 8.2%. However, the number of labels
0 correctly predicted decreased by 2.4%. Therefore, SMOTE Finally, according to Table V, many comments of players
does not increase the performance of deep neural models. on the forum use acronyms such as ”plz” (roughly meaning:

please) in comment No #1, or ”u” (roughly meaning: you) in [4] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. Rangel Pardo,
comment No #4. In addition, the syntax and grammar in player P. Rosso, and M. Sanguinetti, “SemEval-2019 task 5: Multilingual
detection of hate speech against immigrants and women in Twitter,”
comments are often inaccurate compared to formal texts, for in Proceedings of the 13th International Workshop on Semantic Evalu-
example, the word ”previou” (roughly meaning: previous) in ation. Minneapolis, Minnesota, USA: Association for Computational
comment No #3, and ”minut” (roughly meaning: minute) in Linguistics, Jun. 2019.
[5] Y.-L. Chung, E. Kuzmenko, S. S. Tekiroglu, and M. Guerini, “CONAN
comment No #4. To improve the predictability of the model, - COunter NArratives through nichesourcing: a multilingual dataset of
it is necessary to apply lexicon based approaches to normalize responses to fight online hate speech,” in Proceedings of the 57th Annual
the comments into formal texts. Meeting of the Association for Computational Linguistics. Florence,
Italy: Association for Computational Linguistics, Jul. 2019.
TABLE V: Several incorrect predictions in the Cyberbullying [6] U. Bretschneider and R. Peters, “Detecting cyberbullying in online
communities,” 2016.
dataset. [7] W. Fan and M. D. Gordon, “The power of social media analytics,”
No. Input True Predicted Communications of the ACM, 2014.
1 tri argu common sens girl that problem sexist 1 0 [8] C. C. Aggarwal and C. Zhai, “A survey of text classification algorithms,”
bigot plz never show face realiti kthnx vote in Mining text data. Springer, 2012.
oblivion pl x. [9] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
synthetic minority over-sampling technique,” Journal of artificial intel-
2 stop give account info idiot absolut true still 0 1
ligence research, 2002.
think account deserv rune back.
[10] V. Character-Level, “End-to-end recurrent neural network models for
3 refund get kill wait oh idiot rescind previou 0 1
vietnamese named entity recognition: Word-level,” in Computational
statement trust memori.
Linguistics: 15th International Conference of the Pacific Association for
4 post extra two minut travel much stupid u 1 0 Computational Linguistics, PACLING 2017, Yangon, Myanmar, August
think min travel. 16–18, 2017, Revised Selected Papers. Springer, 2018.
5 leav guy troll wont play stupid card fun get 0 1 [11] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors
attent world wide ignor troll idiot. for word representation,” in Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP), 2014.
VI. C ONCLUSION [12] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, “Learning
word vectors for 157 languages,” in Proceedings of the Eleventh Interna-
In this paper, we introduced a solution to identify offensive tional Conference on Language Resources and Evaluation (LREC 2018).
comments on game forums by machine learning approaches Miyazaki, Japan: European Language Resources Association (ELRA),
May 2018.
using text classification models. The Toxic-BERT model got [13] K. Nigam, J. Lafferty, and A. McCallum, “Using maximum entropy
the best results on the Cyberbullying dataset. The weakness for text classification,” in IJCAI-99 workshop on machine learning for
of the Cyberbullying dataset is the imbalance between label 1 information filtering. Stockholom, Sweden, 1999.
[14] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector
and label 0, thus leading to too much wrong prediction of label machines,” ACM transactions on intelligent systems and technology
1. To solve this problem, we used SMOTE for traditional ma- (TIST), 2011.
chine learning models and deep neural models to improve the [15] Y. Kim, “Convolutional neural networks for sentence classification,”
in Proceedings of the 2014 Conference on Empirical Methods in
data imbalance, however, results do not improved significantly Natural Language Processing (EMNLP). Doha, Qatar: Association
on deep neural models. From the error analysis, we found that for Computational Linguistics, Oct. 2014, pp. 1746–1751.
the incorrect predictions were caused by the informal texts [16] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
used by users on game forums such as acronyms, slangs, RNN encoder–decoder for statistical machine translation,” in Proceed-
and wrong syntax sentences. Therefore, it is necessary to ings of the 2014 Conference on Empirical Methods in Natural Language
normalize the comments of users to increase the performance Processing (EMNLP). Doha, Qatar: Association for Computational
Linguistics, 2014.
of classification models. [17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
In future, we are going to apply the Easy Data Augmenta- training of deep bidirectional transformers for language understanding,”
tion techniques (EDA) [19] to enhance data on minor label, in Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
and implement a dictionary to standardize irregular words. The Technologies, Volume 1 (Long and Short Papers). Minneapolis,
Hurlex [20] and the abusive lexicon provided by Wiegand et al. Minnesota: Association for Computational Linguistics, Jun. 2019.
[21] are two potential abusive lexicon sets are used to increase [18] L. Hanu and Unitary team, “Detoxify,” Github.
https://github.com/unitaryai/detoxify, 2020.
the performance of classification models. Besides, based on the [19] J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boost-
results obtained in this paper, we plan to build a module to ing performance on text classification tasks,” in Proceedings of the 2019
automatically detect offensive comments on game forums in Conference on Empirical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural Language Processing
order to help moderators for keep the clean and friendly space (EMNLP-IJCNLP). Hong Kong, China: Association for Computational
for discussion among game players. Linguistics, Nov. 2019, pp. 6382–6388.
[20] E. Bassignana, V. Basile, and V. Patti, “Hurtlex: A multilingual lexicon
R EFERENCES of words to hurt,” in 5th Italian Conference on Computational Linguis-
[1] E. Wulczyn, N. Thain, and L. Dixon, “Ex machina: Personal attacks tics, CLiC-it 2018. CEUR-WS, 2018.
seen at scale,” in Proceedings of the 26th International Conference on [21] M. Wiegand, J. Ruppenhofer, A. Schmidt, and C. Greenberg, “Inducing
World Wide Web, 2017. a lexicon of abusive words – a feature-based approach,” in Proceedings
[2] T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate of the 2018 Conference of the North American Chapter of the Asso-
speech detection and the problem of offensive language,” in Proceedings ciation for Computational Linguistics: Human Language Technologies,
of the International AAAI Conference on Web and Social Media, 2017. Volume 1 (Long Papers). New Orleans, Louisiana: Association for
[3] M. Ribeiro, P. Calais, Y. Santos, V. Almeida, and W. Meira Jr, “Char- Computational Linguistics, Jun. 2018.
acterizing and detecting hateful users on twitter,” in Proceedings of the
International AAAI Conference on Web and Social Media, 2018.

You can also read