Automatically Detecting Cyberbullying Comments on Online Game Forums
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Automatically Detecting Cyberbullying Comments on Online Game Forums Hanh Hong-Phuc Vo1,2,* , Hieu Trung Tran1,2,* , and Son T. Luu1,2,† 1 University of Information Technology, Ho Chi Minh City, Vietnam 2 Vietnam National University Ho Chi Minh City, Vietnam Email: *{18520275,18520754}@gm.uit.edu.vn, †sonlt@uit.edu.vn Abstract—Online game forums are popular to most of game helping administrators detect and hide abusive comments on players. They use it to communicate and discuss the strategy online game forums. arXiv:2106.01598v1 [cs.CL] 3 Jun 2021 of the game, or even to make friends. However, game forums The paper is organized as follows. Section 1 introduces also contain abusive and harassment speech, disturbing and threatening players. Therefore, it is necessary to automatically the topic and motivation to study the aforementioned task. detect and remove cyberbullying comments to keep the game Section 2 presents related work on the problem of identifying forum clean and friendly. We use the Cyberbullying dataset abusive comments. Section 3 describes the Cyberbullying collected from World of Warcraft (WoW) and League of Legends dataset that is used for training machine learning models. (LoL) forums and train classification models to automatically Section 4 and section 5 illustrate our approaches and empirical detect whether a comment of a player is abusive or not. The result obtains 82.69% of macro F1-score for LoL forum and results. Finally, Section 6 concludes our work. 83.86% of macro F1-score for WoW forum by the Toxic-BERT model on the Cyberbullying dataset. II. R ELATED WORK Index Terms—game forums, cyberbullying, machine learning, In 2017, Wulczyn et al. analyzed Wikipedia comments text classification, deep neural network, bert. to find personally offensive comments with a dataset of over 100,000 high-quality human-labeled comments and 63M I. I NTRODUCTION machine-labeled ones [1]. The dataset achieved the best AUG Currently, offensive comments are common issues on gam- result (96.59%) using multi-layer perceptrons (MLP) model on ing forums. These comments are often negative, harmful, and ED (empirical distribution) label with char n-gram. In the case abusive, causing a pessimistic impression on the gaming com- of a model trained on ED labels, the attack score represents munity. To prevent offensive comments by users on forums, the predicted fraction of annotators who would consider the administrators have to manually find each comment, then hide comment an attack. it or ban the user whose comments are abusive. However, Also in 2017, Davidson et al. conducted the study on Au- this approach seems impossible on big forums with huge tomated Hate Speech Detection and the Problem of Offensive user comments. Therefore, automatically detecting offensive Language [2]. The authors constructed a dataset from tweets comments on game forums is necessary to help administrators on Twitter, then trained them with machine learning models find and hide negative comments or even ban users who post such as Logistic Regression, Naive Bayes, Decision Trees, abusive and harassment comments. We solve the problem of Random Forests, and Linear SVMs. After trained, the best automatically detecting cyberbullying comments on the forum performance model, which is Logistic Regression, had the of League of Legends and World of Warcraft, which are results of 91.00% by Precision and 90.00% by Recall and popular to most game players. F1-scores. We use machine learning approaches for text classification In addition, in 2018, Ribeiro et al. performed characterizing tasks to automatically classify whether a comment from a and detecting hateful users on Twitter [3]. Comments in the game player is abusive or not. We also propose alternative dataset are collected from Twitter. However, unlike other solutions such as data pre-processing and data enhancement to datasets, this dataset focuses on the user, not the content of the increase the performance of classifiers. Then, we evaluate the user comments. The author crawled the dataset from 100,386 empirical results to find the best models on the Cyberbullying users with up to 200 tweets per user. They then selected a dataset. Our approaches to machine learning include: tradi- subsample to be annotated. There are 4,972 comments man- tional machine learning models (Logistic Regression, SVM), ually annotated users, of which 544 users labeled as hateful. deep neural models (Text-CNN, GRU), and transfer learning They also implemented dataset models such as GraphSage, models (Toxic BERT). Finally, based on the empirical results, AdaBoost, GradBoost. Using both user’s feature and glove we choose the highest performance model to build a tool for properties together with the GraphSage model gives the best results. Additionally, AdaBoos delivers a good AUC score, but mis-classifying many of the normal users is negative, resulting 978-1-6654-0435-8/21/$31.00 ©2021 IEEE in low accuracy and F1-score.
Besides, many multilingual datasets for abusive comment is imbalanced, and most of the comments are non-offensive recognition are built in different languages, such as hatEval comments. dataset from SemEval-2019 Task 5 contest (2019) [4] on identifying offensive comments about women and migrants in TABLE III: Average length of comments in the Cyberbullying English and Spanish, and CONAN dataset is used to identify dataset. comments inciting violence in the English, French and Italian Forum Label 0 Label 1 languages [5]. League of Legends 102.95 137.60 World of Warcraft 105.42 165.58 III. DATASET The Cyberbullying dataset1 [6] includes two parts: com- Finally, Table III shows the average length of comments in ments of players on League of Legends (LoL) and World of the Cyberbullying dataset. The average length of comments is Warcraft (WoW) forums. The data from League of Legends computed by word level. According to Table III, the average forum contains 17,354 comments with 207 offensive cases, length of comments on label 0 and label 1 are 102.95 and and data from the World of Warcraft forum contains 16,975 137.60 respectively on the LoL forum, and 105.42 and 165.68 comments with 137 offensive cases. All comments are anno- respectively on the WoW forum. In general, the average length tated manually and reviewed by experts. The dataset contains of comments on label 1 are longer than the average length of English comments on 20 different topics. There are two labels comments on label 0 for both two forums. in the dataset, label 1 indicates offensive comments, and label 0 indicates non-offensive comments. Table II show the data IV. M ETHODOLOGIES distribution of labels on two forums - LoL and WoW. A. Text classification task TABLE I: Samples in the Cyberbullying dataset. Text classification is one of the most common tasks in No. Comment Label machine learning and natural language processing. Aggarwal Comments on League of Legends forum and Zhai (2012) defined the text classification task as given 1 We can only hope. 0 the training set D = {X1 , ..., Xn }, where each data point 2 This is why we don’t account share children. 0 Xi ∈ D is labeled with a value derived from the set of 3 This is from Season Stay down bitch fanboy loser. 1 4 Rest in pieces. fk u 1 labels are numbered {1. . . k}. The training data is built into a Comments on World of Warcraft forum classification model. With the test set given, the classification 1 This thread is depressing, progress is slow. 0 model predicts the label for each data point in the test set [8]. 2 But they could give us free transfers in the meantime. 0 3 I didnt nuked it. I ran the all road and can tag him at k 1 hp. Im a druid and i used cyclone time. So dont swear to me you idiot 4 when you troll say a good time its . you down it at . 1 Braindead Fig. 1: Model overview for detecting cyberbullying comments. In addition, according to Table I, comments are often written Figure 1 describes our model overview for detecting cy- in an informal form, using a lot of slang, abbreviations, berbullying comments. With input comments, the model does strange characters and conversations centered on the exchanges pre-processing steps including splitting comments into tokens, between players on game forums. In addition, it can be removing unimportant characters, using SMOTE to generate seen that comments from the Cyberbullying dataset often use new examples and using word embedding (GloVe, fastText) informal words such as abbreviations and slang, for example: to present vectors for input data. Next, the pre-processed data “u”, and “fk”. This is a common feature for comments on are fit into classification models including traditional machine social media platforms [7]. learning models, deep neural models, and transformer learning model. Finally, the output is labels that was predicted by TABLE II: Distribution of labels by percentage on two forums classification models. in the Cyberbullying dataset. Forum Label 0 Label 1 B. SMOTE League of Legends 17,354 (98.81%) 207 (1.19%) World of Warcraft 16,975 (99.19%) 137 (0.81%) Synthetic Minority Oversampling Technique (SMOTE) is a technique to increase the number of samples for the minority Furthermore, as showed in Table II, the distribution of labels class. It is widely used to handle imbalanced datasets. SMOTE is imbalanced on both LoL and WoW forums in the dataset. works by selecting data points in the feature space and drawing Most of the comments are skewed to label 0, which account a line between those points and the nearest neighbors. Then for 98.81% on the LoL forum and 99.19% on the WoW based on this line to select new data for the minority class. forum, respectively. This indicates that Cyberbullying dataset New data points are generated from the original data points to increase the amount of minority class. This is an effective 1 http://ub-web.de/research/ method for solving imbalance in datasets [9].
C. Word embedding V. E MPIRICAL RESULTS Word embedding is a method of word representation that A. Model parameters used in natural language processing. This method represents We implemented the Logistic Regression and SVM with a word (or phrase) by a real number vector that was mapped random state equal to 0, C equal to 1, and max feature equal from a corpus of text. Using word embedding improves the to 13,000. In addition, we implemeneted the Text-CNN and performance of the predictive model [10]. GRU with 5 epochs, batch size equal to 512, sequence length In this paper, we use sets of word representation vectors equal to 300, conv layer size equal to 5, 128 units, dropout such as GloVe [11] and fastText [12]. The glove.6B.300d word equal to 0.1, and using sigmoid activation function. Finally, we embedding2 comprises 300 dimensions vectors over 400,000 implement the Toxic-BERT with 5 epochs, train batch size vocabularies, which is built on Wikipedia and Gigaword equal to 16, and test batch size equal to 8. We used the Simple corpora. The fastText 3 is trained on Common Crawl and transformer4 for implementing the Toxic-BERT model. Wikipedia datasets using CBOW with position-weight in 300 dimensions with 5-grams features. B. Data preparation Before running the model, we do pre-processing steps as D. Traditional machine learning models follows: (1) we remove special characters in the comments Logistic Regression is one of the basic classification mod- (except for the ”*”) and keep only the letters. The characters els commonly used for binary classification problems. For ”*” represent encoded offensive words. These are handled classification problems, we have to convert texts to vectors by changing to the word ”beep”, (2) we split comments into before fitting them into the training model [13]. tokens by using the TweetTokenizer of NLTK library, (3) Support Vector Machine (SVM) is a classification algo- we converted comments to lowercase, and (4) we remove rithm based on the concept of hyper-planes, which divide stop words like ”the”, ”in”, ”a”, ”an” because they have less data into different classes. This algorithm is commonly used meaning in the sentence. for classification and regression problems. Similar to Logistic We use the k-fold cross-validation with 5 folds (k = 5) on Regression, SVM converts text data into vector form before the Cyberbullying dataset. The dataset is randomly divided fitting into the training model [14]. into 5 equal parts with proportion 8:2 for train set and test set respectively. The final result is the mean of 5 folds. E. Deep neural models C. Empirical results Text-CNN is a multi-layered neural network architecture Table IV presents the results of the models on two forums, developed for the classification problem. CNN model uses LoL and WoW of the dataset. According to Table IV, the Convolutional layers (CONV layer) to extract valuable infor- results of the models on the two forums are similar. mation from data [15]. For traditional machine learning models, the macro F1-score GRU is an improved version of the standard recurrent neural and accuracy are quite high disparities. The reason is the network (RNN). To solve the vanishing gradient problem of imbalanced distribution of labels in the dataset as described in a standard RNN, GRU uses special gates including update Section III. gate and reset gate. Basically, these are two vectors which For deep neural models, the Text-CNN model with the decide what information should be passed to the output. The GloVe word embedding gives the best results by macro F1 special thing about them is that they can be trained to keep score, which are 80.68% on the LoL forum and 83.10% on information from long ago, without washing it through time the WoW forum, respectively. Besides, there is a discrepancy or remove information which is relevant to the prediction [16]. between Accuracy and macro F1 scores on deep neural models due to unbalanced data. However, this difference is not as F. Transformer model much as traditional machine learning models. BERT is a transformer architecture created by Devlin et Among the models, Toxic-BERT gives the highest results al. [17] and become popular to many of natural language according to the macro F1-score on both forums, which are processing tasks. With pre-trained models on large and un- 82.69% on LoL forum and 83.86% on the WoW forum, supervised data, BERT can be fine-tuned later in specific respectively according to Table IV. downstream tasks such as text classification and question To handle the imbalanced in the dataset, we apply the answering. Toxic-BERT (also called as Detoxify) [18] is a SMOTE method to increasing the data in minor label. After model based on BERT architecture that is used for toxic using the SMOTE method, on traditional machine learning comments classification task. Toxic-BERT is trained on three models, the results by macro F1-score increased significantly. different toxic dataset comes from three Jigsaw challenges. For Logistic Regression, the macro F1-score increases 10.49% We implement the Toxic-BERT model on the Cyberbullying and 11.41% on the LoL forum and WoW forums, respectively dataset for detecting cyberbullying comments from players. after using SMOTE. Same as Logistic Regression, the results by macro F1-score on the SVM increases 8.95% and 11.41% 2 https://nlp.stanford.edu/projects/glove/ 3 https://fasttext.cc/docs/en/crawl-vectors.html 4 https://simpletransformers.ai/
on the LoL forum and WoW forums, respectively after using SMOTE. However, SMOTE does not augments the perfor- mance of deep neural models on both forums. TABLE IV: Empirical results of the Cyberbullying dataset. League of Legends Forum SMOTE NOT SMOTE Model Acc(%) F1-sc(%) Acc(%) F1-sc(%) Logistic Regression 97.02 60.19 98.81 49.70 Not SMOTE SMOTE SVM 98.20 58.65 98.81 49.70 Text-CNN + fastText 97.87 75.30 99.29 75.67 Fig. 3: Confusion matrix of the Text-CNN with GloVe word Text-CNN+GloVe 97.06 73.19 99.37 80.68 embedding on the LoL forum. GRU + fastText 96.69 66.37 99.18 78.46 GRU + GloVe 95.54 66.50 99.32 75.53 Besides, Figure 4 shows the confusion matrix of the Text- Toxic-BERT - - 99.38 82.69 World of Warcraft Forum CNN model on two word embeddings including GloVe and Model SMOTE NOT SMOTE fastText without using SMOTE technique. As shown in Figure Acc(%) F1-sc(%) Acc(%) F1-sc(%) 4, the predictive accuracy on label 1 of the Text CNN model on Logistic Regression 98.11 63.50 99.19 49.79 SVM 98.74 61.20 99.19 49.79 GloVe word embedding is better than fastText word embed- Text-CNN + fastText 97.43 73.07 99.48 75.37 ding. For label 0, the prediction results on both models are Text-CNN+GloVe 97.56 74.40 99.62 83.10 similar. Therefore, the performance of the Text-CNN model GRU + fastText 96.27 66.19 99.49 74.44 with GloVe is better than fastText. GRU + GloVe 97.77 71.14 99.47 72.99 Toxic-BERT - - 99.58 83.86 D. Error analysis Since the results of the models on the two forums are similar according to Table IV, we analyze the error predictions based on the LoL forum in the Cyberbullying dataset. fastText GloVe Fig. 4: Confusion matrix of the Text-CNN model on the LoL forum (without SMOTE). In addition, Figure 5 shows the confusion matrix of the Toxic BERT model. According to confusion matrix in Figure 5, the Toxic-BERT model has a higher number of correct predictions on label 1 than the number of incorrect predictions. Comparing with other models, the Toxic-BERT model has Not SMOTE SMOTE a higher total percentage of correct prediction on label 0 Fig. 2: Confusion matrix of the Logistic Regression model on and label 1 than other models. Therefore, Toxic-BERT has the LoL forum. the highest accuracy on the dataset, even without the data enhancement method. For the traditional machine learning model, Figure 2 shows the confusion matrix of the Logistic Regression model before and after the SMOTE technique is applied. As shown in Figure 2, SMOTE increases the predictability of the Logistic Regression model on label 1. However, there are still a lot of comments that are mispredicted from label 1 to label 0. Next, Figure 3 shows the confusion matrix of the Text- CNN model with the GloVe embedding before and after enhancing the data using the SMOTE technique. After the Fig. 5: Confusion matrix of the Toxic-BERT model on the data enhancement, the amount of data correctly predicted in LoL forum. label 1 is increased by 8.2%. However, the number of labels 0 correctly predicted decreased by 2.4%. Therefore, SMOTE Finally, according to Table V, many comments of players does not increase the performance of deep neural models. on the forum use acronyms such as ”plz” (roughly meaning:
please) in comment No #1, or ”u” (roughly meaning: you) in [4] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. Rangel Pardo, comment No #4. In addition, the syntax and grammar in player P. Rosso, and M. Sanguinetti, “SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter,” comments are often inaccurate compared to formal texts, for in Proceedings of the 13th International Workshop on Semantic Evalu- example, the word ”previou” (roughly meaning: previous) in ation. Minneapolis, Minnesota, USA: Association for Computational comment No #3, and ”minut” (roughly meaning: minute) in Linguistics, Jun. 2019. [5] Y.-L. Chung, E. Kuzmenko, S. S. Tekiroglu, and M. Guerini, “CONAN comment No #4. To improve the predictability of the model, - COunter NArratives through nichesourcing: a multilingual dataset of it is necessary to apply lexicon based approaches to normalize responses to fight online hate speech,” in Proceedings of the 57th Annual the comments into formal texts. Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019. TABLE V: Several incorrect predictions in the Cyberbullying [6] U. Bretschneider and R. Peters, “Detecting cyberbullying in online communities,” 2016. dataset. [7] W. Fan and M. D. Gordon, “The power of social media analytics,” No. Input True Predicted Communications of the ACM, 2014. 1 tri argu common sens girl that problem sexist 1 0 [8] C. C. Aggarwal and C. Zhai, “A survey of text classification algorithms,” bigot plz never show face realiti kthnx vote in Mining text data. Springer, 2012. oblivion pl x. [9] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intel- 2 stop give account info idiot absolut true still 0 1 ligence research, 2002. think account deserv rune back. [10] V. Character-Level, “End-to-end recurrent neural network models for 3 refund get kill wait oh idiot rescind previou 0 1 vietnamese named entity recognition: Word-level,” in Computational statement trust memori. Linguistics: 15th International Conference of the Pacific Association for 4 post extra two minut travel much stupid u 1 0 Computational Linguistics, PACLING 2017, Yangon, Myanmar, August think min travel. 16–18, 2017, Revised Selected Papers. Springer, 2018. 5 leav guy troll wont play stupid card fun get 0 1 [11] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors attent world wide ignor troll idiot. for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014. VI. C ONCLUSION [12] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, “Learning word vectors for 157 languages,” in Proceedings of the Eleventh Interna- In this paper, we introduced a solution to identify offensive tional Conference on Language Resources and Evaluation (LREC 2018). comments on game forums by machine learning approaches Miyazaki, Japan: European Language Resources Association (ELRA), May 2018. using text classification models. The Toxic-BERT model got [13] K. Nigam, J. Lafferty, and A. McCallum, “Using maximum entropy the best results on the Cyberbullying dataset. The weakness for text classification,” in IJCAI-99 workshop on machine learning for of the Cyberbullying dataset is the imbalance between label 1 information filtering. Stockholom, Sweden, 1999. [14] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector and label 0, thus leading to too much wrong prediction of label machines,” ACM transactions on intelligent systems and technology 1. To solve this problem, we used SMOTE for traditional ma- (TIST), 2011. chine learning models and deep neural models to improve the [15] Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in data imbalance, however, results do not improved significantly Natural Language Processing (EMNLP). Doha, Qatar: Association on deep neural models. From the error analysis, we found that for Computational Linguistics, Oct. 2014, pp. 1746–1751. the incorrect predictions were caused by the informal texts [16] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using used by users on game forums such as acronyms, slangs, RNN encoder–decoder for statistical machine translation,” in Proceed- and wrong syntax sentences. Therefore, it is necessary to ings of the 2014 Conference on Empirical Methods in Natural Language normalize the comments of users to increase the performance Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, 2014. of classification models. [17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- In future, we are going to apply the Easy Data Augmenta- training of deep bidirectional transformers for language understanding,” tion techniques (EDA) [19] to enhance data on minor label, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language and implement a dictionary to standardize irregular words. The Technologies, Volume 1 (Long and Short Papers). Minneapolis, Hurlex [20] and the abusive lexicon provided by Wiegand et al. Minnesota: Association for Computational Linguistics, Jun. 2019. [21] are two potential abusive lexicon sets are used to increase [18] L. Hanu and Unitary team, “Detoxify,” Github. https://github.com/unitaryai/detoxify, 2020. the performance of classification models. Besides, based on the [19] J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boost- results obtained in this paper, we plan to build a module to ing performance on text classification tasks,” in Proceedings of the 2019 automatically detect offensive comments on game forums in Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing order to help moderators for keep the clean and friendly space (EMNLP-IJCNLP). Hong Kong, China: Association for Computational for discussion among game players. Linguistics, Nov. 2019, pp. 6382–6388. [20] E. Bassignana, V. Basile, and V. Patti, “Hurtlex: A multilingual lexicon R EFERENCES of words to hurt,” in 5th Italian Conference on Computational Linguis- [1] E. Wulczyn, N. Thain, and L. Dixon, “Ex machina: Personal attacks tics, CLiC-it 2018. CEUR-WS, 2018. seen at scale,” in Proceedings of the 26th International Conference on [21] M. Wiegand, J. Ruppenhofer, A. Schmidt, and C. Greenberg, “Inducing World Wide Web, 2017. a lexicon of abusive words – a feature-based approach,” in Proceedings [2] T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate of the 2018 Conference of the North American Chapter of the Asso- speech detection and the problem of offensive language,” in Proceedings ciation for Computational Linguistics: Human Language Technologies, of the International AAAI Conference on Web and Social Media, 2017. Volume 1 (Long Papers). New Orleans, Louisiana: Association for [3] M. Ribeiro, P. Calais, Y. Santos, V. Almeida, and W. Meira Jr, “Char- Computational Linguistics, Jun. 2018. acterizing and detecting hateful users on twitter,” in Proceedings of the International AAAI Conference on Web and Social Media, 2018.
You can also read