Predicting Different Types of Subtle Toxicity in Unhealthy Online Conversations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Predicting Different Types of Subtle Toxicity in Unhealthy Online Conversations Shlok Gilda Mirela Silva University of Florida University of Florida shlokgilda@ufl.edu msilva1@ufl.edu Luiz Giovanini Daniela Oliveira University of Florida University of Florida lfrancogiovanini@ufl.edu daniela@ece.ufl.edu Abstract less negative, intense, and hostile than toxic ones. Detecting unhealthy conversations is more chal- This paper investigates the use of machine lenging and less explored in the literature than its learning models for the classification of un- toxic conversations counterpart (Price et al., 2020). arXiv:2106.03952v1 [cs.CL] 7 Jun 2021 healthy online conversations containing one or more forms of subtler abuse, such as hostility, Many early signs of conversational failure are sarcasm, and generalization. We leveraged a due to the subtlety in comments which deter public dataset of 44K online comments con- people from engagement and create downward taining healthy and unhealthy comments la- spirals in interactions. Behaviors such as con- beled with seven forms of subtle toxicity. We descension (e.g., “Utter drivel and undeserving were able to distinguish between these com- of further response.”), “benevolent” stereotyping ments with a top micro F1-score, macro F1- (e.g., “Women have motherly nurturing instincts.”), score, and ROC-AUC of 88.76%, 67.98%, and 0.71, respectively. Hostile comments were eas- and microaggressions (e.g., racially-based such ier to detect than other types of unhealthy com- as “Why don’t you have an accent?”) are fre- ments. We also conducted a sentiment analy- quently experienced by members of minority social sis which revealed that most types of unhealthy groups (Sue et al., 2007; Glick and Fiske, 2001). comments were associated with a slight neg- Nadal et al. (2014) indicated that such subtle abuse ative sentiment, with hostile comments being can be as emotionally harmful as outright toxic the most negative ones. abuse to some individuals. Microaggressions (even unintended slights or social cues) have been linked 1 Introduction to disagreements in intergroup relationships. Ju- Healthy online conversations occur when posts or rgens et al. (2019) also signify the importance comments are made in good faith, are not blatantly of tackling more subtle and serious forms of on- abusive or hostile, typically focus on substance and line abuse by developing proactive technologies ideas, and generally invite engagement (Price et al., (e.g., intervention by bystanders [Markey, 2000], 2020). Conversely, toxic comments are a harmful rephrasing parts of a message to adjust the level of type of conversation widely found online which are politeness [Sennrich et al., 2016]) to counter abuse insulting and violent in nature (Price et al., 2020), before it can cause harm. such as “SHUT UP, YOU FAT POOP, OR I WILL In this paper, we sought to answer two research KICK YOUR A**!!!”. This kind of toxic conversa- questions in the context of unhealthy conversations: tion has been the primary focus of several previous studies (Georgakopoulos et al., 2018; Srivastava • RQ1: What is the general sentiment associ- et al., 2018). However, many comments that de- ated with unhealthy conversations compared ter people from engaging in online conversations to healthy conversations? are not necessarily outright abusive, but contain • RQ2: Can we differentiate unhealthy and subtle forms of abuse (e.g., “Because it drives you healthy conversations? If so, which type of crazy, old lady. Toodooloo...”). These comments unhealthy conversation is the most detectable? are written in a way to engage people, but to also hurt, antagonize, or humiliate others and are thus Towards this end, we analyzed a dataset com- referred to as unhealthy conversations (Price et al., prised of 44K labeled comments of unhealthy con- 2020). In other words, unhealthy comments are versations from Price et al. (2020). We submit-
ted this dataset first to sentiment analysis aimed Aken et al., 2018; Parekh and Patel, 2017). It also at detecting the polarity (positive/negative/neutral) requires reasoning about the implications of the of the comments (RQ1), and then to a compre- propositions. Dinakar et al. (2012) extract implicit hensive machine learning analysis focused on dis- assumptions in statements and use common sense tinguishing comments as healthy vs. unhealthy. reasoning to identify social norm violations that (RQ2). Our experimental results revealed that: would be considered an insult. (i) although none of the types of comments had Identification of subtle indicators of unhealthy extremely polarizing sentiments, most forms of conversations in online comments is a challenging unhealthy online conversations were associated task due to three main reasons (Price et al., 2020): with a slight negative sentiment; and (ii) hostile (i) comments are less extreme and thus have lesser comments were the most detectable form of un- explicit vocabulary; (ii) a remark may be perceived healthy conversation. Findings from this work differently based on context or expectations of the have the potential to inform and advance future re- reader; and (iii) greater risk of false positives or search and development of online moderation tools, false negatives. Cultural diversity also plays an which pave the way for safer online environments. important role on how a comment/remark may be This paper is organized as follows. Section 2 perceived differently (Qiufen, 2014), thus making summarizes related work. Section 3 describes the identification of subtle toxicity online more chal- dataset we leveraged for our experiments, as well lenging. as our preprocessing procedures. Section 4 details our study’s methodology. Section 5 analyzes our From a machine learning perspective, abusive study’s results, and discusses our results by answer- comments classification research initially began ing our proposed research questions. Section 6 with the application of combining TF-IDF (Term analyzes our study’s limitations and proposes di- Frequency–Inverse Document Frequency) with sen- rections for future work. Section 7 concludes the timent and contextual features by Yin et al. (2009). paper. They compared the performance of this model with a simple TF-IDF model and reported a 6% in- 2 Related Work crease in the F1-score of the classifier on chat-style datasets. Since then, there have been many ad- Sentiment classification of social media posts with vances in the field of toxicity classification. David- regards to toxicity has been researched extensively son et al. (2017) detected hate speech on Twitter over the past years (Chen and Qian, 2019; Saeed using a three-way classifier: hate speech, offen- et al., 2018; Saif et al., 2018; Srivastava et al., 2018). sive but not hate speech, and none. They imple- The primary focus of most of the related work has mented Logistic Regression and SVM classifiers been on algorithmic moderation of toxic comments, using several different features including TF-IDF which are derogatory and threatening. The impor- n-grams, Part-of-Speech (POS) n-grams, Flesch tance of community norms in detection and classi- Reading Ease scores, and sentiment scores using fication of these subtler forms of abuse have been an existing sentiment lexicon. Using a similar noted elsewhere (Blackwell et al., 2017; Guberman set of features, Safi Samghabadi et al. (2017) ap- and Hemphill, 2017; Liu et al., 2018; Salminen plied a linear SVM to detect “invective” posts on et al., 2018), but have not received the same atten- Ask.fm, a social networking site for asking ques- tion in the NLP community. tions. They also utilized additional features such Although recognized in the larger NLP abuse as Linguistic Inquiry and Word Count (LIWC; Pen- typology (Waseem et al., 2017), there have been nebaker et al., 2001), word2vec (Mikolov et al., only a few attempts at solving the problems asso- 2013), paragraph2vec (Le and Mikolov, 2014), and ciated with subtle abuse detection, such as a study topic modeling. The authors reported an F1-score on classification of ambivalent sexism using Twit- of 59% and AUC-ROC (area under the ROC curve) ter data (Jha and Mamidi, 2017). There is a need of 0.785 using a specific subset of features. Yu for development of new methods to identify the et al. (2017) proposed a word vector refinement implicit signals of conversational cues. Detect- model that could be applied to pre-trained word ing subtler forms of toxicity requires idiosyncratic vectors (e.g., Word2Vec [Mikolov et al., 2013] and knowledge, familiarity with the conversation con- Glove [Pennington et al., 2014a]) to improve the text, or familiarity with the cultural tropes (van efficiency of sentiment analysis. Chu et al. (2016)
compared the performance of various deep learn- the prevalence of each attribute is unavoidably low. ing approaches to toxicity classification, using both The following are a few examples of every class word and character embeddings. They analyzed label: the performance of neural networks with LSTM 1. Antagonize: “An idiot criticizing another idiot, (Long Short-Term Memory) and word embeddings, seems about right.” a CNN (Convolution Neural Network) with word 2. Condescending: “So, everyone who disagrees embeddings, and another CNN with character em- with her - and YOU - are fascists? You’re just beddings; the latter achieved top accuracy of 93%. another pouting SJW3 , angry that you didn’t get This paper aims to tackle the issue of recognizing a ‘Participant’ trophy...” unhealthy online conversations. Existing research 3. Dismissive: “You are certifiably a nut job with usually focuses on classifying the most hateful/vile that comment.” comments; we, on the other hand, aimed to identify 4. Generalization: “This is all on the Greeks. Pe- the subtle toxicity indicators in online conversa- riod. They are the ones who have been over- tions. spending for decades and lying about it.” 5. Generalization Unfair: “Progressives ALWAYS 3 Data Preparation go over the top. They cannot help it. It’s part of their MO.” In this section, we describe the dataset of 44K la- 6. Healthy: “We are seeing fascist tendencies and beled comments of unhealthy conversations from behaviours in our government, but its [sic] so Price et al. (2020), and detail the preprocessing hard to believe that it is happening to us, to steps taken to prepare the dataset for our machine Canada, that we will just continue to be in de- learning analyses. nial.” 3.1 Dataset Description 7. Hostile: “Crazy religious f**kers are the cause of overpopulating the planet with inbred, une- The dataset used in this study was made publicly ducated people with poor prospects of earning available1 by Price et al. (2020) in October 2020. a satisfying life. An endless pool of desperate It contains 44,355 unique comments of 250 char- terrorists in waiting.” acters or less from the Globe and Mail2 opinion 8. Sarcastic: “You should write another comment, articles sampled from the Simon Fraser University [user] - there are a few right-wing buzzwords Opinion and Comments Corpus dataset by Kol- you didn’t use yet. Not many, but some.” hatkar et al. (2020). Each comment was coded by at least three annotators with at least one of the Price et al. (2020) used Krippendorff’s α as the following class labels: antagonize, condescending, inter-rater reliability metric for the crowd-sourced dismissive, generalization, generalization unfair, annotators (note, however, that they did not report healthy, hostile, and sarcastic. A maximum of the α for two labels: healthy and generalization five annotators were used per comment until suffi- unfair). Opposite to most inter-rater reliability met- cient consensus—measured by a confidence score rics which address agreement, Krippendorff’s α ≥ 75%—was reached. Each annotator was asked to measures the disagreement among coders, rang- identify for each comment whether it was healthy ing from 0 (perfect disagreement) to 1 (perfect and if any of the types of unhealthy discourse (i.e., agreement). Table 1 lists the Krippendorff’s α for condescending, generalization, etc.) were present each attribute (reproduced from Price et al., 2020), or not. The comments were presented in isolation along with the final distribution of our preprocessed to annotators, without the surrounding context of dataset (detailed below). the news article and other comments (i.e., the an- 3.2 Preprocessing notators had no additional context about where the comment was posted or about the engagement of First, we removed one comment from the dataset the comment with the users), thus possibly reduc- that was empty and 1,106 other comments which ing bias. Since the comments were sourced from were not assigned to any class label. This decreased the SFU Opinion and Comments Corpus dataset, the total number of comments to 43,248. All com- ments were then preprocessed for the feature ex- 1 https://github.com/conversationai/ unhealthy-conversations 3 Social Justice Warrior; a pejorative term typically aimed 2 https://www.theglobeandmail.com/ at someone who espouses socially liberal movements.
Table 1: Distribution of comments per category along tion of texts (Manning et al., 2008). This measure with the respective Krippendorff’s α reported by Price considers not only the frequency of words or char- et al. (2020). Note that some comments (10.6%) were acter n-grams in the text but also the relevancy of attributed to multiple labels. those tokens across the dataset as a whole. We also Class label Count K-alpha included other features that we considered useful in Antagonize 2,066 0.39 recognizing unhealthy comments, including length Condescending 2,434 0.36 of comments, percentage of characters which are Dismissive 1,364 0.31 capitalized in each comment, and percentage of Generalization 944 0.35 punctuation characters in each comment. Generalization Unfair 890 Not reported Healthy 41,040 Not reported 4.2 Machine Learning Analysis Hostile 1,130 0.36 In our machine learning experiments of multi-label Sarcastic 1,897 0.34 classification, we considered the following well- known models: traction step. We converted all characters to lower- Logistic regression. We used the Logistic Regres- case, then removed HTML tags, non-alphabetic sion model with TF-IDF vectorized comment texts characters, newline, tab characters, and punctua- using only words for tokens (limited to 10K fea- tion from the comments. We also converted ac- tures). cented characters to their standardized represen- tations to avoid our classifier ambiguating words Support Vector Machine (SVM). Essentially, such as “latte” and “latté.” We then expanded con- SVM focuses on a small subset of examples that are tractions (e.g., “don’t” is replaced with “do not”). critical to differentiating between class members The final step was lemmatization, which consists of and non-class members, throwing out the remain- reducing words to their root forms (e.g., “playing” ing examples (Vapnik, 1998). This is a crucial becomes “play”). After preprocessing the data, property when analyzing large data sets containing three comments were deleted because they con- many ambiguous patterns. We used linear kernel tained just numbers or special characters. Thus, the since it is robust to overfitting. total number of comments was 43,245 with an aver- LightGBM. LightGBM is a tree-based ensemble age length of 19.8 words. The final distribution of model that is trained with gradient boosting (Ke the preprocessed dataset is shown in Table 1. Most et al., 2017; Barbier et al., 2016; Zhang et al., of the comments were assigned to a single class 2017). The unique attribute of LightGBM ver- label (N = 38, 661, 89.4%), while 10.6% of the sus other boosted tree algorithms is that it grows comments (N = 4, 584) were associated with two leaf-wise rather than level-wise, meaning that it or more labels. prioritizes width over depth. There is an impor- tant distinction between boosted tree models and 4 Methodology and Analysis random forest models. While forest models like Scikit-Learn’s (Pedregosa et al., 2011) RandomFor- This section describes our feature engineering pro- est use an ensemble of fully developed decision cess followed by a detailed description of our ma- trees, boosted tree algorithms use an ensemble of chine learning analysis targeting the recognition of weak learners that may be trained faster and can unhealthy comments through a diverse set of clas- possibly generalize better on a dataset like this one sification models, including traditional and deep where there are a very large number of features architectures. We also describe the process of sen- but only a select few might have an influence on timent analysis of the comments. any given comment. Given the huge disparity be- tween healthy and unhealthy comments, we used 4.1 Feature Engineering a tree-based model4 with the top 50 words and the engineered features. We vectorized the comments using the Term Frequency-Inverse Document Frequency (TF-IDF) Convolutional Neural Network Long Short statistic, which is commonly used to evaluate how Term Memory (CNN-LSTM) with pre-trained 4 important a word is to a text in relation to a collec- https://github.com/microsoft/LightGBM
word embeddings. We used Global Vectors for Word Representation (GloVe; Pennington et al., input: [(?, 250)] input_4: InputLayer 2014b) to create an index of words mapped to output: [(?, 250)] known embeddings by parsing the data dump of pre-trained embeddings. Since the comments had input: (?, 250) embedding_3: Embedding variable length (range: [3, 250] characters), we output: (?, 250, 300) fixed the comment length at 250 characters, adding zeros at the end of the sequence vector for shorter lstm_3: LSTM input: (?, 250, 300) sentences. Next, the embedding layer was used to output: (?, 250, 200) load the pre-trained word embedding model. The LSTM layer effectively preserved the character- input: (?, 250, 200) conv1d_3: Conv1D output: (?, 250, 256) istics of historical information in long texts, and the following CNN layer extracted the local fea- tures from the text. After the CNN, we made use input: (?, 250, 256) global_max_pooling1d_3: GlobalMaxPooling1D output: (?, 256) of globalmaxpooling function to reduce the di- mensionality of the feature maps output by the input: (?, 256) preceding convolutional layer. The pooling layer dense_6: Dense output: (?, 200) was followed by a dense layer (regular deeply con- nected neural network); there was also a dropout input: (?, 200) layer placed between the two dense layers to help dropout_3: Dropout output: (?, 200) regularize the learning and reduce overfitting of the dataset. We used Tensorflow (Abadi et al., 2016) input: (?, 200) to implement our CNN-LSTM deep model, whose dense_7: Dense output: (?, 8) architecture is illustrated in Fig. 1. We used binary cross entropy as loss function because it handles each class as an independent vector (instead of as Figure 1: CNN-LSTM network architecture. an 8-dimensional vector). The input to the model was a random number of samples (represented as “?” in Fig. 1), all having a fixed length of 250 char- were micro and macro F1-scores, and AUC-ROC acters. (i.e., area under the ROC curve). F1-score is de- For Logistic Regression, SVM, and LightGBM, fined as the harmonic mean of precision, a measure we used the implementations available in Scikit- of exactness, and recall, a measure of completeness. Learn (Pedregosa et al., 2011) with slight modi- It is well-suited to handle imbalanced datasets (as fications to the default parameters. These mod- in our case; Han et al., 2011). A macro-average F1- els were evaluated using k-fold cross-validation score will compute the F1-score independently for for k = 10. Importantly, in multi-label classifi- each class and then take the average (hence treating cation tasks (our case), a given comment may be all classes equally), whereas a micro-average F1- associated with more than one label, so we did score will aggregate the contributions of all classes not use traditional stratified k-fold sampling which to compute the average F1-score (Zheng, 2015). preserves the distribution of all classes. Instead, AUC-ROC curve is a performance measurement we opted to use iterative stratification (Sechidis for the classification problems at various thresh- et al., 2011; Szymański and Kajdanowicz, 2017), old settings. ROC is a probability curve and AUC via the IterativeStratification method from Scikit- represents the degree or measure of separability, Mulitlearn (Szymański and Kajdanowicz, 2017), indicating how much the model is capable of dis- which gives a well-balanced distribution of evi- tinguishing between classes (wherein higher values dence of label relations up to a given order. Owing indicate better discernment; Zheng, 2015). to the high computational costs of training a deep We ran multiple experiments with different val- neural network, the CNN-LSTM neural network ues of hyper-parameters for every model. For the was evaluated with 5-fold cross-validation instead logistic regression model, we tested different val- of 10 folds. ues of n-grams (words and characters), and max The evaluation metrics used in our experiments features (words and characters). We achieved the
best results with 10,000 word max-features and Table 2: Sentiment Analysis using VADER’s com- word n-grams range between [1, 3]. For the SVM pound score; values range from −1 (extremely nega- tive) to +1 (extremely positive). model, we chose a linear kernel and tested dis- tinct values of the regularization parameter (C); we Class label Mean score observed that a value of C = 0.7 gave us better re- Antagonize −0.104498 sults than the default value of C = 1.0. During our Condescending −0.035128 experiments with LightGBM, we modified some Dismissive −0.081356 baseline parameters to better suit the problem at Generalization −0.085851 hand: num leaves = 128, n estimators = 700 Generalization Unfair −0.091320 and max depth = 32. Healthy 0.035745 Hostile −0.175975 4.3 Sentiment Analysis Sarcastic 0.101285 We used NLTK VADER (Valence Aware Dic- tionary for sEntiment Reasoning; Hutto and sulted from CNN-LSTM (our best-performing clas- Gilbert, 2015) to analyze the polarity of comments. sifier) for all analyzed types of comments. The VADER is a lexicon and rule-based sentiment anal- best result was observed for the class healthy ysis tool that is specifically attuned to sentiments (AU C = 0.9524), followed by the classes hos- expressed in social media. We used the compound tile (AU C = 0.8141) and antagonize (AU C = value from the result for analyzing the polarity of 0.7362). The class sarcasm yielded the poorest the sentiments. For every input text, VADER nor- result (AU C = 0.5707). malizes the overall sentiment score to fall within −1 (very negative) and +1 (very positive), where 5.2 Take-Aways scores between (−0.05, 0.05) are labeled as neutral RQ1: What is the general sentiment associated polarity. with unhealthy conversations compared to healthy conversations? 5 Results & Discussion Our sentiment analysis revealed that none of the In this paper, we analyzed the granularities of sub- types of comments (i.e., classes) have extremely tle toxic online comments. This section presents polarizing sentiment values (Table 2). However, our experimental results in detecting the general all the classes except healthy, sarcastic, and con- sentiments associated with healthy and unhealthy descending have an overall slightly negative sen- online comments (RQ1) as well as recognizing timent associated with them, as expected for un- such comments via machine learning classifiers healthy types of conversations. Hostile comments (RQ2). Lastly, we summarize our main takeaways. were the most negative, likely due to the use of bla- tantly vulgar and vile language. This could possibly 5.1 Results indicate that hostile comments were less subtle in Table 2 shows the mean values of the sentiment their hateful content, and thus easier for sentiment polarity for each category of comment. All types tools and machine learning algorithms to detect. of unhealthy comments except sarcastic and con- Antagonizing, dismissive, and generalizing com- descending resulted in slight negative scores. The ments all had similar negative sentiment scores most negative result was observed for the class hos- in the range of [−0.1045, −0.0814]. Though it is tile, while the most positive result was obtained likely clear to most readers that such comments from the class sarcastic. The classes condescend- (examples given in Sec. 3) demonstrate negative ing and healthy produced the most neutral senti- sentiment, both NLTK’s VADER and CNN-LSTM ment scores. demonstrated poor results for these classes, which Table 3 exhibits the average micro F1-score, highlights how sentiment tools and machine learn- macro F1-score, and ROC-AUC obtained with ing models may struggle to detect subtle forms of all tested classifiers. As can be observed, the unhealthy language. best classification results were achieved with the Interestingly, condescending comments scored CNN-LSTM model, followed by SVM and Light- neutral sentiment (−0.03513). Detecting patron- GBM. Table 4 presents the ROC-AUC values re- izing and condescending language is still an open
Table 3: Classification results. Model Average Micro F1 Average Macro F1 AUC-ROC Logistic Regression 57.54% 48.31% 0.51 SVM 69.15% 61.29% 0.62 LightGBM 64.83% 52.11% 0.57 CNN LSTM Network 88.76% 67.98% 0.71 Table 4: AUC-ROC results from the CNN-LSTM tested models reported a lower average macro model. F1-score compared to the micro F1-score results, which is intuitive given the highly imbalanced Class label ROC-AUC dataset—note that macro F1-score gives the same Antagonize 0.7362 importance to each class, i.e., this value is low for Condescending 0.6702 models that perform poorly on rare classes, which Dismissive 0.6311 was the case for CNN-LSTM when analyzing un- Generalization 0.6633 healthy conversation classes. Therefore, it is pos- Generalization Unfair 0.6590 sible to speculate that the other models also per- Healthy 0.9524 formed better when recognizing healthy comments Hostile 0.8141 compared to unhealthy ones. Sarcastic 0.5707 Mean 0.7121 Additionally, Table 4 shows that healthy com- ments can be differentiated from the remain- ing classes relatively well with generally high research problem because, amongst many reasons, predictive accuracy (AU Chealthy = 0.9524), condescension is often shrouded under “flowery likely due to the high number of healthy sam- words” and can itself fall into seven different cate- ples in the dataset. In contrast, despite only gories (Pérez-Almendros et al., 2020). Meanwhile, 1,130 samples associated with the hostile class, sarcastic comments were notably labeled as posi- AU Chostile = 0.8141, which was the second- tive sentiment (0.1013), even higher than healthy highest AUC achieved by our top performer clas- comments (0.0357); this may have been because sifier. Most of the hostile comments have explicit sarcasm tends to be an ironic remark, often veiled language, which might make it easier for the clas- in a potentially distracting positive tone. sifier to properly recognize this type of conversa- tion. The AUC for antagonizing comments was the Unhealthy comments were mostly related third-highest achieved, at 0.7362, potentially be- with a low polarizing negative sentiment, cause nantagonize = 2, 066 was the second largest with hostile comments labeled as the most sample size out of the remaining unhealthy con- negative and sarcastic as the most positive. versations categories; antagonize also achieved Healthy comments were associated with the second most negative mean sentiment score neutral sentiment. (−0.1045), further indicating that there may be characteristics of this class that facilitate detection from machine learning algorithms. RQ2: Can we differentiate unhealthy and Detection of sarcasm was the most difficult for healthy conversations? If so, which type of the CNN-LSTM model (AU Csarcasm = 0.5707). unhealthy conversation is the most detectable There have been numerous studies which have one? faced similar issues with detection of sarcasm given From Table 3, we can see that it was possible to the difficulty of understanding the nuances and con- differentiate between unhealthy comments with a text surrounding sarcastic texts (Poria et al., 2016; maximum micro F1-score, macro F1-score, and Zhang et al., 2016). The AUC scores for conde- ROC-AUC of 88.76%, 67.98%, and 0.7121, re- scending, dismissive, and generalizing comments spectively, using the CNN-LSTM model. The re- were only slightly better (range: [0.6590, 0.6702]), maining models tested (Logistic Regression, SVM, again highlighting the difficulty in detecting such and LightGBM) reached poorer performances. All nuanced and subtle language.
generalizability of our findings. Future work is Healthy online comments were accurately therefore advised to include other models success- distinguished from unhealthy ones. Hostile fully used by previous studies in classifying con- comments were easier to detect than other versations with unhealthy and toxic elements, such forms of unhealthy conversations. as RandomForest, Bi-LSTM networks, Stacked Bi- LSTM (Bagga, 2020), capsule networks (Srivas- 6 Limitations & Future Work tava et al., 2018) and transformer models including BERT (Price et al., 2020). Future work is also This section discusses the limitations of our study, advised to test feature selection methods, which and suggests possible future work. usually help increase classification accuracy by fo- We leveraged the Price et al. (2020) dataset, cusing on the most discriminating features while which sampled comments from the opinions sec- discarding redundant and irrelevant information. tion of the Canadian Globe and Mail news website. Lastly, given the challenging nature of unhealthy Though our results are promising (with top average comments classification tasks, future work should micro and macro F1-scores of 88.76% and 67.98%, look at specialized corpora (e.g., Wang and Potts, respectively), this dataset is nonetheless contained 2019; Oraby et al., 2017) and machine learning in a highly specific context (data was collected models trained at differentiating particular types of from a single Canadian newspaper website), which conversations, for example, solely sarcastic com- likely decreases the generalizability of our results. ments from non-sarcastic ones (e.g., Dadu and Pant, The dataset was also notably imbalanced, where 2020; Hazarika et al., 2018). most of the labels were attributed to healthy com- ments. Another limitation of the dataset is the low 7 Conclusion inter-reliability among the coders (Table 1). The This paper analyzed the granularities of subtle toxic low values of the metric reduce the confidence in online conversations. The study goal was to sys- the dataset, but also suggests that identification of tematically investigate the general sentiment associ- different types of unhealthy conversations is chal- ated with healthy and unhealthy online comments, lenging even for humans. Also, Krippendorff’s α as well as the accuracy of machine learning clas- was not reported for two out of eight classes (gen- sifiers in recognizing these comments. Towards eralization unfair and healthy). In future work, we this end, we leveraged a public dataset contain- thus aim to expand our experiments on a more di- ing healthy and unhealthy comments labeled with verse dataset by replicating Prince et al.’s coding seven forms of subtle toxicity. We were able to dis- process using a variety of comments from different tinguish between these comments with a maximum news websites. micro F1-score, macro F1-score, and AUC-ROC One limitation of our machine learning anal- of 88.76%, 67.98%, and 0.71, respectively, using ysis is that we trained a deep learning architec- a CNN-LSTM network with pre-trained word em- ture (CNN-LSTM) using a relatively small dataset, beddings. Our conclusions are two-fold: (i) hos- mainly in terms of unhealthy categories of con- tile comments are the most negative (i.e., less sub- versations. In future work, we plan to increase tle toxic) and detectable form of unhealthy online the number of unhealthy comments of our analy- conversation; (ii) most types of unhealthy com- sis by using data augmentation techniques such as ments are associated with a slight negative senti- Generative Adversarial Networks (GAN), which ment. Findings from this work have the potential can synthetically generate new comments. NLP to inform and advance future research and devel- data augmentation techniques could also be used opment of online moderation tools, which pave the to increase the dataset without requiring signifi- way for safer online environments. cant human supervision, such as simulating key- board distance error, substituting words according to their synonyms/antonyms, replacing words with their common spelling mistakes, etc. (Şahin and Steedman, 2019). This may help increase the per- formance of machine learning classifiers in general. Another limitation is that we included only four classifiers in our analysis, which may restrict the
References Spiros V. Georgakopoulos, Sotiris K. Tasoulis, Aris- tidis G. Vrahatis, and Vassilis P. Plagianakos. 2018. Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Convolutional neural networks for toxic comment Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor- classification. arXiv:1802.09957 [cs]. ArXiv: rado, Andy Davis, Jeffrey Dean, Matthieu Devin, 1802.09957. Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Peter Glick and Susan T. Fiske. 2001. An ambivalent Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh alliance: Hostile and benevolent sexism as comple- Levenberg, Dan Mane, Rajat Monga, Sherry Moore, mentary justifications for gender inequality. Ameri- Derek Murray, Chris Olah, Mike Schuster, Jonathon can psychologist, 56(2):109–118. Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal- war, Paul Tucker, Vincent Vanhoucke, Vijay Va- Joshua Guberman and Libby Hemphill. 2017. Chal- sudevan, Fernanda Viegas, Oriol Vinyals, Pete War- lenges in modifying existing scales for detecting ha- den, Martin Wattenberg, Martin Wicke, Yuan Yu, rassment in individual tweets. In Proceedings of and Xiaoqiang Zheng. 2016. TensorFlow: Large- 50th Annual Hawaii International Conference on scale machine learning on heterogeneous systems. System Sciences (HICSS). arXiv preprint arXiv:1603.04467. Software avail- able from tensorflow.org. Jiawei Han, Micheline Kamber, and Jian Pei. 2011. Data mining concepts and techniques, third edition. Betty van Aken, Julian Risch, Ralf Krestel, and The Morgan Kaufmann Series in Data Management Alexander Löser. 2018. Challenges for toxic com- Systems, 5(4):83–124. ment classification: An in-depth error analysis. arXiv:1809.07572 [cs]. ArXiv: 1809.07572. Devamanyu Hazarika, Soujanya Poria, Sruthi Gorantla, Erik Cambria, Roger Zimmermann, and Rada Mi- Sunyam Bagga. 2020. Detecting abuse on the internet: halcea. 2018. CASCADE: Contextual sarcasm de- It’s subtle. McGill University, page 105. tection in online discussion forums. In Proceedings of the 27th International Conference on Computa- Jean Barbier, Mohamad Dia, Nicolas Macris, Florent tional Linguistics, pages 1837–1848, Santa Fe, New Krzakala, Thibault Lesieur, and Lenka Zdeborová. Mexico, USA. Association for Computational Lin- 2016. Mutual information for symmetric rank-one guistics. matrix estimation: A proof of the replica formula. In Advances in Neural Information Processing Systems, C. J. Hutto and Eric Gilbert. 2015. Vader: A parsi- volume 29. Curran Associates, Inc. monious rule-based model for sentiment analysis of social media text. In Proceedings of the 8th Inter- Lindsay Blackwell, Jill Dimond, Sarita Schoenebeck, national Conference on Weblogs and Social Media, and Cliff Lampe. 2017. Classification and its con- ICWSM 2014. sequences for online harassment: Design insights from heartmob. Proceedings of the ACM on Human- Akshita Jha and Radhika Mamidi. 2017. When does Computer Interaction, 1(CSCW):24:1–24:19. a compliment become sexist? analysis and classifi- cation of ambivalent sexism using twitter data. In Zhuang Chen and Tieyun Qian. 2019. Transfer capsule Proceedings of the Second Workshop on NLP and network for aspect level sentiment classification. In Computational Social Science, page 7–16. Associa- Proceedings of the 57th Annual Meeting of the As- tion for Computational Linguistics. sociation for Computational Linguistics, pages 547– 556. Association for Computational Linguistics. David Jurgens, Eshwar Chandrasekharan, and Libby Hemphill. 2019. A just and comprehensive Theodora Chu, Kylie Jue, and Max Wang. 2016. Com- strategy for using nlp to address online abuse. ment abuse classification with deep learning. Stan- arXiv:1906.01738 [cs]. ArXiv: 1906.01738. ford University. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Tanvi Dadu and Kartikey Pant. 2020. Sarcasm detec- Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. tion using context separators in online discourse. In 2017. Lightgbm: A highly efficient gradient boost- Proceedings of the Second Workshop on Figurative ing decision tree. In Advances in Neural Informa- Language Processing, pages 51–55. tion Processing Systems, volume 30. Curran Asso- ciates, Inc. Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech Varada Kolhatkar, Hanhan Wu, Luca Cavasso, Emilie detection and the problem of offensive language. Francis, Kavan Shukla, and Maite Taboada. 2020. arXiv:1703.04009 [cs]. ArXiv: 1703.04009. The sfu opinion and comments corpus: A corpus for the analysis of online news comments. Corpus Prag- Karthik Dinakar, Birago Jones, Catherine Havasi, matics, 4(2):155–190. Henry Lieberman, and Rosalind Picard. 2012. Com- mon sense reasoning for detection, prevention, and Quoc V. Le and Tomas Mikolov. 2014. Dis- mitigation of cyberbullying. ACM Transactions on tributed representations of sentences and documents. Interactive Intelligent Systems, 2(3):1–30. arXiv:1405.4053 [cs]. ArXiv: 1405.4053.
Ping Liu, Joshua Guberman, Libby Hemphill, and Ilan Price, Jordan Gifford-Moore, Jory Flemming, Saul Aron Culotta. 2018. Forecasting the presence and Musker, Maayan Roichman, Guillaume Sylvain, intensity of hostility on instagram using linguistic Nithum Thain, Lucas Dixon, and Jeffrey Sorensen. and social features. arXiv:1804.06759 [cs]. ArXiv: 2020. Six attributes of unhealthy conversation. 1804.06759. arXiv:2010.07410 [cs]. ArXiv: 2010.07410. Christopher D. Manning, Prabhakar Raghavan, and Carla Pérez-Almendros, Luis Espinosa-Anke, and Hinrich Schütze. 2008. Introduction to Information Steven Schockaert. 2020. Don’t patronize me! Retrieval. Cambridge University Press, New York, an annotated dataset with patronizing and conde- NY, USA. scending language towards vulnerable communities. arXiv preprint arXiv:2011.08320. P. M. Markey. 2000. Bystander intervention in Yu Qiufen. 2014. Understanding the impact of culture computer-mediated communication. Computers in on interpretation: A relevance theoretic perspective. Human Behavior, 16(2):183–188. Intercultural Communication Studies, 23(3). Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey H. H. Saeed, K. Shahzad, and F. Kamiran. 2018. Over- Dean. 2013. Efficient estimation of word representa- lapping toxic sentiment classification using deep tions in vector space. arXiv:1301.3781 [cs]. ArXiv: neural architectures. In 2018 IEEE International 1301.3781. Conference on Data Mining Workshops (ICDMW), page 1361–1366. Kevin L. Nadal, Katie E. Griffin, Yinglee Wong, Sahran Hamit, and Morgan Rasmus. 2014. The im- Niloofar Safi Samghabadi, Suraj Maharjan, Alan pact of racial microaggressions on mental health: Sprague, Raquel Diaz-Sprague, and Thamar Solorio. Counseling implications for clients of color. Jour- 2017. Detecting nastiness in social media. In Pro- nal of Counseling & Development, 92(1):57–66. ceedings of the First Workshop on Abusive Language Online, page 63–72. Association for Computational Shereen Oraby, Vrindavan Harrison, Lena Reed, Linguistics. Ernesto Hernandez, Ellen Riloff, and Marilyn Gözde Gül Şahin and Mark Steedman. 2019. Data Walker. 2017. Creating and characterizing a di- augmentation via dependency tree morphing verse corpus of sarcasm in dialogue. arXiv preprint for low-resource languages. arXiv preprint arXiv:1709.05404. arXiv:1903.09460. Pooja Parekh and Hetal Patel. 2017. Toxic comment Mujahed A. Saif, Alexander N. Medvedev, Maxim A. tools: A case study. International Journal of Ad- Medvedev, and Todorka Atanasova. 2018. Classifi- vanced Research in Computer Science, 8(5). cation of online toxic comments using the logistic regression and neural networks models. AIP Confer- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, ence Proceedings, 2048(1):060011. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, Joni Salminen, Hind Almerekhi, Milica Milenkovic, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Soon-Gyo Jung, Jisun An, Haewoon Kwak, and Jim esnay. 2011. Scikit-learn: Machine learning in Jansen. 2018. Anatomy of online hate: Developing Python. Journal of Machine Learning Research, a taxonomy and machine learning models for identi- 12:2825–2830. fying and classifying hate in online news media. In Proceedings of the International AAAI Conference James W. Pennebaker, Martha E. Francis, and Roger J. on Web and Social Media, volume 12. Booth. 2001. Linguistic Inquiry and Word Count. Lawerence Erlbaum Associates, Mahwah, NJ. Konstantinos Sechidis, Grigorios Tsoumakas, and Ioan- nis Vlahavas. 2011. On the stratification of multi- Jeffrey Pennington, Richard Socher, and Christopher label data. Machine Learning and Knowledge Dis- Manning. 2014a. Glove: Global vectors for word covery in Databases, pages 145–158. representation. In Proceedings of the 2014 Confer- Rico Sennrich, Barry Haddow, and Alexandra Birch. ence on Empirical Methods in Natural Language 2016. Controlling politeness in neural machine Processing (EMNLP), page 1532–1543. Association translation via side constraints. In Proceedings of for Computational Linguistics. the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Jeffrey Pennington, Richard Socher, and Christopher D. Human Language Technologies, pages 35–40. Manning. 2014b. Glove: Global vectors for word representation. In Empirical Methods in Natural Saurabh Srivastava, Prerna Khurana, and Vartika Language Processing (EMNLP), pages 1532–1543. Tewari. 2018. Identifying aggression and toxicity in comments using capsule network. In Proceedings of Soujanya Poria, Erik Cambria, Devamanyu Hazarika, the First Workshop on Trolling, Aggression and Cy- and Prateek Vij. 2016. A deeper look into sarcas- berbullying (TRAC-2018), pages 98–105, Santa Fe, tic tweets using deep convolutional neural networks. New Mexico, USA. Association for Computational arXiv preprint arXiv:1610.08815. Linguistics.
Derald Wing Sue, Christina M. Capodilupo, Gina C. Torino, Jennifer M. Bucceri, Aisha M. B. Holder, Kevin L. Nadal, and Marta Esquilin. 2007. Racial microaggressions in everyday life: Implications for clinical practice. American psychologist, 62(4):271– 286. P. Szymański and T. Kajdanowicz. 2017. A scikit- based Python environment for performing multi- label classification. ArXiv e-prints. Piotr Szymański and Tomasz Kajdanowicz. 2017. A network perspective on stratification of multi-label data. In Proceedings of the First International Work- shop on Learning with Imbalanced Domains:Theory and Applications, volume 74 of Proceedings of Machine Learning Research, pages 22–35, ECML- PKDD, Skopje, Macedonia. PMLR. Vladimir N. Vapnik. 1998. Statistical Learning Theory. Wiley-Interscience. Zijian Wang and Christopher Potts. 2019. Talkdown: A corpus for condescension detection in context. arXiv preprint arXiv:1909.11272. Zeerak Waseem, Thomas Davidson, Dana Warmsley, and Ingmar Weber. 2017. Understanding abuse: A typology of abusive language detection subtasks. In Proceedings of the First Workshop on Abusive Lan- guage Online, page 78–84. Association for Compu- tational Linguistics. Dawei Yin, Zhenzhen Xue, Liangjie Hong, Brian Davi- son, April Edwards, and Lynne Edwards. 2009. De- tection of harassment on web 2.0. Proceedings of the Content Analysis in the WEB, 2:1–7. Liang-Chih Yu, Jin Wang, K. Robert Lai, and Xuejie Zhang. 2017. Refining word embeddings for sen- timent analysis. In Proceedings of the 2017 Con- ference on Empirical Methods in Natural Language Processing, page 534–539. Association for Compu- tational Linguistics. Huan Zhang, Si Si, and Cho-Jui Hsieh. 2017. Gpu- acceleration for large-scale tree boosting. arXiv preprint arXiv:1706.08359. Meishan Zhang, Yue Zhang, and G. Fu. 2016. Tweet sarcasm detection using deep neural network. In COLING. A. Zheng. 2015. Evaluating Machine Learning Mod- els: A Beginner’s Guide to Key Concepts and Pit- falls. O’Reilly Media.
You can also read