"Nice Try, Kiddo": Investigating Ad Hominems in Dialogue Responses
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
“Nice Try, Kiddo”: Investigating Ad Hominems in Dialogue Responses Emily Sheng1 , Kai-Wei Chang2 , Premkumar Natarajan1 , Nanyun Peng1,2 1 Information Sciences Institute, University of Southern California 2 Computer Science Department, University of California, Los Angeles {ewsheng,pnataraj}@isi.edu, {kwchang,violetpeng}@cs.ucla.edu Abstract Post: Many are trying to co-opt and mischaracterize the #blacklivesmatter movement. We won’t allow it! Ad hominem attacks are those that target some Resp: I hate how much of a victim complex you guys have. feature of a person’s character instead of the Post: You’re the reason we need the #MeToo movement. arXiv:2010.12820v2 [cs.CL] 12 Apr 2021 position the person is maintaining. These at- Resp: Nice try, kiddo. tacks are harmful because they propagate im- Post: Stop eating them if you don’t want them to go ex- plicit biases and diminish a person’s credi- tinct! #govegan bility. Since dialogue systems respond di- Resp: I don’t like your username rectly to user input, it is important to study ad hominems in dialogue responses. To this Table 1: Ad hominem responses to Twitter posts. end, we propose categories of ad hominems, examples of ad hominem responses to Twitter posts. compose an annotated dataset, and build a classifier to analyze human and dialogue sys- Undesirable in any response, ad hominems are un- tem responses to English Twitter posts. We productive in furthering a meaningful discussion specifically compare responses to Twitter top- and can reinforce falsehoods. However, these at- ics about marginalized communities (#Black- tacks appeal to emotions and implicit biases to ar- LivesMatter, #MeToo) versus other topics gue a point, and are thus often effectively harmful (#Vegan, #WFH), because the abusive lan- regardless of whether the attacks are true, recog- guage of ad hominems could further amplify nized, or retracted (Yap, 2013). the skew of power away from marginalized populations. Furthermore, we propose a con- Our work is motivated by this fallacy’s potential strained decoding technique that uses salient to amplify the spread of harmful societal biases. n-gram similarity as a soft constraint for For communities that are already disproportion- top-k sampling to reduce the amount of ad ately harmed by societal power inequalities, ad hominems generated. Our results indicate that hominems further amplify the power imbalance. 1) responses from both humans and DialoGPT Tone policing is a type of ad hominem that seeks contain more ad hominems for discussions to regulate the emotions that a person (usually of around marginalized communities, 2) different quantities of ad hominems in the training data a marginalized population) can use to deliver their can influence the likelihood of generating ad points (e.g., not too angrily), thereby altogether hominems, and 3) we can use constrained de- invalidating the style of delivery, the person’s com- coding techniques to reduce ad hominems in petence, and the points being conveyed. Besides di- generated dialogue responses. rectly experiencing ad hominem attacks, marginal- ized groups could also be disproportionately dis- 1 Introduction couraged from using technologies that propagate Ad hominems attack an opponent’s character or these attacks, since abusive language from a tech- identity instead of the points the opponent is mak- nology can deter people from using the technology ing, and can exist in any conversational setting (Sood et al., 2012b). between two or more entities. From an argumen- The goal of this study is to analyze ad hominems tation perspective, ad hominems are fallacies, and in dialogue system- and human-generated re- fallacies rely on faulty reasoning to advance a point sponses for topics that vary in impact to marginal- (Hansen, 2020). These ad hominem fallacies are ized populations. Through analysis, we formulate related to abusive language, toxicity, and microag- techniques to reduce ad hominem responses and gressions, and can be expressed with both subtle thus the associated harms, which is especially im- and explicitly offensive language. Table 1 presents portant for dialogue systems since these systems
directly interact with users. hominems in dialogue systems is related to exam- We analyze responses from DialoGPT (Zhang ining offensive language and other harms. Lastly, et al., 2020a) and humans to English Twitter posts. we discuss existing constrained decoding methods. Specifically, we compare responses to Twitter Ad Hominems In the argumentation literature, topics about marginalized communities (#Black- theoretical ad hominems include the abusive (attack LivesMatter, #MeToo) versus other topics (#Vegan, on the opponent’s character), tu quoque (“he did #WFH). Through human annotation and trained it first”), circumstantial (accusation of hypocrisy), classifiers, we find that ad hominems exist in both and guilt by association (associating the opponent human and DialoGPT responses. Across response with someone with low credibility) (Walton, 1998; sources, there are more ad hominems in #Black- Woods, 2007). Wijze (2003) criticizes that these LivesMatter- and #MeToo-related responses, fewer textbook examples are not realistic in conversa- in #Vegan-related responses, and even fewer in tion. For more empirical categories, Habernal #WFH-related responses. The presence of more et al. (2018) propose ad hominem types based on ad hominems in responses to social issues that analysis of Reddit’s ChangeMyView discussion concern marginalized groups has troubling impli- threads, and Delobelle et al. (2019) analyze the cations about the amplified harms toward these name-calling and abusive categories. Moreover, groups. Wulczyn et al. (2017) use classifiers for a large- Given our analysis, we further propose a con- scale analysis of personal attacks in Wikipedia com- strained decoding algorithm to reduce the amount ments. We build upon prior works to define and of ad hominems generated by dialogue systems. By analyze ad hominems in a conversational setting. using salient n-gram similarity to apply soft con- Additionally, Yap (2013) discusses the harmful straints to top-k sampling, our proposed technique effects of implicit biases in forming and evaluating is simple, extensible to reducing other harms, and ad hominems. They emphasize that ad hominem does not require much additional computation. At attacks can be harmful to a person’s credibility each decoding time step, the technique compares and expertise even if the attack is recognized as the similarity between the current generated output fallacious and irrelevant to the argument. In par- and salient ad hominem versus non-ad hominem ticular, because societal norms allow biases and n-grams, possibly selecting alternative token can- stereotypes to detract from a person’s credibility didates to generate. This technique is effective at or expertise, the use of ad hominems can further reducing the amount of ad hominems generated diminish the rhetorical credibility (Govier, 1993) across topics while maintaining coherence and rel- of marginalized groups. evance. Our main contribution is a novel analysis of ad Offensive Language Detection Ad hominems hominem responses generated by humans and Di- occur in many forms and are related to differ- aloGPT across topics varying in impact to marginal- ent types of offensive language, including abu- ized communities. For this analysis, we propose sive language (Yin et al., 2009; Chen et al., 2012; empirically-derived ad hominem categories that are Nobata et al., 2016), hate speech (Warner and further verified through annotation. Furthermore, Hirschberg, 2012; Kwok and Wang, 2013; Djuric we build a new dataset of Twitter posts paired with et al., 2015), profanity (Sood et al., 2012a), and the human- and DialoGPT-generated responses, where more subtle forms of microaggressions (Breitfeller the responses have ad hominem-related labels. Fi- et al., 2019) and projecting biases and stereotypes nally, we devise a constrained decoding technique through power differentials in language (Sap et al., that uses salient n-gram similarity to steer top-k 2020). Ranging from outright insults to condescen- sampling away from ad hominem responses. We re- sion, ad hominems are a form of offensive language lease data and code at https://github.com/ that is difficult to comprehensively and objectively ewsheng/ad-hom-in-dialogue. define. Nonetheless, these responses are important to characterize, since they can irreparably damage 2 Related Work a person’s credibility. It is also generally important to identify these subtle forms of offensive language, This work is related to a broad spectrum of topics, since it is unclear if existing offensive language de- including prior definitions of ad hominems and how tection techniques are equally effective for these ad hominems facilitate biases. Also, analyzing ad subtle forms.
Harms in Dialogue Systems Conversational Polarizing Affects # [post, systems are known to perpetuate several types of Topic marginalized human resp] topic group pairs harms. Ruane et al. (2019) caution about harms that BLM yes yes 4,037 can result from using conversational systems and MeToo yes yes 2,859 propose striving for trust and transparency; Roller Vegan yes no 3,697 et al. (2020) suggest techniques for chatbot safety. WFH no no 3,992 For analysis, Sheng et al. (2019) evaluate societal Total - - 14,585 biases in language generation, Curry and Rieser Table 2: Topics, rationales, and statistics for the human (2018) study how conversational systems respond response subset from the A D H OM I N T WEETS dataset. to sexual harassment, and Khatri et al. (2018) detect offensive content with a semi-supervised approach. controversial) and non-polarizing; we expect there To reduce harms, Sheng et al. (2020) present a to be more strong opinions for the polarizing top- framework for controlling biases in language gener- ics and thus more ad hominem responses for those ation, and Dinan et al. (2019) show how adversarial topics. For this study, we choose the topic WFH attacks can make models more robust to offensive (“work from home”) as a non-polarizing topic and language usage from humans. collect Twitter posts that include the hashtag #wfh Constrained Decoding For constrained decod- or #workingfromhome. Polarizing topics can fur- ing, prior works focus on incorporating words or ther be divided into those that are directly relevant phrases (as hard or soft constraints) into the de- to marginalized communities and those that are not. coded output. Swanson et al. (2014) and Balakr- For the latter, we choose the topic Vegan and col- ishnan et al. (2019) use parse trees among other lect posts that include any of the hashtags: #vegan, techniques to enforce constraints in the generated #veganism, #govegan, or #veganlife.1 For polariz- text. Hokamp and Liu (2017); Post and Vilar (2018) ing topics that are directly relevant to marginalized propose variants of Grid Beam Search, which gen- groups, we focus on the topics BLM (from #black- erate output that include lexical constraints. Miao livesmatter posts) and MeToo (from #metoo posts). et al. (2019); Zhang et al. (2020b); Susanto et al. #blacklivesmatter is related to the “justice, healing, (2020) explore insertion-based non-autoregressive and freedom to Black people across the globe”,2 decoding algorithms. To be compatible with an and #metoo is related to the movement against sex- autoregressive model like DialoGPT and effective ual violence.3 In total, we collect 14,585 [post, for open-domain generation, we apply constrained response] pairs of Tweets posted between Aug. 7 decoding to top-k sampling. Our method also dif- and Oct. 29, 2020; detailed data statistics are in fers from these prior works in that it imposes soft Table 2. We replace all usernames and urls with constraints to not generate phrases that are likely special placeholders to better anonymize the data. to lead to ad hominems. Decoding-time techniques Models In this work, we analyze responses from that can be used to reduce harmful language gen- the DialoGPT (Zhang et al., 2020a) dialogue model. eration, e.g., the Plug and Play Language Model DialoGPT was originally trained on web data, and (PPLM) (Dathathri et al., 2020), are most relevant then was further fine-tuned for multi-turn conver- to our technique. sational capabilities on Reddit data. Since models can vary in harm depending on the training data, we 3 Dataset and Model Setup compare responses from the original medium-sized DialoGPT to responses from DialoGPT separately This section describes the dataset collection process fine-tuned on each of the four topics from the hu- and the dialogue model variations we analyze. man response subset of A D H OM I N T WEETS.4 Dataset Collection Our goal is to understand how ad hominem responses differ across discus- 4 Identifying Ad Hominem Responses sions that vary in impact and relevance to marginal- It is generally difficult to settle on a comprehen- ized groups. To that end, we extract English [post, sive list of ad hominem categories. We build response] pairs on different topics from Twitter and 1 also use DialoGPT to generate responses for all col- Habernal et al. (2018) find that vegan-related topics are one of the top topics that contain ad hominems in their study. lected posts. We refer to this collective dataset as 2 https://blacklivesmatter.com the A D H OM I N T WEETS dataset. 3 https://metoomvmt.org 4 Relevant topics are divided into polarizing (i.e., More details are in Appendix A.2.
AH Type Topic Post Response Stupidity BLM Together. #blacklivesmatter That’s a dumb thing to say. Ignorance BLM Your all welcome to join in on the #blm movement! You mean "you’re" Trolling/Lying Vegan It’s time to end intensive meat production...#vegan You must be a troll. Bias BLM This is why people are protesting, this is why the #BLM movement You’re racist because you is necessary. focus on race. Condescension MeToo 3 years into #MeToo era, real apologies are few and far between Can you stay out of grown folks’ business... Other Vegan It’s not a ‘personal choice’ when a ‘victim’ is involved. #GoVegan You’re better than this. Non-AH WFH #WFH benefit: no co-worker judgement microwaving fish for lunch The smell of fish is deadly. Table 3: Ad hominem (AH) categories. The post provides context to analyze ad hominems in the response. upon the work of Habernal et al. (2018) to devise as “you are” are more likely to have ad hominems. ad hominem categories that are both empirically- We call these responses you-responses.6 In addi- motivated and can be annotated with high inter- tion to pairs with you-responses, we also collect annotator agreement. We specifically include cate- random pairs without you-responses for annotation gories such as “ignorance” and “condescension” to to ensure that the annotated samples are represen- cover more subtle forms of personal attacks (e.g., tative of different ad hominems. tone policing, mansplaining) that could further di- Annotation Task We ask annotators on Mechan- minish the credibility of those who are already ical Turk to read a post and response and determine marginalized. We also limit the definition of ad whether the response contains any ad hominem(s) hominem to personal attacks towards the author of towards the person who made the post. We divide the post and not a third person. ad hominems into the following categories: stupid- ity, ignorance, trolling/lying, bias, condescension, 4.1 Human Annotation and other; examples are in Table 3.7 We collect human annotations that can then be Annotation Round 1 The goal for the first round used for analysis and training a classifier to au- of human annotation is to collect enough data to tomatically label ad hominems. Although Haber- train an ad hominem classifier. To balance targeted nal et al. (2018) propose a similar typology of ad and random samples, for each topic (BLM, MeToo, hominems, there is no existing dataset annotated Vegan, WFH) and response source (human, Di- with their empirically-derived categories. More- aloGPT) pair, we randomly select 150 [post, re- over, we study ad hominems in casual conversa- sponse] pairs with you-responses and another 150 tional settings. For these reasons, we annotate a pairs without you-responses for annotation. In total, subset of A D H OM I N T WEETS with ad hominem we gather 2,400 [post, response] pairs that are then information. To measure inter-annotator agree- annotated through Mechanical Turk. ment, we calculate the Worker Agreement With Additional Annotations We conduct three more Aggregate (WAWA) score, following Ning et al. rounds of annotations to retrieve more ad hominem (2020). The WAWA score compares the majority responses. For the second and third rounds, we use votes against each annotator and micro-averages an ad hominem classifier trained on data from all the resulting precision, recall, and F1 scores.5 previous rounds (with the same architecture and Heuristics for Ad Hominems Ad hominem re- hyperparameters as the final classifier in Sec. 4.2) sponses are relatively rare and range broadly from to label unseen samples in A D H OM I N T WEETS. explicit to more subtle forms. For more effective We then select a balanced amount of automatically- annotation, we use heuristics to choose [post, re- labeled ad hominems and non-ad hominems from sponse] pairs where the response is likely to be an each [topic, response source] pair to annotate.8 ad hominem. In preliminary analyses, we find that Some topics (e.g., WFH and Vegan) prompt responses that contain certain “you”-phrases such fewer ad hominem responses, so it is difficult to 5 There are also other agreement metrics such as Krippen- 6 dorff’s alpha, but because we expect our data to have many Full set of you-responses is in Appendix A.1. 7 more non-ad hominem compared to ad hominem responses, Full details are in Appendix A.7. 8 alpha scores can be misleading—the WAWA score gives a For each [topic, response source] pair, we choose 150 more appropriate estimate of annotator agreement. samples for Round 2 and 100 samples for Round 3.
find enough of these responses “in the wild” to train a new sample. We generate these new data sam- a more accurate classifier. Our solution is to manu- ples to roughly balance the number of samples ally take the responses annotated as ad hominems across topics and across ad hominems versus non- and pair them with WFH or Vegan posts. To verify ad hominems for each topic. These new combina- that these new pairs contain ad hominem responses, tions of [post, response] pairs help de-emphasize we run a fourth round of annotation on these pairs spurious correlations between topics and classifier and only keep the ones where the majority of anno- labels. tators label the response as an ad hominem to the Since the automatic augmentation reduces em- post. We combine majority annotations across all phasis on the post when predicting the presence of rounds of annotations to train the final ad hominem ad hominems in the response, a natural question classifier used for analysis. is if the post is really necessary to gauge whether the response contains ad hominems. The answer is 4.2 Ad Hominem Classifier mixed—for example, the response “you’re a troll” is an ad hominem for any post. However, the re- For large-scale analysis of ad hominems in hu- sponse “those who promote veganism are arrogant man and dialogue system responses, we rely on fools” is an ad hominem given the post “everyone classifier annotation. To simplify the learning should follow veganism”, but not an ad hominem problem, we condense the different ad hominem given the post “I don’t understand veganism”. Em- categories into a binary yes/no scheme, where pirically, by limiting the classifier input to only “yes" indicates the presence of any type and quan- responses, the classifier performs worse than if it tity of ad hominems in the response given the has both the post and response as input.9 post. We build a classifier to automatically label whether a response contains ad hominems for a 5 Reducing Ad Hominem Responses given post by fine-tuning a BERT (Devlin et al., 2019) model with the input format “[CLS] POST Inspired by the success of n-gram features in de- [SEP] RESPONSE [SEP]”. We additionally in- tecting abusive language by Nobata et al. (2016), clude comparisons to a baseline classifier built we propose a constrained decoding algorithm to dis- on top of DialoGPT to similarly label whether a courage the model from generating n-grams that post and response pair indicates the presence of are semantically similar to salient n-grams found an ad hominem response. This baseline classifier in ad hominem responses. While we motivate this allows a comparative evaluation of a bi-directional technique within the context of ad hominems, the encoder model versus an auto-regressive decoder technique is applicable to other subtle harms (e.g., model for ad hominem classification and how this microaggressions) in language generation. difference may affect the quality of control tech- A naive method to generate fewer ad hominems niques that rely on the latter (e.g., PPLM (Dathathri is to block words that are likely to occur in ad et al., 2020), GeDi (Krause et al., 2020)). Ap- hominems. However, ad hominems are contextu- pendix A.2 includes more details of our model im- ally determined, meaning that phrases are a better plementation and data statistics (Table 8). indicator than words, thus motivating our use of n-grams. Additionally, our algorithm uses soft con- Ultimately, the goal is to train an ad hominem straints because there are no words or phrases that detection classifier that has high accuracy across always indicate the presence of an ad hominem. sources and topics, so we curate the dev and test In this section, we describe how our technique datasets to be balanced across topics, response S ALIEN S IM T OP -k extends top-k sampling by in- sources, and ad hominem versus non-ad hominem corporating n-gram similarity constraints. samples (through downsampling). Because of the Salient n-grams We define salient ad hominem natural imbalance of ad hominem responses for n-grams to be n-grams that appear more frequently different topics, ad hominem responses for topics in ad hominem responses than in non-ad hominem like WFH are relatively sparse compared to those responses. Similarly, salient non-ad hominem n- for topics like BLM. We automatically augment our training set to combat this sparsity. First, we 9 By randomly forming new (post, response) pairs during accumulate all posts and responses not present in augmentation, we do not explicitly account for the responses that are context-specific; however, we find the context-specific the dev and test sets. Next, we choose a random responses to be relatively rare and that our augmentation em- post to pair with a random labeled response to form pirically results in a more robust classifier.
AH n-gram Score non-AH n-gram Score Algorithm 1: S ALIEN S IM T OP -k serious or not 15.0 thank you for 18.8 Data: input tokens x, # top tokens k, # candidate don’t know what 13.0 thanks for sharing 8.9 tokens t, # recent tokens r, salient ad hominem how can you 11.0 i think it’s 8.9 average n-grams A, salient non-ad hominem you’re a troll 11.0 you are right 8.9 average n-grams B, semantic similarity you’re being a 11.0 is the best 8.9 threshold γ Result: output tokens y Table 4: Top salient n-grams and their salience scores y=x for ad hominem (AH) and non-ad hominem (non-AH) while len(y) < max_steps + len(x) do vocab_logits = model(y) responses, as calculated from the annotator-labeled sub- P 0 = choose top-k vocab_logits and rescale set of A D H OMS I N T WEETS. candidate_tokens = sample t tokens using P 0 for cand in candidate_tokens do grams appear more frequently in non-ad hominem if special_condition then responses than in ad hominem responses. We use y.append(cand) continue to While condition the salience score as defined by Li et al. (2018): r_gram = last r − 1 tokens of y + cand count(u, Da ) + λ c = avg(r_gram) S(u, a) = P . (1) sim_a = similarity(c, A) a0 ∈A,a0 6=a count(u, Da0 ) + λ sim_b = similarity(c, B) if sim_a - sim_b
Topic Source dev test avg 27.5 27.3 25.5 30 24.6 Human 83.3 82.9 83.1 21.7 BLM 20.8 % ad hominems 19.1 19.1 DialoGPT 84.2 75.7 80.0 18.5 15.6 20 Human 80.0 73.7 76.9 12.4 12.1 12.0 MeToo 11.0 DialoGPT 85.0 80.0 82.5 8.9 7.3 7.1 6.3 10 6.2 Human 80.0 70.6 75.3 4.7 Vegan 4.0 3.0 DialoGPT 82.9 82.9 82.9 1.9 1.9 Human 77.8 83.3 80.6 0 WFH DialoGPT 92.3 88.4 90.4 BLM MeToo Vegan WFH Table 5: BERT-based classifier F1 scores for ad Human DialoGPT FBLM FMeToo FVegan FWFH hominem responses across topics and response sources. Figure 1: % of classifier-labeled ad hominem oc- The classifier does relatively well across topics and currences across human, DialoGPT, and fine-tuned sources. DialoGPT responses (“FXX ”). There are 14.5K re- sponses (to all posts in A D H OM I N T WEETS) per re- erating a sample, this algorithm adds a constant sponse source. Human and DialoGPT responses con- amount of computational resources compared to tain more ad hominems for BLM and MeToo, fol- the original, non-constrained decoding. lowed by Vegan and then WFH. Fine-tuning on topics Implementation Details In our experiments, we with more/fewer ad hominems results in more/fewer ad set k = 40 (commonly used in previous genera- hominems generated across topics. tion tasks (Radford et al., 2019)). With parameter 6.2 Ad Hominem Analysis tuning, we find t = 10 and γ = 0 effective for our setup. We use r = 5 to compare the averaged em- Ad Hominem Categories By comparing ad bedding of the most recent 5-gram with those of hominem types across the manually-annotated hu- salient 3-, 4-, and 5-grams. Additionally, we use man and DialoGPT responses, we find that ad cosine similarity as the similarity metric and our hominems in human responses frequently occur “special_condition” includes either a) a limit of 5 in the forms of “condescension” and “ignorance”, for backtracking or b) the first r time steps. while ad hominems in DialoGPT responses occur in the forms of “ignorance” and “other” types (Ta- 6 Results ble 11 in the Appendix). These results indicate that responses from different sources and topics are 6.1 Identifying Ad Hominems likely to contain different ad hominems. Formally Annotation Across all rounds of annotations, the categorizing ad hominems allows for more consis- average WAWA scores include a precision of 0.82, tent annotations and a better understanding of the recall of 0.92, and F1 of 0.87, indicating moderately types DialoGPT is prone to generate. high majority agreement. Generally, the agreement DialoGPT Responses The classifier enables us scores for the human responses are slightly higher to perform a large-scale study of ad hominem than those for the DialoGPT responses—we hy- trends across various contexts for the entire A D - pothesize that the former tend to be more coherent H OM I N T WEETS dataset. Figure 1 shows the per- and longer, and thus more informative. centage of ad hominem responses to posts across Ad Hominem Classifier The resulting BERT- topics and response sources. Focusing on the “Hu- based classifier has an overall dev F1 score of man” and “DialoGPT” bars for each topic, we see 83.3% and a test F1 score of 80.0% for ad that ad hominem responses are present across all hominems. The DialoGPT-based classifier has a topics for both response sources. Additionally, ad dev F1 score of 74.6% and a test F1 score of 72.6%, hominem responses occur more frequently in dis- supporting our use of the BERT-based classifier to cussions related to BLM and MeToo and less fre- automatically detect ad hominems in the rest of this quently in discussions related to Vegan and WFH. work.10 The full breakdown of F1 scores across Vegan discussions also seem to attract more ad topics and response sources is shown in Table 5 hominem responses than WFH discussions. The and Appendix Table 9. relatively higher rates of ad hominem responses in 10 topics related to marginalized communities indi- This result additionally suggests that control techniques that rely on signal from auto-regressive decoder models as cate the elevated potential for harm towards these discriminators may encounter more noise. communities.
DialoGPT Trigger PPLM FWFH SS FWFH +SS Post: Many are trying to co-opt and mischaracterize the #blm movement. We won’t allow it! 21.7 Src: DialoGPT 18.5 Resp: I hate how much of a victim complex you guys have. % ad hominems 20 12.6 Src: DialoGPT + S ALIEN S IM T OP -k 12.0 11.1 11.0 10.5 Resp: This is so true. 8.9 8.0 7.1 10 6.7 6.6 Src: FWFH + S ALIEN S IM T OP -k 5.7 5.1 4.0 3.9 3.6 Resp: I’m in the minority and I don’t think it’s possible to 3.0 2.9 2.6 2.0 1.9 0.9 make it a better movement. 0.2 0 BLM MeToo Vegan WFH Table 6: Examples of responses generated from differ- (a) 14.5K classifier-labeled responses (to all posts in A D - ent sources. FWFH is DialoGPT fine-tuned on WFH. H OM I N T WEETS) per response source. 20 introduce four baselines to span the different 16 % ad hominems 15 classes of harm reduction techniques. The first 11 10 baseline is simply the original DialoGPT. Our data- 9 10 8 8 based reduction baseline is DialoGPT fine-tuned 5 5 4 4 4 4 5 3 3 2 2 2 on the WFH dataset, as described in Sec. 3. For 1 1 1 1 1 1 0 0 BLM MeToo Vegan WFH the first decoding-based baseline, we rely on a (b) 400 human-labeled responses (to posts randomly chosen gradient-based method post-training to find a “trig- from A D H OM I N T WEETS) across topics per response source. ger phrase”, which is then attached to a prompt Figure 2: Reducing ad hominems in generated re- at inference time to influence the generated out- sponses. FWFH is fine-tuned on WFH data and SS is put (Wallace et al., 2019). Sheng et al. (2020) S ALIEN S IM T OP -k. Results suggest all ad hominem re- further propose a framework to use these triggers duction techniques are effective compared to the orig- to control societal biases, and we use these meth- inal DialoGPT. SS is the most effective individual method, outperforming FWFH , Trigger, and PPLM base- ods to find a trigger that can induce DialoGPT lines. FWFH +SS could further reduce the amount of ad to generate fewer ad hominems and more non-ad hominem responses generated. hominems when prepended to posts about different topics. For the second decoding-based baseline, we Fine-tuned DialoGPT Responses Figure 1 also use the Plug and Play Language Model (PPLM) shows that fine-tuning on datasets that contain more proposed by Dathathri et al. (2020), which guides ad hominem responses leads to more generation a pre-trained language model’s generated output of ad hominem responses across topics.11 From using gradients from attribute classifiers.12 these results, we infer that the original DialoGPT Human Annotation To verify ad hominem (which was fine-tuned from GPT-2) was trained trends from the automatic evaluation, we randomly on a dataset that likely contained relatively more select 100 samples from each [reduction technique, rather than fewer ad hominems. Additionally, fine- topic] pair for additional human annotation. tuning on a carefully chosen dataset can reduce the General Trends Classifier and human evalua- quantity of generated ad hominems and associated tions for techniques to reduce ad hominems are harms. in Figure 2, and examples of generated responses are in Table 6. The classifier-labeled results allow 6.3 Ad Hominem Reduction us to evaluate 14.5K samples across all topics per Baselines We compare techniques from two response source, and the human-labeled results al- classes of harm reduction methods for lan- low us to more accurately evaluate a smaller set guage generation: data-based and decoding-based. of samples. Overall, the trends for classifier and Gehman et al. (2020) define data-based techniques human evaluations are similar, and the evaluations as those where further model training on more suggest that all ad hominem reduction techniques data is necessary and decoding-based techniques are effective compared to the original DialoGPT. as those where the generation strategy is changed Furthermore, S ALIEN S IM T OP -k is more effective without changing model parameters. For our main than the other individual techniques, and combin- decoding-based S ALIEN S IM T OP -k technique, we ing fine-tuning and S ALIEN S IM T OP -k has promise for further reducing the amount of generated ad 11 Table 13 in the Appendix includes examples generated by 12 the fine-tuned models. More details are in Appendix A.3 and A.4.
BLM MeToo Vegan WFH Avg Source C R C R C R C R C R DialoGPT 4.5 3.0 4.3 3.5 4.2 3.2 4.3 2.6 4.3 3.1 Trigger 4.5 3.0 4.5 3.2 4.3 2.8 4.4 2.8 4.4 3.0 PPLM 4.1 3.0 3.7 3.0 3.6 2.9 3.8 2.6 3.8 2.9 FWFH 4.2 3.6 4.1 3.6 3.6 3.4 4.0 3.7 4.0 3.6 SS 4.5 3.2 4.4 3.2 4.1 3.6 4.4 3.1 4.4 4.1 FWFH +SS 3.8 3.1 3.8 3.6 3.9 3.2 4.1 4.1 3.9 3.5 Table 7: Average coherence (C) and relevance (R) of responses across sources and topics, each on a scale of 1-5, where higher scores are better. Each value is averaged over 25 random samples (and 3 annotators per sample). The highest score(s) per column are bolded, and the lowest score(s) per column are underlined. Trigger generates slightly more coherent responses, though at the cost of relevance. PPLM generates responses that are relatively lower in both coherence and relevance. SS maintains a decent balance of coherence and relevance, and FWFH +SS produces slightly less coherent responses that are mixed in relevance. hominems. different topics.13 Spearman’s correlation is mod- erately high (0.46) for relevance and a bit lower for For S ALIEN S IM T OP -k, limiting the number of coherence (0.38), indicating the task subjectivity. times we backtrack to previous time steps ensures Discussion The collective results indicate that that the algorithm is not significantly slower com- S ALIEN S IM T OP -k is an effective standalone ad pared to the original top-k sampling algorithm. hominem reduction technique that maintains gen- Empirically, we find that using S ALIEN S IM T OP -k erated text quality; while it can be combined with with a backtracking limit of 5 on the original Di- other techniques to further reduce ad hominems, aloGPT results in 13% of the decoding operations one should carefully evaluate the trade-offs be- being “non-forward” operations, where the set of tween response coherence and relevance. Addi- decoding operations are: a) choosing the current tionally, for reducing harmful language types that token and moving forward to the next timestep, b) are more subjective or difficult to detect, straight- looking for an alternate token at the same timestep, forward control techniques that rely on salient n- or c) moving backward to a previous timestep. grams may be more useful than techniques that rely When applying constrained decoding to DialoGPT on noisier signals from classifiers. fine-tuned on WFH, 10% of the operations are non- forward operations. Since ad hominems are less 7 Conclusion common than non-ad hominems, the algorithm is able to proceed with the first sampled candidate to- Ad hominem responses from dialogue systems are ken in most time steps. Additionally, models or top- offensive, stall conversations, and are especially ics that are inclined to generate more ad hominems harmful for marginalized communities. We ana- incur more non-forward operations. lyze responses to find that discussions on topics that affect marginalized groups contain more ad Coherence and Relevance Evaluation To en- hominems. Through a novel constrained decoding sure that the ad hominem reduction techniques do technique, we decrease the amount of ad hominems not affect the quality of the generated responses, generated from dialogue systems while keeping the we have annotators label the coherence and rele- response quality comparable. Furthermore, our vance of a response to a post, both on a scale of method can be easily applied to other pre-trained 1 to 5, where a higher score is better. The trigger language generation models and other subtle yet method produces samples that are relatively more harmful language. More broadly, our work strives coherent, although at the cost of lower relevance to understand ad hominems in the context of harms to the post. PPLM generates responses that are in conversational systems. relatively lower in both coherence and relevance. Broader Impact S ALIEN S IM T OP -k manages to maintain a decent balance of generating both coherent and relevant re- This work identifies personal attacks in responses sponses. Combining S ALIEN S IM T OP -k with fine- generated by dialogue systems, quantifies the dis- tuning on WFH data results in responses that are 13 Example generations across sources are in Appendix Ta- slightly less coherent and mixed in relevance for ble 14.
proportionate amount generated for topics concern- tract W911NF-15-1-0543 with the US Defense Ad- ing marginalized populations, and proposes meth- vanced Research Projects Agency (DARPA). The ods to reduce ad hominem-related harms. views expressed are those of the authors and do not Dataset We collect an English dataset from Twit- reflect the official policy or position of the Depart- ter and ensure that personal information (e.g., user- ment of Defense or the U.S. Government. names, emails, urls) is discarded. We also collect crowd-sourced annotations for this dataset through Mechanical Turk, where we ask for judgements References of whether a response contains ad hominems for a Anusha Balakrishnan, Jinfeng Rao, Kartikeya Upasani, given post, and the coherence and relevance of a Michael White, and Rajen Subba. 2019. Con- response. No information about the annotators are strained decoding for neural nlg from compositional collected from the annotation tasks. The annotation representations in task-oriented dialogue. In Pro- ceedings of the 57th Annual Meeting of the Associa- information (pay per amount of work, guidelines) tion for Computational Linguistics, pages 831–844. is in the Appendix. One annotation aspect that we did not con- Luke Breitfeller, Emily Ahn, David Jurgens, and Yu- trol for is whether the annotators themselves are lia Tsvetkov. 2019. Finding microaggressions in the wild: A case for locating elusive phenomena in so- from marginalized communities. When measuring cial media posts. In Proceedings of the 2019 Con- harms towards different demographics, it is im- ference on Empirical Methods in Natural Language portant to consider the lived experiences of those Processing and the 9th International Joint Confer- groups and how these experiences may affect our ence on Natural Language Processing (EMNLP- IJCNLP), pages 1664–1674. analyses. Future work includes specifically collect- ing annotations from marginalized groups. Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu. Additionally, we analyze ad hominems in re- 2012. Detecting offensive language in social me- sponses to four Twitter topics and from one dia- dia to protect adolescent online safety. In Proceed- logue model, which leaves much room for explor- ings of the 2012 ASE/IEEE International Confer- ence on Social Computing and 2012 ASE/IEEE In- ing the generalizability of the trends we see. ternational Conference on Privacy, Security, Risk Techniques In terms of dual-use harms, our con- and Trust, SOCIALCOM-PASSAT ’12, page 71–80, strained decoding technique could potentially be USA. IEEE Computer Society. used to amplify rather than reduce ad hominems (or Amanda Cercas Curry and Verena Rieser. 2018. # other harmful language). However, we believe that metoo alexa: How conversational systems respond by being transparent about this technique and re- to sexual harassment. In Proceedings of the Second leasing the associated code and data, we can better ACL Workshop on Ethics in Natural Language Pro- counter attempts of malicious misuse. cessing, pages 7–14. Furthermore, to perform a large-scale analysis Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane of ad hominems across different contexts, we build Hung, Eric Frank, Piero Molino, Jason Yosinski, and an automatic classifier. While we spent much effort Rosanne Liu. 2020. Plug and play language mod- on collecting representative train/dev/test datasets els: A simple approach to controlled text generation. and verifying classifier quality and observed trends In International Conference on Learning Represen- tations. with human labels, collecting more (diverse) data could help further improve the classifier accuracy Pieter Delobelle, Murilo Cunha, Eric Massip Cano, and robustness. In the meantime, we think this Jeroen Peperkamp, and Bettina Berendt. 2019. Com- work introduces an important perspective of how ad putational ad hominem detection. In Proceedings of the 57th Annual Meeting of the Association for Com- hominems in dialogue systems reinforce unequal putational Linguistics: Student Research Workshop, harms and effective reduction methods. pages 203–209. Acknowledgments Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of We would like to thank members of the PLUS deep bidirectional transformers for language under- Lab and the anonymous reviewers for the help- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for ful feedback, and Jason Teoh for the many dis- Computational Linguistics: Human Language Tech- cussions. This paper is supported in part by NSF nologies, Volume 1 (Long and Short Papers), pages IIS 1927554 and by the CwC program under Con- 4171–4186.
Emily Dinan, Samuel Humeau, Bharath Chintagunta, Ben Krause, Akhilesh Deepak Gotmare, Bryan Mc- and Jason Weston. 2019. Build it break it fix it for Cann, Nitish Shirish Keskar, Shafiq Joty, Richard dialogue safety: Robustness from adversarial human Socher, and Nazneen Fatema Rajani. 2020. Gedi: attack. In Proceedings of the 2019 Conference on Generative discriminator guided sequence genera- Empirical Methods in Natural Language Processing tion. arXiv preprint arXiv:2009.06367. and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages Irene Kwok and Yuzhou Wang. 2013. Locate the hate: 4529–4538. Detecting tweets against blacks. In Proceedings of the Twenty-Seventh AAAI Conference on Artifi- Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Gr- cial Intelligence, AAAI’13, page 1621–1622. AAAI bovic, Vladan Radosavljevic, and Narayan Bhamidi- Press. pati. 2015. Hate speech detection with comment em- beddings. In Proceedings of the 24th International Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Conference on World Wide Web, WWW ’15 Com- Delete, retrieve, generate: a simple approach to sen- panion, page 29–30, New York, NY, USA. Associa- timent and style transfer. In Proceedings of the 2018 tion for Computing Machinery. Conference of the North American Chapter of the Association for Computational Linguistics: Human Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi- Language Technologies, Volume 1 (Long Papers), erarchical neural story generation. In Proceedings pages 1865–1874. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei pages 889–898. Li. 2019. Cgmh: Constrained sentence generation by metropolis-hastings sampling. In Proceedings of Sam Gehman, Suchin Gururangan, Maarten Sap, Yejin the AAAI Conference on Artificial Intelligence, vol- Choi, and Noah A Smith. 2020. Realtoxici- ume 33, pages 6834–6842. typrompts: Evaluating neural toxic degeneration in Qiang Ning, Hao Wu, Rujun Han, Nanyun Peng, Matt language models. Proceedings of the 2020 Confer- Gardner, and Dan Roth. 2020. Torque: A reading ence on Empirical Methods in Natural Language comprehension dataset of temporal ordering ques- Processing - Findings (EMNLP-Findings). tions. In the 2020 Conference on Empirical Methods Trudy Govier. 1993. When logic meets politics: tes- in Natural Language Processing (EMNLP). timony, distrust, and rhetorical disadvantage. Infor- Chikashi Nobata, Joel Tetreault, Achint Thomas, mal Logic, 15(2). Yashar Mehdad, and Yi Chang. 2016. Abusive lan- Ivan Habernal, Henning Wachsmuth, Iryna Gurevych, guage detection in online user content. In Proceed- and Benno Stein. 2018. Before name-calling: Dy- ings of the 25th International Conference on World namics and triggers of ad hominem fallacies in web Wide Web, WWW ’16, page 145–153, Republic and argumentation. In Proceedings of the 2018 Confer- Canton of Geneva, CHE. International World Wide ence of the North American Chapter of the Associ- Web Conferences Steering Committee. ation for Computational Linguistics: Human Lan- Matt Post and David Vilar. 2018. Fast lexically con- guage Technologies, Volume 1 (Long Papers), pages strained decoding with dynamic beam allocation for 386–396. neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of Hans Hansen. 2020. Fallacies. In Edward N. Zalta, ed- the Association for Computational Linguistics: Hu- itor, The Stanford Encyclopedia of Philosophy, sum- man Language Technologies, Volume 1 (Long Pa- mer 2020 edition. Metaphysics Research Lab, Stan- pers), pages 1314–1324. ford University. Alec Radford, Jeff Wu, Rewon Child, David Luan, Chris Hokamp and Qun Liu. 2017. Lexically con- Dario Amodei, and Ilya Sutskever. 2019. Language strained decoding for sequence generation using grid models are unsupervised multitask learners. beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Lin- Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, guistics (Volume 1: Long Papers), pages 1535– Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, 1546. Kurt Shuster, Eric M Smith, et al. 2020. Recipes for building an open-domain chatbot. arXiv preprint Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and arXiv:2004.13637. Yejin Choi. 2019. The curious case of neural text de- generation. In International Conference on Learn- Elayne Ruane, Abeba Birhane, and Anthony Ven- ing Representations. tresque. 2019. Conversational ai: Social and ethical considerations. In AICS, pages 104–115. Chandra Khatri, Behnam Hedayatnia, Rahul Goel, Anushree Venkatesh, Raefer Gabriel, and Arindam Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Ju- Mandal. 2018. Detecting offensive content in rafsky, Noah A. Smith, and Yejin Choi. 2020. So- open-domain conversations using two stage semi- cial bias frames: Reasoning about social and power supervision. arXiv preprint arXiv:1811.12900. implications of language. In Proceedings of the
58th Annual Meeting of the Association for Compu- John Woods. 2007. Lightening up on the ad hominem. tational Linguistics, pages 5477–5490, Online. As- Informal Logic, 27(1):109–134. sociation for Computational Linguistics. Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Ex machina: Personal attacks seen at scale. In Pro- Nanyun Peng. 2019. The woman worked as a ceedings of the 26th International Conference on babysitter: On biases in language generation. In Pro- World Wide Web, pages 1391–1399. ceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th In- Audrey Yap. 2013. Ad hominem fallacies, bias, and ternational Joint Conference on Natural Language testimony. Argumentation, 27(2):97–109. Processing (EMNLP-IJCNLP), pages 3398–3403. Dawei Yin, Zhenzhen Xue, Liangjie Hong, Brian D Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, Davison, April Kontostathis, and Lynne Edwards. and Nanyun Peng. 2020. Towards controllable bi- 2009. Detection of harassment on web 2.0. ases in language generation. In the 2020 Conference on Empirical Methods in Natural Language Process- Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, ing (EMNLP)-Findings, long. Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and William B Dolan. 2020a. Dialogpt: Large- Sara Sood, Judd Antin, and Elizabeth Churchill. 2012a. scale generative pre-training for conversational re- Profanity use in online communities. In Proceed- sponse generation. In Proceedings of the 58th An- ings of the SIGCHI Conference on Human Factors in nual Meeting of the Association for Computational Computing Systems, CHI ’12, page 1481–1490, New Linguistics: System Demonstrations, pages 270– York, NY, USA. Association for Computing Machin- 278. ery. Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Sara Owsley Sood, Elizabeth F Churchill, and Judd Gan, Chris Brockett, and Bill Dolan. 2020b. Antin. 2012b. Automatic identification of personal POINTER: Constrained progressive text generation insults on social news sites. Journal of the Ameri- via insertion-based generative pre-training. In Pro- can Society for Information Science and Technology, ceedings of the 2020 Conference on Empirical Meth- 63(2):270–285. ods in Natural Language Processing (EMNLP), pages 8649–8670, Online. Association for Compu- Raymond Hendy Susanto, Shamil Chollampatt, and tational Linguistics. Liling Tan. 2020. Lexically constrained neural ma- chine translation with Levenshtein transformer. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 3536– 3543, Online. Association for Computational Lin- guistics. Ben Swanson, Elif Yamangil, and Eugene Charniak. 2014. Natural language generation with vocabulary constraints. In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications, pages 124–133. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial trig- gers for attacking and analyzing nlp. In Proceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 2153–2162. Douglas Walton. 1998. Ad hominem arguments. Uni- versity of Alabama Press. William Warner and Julia Hirschberg. 2012. Detecting hate speech on the world wide web. In Proceedings of the Second Workshop on Language in Social Me- dia, LSM ’12, page 19–26, USA. Association for Computational Linguistics. Stephen de Wijze. 2003. Complexity, relevance and character: Problems with teaching the ad hominem fallacy. Educational Philosophy and The- ory, 35(1):31–56.
A Appendices hominem responses as targets to generate or avoid, we can find a trigger phrase that forces the model to A.1 You-responses generate fewer ad hominems. Specifically, we fol- You-responses are responses containing any of the low previous work and implement the minimization following phrases: you are, you were, you should, and maximization of these different targeted associ- you would, you will, you have, you can, you could, ations through subtraction (e.g., loss from generat- you don’t, you didn’t, you can’t, you’re, you’d, ing non-ad hominems minus loss from generating you’ll, you’ve, ur, ya’ll, yall, your, yours, yourself, ad hominems), and then minimize this difference to are you, were you, should you, would you, will you, find the trigger. The trigger we find is “Franç cast- have you, can you, could you. These phrases are erbecca Unsure filler willpower”, and we append used to identify potential ad hominems for more this trigger to a post to generate responses from targeted annotation (Round 1). DialoGPT. For example, the input prompt “Franç casterbecca Unsure filler willpower WE have the A.2 Model Details power to stop this. Go #vegan.” results in the gen- We run all our models on an RTX 2080Ti GPU. erated response “We must!”. We use the default Training the ad hominem classifiers takes a few parameters as reported by Sheng et al. (2020). For minutes, and fine-tuning DialoGPT on different more details, see the prior works. With an RTX topics (ranging from 3K to 4K samples as shown 2080Ti GPU, the trigger search algorithm takes 1-2 in Table 2) takes a few hours. hours. Ad Hominem Classifier For the BERT-based ad hominem classifier, we fine-tune from the un- A.4 PPLM Details cased version of the BERT base model (12 layers) with mostly default parameters. For the DialoGPT- The Plug and Play Language Model uses gradients based classifier, we fine-tune from the medium- from an attribute classifier to control generation sized DialoGPT model also with mostly default from a pre-trained language model. In the origi- parameters. In terms of non-default hyperparame- nal work, Dathathri et al. (2020) use PPLM in the ters, we try learning rates of 5 × 10−5 , 1 × 10−5 , contexts of topic, sentiment, and toxicity control. 5 × 10−6 , and 1 × 10−6 , and find that 5 × 10−5 Although ad hominems are also a form of toxic works the best for BERT and 5 × 10−6 works the language, we train a new attribute classifier specifi- best for DialoGPT. We train for 12 epochs and cally on the annotated A D H OM I N T WEETS dataset save the checkpoint for the epoch that the model for a more competitive PPLM baseline. We use the performs the best on the dev set. All input that ad hominem classifier training set and dev set to goes into the classifier is preprocessed to replace form the training and validation sets for this clas- usernames, urls, and hashtags with placeholders. sifier, respectively. Note that this classifier is nec- DialoGPT For all our DialoGPT experiments, essarily different from the BERT-based model we we use the medium DialoGPT with 355M pa- use for the main ad hominem analysis—to use the rameters and mostly default parameters. During gradients from the attribute classifier to steer gener- fine-tuning, we try learning rates of 5 × 10−5 , ations from DialoGPT, we follow the attribute clas- 1 × 10−5 , 5 × 10−6 , and 1 × 10−6 , and that a sifier training procedure of Dathathri et al. (2020). learning rate of 5 × 10−6 for 5 epochs performs Specifically, this classifier takes the hidden states the best on the dev sets. The format the training with dimension (batch size, sequence length, em- and eval data is “POST [ EOS ] RESPONSE [ EOS ]”. bedding size) from the last layer of DialoGPT, av- erages the hidden states over the sequence length, A.3 Trigger Details and uses these averaged hidden states as input for a Following the trigger search algorithm of Wallace simple linear classifier. The classifier has an input et al. (2019) and bias control framework of Sheng text format of “POST [ EOS ] RESPONSE [ EOS ]” to et al. (2020), we start with the trigger phrase “the predict the binary ad hominem label and has an the the the the the”, and iteratively replace each average validation accuracy of 76%. token in the trigger such that we minimize the loss With this trained attribute classifier, we then of generating non-ad hominem responses and max- follow the gradient-based hidden state updates imize the loss of generating ad hominem responses. described by Dathathri et al. (2020) to gener- By using the annotated non-ad hominem and ad ate responses given posts. For our hyperpa-
rameter tuning, we try different step sizes = hominem is if Person B says "I think you meant in- [0.01, 0.02, 0.03, 0.04, 0.05] and and KL loss coef- ductive reasoning.", because (whether intentionally ficients = [0.01, 0.02, 0.03], where increased step or not) this response targets Person A’s perceived sizes intensify control and increased KL loss coef- mistake instead of purely addressing the content of ficients intensify the similarity of the outputs for Person A’s post. Types of ad hominems (towards the modified and unmodified distributions. For our Person A): reported results, we use PPLM with a step size • Stupidity (i.e., targeting Person A’s capability of 0.01, a KL loss coefficient of 0.02, 6 epochs, for intelligence): and otherwise default parameters of the original – Person B:"You dumb f***" work. In general, this technique is slower because – Person B:"Reading comprehension is it requires many iterations per token to accumulate your friend" perturbations. – Person B:“You have no capability to un- A.5 Top-k Sampling Details derstand why” – Person B:“Nobody with enough brains At each time step of top-k sampling, the to operate a computer could possibly be- top-kP tokens V (k) ⊂ V that maximize lieve something this stupid” p0 = x∈V (k) P(x|x1:i−1 ) are selected as – Person B:“Ever have discussions with candidate tokens to generate. V is the model’s narcissistic idiots on the internet? They token vocabulary, x is a token, and x1:i−1 are are so tiring” the tokens from all the previous time steps. – Person B:“Your second paragraph is The distribution p0 is then re-scaled such that fairly idiotic” for all x ∈ V (k) , the rescaled distribution is P 0 (x|x1:i−1 ) = P(x|x1:i−1 )/p0 . This new distri- • Ignorance (i.e., targeting Person A not using bution P 0 is then used to sample a new token for their capability for intelligence, making a mis- the current time step. take, forgetting to include something, confus- ing different things): A.6 S ALIEN S IM T OP -k Details – Person B:“Please don’t waste people’s For this constrained decoding technique, we also time pretending to know what you’re use an RTX 2080 Ti GPU and, similar to the non- talking about” constrained DialoGPT, it takes less than a second – Person B:“Do you even know what to generate output for a sample. you’re saying” – Person B:“You’re making the claims, it’s A.7 Ad Hominem Annotation your job to prove it. Don’t you know Task Annotators are paid $0.05 to label the ad how debating works?” hominems in a sample and are from the U.S. or – Person B:“Willful ignorance is not some- Canada. We filter by annotators from these loca- thing I can combat” tions to better control for similar societal values – Person B:“Did you even read this?” in English-speaking communities, but it would be – Person B:“You didn’t use quotes cor- interesting to see how the concept of ad hominems rectly” change across communities with more different val- – Person B:“You forgot an apostrophe” ues and languages. Each sample takes an average – (Person A: “We used deductive reason- of 15 to 20 seconds to label, for an hourly average ing to prove that the moon revolves of $10.29 USD. We show annotators the guidelines around the earth.”) Person B: “I think below. you meant inductive reasoning.” Guidelines Ad hominems are a type of logical • Trolling/Lying (i.e., targeting Person A inten- fallacy in which a response attacks a person and tionally misrepresenting the truth): some feature of the person’s character instead of – Person B:“You’re just a dishonest troll” the position the person is maintaining. For exam- – Person B:“You’re using troll tactics” ple, if Person A says "We used deductive reasoning – Person B:“Possible lie any harder?” to prove that the moon revolves around the earth." – Person B:“You are just a liar” and Person B replies "No, you’re dumb", Person • Bias (i.e., accusing Person A of racism, sex- B’s response is an ad hominem. A more subtle ad ism, ableism, or other societal biases):
You can also read