Crossing the Line: Where do Demographic Variables Fit into Humor Detection?
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Crossing the Line: Where do Demographic Variables Fit into Humor Detection? J. A. Meaney School of Informatics University of Edinburgh Edinburgh, UK jameaney@ed.ac.uk Abstract However, while one can be an expert in whether or not an image depicts a dog, and this is stable within Recent shared tasks in humor classification have struggled with two issues: scope and sub- and between cultures, humor is more nuanced than jectivity. Regarding scope, many task datasets that. Unlike image classification: either comprise a highly constrained genre of humor which does not broadly represent the • Humor differs between cultures. Even within genre, or the data collection is so indiscrimi- the same language, different nationalities per- nate that the inter-annotator agreement on its ceive jokes differently. This is particularly rel- comic content is drastically low. In terms evant to stereotyped humor, which may be per- of subjectivity, these tasks typically average ceived as funny to one culture, but offensive over all annotators’ judgments, in spite of the to another. (Rosenthal and Bindman, 2015) fact that humor is highly subjective and varies both between and within cultures. We propose • Humor differs within cultures. Age, gender a dataset which maintains a broad scope but and socio-economic status are known to im- which addresses subjectivity. We will collect pact what is perceived as humorous. (Kuipers, demographic information about the data’s hu- mor annotators in order to bin ratings more 2017) sensibly. We also suggest the addition of an • Humor differs within the same person. Mood ’offensive’ label to reflect the fact a text may be humorous to one group, but offensive to is thought to impact what is considered to be another. This would allow for more meaning- humorous or not. (Wagner and Ruch, 2020) ful shared tasks and could lead to better per- formance on downstream applications, such as Currently in NLP shared tasks, there is scant content moderation. admission of these issues. Humor is treated as a stable target, and humorous texts are subjected to 1 Introduction binary classification and humor score prediction, Interest in computational humor (CH) is flourish- with little recognition that gold standard labels for ing, and since 2017, the proliferation of shared these constructs simply do not exist. humor detection tasks in NLP has attracted new re- 1.1 Proposal searchers to the field. However, leading researchers in CH have bemoaned the fact that NLP’s contribu- To the extent that humor is multi-faceted, and sub- tion is not always informed by the long and inter- ject to multiple interpretations, incremental im- disciplinary history of humor research (Taylor and provements to shared tasks can be made by: Attardo, 2016) (Davies, 2008). This may result in • Acknowledging that texts may not be per- the creation of humor detection systems which pro- ceived as humorous by all readers, and allow- duce excellent evaluation results, but which may ing for a different interpretation, e.g. offen- not scale to other humor datasets, improve down- sive. stream tasks like content moderation, or contribute to our understanding of humor. • Collecting demographic information about the A central issue is the conception of humor classi- annotators of humor datasets to learn more fication tasks as humor-or-not, similar to image about which sectors of society find a text hu- classification’s view of an image as dog-or-not. morous versus offensive. 176 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 176–181 July 5 - July 10, 2020. c 2020 Association for Computational Linguistics
1.2 Why Offensive as an Alternative Label? audience does not possess the tacit linguistic knowl- edge required to understand them (Aarons, 2017). Cultural shifts in many parts of the world have seen a decline in racist and sexist jokes, and the growth 2.1 Limited Scope of humor that acknowledges marginalized people. Lockyer and Pickering (2005) argue that this is Task 6, Hashtag Wars (Potash et al., 2017), sourced not just a recent phenomenon, but that all pluralist its name and data from a segment in the Com- societies navigate the space between humor and edy Central Show @Midnight with Chris Hard- offensiveness, between ‘free speech and cultural wick, which solicited humorous responses to a respect’ given hashtag from its viewers, submitted on Twit- Despite the shift away from using racist or sex- ter. These submissions were effectively annotated ist comments as humor, offensive language is still twice: the producers selected ten tweets as most hu- plentiful on the internet (Davidson et al., 2017), morous, and most appropriate for the show’s type (Nobata et al., 2016). This can reinforce racial of humor. The show’s audience then voted on their stereotypes, or have a damaging impact on commu- number one submission. Task 1 was to pair the nities. In light of the fact that many shared tasks tweets, and for each pair, predict which one had source their data online, either by scraping Twitter, achieved a higher ranking, according to the audi- Reddit, or crowdsourcing, we believe it is worth ence. Task 2 was to predict the labels given by capturing the impact of these texts on users. this stratified annotation: submitted but not top-10, top-10, number one in top-10. 1.3 Why Demographic Factors? The task’s organisers highlighted the data’s lim- Studies as far back as 1937 demonstrate gender ited scope, and were keen to point out that this task and age differences in the appreciation of jokes, does not aim to build an all-purpose, cross-cultural where young men gave higher ratings to ’shady’ humor classifier, but rather to characterise the hu- (e.g. sexual) jokes than their female, and older mor from one source - the show @Midnight. This counterparts did (Omwake, 1937). task’s dual annotation and ecologically valid task More recently, in the Netherlands, Kuipers make it arguably one of the most effective humor (2017) found significant differences in humor pref- challenges in recent years. However, it remains to erences along the lines of gender, age, and in partic- be seen how well a system built on this data would ular, social class or education level. An interesting generalize to another humor detection task. finding was that the older generation rated their Semeval 2020 featured another humor challenge younger counterparts’ humor as offensive. This with two subtasks: predicting the mean funniness contradicts the popular opinion that the millennial rating of each humorous text, and given two hu- generation is perpetually offended (Fisher, 2019). morous texts, predicting which was rated as fun- In terms of gender-specific offensive humor, a nier (Hossain et al., 2019). Instead of collecting US study found that males tended to give higher previously existing humorous texts, the organisers ratings to female-hostile jokes, and females did the generated them by scraping news headlines from same with male-hostile jokes. Both genders found Reddit, and then paying crowdworkers to edit the female-hostile jokes more offensive overall (Abel headlines to make them funny, and annotators to and Flick, 2012). rate the funniness of the new headlines. The body of work from CH on demographic Edits were defined as ‘the insertion of a single- differences in humor perception is absent in current word noun or verb to replace an existing entity or work, but can be incorporated into shared tasks with single-word noun or verb’. The annotators rated some simple adjustments. the headline as funny from 0-4. An abusive/spam option was included, but presumably to discard 2 Previous Work ineffective edits, rather than highlight a text which would cause offense. Nonetheless, inter-annotator SemEval 2017 posed two humor detection tasks. agreement between raters was moderately high, Task 7 (Miller et al., 2017) covered puns, (Krippendorff’s α 0.64) which we do not include here as the identifica- Of interest to CH research is that the authors’ tion/interpretation of puns is less ambiguous than analysis of the generated humor finds support for other forms of humor, except in the case that the established humor theories, such as incongruity, 177
superiority and setup and punchline being central no questions as to whether the user was a Spanish to the this task. However, the editing rules enforced speaker, and as the task was unpaid, there may have such tight linguistic constraints that many common been little incentive to do it accurately. features of language were not permitted, e.g. the use of named entities with two words, phrasal verbs, 3 Methodology even apostrophes. This scales down the humor that The datasets featured in both SemEval tasks had can be generated, not in terms of genre, as was tight constraints on the genre of humor involved. the case with the 2017 SemEval task, but rather in This led to high inter-annotator reliability, but may terms of arbitrary linguistic constraints. not generalize well to other forms of humor. The Finally we must consider that, given that the hu- Spanish tasks featured no such constraints, how- morous texts were presented alongside the original ever, there was extremely low inter-annotator agree- headline, it’s possible that affirmative humor rat- ment, suggesting that the dataset is noisy, and that ings do not mean that the text is humorous in and of a system which is built on this may also fail to itself, only that it is funnier than the contemporary generalize. news — arguably a low bar in the current climate. This proposal aims to include a wide range 2.2 Unlimited Scope of genres, and to increase the reliability of the The HAHA challenge (Humor Analysis based on annotations by collecting information on well- Human Annotation) has run in 2018 (Castro et al., known latent variables in humor appreciation — 2018) and 2019 (Chiruzzo et al., 2019) with two the demographic characteristics of the humor audi- subtasks: binary classification of humor, and pre- ence/annotators. This will allow for more nuanced diction of the average humor score assigned to each tasks, as an alternative to simple humor-or-not defi- text. nitions. The data were collected from fifty Spanish- 3.1 Data Collection speaking Twitter accounts which typically post hu- morous content, representing a range of different We plan to follow a similar data collection pro- dialects of Spanish. These were then uploaded to tocol to (Castro et al., 2018) and collect tweets an online platform, which was open to the public from a wide variety of humorous Twitter accounts. who were asked the following questions to annotate However, unlike Castro et al., we plan to limit the the data: dialect of the jokes collected to US English, and use a crowdsourcing platform which allows us to 1. Does this tweet intend to be humorous? (Yes, select annotators who use this dialect. This will or No) help us to avoid introducing confounds such as lack of cultural knowledge, or divergent language us- 2. [If yes] How humorous do you find it, from 1 age. Furthermore, we will hand select the Twitter to 5? accounts which typically post humorous content, A strength of this annotation process is that the in order to ensure that the data features a wide vari- first question allows the user to objectively identify ety of genres of humor, e.g. observational humor, the genre of the text by identifying its intention, wordplay, humorous vignettes, etc. before giving their subjective opinion of it. How- ever, the inter-annotator agreement for the second 3.2 Annotation question was extremely low (Krippendorf’s α of As mentioned above, averaging over the opinions 0.1625). It’s possible that sourcing the texts from of the audience, similar to approaches in image fifty different accounts introduced too many genres detection is not ecologically valid for humor de- to gain a consensus about what was funny amongst tection. For this reason, we plan to collect demo- annotators. Similarly, the organizers targeted as graphic information about the annotators, in order many different Spanish dialects as possible in their to bin the ratings into groups that may perceive data collection, which could lead to cultural and lin- humor in a similar way. In this way, we hope to guistic differences in humor appreciation. Finally, increase inter-annotator reliability. We also plan to the annotations were sourced on an open platform, include a second label for each text — offensive. with only three test tweets to assess whether an an- Following Castro et al., annotators will be asked notator provided usable ratings or not. There were the following questions for each text: 178
1. Is the intention of this text to be humorous? from Twitter, ensuring that the semantic meaning of the keyword stayed they same, e.g. ‘black’ re- 2. [If so] How humorous do you perceive this ferred to race, and not to Black Friday, and that text to be? the texts were not intended to be humorous. The 3. Is this text offensive? average number of tokens per text in this group was 20.2. 4. [If so] How offensive do you perceive this text to be? • Keyword is not target of joke: ‘What is the Terminators Muslim name? Al Bi Baq’ The annotator guidelines will reflect that offen- siveness can encompass an insult to the audience • Keyword is target of joke: ‘Mattel released itself, or to others who are likely to find the text a Muslim Barbie... It’s a blow-up doll.’ distasteful. • Random tweet with keyword: ‘The Mosque All annotators will be paid for their work, to will close this weekend due to the pandemic’. incentivize quality ratings. They will be selected to undertake the task by virtue of fitting into the We asked 2 groups of annotators, aged 18-25 following demographic bins: (n=10) or aged 55-70 (n=10) to imagine they were social media moderators. Their task was to iden- • Age: 18-25, 26-40, 41-55, 56-70 the bins tify the genre of the texts as label them as ‘humor- are broadly designed to capture Generation Z, ous’, ‘offensive’, ‘humorous and offensive’ and Millenials, Generation X and Baby Boomers ‘other’. We highlighted that they did not need to respectively (Dimock, 2019). find the text humorous, or personally offensive to label them as such. If they identified the intent as • Gender: Male, Female, Non-binary humorous, or the text as possibly offensive to oth- • Level of Education: High School, Undergrad- ers, they should use the corresponding label. We uate, Postgraduate. This will be used as an omitted the numerical rating task for reasons of index of socioeconomic status (Mirowsky and brevity. Ross, 2003). In terms of results, the clearest trends emerge when the groups were split by age. Both age groups Subsequent to data annotation, we will select of users made use of the ’humorous and offensive’ the demographic factor that gives the highest inter- label, suggesting that annotators could identify the rater reliability for this dataset. Annotations will be genre of the text as humorous, but found it in bad averaged by bin, rather than averaging over all of taste. However, there was a trend for the younger a text’s ratings, as was the case in previous shared group using this label more frequently than the tasks. older group. 3.3 Pilot Study Examining where differences in annotation oc- curred, Table 1 demonstrates the disparity in la- To examine the integrity of our assumptions, we ran belling on the following gender-related text: a short pilot task in which we used the Prolific Aca- demic platform to crowdsource annotations from We should really use the blackjack scale users in the youngest and oldest age groups. to rate women. For example: “Every We searched for texts which related to girl here is ugly” “Well, what about her?” race/origin, religion, gender, sexuality and body “Eh, she’s like a 15 or 16. Not sure if I’d type. We used keywords from Fortuna’s (2017) sub- hit it” categories of offensive speech to source texts which could be offensive jokes, such as ‘black’, ‘woman’, Table 1: Variation in labelling between age groups ‘girlfriend’, ‘blind’, ‘gay’, ‘Muslim’, ‘Jew’, etc. Humorous & From a readily available dataset (The Short Jokes Age Humorous Offensive Other Offensive dataset from Kaggle), we sourced 40 jokes, 20 in 18-25 3 3 3 1 which the keyword also referred to the butt of the 56-70 2 7 0 1 joke (average number of tokens per text = 18.4), and 20 in which it did not (average number of to- As we did not have balanced groups based on kens = 19.1). Twenty neutral texts were selected level of education, or a critical mass of non-binary 179
users so we omit analysis for these. Similarly, re- 5 Conclusion garding gender differences, there were no clear Humor detection and rating is a multi-faceted prob- trends in terms of labelling between females and lem. We hope that the inclusion of demographic males, and there were no statistically significant information will shift the state of the art away from differences between groups. objective classification, towards a more subjective The results of our pilot study suggest that pursu- approach. Future qualitative work could also sug- ing demographic differentiation in humor annota- gest further variables whose inclusion would en- tion/classification is worthwhile. Specifically, we hance our knowledge of humor perception. This can see that age group may be relevant as the demo- could set a new standard for shared tasks which graphic factor which most distinguishes annotators’ aim to model humor in future, and could outline response to humor. a methodology that can be replicated with other 3.4 Tasks cultures and languages. We will ask systems to predict, given a group with 6 Acknowledgements a specific set of user demographics: This work was supported in part by the EPSRC Cen- • Is this text humorous to the group, and if so, tre for Doctoral Training in Data Science, funded how humorous? by the UK Engineering and Physical Sciences Re- • Is this text offensive to the group, and if so, search Council (grant EP/L016427/1) and the Uni- how offensive? versity of Edinburgh. Our data will comprise texts which are either hu- morous and not offensive, humorous and offensive, References not humorous and offensive, and not humorous and Debra Aarons. 2017. Puns and tacit linguistic knowl- not-offensive. edge. In The Routledge handbook of language and In the case that there are no clear distinctions humor, pages 80–94. Routledge. between the groups in terms of labels and ratings, Millicent H Abel and Jason Flick. 2012. Mediation and we will average over these annotations, as typical moderation in ratings of hostile jokes by men and tasks have done and proceed with classification and women. regression, as above. The evaluation metrics for the classification task Santiago Castro, Luis Chiruzzo, Aiala Rosá, Diego Garat, and Guillermo Moncecchi. 2018. A crowd- will be precision, recall and F1. The metric for annotated spanish corpus for humor analysis. In Pro- predicting the humor and offensiveness scores will ceedings of the Sixth International Workshop on Nat- be root mean squared error. ural Language Processing for Social Media, pages 7–11. 4 Contribution to Computational Humor Luis Chiruzzo, S Castro, Mathias Etcheverry, Diego In line with CH research, we affirm that humor Garat, Juan José Prada, and Aiala Rosá. 2019. is a moving target in terms of differing interpreta- Overview of haha at iberlef 2019: Humor analy- sis based on human annotation. In Proceedings tions between demographic groups and across the of the Iberian Languages Evaluation Forum (Iber- lifetime. Our dataset will be the first to model the LEF 2019). CEUR Workshop Proceedings, CEUR- reception of a wide variety of humor genres from WS, Bilbao, Spain (9 2019). Twitter, presented to users of different demograph- Thomas Davidson, Dana Warmsley, Michael Macy, ics. It will also be, to the best of our knowledge, and Ingmar Weber. 2017. Automated hate speech the first CH dataset to take into account the ratings detection and the problem of offensive language. In of non-binary annotators. Eleventh international aaai conference on web and In line with Hossain (2019), we aim to use clus- social media. tering methods on the humor and/or offensive texts Christie Davies. 2008. Undertaking the comparative to determine themes that evoke these classes for study of humor. The primer of humor research, different groups. We also aim to explore whether Berlin: Mouton de Gruyter, pages 157–182. theories of humor, such as surprisal, superiority Michael Dimock. 2019. Defining generations: Where and incongruity are equally appreciated among dif- millennials end and generation z begins. Pew Re- ferent groups. search Center, 17:1–7. 180
Caitlin Fisher. 2019. The Gaslighting of the Millen- nial Generation: How to Succeed in a Society that Blames You for Everything Gone Wrong. Mango Media Inc. Paula Cristina Teixeira Fortuna. 2017. Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes. Nabil Hossain, John Krumm, and Michael Gamon. 2019. ” president vows to cut¡ taxes¿ hair”: Dataset and analysis of creative text editing for humorous headlines. arXiv preprint arXiv:1906.00274. Giselinde Kuipers. 2017. Humour styles and class cul- tures: Highbrow humour and lowbrow humour in the netherlands. In The Anatomy of Laughter, pages 58–69. Routledge. Sharon Lockyer and Michael Pickering. 2005. Intro- duction: The ethics and aesthetics of humour and comedy. In Beyond a Joke, pages 1–24. Springer. Tristan Miller, Christian Hempelmann, and Iryna Gurevych. 2017. SemEval-2017 task 7: Detection and interpretation of english puns. In Proceedings of the 11th International Workshop on Semantic Eval- uation (SemEval-2017), pages 58–68, Stroudsburg, PA, USA. Association for Computational Linguis- tics. John Mirowsky and Catherine E Ross. 2003. Educa- tion, social status, and health. Transaction Publish- ers. Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive lan- guage detection in online user content. In Proceed- ings of the 25th international conference on world wide web, pages 145–153. Louise Omwake. 1937. A study of sense of humor: its relation to sex, age, and personal characteristics. Journal of Applied Psychology, 21(6):688. Peter Potash, Alexey Romanov, and Anna Rumshisky. 2017. Semeval-2017 task 6:# hashtagwars: Learn- ing a sense of humor. In Proceedings of the 11th International Workshop on Semantic Evalua- tion (SemEval-2017), pages 49–57. Angela Rosenthal and David Bindman. 2015. No laughing matter: Visual humor in ideas of race, na- tionality, and ethnicity. Dartmouth College Press. Julia M Taylor and S Attardo. 2016. Computational treatments of humor. In Routledge Handbook of Language and Humor. Lisa Wagner and Willibald Ruch. 2020. Trait cheer- fulness, seriousness, and bad mood outperform per- sonality traits of the five-factor model in explaining variance in humor behaviors and well-being among adolescents. Current Psychology, pages 1–12. 181
You can also read