Annotating Online Misogyny - Leon Derczynski

Page created by Norma Anderson
 
CONTINUE READING
Annotating Online Misogyny

          Philine Zeinert               Nanna Inie               Leon Derczynski
    IT University of Copenhagen IT University of Copenhagen IT University of Copenhagen
             Denmark                     Denmark                     Denmark
          phze@itu.dk                 nans@itu.dk                 leod@itu.dk

                       Abstract                            Abusive language is linguistically diverse (Vid-
                                                        gen and Derczynski, 2020), both explicitly, in the
     Online misogyny, a category of online abusive      form of swear words or profanities; implicitly, in
     language, has serious and harmful social con-      the form of sarcasm or humor (Waseem et al.,
     sequences. Automatic detection of misogynis-       2017); and subtly, in the form of attitudes and opin-
     tic language online, while imperative, poses
                                                        ions. Recognizing distinctions between variants of
     complicated challenges to both data gathering,
     data annotation, and bias mitigation, as this      misogyny is challenging for humans, let alone com-
     type of data is linguistically complex and di-     puters. Systems for automatic detection are usually
     verse. This paper makes three contributions        created using labeled training data (Kiritchenko
     in this area: Firstly, we describe the detailed    et al., 2020), hence, their performance depends
     design of our iterative annotation process and     on the quality and representativity of the available
     codebook. Secondly, we present a comprehen-        datasets and their labels. We currently lack trans-
     sive taxonomy of labels for annotating misog-
                                                        parent methods for how to create diverse datasets.
     yny in natural written language, and finally, we
     introduce a high-quality dataset of annotated
                                                        When abusive language is annotated, classes are of-
     posts sampled from social media posts.             ten created based on each unique dataset (a purely
                                                        inductive approach), rather than taking advantage
1    Introduction                                       of general, established terminology from, for in-
                                                        stance, social science or psychology (a deductive
Abusive language is a phenomenon with serious           approach, building on existing research). This
consequences for its victims, and misogyny is           makes classification scores difficult to compare and
no exception. According to a 2017 report from           apply across diverse training datasets.
Amnesty International, 23% of women from eight             This paper investigates the research question:
different countries have experienced online abuse       How might we design a comprehensive annotation
or harassment at least once, and 41% of these said      process which results in high quality data for au-
that on at least one occasion, these online experi-     tomatically detecting misogyny? We make three
ences made them feel that their physical safety was     novel contributions: 1. Methodology: We describe
threatened (Amnesty International, 2017).               our iterative approach to the annotation process in
    Automatic detection of abusive language can         a transparent way which allows for a higher degree
help identify and report harmful accounts and acts,     of comparability with similar research. 2. Model:
and allows counter narratives (Chung et al., 2019;      We present a taxonomy and annotation codebook
Garland et al., 2020; Ziems et al., 2020). Due to       grounded in previous research on automatic detec-
the volume of online text and the mental impact         tion of misogyny as well as social science termi-
on humans who are employed to moderate online           nology. 3. Dataset: We present a new, annotated
abusive language - moderators of abusive online         corpus of Danish social media posts, Bajer,1 an-
content have been shown to develop serious PTSD         notated for misogyny, including analysis of class
and depressive symptoms (Casey Newton, 2020) -          balance, word frequencies, Inter-Annotator Agree-
it is urgent to develop systems to automate the de-     ment (IAA), annotation errors, and classification
tection and moderation of online abusive language.      baseline.
Automatic detection, however, presents significant        1
                                                            https://github.com/phze22/
challenges (Vidgen et al., 2019).                       Online-Misogyny-in-Danish-Bajer
Since research has indicated that misogyny            from the work of Waseem and Hovy (2016). While
presents differently across languages, and, likely,      harsh sexism (hateful or negative views of women)
cultures (Anzovino et al., 2018), an additional con-     is the more recognized type of sexism, benevo-
tribution of this work is that it presents a dataset     lent sexism (“a subjectively positive view towards
of misogyny in Danish, a North Germanic lan-             men or women”), often exemplified as a compli-
guage, spoken by only six million people, and            ment using a positive stereotypical picture, is still
indeed the first work of its kind in any Scandina-       discriminating (Glick and Fiske, 1996). Other cat-
vian/Nordic culture to our knowledge. In Denmark         egorisations of harassment towards women have
an increasing proportion of people refrain from on-      distinguished between physical, sexual and indirect
line discourse due to the harsh tone, with 68% of        occurrences (Sharifirad and Jacovi, 2019).
social media users self-excluding in 2021 (Anal-            Anzovino et al. (2018) classify misogyny more
yse & Tal, 2021; Andersen and Langberg, 2021),           segregated in five subcategories: Discredit, Harass-
making this study contextually relevant. Further,        ment & Threats of Violence, Derailing, Stereotype
the lack of language resources available for Dan-        & Objectification, and Dominance. They also dis-
ish (Kirkedal et al., 2019) coupled with its lexical     tinguish between if the abuse is active or passive
complexity (Bleses et al., 2008) make it an intricate    towards the target. These labels appear to apply
research objective for natural language processing.      well to other languages, and quantitative represen-
                                                         tation of labels differ by language. For example,
2     Background and related work                        Spanish shows a stronger presence of Dominance,
Abusive language is as ancient a phenomenon as           Italian of Stereotype & Objectification, and English
written language itself. Written profanities and in-     of Discredit. As we see variance across languages,
sults about others are found as old as graffiti on ru-   building terminology for labeling misogyny cor-
ins from the Roman empire (Wallace, 2005). Auto-         rectly is therefore a key challenge in being able to
matic processing of abusive text is far more recent,     detect it automatically. Parikh et al. (2019) take
early work including e.g. Davidson et al. (2017)         a multi-label approach to categorizing posts from
and Waseem et al. (2017). Research in this field         the “Everyday Sexism Project”, where as many
has produced both data, taxonomies, and methods          as 23 different categories are not mutually exclu-
for detecting and defining abuse, but there exists no    sive. The types of sexism identified in their dataset
objective framing for what constitutes abuse and         include body shaming, gaslighting, and mansplain-
what does not. In this work, we focus on a specific      ing. While the categories of this work are extremely
category of online abuse, namely misogyny.               detailed and socially useful, several studies have
                                                         demonstrated the challenge for human annotators
2.1    Online misogyny and existing datasets             to use labels that are intuitively unclear (Chatzakou
                                                         et al., 2017; Vidgen et al., 2019) or closely related
Misogyny can be categorised as a subbranch of hate
                                                         to each other (Founta et al., 2018).
speech and is described as hateful content targeting
women (Waseem, 2016). The degree of toxicity                Guest et al. (2021) suggest a novel taxonomy for
depends on complicated subjective measures, for          misogyny labeling applied to a corpus of primarily
instance, the receiver’s perception of the dialect of    English Reddit posts. Based on previous research,
the speaker (Sap et al., 2019).                          including Anzovino et al. (2018), they present the
   Annotating misogyny typically requires more           following four overarching categories of misog-
than a binary present/absent label. Chiril et al.        yny: (i) Misogynistic Pejoratives, (ii) descriptions
(2020), for instance, use three categories to classify   of Misogynistic Treatment, (iii) acts of Misogynis-
misogyny in French: direct sexist content (directly      tic Derogation and (iv) Gendered Personal attacks
addressed to a woman or a group of women), de-           against women.
scriptive sexist content (describing a woman or             The current work combines previous categoriza-
women in general) or reporting sexist content (a         tions on misogyny into a taxonomy which is useful
report of a sexism experience or a denunciation of       for annotation of misogyny in all languages, while
a sexist behaviour). This categorization does not,       being transparent about the construction of this
however, specify the type of misogyny.                   taxonomy. Our work builds on the previous work
   Jha and Mamidi (2017) distinguish between             presented in this section, continuous discussions
harsh and benevolvent sexism, building on the data       among the annotators, and the addition of social
science terminology to create a single-label tax-          Different social media platforms attract differ-
onomy of misogyny as identified in Danish social        ent user groups and can exhibit domain-specific
media posts across various platforms.                   language (Karan and Šnajder, 2018). Rather than
                                                        choosing one platform (existing misogyny datasets
3     Methodology and dataset creation                  are primarily based on Twitter and Reddit (Guest
The creation of quality datasets involves a chain of    et al., 2021)), we sampled from multiple platforms:
methodological decisions. In this section, we will      Statista (2020) shows that the platform where most
present the rationale of creating our dataset under     Danish users are present is Facebook, followed
three headlines: Dataset, Annotation process, and       by Twitter, YouTube, Instagram and lastly, Reddit.
Mitigating biases.                                      The dataset was sampled from Twitter, Facebook
                                                        and Reddit posts as plain text.
3.1    Dataset: Online misogyny in social media
                                                        Language variety: Danish, BCP-47: da-DK.
Bender and Friedman (2018) present a set of data
statements for NLP which help “alleviate issues re-     Text characteristics: Danish colloquial web
lated to exclusion and bias in language technology,     speech. Posts, comments, retweets: max. length
lead[ing] to better precision in claims about how       512, average length: 161 characters.
natural language processing research can general-       Speaker demographics: Social media users,
ize and thus better engineering results”.               age/gender/race unknown/mixed.
   Data statements are a characterization of a
dataset which provides context to others to under-      Speech situation: Interactive, social media dis-
stand how experimental results might generalize         cussions.
and what biases might be reflected in systems built
                                                        Annotator demographics: We recruited anno-
on the software. We present our data statements for
                                                        tators aiming specifically for diversity in gender,
the dataset creation in the following:
                                                        age, occupation/ background (linguistic and ethno-
Curation rationale: Random sampling of text             graphic knowledge), region (spoken dialects) as
often results in scarcity of examples of specifically   well as an additional facilitator with a background
misogynistic content (e.g. (Wulczyn et al., 2017;       in ethnography to lead initial discussions (see Table
Founta et al., 2018)). Therefore, we used the com-      1). Annotators were appointed as full-time employ-
mon alternative of collecting data by using pre-        ees with full standard benefits.
defined keywords with a potentially high search hit
                                                          Gender:           6 female, 2 male (8 total)
(e.g. Waseem and Hovy (2016)), and identifying
                                                          Age:              5
& Test stages are replaced by Leveraging of an-         users (Wiegand et al., 2019), domain (Wiegand
notations for one’s particular goal, in our case the    et al., 2019), time (Florio et al., 2020) and lack of
creation of a comprehensive taxonomy.                   linguistic variety (Vidgen and Derczynski, 2020).
   We created a set of guidelines for the annotators.
                                                        Label biases: Label biases can be caused by, for
The annotators were first asked to read the guide-
                                                        instance, non-representative annotator selection,
lines and individually annotate about 150 different
                                                        lack in training/domain expertise, preconceived
posts, after which there was a shared discussion.
                                                        notions, or pre-held stereotypes. These biases are
After this pilot round, the volume of samples per an-
                                                        treated in relation to abusive language datasets
notator was increased and every sample labeled by
                                                        by several sources, e.g. general sampling and
2-3 annotators. When instances were ‘flagged’ or
                                                        annotators biases (Waseem, 2016; Al Kuwatly
annotators disagreed on them, they were discussed
                                                        et al., 2020), biases towards minority identity
during weekly meetings, and misunderstandings
                                                        mentions based for example on gender or race
were resolved together with the external facilita-
                                                        (Davidson et al., 2017; Dixon et al., 2018; Park
tor. After round three, when reaching 7k annotated
                                                        et al., 2018; Davidson et al., 2019), and political
posts (Figure 2), we continued with independent
                                                        annotator biases (Wich et al., 2020). Other quali-
annotations maintaining a 15% instance overlap
                                                        tative biases comprise, for instance, demographic
between randomly picked annotator pairs.
                                                        bias, over-generalization, topic exposure as social
   Management of annotator disagreement is an im-
                                                        biases (Hovy and Spruit, 2016).
portant part of the process design. Disagreements
can be solved by majority voting (Davidson et al.,
                                                           Systematic measurement of biases in datasets
2017; Wiegand et al., 2019), labeled as abuse if at
                                                        remains an open research problem. Friedman and
least one annotator has labeled it (Golbeck et al.,
                                                        Nissenbaum (1996) discuss “freedom from biases”
2017) or by a third objective instance (Gao and
                                                        as an ideal for good computer systems, and state
Huang, 2017). Most datasets use crowdsourcing
                                                        that methods applied during data creation influ-
platforms or a few academic experts for annotation
                                                        ence the quality of the resulting dataset quality
(Vidgen and Derczynski, 2020). Inter-annotator-
                                                        with which systems are later trained. Shah et al.
agreement (IAA) and classification performance
                                                        (2020) showed that half of biases are caused by
are established as two grounded evaluation mea-
                                                        the methodology design, and presented a first ap-
surements for annotation quality (Vidgen and Der-
                                                        proach of classifying a broad range of predictive
czynski, 2020). Comparing the performance of am-
                                                        biases under one umbrella in NLP.
ateur annotators (while providing guidelines) with
                                                           We applied several measures to mitigate biases
expert annotators for sexism and racism annotation,
                                                        occurring through the annotation design and execu-
Waseem (2016) show that the quality of amateur
                                                        tion: First, we selected labels grounded in existing,
annotators is competitive with expert annotations
                                                        peer-reviewed research from more than one field.
when several amateurs agree. Facing the trade-off
                                                        Second, we aimed for diversity in annotator profiles
between training annotators intensely and the num-
                                                        in terms of age, gender, dialect, and background.
ber of involved annotators, we continued with the
                                                        Third, we recruited a facilitator with a background
trained annotators and group discussions/ individ-
                                                        in ethnographic studies and provided intense anno-
ual revisions for flagged content and disagreements
                                                        tator training. Fourth, we engaged in weekly group
(Section 5.4).
                                                        discussions, iteratively improving the codebook
3.3   Mitigating Biases                                 and integrating edge cases. Fifth, the selection of
                                                        platforms from which we sampled data is based on
Prior work demonstrates that biases in datasets         local user representation in Denmark, rather than
can occur through the training and selection of         convenience. Sixth, diverse sampling methods for
annotators or selection of posts to annotate (Geva      data collection reduced selection biases.
et al., 2019; Wiegand et al., 2019; Sap et al., 2019;
Al Kuwatly et al., 2020; Ousidhoum et al., 2020).       4   A taxonomy and codebook for labeling
                                                            online misogyny
Selection biases: Selection biases for abusive
language can be seen in the sampling of text, for in-   Good language taxonomies systematically bring
stance when using keyword search (Wiegand et al.,       together definitions and describe general principles
2019), topic dependency (Ousidhoum et al., 2020),       of each definition. The purpose is categorizing
reference                      lang.       labels
 Abusive         Zampieri et al. (2019)         da,en,      Offensive (OFF)/Not offensive (NOT)
 Language                                       gr,ar,tu    Targeted Insult (TIN)/Untargeted (UNT)/
                                                            Individual (IND)/Group (GRP)/Other (OTH)
 Hate speech     Waseem and Hovy (2016)         en          Sexism, Racism
 Misogyny        Anzovino et al. (2018)         en,it,es    Discredit, Stereotype, Objectification,
                                                            Sexual Harassm., Dominance, Derailing
                 Jha and Mamidi (2017)          en          Benevolent extension

                 Table 2: Established taxonomies and their use for the misogyny detection task

and mapping entities in a way that demonstrates            stances of abusive language, our taxonomy embeds
their natural relationship, e.g. Schmidt and Wie-          misogyny as a subcategory of abusive language.
gand (2017); Anzovino et al. (2018); Zampieri et al.       Misogyny is distinguished from, for instance, per-
(2019); Banko et al. (2020). Their application is          sonal attacks, which is closer to the abusive lan-
especially clear in shared tasks, as for multilingual      guage of cyberbullying. For definitions and ex-
sexism detection against women, SemEval 2019               amples from the dataset to the categories, see Ap-
(Basile et al., 2019).                                     pendix A.1. We build on the taxonomy suggested
   On one hand, it should be an aim of a taxon-            in Zampieri et al. (2019), which has been applied to
omy that it is easily understandable and applicable        datasets in several languages as well as in SemEval
for annotators from various background and with            (Zampieri et al., 2020). While Parikh et al. (2019)
different expertise levels. On the other hand, a           provide a rich collection of sexism categories, mul-
taxonomy is only useful if it is also correct and          tiple, overlapping labels do not fulfill the purpose of
comprehensive, i.e. a good representation of the           being easily understandable and applicable for an-
world. Therefore, we have aimed to integrate defi-         notators. The taxonomies in Anzovino et al. (2018)
nitions from several sources of previous research          and Jha and Mamidi (2017) have proved their ap-
(deductive approach) as well as categories result-         plication to English, Italian and Spanish, and of-
ing from discussions of the concrete data (inductive       fer more general labels. Some labels from previ-
approach).                                                 ous work were removed from the labeling scheme
   Our taxonomy for misogyny is the product of (a)         during the weekly discussions among authors and
existing research in online abusive language and           annotators, (for instance derailing), because no in-
misogyny (specifically the work in Table 2), (b) a         stances of them were found in the data.
review of misogyny in the context of online plat-
forms and online platforms in a Danish context (c)         4.1   Misogyny: Neosexism
iterative adjustments during the process including         During our analysis of misogyny in the Danish
discussions between the authors and annotators.            context (b), we became aware of the term “neosex-
   The labeling scheme (Figure 1) is the main              ism”. Neosexism is a concept defined in Tougas
structure for guidelines for the annotators, while a       et al. (1999), and presents as the belief that women
codebook ensured common understanding of the               have already achieved equality, and that discrimi-
label descriptions. The codebook provided the an-          nation of women does not exist. Neosexism is based
notators with definitions from the combined tax-           on covert sexist beliefs, which can “go unnoticed,
onomies. The descriptions were adjusted to dis-            disappearing into the cultural norms. Those who
tinguish edge-cases during the weekly discussion           consider themselves supporters of women’s rights
rounds.                                                    may maintain non-traditional gender roles, but also
   The taxonomy has four levels: (1) Abu-                  exhibit subtle sexist beliefs” (Martinez et al., 2010).
sive (abusive/not abusive), (2) Target (indi-              Sexism in Denmark appear to correlate with the
vidual/group/others/untargeted), (3) Group type            modern sexism scale (Skewes et al., 2019; Tougas
(racism/misogyny/others), (4) Misogyny type                et al., 1995; Swim et al., 1995; Campbell et al.,
(harassment/discredit/stereotype & objectifica-            1997). Neosexism was added to the taxonomy be-
tion/dominance/neosexism/benevolent). To demon-            fore annotation began, and as we will see in the
strate the relationship of misogyny to other in-           analysis section, neosexism was the most common
1.3K
                                                                                                                     neosexism
                                                                                                                     denial of discrimination/
                                                                                                                     resentment of complaints
                                                                                                             0.3K
                                                                                                                     discredit
                                                                                                                     disgrace/ humiliate women

                                    3K
                                           individual                    2K
                                                                                                                     with no larger intent
                                           person-targeted,

                                           e.g. cyberbullying
                                                                                    misogyny                 0.2K    stereotypes &
                                                                                    towards women
  20.4K
          not                       2.6K                                                                             objectification
                                           group
                                                                    normative held but fixed,

                                                                         0.5K
  7.5K
          abusive
                                           group-targeted,

                                           i.e. hate speech
                                                                                    others
                          oversimplified gender images
                                                                                    e.g. LGTB, towards men   0.07K
          following 11-point-list
          after Waseem & Hovy       1K     untargeted
                                                                                                                     benevolent sexism
                                                                                                                     positive, gender-typical sentiment,
          (2016)                                                                                                     often disguised as a compliment
                                           profanity/ swearing           0.1K
                                                                                    racism                   0.06K
                                    0.8K   others
                                                                                    because of ethnicity             dominance
                                                                                                                     advocating superiority
                                           e.g. conceptual against the
                                                                                                             0.05K
                                           media, a political party
                                                                                                                     sexual harassment
                                                                                                                     asking for sexual favours,

                                                                                                                     unwanted sexualisation

Figure 1: A labeling scheme for online misogyny (blue) embedded within the taxonomy for labeling abusive
language occurrences (green). Definitions and examples can be found in the compressed codebook in A.1

form of misogyny present in our dataset (Figure 1).                             onomies in abusive language while integrating
Here follow some examples of neosexism from our                                 context-related occurrences. A similar idea is
dataset:                                                                        demonstrated in Mulki and Ghanem (2021), adding
   • Resenting complaints about discrimination:                                 damning as an occurrence of misogyny in an Ara-
    “I often feel that people have treated me better                            bic context. While most of previous research is
     and spoken nicer to me because I was a girl,                               done in English, these language-specific findings
     so I have a hard time taking it seriously when                             highlight the need for taxonomies that are flexible
     people think that women are so discriminated                               to different contexts, i.e. they are good represen-
     against in the Western world.”                                             tations of the world. Lastly, from an NLP point
   • Questioning the existence of discrimination:                               of view, languages with less resources for training
    “Can you point to research showing that child-                              data can profit further from transfer learning with
     birth is the reason why mothers miss out on                                similar labels, as demonstrated in Pamungkas et al.
     promotions?”                                                               (2020) for misogyny detection.
   • Presenting men as victims: “Classic. If it’s a
     disadvantage for women it’s the fault of soci-                             5     Results and Analysis
     ety. If men, then it must be their own. Sexism                             5.1     Class Balance
     thrives on the feminist wing.”
                                                                                The final dataset contains 27.9K comments, of
Neosexism is an implicit form of misogyny, which
                                                                                which 7.5K contain abusive language. Misogy-
is reflected in annotation challenges summarised
                                                                                nistic posts comprise 7% of overall posts. Neosex-
in section 5.5. In prior taxonomies, instances of
                                                                                ism is by far the most frequently represented class
neosexism would most likely have been assigned to
                                                                                with 1.3K tagged posts, while Discredit and Stereo-
the implicit appearances of misogynistic treatment
                                                                                type & objectification are present in 0.3K and 0.2K
(ii) (Guest et al., 2021) – or perhaps not classified as
                                                                                posts. Benevolent, Dominance, and Harrassment
misogyny at all. Neosexism is most closely related
                                                                                are tagged in between only 45 and 70 posts.
to the definition “disrespectful actions, suggesting
or stating that women should be controlled in some
                                                                                5.2     Domain/Sampling representation
way, especially by men”. This definition, however,
does not describe the direct denial that misogyny                               Most posts tagged as abusive and/or containing
exists. Without a distinct and explicit neosexism                               misogyny are retrieved from searches on posts from
category, however, these phenomena may be mixed                                 public media profiles, see Table 3. Facebook and
up or even ignored.                                                             Twitter are equally represented, while Reddit is in
   The taxonomy follows the suggestions of Vid-                                 the minority. Reddit posts were sampled from an
gen et al. (2019) for establishing unifying tax-                                available historical collection.
samp. domain                            dis. time                  abs.   dis.    dis.   2 annotators. IAA is calculated through average
                                         dom                        in     ⊂       ⊂      label agreement at post level – for example if two
                                                                    k      abus    mis    annotators label two posts [abusive, untargeted] and
 topic  Facebook 48% 07-                                            12,3   51%     63%    [abusive, group targeted] the agreement would be
                     11/20                                                                0.5. Our IAA during iterations of dataset construc-
 keyw. Twitter 45% 08-                                              7,8    32% 27%        tion ranged between 0.5 and 0.71. In the penulti-
                     12/20                                                                mate annotation round we saw a drop in agreement
 user   Twitter                                                     3,6    8%      6%     (Figure 2); this is attributed to a change in underly-
 keyw. Reddit    7% 02-                                             2,4    7%      2%     ing text genre, moving to longer Reddit posts. 25%
                     04/19                                                                of disagreements about classifications were solved
 popul. Facebook                                                    1      2%      2%     during discussions. Annotators had the opportu-
                                                                                          nity to adjust their disagreed annotation in the first
Table 3: Distribution sampling techniques and domains                                     revision individually, which represents the remain-
 Sampling techniques: topic = posts from public media sites
           and comments to these posts; keyw. =                                           ing 75% (Table 5). The majority of disagreements
    keyword/hashtag-search; popul. = most interactions.                                   were on subtask A, deciding whether the post was
                                                                                          abusive or not.
5.3         Word Counts                                                                    individual corr.       group solv.      discussion round
 Frequencies of the words; ‘kvinder’ (women) and                                                 417                 169            69 (+125 pilot)
‘mænd’ (men) were the highest, but these words did
 not represent strong polarities towards abusive and                                            Table 5: Solved disagreements/flagged content
 misogynistic content (Table 4). The word ‘user’
 represents de-identified references to discussion                                           The final overall Fleiss’ Kappa (Fleiss (1971))
 participants (“@USER”).                                                                  for individual subtasks are: abusive/not: 0.58, tar-
                                                                                          geted: 0.54, misogyny/not: 0.54. It is notable here
 dataset                              ⊂ abus                        ⊂ mis                 that the dataset is significantly more skewed than
 (kvinder,                            (kvinder,                     (kvinder,             prior work which upsampled to 1:1 class balances.
 0.29)                                0.34)                         0.41)                 Chance-corrected measurements are sensitive to
 (user, 0.29)                         (user, 0.25)                  (mænd, 0.28)          agreement on rare categories and higher agreement
 (metoo, 0.25)                        (mænd, 0.22)                  (user, 0.18)          is needed to reach reliability, as shown in Artstein
 (mænd, 0.21)                         (bare, 0.17)                  (år, 0.16)           and Poesio (2008).
 (bare, 0.16)                         (metoo, 0.16)                 (når, 0.15)
                                                                                          5.5     Annotator disagreement analysis
                    Table 4: Top-3 word frequencies
tf-idf scores with prior removal of special character and stop-                           Based on the discussion rounds, the following types
words, notion:(token, tf-idf)                                                             of posts were the most challenging to annotate:

                                                                                            1. Interpretation of the author’s intention (irony,
5.4         Inter-Annotator Agreement (IAA)                                                    sarcasm, jokes, and questions)
                                                                                                 E.g. Haha! Virksomheder i Danmark: Vi ansætter
 1.0
                 per iteration                                                                   aldrig en kvinde igen... (Haha! Companies in Den-
 0.9             accumelated
 0.8                                                                                             mark: We will never hire a woman again ...)
 0.7                                                          0.71         0.7 0.7 0.71          sexisme og seksuelt frisind er da vist ikke det samme?
                                                 0.65
                                  0.61
 0.6                                                                                             (I don’t believe sexism and sexual liberalism are the
 0.5             0.46                                                                            same?)
 0.4       0.4
 0.3
                                                                                            2. Degree of abuse: Misrepresenting the truth to
       0            5            10         15          20           25      30                harm the subject or fact
                                      Annotated data samples in k
                 Figure 2: Inter-Annotator-Agreement                                             E.g. Han er en stor løgner (He is a big liar)
y-axis: Agreement by rel. overlap of label-sequences per                                    3. Hashtags: Meaning and usage of hashtags in
sample; x-axis: Annotated data samples in k.
                                                                                               relation to the context
  We measure IAA using the agreement between                                                     E.g. #nometoo
3 annotators for each instance until round 3 (7k                                            4. World knowledge required:
posts), and then sub-sampled data overlaps between                                               Du siger at Frank bruger sin magt forkert men du
bruger din til at brænde så mange mænd på bålet ...      each experiment. Results are reported when the
      (You say that Frank uses his power wrongly, but you use     model reached stabilized per class f1 scores for
      yours to throw so many men on the fire ... - referring to   all classes on the test set (± 0.01/20). The results
      a specific political topic.)                                indicate the expected challenge of accurately pre-
  5. Quotes: re-posting or re-tweeting a quote                    dicting less-represented classes and generalizing to
     gives limited information about the support or               unseen data. Analysing False Positives and False
     denial of the author                                         Negatives on the misogyny detection task, we can-
                                                                  not recognise noticeable correlations with other
  6. Jargon: receiver’s perception
                                                                  abusive forms and disagreements/ difficult cases
      I skal alle have et klap i måsen herfra (You all get a
                                                                  from the annotation task.
      pat on the behind from me)

Handling these was an iterative process of raising                    subtask        epoch   f1       prec.     recall
cases for revision in the discussion rounds, formu-                   abus/not       200     0.7650   76.43%    76.4%
lating the issue, and providing documentation. We                     target         120     0.6502   64.45%    66.2%
added the status and, where applicable, outcome                       misog./not     200     0.8549   85.27%    85.85%
from these cases to the guidelines. We also added                     misog.*                0.6191
explanations of hashtags and definitions of unclear                   misog.categ.   100     0.7913   77.79% 81.26%
identities, like “the media”, as a company. For                   Table 6: Baseline Evaluation: F1-scores, Precision, Re-
quotes without declaration of rejection or support,               call (weighted, *except for misog., class f1-score) with
we agreed to label them as not abusive, since the                 mBERT
motivation of re-posting is not clear.

5.6   Baseline Experiments as an indicator                        6    Discussion and reflection
Lastly, we provide a classification baseline: For
misogyny and abusive language, the BERT model                     Reflections on sampling We sampled from dif-
from Devlin et al. (2019) proved to be a robust ar-               ferent platforms, and applied different sampling
chitecture for cross-domain (Swamy et al., 2019)                  techniques. The goal was to ensure, first, a suf-
and cross-lingual (Pamungkas et al., 2020; Mulki                  ficient amount of misogynistic content and, sec-
and Ghanem, 2021) transfer. We use therefore mul-                 ondly, mitigation of biases stemming from a uni-
tilingual BERT (’bert-base-multilingual-un cased’)                form dataset.
for general language understanding in Danish, fine-                  Surprisingly, topic sampling unearthed a higher
tuned on our dataset.                                             density of misogynistic content than targeted key-
    Model: We follow the suggested parameters                     word search (Table 3). While researching plat-
from Mosbach et al. (2020) for fine-tuning (learn-                forms, we noticed the limited presence of Dan-
ing rate 2e-5, weight decay 0.01, AdamW opti-                     ish for publicly available men-dominated fora
mizer without bias correction). Class imbalance is                (e.g. gaming forums such as DotA2 and extrem-
handled by weighted sampling and data split for                   ist plaftorms such as Gab (Kennedy et al., 2018)).
train/test 80/20. Experiments are conducted with                  This, as well as limitations of platform APIs caused
batch size 32 using Tesla V100 GPU.                               a narrow data selection. Often, non-privileged lan-
    Preprocessing: Our initial pre-processing of the              guages can gain from cross-language transfer learn-
unstrucutured posts included converting emojis to                 ing. We experimented with translating misogy-
text, url replacement, limit @USER and punctu-                    nistic posts from Fersini et al. (2018) to Danish,
ation occurrences and adding special tokens for                   using translation services, and thereby augment the
upper case letters adopted from Ahn et al. (2020).                minority class data. Translation services did not
    Classification: Since the effect of applying multi-           provide a sampling alternative. Additionally, as
task-learning might not conditionally improve per-                discovered by Anzovino et al. (2018), misogynis-
formance (Mulki and Ghanem, 2021), the classi-                    tic content seems to vary with culture. This makes
fication is evaluated on a subset of the dataset for
each subtask (see Table 6) including all posts of the              total    text corrected    label corrected    out
target label (e.g. misogyny) and stratified sampling               960            877               224          48
of the non-target classes (e.g. for non-misogynistic:
abusive and non-abusive posts) with 10k posts for                      Table 7: Translating IberEval posts EN to DA
language-specific investigations important, both for    7   Conclusion and future work
the sake of quality of automatic detection systems,
                                                        In this work, we have documented the construction
as well as for cultural discovery and investigation.
                                                        of a dataset for training systems for automatic de-
Table 7 shows results of post-translation manual
                                                        tection of online misogyny. We also present the
correction by annotators (all fluent in English).
                                                        resulting dataset of misogyny in Danish social me-
                                                        dia, Bajer, including class balance, word counts,
Reflections on annotation process Using just            and baseline as an indicator. This dataset is avail-
seven annotators has the disadvantage that one is       able for research purposes upon request.
unlikely to achieve as broad a range of annotator          The objective of this research was to explore the
profiles as, for instance, through crowdsourcing.       design of an annotation process which would result
However, during annotation and weekly discus-           in a high quality dataset, and which was transparent
sions, we saw clear benefits from having a small        and useful for other researchers.
annotator group with different backgrounds and             Our approach was to recruit and train a diverse
intense training. While annotation quality cannot       group of annotators and build a taxonomy and code-
be measured by IAA alone, the time for debate clar-     book through collaborative and iterative annotator-
ified taxonomy items, gave thorough guidelines,         involved discussions. The annotators reached good
and increased the likelihood of correct annotations.    agreement, indicating that the taxonomy and code-
The latter reflects the quality of the final dataset,   book were understandable and useful.
while the former two indicate that the taxonomy            However, to rigorously evaluate the quality of
and codebook are likely useful for other researchers    the dataset and the performance of models that
analysing and processing online misogyny.               build on it, the models should be evaluated in prac-
                                                        tice with different text types and languages, as well
6.1   A comprehensive taxonomy for misogyny             as compared and combined with models trained
                                                        on different datasets, i.e. Guest et al. (2021). Be-
The semi-open development of the taxonomy and           cause online misogyny is a sensitive and precarious
frequent discussions allowed the detection neo-         subject, we also propose that the performance of
sexism as an implicit form of misogyny. Future          automatic detection models should be evaluated
research in taxonomies of misogyny could con-           with use of qualitative methods (Inie and Derczyn-
sider including distinctions between active/passive     ski, 2021), bringing humans into the loop. As we
misogyny, as suggested by Anzovino et al. (2018)        found through our continuous discussions, online
as well as other sub-phenomena.                         abuse can present in surprising forms, for instance
  In the resulting dataset, we saw a strong repre-      the denial that misogyny exists. The necessary in-
sentation of neosexism. Whether this is a specific      tegration of knowledge and concepts from relevant
cultural phenomenon for Danish, or indicative of        fields, e.g. social science, into NLP research is only
general online behaviour, is not clear.                 really possible through thorough human participa-
   The use of unified taxonomies in research af-        tion and discussion.
fords the possibility to test the codebook guide-
                                                        Acknowledgement
lines iteratively. We include a short version of the
guidelines in the appendix; the original document       This research was supported by the IT Univer-
consists of seventeen pages. In a feedback survey       sity of Copenhagen, Computer Science for inter-
following the annotation work, most of the anno-        nal funding on Abusive Language Detection; and
tators described that during the process, they used     the Independent Research Fund Denmark under
the guidelines primarily for revision in case they      project 9131-00131B, Verif-AI. We thank our anno-
felt unsure how to label the post. To make the          tators Nina Schøler Nørgaard, Tamana Saidi, Jonas
annotation more intuitively clear for annotators,       Joachim Kofoed, Freja Birk, Cecilia Andersen, Ul-
we suggest reconsidering documentation tools and        rik Dolzyk, Im Sofie Skak and Rania M. Tawfik.
their accessibility for annotators. Guidelines are      We are also grateful for discussions with Debora
crucial for handling linguistic challenges, and well-   Nozza, Elisabetta Fersini and Tracie Farrell.
documented decisions about them serve to create
comparable research on detecting online misogyny
across languages and dataset.
Impact statement: Data anonymization                       In Proceedings of the Fourth Workshop on Online
                                                           Abuse and Harms, pages 125–137, Online. Associa-
Usernames and discussion participant/author                tion for Computational Linguistics.
names are replaced with a token @USER value.
                                                         Valerio Basile, Cristina Bosco, Elisabetta Fersini,
Annotators were presented with the text of the post
                                                           Debora Nozza, Viviana Patti, Francisco Manuel
and no author information. Posts that could not            Rangel Pardo, Paolo Rosso, and Manuela San-
be interpreted by annotators because of missing            guinetti. 2019. SemEval-2019 Task 5: Multilingual
background information were excluded. We only              Detection of Hate Speech Against Immigrants and
gathered public posts.                                     Women in Twitter. In Proceedings of the 13th Inter-
                                                           national Workshop on Semantic Evaluation, pages
   Annotators worked in a tool where they could            54–63, Minneapolis, Minnesota, USA. Association
not export or copy data. Annotators are instructed         for Computational Linguistics.
to flag and skip PII-bearing posts.
                                                         Emily M. Bender and Batya Friedman. 2018. Data
   All further information about dataset creation is
                                                           Statements for Natural Language Processing: To-
included in the main body of the paper above.              ward Mitigating System Bias and Enabling Better
                                                           Science. Transactions of the Association for Com-
                                                           putational Linguistics, 6:587–604.
References
                                                         Dorthe Bleses, Werner Vach, Malene Slott, Sonja We-
Hwijeen Ahn, Jimin Sun, Chan Young Park, and               hberg, Pia Thomsen, Thomas O Madsen, and Hans
 Jungyun Seo. 2020. NLPDove at SemEval-2020                Basbøll. 2008. Early vocabulary development in
 Task 12: Improving Offensive Language Detection           Danish and other languages: A CDI-based compari-
 with Cross-lingual Transfer. In Proceedings of the        son. Journal of Child Language, 35(3):619.
 Fourteenth Workshop on Semantic Evaluation, pages
 1576–1586.                                              Bernadette Campbell, E. Glenn Schellenberg, and
                                                           Charlene Y. Senn. 1997. Evaluating Measures of
Hala Al Kuwatly, Maximilian Wich, and Georg Groh.          Contemporary Sexism. Psychology of Women Quar-
  2020. Identifying and Measuring Annotator Bias           terly, 21(1):89–102. Publisher: SAGE Publications
  Based on Annotators’ Demographic Characteristics.        Inc.
  In Proceedings of the Fourth Workshop on Online
  Abuse and Harms, pages 184–190, Online. Associa-       Casey Newton. 2020. Facebook will pay $52 million
  tion for Computational Linguistics.                      in settlement with moderators who developed PTSD
                                                           on the job. The Verge. Accessed: Jan, 2021.
Amnesty International. 2017. Amnesty reveals alarm-
 ing impact of online abuse against women. https:        Despoina Chatzakou, Nicolas Kourtellis, Jeremy
  //www.amnesty.org/en/latest/news/2017/                   Blackburn, Emiliano De Cristofaro, Gianluca
  11/amnesty-reveals-alarmin-impact-of-/                   Stringhini, and Athena Vakali. 2017. Mean Birds:
  online-abuse-against-women/.   Accessed:                 Detecting Aggression and Bullying on Twitter. In
  Jan, 2021.                                               Proceedings of the 2017 ACM on Web Science Con-
                                                           ference, WebSci ’17, pages 13–22, New York, NY,
Analyse & Tal. 2021. Angreb i den offentlige debat på
                                                           USA. Association for Computing Machinery.
  Facebook. Technical report, Analyse & Tal.
Astrid Skov Andersen and Maja Langberg. 2021.            Patricia Chiril, Véronique Moriceau, Farah Benamara,
  Nogle personer tror, at de gør verden til et bedre       Alda Mari, Gloria Origgi, and Marlène Coulomb-
  sted ved at sende hadbeskeder, siger ekspert. TV2        Gully. 2020. An Annotated Corpus for Sexism De-
  Nyheder.                                                 tection in French Tweets. In Proceedings of the
                                                           12th Language Resources and Evaluation Confer-
Maria Anzovino, Elisabetta Fersini, and Paolo Rosso.       ence, pages 1397–1403, Marseille, France. Euro-
 2018. Automatic Identification and Classification         pean Language Resources Association.
 of Misogynistic Language on Twitter. In Max Sil-
 berztein, Faten Atigui, Elena Kornyshova, Elisabeth     Yi-Ling Chung, Elizaveta Kuzmenko, Serra Sinem
 Métais, and Farid Meziane, editors, Natural Lan-         Tekiroglu, and Marco Guerini. 2019. CONAN -
 guage Processing and Information Systems, volume          COunter NArratives through Nichesourcing: a Mul-
 10859, pages 57–64. Springer International Publish-       tilingual Dataset of Responses to Fight Online Hate
 ing, Cham. Series Title: Lecture Notes in Computer        Speech. In Proceedings of the 57th Annual Meet-
 Science.                                                  ing of the Association for Computational Linguis-
                                                           tics, pages 2819–2829, Florence, Italy. Association
Ron Artstein and Massimo Poesio. 2008. Inter-Coder         for Computational Linguistics.
  Agreement for Computational Linguistics. Compu-
  tational Linguistics, 34(4):555–596.                   Danske Kvindesamfund. 2020.       Sexisme og sex-
                                                           chikane.    https://danskkvindesamfund.dk/dansk-
Michele Banko, Brendon MacKeen, and Laurie Ray.            kvindesamfunds-abc/sexisme/. Accessed 2021-01-
  2020. A Unified Taxonomy of Harmful Content.             17.
Thomas Davidson, Debasmita Bhattacharya, and Ing-         Batya Friedman and Helen Nissenbaum. 1996. Bias
  mar Weber. 2019. Racial Bias in Hate Speech and           in Computer Systems. ACM Transactions on Infor-
  Abusive Language Detection Datasets. In Proceed-          mation Systems, 14(3):330–347. Publisher: Associ-
  ings of the Third Workshop on Abusive Language            ation for Computing Machinery (ACM).
  Online, pages 25–35, Florence, Italy. Association for
  Computational Linguistics.                              Lei Gao and Ruihong Huang. 2017. Detecting On-
                                                            line Hate Speech Using Context Aware Models. In
Thomas Davidson, Dana Warmsley, Michael Macy,               Proceedings of the International Conference Recent
  and Ingmar Weber. 2017. Automated Hate Speech             Advances in Natural Language Processing, RANLP
  Detection and the Problem of Offensive Language.          2017, pages 260–266, Varna, Bulgaria. INCOMA
  In Proceedings of the International AAAI Confer-          Ltd.
  ence on Web and Social Media, 1.
                                                          Joshua Garland, Keyan Ghazi-Zahedi, Jean-Gabriel
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               Young, Laurent Hébert-Dufresne, and Mirta Galesic.
   Kristina Toutanova. 2019. BERT: Pre-training of           2020. Countering hate on social media: Large scale
   Deep Bidirectional Transformers for Language Un-          classification of hate and counter speech. In Pro-
   derstanding. In Proceedings of the 2019 Conference        ceedings of the Fourth Workshop on Online Abuse
   of the North American Chapter of the Association          and Harms, pages 102–112, Online. Association for
   for Computational Linguistics: Human Language             Computational Linguistics.
  Technologies, Volume 1 (Long and Short Papers),
   pages 4171–4186, Minneapolis, Minnesota. Associ-       Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019.
   ation for Computational Linguistics.                    Are We Modeling the Task or the Annotator? An In-
                                                           vestigation of Annotator Bias in Natural Language
Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain,      Understanding Datasets. In Proceedings of the
  and Lucy Vasserman. 2018. Measuring and Mitigat-         2019 Conference on Empirical Methods in Natu-
  ing Unintended Bias in Text Classification. In Pro-      ral Language Processing and the 9th International
  ceedings of the 2018 AAAI/ACM Conference on AI,          Joint Conference on Natural Language Processing
  Ethics, and Society, AIES ’18, pages 67–73, New          (EMNLP-IJCNLP), pages 1161–1166, Hong Kong,
  York, NY, USA. Association for Computing Machin-         China. Association for Computational Linguistics.
  ery.
                                                         Peter Glick and Susan Fiske. 1996. The Ambiva-
Bo Ekehammar, Nazar Akrami, and Tadesse Araya.              lent Sexism Inventory: Differentiating Hostile and
   2000. Development and validation of Swedish              Benevolent Sexism. Journal of Personality and So-
   classical and modern sexism scales. Scandina-            cial Psychology, 70:491–512.
   vian Journal of Psychology, 41(4):307–314. eprint:    Jennifer Golbeck, Zahra Ashktorab, Rashad O. Banjo,
   https://onlinelibrary.wiley.com/doi/pdf/10.1111/1467-    Alexandra Berlinger, Siddharth Bhagwan, Cody
   9450.00203.                                              Buntain, Paul Cheakalos, Alicia A. Geller, Quint
Elisabetta Fersini, Paolo Rosso, and Maria Anzovino.        Gergory, Rajesh Kumar Gnanasekaran, Raja Ra-
   2018. Overview of the Task on Automatic Misog-           jan Gunasekaran, Kelly M. Hoffman, Jenny Hot-
   yny Identification at IberEval 2018. In IberEval@        tle, Vichita Jienjitlert, Shivika Khare, Ryan Lau,
   SEPLN, pages 214–228.                                    Marianna J. Martindale, Shalmali Naik, Heather L.
                                                            Nixon, Piyush Ramachandran, Kristine M. Rogers,
Mark A Finlayson and Tomaž Erjavec. 2017. Overview         Lisa Rogers, Meghna Sardana Sarin, Gaurav Sha-
   of annotation creation: Processes and tools. In          hane, Jayanee Thanki, Priyanka Vengataraman, Zi-
   Handbook of Linguistic Annotation, pages 167–191.        jian Wan, and Derek Michael Wu. 2017. A Large
   Springer.                                                Labeled Corpus for Online Harassment Research.
                                                            In Proceedings of the 2017 ACM on Web Science
Joseph L. Fleiss. 1971. Measuring nominal scale agree-      Conference, pages 229–233, Troy New York USA.
   ment among many raters. Psychological Bulletin,          ACM.
   76(5):378–382.
                                                         Ella Guest, Bertie Vidgen, Alexandros Mittos, Nis-
Komal Florio, Valerio Basile, Marco Polignano, Pier-        hanth Sastry, Gareth Tyson, and Helen Margetts.
   paolo Basile, and Viviana Patti. 2020. Time of           2021. An Expert Annotated Dataset for the Detec-
   Your Hate: The Challenge of Time in Hate Speech          tion of Online Misogyny. In Proceedings of the 16th
   Detection on Social Media.         Applied Sciences,     Conference of the European Chapter of the Associ-
  10(12):4180. Number: 12 Publisher: Multidisci-            ation for Computational Linguistics: Main Volume,
   plinary Digital Publishing Institute.                    pages 1336–1350, Online. Association for Computa-
                                                            tional Linguistics.
Antigoni Maria Founta, Constantinos Djouvas, De-
   spoina Chatzakou, Ilias Leontiadis, Jeremy Black-     Dirk Hovy and Shannon L. Spruit. 2016. The Social
   burn, Gianluca Stringhini, Athena Vakali, Michael        Impact of Natural Language Processing. In Proceed-
   Sirivianos, and Nicolas Kourtellis. 2018. Large          ings of the 54th Annual Meeting of the Association
   Scale Crowdsourcing and Characterization of Twit-        for Computational Linguistics (Volume 2: Short Pa-
   ter Abusive Behavior. In Twelfth International AAAI      pers), pages 591–598, Berlin, Germany. Association
   Conference on Web and Social Media.                      for Computational Linguistics.
Nanna Inie and Leon Derczynski. 2021. An IDR                 Baselines. arXiv:2006.04884 [cs, stat].      ArXiv:
  Framework of Opportunities and Barriers between            2006.04884.
  HCI and NLP. In Proceedings of the First Workshop
  on Bridging Human–Computer Interaction and Nat-          Hala Mulki and Bilal Ghanem. 2021. Let-Mi: An Ara-
  ural Language Processing, pages 101–108.                   bic Levantine Twitter Dataset for Misogynistic Lan-
                                                             guage. In Proceedings of the Sixth Arabic Natural
Akshita Jha and Radhika Mamidi. 2017. When does              Language Processing Workshop, pages 154–163.
  a compliment become sexist? Analysis and classifi-
  cation of ambivalent sexism using twitter data. In       Nedjma Ousidhoum, Yangqiu Song, and Dit-Yan Ye-
  Proceedings of the Second Workshop on NLP and              ung. 2020.     Comparative Evaluation of Label-
  Computational Social Science, pages 7–16, Vancou-          Agnostic Selection Bias in Multilingual Hate Speech
  ver, Canada. Association for Computational Linguis-        Datasets. In Proceedings of the 2020 Conference
  tics.                                                      on Empirical Methods in Natural Language Process-
                                                             ing (EMNLP), pages 2532–2542, Online. Associa-
Mladen Karan and Jan Šnajder. 2018. Cross-Domain            tion for Computational Linguistics.
  Detection of Abusive Language Online. In Proceed-
                                                           Endang Wahyu Pamungkas, Valerio Basile, and Vi-
  ings of the 2nd Workshop on Abusive Language On-
                                                             viana Patti. 2020. Misogyny Detection in Twitter: a
  line (ALW2), pages 132–137, Brussels, Belgium. As-
                                                             Multilingual and Cross-Domain Study. Information
  sociation for Computational Linguistics.
                                                             Processing & Management, 57(6):102360.
Brendan      Kennedy,         Mohammad          Atari,     Pulkit Parikh, Harika Abburi, Pinkesh Badjatiya, Rad-
  Aida Mostafazadeh Davani, Leigh Yeh, Ali                   hika Krishnan, Niyati Chhaya, Manish Gupta, and
  Omrani, Yehsong Kim, Kris Coombs, Shreya                   Vasudeva Varma. 2019. Multi-label Categorization
  Havaldar, Gwenyth Portillo-Wightman, Elaine                of Accounts of Sexism using a Neural Framework.
  Gonzalez, Joseph Hoover, Aida Azatian, Alyzeh              In Proceedings of the 2019 Conference on Empirical
  Hussain, Austin Lara, Gabriel Olmos, Adam Omary,           Methods in Natural Language Processing and the
  Christina Park, Clarisa Wijaya, Xin Wang, Yong             9th International Joint Conference on Natural Lan-
  Zhang, and Morteza Dehghani. 2018. The Gab Hate            guage Processing (EMNLP-IJCNLP), pages 1642–
  Corpus: A collection of 27k posts annotated for hate       1652, Hong Kong, China. Association for Computa-
  speech. Technical report, PsyArXiv. Type: article.         tional Linguistics.
Svetlana Kiritchenko, Isar Nejadgholi, and Kathleen C.     Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re-
  Fraser. 2020. Confronting Abusive Language On-              ducing Gender Bias in Abusive Language Detec-
  line: A Survey from the Ethical and Human                   tion. In Proceedings of the 2018 Conference on
  Rights Perspective. arXiv:2012.12305 [cs]. ArXiv:           Empirical Methods in Natural Language Processing,
  2012.12305.                                                 pages 2799–2804, Brussels, Belgium. Association
                                                              for Computational Linguistics.
Andreas Kirkedal, Barbara Plank, Leon Derczynski,
  and Natalie Schluter. 2019. The Lacunae of Dan-          James Pustejovsky and Amber Stubbs. 2012. Nat-
  ish Natural Language Processing. In Proceedings of         ural Language Annotation for Machine Learn-
  the 22nd Nordic Conference on Computational Lin-           ing: A Guide to Corpus-Building for Applica-
  guistics, pages 356–362.                                   tions. ”O’Reilly Media, Inc.”. Google-Books-ID:
                                                             A57TS7fs8MUC.
Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, and
   Marcos Zampieri. 2018. Benchmarking Aggression          Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,
   Identification in Social Media. In Proceedings of the    and Noah A. Smith. 2019. The Risk of Racial Bias
  First Workshop on Trolling, Aggression and Cyber-         in Hate Speech Detection. In Proceedings of the
   bullying (TRAC-2018), pages 1–11, Santa Fe, New          57th Annual Meeting of the Association for Com-
   Mexico, USA. Association for Computational Lin-          putational Linguistics, pages 1668–1678, Florence,
   guistics.                                                Italy. Association for Computational Linguistics.
Carmen Martinez, Consuelo Paterna, Patricia Roux,          Anna Schmidt and Michael Wiegand. 2017. A Sur-
  and Juan Manuel Falomir. 2010. Predicting gender           vey on Hate Speech Detection using Natural Lan-
  awareness: The relevance of neo-sexism. Journal of         guage Processing. In Proceedings of the Fifth Inter-
  Gender Studies, 19(1):1–12.                                national Workshop on Natural Language Processing
                                                             for Social Media, pages 1–10, Valencia, Spain. As-
Barbara Masser and Dominic Abrams. 1999. Con-                sociation for Computational Linguistics.
  temporary Sexism: The Relationships Among Hos-
  tility, Benevolence, and Neosexism. Psychology           Deven Santosh Shah, H. Andrew Schwartz, and Dirk
  of Women Quarterly, 23(3):503–517. Publisher:              Hovy. 2020. Predictive Biases in Natural Language
  SAGE Publications Inc.                                     Processing Models: A Conceptual Framework and
                                                             Overview. In Proceedings of the 58th Annual Meet-
Marius Mosbach, Maksym Andriushchenko, and Diet-             ing of the Association for Computational Linguistics,
 rich Klakow. 2020. On the Stability of Fine-tuning          pages 5248–5264, Online. Association for Computa-
 BERT: Misconceptions, Explanations, and Strong              tional Linguistics.
Sima Sharifirad and Alon Jacovi. 2019. Learning and       Workshop on NLP and Computational Social Sci-
  Understanding Different Categories of Sexism Us-        ence, pages 138–142, Austin, Texas. Association for
  ing Convolutional Neural Network’s Filters. In Pro-     Computational Linguistics.
  ceedings of the 2019 Workshop on Widening NLP,
  pages 21–23.                                          Zeerak Waseem, Thomas Davidson, Dana Warmsley,
                                                          and Ingmar Weber. 2017. Understanding Abuse:
Gudbjartur Ingi Sigurbergsson and Leon Derczynski.        A Typology of Abusive Language Detection Sub-
  2020. Offensive Language and Hate Speech Detec-         tasks. In Proceedings of the First Workshop on Abu-
  tion for Danish. In Proceedings of The 12th Lan-        sive Language Online, pages 78–84, Vancouver, BC,
  guage Resources and Evaluation Conference, pages        Canada. Association for Computational Linguistics.
  3498–3508.
                                                        Zeerak Waseem and Dirk Hovy. 2016. Hateful Sym-
Lea Skewes, Joshua Skewes, and Michelle Ryan. 2019.       bols or Hateful People? Predictive Features for Hate
  Attitudes to Sexism and Gender Equity at a Danish       Speech Detection on Twitter. In Proceedings of
  University. Kvinder, Køn & Forskning, pages 71–         the NAACL Student Research Workshop, pages 88–
  85.                                                     93, San Diego, California. Association for Computa-
                                                          tional Linguistics.
Statista. 2020. Denmark: most popular social media
   sites 2020. Accessed: Jan, 2021.                     Maximilian Wich, Jan Bauer, and Georg Groh. 2020.
                                                         Impact of Politically Biased Data on Hate Speech
Steve Durairaj Swamy, Anupam Jamatia, and Björn         Classification. In Proceedings of the Fourth Work-
   Gambäck. 2019. Studying Generalisability across      shop on Online Abuse and Harms, pages 54–64, On-
   Abusive Language Detection Datasets. In Proceed-      line. Association for Computational Linguistics.
   ings of the 23rd Conference on Computational Nat-
   ural Language Learning (CoNLL), pages 940–950,       Michael Wiegand, Josef Ruppenhofer, and Thomas
   Hong Kong, China. Association for Computational        Kleinbauer. 2019. Detection of Abusive Language:
   Linguistics.                                           the Problem of Biased Datasets. In Proceedings of
                                                          the 2019 Conference of the North American Chap-
Janet Swim, Kathryn Aikin, Wayne Hall, and Barbara        ter of the Association for Computational Linguistics:
   Hunter. 1995. Sexism and Racism: Old-Fashioned         Human Language Technologies, Volume 1 (Long
   and Modern Prejudices. Journal of Personality and      and Short Papers), pages 602–608, Minneapolis,
  Social Psychology, 68:199–214.                          Minnesota. Association for Computational Linguis-
                                                          tics.
Francine Tougas, Rupert Brown, Ann M. Beaton, and
  Stéphane Joly. 1995. Neosexism: Plus Ça Change,     Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017.
  Plus C’est Pareil. Personality and Social Psychol-       Ex Machina: Personal Attacks Seen at Scale. In
  ogy Bulletin, 21(8):842–849. Publisher: SAGE Pub-        Proceedings of the 26th international conference on
  lications Inc.                                           world wide web, pages 1391–1399.
Francine Tougas, Rupert Brown, Ann M. Beaton, and       Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
  Line St-Pierre. 1999. Neosexism among Women:           Sara Rosenthal, Noura Farra, and Ritesh Kumar.
  The Role of Personally Experienced Social Mobility     2019. Predicting the Type and Target of Offensive
  Attempts. Personality and Social Psychology Bul-       Posts in Social Media. In Proceedings of the 2019
  letin, 25(12):1487–1497. Publisher: SAGE Publica-      Conference of the North American Chapter of the
  tions Inc.                                             Association for Computational Linguistics: Human
                                                         Language Technologies, Volume 1 (Long and Short
Bertie Vidgen and Leon Derczynski. 2020. Direc-          Papers), pages 1415–1420, Minneapolis, Minnesota.
  tions in abusive language training data, a system-     Association for Computational Linguistics.
  atic review: Garbage in, garbage out. PLOS ONE,
  15(12):e0243300. Publisher: Public Library of Sci-    Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa
  ence.                                                  Atanasova, Georgi Karadzhov, Hamdy Mubarak,
                                                         Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin.
Bertie Vidgen, Alex Harris, Dong Nguyen, Rebekah         2020. SemEval-2020 Task 12: Multilingual Offen-
  Tromble, Scott Hale, and Helen Margetts. 2019.         sive Language Identification in Social Media (Of-
  Challenges and frontiers in abusive content detec-     fensEval 2020). arXiv:2006.07235 [cs]. ArXiv:
  tion. In Proceedings of the Third Workshop on Abu-     2006.07235.
  sive Language Online, pages 80–93, Florence, Italy.
  Association for Computational Linguistics.            Caleb Ziems, Bing He, Sandeep Soni, and Srijan Ku-
                                                          mar. 2020. Racism is a Virus: Anti-Asian Hate
Rex E Wallace. 2005. An introduction to wall inscrip-
                                                          and Counterhate in Social Media during the COVID-
  tions from Pompeii and Herculaneum. Bolchazy-
                                                          19 Crisis. arXiv:2005.12423 [physics]. ArXiv:
  Carducci Publishers.
                                                          2005.12423.
Zeerak Waseem. 2016. Are You a Racist or Am I See-
  ing Things? Annotator Influence on Hate Speech
  Detection on Twitter. In Proceedings of the First
You can also read