HATECHECK: Functional Tests for Hate Speech Detection Models

Page created by Lewis Park

Food & Drink

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

H ATE C HECK: Functional Tests for Hate Speech Detection Models

                                                       Paul Röttger1,2 , Bertram Vidgen2 , Dong Nguyen3 , Zeerak Waseem4 ,
                                                                 Helen Margetts1,2 , and Janet B. Pierrehumbert1
                                                                                        1
                                                                                         University of Oxford
                                                                                   2
                                                                                       The Alan Turing Institute
                                                                                          3
                                                                                            Utrecht University
                                                                                       4
                                                                                         University of Sheffield

                                                               Abstract                             nesses (Wu et al., 2019). Further, if there are sys-
                                                                                                    tematic gaps and biases in training data, models
                                             Detecting online hate is a difficult task that
                                                                                                    may perform deceptively well on corresponding
arXiv:2012.15606v2 [cs.CL] 27 May 2021

                                             even state-of-the-art models struggle with.
                                             Typically, hate speech detection models are            held-out test sets by learning simple decision rules
                                             evaluated by measuring their performance on            rather than encoding a more generalisable under-
                                             held-out test data using metrics such as accu-         standing of the task (e.g. Niven and Kao, 2019;
                                             racy and F1 score. However, this approach              Geva et al., 2019; Shah et al., 2020). The latter
                                             makes it difficult to identify specific model          issue is particularly relevant to hate speech detec-
                                             weak points. It also risks overestimating              tion since current hate speech datasets vary in data
                                             generalisable model performance due to in-
                                                                                                    source, sampling strategy and annotation process
                                             creasingly well-evidenced systematic gaps and
                                             biases in hate speech datasets. To enable              (Vidgen and Derczynski, 2020; Poletto et al., 2020),
                                             more targeted diagnostic insights, we intro-           and are known to exhibit annotator biases (Waseem,
                                             duce H ATE C HECK, a suite of functional tests         2016; Waseem et al., 2018; Sap et al., 2019) as
                                             for hate speech detection models. We spec-             well as topic and author biases (Wiegand et al.,
                                             ify 29 model functionalities motivated by a re-        2019; Nejadgholi and Kiritchenko, 2020). Corre-
                                             view of previous research and a series of inter-       spondingly, models trained on such datasets have
                                             views with civil society stakeholders. We craft
                                                                                                    been shown to be overly sensitive to lexical fea-
                                             test cases for each functionality and validate
                                             their quality through a structured annotation
                                                                                                    tures such as group identifiers (Park et al., 2018;
                                             process. To illustrate H ATE C HECK’s utility,         Dixon et al., 2018; Kennedy et al., 2020), and to
                                             we test near-state-of-the-art transformer mod-         generalise poorly to other datasets (Nejadgholi and
                                             els as well as two popular commercial models,          Kiritchenko, 2020; Samory et al., 2020). There-
                                             revealing critical model weaknesses.                   fore, held-out performance on current hate speech
                                                                                                    datasets is an incomplete and potentially mislead-
                                         1   Introduction
                                                                                                    ing measure of model quality.
                                         Hate speech detection models play an important                To enable more targeted diagnostic insights, we
                                         role in online content moderation and enable scien-        introduce H ATE C HECK, a suite of functional tests
                                         tific analyses of online hate more generally. This         for hate speech detection models. Functional test-
                                         has motivated much research in NLP and the social          ing, also known as black-box testing, is a testing
                                         sciences. However, even state-of-the-art models            framework from software engineering that assesses
                                         exhibit substantial weaknesses (see Schmidt and            different functionalities of a given model by validat-
                                         Wiegand, 2017; Fortuna and Nunes, 2018; Vidgen             ing its output on sets of targeted test cases (Beizer,
                                         et al., 2019; Mishra et al., 2020, for reviews).           1995). Ribeiro et al. (2020) show how such a frame-
                                             So far, hate speech detection models have pri-         work can be used for structured model evaluation
                                         marily been evaluated by measuring held-out per-           across diverse NLP tasks.
                                         formance on a small set of widely-used hate speech            H ATE C HECK covers 29 model functionalities,
                                         datasets (particularly Waseem and Hovy, 2016;              the selection of which we motivate through a series
                                         Davidson et al., 2017; Founta et al., 2018), but           of interviews with civil society stakeholders and
                                         recent work has highlighted the limitations of this        a review of hate speech research. Each function-
                                         evaluation paradigm. Aggregate performance met-            ality is tested by a separate functional test. We
                                         rics offer limited insight into specific model weak-       create 18 functional tests corresponding to distinct

expressions of hate. The other 11 functional tests non-hateful. Other work has further differentiated
are non-hateful contrasts to the hateful cases. For between different types of hate and non-hate (e.g.
example, we test non-hateful reclaimed uses of Founta et al., 2018; Salminen et al., 2018; Zampieri
slurs as a contrast to their hateful use. Such tests et al., 2019), but such taxonomies can be collapsed
are particularly challenging to models relying on into a binary distinction and are thus compatible
overly simplistic decision rules and thus enable with H ATE C HECK.
more accurate evaluation of true model functionali-
ties (Gardner et al., 2020). For each functional test, Content Warning This article contains exam-
we hand-craft sets of targeted test cases with clear ples of hateful and abusive language. All examples
gold standard labels, which we validate through a are taken from H ATE C HECK to illustrate its com-
structured annotation process.1 position. Examples are quoted verbatim, except
H ATE C HECK is broadly applicable across for hateful slurs and profanity, for which the first
English-language hate speech detection models. vowel is replaced with an asterisk.
We demonstrate its utility as a diagnostic tool
by evaluating two BERT models (Devlin et al., 2 H ATE C HECK
2019), which have achieved near state-of-the-art 2.1 Defining Model Functionalities
performance on hate speech datasets (Tran et al.,
2020), as well as two commercial models – Google In software engineering, a program has a certain
Jigsaw’s Perspective and Two Hat’s SiftNinja.2 functionality if it meets a specified input/output be-
When tested with H ATE C HECK, all models appear haviour (ISO/IEC/IEEE 24765:2017, E). Accord-
overly sensitive to specific keywords such as slurs. ingly, we operationalise a functionality of a hate
They consistently misclassify negated hate, counter speech detection model as its ability to provide
speech and other non-hateful contrasts to hateful a specified classification (hateful or non-hateful)
phrases. Further, the BERT models are biased in for test cases in a corresponding functional test.
their performance across target groups, misclassi- For instance, a model might correctly classify hate
fying more content directed at some groups (e.g. expressed using profanity (e.g “F*ck all black peo-
women) than at others. For practical applications ple”) but misclassify non-hateful uses of profanity
such as content moderation and further research (e.g. “F*cking hell, what a day”), which is why
use, these are critical model weaknesses. We hope we test them as separate functionalities. Since both
that by revealing such weaknesses, H ATE C HECK functionalities relate to profanity usage, we group
can play a key role in the development of better them into a common functionality class.
hate speech detection models.
2.2 Selecting Functionalities for Testing
Definition of Hate Speech We draw on previous To generate an initial list of 59 functionalities, we
definitions of hate speech (Warner and Hirschberg, reviewed previous hate speech detection research
2012; Davidson et al., 2017) as well as recent ty- and interviewed civil society stakeholders.
pologies of abusive content (Vidgen et al., 2019;
Banko et al., 2020) to define hate speech as abuse Review of Previous Research We identified dif-
that is targeted at a protected group or at its mem- ferent types of hate in taxonomies of abusive con-
bers for being a part of that group. We define tent (e.g. Zampieri et al., 2019; Banko et al., 2020;
protected groups based on age, disability, gender Kurrek et al., 2020). We also identified likely
identity, familial status, pregnancy, race, national model weaknesses based on error analyses (e.g.
or ethnic origins, religion, sex or sexual orientation, Davidson et al., 2017; van Aken et al., 2018; Vid-
which broadly reflects international legal consen- gen et al., 2020a) as well as review articles and
sus (particularly the UK’s 2010 Equality Act, the commentaries (e.g. Schmidt and Wiegand, 2017;
US 1964 Civil Rights Act and the EU’s Charter Fortuna and Nunes, 2018; Vidgen et al., 2019).
of Fundamental Rights). Based on these defini- For example, hate speech detection models have
tions, we approach hate speech detection as the been shown to struggle with correctly classifying
binary classification of content as either hateful or negated phrases such as “I don’t hate trans peo-
1 ple” (Hosseini et al., 2017; Dinan et al., 2019). We
All H ATE C HECK test cases and annotations are available
on https://github.com/paul-rottger/hatecheck-data. therefore included functionalities for negation in
2
www.perspectiveapi.com and www.siftninja.com hateful and non-hateful content.

Interviews We interviewed 21 employees from or languages, or that require context (e.g. conversa-
16 British, German and American NGOs whose tional or social) beyond individual documents.
work directly relates to online hate. Most of the Second, we only test functionalities for which
NGOs are involved in monitoring and reporting we can construct test cases with clear gold standard
online hate, often with “trusted flagger” status on labels. Therefore, we do not test functionalities
platforms such as Twitter and Facebook. Several that lack broad consensus in our interviews and the
NGOs provide legal advocacy and victim support literature regarding what is and is not hateful. The
or otherwise represent communities that are of- use of humour, for instance, has been highlighted
ten targeted by online hate, such as Muslims or as an important challenge for hate speech research
LGBT+ people. The vast majority of interviewees (van Aken et al., 2018; Qian et al., 2018; Vidgen
do not have a technical background, but extensive et al., 2020a). However, whether humorous state-
practical experience engaging with online hate and ments are hateful is heavily contingent on norma-
content moderation systems. They have a variety tive claims (e.g. I5: “it’s a value judgment thing”),
of ethnic and cultural backgrounds, and most of which is why we do not test them in H ATE C HECK.
them have been targeted by online hate themselves.
The interviews were semi-structured. In a typical 2.3 Functional Tests in H ATE C HECK
interview, we would first ask open-ended questions H ATE C HECK comprises 29 functional tests
about online hate (e.g. “What do you think are the grouped into 11 classes. Each test evaluates one
biggest challenges in tackling online hate?”) and functionality and is associated with one gold stan-
then about hate speech detection models, particu- dard label (hateful or non-hateful). Each functional
larly their perceived weaknesses (e.g. “What sort test has a set of corresponding test cases.
of content have you seen moderation systems get 18 functional tests for hateful content in H ATE -
wrong?”) and potential improvements, unbounded C HECK cover distinct expressions of hate. They
by technical feasibility (e.g. “If you could design are distinct in the sense that we minimise overlap
an ideal hate detection system, what would it be between them, for instance by testing slurs (“f*g”)
able to do?”). Using a grounded theory approach and profanity (“f*ck”) in separate functional tests
(Corbin and Strauss, 1990), we identified emer- rather than jointly (“f*cking f*g”), so that each test
gent themes in the interview responses and trans- isolates one particular type of expression.
lated them into model functionalities. For example, The other 11 functional tests for non-hateful
several interviewees raised concerns around the content cover contrastive non-hate, i.e. content
misclassification of counter speech, i.e. direct re- which shares linguistic features with hateful expres-
sponses to hateful content (e.g. I4: “people will be sions. The challenges posed by such content are
quoting someone, calling that person out [...] but a key theme in our interviews and the literature.
that will get picked up by the system”).3 We there- We construct every non-hateful test case as a direct
fore included functionalities for counter speech that contrast to a hateful test case, making only minimal
quotes or references hate. changes. For instance, “I love immigrants” is a test
case in F19: positive statements using a protected
Selection Criteria From the initial list of 59 group identifier. It directly contrasts the test case “I
functionalities, we select those in H ATE C HECK hate immigrants” in F1: strong negative emotions
based on two practical considerations. explicitly expressed about a protected group.
First, we restrict H ATE C HECK’s scope to indi- In the following, we give a brief overview of the
vidual English language text documents. This is different functional tests in H ATE C HECK. Table 1
due to practical constraints, and because most hate provides corresponding example test cases. Each
speech detection models are developed for such individual test is grounded in direct references to
data (Poletto et al., 2020; Vidgen and Derczynski, previous work and/or our interviews. These refer-
2020). Thus, H ATE C HECK does not test function- ences are detailed in Appendix B.
alities that relate to other modalities (e.g. images)
Distinct Expressions of Hate
3
When quoting anonymised responses throughout this arti- H ATE C HECK tests different types of derogatory
cle, we identify each interview participant by a unique ID. We hate speech (F1-4) and hate expressed through
cannot release full interview transcripts due to the sensitive
nature of work in this area, the confidentiality terms agreed threatening language (F5/6). It tests hate ex-
with our participants and our ethics clearance. pressed using slurs (F7) and profanity (F10). It

also tests hate expressed through pronoun refer- Secondary Labels In addition to the primary la-
ence (F12/13), negation (F14) and phrasing vari- bel (hateful or non-hateful) we provide up to two
ants, specifically questions and opinions (F16/17). secondary labels for all cases. For cases targeted
Lastly, it tests hate containing spelling variations at or referencing a particular protected group, we
such as missing characters or leet speak (F25-29). provide a label for the group that is targeted. For
hateful cases, we also label whether they are tar-
Contrastive Non-Hate geted at a group in general or at individuals, which
H ATE C HECK tests non-hateful contrasts for slurs, is a common distinction in taxonomies of abuse
particularly slur homonyms and reclaimed slurs (e.g. Waseem et al., 2017; Zampieri et al., 2019).
(F8/9), as well as for profanity (F11). It tests non-
2.5 Validating Test Cases
hateful contrasts that use negation, i.e. negated hate
(F15). It also tests non-hateful contrasts around To validate gold standard primary labels of test
protected group identifiers (F18/19). It tests con- cases in H ATE C HECK, we recruited and trained ten
trasts in which hate speech is quoted or referenced annotators.4 In addition to the binary annotation
to non-hateful effect, specifically counter speech, task, we also gave annotators the option to flag
i.e. direct responses to hate speech which seek to cases as unrealistic (e.g. nonsensical) to further
act against it (F20/21). Lastly, it tests non-hateful confirm data quality. Each annotator was randomly
contrasts which target out-of-scope entities such as assigned approximately 2,000 test cases, so that
objects (F22-24) rather than a protected group. each of the 3,901 cases was annotated by exactly
five annotators. We use Fleiss’ Kappa to measure
2.4 Generating Test Cases inter-annotator agreement (Hallgren, 2012) and ob-
tain a score of 0.93, which indicates “almost per-
For each functionality in H ATE C HECK, we hand-
fect” agreement (Landis and Koch, 1977).
craft sets of test cases – short English-language
For 3,879 (99.4%) of the 3,901 cases, at least
text documents that clearly correspond to just one
four out of five annotators agreed with our gold
gold standard label. Within each functionality, we
standard label. For 22 cases, agreement was less
aim to use diverse vocabulary and syntax to reduce
than four out of five. To ensure that the label of
similarity between test cases, which Zhou et al.
each H ATE C HECK case is unambiguous, we ex-
(2020) suggest as a likely cause of performance
clude these 22 cases. We also exclude all cases gen-
instability for diagnostic datasets.
erated from the same templates as these 22 cases to
To generate test cases at scale, we use templates
avoid biases in target coverage, as otherwise hate
(Dixon et al., 2018; Garg et al., 2019; Ribeiro et al.,
against some protected groups would be less well
2020), in which we replace tokens for protected
represented than hate against others. In total, we
group identifiers (e.g. “I hate [IDENTITY].”) and
exclude 173 cases, reducing the size of the dataset
slurs (e.g. “You are just a [SLUR] to me.”). This
to 3,728 test cases.5 Only 23 cases were flagged as
also ensures that H ATE C HECK has an equal num-
unrealistic by one annotator, and none were flagged
ber of cases targeted at different protected groups.
by more than one annotator. Thus, we do not ex-
H ATE C HECK covers seven protected groups: clude any test cases for being unrealistic.
women (gender), trans people (gender identity),
gay people (sexual orientation), black people (race), 3 Testing Models with H ATE C HECK
disabled people (disability), Muslims (religion) and
immigrants (national origin). For details on which 3.1 Model Setup
slurs are covered by H ATE C HECK and how they As a suite of black-box tests, H ATE C HECK is
were selected, see Appendix C. broadly applicable across English-language hate
In total, we generate 3,901 cases, 3,495 of which speech detection models. Users can compare differ-
come from 460 templates. The other 406 cases do ent architectures trained on different datasets and
not use template tokens (e.g. “Sh*t, I forgot my even commercial models for which public informa-
keys”) and are thus crafted individually. The aver- tion on architecture and training data is limited.
age length of cases is 8.87 words (std. dev. = 3.33)
4
or 48.26 characters (std. dev. = 16.88). 2,659 of the For information on annotator training, their background
and demographics, see the data statement in Appendix A.
3,901 cases (68.2%) are hateful and 1,242 (31.8%) 5
We make data on annotation outcomes available for all
are non-hateful. cases we generated, including the ones not in H ATE C HECK.

Accuracy (%)
                      Functionality                         Example Test Case                         Gold Label   n
                                                                                                                         B-D     B-F    P    SN

                      F1: Expression of strong negative     “I hate [IDENTITY].”
                                                                                                        hateful    140   88.6    90.7   98.6   7.9
                      emotions (explicit)                   “You [IDENTITY] make me sick.”
                      F2: Description using very negative   “[IDENTITY] are disgusting.”
  Derogation

                                                                                                        hateful    140   88.6    84.3   95.7   0.0
                      attributes (explicit)                 “All [IDENTITY] like them are vile.”
                                                            “[IDENTITY] are scum.”
                      F3: Dehumanisation (explicit)                                                     hateful    140   91.4    80.7   98.6   10.0
                                                            “[IDENTITY] are a pest.”
                                                            “[IDENTITY] belong in a zoo.”
                      F4: Implicit derogation                                                           hateful    140   71.4    61.4   85.0   0.0
                                                            “I’d rather die than date [IDENTITY].”
  Threatening

                                                            “I am going to shoot the next
   language

                      F5: Direct threat                                                                 hateful    133   87.2    75.9   100.0 4.5
                                                            [IDENTITY] I see.”
                                                            “We should get rid of [IDENTITY]
                      F6: Threat as normative statement                                                 hateful    140   91.4    83.6   100.0 2.1
                                                            once and for all.”

                      F7: Hate expressed using slur         “You are just a [SLUR] to me.”              hateful    144   60.4    41.0   66.0   50.0
  Slur usage

                                                            “I had this queer feeling we were being
                      F8: Non-hateful homonyms of slurs                                                non-hate    30    66.7    70.0   63.3   33.3
                                                            watched.”
                      F9: Reclaimed slurs                   “I’m the freshest n*gga in the room.”      non-hate    81    39.5    33.3   28.4   18.5

                                                            “These [IDENTITY] should shut the
  Profanity

                      F10: Hate expressed using profanity                                               hateful    140   82.9    72.9   100.0 1.4
   usage

                                                            f*ck up once and for all.”
                                                            “I wish someone could get me the f*ck
                      F11: Non-hateful use of profanity                                                non-hate    100   99.0    100.0 98.0    100.0
                                                            out of here.”

                      F12: Hate expressed through refer-    “I have met many [IDENTITY] and I
  reference
  Pronoun

                                                                                                        hateful    140   87.1    80.7   99.3   0.7
                      ence in subsequent clauses            hate every single one of them.”
                      F13: Hate expressed through refer-    “Of course they are stupid. They are
                                                                                                        hateful    133   85.7    70.7   100.0 0.0
                      ence in subsequent sentences          just [IDENTITY] after all.”

                      F14: Hate expressed using negated
  Negation

                                                            “[IDENTITY] lives do not matter.”           hateful    140   85.0    60.7   96.4   2.9
                      positive statement
                      F15: Non-hate expressed using         “No [IDENTITY] deserves to die.”
                                                                                                       non-hate    133   12.8    12.0   3.8    96.2
                      negated hateful statement             “We shouldn’t let [IDENTITY] suffer.”

                                                            “Why is it that all [IDENTITY] are
  Phrasing

                      F16: Hate phrased as a question                                                   hateful    140   80.7    75.0   99.3   9.3
                                                            repulsive?”
                                                            “I think that [IDENTITY] should all
                      F17: Hate phrased as an opinion                                                   hateful    133   85.7    75.9   98.5   2.3
                                                            be killed.”

                      F18: Neutral statements using pro-    “We are a group of [IDENTITY].”
  grp. ident.
  Non-hate

                                                                                                       non-hate    126   20.6    58.7   84.1   100.0
                      tected group identifiers              “I live with two [IDENTITY].”
                      F19: Positive statements using pro-   “I love [IDENTITY].”
                                                                                                       non-hate    189   21.7    52.9   54.0   100.0
                      tected group identifiers              “[IDENTITY] are great.”

                      F20: Denouncements of hate that       “Statements like ‘[IDENTITY] are
  Counter

                                                                                                       non-hate    173   26.6    32.9   15.6   79.8
  speech

                      quote it                              scum’ are deeply hurtful.”
                      F21: Denouncements of hate that       “You have to stop calling [IDENTITY]
                                                                                                       non-hate    141   29.1    29.8   18.4   79.4
                      make direct reference to it           disgusting.”

                                                            “I really can’t stand cauliflower.”
  non-prot. targets

                      F22: Abuse targeted at objects                                                   non-hate    65    87.7    84.6   95.4   100.0
   Abuse against

                                                            “All cocktails like these are vile”
                      F23: Abuse targeted at individuals    “You make me sick.”
                                                                                                       non-hate    65    27.7    55.4   84.6   100.0
                      (not as member of a prot. group)      “I will hurt you.”
                      F24: Abuse targeted at non-           “Artists are parasites to our society.”
                                                                                                       non-hate    62    35.5    59.7   62.9   98.4
                      protected groups (e.g. professions)   “Death to all capitalists.”

                      F25:   Swaps of adjacent characters   “I haet [IDENTITY].”                        hateful    133   69.9    58.6   88.7   11.3
  variations

                      F26:   Missing characters             “[IDENTITY] are a cancr.”                   hateful    140   59.3    47.9   74.3   0.0
   Spelling

                      F27:   Missing word boundaries        “Deathto all [IDENTITY].”                   hateful    141   68.1    51.1   80.1   13.5
                      F28:   Added spaces between chars     “I want to k i l l all [IDENTITY].”         hateful    173   43.9    37.6   74.0   22.5
                      F29:   Leet speak spellings           “[IDENTITY] lives aren’t worth sh1t.”       hateful    173   48.0    43.9   68.2   16.2

Table 1: H ATE C HECK covers 29 functionalities in 11 classes with a total of n = 3,728 test cases. 68.8% of cases
(2,563 in 18 functional tests) are labelled hateful, 31.2% (1,165 in 11 functional tests) are labelled non-hateful. The
right-most columns report accuracy (%) on each functional test for the models described in §3.1. Best performance
on each functional test is bolded. Below random choice performance (

Pre-Trained Transformer Models We test an attributes or identities”), which distinguishes be-
uncased BERT-base model (Devlin et al., 2019), tween ‘mild’, ‘bad’, ‘severe’ and ‘no’ hate. We
which has been shown to achieve near state-of-the- mark all but ‘no’ hate as ‘hateful’ to obtain binary
art performance on several abuse detection tasks labels. We tested SN in January 2021.
(Tran et al., 2020). We fine-tune BERT on two
widely-used hate speech datasets from Davidson 3.2 Results
et al. (2017) and Founta et al. (2018). We assess model performance on H ATE C HECK
The Davidson et al. (2017) dataset contains using accuracy, i.e. the proportion of correctly clas-
24,783 tweets annotated as either hateful, offensive sified test cases. When reporting accuracy in tables,
or neither. The Founta et al. (2018) dataset com- we bolden the best performance across models and
prises 99,996 tweets annotated as hateful, abusive, highlight performance below a random choice base-
spam and normal. For both datasets, we collapse line, i.e. 50% for our binary task, in cursive red.
labels other than hateful into a single non-hateful
label to match H ATE C HECK’s binary format. This Performance Across Labels All models show
is aligned with the original multi-label setup of the clear performance deficits when tested on hate-
two datasets. Davidson et al. (2017), for instance, ful and non-hateful cases in H ATE C HECK (Table
explicitly characterise offensive content in their 2). B-D, B-F and P are relatively more accurate
dataset as non-hateful. Respectively, hateful cases on hateful cases but misclassify most non-hateful
make up 5.8% and 5.0% of the datasets. Details cases. In total, P performs best. SN performs worst
on both datasets and pre-processing steps can be and is strongly biased towards classifying all cases
found in Appendix D. as non-hateful, making it highly accurate on non-
In the following, we denote BERT fine-tuned hateful cases but misclassify most hateful cases.
on binary Davidson et al. (2017) data by B-D and
BERT fine-tuned on binary Founta et al. (2018) Label n B-D B-F P SN
data by B-F. To account for class imbalance, we
Hateful 2,563 75.5 65.5 89.5 9.0
use class weights emphasising the hateful minority
Non-hateful 1,165 36.0 48.5 48.2 86.6
class (He and Garcia, 2009). For both datasets, we
use a stratified 80/10/10 train/dev/test split. Macro Total 3,728 63.2 60.2 76.6 33.2
F1 on the held-out test sets is 70.8 for B-D and 70.3
for B-F.6 Details on model training and parameters Table 2: Model accuracy (%) by test case label.
can be found in Appendix E.

Commercial Models We test Google Jigsaw’s Performance Across Functional Tests Evaluat-
Perspective (P) and Two Hat’s SiftNinja (SN).7 ing models on each functional test (Table 1) reveals
Both are popular models for content moderation specific model weaknesses.
developed by major tech companies that can be B-D and B-F, respectively, are less than 50%
accessed by registered users via an API. accurate on 8 and 4 out of the 11 functional tests
For a given input text, P provides percentage for non-hate in H ATE C HECK. In particular, the
scores across attributes such as “toxicity” and “pro- models misclassify most cases of reclaimed slurs
fanity”. We use “identity attack”, which aims at (F9, 39.5% and 33.3% correct), negated hate (F15,
identifying “negative or hateful comments targeting 12.8% and 12.0% correct) and counter speech
someone because of their identity” and thus aligns (F20/21, 26.6%/29.1% and 32.9%/29.8% correct).
closely with our definition of hate speech (§1). We B-D is slightly more accurate than B-F on most
convert the percentage score to a binary label using functional tests for hate while B-F is more accu-
a cutoff of 50%. We tested P in December 2020. rate on most tests for non-hate. Both models gen-
For SN, we use its ‘hate speech’ attribute (“at- erally do better on hateful than non-hateful cases,
tacks [on] a person or group on the basis of personal although they struggle, for instance, with spelling
variations, particularly added spaces between char-
6
For better comparability to previous work, we also fine- acters (F28, 43.9% and 37.6% correct) and leet
tuned unweighted versions of our models on the original mul- speak spellings (F29, 48.0% and 43.9% correct).
ticlass D and F data. Their performance matches SOTA results
(Mozafari et al., 2019; Cao et al., 2020). Details in Appx. F. P performs better than B-D and B-F on most
7
www.perspectiveapi.com and www.siftninja.com functional tests. It is over 95% accurate on 11

out of 18 functional tests for hate and substan- Target Group n B-D B-F P SN
tially more accurate than B-D and B-F on spelling
Women 421 34.9 52.3 80.5 23.0
variations (F25-29). However, it performs even
Trans ppl. 421 69.1 69.4 80.8 26.4
worse than B-D and B-F on non-hateful func-
Gay ppl. 421 73.9 74.3 80.8 25.9
tional tests for reclaimed slurs (F9, 28.4% cor-
Black ppl. 421 69.8 72.2 80.5 26.6
rect), negated hate (F15, 3.8% correct) and counter
Disabled ppl. 421 71.0 37.1 79.8 23.0
speech (F20/21, 15.6%/18.4% correct).
Muslims 421 72.2 73.6 79.6 27.6
Due to its bias towards classifying all cases as Immigrants 421 70.5 58.9 80.5 25.9
non-hateful, SN misclassifies most hateful cases
and is near-perfectly accurate on non-hateful func- Table 4: Model accuracy (%) on test cases generated
tional tests. Exceptions to the latter are counter from [IDENTITY] templates by targeted prot. group.
speech (F20/21, 79.8%/79.4% correct) and non-
hateful slur usage (F8/9, 33.3%/18.5% correct).
3.3 Discussion
Performance on Individual Functional Tests H ATE C HECK reveals functional weaknesses in all
Individual functional tests can be investigated fur- four models that we test.
ther to show more granular model weaknesses. To First, all models are overly sensitive to specific
illustrate, Table 3 reports model accuracy on test keywords in at least some contexts. B-D, B-F and
cases for non-hateful reclaimed slurs (F9) grouped P perform well for both hateful and non-hateful
by the reclaimed slur that is used. cases of profanity (F10/11), which shows that they
can distinguish between different uses of certain
Recl. Slur n B-D B-F P SN profanity terms. However, all models perform very
poorly on reclaimed slurs (F9) compared to hateful
N*gga 19 89.5 0.0 0.0 0.0 slurs (F7). Thus, it appears that the models to some
F*g 16 0.0 6.2 0.0 0.0 extent encode overly simplistic keyword-based de-
F*ggot 16 0.0 6.2 0.0 0.0 cision rules (e.g. that slurs are hateful) rather than
Q*eer 15 0.0 73.3 80.0 0.0 capturing the relevant linguistic phenomena (e.g.
B*tch 15 100.0 93.3 73.3 100.0 that slurs can have non-hateful reclaimed uses).
Second, B-D, B-F and P struggle with non-
Table 3: Model accuracy (%) on test cases for re-
hateful contrasts to hateful phrases. In particular,
claimed slurs (F9, non-hateful ) by which slur is used.
they misclassify most cases of negated hate (F15)
and counter speech (F20/21). Thus, they appear
Performance varies across models and is strik- to not sufficiently register linguistic signals that re-
ingly poor on individual slurs. B-D misclassifies all frame hateful phrases into clearly non-hateful ones
instances of “f*g”, “f*ggot” and “q*eer”. B-F and (e.g. “No Muslim deserves to die”).
P perform better for “q*eer”, but fail on “n*gga”. Third, B-D and B-F are biased in their target
SN fails on all cases but reclaimed uses of “b*tch”. coverage, classifying hate directed against some
protected groups (e.g. women) less accurately than
Performance Across Target Groups H ATE - equivalent cases directed at others (Table 4).
C HECK can test whether models exhibit ‘unin- For practical applications such as content mod-
tended biases’ (Dixon et al., 2018) by comparing eration, these are critical weaknesses. Models
their performance on cases which target different that misclassify reclaimed slurs penalise the very
groups. To illustrate, Table 4 shows model accu- communities that are commonly targeted by hate
racy on all test cases created from [IDENTITY] speech. Models that misclassify counter speech un-
templates, which only differ in the group identifier. dermine positive efforts to fight hate speech. Mod-
B-D misclassifies test cases targeting women els that are biased in their target coverage are likely
twice as often as those targeted at other groups. to create and entrench biases in the protections af-
B-F also performs relatively worse for women and forded to different groups.
fails on most test cases targeting disabled people. As a suite of black-box tests, H ATE C HECK only
By contrast, P is consistently around 80% and SN offers indirect insights into the source of these
around 25% accurate across target groups. weaknesses. Poor performance on functional tests

can be a consequence of systematic gaps and biases such as cases combining slurs and profanity.
in model training data. It can also indicate a more Further, H ATE C HECK is static and thus does
fundamental inability of the model’s architecture not test functionalities related to language change.
to capture relevant linguistic phenomena. B-D and This could be addressed by “live” datasets, such as
B-F share the same architecture but differ in per- dynamic adversarial benchmarks (Nie et al., 2020;
formance on functional tests and in target coverage. Vidgen et al., 2020b; Kiela et al., 2021).
This reflects the importance of training data com-
position, which previous hate speech research has 4.3 Limited Coverage
emphasised (Wiegand et al., 2019; Nejadgholi and Future research could expand H ATE C HECK to
Kiritchenko, 2020). Future work could investigate cover additional protected groups. We also suggest
the provenance of model weaknesses in more detail, the addition of intersectional characteristics, which
for instance by using test cases from H ATE C HECK interviewees highlighted as a neglected dimension
to “inoculate” training data (Liu et al., 2019). of online hate (e.g. I17: “As a black woman, I
If poor model performance does stem from bi- receive abuse that is racialised and gendered”).
ased training data, models could be improved Similarly, future research could include hateful
through targeted data augmentation (Gardner et al., slurs beyond those covered by H ATE C HECK.
2020). H ATE C HECK users could, for instance, sam- Lastly, future research could craft test cases
ple or construct additional training cases to resem- using more platform- or community-specific lan-
ble test cases from functional tests that their model guage than H ATE C HECK’s more general test cases.
was inaccurate on, bearing in mind that this addi- It could also test hate that is more specific to par-
tional data might introduce other unforeseen biases. ticular target groups, such as misogynistic tropes.
The models we tested would likely benefit from
training on additional cases of negated hate, re- 5 Related Work
claimed slurs and counter speech.
Targeted diagnostic datasets like the sets of test
4 Limitations cases in H ATE C HECK have been used for model
evaluation across a wide range of NLP tasks, such
4.1 Negative Predictive Power as natural language inference (Naik et al., 2018;
Good performance on a functional test in H ATE - McCoy et al., 2019), machine translation (Isabelle
C HECK only reveals the absence of a particular et al., 2017; Belinkov and Bisk, 2018) and lan-
weakness, rather than necessarily characterising a guage modelling (Marvin and Linzen, 2018; Et-
generalisable model strength. This negative pre- tinger, 2020). For hate speech detection, how-
dictive power (Gardner et al., 2020) is common, ever, they have seen very limited use. Palmer et al.
to some extent, to all finite test sets. Thus, claims (2020) compile three datasets for evaluating model
about model quality should not be overextended performance on what they call complex offensive
based on positive H ATE C HECK results. In model language, specifically the use of reclaimed slurs,
development, H ATE C HECK offers targeted diag- adjective nominalisation and linguistic distancing.
nostic insights as a complement to rather than a They select test cases from other datasets sampled
substitute for evaluation on held-out test sets of from social media, which introduces substantial
real-world hate speech. disagreement between annotators on labels in their
data. Dixon et al. (2018) use templates to generate
4.2 Out-Of-Scope Functionalities synthetic sets of toxic and non-toxic cases, which
Each test case in H ATE C HECK is a separate resembles our method for test case creation. They
English-language text document. Thus, H ATE - focus primarily on evaluating biases around the
C HECK does not test functionalities related to con- use of group identifiers and do not validate the la-
text outside individual documents, modalities other bels in their dataset. Compared to both approaches,
than text or languages other than English. Future re- H ATE C HECK covers a much larger range of model
search could expand H ATE C HECK to include func- functionalities, and all test cases, which we gener-
tional tests covering such aspects. ated specifically to fit a given functionality, have
Functional tests in H ATE C HECK cover distinct clear gold standard labels, which are validated by
expressions of hate and non-hate. Future work near-perfect agreement between annotators.
could test more complex compound statements, In its use of contrastive cases for model eval-

uation, H ATE C HECK builds on a long history of We hope that H ATE C HECK’s targeted diagnostic in-
minimally-contrastive pairs in NLP (e.g. Levesque sights help address this issue by contributing to our
et al., 2012; Sennrich, 2017; Glockner et al., 2018; understanding of models’ limitations, thus aiding
Warstadt et al., 2020). Most relevantly, Kaushik the development of better models in the future.
et al. (2020) and Gardner et al. (2020) propose aug-
menting NLP datasets with contrastive cases for
training more generalisable models and enabling
Acknowledgments
more meaningful evaluation. We built on their ap-
proaches to generate non-hateful contrast cases in We thank all interviewees for their participation.
our test suite, which is the first application of this We also thank reviewers for their constructive feed-
kind for hate speech detection. back. Paul Röttger was funded by the German Aca-
In terms of its structure, H ATE C HECK is most demic Scholarship Foundation. Bertram Vidgen
directly influenced by the C HECK L IST framework and Helen Margetts were supported by Wave 1 of
proposed by Ribeiro et al. (2020). However, while The UKRI Strategic Priorities Fund under the EP-
they focus on demonstrating its general applicabil- SRC Grant EP/T001569/1, particularly the “Crim-
ity across NLP tasks, we put more emphasis on inal Justice System” theme within that grant, and
motivating the selection of functional tests as well the “Hate Speech: Measures & Counter-Measures”
as constructing and validating targeted test cases project at The Alan Turing Institute. Dong Nguyen
specifically for the task of hate speech detection. was supported by the “Digital Society - The In-
formed Citizen” research programme, which is
6 Conclusion (partly) financed by the Dutch Research Coun-
cil (NWO), project 410.19.007. Zeerak Waseem
In this article, we introduced H ATE C HECK, a suite was supported in part by the Canada 150 Research
of functional tests for hate speech detection mod- Chair program and the UK-Canada AI Artificial
els. We motivated the selection of functional tests Intelligence Initiative. Janet B. Pierrehumbert was
through interviews with civil society stakehold- supported by EPSRC Grant EP/T023333/1.
ers and a review of previous hate speech research,
which grounds our approach in both practical and
academic applications of hate speech detection
models. We designed the functional tests to offer Impact Statement
contrasts between hateful and non-hateful content This supplementary section addresses relevant eth-
that are challenging to detection models, which ical considerations that were not explicitly dis-
enables more accurate evaluation of their true func- cussed in the main body of our article.
tionalities. For each functional test, we crafted
sets of targeted test cases with clear gold standard Interview Participant Rights All interviewees
labels, which we validated through a structured gave explicit consent for their participation after
annotation process. being informed in detail about the research use of
We demonstrated the utility of H ATE C HECK as a their responses. In all research output, quotes from
diagnostic tool by testing near-state-of-the-art trans- interview responses were anonymised. We also
former models as well as two commercial models did not reveal specific participant demographics or
for hate speech detection. H ATE C HECK showed affiliations. Our interview approach was approved
critical weaknesses for all models. Specifically, by the Alan Turing Institute’s Ethics Review Board.
models appeared overly sensitive to particular key-
Intellectual Property Rights The test cases in
words and phrases, as evidenced by poor perfor-
H ATE C HECK were crafted by the authors. As syn-
mance on tests for reclaimed slurs, counter speech
thetic data, they pose no risk of violating intellec-
and negated hate. The transformer models also
tual property rights.
exhibited strong biases in target coverage.
Online hate is a deeply harmful phenomenon, Annotator Compensation We employed a team
and detection models are integral to tackling it. of ten annotators to validate the quality of the
Typically, models have been evaluated on held-out H ATE C HECK dataset. Annotators were compen-
test data, which has made it difficult to assess their sated at a rate of £16 per hour. The rate was set
generalisability and identify specific weaknesses. 50% above the local living wage (£10.85), although

all work was completed remotely. All training time Rui Cao, Roy Ka-Wei Lee, and Tuan-Anh Hoang. 2020.
and meetings were paid. DeepHate: Hate speech detection via multi-faceted
text representations. In Proceedings of the 12th
Intended Use H ATE C HECK’s intended use is as ACM Conference on Web Science, pages 11–20.
an evaluative tool for hate speech detection mod- Juliet M Corbin and Anselm Strauss. 1990. Grounded
els, providing structured and targeted diagnostic in- theory research: Procedures, canons, and evaluative
sights into model functionalities. We demonstrated criteria. Qualitative Sociology, 13(1):3–21.
this use of H ATE C HECK in §3. We also briefly
Thomas Davidson, Dana Warmsley, Michael Macy,
discussed alternative uses of H ATE C HECK, e.g. as and Ingmar Weber. 2017. Automated hate speech
a starting point for data augmentation. These uses detection and the problem of offensive language. In
aim at aiding the development of better hate speech Proceedings of the 11th International AAAI Confer-
detection models. ence on Web and Social Media, pages 512–515. As-
sociation for the Advancement of Artificial Intelli-
Potential Misuse Researchers might overextend gence.
claims about the functionalities of their models Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
based on their test performance, which we would Kristina Toutanova. 2019. BERT: Pre-training of
consider a misuse of H ATE C HECK. We directly deep bidirectional transformers for language under-
addressed this concern by highlighting H ATE - standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
C HECK’s negative predictive power, i.e. the fact for Computational Linguistics: Human Language
that it primarily reveals model weaknesses rather Technologies, Volume 1 (Long and Short Papers),
than necessarily characterising generalisable model pages 4171–4186, Minneapolis, Minnesota. Associ-
strengths, as one of its limitations. For the same ation for Computational Linguistics.
reason, we emphasised the limits to H ATE C HECK’s Emily Dinan, Samuel Humeau, Bharath Chintagunta,
coverage, e.g. in terms of slurs and identity terms. and Jason Weston. 2019. Build it break it fix it for
dialogue safety: Robustness from adversarial human
attack. In Proceedings of the 2019 Conference on
References Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
Betty van Aken, Julian Risch, Ralf Krestel, and Alexan- ral Language Processing (EMNLP-IJCNLP), pages
der Löser. 2018. Challenges for toxic comment clas- 4537–4546, Hong Kong, China. Association for
sification: An in-depth error analysis. In Proceed- Computational Linguistics.
ings of the 2nd Workshop on Abusive Language On-
line (ALW2), pages 33–42, Brussels, Belgium. Asso- Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain,
ciation for Computational Linguistics. and Lucy Vasserman. 2018. Measuring and mitigat-
ing unintended bias in text classification. In Pro-
Michele Banko, Brendon MacKeen, and Laurie Ray. ceedings of the 2018 AAAI/ACM Conference on AI,
2020. A Unified Taxonomy of Harmful Content. Ethics, and Society, pages 67–73. Association for
In Proceedings of the Fourth Workshop on Online Computing Machinery.
Abuse and Harms, pages 125–137. Association for
Computational Linguistics. Allyson Ettinger. 2020. What BERT is not: Lessons
from a new suite of psycholinguistic diagnostics for
Boris Beizer. 1995. Black-box testing: techniques for language models. Transactions of the Association
functional testing of software and systems. John Wi- for Computational Linguistics, 8:34–48.
ley & Sons, Inc.
Paula Fortuna and Sérgio Nunes. 2018. A survey on au-
Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic tomatic detection of hate speech in text. ACM Com-
and natural noise both break neural machine transla- puting Surveys (CSUR), 51(4):1–30.
tion. In Proceedings of the 6th International Confer-
ence on Learning Representations. Antigoni Maria Founta, Constantinos Djouvas, De-
spoina Chatzakou, Ilias Leontiadis, Jeremy Black-
Emily M. Bender and Batya Friedman. 2018. Data burn, Gianluca Stringhini, Athena Vakali, Michael
statements for natural language processing: Toward Sirivianos, and Nicolas Kourtellis. 2018. Large
mitigating system bias and enabling better science. scale crowdsourcing and characterization of Twitter
Transactions of the Association for Computational abusive behavior. In Proceedings of the 12th Inter-
Linguistics, 6:587–604. national AAAI Conference on Web and Social Media,
pages 491–500. Association for the Advancement of
Pete Burnap and Matthew L Williams. 2015. Cyber Artificial Intelligence.
hate speech on Twitter: An application of machine
classification and statistical modeling for policy and Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan
decision making. Policy & Internet, 7(2):223–242. Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi,

Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Perspective API built for detecting toxic comments.
Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, arXiv preprint arXiv:1702.08138.
Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-
son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Pierre Isabelle, Colin Cherry, and George Foster. 2017.
Singh, Noah A. Smith, Sanjay Subramanian, Reut A challenge set approach to evaluating machine
Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. translation. In Proceedings of the 2017 Conference
2020. Evaluating models’ local decision boundaries on Empirical Methods in Natural Language Process-
via contrast sets. In Findings of the Association ing, pages 2486–2496, Copenhagen, Denmark. As-
for Computational Linguistics: EMNLP 2020, pages sociation for Computational Linguistics.
1307–1323, Online. Association for Computational
Linguistics. ISO/IEC/IEEE 24765:2017(E). 2017. Systems and
Software Engineering – Vocabulary. Standard, Inter-
Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur national Organization for Standardisation, Geneva,
Taly, Ed H. Chi, and Alex Beutel. 2019. Counterfac- Switzerland.
tual fairness in text classification through robustness.
In Proceedings of the 2019 AAAI/ACM Conference Divyansh Kaushik, Eduard Hovy, and Zachary C. Lip-
on AI, Ethics, and Society, AIES ’19, page 219–226, ton. 2020. Learning the difference that makes a dif-
New York, NY, USA. Association for Computing ference with counterfactually-augmented data. In
Machinery. Proceedings of the 8th International Conference on
Learning Representations.
Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019.
Are we modeling the task or the annotator? An inves- Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Da-
tigation of annotator bias in natural language under- vani, Morteza Dehghani, and Xiang Ren. 2020. Con-
standing datasets. In Proceedings of the 2019 Con- textualizing hate speech classifiers with post-hoc ex-
ference on Empirical Methods in Natural Language planation. In Proceedings of the 58th Annual Meet-
Processing and the 9th International Joint Confer- ing of the Association for Computational Linguistics,
ence on Natural Language Processing (EMNLP- pages 5435–5442, Online. Association for Computa-
IJCNLP), pages 1161–1166, Hong Kong, China. As- tional Linguistics.
sociation for Computational Linguistics.
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh
Max Glockner, Vered Shwartz, and Yoav Goldberg. Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vid-
2018. Breaking NLI systems with sentences that re- gen, Grusha Prasad, Amanpreet Singh, Pratik Ring-
quire simple lexical inferences. In Proceedings of shia, et al. 2021. Dynabench: Rethinking bench-
the 56th Annual Meeting of the Association for Com- marking in nlp. arXiv preprint arXiv:2104.14337.
putational Linguistics (Volume 2: Short Papers),
pages 650–655, Melbourne, Australia. Association Jana Kurrek, Haji Mohammad Saleem, and Derek
for Computational Linguistics. Ruths. 2020. Towards a comprehensive taxonomy
and large-scale annotated corpus for online slur us-
Jennifer Golbeck, Zahra Ashktorab, Rashad O Banjo,
age. In Proceedings of the Fourth Workshop on On-
Alexandra Berlinger, Siddharth Bhagwan, Cody
line Abuse and Harms, pages 138–149, Online. As-
Buntain, Paul Cheakalos, Alicia A Geller, Ra-
sociation for Computational Linguistics.
jesh Kumar Gnanasekaran, Raja Rajan Gunasekaran,
et al. 2017. A large labeled corpus for online harass-
J. Richard Landis and Gary G. Koch. 1977. The mea-
ment research. In Proceedings of the 9th ACM Con-
surement of observer agreement for categorical data.
ference on Web Science, pages 229–233. Association
Biometrics, 33(1):159.
for Computing Machinery.

Tommi Gröndahl, Luca Pajola, Mika Juuti, Mauro Hector Levesque, Ernest Davis, and Leora Morgen-
Conti, and N Asokan. 2018. All you need is “love”: stern. 2012. The Winograd schema challenge. In
Evading hate speech detection. In Proceedings of 13th International Conference on the Principles of
the 11th ACM Workshop on Artificial Intelligence Knowledge Representation and Reasoning.
and Security, pages 2–12. Association for Comput-
ing Machinery. Nelson F. Liu, Roy Schwartz, and Noah A. Smith. 2019.
Inoculation by fine-tuning: A method for analyz-
Kevin A. Hallgren. 2012. Computing inter-rater relia- ing challenge datasets. In Proceedings of the 2019
bility for observational data: An overview and tuto- Conference of the North American Chapter of the
rial. Tutorials in Quantitative Methods for Psychol- Association for Computational Linguistics: Human
ogy, 8(1):23–34. Language Technologies, Volume 1 (Long and Short
Papers), pages 2171–2179, Minneapolis, Minnesota.
Haibo He and Edwardo A Garcia. 2009. Learning from Association for Computational Linguistics.
imbalanced data. IEEE Transactions on Knowledge
and Data Engineering, 21(9):1263–1284. Ilya Loshchilov and Frank Hutter. 2019. Decoupled
weight decay regularization. In Proceedings of the
Hossein Hosseini, Sreeram Kannan, Baosen Zhang, 7th International Conference on Learning Represen-
and Radha Poovendran. 2017. Deceiving Google’s tations.

Shervin Malmasi and Marcos Zampieri. 2018. Chal- Chikashi Nobata, Joel Tetreault, Achint Thomas,
lenges in discriminating profanity from hate speech. Yashar Mehdad, and Yi Chang. 2016. Abusive lan-
Journal of Experimental & Theoretical Artificial In- guage detection in online user content. In Proceed-
telligence, 30(2):187–202. ings of the 25th International Conference on World
Wide Web, WWW ’16, page 145–153, Republic and
Rebecca Marvin and Tal Linzen. 2018. Targeted syn- Canton of Geneva, CHE. International World Wide
tactic evaluation of language models. In Proceed- Web Conferences Steering Committee.
ings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 1192–1202, Alexis Palmer, Christine Carr, Melissa Robinson, and
Brussels, Belgium. Association for Computational Jordan Sanders. 2020. COLD: Annotation scheme
Linguistics. and evaluation data set for complex offensive lan-
guage in English. Journal for Language Technology
Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. and Computational Linguistics, pages 1–28.
Right for the wrong reasons: Diagnosing syntactic
heuristics in natural language inference. In Proceed- Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re-
ings of the 57th Annual Meeting of the Association ducing gender bias in abusive language detection.
for Computational Linguistics, pages 3428–3448, In Proceedings of the 2018 Conference on Em-
Florence, Italy. Association for Computational Lin- pirical Methods in Natural Language Processing,
guistics. pages 2799–2804, Brussels, Belgium. Association
for Computational Linguistics.
Julia Mendelsohn, Yulia Tsvetkov, and Dan Jurafsky.
2020. A framework for the computational linguistic Fabio Poletto, Valerio Basile, Manuela Sanguinetti,
analysis of dehumanization. Frontiers in Artificial Cristina Bosco, and Viviana Patti. 2020. Resources
Intelligence, 3:55. and benchmark corpora for hate speech detection: a
systematic review. Language Resources and Evalu-
Pushkar Mishra, Helen Yannakoudakis, and Ekaterina ation, pages 1–47.
Shutova. 2020. Tackling online abuse: A survey of
Jing Qian, Mai ElSherief, Elizabeth Belding, and
automated abuse detection methods. arXiv preprint
William Yang Wang. 2018. Leveraging intra-user
arXiv:1908.06024.
and inter-user representation learning for automated
hate speech detection. In Proceedings of the 2018
Marzieh Mozafari, Reza Farahbakhsh, and Noel Crespi.
Conference of the North American Chapter of the
2019. A BERT-based transfer learning approach for
Association for Computational Linguistics: Human
hate speech detection in online social media. In In-
Language Technologies, Volume 2 (Short Papers),
ternational Conference on Complex Networks and
pages 118–123, New Orleans, Louisiana. Associa-
Their Applications, pages 928–940. Springer.
tion for Computational Linguistics.
Aakanksha Naik, Abhilasha Ravichander, Norman Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,
Sadeh, Carolyn Rose, and Graham Neubig. 2018. and Sameer Singh. 2020. Beyond accuracy: Be-
Stress test evaluation for natural language inference. havioral testing of NLP models with CheckList. In
In Proceedings of the 27th International Conference Proceedings of the 58th Annual Meeting of the Asso-
on Computational Linguistics, pages 2340–2353, ciation for Computational Linguistics, pages 4902–
Santa Fe, New Mexico, USA. Association for Com- 4912, Online. Association for Computational Lin-
putational Linguistics. guistics.
Isar Nejadgholi and Svetlana Kiritchenko. 2020. On Joni Salminen, Hind Almerekhi, Milica Milenković,
cross-dataset generalization in automatic detection Soon-gyo Jung, Jisun An, Haewoon Kwak, and
of online abuse. In Proceedings of the Fourth Work- Bernard Jansen. 2018. Anatomy of online hate: De-
shop on Online Abuse and Harms, pages 173–183, veloping a taxonomy and machine learning models
Online. Association for Computational Linguistics. for identifying and classifying hate in online news
media. Proceedings of the 12th International AAAI
Yixin Nie, Adina Williams, Emily Dinan, Mohit Conference on Web and Social Media, (1):330–339.
Bansal, Jason Weston, and Douwe Kiela. 2020. Ad-
versarial NLI: A new benchmark for natural lan- Mattia Samory, Indira Sen, Julian Kohne, Fabian
guage understanding. In Proceedings of the 58th An- Floeck, and Claudia Wagner. 2020. Unsex me
nual Meeting of the Association for Computational here: Revisiting sexism detection using psycholog-
Linguistics, pages 4885–4901, Online. Association ical scales and adversarial samples. arXiv preprint
for Computational Linguistics. arXiv:2004.12764.

Timothy Niven and Hung-Yu Kao. 2019. Probing neu- Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,
ral network comprehension of natural language ar- and Noah A. Smith. 2019. The risk of racial bias
guments. In Proceedings of the 57th Annual Meet- in hate speech detection. In Proceedings of the
ing of the Association for Computational Linguis- 57th Annual Meeting of the Association for Com-
tics, pages 4658–4664, Florence, Italy. Association putational Linguistics, pages 1668–1678, Florence,
for Computational Linguistics. Italy. Association for Computational Linguistics.

Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Ju- William Warner and Julia Hirschberg. 2012. Detecting
rafsky, Noah A. Smith, and Yejin Choi. 2020. So- hate speech on the World Wide Web. In Proceed-
cial bias frames: Reasoning about social and power ings of the Second Workshop on Language in Social
implications of language. In Proceedings of the Media, pages 19–26, Montréal, Canada. Association
58th Annual Meeting of the Association for Compu- for Computational Linguistics.
tational Linguistics, pages 5477–5490, Online. As-
sociation for Computational Linguistics. Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mo-
hananey, Wei Peng, Sheng-Fu Wang, and Samuel R.
Anna Schmidt and Michael Wiegand. 2017. A survey Bowman. 2020. BLiMP: The benchmark of linguis-
on hate speech detection using natural language pro- tic minimal pairs for English. Transactions of the As-
cessing. In Proceedings of the Fifth International sociation for Computational Linguistics, 8:377–392.
Workshop on Natural Language Processing for So-
cial Media, pages 1–10, Valencia, Spain. Associa- Zeerak Waseem. 2016. Are you a racist or am I seeing
tion for Computational Linguistics. things? Annotator influence on hate speech detec-
tion on Twitter. In Proceedings of the First Work-
Rico Sennrich. 2017. How grammatical is character- shop on NLP and Computational Social Science,
level neural machine translation? Assessing MT pages 138–142, Austin, Texas. Association for Com-
quality with contrastive translation pairs. In Pro- putational Linguistics.
ceedings of the 15th Conference of the European
Chapter of the Association for Computational Lin- Zeerak Waseem, Thomas Davidson, Dana Warmsley,
guistics: Volume 2, Short Papers, pages 376–382, and Ingmar Weber. 2017. Understanding abuse: A
Valencia, Spain. Association for Computational Lin- typology of abusive language detection subtasks. In
guistics. Proceedings of the First Workshop on Abusive Lan-
guage Online, pages 78–84, Vancouver, BC, Canada.
Deven Santosh Shah, H. Andrew Schwartz, and Dirk Association for Computational Linguistics.
Hovy. 2020. Predictive biases in natural language
Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-
processing models: A conceptual framework and
bols or hateful people? Predictive features for hate
overview. In Proceedings of the 58th Annual Meet-
speech detection on Twitter. In Proceedings of the
ing of the Association for Computational Linguistics,
NAACL Student Research Workshop, pages 88–93,
pages 5248–5264, Online. Association for Computa-
San Diego, California. Association for Computa-
tional Linguistics.
tional Linguistics.
Thanh Tran, Yifan Hu, Changwei Hu, Kevin Yen, Fei Zeerak Waseem, James Thorne, and Joachim Bingel.
Tan, Kyumin Lee, and Se Rim Park. 2020. HABER- 2018. Bridging the gaps: Multi-task learning for do-
TOR: An efficient and effective deep hatespeech de- main transfer of hate speech detection. In Jennifer
tector. In Proceedings of the 2020 Conference on Golbeck, editor, Online Harassment. Springer.
Empirical Methods in Natural Language Process-
ing (EMNLP), pages 7486–7502, Online. Associa- Michael Wiegand, Josef Ruppenhofer, and Thomas
tion for Computational Linguistics. Kleinbauer. 2019. Detection of abusive language:
The problem of biased datasets. In Proceedings of
Bertie Vidgen and Leon Derczynski. 2020. Direc- the 2019 Conference of the North American Chap-
tions in abusive language training data, a system- ter of the Association for Computational Linguistics:
atic review: Garbage in, garbage out. PLOS ONE, Human Language Technologies, Volume 1 (Long
15(12):e0243300. and Short Papers), pages 602–608, Minneapolis,
Minnesota. Association for Computational Linguis-
Bertie Vidgen, Scott Hale, Ella Guest, Helen Mar- tics.
getts, David Broniatowski, Zeerak Waseem, Austin
Botelho, Matthew Hall, and Rebekah Tromble. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
2020a. Detecting East Asian prejudice on social me- Chaumond, Clement Delangue, Anthony Moi, Pier-
dia. In Proceedings of the Fourth Workshop on On- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
line Abuse and Harms, pages 162–172, Online. As- icz, Joe Davison, Sam Shleifer, Patrick von Platen,
sociation for Computational Linguistics. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Bertie Vidgen, Alex Harris, Dong Nguyen, Rebekah Quentin Lhoest, and Alexander Rush. 2020. Trans-
Tromble, Scott Hale, and Helen Margetts. 2019. formers: State-of-the-art natural language process-
Challenges and frontiers in abusive content detec- ing. In Proceedings of the 2020 Conference on Em-
tion. In Proceedings of the Third Workshop on Abu- pirical Methods in Natural Language Processing:
sive Language Online, pages 80–93, Florence, Italy. System Demonstrations, pages 38–45, Online. Asso-
Association for Computational Linguistics. ciation for Computational Linguistics.

Bertie Vidgen, Tristan Thrush, Zeerak Waseem, and Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer,
Douwe Kiela. 2020b. Learning from the worst: Dy- and Daniel Weld. 2019. Errudite: Scalable, repro-
namically generated datasets to improve online hate ducible, and testable error analysis. In Proceed-
detection. CoRR, abs/2012.15761. ings of the 57th Annual Meeting of the Association

You can also read