Toward Gender-Inclusive Coreference Resolution
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Toward Gender-Inclusive Coreference Resolution Yang Trista Cao Hal Daumé III University of Maryland University of Maryland ycao95@cs.umd.edu Microsoft Research me@hal3.name Abstract crimination in trained coreference systems, show- ing that current systems over-rely on social stereo- Correctly resolving textual mentions of people types when resolving HE and SHE pronouns1 (see fundamentally entails making inferences about those people. Such inferences raise the risk of §2). Contemporaneously, critical work in Human- systemic biases in coreference resolution sys- Computer Interaction has complicated discussions tems, including biases that can harm binary around gender in other fields, such as computer and non-binary trans and cis stakeholders. To vision (Keyes, 2018; Hamidi et al., 2018). better understand such biases, we foreground Building on both lines of work, and inspired by nuanced conceptualizations of gender from so- Keyes’s (2018) study of vision-based automatic ciology and sociolinguistics, and develop two new datasets for interrogating bias in crowd gender recognition systems, we consider gender annotations and in existing coreference reso- bias from a broader conceptual frame than the bi- lution systems. Through these studies, con- nary “folk” model. We investigate ways in which ducted on English text, we confirm that with- folk notions of gender—namely that there are two out acknowledging and building systems that genders, assigned at birth, immutable, and in per- recognize the complexity of gender, we build fect correspondence to gendered linguistic forms— systems that lead to many potential harms. lead to the development of technology that is exclu- sionary and harmful of binary and non-binary trans 1 Introduction and cis people.2 Addressing such issues is critical Coreference resolution—the task of determining not just to improve the quality of our systems, but which textual references resolve to the same real- more pointedly to minimize the harms caused by world entity—requires making inferences about our systems by reinforcing existing unjust social those entities. Especially when those entities are hierarchies (Lambert and Packer, 2019). people, coreference resolution systems run the risk There are several stakeholder groups who may of making unlicensed inferences, possibly resulting easily face harms when coreference systems is in harms either to individuals or groups of people. used (Blodgett et al., 2020). Those harms includes Embedded in coreference inferences are varied as- several possible harms, both allocational and rep- pects of gender, both because gender can show up resentation harms (Barocas et al., 2017), including explicitly (e.g., pronouns in English, morphology quality of service, erasure, and stereotyping harms. in Arabic) and because societal expectations and Following Bender’s (2019) taxonomy of stakehold- stereotypes around gender roles may be explicitly or implicitly assumed by speakers or listeners. This 1 Throughout, we avoid mapping pronouns to a “gender” la- can lead to significant biases in coreference resolu- bel, preferring to use the pronoun directly, include (in English) SHE , HE , the non-binary use of singular THEY , and neopro- tion systems: cases where systems “systematically nouns (e.g., ZE / HIR, XEY / XEM), which have been in usage and unfairly discriminate against certain individ- since at least the 1970s (Bustillos, 2011; Merriam-Webster, uals or groups of individuals in favor of others” 2016; Bradley et al., 2019; Hord, 2016; Spivak, 1997). 2 Following GLAAD (2007), transgender individuals are (Friedman and Nissenbaum, 1996, p. 332). those whose gender differs from the sex they were assigned Gender bias in coreference resolution can mani- at birth. This is in opposition to cisgender individuals, whose fest in many ways; work by Rudinger et al. (2018), assigned sex at birth happens to correspond to their gender. Transgender individuals can either be binary (those whose Zhao et al. (2018a), and Webster et al. (2018) fo- gender falls in the “male/female” dichotomy) or non-binary cused largely on the case of binary gender dis- (those for which the relationship is more complex). 4568 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4568–4595 July 5 - 10, 2020. c 2020 Association for Computational Linguistics
ers and Barocas et al.’s (2017) taxonomy of harms, written by and about trans people).3 In all cases, there are several ways in which trans exclusionary we focus largely on harms due to over- and under- coreference resolution systems can cause harm: representation (Kay et al., 2015), replicating stereo- types (Sweeney, 2013; Caliskan et al., 2017) (par- Indirect: subject of query. If a person is the ticular those that are cisnormative and/or heteronor- subject of a web query, pages about xem may mative), and quality of service differentials (Buo- be missed if “multiple mentions of query” is a lamwini and Gebru, 2018). ranking feature, and the system cannot resolve xyr pronouns ⇒ quality of service, erasure. The primary contributions of this paper are: Direct: by choice. If a grammar checker uses (1) Connecting existing work on gender bias in coreference, it may insist that an author writ- NLP to sociological and sociolinguistic concep- ing hir third-person autobiography is repeatedly tions of gender to provide a scaffolding for fu- making errors when referring to hirself ⇒ qual- ture work on analyzing “gender bias in NLP” (§3). ity of service, stereotyping, denigration. (2) Developing an ablation technique for measur- Direct: not by choice. If an information extrac- ing gender bias in coreference resolution annota- tion system run on résumés relies on cisnorma- tions, focusing on the human bias that can enter tive assumptions, job experiences by a candidate into annotation tasks (§4). (3) Constructing a new who has transitioned and changed his pronouns dataset, the Gender Inclusive Coreference dataset may be missed ⇒ allocative, erasure. (GIC OREF), for testing performance of coreference Many stakeholders. If a machine translation sys- resolution systems on texts that discuss non-binary tem uses discourse context to generate pronouns, and binary transgender people (§5). then errors can results in directly misgendering subjects of the document being translated ⇒ 2 Related Work quality of service, denigration, erasure. There are four recent papers that consider gender To address such harms as well as understand where bias in coreference resolution systems. Rudinger and how they arise, we need to complicate (a) what et al. (2018) evaluates coreference systems for evi- “gender” means and (b) how harms can enter into dence of occupational stereotyping, by construct- natural language processing (NLP) systems. To- ing Winograd-esque (Levesque et al., 2012) test ex- ward (a), we begin with a unifying analysis (§ 3) amples. They find that humans can reliably resolve of how gender is socially constructed, and how so- these examples, but systems largely fail at them, cial conditions in the world impose expectations typically in a gender-stereotypical way. In contem- around people’s gender. Of particular interest is poraneous work, Zhao et al. (2018a) proposed a how gender is reflected in language, and how that very similar, also Winograd-esque scheme, also both matches and potentially mismatches the way for measuring gender-based occupational stereo- people experience their gender in the world. Then, types. In addition to reaching similar conclusions in order to understand social biases around gen- to Rudinger et al. (2018), this work also used a der, we find it necessary to consider the different similar “counterfactual” data process as we use in ways in which gender can be realized linguisti- § 4.1 in order to provide additional training data cally, breaking down what previously have been to a coreference resolution system. Webster et al. considered “gendered words” in NLP papers into (2018) produced the GAP dataset for evaluating finer-grained categories that have been identified in coreference systems, by specifically seeking exam- the sociolinguistics literature of lexical, referential, ples where “gender” (left underspecified) could not grammatical, and social gender. be used to help coreference. They found that coref- erence systems struggle in these cases, also point- Toward (b), we focus on how bias can enter ing to the fact that some success of current corefer- into two stages of machine learning systems: data ence systems is due to reliance on (binary) gender annotation (§ 4) and model definition (§ 5). We stereotypes. Finally, Ackerman (2019) presents construct two new datasets: (1) MAP (a similar an alternative breakdown of gender than we use dataset to GAP (Webster et al., 2018) but without (§ 3), and proposes matching criteria for model- binary gender constraints) on which we can per- 3 Both datasets are released under a BSD license at form counterfactual manipulations and (2) GICoref github.com/TristaCao/into inclusivecoref (a fully annotated coreference resolution dataset with corresponding datasheets (Gebru et al., 2018). 4569
ing coreference resolution linguistically, taking a 3.1 Sociological Gender trans-inclusive perspective on gender. Many modern trans-inclusive models of gender rec- Gender bias in NLP has been considered more ognize that gender encompasses many different broadly than just in coreference resolution, includ- aspects. These aspects include the experience that ing, natural language inference (Rudinger et al., one has of gender (or lack thereof), the way that 2017), word embeddings (e.g., Bolukbasi et al., one expresses one’s gender to the world, and the 2016; Romanov et al., 2019; Gonen and Goldberg, way that normative social conditions impose gender 2019), sentiment analysis (Kiritchenko and Mo- norms, typically as a dichotomy between mascu- hammad, 2018), machine translation (Font and line and feminine roles or traits (Kramarae and Tre- Costa-jussà, 2019; Prates et al., 2019; Dryer, 2013; ichler, 1985; West and Zimmerman, 1987; Butler, Frank et al., 2004; Wandruszka, 1969; Nissen, 1990; Risman, 2009; Serano, 2007). Gender self- 2002; Doleschal and Schmid, 2001), among many determination, on the other hand, holds that each others (Blodgett et al., 2020, inter alia). Gender is person is the “ultimate authority” on their own gen- also an object of study in gender recognition sys- der identity (Zimman, 2019; Stanley, 2014), with tems (Hamidi et al., 2018). Much of this work has Zimman (2019) further arguing the importance of focused on gender bias with a (usually implicit) the role language plays in that determination. binary lens, an issue which was also called out Such trans-inclusive models deconflate anatomi- recently by Larson (2017b) and May (2019). cal and biological traits and the sex that a person 3 Linguistic & Social Gender had assigned to them at birth from one’s gendered position in society; this includes intersex people, The concept of gender is complex and contested, whose anatomical/biological factors do not match covering (at least) aspects of a person’s internal ex- the usual designational criteria for either sex. Trans- perience, how they express this to the world, how inclusive views typically recognize that gender ex- social conditions in the world impose expectations ists beyond the regressive “female”/“male” binary4 ; on them (including expectations around their sex- additionally, one’s gender may shift by time or con- uality), and how they are perceived and accepted text (often “genderfluid”), and some people do not (or not). When this complex concept is realized in experience gender at all (often “agender”) (Kessler language, the situation becomes even more com- and McKenna, 1978; Schilt and Westbrook, 2009; plex: linguistic categories of gender do not even Darwin, 2017; Richards et al., 2017). In §5 we an- remotely map one-to-one to social categories. As alyze the degree to which NLP papers make trans- observed by Bucholtz (1999): inclusive or trans-exclusive assumptions. “Attempts to read linguistic structure di- Social gender refers to the imposition of gen- rectly for information about social gender der roles or traits based on normative social condi- are often misguided.” tions (Kramarae and Treichler, 1985), which often For instance, when working in a language like En- includes imposing a dichotomy between feminine glish which formally marks gender on pronouns, it and masculine (in behavior, dress, speech, occupa- is all too easy to equate “recognizing the pronoun tion, societal roles, etc.). Ackerman (2019) high- that corefers with this name” with “recognizing the lights a highly overlapping concept, “bio-social real-world gender of referent of that name.” gender”, which consists of gender role, gender ex- Furthermore, despite the impossibility of a per- pression, and gender identity. Taking gender role fect alignment with linguistic gender, it is gener- as an example, upon learning that a nurse is coming ally clear that an incorrectly gendered reference to their hospital room, a patient may form expecta- to a person (whether through pronominalization tions that this person is likely to be “female,” and or otherwise) can be highly problematic (Johnson may generate expectations around how their face or et al., 2019; McLemore, 2015). This process of body may look, how they are likely to be dressed, misgendering is problematic for both trans and cis how and where hair may appear, how to refer to individuals to the extent that transgender historian them, and so on. This process, often referred to as Stryker (2008) writes: gendering (Serano, 2007) occurs both in real world “[o]ne’s gender identity could perhaps best 4 Some authors use female/male for sex and woman/man be described as how one feels about being for gender; we do not need this distinction (which is itself referred to by a particular pronoun.” contestable) and use female/male for gender. 4570
interactions, as well as in purely linguistic settings as gendered terms of address like “Mrs.” Impor- (e.g., reading a newspaper), in which readers may tantly, lexical gender is a property of the linguistic use social gender clues to assign gender(s) to the unit, not a property of its referent in the real world, real world people being discussed. which may or may not exist. For instance, in “Ev- ery son loves his parents”, there is no real world 3.2 Linguistic Gender referent of “son” (and therefore no referential gen- Our discussion of linguistic gender largely fol- der), yet it still (likely) takes HIS as a pronoun lows (Corbett, 1991; Ochs, 1992; Craig, 1994; Cor- anaphor because “son” has lexical gender MASC. bett, 2013; Hellinger and Motschenbacher, 2015; Fuertes-Olivera, 2007), departing from earlier char- 3.3 Social and Linguistic Gender Interplays acterizations that postulate a direct mapping from The relationship between these aspects of gender language to gender (Lakoff, 1975; Silverstein, is complex, and none is one-to-one. The refer- 1979). Our taxonomy is related but not identical to ential gender of an individual (e.g., pronouns in (Ackerman, 2019), which we discuss in §2. English) may or may not match their social gender Grammatical gender, similarly defined in Ack- and this may change by context. This can happen in erman (2019), is nothing more than a classification the case of people whose everyday life experience of nouns based on a principle of grammatical agree- of their gender fluctuates over time (at any inter- ment. In “gender languages” there are typically val), as well as in the case of drag performers (e.g., two or three grammatical genders that have, for some men who perform drag are addressed as SHE animate or personal references, considerable cor- while performing, and HE when not (for Transgen- respondence between a FEM (resp. MASC) gram- der Equality, 2017)). The other linguistic forms of matical gender and referents with female- (resp. gender (grammatical, lexical) also need not match male-)5 social gender. In comparison, “noun class each other, nor match referential gender (Hellinger languages” have no such correspondence, and typ- and Motschenbacher, 2015). ically many more classes. Some languages have Social gender (societal expectations, in particu- no grammatical gender at all; English is generally lar) captures the observation that upon hearing “My seen as one (Nissen, 2002; Baron, 1971) (though cousin is a librarian”, many speakers will infer “fe- this is contested (Bjorkman, 2017)). male” for “cousin”, because of either an entailment Referential gender (similar, but not identical of “librarian” or some sort of probabilistic infer- to Ackerman’s (2019) “conceptual gender”) re- ence (Lyons, 1977), but not based on either gram- lates linguistic expressions to extra-linguistic re- matical gender (which does not exist in English) or ality, typically identifying referents as “female,” lexical gender. We focus on English, which has no “male,” or “gender-indefinite.” Fundamentally, ref- grammatical gender, but does have lexical gender. erential gender only exists when there is an entity English also marks referential gender on singular being referred to, and their gender (or sex) is real- third person pronouns. ized linguistically. The most obvious examples in Below, we use this more nuanced notion of dif- English are gendered third person pronouns (SHE, ferent types of gender to inspect how bias play out HE), including neopronouns ( ZE, EM ) and singular in coreference resolution systems. These biases THEY 6 , but also includes cases like “policeman” may arise in the context of any of these notions of when the intended referent of this noun has so- gender, and we encourage future work to extend cial gender “male” (though not when “policeman” care over and be explicit about what notions of is used non-referentially, as in “every policeman gender are being utilized and when. needs to hold others accountable”). Lexical gender refers to an extra-linguistic 4 Bias in Human Annotation properties of female-ness or male-ness in a non- referential way, as in terms like “mother” as well A possible source of bias in coreference systems 5 One comes from human annotations on the data used difficulty in this discussion is that linguistic gender and social gender use the terms “feminine” and “masculine” to train them. Such biases can arise from a com- differently; to avoid confusion, when referring to the linguistic bination of (possibly) underspecified annotations properties, we use FEM and MASC. guidelines and the positionality of annotators them- 6 People’s mental acceptability of singular THEY is still rel- atively low even with its increased usage (Prasad and Morris, selves. In this section, we study how different 2020), and depends on context (Conrod, 2018). aspects of linguistic notions impact an annotator’s 4571
(d) (b) (c) Mrs. −−→ 0/ Rebekah Johnson Bobbitt − −→ M. Booth was the younger sister −→ sibling of (b) Lyndon B. Johnson − −→ T. Schneider, 36th President of the United States. Born in 1910 in Stonewall, (a) Texas, she − −→ they worked in the cataloging department of the Library of Congress in the 1930s before (a) (c) her −−→ their brother −→ sibling entered politics. Figure 1: Example of applying all ablation substitutions for an example context in the MAP corpus. Each substitution type is marked over the arrow and separately color-coded. judgments of anaphora. This parallels Ackerman ing Webster et al.’s approach, but we do not restrict (2019) linguistic analysis, in which a Broad Match- the two names to match gender. ing Criterion is proposed, which posits that “match- In ablating gender information, one challenge ing gender requires at least one level of the mental is that removing social gender cues (e.g., “nurse” representation of gender to be identical to the can- tending female) is not possible because they can ex- didate antecedent in order to match.” ist anywhere. Likewise, it is not possible to remove Our study can be seen as evaluating which con- syntactic cues in a non-circular manner. For exam- ceptual properties of gender are most salient in ple in (1), syntactic structure strongly suggests the human judgments. We start with natural text in antecedent of “herself” is “Liang”, making it less which we can cast the coreference task as a binary likely that “He” corefers with Liang later (though classification problem (“which of these two names it is possible, and such cases exist in natural data does this pronoun refer to?”) inspired by Webster due either to genderfluidity or misgendering). et al. (2018). We then generate “counterfactual aug- (1) Liang saw herself in the mirror. . .He . . . mentations” of this dataset by ablating the various Fortunately, it is possible to enumerate a high cov- notions of linguistic gender described in §3.2, sim- erage list of English terms that signal lexical gen- ilar to Zmigrod et al. (2019). We finally evaluate der: terms of address (Mrs., Mr.) and semantically the impact of these ablations on human annotation gendered nouns (mother).8 We assembled a list by behavior to answer the question: which forms of taking many online lists (mostly targeted at English linguistic knowledge are most essential for human language learners), merging them, and manual fil- annotators to make consistent judgments. See Ap- tering. The assembling process and the final list is pendix A for examples of how linguistic gender published with the MAP dataset and its datasheet. may be used to infer social gender. To execute the “hiding” of various aspects of 4.1 Ablation Methodology gender, we use the following substitutions: (a) ¬P RO: Replace third person pronouns with In order to determine which cues annotators are us- gender neutral variants (THEY, XEY, ZE). ing and the degree to which they use them, we con- (b) ¬NAME: Replace names by random names struct an ablation study in which we hide various with only a first initial and last name. aspects of gender and evaluate how this impacts (c) ¬S EM: Replace semantically gendered nouns annotators’ judgments of anaphoricity. We con- with gender-indefinite variants. struct binary classification examples taken from (d) ¬A DDR: Remove terms of address.9 Wikipedia pages, in which a single pronoun is See Figure 1 for an example of all substitutions. selected, and two possible antecedent names are We perform two sets of experiments, one fol- given, and the annotator must select which one. We lowing a “forward selection” type ablation (start cannot use Webster et al.’s GAP dataset directly, with everything removed and add each back in one- because their data is constrained that the “gender” at-a-time) and one following “backward selection” of the two possible antecedents is “the same”7 ; for (remove each separately). Forward selection is nec- us, we are specifically interested in how annotators essary in order to de-conflate syntactic cues from make decisions even when additional gender infor- 8 These are, however, sometimes complex. For instance, mation is available. Thus, we construct a dataset “actress” signals lexical gender of female, while “actor” may called Maybe Ambiguous Pronoun (MAP) follow- signal social gender of male and, in certain varieties of English, may also signal lexical gender of male. 7 It is unclear from the GAP dataset what notion of “gender” 9 An alternative suggested by Cassidy Henry that we did is used, nor how it was determined to be “the same.” not explore would be to replace all with Mx. or Dr. 4572
stereotypes; while backward selection gives a sense of how much impact each type of gender cue has in the context of all the others. We begin with Z ERO, in which we apply all four substitutions. Since this also removes gender cues from the pronouns themselves, an annotator cannot substantially rely on social gender to per- form these resolutions. We next consider adding back in the original pronouns (always HE or SHE here), yielding ¬NAME ¬S EM ¬A DDR. Any dif- ference in annotation behavior between Z ERO and ¬NAME ¬S EM ¬A DDR can only be due to so- cial gender stereotypes. The next setting, ¬S EM ¬A DDR removes both forms of lexical gender (se- mantically gendered nouns and terms of address); Figure 2: Human annotation results for the ablation differences between ¬S EM ¬A DDR and ¬NAME study on MAP dataset. Each column is a different abla- ¬S EM ¬A DDR show how much names are relied tion, and the y-axis is the degree of accuracy with 95% on for annotation. Similarly, ¬NAME ¬A DDR re- significance intervals. Bottom bar plots are annotator moves names and terms of address, showing the im- certainties as how sure they are in their choices. pact of semantically gendered nouns, and ¬NAME ¬S EM removes names and semantically gendered role when annotators resolve coreferences. This nouns, showing the impact of terms of address. confirms the findings of Rudinger et al. (2018) and In the backward selection case, we begin with Zhao et al. (2018a) that human annotated data in- O RIG, which is the unmodified original text. To corporates bias from stereotypes. this, we can apply the pronoun filter to get ¬P RO; Moreover, if we compare O RIG with columns differences in annotation between O RIG and ¬P RO left to it, we see that name is another significant give a measure of how much any sort of gender- cue for annotator judgments, while lexical gender based inference is used. Similarly, we get ¬NAME cues do not have significant impacts on human by only removing names, which gives a measure annotation accuracies. This is likely in part due of how much names are used (in the context of to the low appearance frequency of lexical gen- all other cues); we get ¬S EM by only removing der cues in our dataset. Every example has pro- semantically gendered words; and ¬A DDR by only nouns and names, whereas 49% of the examples removing terms of address. have semantically gendered nouns but only 3% of 4.2 Annotation Results the examples include terms of address. We also We construct examples using the methodology de- note that if we compare ¬NAME ¬S EM ¬A DDR fined above. We then conduct annotation experi- to ¬S EM ¬A DDR and ¬NAME ¬A DDR, accuracy ments using crowdworkers on Amazon Mechanical drops when removing gender cues. Though the Turk following the methodology by which the origi- differences are not statistically significant, we did nal GAP corpus was created10 . Because we wanted not expect the accuracy drop. to also capture uncertainty, we ask the crowdwork- Finally, we find annotators’ certainty values fol- ers how sure they are in their choices, between low the same trend as the accuracy: annotators “definitely” sure, “probably” sure and “unsure.” have a reasonable sense of when they are unsure. Figure 2 shows the human annotation results as We also note that accuracy score are essentially binary classification accuracy for resolving the pro- the same for Z ERO and ¬P RO, which suggests that noun to the antecedent. We can see that removing once explicit binary gender is gone from pronouns, pronouns leads to significant drop in accuracy. This the impact of any other form of linguistic gender indicates that gender-based inferences, especially in annotator’s decisions is also removed. social gender stereotypes, play the most significant 5 Bias in Model Specifications 10 Our study was approved by the Microsoft Research Ethics Board. Workers were paid $1 to annotate ten contexts (the In addition to biases that can arise from the data average annotation time was seven minutes). that a system is trained on, as studied in the previ- 4573
ous section, bias can also come from how models All Papers Coref Papers are structured. For instance, a system may fail to L.G? 52.6% (of 150) 95.4% (of 22) recognize anything other than a dictionary of fixed S.G? 58.0% (of 150) 86.3% (of 22) L.G6=S.G? 11.1% (of 27) 5.5% (of 18) pronouns as possible referents to entities. Here, S.G Binary? 92.8% (of 84) 94.4% (of 18) we analyze prior work in models for coreference S.G Immutable? 94.5% (of 74) 100.0% (of 14) resolution in three ways. First, we do a literature They/Neo? 3.5% (of 56) 7.1% (of 14) study to quantify how NLP papers discuss gender. Table 1: Analysis of a corpus of 150 NLP papers that Second, similar to Zhao et al. (2018a) and Rudinger mention “gender” along the lines of what assumptions et al. (2018), we evaluate five freely available sys- around gender are implicitly or explicitly made. tems on the ablated data from §4. Third, we evalu- ate these systems on the dataset we created: Gender Inclusive Coreference (GIC OREF). Only 7.1% (one paper!) considers neopronouns and/or specific singular THEY. 5.1 Cis-normativity in published NLP papers The situation for papers not specifically about In our first study, we adapt the approach Keyes coreference is similar (the majority of these pa- (2018) took for analyzing the degree to which com- pers are either purely linguistic papers about gram- puter vision papers encoded trans-exclusive models matical gender in languages other than English, of gender. In particular, we began with a random or papers that do “gender recognition” of au- sample of ∼150 papers from the ACL anthology thors based on their writing; May (2019) discusses that mention the word “gender” and coded them the (re)production of gender in automated gender according to the following questions: recognition in NLP in much more detail). Overall, • Does the paper discuss coreference resolution? the situation more broadly is equally troubling, and • Does the paper study English? generally also fails to escape from the folk theory • L.G: Does the paper deal with linguistic gender of gender. In particular, none of the differences are (grammatical gender or gendered pronouns)? significant at a p = 0.05 level except for the first • S.G: Does the paper deal with social gender? two questions, due to the small sample size (accord- • L.G6=S.G: (If yes to L.G and S.G:) Does the ing to an n − 1 chi-squared test). The result is that paper distinguish linguistic from social gender? although we do not know exactly what decisions • S.G Binary: (If yes to S.G:) Does the paper are baked in to all systems, the vast majority in our explicitly or implicitly assume that social gender study (including two papers by one of the authors is binary? (Daumé and Marcu, 2005; Orita et al., 2015)) come • S.G Immutable: (If yes to S.G:) Does the paper with strong gender binary assumptions, and exist explicitly or implicitly assume social gender is within a broader sphere of literature which erases immutable? non-binary and binary trans identities. • They/Neo: (If yes to S.G and to English:) Does the paper explicitly consider uses of definite sin- 5.2 System performance on MAP gular “they” or neopronouns? Next, we analyze the effect that our different ab- The results of this coding are in Table 1 (the full lation mechanisms have on existing coreference annotation is in Appendix B). We see out of the resolutions systems. In particular, we run five 22 coreference papers analyzed, the vast majority coreference resolution systems on our ablated data: conform to a “folk” theory of language: the AI2 system (AI2; Gardner et al., 2017), hug- Only 5.5% distinguish social from linguistic gen- ging face (HF; Wolf, 2017), which is a neural sys- der (despite it being relevant); tem based on spacy, and the Stanford deterministic Only 5.6% explicitly model gender as inclusive (SfdD; Raghunathan et al., 2010), statistical (SfdS; of non-binary identities; Clark and Manning, 2015) and neural (SfdN; Clark No papers treat gender as anything other than and Manning, 2016) systems. Figure 3 shows the completely immutable;11 results. We can see that the system accuracies 11 The mostly follow the same pattern as human accu- most common ways in which papers implicitly as- sume that social gender is immutable is either 1) by relying on racy scores, though all are significantly lower than external knowledge bases that map names to “gender”; or 2) human results. Accuracy scores for systems drop by scraping a history of a user’s social media posts or emails and assuming that their “gender” today matches the gender of that historical record. 4574
Precision Recall F1 AI2 40.4% 29.2% 33.9% HF 68.8% 22.3% 33.6% SfdD 50.8% 23.9% 32.5% SfdS 59.8% 24.1% 34.3% SfdN 59.4% 24.0% 34.2% Table 2: LEA scores on GIC OREF (incorrect reference excluded) with various coreference resolution systems. Rows are different systems while columns are preci- sion, recall, and F1 scores. When evaluate, we only count exact matches of pronouns and name entities. uments from three types of sources: articles from Figure 3: Coreference resolution systems results for English Wikipedia about people with non-binary the ablation study on MAP dataset. The y-axis is the gender identities, articles from LGBTQ periodi- degree of accuracy with 95% significance intervals. cals, and fan-fiction stories from Archive Of Our Own (with the respective author’s permission)12 . These documents were each annotated by both of dramatically when we ablate out referential gender the authors and adjudicated.13 This data includes in pronouns. This reveals that those coreference many examples of people who use pronouns other resolution systems reply heavily on gender-based than SHE or HE (the dataset contains 27% HE, 20% inferences. In terms of each systems, HF and SfdN SHE , 35% THEY , and 18% neopronouns, people systems have similar results and outperform other who are genderfluid and whose names or pronouns systems in most cases. SfdD accuracy drops signif- change through the article, people who are mis- icantly once names are ablated. gendered, and people in relationships that are not These results echo and extend previous observa- heteronormative. In addition, incorrect references tions made by Zhao et al. (2018a), who focus on de- (misgendering and deadnaming14 ) are explicitly tecting stereotypes within occupations. They detect annotated.15 Two example annotated documents, gender bias by checking if the system accuracies one from Wikipedia, and one from Archive of Our are the same for cases that can be resolved by syn- Own, are provided in Appendix C and Appendix D. tactic cues and cases that cannot, with original data We run the same systems as before on this and reversed-gender data. Similarly, Rudinger et al. dataset. Table 2 reports results according the stan- (2018) focus on detecting stereotypes within occu- dard coreference resolution evaluation metric LEA pations as well. They construct dataset without any (Moosavi and Strube, 2016). Since no systems gender cues other than stereotypes, and check how are implemented to explicitly mark incorrect ref- systems perform with different pronouns – THEY, erences, and no current evaluation metrics address SHE, HE. Ideally, they should all perform the same this case, we perform the same evaluation twice. because there is not any gender cues in the sen- One with incorrect references included as regular tence. However, they find that systems do not work references in the ground truth; and other with in- on “they” and perform better on “he” than “she”. correct references excluded. Due to the limited Our analysis breaks this stereotyping down further number of incorrect references in the dataset, the to detect which aspects of gender signals are most leveraged by current systems. 12 See https://archiveofourown.org; thanks to Os Keyes for this suggestion. 13 We evaluate inter-annotator agreement by treating one 5.3 System behavior on gender-inclusive data annotation as gold standard and the other as system output Finally, in order to evaluate current coreference res- and computing the LEA metric; the resulting F1-score is 92%. olution models in gender inclusive contexts we in- During the adjudication process we found that most of the dis- agreement are due to one of the authors missing/overlooking troduce a new dataset, GIC OREF. Here we focused mentions, and rarely due to true “disagreement.” on naturally occurring data, but sampled specifi- 14 According to Clements (2017) deadnaming occurs when cally to surface more gender-related phenomena someone, intentionally or not, refers to a person who’s trans- gender by the name they used before they transitioned. than may be found in, say, the Wall Street Journal. 15 Thanks to an anonymous reader of a draft version of this Our new GIC OREF dataset consists of 95 doc- paper for this suggestion. 4575
difference of the results are not significant. Here We also emphasize that while we endeavored to we only report the latter. be inclusive, our own positionality has undoubt- The first observation is that there is still plenty edly led to other biases. One in particular is a room for coreference systems to improve; the best largely Western bias, both in terms of what models performing system achieves as F1 score of 34%, but of gender we use and also in terms of the data we the Stanford neural system’s F1 score on CoNLL- annotated. We have attempted to partially compen- 2012 test set reaches 60% (Moosavi, 2020). Ad- sate for this bias by intentionally including docu- ditionally, we can see system precision dominates ments with non-Western non-binary expressions recall. This is likely partially due to poor recall of of gender in the GICoref dataset16 , but the dataset pronouns other than HE and SHE. To analyze this, nonetheless remains Western-dominant. we compute the recall of each system for finding Additionally, our ability to collect naturally oc- referential pronouns at all, regardless of whether curring data was limited because many sources they are correctly linked to their antecedents. We simply do not yet permit (or have only recently find that all systems achieve a recall of at least 95% permitted) the use of gender inclusive language in for binary pronouns, a recall of around 90% on their articles. This led us to counterfactual text average for THEY, and a recall of around a paltry manipulation, which, while useful, is essentially 13% for neopronouns (two systems—Stanford de- impossible to do flawlessly. Moreover, our abil- terministic and Stanford neural—never identify any ity to evaluate coreference systems with data that neopronouns at all). includes incorrect references was limited as well, because current systems do not mark any forms of 6 Discussion and Moving Forward misgendering or deadnaming explicitly, and cur- rent metrics do not take this into account. Finally, Our goal in this paper was to analyze how gender because the social construct of gender is fundamen- bias exist in coreference resolution annotations and tally contested, some of our results may apply only models, with a particular focus on how it may fail under some frameworks. to adequately process text involving binary and We hope this paper can serve as a roadmap for fu- non-binary trans referents. We thus created two ture studies. In particular, the gender taxonomy we datasets: MAP and GIC OREF. Both datasets show presented, while not novel, is (to our knowledge) significant gaps in system performance, but perhaps previously unattested in discussions around gender moreso, show that taking crowdworker judgments bias in NLP systems; we hope future work in this as “gold standard” can be problematic. It may be area can draw on these ideas. We also hope that the case that to truly build gender inclusive datasets developers of datasets or systems can use some of and systems, we need to hire or consult experiential our analysis as inspiration for how one can attempt experts (Patton et al., 2019; Young et al., 2019). to measure—and then root out—different forms Moreover, although we studied crowdworkers on of bias in coreference resolution systems and NLP Mechanical Turk (because they are often employed systems more broadly. as annotators for NLP resources), if other popula- tions are used for annotation, it becomes important Acknowledgments to consider their positionality and how that may im- pact annotations. This echoes a related finding in The authors are grateful to a number of people annotation of hate-speech that annotator positional- who have provided pointers, edits, suggestions, and ity matters (Olteanu et al., 2019). More broadly, we annotation facilities to improve this work: Lau- found that trans-exclusionary assumptions around ren Ackerman, Cassidy Henry, Os Keyes, Chan- gender in NLP papers is made commonly (and dler May, Hanyu Wang, and Marion Zepf, all con- implicitly), a practice that we hope to see change tributed to various aspects of this work, includ- in the future because it fundamentally limits the ing suggestions for data sources for the GI Coref applicability of NLP systems. dataset. We also thank the CLIP lab at the Univer- sity of Maryland for comments on previous drafts. The primary limitation of our study and analysis is that it is limited to English. This is particularly 16 We endeavored to represent some non-Western gender limiting because English lacks a grammatical gen- identies that do not fall into the male/female binary, including people who identify as hijra (Indian subcontinent), phuying der system, and some extensions of our work to (Thailand, sometimes referred to as kathoey), muxe (Oaxaca), languages with grammatical gender are non-trivial. two-spirit (Americas), fa’afafine (Samoa) and māhū (Hawaii). 4576
References Papers, pages 153–156, Columbus, Ohio. Associa- tion for Computational Linguistics. Saleem Abuleil, Khalid Alsamara, and Martha Evens. 2002. Acquisition system for Arabic noun morphol- Ibrahim Badr, Rabih Zbib, and James Glass. 2009. Syn- ogy. In Proceedings of the ACL-02 Workshop on tactic phrase reordering for English-to-Arabic sta- Computational Approaches to Semitic Languages, tistical machine translation. In Proceedings of the Philadelphia, Pennsylvania, USA. Association for 12th Conference of the European Chapter of the ACL Computational Linguistics. (EACL 2009), pages 86–93, Athens, Greece. Associ- ation for Computational Linguistics. Lauren Ackerman. 2019. Syntactic and cognitive is- sues in investigating gendered coreference. Glossa: R. I. Bainbridge. 1985. Montagovian definite clause a journal of general linguistics, 4. grammar. In Second Conference of the European Chapter of the Association for Computational Lin- Apoorv Agarwal, Jiehan Zheng, Shruti Kamath, Sri- guistics, Geneva, Switzerland. Association for Com- ramkumar Balasubramanian, and Shirin Ann Dey. putational Linguistics. 2015. Key female characters in film have more to talk about besides men: Automating the Bechdel Janet Baker, Larry Gillick, and Robert Roth. 1994. Re- test. In Proceedings of the 2015 Conference of the search in large vocabulary continuous speech recog- North American Chapter of the Association for Com- nition. In Human Language Technology: Proceed- putational Linguistics: Human Language Technolo- ings of a Workshop held at Plainsboro, New Jersey, gies, pages 830–840, Denver, Colorado. Association March 8-11, 1994. for Computational Linguistics. Murali Raghu Babu Balusu, Taha Merghani, and Jacob Tafseer Ahmed Khan. 2014. Automatic acquisition of Eisenstein. 2018. Stylistic variation in social media Urdu nouns (along with gender and irregular plu- part-of-speech tagging. In Proceedings of the Sec- rals). In Proceedings of the Ninth International ond Workshop on Stylistic Variation, pages 11–19, Conference on Language Resources and Evalua- New Orleans. Association for Computational Lin- tion (LREC’14), pages 2846–2850, Reykjavik, Ice- guistics. land. European Language Resources Association (ELRA). Francesco Barbieri and Jose Camacho-Collados. 2018. How gender and skin tone modifiers affect emoji se- Sarah Alkuhlani and Nizar Habash. 2011. A corpus mantics in twitter. In Proceedings of the Seventh for modeling morpho-syntactic agreement in Arabic: Joint Conference on Lexical and Computational Se- Gender, number and rationality. In Proceedings of mantics, pages 101–106, New Orleans, Louisiana. the 49th Annual Meeting of the Association for Com- Association for Computational Linguistics. putational Linguistics: Human Language Technolo- gies, pages 357–362, Portland, Oregon, USA. Asso- Solon Barocas, Kate Crawford, Aaron Shapiro, and ciation for Computational Linguistics. Hanna Wallach. 2017. The Problem With Bias: Al- locative Versus Representational Harms in Machine Sarah Alkuhlani and Nizar Habash. 2012. Identify- Learning. In Proceedings of SIGCIS. ing broken plurals, irregular gender, and rationality in Arabic text. In Proceedings of the 13th Confer- Naomi S. Baron. 1971. A reanalysis of english gram- ence of the European Chapter of the Association matical gender. Lingua, 27:113–140. for Computational Linguistics, pages 675–685, Avi- gnon, France. Association for Computational Lin- Emily M. Bender. 2019. A typology of ethical risks guistics. in language technology with an eye towards where transparent documentation can help. The Future of Tania Avgustinova and Hans Uszkoreit. 2000. An on- Artificial Intelligence: Language, Ethics, Technol- tology of systematic relations for a shared grammar ogy. of Slavic. In COLING 2000 Volume 1: The 18th International Conference on Computational Linguis- Shane Bergsma and Dekang Lin. 2006. Bootstrapping tics. path-based pronoun resolution. In Proceedings of the 21st International Conference on Computational Bogdan Babych, Jonathan Geiger, Mireia Linguistics and 44th Annual Meeting of the Associ- Ginestı́ Rosell, and Kurt Eberle. 2014. Deriv- ation for Computational Linguistics, pages 33–40, ing de/het gender classification for Dutch nouns for Sydney, Australia. Association for Computational rule-based MT generation tasks. In Proceedings Linguistics. of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra), pages 75–81, Gothen- Shane Bergsma, Dekang Lin, and Randy Goebel. 2009. burg, Sweden. Association for Computational Glen, glenda or glendale: Unsupervised and semi- Linguistics. supervised learning of English noun gender. In Pro- ceedings of the Thirteenth Conference on Computa- Ibrahim Badr, Rabih Zbib, and James Glass. 2008. Seg- tional Natural Language Learning (CoNLL-2009), mentation for English-to-Arabic statistical machine pages 120–128, Boulder, Colorado. Association for translation. In Proceedings of ACL-08: HLT, Short Computational Linguistics. 4577
Shane Bergsma, Matt Post, and David Yarowsky. 2012. Maria Bustillos. 2011. Our desperate, 250-year-long Stylometric analysis of scientific articles. In Pro- search for a gender-neutral pronoun. permalink. ceedings of the 2012 Conference of the North Amer- ican Chapter of the Association for Computational Judith Butler. 1990. Gender Trouble. Routledge. Linguistics: Human Language Technologies, pages Aylin Caliskan, Joanna J. Bryson, and Arvind 327–337, Montréal, Canada. Association for Com- Narayanan. 2017. Semantics derived automatically putational Linguistics. from language corpora contain human-like biases. Bronwyn M. Bjorkman. 2017. Singular they and Science, 356(6334). the syntactic representation of gender in english. Michael Carl, Sandrine Garnier, Johann Haller, Anne Glossa: A Journal of General Linguistics, 2(1):80. Altmayer, and Bärbel Miemietz. 2004. Control- Su Lin Blodgett, Solon Barocas, Hal Daumé, III, and ling gender equality with shallow NLP techniques. Hanna Wallach. 2020. Language (technology) is In COLING 2004: Proceedings of the 20th Inter- power: The need to be explicit about NLP harms. national Conference on Computational Linguistics, In Proceedings of the Conference of the Association pages 820–826, Geneva, Switzerland. COLING. for Computational Linguistics (ACL). Sophia Chan and Alona Fyshe. 2018. Social and emo- Ondřej Bojar, Rudolf Rosa, and Aleš Tamchyna. 2013. tional correlates of capitalization on twitter. In Chimera – three heads for English-to-Czech trans- Proceedings of the Second Workshop on Computa- lation. In Proceedings of the Eighth Workshop on tional Modeling of People’s Opinions, Personality, Statistical Machine Translation, pages 92–98, Sofia, and Emotions in Social Media, pages 10–15, New Bulgaria. Association for Computational Linguis- Orleans, Louisiana, USA. Association for Computa- tics. tional Linguistics. Tolga Bolukbasi, Kai-Wei Chang, James Zou, Songsak Channarukul, Susan W. McRoy, and Syed S. Venkatesh Saligrama, and Adam Kalai. 2016. Ali. 2000. Enriching partially-specified represen- Man is to Computer Programmer as Woman is to tations for text realization using an attribute gram- Homemaker? Debiasing Word Embeddings. In mar. In INLG’2000 Proceedings of the First Interna- Proceedings of NeurIPS. tional Conference on Natural Language Generation, pages 163–170, Mitzpe Ramon, Israel. Association Constantinos Boulis and Mari Ostendorf. 2005. A for Computational Linguistics. quantitative analysis of lexical differences between Eric Charton and Michel Gagnon. 2011. Poly-co: a genders in telephone conversations. In Proceed- multilayer perceptron approach for coreference de- ings of the 43rd Annual Meeting of the Association tection. In Proceedings of the Fifteenth Confer- for Computational Linguistics (ACL’05), pages 435– ence on Computational Natural Language Learning: 442, Ann Arbor, Michigan. Association for Compu- Shared Task, pages 97–101, Portland, Oregon, USA. tational Linguistics. Association for Computational Linguistics. Evan Bradley, Julia Salkind, Ally Moore, and Sofi Teit- Chen Chen and Vincent Ng. 2014. Chinese zero sort. 2019. Singular ‘they’ and novel pronouns: pronoun resolution: An unsupervised probabilistic gender-neutral, nonbinary, or both? Proceedings of model rivaling supervised resolvers. In Proceedings the Linguistic Society of America, 4(1):36–1–7. of the 2014 Conference on Empirical Methods in Mary Bucholtz. 1999. Gender. Journal of Linguistic Natural Language Processing (EMNLP), pages 763– Anthropology. Special issue: Lexicon for the New 774, Doha, Qatar. Association for Computational Millennium, ed. Alessandro Duranti. Linguistics. Joy Buolamwini and Timnit Gebru. 2018. Gender Morgane Ciot, Morgan Sonderegger, and Derek Ruths. shades: Intersectional accuracy disparities in com- 2013. Gender inference of Twitter users in non- mercial gender classification. English contexts. In Proceedings of the 2013 Con- ference on Empirical Methods in Natural Language John D. Burger, John Henderson, George Kim, and Processing, pages 1136–1145, Seattle, Washington, Guido Zarrella. 2011. Discriminating gender on USA. Association for Computational Linguistics. twitter. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Kevin Clark and Christopher D. Manning. 2015. pages 1301–1309, Edinburgh, Scotland, UK. Associ- Entity-centric coreference resolution with model ation for Computational Linguistics. stacking. In ACL. Kevin Clark and Christopher D. Manning. 2016. Deep Felix Burkhardt, Martin Eckert, Wiebke Johannsen, reinforcement learning for mention-ranking corefer- and Joachim Stegmann. 2010. A database of age ence models. In EMNLP. and gender annotated telephone speech. In Proceed- ings of the Seventh International Conference on Lan- KC Clements. 2017. What is deadnaming? Blog post. guage Resources and Evaluation (LREC’10), Val- letta, Malta. European Language Resources Associ- Kirby Conrod. 2018. Changes in singular they. In Cas- ation (ELRA). cadia Workshop in Sociolinguistics. 4578
Greville G. Corbett. 1991. Gender. Cambridge Univer- Thierry Declerck, Nikolina Koleva, and Hans-Ulrich sity Press. Krieger. 2012. Ontology-based incremental annota- tion of characters in folktales. In Proceedings of the Greville G. Corbett. 2013. Number of genders. In 6th Workshop on Language Technology for Cultural Matthew S. Dryer and Martin Haspelmath, editors, Heritage, Social Sciences, and Humanities, pages The World Atlas of Language Structures Online. 30–34, Avignon, France. Association for Computa- Max Planck Institute for Evolutionary Anthropol- tional Linguistics. ogy, Leipzig. Liviu P. Dinu, Vlad Niculae, and Octavia-Maria Şulea. Marta R. Costa-jussà. 2017. Why Catalan-Spanish neu- 2012. The Romanian neuter examined through a ral machine translation? analysis, comparison and two-gender n-gram classification system. In Pro- combination with standard rule and phrase-based ceedings of the Eighth International Conference technologies. In Proceedings of the Fourth Work- on Language Resources and Evaluation (LREC’12), shop on NLP for Similar Languages, Varieties and pages 907–910, Istanbul, Turkey. European Lan- Dialects (VarDial), pages 55–62, Valencia, Spain. guage Resources Association (ELRA). Association for Computational Linguistics. Ursula Doleschal and Sonja Schmid. 2001. Doing gen- der in Russian. Gender Across Languages. The lin- Colette G. Craig. 1994. Classifier languages. The en- guistic representation of women and men, 1:253– cyclopedia of language and linguistics, 2:565–569. 282. Silviu Cucerzan and David Yarowsky. 2003. Mini- Michael Dorna, Anette Frank, Josef van Genabith, and mally supervised induction of grammatical gender. Martin C. Emele. 1998. Syntactic and semantic In Proceedings of the 2003 Human Language Tech- transfer with f-structures. In 36th Annual Meet- nology Conference of the North American Chapter ing of the Association for Computational Linguis- of the Association for Computational Linguistics, tics and 17th International Conference on Computa- pages 40–47. tional Linguistics, Volume 1, pages 341–347, Mon- treal, Quebec, Canada. Association for Computa- Ali Dada. 2007. Implementation of the Arabic numer- tional Linguistics. als and their syntax in GF. In Proceedings of the 2007 Workshop on Computational Approaches to Matthew S. Dryer. 2013. Expression of pronominal Semitic Languages: Common Issues and Resources, subjects. In Matthew S. Dryer and Martin Haspel- pages 9–16, Prague, Czech Republic. Association math, editors, The World Atlas of Language Struc- for Computational Linguistics. tures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig. Laurence Danlos and Fiametta Namer. 1988. Morphol- ogy and cross dependencies in the synthesis of per- Esin Durmus and Claire Cardie. 2018. Understanding sonal pronouns in romance languages. In Coling Bu- the effect of gender and stance in opinion expression dapest 1988 Volume 1: International Conference on in debates on “abortion”. In Proceedings of the Sec- Computational Linguistics. ond Workshop on Computational Modeling of Peo- ple’s Opinions, Personality, and Emotions in Social Helana Darwin. 2017. Doing gender beyond the bi- Media, pages 69–75, New Orleans, Louisiana, USA. nary: A virtual ethnography. Symbolic Interaction, Association for Computational Linguistics. 40(3):317–334. Jason Eisner and Damianos Karakos. 2005. Bootstrap- ping without the boot. In Proceedings of Human Kareem Darwish, Ahmed Abdelali, and Hamdy Language Technology Conference and Conference Mubarak. 2014. Using stem-templates to improve on Empirical Methods in Natural Language Process- Arabic POS and gender/number tagging. In Pro- ing, pages 395–402, Vancouver, British Columbia, ceedings of the Ninth International Conference on Canada. Association for Computational Linguistics. Language Resources and Evaluation (LREC’14), pages 2926–2931, Reykjavik, Iceland. European Ahmed El Kholy and Nizar Habash. 2012. Rich mor- Language Resources Association (ELRA). phology generation using statistical machine trans- lation. In INLG 2012 Proceedings of the Seventh Hal Daumé, III and Daniel Marcu. 2005. A large-scale International Natural Language Generation Confer- exploration of effective global features for a joint en- ence, pages 90–94, Utica, IL. Association for Com- tity detection and tracking model. In HLT/EMNLP, putational Linguistics. pages 97–104. Katja Filippova. 2012. User demographics and lan- Łukasz Debowski. 2003. A reconfigurable stochastic guage in an implicit social network. In Proceedings tagger for languages with complex tag structure. In of the 2012 Joint Conference on Empirical Methods Proceedings of the 2003 EACL Workshop on Mor- in Natural Language Processing and Computational phological Processing of Slavic Languages, pages Natural Language Learning, pages 1478–1488, Jeju 63–70, Budapest, Hungary. Association for Compu- Island, Korea. Association for Computational Lin- tational Linguistics. guistics. 4579
You can also read