Language (Technology) is Power: A Critical Survey of "Bias" in NLP - Association for Computational Linguistics
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Language (Technology) is Power: A Critical Survey of “Bias” in NLP Su Lin Blodgett Solon Barocas College of Information and Computer Sciences Microsoft Research University of Massachusetts Amherst Cornell University blodgett@cs.umass.edu solon@microsoft.com Hal Daumé III Hanna Wallach Microsoft Research Microsoft Research University of Maryland wallach@microsoft.com me@hal3.name Abstract 2019), coreference resolution (Rudinger et al., 2018; Zhao et al., 2018a), machine translation (Van- We survey 146 papers analyzing “bias” in massenhove et al., 2018; Stanovsky et al., 2019), NLP systems, fnding that their motivations sentiment analysis (Kiritchenko and Mohammad, are often vague, inconsistent, and lacking in normative reasoning, despite the fact that 2018), and hate speech/toxicity detection (e.g., analyzing “bias” is an inherently normative Park et al., 2018; Dixon et al., 2018), among others. process. We further fnd that these papers’ Although these papers have laid vital ground- proposed quantitative techniques for measur- work by illustrating some of the ways that NLP ing or mitigating “bias” are poorly matched to systems can be harmful, the majority of them fail their motivations and do not engage with the to engage critically with what constitutes “bias” relevant literature outside of NLP. Based on in the frst place. Despite the fact that analyzing these fndings, we describe the beginnings of a path forward by proposing three recommenda- “bias” is an inherently normative process—in tions that should guide work analyzing “bias” which some system behaviors are deemed good in NLP systems. These recommendations rest and others harmful—papers on “bias” in NLP on a greater recognition of the relationships systems are rife with unstated assumptions about between language and social hierarchies, what kinds of system behaviors are harmful, in encouraging researchers and practitioners what ways, to whom, and why. Indeed, the term to articulate their conceptualizations of “bias” (or “gender bias” or “racial bias”) is used “bias”—i.e., what kinds of system behaviors are harmful, in what ways, to whom, and why, to describe a wide range of system behaviors, even as well as the normative reasoning underlying though they may be harmful in different ways, to these statements—and to center work around different groups, or for different reasons. Even the lived experiences of members of commu- papers analyzing “bias” in NLP systems developed nities affected by NLP systems, while inter- for the same task often conceptualize it differently. rogating and reimagining the power relations For example, the following system behaviors between technologists and such communities. are all understood to be self-evident statements of “racial bias”: (a) embedding spaces in which embed- 1 Introduction dings for names associated with African Americans A large body of work analyzing “bias” in natural are closer (compared to names associated with language processing (NLP) systems has emerged European Americans) to unpleasant words than in recent years, including work on “bias” in embed- pleasant words (Caliskan et al., 2017); (b) senti- ding spaces (e.g., Bolukbasi et al., 2016a; Caliskan ment analysis systems yielding different intensity et al., 2017; Gonen and Goldberg, 2019; May scores for sentences containing names associated et al., 2019) as well as work on “bias” in systems with African Americans and sentences containing developed for a breadth of tasks including language names associated with European Americans (Kir- modeling (Lu et al., 2018; Bordia and Bowman, itchenko and Mohammad, 2018); and (c) toxicity 5454 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476 July 5 - 10, 2020. c 2020 Association for Computational Linguistics
detection systems scoring tweets containing fea- NLP task Papers tures associated with African-American English as Embeddings (type-level or contextualized) 54 more offensive than tweets without these features Coreference resolution 20 Language modeling or dialogue generation 17 (Davidson et al., 2019; Sap et al., 2019). Moreover, Hate-speech detection 17 some of these papers focus on “racial bias” Sentiment analysis 15 expressed in written text, while others focus on Machine translation 8 Tagging or parsing 5 “racial bias” against authors. This use of imprecise Surveys, frameworks, and meta-analyses 20 terminology obscures these important differences. Other 22 We survey 146 papers analyzing “bias” in NLP systems, fnding that their motivations are often Table 1: The NLP tasks covered by the 146 papers. vague and inconsistent. Many lack any normative reasoning for why the system behaviors that are described as “bias” are harmful, in what ways, and papers without the keywords “bias” or “fairness,” to whom. Moreover, the vast majority of these we also traversed the citation graph of our initial papers do not engage with the relevant literature set of papers, retaining any papers analyzing “bias” outside of NLP to ground normative concerns when in NLP systems that are cited by or cite the papers proposing quantitative techniques for measuring in our initial set. Finally, we manually inspected or mitigating “bias.” As a result, we fnd that many any papers analyzing “bias” in NLP systems from of these techniques are poorly matched to their leading machine learning, human–computer inter- motivations, and are not comparable to one another. action, and web conferences and workshops, such We then describe the beginnings of a path as ICML, NeurIPS, AIES, FAccT, CHI, and WWW, forward by proposing three recommendations along with any relevant papers that were made that should guide work analyzing “bias” in NLP available in the “Computation and Language” and systems. We argue that such work should examine “Computers and Society” categories on arXiv prior the relationships between language and social hi- to May 2020, but found that they had already been erarchies; we call on researchers and practitioners identifed via our traversal of the citation graph. We conducting such work to articulate their conceptu- provide a list of all 146 papers in the appendix. In alizations of “bias” in order to enable conversations Table 1, we provide a breakdown of the NLP tasks about what kinds of system behaviors are harmful, covered by the papers. We note that counts do not in what ways, to whom, and why; and we recom- sum to 146, because some papers cover multiple mend deeper engagements between technologists tasks. For example, a paper might test the effcacy and communities affected by NLP systems. We of a technique for mitigating “bias” in embed- also provide several concrete research questions ding spaces in the context of sentiment analysis. that are implied by each of our recommendations. Once identifed, we then read each of the 146 pa- 2 Method pers with the goal of categorizing their motivations and their proposed quantitative techniques for mea- Our survey includes all papers known to us suring or mitigating “bias.” We used a previously analyzing “bias” in NLP systems—146 papers in developed taxonomy of harms for this categoriza- total. We omitted papers about speech, restricting tion, which differentiates between so-called alloca- our survey to papers about written text only. To tional and representational harms (Barocas et al., identify the 146 papers, we frst searched the ACL 2017; Crawford, 2017). Allocational harms arise Anthology1 for all papers with the keywords “bias” when an automated system allocates resources (e.g., or “fairness” that were made available prior to May credit) or opportunities (e.g., jobs) unfairly to dif- 2020. We retained all papers about social “bias,” ferent social groups; representational harms arise and discarded all papers about other defnitions of when a system (e.g., a search engine) represents the keywords (e.g., hypothesis-only bias, inductive some social groups in a less favorable light than bias, media bias). We also discarded all papers us- others, demeans them, or fails to recognize their ing “bias” in NLP systems to measure social “bias” existence altogether. Adapting and extending this in text or the real world (e.g., Garg et al., 2018). taxonomy, we categorized the 146 papers’ motiva- To ensure that we did not exclude any relevant tions and techniques into the following categories: 1 https://www.aclweb.org/anthology/ . Allocational harms. 5455
Papers 3.1 Motivations Category Motivation Technique Papers state a wide range of motivations, Allocational harms 30 4 multiple motivations, vague motivations, and Stereotyping 50 58 sometimes no motivations at all. We found that Other representational harms 52 43 Questionable correlations 47 42 the papers’ motivations span all six categories, with Vague/unstated 23 0 several papers falling into each one. Appropriately, Surveys, frameworks, and 20 20 papers that provide surveys or frameworks for an- meta-analyses alyzing “bias” in NLP systems often state multiple Table 2: The categories into which the 146 papers fall. motivations (e.g., Hovy and Spruit, 2016; Bender, 2019; Sun et al., 2019; Rozado, 2020; Shah et al., 2020). However, as the examples in Table 3 (in the appendix) illustrate, many other papers (33%) do . Representational harms:2 so as well. Some papers (16%) state only vague . Stereotyping that propagates negative gen- motivations or no motivations at all. For example, eralizations about particular social groups. “[N]o human should be discriminated on the basis . Differences in system performance for dif- of demographic attributes by an NLP system.” ferent social groups, language that misrep- —Kaneko and Bollegala (2019) resents the distribution of different social “[P]rominent word embeddings [...] encode groups in the population, or language that systematic biases against women and black people [...] implicating many NLP systems in scaling up is denigrating to particular social groups. social injustice.” —May et al. (2019) . Questionable correlations between system be- These examples leave unstated what it might mean havior and features of language that are typi- for an NLP system to “discriminate,” what con- cally associated with particular social groups. stitutes “systematic biases,” or how NLP systems . Vague descriptions of “bias” (or “gender contribute to “social injustice” (itself undefned). bias” or “racial bias”) or no description at all. . Surveys, frameworks, and meta-analyses. Papers’ motivations sometimes include no nor- mative reasoning. We found that some papers In Table 2 we provide counts for each of the (32%) are not motivated by any apparent normative six categories listed above. (We also provide a concerns, often focusing instead on concerns about list of the papers that fall into each category in the system performance. For example, the frst quote appendix.) Again, we note that the counts do not below includes normative reasoning—namely that sum to 146, because some papers state multiple models should not use demographic information motivations, propose multiple techniques, or pro- to make predictions—while the other focuses on pose a single technique for measuring or mitigating learned correlations impairing system performance. multiple harms. Table 3, which is in the appendix, “In [text classifcation], models are expected to contains examples of the papers’ motivations and make predictions with the semantic information techniques across a range of different NLP tasks. rather than with the demographic group identity information (e.g., ‘gay’, ‘black’) contained in the sentences.” —Zhang et al. (2020a) 3 Findings “An over-prevalence of some gendered forms in the training data leads to translations with identifable Categorizing the 146 papers’ motivations and pro- errors. Translations are better for sentences posed quantitative techniques for measuring or miti- involving men and for sentences containing gating “bias” into the six categories listed above en- stereotypical gender roles.” abled us to identify several commonalities, which —Saunders and Byrne (2020) we present below, along with illustrative quotes. Even when papers do state clear motivations, 2 they are often unclear about why the system be- We grouped several types of representational harms into two categories to refect that the main point of differentiation haviors that are described as “bias” are harm- between the 146 papers’ motivations and proposed quantitative ful, in what ways, and to whom. We found that techniques for measuring or mitigating “bias” is whether or not even papers with clear motivations often fail to ex- they focus on stereotyping. Among the papers that do not fo- cus on stereotyping, we found that most lack suffciently clear plain what kinds of system behaviors are harmful, motivations and techniques to reliably categorize them further. in what ways, to whom, and why. For example, 5456
“Deploying these word embedding algorithms in “It is essential to quantify and mitigate gender bias practice, for example in automated translation in these embeddings to avoid them from affecting systems or as hiring aids, runs the serious risk of downstream applications.” —Zhou et al. (2019) perpetuating problematic biases in important societal contexts.” —Brunet et al. (2019) In contrast, papers that provide surveys or frame- works for analyzing “bias” in NLP systems treat “[I]f the systems show discriminatory behaviors in representational harms as harmful in their own the interactions, the user experience will be adversely affected.” —Liu et al. (2019) right. For example, Mayfeld et al. (2019) and Ruane et al. (2019) cite the harmful reproduction These examples leave unstated what “problematic of dominant linguistic norms by NLP systems (a biases” or non-ideal user experiences might look point to which we return in section 4), while Bender like, how the system behaviors might result in (2019) outlines a range of harms, including seeing these things, and who the relevant stakeholders stereotypes in search results and being made invis- or users might be. In contrast, we fnd that papers ible to search engines due to language practices. that provide surveys or frameworks for analyzing “bias” in NLP systems often name who is harmed, 3.2 Techniques acknowledging that different social groups may Papers’ techniques are not well grounded in the experience these systems differently due to their relevant literature outside of NLP. Perhaps un- different relationships with NLP systems or surprisingly given that the papers’ motivations are different social positions. For example, Ruane often vague, inconsistent, and lacking in normative et al. (2019) argue for a “deep understanding of reasoning, we also found that the papers’ proposed the user groups [sic] characteristics, contexts, and quantitative techniques for measuring or mitigating interests” when designing conversational agents. “bias” do not effectively engage with the relevant Papers about NLP systems developed for the literature outside of NLP. Papers on stereotyping same task often conceptualize “bias” differ- are a notable exception: the Word Embedding ently. Even papers that cover the same NLP task Association Test (Caliskan et al., 2017) draws on often conceptualize “bias” in ways that differ sub- the Implicit Association Test (Greenwald et al., stantially and are sometimes inconsistent. Rows 3 1998) from the social psychology literature, while and 4 of Table 3 (in the appendix) contain machine several techniques operationalize the well-studied translation papers with different conceptualizations “Angry Black Woman” stereotype (Kiritchenko of “bias,” leading to different proposed techniques, and Mohammad, 2018; May et al., 2019; Tan while rows 5 and 6 contain papers on “bias” in em- and Celis, 2019) and the “double bind” faced by bedding spaces that state different motivations, but women (May et al., 2019; Tan and Celis, 2019), in propose techniques for quantifying stereotyping. which women who succeed at stereotypically male tasks are perceived to be less likable than similarly Papers’ motivations confate allocational and successful men (Heilman et al., 2004). Tan and representational harms. We found that the pa- Celis (2019) also examine the compounding effects pers’ motivations sometimes (16%) name imme- of race and gender, drawing on Black feminist diate representational harms, such as stereotyping, scholarship on intersectionality (Crenshaw, 1989). alongside more distant allocational harms, which, in the case of stereotyping, are usually imagined as Papers’ techniques are poorly matched to their downstream effects of stereotypes on résumé flter- motivations. We found that although 21% of the ing. Many of these papers use the imagined down- papers include allocational harms in their motiva- stream effects to justify focusing on particular sys- tions, only four papers actually propose techniques tem behaviors, even when the downstream effects for measuring or mitigating allocational harms. are not measured. Papers on “bias” in embedding Papers focus on a narrow range of potential spaces are especially likely to do this because em- sources of “bias.” We found that nearly all of the beddings are often used as input to other systems: papers focus on system predictions as the potential “However, none of these papers [on embeddings] sources of “bias,” with many additionally focusing have recognized how blatantly sexist the embeddings are and hence risk introducing biases on “bias” in datasets (e.g., differences in the of various types into real-world systems.” number of gendered pronouns in the training data —Bolukbasi et al. (2016a) (Zhao et al., 2019)). Most papers do not interrogate 5457
the normative implications of other decisions made language and social hierarchies. Many disciplines, during the development and deployment lifecycle— including sociolinguistics, linguistic anthropology, perhaps unsurprising given that their motivations sociology, and social psychology, study how sometimes include no normative reasoning. A language takes on social meaning and the role that few papers are exceptions, illustrating the impacts language plays in maintaining social hierarchies. of task defnitions, annotation guidelines, and For example, language is the means through which evaluation metrics: Cao and Daumé (2019) study social groups are labeled and one way that beliefs how folk conceptions of gender (Keyes, 2018) are about social groups are transmitted (e.g., Maass, reproduced in coreference resolution systems that 1999; Beukeboom and Burgers, 2019). Group assume a strict gender dichotomy, thereby main- labels can serve as the basis of stereotypes and thus taining cisnormativity; Sap et al. (2019) focus on reinforce social inequalities: “[T]he label content the effect of priming annotators with information functions to identify a given category of people, about possible dialectal differences when asking and thereby conveys category boundaries and a them to apply toxicity labels to sample tweets, fnd- position in a hierarchical taxonomy” (Beukeboom ing that annotators who are primed are signifcantly and Burgers, 2019). Similarly, “controlling less likely to label tweets containing features asso- images,” such as stereotypes of Black women, ciated with African-American English as offensive. which are linguistically and visually transmitted through literature, news media, television, and so 4 A path forward forth, provide “ideological justifcation” for their We now describe how researchers and practitioners continued oppression (Collins, 2000, Chapter 4). conducting work analyzing “bias” in NLP systems As a result, many groups have sought to bring might avoid the pitfalls presented in the previous about social changes through changes in language, section—the beginnings of a path forward. We disrupting patterns of oppression and marginal- propose three recommendations that should guide ization via so-called “gender-fair” language such work, and, for each, provide several concrete (Sczesny et al., 2016; Menegatti and Rubini, 2017), research questions. We emphasize that these ques- language that is more inclusive to people with tions are not comprehensive, and are intended to disabilities (ADA, 2018), and language that is less generate further questions and lines of engagement. dehumanizing (e.g., abandoning the use of the term Our three recommendations are as follows: “illegal” in everyday discourse on immigration in the U.S. (Rosa, 2019)). The fact that group labels (R1) Ground work analyzing “bias” in NLP sys- are so contested is evidence of how deeply inter- tems in the relevant literature outside of NLP twined language and social hierarchies are. Taking that explores the relationships between lan- “gender-fair” language as an example, the hope guage and social hierarchies. Treat represen- is that reducing asymmetries in language about tational harms as harmful in their own right. women and men will reduce asymmetries in their (R2) Provide explicit statements of why the social standing. Meanwhile, struggles over lan- system behaviors that are described as “bias” guage use often arise from dominant social groups’ are harmful, in what ways, and to whom. desire to “control both material and symbolic Be forthright about the normative reasoning resources”—i.e., “the right to decide what words (Green, 2019) underlying these statements. will mean and to control those meanings”—as was (R3) Examine language use in practice by engag- the case in some white speakers’ insistence on ing with the lived experiences of members of using offensive place names against the objections communities affected by NLP systems. Inter- of Indigenous speakers (Hill, 2008, Chapter 3). rogate and reimagine the power relations be- Sociolinguists and linguistic anthropologists tween technologists and such communities. have also examined language attitudes and lan- guage ideologies, or people’s metalinguistic beliefs 4.1 Language and social hierarchies about language: Which language varieties or prac- Turning frst to (R1), we argue that work analyzing tices are taken as standard, ordinary, or unmarked? “bias” in NLP systems will paint a much fuller pic- Which are considered correct, prestigious, or ap- ture if it engages with the relevant literature outside propriate for public use, and which are considered of NLP that explores the relationships between incorrect, uneducated, or offensive (e.g., Campbell- 5458
Kibler, 2009; Preston, 2009; Loudermilk, 2015; provide the following concrete research questions: Lanehart and Malik, 2018)? Which are rendered in- visible (Roche, 2019)?3 Language ideologies play . How do social hierarchies and language a vital role in reinforcing and justifying social hi- ideologies infuence the decisions made during erarchies because beliefs about language varieties the development and deployment lifecycle? or practices often translate into beliefs about their What kinds of NLP systems do these decisions speakers (e.g. Alim et al., 2016; Rosa and Flores, result in, and what kinds do they foreclose? 2017; Craft et al., 2020). For example, in the U.S., General assumptions: To which linguistic the portrayal of non-white speakers’ language norms do NLP systems adhere (Bender, varieties and practices as linguistically defcient 2019; Ruane et al., 2019)? Which language helped to justify violent European colonialism, and practices are implicitly assumed to be today continues to justify enduring racial hierar- standard, ordinary, correct, or appropriate? chies by maintaining views of non-white speakers Task defnition: For which speakers as lacking the language “required for complex are NLP systems (and NLP resources) thinking processes and successful engagement developed? (See Joshi et al. (2020) for in the global economy” (Rosa and Flores, 2017). a discussion.) How do task defnitions Recognizing the role that language plays in discretize the world? For example, how maintaining social hierarchies is critical to the are social groups delineated when defning future of work analyzing “bias” in NLP systems. demographic attribute prediction tasks First, it helps to explain why representational (e.g., Koppel et al., 2002; Rosenthal and harms are harmful in their own right. Second, the McKeown, 2011; Nguyen et al., 2013)? complexity of the relationships between language What about languages in native language and social hierarchies illustrates why studying prediction tasks (Tetreault et al., 2013)? “bias” in NLP systems is so challenging, suggesting Data: How are datasets collected, prepro- that researchers and practitioners will need to move cessed, and labeled or annotated? What are beyond existing algorithmic fairness techniques. the impacts of annotation guidelines, anno- We argue that work must be grounded in the tator assumptions and perceptions (Olteanu relevant literature outside of NLP that examines et al., 2019; Sap et al., 2019; Geiger et al., the relationships between language and social 2020), and annotation aggregation pro- hierarchies; without this grounding, researchers cesses (Pavlick and Kwiatkowski, 2019)? and practitioners risk measuring or mitigating Evaluation: How are NLP systems evalu- only what is convenient to measure or mitigate, ated? What are the impacts of evaluation rather than what is most normatively concerning. metrics (Olteanu et al., 2017)? Are any More specifcally, we recommend that work non-quantitative evaluations performed? analyzing “bias” in NLP systems be reoriented . How do NLP systems reproduce or transform around the following question: How are social language ideologies? Which language varieties hierarchies, language ideologies, and NLP systems or practices come to be deemed good or bad? coproduced? This question mirrors Benjamin’s Might “good” language simply mean language (2020) call to examine how “race and technology that is easily handled by existing NLP sys- are coproduced”—i.e., how racial hierarchies, and tems? For example, linguistic phenomena aris- the ideologies and discourses that maintain them, ing from many language practices (Eisenstein, create and are re-created by technology. We recom- 2013) are described as “noisy text” and often mend that researchers and practitioners similarly viewed as a target for “normalization.” How ask how existing social hierarchies and language do the language ideologies that are reproduced ideologies drive the development and deployment by NLP systems maintain social hierarchies? of NLP systems, and how these systems therefore reproduce these hierarchies and ideologies. As . Which representational harms are being a starting point for reorienting work analyzing measured or mitigated? Are these the most “bias” in NLP systems around this question, we normatively concerning harms, or merely those that are well handled by existing algo- 3 Language ideologies encompass much more than this; see, rithmic fairness techniques? Are there other e.g., Lippi-Green (2012), Alim et al. (2016), Rosa and Flores (2017), Rosa and Burdick (2017), and Charity Hudley (2017). representational harms that might be analyzed? 5459
4.2 Conceptualizations of “bias” underpin this conceptualization of “bias?” Turning now to (R2), we argue that work analyzing “bias” in NLP systems should provide explicit 4.3 Language use in practice statements of why the system behaviors that are Finally, we turn to (R3). Our perspective, which described as “bias” are harmful, in what ways, rests on a greater recognition of the relationships and to whom, as well as the normative reasoning between language and social hierarchies, suggests underlying these statements. In other words, several directions for examining language use in researchers and practitioners should articulate their practice. Here, we focus on two. First, because lan- conceptualizations of “bias.” As we described guage is necessarily situated, and because different above, papers often contain descriptions of system social groups have different lived experiences due behaviors that are understood to be self-evident to their different social positions (Hanna et al., statements of “bias.” This use of imprecise 2020)—particularly groups at the intersections terminology has led to papers all claiming to of multiple axes of oppression—we recommend analyze “bias” in NLP systems, sometimes even that researchers and practitioners center work in systems developed for the same task, but with analyzing “bias” in NLP systems around the lived different or even inconsistent conceptualizations of experiences of members of communities affected “bias,” and no explanations for these differences. by these systems. Second, we recommend that Yet analyzing “bias” is an inherently normative the power relations between technologists and process—in which some system behaviors are such communities be interrogated and reimagined. deemed good and others harmful—even if assump- Researchers have pointed out that algorithmic tions about what kinds of system behaviors are fairness techniques, by proposing incremental harmful, in what ways, for whom, and why are technical mitigations—e.g., collecting new datasets not stated. We therefore echo calls by Bardzell and or training better models—maintain these power Bardzell (2011), Keyes et al. (2019), and Green relations by (a) assuming that automated systems (2019) for researchers and practitioners to make should continue to exist, rather than asking their normative reasoning explicit by articulating whether they should be built at all, and (b) keeping the social values that underpin their decisions to development and deployment decisions in the deem some system behaviors as harmful, no matter hands of technologists (Bennett and Keyes, 2019; how obvious such values appear to be. We further Cifor et al., 2019; Green, 2019; Katell et al., 2020). argue that this reasoning should take into account There are many disciplines for researchers and the relationships between language and social practitioners to draw on when pursuing these hierarchies that we described above. First, these directions. For example, in human–computer relationships provide a foundation from which to interaction, Hamidi et al. (2018) study transgender approach the normative reasoning that we recom- people’s experiences with automated gender mend making explicit. For example, some system recognition systems in order to uncover how behaviors might be harmful precisely because these systems reproduce structures of transgender they maintain social hierarchies. Second, if work exclusion by redefning what it means to perform analyzing “bias” in NLP systems is reoriented gender “normally.” Value-sensitive design provides to understand how social hierarchies, language a framework for accounting for the values of differ- ideologies, and NLP systems are coproduced, then ent stakeholders in the design of technology (e.g., this work will be incomplete if we fail to account Friedman et al., 2006; Friedman and Hendry, 2019; for the ways that social hierarchies and language Le Dantec et al., 2009; Yoo et al., 2019), while ideologies determine what we mean by “bias” in participatory design seeks to involve stakeholders the frst place. As a starting point, we therefore in the design process itself (Sanders, 2002; Muller, provide the following concrete research questions: 2007; Simonsen and Robertson, 2013; DiSalvo . What kinds of system behaviors are described et al., 2013). Participatory action research in educa- as “bias”? What are their potential sources (e.g., tion (Kemmis, 2006) and in language documenta- general assumptions, task defnition, data)? tion and reclamation (Junker, 2018) is also relevant. . In what ways are these system behaviors harm- In particular, work on language reclamation to ful, to whom are they harmful, and why? support decolonization and tribal sovereignty . What are the social values (obvious or not) that (Leonard, 2012) and work in sociolinguistics focus- 5460
ing on developing co-equal research relationships systems may not work, illustrating their limitations. with community members and supporting linguis- However, they do not conceptualize “racial bias” in tic justice efforts (e.g., Bucholtz et al., 2014, 2016, the same way. The frst four of these papers simply 2019) provide examples of more emancipatory rela- focus on system performance differences between tionships with communities. Finally, several work- text containing features associated with AAE and shops and events have begun to explore how to em- text without these features. In contrast, the last power stakeholders in the development and deploy- two papers also focus on such system performance ment of technology (Vaccaro et al., 2019; Givens differences, but motivate this focus with the fol- and Morris, 2020; Sassaman et al., 2020)4 and how lowing additional reasoning: If tweets containing to help researchers and practitioners consider when features associated with AAE are scored as more not to build systems at all (Barocas et al., 2020). offensive than tweets without these features, then As a starting point for engaging with commu- this might (a) yield negative perceptions of AAE; nities affected by NLP systems, we therefore (b) result in disproportionate removal of tweets provide the following concrete research questions: containing these features, impeding participation . How do communities become aware of NLP in online platforms and reducing the space avail- systems? Do they resist them, and if so, how? able online in which speakers can use AAE freely; . What additional costs are borne by communi- and (c) cause AAE speakers to incur additional ties for whom NLP systems do not work well? costs if they have to change their language practices . Do NLP systems shift power toward oppressive to avoid negative perceptions or tweet removal. institutions (e.g., by enabling predictions that More importantly, none of these papers engage communities do not want made, linguistically with the literature on AAE, racial hierarchies in the based unfair allocation of resources or oppor- U.S., and raciolinguistic ideologies. By failing to tunities (Rosa and Flores, 2017), surveillance, engage with this literature—thereby treating AAE or censorship), or away from such institutions? simply as one of many non-Penn Treebank vari- . Who is involved in the development and eties of English or perhaps as another challenging deployment of NLP systems? How do domain—work analyzing “bias” in NLP systems decision-making processes maintain power re- in the context of AAE fails to situate these systems lations between technologists and communities in the world. Who are the speakers of AAE? How affected by NLP systems? Can these pro- are they viewed? We argue that AAE as a language cesses be changed to reimagine these relations? variety cannot be separated from its speakers— primarily Black people in the U.S., who experience 5 Case study systemic anti-Black racism—and the language ide- To illustrate our recommendations, we present a ologies that reinforce and justify racial hierarchies. case study covering work on African-American Even after decades of sociolinguistic efforts to English (AAE).5 Work analyzing “bias” in the con- legitimize AAE, it continues to be viewed as “bad” text of AAE has shown that part-of-speech taggers, English and its speakers continue to be viewed as language identifcation systems, and dependency linguistically inadequate—a view called the defcit parsers all work less well on text containing perspective (Alim et al., 2016; Rosa and Flores, features associated with AAE than on text without 2017). This perspective persists despite demon- these features (Jørgensen et al., 2015, 2016; Blod- strations that AAE is rule-bound and grammatical gett et al., 2016, 2018), and that toxicity detection (Mufwene et al., 1998; Green, 2002), in addition systems score tweets containing features associated to ample evidence of its speakers’ linguistic adroit- with AAE as more offensive than tweets with- ness (e.g., Alim, 2004; Rickford and King, 2016). out them (Davidson et al., 2019; Sap et al., 2019). This perspective belongs to a broader set of raciolin- These papers have been critical for highlighting guistic ideologies (Rosa and Flores, 2017), which AAE as a language variety for which existing NLP also produce allocational harms; speakers of AAE 4 Also https://participatoryml.github.io/ are frequently penalized for not adhering to domi- 5 This language variety has had many different names nant language practices, including in the education over the years, but is now generally called African- system (Alim, 2004; Terry et al., 2010), when American English (AAE), African-American Vernacular En- glish (AAVE), or African-American Language (AAL) (Green, seeking housing (Baugh, 2018), and in the judicial 2002; Wolfram and Schilling, 2015; Rickford and King, 2016). system, where their testimony is misunderstood or, 5461
worse yet, disbelieved (Rickford and King, 2016; and deployment of NLP systems produce stigmati- Jones et al., 2019). These raciolinguistic ideologies zation and disenfranchisement, and work on AAE position racialized communities as needing use in practice, such as the ways that speakers linguistic intervention, such as language education of AAE interact with NLP systems that were not programs, in which these and other harms can be designed for them. This literature can also help re- reduced if communities accommodate to domi- searchers and practitioners address the allocational nant language practices (Rosa and Flores, 2017). harms that may be produced by NLP systems, and In the technology industry, speakers of AAE are ensure that even well-intentioned NLP systems often not considered consumers who matter. For do not position racialized communities as needing example, Benjamin (2019) recounts an Apple em- linguistic intervention or accommodation to ployee who worked on speech recognition for Siri: dominant language practices. Finally, researchers “As they worked on different English dialects — and practitioners wishing to design better systems Australian, Singaporean, and Indian English — [the can also draw on a growing body of work on employee] asked his boss: ‘What about African anti-racist language pedagogy that challenges the American English?’ To this his boss responded: ‘Well, Apple products are for the premium market.”’ defcit perspective of AAE and other racialized language practices (e.g. Flores and Chaparro, 2018; The reality, of course, is that speakers of AAE tend Baker-Bell, 2019; Martínez and Mejía, 2019), as not to represent the “premium market” precisely be- well as the work that we described in section 4.3 cause of institutions and policies that help to main- on reimagining the power relations between tech- tain racial hierarchies by systematically denying nologists and communities affected by technology. them the opportunities to develop wealth that are available to white Americans (Rothstein, 2017)— 6 Conclusion an exclusion that is reproduced in technology by countless decisions like the one described above. By surveying 146 papers analyzing “bias” in NLP Engaging with the literature outlined above systems, we found that (a) their motivations are situates the system behaviors that are described often vague, inconsistent, and lacking in norma- as “bias,” providing a foundation for normative tive reasoning; and (b) their proposed quantitative reasoning. Researchers and practitioners should techniques for measuring or mitigating “bias” are be concerned about “racial bias” in toxicity poorly matched to their motivations and do not en- detection systems not only because performance gage with the relevant literature outside of NLP. differences impair system performance, but To help researchers and practitioners avoid these because they reproduce longstanding injustices of pitfalls, we proposed three recommendations that stigmatization and disenfranchisement for speakers should guide work analyzing “bias” in NLP sys- of AAE. In re-stigmatizing AAE, they reproduce tems, and, for each, provided several concrete re- language ideologies in which AAE is viewed as search questions. These recommendations rest on ungrammatical, uneducated, and offensive. These a greater recognition of the relationships between ideologies, in turn, enable linguistic discrimination language and social hierarchies—a step that we and justify enduring racial hierarchies (Rosa and see as paramount to establishing a path forward. Flores, 2017). Our perspective, which understands racial hierarchies and raciolinguistic ideologies as structural conditions that govern the development Acknowledgments and deployment of technology, implies that techniques for measuring or mitigating “bias” This paper is based upon work supported by the in NLP systems will necessarily be incomplete National Science Foundation Graduate Research unless they interrogate and dismantle these Fellowship under Grant No. 1451512. Any opin- structural conditions, including the power relations ion, fndings, and conclusions or recommendations between technologists and racialized communities. expressed in this material are those of the authors We emphasize that engaging with the literature and do not necessarily refect the views of the Na- on AAE, racial hierarchies in the U.S., and tional Science Foundation. We thank the reviewers raciolinguistic ideologies can generate new lines of for their useful feedback, especially the sugges- engagement. These lines include work on the ways tion to include additional details about our method. that the decisions made during the development 5462
References Shaowen Bardzell and Jeffrey Bardzell. 2011. Towards a Feminist HCI Methodology: Social Science, Femi- Artem Abzaliev. 2019. On GAP coreference resolu- nism, and HCI. In Proceedings of the Conference on tion shared task: insights from the 3rd place solution. Human Factors in Computing Systems (CHI), pages In Proceedings of the Workshop on Gender Bias in 675–684, Vancouver, Canada. Natural Language Processing, pages 107–112, Flo- rence, Italy. Solon Barocas, Asia J. Biega, Benjamin Fish, J˛edrzej Niklas, and Luke Stark. 2020. When Not to De- ADA. 2018. Guidelines for Writing About Peo- sign, Build, or Deploy. In Proceedings of the Confer- ple With Disabilities. ADA National Network. ence on Fairness, Accountability, and Transparency, https://bit.ly/2KREbkB. Barcelona, Spain. Oshin Agarwal, Funda Durupinar, Norman I. Badler, Solon Barocas, Kate Crawford, Aaron Shapiro, and and Ani Nenkova. 2019. Word embeddings (also) Hanna Wallach. 2017. The Problem With Bias: Al- encode human personality stereotypes. In Proceed- locative Versus Representational Harms in Machine ings of the Joint Conference on Lexical and Com- Learning. In Proceedings of SIGCIS, Philadelphia, putational Semantics, pages 205–211, Minneapolis, PA. MN. Christine Basta, Marta R. Costa-jussà, and Noe Casas. H. Samy Alim. 2004. You Know My Steez: An Ethno- 2019. Evaluating the underlying gender bias in con- graphic and Sociolinguistic Study of Styleshifting in textualized word embeddings. In Proceedings of a Black American Speech Community. American Di- the Workshop on Gender Bias for Natural Language alect Society. Processing, pages 33–39, Florence, Italy. H. Samy Alim, John R. Rickford, and Arnetha F. Ball, John Baugh. 2018. Linguistics in Pursuit of Justice. editors. 2016. Raciolinguistics: How Language Cambridge University Press. Shapes Our Ideas About Race. Oxford University Press. Emily M. Bender. 2019. A typology of ethical risks in language technology with an eye towards where Sandeep Attree. 2019. Gendered ambiguous pronouns transparent documentation can help. Presented at shared task: Boosting model confdence by evidence The Future of Artifcial Intelligence: Language, pooling. In Proceedings of the Workshop on Gen- Ethics, Technology Workshop. https://bit.ly/ der Bias in Natural Language Processing, Florence, 2P9t9M6. Italy. Ruha Benjamin. 2019. Race After Technology: Aboli- Pinkesh Badjatiya, Manish Gupta, and Vasudeva tionist Tools for the New Jim Code. John Wiley & Varma. 2019. Stereotypical bias removal for hate Sons. speech detection task using knowledge-based gen- eralizations. In Proceedings of the International Ruha Benjamin. 2020. 2020 Vision: Reimagining the World Wide Web Conference, pages 49–59, San Fran- Default Settings of Technology & Society. Keynote cisco, CA. at ICLR. Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Cynthia L. Bennett and Os Keyes. 2019. What is the Shmatikov. 2019. Differential Privacy Has Dis- Point of Fairness? Disability, AI, and The Com- parate Impact on Model Accuracy. In Proceedings plexity of Justice. In Proceedings of the ASSETS of the Conference on Neural Information Processing Workshop on AI Fairness for People with Disabili- Systems, Vancouver, Canada. ties, Pittsburgh, PA. April Baker-Bell. 2019. Dismantling anti-black lin- Camiel J. Beukeboom and Christian Burgers. 2019. guistic racism in English language arts classrooms: How Stereotypes Are Shared Through Language: A Toward an anti-racist black language pedagogy. The- Review and Introduction of the Social Categories ory Into Practice. and Stereotypes Communication (SCSC) Frame- work. Review of Communication Research, 7:1–37. David Bamman, Sejal Popat, and Sheng Shen. 2019. An annotated dataset of literary entities. In Proceed- Shruti Bhargava and David Forsyth. 2019. Expos- ings of the North American Association for Com- ing and Correcting the Gender Bias in Image putational Linguistics (NAACL), pages 2138–2144, Captioning Datasets and Models. arXiv preprint Minneapolis, MN. arXiv:1912.00578. Xingce Bao and Qianqian Qiao. 2019. Transfer Learn- Jayadev Bhaskaran and Isha Bhallamudi. 2019. Good ing from Pre-trained BERT for Pronoun Resolution. Secretaries, Bad Truck Drivers? Occupational Gen- In Proceedings of the Workshop on Gender Bias der Stereotypes in Sentiment Analysis. In Proceed- in Natural Language Processing, pages 82–88, Flo- ings of the Workshop on Gender Bias in Natural Lan- rence, Italy. guage Processing, pages 62–68, Florence, Italy. 5463
Su Lin Blodgett, Lisa Green, and Brendan O’Connor. Student Researchers as Linguistic Experts. Lan- 2016. Demographic Dialectal Variation in Social guage and Linguistics Compass, 8:144–157. Media: A Case Study of African-American English. In Proceedings of Empirical Methods in Natural Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Language Processing (EMNLP), pages 1119–1130, Trevor Darrell, and Anna Rohrbach. 2018. Women Austin, TX. also Snowboard: Overcoming Bias in Captioning Models. In Procedings of the European Conference Su Lin Blodgett and Brendan O’Connor. 2017. Racial on Computer Vision (ECCV), pages 793–811, Mu- Disparity in Natural Language Processing: A Case nich, Germany. Study of Social Media African-American English. In Proceedings of the Workshop on Fairness, Ac- Aylin Caliskan, Joanna J. Bryson, and Arvind countability, and Transparency in Machine Learning Narayanan. 2017. Semantics derived automatically (FAT/ML), Halifax, Canada. from language corpora contain human-like biases. Science, 356(6334). Su Lin Blodgett, Johnny Wei, and Brendan O’Connor. 2018. Twitter Universal Dependency Parsing for Kathryn Campbell-Kibler. 2009. The nature of so- African-American and Mainstream American En- ciolinguistic perception. Language Variation and glish. In Proceedings of the Association for Compu- Change, 21(1):135–156. tational Linguistics (ACL), pages 1415–1425, Mel- bourne, Australia. Yang Trista Cao and Hal Daumé, III. 2019. To- ward gender-inclusive coreference resolution. arXiv Tolga Bolukbasi, Kai-Wei Chang, James Zou, preprint arXiv:1910.13913. Venkatesh Saligrama, and Adam Kalai. 2016a. Man is to Computer Programmer as Woman is to Rakesh Chada. 2019. Gendered pronoun resolution us- Homemaker? Debiasing Word Embeddings. In Pro- ing bert and an extractive question answering formu- ceedings of the Conference on Neural Information lation. In Proceedings of the Workshop on Gender Processing Systems, pages 4349–4357, Barcelona, Bias in Natural Language Processing, pages 126– Spain. 133, Florence, Italy. Tolga Bolukbasi, Kai-Wei Chang, James Zou, Kaytlin Chaloner and Alfredo Maldonado. 2019. Mea- Venkatesh Saligrama, and Adam Kalai. 2016b. suring Gender Bias in Word Embedding across Do- Quantifying and reducing stereotypes in word mains and Discovering New Gender Bias Word Cat- embeddings. In Proceedings of the ICML Workshop egories. In Proceedings of the Workshop on Gender on #Data4Good: Machine Learning in Social Good Bias in Natural Language Processing, pages 25–32, Applications, pages 41–45, New York, NY. Florence, Italy. Shikha Bordia and Samuel R. Bowman. 2019. Identify- ing and reducing gender bias in word-level language Anne H. Charity Hudley. 2017. Language and Racial- models. In Proceedings of the NAACL Student Re- ization. In Ofelia García, Nelson Flores, and Mas- search Workshop, pages 7–15, Minneapolis, MN. similiano Spotti, editors, The Oxford Handbook of Language and Society. Oxford University Press. Marc-Etienne Brunet, Colleen Alkalay-Houlihan, Ash- ton Anderson, and Richard Zemel. 2019. Under- Won Ik Cho, Ji Won Kim, Seok Min Kim, and standing the Origins of Bias in Word Embeddings. Nam Soo Kim. 2019. On measuring gender bias in In Proceedings of the International Conference on translation of gender-neutral pronouns. In Proceed- Machine Learning, pages 803–811, Long Beach, ings of the Workshop on Gender Bias in Natural Lan- CA. guage Processing, pages 173–181, Florence, Italy. Mary Bucholtz, Dolores Inés Casillas, and Jin Sook Shivang Chopra, Ramit Sawhney, Puneet Mathur, and Lee. 2016. Beyond Empowerment: Accompani- Rajiv Ratn Shah. 2020. Hindi-English Hate Speech ment and Sociolinguistic Justice in a Youth Research Detection: Author Profling, Debiasing, and Practi- Program. In Robert Lawson and Dave Sayers, edi- cal Perspectives. In Proceedings of the AAAI Con- tors, Sociolinguistic Research: Application and Im- ference on Artifcial Intelligence (AAAI), New York, pact, pages 25–44. Routledge. NY. Mary Bucholtz, Dolores Inés Casillas, and Jin Sook Marika Cifor, Patricia Garcia, T.L. Cowan, Jasmine Lee. 2019. California Latinx Youth as Agents of Rault, Tonia Sutherland, Anita Say Chan, Jennifer Sociolinguistic Justice. In Netta Avineri, Laura R. Rode, Anna Lauren Hoffmann, Niloufar Salehi, and Graham, Eric J. Johnson, Robin Conley Riner, and Lisa Nakamura. 2019. Feminist Data Manifest- Jonathan Rosa, editors, Language and Social Justice No. Retrieved from https://www.manifestno. in Practice, pages 166–175. Routledge. com/. Mary Bucholtz, Audrey Lopez, Allina Mojarro, Elena Patricia Hill Collins. 2000. Black Feminist Thought: Skapoulli, Chris VanderStouwe, and Shawn Warner- Knowledge, Consciousness, and the Politics of Em- Garcia. 2014. Sociolinguistic Justice in the Schools: powerment. Routledge. 5464
Justin T. Craft, Kelly E. Wright, Rachel Elizabeth Robertson, editors, Routledge International Hand- Weissler, and Robin M. Queen. 2020. Language book of Participatory Design, pages 182–209. Rout- and Discrimination: Generating Meaning, Perceiv- ledge. ing Identities, and Discriminating Outcomes. An- nual Review of Linguistics, 6(1). Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigat- Kate Crawford. 2017. The Trouble with Bias. Keynote ing unintended bias in text classifcation. In Pro- at NeurIPS. ceedings of the Conference on Artifcial Intelligence, Ethics, and Society (AIES), New Orleans, LA. Kimberle Crenshaw. 1989. Demarginalizing the Inter- section of Race and Sex: A Black Feminist Critique of Antidiscrmination Doctrine, Feminist Theory and Jacob Eisenstein. 2013. What to do about bad lan- Antiracist Politics. University of Chicago Legal Fo- guage on the Internet. In Proceedings of the North rum. American Association for Computational Linguistics (NAACL), pages 359–369. Amanda Cercas Curry and Verena Rieser. 2018. #MeToo: How Conversational Systems Respond to Kawin Ethayarajh. 2020. Is Your Classifer Actually Sexual Harassment. In Proceedings of the Workshop Biased? Measuring Fairness under Uncertainty with on Ethics in Natural Language Processing, pages 7– Bernstein Bounds. In Proceedings of the Associa- 14, New Orleans, LA. tion for Computational Linguistics (ACL). Karan Dabas, Nishtha Madaan, Gautam Singh, Vi- Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. jay Arya, Sameep Mehta, and Tanmoy Chakraborty. 2019. Understanding Undesirable Word Embedding 2020. Fair Transfer of Multiple Style Attributes in Assocations. In Proceedings of the Association for Text. arXiv preprint arXiv:2001.06693. Computational Linguistics (ACL), pages 1696–1705, Florence, Italy. Thomas Davidson, Debasmita Bhattacharya, and Ing- mar Weber. 2019. Racial bias in hate speech and Joseph Fisher. 2019. Measuring social bias in abusive language detection datasets. In Proceedings knowledge graph embeddings. arXiv preprint of the Workshop on Abusive Language Online, pages arXiv:1912.02761. 25–35, Florence, Italy. Maria De-Arteaga, Alexey Romanov, Hanna Wal- Nelson Flores and Sofa Chaparro. 2018. What counts lach, Jennifer Chayes, Christian Borgs, Alexandra as language education policy? Developing a materi- Chouldechova, Sahin Geyik, Krishnaram Kentha- alist Anti-racist approach to language activism. Lan- padi, and Adam Tauman Kalai. 2019. Bias in bios: guage Policy, 17(3):365–384. A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Confer- Omar U. Florez. 2019. On the Unintended Social Bias ence on Fairness, Accountability, and Transparency, of Training Language Generation Models with Data pages 120–128, Atlanta, GA. from Local Media. In Proceedings of the NeurIPS Workshop on Human-Centric Machine Learning, Sunipa Dev, Tao Li, Jeff Phillips, and Vivek Sriku- Vancouver, Canada. mar. 2019. On Measuring and Mitigating Biased Inferences of Word Embeddings. arXiv preprint Joel Escudé Font and Marta R. Costa-jussà. 2019. arXiv:1908.09369. Equalizing gender biases in neural machine trans- lation with word embeddings techniques. In Pro- Sunipa Dev and Jeff Phillips. 2019. Attenuating Bias in ceedings of the Workshop on Gender Bias for Natu- Word Vectors. In Proceedings of the International ral Language Processing, pages 147–154, Florence, Conference on Artifcial Intelligence and Statistics, Italy. pages 879–887, Naha, Japan. Batya Friedman and David G. Hendry. 2019. Value Mark Díaz, Isaac Johnson, Amanda Lazar, Anne Marie Sensitive Design: Shaping Technology with Moral Piper, and Darren Gergle. 2018. Addressing age- Imagination. MIT Press. related bias in sentiment analysis. In Proceedings of the Conference on Human Factors in Computing Batya Friedman, Peter H. Kahn Jr., and Alan Borning. Systems (CHI), Montréal, Canada. 2006. Value Sensitive Design and Information Sys- Emily Dinan, Angela Fan, Adina Williams, Jack Ur- tems. In Dennis Galletta and Ping Zhang, editors, banek, Douwe Kiela, and Jason Weston. 2019. Human-Computer Interaction in Management Infor- Queens are Powerful too: Mitigating Gender mation Systems: Foundations, pages 348–372. M.E. Bias in Dialogue Generation. arXiv preprint Sharpe. arXiv:1911.03842. Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and Carl DiSalvo, Andrew Clement, and Volkmar Pipek. James Zou. 2018. Word Embeddings Quantify 100 2013. Communities: Participatory Design for, with Years of Gender and Ethnic Stereotypes. Proceed- and by communities. In Jesper Simonsen and Toni ings of the National Academy of Sciences, 115(16). 5465
You can also read