Language (Technology) is Power: A Critical Survey of "Bias" in NLP - Association for Computational Linguistics
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Language (Technology) is Power: A Critical Survey of “Bias” in NLP
Su Lin Blodgett Solon Barocas
College of Information and Computer Sciences Microsoft Research
University of Massachusetts Amherst Cornell University
blodgett@cs.umass.edu solon@microsoft.com
Hal Daumé III Hanna Wallach
Microsoft Research Microsoft Research
University of Maryland wallach@microsoft.com
me@hal3.name
Abstract 2019), coreference resolution (Rudinger et al.,
2018; Zhao et al., 2018a), machine translation (Van-
We survey 146 papers analyzing “bias” in massenhove et al., 2018; Stanovsky et al., 2019),
NLP systems, fnding that their motivations
sentiment analysis (Kiritchenko and Mohammad,
are often vague, inconsistent, and lacking
in normative reasoning, despite the fact that 2018), and hate speech/toxicity detection (e.g.,
analyzing “bias” is an inherently normative Park et al., 2018; Dixon et al., 2018), among others.
process. We further fnd that these papers’ Although these papers have laid vital ground-
proposed quantitative techniques for measur- work by illustrating some of the ways that NLP
ing or mitigating “bias” are poorly matched to systems can be harmful, the majority of them fail
their motivations and do not engage with the to engage critically with what constitutes “bias”
relevant literature outside of NLP. Based on
in the frst place. Despite the fact that analyzing
these fndings, we describe the beginnings of a
path forward by proposing three recommenda- “bias” is an inherently normative process—in
tions that should guide work analyzing “bias” which some system behaviors are deemed good
in NLP systems. These recommendations rest and others harmful—papers on “bias” in NLP
on a greater recognition of the relationships systems are rife with unstated assumptions about
between language and social hierarchies, what kinds of system behaviors are harmful, in
encouraging researchers and practitioners what ways, to whom, and why. Indeed, the term
to articulate their conceptualizations of
“bias” (or “gender bias” or “racial bias”) is used
“bias”—i.e., what kinds of system behaviors
are harmful, in what ways, to whom, and why,
to describe a wide range of system behaviors, even
as well as the normative reasoning underlying though they may be harmful in different ways, to
these statements—and to center work around different groups, or for different reasons. Even
the lived experiences of members of commu- papers analyzing “bias” in NLP systems developed
nities affected by NLP systems, while inter- for the same task often conceptualize it differently.
rogating and reimagining the power relations For example, the following system behaviors
between technologists and such communities.
are all understood to be self-evident statements of
“racial bias”: (a) embedding spaces in which embed-
1 Introduction
dings for names associated with African Americans
A large body of work analyzing “bias” in natural are closer (compared to names associated with
language processing (NLP) systems has emerged European Americans) to unpleasant words than
in recent years, including work on “bias” in embed- pleasant words (Caliskan et al., 2017); (b) senti-
ding spaces (e.g., Bolukbasi et al., 2016a; Caliskan ment analysis systems yielding different intensity
et al., 2017; Gonen and Goldberg, 2019; May scores for sentences containing names associated
et al., 2019) as well as work on “bias” in systems with African Americans and sentences containing
developed for a breadth of tasks including language names associated with European Americans (Kir-
modeling (Lu et al., 2018; Bordia and Bowman, itchenko and Mohammad, 2018); and (c) toxicity
5454
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476
July 5 - 10, 2020. c 2020 Association for Computational Linguisticsdetection systems scoring tweets containing fea- NLP task Papers
tures associated with African-American English as Embeddings (type-level or contextualized) 54
more offensive than tweets without these features Coreference resolution 20
Language modeling or dialogue generation 17
(Davidson et al., 2019; Sap et al., 2019). Moreover, Hate-speech detection 17
some of these papers focus on “racial bias” Sentiment analysis 15
expressed in written text, while others focus on Machine translation 8
Tagging or parsing 5
“racial bias” against authors. This use of imprecise Surveys, frameworks, and meta-analyses 20
terminology obscures these important differences. Other 22
We survey 146 papers analyzing “bias” in NLP
systems, fnding that their motivations are often Table 1: The NLP tasks covered by the 146 papers.
vague and inconsistent. Many lack any normative
reasoning for why the system behaviors that are
described as “bias” are harmful, in what ways, and papers without the keywords “bias” or “fairness,”
to whom. Moreover, the vast majority of these we also traversed the citation graph of our initial
papers do not engage with the relevant literature set of papers, retaining any papers analyzing “bias”
outside of NLP to ground normative concerns when in NLP systems that are cited by or cite the papers
proposing quantitative techniques for measuring in our initial set. Finally, we manually inspected
or mitigating “bias.” As a result, we fnd that many any papers analyzing “bias” in NLP systems from
of these techniques are poorly matched to their leading machine learning, human–computer inter-
motivations, and are not comparable to one another. action, and web conferences and workshops, such
We then describe the beginnings of a path as ICML, NeurIPS, AIES, FAccT, CHI, and WWW,
forward by proposing three recommendations along with any relevant papers that were made
that should guide work analyzing “bias” in NLP available in the “Computation and Language” and
systems. We argue that such work should examine “Computers and Society” categories on arXiv prior
the relationships between language and social hi- to May 2020, but found that they had already been
erarchies; we call on researchers and practitioners identifed via our traversal of the citation graph. We
conducting such work to articulate their conceptu- provide a list of all 146 papers in the appendix. In
alizations of “bias” in order to enable conversations Table 1, we provide a breakdown of the NLP tasks
about what kinds of system behaviors are harmful, covered by the papers. We note that counts do not
in what ways, to whom, and why; and we recom- sum to 146, because some papers cover multiple
mend deeper engagements between technologists tasks. For example, a paper might test the effcacy
and communities affected by NLP systems. We of a technique for mitigating “bias” in embed-
also provide several concrete research questions ding spaces in the context of sentiment analysis.
that are implied by each of our recommendations.
Once identifed, we then read each of the 146 pa-
2 Method pers with the goal of categorizing their motivations
and their proposed quantitative techniques for mea-
Our survey includes all papers known to us suring or mitigating “bias.” We used a previously
analyzing “bias” in NLP systems—146 papers in developed taxonomy of harms for this categoriza-
total. We omitted papers about speech, restricting tion, which differentiates between so-called alloca-
our survey to papers about written text only. To tional and representational harms (Barocas et al.,
identify the 146 papers, we frst searched the ACL 2017; Crawford, 2017). Allocational harms arise
Anthology1 for all papers with the keywords “bias” when an automated system allocates resources (e.g.,
or “fairness” that were made available prior to May credit) or opportunities (e.g., jobs) unfairly to dif-
2020. We retained all papers about social “bias,” ferent social groups; representational harms arise
and discarded all papers about other defnitions of when a system (e.g., a search engine) represents
the keywords (e.g., hypothesis-only bias, inductive some social groups in a less favorable light than
bias, media bias). We also discarded all papers us- others, demeans them, or fails to recognize their
ing “bias” in NLP systems to measure social “bias” existence altogether. Adapting and extending this
in text or the real world (e.g., Garg et al., 2018). taxonomy, we categorized the 146 papers’ motiva-
To ensure that we did not exclude any relevant tions and techniques into the following categories:
1
https://www.aclweb.org/anthology/ . Allocational harms.
5455Papers 3.1 Motivations
Category Motivation Technique Papers state a wide range of motivations,
Allocational harms 30 4 multiple motivations, vague motivations, and
Stereotyping 50 58 sometimes no motivations at all. We found that
Other representational harms 52 43
Questionable correlations 47 42 the papers’ motivations span all six categories, with
Vague/unstated 23 0 several papers falling into each one. Appropriately,
Surveys, frameworks, and 20 20 papers that provide surveys or frameworks for an-
meta-analyses
alyzing “bias” in NLP systems often state multiple
Table 2: The categories into which the 146 papers fall. motivations (e.g., Hovy and Spruit, 2016; Bender,
2019; Sun et al., 2019; Rozado, 2020; Shah et al.,
2020). However, as the examples in Table 3 (in the
appendix) illustrate, many other papers (33%) do
. Representational harms:2
so as well. Some papers (16%) state only vague
. Stereotyping that propagates negative gen-
motivations or no motivations at all. For example,
eralizations about particular social groups.
“[N]o human should be discriminated on the basis
. Differences in system performance for dif- of demographic attributes by an NLP system.”
ferent social groups, language that misrep- —Kaneko and Bollegala (2019)
resents the distribution of different social “[P]rominent word embeddings [...] encode
groups in the population, or language that systematic biases against women and black people
[...] implicating many NLP systems in scaling up
is denigrating to particular social groups. social injustice.” —May et al. (2019)
. Questionable correlations between system be-
These examples leave unstated what it might mean
havior and features of language that are typi-
for an NLP system to “discriminate,” what con-
cally associated with particular social groups.
stitutes “systematic biases,” or how NLP systems
. Vague descriptions of “bias” (or “gender
contribute to “social injustice” (itself undefned).
bias” or “racial bias”) or no description at all.
. Surveys, frameworks, and meta-analyses. Papers’ motivations sometimes include no nor-
mative reasoning. We found that some papers
In Table 2 we provide counts for each of the
(32%) are not motivated by any apparent normative
six categories listed above. (We also provide a
concerns, often focusing instead on concerns about
list of the papers that fall into each category in the
system performance. For example, the frst quote
appendix.) Again, we note that the counts do not
below includes normative reasoning—namely that
sum to 146, because some papers state multiple
models should not use demographic information
motivations, propose multiple techniques, or pro-
to make predictions—while the other focuses on
pose a single technique for measuring or mitigating
learned correlations impairing system performance.
multiple harms. Table 3, which is in the appendix,
“In [text classifcation], models are expected to
contains examples of the papers’ motivations and make predictions with the semantic information
techniques across a range of different NLP tasks. rather than with the demographic group identity
information (e.g., ‘gay’, ‘black’) contained in the
sentences.” —Zhang et al. (2020a)
3 Findings
“An over-prevalence of some gendered forms in the
training data leads to translations with identifable
Categorizing the 146 papers’ motivations and pro- errors. Translations are better for sentences
posed quantitative techniques for measuring or miti- involving men and for sentences containing
gating “bias” into the six categories listed above en- stereotypical gender roles.”
abled us to identify several commonalities, which —Saunders and Byrne (2020)
we present below, along with illustrative quotes.
Even when papers do state clear motivations,
2
they are often unclear about why the system be-
We grouped several types of representational harms into
two categories to refect that the main point of differentiation haviors that are described as “bias” are harm-
between the 146 papers’ motivations and proposed quantitative ful, in what ways, and to whom. We found that
techniques for measuring or mitigating “bias” is whether or not even papers with clear motivations often fail to ex-
they focus on stereotyping. Among the papers that do not fo-
cus on stereotyping, we found that most lack suffciently clear plain what kinds of system behaviors are harmful,
motivations and techniques to reliably categorize them further. in what ways, to whom, and why. For example,
5456“Deploying these word embedding algorithms in “It is essential to quantify and mitigate gender bias
practice, for example in automated translation in these embeddings to avoid them from affecting
systems or as hiring aids, runs the serious risk of downstream applications.” —Zhou et al. (2019)
perpetuating problematic biases in important
societal contexts.” —Brunet et al. (2019) In contrast, papers that provide surveys or frame-
works for analyzing “bias” in NLP systems treat
“[I]f the systems show discriminatory behaviors in
representational harms as harmful in their own
the interactions, the user experience will be
adversely affected.” —Liu et al. (2019) right. For example, Mayfeld et al. (2019) and
Ruane et al. (2019) cite the harmful reproduction
These examples leave unstated what “problematic of dominant linguistic norms by NLP systems (a
biases” or non-ideal user experiences might look point to which we return in section 4), while Bender
like, how the system behaviors might result in (2019) outlines a range of harms, including seeing
these things, and who the relevant stakeholders stereotypes in search results and being made invis-
or users might be. In contrast, we fnd that papers ible to search engines due to language practices.
that provide surveys or frameworks for analyzing
“bias” in NLP systems often name who is harmed, 3.2 Techniques
acknowledging that different social groups may Papers’ techniques are not well grounded in the
experience these systems differently due to their relevant literature outside of NLP. Perhaps un-
different relationships with NLP systems or surprisingly given that the papers’ motivations are
different social positions. For example, Ruane often vague, inconsistent, and lacking in normative
et al. (2019) argue for a “deep understanding of reasoning, we also found that the papers’ proposed
the user groups [sic] characteristics, contexts, and quantitative techniques for measuring or mitigating
interests” when designing conversational agents. “bias” do not effectively engage with the relevant
Papers about NLP systems developed for the literature outside of NLP. Papers on stereotyping
same task often conceptualize “bias” differ- are a notable exception: the Word Embedding
ently. Even papers that cover the same NLP task Association Test (Caliskan et al., 2017) draws on
often conceptualize “bias” in ways that differ sub- the Implicit Association Test (Greenwald et al.,
stantially and are sometimes inconsistent. Rows 3 1998) from the social psychology literature, while
and 4 of Table 3 (in the appendix) contain machine several techniques operationalize the well-studied
translation papers with different conceptualizations “Angry Black Woman” stereotype (Kiritchenko
of “bias,” leading to different proposed techniques, and Mohammad, 2018; May et al., 2019; Tan
while rows 5 and 6 contain papers on “bias” in em- and Celis, 2019) and the “double bind” faced by
bedding spaces that state different motivations, but women (May et al., 2019; Tan and Celis, 2019), in
propose techniques for quantifying stereotyping. which women who succeed at stereotypically male
tasks are perceived to be less likable than similarly
Papers’ motivations confate allocational and successful men (Heilman et al., 2004). Tan and
representational harms. We found that the pa- Celis (2019) also examine the compounding effects
pers’ motivations sometimes (16%) name imme- of race and gender, drawing on Black feminist
diate representational harms, such as stereotyping, scholarship on intersectionality (Crenshaw, 1989).
alongside more distant allocational harms, which,
in the case of stereotyping, are usually imagined as Papers’ techniques are poorly matched to their
downstream effects of stereotypes on résumé flter- motivations. We found that although 21% of the
ing. Many of these papers use the imagined down- papers include allocational harms in their motiva-
stream effects to justify focusing on particular sys- tions, only four papers actually propose techniques
tem behaviors, even when the downstream effects for measuring or mitigating allocational harms.
are not measured. Papers on “bias” in embedding
Papers focus on a narrow range of potential
spaces are especially likely to do this because em-
sources of “bias.” We found that nearly all of the
beddings are often used as input to other systems:
papers focus on system predictions as the potential
“However, none of these papers [on embeddings]
sources of “bias,” with many additionally focusing
have recognized how blatantly sexist the
embeddings are and hence risk introducing biases on “bias” in datasets (e.g., differences in the
of various types into real-world systems.” number of gendered pronouns in the training data
—Bolukbasi et al. (2016a) (Zhao et al., 2019)). Most papers do not interrogate
5457the normative implications of other decisions made language and social hierarchies. Many disciplines,
during the development and deployment lifecycle— including sociolinguistics, linguistic anthropology,
perhaps unsurprising given that their motivations sociology, and social psychology, study how
sometimes include no normative reasoning. A language takes on social meaning and the role that
few papers are exceptions, illustrating the impacts language plays in maintaining social hierarchies.
of task defnitions, annotation guidelines, and For example, language is the means through which
evaluation metrics: Cao and Daumé (2019) study social groups are labeled and one way that beliefs
how folk conceptions of gender (Keyes, 2018) are about social groups are transmitted (e.g., Maass,
reproduced in coreference resolution systems that 1999; Beukeboom and Burgers, 2019). Group
assume a strict gender dichotomy, thereby main- labels can serve as the basis of stereotypes and thus
taining cisnormativity; Sap et al. (2019) focus on reinforce social inequalities: “[T]he label content
the effect of priming annotators with information functions to identify a given category of people,
about possible dialectal differences when asking and thereby conveys category boundaries and a
them to apply toxicity labels to sample tweets, fnd- position in a hierarchical taxonomy” (Beukeboom
ing that annotators who are primed are signifcantly and Burgers, 2019). Similarly, “controlling
less likely to label tweets containing features asso- images,” such as stereotypes of Black women,
ciated with African-American English as offensive. which are linguistically and visually transmitted
through literature, news media, television, and so
4 A path forward forth, provide “ideological justifcation” for their
We now describe how researchers and practitioners continued oppression (Collins, 2000, Chapter 4).
conducting work analyzing “bias” in NLP systems As a result, many groups have sought to bring
might avoid the pitfalls presented in the previous about social changes through changes in language,
section—the beginnings of a path forward. We disrupting patterns of oppression and marginal-
propose three recommendations that should guide ization via so-called “gender-fair” language
such work, and, for each, provide several concrete (Sczesny et al., 2016; Menegatti and Rubini, 2017),
research questions. We emphasize that these ques- language that is more inclusive to people with
tions are not comprehensive, and are intended to disabilities (ADA, 2018), and language that is less
generate further questions and lines of engagement. dehumanizing (e.g., abandoning the use of the term
Our three recommendations are as follows: “illegal” in everyday discourse on immigration in
the U.S. (Rosa, 2019)). The fact that group labels
(R1) Ground work analyzing “bias” in NLP sys- are so contested is evidence of how deeply inter-
tems in the relevant literature outside of NLP twined language and social hierarchies are. Taking
that explores the relationships between lan- “gender-fair” language as an example, the hope
guage and social hierarchies. Treat represen- is that reducing asymmetries in language about
tational harms as harmful in their own right. women and men will reduce asymmetries in their
(R2) Provide explicit statements of why the social standing. Meanwhile, struggles over lan-
system behaviors that are described as “bias” guage use often arise from dominant social groups’
are harmful, in what ways, and to whom. desire to “control both material and symbolic
Be forthright about the normative reasoning resources”—i.e., “the right to decide what words
(Green, 2019) underlying these statements. will mean and to control those meanings”—as was
(R3) Examine language use in practice by engag- the case in some white speakers’ insistence on
ing with the lived experiences of members of using offensive place names against the objections
communities affected by NLP systems. Inter- of Indigenous speakers (Hill, 2008, Chapter 3).
rogate and reimagine the power relations be- Sociolinguists and linguistic anthropologists
tween technologists and such communities. have also examined language attitudes and lan-
guage ideologies, or people’s metalinguistic beliefs
4.1 Language and social hierarchies about language: Which language varieties or prac-
Turning frst to (R1), we argue that work analyzing tices are taken as standard, ordinary, or unmarked?
“bias” in NLP systems will paint a much fuller pic- Which are considered correct, prestigious, or ap-
ture if it engages with the relevant literature outside propriate for public use, and which are considered
of NLP that explores the relationships between incorrect, uneducated, or offensive (e.g., Campbell-
5458Kibler, 2009; Preston, 2009; Loudermilk, 2015; provide the following concrete research questions:
Lanehart and Malik, 2018)? Which are rendered in-
visible (Roche, 2019)?3 Language ideologies play . How do social hierarchies and language
a vital role in reinforcing and justifying social hi- ideologies infuence the decisions made during
erarchies because beliefs about language varieties the development and deployment lifecycle?
or practices often translate into beliefs about their What kinds of NLP systems do these decisions
speakers (e.g. Alim et al., 2016; Rosa and Flores, result in, and what kinds do they foreclose?
2017; Craft et al., 2020). For example, in the U.S., General assumptions: To which linguistic
the portrayal of non-white speakers’ language norms do NLP systems adhere (Bender,
varieties and practices as linguistically defcient 2019; Ruane et al., 2019)? Which language
helped to justify violent European colonialism, and practices are implicitly assumed to be
today continues to justify enduring racial hierar- standard, ordinary, correct, or appropriate?
chies by maintaining views of non-white speakers Task defnition: For which speakers
as lacking the language “required for complex are NLP systems (and NLP resources)
thinking processes and successful engagement developed? (See Joshi et al. (2020) for
in the global economy” (Rosa and Flores, 2017). a discussion.) How do task defnitions
Recognizing the role that language plays in discretize the world? For example, how
maintaining social hierarchies is critical to the are social groups delineated when defning
future of work analyzing “bias” in NLP systems. demographic attribute prediction tasks
First, it helps to explain why representational (e.g., Koppel et al., 2002; Rosenthal and
harms are harmful in their own right. Second, the McKeown, 2011; Nguyen et al., 2013)?
complexity of the relationships between language What about languages in native language
and social hierarchies illustrates why studying prediction tasks (Tetreault et al., 2013)?
“bias” in NLP systems is so challenging, suggesting Data: How are datasets collected, prepro-
that researchers and practitioners will need to move cessed, and labeled or annotated? What are
beyond existing algorithmic fairness techniques. the impacts of annotation guidelines, anno-
We argue that work must be grounded in the tator assumptions and perceptions (Olteanu
relevant literature outside of NLP that examines et al., 2019; Sap et al., 2019; Geiger et al.,
the relationships between language and social 2020), and annotation aggregation pro-
hierarchies; without this grounding, researchers cesses (Pavlick and Kwiatkowski, 2019)?
and practitioners risk measuring or mitigating Evaluation: How are NLP systems evalu-
only what is convenient to measure or mitigate, ated? What are the impacts of evaluation
rather than what is most normatively concerning. metrics (Olteanu et al., 2017)? Are any
More specifcally, we recommend that work non-quantitative evaluations performed?
analyzing “bias” in NLP systems be reoriented
. How do NLP systems reproduce or transform
around the following question: How are social
language ideologies? Which language varieties
hierarchies, language ideologies, and NLP systems
or practices come to be deemed good or bad?
coproduced? This question mirrors Benjamin’s
Might “good” language simply mean language
(2020) call to examine how “race and technology
that is easily handled by existing NLP sys-
are coproduced”—i.e., how racial hierarchies, and
tems? For example, linguistic phenomena aris-
the ideologies and discourses that maintain them,
ing from many language practices (Eisenstein,
create and are re-created by technology. We recom-
2013) are described as “noisy text” and often
mend that researchers and practitioners similarly
viewed as a target for “normalization.” How
ask how existing social hierarchies and language
do the language ideologies that are reproduced
ideologies drive the development and deployment
by NLP systems maintain social hierarchies?
of NLP systems, and how these systems therefore
reproduce these hierarchies and ideologies. As . Which representational harms are being
a starting point for reorienting work analyzing measured or mitigated? Are these the most
“bias” in NLP systems around this question, we normatively concerning harms, or merely
those that are well handled by existing algo-
3
Language ideologies encompass much more than this; see, rithmic fairness techniques? Are there other
e.g., Lippi-Green (2012), Alim et al. (2016), Rosa and Flores
(2017), Rosa and Burdick (2017), and Charity Hudley (2017). representational harms that might be analyzed?
54594.2 Conceptualizations of “bias” underpin this conceptualization of “bias?”
Turning now to (R2), we argue that work analyzing
“bias” in NLP systems should provide explicit 4.3 Language use in practice
statements of why the system behaviors that are Finally, we turn to (R3). Our perspective, which
described as “bias” are harmful, in what ways, rests on a greater recognition of the relationships
and to whom, as well as the normative reasoning between language and social hierarchies, suggests
underlying these statements. In other words, several directions for examining language use in
researchers and practitioners should articulate their practice. Here, we focus on two. First, because lan-
conceptualizations of “bias.” As we described guage is necessarily situated, and because different
above, papers often contain descriptions of system social groups have different lived experiences due
behaviors that are understood to be self-evident to their different social positions (Hanna et al.,
statements of “bias.” This use of imprecise 2020)—particularly groups at the intersections
terminology has led to papers all claiming to of multiple axes of oppression—we recommend
analyze “bias” in NLP systems, sometimes even that researchers and practitioners center work
in systems developed for the same task, but with analyzing “bias” in NLP systems around the lived
different or even inconsistent conceptualizations of experiences of members of communities affected
“bias,” and no explanations for these differences. by these systems. Second, we recommend that
Yet analyzing “bias” is an inherently normative the power relations between technologists and
process—in which some system behaviors are such communities be interrogated and reimagined.
deemed good and others harmful—even if assump- Researchers have pointed out that algorithmic
tions about what kinds of system behaviors are fairness techniques, by proposing incremental
harmful, in what ways, for whom, and why are technical mitigations—e.g., collecting new datasets
not stated. We therefore echo calls by Bardzell and or training better models—maintain these power
Bardzell (2011), Keyes et al. (2019), and Green relations by (a) assuming that automated systems
(2019) for researchers and practitioners to make should continue to exist, rather than asking
their normative reasoning explicit by articulating whether they should be built at all, and (b) keeping
the social values that underpin their decisions to development and deployment decisions in the
deem some system behaviors as harmful, no matter hands of technologists (Bennett and Keyes, 2019;
how obvious such values appear to be. We further Cifor et al., 2019; Green, 2019; Katell et al., 2020).
argue that this reasoning should take into account There are many disciplines for researchers and
the relationships between language and social practitioners to draw on when pursuing these
hierarchies that we described above. First, these directions. For example, in human–computer
relationships provide a foundation from which to interaction, Hamidi et al. (2018) study transgender
approach the normative reasoning that we recom- people’s experiences with automated gender
mend making explicit. For example, some system recognition systems in order to uncover how
behaviors might be harmful precisely because these systems reproduce structures of transgender
they maintain social hierarchies. Second, if work exclusion by redefning what it means to perform
analyzing “bias” in NLP systems is reoriented gender “normally.” Value-sensitive design provides
to understand how social hierarchies, language a framework for accounting for the values of differ-
ideologies, and NLP systems are coproduced, then ent stakeholders in the design of technology (e.g.,
this work will be incomplete if we fail to account Friedman et al., 2006; Friedman and Hendry, 2019;
for the ways that social hierarchies and language Le Dantec et al., 2009; Yoo et al., 2019), while
ideologies determine what we mean by “bias” in participatory design seeks to involve stakeholders
the frst place. As a starting point, we therefore in the design process itself (Sanders, 2002; Muller,
provide the following concrete research questions: 2007; Simonsen and Robertson, 2013; DiSalvo
. What kinds of system behaviors are described et al., 2013). Participatory action research in educa-
as “bias”? What are their potential sources (e.g., tion (Kemmis, 2006) and in language documenta-
general assumptions, task defnition, data)? tion and reclamation (Junker, 2018) is also relevant.
. In what ways are these system behaviors harm- In particular, work on language reclamation to
ful, to whom are they harmful, and why? support decolonization and tribal sovereignty
. What are the social values (obvious or not) that (Leonard, 2012) and work in sociolinguistics focus-
5460ing on developing co-equal research relationships systems may not work, illustrating their limitations.
with community members and supporting linguis- However, they do not conceptualize “racial bias” in
tic justice efforts (e.g., Bucholtz et al., 2014, 2016, the same way. The frst four of these papers simply
2019) provide examples of more emancipatory rela- focus on system performance differences between
tionships with communities. Finally, several work- text containing features associated with AAE and
shops and events have begun to explore how to em- text without these features. In contrast, the last
power stakeholders in the development and deploy- two papers also focus on such system performance
ment of technology (Vaccaro et al., 2019; Givens differences, but motivate this focus with the fol-
and Morris, 2020; Sassaman et al., 2020)4 and how lowing additional reasoning: If tweets containing
to help researchers and practitioners consider when features associated with AAE are scored as more
not to build systems at all (Barocas et al., 2020). offensive than tweets without these features, then
As a starting point for engaging with commu- this might (a) yield negative perceptions of AAE;
nities affected by NLP systems, we therefore (b) result in disproportionate removal of tweets
provide the following concrete research questions: containing these features, impeding participation
. How do communities become aware of NLP in online platforms and reducing the space avail-
systems? Do they resist them, and if so, how? able online in which speakers can use AAE freely;
. What additional costs are borne by communi- and (c) cause AAE speakers to incur additional
ties for whom NLP systems do not work well? costs if they have to change their language practices
. Do NLP systems shift power toward oppressive to avoid negative perceptions or tweet removal.
institutions (e.g., by enabling predictions that More importantly, none of these papers engage
communities do not want made, linguistically with the literature on AAE, racial hierarchies in the
based unfair allocation of resources or oppor- U.S., and raciolinguistic ideologies. By failing to
tunities (Rosa and Flores, 2017), surveillance, engage with this literature—thereby treating AAE
or censorship), or away from such institutions? simply as one of many non-Penn Treebank vari-
. Who is involved in the development and eties of English or perhaps as another challenging
deployment of NLP systems? How do domain—work analyzing “bias” in NLP systems
decision-making processes maintain power re- in the context of AAE fails to situate these systems
lations between technologists and communities in the world. Who are the speakers of AAE? How
affected by NLP systems? Can these pro- are they viewed? We argue that AAE as a language
cesses be changed to reimagine these relations? variety cannot be separated from its speakers—
primarily Black people in the U.S., who experience
5 Case study systemic anti-Black racism—and the language ide-
To illustrate our recommendations, we present a ologies that reinforce and justify racial hierarchies.
case study covering work on African-American Even after decades of sociolinguistic efforts to
English (AAE).5 Work analyzing “bias” in the con- legitimize AAE, it continues to be viewed as “bad”
text of AAE has shown that part-of-speech taggers, English and its speakers continue to be viewed as
language identifcation systems, and dependency linguistically inadequate—a view called the defcit
parsers all work less well on text containing perspective (Alim et al., 2016; Rosa and Flores,
features associated with AAE than on text without 2017). This perspective persists despite demon-
these features (Jørgensen et al., 2015, 2016; Blod- strations that AAE is rule-bound and grammatical
gett et al., 2016, 2018), and that toxicity detection (Mufwene et al., 1998; Green, 2002), in addition
systems score tweets containing features associated to ample evidence of its speakers’ linguistic adroit-
with AAE as more offensive than tweets with- ness (e.g., Alim, 2004; Rickford and King, 2016).
out them (Davidson et al., 2019; Sap et al., 2019). This perspective belongs to a broader set of raciolin-
These papers have been critical for highlighting guistic ideologies (Rosa and Flores, 2017), which
AAE as a language variety for which existing NLP also produce allocational harms; speakers of AAE
4
Also https://participatoryml.github.io/ are frequently penalized for not adhering to domi-
5
This language variety has had many different names nant language practices, including in the education
over the years, but is now generally called African- system (Alim, 2004; Terry et al., 2010), when
American English (AAE), African-American Vernacular En-
glish (AAVE), or African-American Language (AAL) (Green, seeking housing (Baugh, 2018), and in the judicial
2002; Wolfram and Schilling, 2015; Rickford and King, 2016). system, where their testimony is misunderstood or,
5461worse yet, disbelieved (Rickford and King, 2016; and deployment of NLP systems produce stigmati-
Jones et al., 2019). These raciolinguistic ideologies zation and disenfranchisement, and work on AAE
position racialized communities as needing use in practice, such as the ways that speakers
linguistic intervention, such as language education of AAE interact with NLP systems that were not
programs, in which these and other harms can be designed for them. This literature can also help re-
reduced if communities accommodate to domi- searchers and practitioners address the allocational
nant language practices (Rosa and Flores, 2017). harms that may be produced by NLP systems, and
In the technology industry, speakers of AAE are ensure that even well-intentioned NLP systems
often not considered consumers who matter. For do not position racialized communities as needing
example, Benjamin (2019) recounts an Apple em- linguistic intervention or accommodation to
ployee who worked on speech recognition for Siri: dominant language practices. Finally, researchers
“As they worked on different English dialects — and practitioners wishing to design better systems
Australian, Singaporean, and Indian English — [the can also draw on a growing body of work on
employee] asked his boss: ‘What about African anti-racist language pedagogy that challenges the
American English?’ To this his boss responded:
‘Well, Apple products are for the premium market.”’
defcit perspective of AAE and other racialized
language practices (e.g. Flores and Chaparro, 2018;
The reality, of course, is that speakers of AAE tend Baker-Bell, 2019; Martínez and Mejía, 2019), as
not to represent the “premium market” precisely be- well as the work that we described in section 4.3
cause of institutions and policies that help to main- on reimagining the power relations between tech-
tain racial hierarchies by systematically denying nologists and communities affected by technology.
them the opportunities to develop wealth that are
available to white Americans (Rothstein, 2017)—
6 Conclusion
an exclusion that is reproduced in technology by
countless decisions like the one described above.
By surveying 146 papers analyzing “bias” in NLP
Engaging with the literature outlined above
systems, we found that (a) their motivations are
situates the system behaviors that are described
often vague, inconsistent, and lacking in norma-
as “bias,” providing a foundation for normative
tive reasoning; and (b) their proposed quantitative
reasoning. Researchers and practitioners should
techniques for measuring or mitigating “bias” are
be concerned about “racial bias” in toxicity
poorly matched to their motivations and do not en-
detection systems not only because performance
gage with the relevant literature outside of NLP.
differences impair system performance, but
To help researchers and practitioners avoid these
because they reproduce longstanding injustices of
pitfalls, we proposed three recommendations that
stigmatization and disenfranchisement for speakers
should guide work analyzing “bias” in NLP sys-
of AAE. In re-stigmatizing AAE, they reproduce
tems, and, for each, provided several concrete re-
language ideologies in which AAE is viewed as
search questions. These recommendations rest on
ungrammatical, uneducated, and offensive. These
a greater recognition of the relationships between
ideologies, in turn, enable linguistic discrimination
language and social hierarchies—a step that we
and justify enduring racial hierarchies (Rosa and
see as paramount to establishing a path forward.
Flores, 2017). Our perspective, which understands
racial hierarchies and raciolinguistic ideologies as
structural conditions that govern the development Acknowledgments
and deployment of technology, implies that
techniques for measuring or mitigating “bias” This paper is based upon work supported by the
in NLP systems will necessarily be incomplete National Science Foundation Graduate Research
unless they interrogate and dismantle these Fellowship under Grant No. 1451512. Any opin-
structural conditions, including the power relations ion, fndings, and conclusions or recommendations
between technologists and racialized communities. expressed in this material are those of the authors
We emphasize that engaging with the literature and do not necessarily refect the views of the Na-
on AAE, racial hierarchies in the U.S., and tional Science Foundation. We thank the reviewers
raciolinguistic ideologies can generate new lines of for their useful feedback, especially the sugges-
engagement. These lines include work on the ways tion to include additional details about our method.
that the decisions made during the development
5462References Shaowen Bardzell and Jeffrey Bardzell. 2011. Towards
a Feminist HCI Methodology: Social Science, Femi-
Artem Abzaliev. 2019. On GAP coreference resolu- nism, and HCI. In Proceedings of the Conference on
tion shared task: insights from the 3rd place solution. Human Factors in Computing Systems (CHI), pages
In Proceedings of the Workshop on Gender Bias in 675–684, Vancouver, Canada.
Natural Language Processing, pages 107–112, Flo-
rence, Italy. Solon Barocas, Asia J. Biega, Benjamin Fish, J˛edrzej
Niklas, and Luke Stark. 2020. When Not to De-
ADA. 2018. Guidelines for Writing About Peo- sign, Build, or Deploy. In Proceedings of the Confer-
ple With Disabilities. ADA National Network. ence on Fairness, Accountability, and Transparency,
https://bit.ly/2KREbkB. Barcelona, Spain.
Oshin Agarwal, Funda Durupinar, Norman I. Badler, Solon Barocas, Kate Crawford, Aaron Shapiro, and
and Ani Nenkova. 2019. Word embeddings (also) Hanna Wallach. 2017. The Problem With Bias: Al-
encode human personality stereotypes. In Proceed- locative Versus Representational Harms in Machine
ings of the Joint Conference on Lexical and Com- Learning. In Proceedings of SIGCIS, Philadelphia,
putational Semantics, pages 205–211, Minneapolis, PA.
MN.
Christine Basta, Marta R. Costa-jussà, and Noe Casas.
H. Samy Alim. 2004. You Know My Steez: An Ethno- 2019. Evaluating the underlying gender bias in con-
graphic and Sociolinguistic Study of Styleshifting in textualized word embeddings. In Proceedings of
a Black American Speech Community. American Di- the Workshop on Gender Bias for Natural Language
alect Society. Processing, pages 33–39, Florence, Italy.
H. Samy Alim, John R. Rickford, and Arnetha F. Ball, John Baugh. 2018. Linguistics in Pursuit of Justice.
editors. 2016. Raciolinguistics: How Language Cambridge University Press.
Shapes Our Ideas About Race. Oxford University
Press. Emily M. Bender. 2019. A typology of ethical risks
in language technology with an eye towards where
Sandeep Attree. 2019. Gendered ambiguous pronouns transparent documentation can help. Presented at
shared task: Boosting model confdence by evidence The Future of Artifcial Intelligence: Language,
pooling. In Proceedings of the Workshop on Gen- Ethics, Technology Workshop. https://bit.ly/
der Bias in Natural Language Processing, Florence, 2P9t9M6.
Italy.
Ruha Benjamin. 2019. Race After Technology: Aboli-
Pinkesh Badjatiya, Manish Gupta, and Vasudeva tionist Tools for the New Jim Code. John Wiley &
Varma. 2019. Stereotypical bias removal for hate Sons.
speech detection task using knowledge-based gen-
eralizations. In Proceedings of the International Ruha Benjamin. 2020. 2020 Vision: Reimagining the
World Wide Web Conference, pages 49–59, San Fran- Default Settings of Technology & Society. Keynote
cisco, CA. at ICLR.
Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Cynthia L. Bennett and Os Keyes. 2019. What is the
Shmatikov. 2019. Differential Privacy Has Dis- Point of Fairness? Disability, AI, and The Com-
parate Impact on Model Accuracy. In Proceedings plexity of Justice. In Proceedings of the ASSETS
of the Conference on Neural Information Processing Workshop on AI Fairness for People with Disabili-
Systems, Vancouver, Canada. ties, Pittsburgh, PA.
April Baker-Bell. 2019. Dismantling anti-black lin- Camiel J. Beukeboom and Christian Burgers. 2019.
guistic racism in English language arts classrooms: How Stereotypes Are Shared Through Language: A
Toward an anti-racist black language pedagogy. The- Review and Introduction of the Social Categories
ory Into Practice. and Stereotypes Communication (SCSC) Frame-
work. Review of Communication Research, 7:1–37.
David Bamman, Sejal Popat, and Sheng Shen. 2019.
An annotated dataset of literary entities. In Proceed- Shruti Bhargava and David Forsyth. 2019. Expos-
ings of the North American Association for Com- ing and Correcting the Gender Bias in Image
putational Linguistics (NAACL), pages 2138–2144, Captioning Datasets and Models. arXiv preprint
Minneapolis, MN. arXiv:1912.00578.
Xingce Bao and Qianqian Qiao. 2019. Transfer Learn- Jayadev Bhaskaran and Isha Bhallamudi. 2019. Good
ing from Pre-trained BERT for Pronoun Resolution. Secretaries, Bad Truck Drivers? Occupational Gen-
In Proceedings of the Workshop on Gender Bias der Stereotypes in Sentiment Analysis. In Proceed-
in Natural Language Processing, pages 82–88, Flo- ings of the Workshop on Gender Bias in Natural Lan-
rence, Italy. guage Processing, pages 62–68, Florence, Italy.
5463Su Lin Blodgett, Lisa Green, and Brendan O’Connor. Student Researchers as Linguistic Experts. Lan-
2016. Demographic Dialectal Variation in Social guage and Linguistics Compass, 8:144–157.
Media: A Case Study of African-American English.
In Proceedings of Empirical Methods in Natural Kaylee Burns, Lisa Anne Hendricks, Kate Saenko,
Language Processing (EMNLP), pages 1119–1130, Trevor Darrell, and Anna Rohrbach. 2018. Women
Austin, TX. also Snowboard: Overcoming Bias in Captioning
Models. In Procedings of the European Conference
Su Lin Blodgett and Brendan O’Connor. 2017. Racial on Computer Vision (ECCV), pages 793–811, Mu-
Disparity in Natural Language Processing: A Case nich, Germany.
Study of Social Media African-American English.
In Proceedings of the Workshop on Fairness, Ac- Aylin Caliskan, Joanna J. Bryson, and Arvind
countability, and Transparency in Machine Learning Narayanan. 2017. Semantics derived automatically
(FAT/ML), Halifax, Canada. from language corpora contain human-like biases.
Science, 356(6334).
Su Lin Blodgett, Johnny Wei, and Brendan O’Connor.
2018. Twitter Universal Dependency Parsing for Kathryn Campbell-Kibler. 2009. The nature of so-
African-American and Mainstream American En- ciolinguistic perception. Language Variation and
glish. In Proceedings of the Association for Compu- Change, 21(1):135–156.
tational Linguistics (ACL), pages 1415–1425, Mel-
bourne, Australia. Yang Trista Cao and Hal Daumé, III. 2019. To-
ward gender-inclusive coreference resolution. arXiv
Tolga Bolukbasi, Kai-Wei Chang, James Zou, preprint arXiv:1910.13913.
Venkatesh Saligrama, and Adam Kalai. 2016a.
Man is to Computer Programmer as Woman is to Rakesh Chada. 2019. Gendered pronoun resolution us-
Homemaker? Debiasing Word Embeddings. In Pro- ing bert and an extractive question answering formu-
ceedings of the Conference on Neural Information lation. In Proceedings of the Workshop on Gender
Processing Systems, pages 4349–4357, Barcelona, Bias in Natural Language Processing, pages 126–
Spain. 133, Florence, Italy.
Tolga Bolukbasi, Kai-Wei Chang, James Zou, Kaytlin Chaloner and Alfredo Maldonado. 2019. Mea-
Venkatesh Saligrama, and Adam Kalai. 2016b. suring Gender Bias in Word Embedding across Do-
Quantifying and reducing stereotypes in word mains and Discovering New Gender Bias Word Cat-
embeddings. In Proceedings of the ICML Workshop egories. In Proceedings of the Workshop on Gender
on #Data4Good: Machine Learning in Social Good Bias in Natural Language Processing, pages 25–32,
Applications, pages 41–45, New York, NY. Florence, Italy.
Shikha Bordia and Samuel R. Bowman. 2019. Identify-
ing and reducing gender bias in word-level language Anne H. Charity Hudley. 2017. Language and Racial-
models. In Proceedings of the NAACL Student Re- ization. In Ofelia García, Nelson Flores, and Mas-
search Workshop, pages 7–15, Minneapolis, MN. similiano Spotti, editors, The Oxford Handbook of
Language and Society. Oxford University Press.
Marc-Etienne Brunet, Colleen Alkalay-Houlihan, Ash-
ton Anderson, and Richard Zemel. 2019. Under- Won Ik Cho, Ji Won Kim, Seok Min Kim, and
standing the Origins of Bias in Word Embeddings. Nam Soo Kim. 2019. On measuring gender bias in
In Proceedings of the International Conference on translation of gender-neutral pronouns. In Proceed-
Machine Learning, pages 803–811, Long Beach, ings of the Workshop on Gender Bias in Natural Lan-
CA. guage Processing, pages 173–181, Florence, Italy.
Mary Bucholtz, Dolores Inés Casillas, and Jin Sook Shivang Chopra, Ramit Sawhney, Puneet Mathur, and
Lee. 2016. Beyond Empowerment: Accompani- Rajiv Ratn Shah. 2020. Hindi-English Hate Speech
ment and Sociolinguistic Justice in a Youth Research Detection: Author Profling, Debiasing, and Practi-
Program. In Robert Lawson and Dave Sayers, edi- cal Perspectives. In Proceedings of the AAAI Con-
tors, Sociolinguistic Research: Application and Im- ference on Artifcial Intelligence (AAAI), New York,
pact, pages 25–44. Routledge. NY.
Mary Bucholtz, Dolores Inés Casillas, and Jin Sook Marika Cifor, Patricia Garcia, T.L. Cowan, Jasmine
Lee. 2019. California Latinx Youth as Agents of Rault, Tonia Sutherland, Anita Say Chan, Jennifer
Sociolinguistic Justice. In Netta Avineri, Laura R. Rode, Anna Lauren Hoffmann, Niloufar Salehi, and
Graham, Eric J. Johnson, Robin Conley Riner, and Lisa Nakamura. 2019. Feminist Data Manifest-
Jonathan Rosa, editors, Language and Social Justice No. Retrieved from https://www.manifestno.
in Practice, pages 166–175. Routledge. com/.
Mary Bucholtz, Audrey Lopez, Allina Mojarro, Elena Patricia Hill Collins. 2000. Black Feminist Thought:
Skapoulli, Chris VanderStouwe, and Shawn Warner- Knowledge, Consciousness, and the Politics of Em-
Garcia. 2014. Sociolinguistic Justice in the Schools: powerment. Routledge.
5464Justin T. Craft, Kelly E. Wright, Rachel Elizabeth Robertson, editors, Routledge International Hand-
Weissler, and Robin M. Queen. 2020. Language book of Participatory Design, pages 182–209. Rout-
and Discrimination: Generating Meaning, Perceiv- ledge.
ing Identities, and Discriminating Outcomes. An-
nual Review of Linguistics, 6(1). Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain,
and Lucy Vasserman. 2018. Measuring and mitigat-
Kate Crawford. 2017. The Trouble with Bias. Keynote ing unintended bias in text classifcation. In Pro-
at NeurIPS. ceedings of the Conference on Artifcial Intelligence,
Ethics, and Society (AIES), New Orleans, LA.
Kimberle Crenshaw. 1989. Demarginalizing the Inter-
section of Race and Sex: A Black Feminist Critique
of Antidiscrmination Doctrine, Feminist Theory and Jacob Eisenstein. 2013. What to do about bad lan-
Antiracist Politics. University of Chicago Legal Fo- guage on the Internet. In Proceedings of the North
rum. American Association for Computational Linguistics
(NAACL), pages 359–369.
Amanda Cercas Curry and Verena Rieser. 2018.
#MeToo: How Conversational Systems Respond to Kawin Ethayarajh. 2020. Is Your Classifer Actually
Sexual Harassment. In Proceedings of the Workshop Biased? Measuring Fairness under Uncertainty with
on Ethics in Natural Language Processing, pages 7– Bernstein Bounds. In Proceedings of the Associa-
14, New Orleans, LA. tion for Computational Linguistics (ACL).
Karan Dabas, Nishtha Madaan, Gautam Singh, Vi- Kawin Ethayarajh, David Duvenaud, and Graeme Hirst.
jay Arya, Sameep Mehta, and Tanmoy Chakraborty. 2019. Understanding Undesirable Word Embedding
2020. Fair Transfer of Multiple Style Attributes in Assocations. In Proceedings of the Association for
Text. arXiv preprint arXiv:2001.06693. Computational Linguistics (ACL), pages 1696–1705,
Florence, Italy.
Thomas Davidson, Debasmita Bhattacharya, and Ing-
mar Weber. 2019. Racial bias in hate speech and Joseph Fisher. 2019. Measuring social bias in
abusive language detection datasets. In Proceedings knowledge graph embeddings. arXiv preprint
of the Workshop on Abusive Language Online, pages arXiv:1912.02761.
25–35, Florence, Italy.
Maria De-Arteaga, Alexey Romanov, Hanna Wal- Nelson Flores and Sofa Chaparro. 2018. What counts
lach, Jennifer Chayes, Christian Borgs, Alexandra as language education policy? Developing a materi-
Chouldechova, Sahin Geyik, Krishnaram Kentha- alist Anti-racist approach to language activism. Lan-
padi, and Adam Tauman Kalai. 2019. Bias in bios: guage Policy, 17(3):365–384.
A case study of semantic representation bias in a
high-stakes setting. In Proceedings of the Confer- Omar U. Florez. 2019. On the Unintended Social Bias
ence on Fairness, Accountability, and Transparency, of Training Language Generation Models with Data
pages 120–128, Atlanta, GA. from Local Media. In Proceedings of the NeurIPS
Workshop on Human-Centric Machine Learning,
Sunipa Dev, Tao Li, Jeff Phillips, and Vivek Sriku- Vancouver, Canada.
mar. 2019. On Measuring and Mitigating Biased
Inferences of Word Embeddings. arXiv preprint Joel Escudé Font and Marta R. Costa-jussà. 2019.
arXiv:1908.09369. Equalizing gender biases in neural machine trans-
lation with word embeddings techniques. In Pro-
Sunipa Dev and Jeff Phillips. 2019. Attenuating Bias in ceedings of the Workshop on Gender Bias for Natu-
Word Vectors. In Proceedings of the International ral Language Processing, pages 147–154, Florence,
Conference on Artifcial Intelligence and Statistics, Italy.
pages 879–887, Naha, Japan.
Batya Friedman and David G. Hendry. 2019. Value
Mark Díaz, Isaac Johnson, Amanda Lazar, Anne Marie
Sensitive Design: Shaping Technology with Moral
Piper, and Darren Gergle. 2018. Addressing age-
Imagination. MIT Press.
related bias in sentiment analysis. In Proceedings
of the Conference on Human Factors in Computing
Batya Friedman, Peter H. Kahn Jr., and Alan Borning.
Systems (CHI), Montréal, Canada.
2006. Value Sensitive Design and Information Sys-
Emily Dinan, Angela Fan, Adina Williams, Jack Ur- tems. In Dennis Galletta and Ping Zhang, editors,
banek, Douwe Kiela, and Jason Weston. 2019. Human-Computer Interaction in Management Infor-
Queens are Powerful too: Mitigating Gender mation Systems: Foundations, pages 348–372. M.E.
Bias in Dialogue Generation. arXiv preprint Sharpe.
arXiv:1911.03842.
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and
Carl DiSalvo, Andrew Clement, and Volkmar Pipek. James Zou. 2018. Word Embeddings Quantify 100
2013. Communities: Participatory Design for, with Years of Gender and Ethnic Stereotypes. Proceed-
and by communities. In Jesper Simonsen and Toni ings of the National Academy of Sciences, 115(16).
5465You can also read