A Survey of Race, Racism, and Anti-Racism in NLP
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Survey of Race, Racism, and Anti-Racism in NLP Anjalie Field Su Lin Blodgett Carnegie Mellon University Microsoft Research anjalief@cs.cmu.edu sulin.blodgett@microsoft.com Zeerak Waseem Yulia Tsvetkov University of Sheffield University of Washington z.w.butt@sheffield.ac.uk yuliats@cs.washington.edu Abstract While researchers and activists have increasingly Despite inextricable ties between race and lan- drawn attention to racism in computer science and guage, little work has considered race in NLP academia, frequently-cited examples of racial bias research and development. In this work, we in AI are often drawn from disciplines other than arXiv:2106.11410v2 [cs.CL] 15 Jul 2021 survey 79 papers from the ACL anthology that NLP, such as computer vision (facial recognition) mention race. These papers reveal various (Buolamwini and Gebru, 2018) or machine learn- types of race-related bias in all stages of NLP ing (recidivism risk prediction) (Angwin et al., model development, highlighting the need for 2016). Even the presence of racial biases in search proactive consideration of how NLP systems can uphold racial hierarchies. However, per- engines like Google (Sweeney, 2013; Noble, 2018) sistent gaps in research on race and NLP re- has prompted little investigation in the ACL com- main: race has been siloed as a niche topic munity. Work on NLP and race remains sparse, and remains ignored in many NLP tasks; most particularly in contrast to concerns about gender work operationalizes race as a fixed single- bias, which have led to surveys, workshops, and dimensional variable with a ground-truth label, shared tasks (Sun et al., 2019; Webster et al., 2019). which risks reinforcing differences produced In this work, we conduct a comprehensive sur- by historical racism; and the voices of histor- vey of how NLP literature and research practices ically marginalized people are nearly absent in NLP literature. By identifying where and how engage with race. We first examine 79 papers from NLP literature has and has not considered race, the ACL Anthology that mention the words ‘race’, especially in comparison to related fields, our ‘racial’, or ‘racism’ and highlight examples of how work calls for inclusion and racial justice in racial biases manifest at all stages of NLP model NLP research practices. pipelines (§3). We then describe some of the limi- 1 Introduction tations of current work (§4), specifically showing that NLP research has only examined race in a nar- Race and language are tied in complicated ways. row range of tasks with limited or no social context. Raciolinguistics scholars have studied how they are Finally, in §5, we revisit the NLP pipeline with a fo- mutually constructed: historically, colonial pow- cus on how people generate data, build models, and ers construct linguistic and racial hierarchies to are affected by deployed systems, and we highlight justify violence, and currently, beliefs about the current failures to engage with people traditionally inferiority of racialized people’s language practices underrepresented in STEM and academia. continue to justify social and economic exclusion While little work has examined the role of race (Rosa and Flores, 2017).1 Furthermore, language in NLP specifically, prior work has discussed race is the primary means through which stereotypes in related fields, including human-computer in- and prejudices are communicated and perpetuated teraction (HCI) (Ogbonnaya-Ogburu et al., 2020; (Hamilton and Trolier, 1986; Bar-Tal et al., 2013). Rankin and Thomas, 2019; Schlesinger et al., However, questions of race and racial bias 2017), fairness in machine learning (Hanna et al., have been minimally explored in NLP literature. 2020), and linguistics (Charity Hudley et al., 2020; 1 We use racialization to refer the process of “ascribing and Motha, 2020). We draw comparisons and guid- prescribing a racial category or classification to an individual ance from this work and show its relevance to NLP or group of people . . . based on racial attributes including but not limited to cultural and social history, physical features, research. Our work differs from NLP-focused re- and skin color” (Charity Hudley, 2017). lated work on gender bias (Sun et al., 2019), ‘bias’
generally (Blodgett et al., 2020), and the adverse U.S. However, as race and racism are global con- impacts of language models (Bender et al., 2021) structs, some aspects of our analysis are applicable in its explicit focus on race and racism. to other contexts. We suggest that future studies In surveying research in NLP and related fields, on racialization in NLP ground their analysis in the we ultimately find that NLP systems and research appropriate geo-cultural context, which may result practices produce differences along racialized lines. in findings or analyses that differ from our work. Our work calls for NLP researchers to consider the social hierarchies upheld and exacerbated by 3 Survey of NLP literature on race NLP research and to shift the field toward “greater 3.1 ACL Anthology papers about race inclusion and racial justice” (Charity Hudley et al., 2020). In this section, we introduce our primary survey data—papers from the ACL Anthology3 —and we 2 What is race? describe some of their major findings to empha- size that NLP systems encode racial biases. We It has been widely accepted by social scientists that searched the anthology for papers containing the race is a social construct, meaning it “was brought terms ‘racial’, ‘racism’, or ‘race’, discarding ones into existence or shaped by historical events, social that only mentioned race in the references section forces, political power, and/or colonial conquest” or in data examples and adding related papers cited rather than reflecting biological or ‘natural’ differ- by the initial set if they were also in the ACL An- ences (Hanna et al., 2020). More recent work has thology. In using keyword searches, we focus on criticized the “social construction” theory as circu- papers that explicitly mention race and consider lar and rooted in academic discourse, and instead papers that use euphemistic terms to not have sub- referred to race as “colonial constituted practices”, stantial engagement on this topic. As our focus including “an inherited western, modern-colonial is on NLP and the ACL community, we do not in- practice of violence, assemblage, superordination, clude NLP-related papers published in other venues exploitation and segregation” (Saucier et al., 2016). in the reported metrics (e.g. Table 1), but we do The term race is also multi-dimensional and draw from them throughout our analysis. can refer to a variety of different perspectives, in- Our initial search identified 165 papers. How- cluding racial identity (how you self-identify), ob- ever, reviewing all of them revealed that many do served race (the race others perceive you to be), not deeply engage on the topic. For example, 37 and reflected race (the race you believe others per- papers mention ‘racism’ as a form of abusive lan- ceive you to be) (Roth, 2016; Hanna et al., 2020; guage or use ‘racist’ as an offensive/hate speech Ogbonnaya-Ogburu et al., 2020). Racial catego- label without further engagement. 30 papers only rizations often differ across dimensions and depend mention race as future work, related work, or mo- on the defined categorization schema. For exam- tivation, e.g. in a survey about gender bias, “Non- ple, the United States census considers Hispanic binary genders as well as racial biases have largely an ethnicity, not a race, but surveys suggest that been ignored in NLP” (Sun et al., 2019). After 2/3 of people who identify as Hispanic consider discarding these types of papers, our final analysis it a part of their racial background.2 Similarly, set consists of 79 papers.4 the census does not consider ‘Jewish’ a race, but Table 1 provides an overview of the 79 papers, some NLP work considers anti-Semitism a form manually coded for each paper’s primary NLP task of racism (Hasanuzzaman et al., 2017). Race de- and its focal goal or contribution. We determined pends on historical and social context—there are task/application labels through an iterative process: no ‘ground truth’ labels or categories (Roth, 2016). listing the main focus of each paper and then col- As the work we survey primarily focuses on the lapsing similar categories. In cases where papers United States, our analysis similarly focuses on the 3 The ACL Anthology includes papers from all official 2 ACL venues and some non-ACL events listed in Appendix A, https://www.census.gov/mso/ www/training/pdf/race-ethnicity- as of December 2020 it included 6, 200 papers 4 onepager.pdf/, https://www.census.gov/ We do not discard all papers about abusive language, only topics/population/race/about.html, ones that exclusively use racism/racist as a classification label. https://www.pewresearch.org/fact-tank/ We retain papers with further engagement, e.g. discussions 2015/06/15/is-being-hispanic-a-matter- of how to define racism or identification of racial bias in hate of-race-ethnicity-or-both/ speech classifiers.
Analyze Corpus Survey/Position Develop Model Collect Corpus Detect Bias Debias Total Abusive Language 6 4 2 5 2 2 21 Social Science/Social Media 2 10 6 1 - 1 20 Text Representations (LMs, embeddings) - 2 - 9 2 - 13 Text Generation (dialogue, image captions, story gen. ) - - 1 5 1 1 8 Sector-specific NLP applications (edu., law, health) 1 2 - - 1 3 7 Ethics/Task-independent Bias 1 - 1 1 1 2 6 Core NLP Applications (parsing, NLI, IE) 1 - 1 1 1 - 4 Total 11 18 11 22 8 9 79 Table 1: 79 papers on race or racism from the ACL anthology, categorized by NLP application and focal task. could rightfully be included in multiple categories, vealing under-representation in training data, some- we assign them to the best-matching one based on times tangentially to primary research questions: stated contributions and the percentage of the paper Rudinger et al. (2017) suggest that gender bias may devoted to each possible category. In the Appendix be easier to identify than racial or ethnic bias in we provide additional categorizations of the papers Natural Language Inference data sets because of according to publication year, venue, and racial data sparsity, and Caliskan et al. (2017) alter the categories used, as well as the full list of 79 papers. Implicit Association Test stimuli that they use to measure biases in word embeddings because some 3.2 NLP systems encode racial bias African American names were not frequent enough Next, we present examples that identify racial bias in their corpora. in NLP models, focusing on 5 parts of a standard An equally important consideration, in addition NLP pipeline: data, data labels, models, model out- to whom the data describes is who authored the puts, and social analyses of outputs. We include data. For example, Blodgett et al. (2018) show papers described in Table 1 and also relevant liter- that parsing systems trained on White Mainstream ature beyond the ACL Anthology (e.g. NeurIPS, American English perform poorly on African PNAS, Science). These examples are not intended American English (AAE).5 In a more general exam- to be exhaustive, and in §4 we describe some of the ple, Wikipedia has become a popular data source ways that NLP literature has failed to engage with for many NLP tasks. However, surveys suggest race, but nevertheless, we present them as evidence that Wikipedia editors are primarily from white- that NLP systems perpetuate harmful biases along majority countries,6 and several initiatives have racialized lines. pointed out systemic racial biases in Wikipedia coverage (Adams et al., 2019; Field et al., 2021).7 Data A substantial amount of prior work has al- Models trained on these data only learn to process ready shown how NLP systems, especially word the type of text generated by these users, and fur- embeddings and language models, can absorb and ther, only learn information about the topics these amplify social biases in data sets (Bolukbasi et al., users are interested in. The representativeness of 2016; Zhao et al., 2017). While most work focuses data sets is a well-discussed issue in social-oriented on gender bias, some work has made similar ob- tasks, like inferring public opinion (Olteanu et al., servations about racial bias (Rudinger et al., 2017; 2019), but this issue is also an important considera- Garg et al., 2018; Kurita et al., 2019). These studies focus on how training data might describe racial 5 We note that conceptualizations of AAE and the accom- minorities in biased ways, for example, by exam- panying terminology for the variety have shifted considerably ining words associated with terms like ‘black’ or in the last half century; see King (2020) for an overview. 6 traditionally European/African American names https://meta.wikimedia.org/wiki/ Research:Wikipedia Editors Survey 2011 April (Caliskan et al., 2017; Manzini et al., 2019). Some 7 https://en.wikipedia.org/wiki/ studies additionally capture who is described, re- Racial bias on Wikipedia
tion in ‘neutral’ tasks like parsing (Waseem et al., vestigation of results is needed to ascertain which 2021). The type of data that researchers choose factors most contribute to disparate performance. to train their models on does not just affect what data the models perform well for, it affects what people the models work for. NLP researchers can- Model Outputs Several papers focus on model not assume models will be useful or function for outcomes, and how NLP systems could perpetuate marginalized people unless they are trained on data and amplify bias if they are deployed: generated by them. • Classifiers trained on common abusive lan- Data Labels Although model biases are often guage data sets are more likely to label tweets blamed on raw data, several of the papers we survey containing characteristics of AAE as offensive identify biases in the way researchers categorize or (Davidson et al., 2019; Sap et al., 2019). obtain data annotations. For example: • Classifiers for abusive language are more • Annotation schema Returning to Blodgett likely to label text containing identity terms et al. (2018), this work defines new parsing like ‘black’ as offensive (Dixon et al., 2018). standards for formalisms common in AAE, • GPT outputs text with more negative senti- demonstrating how parsing labels themselves ment when prompted with AAE -like inputs were not designed for racialized language va- (Groenwold et al., 2020). rieties. • Annotation instructions Sap et al. (2019) show that annotators are less likely to label Social Analyses of Outputs While the examples tweets using AAE as offensive if they are in this section primarily focus on racial biases in told the likely language varieties of the tweets. trained NLP systems, other work (e.g. included Thus, how annotation schemes are designed in ‘Social Science/Social Media’ in Table 1) uses (e.g. what contextual information is provided) NLP tools to analyze race in society. Examples in- can impact annotators’ decisions, and fail- clude examining how commentators describe foot- ing to provide sufficient context can result ball players of different races (Merullo et al., 2019) in racial biases. or how words like ‘prejudice’ have changed mean- • Annotator selection Waseem (2016) show ing over time (Vylomova et al., 2019). that feminist/anti-racist activists assign differ- While differing in goals, this work is often sus- ent offensive language labels to tweets than ceptible to the same pitfalls as other NLP tasks. figure-eight workers, demonstrating that an- One area requiring particular caution is in the in- notators’ lived experiences affect data annota- terpretation of results produced by analysis models. tions. For example, while word embeddings have become a common way to measure semantic change or es- Models Some papers have found evidence that timate word meanings (Garg et al., 2018), Joseph model instances or architectures can change the and Morgan (2020) show that embedding associ- racial biases of outputs produced by the model. ations do not always correlate with human opin- Sommerauer and Fokkens (2019) find that the word ions; in particular, correlations are stronger for be- embedding associations around words like ‘race’ liefs about gender than race. Relatedly, in HCI, and ‘racial’ change not only depending on the the recognition that authors’ own biases can affect model architecture used to train embeddings, but their interpretations of results has caused some au- also on the specific model instance used to extract thors to provide self-disclosures (Schlesinger et al., them, perhaps because of differing random seeds. 2017), but this practice is uncommon in NLP. Kiritchenko and Mohammad (2018) examine gen- der and race biases in 200 sentiment analysis sys- We conclude this section by observing that when tems submitted to a shared task and find different researchers have looked for racial biases in NLP levels of bias in different systems. As the train- systems, they have usually found them. This litera- ing data for the shared task was standardized, all ture calls for proactive approaches in considering models were trained on the same data. However, how data is collected, annotated, used, and inter- participants could have used external training data preted to prevent NLP systems from exacerbating or pre-trained embeddings, so a more detailed in- historical racial hierarchies.
4 Limitations in where and how NLP et al., 2019; Blodgett et al., 2018; Xia et al., 2020; operationalizes race Xu et al., 2019; Groenwold et al., 2020), but even this corpus is explicitly not intended to infer race. While §3 demonstrates ways that NLP systems Furthermore, names and hand-selected iden- encode racial biases, we next identify gaps and lim- tity terms are not sufficient for uncovering model itations in how these works have examined racism, bias. De-Arteaga et al. (2019) show this in ex- focusing on how and in what tasks researchers have amining gender bias in occupation classification: considered race. We ultimately conclude that prior when overt indicators like names and pronouns are NLP literature has marginalized research on race scrubbed from the data, performance gaps and po- and encourage deeper engagement with other fields, tential allocational harms still remain. Names also critical views of simplified classification schema, generalize poorly. While identity terms can be ex- and broader application scope in future work (Blod- amined across languages (van Miltenburg et al., gett et al., 2020; Hanna et al., 2020). 2017), differences in naming conventions often do 4.1 Common data sets are narrow in scope not translate, leading some studies to omit examin- ing racial bias in non-English languages (Lauscher The papers we surveyed suggest that research on and Glavaš, 2019). Even within English, names of- race in NLP has used a very limited range of ten fail to generalize across domains, geographies, data sets, which fails to account for the multi- and time. For example, names drawn from the dimensionality of race and simplifications inher- U.S. census generalize poorly to Twitter (Wood- ent in classification. We identified 3 common data Doughty et al., 2018), and names common among sources:8 Black and white children were not distinctly differ- • 9 papers use a set of tweets with inferred prob- ent prior to the 1970s (Fryer Jr and Levitt, 2004; abilistic topic labels based on alignment with Sweeney, 2013). U.S. census race/ethnicity groups (or the pro- vided inference model) (Blodgett et al., 2016). We focus on these 3 data sets as they were • 11 papers use lists of names drawn from most common in the papers we surveyed, but Sweeney (2013), Caliskan et al. (2017), or we note that others exist. Preoţiuc-Pietro and Garg et al. (2018). Most commonly, 6 pa- Ungar (2018) provide a data set of tweets with pers use African/European American names self-identified race of their authors, though it is from the Word Embedding Association Test little used in subsequent work and focused on (WEAT) (Caliskan et al., 2017), which in turn demographic prediction, rather than evaluating draws data from Greenwald et al. (1998) and model performance gaps. Two recently-released Bertrand and Mullainathan (2004). data sets (Nadeem et al., 2020; Nangia et al., • 10 papers use explicit keywords like ‘Black 2020) provide crowd-sourced pairs of more- and woman’, often placed in templates like “I am a less-stereotypical text. More work is needed to ” to test if model performance remains understand any privacy concerns and the strengths the same for different identity terms. and limitations of these data (Blodgett et al., 2021). Additionally, some papers collect domain-specific While these commonly-used data sets can iden- data, such as self-reported race in an online com- tify performance disparities, they only capture a munity (Loveys et al., 2018), or crowd-sourced narrow subset of the multiple dimensions of race annotations of perceived race of football players (§2). For example, none of them capture self- (Merullo et al., 2019). While these works offer identified race. While observed race is often appro- clear contextualization, it is difficult to use these priate for examining discrimination and some types data sets to address other research questions. of disparities, it is impossible to assess potential harms and benefits of NLP systems without assess- ing their performance over text generated by and 4.2 Classification schemes operationalize directed to people of different races. The corpus race as a fixed, single-dimensional from Blodgett et al. (2016) does serve as a start- U.S.-census label ing point and forms the basis of most current work Work that uses the same few data sets inevitably assessing performance gaps in NLP models (Sap also uses the same few classification schemes, often 8 We provide further counts of what racial categories papers without justification. The most common explicitly use and how they operationalize them in Appendix B. stated source of racial categories is the U.S. census,
which reflects the general trend of U.S.-centrism privileged people (e.g. Black men), while consid- in NLP research (the vast majority of work we sur- eration of gender emphasizes the experience of veyed also focused on English). While census cate- race-privileged people (e.g. white women). Nei- gories are sometimes appropriate, repeated use of ther reflect the experience of people who face dis- classification schemes and accompanying data sets crimination along both axes (e.g. Black women) without considering who defined these schemes (Crenshaw, 1989). A small selection of papers have and whether or not they are appropriate for the cur- examined intersectional biases in embeddings or rent context risks perpetuating the misconception word co-occurrences (Herbelot et al., 2012; May that race is ‘natural’ across geo-cultural contexts. et al., 2019; Tan and Celis, 2019; Lepori, 2020), but We refer to Hanna et al. (2020) for a more thorough we did not identify mentions of intersectionality in overview of the harms of “widespread uncritical any other NLP research areas. Further, several of adoption of racial categories,” which “can in turn these papers use NLP technology to examine or val- re-entrench systems of racial stratification which idate theories on intersectionality; they do not draw give rise to real health and social inequalities.” At from theory on intersectionality to critically exam- best, the way race has been operationalized in NLP ine NLP models. These omissions can mask harms: research is only capable of examining a narrow sub- Jiang and Fellbaum (2020) provide an example us- set of potential harms. At worst, it risks reinforcing ing word embeddings of how failing to consider in- racism by presenting racial divisions as natural, tersectionality can render invisible people marginal- rather than the product of social and historical con- ized in multiple ways. Numerous directions remain text (Bowker and Star, 2000). for exploration, such as how ‘debiasing’ models along one social dimension affects other dimen- As an example of questioning who devised racial sions. Surveys in HCI offer further frameworks categories and for what purpose, we consider the on how to incorporate identity and intersectional- pattern of re-using names from Greenwald et al. ity into computational research (Schlesinger et al., (1998), who describe their data as sets of names 2017; Rankin and Thomas, 2019). “judged by introductory psychology students to be more likely to belong to White Americans than to 4.3 NLP research on race is restricted to Black Americans” or vice versa. When incorpo- specific tasks and applications rating this data into WEAT, Caliskan et al. (2017) Finally, Table 1 reveals many common NLP appli- discard some judged African American names as cations where race has not been examined, such as too infrequent in their embedding data. Work sub- machine translation, summarization, or question an- sequently drawing from WEAT makes no mention swering.9 While some tasks seem inherently more of the discarded names nor contains much discus- relevant to social context than others (a claim we sion of how the data was generated and whether or dispute in this work, particularly in §5), research on not names judged to be white or Black by introduc- race is compartmentalized to limited areas of NLP tory psychology students in 1998 are an appropriate even in comparison with work on ‘bias’. For exam- benchmark for the studied task. While gathering ple, Blodgett et al. (2020) identify 20 papers that data to examine race in NLP is challenging, and in examine bias in co-reference resolution systems this work we ourselves draw from examples that and 8 in machine translation, whereas we identify use Greenwald et al. (1998), it is difficult to inter- 0 papers in either that consider race. Instead, race pret what implications arise when models exhibit is most often mentioned in NLP papers in the con- disparities over this data and to what extent models text of abusive language, and work on detecting or without disparities can be considered ‘debiased’. removing bias in NLP models has focused on word Finally, almost all of the work we examined con- embeddings. ducts single-dimensional analyses, e.g. focus on Overall, our survey identifies a need for the ex- race or gender but not both simultaneously. This amination of race in a broader range of NLP tasks, focus contrasts with the concept of intersection- the development of multi-dimensional data sets, ality, which has shown that examining discrim- and careful consideration of context and appropri- ination along a single axis fails to capture the ateness of racial categories. In general, race is experiences of people who face marginalization 9 We identified only 8 relevant papers on Text Generation, along multiple axes. For example, consideration which focus on other areas including chat bots, GPT-2/3, hu- of race often emphasizes the experience of gender- mor generation, and story generation.
difficult to operationalize, but NLP researchers do researchers could easily repeat this incident, for not need to start from scratch, and can instead draw example, by using demographic profiling of social from relevant work in other fields. media users to create more diverse data sets. While obtaining diverse, representative, real-world data 5 NLP propagates marginalization of sets is important for building models, data must racialized people be collected with consideration for the people who generated it, such as obtaining informed consent, While in §4 we primarily discuss race as a topic or setting limits of uses, and preserving privacy, as a construct, in this section, we consider the role, or well as recognizing that some communities may more pointedly, the absence, of traditionally under- not want their data used for NLP at all (Paullada, represented people in NLP research. 2020). 5.1 People create data 5.2 People build models As discussed in §3.2, data and annotations are gen- erated by people, and failure to consider who cre- Research is additionally carried out by people who ated data can lead to harms. In §3.2 we identify determine what projects to pursue and how to a need for diverse training data in order to ensure approach them. While statistics on ACL confer- models work for a diverse set of people, and in §4 ences and publications have focused on geographic we describe a similar need for diversity in data that representation rather than race, they do highlight is used to assess algorithmic fairness. However, under-representation. Out of 2, 695 author affili- gathering this type of data without consideration of ations associated with papers in the ACL Anthol- the people who generated it can introduce privacy ogy for 5 major conferences held in 2018, only 5 violations and risks of demographic profiling. (0.2%) were from Africa, compared with 1, 114 As an example, in 2019, partially in response from North America (41.3%).11 Statistics pub- to research showing that facial recognition al- lished for 2017 conference attendees and ACL fel- gorithms perform worse on darker-skinned than lows similarly reveal a much higher percentage lighter-skinned people (Buolamwini and Gebru, of people from “North, Central and South Amer- 2018; Raji and Buolamwini, 2019), researchers ica” (55% attendees / 74% fellows) than from “Eu- at IBM created the “Diversity in Faces” data set, rope, Middle East and Africa” (19%/13%) or “Asia- which consists of 1 million photos sampled from Pacific” (23%/13%).12 These broad regional cate- the the publicly available YFCC-100M data set and gories likely mask further under-representation, e.g. annotated with “craniofacial distances, areas and percentage of attendees and fellows from Africa ratios, facial symmetry and contrast, skin color, as compared to Europe. According to an NSF re- age and gender predictions” (Merler et al., 2019). port that includes racial statistics rather than na- While this data set aimed to improve the fairness tionality, 14% of doctorate degrees in Computer of facial recognition technology, it included pho- Science awarded by U.S. institutions to U.S. cit- tos collected from a Flickr, a photo-sharing web- izens and permanent residents were awarded to site whose users did not explicitly consent for this Asian students, < 4% to Black or African Ameri- use of their photos. Some of these users filed a can students, and 0% to American Indian or Alaska lawsuit against IBM, in part for “subjecting them Native students (National Center for Science and to increased surveillance, stalking, identity theft, Engineering Statistics, 2019).13 and other invasions of privacy and fraud.”10 NLP It is difficult to envision reducing or eliminating 10 racial differences in NLP systems without changes https://www.classaction.org/news/ in the researchers building these systems. One class-action-accuses-ibm-of-flagrant- violations-of-illinois-biometric- theory that exemplifies this challenge is interest privacy-law-to-develop-facial- convergence, which suggests that people in posi- recognition-tech#embedded-document https://www.nbcnews.com/tech/internet/ tions of power only take action against systematic facial-recognition-s-dirty-little- 11 secret-millions-online-photos-scraped- http://www.marekrei.com/blog/ n981921 IBM has since removed the “Diversity in Faces” geographic-diversity-of-nlp-conferences/ 12 data set as well as their “Detect Faces” public API and https://www.aclweb.org/portal/content/ stopped their use of and research on facial recognition. acl-diversity-statistics 13 https://qz.com/1866848/why-ibm-abandoned- Results exclude respondents who did not report race or its-facial-recognition-program/ ethnicity or were Native Hawaiian or Other Pacific Islander.
problems like racism when it also advances their tools for predicting demographic information (Tat- own interests (Bell Jr, 1980). Ogbonnaya-Ogburu man, 2020) and automatic prison term prediction et al. (2020) identify instances of interest conver- (Leins et al., 2020), motivated by the history of gence in the HCI community, primarily in diversity using technology to police racial minorities and re- initiatives that benefit institutions’ images rather lated criticism in other fields (Browne, 2015; Buo- than underrepresented people. In a research setting, lamwini and Gebru, 2018; McIlwain, 2019). In interest convergence can encourage studies of incre- cases where potential harms are less direct, they mental and surface-level biases while discouraging are often unaddressed entirely. For example, while research that might be perceived as controversial low-resource NLP is a large area of research, a and force fundamental changes in the field. paper on machine translation of white American Demographic statistics are not sufficient for and European languages is unlikely to discuss how avoiding pitfalls like interest convergence, as they continual model improvements in these settings in- fail to capture the lived experiences of researchers. crease technological inequality. Little work on low- Ogbonnaya-Ogburu et al. (2020) provide several resource NLP has focused on the realities of struc- examples of challenges that non-white HCI re- tural racism or differences in lived experience and searchers have faced, including the invisible labor how they might affect the way technology should of representing ‘diversity’, everyday microaggres- be designed. sions, and altering their research directions in ac- Detection of abusive language offers an infor- cordance with their advisors’ interests. Rankin and mative case study on the danger of failing to con- Thomas (2019) further discuss how research con- sider people affected by technology. Work on abu- ducted by people of different races is perceived dif- sive language often aims to detect racism for con- ferently: “Black women in academia who conduct tent moderation (Waseem and Hovy, 2016). How- research about the intersections of race, gender, ever, more recent work has show that existing hate class, and so on are perceived as ‘doing service,’ speech classifiers are likely to falsely label text con- whereas white colleagues who conduct the same re- taining identity terms like ‘black’ or text containing search are perceived as doing cutting-edge research linguistic markers of AAE as toxic (Dixon et al., that demands attention and recognition.” While we 2018; Sap et al., 2019; Davidson et al., 2019; Xia draw examples about race from HCI in the absence et al., 2020). Deploying these models could censor of published work on these topics in NLP, the lack the posts of the very people they purport to help. of linguistic diversity in NLP research similarly demonstrates how representation does not neces- In other areas of statistics and machine learning, sarily imply inclusion. Although researchers from focus on participatory design has sought to am- various parts of the world (Asia, in particular) do plify the voices of people affected by technology have some numerical representation among ACL and its development. An ICML 2020 workshop authors, attendees, and fellows, NLP research over- titled “Participatory Approaches to Machine Learn- whelmingly favors a small set of languages, with ing” highlights a number of papers in this area a heavy skew towards European languages (Joshi (Kulynych et al., 2020; Brown et al., 2019). A et al., 2020) and ‘standard’ language varieties (Ku- few related examples exist in NLP, e.g. Gupta et al. mar et al., 2021). (2020) gather data for an interactive dialogue agent intended to provide more accessible information about heart failure to Hispanic/Latinx and African 5.3 People use models American patients. The authors engage with health- Finally, NLP research produces technology that is care providers and doctors, though they leave focal used by people, and even work without direct ap- groups with patients for future work. While NLP plications is typically intended for incorporation researchers may not be best situated to examine into application-based systems. With the recogni- how people interact with deployed technology, they tion that technology ultimately affects people, re- could instead draw motivation from fields that have searchers on ethics in NLP have increasingly called stronger histories of participatory design, such as for considerations of whom technology might harm HCI. However, we did not identify citing participa- and suggested that there are some NLP technolo- tory design studies conducted by others as common gies that should not be built at all. In the context of practice in the work we surveyed. As in the case perpetuating racism, examples include criticism of of researcher demographics, participatory design is
not an end-all solution. Sloane et al. (2020) provide draw from linguistics, Charity Hudley et al. (2020) a discussion of how participatory design can col- in turn call on linguists to draw models of racial lapse to ‘participation-washing’ and how such work justice from anthropology, sociology, and psychol- must be context-specific, long-term, and genuine. ogy. Relatedly, there are numerous racialized ef- fects that NLP research can have that we do not 6 Discussion address in this work; for example, Bender et al. (2021) and Strubell et al. (2019) discuss the envi- We conclude by synthesizing some of the obser- ronmental costs of training large language models, vations made in the preceding sections into more and how global warming disproportionately affects actionable items. First, NLP research needs to marginalized communities. We suggest that read- explicitly incorporate race. We quote Benjamin ers use our work as one starting point for bringing (2019): “[technical systems and social codes] op- inclusion and racial justice into NLP. erate within powerful systems of meaning that ren- der some things visible, others invisible, and create Acknowledgements a vast array of distortions and dangers.” In the context of NLP research, this philosophy We gratefully thank Hanna Kim, Kartik Goyal, Ar- implies that all technology we build works in ser- tidoro Pagnoni, Qinlan Shen, and Michael Miller vice of some ideas or relations, either by upholding Yoder for their feedback on this work. Z.W. has them or dismantling them. Any research that is been supported in part by the Canada 150 Research not actively combating prevalent social systems Chair program and the UK-Canada Artificial Intel- like racism risks perpetuating or exacerbating them. ligence Initiative. A.F. has been supported in part Our work identifies several ways in which NLP by a Google PhD Fellowship and a GRFP under research upholds racism: Grant No. DGE1745016. This material is based • Systems contain representational harms and upon work supported in part by the National Sci- performance gaps throughout NLP pipelines ence Foundation under Grants No. IIS2040926 and • Research on race is restricted to a narrow sub- IIS2007960. Any opinions, findings, and conclu- set of tasks and definitions of race, which can sions or recommendations expressed in this mate- mask harms and falsely reify race as ‘natural’ rial are those of the authors and do not necessarily • Traditionally underrepresented people are ex- reflect the views of the NSF. cluded from the research process, both as con- sumers and producers of technology 7 Ethical Considerations Furthermore, while we focus on race, which We, the authors of this work, are situated in the we note has received substantially less attention cultural contexts of the United States of America than gender, many of the observations in this work and the United Kingdom/Europe, and some of us hold for social characteristics that have received identify as people of color. We all identify as NLP even less attention in NLP research, such as so- researchers, and we acknowledge that we are situ- cioeconomic class, disability, or sexual orientation ated within the traditionally exclusionary practices (Mendelsohn et al., 2020; Hutchinson et al., 2020). of academic research. These perspectives have im- Nevertheless, none of these challenges can be ad- pacted our work, and there are viewpoints outside dressed without direct engagement with marginal- of our institutions and experiences that our work ized communities of color. NLP researchers can may not fully represent. draw on precedents for this type of engagement from other fields, such as participatory design and value sensitive design models (Friedman et al., References 2013). Additionally, numerous organizations al- Julia Adams, Hannah Brückner, and Cambria Naslund. ready exist that serve as starting points for partner- 2019. Who counts as a notable sociologist on ships, such as Black in AI, Masakhane, Data for Wikipedia? gender, race, and the “professor test”. Black Lives, and the Algorithmic Justice League. Socius, 5. Finally, race and language are complicated, and Silvio Amir, Mark Dredze, and John W. Ayers. 2019. while readers may look for clearer recommenda- Mental health surveillance over social media with tions, no one data set, model, or set of guidelines digital cohorts. In Proceedings of the Sixth Work- can ‘solve’ racism in NLP. For instance, while we shop on Computational Linguistics and Clinical Psy-
chology, pages 114–120, Minneapolis, Minnesota. Su Lin Blodgett, Lisa Green, and Brendan O’Connor. Association for Computational Linguistics. 2016. Demographic dialectal variation in social media: A case study of African-American English. Julia Angwin, Jeff Larson, Surya Mattu, and Lauren In Proceedings of the 2016 Conference on Empiri- Kirchner. 2016. Machine bias: There’s software cal Methods in Natural Language Processing, pages used across the country to predict future criminals 1119–1130, Austin, Texas. Association for Compu- and it’s biased against blacks. ProPublica. tational Linguistics. Stavros Assimakopoulos, Rebecca Vella Muskat, Lon- Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, neke van der Plas, and Albert Gatt. 2020. Annotat- Robert Sim, and Hanna Wallach. 2021. Stereotyp- ing for hate speech: The MaNeCo corpus and some ing Norwegian Salmon: An Inventory of Pitfalls in input from critical discourse analysis. In Proceed- Fairness Benchmark Datasets. In Proceedings of the ings of the 12th Language Resources and Evaluation Joint Conference of the 59th Annual Meeting of the Conference, pages 5088–5097, Marseille, France. Association for Computational Linguistics and the European Language Resources Association. 11th International Joint Conference on Natural Lan- Daniel Bar-Tal, Carl F Graumann, Arie W Kruglanski, guage Processing, Online. Association for Compu- and Wolfgang Stroebe. 2013. Stereotyping and prej- tational Linguistics. udice: Changing conceptions. Springer Science & Su Lin Blodgett, Johnny Wei, and Brendan O’Connor. Business Media. 2018. Twitter Universal Dependency parsing for Francesco Barbieri and Jose Camacho-Collados. 2018. African-American and mainstream American En- How gender and skin tone modifiers affect emoji se- glish. In Proceedings of the 56th Annual Meeting of mantics in Twitter. In Proceedings of the Seventh the Association for Computational Linguistics (Vol- Joint Conference on Lexical and Computational Se- ume 1: Long Papers), pages 1415–1425, Melbourne, mantics, pages 101–106, New Orleans, Louisiana. Australia. Association for Computational Linguis- Association for Computational Linguistics. tics. Derrick A Bell Jr. 1980. Brown v. board of education Tolga Bolukbasi, Kai-Wei Chang, James Zou, and the interest-convergence dilemma. Harvard law Venkatesh Saligrama, and Adam Kalai. 2016. review, pages 518–533. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Emily Bender, Timnit Gebru, Angelina McMillan- Proceedings of the 30th International Confer- Major, and Shmargaret Shmitchell. 2021. On the ence on Neural Information Processing Systems, dangers of stochastic parrots: Can language models page 4356–4364, Red Hook, NY, USA. Curran be too big? . In Proceedings of the 2021 Confer- Associates Inc. ence on Fairness, Accountability, and Transparency, page 610–623, New York, NY, USA. Association for Rishi Bommasani, Kelly Davis, and Claire Cardie. Computing Machinery. 2020. Interpreting Pretrained Contextualized Repre- sentations via Reductions to Static Embeddings. In Ruha Benjamin. 2019. Race After Technology: Aboli- Proceedings of the 58th Annual Meeting of the Asso- tionist Tools for the New Jim Code. Wiley. ciation for Computational Linguistics, pages 4758– 4781, Online. Association for Computational Lin- Shane Bergsma, Mark Dredze, Benjamin Van Durme, guistics. Theresa Wilson, and David Yarowsky. 2013. Broadly improving user classification via Geoffrey C. Bowker and Susan Leigh Star. 2000. communication-based name and location clus- Sorting Things Out: Classification and Its Conse- tering on Twitter. In Proceedings of the 2013 quences. Inside Technology. MIT Press. Conference of the North American Chapter of the Association for Computational Linguistics: Human Anna Brown, Alexandra Chouldechova, Emily Putnam- Language Technologies, pages 1010–1019, Atlanta, Hornstein, Andrew Tobin, and Rhema Vaithianathan. Georgia. Association for Computational Linguistics. 2019. Toward algorithmic accountability in pub- lic services: A qualitative study of affected commu- Marianne Bertrand and Sendhil Mullainathan. 2004. nity perspectives on algorithmic decision-making in Are Emily and Greg more employable than Lak- child welfare services. In Proceedings of the 2019 isha and Jamal? A field experiment on labor mar- CHI Conference on Human Factors in Computing ket discrimination. American Economic Review, Systems, CHI ’19, page 1–12, New York, NY, USA. 94(4):991–1013. Association for Computing Machinery. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Simone Browne. 2015. Dark Matters: On the Surveil- Hanna Wallach. 2020. Language (technology) is lance of Blackness. Duke University Press. power: A critical survey of “bias” in NLP. In Pro- ceedings of the 58th Annual Meeting of the Asso- Joy Buolamwini and Timnit Gebru. 2018. Gender ciation for Computational Linguistics, pages 5454– shades: Intersectional accuracy disparities in com- 5476, Online. Association for Computational Lin- mercial gender classification. In Proceedings of guistics. the 1st Conference on Fairness, Accountability and
Transparency, pages 77–91, New York, NY, USA. shootings. In Proceedings of the 2019 Conference PMLR. of the North American Chapter of the Association for Computational Linguistics: Human Language Aylin Caliskan, Joanna J. Bryson, and Arvind Technologies, Volume 1 (Long and Short Papers), Narayanan. 2017. Semantics derived automatically pages 2970–3005, Minneapolis, Minnesota. Associ- from language corpora contain human-like biases. ation for Computational Linguistics. Science, 356(6334):183–186. Michael Castelle. 2018. The linguistic ideologies of Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, deep abusive language classification. In Proceed- and Lucy Vasserman. 2018. Measuring and mitigat- ings of the 2nd Workshop on Abusive Language On- ing unintended bias in text classification. In Pro- line (ALW2), pages 160–170, Brussels, Belgium. As- ceedings of the 2018 AAAI/ACM Conference on AI, sociation for Computational Linguistics. Ethics, and Society, page 67–73, New York, NY, USA. Association for Computing Machinery. Bharathi Raja Chakravarthi. 2020. HopeEDI: A mul- tilingual hope speech detection dataset for equality, Jacob Eisenstein, Noah A. Smith, and Eric P. Xing. diversity, and inclusion. In Proceedings of the Third 2011. Discovering sociolinguistic associations with Workshop on Computational Modeling of People’s structured sparsity. In Proceedings of the 49th An- Opinions, Personality, and Emotion’s in Social Me- nual Meeting of the Association for Computational dia, pages 41–53, Barcelona, Spain (Online). Asso- Linguistics: Human Language Technologies, pages ciation for Computational Linguistics. 1365–1374, Portland, Oregon, USA. Association for Computational Linguistics. Anne H. Charity Hudley. 2017. Language and Racial- ization. In Ofelia Garcı́a, Nelson Flores, and Mas- Yanai Elazar and Yoav Goldberg. 2018. Adversarial similiano Spotti, editors, The Oxford Handbook of removal of demographic attributes from text data. Language and Society, pages 381–402. Oxford Uni- In Proceedings of the 2018 Conference on Empiri- versity Press. cal Methods in Natural Language Processing, pages 11–21, Brussels, Belgium. Association for Computa- Anne H. Charity Hudley, Christine Mallinson, and tional Linguistics. Mary Bucholtz. 2020. Toward racial justice in lin- guistics: Interdisciplinary insights into theorizing Anjalie Field, Chan Young Park, and Yulia Tsvetkov. race in the discipline and diversifying the profession. 2021. Controlled analyses of social biases in Language, 96(4):e200–e235. Wikipedia bios. Computing Research Repository, Isobelle Clarke and Jack Grieve. 2017. Dimensions of arXiv:2101.00078. Version 1. abusive language on Twitter. In Proceedings of the First Workshop on Abusive Language Online, pages Batya Friedman, Peter Kahn, Alan Borning, and Alina 1–10, Vancouver, BC, Canada. Association for Com- Huldtgren. 2013. Value sensitive design and infor- putational Linguistics. mation systems. In Neelke Doorn, Daan Schuur- biers, Ibo van de Poel, and Michael Gorman, editors, Kimberlé Crenshaw. 1989. Demarginalizing the inter- Early engagement and new technologies: Opening section of race and sex: A black feminist critique of up the laboratory, volume 16. Springer, Dordrecht. antidiscrimination doctrine, feminist theory and an- tiracist politics. University of Chicago Legal Forum, Roland G Fryer Jr and Steven D Levitt. 2004. The 1989(8). causes and consequences of distinctively black names. The Quarterly Journal of Economics, Thomas Davidson, Debasmita Bhattacharya, and Ing- 119(3):767–805. mar Weber. 2019. Racial bias in hate speech and abusive language detection datasets. In Proceedings Ryan J. Gallagher, Kyle Reing, David Kale, and Greg of the Third Workshop on Abusive Language Online, Ver Steeg. 2017. Anchored correlation explanation: pages 25–35, Florence, Italy. Association for Com- Topic modeling with minimal domain knowledge. putational Linguistics. Transactions of the Association for Computational Maria De-Arteaga, Alexey Romanov, Hanna Wal- Linguistics, 5:529–542. lach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kentha- Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and padi, and Adam Tauman Kalai. 2019. Bias in bios: James Zou. 2018. Word embeddings quantify A case study of semantic representation bias in a 100 years of gender and ethnic stereotypes. Pro- high-stakes setting. In Proceedings of the Confer- ceedings of the National Academy of Sciences, ence on Fairness, Accountability, and Transparency, 115(16):E3635–E3644. page 120–128, New York, NY, USA. Association for Computing Machinery. Ona de Gibert, Naiara Perez, Aitor Garcı́a-Pablos, and Montse Cuadros. 2018. Hate speech dataset from Dorottya Demszky, Nikhil Garg, Rob Voigt, James a white supremacy forum. In Proceedings of the Zou, Jesse Shapiro, Matthew Gentzkow, and Dan Ju- 2nd Workshop on Abusive Language Online (ALW2), rafsky. 2019. Analyzing polarization in social me- pages 11–20, Brussels, Belgium. Association for dia: Method and application to tweets on 21 mass Computational Linguistics.
Nabeel Gillani and Roger Levy. 2019. Simple dynamic 12th Language Resources and Evaluation Confer- word embeddings for mapping perceptions in the ence, pages 1440–1448, Marseille, France. Euro- public sphere. In Proceedings of the Third Work- pean Language Resources Association. shop on Natural Language Processing and Compu- tational Social Science, pages 94–99, Minneapolis, Ben Hutchinson, Vinodkumar Prabhakaran, Emily Minnesota. Association for Computational Linguis- Denton, Kellie Webster, Yu Zhong, and Stephen De- tics. nuyl. 2020. Social biases in NLP models as barriers for persons with disabilities. In Proceedings of the Anthony G Greenwald, Debbie E McGhee, and Jor- 58th Annual Meeting of the Association for Compu- dan LK Schwartz. 1998. Measuring individual dif- tational Linguistics, pages 5491–5501, Online. As- ferences in implicit cognition: the implicit associa- sociation for Computational Linguistics. tion test. Journal of personality and social psychol- ogy, 74(6):1464. May Jiang and Christiane Fellbaum. 2020. Interdepen- dencies of gender and race in contextualized word Sophie Groenwold, Lily Ou, Aesha Parekh, Samhita embeddings. In Proceedings of the Second Work- Honnavalli, Sharon Levy, Diba Mirza, and shop on Gender Bias in Natural Language Process- William Yang Wang. 2020. Investigating African- ing, pages 17–25, Barcelona, Spain (Online). Asso- American Vernacular English in transformer-based ciation for Computational Linguistics. text generation. In Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 5877–5883, Online. Kenneth Joseph and Jonathan Morgan. 2020. When do Association for Computational Linguistics. word embeddings accurately reflect surveys on our beliefs about people? In Proceedings of the 58th An- Itika Gupta, Barbara Di Eugenio, Devika Salunke, An- nual Meeting of the Association for Computational drew Boyd, Paula Allen-Meares, Carolyn Dickens, Linguistics, pages 4392–4415, Online. Association and Olga Garcia. 2020. Heart failure education for Computational Linguistics. of African American and Hispanic/Latino patients: Data collection and analysis. In Proceedings of the Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika First Workshop on Natural Language Processing for Bali, and Monojit Choudhury. 2020. The state and Medical Conversations, pages 41–46, Online. Asso- fate of linguistic diversity and inclusion in the NLP ciation for Computational Linguistics. world. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, David L Hamilton and Tina K Trolier. 1986. Stereo- pages 6282–6293, Online. Association for Computa- types and stereotyping: An overview of the cogni- tional Linguistics. tive approach. In J. F. Dovidiom and S. L. Gaert- ner, editors, Prejudice, discrimination, and racism, David Jurgens, Libby Hemphill, and Eshwar Chan- pages 127––163. Academic Press. drasekharan. 2019. A just and comprehensive strat- egy for using NLP to address online abuse. In Pro- Alex Hanna, Emily Denton, Andrew Smart, and Jamila ceedings of the 57th Annual Meeting of the Asso- Smith-Loud. 2020. Towards a critical race method- ciation for Computational Linguistics, pages 3658– ology in algorithmic fairness. In Proceedings of the 3666, Florence, Italy. Association for Computa- 2020 Conference on Fairness, Accountability, and tional Linguistics. Transparency, page 501–512, New York, NY, USA. Association for Computing Machinery. Saket Karve, Lyle Ungar, and João Sedoc. 2019. Con- ceptor debiasing of word representations evaluated Mohammed Hasanuzzaman, Gaël Dias, and Andy on WEAT. In Proceedings of the First Workshop Way. 2017. Demographic word embeddings for on Gender Bias in Natural Language Processing, racism detection on Twitter. In Proceedings of pages 40–48, Florence, Italy. Association for Com- the Eighth International Joint Conference on Natu- putational Linguistics. ral Language Processing (Volume 1: Long Papers), pages 926–936, Taipei, Taiwan. Asian Federation of Natural Language Processing. Anna Kasunic and Geoff Kaufman. 2018. Learning to listen: Critically considering the role of AI in human Aurélie Herbelot, Eva von Redecker, and Johanna storytelling and character creation. In Proceedings Müller. 2012. Distributional techniques for philo- of the First Workshop on Storytelling, pages 1–13, sophical enquiry. In Proceedings of the 6th Work- New Orleans, Louisiana. Association for Computa- shop on Language Technology for Cultural Heritage, tional Linguistics. Social Sciences, and Humanities, pages 45–54, Avi- gnon, France. Association for Computational Lin- Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Da- guistics. vani, Morteza Dehghani, and Xiang Ren. 2020. Con- textualizing hate speech classifiers with post-hoc ex- Xiaolei Huang, Linzi Xing, Franck Dernoncourt, and planation. In Proceedings of the 58th Annual Meet- Michael J. Paul. 2020. Multilingual Twitter cor- ing of the Association for Computational Linguistics, pus and baselines for evaluating demographic bias pages 5435–5442, Online. Association for Computa- in hate speech recognition. In Proceedings of the tional Linguistics.
You can also read