Gender Bias in Machine Translation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Gender Bias in Machine Translation Beatrice Savoldi1,2 , Marco Gaido1,2 , Luisa Bentivogli2 , Matteo Negri2 , Marco Turchi2 1 University of Trento 2 Fondazione Bruno Kessler {bsavoldi,mgaido,bentivo,negri,turchi}@fbk.eu Abstract text, she was repeatedly referred to by masculine pronouns. Gender-related concerns have also been Machine translation (MT) technology has recently voiced by online MT users, spotting how facilitated our daily tasks by providing ac- commercial systems further entrench social gender cessible shortcuts for gathering, processing arXiv:2104.06001v2 [cs.CL] 15 Apr 2021 expectations, e.g. they tend to translate engineers and communicating information. However, as masculine and nurses as feminine (Olson, 2018). it can suffer from biases that harm users and society at large. As a relatively new field of With language technologies entering widespread inquiry, studies of gender bias in MT still use and being deployed at a massive scale, their so- lack cohesion, which advocates for a unified cietal impact has raised concern both within (Hovy framework to ease future research. To this and Spruit, 2016; Bender et al., 2021) and outside end, we: i) critically review current concep- (Dastin, 2018) the scientific community. To take tualizations of bias in light of theoretical in- stock of the situation, Sun et al. (2019) reviewed sights from related disciplines, ii) summarize previous analyses aimed at assessing gender NLP studies on the topic. However, their survey is bias in MT, iii) discuss the mitigating strate- based on monolingual applications, whose underly- gies proposed so far, and iv) point toward ing assumptions and solutions may not be directly potential directions for future work. applicable to languages other than English (Zhou et al., 2019; Zhao et al., 2020; Takeshita et al., 2020) and cross-lingual settings. Moreover, MT 1 Introduction is a multifaceted task, which requires to resolve Interest in understanding, assessing, and mitigating multiple gender-related subtasks at the same time gender bias is steadily growing within the natu- (e.g. coreference resolution, named entity recogni- ral language processing (NLP) community, with tion). Hence, depending on the languages involved recent studies showing how gender disparities af- and the factors accounted for, gender bias has been fect language technologies. Sometimes, for exam- conceptualized differently across studies. To date, ple, coreference resolution systems fail to recog- gender bias in MT has been tackled by means of a nize women doctors (Zhao et al., 2017; Rudinger narrow, problem-solving oriented approach. While et al., 2018), image captioning models do not detect technical countermeasures are needed, failing to women sitting next to a computer (Hendricks et al., adopt a wider perspective and engage with related 2018), and automatic speech recognition works literature outside of NLP can be detrimental to the better with male voices (Tatman, 2017). Despite advancement of the field (Blodgett et al., 2020). a prior disregard for such a phenomenon within In this paper, we intend to put such literature to research agendas (Cislak et al., 2018), it is now use for the study of gender bias in MT. We go be- widely recognized that NLP tools encode and re- yond surveys restricted to monolingual NLP (Sun flect controversial social asymmetries for many et al., 2019) or more limited in scope (Costa-jussà, seemingly neutral tasks, machine translation (MT) 2019; Monti, 2020), and present the first compre- included. Admittedly, the problem is not com- hensive review of gender bias in MT. In particular, pletely new (Frank et al., 2004). A few years ago, we 1) offer a unified framework that introduces Schiebinger (2014) denounced the phenomenon of the concepts, sources, and effects of bias in MT, “masculine default” in MT after running one of her clarified in light of relevant notions on the rela- interviews through a commercial translation sys- tion between gender and different languages; 2) tem. In spite of several feminine mentions in the critically discuss the state of the research, identi-
fying blind spots as well as present and future key resentation of women, ii) not recognizing the ex- challenges. istence of non-binary individuals, and iii) failing to reflect their identity and communicative reper- 2 Bias statement toires. Considering the latter case, by fostering the visibility of the way of speaking of the dominant In cognitive science, bias refers to the result of group, users can presume that such language repre- psychological heuristics, i.e. mental shortcuts that sents the most appropriate or prestigious variant.1 can be critical to support prompt reactions (Tversky Stereotyping regards the propagation of negative and Kahneman, 1973, 1974). AI research borrowed generalizations of a social group, e.g. belittling from such tradition (Rich and Gureckis, 2019; Rah- feminine representation to less prestigious occu- wan et al., 2019) and conceived bias as the diver- pations (teacher (F) vs. lecturer (M)), or in asso- gence from an ideal or expected value (Glymour ciation with attractiveness judgments (pretty lec- and Herington, 2019; Shah et al., 2020), which can turer (F)). Such behaviors are harmful as they can occur if models rely on spurious cues and unin- directly affect the self-esteem of members of the tended shortcut strategies to predict outputs (Schus- target group (Bourguignon et al., 2015). ter et al., 2019; McCoy et al., 2019; Geirhos et al., The ubiquitous embedding of MT in web appli- 2020). Since this can lead to systematic errors or cations provides us with paradigmatic examples adverse social effects, bias investigation is not only of how the two types of (R) can interplay. If a a scientific and technical endeavour, but also an woman or non-binary2 scientist is the subject of a ethical one, given the growing societal role of NLP query, automatically translated pages run the risk applications (Bender and Friedman, 2018). of referring to them via masculine-inflected job As Blodgett et al. (2020) recently called out, and qualifications. In this case, the subject is misrep- has been endorsed in other venues (Hardmeier et al., resented, leading to potential feelings of identity 2021), analysing bias is an inherently normative invalidation (Zimman et al., 2017). Also, users process, which requires to identify what is deemed may not be aware of being exposed to MT mistakes as an harmful behavior, how and to whom. Hereby, due to the deceptively fluent output of a system drawing on the human-centered approach of Value (Martindale and Carpuat, 2018). In the long run, Sensitive Design (Friedman and Hendry, 2019), we stereotypical assumptions and prejudices (e.g. only consider as biased an MT model that systematically men are qualified for high-level positions) might and unfairly discriminates against certain individ- reinforce (Levesque, 2011; Régner et al., 2019). uals or groups in favour of others (Friedman and Nissenbaum, 1996). We identify bias per specific As regards (A), MT services are consumed by model’s behaviors, which are assessed envisaging the general public and can thus be regarded as re- their potential risks when the model is deployed sources in their own right. Hence, (R) can directly (Bender et al., 2021) and the harms that could ensue imply (A) as a performance disparity across users (Crawford, 2017), with people in focus (Bender, in the quality of service, i.e. the overall efficiency 2019). Along this line, with MT daily deployed of the service. Accordingly, a woman attempting for a range of use-cases by millions of individuals, to translate her biography by relying on an MT sys- there are several contexts and people that could be tem requires additional energy and time to revise impacted. As a guide, we rely on Crawford (2017), wrong masculine references. If such disparities are who defines two main categories of harms produced not accounted for, the MT field runs the risk of by a biased system: i) Representational (R) – i.e. producing systems that prevent certain groups from detraction from the representation of social groups fully benefiting from such technological resources. and their identity, which, in turn, affects attitudes In the following, we operationalize such cate- and beliefs; ii) Allocational (A) – i.e. a system gories to map current studies on gender bias to allocates or withholds opportunities or resources to their motivations and societal implications. certain groups. In the MT literature reviewed in this paper, (R) 1 can be distinguished into under-representation and For an analogy on how technology shaped the perception of feminine voices as shrill and immature see (Tallon, 2019). stereotyping. The former refers to the reduction of 2 Throughout the paper, we use non-binary as an umbrella the visibility of a social group through language, term for referring to all gender identities between or outside e.g. by i) producing a disproportionately low rep- the masculine/feminine binary categories.
3 Understanding Bias derivative nouns (actor/actress) and compounds (chairman/chairwoman). To confront bias in MT, it is vital to reach out to Grammatical gender languages (e.g. Arabic, other disciplines that foregrounded how the socio- Spanish). In these languages, each noun pertains cultural notions of gender interact with language(s), to a class such as masculine, feminine, and neuter translation, and implicit biases. Afterward, we (if present). Although for most inanimate objects can discuss the multiple factors that concur to en- gender assignment is just formal,4 for human refer- code and amplify gender inequalities in language ents masculine/feminine markings are assigned on technology. Note that, except for (Saunders et al., a semantic basis. Grammatical gender is defined 2020), current studies on gender bias in MT have by a system of morphosyntactic agreement, where assumed a (often implicit) binary vision of gender. several parts of speech beside the noun (e.g. verbs, As such, our discussion is largely forced into such determiners, adjectives) carry gender inflections. a classification. Although we reiterate on bimodal In light of the above, the English sentence feminine/masculine linguistic forms and social cat- “she/he is a good friend” has no overt expression of egories, we emphasize that gender encompasses gender in a genderless language like Turkish (“o multiple biosocial elements not to be conflated with iyi bir arkadaş”), whereas Spanish spreads several sex (Risman, 2018; Fausto-Sterling, 2019), and that overt feminine “Ella es una buena amiga” or mas- some individuals do not experience gender, at all, culine markings “El es un buen amigo”. Although or in binary terms (Glen and Hurrell, 2012). general, these macro-categories allow us to high- 3.1 Gender and Language light typological differences across languages, cru- cial to frame gender issues in both human and ma- The relation between language and gender is not chine translation. Also, they exhibit to what extent straightforward. First, the linguistic structures used speakers of each group are led to think and commu- to refer to the extra-linguistic reality of gender vary nicate via binary distinctions,5 as well as underline across languages (§3.1.1). Moreover, how gender the relative complexity in carving out a space for is assigned and perceived in our verbal practices de- lexical innovations which encode non-binary gen- pends on contextual factors as well as assumptions der (Hord, 2016; Conrod, 2020). In this sense, about social roles, traits, and attributes (§3.1.2). At while English is attempting to bring the singular last, language is conceived as a tool for articulating they in common use and developing neo-pronouns and constructing personal identities (§3.1.3). (Bradley et al., 2019), for grammatical gender lan- 3.1.1 Linguistic Encoding of Gender guages like Spanish neutrality requires to develop Drawing on (Corbett, 1991; Craig, 1994; Comrie, neo-morphemes (“Elle es une buene amigue”). 1999; Hellinger and Bußman, 2001, 2002, 2003; 3.1.2 Social Gender Connotations Corbett, 2013; Gygax et al., 2019) we hereby de- scribe the linguistic forms (lexical, pronominal, To understand gender bias, we have to grasp not grammatical) that bear a relation with the extra- only the structure of different languages, but also linguistic reality of gender. Following Stahlberg how linguistic expressions are connoted, deployed, et al. (2007), we identify three language groups: and perceived (Hellinger and Motschenbacher, Genderless languages (e.g. Finnish, Turkish). 2015). In grammatical gender languages, feminine In such languages, the gender-specific repertoire forms are often subject to a so-called semantic dero- is at its minimum, only expressed for basic lexical gation (Schulz, 1975), e.g. in French, couturier pairs, usually kinship or address terms (e.g. in (fashion designer) vs. couturière (seamstress). En- Finnish sisko/sister vs. veli/brother). glish is no exception (e.g. major/majorette). Notional gender languages3 (e.g. Danish, En- Bias can creep in also in a covert matter, as in the glish). On top of lexical gender (mom/dad), such case of epicene (i.e. gender neutral) nouns where languages display a system of pronominal gender gender is not grammatically marked. Here, gender (she/he, her/him). English also hosts some marked assignment is linked to (typically binary) social gender, i.e. “the socially imposed dichotomy of 3 Also referred to as natural gender languages. Following 4 (McConnell-Ginet, 2013), we prefer notional to avoid termino- E.g. “moon” is masculine in German, feminine in French. 5 logical overlapping with “natural”, i.e. biological/anatomical Outside of the Western paradigm, there are cultures whose sexual categories. For a wider discussion on the topic see languages traditionally encode gender outside of the binary (Nevalainen and Raumolin-Brunberg, 1993; Curzan, 2003). (Epple, 1998; Murray, 2003; Hall and O’Donovan, 2014).
masculine and feminine role and character traits” dorf, 2002; Brownlow et al., 2003). Although some (Kramarae and Treichler, 1985). As an illustra- correspondences between gender and linguistic fea- tion, Danish speakers tend to pronominalize dom- tures hold across cultures and languages (Smith, mer (judge) with han (he) when referring to the 2003; Johannsen et al., 2015), it should be kept in whole occupational category (Gomard, 1995; Nis- mind that they are far from universal6 and should sen, 2002). Social gender assignment varies across not be intended in a stereotypical and oversimplis- time and space (Lyons, 1977; Romaine, 1999; tic manner (Bergvall et al., 1996; Nguyen et al., Cameron, 2003) and regards stereotypical assump- 2016; Koolen and van Cranenburgh, 2017). tions about what is typical or appropriate for men Drawing on gender-related features proved use- and women. Such assumptions impact our percep- ful to build demographically informed NLP tools tions (Hamilton, 1988; Gygax et al., 2008; Kreiner (Garimella et al., 2019) and personalized MT mod- et al., 2008) and influence our behavior – e.g. lead- els (Mirkin et al., 2015; Bawden et al., 2016; Ra- ing individuals to identify with and fulfill stereo- binovich et al., 2017). However, using personal typical expectations (Wolter and Hannover, 2016; gender as a variable requires a prior understanding Sczesny et al., 2018) – and verbal communication, of which categories may be salient, and a critical e.g. women are often misquoted in the academic reflection on how gender is intended and ascribed community (Krawczyk, 2017). (Larson, 2017). Otherwise, if we assume that the Translation studies highlight how social gender only relevant (sexual) categories are “male” and assignment influences translation choices (Jakob- “female”, our models will inevitably fulfill such a son, 1959; Chamberlain, 1988; Comrie, 1999; reductionist expectation (Bamman et al., 2014). Di Sabato and Perri, 2020). Primarily, the prob- lem arises from typological differences across 3.2 Gender Bias in MT languages and their gender systems; nonetheless, To date, an overview of how several factors may socio-cultural factors influence how translators deal contribute to gender bias in MT does not exist. We with such differences. Consider the character of the identify and clarify concurring problematic causes, cook in Daphne du Maurier’s “Rebecca”, whose accounting for the context in which systems are gender is never explicitly stated in the whole book. developed and used (§2). To this aim, we rely on In the lack of any available information, translators the three overarching categories of bias described into five grammatical gender languages differently by Friedman and Nissenbaum (1996), which fore- represented the character as a man or a woman ground different sources that can lead to the mani- (Wandruszka, 1969; Nissen, 2002). Although ex- festation of machine bias. These are: pre-existing treme, this case represents to a certain extent the bias – rooted in our institutions, practices and at- situation of uncertainty faced by MT: the mapping titudes (§3.2.1), technical bias – due to technical of one-to-many forms in gender prediction. But, constraints and decisions (§3.2.2), and emergent as discussed in §4.1, mistranslations occur when bias – arising in the context of interaction with contextual gender information is available, too. users (§3.2.3). Rather than discretely, we consider such categories as placed in a continuum. 3.1.3 Gender and Language Use 3.2.1 Pre-existing Bias Language use varies between demographic groups and reflects their backgrounds, personalities, and MT models are known to reflect gender dispari- social identities (Labov, 1972; Trudgill, 2000; Pen- ties present in the data. However, reflections on nebaker and Stone, 2003). In this light, the study of such generally invoked disparities are often over- gender and language variation has received much looked. Treating data as an abstract, monolithic attention in socio- and corpus linguistics (Holmes entity (Gitelman, 2013) – or relying on “overly and Meyerhoff, 2003; Eckert and McConnell-Ginet, broad/overloaded terms like training data bias”7 2013). Research conducted in speech and text (Suresh and Guttag, 2019) – does not encourage analysis highlighted several gender differences, 6 It has been largely debated whether gender-related differ- which are exhibited at the phonological and lexical- ences are inherently biological or cultural and social products syntactic level. For example, women rely more (Mulac et al., 2001). Currently, the idea that they depend on biological reasons is largely rejected (Hyde, 2005) in favour on hedging strategies (“it seems that”), purpose of a socio-cultural or performative perspective (Butler, 1990). clauses (“in order to”), first-person pronouns, and 7 See (Johnson, 2020a; Samar, 2020) for a discussion on prosodic exclamations (Mulac et al., 2001; Mon- how such narrative can be counterproductive for tackling bias.
reasoning on the many factors of which data are the puts. As datasets are a crucial source of bias, this product. First and foremost, the historical, socio- advocates for careful data curation (Mehrabi et al., cultural context in which they are generated. 2019; Paullada et al., 2020; Hanna et al., 2021; A starting point to tackle these issues is the Bender et al., 2021), guided by pragmatically- and Europarl corpus (Koehn, 2005), where only 30% socially-informed analysis (Hitti et al., 2019; Sap of sentences are uttered by women (Vanmassen- et al., 2020; Devinney et al., 2020) and annotation hove et al., 2018). Such kind of imbalance is a practices (Gaido et al., 2020). direct window into the glass ceiling that has ham- Overall, while data can mirror gender inequal- pered women’s access to parliamentary positions. ities and offer adverse shortcut learning opportu- This case exemplifies how data might be “tainted nities, it is “quite clear that data alone rarely con- with historical bias”, mirroring an “unequal ground strain a model sufficiently” (Geirhos et al., 2020) truth” (Hacker, 2018). However, other gender vari- nor explain the fact that models overamplify (Shah ables are harder to spot and quantify. et al., 2020) such inequalities in their outputs. Fo- Empirical research in linguistics pointed out that cusing on models’ components, Costa-jussà et al. subtle gender asymmetries are rooted in languages’ (2020b) demonstrate that architectural choices in use and structure. For instance, an important aspect multilingual MT impact systems’ behavior: shared regards how women are referred to. Femaleness is encoder-decoders retain less gender information in often explicitly invoked when there is no textual the source embeddings and less diversion in the need to do so, even in languages that do not require attention than language-specific encoder-decoders overt gender marking. A case in point regards (Escolano et al., 2021), thus disfavouring the gen- Turkish, which differentiates cocuk (child) and kiz eration of feminine forms. While discussing the cocugu (female child) (Braun, 2000). Similarly, in loss and decay of certain words in translation, Van- a corpus search, Romaine (2001) found 155 explicit massenhove et al. (2019, 2021) attest the existence female markings for doctor (female, woman or lady of an algorithmic bias that leads those forms that doctor), compared to only 14 male doctor. Feminist are underrepresented in the training data – as it may language critique provided extensive analysis of be the case for feminine references – to further de- such a phenomenon by highlighting how referents crease in the MT ouput. Specifically, Roberts et al. in discourse are considered men by default unless (2020) prove that beam search – unlike sampling – explicitly stated (Silveira, 1980; Hamilton, 1991). is skewed toward the generation of more frequent Finally, prescriptive top-down guidelines limit the (masculine) pronouns, as it leads models to an ex- linguistic visibility of gender diversity, e.g. the treme operating point that exhibits zero variability. Real Academia de la Lengua Española recently Thus, efforts towards understating and mitigat- discarded the official use of non-binary innovations ing gender bias should also account for the model and claimed the functionality of masculine generics front. To date, this remains largely unexplored. (Mundo, 2018; López et al., 2020). 3.2.3 Emergent Bias By stressing such issues, we are not condoning the reproduction of pre-existing bias in MT. Rather, Emergent bias may arise when a system is used the above-mentioned concerns are the starting point in a different context than the one it was designed to account for when dealing with gender bias. for, e.g. when it is applied to another demographic group. From car crash dummies to clinical trials, 3.2.2 Technical Bias we have evidence of how not accounting for gender Technical bias comprises aspects related to data differences brings to the creation of male-grounded creation, models’ design, training and testing pro- products with dire consequences (Liu and Dipi- cedures. If present in training and testing samples, etro Mager, 2016; Criado-Perez, 2019), such as asymmetries in the semantics of language use and higher death and injury risks in vehicle crash and in gender distribution are respectively learnt by MT less effective medical treatments for women. Simi- systems and rewarded in their evaluation. However, larly, unbeknownst to their creators, MT systems as just discussed, biased representations are not that are not envisioned for a diverse range of users merely quantitative, but also qualitative. Accord- will not generalize for the feminine segment of ingly, straightforward procedures – e.g. balancing the population. Hence, in the interaction with an the number of speakers in existing datasets – do not MT system, a woman will likely be misgendered ensure a fairer representation of gender in MT out- or not have her linguistic style preserved (Hovy
et al., 2020). Other conditions of users/system weight of prejudices and stereotypes in MT (§4.1); mismatch may be the result of changing societal ii) studies assessing whether gender is properly pre- knowledge and values. A case in point regards served in translation (§4.2). To keep the connection Google Translate’s historical decision to adjust its with the human-centered approach embraced in this system for instances of gender ambiguity. Since its survey, in Table 1 we map each work to the harms launch twenty years ago, Google had provided only (see §2) ensuing from the biased behaviors they one translation for single-word gender-ambiguous assess. Finally, we review existing benchmarks for queries (e.g. the English professor translated in Ital- comparing MT performance across genders (§4.3). ian with the masculine professore). In a community increasingly conscious about the power of language 4.1 MT and Gender Stereotypes to hardwire stereotypical beliefs and women’s in- In MT, we record prior studies concerned with pro- visibility (Lindqvist et al., 2019; Beukeboom and noun translation and coreference resolution across Burgers, 2019), the bias exhibited by the system typologically different languages accounting for was confronted with a new sensitivity. The ser- both animate and inanimate referents (Hardmeier vice’s announcement (Kuczmarski, 2018) to pro- and Federico, 2010; Le Nagard and Koehn, 2010; vide a double feminine/masculine output (profes- Guillou, 2012). For the specific analysis on gender sor→professoressa|professore) stems from current bias, instead, such tasks are exclusively studied in demands for gender-inclusive resolutions. For the relation to human entities. recognition of non-binary groups (Richards et al., Prates et al. (2018) and Cho et al. (2019) de- 2016), we invite to study how such modeling could sign a similar setting to assess gender bias. Prates be integrated with neutral strategies (§6). et al. (2018) investigate pronoun translation from 12 gender neutral languages into English. Retriev- 4 Assessing Bias ing ∼1,000 job positions from the U.S. Bureau of First accounts on gender bias in MT date back to Labor statistics, they build simple constructions Frank et al. (2004). Their manual analysis pointed like the Hungarian “ő egy mérnök” (“he/she is an out how English-German MT suffers from a dearth engineer”). Following the same template, Cho et al. at the linguistic level, observing severe difficulties (2019) extend the analysis to Korean-English in- in recovering syntactic and semantic information cluding both occupations and sentiment words (e.g. to correctly produce gender agreement. kind). As their samples are ambiguous by design, Akin investigations were conducted on other tar- the observed predictions of he/she pronouns should get grammatical gender languages for several com- be basically a random guess, yet they show a gen- mercial MT systems (Abu-Ayyash, 2017; Monti, eral strong masculine skew.9 To further analyze 2017; Rescigno et al., 2020). While these stud- the under-representation of she pronouns, Prates ies focused on contrastive phenomena, Schiebinger et al. (2018) focus on 22 macro-categories of occu- (2014)8 went beyond linguistic insights, calling for pation areas (e.g. STEM, communication, admin- a deeper understanding of gender bias. Her article istration) and compare the proportion of pronoun on Google Translate’s “masculine default” behav- predictions against the real-world proportion of ior emphasized how such phenomenon is related to men and women employed in such sectors. In this a larger discussion on gender inequalities, also per- way, they see that MT not only yields a masculine petuated by socio-technical artifacts (Selbst et al., default, but also underestimates feminine frequency 2019). All in all, these qualitative analyses demon- at a greater rate than occupation data alone suggest. strated that gender problems encompass all three Such an analysis attests the existence of machine MT paradigms (neural, statistical, and rule-based), bias, and defines it as the exacerbation of actual preparing the ground for quantitative work. gender disparities. To attest the existence and scale of gender bias Going beyond word lists and simple synthetic across several languages, dedicated benchmarks, constructions, Gonen and Webster (2020) inspect evaluations, and experiments have been designed. the translation into Russian, Spanish, German, and We first discuss large scale analyses aimed at assess- 9 Cho et al. (2019) also recognize that a potentially higher ing gender bias in MT, grouped according to two frequency of feminine references in the MT output would main conceptualizations: i) works focusing on the not necessarily imply that the problem of bias is alleviated. Rather, it may reflect gender stereotypes, as for hairdresser 8 See also Schiebinger’s project Gendered Innovations. that is skewed toward feminine.
French of natural, but still ambiguous, English sen- gains in terms of overall quality when translat- tences. Their analysis on the ratio and type of gen- ing into grammatical gender languages, where erated masculine/feminine job titles consistently speaker’s references are often marked. For in- exhibits social asymmetries for target grammati- stance, the French translation of “I’m happy” is cal gender languages (e.g. lecturer masculine vs. either “Je suis heureuse“ or “Je suis hereux” for a teacher feminine). Finally, Stanovsky et al. (2019) female/male speaker respectively. With more fo- asses that MT is skewed to the point of actually cused cross-gender analysis – carried out by split- ignoring explicit feminine gender information in ting their English-French test set into 1st person source English sentences. For instance, MT sys- male vs. female data – they assess that the largest tems yield a wrong masculine translation of the job margin of improvement for their gender-informed title baker, albeit it is referred to by the pronoun approach concerns sentences uttered by women, she. Beside the overlook of overt gender mentions, as the results of their baseline disclose a disparity the model’s reliance on unintended (and irrelevant) in favour of men speakers. Note that the authors cues for gender assignment is further confirmed by rely on manual analysis to ascribe performance dif- the fact that adding a socially connoted – but for- ferences to gender-related features. In fact, global mally epicene – adjective (the pretty baker) pushes evaluations on generic test sets alone are inadequate models toward feminine inflections in translation. to pointedly measure gender bias. 4.2 MT and Gender Preservation 4.3 Existing Benchmarks MT outputs are typically evaluated against refer- Instead of analysing the weight of prejudices and ence translations by means of standard metrics such stereotypes, Vanmassenhove et al. (2018) and Hovy as BLEU (Papineni et al., 2002) or TER (Snover et al. (2020) investigate whether speaker’s gender10 et al., 2006). This procedure poses two challenges. is properly reflected in MT. This line of research is First, these metrics provide coarse-grained scores preceded by findings on gender personalization of for translation quality, treating all errors equally statistical MT (Mirkin et al., 2015; Bawden et al., and being rather insensitive to specific linguistic 2016; Rabinovich et al., 2017), claiming that gen- phenomena (Sennrich, 2017). Second, generic test der “signals” are weakened in translation. sets containing the same gender imbalance present Hovy et al. (2020) conjecture the existence of an in the training data can actually reward biased pre- age and gender stylistic bias due to models’ under- dictions. Hereby, we describe the publicly avail- exposure to the writings of women and younger able MT Gender Bias Evaluation Testsets (GBETs) segments of the population. To test this hypoth- (Sun et al., 2019), i.e. benchmarks designed to esis, they automatically translate a corpus of on- probe gender bias by isolating the impact of gender line reviews with available metadata about users from other factors that may affect systems’ perfor- (Hovy et al., 2015). Then, they compare such de- mance. Note that different benchmarks and met- mographic information with the prediction of age rics respond to different conceptualizations of bias and gender classifiers run on the MT output. Re- (Barocas et al., 2019). Common to them all in MT, sults indicate that different commercial MT models however, is that biased behaviors are formalized systematically make authors sound older and male. using some variants of averaged performance11 dis- However, the authors do not inspect which stylistic parities cross gender groups, comparing the accu- features and linguistic choices MT overproduces. racy of gender predictions on an equal number of In a similar vein, Vanmassenhove et al. (2018) masculine, feminine, and neutral references. probe MT’s ability to preserve speaker’s gender Escudé Font and Costa-jussà (2019) developed translating from English into ten languages. To this the bilingual English-Spanish Occupations test aim, they develop gender-informed MT models set, consisting of 1,000 sentences equally dis- (see § 5.1), whose outputs are compared with those tributed across genders. The phrasal structure obtained by their baseline counterparts. Tested envisioned for their sentences is “I’ve known on a set for spoken language translation (Koehn, {her|him|} for a long time, my 2005), their enhanced models show consistent 11 This is a value-laden option (Birhane et al., 2020), and 10 Note that these studies distinguish speakers into fe- not the only possible one (Mitchell et al., 2020). For a broader male/male. As discussed in §3.1.3, we invite a reflection discussion on measurement and bias we refer the reader also on the appropriateness and use of such categories. to (Jacobs, 2021; Jacobs et al., 2020).
Study Benchmark Gender Harms (Prates et al., 2018) Synthetic, U.S. Bureau of Labor Statistics b R: under-rep, stereotyping (Cho et al., 2019) Synthetic equity evaluation corpus (EEC) b R: under-rep, stereotyping (Gonen and Webster, 2020) BERT-based perturbations on natural sentences b R: under-rep, stereotyping (Stanovsky et al., 2019) WinoMT b R: under-rep, stereotyping (Vanmassenhove et al., 2018) Europarl (generic) b A: quality (Hovy et al., 2020) Trustpilot (reviews with gender and age) b R: under-rep, A: quality Table 1: For each Study, the Table shows on which Benchmark gender bias is assessed, how Gender is intended (here only in binary (b) terms). Finally, we indicate which (R)epresentational – under-representation and stereotyping – or (A)llocational Harm – as reduced quality of service – is addressed in the study. friend works as {a|an} ”. The evalu- synthetic gender-related phenomena, they do not ation focuses on the translation of the noun friend represent the actual challenges posed by real-world into Spanish (amigo/amiga). Since gender informa- language and are relatively easy to overfit. In other tion is present in the source context and sentences words, as recognized by Rudinger et al. (2018) are the same for both masculine and feminine par- “they may demonstrate the presence of gender bias ticipants, an MT system exhibits gender bias if it in a system, but not prove its absence”. cannot provide the correct translation of friend at The Arabic Parallel Gender Corpus (Habash the same rate across genders. et al., 2019) includes an English-Arabic test set Stanovsky et al. (2019) created WinoMT by retrieved from OpenSubtitles natural language data concatenating two existing English GBETs for (Lison and Tiedemann, 2016). Each of the 2,448 coreference resolution (Rudinger et al., 2018; Zhao sentences in the set exhibits a first person sin- et al., 2018a). The corpus consists of 3,888 Wino- gular reference to the speaker (e.g. “I’m rich”). gradesque sentences presenting two human entities Among them, ∼200 English sentences require gen- defined by their role and a subsequent pronoun that der agreement to be assigned in translation. These needs to be correctly resolved to one of the entities were translated into Arabic in both gender forms, (e.g. “The lawyer yelled at the hairdresser because obtaining a quantitatively and qualitatively equal he did a bad job”). For each sentence, there are amount of sentence pairs with annotated mascu- two variants with either he or she pronouns, so as line/feminine references. This natural corpus thus to cast the referred annotated entity (hairdresser) allows for cross-gender evaluations on MT produc- into a proto- or antistereotypical gender role. By tion of correct speaker’s gender agreement. translating WinoMT into grammatical gender lan- MuST-SHE (Bentivogli et al., 2020) is a natu- guages, one can thus measure system’s ability to ral benchmark for three language pairs (English- resolve the anaphorical relation and pick the correct French/Italian/Spanish). Built on TED talks data feminine/masculine inflection for the occupational (Cattoni et al., 2021), for each language pair it noun. Also, it allows to verify if MT predictions comprises ∼1,000 (audio, transcript, translation) correlate with stereotyping. triplets, thus allowing evaluation for both MT and Finally, Saunders et al. (2020) enriched the origi- speech translation (ST). Its samples are balanced nal version of WinoMT in two different ways. First, between masculine and feminine phenomena, and by including a third gender neutral case based on incorporate two types of constructions: i) sen- the singular they pronoun, which paves the way for tences referring to the speaker (e.g. “I was born accounting also for non-binary referents. Second, in Mumbai”), and ii) sentences that present con- by labeling the entity in the sentence which is not textual information to disambiguate gender (e.g. coreferent with the pronoun (lawyer). The latter “My mum was born in Mumbai”). Since every annotation is used to verify the shortcomings of gender-marked word is annotated in the corpus, some mitigating approaches as discussed in §5. MuST-SHE grants the advantage of complementing The above-mentioned corpora are said to be chal- BLEU- and accuracy-based evaluations on gender lenge sets, consisting of sentences created ad hoc translation for a great variety of phenomena. for diagnostic purposes. In this way, they can be Unlike challenge sets, these natural corpora used to quantify bias in the context of stereotyp- quantify whether MT yields reduced feminine rep- ing and under-representation in a sound environ- resentation in authentic conditions and whether the ment. However, consisting of a limited variety of quality of service varies across speakers of different
genders. However, as they treat all gender-marked (e.g. based on first names) is not advisable, as it words equally, it is not possible to identify if the runs the risks of introducing additional bias by mak- model is propagating stereotypical representations. ing unlicensed assumptions about one’s identity. All in all, we stress that each test set and metric Elaraby et al. (2018) bypasses this risk by defin- is only a proxy for framing a phenomenon or an ing a comprehensive set of cross-lingual gender ability (e.g. anaphora resolution), and an approxi- agreement rules based on POS tagging. In this mation of what we truly intend to gauge. Thus, as way, they identify the speakers’ and listeners’ gen- we discuss in §6, advances in MT should account der references in an English-Arabic parallel corpus, for the observation of gender bias in real-world con- which is consequently labelled and used for train- ditions, as to avoid that achieving high scores on ing. However, such approach is not directly scal- a mathematically formalized esteem could lead to able to other languages, as it would require creating a false sense of security. Still, benchmarks remain new dedicated rules. Moreover, in realistic deploy- valuable tools to monitor model’s behavior. As ment conditions where reference translations are such, we remark that evaluation procedures ought not available, this information still has to be exter- to cover both models’ general performance and nally supplied as metadata at inference time. gender-related issues. This is crucial to establish Stafanovičs et al. (2020) and Saunders et al. the capabilities and limits of mitigating strategies. (2020) explore the use of word-level gender tags. While Stafanovičs et al. (2020) just report a gen- 5 Mitigating Bias der translation improvement, Saunders et al. (2020) To attenuate gender bias in MT, different strategies rely on the expanded version of WinoMT to iden- dealing with input data, learning algorithms, and tify a problem concerning gender tagging: it intro- model outputs have been proposed. As attested duces noise if applied to sentences with references by Birhane et al. (2020), since advancements are to multiple participants, as it pushes their transla- oftentimes exclusively reported in terms of values tion toward the same gender. Saunders et al. (2020) internal to the machine learning field (e.g efficiency, also include a first non-binary exploration of neu- performance), it is not clear how such strategies tral translation by exploiting an artificial dataset, are meeting societal needs by reducing MT-related where neutral tags are added and gendered inflec- harms. In order to conciliate technical perspectives tions are replaced by placeholders. The results are with the intended social purpose, in Table 2 we however inconclusive, most likely due to the small map each mitigating approach to the harms (see size and synthetic nature of their dataset. §2) they are meant to alleviate, as well as on which Adding context. Without further information benchmark their effectiveness is evaluated. Com- needed for training or inference, Basta et al. (2020) plementarily, we hereby describe each approach by adopt a generic approach and concatenate each means of two categories: model debiasing (§5.1) sentence with its preceding one. By providing more and debiasing through external components (§5.2). context, they attest a slight improvement for gender translations requiring anaphoric coreference to be 5.1 Model Debiasing solved in English-Spanish. This finding motivates This line of work focuses on mitigating gender bias exploration at the document level, but it should be through architectural changes of general-purpose validated with manual (Castilho et al., 2020) and MT models or via dedicated training procedures. interpretability analyses, as the added context can Gender tagging. To improve the generation be beneficial for gender-unrelated reasons, such as of speaker’s referential markings, Vanmassenhove acting as a regularization factor (Kim et al., 2019). et al. (2018) prepend a gender tag (M or F) to each Debiased word embeddings. The two above- source sentence, both at training and inference time. mentioned mitigations converge on the same intent: As their model is able to leverage this additional supply the model with additional gender knowl- information, the approach proves useful to handle edge. Instead, Escudé Font and Costa-jussà (2019) morphological agreement when translating from leverage pre-trained word embeddings, which are English into French. However, this solution re- debiased using the hard-debiasing method pro- quires additional metadata regarding the speakers’ posed by Bolukbasi et al. (2016) or the GN- gender that might not always be feasible to ac- GLOVE algorithm (Zhao et al., 2018b). With quire. Automatic annotation of speakers’ gender these methods, gender associations are respectively
Approach Authors Benchmark Gender Harms Gender tagging Vanmassenhove et al. Europarl (generic) b R: under-rep, A: quality (sentence-level) Elaraby et al. Open subtitles (generic) b R: under-rep, A: quality Gender tagging Saunders et al. expanded WinoMT nb R: under-rep, stereotyping (word-level) Stafanovičs et al. WinoMT b R: under-rep, stereotyping Adding context Basta et al. WinoMT b R: under-rep, stereotyping Word-embeddings Escudé Font and Costa-jussà Occupation test set b R: under-rep Fine-tuning Costa-jussà and de Jorge WinoMT b R: under-rep, stereotyping Black-box injection Moryossef et al. Open subtitles (selected sample) b R: under-rep, A: quality Lattice-rescoring Saunders and Byrne WinoMT b R: under-rep, steretoyping Re-inflection Habash et al.; Alhafni et al. Arabic Parallel Gender Corpus b R: under-rep, A: quality Table 2: For each Approach and related Authors, the Table shows on which Benchmark it is tested, if Gender is intended in binary terms (b), or including non-binary (nb) identities. Finally, we indicate which (R)epresentational – under-representation and stereotyping – or (A)llocational Harm – as reduced quality of service – the approach attempts to mitigate. removed or isolated from the representations of approaches do not imply retraining, but introduce English gender-neutral words. Escudé Font and the additional cost of maintaining separate modules Costa-jussà (2019) experiment using such embed- and handling their integration with the MT model. dings on the decoder side, the encoder side, and Black-box injection. Moryossef et al. (2019) both sides of an English-Spanish model. The best attempt to control the production of feminine refer- results are obtained by leveraging GN-GLOVE em- ences to the speaker and numeral inflections (plural beddings on both encoder and decoder sides, in- or singular) for the listener(s) in an English-Hebrew creasing BLEU scores and gender accuracy. The setting. To this aim, they rely on a short construc- authors generically apply debiasing methods de- tion, such as “she said to them”, which is prepended veloped for English also to their target language. to the source sentence and then removed from the However, being Spanish a grammatical gender lan- MT output. Their approach is simple, it can han- guage, other language-specific approaches should dle two types of information (gender and number) be considered to preserve the quality of the original for multiple entities (speaker and listener), and im- embeddings (Zhou et al., 2019; Zhao et al., 2020). proves systems’ ability to generate feminine target We also stress that it is debated whether depriving forms. However, as in the case of (Vanmassen- systems of some knowledge and “blind” their per- hove et al., 2018; Elaraby et al., 2018), it requires ceptions is the right path toward fairer language metadata about speakers and listeners. models (Dwork et al., 2012; Caliskan et al., 2017; Lattice re-scoring. Saunders and Byrne (2020) Gonen and Goldberg, 2019; Nissim and van der propose to post-process the MT output with a lat- Goot, 2020). Also, Goldfarb-Tarrant et al. (2021) tice re-scoring module. This module exploits a find that there is no reliable correlation between in- transducer to create a lattice by mapping gender trinsic evaluations of bias in word-embeddings and marked words in the MT output to all their possible cascaded effects on MT models’ biased behavior. inflectional variants. Developed for German, Span- Balanced fine-tuning. Finally, Costa-jussà and ish, and Hebrew, all the sentences corresponding de Jorge (2020) rely on Gebiotoolkit (Costa-jussà to the paths in the lattice are re-scored with another et al., 2020c) to build gender-balanced datasets (i.e. model, which has been gender-debiased, but at the featuring an equal amount of masculine/feminine cost of lower generic translation quality. Then, the references) based on Wikipedia biographies. By sentence with the highest probability is picked as fine-tuning their models on such natural and more the final output. When tested on WinoMT, such representative data, the generation of feminine an approach notably leads to an increase in the ac- forms is overall improved. However, the approach curacy of gender forms selection. Note that the is not as effective for gender translation on the anti- gender-debiased system is created by fine-tuning stereotypical WinoMT set. the model on an ad-hoc built tiny set containing a balanced amount of masculine/feminine forms. 5.2 Debiasing through External Components Such an approach, also known as counterfactual Instead of directly debiasing the MT model, these data augmentation (Lu et al., 2020), requires cre- mitigating strategies intervene in the inference ating identical pairs of sentences differing only phase with external dedicated components. Such in terms of gender references. In fact, Saunders
and Byrne (2020) compile English sentences fol- terventions alone are not a panacea (Chang, 2019) lowing this schema: “The finished and should be integrated with long-term multidisci- work”. Then, the sentences are auto- plinary commitment and practices (D’Ignazio and matically translated and manually checked. In this Klein, 2020; Gebru, 2020) necessary to address way, they obtain a gender-balanced parallel corpus. bias in our community, hence in its artifacts, too. Thus, to implement their method for other language pairs, the generation of new data is necessary. Al- 6 Conclusion and Key Challenges though for their fine-tuning set the effort required is limited, data augmentation can be very costly for As disparate studies confronting gender bias in complex sentences representing a richer variety of MT are rapidly emerging, in this paper we pre- gender agreement phenomena.12 sented them within a unified framework to criti- Gender re-inflection. Habash et al. (2019) cally overview current conceptualizations and ap- and Alhafni et al. (2020) confront the problem proaches to the problem. Since gender bias is a of speaker’s gender agreement in Arabic with a multifaceted and interdisciplinary issue, in our dis- post-processing component that re-inflects 1st per- cussion we integrated knowledge from related dis- son references into masculine/feminine forms. In ciplines, which can be instrumental to guide future (Alhafni et al., 2020), the preferred gender of the research and make it thrive. We conclude by sug- speaker and the translated Arabic sentence are fed gesting several directions that can help this field to that component, which re-inflects the sentence in going forward. the desired form. In (Habash et al., 2019), instead, Model de-biasing. Neural networks rely on the component can be: i) a two-step system that easy-to-learn shortcuts or “cheap tricks” (Levesque, first identifies the gender of 1st person references 2014), as picking up on spurious correlations of- in an MT output, and then re-inflects them in the fered by training data can be easier for machines opposite form; ii) a single-step system that always than learning to actually solve a specific task. What produces both forms from an MT output. Their is “easy to learn” for a model depends on the induc- method does not necessarily require speakers’ gen- tive bias (Sinz et al., 2019; Geirhos et al., 2020) re- der information: if metadata are supplied, the MT sulting from architectural choices, training data and output is re-inflected accordingly; differently, both learning rules. We think that explainability tech- feminine/masculine inflections are offered (leav- niques (Belinkov et al., 2020) represent a useful ing to the user the choice of the appropriate one). tool to identify spurious cues (features) exploited While beneficial for English-Arabic, their approach by the model during inference. Discerning them is not directly applicable to other languages. In fact, can provide the research community with guidance unlike (Saunders and Byrne, 2020), implementing on how to improve models’ generalisation by work- the re-inflection component demanded the expen- ing on the data, architectures, loss functions and sive work of data creation of the Arabic Parallel optimizations. For instance, data responsible for Gender Corpus (§4.3). Along the same line, now spurious features (e.g. stereotypical correlations) Google Translate also delivers two outputs for short might be recognized and their weight at training gender-ambiguous queries (Johnson, 2020b). How- time lowered (Karimi Mahabadi et al., 2020). Be- ever, among languages with grammatical gender, sides, state-of-the-art architectural choices and al- the service is available only for English-Spanish. gorithms in MT have mostly been studied in terms of overall translation quality, without specific anal- In light of the above, we remark that there is no yses regarding gender translation. For instance, cur- conclusive state-of-the-art method for mitigating rent systems segment text into subword units with bias. The discussed interventions in MT tend to re- statistical methods that can break the morphologi- spond to specific aspects of the problem with modu- cal structure of words, losing relevant semantic and lar solutions, but if and how they can be conciliated syntactic information in morphologically-rich lan- within the same MT system remains unexplored. guages (Niehues et al., 2016; Ataman et al., 2017). Besides, gender bias in MT is a socio-technical Several languages show complex feminine forms, problem. We thus highlight that engineering in- typically derivative and created by adding a suffix 12 Zmigrod et al. (2019) proposed an automatic approach for to the masculine form, like Lehrer/Lehrerin (de), augmenting data into morphologically-rich languages, but it studente/studentessa (it). It would be relevant to is only viable for simple constructions with one single entity. investigate whether, compared to other segmenta-
tion techniques, statistical approaches disadvantage versarial networks including a discriminator that (rarer and more complex) feminine forms. The MT classifies speaker’s linguistic expression of gen- community should not overlook focused hypothe- der (masculine or feminine) could be employed to ses of such kind, as they expand our comprehension “neutralize” speaker-related forms (Li et al., 2018; of the gender bias conundrum. Delobelle et al., 2020). On the other side, Direct Non-textual modalities. Gender bias for non- Non-binary Language (DNL) aims at increasing textual automatic translations (e.g. audiovisual) the visibility of non-binary individuals via neol- has been largely neglected. In this sense, ST rep- ogisms and neomorphemes (Bradley et al., 2019; resents a small niche (Costa-jussà et al., 2020a). Papadopoulos, 2019). With DNL starting to circu- For the translation of speaker-related gender phe- late (Shroy, 2016; Santiago, 2018; López, 2019), nomena, Bentivogli et al. (2020) prove that direct the community is presented with the opportunity to ST systems exploit speaker’s vocal characteristics engage with the creation of more inclusive data. as a gender cue to improve feminine translation. Finally, as already highlighted in the law and However, as addressed by Gaido et al. (2020), re- social science theory, discrimination can arise from lying on physical gender cues (e.g. pitch) for such the intersection of multiple identity categories (e.g. task imply reductionist gender classifications (Zim- race and gender) (Crenshaw, 1989), which are not man, 2020) within systems, making them poten- additive and cannot always be detected in isolation tially harmful for a diverse range of users. Simi- (Schlesinger et al., 2017). Following the MT work larly, although image-guided translation has been by Hovy et al. (2020), as well as other intersec- claimed useful for gender translation as it relies on tional analyses from NLP (Herbelot et al., 2012; visual inputs for disambiguation (Frank et al., 2018; Jiang and Fellbaum, 2020) and AI-related fields Ive et al., 2019), it could bend toward stereotypical (Buolamwini and Gebru, 2018), future studies may assumptions about appearance. Further research account for the interaction of gender attributes with should explore such directions to identify poten- other sociodemographic classes. tial challenges and risks, drawing on bias in image captioning (van Miltenburg, 2019), and consoli- Human-in-the-loop. Research on gender bias dated studies from the fields of automatic gender in MT is still restricted to lab tests. As such, un- recognition and human computer interaction (HCI) like for other studies relying on participatory de- (Hamidi et al., 2018; Keyes, 2018; May, 2019). sign (Turner et al., 2015; Cercas Curry et al., 2020; Liebling et al., 2020), the advancement of the field Beyond Dichotomies. Apart from few notable is not measured in line with people accounting for exceptions for English NLP tasks (Manzini et al., their experiences, in relation to specific deployment 2019; Cao and Daumé III, 2020; Sun et al., 2021), contexts. However, these are fundamental consid- and one in MT (Saunders et al., 2020), the discus- erations to guide the field forward and, as HCI sion around gender bias has been reduced to the studies show (Vorvoreanu et al., 2019), concretely binary masculine/feminine dichotomy. Although support the creation of gender-inclusive technol- research in this direction is currently hampered by ogy. Also, we invite the whole development pro- the absence of data, we invite considering inclu- cess to be paired with bias-aware research method- sive solutions and exploring nuanced dimensions ology (Havens et al., 2020) and HCI approaches of gender. Starting from language practices, Indi- (Stumpf et al., 2020), which help operationalize rect Non-binary Language (INL) overcomes gen- sensitive attributes like gender (Keyes et al., 2021). der specifications (e.g. using service, humankind Finally, MT is not only built for people, but also rather then waiter/waitress or mankind).13 Whilst by people. Thus, it is vital to reflect on implicit more challenging, INL can be achieved also for biases and backgrounds of the people involved in grammatical gender languages (Motschenbacher, MT pipelines at all stages and how they could be 2014; Lindqvist et al., 2019), and it is endorsed for reflected in the model. This means starting from official EU documents.14 Accordingly, MT mod- bottom-level countermeasures, engaging with trans- els could be brought to avoid binary forms and lators (De Marco and Toto, 2019; Lessinger, 2020), move toward gender-unspecified solution, e.g. ad- annotators (Waseem, 2016; Geva et al., 2019), con- 13 INL suggestions have also been recently implemented sidering everyone’s subjective positionality and, within Microsoft text editors (Langston, 2020). crucially, also the lack of diversity within technol- 14 See the Europarl guidelines. ogy teams (Schluter, 2018; Waseem et al., 2020).
You can also read