Societal Biases in Language Generation: Progress and Challenges - ACL Anthology

Page created by Tracy Figueroa

Government & Politics

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Societal Biases in Language Generation: Progress and Challenges

       Emily Sheng1 , Kai-Wei Chang2 , Premkumar Natarajan1 , Nanyun Peng1,2
            1
              Information Sciences Institute, University of Southern California
         2
           Computer Science Department, University of California, Los Angeles
    {ewsheng,pnataraj}@isi.edu, {kwchang,violetpeng}@cs.ucla.edu

                      Abstract                              into developing NLG techniques.
                                                               We emphasize the importance of better under-
    Technology for language generation has ad-
                                                            standing how societal biases manifest in NLG tech-
    vanced rapidly, spurred by advancements in
    pre-training large models on massive amounts
                                                            niques, because NLG applications directly inter-
    of data and the need for intelligent agents to          act with many different users to generate novel
    communicate in a natural manner. While tech-            content in various domains (e.g., chat bots for
    niques can effectively generate fluent text, they       health, education, and customer support). However,
    can also produce undesirable societal biases            when techniques are less effective or detrimental
    that can have a disproportionately negative im-         for marginalized populations, these techniques can
    pact on marginalized populations. Language              inadvertently become gatekeepers of those popula-
    generation presents unique challenges for bi-
                                                            tions for generation and associated language tech-
    ases in terms of direct user interaction and
    the structure of decoding techniques. To bet-           nologies. For example, an educational chat bot that
    ter understand these challenges, we present a           produces more negative responses for topics about
    survey on societal biases in language gener-            a specific ethnicity will discourage users of that eth-
    ation, focusing on how data and techniques              nicity from interacting with the chat bot. While it
    contribute to biases and progress towards re-           is generally important to study the societal impact
    ducing biases. Motivated by a lack of studies           of NLP and AI techniques, we argue that the direct
    on biases from decoding techniques, we also
                                                            user impact of NLG techniques makes it especially
    conduct experiments to quantify the effects
    of these techniques. By further discussing
                                                            important to carefully quantify the impact.
    general trends and open challenges, we call                Motivated by the importance of fairness in lan-
    to attention promising directions for research          guage generation, we present the first comprehen-
    and the importance of fairness and inclusivity          sive survey on societal biases in language genera-
    considerations for language generation appli-           tion. By enumerating how NLG techniques con-
    cations.                                                tribute to biases and examining progress towards
                                                            bias analysis and mitigation, we contextualize the
1   Introduction
                                                            discussion of broader trends and challenges. Specif-
Natural language generation (NLG) is a suite of             ically, we focus on techniques for NLG tasks, i.e.,
techniques that enables the generation of human-            tasks that generate a sequence of text.1 Finding a
readable language for different goals. These tech-          lack of studies on biases from decoding techniques,
niques are the core components of applications              we additionally present an experimental study to
such as virtual assistants, chat bots, automatic trans-     quantify the effects of various decoding techniques.
lators, summarizers, and creative language com-                Before we delve into the details of biases in lan-
posers. Recent advances in techniques for language          guage generation, we first position our survey in
generation (e.g., GPT (Radford et al., 2018), GPT-2         the context of other relevant surveys and position
(Radford et al., 2019), GPT-3 (Brown et al., 2020),         papers. Sun et al. (2019) present a focused survey
TransformerXL (Dai et al., 2019), XLNet (Yang                   1
                                                                  Although bi-directional language models like BERT (De-
et al., 2019)) powered by Transformers (Vaswani             vlin et al., 2019) can also be used for auto-regressive gener-
et al., 2017) and an increasing repository of avail-        ation (Wang and Cho, 2019; Chen et al., 2020), traditional
                                                            auto-regressive models are still typically of better quality and
able data have created more capable applications.           more widely used for generation (Shwartz et al., 2020). Thus,
This has, in turn, channeled more interest and effort       we limit the scope of this survey to the latter models.

                                                        4275
                Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
             and the 11th International Joint Conference on Natural Language Processing, pages 4275–4293
                           August 1–6, 2021. ©2021 Association for Computational Linguistics

Demo. Dim.   NLG Task        Works
    Gender       Autocomplete Bordia and Bowman (2019); Qian et al. (2019); Solaiman et al. (2019); Sheng et al. (2019,
                              2020); Vig et al. (2020); Yeo and Chen (2020); Brown et al. (2020); Dhamala et al. (2021);
                              Schick et al. (2021); Nozza et al. (2021); Kirk et al. (2021)
                 Dialogue     Henderson et al. (2018); Dinan et al. (2020a); Liu et al. (2020a,b); Cercas Curry et al. (2020);
                              Sheng et al. (2021a,b)
                 MT           Vanmassenhove et al. (2018); Elaraby et al. (2018); Prates et al. (2019); Stanovsky et al.
                              (2019); Escudé Font and Costa-jussà (2019); Cho et al. (2019); Moryossef et al. (2019);
                              Saunders and Byrne (2020); Saunders et al. (2020); Kocmi et al. (2020); Costa-jussà and
                              de Jorge (2020); Costa-jussà et al. (2020); Basta et al. (2020); Farkas and Németh (2020);
                              Stafanovičs et al. (2020); Gonen and Webster (2020); Hovy et al. (2020); Roberts et al.
                              (2020); Cho et al. (2021); Savoldi et al. (2021); Renduchintala and Williams (2021); Choubey
                              et al. (2021); Saunders et al. (2021); Tomalin et al. (2021)
                 Re-writing   Habash et al. (2019); Zmigrod et al. (2019); Alhafni et al. (2020); Sun et al. (2021)
    Profession   Autocomplete Huang et al. (2020); Dhamala et al. (2021)
    Race         Autocomplete Solaiman et al. (2019); Sheng et al. (2019, 2020); Groenwold et al. (2020); Brown et al.
                              (2020); Dhamala et al. (2021); Schick et al. (2021); Kirk et al. (2021)
                 Dialogue     Sheng et al. (2021a,b)
    Religion     Autocomplete Solaiman et al. (2019); Brown et al. (2020); Dhamala et al. (2021); Kirk et al. (2021); Abid
                              et al. (2021)
    Sexuality    Autocomplete Sheng et al. (2019, 2020); Kirk et al. (2021)
                 Dialogue     Sheng et al. (2021a)
    Other        Autocomplete Shwartz et al. (2020); Peng et al. (2020); Huang et al. (2020); Dhamala et al. (2021); Kirk
                              et al. (2021)
                 Dialogue     Sheng et al. (2021a)
                 Re-writing   Pryzant et al. (2020); Ma et al. (2020)

Table 1: Existing bias studies on different demographic dimensions in various NLG tasks: autocomplete genera-
tion, dialogue generation, machine translation (MT), and text re-writing.
on mitigating gender biases and Shah et al. (2020)              that generate text continuations conditioned on
categorize sources of biases—both largely focus                 some prompt and those that transform text from
on natural language understanding (NLU) tasks,                  one form to another. Table 1 organizes various
while we examine biases in NLG tasks. Addition-                 bias-related works for NLG tasks.
ally, Blodgett et al. (2020) urge for more explicitly
tying “biases” in NLP to societal normative defi-               2.1    Continuation Generation Tasks
nitions of biases and social hierarchies; with their            The continuation class includes autocomplete and
recommendations in mind, we discuss the negative                dialogue generation, where the goal is to generate
impacts of biases in NLG techniques.                            text that is coherent and relevant to a prompt.
   Our contributions are a comprehensive survey
                                                                Autocomplete Generation We use the term au-
on societal biases in language generation and an
                                                                tocomplete generation to refer to conditional gen-
experimental study on biases from decoding tech-
                                                                eration directly from language models. Language
niques. To start, we describe classes of NLG tasks
                                                                models are the core components for many NLG
(Sec. 2) and subsequently examine examples of bi-
                                                                and NLU tasks, and this task enables directly quan-
ases and harms in NLG (Sec. 3). We then discuss
                                                                tifying biases in large, pre-trained language models
NLG techniques that facilitate biases, including a
                                                                (Bordia and Bowman, 2019; Sheng et al., 2019;
study of decoding techniques (Sec. 4). Sec. 5 high-
                                                                Solaiman et al., 2019; Brown et al., 2020). Exist-
lights progress and challenges, and Sec. 6 presents
                                                                ing works analyzing biases in autocomplete gen-
open problems and proposals. We hope this survey
                                                                eration have mostly examined Transformer-based
brings more visibility to the importance of carefully
                                                                models, including GPT (Shwartz et al., 2020), GPT-
considering different components of NLG pipelines
                                                                2 (Solaiman et al., 2019; Sheng et al., 2019, 2020;
for potential biases and mitigation methods.
                                                                Shwartz et al., 2020; Vig et al., 2020; Yeo and Chen,
                                                                2020; Huang et al., 2020; Dhamala et al., 2021;
2      Language Generation Tasks
                                                                Schick et al., 2021), GPT-3 (Brown et al., 2020),
To begin, we categorize generation tasks and in-                CTRL (Dhamala et al., 2021), TransformerXL
troduce existing bias studies relevant to each task.            (Shwartz et al., 2020; Vig et al., 2020; Huang et al.,
NLG tasks broadly fall into two categories: those               2020), and XLNet (Shwartz et al., 2020; Vig et al.,

                                                           4276

2020; Yeo and Chen, 2020), though Bordia and                     study Microsoft Translator,4 Amazon Translate,5
Bowman (2019); Qian et al. (2019) also look at                   and SYSTRAN;6 Cho et al. (2019) additionally
LSTM-based models.                                               look at Naver Papago7 and Kakao Translator,8 and
Dialogue Generation Dialogue generation is                       Cho et al. (2021) also examine Yandex.9
conditioned on user inputs and can be for spe-                   Re-writing We use the term re-writing to refer
cific domains (e.g., health, customer service) and               to tasks of revising specific words and phrases in
tasks (e.g., behavior intervention, booking flights)             the original text to be more aligned with a targeted
or general chit-chat. These dialogue applications                attribute. Specifically, there have been studies on
directly interact with users, and any propagated                 re-inflection (Habash et al., 2019; Zmigrod et al.,
biases directly affect user behavior and actions.                2019; Alhafni et al., 2020) and re-writing text to use
In terms of recurrent dialogue models, Henderson                 neutral viewpoints (Pryzant et al., 2020), gender-
et al. (2018) analyze biases in hierarchical recur-              neutral English (Sun et al., 2021), or more agency
rent encoder-decoder architectures and Liu et al.                (Ma et al., 2020). These tasks typically rely on
(2020a,b) analyze LSTM-based encoder-decoder                     custom encoder-decoder models.
models. Other works on dialogue biases (Dinan
                                                                 2.3      Other Tasks
et al., 2020a; Sheng et al., 2020, 2021b) focus
on Transformer-based models such as DialoGPT                     There are other NLG tasks, such as the continua-
(Zhang et al., 2020) and other custom architectures.             tion tasks of story and poetry generation, and the
                                                                 transformation tasks of abstractive summarization
2.2    Transformation Generation Tasks                           and paraphrase generation. However, these other
                                                                 NLG tasks are not yet well-studied in the context
The transformation class includes machine trans-
                                                                 of societal biases.10
lation and various formulations of text re-writing.
The general goal of these tasks is to transform text             3       Biases and their Negative Impacts
into a form with targeted properties.
Machine Translation Translation is the task of                   In this section, we introduce how existing studies
transforming text between languages while pre-                   of biases in NLG tasks commonly quantify biases
serving the meaning. Existing works on biases                    and their negative impacts.
in machine translation have almost exclusively fo-               3.1      Bias Definitions and Metrics
cused on issues of gender biases2 in a variety of
                                                                 In the context of AI fairness, the term “bias” com-
academic and commercial systems. The use of
                                                                 monly refers to skews that result in undesirable
grammatical gender in some languages and not in
                                                                 impacts (Crawford, 2017) and is quantifiable with
others can expose unwanted gender associations
                                                                 some metric. There are relatively more existing
(e.g., for different occupations) through translation
                                                                 studies on biases in NLU tasks, where it is arguably
(Prates et al., 2019). Earlier works by Vanmassen-
                                                                 simpler to define bias metrics, since we can intu-
hove et al. (2018) and Elaraby et al. (2018) study
                                                                 itively compare the accuracy of the task (e.g., coref-
LSTM-based encoder-decoder translation systems,
                                                                 erence resolution, hate speech detection) for differ-
and more recent works examine Transformer-based
                                                                 ent demographics. Language generation tasks often
architectures (Escudé Font and Costa-jussà, 2019;
                                                                 involve stochastic generation of open-ended and
Stanovsky et al., 2019; Saunders and Byrne, 2020;
                                                                 lengthy texts, traits that are not directly compatible
Saunders et al., 2020; Costa-jussà and de Jorge,
                                                                 with traditional algorithmic bias definitions (e.g.,
2020; Basta et al., 2020; Stafanovičs et al., 2020;
                                                                     4
Renduchintala and Williams, 2021; Choubey et al.,                      https://www.bing.com/translator
                                                                     5
2021; Saunders et al., 2021; Tomalin et al., 2021).                    https://aws.amazon.com/translate
                                                                     6
                                                                       https://www.systransoft.com
While Google Translate3 has been the most pop-                       7
                                                                       https://papago.naver.com
ular commercial system to analyze for gender bi-                     8
                                                                       https://translate.kakao.com
                                                                     9
ases (Prates et al., 2019; Moryossef et al., 2019;                     https://translate.yandex.com
                                                                    10
Stanovsky et al., 2019; Cho et al., 2019; Farkas                       Lucy and Bamman (2021) is an exception that analyzes
                                                                 gender in generated stories. While there are studies of bi-
and Németh, 2020), Stanovsky et al. (2019) also                 ases in poetry generation and summarization, they focus on
                                                                 non-NLG biases: Sheng and Uthus (2020) investigate biases
   2
      For a detailed survey of gender bias in machine transla-   in a poetry composition system, but in the context of infor-
tion, we refer readers to Savoldi et al. (2021).                 mation retrieval; Celis and Keswani (2020) analyze biases in
    3
      https://translate.google.com                               extractive summarization.

                                                            4277

equalized odds, equal opportunity, demographic           pacts. We survey detrimental representational11
parity (Dwork et al., 2012; Hardt et al., 2016)).        and allocational12 impacts (Crawford, 2017; Baro-
    Because of the difficulty in defining metrics, ex-   cas et al., 2017; Blodgett et al., 2020) used to moti-
isting works define bias loosely as demographic          vate existing studies of bias in NLG tasks, finding
inequality and use intermediate proxy metrics to         limited examples. While representational impacts
comparatively measure bias. Examples include:            are sometimes cited, it is difficult to measure the
• Regard Ratio: negative-neutral-positive regard         extent of the impacts. Additionally, techniques
   score ratios of text generated from bias-inducing     for effective NLG are relatively new, and existing
   prompts (Sheng et al., 2019)                          studies have limited knowledge of potential alloca-
• Sentiment Ratio: negative-neutral-positive sen-        tional impacts. Finally, biases in NLG tasks give
   timent score ratios of text generated from African    rise to a third type of negative impacts, which we
   American English (AAE) versus White-Aligned           call vulnerability impacts.
   English (WAE) prompts (Groenwold et al., 2020)        Representational Impacts The works in Ta-
• Individual and Group Fairness through Sen-             ble 1 motivate (to varying degrees) studying bi-
   timent: comparisons of the sentiment distribu-        ases in NLG through potential negative representa-
   tions of generated text across demographics and       tional impacts, in the form of propagating stereo-
   prompts (Huang et al., 2020)                          types, misrepresentations, or denigrations of social
• Gendered Word Co-occurrence Score: mean                groups. For example, Sheng et al. (2019) enumer-
   and standard deviations of the absolute log ra-       ate how generated text can propagate varying social
   tio of probabilities: P(word|female terms) to         perceptions of different demographics, and Prates
   P(word|male terms) across all words in gener-         et al. (2019) discuss how occupation-related gender
   ated text (Bordia and Bowman, 2019)                   biases could propagate stereotypes in translation.
There are also metrics for other bias evaluation         However, it is difficult to quantify the effects of rep-
setups in continuation generation tasks involving        resentational impacts;13 while such impacts may be
sentiment (Shwartz et al., 2020), the ratio of gen-      measured indirectly (e.g. by analyzing allocational
dered words (Solaiman et al., 2019; Vig et al., 2020;    impacts), we suggest long-term, interdisciplinary
Dinan et al., 2020a), and other novel metrics (Peng      collaborations to explore the direct effects of these
et al., 2020; Yeo and Chen, 2020). Studies of biases     representational impacts.
in transformation generation tasks favor metrics of      Allocational Impacts Harmful allocational im-
accuracy in terms of successfully transforming text      pacts result from an unequal allocation of resources
to have a desired property. We present a more thor-      across groups. Since effective NLG techniques
ough comparison of metrics in Section 5.4.               based on large Transformer models (Vaswani et al.,
    Bias metrics can also be categorized by how they     2017) are relatively new, most of the existing works
define associations between demographic group at-        on biases in NLG that list possible impacts only
tributes and text. Biases can be towards people          analyze direct representational consequences. A
described in text, people who produce the text,          real example of a negative allocational impact is
or people to whom the text is addressed (Dinan           when machine translation errors lead to arrests
et al., 2020b). Most existing works define bias          (Ong, 2017). In general, technologies that are less
metrics through the first association—these biases       effective or detrimental for certain populations be-
are relatively easier to analyze, since both the de-     come barriers that actively prevent those popula-
mographic and the textual signals of bias are en-        tions from using the technology, leading to dimin-
capsulated within the text. There are also works         ished opportunities in jobs, education, health, etc.
that define biases towards people who produce the        We discuss more details in Section 4.5. With contin-
text (Groenwold et al., 2020) or people to whom          uous technological advances, more organizations
the text is addressed (Sheng et al., 2021b), though      will turn to effective NLG techniques, making it
there are relatively fewer works that study these        imperative to start setting norms to reduce harmful
latter associations.                                     allocational impacts (Tamkin et al., 2021).

3.2   Negative Impacts                                     11
                                                               Unfair representations of different groups
                                                           12
                                                               Unfair allocation of resources
Biases in NLG techniques are important to study             13
                                                               Kay et al. (2015) is a rare example that explicitly studies
because they can result in harmful, negative im-         the effect of representational impacts in image search.

                                                    4278

Vulnerability Impacts Open-domain generation could lead to biases, and we refer readers to their
tasks can amplify a group’s vulnerability to manip- works for more discussion. In the context of trans-
ulation and harm, which is an intermediate impact lation, Cho et al. (2021) find that more data can
that makes a group more susceptible to represen- increase translation fluency but may also make the
tational and allocational impacts. For example, system more biased.
privacy-related issues (Carlini et al., 2020), misin-
formation (Levy et al., 2021), or radicalizing views 4.2 Biases from Model Architecture
in generated text could make a group more likely to There are relatively few studies that examine model
be attributed to specific stereotypes (e.g., through architectural properties that could lead to biases.
action guided by misinformation) or end up with We discuss the few efforts towards understanding
diminished opportunities (e.g., by having personal model biases in NLG tasks and emphasize the need
data exposed and misused). Separately identifying for more to generalize. For autocomplete gener-
vulnerability impacts could help facilitate recogni- ation, Vig et al. (2020) analyze GPT-2 variants
tion of other negative impacts. through a causal mediation analysis, finding that
larger models contain more gender bias, and bias
4 Contributors to NLG Biases tends to be concentrated in a small number of neu-
rons and attention heads. Silva et al. (2021) ob-
In a pipeline from data collection to evaluation
serve amplified biases in distilled versus original
for an NLG task, each component could propagate
models. For machine translation, Costa-jussà et al.
biases.14 We emphasize the ways in which data,
(2020) note that language-specific architectures are
model architecture, decoding, evaluation, and de-
less biased because they encode more gender in-
ployment uniquely exacerbate biases in generation
formation than shared language encoder-decoder
tasks. Additionally, we present an empirical study
architectures. Studies like the aforementioned are
to show how measured biases in generated text can
useful for designing targeted bias mitigation meth-
vary based on decoding technique.
ods (e.g., controlled generation to target specific
4.1 Biases from Data attention heads or regularization to retain gender
information). However, more evidence would be
Modern NLP models often rely on large pre-trained needed to generalize findings across models.15
language models, which in turn rely on a large col-
lection of data to learn explicit and implicit associ- 4.3 Biases from Decoding
ations. Several recent pre-trained language models While NLU and NLG models have structural simi-
used for NLG tasks, e.g., T5 (Raffel et al., 2020) larities, NLG tasks uniquely use search or sampling
and GPT-3 (Brown et al., 2020), are trained on the techniques at inference time to generate text. Popu-
largest datasets used for any models. These large lar techniques include:
models for generation are commonly trained on • Greedy Search: at each time step, choose the
web data, which is known to contain biased lan- word with the highest probability.
guage (e.g., Ferrer et al. (2021) discover gender, • Beam Search: at each time step, keep the top b
religion, and ethnic biases in Reddit communities). hypotheses with highest probabilities; eventually
While preprocessing is often included to filter out pick the hypothesis with the highest probability.
malformatted data and explicitly negative content • Top-k sampling (Fan et al., 2018): at each time
(e.g., bad words and offensive phrases), those are step, re-distribute the probability mass of the top
generally the only efforts to reduce biases and as- k words with highest probabilities and sample.
sociated impacts. Furthermore, by filtering out all • Nucleus sampling (Holtzman et al., 2019): at
words deemed “bad”, Bender et al. (2021) warns each time step, re-distribute the probability mass
that we remove the discourse of marginalized pop- of the smallest set of words with a cumulative
ulations. Paullada et al. (2020), Bender and Fried- probability exceeding p and sample.
man (2018), and Gebru et al. (2018) provide more
More constrained forms of generation such as ma-
comprehensive surveys and frameworks that focus
chine translation generally use variations of beam
on aspects of data creation and management that
15
We also refer the reader to the work of Park et al. (2018)
14
Task formulation and application deployment are also that discusses biases in NLU tasks from model components
part of NLG task pipelines (Kiritchenko et al., 2020), though that “attend” to specific words (e.g., through attention or pool-
we do not focus on biases in these areas. ing), which could be applicable to NLG tasks as well.

4279

search; however, preferred decoding techniques are age researchers to document sampling techniques,
more varied for open-domain generation. Despite consider how metrics can be formulated to evaluate
variations in fluency and diversity between deter- both bias and other factors of generation quality,
ministic versus stochastic, search versus sampling and inspire more comprehensive studies.18
procedures, there are limited studies (Roberts et al.,
2020) on how different decoding properties affect 4.4 Biases from Evaluation
biases in generation. Biases can arise from both general evaluations and
A Study on Biases from Decoding To study bias evaluations for NLG tasks.
how decoding techniques affect biases in gener- General Evaluations Current standards for
ation, we use existing NLG bias metrics to evalu- NLG evaluation can reinforce certain types of lan-
ate text generated from different decoding meth- guage and penalize others. For example, using
ods.16 We examine autocomplete generations from perplexity as measured by models pre-trained on
GPT, GPT-2, and XLNet, using the decoding tech- datasets largely containing non-AAE text leads
niques from Section 4.3. We evaluate with the to an unfair evaluation of AAE text. Addition-
following bias metrics: regard ratios (Sheng et al., ally, the subjectivity of generation tasks means that
2019), sentiment ratios (Groenwold et al., 2020), much of NLG evaluation depends on human labels.
individual and group fairness through sentiment Since humans from different backgrounds are ac-
scores (Huang et al., 2020), and gendered word customed to different societal norms and linguistic
co-occurrence scores (Bordia and Bowman, 2019) variations, the choice of human annotators could
(as introduced in Section 3). More experimental drastically influence the evaluation standards for
details can be found in the Appendix. generated text.
In Section 5.4, we distinguish between relative Bias Evaluations It is difficult to evaluate so-
and absolute score metrics to examine evaluation cietal biases in NLG tasks because NLG can be
differences between NLG tasks. Here, we orga- open-domain, and there are many different notions
nize our results into these categories to generalize of biases from various backgrounds and cultures
trends about decoding techniques. The ratio-based (Sambasivan et al., 2021). These factors lead to
metrics are relative score metrics, since evaluation the use of a variety of metrics to evaluate biases
relies on comparing ratios between demographics. (Section 3). To avoid experimental bias in eval-
The latter three metrics are absolute score metrics uation, we recommend using multiple metrics to
that have target values of zero indicating no bias. cover many types of biases at various granulari-
For the relative score metrics, search and sam- ties. We identify three points to emphasize the
pling techniques generate similar outcomes. An need for more comprehensive evaluations. First,
interesting result between sampling techniques for most existing works on biases in generation center
the regard metric is that nucleus sampling is less around one demographic dimension (often gender
biased yet more negative than top-k sampling. For and from a Western perspective, e.g., using stan-
the absolute score metrics, we find that beam search dard Western occupations). While there has been
is the most unbiased technique, closely followed no comprehensive study on whether mitigating bi-
by greedy search and then top-k and nucleus sam- ases for one demographic dimension (e.g., gender)
pling. Through our study, we discover that text may exacerbate biases for others (e.g., race, inter-
diversity is not accounted for in any of the bias sectional identities), this is a possibility we must
metrics, yet diversity can be a confounding fac- consider. Second, most works only evaluate bias
tor. Specifically, beam search is the least diverse,17 through a single intermediate proxy; however, dif-
followed by greedy search, top-k sampling, then ferent metrics are defined at different granularities
nucleus sampling. Results indicate that the less (e.g., sentiment is sentence-level, gendered word
diverse search techniques lead to better scores for ratio is word-level). Finally, different evaluation
individual fairness, group fairness, and gendered datasets test for specific types of biases and are
word co-occurrence ratios. influenced by the backgrounds of the curators. Col-
We hope these experimental results will encour- lectively evaluating biases across demographic di-
16
Code at https://github.com/ewsheng/ mensions and granularities can thus help reduce
decoding-biases. experimentally-biased evaluations.
17
We report average generated text length and vocabulary
18
sizes to estimate diversity in Appendix Table 4. Results are summarized in Appendix Tables 2, 3, and 5.

4280

4.5    Biases from Deploying Systems                     Curated Datasets Existing datasets to study bi-
                                                         ases in translation include parallel sentences tagged
In terms of deploying NLG systems, there is a
                                                         with speaker or subject gender information (Van-
feedback loop that benefits some communities and
                                                         massenhove et al., 2018; Habash et al., 2019) and
further disadvantages others. While this feedback
                                                         datasets to study gender biases when translating
loop is not unique to NLG systems, these systems
                                                         from neutral references of a person (e.g., nurse in
that directly interact with users make good caution-
                                                         English, gender-neutral pronouns) to gendered in-
ary examples.
                                                         stances (e.g., enfermera or enfermero in Spanish,
   First, many deployed language technologies re-
                                                         gendered pronouns) (Cho et al., 2019; Stanovsky
quire internet access both to use and contribute
                                                         et al., 2019; Gonen and Webster, 2020; Kocmi et al.,
feedback, thus favoring the views and languages of
                                                         2020). Renduchintala and Williams (2021) addi-
those privileged with this access. For example, any-
                                                         tionally provide a dataset to study translation of
one can contribute feedback to Google Translate,
                                                         neutral references in unambiguous contexts. Other
but if contributions and subsequent improvements
                                                         works present parallel corpora of biased versus un-
are focused on high-resource languages, this fur-
                                                         biased framings and presuppositions (Pryzant et al.,
ther increases the accuracy gap between the high
                                                         2020) and AAE versus WAE equivalents (Groen-
and low resource languages, diminishing opportuni-
                                                         wold et al., 2020). Sheng et al. (2019); Huang et al.
ties for speakers of the low resource languages, i.e.,
                                                         (2020); Dhamala et al. (2021) additionally curate
representation disparity (Hashimoto et al., 2018).
                                                         sets of prompts that can be used to evaluate biases
   Second, those who are unable to achieve their         in autocomplete generation.
goals from using these language technologies (e.g.,      Bias Analysis Most bias analyses of NLG tasks
unsuccessful translation, unhelpful or offensive         use prompts to probe for different biases in gener-
chat bot) are less likely to continue using the tech-    ated text, e.g., regarding social perception (Sheng
nology. This means that there is less feedback and       et al., 2019), gender in translation (Prates et al.,
data to improve the technologies, reinforcing the        2019), names (Shwartz et al., 2020), sentiment
decreased effectiveness for certain populations, i.e.,   distribution (Huang et al., 2020), dialects (Groen-
disparity amplification (Hashimoto et al., 2018).        wold et al., 2020), dialogue personas (Sheng et al.,
   One way we might intervene is to follow a more        2021a), or other notions of similarity across demo-
targeted approach for data and feedback collection,      graphics (Yeo and Chen, 2020; Henderson et al.,
e.g., from excluded populations. However, we ac-         2018). Vig et al. (2020) also use prompts to investi-
knowledge that this remains a difficult task and         gate gender biases, though they do so in the context
that it is also necessary to be aware of “commu-         of a causal mediation analysis. Furthermore, Prates
nity goals” and other factors in order to co-design      et al. (2019) and Farkas and Németh (2020) com-
language technologies without inflicting additional      pare pronoun gender biases in translations (induced
harm on marginalized populations (Bird, 2020).           with prompts) to real-world statistics.
                                                         Bias Mitigation Methods can broadly be classi-
5     Progress, Trends, and Challenges                   fied into two categories based on the type of data ap-
                                                         plied. The first category encompasses methods that
Following the discussion of contributors to biases,
                                                         fine-tune or train on a balanced dataset to lessen
we survey trends and challenges for reducing biases
                                                         the effects of the model relying on spurious corre-
in NLG.
                                                         lations between imbalanced data and task perfor-
                                                         mance. CDA has been applied to datasets used for
5.1    Data Methods
                                                         continued or fresh training in dialogue generation
Data-based methods for both bias analysis and mit-       (Dinan et al., 2020a; Liu et al., 2020a) as well as
igation use the general idea of counterfactual data      machine translation (Saunders and Byrne, 2020;
augmentation (CDA) (Lu et al., 2020) to curate sets      Costa-jussà and de Jorge, 2020; Stafanovičs et al.,
of counterfactual prompts. A common method for           2020). The second category is methods that at-
analysis is using targeted prompts to induce NLG         tach a short prefix at training time (Vanmassenhove
models to reveal biases. For data-based mitigation,      et al., 2018; Basta et al., 2020; Alhafni et al., 2020)
existing works focus on fine-tuning large models         or inference time (Moryossef et al., 2019).
or training smaller models with datasets that are        Challenges The size of state-of-the-art pre-
balanced with respect to targeted demographics.          trained models and varying definitions of biases

                                                    4281

in generation present difficulties for creating stan- be more equalized towards different demographics.
dardized datasets that are generally effective across For translation, Saunders and Byrne (2020) present
biases and demographics. Moreover, it remains to a lattice rescoring procedure that creates gender-
be seen whether data-based mitigation is as effec- inflected search spaces to rescore text for more ac-
tive for open-domain NLG tasks as it is for more curate translations, and Saunders et al. (2021) sub-
constrained settings. sequently use this lattice structure to present more
gendered options during beam search and rerank
5.2 Training Methods translation hypotheses according to gender criteria.
In addition to data-based mitigation, training-based For dialogue generation, Sheng et al. (2021b) in-
mitigation is another popular class of methods to troduce a constrained decoding method that uses
reduce biases in generation. n-gram similarity to guide generation away from
Bias Mitigation Several works that use training- ad hominems towards marginalized groups. For au-
based mitigation techniques rely on regularization tocomplete generation, Schick et al. (2021) present
(Bordia and Bowman, 2019; Qian et al., 2019; a self-debiasing scheme that re-weights word prob-
Huang et al., 2020; Liu et al., 2020a; Saunders and abilities to generate less undesirable words.
Byrne, 2020). There are also works that induce con- Challenges Control methods at inference time
trol by incorporating a bias control code through could potentially steer the model into degenerate
conditional training (Dinan et al., 2020a), by ap- spaces, so it is important to also evaluate these
pending a target value to inputs during training methods for coherence, fluency, and task relevance.
(Ma et al., 2020), by using a normative classifier to
produce reward values for backpropagation (Peng 5.4 Evaluation Methods
et al., 2020), or through adversarial training (Liu There are two types of evaluations: those that rely
et al., 2020b). Other techniques include using de- on absolute scores and those that rely on relative
biased word embeddings (Escudé Font and Costa- scores. Absolute score evaluations use an accu-
jussà, 2019), identifying and editing out subjective mulated score to summarize inequalities between
words (Pryzant et al., 2020), and using Markov ran- demographics, whereas relative evaluations explic-
dom fields to preserve morpho-syntactic agreement itly report inequalities between all demographics.
during reinflection (Zmigrod et al., 2019). While it is possible to convert between relative and
Challenges The main challenge of bias mitiga- absolute scores, distinguishing between how exist-
tion through training methods is that it is costly and ing works choose to portray evaluations allows us
impractical to re-train models for new biases en- to examine differences between generation tasks.
countered. In fact, most of the techniques that rely Absolute Evaluations We find that the transfor-
on training from scratch use smaller architectures mation class of generation tasks favors bias evalu-
(exceptions are from larger institutions). ation through absolute metrics, which is possible
because these tasks involve relatively more con-
5.3 Inference Methods strained forms of generation. Examples of eval-
While the existing literature on inference time meth- uation objectives through absolute scores include
ods for bias mitigation is sparse, decoding-based Peng et al. (2020) reducing non-normative gener-
methods are a promising alternative to data- and ations, Ma et al. (2020) increasing the accuracy
training-based methods. Specifically, these meth- of the change in agency, Zmigrod et al. (2019) in-
ods are compatible with any pre-trained language creasing the number of correct inflections, Huang
model for generation without additional training. et al. (2020) reducing individual and group fair-
Given recent development of inference-time meth- ness scores, and Sheng et al. (2021b) reducing
ods for control that can reduce toxicity (e.g., PPLM the amount of ad hominems towards marginalized
(Dathathri et al., 2019), GeDi (Krause et al., 2020), groups. Studies of gender bias in machine trans-
DExperts (Liu et al., 2021)), there is potential for lation are well-suited to evaluations using abso-
extending these methods to bias mitigation. lute scores: many use BLEU and its variants to
Bias Mitigation For autocomplete and dialogue evaluate correct gender inflections and translations
generation, Sheng et al. (2020) formulate bias trig- (Moryossef et al., 2019; Escudé Font and Costa-
gers using gradient-based methods of Wallace et al. jussà, 2019; Elaraby et al., 2018; Habash et al.,
(2019). These triggers are appended to prompts 2019; Alhafni et al., 2020) or accuracy on WinoMT
during inference time to control text generation to (Saunders and Byrne, 2020; Saunders et al., 2020;

4282

Kocmi et al., 2020; Costa-jussà and de Jorge, 2020; for potential harms. Since effective models for
Costa-jussà et al., 2020; Basta et al., 2020; Choubey NLG tasks are correlated with increasing training
et al., 2021; Saunders et al., 2021). data sizes, biases in data collection (e.g., English-
Relative Evaluations In terms of evaluation centric, drawn from popular Western media) re-
through relative scores, examples from existing main a major contributor of biases that manifest
works are mainly from continuation generation in generation. Additionally, datasets used to study
tasks. We infer that the less constrained, open- biases in generation can also be limited (e.g., only
domain nature of continuation generation tasks for binary gender classes). For more bias-aware
makes it more preferable to evaluate mitigation data curation, we suggest diversifying datasets to
through more flexible comparisons rather than ab- include more viewpoints from various groups.
solute scores. For autocomplete generation, Sheng Understanding Trade-Offs Different methods
et al. (2019, 2020) and Groenwold et al. (2020) for analysis, mitigation, and evaluation have unique
compare regard or sentiment scores across demo- trade-offs. Existing works have been relatively
graphics, Shwartz et al. (2020) compare names small-scale and limited to a small number of biases
across various intermediate metrics, Vig et al. for specific tasks. Some useful questions to con-
(2020) measure proportional differences between sider when developing methods to study generation
the amount of bias under a gendered versus ambigu- biases are whether we can generalize methods to a
ous reading, and Yeo and Chen (2020) compare diverse set of biases and a wide range of contexts.
occupations generated for different genders. Bias It is also important to consider formulating met-
studies in dialogue generation use relative scores rics that would jointly mitigate biases and preserve
by comparing sentiment and offensive language other desired text qualities (e.g., diversity, fluency).
discrepancies (Henderson et al., 2018; Liu et al.,
2020a,b) and the percentage of gendered words Interactive and Continuous Learning The dif-
(Dinan et al., 2020a). ficulties of measuring and mitigating biases in gen-
Challenges A trade-off between framing biases eration can be reduced with a general framework
as a relative or absolute metric is that relative met- for interactive and continuous learning. Over time,
rics can be more flexibly aligned to normative con- such a system could learn from diverse opinions of
cerns like social perception. Absolute metrics that what constitutes “fair” versus “unfair” generations
look for ratios of gendered words or other indica- across tasks. A unified framework would centralize
tor words assume that there is a set of words that and highlight the importance of studying biases in
captures all the differences between demographic generation, as well as fuel the development of a
groups, regardless of whether these differences are more comprehensive set of evaluations that may be
related to normative definitions of harm. There useful for large-scale studies of impact.
are also absolute metrics such as those of Huang Focusing on Negative Impacts Section 3 dis-
et al. (2020) that can incorporate intermediate met- cusses how there are very few existing works on
rics that are more aligned with normative behavior, biases that explicitly and meaningfully engage with
though these metrics reduce the notion of biases to resulting negative impacts, even though these im-
a single value, which could erase historical inequal- pacts are what motivate reducing biases. By re-
ities between groups. framing efforts on reducing negative impacts rather
than biases, we may be able to define metrics and
6 Open Problems and Proposals progress that better correlate with reducing harm.
For example, relative framings of bias metrics
As a fairly nascent area of exploration, the study
could better enable metrics to be more aligned with
of biases in language generation still poses many
reducing harms for particularly impacted groups.
challenges. Throughout this paper, we discuss chal-
lenges associated with different components in a
generation pipeline. With a heightened awareness Acknowledgments
of the relevant body of work, we conclude with
recommendations for open problems. We would like to thank Seraphina Goldfarb-Tarrant,
Bias-Aware Data Curation Many works have Sunipa Dev, Jason Teoh, members of the Plus Lab,
highlighted the harms and problems when col- and our anonymous reviewers for the many helpful
lecting training datasets with limited awareness suggestions that went into this paper.

4283

Ethics and Broader Implications of the The Fourth Widening Natural Language Pro-
cessing Workshop, pages 99–102, Seattle, USA. As-
In this work, we present a survey and commentary sociation for Computational Linguistics.
on the progress and challenges for studying societal
Emily M. Bender and Batya Friedman. 2018. Data
biases in language generation. statements for natural language processing: Toward
Data We do not check the quality of the datasets mitigating system bias and enabling better science.
used to train popular language generation models Transactions of the Association for Computational
Linguistics, 6:587–604.
(due to limited availability and size), though we
do briefly mention problems that other works have Emily M Bender, Timnit Gebru, Angelina McMillan-
found regarding using large datasets that have been Major, and Shmargaret Shmitchell. 2021. On the
minimally filtered. Some of the surveyed datasets dangers of stochastic parrots: Can language models
be too big. Proceedings of FAccT.
and metrics that are used for evaluating biases ap-
proximate binary genders using names typical of Steven Bird. 2020. Decolonising speech and lan-
specific genders, and may be better re-formulated guage technology. In Proceedings of the 28th Inter-
national Conference on Computational Linguistics,
to avoid harms and curate a more accurate repre- pages 3504–3519, Barcelona, Spain (Online). Inter-
sentation of different genders. On the subject of national Committee on Computational Linguistics.
genders, the majority of bias evaluation data also
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and
only evaluate for binary genders—we point out this
Hanna Wallach. 2020. Language (technology) is
issue in our survey as well. power: A critical survey of “bias” in NLP. In Pro-
Techniques Most of the techniques surveyed in ceedings of the 58th Annual Meeting of the Asso-
this work are trained with or bias-tested with data ciation for Computational Linguistics, pages 5454–
5476, Online. Association for Computational Lin-
drawn from Western sources or culture, since that is guistics.
largely the focus of the existing body of work. We
also refer to studies that point out how techniques Shikha Bordia and Samuel R. Bowman. 2019. Identify-
for bias do not always transfer across cultures. Our ing and reducing gender bias in word-level language
models. In Proceedings of the 2019 Conference of
decoding experiments could potentially fuel mis- the North American Chapter of the Association for
use by giving those with adversarial interests a Computational Linguistics: Student Research Work-
better understanding of how decoding algorithms shop, pages 7–15, Minneapolis, Minnesota. Associ-
could thwart bias metrics, though we believe trans- ation for Computational Linguistics.
parency around these results outweigh the potential Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
for misuse. Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot
learners. arXiv preprint arXiv:2005.14165.
References
Abubakar Abid, Maheen Farooqi, and James Zou. Nicholas Carlini, Florian Tramer, Eric Wallace,
2021. Persistent anti-muslim bias in large language Matthew Jagielski, Ariel Herbert-Voss, Katherine
models. arXiv preprint arXiv:2101.05783. Lee, Adam Roberts, Tom Brown, Dawn Song, Ul-
far Erlingsson, et al. 2020. Extracting training
Bashar Alhafni, Nizar Habash, and Houda Bouamor. data from large language models. arXiv preprint
2020. Gender-aware reinflection using linguistically arXiv:2012.07805.
enhanced neural models. In Proceedings of the Sec- L Elisa Celis and Vijay Keswani. 2020. Dialect diver-
ond Workshop on Gender Bias in Natural Language sity in text summarization on twitter. arXiv preprint
Processing, pages 139–150, Barcelona, Spain (On- arXiv:2007.07860.
line). Association for Computational Linguistics.
Amanda Cercas Curry, Judy Robertson, and Verena
Solon Barocas, Kate Crawford, Aaron Shapiro, and Rieser. 2020. Conversational assistants and gender
Hanna Wallach. 2017. The problem with bias: Al- stereotypes: Public perceptions and desiderata for
locative versus representational harms in machine voice personas. In Proceedings of the Second Work-
learning. In 9th Annual Conference of the Special shop on Gender Bias in Natural Language Process-
Interest Group for Computing, Information and So- ing, pages 72–78, Barcelona, Spain (Online). Asso-
ciety. ciation for Computational Linguistics.
Christine Basta, Marta R. Costa-jussà, and José A. R. Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu,
Fonollosa. 2020. Towards mitigating gender bias in and Jingjing Liu. 2020. Distilling knowledge
a decoder-based neural machine translation model learned in BERT for text generation. In Proceedings
by adding contextual information. In Proceedings of the 58th Annual Meeting of the Association for

4284

Computational Linguistics, pages 7893–7905, On- Emily Dinan, Angela Fan, Adina Williams, Jack Ur-
line. Association for Computational Linguistics. banek, Douwe Kiela, and Jason Weston. 2020a.
Queens are powerful too: Mitigating gender bias in
Won Ik Cho, Ji Won Kim, Seok Min Kim, and dialogue generation. In Proceedings of the 2020
Nam Soo Kim. 2019. On measuring gender bias in Conference on Empirical Methods in Natural Lan-
translation of gender-neutral pronouns. In Proceed- guage Processing (EMNLP), pages 8173–8188, On-
ings of the First Workshop on Gender Bias in Natu- line. Association for Computational Linguistics.
ral Language Processing, pages 173–181, Florence,
Italy. Association for Computational Linguistics. Emily Dinan, Angela Fan, Ledell Wu, Jason Weston,
Won Ik Cho, Jiwon Kim, Jaeyeong Yang, and Nam Soo Douwe Kiela, and Adina Williams. 2020b. Multi-
Kim. 2021. Towards cross-lingual generalization of dimensional gender bias classification. In Proceed-
translation gender bias. In Proceedings of the 2021 ings of the 2020 Conference on Empirical Methods
ACM Conference on Fairness, Accountability, and in Natural Language Processing (EMNLP), pages
Transparency, pages 449–457. 314–331, Online. Association for Computational
Linguistics.
Prafulla Kumar Choubey, Anna Currey, Prashant
Mathur, and Georgiana Dinu. 2021. Improving gen- Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer
der translation accuracy with filtered self-training. Reingold, and Richard Zemel. 2012. Fairness
arXiv preprint arXiv:2104.07695. through awareness. In Proceedings of the 3rd inno-
vations in theoretical computer science conference,
Marta R Costa-jussà, Carlos Escolano, Christine pages 214–226.
Basta, Javier Ferrando, Roser Batlle, and Ksenia
Kharitonova. 2020. Gender bias in multilingual neu- Mostafa Elaraby, Ahmed Y Tawfik, Mahmoud Khaled,
ral machine translation: The architecture matters. Hany Hassan, and Aly Osama. 2018. Gender aware
arXiv preprint arXiv:2012.13176. spoken language translation applied to english-
arabic. In 2018 2nd International Conference
Marta R. Costa-jussà and Adrià de Jorge. 2020.
on Natural Language and Speech Processing (IC-
Fine-tuning neural machine translation on gender-
NLSP), pages 1–6. IEEE.
balanced datasets. In Proceedings of the Second
Workshop on Gender Bias in Natural Language Pro-
Joel Escudé Font and Marta R. Costa-jussà. 2019.
cessing, pages 26–34, Barcelona, Spain (Online).
Equalizing gender bias in neural machine transla-
Association for Computational Linguistics.
tion with word embeddings techniques. In Proceed-
Kate Crawford. 2017. The trouble with bias. Keynote ings of the First Workshop on Gender Bias in Natu-
at NeurIPS. ral Language Processing, pages 147–154, Florence,
Italy. Association for Computational Linguistics.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-
bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
Transformer-XL: Attentive language models beyond erarchical neural story generation. In Proceedings
a fixed-length context. In Proceedings of the 57th of the 56th Annual Meeting of the Association for
Annual Meeting of the Association for Computa- Computational Linguistics (Volume 1: Long Papers),
tional Linguistics, pages 2978–2988, Florence, Italy. pages 889–898.
Association for Computational Linguistics.
Anna Farkas and Renáta Németh. 2020. How to mea-
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane sure gender bias in machine translation: Optimal
Hung, Eric Frank, Piero Molino, Jason Yosinski, and translators, multiple reference points. arXiv preprint
Rosanne Liu. 2019. Plug and play language mod- arXiv:2011.06445.
els: A simple approach to controlled text generation.
In International Conference on Learning Represen- Xavier Ferrer, Tom van Nuenen, Jose M Such, and Na-
tations. talia Criado. 2021. Discovering and categorising
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and language biases in reddit. In Proceedings of the In-
Kristina Toutanova. 2019. BERT: Pre-training of ternational AAAI Conference on Web and Social Me-
deep bidirectional transformers for language under- dia, volume 15.
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association Timnit Gebru, Jamie Morgenstern, Briana Vecchione,
for Computational Linguistics: Human Language Jennifer Wortman Vaughan, Hanna Wallach, Hal
Technologies, Volume 1 (Long and Short Papers), Daumé III, and Kate Crawford. 2018. Datasheets
pages 4171–4186, Minneapolis, Minnesota. Associ- for datasets. arXiv preprint arXiv:1803.09010.
ation for Computational Linguistics.
Hila Gonen and Kellie Webster. 2020. Automatically
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya identifying gender issues in machine translation us-
Krishna, Yada Pruksachatkun, Kai-Wei Chang, and ing perturbations. In Findings of the Association
Rahul Gupta. 2021. Bold: Dataset and metrics for for Computational Linguistics: EMNLP 2020, pages
measuring biases in open-ended language genera- 1991–1995, Online. Association for Computational
tion. Proceedings of FAccT. Linguistics.

4285

Sophie Groenwold, Lily Ou, Aesha Parekh, Samhita Proceedings of the 33rd Annual ACM Conference on
Honnavalli, Sharon Levy, Diba Mirza, and Human Factors in Computing Systems, pages 3819–
William Yang Wang. 2020. Investigating African- 3828.
American Vernacular English in transformer-based
text generation. In Proceedings of the 2020 Confer- Svetlana Kiritchenko and Saif Mohammad. 2018. Ex-
ence on Empirical Methods in Natural Language amining gender and race bias in two hundred sen-
Processing (EMNLP), pages 5877–5883, Online. timent analysis systems. In Proceedings of the
Association for Computational Linguistics. Seventh Joint Conference on Lexical and Computa-
tional Semantics, pages 43–53.
Nizar Habash, Houda Bouamor, and Christine Chung.
2019. Automatic gender identification and reinflec- Svetlana Kiritchenko, Isar Nejadgholi, and Kathleen C
tion in Arabic. In Proceedings of the First Work- Fraser. 2020. Confronting abusive language online:
shop on Gender Bias in Natural Language Process- A survey from the ethical and human rights perspec-
ing, pages 155–165, Florence, Italy. Association for tive. arXiv preprint arXiv:2012.12305.
Computational Linguistics.
Hannah Kirk, Yennie Jun, Haider Iqbal, Elias Benussi,
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equal- Filippo Volpin, Frederic A Dreyer, Aleksandar Sht-
ity of opportunity in supervised learning. In Ad- edritski, and Yuki M Asano. 2021. How true is gpt-
vances in neural information processing systems, 2? an empirical analysis of intersectional occupa-
pages 3315–3323. tional biases. arXiv preprint arXiv:2102.04130.

Tatsunori Hashimoto, Megha Srivastava, Hongseok Tom Kocmi, Tomasz Limisiewicz, and Gabriel
Namkoong, and Percy Liang. 2018. Fairness with- Stanovsky. 2020. Gender coreference and bias eval-
out demographics in repeated loss minimization. uation at WMT 2020. In Proceedings of the Fifth
In International Conference on Machine Learning, Conference on Machine Translation, pages 357–364,
pages 1929–1938. PMLR. Online. Association for Computational Linguistics.

Peter Henderson, Koustuv Sinha, Nicolas Angelard- Ben Krause, Akhilesh Deepak Gotmare, Bryan Mc-
Gontier, Nan Rosemary Ke, Genevieve Fried, Ryan Cann, Nitish Shirish Keskar, Shafiq Joty, Richard
Lowe, and Joelle Pineau. 2018. Ethical challenges Socher, and Nazneen Fatema Rajani. 2020. Gedi:
in data-driven dialogue systems. In Proceedings of Generative discriminator guided sequence genera-
the 2018 AAAI/ACM Conference on AI, Ethics, and tion. arXiv preprint arXiv:2009.06367.
Society, pages 123–129.
Sharon Levy, Michael Saxon, and William Yang Wang.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and 2021. The truth is out there: Investigating conspir-
Yejin Choi. 2019. The curious case of neural text de- acy theories in text generation. In Findings of The
generation. In International Conference on Learn- Joint Conference of the 59th Annual Meeting of the
ing Representations. Association for Computational Linguistics and the
11th International Joint Conference on Natural Lan-
Dirk Hovy, Federico Bianchi, and Tommaso Forna- guage Processing.
ciari. 2020. “you sound just like your father” com-
mercial machine translation systems include stylis- Alisa Liu, Maarten Sap, Ximing Lu, Swabha
tic biases. In Proceedings of the 58th Annual Meet- Swayamdipta, Chandra Bhagavatula, Noah A Smith,
ing of the Association for Computational Linguistics, and Yejin Choi. 2021. On-the-fly controlled text
pages 1686–1690, Online. Association for Computa- generation with experts and anti-experts. In The
tional Linguistics. Joint Conference of the 59th Annual Meeting of the
Association for Computational Linguistics and the
Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stan- 11th International Joint Conference on Natural Lan-
forth, Johannes Welbl, Jack Rae, Vishal Maini, Dani guage Processing.
Yogatama, and Pushmeet Kohli. 2020. Reducing
sentiment bias in language models via counterfac- Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao
tual evaluation. In Findings of the Association for Liu, and Jiliang Tang. 2020a. Does gender matter?
Computational Linguistics: EMNLP 2020, pages towards fairness in dialogue systems. In Proceed-
65–83, Online. Association for Computational Lin- ings of the 28th International Conference on Com-
guistics. putational Linguistics, pages 4403–4416, Barcelona,
Spain (Online). International Committee on Compu-
Clayton Hutto and Eric Gilbert. 2014. Vader: A par- tational Linguistics.
simonious rule-based model for sentiment analysis
of social media text. In Proceedings of the Interna- Haochen Liu, Wentao Wang, Yiqi Wang, Hui Liu, Zi-
tional AAAI Conference on Web and Social Media, tao Liu, and Jiliang Tang. 2020b. Mitigating gender
volume 8. bias for neural dialogue generation with adversarial
learning. In Proceedings of the 2020 Conference on
Matthew Kay, Cynthia Matuszek, and Sean A Munson. Empirical Methods in Natural Language Processing
2015. Unequal representation and gender stereo- (EMNLP), pages 893–903, Online. Association for
types in image search results for occupations. In Computational Linguistics.

4286

You can also read