Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation - ACL Anthology

Page created by Amy Franklin

Cars & Machinery

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Double Perturbation: On the Robustness of Robustness and
Counterfactual Bias Evaluation

Chong Zhang Jieyu Zhao Huan Zhang Kai-Wei Chang Cho-Jui Hsieh
Department of Computer Science, UCLA
{chongz, jyzhao, kwchang, chohsieh}@cs.ucla.edu, huan@huan-zhang.com

Abstract Xtest

Robustness and counterfactual bias are usually
x0 ="a deep and meaningful film (movie)."
evaluated on a test dataset. However, are
99% positive (99% positive)
these evaluations robust? If the test dataset is
perturbed slightly, will the evaluation results
keep the same? In this paper, we propose a perturb
“double perturbation” framework to uncover
model weaknesses beyond the test dataset. x̃0 ="a short and moving film (movie)."
The framework first perturbs the test dataset 73% positive (70% negative)
to construct abundant natural sentences
similar to the test data, and then diagnoses Figure 1: A vulnerable example beyond the test dataset.
the prediction change regarding a single-word Numbers on the bottom right are the sentiment predic-
substitution. We apply this framework to tions for film and movie. Blue x0 comes from the
study two perturbation-based approaches test dataset and its prediction cannot be altered by the
that are used to analyze models’ robustness substitution film → movie (robust). Yellow exam-
and counterfactual bias in English. (1) For ple x̃0 is slightly perturbed but remains natural. Its pre-
robustness, we focus on synonym substitu- diction can be altered by the substitution (vulnerable).
tions and identify vulnerable examples where
prediction can be altered. Our proposed attack
attains high success rates (96.0%–99.8%) In most studies, model robustness is evaluated
in finding vulnerable examples on both based on a given test dataset or synthetic sentences
original and robustly trained CNNs and constructed from templates (Ribeiro et al., 2020).
Transformers. (2) For counterfactual bias, Specifically, the robustness of a model is often eval-
we focus on substituting demographic tokens uated by the ratio of test examples where the model
(e.g., gender, race) and measure the shift of
prediction cannot be altered by semantic-invariant
the expected prediction among constructed
sentences. Our method is able to reveal perturbation. We refer to this type of evaluations
the hidden model biases not directly shown as the first-order robustness evaluation. However,
in the test dataset. Our code is available even if a model is first-order robust on an input sen-
at https://github.com/chong-z/ tence x0 , it is possible that the model is not robust
nlp-second-order-attack. on a natural sentence x̃0 that is slightly modified
from x0 . In that case, adversarial examples still
1 Introduction exist even if first-order attacks cannot find any of
Recent studies show that NLP models are vulner- them from the given test dataset. Throughout this
able to adversarial perturbations. A seemingly paper, we call x̃0 a vulnerable example. The ex-
“invariance transformation” (a.k.a. adversarial per- istence of such examples exposes weaknesses in
turbation) such as synonym substitutions (Alzantot models’ understanding and presents challenges for
et al., 2018; Zang et al., 2020) or syntax-guided model deployment. Fig. 1 illustrates an example.
paraphrasing (Iyyer et al., 2018; Huang and Chang, In this paper, we propose the double perturba-
2021) can alter the prediction. To mitigate the tion framework for evaluating a stronger notion
model vulnerability, robust training methods have of second-order robustness. Given a test dataset,
been proposed and shown effective (Miyato et al., we consider a model to be second-order robust if
2017; Jia et al., 2019; Huang et al., 2019; Zhou there is no vulnerable example that can be iden-
et al., 2020). tified in the neighborhood of given test instances
3899
Proceedings of the 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, pages 3899–3916
June 6–11, 2021. ©2021 Association for Computational Linguistics

(§2.2). In particular, our framework first perturbs positive
negative
the test set to construct the neighborhood, and then
diagnoses the robustness regarding a single-word
x̃1
synonym substitution. Taking Fig. 2 as an example, x̃00
the model is first-order robust on the input sentence
x0 (the prediction cannot be altered), but it is not x̃0

second-order robust due to the existence of the vul- x0

nerable example x̃0 . Our framework is designed to
identify x̃0 . Figure 2: An illustration of the decision boundary. Dia-
We apply the proposed framework and quantify mond area denotes invariance transformations. Blue x0
second-order robustness through two second-order is a robust input example (the entire diamond is green).
attacks (§3). We experiment with English senti- Yellow x̃0 is a vulnerable example in the neighborhood
of x0 . Red x̃00 is an adversarial example to x̃0 . Note:
ment classification on the SST-2 dataset (Socher
x̃00 is not an adversarial example to x0 since they have
et al., 2013) across various model architectures. different meanings to human (outside the diamond).
Surprisingly, although robustly trained CNN (Jia
et al., 2019) and Transformer (Xu et al., 2020)
can achieve high robustness under strong at- order robustness and reveal the models’ vulnerabil-
tacks (Alzantot et al., 2018; Garg and Ramakr- ities that cannot be identified by previous attacks.
ishnan, 2020) (23.0%–71.6% success rates), for (3) We propose a counterfactual bias evaluation
around 96.0% of the test examples our attacks can method to reveal the hidden model bias based on
find a vulnerable example by perturbing 1.3 words our double perturbation framework.
on average. This finding indicates that these ro-
bustly trained models, despite being first-order ro- 2 The Double Perturbation Framework
bust, are not second-order robust.
In this section, we describe the double perturbation
Furthermore, we extend the double perturbation framework which focuses on identifying vulnerable
framework to evaluate counterfactual biases (Kus- examples within a small neighborhood of the test
ner et al., 2017) (§4) in English. When the test dataset. The framework consists of a neighborhood
dataset is small, our framework can help improve perturbation and a word substitution. We start with
the evaluation robustness by revealing the hidden defining word substitutions.
biases not directly shown in the test dataset. In-
tuitively, a fair model should make the same pre- 2.1 Existing Word Substitution Strategy
diction for nearly identical examples referencing We focus our study on word-level substitution,
different groups (Garg et al., 2019) with different where existing works evaluate robustness and coun-
protected attributes (e.g., gender, race). In our eval- terfactual bias by directly perturbing the test dataset.
uation, we consider a model biased if substituting For instance, adversarial attacks alter the prediction
tokens associated with protected attributes changes by making synonym substitutions, and the fairness
the expected prediction, which is the average pre- literature evaluates counterfactual fairness by sub-
diction among all examples within the neighbor- stituting protected tokens. We integrate the word
hood. For instance, a toxicity classifier is biased substitution strategy into our framework as the com-
if it tends to increase the toxicity if we substitute ponent for evaluating robustness and fairness.
straight → gay in an input sentence (Dixon For simplicity, we consider a single-word substi-
et al., 2018). In the experiments, we evaluate the ex- tution and denote it with the operator ⊕. Let X ⊆
pected sentiment predictions on pairs of protected V l be the input space where V is the vocabulary and
tokens (e.g., (he, she), (gay, straight)), and l is the sentence length, p = (p(1) , p(2) ) ∈ V 2 be
demonstrate that our method is able to reveal the a pair of synonyms (called patch words), Xp ⊆ X
hidden model biases. denotes sentences with a single occurrence of p(1)
Our main contributions are: (1) We propose the (for simplicity we skip other sentences), x0 ∈ Xp
double perturbation framework to diagnose the ro- be an input sentence, then x0 ⊕ p means “substitute
bustness of existing robustness and fairness evalu- p(1) → p(2) in x0 ”. The result after substitution is:
ation methods. (2) We propose two second-order
attacks to quantify the stronger notion of second- x00 = x0 ⊕ p.
3900

Taking Fig. 1 as an example, where p = (film, Algorithm 1: Neighborhood construction
movie) and x0 = a deep and meaning- Data: Input sentence x0 , masked language model
ful film, the perturbed sentence is x00 = a LM, max distance k.
1 Function Neighbork (x0 ):
deep and meaningful movie. Now we in- 2 if k = 0 then return {x0 } ;
troduce other components in our framework. 3 if k ≥ 2 thenS
4 return x̃∈Neighbor (x0 ) Neighbork−1 (x̃);
1
5 Xneighbor ← ∅;
2.2 Proposed Neighborhood Perturbation 6 for i ← 0, . . . , len(x0 ) − 1 do
7 T, L ← LM.fillmask(x0 , i);
. Mask ith token and return candidate
Instead of applying the aforementioned word sub- tokens and corresponding logits.
stitutions directly to the original test dataset, our 8 L ← SortDecreasing(L);
9 lmin ← max{L(κ) , L(0) − δ};
framework perturbs the test dataset within a small . L(i) denotes the ith element. We
neighborhood to construct similar natural sen- empirically set κ ← 20 and δ ← 3.
tences. This is to identify vulnerable examples 10 Tnew ← {t | l > lmin , (t, l) ∈ T × L};
(i)
with respect to the model. Note that examples in 11 Xnew ← {x0 | x0 ← t, t ∈ Tnew };
. Construct new sentences by
the neighborhood are not required to have the same replacing the ith token.
meaning as the original example, since we only 12 Xneighbor ← Xneighbor ∪ Xnew ;
study the prediction difference caused by applying 13 return Xneighbor ;
synonym substitution p (§2.1).
Constraints on the neighborhood. We limit the
3 Evaluating Second-Order Robustness
neighborhood sentences within a small `0 norm
ball (regarding the test instance) to ensure syntactic With the proposed double perturbation framework,
similarity, and empirically ensure the naturalness we design two black-box attacks1 to identify vul-
through a language model. The neighborhood of nerable examples within the neighborhood of the
an input sentence x0 ∈ X is: test set. We aim at evaluating the robustness for
inputs beyond the test set.
Neighbork (x0 ) ⊆ Ballk (x0 ) ∩ Xnatural , (1) 3.1 Previous First-Order Attacks
Adversarial attacks search for small and invariant
where Ballk (x0 ) = {x | kx − x0 k0 ≤ k, x ∈ X } perturbations on the model input that can alter the
is the `0 norm ball around x0 (i.e., at most k differ- prediction. To simplify the discussion, in the fol-
ent tokens), and Xnatural denotes natural sentences lowing, we take a binary classifier f (x) : X →
that satisfy a certain language model score which {0, 1} as an example to describe our framework.
will be discussed next. Let x0 be the sentence from the test set with label
Construction with masked language model. y0 , then the smallest perturbation δ ∗ under `0 norm
We construct neighborhood sentences from x0 by distance is:2
substituting at most k tokens. As shown in Al-
δ ∗ := argmin kδk0 s.t. f (x0 ⊕ δ) 6= y0 .
gorithm 1, the construction employs a recursive δ
approach and replaces one token at a time. For
each recursion, the algorithm first masks each to- Here δ = p1 ⊕ · · · ⊕ pl denotes a series of substi-
ken of the input sentence (may be the original x0 or tutions. In contrast, our second-order attacks fix
the x̃ from last recursion) separately and predicts δ = p and search for the vulnerable x0 .
likely replacements with a masked language model
(e.g., DistilBERT, Sanh et al. 2019). To ensure the 3.2 Proposed Second-Order Attacks
naturalness, we keep the top 20 tokens for each Second-order attacks study the prediction differ-
mask with the largest logit (subject to a threshold, ence caused by applying p. For notation conve-
Line 9). Then, the algorithm constructs neighbor- nience we define the prediction difference F (x; p) :
hood sentences by replacing the mask with found
1
tokens. We use the notation x̃ in the following sec- Black-box attacks only observe the model outputs and do
not know the model parameters or the gradient.
tions to denote the constructed sentences within the 2
For simplicity, we use `0 norm distance to measure the
neighborhood. similarity, but other distance metrics can be applied.
3901

Find p for x0 .                              Find vulnerable example                                 p alters the prediction.
                                                     through beam search.
 x0 = a deep and meaningful film.                      x (i = 1)                     fsoft (x)         x̃0 ="a short and moving film (movie)."
                                           a deep and disturbing film (movie).     .990 (.989)
                                           a deep and moving film (movie).         .999 (.999)                   73% positive (70% negative)
                                           a dramatic and meaningful film (movie). .999 (.999)
                                                      ···
 Synonyms
 film, movie
                    p = film, movie                     x (i = 2)                       fsoft (x)
    story, tale
                                            a short and moving film (movie).          .730 (.303)
      fool, silly                           a slow and moving film (movie).           .519 (.151)
       ···                                  a dramatic or meaningful film (movie).    .487 (.168)
                                                       ···

Figure 3: The attack flow for SO-Beam (Algorithm 2). Blue x0 is the input sentence and yellow x̃0 is our con-
structed vulnerable example (the prediction can be altered by substituting film → movie). Green boxes in the
middle show intermediate sentences, and fsoft (x) denotes the probability outputs for film and movie.

X × V 2 → {−1, 0, 1} by:3                                                  p), respectively. Minimizing Eq. (4) effectively
                                                                           leads to fmin → 0 and fmax → 1, and we use a
               F (x; p) := f (x ⊕ p) − f (x).                     (2)      beam search to find the best x. At each iteration,
                                                                           we construct sentences through Neighbor1 (x) and
Taking Fig. 1 as an example, the prediction differ-
                                                                           only keep the top 20 sentences with the smallest
ence for x̃0 on p is F (x̃0 ; p) = f (...moving
                                                                           L(x; p). We run at most k iterations, and stop
movie.) − f (...moving film.) = −1.
                                                                           earlier if we find a vulnerable example. We provide
   Given an input sentence x0 , we want to find
                                                                           the detailed implementation in Algorithm 2 and a
patch words p and a vulnerable example x̃0 such
                                                                           flowchart in Fig. 3.
that f (x̃0 ⊕ p) 6= f (x̃0 ). Follow Alzantot et al.
(2018), we choose p from a predefined list of
                                                                             Algorithm 2: Beam-search attack (SO-
counter-fitted synonyms (Mrkšić et al., 2016) that
                                                                             Beam)
maximizes |fsoft (p(2) ) − fsoft (p(1) )|. Here fsoft (x) :
                                                                                 Data: Input sentence x0 , synonyms P, model
X → [0, 1] denotes probability output (e.g., af-                                       functions F and fsoft , loss L, max distance k.
ter the softmax layer but before the final argmax),                         1    Function SO-Beamk (x0 ):
fsoft (p(1) ) and fsoft (p(2) ) denote the predictions for                  2        p ← argmax |fsoft (p(2) ) − fsoft (p(1) )|;
                                                                                           p∈P s.t. x0 ∈Xp
the single word, and we enumerate through all pos-                          3        Xbeam ← {x0 };
sible p for x0 . Let k be the neighborhood distance,                        4        for i ← 1, . . .S, k do
                                                                             5            Xnew ← x̃∈Xbeam Neighbor1 (x̃);
then the attack is equivalent to solving:
                                                                             6            x̃0 ← argmaxx∈Xnew |F (x; p)|;
                                                                             7            if F (x̃0 ; p) 6= 0 then return x̃0 ;
               x̃0 =      argmax           |F (x; p)|.            (3)        8            Xnew ← SortIncreasing(Xnew , L);
                       x∈Neighbork (x0 )                                                                  (0)       (β−1)
                                                                             9            Xbeam ← {Xnew , . . . , Xnew };
                                                                                          . Keep the best beam.        We set β ← 20.
Brute-force attack (SO-Enum).          A naive ap-
                                                                           10        return None;
proach for solving Eq. (3) is to enumerate through
Neighbork (x0 ). The enumeration finds the small-
est perturbation, but is only applicable for small k
                                                                            3.3      Experimental Results
(e.g., k ≤ 2) given the exponential complexity.
Beam-search attack (SO-Beam). The efficiency                                In this section, we evaluate the second-order ro-
can be improved by utilizing the probability output,                        bustness of existing models and show the quality
where we solve Eq. (3) by minimizing the cross-                             of our constructed vulnerable examples.
entropy loss with regard to x ∈ Neighbork (x0 ):
                                                                            3.3.1      Setup
   L(x; p) := − log(1 − fmin ) − log(fmax ),                      We follow the setup from the robust training lit-
                                                                  (4)
                                                                  erature (Jia et al., 2019; Xu et al., 2020) and ex-
where fmin and fmax are the smaller and the larger                periment with both the base (non-robust) and ro-
output probability between fsoft (x) and fsoft (x ⊕               bustly trained models. We train the binary senti-
    3
      We assume a binary classification task, but our framework   ment classifiers on the SST-2 dataset with bag-of-
is general and can be extended to multi-class classification.     words (BoW), CNN, LSTM, and attention-based
                                                               3902

Original: 70% Negative
Input Example: in its best moments , resembles a bad high school production of grease , without benefit of song .
Genetic: 56% Positive
Adversarial Example: in its best moment , recalling a naughty high school production of lubrication , unless benefit of song .
BAE: 56% Positive
Adversarial Example: in its best moments , resembles a great high school production of grease , without benefit of song .
SO-Enum and SO-Beam (ours): 60% Negative (67% Positive)
Vulnerable Example: in its best moments , resembles a bad (unhealthy) high school production of musicals , without benefit of song .

Table 1: Sampled attack results on the robust BoW. For Genetic and BAE the goal is to find an adversarial example
that alters the original prediction, whereas for SO-Enum and SO-Beam the goal is to find a vulnerable example
beyond the test set such that the prediction can be altered by substituting bad → unhealthy.

models. timization algorithm that generates both syntacti-
Base models. For BoW, CNN, and LSTM, all cally and semantically similar adversarial exam-
models use pre-trained GloVe embeddings (Pen- ples, by replacing words within the list of counter-
nington et al., 2014), and have one hidden layer fitted synonyms. (2) The BAE attack (Garg and
of the corresponding type with 100 hidden size. Ramakrishnan, 2020) generates coherent adversar-
Similar to the baseline performance reported in ial examples by masking and replacing words using
GLUE (Wang et al., 2019), our trained models BERT. For both methods we use the implementa-
have an evaluation accuracy of 81.4%, 82.5%, and tion provided by TextAttack (Morris et al., 2020).
81.7%, respectively. For attention-based models,
we train a 3-layer Transformer (the largest size in Attack success rate (second-order). We also
Shi et al. 2020) and fine-tune a pre-trained bert- quantify second-order robustness through attack
base-uncased from HuggingFace (Wolf et al., success rate, which measures the ratio of test ex-
2020). The Transformer uses 4 attention heads amples that a vulnerable example can be found.
and 64 hidden size, and obtains 82.1% accuracy. To evaluate the impact of neighborhood size, we
The BERT-base uses the default configuration and experiment with two configurations: (1) For the
obtains 92.7% accuracy. small neighborhood (k = 2), we use SO-Enum
Robust models (first-order). With the same that finds the most similar vulnerable example. (2)
setup as base models, we apply robust training For the large neighborhood (k = 6), SO-Enum is
methods to improve the resistance to word substitu- not applicable and we use SO-Beam to find vul-
tion attacks. Jia et al. (2019) provide a provably ro- nerable examples. We consider the most challeng-
bust training method through Interval Bound Prop- ing setup and use patch words p from the same
agation (IBP, Dvijotham et al. 2018) for all word set of counter-fitted synonyms as robust models
substitutions on BoW, CNN and LSTM. Xu et al. (they are provably robust to these synonyms on
(2020) provide a provably robust training method the test set). We also provide a random baseline
on general computational graphs through a combi- to validate the effectiveness of minimizing Eq. (4)
nation of forward and backward linear bound prop- (Appendix A.1).
agation, and the resulting 3-layer Transformer is
robust to up to 6 word substitutions. For both works Quality metrics (perplexity and similarity).
we use the same set of counter-fitted synonyms pro- We quantify the quality of our constructed vulnera-
vided in Jia et al. (2019). We skip BERT-base due ble examples through two metrics: (1) GPT-2 (Rad-
to the lack of an effective robust training method. ford et al., 2019) perplexity quantifies the natural-
Attack success rate (first-order). We quantify ness of a sentence (smaller is better). We report
first-order robustness through attack success rate, the perplexity for both the original input exam-
which measures the ratio of test examples that an ples and the constructed vulnerable examples. (2)
adversarial example can be found. We use first- `0 norm distance quantifies the disparity between
order attacks as a reference due to the lack of a two sentences (smaller is better). We report the
direct baseline. We experiment with two black-box distance between the input and the vulnerable ex-
attacks: (1) The Genetic attack (Alzantot et al., ample. Note that first-order attacks have different
2018; Jia et al., 2019) uses a population-based op- objectives and thus cannot be compared directly.
3903

Attack Success Rate (%)                                  SO-Enum                  SO-Beam
                 Genetic BAE SO-Enum SO-Beam                               Original Perturb         Original Perturb
                                                                                              `0                       `0
                                                                            PPL      PPL             PPL      PPL
    Base Models:
    BoW            57.0   69.7     95.3       99.7             Base Models:
    CNN            62.0   71.0     95.3       99.8             BoW            168     202     1.1    166       202     1.2
    LSTM           60.0   68.3     95.8       99.5             CNN            170     204     1.1    166       201     1.2
                                                               LSTM           168     204     1.1    166       204     1.2
    Transformer    73.0   74.3     95.4       98.0
                                                               Transformer    165     193     1.0    165       195     1.1
    BERT-base      41.0   61.5     94.3       98.7
                                                               BERT-base      170     229     1.3    168       222     1.4
    Robust Models:
                                                               Robust Models:
    BoW            28.0   63.1     81.5       88.4             BoW            170     212     1.2    171       222     1.4
    CNN            23.0   64.4     91.0       96.0             CNN            166     209     1.2    168       210     1.3
    LSTM           24.0   61.0     62.9       77.5             LSTM           194     251     1.3    185       260     1.8
    Transformer    56.0   71.6     91.2       96.2             Transformer    170     213     1.2    165       208     1.3

Table 2: The average rates over 872 examples (100 for       Table 3: The quality metrics for second-order meth-
Genetic due to long running time). Second-order at-         ods. We report the median perplexity (PPL) and av-
tacks achieve higher successful rate since they are able    erage `0 norm distance. The original PPL may differ
to search beyond the test set.                              across models since we only count successful attacks.

3.3.2   Results                                             nerable examples beyond the test set, and the search
We experiment with the validation split (872 exam-          is not required to maintain semantic similarity. Our
ples) on a single RTX 3090. The average running             methods provide a way to further investigate the
time per example (in seconds) on base LSTM is               robustness (or find vulnerable and adversarial ex-
31.9 for Genetic, 1.1 for BAE, 7.0 for SO-Enum              amples) even when the model is robust to the test
(k = 2), and 1.9 for SO-Beam (k = 6). We provide            set.
additional running time results in Appendix A.3.            Quality of constructed vulnerable examples.
Table 1 provides an example of the attack result            As shown in Table 3, second-order attacks are able
where all attacks are successful (additional exam-          to construct vulnerable examples by perturbing 1.3
ples in Appendix A.5). As shown, our second-                words on average, with a slightly increased per-
order attacks find a vulnerable example by replac-          plexity. For instance, on the robustly trained CNN
ing grease → musicals, and the vulnerable                   and Transformer, SO-Beam constructs vulnerable
example has different predictions for bad and un-           examples by perturbing 1.3 words on average, with
healthy. Note that, Genetic and BAE have dif-               the median5 perplexity increased from around 165
ferent objectives from second-order attacks and             to around 210. We provide metrics for first-order
focus on finding the adversarial example. Next we           attacks in Appendix A.5 as they have different ob-
discuss the results from two perspectives.                  jectives and are not directly comparable.
Second-order robustness. We observe that ex-                   Furthermore, applying existing attacks on the
isting robustly trained models are not second-order         vulnerable examples constructed by our method
robust. As shown in Table 2, our second-order               will lead to much smaller perturbations. As a refer-
attacks attain high success rates not only on the           ence, on the robustly trained CNN, Genetic attack
base models but also on the robustly trained mod-           constructs adversarial examples by perturbing 2.7
els. For instance, on the robustly trained CNN              words on average (starting from the input exam-
and Transformer, SO-Beam finds vulnerable ex-               ples). However, if Genetic starts from our vulnera-
amples within a small neighborhood for around               ble examples, it would only need to perturb a single
96.0% of the test examples, even though these mod-          word (i.e., the patch words p) to alter the predic-
els have improved resistance to strong first-order          tion. These results demonstrate the weakness of
attacks (success rates drop from 62.0%–74.3% to             the models (even robustly trained) for those inputs
23.0%–71.6% for Genetic and BAE).4 This phe-                beyond the test set.
nomenon can be explained by the fact that both first-       3.3.3 Human Evaluation
order attacks and robust training methods focus on          We perform human evaluation on the examples con-
synonym substitutions on the test set, whereas our          structed by SO-Beam. Specifically, we randomly
attacks, due to their second-order nature, find vul-
                                                                 5
                                                                We report median due to the unreasonably large perplexity
   4
     BAE is more effective on robust models as it may use   on certain sentences. e.g., 395 for that’s a cheat. but
replacement words outside the counter-fitted synonyms.      6740 for that proves perfect cheat.
                                                        3904

Naturalness (1-5) Semantic Similarity (%) x x⊕p x x⊕p
Original Perturb Original Perturb
3.87 3.63 85 71

Table 4: The quality metrics from human evaluation.

select 100 successful attacks and evaluate both the
original examples and the vulnerable examples. To Figure 4: An illustration of an unbiased model vs. a
evaluate the naturalness of the constructed exam- biased model. Green and gray indicate the probability
ples, we ask the annotators to score the likelihood of positive and negative predictions, respectively. Left:
(on a Likert scale of 1-5, 5 to be the most likely) of An unbiased model where the (x, x ⊕ p) pair (yellow-
being an original example based on the grammar red dots) is relatively parallel to the decision boundary.
correctness. To evaluate the semantic similarity af- Right: A biased model where the predictions for x ⊕ p
(red) are usually more negative (gray) than x (yellow).
ter applying the synonym substitution p, we ask the
annotators to predict the sentiment of each example,
and calculate the ratio of examples that maintain expected predictions for protected groups are dif-
the same sentiment prediction after the synonym ferent (assuming the model is not intended to dis-
substitution. For both metrics, we take the median criminate between these groups). For instance, a
from 3 independent annotations. We use US-based sentiment classifier is biased if the expected predic-
annotators on Amazon’s Mechanical Turk6 and pay tion for inputs containing woman is more positive
$0.03 per annotation, and expect each annotation to (or negative) than inputs containing man. Such bias
take 10 seconds on average (effectively, the hourly is harmful as they may make unfair decisions based
rate is about $11). See Appendix A.2 for more on protected attributes, for example in situations
details. such as hiring and college admission.
As shown in Table 4, the naturalness score only Counterfactual token bias. We study a narrow
drop slightly after the perturbation, indicating that case of counterfactual bias, where counterfactual
our constructed vulnerable examples have similar examples are constructed by substituting protected
naturalness as the original examples. As for the tokens in the input. A naive approach of measuring
semantic similarity, we observe that 85% of the this bias is to construct counterfactual examples
original examples maintain the same meaning after directly from the test set, however such evaluation
the synonym substitution, and the corresponding may not be robust since test examples are only a
ratio is 71% for vulnerable examples. This indi- small subset of natural sentences. Formally, let p
cates that the synonym substitution is an invariance be a pair of protected tokens such as (he, she) or
transformation for most examples. (Asian, American), Xtest ⊆ Xp be a test set (as
in §2.1), we define counterfactual token bias by:
4 Evaluating Counterfactual Bias
Bp,k := E Fsoft (x; p). (5)
In addition to evaluating second-order robustness, x∈Neighbork (Xtest )
we further extend the double perturbation frame-
work (§2) to evaluate counterfactual biases by set- We calculate Eq. (5) through an enumeration across
ting p to pairs of protected tokens. We show that all natural sentences withinS the neighborhood.7
our method can reveal the hidden model bias. Here Neighbork (Xtest ) = x∈Xtest Neighbork (x)
denotes the union of neighborhood examples (of
4.1 Counterfactual Bias distance k) around the test set, and Fsoft (x; p) :
In contrast to second-order robustness, where we X × V 2 → [−1, 1] denotes the difference between
consider the model vulnerable as long as there ex- probability outputs fsoft (similar to Eq. (2)):
ists one vulnerable example, counterfactual bias
focuses on the expected prediction, which is the Fsoft (x; p) := fsoft (x ⊕ p) − fsoft (x). (6)
average prediction among all examples within the 7
For gender bias, we employ a blacklist to avoid adding
neighborhood. We consider a model biased if the gendered tokens during the neighborhood construction. This
is to avoid semantic shift when, for example, p = (he, she)
6
https://www.mturk.com such that it may refer to different tokens after the substitution.
3905

Patch Words # Original # Perturbed
he,she 5 325,401
his,her 4 255,245
him,her 4 233,803
men,women 3 192,504
man,woman 3 222,981
actor,actress 2 141,780
...
Total 34 2,317,635

Table 5: The number of original examples (k = 0) and
the number of perturbed examples (k = 3) in Xfilter . Figure 5: Our proposed Bp,k measured on Xfilter . Here
“original” is equivalent to k = 0, “perturbed” is equiva-
lent to k = 3, p is in the form of (male, female).
The model is unbiased on p if Bp,k ≈ 0, whereas
a positive or negative Bp,k indicates that the model
shows preference or against to p(2) , respectively. Metrics. We evaluate model bias through the
Fig. 4 illustrates the distribution of (x, x ⊕ p) for proposed Bp,k for k = 0, . . . , 3. Here the bias for
both an unbiased model and a biased model. k = 0 is effectively measured on the original test
The aforementioned neighborhood construction set, and the bias for k ≥ 1 is measured on our
does not introduce additional bias. For instance, constructed neighborhood. We randomly sample a
let x0 be a sentence containing he, even though subset of constructed examples when k = 3 due to
it is possible for Neighbor1 (x0 ) to contain many the exponential complexity.
stereotyping sentences (e.g., contains tokens such Filtered test set. To investigate whether our
as doctor and driving) that affect the distri- method is able to reveal model bias that was hidden
bution of fsoft (x), but it does not bias Eq. (6) as in the test set, we construct a filtered test set on
we only care about the prediction difference of re- which the bias cannot be observed directly. Let
placing he → she. The construction has no infor- Xtest be the original validation split, we construct
mation about the model objective, thus it would be Xfilter by the equation below and empirically set
difficult to bias fsoft (x) and fsoft (x ⊕ p) differently. = 0.005. We provide statistics in Table 5.

4.2 Experimental Results Xfilter := {x | |Fsoft (x; p)| < , x ∈ Xtest }.
In this section, we use gender bias as a running
4.2.2 Results
example, and demonstrate the effectiveness of our
method by revealing the hidden model bias. We Our method is able to reveal the hidden model
provide additional results in Appendix A.4. bias on Xfilter , which is not visible with naive mea-
surements. In Fig. 5, the naive approach (k = 0)
4.2.1 Setup observes very small biases on most tokens (as con-
We evaluate counterfactual token bias on the SST-2 structed). In contrast, when evaluated by our dou-
dataset with both the base and debiased models. ble perturbation framework (k = 3), we are able to
We focus on binary gender bias and set p to pairs observe noticeable bias, where most p has a pos-
of gendered pronouns from Zhao et al. (2018a). itive bias on the base model. This observed bias
Base Model. We train a single layer LSTM with is in line with the measurements on the original
pre-trained GloVe embeddings and 75 hidden size Xtest (Appendix A.4), indicating that we reveal the
(from TextAttack, Morris et al. 2020). The model correct model bias. Furthermore, we observe miti-
has 82.9% accuracy similar to the baseline perfor- gated biases in the debiased model, which demon-
mance reported in GLUE. strates the effectiveness of data augmentation.
Debiased Model. Data-augmentation with gen- To demonstrate how our method reveals hidden
der swapping has been shown effective in mitigat- bias, we conduct a case study with p = (actor,
ing gender bias (Zhao et al., 2018a, 2019). We actress) and show the relationship between the
augment the training split by swapping all male bias Bp,k and the neighborhood distance k. We
entities with the corresponding female entities and present the histograms for Fsoft (x; p) in Fig. 6 and
vice-versa. We use the same setup as the base plot the corresponding Bp,k vs. k in the right-most
LSTM and attain 82.45% accuracy. panel. Surprisingly, for the base model, the bias is
3906

Figure 6: Left and Middle: Histograms for Fsoft (x; p) (x-axis) with p = (actor, actress). Right: The plot
for the average Fsoft (x; p) (i.e., counterfactual token bias) vs. neighborhood distance k. Results show that the
counterfactual bias on p can be revealed when increasing k.

negative when k = 0, but becomes positive when image transformations can be used, in our natural
k = 3. This is because the naive approach only has language setting, generating a valid example be-
two test examples (Table 5) thus the measurement yond test set is more challenging because language
is not robust. In contrast, our method is able to semantics and grammar must be maintained.
construct 141,780 similar natural sentences when Counterfactual fairness. Kusner et al. (2017)
k = 3 and shifts the distribution to the right (posi- propose counterfactual fairness and consider a
tive). As shown in the right-most panel, the bias is model fair if changing the protected attributes does
small when k = 1, and becomes more significant not affect the distribution of prediction. We fol-
as k increases (larger neighborhood). As discussed low the definition and focus on evaluating the
in §4.1, the neighborhood construction does not counterfactual bias between pairs of protected to-
introduce additional bias, and these results demon- kens. Existing literature quantifies fairness on a
strate the effectiveness of our method in revealing test dataset or through templates (Feldman et al.,
hidden model bias. 2015; Kiritchenko and Mohammad, 2018; May
et al., 2019; Huang et al., 2020). For instance,
5 Related Work Garg et al. (2019) quantify the absolute counter-
factual token fairness gap on the test set; Prab-
First-order robustness evaluation. A line
hakaran et al. (2019) study perturbation sensitivity
of work has been proposed to study the vul-
for named entities on a given set of corpus. Wallace
nerability of natural language models, through
et al. (2019b); Sheng et al. (2019, 2020) study how
transformations such as character-level perturba-
language generation models respond differently to
tions (Ebrahimi et al., 2018), word-level pertur-
prompt sentences containing mentions of different
bations (Jin et al., 2019; Ren et al., 2019; Yang
demographic groups. In contrast, our method quan-
et al., 2020; Hsieh et al., 2019; Cheng et al., 2020;
tifies the bias on the constructed neighborhood.
Li et al., 2020), prepending or appending a se-
quence (Jia and Liang, 2017; Wallace et al., 2019a),
and generative models (Zhao et al., 2018b). They 6 Conclusion
focus on constructing adversarial examples from
the test set that alter the prediction, whereas our This work proposes the double perturbation frame-
methods focus on finding vulnerable examples be- work to identify model weaknesses beyond the test
yond the test set whose prediction can be altered. dataset, and study a stronger notion of robustness
Robustness beyond the test set. Several works and counterfactual bias. We hope that our work
have studied model robustness beyond test sets but can stimulate the research on further improving the
mostly focused on computer vision tasks. Zhang robustness and fairness of natural language models.
et al. (2019) demonstrate that a robustly trained
model could still be vulnerable to small perturba- Acknowledgments
tions if the input comes from a distribution only
slightly different than a normal test set (e.g., im- We thank anonymous reviewers for their helpful
ages with slightly different contrasts). Hendrycks feedback. We thank UCLA-NLP group for the
and Dietterich (2019) study more sources of com- valuable discussions and comments. The research
mon corruptions such as brightness, motion blur is supported NSF #1927554, #1901527, #2008173
and fog. Unlike in computer vision where simple and #2048280 and an Amazon Research Award.
3907

Ethical Considerations 56th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers), pages
Intended use. One primary goal of NLP models 31–36, Melbourne, Australia. Association for Com-
is the generalization to real-world inputs. However, putational Linguistics.
existing test datasets and templates are often not Michael Feldman, Sorelle A Friedler, John Moeller,
comprehensive, and thus it is difficult to evaluate Carlos Scheidegger, and Suresh Venkatasubrama-
real-world performance (Recht et al., 2019; Ribeiro nian. 2015. Certifying and removing disparate im-
et al., 2020). Our work sheds a light on quantifying pact. In proceedings of the 21th ACM SIGKDD in-
ternational conference on knowledge discovery and
performance for inputs beyond the test dataset and data mining, pages 259–268.
help uncover model weaknesses prior to the real-
world deployment. Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur
Taly, Ed H. Chi, and Alex Beutel. 2019. Counterfac-
Misuse potential. Similar to other existing adver- tual fairness in text classification through robustness.
sarial attack methods (Ebrahimi et al., 2018; Jin In Proceedings of the 2019 AAAI/ACM Conference
et al., 2019; Zhao et al., 2018b), our second-order on AI, Ethics, and Society, AIES ’19, page 219–226,
attacks can be used for finding vulnerable exam- New York, NY, USA. Association for Computing
Machinery.
ples to a NLP system. Therefore, it is essential
to study how to improve the robustness of NLP Siddhant Garg and Goutham Ramakrishnan. 2020.
models against second-order attacks. BAE: BERT-based adversarial examples for text
Limitations. While the core idea about the dou- classification. In Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Language
ble perturbation framework is general, in §4, we Processing (EMNLP), pages 6174–6181, Online. As-
consider only binary gender in the analysis of coun- sociation for Computational Linguistics.
terfactual fairness due to the restriction of the En-
Dan Hendrycks and Thomas Dietterich. 2019. Bench-
glish corpus we used, which only have words asso- marking neural network robustness to common cor-
ciated with binary gender such as he/she, wait- ruptions and perturbations. In International Confer-
er/waitress, etc. ence on Learning Representations.
Yu-Lun Hsieh, Minhao Cheng, Da-Cheng Juan, Wei
Wei, Wen-Lian Hsu, and Cho-Jui Hsieh. 2019. On
References the robustness of self-attentive models. In Proceed-
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, ings of the 57th Annual Meeting of the Association
Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. for Computational Linguistics, pages 1520–1529.
2018. Generating natural language adversarial ex-
Kuan-Hao Huang and Kai-Wei Chang. 2021. Generat-
amples. In Proceedings of the 2018 Conference on
ing syntactically controlled paraphrases without us-
Empirical Methods in Natural Language Processing,
ing annotated parallel pairs. In EACL.
pages 2890–2896, Brussels, Belgium. Association
for Computational Linguistics. Po-Sen Huang, Robert Stanforth, Johannes Welbl,
Chris Dyer, Dani Yogatama, Sven Gowal, Krish-
Minhao Cheng, Jinfeng Yi, Pin-Yu Chen, Huan Zhang, namurthy Dvijotham, and Pushmeet Kohli. 2019.
and Cho-Jui Hsieh. 2020. Seq2sick: Evaluating the Achieving verified robustness to symbol substitu-
robustness of sequence-to-sequence models with ad- tions via interval bound propagation. Proceedings of
versarial examples. Proceedings of the AAAI Confer- the 2019 Conference on Empirical Methods in Nat-
ence on Artificial Intelligence, 34(04):3601–3608. ural Language Processing and the 9th International
Joint Conference on Natural Language Processing
Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, (EMNLP-IJCNLP).
and Lucy Vasserman. 2018. Measuring and mitigat-
ing unintended bias in text classification. In Pro- Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stan-
ceedings of the 2018 AAAI/ACM Conference on AI, forth, Johannes Welbl, Jack Rae, Vishal Maini, Dani
Ethics, and Society, AIES ’18, page 67–73, New Yogatama, and Pushmeet Kohli. 2020. Reducing
York, NY, USA. Association for Computing Machin- sentiment bias in language models via counterfac-
ery. tual evaluation. Findings in EMNLP.
Krishnamurthy Dvijotham, Sven Gowal, Robert Stan- Mohit Iyyer, J. Wieting, Kevin Gimpel, and Luke
forth, Relja Arandjelovic, Brendan O’Donoghue, Zettlemoyer. 2018. Adversarial example generation
Jonathan Uesato, and Pushmeet Kohli. 2018. Train- with syntactically controlled paraphrase networks.
ing verified learners with learned verifiers. ArXiv, abs/1804.06059.
Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Robin Jia and Percy Liang. 2017. Adversarial ex-
Dou. 2018. HotFlip: White-box adversarial exam- amples for evaluating reading comprehension sys-
ples for text classification. In Proceedings of the tems. In Proceedings of the 2017 Conference on
3908

Empirical Methods in Natural Language Processing, Jeffrey Pennington, Richard Socher, and Christopher D.
EMNLP 2017, Copenhagen, Denmark, September 9- Manning. 2014. Glove: Global vectors for word rep-
11, 2017, pages 2021–2031. Association for Compu- resentation. In Empirical Methods in Natural Lan-
tational Linguistics. guage Processing (EMNLP), pages 1532–1543.
Robin Jia, Aditi Raghunathan, Kerem Göksel, and Vinodkumar Prabhakaran, Ben Hutchinson, and Mar-
Percy Liang. 2019. Certified robustness to adversar- garet Mitchell. 2019. Perturbation sensitivity analy-
ial word substitutions. In EMNLP/IJCNLP. sis to detect unintended model biases. In Proceed-
ings of the 2019 Conference on Empirical Methods
Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter in Natural Language Processing and the 9th Inter-
Szolovits. 2019. Is bert really robust? a strong base- national Joint Conference on Natural Language Pro-
line for natural language attack on text classification cessing (EMNLP-IJCNLP), pages 5740–5745, Hong
and entailment. Kong, China. Association for Computational Lin-
Svetlana Kiritchenko and Saif Mohammad. 2018. Ex- guistics.
amining gender and race bias in two hundred sen-
Alec Radford, Jeff Wu, Rewon Child, David Luan,
timent analysis systems. In Proceedings of the
Dario Amodei, and Ilya Sutskever. 2019. Language
Seventh Joint Conference on Lexical and Compu-
models are unsupervised multitask learners.
tational Semantics, pages 43–53, New Orleans,
Louisiana. Association for Computational Linguis- Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt,
tics. and Vaishaal Shankar. 2019. Do ImageNet clas-
Matt J Kusner, Joshua Loftus, Chris Russell, and Ri- sifiers generalize to ImageNet? volume 97 of
cardo Silva. 2017. Counterfactual fairness. In Proceedings of Machine Learning Research, pages
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, 5389–5400, Long Beach, California, USA. PMLR.
R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.
Advances in Neural Information Processing Systems
2019. Generating natural language adversarial ex-
30, pages 4066–4076. Curran Associates, Inc.
amples through probability weighted word saliency.
Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, In Proceedings of the 57th Annual Meeting of the
and Xipeng Qiu. 2020. BERT-ATTACK: Adversar- Association for Computational Linguistics, pages
ial attack against BERT using BERT. In Proceed- 1085–1097, Florence, Italy. Association for Compu-
ings of the 2020 Conference on Empirical Methods tational Linguistics.
in Natural Language Processing (EMNLP), pages
6193–6202, Online. Association for Computational Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,
Linguistics. and Sameer Singh. 2020. Beyond accuracy: Behav-
ioral testing of nlp models with checklist. In Associ-
Chandler May, Alex Wang, Shikha Bordia, Samuel R. ation for Computational Linguistics (ACL).
Bowman, and Rachel Rudinger. 2019. On measur-
ing social biases in sentence encoders. In Proceed- Victor Sanh, Lysandre Debut, Julien Chaumond, and
ings of the 2019 Conference of the North American Thomas Wolf. 2019. Distilbert, a distilled version
Chapter of the Association for Computational Lin- of bert: smaller, faster, cheaper and lighter. ArXiv,
guistics: Human Language Technologies, Volume 1 abs/1910.01108.
(Long and Short Papers), pages 622–628, Minneapo-
lis, Minnesota. Association for Computational Lin- Emily Sheng, Kai-Wei Chang, Premkumar Natarajan,
guistics. and Nanyun Peng. 2019. The woman worked as
a babysitter: On biases in language generation. In
Takeru Miyato, Andrew M. Dai, and Ian Goodfel- EMNLP.
low. 2017. Adversarial training methods for semi-
supervised text classification. ICLR. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan,
and Nanyun Peng. 2020. Towards controllable bi-
John X. Morris, Eli Lifland, Jin Yong Yoo, Jake ases in language generation. In EMNLP-Finding.
Grigsby, Di Jin, and Yanjun Qi. 2020. Textattack:
A framework for adversarial attacks, data augmenta- Zhouxing Shi, Huan Zhang, Kai-Wei Chang, Minlie
tion, and adversarial training in nlp. Huang, and Cho-Jui Hsieh. 2020. Robustness verifi-
cation for transformers. In International Conference
Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thom- on Learning Representations.
son, Milica Gašić, Lina M. Rojas-Barahona, Pei-
Hao Su, David Vandyke, Tsung-Hsien Wen, and Richard Socher, Alex Perelygin, Jean Wu, Jason
Steve Young. 2016. Counter-fitting word vectors to Chuang, Christopher D. Manning, Andrew Ng, and
linguistic constraints. In Proceedings of the 2016 Christopher Potts. 2013. Recursive deep models
Conference of the North American Chapter of the for semantic compositionality over a sentiment tree-
Association for Computational Linguistics: Human bank. In Proceedings of the 2013 Conference on
Language Technologies, pages 142–148, San Diego, Empirical Methods in Natural Language Processing,
California. Association for Computational Linguis- pages 1631–1642, Seattle, Washington, USA. Asso-
tics. ciation for Computational Linguistics.
3909

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gard- Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018b.
ner, and Sameer Singh. 2019a. Universal adver- Generating natural adversarial examples. In Interna-
sarial triggers for attacking and analyzing nlp. In tional Conference on Learning Representations.
EMNLP/IJCNLP.
Yi Zhou, Xiaoqing Zheng, Cho-Jui Hsieh, Kai-Wei
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gard- Chang, and Xuanjing Huang. 2020. Defense against
ner, and Sameer Singh. 2019b. Universal adversar- adversarial attacks in nlp via dirichlet neighborhood
ial triggers for attacking and analyzing NLP. In ensemble. arXiv preprint arXiv:2006.11627.
EMNLP.

Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R. Bowman. 2019.
Glue: A multi-task benchmark and analysis platform
for natural language understanding.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander M. Rush. 2020.
Huggingface’s transformers: State-of-the-art natural
language processing.

Kaidi Xu, Zhouxing Shi, Huan Zhang, Yihan Wang,
Kai-Wei Chang, Minlie Huang, Bhavya Kailkhura,
Xue Lin, and Cho-Jui Hsieh. 2020. Automatic per-
turbation analysis for scalable certified robustness
and beyond.

Puyudi Yang, Jianbo Chen, Cho-Jui Hsieh, Jane-Ling
Wang, and Michael I. Jordan. 2020. Greedy attack
and gumbel attack: Generating adversarial examples
for discrete data. Journal of Machine Learning Re-
search, 21(43):1–36.

Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu,
Meng Zhang, Qun Liu, and Maosong Sun. 2020.
Word-level textual adversarial attacking as combina-
torial optimization. Proceedings of the 58th Annual
Meeting of the Association for Computational Lin-
guistics.

Huan Zhang, Hongge Chen, Zhao Song, Duane Boning,
inderjit dhillon, and Cho-Jui Hsieh. 2019. The lim-
itations of adversarial training and the blind-spot at-
tack. In International Conference on Learning Rep-
resentations.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cot-
terell, Vicente Ordonez, and Kai-Wei Chang. 2019.
Gender bias in contextualized word embeddings. In
NAACL.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-
donez, and Kai-Wei Chang. 2018a. Gender bias
in coreference resolution: Evaluation and debiasing
methods. In Proceedings of the 2018 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 2 (Short Papers), pages 15–20,
New Orleans, Louisiana. Association for Computa-
tional Linguistics.
3910

A Supplemental Material Running Time (seconds)
Genetic BAE SO-Enum SO-Beam

A.1 Random Baseline Base Models:
BoW 31.6 0.9 6.2 1.8
CNN 28.8 1.0 5.9 1.7
To validate the effectiveness of minimizing Eq. (4), LSTM 31.9 1.1 7.0 1.9
we also experiment on a second-order baseline that Transformer 51.9 0.5 6.5 2.5
constructs vulnerable examples by randomly re- BERT-base 65.6 1.1 35.4 7.1
Robust Models:
placing up to 6 words. We use the same masked
BoW 103.9 1.0 8.0 3.5
language model and threshold as SO-Beam such CNN 129.4 1.0 6.7 2.6
that they share a similar neighborhood. We per- LSTM 116.4 1.1 10.7 5.3
form the attack on the same models as Table 2, and Transformer 66.4 0.5 5.9 2.6

the attack success rates on robustly trained BoW,
Table 6: The average running time over 872 examples
CNN, LSTM, and Transformers are 18.8%, 22.3%, (100 for Genetic due to long running time).
15.2%, and 25.1%, respectively. Despite being a
second-order attack, the random baseline has low
attack success rates thus demonstrates the effective- A.4 Additional Results on Protected Tokens
ness of SO-Beam.
Fig. 7 presents the experimental results with ad-
ditional protected tokens such as nationality, reli-
A.2 Human Evaluation gion, and sexual orientation (from Ribeiro et al.
(2020)). We use the same base LSTM as de-
We randomly select 100 successful attacks from
scribed in §4.2. One interesting observation is
SO-Beam and consider four types of examples (for
when p = (gay, straight) where the bias is
a total of 400 examples): The original examples
negative, indicating that the sentiment classifier
with and without synonym substitution p, and the
tends to give more negative prediction when sub-
vulnerable examples with and without synonym
stituting gay → straight in the input. This
substitution p. For each example, we annotate the
phenomenon is opposite to the behavior of toxi-
naturalness and sentiment separately as described
city classifiers (Dixon et al., 2018), and we hy-
below.
pothesize that it may be caused by the different
Naturalness of vulnerable examples. We ask distribution of training data. To verify the hypoth-
the annotators to score the likelihood of being an esis, we count the number of training examples
original example (i.e., not altered by computer) containing each word, and observe that we have
based on grammar correctness and naturalness, far more negative examples than positive examples
with a Likert scale of 1-5: (1) Sure adversarial among those containing straight (Table 7). Af-
example. (2) Likely an adversarial example. (3) ter looking into the training set, it turns out that
Neutral. (4) Likely an original example. (5) Sure straight to video is a common phrase to
original example. criticize a film, thus the classifier incorrectly cor-
Semantic similarity after the synonym substitu- relates straight with negative sentiment. This
tion. We first ask the annotators to predict the also reveals the limitation of our method on polyse-
sentiment on a Likert scale of 1-5, and then map mous words.
the prediction to three categories: negative, neutral,
and positive. We consider two examples to have # Negative # Positive
the same semantic meaning if and only if they are gay 37 20
both positive or negative. straight 71 18

Table 7: Number of negative and positive examples
A.3 Running Time containing gay and straight in the training set.
We experiment with the validation split on a single
RTX 3090, and measure the average running time In Fig. 8, we measure the bias on Xtest and ob-
per example. As shown in Table 6, SO-Beam runs serve positive bias on most tokens for both k = 0
faster than SO-Enum since it utilizes the probability and k = 3, which indicates that the model “tends”
output. The running time may increase if the model to make more positive predictions for examples
has improved second-order robustness. containing certain female pronouns than male pro-
3911

Genetic                     BAE
                                                                         Original Perturb         Original    Perturb
                                                                                            `0                          `0
                                                                          PPL       PPL            PPL         PPL
                                                             Base Models:
                                                             BoW            145     258     3.3     192        268      1.6
                                                             CNN            146     282     3.0     186        254      1.5
                                                             LSTM           131     238     2.9     190        263      1.6
                                                             Transformer    137     232     2.8     185        254      1.4
                                                             BERT-base      201     342     3.4     189        277      1.6
                                                             Robust Models:
                                                             BoW            132     177     2.4     214        269      1.5
                                                             CNN            136     236     2.7     211        279      1.5
                                                             LSTM           163     267     2.5     220        302      1.6
                                                             Transformer    118     200     2.8     196        261      1.4

                                                          Table 8: The quality metrics for first-order attacks from
                                                          successful attacks. We compare median perplexities
                                                          (PPL) and average `0 norm distances.
Figure 7: Additional counterfactual token bias mea-
sured on the original validation split with base LSTM.
                                                          SO-Beam on base LSTM, and Table 12 shows addi-
                                                          tional attack results from SO-Beam on robust CNN.
nouns. Notice that even though gender swap miti-          Bold words are the corresponding patch words p,
gates the bias to some extent, it is still difficult to   taken from the predefined list of counter-fitted syn-
fully eliminate the bias. This is probably caused         onyms.
by tuples like (him, his, her) which cannot be
swapped perfectly, and requires additional process-
ing such as part-of-speech resolving (Zhao et al.,
2018a).

Figure 8: Full results for gendered tokens measured on
the original validation split.

   To help evaluate the naturalness of our con-
structed examples used in §4, we provide sample
sentences in Table 9 and Table 10. Bold words are
the corresponding patch words p, taken from the
predefined list of gendered pronouns.

A.5   Additional Results on Robustness
Table 8 provides the quality metrics for first-order
attacks, where we measure the GPT-2 perplexity
and `0 norm distance between the input and the
adversarial example. For BAE we evaluate on 872
validation examples, and for Genetic we evaluate
on 100 validation examples due to the long running
time.
   Table 11 shows additional attack results from
                                                    3912

Type                 Predictions                                               Text
    Original      95% Negative   94% Negative      it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac-
                                                   tress) who is out of their depth .
 Distance k = 1   97% Negative (97% Negative)      it ’s hampered by a lifetime-channel kind of plot and lone lead actor
                                                   (actress) who is out of their depth .
                  56% Negative (55% Positive )     it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac-
                                                   tress) who is out of creative depth .
                  89% Negative (84% Negative)      it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac-
                                                   tress) who talks out of their depth .
                  98% Negative (98% Negative)      it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac-
                                                   tress) who is out of production depth .
                  96% Negative (96% Negative)      it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac-
                                                   tress) that is out of their depth .
 Distance k = 2   88% Negative (87% Negative)      it ’s hampered by a lifetime-channel cast of stars and a lead actor (ac-
                                                   tress) who is out of their depth .
                  96% Negative (95% Negative)      it ’s hampered by a simple set of plot and a lead actor (actress) who is
                                                   out of their depth .
                  54% Negative (54% Negative)      it ’s framed about a lifetime-channel kind of plot and a lead actor (ac-
                                                   tress) who is out of their depth .
                  90% Negative (88% Negative)      it ’s hampered by a lifetime-channel mix between plot and a lead actor
                                                   (actress) who is out of their depth .
                  78% Negative (68% Negative)      it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac-
                                                   tress) who storms out of their mind .
 Distance k = 3   52% Positive   (64% Positive )   it ’s characterized by a lifetime-channel combination comedy plot and a
                                                   lead actor (actress) who is out of their depth .
                  93% Negative (93% Negative)      it ’s hampered by a lifetime-channel kind of star and a lead actor (ac-
                                                   tress) who falls out of their depth .
                  58% Negative (57% Negative)      it ’s hampered by a tough kind of singer and a lead actor (actress) who
                                                   is out of their teens .
                  70% Negative (52% Negative)      it ’s hampered with a lifetime-channel kind of plot and a lead actor
                                                   (actress) who operates regardless of their depth .
                  58% Negative (53% Positive )     it ’s hampered with a lifetime-channel cast of plot and a lead actor
                                                   (actress) who is out of creative depth .

Table 9: Additional counterfactual bias examples on base LSTM with p = (actor, actress). We only present
5 examples per k due to space constrain.

                                                         3913

You can also read