Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation - ACL Anthology
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation Chong Zhang Jieyu Zhao Huan Zhang Kai-Wei Chang Cho-Jui Hsieh Department of Computer Science, UCLA {chongz, jyzhao, kwchang, chohsieh}@cs.ucla.edu, huan@huan-zhang.com Abstract Xtest Robustness and counterfactual bias are usually x0 ="a deep and meaningful film (movie)." evaluated on a test dataset. However, are 99% positive (99% positive) these evaluations robust? If the test dataset is perturbed slightly, will the evaluation results keep the same? In this paper, we propose a perturb “double perturbation” framework to uncover model weaknesses beyond the test dataset. x̃0 ="a short and moving film (movie)." The framework first perturbs the test dataset 73% positive (70% negative) to construct abundant natural sentences similar to the test data, and then diagnoses Figure 1: A vulnerable example beyond the test dataset. the prediction change regarding a single-word Numbers on the bottom right are the sentiment predic- substitution. We apply this framework to tions for film and movie. Blue x0 comes from the study two perturbation-based approaches test dataset and its prediction cannot be altered by the that are used to analyze models’ robustness substitution film → movie (robust). Yellow exam- and counterfactual bias in English. (1) For ple x̃0 is slightly perturbed but remains natural. Its pre- robustness, we focus on synonym substitu- diction can be altered by the substitution (vulnerable). tions and identify vulnerable examples where prediction can be altered. Our proposed attack attains high success rates (96.0%–99.8%) In most studies, model robustness is evaluated in finding vulnerable examples on both based on a given test dataset or synthetic sentences original and robustly trained CNNs and constructed from templates (Ribeiro et al., 2020). Transformers. (2) For counterfactual bias, Specifically, the robustness of a model is often eval- we focus on substituting demographic tokens uated by the ratio of test examples where the model (e.g., gender, race) and measure the shift of prediction cannot be altered by semantic-invariant the expected prediction among constructed sentences. Our method is able to reveal perturbation. We refer to this type of evaluations the hidden model biases not directly shown as the first-order robustness evaluation. However, in the test dataset. Our code is available even if a model is first-order robust on an input sen- at https://github.com/chong-z/ tence x0 , it is possible that the model is not robust nlp-second-order-attack. on a natural sentence x̃0 that is slightly modified from x0 . In that case, adversarial examples still 1 Introduction exist even if first-order attacks cannot find any of Recent studies show that NLP models are vulner- them from the given test dataset. Throughout this able to adversarial perturbations. A seemingly paper, we call x̃0 a vulnerable example. The ex- “invariance transformation” (a.k.a. adversarial per- istence of such examples exposes weaknesses in turbation) such as synonym substitutions (Alzantot models’ understanding and presents challenges for et al., 2018; Zang et al., 2020) or syntax-guided model deployment. Fig. 1 illustrates an example. paraphrasing (Iyyer et al., 2018; Huang and Chang, In this paper, we propose the double perturba- 2021) can alter the prediction. To mitigate the tion framework for evaluating a stronger notion model vulnerability, robust training methods have of second-order robustness. Given a test dataset, been proposed and shown effective (Miyato et al., we consider a model to be second-order robust if 2017; Jia et al., 2019; Huang et al., 2019; Zhou there is no vulnerable example that can be iden- et al., 2020). tified in the neighborhood of given test instances 3899 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3899–3916 June 6–11, 2021. ©2021 Association for Computational Linguistics
(§2.2). In particular, our framework first perturbs positive negative the test set to construct the neighborhood, and then diagnoses the robustness regarding a single-word x̃1 synonym substitution. Taking Fig. 2 as an example, x̃00 the model is first-order robust on the input sentence x0 (the prediction cannot be altered), but it is not x̃0 second-order robust due to the existence of the vul- x0 nerable example x̃0 . Our framework is designed to identify x̃0 . Figure 2: An illustration of the decision boundary. Dia- We apply the proposed framework and quantify mond area denotes invariance transformations. Blue x0 second-order robustness through two second-order is a robust input example (the entire diamond is green). attacks (§3). We experiment with English senti- Yellow x̃0 is a vulnerable example in the neighborhood of x0 . Red x̃00 is an adversarial example to x̃0 . Note: ment classification on the SST-2 dataset (Socher x̃00 is not an adversarial example to x0 since they have et al., 2013) across various model architectures. different meanings to human (outside the diamond). Surprisingly, although robustly trained CNN (Jia et al., 2019) and Transformer (Xu et al., 2020) can achieve high robustness under strong at- order robustness and reveal the models’ vulnerabil- tacks (Alzantot et al., 2018; Garg and Ramakr- ities that cannot be identified by previous attacks. ishnan, 2020) (23.0%–71.6% success rates), for (3) We propose a counterfactual bias evaluation around 96.0% of the test examples our attacks can method to reveal the hidden model bias based on find a vulnerable example by perturbing 1.3 words our double perturbation framework. on average. This finding indicates that these ro- bustly trained models, despite being first-order ro- 2 The Double Perturbation Framework bust, are not second-order robust. In this section, we describe the double perturbation Furthermore, we extend the double perturbation framework which focuses on identifying vulnerable framework to evaluate counterfactual biases (Kus- examples within a small neighborhood of the test ner et al., 2017) (§4) in English. When the test dataset. The framework consists of a neighborhood dataset is small, our framework can help improve perturbation and a word substitution. We start with the evaluation robustness by revealing the hidden defining word substitutions. biases not directly shown in the test dataset. In- tuitively, a fair model should make the same pre- 2.1 Existing Word Substitution Strategy diction for nearly identical examples referencing We focus our study on word-level substitution, different groups (Garg et al., 2019) with different where existing works evaluate robustness and coun- protected attributes (e.g., gender, race). In our eval- terfactual bias by directly perturbing the test dataset. uation, we consider a model biased if substituting For instance, adversarial attacks alter the prediction tokens associated with protected attributes changes by making synonym substitutions, and the fairness the expected prediction, which is the average pre- literature evaluates counterfactual fairness by sub- diction among all examples within the neighbor- stituting protected tokens. We integrate the word hood. For instance, a toxicity classifier is biased substitution strategy into our framework as the com- if it tends to increase the toxicity if we substitute ponent for evaluating robustness and fairness. straight → gay in an input sentence (Dixon For simplicity, we consider a single-word substi- et al., 2018). In the experiments, we evaluate the ex- tution and denote it with the operator ⊕. Let X ⊆ pected sentiment predictions on pairs of protected V l be the input space where V is the vocabulary and tokens (e.g., (he, she), (gay, straight)), and l is the sentence length, p = (p(1) , p(2) ) ∈ V 2 be demonstrate that our method is able to reveal the a pair of synonyms (called patch words), Xp ⊆ X hidden model biases. denotes sentences with a single occurrence of p(1) Our main contributions are: (1) We propose the (for simplicity we skip other sentences), x0 ∈ Xp double perturbation framework to diagnose the ro- be an input sentence, then x0 ⊕ p means “substitute bustness of existing robustness and fairness evalu- p(1) → p(2) in x0 ”. The result after substitution is: ation methods. (2) We propose two second-order attacks to quantify the stronger notion of second- x00 = x0 ⊕ p. 3900
Taking Fig. 1 as an example, where p = (film, Algorithm 1: Neighborhood construction movie) and x0 = a deep and meaning- Data: Input sentence x0 , masked language model ful film, the perturbed sentence is x00 = a LM, max distance k. 1 Function Neighbork (x0 ): deep and meaningful movie. Now we in- 2 if k = 0 then return {x0 } ; troduce other components in our framework. 3 if k ≥ 2 thenS 4 return x̃∈Neighbor (x0 ) Neighbork−1 (x̃); 1 5 Xneighbor ← ∅; 2.2 Proposed Neighborhood Perturbation 6 for i ← 0, . . . , len(x0 ) − 1 do 7 T, L ← LM.fillmask(x0 , i); . Mask ith token and return candidate Instead of applying the aforementioned word sub- tokens and corresponding logits. stitutions directly to the original test dataset, our 8 L ← SortDecreasing(L); 9 lmin ← max{L(κ) , L(0) − δ}; framework perturbs the test dataset within a small . L(i) denotes the ith element. We neighborhood to construct similar natural sen- empirically set κ ← 20 and δ ← 3. tences. This is to identify vulnerable examples 10 Tnew ← {t | l > lmin , (t, l) ∈ T × L}; (i) with respect to the model. Note that examples in 11 Xnew ← {x0 | x0 ← t, t ∈ Tnew }; . Construct new sentences by the neighborhood are not required to have the same replacing the ith token. meaning as the original example, since we only 12 Xneighbor ← Xneighbor ∪ Xnew ; study the prediction difference caused by applying 13 return Xneighbor ; synonym substitution p (§2.1). Constraints on the neighborhood. We limit the 3 Evaluating Second-Order Robustness neighborhood sentences within a small `0 norm ball (regarding the test instance) to ensure syntactic With the proposed double perturbation framework, similarity, and empirically ensure the naturalness we design two black-box attacks1 to identify vul- through a language model. The neighborhood of nerable examples within the neighborhood of the an input sentence x0 ∈ X is: test set. We aim at evaluating the robustness for inputs beyond the test set. Neighbork (x0 ) ⊆ Ballk (x0 ) ∩ Xnatural , (1) 3.1 Previous First-Order Attacks Adversarial attacks search for small and invariant where Ballk (x0 ) = {x | kx − x0 k0 ≤ k, x ∈ X } perturbations on the model input that can alter the is the `0 norm ball around x0 (i.e., at most k differ- prediction. To simplify the discussion, in the fol- ent tokens), and Xnatural denotes natural sentences lowing, we take a binary classifier f (x) : X → that satisfy a certain language model score which {0, 1} as an example to describe our framework. will be discussed next. Let x0 be the sentence from the test set with label Construction with masked language model. y0 , then the smallest perturbation δ ∗ under `0 norm We construct neighborhood sentences from x0 by distance is:2 substituting at most k tokens. As shown in Al- δ ∗ := argmin kδk0 s.t. f (x0 ⊕ δ) 6= y0 . gorithm 1, the construction employs a recursive δ approach and replaces one token at a time. For each recursion, the algorithm first masks each to- Here δ = p1 ⊕ · · · ⊕ pl denotes a series of substi- ken of the input sentence (may be the original x0 or tutions. In contrast, our second-order attacks fix the x̃ from last recursion) separately and predicts δ = p and search for the vulnerable x0 . likely replacements with a masked language model (e.g., DistilBERT, Sanh et al. 2019). To ensure the 3.2 Proposed Second-Order Attacks naturalness, we keep the top 20 tokens for each Second-order attacks study the prediction differ- mask with the largest logit (subject to a threshold, ence caused by applying p. For notation conve- Line 9). Then, the algorithm constructs neighbor- nience we define the prediction difference F (x; p) : hood sentences by replacing the mask with found 1 tokens. We use the notation x̃ in the following sec- Black-box attacks only observe the model outputs and do not know the model parameters or the gradient. tions to denote the constructed sentences within the 2 For simplicity, we use `0 norm distance to measure the neighborhood. similarity, but other distance metrics can be applied. 3901
Find p for x0 . Find vulnerable example p alters the prediction. through beam search. x0 = a deep and meaningful film. x (i = 1) fsoft (x) x̃0 ="a short and moving film (movie)." a deep and disturbing film (movie). .990 (.989) a deep and moving film (movie). .999 (.999) 73% positive (70% negative) a dramatic and meaningful film (movie). .999 (.999) ··· Synonyms film, movie p = film, movie x (i = 2) fsoft (x) story, tale a short and moving film (movie). .730 (.303) fool, silly a slow and moving film (movie). .519 (.151) ··· a dramatic or meaningful film (movie). .487 (.168) ··· Figure 3: The attack flow for SO-Beam (Algorithm 2). Blue x0 is the input sentence and yellow x̃0 is our con- structed vulnerable example (the prediction can be altered by substituting film → movie). Green boxes in the middle show intermediate sentences, and fsoft (x) denotes the probability outputs for film and movie. X × V 2 → {−1, 0, 1} by:3 p), respectively. Minimizing Eq. (4) effectively leads to fmin → 0 and fmax → 1, and we use a F (x; p) := f (x ⊕ p) − f (x). (2) beam search to find the best x. At each iteration, we construct sentences through Neighbor1 (x) and Taking Fig. 1 as an example, the prediction differ- only keep the top 20 sentences with the smallest ence for x̃0 on p is F (x̃0 ; p) = f (...moving L(x; p). We run at most k iterations, and stop movie.) − f (...moving film.) = −1. earlier if we find a vulnerable example. We provide Given an input sentence x0 , we want to find the detailed implementation in Algorithm 2 and a patch words p and a vulnerable example x̃0 such flowchart in Fig. 3. that f (x̃0 ⊕ p) 6= f (x̃0 ). Follow Alzantot et al. (2018), we choose p from a predefined list of Algorithm 2: Beam-search attack (SO- counter-fitted synonyms (Mrkšić et al., 2016) that Beam) maximizes |fsoft (p(2) ) − fsoft (p(1) )|. Here fsoft (x) : Data: Input sentence x0 , synonyms P, model X → [0, 1] denotes probability output (e.g., af- functions F and fsoft , loss L, max distance k. ter the softmax layer but before the final argmax), 1 Function SO-Beamk (x0 ): fsoft (p(1) ) and fsoft (p(2) ) denote the predictions for 2 p ← argmax |fsoft (p(2) ) − fsoft (p(1) )|; p∈P s.t. x0 ∈Xp the single word, and we enumerate through all pos- 3 Xbeam ← {x0 }; sible p for x0 . Let k be the neighborhood distance, 4 for i ← 1, . . .S, k do 5 Xnew ← x̃∈Xbeam Neighbor1 (x̃); then the attack is equivalent to solving: 6 x̃0 ← argmaxx∈Xnew |F (x; p)|; 7 if F (x̃0 ; p) 6= 0 then return x̃0 ; x̃0 = argmax |F (x; p)|. (3) 8 Xnew ← SortIncreasing(Xnew , L); x∈Neighbork (x0 ) (0) (β−1) 9 Xbeam ← {Xnew , . . . , Xnew }; . Keep the best beam. We set β ← 20. Brute-force attack (SO-Enum). A naive ap- 10 return None; proach for solving Eq. (3) is to enumerate through Neighbork (x0 ). The enumeration finds the small- est perturbation, but is only applicable for small k 3.3 Experimental Results (e.g., k ≤ 2) given the exponential complexity. Beam-search attack (SO-Beam). The efficiency In this section, we evaluate the second-order ro- can be improved by utilizing the probability output, bustness of existing models and show the quality where we solve Eq. (3) by minimizing the cross- of our constructed vulnerable examples. entropy loss with regard to x ∈ Neighbork (x0 ): 3.3.1 Setup L(x; p) := − log(1 − fmin ) − log(fmax ), We follow the setup from the robust training lit- (4) erature (Jia et al., 2019; Xu et al., 2020) and ex- where fmin and fmax are the smaller and the larger periment with both the base (non-robust) and ro- output probability between fsoft (x) and fsoft (x ⊕ bustly trained models. We train the binary senti- 3 We assume a binary classification task, but our framework ment classifiers on the SST-2 dataset with bag-of- is general and can be extended to multi-class classification. words (BoW), CNN, LSTM, and attention-based 3902
Original: 70% Negative Input Example: in its best moments , resembles a bad high school production of grease , without benefit of song . Genetic: 56% Positive Adversarial Example: in its best moment , recalling a naughty high school production of lubrication , unless benefit of song . BAE: 56% Positive Adversarial Example: in its best moments , resembles a great high school production of grease , without benefit of song . SO-Enum and SO-Beam (ours): 60% Negative (67% Positive) Vulnerable Example: in its best moments , resembles a bad (unhealthy) high school production of musicals , without benefit of song . Table 1: Sampled attack results on the robust BoW. For Genetic and BAE the goal is to find an adversarial example that alters the original prediction, whereas for SO-Enum and SO-Beam the goal is to find a vulnerable example beyond the test set such that the prediction can be altered by substituting bad → unhealthy. models. timization algorithm that generates both syntacti- Base models. For BoW, CNN, and LSTM, all cally and semantically similar adversarial exam- models use pre-trained GloVe embeddings (Pen- ples, by replacing words within the list of counter- nington et al., 2014), and have one hidden layer fitted synonyms. (2) The BAE attack (Garg and of the corresponding type with 100 hidden size. Ramakrishnan, 2020) generates coherent adversar- Similar to the baseline performance reported in ial examples by masking and replacing words using GLUE (Wang et al., 2019), our trained models BERT. For both methods we use the implementa- have an evaluation accuracy of 81.4%, 82.5%, and tion provided by TextAttack (Morris et al., 2020). 81.7%, respectively. For attention-based models, we train a 3-layer Transformer (the largest size in Attack success rate (second-order). We also Shi et al. 2020) and fine-tune a pre-trained bert- quantify second-order robustness through attack base-uncased from HuggingFace (Wolf et al., success rate, which measures the ratio of test ex- 2020). The Transformer uses 4 attention heads amples that a vulnerable example can be found. and 64 hidden size, and obtains 82.1% accuracy. To evaluate the impact of neighborhood size, we The BERT-base uses the default configuration and experiment with two configurations: (1) For the obtains 92.7% accuracy. small neighborhood (k = 2), we use SO-Enum Robust models (first-order). With the same that finds the most similar vulnerable example. (2) setup as base models, we apply robust training For the large neighborhood (k = 6), SO-Enum is methods to improve the resistance to word substitu- not applicable and we use SO-Beam to find vul- tion attacks. Jia et al. (2019) provide a provably ro- nerable examples. We consider the most challeng- bust training method through Interval Bound Prop- ing setup and use patch words p from the same agation (IBP, Dvijotham et al. 2018) for all word set of counter-fitted synonyms as robust models substitutions on BoW, CNN and LSTM. Xu et al. (they are provably robust to these synonyms on (2020) provide a provably robust training method the test set). We also provide a random baseline on general computational graphs through a combi- to validate the effectiveness of minimizing Eq. (4) nation of forward and backward linear bound prop- (Appendix A.1). agation, and the resulting 3-layer Transformer is robust to up to 6 word substitutions. For both works Quality metrics (perplexity and similarity). we use the same set of counter-fitted synonyms pro- We quantify the quality of our constructed vulnera- vided in Jia et al. (2019). We skip BERT-base due ble examples through two metrics: (1) GPT-2 (Rad- to the lack of an effective robust training method. ford et al., 2019) perplexity quantifies the natural- Attack success rate (first-order). We quantify ness of a sentence (smaller is better). We report first-order robustness through attack success rate, the perplexity for both the original input exam- which measures the ratio of test examples that an ples and the constructed vulnerable examples. (2) adversarial example can be found. We use first- `0 norm distance quantifies the disparity between order attacks as a reference due to the lack of a two sentences (smaller is better). We report the direct baseline. We experiment with two black-box distance between the input and the vulnerable ex- attacks: (1) The Genetic attack (Alzantot et al., ample. Note that first-order attacks have different 2018; Jia et al., 2019) uses a population-based op- objectives and thus cannot be compared directly. 3903
Attack Success Rate (%) SO-Enum SO-Beam Genetic BAE SO-Enum SO-Beam Original Perturb Original Perturb `0 `0 PPL PPL PPL PPL Base Models: BoW 57.0 69.7 95.3 99.7 Base Models: CNN 62.0 71.0 95.3 99.8 BoW 168 202 1.1 166 202 1.2 LSTM 60.0 68.3 95.8 99.5 CNN 170 204 1.1 166 201 1.2 LSTM 168 204 1.1 166 204 1.2 Transformer 73.0 74.3 95.4 98.0 Transformer 165 193 1.0 165 195 1.1 BERT-base 41.0 61.5 94.3 98.7 BERT-base 170 229 1.3 168 222 1.4 Robust Models: Robust Models: BoW 28.0 63.1 81.5 88.4 BoW 170 212 1.2 171 222 1.4 CNN 23.0 64.4 91.0 96.0 CNN 166 209 1.2 168 210 1.3 LSTM 24.0 61.0 62.9 77.5 LSTM 194 251 1.3 185 260 1.8 Transformer 56.0 71.6 91.2 96.2 Transformer 170 213 1.2 165 208 1.3 Table 2: The average rates over 872 examples (100 for Table 3: The quality metrics for second-order meth- Genetic due to long running time). Second-order at- ods. We report the median perplexity (PPL) and av- tacks achieve higher successful rate since they are able erage `0 norm distance. The original PPL may differ to search beyond the test set. across models since we only count successful attacks. 3.3.2 Results nerable examples beyond the test set, and the search We experiment with the validation split (872 exam- is not required to maintain semantic similarity. Our ples) on a single RTX 3090. The average running methods provide a way to further investigate the time per example (in seconds) on base LSTM is robustness (or find vulnerable and adversarial ex- 31.9 for Genetic, 1.1 for BAE, 7.0 for SO-Enum amples) even when the model is robust to the test (k = 2), and 1.9 for SO-Beam (k = 6). We provide set. additional running time results in Appendix A.3. Quality of constructed vulnerable examples. Table 1 provides an example of the attack result As shown in Table 3, second-order attacks are able where all attacks are successful (additional exam- to construct vulnerable examples by perturbing 1.3 ples in Appendix A.5). As shown, our second- words on average, with a slightly increased per- order attacks find a vulnerable example by replac- plexity. For instance, on the robustly trained CNN ing grease → musicals, and the vulnerable and Transformer, SO-Beam constructs vulnerable example has different predictions for bad and un- examples by perturbing 1.3 words on average, with healthy. Note that, Genetic and BAE have dif- the median5 perplexity increased from around 165 ferent objectives from second-order attacks and to around 210. We provide metrics for first-order focus on finding the adversarial example. Next we attacks in Appendix A.5 as they have different ob- discuss the results from two perspectives. jectives and are not directly comparable. Second-order robustness. We observe that ex- Furthermore, applying existing attacks on the isting robustly trained models are not second-order vulnerable examples constructed by our method robust. As shown in Table 2, our second-order will lead to much smaller perturbations. As a refer- attacks attain high success rates not only on the ence, on the robustly trained CNN, Genetic attack base models but also on the robustly trained mod- constructs adversarial examples by perturbing 2.7 els. For instance, on the robustly trained CNN words on average (starting from the input exam- and Transformer, SO-Beam finds vulnerable ex- ples). However, if Genetic starts from our vulnera- amples within a small neighborhood for around ble examples, it would only need to perturb a single 96.0% of the test examples, even though these mod- word (i.e., the patch words p) to alter the predic- els have improved resistance to strong first-order tion. These results demonstrate the weakness of attacks (success rates drop from 62.0%–74.3% to the models (even robustly trained) for those inputs 23.0%–71.6% for Genetic and BAE).4 This phe- beyond the test set. nomenon can be explained by the fact that both first- 3.3.3 Human Evaluation order attacks and robust training methods focus on We perform human evaluation on the examples con- synonym substitutions on the test set, whereas our structed by SO-Beam. Specifically, we randomly attacks, due to their second-order nature, find vul- 5 We report median due to the unreasonably large perplexity 4 BAE is more effective on robust models as it may use on certain sentences. e.g., 395 for that’s a cheat. but replacement words outside the counter-fitted synonyms. 6740 for that proves perfect cheat. 3904
Naturalness (1-5) Semantic Similarity (%) x x⊕p x x⊕p Original Perturb Original Perturb 3.87 3.63 85 71 Table 4: The quality metrics from human evaluation. select 100 successful attacks and evaluate both the original examples and the vulnerable examples. To Figure 4: An illustration of an unbiased model vs. a evaluate the naturalness of the constructed exam- biased model. Green and gray indicate the probability ples, we ask the annotators to score the likelihood of positive and negative predictions, respectively. Left: (on a Likert scale of 1-5, 5 to be the most likely) of An unbiased model where the (x, x ⊕ p) pair (yellow- being an original example based on the grammar red dots) is relatively parallel to the decision boundary. correctness. To evaluate the semantic similarity af- Right: A biased model where the predictions for x ⊕ p (red) are usually more negative (gray) than x (yellow). ter applying the synonym substitution p, we ask the annotators to predict the sentiment of each example, and calculate the ratio of examples that maintain expected predictions for protected groups are dif- the same sentiment prediction after the synonym ferent (assuming the model is not intended to dis- substitution. For both metrics, we take the median criminate between these groups). For instance, a from 3 independent annotations. We use US-based sentiment classifier is biased if the expected predic- annotators on Amazon’s Mechanical Turk6 and pay tion for inputs containing woman is more positive $0.03 per annotation, and expect each annotation to (or negative) than inputs containing man. Such bias take 10 seconds on average (effectively, the hourly is harmful as they may make unfair decisions based rate is about $11). See Appendix A.2 for more on protected attributes, for example in situations details. such as hiring and college admission. As shown in Table 4, the naturalness score only Counterfactual token bias. We study a narrow drop slightly after the perturbation, indicating that case of counterfactual bias, where counterfactual our constructed vulnerable examples have similar examples are constructed by substituting protected naturalness as the original examples. As for the tokens in the input. A naive approach of measuring semantic similarity, we observe that 85% of the this bias is to construct counterfactual examples original examples maintain the same meaning after directly from the test set, however such evaluation the synonym substitution, and the corresponding may not be robust since test examples are only a ratio is 71% for vulnerable examples. This indi- small subset of natural sentences. Formally, let p cates that the synonym substitution is an invariance be a pair of protected tokens such as (he, she) or transformation for most examples. (Asian, American), Xtest ⊆ Xp be a test set (as in §2.1), we define counterfactual token bias by: 4 Evaluating Counterfactual Bias Bp,k := E Fsoft (x; p). (5) In addition to evaluating second-order robustness, x∈Neighbork (Xtest ) we further extend the double perturbation frame- work (§2) to evaluate counterfactual biases by set- We calculate Eq. (5) through an enumeration across ting p to pairs of protected tokens. We show that all natural sentences withinS the neighborhood.7 our method can reveal the hidden model bias. Here Neighbork (Xtest ) = x∈Xtest Neighbork (x) denotes the union of neighborhood examples (of 4.1 Counterfactual Bias distance k) around the test set, and Fsoft (x; p) : In contrast to second-order robustness, where we X × V 2 → [−1, 1] denotes the difference between consider the model vulnerable as long as there ex- probability outputs fsoft (similar to Eq. (2)): ists one vulnerable example, counterfactual bias focuses on the expected prediction, which is the Fsoft (x; p) := fsoft (x ⊕ p) − fsoft (x). (6) average prediction among all examples within the 7 For gender bias, we employ a blacklist to avoid adding neighborhood. We consider a model biased if the gendered tokens during the neighborhood construction. This is to avoid semantic shift when, for example, p = (he, she) 6 https://www.mturk.com such that it may refer to different tokens after the substitution. 3905
Patch Words # Original # Perturbed he,she 5 325,401 his,her 4 255,245 him,her 4 233,803 men,women 3 192,504 man,woman 3 222,981 actor,actress 2 141,780 ... Total 34 2,317,635 Table 5: The number of original examples (k = 0) and the number of perturbed examples (k = 3) in Xfilter . Figure 5: Our proposed Bp,k measured on Xfilter . Here “original” is equivalent to k = 0, “perturbed” is equiva- lent to k = 3, p is in the form of (male, female). The model is unbiased on p if Bp,k ≈ 0, whereas a positive or negative Bp,k indicates that the model shows preference or against to p(2) , respectively. Metrics. We evaluate model bias through the Fig. 4 illustrates the distribution of (x, x ⊕ p) for proposed Bp,k for k = 0, . . . , 3. Here the bias for both an unbiased model and a biased model. k = 0 is effectively measured on the original test The aforementioned neighborhood construction set, and the bias for k ≥ 1 is measured on our does not introduce additional bias. For instance, constructed neighborhood. We randomly sample a let x0 be a sentence containing he, even though subset of constructed examples when k = 3 due to it is possible for Neighbor1 (x0 ) to contain many the exponential complexity. stereotyping sentences (e.g., contains tokens such Filtered test set. To investigate whether our as doctor and driving) that affect the distri- method is able to reveal model bias that was hidden bution of fsoft (x), but it does not bias Eq. (6) as in the test set, we construct a filtered test set on we only care about the prediction difference of re- which the bias cannot be observed directly. Let placing he → she. The construction has no infor- Xtest be the original validation split, we construct mation about the model objective, thus it would be Xfilter by the equation below and empirically set difficult to bias fsoft (x) and fsoft (x ⊕ p) differently. = 0.005. We provide statistics in Table 5. 4.2 Experimental Results Xfilter := {x | |Fsoft (x; p)| < , x ∈ Xtest }. In this section, we use gender bias as a running 4.2.2 Results example, and demonstrate the effectiveness of our method by revealing the hidden model bias. We Our method is able to reveal the hidden model provide additional results in Appendix A.4. bias on Xfilter , which is not visible with naive mea- surements. In Fig. 5, the naive approach (k = 0) 4.2.1 Setup observes very small biases on most tokens (as con- We evaluate counterfactual token bias on the SST-2 structed). In contrast, when evaluated by our dou- dataset with both the base and debiased models. ble perturbation framework (k = 3), we are able to We focus on binary gender bias and set p to pairs observe noticeable bias, where most p has a pos- of gendered pronouns from Zhao et al. (2018a). itive bias on the base model. This observed bias Base Model. We train a single layer LSTM with is in line with the measurements on the original pre-trained GloVe embeddings and 75 hidden size Xtest (Appendix A.4), indicating that we reveal the (from TextAttack, Morris et al. 2020). The model correct model bias. Furthermore, we observe miti- has 82.9% accuracy similar to the baseline perfor- gated biases in the debiased model, which demon- mance reported in GLUE. strates the effectiveness of data augmentation. Debiased Model. Data-augmentation with gen- To demonstrate how our method reveals hidden der swapping has been shown effective in mitigat- bias, we conduct a case study with p = (actor, ing gender bias (Zhao et al., 2018a, 2019). We actress) and show the relationship between the augment the training split by swapping all male bias Bp,k and the neighborhood distance k. We entities with the corresponding female entities and present the histograms for Fsoft (x; p) in Fig. 6 and vice-versa. We use the same setup as the base plot the corresponding Bp,k vs. k in the right-most LSTM and attain 82.45% accuracy. panel. Surprisingly, for the base model, the bias is 3906
Figure 6: Left and Middle: Histograms for Fsoft (x; p) (x-axis) with p = (actor, actress). Right: The plot for the average Fsoft (x; p) (i.e., counterfactual token bias) vs. neighborhood distance k. Results show that the counterfactual bias on p can be revealed when increasing k. negative when k = 0, but becomes positive when image transformations can be used, in our natural k = 3. This is because the naive approach only has language setting, generating a valid example be- two test examples (Table 5) thus the measurement yond test set is more challenging because language is not robust. In contrast, our method is able to semantics and grammar must be maintained. construct 141,780 similar natural sentences when Counterfactual fairness. Kusner et al. (2017) k = 3 and shifts the distribution to the right (posi- propose counterfactual fairness and consider a tive). As shown in the right-most panel, the bias is model fair if changing the protected attributes does small when k = 1, and becomes more significant not affect the distribution of prediction. We fol- as k increases (larger neighborhood). As discussed low the definition and focus on evaluating the in §4.1, the neighborhood construction does not counterfactual bias between pairs of protected to- introduce additional bias, and these results demon- kens. Existing literature quantifies fairness on a strate the effectiveness of our method in revealing test dataset or through templates (Feldman et al., hidden model bias. 2015; Kiritchenko and Mohammad, 2018; May et al., 2019; Huang et al., 2020). For instance, 5 Related Work Garg et al. (2019) quantify the absolute counter- factual token fairness gap on the test set; Prab- First-order robustness evaluation. A line hakaran et al. (2019) study perturbation sensitivity of work has been proposed to study the vul- for named entities on a given set of corpus. Wallace nerability of natural language models, through et al. (2019b); Sheng et al. (2019, 2020) study how transformations such as character-level perturba- language generation models respond differently to tions (Ebrahimi et al., 2018), word-level pertur- prompt sentences containing mentions of different bations (Jin et al., 2019; Ren et al., 2019; Yang demographic groups. In contrast, our method quan- et al., 2020; Hsieh et al., 2019; Cheng et al., 2020; tifies the bias on the constructed neighborhood. Li et al., 2020), prepending or appending a se- quence (Jia and Liang, 2017; Wallace et al., 2019a), and generative models (Zhao et al., 2018b). They 6 Conclusion focus on constructing adversarial examples from the test set that alter the prediction, whereas our This work proposes the double perturbation frame- methods focus on finding vulnerable examples be- work to identify model weaknesses beyond the test yond the test set whose prediction can be altered. dataset, and study a stronger notion of robustness Robustness beyond the test set. Several works and counterfactual bias. We hope that our work have studied model robustness beyond test sets but can stimulate the research on further improving the mostly focused on computer vision tasks. Zhang robustness and fairness of natural language models. et al. (2019) demonstrate that a robustly trained model could still be vulnerable to small perturba- Acknowledgments tions if the input comes from a distribution only slightly different than a normal test set (e.g., im- We thank anonymous reviewers for their helpful ages with slightly different contrasts). Hendrycks feedback. We thank UCLA-NLP group for the and Dietterich (2019) study more sources of com- valuable discussions and comments. The research mon corruptions such as brightness, motion blur is supported NSF #1927554, #1901527, #2008173 and fog. Unlike in computer vision where simple and #2048280 and an Amazon Research Award. 3907
Ethical Considerations 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages Intended use. One primary goal of NLP models 31–36, Melbourne, Australia. Association for Com- is the generalization to real-world inputs. However, putational Linguistics. existing test datasets and templates are often not Michael Feldman, Sorelle A Friedler, John Moeller, comprehensive, and thus it is difficult to evaluate Carlos Scheidegger, and Suresh Venkatasubrama- real-world performance (Recht et al., 2019; Ribeiro nian. 2015. Certifying and removing disparate im- et al., 2020). Our work sheds a light on quantifying pact. In proceedings of the 21th ACM SIGKDD in- ternational conference on knowledge discovery and performance for inputs beyond the test dataset and data mining, pages 259–268. help uncover model weaknesses prior to the real- world deployment. Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H. Chi, and Alex Beutel. 2019. Counterfac- Misuse potential. Similar to other existing adver- tual fairness in text classification through robustness. sarial attack methods (Ebrahimi et al., 2018; Jin In Proceedings of the 2019 AAAI/ACM Conference et al., 2019; Zhao et al., 2018b), our second-order on AI, Ethics, and Society, AIES ’19, page 219–226, attacks can be used for finding vulnerable exam- New York, NY, USA. Association for Computing Machinery. ples to a NLP system. Therefore, it is essential to study how to improve the robustness of NLP Siddhant Garg and Goutham Ramakrishnan. 2020. models against second-order attacks. BAE: BERT-based adversarial examples for text Limitations. While the core idea about the dou- classification. In Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Language ble perturbation framework is general, in §4, we Processing (EMNLP), pages 6174–6181, Online. As- consider only binary gender in the analysis of coun- sociation for Computational Linguistics. terfactual fairness due to the restriction of the En- Dan Hendrycks and Thomas Dietterich. 2019. Bench- glish corpus we used, which only have words asso- marking neural network robustness to common cor- ciated with binary gender such as he/she, wait- ruptions and perturbations. In International Confer- er/waitress, etc. ence on Learning Representations. Yu-Lun Hsieh, Minhao Cheng, Da-Cheng Juan, Wei Wei, Wen-Lian Hsu, and Cho-Jui Hsieh. 2019. On References the robustness of self-attentive models. In Proceed- Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, ings of the 57th Annual Meeting of the Association Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. for Computational Linguistics, pages 1520–1529. 2018. Generating natural language adversarial ex- Kuan-Hao Huang and Kai-Wei Chang. 2021. Generat- amples. In Proceedings of the 2018 Conference on ing syntactically controlled paraphrases without us- Empirical Methods in Natural Language Processing, ing annotated parallel pairs. In EACL. pages 2890–2896, Brussels, Belgium. Association for Computational Linguistics. Po-Sen Huang, Robert Stanforth, Johannes Welbl, Chris Dyer, Dani Yogatama, Sven Gowal, Krish- Minhao Cheng, Jinfeng Yi, Pin-Yu Chen, Huan Zhang, namurthy Dvijotham, and Pushmeet Kohli. 2019. and Cho-Jui Hsieh. 2020. Seq2sick: Evaluating the Achieving verified robustness to symbol substitu- robustness of sequence-to-sequence models with ad- tions via interval bound propagation. Proceedings of versarial examples. Proceedings of the AAAI Confer- the 2019 Conference on Empirical Methods in Nat- ence on Artificial Intelligence, 34(04):3601–3608. ural Language Processing and the 9th International Joint Conference on Natural Language Processing Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, (EMNLP-IJCNLP). and Lucy Vasserman. 2018. Measuring and mitigat- ing unintended bias in text classification. In Pro- Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stan- ceedings of the 2018 AAAI/ACM Conference on AI, forth, Johannes Welbl, Jack Rae, Vishal Maini, Dani Ethics, and Society, AIES ’18, page 67–73, New Yogatama, and Pushmeet Kohli. 2020. Reducing York, NY, USA. Association for Computing Machin- sentiment bias in language models via counterfac- ery. tual evaluation. Findings in EMNLP. Krishnamurthy Dvijotham, Sven Gowal, Robert Stan- Mohit Iyyer, J. Wieting, Kevin Gimpel, and Luke forth, Relja Arandjelovic, Brendan O’Donoghue, Zettlemoyer. 2018. Adversarial example generation Jonathan Uesato, and Pushmeet Kohli. 2018. Train- with syntactically controlled paraphrase networks. ing verified learners with learned verifiers. ArXiv, abs/1804.06059. Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Robin Jia and Percy Liang. 2017. Adversarial ex- Dou. 2018. HotFlip: White-box adversarial exam- amples for evaluating reading comprehension sys- ples for text classification. In Proceedings of the tems. In Proceedings of the 2017 Conference on 3908
Empirical Methods in Natural Language Processing, Jeffrey Pennington, Richard Socher, and Christopher D. EMNLP 2017, Copenhagen, Denmark, September 9- Manning. 2014. Glove: Global vectors for word rep- 11, 2017, pages 2021–2031. Association for Compu- resentation. In Empirical Methods in Natural Lan- tational Linguistics. guage Processing (EMNLP), pages 1532–1543. Robin Jia, Aditi Raghunathan, Kerem Göksel, and Vinodkumar Prabhakaran, Ben Hutchinson, and Mar- Percy Liang. 2019. Certified robustness to adversar- garet Mitchell. 2019. Perturbation sensitivity analy- ial word substitutions. In EMNLP/IJCNLP. sis to detect unintended model biases. In Proceed- ings of the 2019 Conference on Empirical Methods Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter in Natural Language Processing and the 9th Inter- Szolovits. 2019. Is bert really robust? a strong base- national Joint Conference on Natural Language Pro- line for natural language attack on text classification cessing (EMNLP-IJCNLP), pages 5740–5745, Hong and entailment. Kong, China. Association for Computational Lin- Svetlana Kiritchenko and Saif Mohammad. 2018. Ex- guistics. amining gender and race bias in two hundred sen- Alec Radford, Jeff Wu, Rewon Child, David Luan, timent analysis systems. In Proceedings of the Dario Amodei, and Ilya Sutskever. 2019. Language Seventh Joint Conference on Lexical and Compu- models are unsupervised multitask learners. tational Semantics, pages 43–53, New Orleans, Louisiana. Association for Computational Linguis- Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, tics. and Vaishaal Shankar. 2019. Do ImageNet clas- Matt J Kusner, Joshua Loftus, Chris Russell, and Ri- sifiers generalize to ImageNet? volume 97 of cardo Silva. 2017. Counterfactual fairness. In Proceedings of Machine Learning Research, pages I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, 5389–5400, Long Beach, California, USA. PMLR. R. Fergus, S. Vishwanathan, and R. Garnett, editors, Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. Advances in Neural Information Processing Systems 2019. Generating natural language adversarial ex- 30, pages 4066–4076. Curran Associates, Inc. amples through probability weighted word saliency. Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, In Proceedings of the 57th Annual Meeting of the and Xipeng Qiu. 2020. BERT-ATTACK: Adversar- Association for Computational Linguistics, pages ial attack against BERT using BERT. In Proceed- 1085–1097, Florence, Italy. Association for Compu- ings of the 2020 Conference on Empirical Methods tational Linguistics. in Natural Language Processing (EMNLP), pages 6193–6202, Online. Association for Computational Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Linguistics. and Sameer Singh. 2020. Beyond accuracy: Behav- ioral testing of nlp models with checklist. In Associ- Chandler May, Alex Wang, Shikha Bordia, Samuel R. ation for Computational Linguistics (ACL). Bowman, and Rachel Rudinger. 2019. On measur- ing social biases in sentence encoders. In Proceed- Victor Sanh, Lysandre Debut, Julien Chaumond, and ings of the 2019 Conference of the North American Thomas Wolf. 2019. Distilbert, a distilled version Chapter of the Association for Computational Lin- of bert: smaller, faster, cheaper and lighter. ArXiv, guistics: Human Language Technologies, Volume 1 abs/1910.01108. (Long and Short Papers), pages 622–628, Minneapo- lis, Minnesota. Association for Computational Lin- Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, guistics. and Nanyun Peng. 2019. The woman worked as a babysitter: On biases in language generation. In Takeru Miyato, Andrew M. Dai, and Ian Goodfel- EMNLP. low. 2017. Adversarial training methods for semi- supervised text classification. ICLR. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2020. Towards controllable bi- John X. Morris, Eli Lifland, Jin Yong Yoo, Jake ases in language generation. In EMNLP-Finding. Grigsby, Di Jin, and Yanjun Qi. 2020. Textattack: A framework for adversarial attacks, data augmenta- Zhouxing Shi, Huan Zhang, Kai-Wei Chang, Minlie tion, and adversarial training in nlp. Huang, and Cho-Jui Hsieh. 2020. Robustness verifi- cation for transformers. In International Conference Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thom- on Learning Representations. son, Milica Gašić, Lina M. Rojas-Barahona, Pei- Hao Su, David Vandyke, Tsung-Hsien Wen, and Richard Socher, Alex Perelygin, Jean Wu, Jason Steve Young. 2016. Counter-fitting word vectors to Chuang, Christopher D. Manning, Andrew Ng, and linguistic constraints. In Proceedings of the 2016 Christopher Potts. 2013. Recursive deep models Conference of the North American Chapter of the for semantic compositionality over a sentiment tree- Association for Computational Linguistics: Human bank. In Proceedings of the 2013 Conference on Language Technologies, pages 142–148, San Diego, Empirical Methods in Natural Language Processing, California. Association for Computational Linguis- pages 1631–1642, Seattle, Washington, USA. Asso- tics. ciation for Computational Linguistics. 3909
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gard- Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018b. ner, and Sameer Singh. 2019a. Universal adver- Generating natural adversarial examples. In Interna- sarial triggers for attacking and analyzing nlp. In tional Conference on Learning Representations. EMNLP/IJCNLP. Yi Zhou, Xiaoqing Zheng, Cho-Jui Hsieh, Kai-Wei Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gard- Chang, and Xuanjing Huang. 2020. Defense against ner, and Sameer Singh. 2019b. Universal adversar- adversarial attacks in nlp via dirichlet neighborhood ial triggers for attacking and analyzing NLP. In ensemble. arXiv preprint arXiv:2006.11627. EMNLP. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. Glue: A multi-task benchmark and analysis platform for natural language understanding. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Huggingface’s transformers: State-of-the-art natural language processing. Kaidi Xu, Zhouxing Shi, Huan Zhang, Yihan Wang, Kai-Wei Chang, Minlie Huang, Bhavya Kailkhura, Xue Lin, and Cho-Jui Hsieh. 2020. Automatic per- turbation analysis for scalable certified robustness and beyond. Puyudi Yang, Jianbo Chen, Cho-Jui Hsieh, Jane-Ling Wang, and Michael I. Jordan. 2020. Greedy attack and gumbel attack: Generating adversarial examples for discrete data. Journal of Machine Learning Re- search, 21(43):1–36. Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. 2020. Word-level textual adversarial attacking as combina- torial optimization. Proceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics. Huan Zhang, Hongge Chen, Zhao Song, Duane Boning, inderjit dhillon, and Cho-Jui Hsieh. 2019. The lim- itations of adversarial training and the blind-spot at- tack. In International Conference on Learning Rep- resentations. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cot- terell, Vicente Ordonez, and Kai-Wei Chang. 2019. Gender bias in contextualized word embeddings. In NAACL. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- donez, and Kai-Wei Chang. 2018a. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computa- tional Linguistics. 3910
A Supplemental Material Running Time (seconds) Genetic BAE SO-Enum SO-Beam A.1 Random Baseline Base Models: BoW 31.6 0.9 6.2 1.8 CNN 28.8 1.0 5.9 1.7 To validate the effectiveness of minimizing Eq. (4), LSTM 31.9 1.1 7.0 1.9 we also experiment on a second-order baseline that Transformer 51.9 0.5 6.5 2.5 constructs vulnerable examples by randomly re- BERT-base 65.6 1.1 35.4 7.1 Robust Models: placing up to 6 words. We use the same masked BoW 103.9 1.0 8.0 3.5 language model and threshold as SO-Beam such CNN 129.4 1.0 6.7 2.6 that they share a similar neighborhood. We per- LSTM 116.4 1.1 10.7 5.3 form the attack on the same models as Table 2, and Transformer 66.4 0.5 5.9 2.6 the attack success rates on robustly trained BoW, Table 6: The average running time over 872 examples CNN, LSTM, and Transformers are 18.8%, 22.3%, (100 for Genetic due to long running time). 15.2%, and 25.1%, respectively. Despite being a second-order attack, the random baseline has low attack success rates thus demonstrates the effective- A.4 Additional Results on Protected Tokens ness of SO-Beam. Fig. 7 presents the experimental results with ad- ditional protected tokens such as nationality, reli- A.2 Human Evaluation gion, and sexual orientation (from Ribeiro et al. (2020)). We use the same base LSTM as de- We randomly select 100 successful attacks from scribed in §4.2. One interesting observation is SO-Beam and consider four types of examples (for when p = (gay, straight) where the bias is a total of 400 examples): The original examples negative, indicating that the sentiment classifier with and without synonym substitution p, and the tends to give more negative prediction when sub- vulnerable examples with and without synonym stituting gay → straight in the input. This substitution p. For each example, we annotate the phenomenon is opposite to the behavior of toxi- naturalness and sentiment separately as described city classifiers (Dixon et al., 2018), and we hy- below. pothesize that it may be caused by the different Naturalness of vulnerable examples. We ask distribution of training data. To verify the hypoth- the annotators to score the likelihood of being an esis, we count the number of training examples original example (i.e., not altered by computer) containing each word, and observe that we have based on grammar correctness and naturalness, far more negative examples than positive examples with a Likert scale of 1-5: (1) Sure adversarial among those containing straight (Table 7). Af- example. (2) Likely an adversarial example. (3) ter looking into the training set, it turns out that Neutral. (4) Likely an original example. (5) Sure straight to video is a common phrase to original example. criticize a film, thus the classifier incorrectly cor- Semantic similarity after the synonym substitu- relates straight with negative sentiment. This tion. We first ask the annotators to predict the also reveals the limitation of our method on polyse- sentiment on a Likert scale of 1-5, and then map mous words. the prediction to three categories: negative, neutral, and positive. We consider two examples to have # Negative # Positive the same semantic meaning if and only if they are gay 37 20 both positive or negative. straight 71 18 Table 7: Number of negative and positive examples A.3 Running Time containing gay and straight in the training set. We experiment with the validation split on a single RTX 3090, and measure the average running time In Fig. 8, we measure the bias on Xtest and ob- per example. As shown in Table 6, SO-Beam runs serve positive bias on most tokens for both k = 0 faster than SO-Enum since it utilizes the probability and k = 3, which indicates that the model “tends” output. The running time may increase if the model to make more positive predictions for examples has improved second-order robustness. containing certain female pronouns than male pro- 3911
Genetic BAE Original Perturb Original Perturb `0 `0 PPL PPL PPL PPL Base Models: BoW 145 258 3.3 192 268 1.6 CNN 146 282 3.0 186 254 1.5 LSTM 131 238 2.9 190 263 1.6 Transformer 137 232 2.8 185 254 1.4 BERT-base 201 342 3.4 189 277 1.6 Robust Models: BoW 132 177 2.4 214 269 1.5 CNN 136 236 2.7 211 279 1.5 LSTM 163 267 2.5 220 302 1.6 Transformer 118 200 2.8 196 261 1.4 Table 8: The quality metrics for first-order attacks from successful attacks. We compare median perplexities (PPL) and average `0 norm distances. Figure 7: Additional counterfactual token bias mea- sured on the original validation split with base LSTM. SO-Beam on base LSTM, and Table 12 shows addi- tional attack results from SO-Beam on robust CNN. nouns. Notice that even though gender swap miti- Bold words are the corresponding patch words p, gates the bias to some extent, it is still difficult to taken from the predefined list of counter-fitted syn- fully eliminate the bias. This is probably caused onyms. by tuples like (him, his, her) which cannot be swapped perfectly, and requires additional process- ing such as part-of-speech resolving (Zhao et al., 2018a). Figure 8: Full results for gendered tokens measured on the original validation split. To help evaluate the naturalness of our con- structed examples used in §4, we provide sample sentences in Table 9 and Table 10. Bold words are the corresponding patch words p, taken from the predefined list of gendered pronouns. A.5 Additional Results on Robustness Table 8 provides the quality metrics for first-order attacks, where we measure the GPT-2 perplexity and `0 norm distance between the input and the adversarial example. For BAE we evaluate on 872 validation examples, and for Genetic we evaluate on 100 validation examples due to the long running time. Table 11 shows additional attack results from 3912
Type Predictions Text Original 95% Negative 94% Negative it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac- tress) who is out of their depth . Distance k = 1 97% Negative (97% Negative) it ’s hampered by a lifetime-channel kind of plot and lone lead actor (actress) who is out of their depth . 56% Negative (55% Positive ) it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac- tress) who is out of creative depth . 89% Negative (84% Negative) it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac- tress) who talks out of their depth . 98% Negative (98% Negative) it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac- tress) who is out of production depth . 96% Negative (96% Negative) it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac- tress) that is out of their depth . Distance k = 2 88% Negative (87% Negative) it ’s hampered by a lifetime-channel cast of stars and a lead actor (ac- tress) who is out of their depth . 96% Negative (95% Negative) it ’s hampered by a simple set of plot and a lead actor (actress) who is out of their depth . 54% Negative (54% Negative) it ’s framed about a lifetime-channel kind of plot and a lead actor (ac- tress) who is out of their depth . 90% Negative (88% Negative) it ’s hampered by a lifetime-channel mix between plot and a lead actor (actress) who is out of their depth . 78% Negative (68% Negative) it ’s hampered by a lifetime-channel kind of plot and a lead actor (ac- tress) who storms out of their mind . Distance k = 3 52% Positive (64% Positive ) it ’s characterized by a lifetime-channel combination comedy plot and a lead actor (actress) who is out of their depth . 93% Negative (93% Negative) it ’s hampered by a lifetime-channel kind of star and a lead actor (ac- tress) who falls out of their depth . 58% Negative (57% Negative) it ’s hampered by a tough kind of singer and a lead actor (actress) who is out of their teens . 70% Negative (52% Negative) it ’s hampered with a lifetime-channel kind of plot and a lead actor (actress) who operates regardless of their depth . 58% Negative (53% Positive ) it ’s hampered with a lifetime-channel cast of plot and a lead actor (actress) who is out of creative depth . Table 9: Additional counterfactual bias examples on base LSTM with p = (actor, actress). We only present 5 examples per k due to space constrain. 3913
You can also read