Towards Variable-Length Textual Adversarial Attacks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Towards Variable-Length Textual Adversarial Attacks Junliang Guo1 , Zhirui Zhang2 , Linlin Zhang3 , Linli Xu1 , Boxing Chen2 , Enhong Chen1 , Weihua Luo2 1 Anhui Province Key Laboratory of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China 2 Alibaba DAMO Academy 3 Zhejiang University guojunll@mail.ustc.edu.cn, {linlixu,cheneh}@ustc.edu.cn {zhirui.zzr, zll240651, boxing.cbx, weihua.luowh}@alibaba-inc.com Abstract Ours: I rented this movie to relax. But I got a episodic migraine instead. Adversarial attacks have shown the vulnerabil- Replace: I rented this movie to relax. But I got a migraine. ity of machine learning models, however, it arXiv:2104.08139v1 [cs.CL] 16 Apr 2021 is non-trivial to conduct textual adversarial at- Origin: I rented this movie to relax. But I got a headache. tacks on natural language processing tasks due to the discreteness of data. Most previous ap- Positive Model boundary True boundary proaches conduct attacks with the atomic re- Negative Replace candidates placement operation, which usually leads to Total candidates fixed-length adversarial examples and there- Figure 1: An illustration of the proposed VL-Attack fore limits the exploration on the decision method. Red, underline and strikeout tokens indi- space. In this paper, we propose variable- cate replacements, insertions and deletions respectively. length textual adversarial attacks (VL-Attack) Equipped with the proposed three atomic operations, and integrate three atomic operations, namely our method is able to generate adversarial examples insertion, deletion and replacement, into a uni- that are closer to the true decision boundary, which can- fied framework, by introducing and manipulat- not be reached only by replacement. ing a special blank token while attacking. In this way, our approach is able to more com- terpretability of machine learning models (Ribeiro prehensively find adversarial examples around et al., 2018; Tao et al., 2018). the decision boundary and effectively conduct adversarial attacks. Specifically, our method While having been extensively studied in com- drops the accuracy of IMDB classification by puter vision models, adversarial attack on natural 96% with only editing 1.3% tokens while at- language processing models is more challenging as tacking a pre-trained BERT model. In addi- small perturbations of textual data may change its tion, fine-tuning the victim model with gener- original semantic meaning and thus are perceptible. ated adversarial samples can improve the ro- Most previous works utilize the replacement oper- bustness of the model without hurting the per- formance, especially for length-sensitive mod- ation to construct adversarial examples, including els. On the task of non-autoregressive machine elaborately simulating natural noises such as ty- translation, our method can achieve 33.18 pos by character-level replacement (Ebrahimi et al., BLEU score on IWSLT14 German-English 2017; Li et al., 2018) and word-level synonym re- translation, achieving an improvement of 1.47 placement (Papernot et al., 2016; Alzantot et al., over the baseline model. 2018), to ensure the generated examples will not alter the semantics. However, naturally, most adver- 1 Introduction sarial examples could have different lengths from While achieving great successes in various do- the original sentence, and the replacement opera- mains, machine learning models have been found tion is only able to generate fixed-length examples vulnerable to adversarial examples, i.e., the origi- and thus produce a limited subset of the total ad- nal inputs with small perturbations that are indistin- versarial candidates. We provide an illustration guishable to human knowledge can fool the model in Figure 1, where replacements cannot generate and lead to incorrect results (Goodfellow et al., variable-length adversarial examples. 2014; Kurakin et al., 2016; Zhao et al., 2017). Ad- In fact, an original sentence can be converted versarial attacks and defenses are able to improve to any adversarial candidate with the combination the robustness and security (Szegedy et al., 2013; of three one-step atomic operations: replacement, Madry et al., 2017) as well as to explore the in- insertion and deletion. Inspired by this observa-
tion, we propose a paradigm of white-box variable- pre-trained BERT (Devlin et al., 2018) model when length adversarial attacks to comprehensively ex- fine-tuning on downstream tasks. Specifically, after plore adversarial examples. The whole attack pro- the attack, the model’s IMDB classification accu- cedure consists of multiple steps. In each step, we racy drops from 90.4% to 3.5% by only editing conduct an attack utilizing an operation randomly 1.3% tokens. Human evaluation verifies that the sampled from the proposed three types of atomic generated adversarial examples are highly accurate operations. Theoretically, our method is able to in grammatical, semantical and label preserving reach adversarial examples with any word-level perspectives. On NAT, we show that our method Levenshtein distance (Levenshtein, 1966) to the can be effectively used for data augmentation. By inputs, therefore covering the full adversarial can- integrating the adversarial examples of training didates set. samples, our method boosts the performance of the mask-predict model on various machine translation We propose tailored attacking methods for the tasks. Specifically, we achieve 33.18 BLEU score proposed three atomic operations. Specifically, we on IWSLT14 German-English translation with an make insertion and deletion possible by converting improvement of 1.47 over the baseline model. them into gradient-based methods with the intro- duction of a special blank token [BLK]. Given a Our main contributions can be summarized as pre-trained victim model such as BERT, we intro- follows: duce [BLK] while fine-tuning it on downstream • We propose a variable-length adversarial attack tasks. By randomly inserting the blank token into method on textual data, which integrates three inputs while keeping the target unchanged, the to- atomic operations, including replacement, inser- ken is trained as a “blank space” to the model and tion and deletion, into a unified framework. its occurrence will not affect the original semantic • Our method successfully attacks the pre-trained information. Then, we implement insertion and BERT model on NLU tasks with higher attack- deletion with the help of the blank token. For in- ing success rate and lower perturbation rate than sertion, we first insert a [BLK] token in a chosen previous approaches, while being semantic and position of the input. Then we choose the content label preserving from human judgement. to insert by calculating the similarities between the • Our method can be viewed as a data augmenta- gradients of [BLK] and tokens in the vocabulary. tion method which is specifically beneficial for While deleting, we evaluate the importance of all length-sensitive models, and is able to achieve tokens by comparing with [BLK] in the gradient strong performance on NAT tasks. space, and then delete the most important token of the sentence. And for replacement, we follow 2 Related Work the gradient based method utilized in (Ebrahimi et al., 2017; Wallace et al., 2019) and replace a 2.1 Textual Adversarial Attack token with the one that maximizes the first-order Adversarial attack has been greatly studied in approximation of the prediction loss. computer vision, and most works execute attack Besides conducting attacks, our method can also with gradient-based perturbation on the continu- be viewed as a data augmentation method to en- ous space (Szegedy et al., 2013; Goodfellow et al., hance the robustness and performance of models. 2014). For textual adversarial attack, it is much Since the proposed method is able to generate more challenging. We categorize previous works variable-length adversarial examples, it is partic- into black-box and white-box attacking, depend- ularly beneficial to length-sensitive models, such ing on whether the attacker is aware of the gradi- as non-autoregressive neural machine translation ents of the victim model. For black-box attack- (NAT) (Gu et al., 2017; Guo et al., 2019, 2020a) ing, previous works usually replace words or char- which requires predicting the target length before acters in the original inputs guided by some re- decoding. Therefore, in experiments, we evaluate strictions to maintain the original semantics, in- our method on two tasks including natural language cluding heuristic priors (Jin et al., 2019; Li et al., understanding (NLU) and NAT with mask-predict 2018; Ren et al., 2019), semantic similarity check- decoding (Ghazvininejad et al., 2019). We verify ing (Li et al., 2020; Cer et al., 2018), extra language the effectiveness of the proposed method on NLU models (Alzantot et al., 2018) or a combination of tasks, where our method successfully attacks the them (Jin et al., 2019; Li et al., 2020; Zang et al.,
2020; Morris et al., 2020). For white box attack- where sim(·) indicates the function that measures ing, most previous works follow Fast Gradient Sign the semantic similarity between the original inputs Method (FGSM) (Goodfellow et al., 2014) to re- and adversarial examples, where θ is the threshold. place the original input with tokens that are able to drastically increase the prediction loss of the vic- 3.1 Variable-Length Attack tim model, either at the char-level (Ebrahimi et al., Our adversarial attack consists of multiple steps, 2017), word-level (Wallace et al., 2019; Papernot and in each step, the attack is conducted by an et al., 2016), or phrase-level (Liang et al., 2017). atomic operation selected randomly from replace- In this paper, we focus on white-box attacking and ment, insertion and deletion. Formally, denote propose a unified framework that integrates the M(x) = xadv as the mapping function from the three atomic operations in a novel manner, by in- original input to the adversarial example, it can be troducing and manipulating a blank token [BLK]. defined as: 2.2 Non-Autoregressive NMT M(x) = Mk Mk−1 · · · M1 (x), (2) Mi ∈ {insertion, deletion, replacement}, Non-autoregressive neural machine transla- tion (NAT) (Gu et al., 2017) is proposed to speedup where k is the number of operation steps as well as the inference latency of autoregressive machine the Levenshtein distance between x and xadv . translation models, by generating target tokens in Each step of our attack is achieved by in- parallel while decoding. Among the follow-up troducing and leveraging a special blank token works, mask-predict decoding (Ghazvininejad [BLK]. While fine-tuning, given a training sample et al., 2019) is shown effective by achieving (x, y) ∈ Ttrain , we randomly insert a blank token comparable performance with autoregressive into the given input x = (x1 , x2 , ..., xn ), result- models while halving the inference latency (Guo ing x0 = (x1 , ..., xi−1 , [BLK], xi , ..., xn ). Then et al., 2020b). However, in the mask-predict we fine-tune the model on (x0 , y) instead of the model, the target length requires to be predicted original sample (x, y), i.e., train the model with up front, making the results unstable and sensitive. L(y, f (x0 )). In this way, the [BLK] token is The training-inference discrepancy is also a key learned to serve as a blank space as its occurrence problem (Ghazvininejad et al., 2020). will not affect the semantic information of the orig- We show that our method can be viewed as an inal sample. With the help of the blank token, we effective data augmentation method, which is able introduce the proposed three atomic operations as to improve the translation performance of the NAT follows. Note that we conduct adversarial attacks model, as well as alleviating the problems men- while inference, i.e., (x, y) ∈ Ttest . tioned above. Replacement The replacement operation is in- 3 Methodology spired by the FGSM method (Goodfellow et al., 2014) which leverages the first-order approxima- We introduce the proposed Variable-Length tex- tion of the loss function. For the i-th token xi in tual adversarial Attack (VL-Attack) method in this the original input x, we replace it with the token section. We start with the problem definition. that most drastically increases the loss function in the vocabulary, Problem Definition Given a pre-trained model such as BERT, after fine-tuning it on the training set xadv i = argmax(e(xj ) − e(xi ))| ∇xi L(y, f (x)), xj ∈Vxi of a downstream task Ttrain , we get the victim model (3) f (·) which will be evaluated on the test set Ttest . We where e(xi ) indicates the embedding of xi , Vxi = denote the task-specific loss function while fine- top_k{Plm (xj |xi )} is a subset of the whole tuning as L(y, f (x)). Given a test sample (x, y) ∈ vocabulary V , which is refined by a language Ttest where x is the input sequence and y is its model Plm . In our setting, we directly use the vic- target, there should be f (x) = y. While attacking, tim BERT model as the language model for sim- we aim at generating an adversarial example xadv plicity. that satisfies: Insertion The insertion operation can be divided f (xadv ) 6= y and sim(x, xadv ) > θ, (1) into two steps. Firstly, we construct a sample x0 that
Adversarial Example % $%# … I got a episodic … I got a episodic & ≠ ((% $%# ) migraine instead. migraine [BLK]. Origin Input % 4. Insert via Equ (4) & = ((%) … But I got a episodic … But I got a headache. migraine. 1. Replace via Equ (3) Victim Model 3. Delete via Equ (5) … But I got a migraine. 2. Insert via Equ (4) … But I got a [BLK] migraine. … But I got a episodic migraine. Figure 2: An illustration of the attacking pipeline with the proposed insertion, deletion and replacement operations. We take the second half of the sample in Figure 1 as the example. (x, y) is a pair of test sample where x is the input while y is the target, and xadv is the generated adversarial sample through 4 attacking steps. is semantically identical to x but a token longer, i.e., distance to the [BLK] token as well as the increase |x0 | = |x| + 1 and sim(x, xadv ) > θ. We achieve of loss in the gradient space. this by inserting a [BLK] token into x as in fine- Based on the proposed operations, we introduce tuning. Then, the token to insert is determined in a the pipelines of our method when attacking NLU similar way as in the replacement operation, tasks as follows. We provide an illustration in Fig- ure 2. We aim at attacking the model with mini- xadv BLK = (4) mum number of modifications. Given an input x | 0 argmax(e(xj ) − e(xBLK )) ∇xBLK L(y, f (x )), and its label y, we iteratively apply the attack steps xj ∈VxBLK introduced above until obtaining a successful attack where xBLK represents the inserted [BLK] token, (i.e., f (xadv ) 6= y) or reaching the upper bound and VxBLK is the vocabulary refined by the language of attack operations N = bλ · |x|c, where λ is a model similar as above. hyper-parameter controlling the trade-off between In replacement, the position of the attacked to- the overall attack success rate and the modifica- ken is determined by computing Equation (3) over tion rate. In each step, we measure the semantic the whole input sequence x and selecting the posi- similarity between x and xadv by Universal Sen- tion that increases the loss the most. Similarly, we tence Encoder (Cer et al., 2018) which encodes iteratively insert the [BLK] token into each posi- two sentences into a pair of fixed-length vectors tion and compute Equation (4) to determine where and calculate the cosine similarity between them. to conduct the insertion operation. Both operations This is a common practice in previous works (Jin can be efficiently computed in parallel. et al., 2019; Li et al., 2020). We skip this attack step if the similarity is less than the threshold θ. Deletion We propose the deletion operation based on the following intuition: the token that is 3.2 Adversarial Training on NAT Models more dissimilar to the [BLK] token will be more important to the prediction result. Specifically, we Our method can be naturally applied for adversar- quantify the importance of tokens in x with the ial training to enhance the robustness of the victim help of [BLK] token, model. Specifically, we evaluate our method on NAT with mask-predict decoding (Ghazvininejad αi = (e(xi ) − e(xBLK ))| ∇xi L(y, f (x)), (5) et al., 2019). Given a bilingual training pair (x, y), the model is trained as a conditional masked lan- where αi indicates the importance of the i-th token guage model, xi . Then the deleted token xdel is selected among all tokens, where del = argmaxj∈|x| αj and |x| |y m | denotes the sequence length of the input x. In m r LNAT (y |y , x) = − X log P (ytm |y r , x), (6) this way, the deleted token is selected jointly by its t=1
where y m are masked tokens and y r are residual natural language inference datasets, i.e., MNLI and target tokens, following the same masking strategy SNLI. proposed in BERT (Devlin et al., 2018). While • YELP: A document-level sentiment classifica- decoding, different from traditional autoregressive tion dataset on restaurant reviews. We process models which dynamically determine the target them into a polarity classification task follow- length with the [EOS] symbol, the NAT model ing (Zhang et al., 2015). needs to predict the target length at first, i.e., mod- • IMDB: A binary document-level sentiment clas- eling P (|y| |x), making the translation result sensi- sification dataset on movie reviews1 . tive to the predicted length. In addition, as pointed • MNLI: A multi-genre natural language inference out by Ghazvininejad et al. (2020), there exists dataset consisting of premise-hypothesis pairs, the discrepancy between training and inference in and the task is to determine the relation of the mask-predict decoding, since the observed decoder hypothesis to the premise, i.e., entailment, neu- inputs are correct golden targets in training but are tral, or contradiction (Williams et al., 2017). We not always correct in inference. evaluate on the matched set. By providing variable-length adversarial exam- • SNLI: A natural language inference dataset from ples, our method can alleviate these problems by ad- the Stanford language inference task (Bowman versarially training the NAT model. The attack pro- et al., 2015). cedure is different from that on NLU tasks in two aspects. Firstly, we obtain adversarial examples by For all tasks, we evaluate our method on the 1k test conducting a predefined number of attack steps as samples provided by (Alzantot et al., 2018) and there does not exist a precise definition of a “suc- also utilized in (Jin et al., 2019; Li et al., 2020), cessful attack” in machine translation. Secondly, which are randomly selected from the correspond- we conduct attacks jointly on the encoder input x ing test sets of each task. We tokenize and seg- as well as the decoder input y r , resulting the ad- ment each word into wordpiece tokens w.r.t the versarial training loss LNAT (y m |(y r )adv , xadv ). By vocabulary of the pre-trained BERT model, and we doing so, the model observes plausible decoder conduct attacks at the wordpiece level. We set the inputs other than the golden targets, thus alleviat- upper bound of perturbations λ = 30% and the ing the training-inference discrepancy. In addition, semantic similarity threshold θ = 0.5 for all tasks. by keeping the targets and feeding the model with Baselines We compare our method with two variable-length encoder and decoder inputs, the state-of-the-art black-box adversarial attack meth- model robustness regrading the length prediction ods including TextFooler (Jin et al., 2019) and will also be enhanced. BERT-Attack (Li et al., 2020). We strictly follow the settings in (Li et al., 2020) and directly copy the 4 Experiments results of TextFooler and Bert-Attack as reported in We conduct experiments in two parts to evaluate their papers. We also consider a white-box baseline the proposed method. On NLU tasks, we verify HotFlip (Ebrahimi et al., 2017), which conducts the adversarial attack performance of our method. attacks mainly based on the replacement operation. On the task of non-autoregressive machine trans- We consider the word-level variant of their model, lation, we explore the potential of our method for and obtain their results when attacking BERT by adversarial training. We start with NLU tasks. re-implementing the model based on their code2 . 4.1 Adversarial Attack on NLU Evaluation We validate the performance of our model with both automatic metrics and human eval- 4.1.1 Experimental Setup uation. The automatic evaluation includes the fol- In NLU tasks, we fine-tune the pre-trained BERT lowing metrics. The Original Accuracy (OriAcc) base model on downstream classification datasets, indicates the original model prediction accuracy and take it as the victim model. We mainly follow without adversarial attacks, and the Attacked Ac- the settings in previous works (Jin et al., 2019; Li curacy (AttAcc) indicates the model performance et al., 2020) to construct a fair comparison. on the adversarial examples generated by the at- Datasets We evaluate our method on benchmark tack methods. Lower AttAcc represents the more NLU tasks including two sentiment classification 1 https://datasets.imdbws.com/ 2 datasets, i.e., YELP and IMDB, as well as two https://github.com/AnyiRao/WordAdver
Table 1: Results of adversarial attacks against various fine-tuned BERT models. For BERT-Attack and TextFooler, we directly copy the scores reported by (Li et al., 2020), and we re-implement the word-level HotFlip when attacking BERT based on the their code. For the attacked accuracy (AttAcc) and perturbed ratio (Perturb%), the lower the better, while for the semantic similarity (Sim), the higher the better. Yelp (OriAcc = 96.8) IMDB (OriAcc = 90.4) Model AttAcc↓ Perturb%↓ Sim↑ AttAcc↓ Perturb%↓ Sim↑ BERT-Attack 5.1 4.1 0.77 11.4 4.4 0.86 TextFooler 6.6 12.8 0.74 13.6 6.1 0.86 HotFlip 9.2 12.3 0.63 8.2 2.7 0.84 VL-Attack 4.8 3.7 0.83 3.5 1.3 0.87 SNLI (OriAcc = 90.0) MNLI (OriAcc = 85.1) Model AttAcc↓ Perturb%↓ Sim↑ AttAcc↓ Perturb%↓ Sim↑ BERT-Attack 11.8 10.9 0.48 9.9 8.4 0.62 TextFooler 12.4 26.0 0.50 17.5 20.9 0.61 HotFlip 6.3 11.0 0.54 8.0 9.4 0.61 VL-Attack 4.8 9.5 0.56 5.3 7.3 0.66 successful attacks. The Perturbed Ratio (Per- Table 2: The ablation study of the proposed VL-Attack turb%) is the average percentage of perturbed to- with naively implemented insertions and deletions. Re- sults are obtained on the IMDB dataset, and we keep kens when conducting attacks. Intuitively, the at- the same settings as in Table 1. tack is more efficient and semantic preserving with Model AttAcc Perturb% Sim less perturbations and higher attacked accuracy. Fi- nally, we report the Semantic Similarity (Sim) VL-Attack 3.5 1.3 0.87 + Naive Insert 5.6 3.9 0.77 between the original inputs and adversarial exam- + Naive Delete 7.9 5.3 0.80 ples measured by the averaged Universal Sentence Encoder (USE) (Cer et al., 2018) score. Table 3: Human evaluation results, where “/” indicates As for human evaluation, we follow the settings the results are only applicable for adversarial samples. of (Jin et al., 2019) to measure the grammaticality, Dataset Grammar Semantic Consistency semantic similarity as well as label consistency of Original 4.71 / / MNLI HotFlip 3.98 0.75 0.60 the original sentences and the adversarial examples. VL-Attack 4.30 0.86 0.77 We randomly sample 100 sentences from the test Original 4.88 / / set of MNLI as well as IMDB and generate ad- IMDB HotFlip 4.02 0.78 0.73 versarial examples by our method, and then invite VL-Attack 4.57 0.89 0.85 three human experts to annotate the results, who are all native speakers with university-level education cation tasks which are easier to attack, our method backgrounds. The grammaticality is scored from 1 drastically drops the accuracy of the victim model to 5, and the semantic similarity is determined by by more than 95% with only editing less than 4% measuring whether the generated example is simi- tokens. Our method achieves similar performance lar/ambiguous/dissimilar to the original sentence, with editing less than 10% tokens on the harder scoring as 1/0.5/0 respectively. In addition, the natural language inference tasks. label consistency between the pair of an original Comparing with HotFlip, the baseline based on and generated sentence is also collected based on the replacement operation, our method is able to human classification. We report the results of our achieve more successful attacks with less perturba- method and the HotFlip baseline. Human annota- tions as well as maintaining higher semantic simi- tors are blind to method identities. larity, which illustrates that more natural adversar- ial examples can be obtained by introducing the 4.1.2 Automatic Evaluation insertion and deletion operations. The results of adversarial attack on NLU tasks are listed in Table 1. The proposed VL-Attack out- 4.1.3 Ablation Study performs the compared state-of-the-art black-box We achieve the insertion and deletion operations baselines as well as the replacement based white- by introducing and leveraging a special blank to- box baseline in all settings. On sentiment classifi- ken. The benefits of doing so can be concluded
Table 4: Case studies on the MNLI dataset of the proposed VL-Attack method. “P” and “H” indicate premise and hypothesis respectively. “Origin” indicates the original input sample. Red words indicate replacements and underlined words indicate insertions. Method Content Label P: Sandstone and granite were the materials used to build the Baroque church of Bom Jesus, Origin famous for its casket of St. Francis Xavier’s relics in the mausoleum to the right of the altar. Contradiction H: St. Francis Xavier’s relics were never recovered, unfortunately. P: Sandstone and granite were the materials used to build the Baroque church of Bom Jesus, VL-Attack famous for its casket of St. Francis Xavier’s relics in the mausoleum to the right of the altar. Neutral H: St. Francis Xavier’s relics were never recapture, however. P: Julius Caesar’s nephew Octavian took the name Augustus; Origin Rome ceased to be a republic, and became an empire. Contradiction H: Rome never ceased to be a republic, and did not become an empire. P: Julius Caesar’s nephew Octavian took the name Augustus; VL-Attack Rome not come to be a republic, and became an empire. Entailment H: Rome never ceased to be a republic, and did not become an empire. as twofold. Firstly, the [BLK] token is trained as Table 5: Results of fine-tuning a pre-trained BERT a blank space to the model, which can be treated model with adversarial training on NLU tasks. as a placeholder that does not change the original Model YELP IMDB SNLI MNLI semantics of the input. In this way, we ensure that Origin 96.8 90.4 90.0 85.1 the inserted token is selected based on the origi- + VL-Attack 96.4 91.2 90.1 84.7 nal context of the input, which cannot be achieved if other tokens are used as the placeholder (e.g., using existing special tokens such as [MASK] or tance of different tokens and conducting attacks. duplicating the current token) while inserting. 4.1.4 Human Evaluation Secondly, determining the importance of tokens to efficiently conduct attacks correspondingly, is an The human evaluation results are listed in Table 3, important problem in textual adversarial attack (Li where we report the average scores of all annotators. et al., 2020). Previous works achieve this by greed- Consistent with the results of the automatic eval- ily running predictions (Li et al., 2020) or leverag- uation, the adversarial examples generated by our ing external tools (Liang et al., 2017), which lacks method are both grammatically and semantically efficiency. Here, the importance of each token can closer to the original samples than the replacement be easily determined by its distance to [BLK] fol- based baseline. In addition, a majority of generated lowing Equation (5) as a byproduct of introducing examples preserve the same labels as the original the blank token. samples according to human annotators, showing that our attacking method is indistinguishable to Here, we verify the efficacy of the blank token human judgement most of the time. by naively implementing the insertion and deletion operations. Specifically, we change the [BLK] 4.1.5 Case Study token to the [MASK] token and then conduct in- In Table 4, we provide several adversarial exam- sertion following the same method described in ples generated by our method on the MNLI dataset. Section 3.1. For deletion, we randomly delete a Perturbations are marked as red. The generated ex- token of the input. amples are generally semantic consistent with the We test these naive baselines on the IMDB origin input, as well as indistinguishable in human dataset, and results are shown in Table 2. We can knowledge. find that naive insertion results in lower semantic similarity, indicating that the [MASK] token will 4.1.6 Adversarial Training on NLU Tasks change the context of the original input while inser- We also conduct additional investigations of ad- tion, and thus maintaining less original semantic in- versarial training on NLU tasks to verify whether formation. Naive deletion achieves worse attacked the generated adversarial examples preserve the accuracy with perturbing more tokens, which in- original labels. Specifically, we first generate ad- dicates the effectiveness and efficiency of the pro- versarial examples on the training set of each task, posed blank token when determining the impor- which are then concatenated with the original train-
ing set to fine-tune a pre-trained BERT model. And Table 6: The BLEU scores of mask-predict with adver- we evaluate the model on clean test sets, where sarial training and baselines on the IWSLT14 De-En, WMT14 En-De and WMT16 En-Ro tasks. The results the model performance can be drastically dropped of baselines are obtained by our implementation. if most adversarial examples are inconsistent with IWSLT14 WMT14 WMT16 their original labels. Models De−En En−De En−Ro Results are listed in Table 5. With adversarial Transformer 32.59 28.04 34.13 training, the model performs comparably with the Mask-Predict 31.71 27.03 33.08 original performance (change of accuracy varies + HotFlip 32.05 26.91 33.12 from −0.4 to +0.8) on clean test sets, showing that + VL-Attack 33.18 27.55 33.57 most generated adversarial examples are consistent Table 7: The BLEU scores of mask-predict when fac- with the original label and will not hurt the model ing adversarial attacks. The results are obtained on the performance. IWSLT14 German-English translation task. 4.2 Adversarial Training on NAT IWSLT14 De-En Models dev test In this section, we utilize our method for adver- Mask-Predict 32.38 31.71 sarial training on NAT tasks to alleviate some Attacked by HotFlip 22.95 21.01 key problems in NAT models as discussed in Sec- Attacked by VL-Attack 21.29 19.47 tion 3.2. The adversarial training results on NLU Mask-Predict + VL-Attack 33.84 33.18 tasks are also provided in the supplementary mate- Attacked by HotFlip 30.79 29.85 Attacked by VL-Attack 28.42 27.13 rial. 4.2.1 Experimental Setup De-En task. Following mask-predict, we utilize Dataset and Model Configurations We sequence-level knowledge distillation (Kim and conduct experiments on benchmark ma- Rush, 2016) on the training set of the WMT14 chine translation datasets including IWSLT14 En-De task to provide less noisy and more deter- German→English (IWSLT14 De-En)3 , WMT14 ministic training data for NAT models (Gu et al., English←German translation (WMT14 En-De)4 , 2017). And we use raw datasets for other tasks. and WMT16 English↔Romanian (WMT16 We train the model on 1/8 Nvidia P100 GPUs for En-Ro)5 . For IWSLT14, we adopt the official split IWSLT14/WMT tasks. For evaluation, we report of train/valid/test sets. For WMT14 tasks, we the tokenized BLEU scores (Papineni et al., 2002) utilize newstest2013 and newstest2014 measured by multi-bleu.perl. as the validation and test set respectively. For Adversarial Training Pipeline We first pre- WMT16 tasks, we use newsdev2016 and train a mask-predict model, on which we execute newstest2016 as the validation and test set. adversarial attacks on the training set following the We tokenize the sentences by Moses (Koehn et al., process described in Section 3.1, except that we 2007) and segment each word into subwords conduct a fixed number (randomly sampled from 1 using Byte-Pair Encoding (BPE) (Sennrich et al., to 15% of the sequence length) of attacking steps 2015), resulting in a 32k vocabulary shared by as a “successful attack” in NLG tasks is not well source and target languages. On WMT tasks, the defined. Then we fine-tune the model on the con- model architecture is akin to the base configura- catenation of the original training set as well as the tion of the Transformer model (Vaswani et al., attacked training set. As for baselines, we compare 2017) (dhidden = 512, dFFN = 2048, nlayer = 6, our method with HotFlip, a white-box attacking nhead = 8). We use a smaller configuration on the method mainly based on the replacement operation. IWSLT14 De-En task (dhidden = 256, dFFN = 512, We consider its word-level variant. nlayer = 5, nhead = 4). We choose the mask-predict (Ghazvininejad 4.2.2 Results et al., 2019) model as the victim NAT model. We The main results are listed in Table 6. Our method use the Transformer base configuration on WMT outperforms the original mask-predict model as tasks and a smaller configuration on the IWSLT14 well as the HotFlip baseline on all datasets. Specif- 3 https://wit3.fbk.eu/ ically, we can find that HotFlip, the replacement 4 https://www.statmt.org/wmt14/translation-task based adversarial attack method, does not provide 5 https://www.statmt.org/wmt16/translation-task clear improvements over the original mask-predict
model, indicating that fixed-length adversarial ex- placement to heuristic approaches, or leveraging amples are not helpful for the training of NAT mod- imitation learning (Wallace et al., 2020). els. On the contrary, the proposed VL-Attack is able to boost the performance of the mask-predict model by providing variable-length adversarial ex- References amples. Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. 4.2.3 Adversarial Defense 2018. Generating natural language adversarial ex- amples. arXiv preprint arXiv:1804.07998. Here, we verify that in addition to enhancing the model performance on regular test sets, our method Samuel R Bowman, Gabor Angeli, Christopher Potts, can also help the model defend against adversarial and Christopher D Manning. 2015. A large anno- attacks. We attack the original mask-predict model tated corpus for learning natural language inference. as well as the model fine-tuned by our method on arXiv preprint arXiv:1508.05326. the validation and test set of the IWSLT14 German- Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, English translation task, and we set the attacking Nicole Limtiaco, Rhomni St John, Noah Constant, steps as 3. Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Results are shown in Table 7, from which we can et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175. find that the proposed VL-Attack provides more critical attacks than HotFlip as more BLEU scores Jacob Devlin, Ming-Wei Chang, Kenton Lee, and are dropped. In addition, fine-tuning with adver- Kristina Toutanova. 2018. Bert: Pre-training of deep sarial examples generated by our method success- bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805. fully promotes the robustness of the mask-predict model. Specifically, on the test set, the drop of Javid Ebrahimi, Anyi Rao, Daniel Lowd, and De- BLEU score has been decreased from 33.74% to jing Dou. 2017. Hotflip: White-box adversarial 10.04% for HotFlip and from 38.60% to 18.23% examples for text classification. arXiv preprint arXiv:1712.06751. for the proposed method, showing that the adver- sarial training method can help the model defend Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and both fixed-length and variable-length adversarial Luke Zettlemoyer. 2019. Mask-predict: Parallel attacks. decoding of conditional masked language models. arXiv preprint arXiv:1904.09324. 5 Conclusion Marjan Ghazvininejad, Omer Levy, and Luke Zettle- moyer. 2020. Semi-autoregressive training im- In this paper, we propose a variable-length adversar- proves mask-predict decoding. arXiv preprint ial attack method on textual data, which is consisted arXiv:2001.08785. of three atomic operations including replacement, insertion and deletion. We integrate them into a Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversar- unified framework by introducing and leveraging ial examples. arXiv preprint arXiv:1412.6572. a special blank token, which is trained to serve as a blank space to the victim model, and therefore Jiatao Gu, James Bradbury, Caiming Xiong, Vic- can be utilized as a placeholder while inserting and tor OK Li, and Richard Socher. 2017. Non- autoregressive neural machine translation. arXiv measuring the importance of tokens while deleting. preprint arXiv:1711.02281. We verify the effectiveness of the proposed method by attacking a pre-trained BERT model on natu- Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and ral language understanding tasks, and then show Tie-Yan Liu. 2019. Non-autoregressive neural ma- that our method is beneficial to non-autoregressive chine translation with enhanced decoder input. In Proceedings of the AAAI Conference on Artificial In- machine (NAT) translation, a task that is sensitive telligence, volume 33, pages 3723–3730. to the input lengths, when treated as a data aug- mentation method. Our method is able to boosts Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong the translation performance as well as the robust- Chen, and Tie-Yan Liu. 2020a. Fine-tuning by cur- riculum learning for non-autoregressive neural ma- ness of adversarial attacks of the NAT model. In chine translation. In Proceedings of the AAAI Con- the future, we will try to extend our method to the ference on Artificial Intelligence, volume 34, pages black-box setting, by alternating gradient-based re- 7839–7846.
Junliang Guo, Linli Xu, and Enhong Chen. 2020b. the 40th annual meeting on association for compu- Jointly masked sequence-to-sequence model for non- tational linguistics, pages 311–318. autoregressive neural machine translation. In Pro- ceedings of the 58th Annual Meeting of the Associa- Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. tion for Computational Linguistics, pages 376–385. 2019. Generating natural language adversarial ex- amples through probability weighted word saliency. Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter In Proceedings of the 57th annual meeting of the as- Szolovits. 2019. Is bert really robust? a strong base- sociation for computational linguistics, pages 1085– line for natural language attack on text classification 1097. and entailment. arXiv, pages arXiv–1907. Marco Tulio Ribeiro, Sameer Singh, and Carlos Yoon Kim and Alexander M Rush. 2016. Sequence- Guestrin. 2018. Semantically equivalent adversar- level knowledge distillation. arXiv preprint ial rules for debugging nlp models. In Proceedings arXiv:1606.07947. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris pages 856–865. Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Rico Sennrich, Barry Haddow, and Alexandra Birch. Richard Zens, et al. 2007. Moses: Open source 2015. Neural machine translation of rare words with toolkit for statistical machine translation. In Pro- subword units. arXiv preprint arXiv:1508.07909. ceedings of the 45th annual meeting of the ACL on Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, interactive poster and demonstration sessions, pages Joan Bruna, Dumitru Erhan, Ian Goodfellow, and 177–180. Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. Guanhong Tao, Shiqing Ma, Yingqi Liu, and Xi- arXiv preprint arXiv:1607.02533. angyu Zhang. 2018. Attacks meet interpretability: Attribute-steered detection of adversarial samples. Vladimir I Levenshtein. 1966. Binary codes capable In Advances in Neural Information Processing Sys- of correcting deletions, insertions, and reversals. In tems, pages 7717–7728. Soviet physics doklady, volume 10, pages 707–710. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Wang. 2018. Textbugger: Generating adversarial Kaiser, and Illia Polosukhin. 2017. Attention is all text against real-world applications. arXiv preprint you need. In NIPS. arXiv:1812.05271. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang and Sameer Singh. 2019. Universal adversarial trig- Xue, and Xipeng Qiu. 2020. Bert-attack: Adver- gers for attacking and analyzing nlp. arXiv preprint sarial attack against bert using bert. arXiv preprint arXiv:1908.07125. arXiv:2004.09984. Eric Wallace, Mitchell Stern, and Dawn Song. Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, 2020. Imitation attacks and defenses for black- Xirong Li, and Wenchang Shi. 2017. Deep box machine translation systems. arXiv preprint text classification can be fooled. arXiv preprint arXiv:2004.15015. arXiv:1704.08006. Adina Williams, Nikita Nangia, and Samuel R Bow- Aleksander Madry, Aleksandar Makelov, Ludwig man. 2017. A broad-coverage challenge corpus for Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. sentence understanding through inference. arXiv Towards deep learning models resistant to adversar- preprint arXiv:1704.05426. ial attacks. arXiv preprint arXiv:1706.06083. Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, John X Morris, Eli Lifland, Jack Lanchantin, Yangfeng Meng Zhang, Qun Liu, and Maosong Sun. 2020. Ji, and Yanjun Qi. 2020. Reevaluating adversar- Word-level textual adversarial attacking as combina- ial examples in natural language. arXiv preprint torial optimization. In Proceedings of the 58th An- arXiv:2004.14174. nual Meeting of the Association for Computational Linguistics, pages 6066–6080. Nicolas Papernot, Patrick McDaniel, Ananthram Swami, and Richard Harang. 2016. Crafting adver- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. sarial input sequences for recurrent neural networks. Character-level convolutional networks for text clas- In MILCOM 2016-2016 IEEE Military Communica- sification. In Advances in neural information pro- tions Conference, pages 49–54. IEEE. cessing systems, pages 649–657. Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2017. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Generating natural adversarial examples. arXiv Jing Zhu. 2002. Bleu: a method for automatic eval- preprint arXiv:1710.11342. uation of machine translation. In Proceedings of
You can also read