Towards Variable-Length Textual Adversarial Attacks

Page created by Gladys Navarro

Arts & Entertainment

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Towards Variable-Length Textual Adversarial Attacks
                                                      Junliang Guo1 , Zhirui Zhang2 , Linlin Zhang3 , Linli Xu1 ,
                                                              Boxing Chen2 , Enhong Chen1 , Weihua Luo2
                                                 1
                                                   Anhui Province Key Laboratory of Big Data Analysis and Application,
                                                              School of Computer Science and Technology,
                                                              University of Science and Technology of China
                                                           2
                                                             Alibaba DAMO Academy 3 Zhejiang University
                                              guojunll@mail.ustc.edu.cn, {linlixu,cheneh}@ustc.edu.cn
                                         {zhirui.zzr, zll240651, boxing.cbx, weihua.luowh}@alibaba-inc.com

                                                               Abstract                                                        Ours: I rented this movie to relax.
                                                                                                                                    But I got a episodic migraine instead.

                                             Adversarial attacks have shown the vulnerabil-                                    Replace: I rented this movie to relax.
                                                                                                                                        But I got a migraine.
                                             ity of machine learning models, however, it
arXiv:2104.08139v1 [cs.CL] 16 Apr 2021

                                             is non-trivial to conduct textual adversarial at-                                 Origin: I rented this movie to relax.
                                                                                                                                       But I got a headache.
                                             tacks on natural language processing tasks due
                                             to the discreteness of data. Most previous ap-                                 Positive                 Model boundary
                                                                                                                                                     True boundary
                                             proaches conduct attacks with the atomic re-                        Negative                            Replace candidates
                                             placement operation, which usually leads to                                                             Total candidates

                                             fixed-length adversarial examples and there-         Figure 1: An illustration of the proposed VL-Attack
                                             fore limits the exploration on the decision          method. Red, underline and strikeout tokens indi-
                                             space. In this paper, we propose variable-           cate replacements, insertions and deletions respectively.
                                             length textual adversarial attacks (VL-Attack)       Equipped with the proposed three atomic operations,
                                             and integrate three atomic operations, namely        our method is able to generate adversarial examples
                                             insertion, deletion and replacement, into a uni-     that are closer to the true decision boundary, which can-
                                             fied framework, by introducing and manipulat-        not be reached only by replacement.
                                             ing a special blank token while attacking. In
                                             this way, our approach is able to more com-          terpretability of machine learning models (Ribeiro
                                             prehensively find adversarial examples around
                                                                                                  et al., 2018; Tao et al., 2018).
                                             the decision boundary and effectively conduct
                                             adversarial attacks. Specifically, our method           While having been extensively studied in com-
                                             drops the accuracy of IMDB classification by         puter vision models, adversarial attack on natural
                                             96% with only editing 1.3% tokens while at-          language processing models is more challenging as
                                             tacking a pre-trained BERT model. In addi-           small perturbations of textual data may change its
                                             tion, fine-tuning the victim model with gener-       original semantic meaning and thus are perceptible.
                                             ated adversarial samples can improve the ro-
                                                                                                  Most previous works utilize the replacement oper-
                                             bustness of the model without hurting the per-
                                             formance, especially for length-sensitive mod-
                                                                                                  ation to construct adversarial examples, including
                                             els. On the task of non-autoregressive machine       elaborately simulating natural noises such as ty-
                                             translation, our method can achieve 33.18            pos by character-level replacement (Ebrahimi et al.,
                                             BLEU score on IWSLT14 German-English                 2017; Li et al., 2018) and word-level synonym re-
                                             translation, achieving an improvement of 1.47        placement (Papernot et al., 2016; Alzantot et al.,
                                             over the baseline model.                             2018), to ensure the generated examples will not
                                                                                                  alter the semantics. However, naturally, most adver-
                                         1   Introduction                                         sarial examples could have different lengths from
                                         While achieving great successes in various do-           the original sentence, and the replacement opera-
                                         mains, machine learning models have been found           tion is only able to generate fixed-length examples
                                         vulnerable to adversarial examples, i.e., the origi-     and thus produce a limited subset of the total ad-
                                         nal inputs with small perturbations that are indistin-   versarial candidates. We provide an illustration
                                         guishable to human knowledge can fool the model          in Figure 1, where replacements cannot generate
                                         and lead to incorrect results (Goodfellow et al.,        variable-length adversarial examples.
                                         2014; Kurakin et al., 2016; Zhao et al., 2017). Ad-         In fact, an original sentence can be converted
                                         versarial attacks and defenses are able to improve       to any adversarial candidate with the combination
                                         the robustness and security (Szegedy et al., 2013;       of three one-step atomic operations: replacement,
                                         Madry et al., 2017) as well as to explore the in-        insertion and deletion. Inspired by this observa-

tion, we propose a paradigm of white-box variable-      pre-trained BERT (Devlin et al., 2018) model when
length adversarial attacks to comprehensively ex-       fine-tuning on downstream tasks. Specifically, after
plore adversarial examples. The whole attack pro-       the attack, the model’s IMDB classification accu-
cedure consists of multiple steps. In each step, we     racy drops from 90.4% to 3.5% by only editing
conduct an attack utilizing an operation randomly       1.3% tokens. Human evaluation verifies that the
sampled from the proposed three types of atomic         generated adversarial examples are highly accurate
operations. Theoretically, our method is able to        in grammatical, semantical and label preserving
reach adversarial examples with any word-level          perspectives. On NAT, we show that our method
Levenshtein distance (Levenshtein, 1966) to the         can be effectively used for data augmentation. By
inputs, therefore covering the full adversarial can-    integrating the adversarial examples of training
didates set.                                            samples, our method boosts the performance of the
                                                        mask-predict model on various machine translation
   We propose tailored attacking methods for the
                                                        tasks. Specifically, we achieve 33.18 BLEU score
proposed three atomic operations. Specifically, we
                                                        on IWSLT14 German-English translation with an
make insertion and deletion possible by converting
                                                        improvement of 1.47 over the baseline model.
them into gradient-based methods with the intro-
duction of a special blank token [BLK]. Given a            Our main contributions can be summarized as
pre-trained victim model such as BERT, we intro-        follows:
duce [BLK] while fine-tuning it on downstream
                                                        • We propose a variable-length adversarial attack
tasks. By randomly inserting the blank token into
                                                          method on textual data, which integrates three
inputs while keeping the target unchanged, the to-
                                                          atomic operations, including replacement, inser-
ken is trained as a “blank space” to the model and
                                                          tion and deletion, into a unified framework.
its occurrence will not affect the original semantic
                                                        • Our method successfully attacks the pre-trained
information. Then, we implement insertion and
                                                          BERT model on NLU tasks with higher attack-
deletion with the help of the blank token. For in-
                                                          ing success rate and lower perturbation rate than
sertion, we first insert a [BLK] token in a chosen
                                                          previous approaches, while being semantic and
position of the input. Then we choose the content
                                                          label preserving from human judgement.
to insert by calculating the similarities between the
                                                        • Our method can be viewed as a data augmenta-
gradients of [BLK] and tokens in the vocabulary.
                                                          tion method which is specifically beneficial for
While deleting, we evaluate the importance of all
                                                          length-sensitive models, and is able to achieve
tokens by comparing with [BLK] in the gradient
                                                          strong performance on NAT tasks.
space, and then delete the most important token
of the sentence. And for replacement, we follow         2     Related Work
the gradient based method utilized in (Ebrahimi
et al., 2017; Wallace et al., 2019) and replace a       2.1    Textual Adversarial Attack
token with the one that maximizes the first-order
                                                        Adversarial attack has been greatly studied in
approximation of the prediction loss.
                                                        computer vision, and most works execute attack
   Besides conducting attacks, our method can also      with gradient-based perturbation on the continu-
be viewed as a data augmentation method to en-          ous space (Szegedy et al., 2013; Goodfellow et al.,
hance the robustness and performance of models.         2014). For textual adversarial attack, it is much
Since the proposed method is able to generate           more challenging. We categorize previous works
variable-length adversarial examples, it is partic-     into black-box and white-box attacking, depend-
ularly beneficial to length-sensitive models, such      ing on whether the attacker is aware of the gradi-
as non-autoregressive neural machine translation        ents of the victim model. For black-box attack-
(NAT) (Gu et al., 2017; Guo et al., 2019, 2020a)        ing, previous works usually replace words or char-
which requires predicting the target length before      acters in the original inputs guided by some re-
decoding. Therefore, in experiments, we evaluate        strictions to maintain the original semantics, in-
our method on two tasks including natural language      cluding heuristic priors (Jin et al., 2019; Li et al.,
understanding (NLU) and NAT with mask-predict           2018; Ren et al., 2019), semantic similarity check-
decoding (Ghazvininejad et al., 2019). We verify        ing (Li et al., 2020; Cer et al., 2018), extra language
the effectiveness of the proposed method on NLU         models (Alzantot et al., 2018) or a combination of
tasks, where our method successfully attacks the        them (Jin et al., 2019; Li et al., 2020; Zang et al.,

2020; Morris et al., 2020). For white box attack-          where sim(·) indicates the function that measures
ing, most previous works follow Fast Gradient Sign         the semantic similarity between the original inputs
Method (FGSM) (Goodfellow et al., 2014) to re-             and adversarial examples, where θ is the threshold.
place the original input with tokens that are able to
drastically increase the prediction loss of the vic-       3.1   Variable-Length Attack
tim model, either at the char-level (Ebrahimi et al.,      Our adversarial attack consists of multiple steps,
2017), word-level (Wallace et al., 2019; Papernot          and in each step, the attack is conducted by an
et al., 2016), or phrase-level (Liang et al., 2017).       atomic operation selected randomly from replace-
In this paper, we focus on white-box attacking and         ment, insertion and deletion. Formally, denote
propose a unified framework that integrates the            M(x) = xadv as the mapping function from the
three atomic operations in a novel manner, by in-          original input to the adversarial example, it can be
troducing and manipulating a blank token [BLK].            defined as:

2.2    Non-Autoregressive NMT                                 M(x) = Mk         Mk−1 · · ·   M1 (x),         (2)
                                                                 Mi ∈ {insertion, deletion, replacement},
Non-autoregressive neural machine transla-
tion (NAT) (Gu et al., 2017) is proposed to speedup        where k is the number of operation steps as well as
the inference latency of autoregressive machine            the Levenshtein distance between x and xadv .
translation models, by generating target tokens in            Each step of our attack is achieved by in-
parallel while decoding. Among the follow-up               troducing and leveraging a special blank token
works, mask-predict decoding (Ghazvininejad                [BLK]. While fine-tuning, given a training sample
et al., 2019) is shown effective by achieving              (x, y) ∈ Ttrain , we randomly insert a blank token
comparable performance with autoregressive                 into the given input x = (x1 , x2 , ..., xn ), result-
models while halving the inference latency (Guo            ing x0 = (x1 , ..., xi−1 , [BLK], xi , ..., xn ). Then
et al., 2020b). However, in the mask-predict               we fine-tune the model on (x0 , y) instead of the
model, the target length requires to be predicted          original sample (x, y), i.e., train the model with
up front, making the results unstable and sensitive.       L(y, f (x0 )). In this way, the [BLK] token is
The training-inference discrepancy is also a key           learned to serve as a blank space as its occurrence
problem (Ghazvininejad et al., 2020).                      will not affect the semantic information of the orig-
   We show that our method can be viewed as an             inal sample. With the help of the blank token, we
effective data augmentation method, which is able          introduce the proposed three atomic operations as
to improve the translation performance of the NAT          follows. Note that we conduct adversarial attacks
model, as well as alleviating the problems men-            while inference, i.e., (x, y) ∈ Ttest .
tioned above.
                                                           Replacement The replacement operation is in-
3     Methodology                                          spired by the FGSM method (Goodfellow et al.,
                                                           2014) which leverages the first-order approxima-
We introduce the proposed Variable-Length tex-             tion of the loss function. For the i-th token xi in
tual adversarial Attack (VL-Attack) method in this         the original input x, we replace it with the token
section. We start with the problem definition.             that most drastically increases the loss function in
                                                           the vocabulary,
Problem Definition Given a pre-trained model
such as BERT, after fine-tuning it on the training set      xadv
                                                             i   = argmax(e(xj ) − e(xi ))| ∇xi L(y, f (x)),
                                                                     xj ∈Vxi
of a downstream task Ttrain , we get the victim model
                                                                                                            (3)
f (·) which will be evaluated on the test set Ttest . We
                                                           where e(xi ) indicates the embedding of xi , Vxi =
denote the task-specific loss function while fine-
                                                           top_k{Plm (xj |xi )} is a subset of the whole
tuning as L(y, f (x)). Given a test sample (x, y) ∈
                                                           vocabulary V , which is refined by a language
Ttest where x is the input sequence and y is its
                                                           model Plm . In our setting, we directly use the vic-
target, there should be f (x) = y. While attacking,
                                                           tim BERT model as the language model for sim-
we aim at generating an adversarial example xadv
                                                           plicity.
that satisfies:
                                                           Insertion The insertion operation can be divided
        f (xadv ) 6= y and sim(x, xadv ) > θ,       (1)    into two steps. Firstly, we construct a sample x0 that

Adversarial Example % $%#          … I got a episodic           … I got a episodic
                    & ≠ ((% $%# )                migraine instead.            migraine [BLK].

              Origin Input %                                4. Insert via Equ (4)
                & = ((%)

                                                                                                         … But I got a episodic
              … But I got a headache.
                                                                                                         migraine.
                      1. Replace via Equ (3)               Victim Model                            3. Delete via Equ (5)

              … But I got a migraine.

                                                                2. Insert via Equ (4)

                                        … But I got a [BLK] migraine.          … But I got a episodic migraine.

Figure 2: An illustration of the attacking pipeline with the proposed insertion, deletion and replacement operations.
We take the second half of the sample in Figure 1 as the example. (x, y) is a pair of test sample where x is the
input while y is the target, and xadv is the generated adversarial sample through 4 attacking steps.

is semantically identical to x but a token longer, i.e.,                   distance to the [BLK] token as well as the increase
|x0 | = |x| + 1 and sim(x, xadv ) > θ. We achieve                          of loss in the gradient space.
this by inserting a [BLK] token into x as in fine-                            Based on the proposed operations, we introduce
tuning. Then, the token to insert is determined in a                       the pipelines of our method when attacking NLU
similar way as in the replacement operation,                               tasks as follows. We provide an illustration in Fig-
                                                                           ure 2. We aim at attacking the model with mini-
  xadv
   BLK =                                                          (4)      mum number of modifications. Given an input x
                                    |                       0
  argmax(e(xj ) − e(xBLK )) ∇xBLK L(y, f (x )),                            and its label y, we iteratively apply the attack steps
  xj ∈VxBLK
                                                                           introduced above until obtaining a successful attack
where xBLK represents the inserted [BLK] token,                            (i.e., f (xadv ) 6= y) or reaching the upper bound
and VxBLK is the vocabulary refined by the language                        of attack operations N = bλ · |x|c, where λ is a
model similar as above.                                                    hyper-parameter controlling the trade-off between
   In replacement, the position of the attacked to-                        the overall attack success rate and the modifica-
ken is determined by computing Equation (3) over                           tion rate. In each step, we measure the semantic
the whole input sequence x and selecting the posi-                         similarity between x and xadv by Universal Sen-
tion that increases the loss the most. Similarly, we                       tence Encoder (Cer et al., 2018) which encodes
iteratively insert the [BLK] token into each posi-                         two sentences into a pair of fixed-length vectors
tion and compute Equation (4) to determine where                           and calculate the cosine similarity between them.
to conduct the insertion operation. Both operations                        This is a common practice in previous works (Jin
can be efficiently computed in parallel.                                   et al., 2019; Li et al., 2020). We skip this attack
                                                                           step if the similarity is less than the threshold θ.
Deletion We propose the deletion operation
based on the following intuition: the token that is                        3.2    Adversarial Training on NAT Models
more dissimilar to the [BLK] token will be more
important to the prediction result. Specifically, we                       Our method can be naturally applied for adversar-
quantify the importance of tokens in x with the                            ial training to enhance the robustness of the victim
help of [BLK] token,                                                       model. Specifically, we evaluate our method on
                                                                           NAT with mask-predict decoding (Ghazvininejad
   αi = (e(xi ) − e(xBLK ))| ∇xi L(y, f (x)),                     (5)      et al., 2019). Given a bilingual training pair (x, y),
                                                                           the model is trained as a conditional masked lan-
where αi indicates the importance of the i-th token                        guage model,
xi . Then the deleted token xdel is selected among
all tokens, where del = argmaxj∈|x| αj and |x|                                                              |y m |
denotes the sequence length of the input x. In                                          m    r
                                                                              LNAT (y |y , x) = −
                                                                                                            X
                                                                                                                     log P (ytm |y r , x), (6)
this way, the deleted token is selected jointly by its                                                      t=1

where y m are masked tokens and y r are residual natural language inference datasets, i.e., MNLI and
target tokens, following the same masking strategy SNLI.
proposed in BERT (Devlin et al., 2018). While • YELP: A document-level sentiment classifica-
decoding, different from traditional autoregressive tion dataset on restaurant reviews. We process
models which dynamically determine the target them into a polarity classification task follow-
length with the [EOS] symbol, the NAT model ing (Zhang et al., 2015).
needs to predict the target length at first, i.e., mod- • IMDB: A binary document-level sentiment clas-
eling P (|y| |x), making the translation result sensi- sification dataset on movie reviews1 .
tive to the predicted length. In addition, as pointed • MNLI: A multi-genre natural language inference
out by Ghazvininejad et al. (2020), there exists dataset consisting of premise-hypothesis pairs,
the discrepancy between training and inference in and the task is to determine the relation of the
mask-predict decoding, since the observed decoder hypothesis to the premise, i.e., entailment, neu-
inputs are correct golden targets in training but are tral, or contradiction (Williams et al., 2017). We
not always correct in inference. evaluate on the matched set.
By providing variable-length adversarial exam- • SNLI: A natural language inference dataset from
ples, our method can alleviate these problems by ad- the Stanford language inference task (Bowman
versarially training the NAT model. The attack pro- et al., 2015).
cedure is different from that on NLU tasks in two
aspects. Firstly, we obtain adversarial examples by For all tasks, we evaluate our method on the 1k test
conducting a predefined number of attack steps as samples provided by (Alzantot et al., 2018) and
there does not exist a precise definition of a “suc- also utilized in (Jin et al., 2019; Li et al., 2020),
cessful attack” in machine translation. Secondly, which are randomly selected from the correspond-
we conduct attacks jointly on the encoder input x ing test sets of each task. We tokenize and seg-
as well as the decoder input y r , resulting the ad- ment each word into wordpiece tokens w.r.t the
versarial training loss LNAT (y m |(y r )adv , xadv ). By vocabulary of the pre-trained BERT model, and we
doing so, the model observes plausible decoder conduct attacks at the wordpiece level. We set the
inputs other than the golden targets, thus alleviat- upper bound of perturbations λ = 30% and the
ing the training-inference discrepancy. In addition, semantic similarity threshold θ = 0.5 for all tasks.
by keeping the targets and feeding the model with Baselines We compare our method with two
variable-length encoder and decoder inputs, the state-of-the-art black-box adversarial attack meth-
model robustness regrading the length prediction ods including TextFooler (Jin et al., 2019) and
will also be enhanced. BERT-Attack (Li et al., 2020). We strictly follow
the settings in (Li et al., 2020) and directly copy the
4 Experiments
results of TextFooler and Bert-Attack as reported in
We conduct experiments in two parts to evaluate their papers. We also consider a white-box baseline
the proposed method. On NLU tasks, we verify HotFlip (Ebrahimi et al., 2017), which conducts
the adversarial attack performance of our method. attacks mainly based on the replacement operation.
On the task of non-autoregressive machine trans- We consider the word-level variant of their model,
lation, we explore the potential of our method for and obtain their results when attacking BERT by
adversarial training. We start with NLU tasks. re-implementing the model based on their code2 .
4.1 Adversarial Attack on NLU Evaluation We validate the performance of our
model with both automatic metrics and human eval-
4.1.1 Experimental Setup
uation. The automatic evaluation includes the fol-
In NLU tasks, we fine-tune the pre-trained BERT
lowing metrics. The Original Accuracy (OriAcc)
base model on downstream classification datasets,
indicates the original model prediction accuracy
and take it as the victim model. We mainly follow
without adversarial attacks, and the Attacked Ac-
the settings in previous works (Jin et al., 2019; Li
curacy (AttAcc) indicates the model performance
et al., 2020) to construct a fair comparison.
on the adversarial examples generated by the at-
Datasets We evaluate our method on benchmark tack methods. Lower AttAcc represents the more
NLU tasks including two sentiment classification 1
https://datasets.imdbws.com/
2
datasets, i.e., YELP and IMDB, as well as two https://github.com/AnyiRao/WordAdver

Table 1: Results of adversarial attacks against various fine-tuned BERT models. For BERT-Attack and TextFooler,
we directly copy the scores reported by (Li et al., 2020), and we re-implement the word-level HotFlip when
attacking BERT based on the their code. For the attacked accuracy (AttAcc) and perturbed ratio (Perturb%), the
lower the better, while for the semantic similarity (Sim), the higher the better.
                                   Yelp (OriAcc = 96.8)                   IMDB (OriAcc = 90.4)
               Model          AttAcc↓   Perturb%↓         Sim↑        AttAcc↓     Perturb%↓      Sim↑
               BERT-Attack      5.1          4.1          0.77          11.4         4.4         0.86
               TextFooler       6.6         12.8          0.74          13.6         6.1         0.86
               HotFlip          9.2         12.3          0.63           8.2         2.7         0.84
               VL-Attack        4.8          3.7          0.83           3.5         1.3         0.87

                                  SNLI (OriAcc = 90.0)                    MNLI (OriAcc = 85.1)
               Model          AttAcc↓   Perturb%↓         Sim↑        AttAcc↓     Perturb%↓      Sim↑
               BERT-Attack      11.8        10.9          0.48           9.9          8.4        0.62
               TextFooler       12.4        26.0          0.50          17.5         20.9        0.61
               HotFlip           6.3        11.0          0.54           8.0          9.4        0.61
               VL-Attack        4.8          9.5          0.56           5.3         7.3         0.66

successful attacks. The Perturbed Ratio (Per-               Table 2: The ablation study of the proposed VL-Attack
turb%) is the average percentage of perturbed to-           with naively implemented insertions and deletions. Re-
                                                            sults are obtained on the IMDB dataset, and we keep
kens when conducting attacks. Intuitively, the at-
                                                            the same settings as in Table 1.
tack is more efficient and semantic preserving with
                                                                 Model               AttAcc   Perturb%    Sim
less perturbations and higher attacked accuracy. Fi-
nally, we report the Semantic Similarity (Sim)                   VL-Attack            3.5          1.3    0.87
                                                                  + Naive Insert      5.6          3.9    0.77
between the original inputs and adversarial exam-                 + Naive Delete      7.9          5.3    0.80
ples measured by the averaged Universal Sentence
Encoder (USE) (Cer et al., 2018) score.                     Table 3: Human evaluation results, where “/” indicates
   As for human evaluation, we follow the settings          the results are only applicable for adversarial samples.
of (Jin et al., 2019) to measure the grammaticality,                Dataset        Grammar Semantic Consistency
semantic similarity as well as label consistency of                    Original      4.71          /        /
                                                             MNLI      HotFlip       3.98        0.75     0.60
the original sentences and the adversarial examples.                  VL-Attack      4.30        0.86     0.77
We randomly sample 100 sentences from the test
                                                                       Original      4.88          /        /
set of MNLI as well as IMDB and generate ad-                 IMDB
                                                                       HotFlip       4.02        0.78     0.73
versarial examples by our method, and then invite                     VL-Attack      4.57        0.89     0.85
three human experts to annotate the results, who are
all native speakers with university-level education         cation tasks which are easier to attack, our method
backgrounds. The grammaticality is scored from 1            drastically drops the accuracy of the victim model
to 5, and the semantic similarity is determined by          by more than 95% with only editing less than 4%
measuring whether the generated example is simi-            tokens. Our method achieves similar performance
lar/ambiguous/dissimilar to the original sentence,          with editing less than 10% tokens on the harder
scoring as 1/0.5/0 respectively. In addition, the           natural language inference tasks.
label consistency between the pair of an original              Comparing with HotFlip, the baseline based on
and generated sentence is also collected based on           the replacement operation, our method is able to
human classification. We report the results of our          achieve more successful attacks with less perturba-
method and the HotFlip baseline. Human annota-              tions as well as maintaining higher semantic simi-
tors are blind to method identities.                        larity, which illustrates that more natural adversar-
                                                            ial examples can be obtained by introducing the
4.1.2   Automatic Evaluation                                insertion and deletion operations.
The results of adversarial attack on NLU tasks are
listed in Table 1. The proposed VL-Attack out-              4.1.3     Ablation Study
performs the compared state-of-the-art black-box            We achieve the insertion and deletion operations
baselines as well as the replacement based white-           by introducing and leveraging a special blank to-
box baseline in all settings. On sentiment classifi-        ken. The benefits of doing so can be concluded

Table 4: Case studies on the MNLI dataset of the proposed VL-Attack method. “P” and “H” indicate premise
and hypothesis respectively. “Origin” indicates the original input sample. Red words indicate replacements and
underlined words indicate insertions.
Method Content Label
P: Sandstone and granite were the materials used to build the Baroque church of Bom Jesus,
Origin famous for its casket of St. Francis Xavier’s relics in the mausoleum to the right of the altar. Contradiction
H: St. Francis Xavier’s relics were never recovered, unfortunately.
P: Sandstone and granite were the materials used to build the Baroque church of Bom Jesus,
VL-Attack famous for its casket of St. Francis Xavier’s relics in the mausoleum to the right of the altar. Neutral
H: St. Francis Xavier’s relics were never recapture, however.
P: Julius Caesar’s nephew Octavian took the name Augustus;
Origin Rome ceased to be a republic, and became an empire. Contradiction
H: Rome never ceased to be a republic, and did not become an empire.
P: Julius Caesar’s nephew Octavian took the name Augustus;
VL-Attack Rome not come to be a republic, and became an empire. Entailment
H: Rome never ceased to be a republic, and did not become an empire.

as twofold. Firstly, the [BLK] token is trained as Table 5: Results of fine-tuning a pre-trained BERT
a blank space to the model, which can be treated model with adversarial training on NLU tasks.
as a placeholder that does not change the original
Model YELP IMDB SNLI MNLI
semantics of the input. In this way, we ensure that
Origin 96.8 90.4 90.0 85.1
the inserted token is selected based on the origi- + VL-Attack 96.4 91.2 90.1 84.7
nal context of the input, which cannot be achieved
if other tokens are used as the placeholder (e.g.,
using existing special tokens such as [MASK] or tance of different tokens and conducting attacks.
duplicating the current token) while inserting.
4.1.4 Human Evaluation
Secondly, determining the importance of tokens
to efficiently conduct attacks correspondingly, is an The human evaluation results are listed in Table 3,
important problem in textual adversarial attack (Li where we report the average scores of all annotators.
et al., 2020). Previous works achieve this by greed- Consistent with the results of the automatic eval-
ily running predictions (Li et al., 2020) or leverag- uation, the adversarial examples generated by our
ing external tools (Liang et al., 2017), which lacks method are both grammatically and semantically
efficiency. Here, the importance of each token can closer to the original samples than the replacement
be easily determined by its distance to [BLK] fol- based baseline. In addition, a majority of generated
lowing Equation (5) as a byproduct of introducing examples preserve the same labels as the original
the blank token. samples according to human annotators, showing
that our attacking method is indistinguishable to
Here, we verify the efficacy of the blank token
human judgement most of the time.
by naively implementing the insertion and deletion
operations. Specifically, we change the [BLK] 4.1.5 Case Study
token to the [MASK] token and then conduct in- In Table 4, we provide several adversarial exam-
sertion following the same method described in ples generated by our method on the MNLI dataset.
Section 3.1. For deletion, we randomly delete a Perturbations are marked as red. The generated ex-
token of the input. amples are generally semantic consistent with the
We test these naive baselines on the IMDB origin input, as well as indistinguishable in human
dataset, and results are shown in Table 2. We can knowledge.
find that naive insertion results in lower semantic
similarity, indicating that the [MASK] token will 4.1.6 Adversarial Training on NLU Tasks
change the context of the original input while inser- We also conduct additional investigations of ad-
tion, and thus maintaining less original semantic in- versarial training on NLU tasks to verify whether
formation. Naive deletion achieves worse attacked the generated adversarial examples preserve the
accuracy with perturbing more tokens, which in- original labels. Specifically, we first generate ad-
dicates the effectiveness and efficiency of the pro- versarial examples on the training set of each task,
posed blank token when determining the impor- which are then concatenated with the original train-

ing set to fine-tune a pre-trained BERT model. And Table 6: The BLEU scores of mask-predict with adver-
we evaluate the model on clean test sets, where sarial training and baselines on the IWSLT14 De-En,
WMT14 En-De and WMT16 En-Ro tasks. The results
the model performance can be drastically dropped
of baselines are obtained by our implementation.
if most adversarial examples are inconsistent with
IWSLT14 WMT14 WMT16
their original labels. Models De−En En−De En−Ro
Results are listed in Table 5. With adversarial
Transformer 32.59 28.04 34.13
training, the model performs comparably with the
Mask-Predict 31.71 27.03 33.08
original performance (change of accuracy varies + HotFlip 32.05 26.91 33.12
from −0.4 to +0.8) on clean test sets, showing that + VL-Attack 33.18 27.55 33.57
most generated adversarial examples are consistent
Table 7: The BLEU scores of mask-predict when fac-
with the original label and will not hurt the model
ing adversarial attacks. The results are obtained on the
performance. IWSLT14 German-English translation task.
4.2 Adversarial Training on NAT IWSLT14 De-En
Models dev test
In this section, we utilize our method for adver- Mask-Predict 32.38 31.71
sarial training on NAT tasks to alleviate some Attacked by HotFlip 22.95 21.01
key problems in NAT models as discussed in Sec- Attacked by VL-Attack 21.29 19.47
tion 3.2. The adversarial training results on NLU Mask-Predict + VL-Attack 33.84 33.18
tasks are also provided in the supplementary mate- Attacked by HotFlip 30.79 29.85
Attacked by VL-Attack 28.42 27.13
rial.
4.2.1 Experimental Setup De-En task. Following mask-predict, we utilize
Dataset and Model Configurations We sequence-level knowledge distillation (Kim and
conduct experiments on benchmark ma- Rush, 2016) on the training set of the WMT14
chine translation datasets including IWSLT14 En-De task to provide less noisy and more deter-
German→English (IWSLT14 De-En)3 , WMT14 ministic training data for NAT models (Gu et al.,
English←German translation (WMT14 En-De)4 , 2017). And we use raw datasets for other tasks.
and WMT16 English↔Romanian (WMT16 We train the model on 1/8 Nvidia P100 GPUs for
En-Ro)5 . For IWSLT14, we adopt the official split IWSLT14/WMT tasks. For evaluation, we report
of train/valid/test sets. For WMT14 tasks, we the tokenized BLEU scores (Papineni et al., 2002)
utilize newstest2013 and newstest2014 measured by multi-bleu.perl.
as the validation and test set respectively. For Adversarial Training Pipeline We first pre-
WMT16 tasks, we use newsdev2016 and train a mask-predict model, on which we execute
newstest2016 as the validation and test set. adversarial attacks on the training set following the
We tokenize the sentences by Moses (Koehn et al., process described in Section 3.1, except that we
2007) and segment each word into subwords conduct a fixed number (randomly sampled from 1
using Byte-Pair Encoding (BPE) (Sennrich et al., to 15% of the sequence length) of attacking steps
2015), resulting in a 32k vocabulary shared by as a “successful attack” in NLG tasks is not well
source and target languages. On WMT tasks, the defined. Then we fine-tune the model on the con-
model architecture is akin to the base configura- catenation of the original training set as well as the
tion of the Transformer model (Vaswani et al., attacked training set. As for baselines, we compare
2017) (dhidden = 512, dFFN = 2048, nlayer = 6, our method with HotFlip, a white-box attacking
nhead = 8). We use a smaller configuration on the method mainly based on the replacement operation.
IWSLT14 De-En task (dhidden = 256, dFFN = 512, We consider its word-level variant.
nlayer = 5, nhead = 4).
We choose the mask-predict (Ghazvininejad 4.2.2 Results
et al., 2019) model as the victim NAT model. We The main results are listed in Table 6. Our method
use the Transformer base configuration on WMT outperforms the original mask-predict model as
tasks and a smaller configuration on the IWSLT14 well as the HotFlip baseline on all datasets. Specif-
3
https://wit3.fbk.eu/
ically, we can find that HotFlip, the replacement
4
https://www.statmt.org/wmt14/translation-task based adversarial attack method, does not provide
5
https://www.statmt.org/wmt16/translation-task clear improvements over the original mask-predict

model, indicating that fixed-length adversarial ex- placement to heuristic approaches, or leveraging
amples are not helpful for the training of NAT mod- imitation learning (Wallace et al., 2020).
els. On the contrary, the proposed VL-Attack is
able to boost the performance of the mask-predict
model by providing variable-length adversarial ex- References
amples. Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,
Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.
4.2.3 Adversarial Defense 2018. Generating natural language adversarial ex-
amples. arXiv preprint arXiv:1804.07998.
Here, we verify that in addition to enhancing the
model performance on regular test sets, our method Samuel R Bowman, Gabor Angeli, Christopher Potts,
can also help the model defend against adversarial and Christopher D Manning. 2015. A large anno-
attacks. We attack the original mask-predict model tated corpus for learning natural language inference.
as well as the model fine-tuned by our method on arXiv preprint arXiv:1508.05326.
the validation and test set of the IWSLT14 German- Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,
English translation task, and we set the attacking Nicole Limtiaco, Rhomni St John, Noah Constant,
steps as 3. Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,
Results are shown in Table 7, from which we can et al. 2018. Universal sentence encoder. arXiv
preprint arXiv:1803.11175.
find that the proposed VL-Attack provides more
critical attacks than HotFlip as more BLEU scores Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
are dropped. In addition, fine-tuning with adver- Kristina Toutanova. 2018. Bert: Pre-training of deep
sarial examples generated by our method success- bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
fully promotes the robustness of the mask-predict
model. Specifically, on the test set, the drop of Javid Ebrahimi, Anyi Rao, Daniel Lowd, and De-
BLEU score has been decreased from 33.74% to jing Dou. 2017. Hotflip: White-box adversarial
10.04% for HotFlip and from 38.60% to 18.23% examples for text classification. arXiv preprint
arXiv:1712.06751.
for the proposed method, showing that the adver-
sarial training method can help the model defend Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and
both fixed-length and variable-length adversarial Luke Zettlemoyer. 2019. Mask-predict: Parallel
attacks. decoding of conditional masked language models.
arXiv preprint arXiv:1904.09324.
5 Conclusion Marjan Ghazvininejad, Omer Levy, and Luke Zettle-
moyer. 2020. Semi-autoregressive training im-
In this paper, we propose a variable-length adversar- proves mask-predict decoding. arXiv preprint
ial attack method on textual data, which is consisted arXiv:2001.08785.
of three atomic operations including replacement,
insertion and deletion. We integrate them into a Ian J Goodfellow, Jonathon Shlens, and Christian
Szegedy. 2014. Explaining and harnessing adversar-
unified framework by introducing and leveraging ial examples. arXiv preprint arXiv:1412.6572.
a special blank token, which is trained to serve as
a blank space to the victim model, and therefore Jiatao Gu, James Bradbury, Caiming Xiong, Vic-
can be utilized as a placeholder while inserting and tor OK Li, and Richard Socher. 2017. Non-
autoregressive neural machine translation. arXiv
measuring the importance of tokens while deleting.
preprint arXiv:1711.02281.
We verify the effectiveness of the proposed method
by attacking a pre-trained BERT model on natu- Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and
ral language understanding tasks, and then show Tie-Yan Liu. 2019. Non-autoregressive neural ma-
that our method is beneficial to non-autoregressive chine translation with enhanced decoder input. In
Proceedings of the AAAI Conference on Artificial In-
machine (NAT) translation, a task that is sensitive telligence, volume 33, pages 3723–3730.
to the input lengths, when treated as a data aug-
mentation method. Our method is able to boosts Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong
the translation performance as well as the robust- Chen, and Tie-Yan Liu. 2020a. Fine-tuning by cur-
riculum learning for non-autoregressive neural ma-
ness of adversarial attacks of the NAT model. In chine translation. In Proceedings of the AAAI Con-
the future, we will try to extend our method to the ference on Artificial Intelligence, volume 34, pages
black-box setting, by alternating gradient-based re- 7839–7846.

Junliang Guo, Linli Xu, and Enhong Chen. 2020b. the 40th annual meeting on association for compu-
Jointly masked sequence-to-sequence model for non- tational linguistics, pages 311–318.
autoregressive neural machine translation. In Pro-
ceedings of the 58th Annual Meeting of the Associa- Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.
tion for Computational Linguistics, pages 376–385. 2019. Generating natural language adversarial ex-
amples through probability weighted word saliency.
Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter In Proceedings of the 57th annual meeting of the as-
Szolovits. 2019. Is bert really robust? a strong base- sociation for computational linguistics, pages 1085–
line for natural language attack on text classification 1097.
and entailment. arXiv, pages arXiv–1907.
Marco Tulio Ribeiro, Sameer Singh, and Carlos
Yoon Kim and Alexander M Rush. 2016. Sequence- Guestrin. 2018. Semantically equivalent adversar-
level knowledge distillation. arXiv preprint ial rules for debugging nlp models. In Proceedings
arXiv:1606.07947. of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris pages 856–865.
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran, Rico Sennrich, Barry Haddow, and Alexandra Birch.
Richard Zens, et al. 2007. Moses: Open source 2015. Neural machine translation of rare words with
toolkit for statistical machine translation. In Pro- subword units. arXiv preprint arXiv:1508.07909.
ceedings of the 45th annual meeting of the ACL on Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,
interactive poster and demonstration sessions, pages Joan Bruna, Dumitru Erhan, Ian Goodfellow, and
177–180. Rob Fergus. 2013. Intriguing properties of neural
networks. arXiv preprint arXiv:1312.6199.
Alexey Kurakin, Ian Goodfellow, and Samy Bengio.
2016. Adversarial examples in the physical world. Guanhong Tao, Shiqing Ma, Yingqi Liu, and Xi-
arXiv preprint arXiv:1607.02533. angyu Zhang. 2018. Attacks meet interpretability:
Attribute-steered detection of adversarial samples.
Vladimir I Levenshtein. 1966. Binary codes capable In Advances in Neural Information Processing Sys-
of correcting deletions, insertions, and reversals. In tems, pages 7717–7728.
Soviet physics doklady, volume 10, pages 707–710.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Wang. 2018. Textbugger: Generating adversarial Kaiser, and Illia Polosukhin. 2017. Attention is all
text against real-world applications. arXiv preprint you need. In NIPS.
arXiv:1812.05271.
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,
Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang and Sameer Singh. 2019. Universal adversarial trig-
Xue, and Xipeng Qiu. 2020. Bert-attack: Adver- gers for attacking and analyzing nlp. arXiv preprint
sarial attack against bert using bert. arXiv preprint arXiv:1908.07125.
arXiv:2004.09984.
Eric Wallace, Mitchell Stern, and Dawn Song.
Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, 2020. Imitation attacks and defenses for black-
Xirong Li, and Wenchang Shi. 2017. Deep box machine translation systems. arXiv preprint
text classification can be fooled. arXiv preprint arXiv:2004.15015.
arXiv:1704.08006.
Adina Williams, Nikita Nangia, and Samuel R Bow-
Aleksander Madry, Aleksandar Makelov, Ludwig man. 2017. A broad-coverage challenge corpus for
Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. sentence understanding through inference. arXiv
Towards deep learning models resistant to adversar- preprint arXiv:1704.05426.
ial attacks. arXiv preprint arXiv:1706.06083.
Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu,
John X Morris, Eli Lifland, Jack Lanchantin, Yangfeng Meng Zhang, Qun Liu, and Maosong Sun. 2020.
Ji, and Yanjun Qi. 2020. Reevaluating adversar- Word-level textual adversarial attacking as combina-
ial examples in natural language. arXiv preprint torial optimization. In Proceedings of the 58th An-
arXiv:2004.14174. nual Meeting of the Association for Computational
Linguistics, pages 6066–6080.
Nicolas Papernot, Patrick McDaniel, Ananthram
Swami, and Richard Harang. 2016. Crafting adver- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
sarial input sequences for recurrent neural networks. Character-level convolutional networks for text clas-
In MILCOM 2016-2016 IEEE Military Communica- sification. In Advances in neural information pro-
tions Conference, pages 49–54. IEEE. cessing systems, pages 649–657.
Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2017.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Generating natural adversarial examples. arXiv
Jing Zhu. 2002. Bleu: a method for automatic eval- preprint arXiv:1710.11342.
uation of machine translation. In Proceedings of

You can also read