Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning Hongge Chen1* , Huan Zhang23* , Pin-Yu Chen3 , Jinfeng Yi4 , and Cho-Jui Hsieh2 1 MIT, Cambridge, MA 02139, USA 2 UC Davis, Davis, CA 95616, USA 3 IBM Research, NY 10598, USA 4 JD AI Research, Beijing, China chenhg@mit.edu, ecezhang@ucdavis.edu pin-yu.chen@ibm.com, yijinfeng@jd.com, chohsieh@ucdavis.edu * Hongge Chen and Huan Zhang contribute equally to this work Abstract takes an image as an input and generates a lan- guage caption that best describes its visual con- arXiv:1712.02051v2 [cs.CV] 22 May 2018 Visual language grounding is widely stud- ied in modern neural image caption- tents, and has many important applications such ing systems, which typically adopts an as developing image search engines with complex encoder-decoder framework consisting of natural language queries, building AI agents that two principal components: a convolu- can see and talk, and promoting equal web ac- tional neural network (CNN) for image cess for people who are blind or visually impaired. feature extraction and a recurrent neural Modern image captioning systems typically adopt network (RNN) for language caption gen- an encoder-decoder framework composed of two eration. To study the robustness of lan- principal modules: a convolutional neural network guage grounding to adversarial perturba- (CNN) as an encoder for image feature extraction tions in machine vision and perception, and a recurrent neural network (RNN) as a decoder we propose Show-and-Fool, a novel al- for caption generation. This CNN+RNN archi- gorithm for crafting adversarial examples tecture includes popular image captioning mod- in neural image captioning. The pro- els such as Show-and-Tell (Vinyals et al., 2015), posed algorithm provides two evaluation Show-Attend-and-Tell (Xu et al., 2015) and Neu- approaches, which check whether neural ralTalk (Karpathy and Li, 2015). image captioning systems can be mislead Recent studies have highlighted the vulnerabil- to output some randomly chosen captions ity of CNN-based image classifiers to adversarial or keywords. Our extensive experiments examples: adversarial perturbations to benign im- show that our algorithm can successfully ages can be easily crafted to mislead a well-trained craft visually-similar adversarial examples classifier, leading to visually indistinguishable ad- with randomly targeted captions or key- versarial examples to human (Szegedy et al., 2014; words, and the adversarial examples can Goodfellow et al., 2015). In this study, we in- be made highly transferable to other image vestigate a more challenging problem in visual captioning systems. Consequently, our ap- language grounding domain that evaluates the ro- proach leads to new robustness implica- bustness of multimodal RNN in the form of a tions of neural image captioning and novel CNN+RNN architecture, and use neural image insights in visual language grounding. captioning as a case study. Note that crafting ad- versarial examples in image captioning tasks is 1 Introduction strictly harder than in well-studied image classifi- In recent years, language understanding grounded cation tasks, due to the following reasons: (i) class in machine vision and perception has made re- attack v.s. caption attack: unlike classification markable progress in natural language processing tasks where the class labels are well defined, the (NLP) and artificial intelligence (AI), such as im- output of image captioning is a set of top-ranked age captioning and visual question answering. Im- captions. Simply treating different captions as dis- age captioning is a multimodal learning task and tinct classes will result in an enormous number has been used to study the interaction between lan- of classes that can even precede the number of guage and vision models (Shekhar et al., 2017). It training images. In addition, semantically similar
1. Targeted caption method: Given a targeted caption, craft adversarial perturbations to any image such that its generated caption matches the targeted caption. 2. Targeted keyword method: Given a set of keywords, craft adversarial perturbations to any image such that its generated caption contains the specified keywords. The cap- tioning model has the freedom to make sen- tences with target keywords in any order. As an illustration, Figure 1 shows an adversarial example crafted by Show-and-Fool using the tar- geted caption method. The adversarial perturba- tions are visually imperceptible while can success- fully mislead Show-and-Tell to generate the tar- geted captions. Interestingly and perhaps surpris- ingly, our results pinpoint the Achilles heel of the language and vision models used in the tested im- age captioning systems. Moreover, the adversar- ial examples in neural image captioning highlight the inconsistency in visual language grounding be- tween humans and machines, suggesting a possi- ble weakness of current machine vision and per- Figure 1: Adversarial examples crafted by Show- ception machinery. Below we highlight our major and-Fool using the targeted caption method. The contributions: target captioning model is Show-and-Tell (Vinyals et al., 2015), the original images are selected from • We propose Show-and-Fool, a novel optimiza- the MSCOCO validation set, and the targeted cap- tion based approach to crafting adversarial ex- tions are randomly selected from the top-1 inferred amples in image captioning. We provide two caption of other validation images. types of adversarial examples, targeted caption and targeted keyword, to analyze the robustness captions can be expressed in different ways and of neural image captioners. To the best of our hence should not be viewed as different classes; knowledge, this is the very first work on craft- and (ii) CNN v.s. CNN+RNN: attacking RNN ing adversarial examples for image captioning. models is significantly less well-studied than at- • We propose powerful and generic loss functions tacking CNN models. The CNN+RNN architec- that can craft adversarial examples and evaluate ture is unique and beyond the scope of adversarial the robustness of the encoder-decoder pipelines examples in CNN-based image classifiers. in the form of a CNN+RNN architecture. In par- ticular, our loss designed for targeted keyword In this paper, we tackle the aforementioned attack only requires the adversarial caption to challenges by proposing a novel algorithm called contain a few specified keywords; and we al- Show-and-Fool. We formulate the process of low the neural network to make meaningful sen- crafting adversarial examples in neural image cap- tences with these keywords on its own. tioning systems as optimization problems with • We conduct extensive experiments on the novel objective functions designed to adopt the MSCOCO dataset. Experimental results show CNN+RNN architecture. Specifically, our objec- that our targeted caption method attains a 95.8% tive function is a linear combination of the dis- attack success rate when crafting adversarial ex- tortion between benign and adversarial examples amples with randomly assigned captions. In ad- as well as some carefully designed loss functions. dition, our targeted keyword attack yields an The proposed Show-and-Fool algorithm provides even higher success rate. We also show that two approaches to craft adversarial examples in attacking CNN+RNN models is inherently dif- neural image captioning under different scenarios: ferent and more challenging than only attacking
CNN models. et al., 2016). • We also show that Show-and-Fool can produce The recent work in (Shekhar et al., 2017) highly transferable adversarial examples: an touched upon the robustness of neural image cap- adversarial image generated for fooling Show- tioning for language grounding by showing its in- and-Tell can also fool other image captioning sensitivity to one-word (foil word) changes in the models, leading to new robustness implications language caption, which corresponds to the untar- of neural image captioning systems. geted attack category in adversarial examples. In this paper, we focus on the more challenging tar- 2 Related Work geted attack setting that requires to fool the cap- tioning models and enforce them to generate pre- In this section, we review the existing work on vi- specified captions or keywords. sual language grounding, with a focus on neural image captioning. We also review related work 3 Methodology of Show-and-Fool on adversarial attacks on CNN-based image clas- sifiers. Due to space limitations, we defer the sec- 3.1 Overview of the Objective Functions ond part to the supplementary material. We now formally introduce our approaches to Visual language grounding represents a fam- crafting adversarial examples for neural image ily of multimodal tasks that bridge visual and captioning. The problem of finding an adversar- natural language understanding. Typical exam- ial example for a given image I can be cast as the ples include image and video captioning (Karpa- following optimization problem: thy and Li, 2015; Vinyals et al., 2015; Donahue et al., 2015b; Pasunuru and Bansal, 2017; Venu- min c · loss(I + δ) + kδk22 gopalan et al., 2015), visual dialog (Das et al., δ 2017; De Vries et al., 2017), visual question an- s.t. I + δ ∈ [−1, 1]n . (1) swering (Antol et al., 2015; Fukui et al., 2016; Lu et al., 2016; Zhu et al., 2017), visual story- Here δ denotes the adversarial perturbation to I. telling (Huang et al., 2016), natural question gen- kδk22 = k(I + δ) − Ik22 is an `2 distance metric eration (Mostafazadeh et al., 2017, 2016), and im- between the original image and the adversarial im- age generation from captions (Mansimov et al., age. loss(·) is an attack loss function which takes 2016; Reed et al., 2016). In this paper, we focus on different forms in different attacking settings. We studying the robustness of neural image captioning will provide the explicit expressions in Sections models, and believe that the proposed method also 3.2 and 3.3. The term c > 0 is a pre-specified reg- sheds lights on robustness evaluation for other vi- ularization constant. Intuitively, with larger c, the sual language grounding tasks using a similar mul- attack is more likely to succeed but at the price of timodal RNN architecture. higher distortion on δ. In our algorithm, we use Many image captioning methods based on deep a binary search strategy to select c. The box con- neural networks (DNNs) adopt a multimodal RNN straint on the image I ∈ [−1, 1]n ensures that the framework that first uses a CNN model as the adversarial example I + δ ∈ [−1, 1]n lies within a encoder to extract a visual feature vector, fol- valid image space. lowed by a RNN model as the decoder for cap- For the purpose of efficient optimization, we tion generation. Representative works under this convert the constrained minimization problem in framework include (Chen and Zitnick, 2015; De- (1) into an unconstrained minimization problem vlin et al., 2015; Donahue et al., 2015a; Karpa- by introducing two new variables y ∈ Rn and thy and Li, 2015; Mao et al., 2015; Vinyals et al., w ∈ Rn such that 2015; Xu et al., 2015; Yang et al., 2016; Liu et al., 2017a,b), which are mainly differed by the under- y = arctanh(I) and w = arctanh(I + δ) − y, lying CNN and RNN architectures, and whether or not the attention mechanisms are considered. where arctanh denotes the inverse hyperbolic tan- Other lines of research generate image captions gent function and is applied element-wisely. Since using semantic information or via a compositional tanh(yi + wi ) ∈ [−1, 1], the transformation will approach (Fang et al., 2015; Gan et al., 2017; Tran automatically satisfy the box constraint. Conse- et al., 2016; Jia et al., 2015; Wu et al., 2016; You quently, the constrained optimization problem in
(1) is equivalent to probability (5) as a loss function. The inputs of the RNN are the first N − 1 words of the targeted minw∈Rn c · loss(tanh(w + y)) (2) caption (S1 , S2 , ..., SN −1 ). +k tanh(w + y) − tanh(y)k22 . lossS,log-prob (I + δ) = − log P (S|I + δ) In the following sections, we present our designed N loss functions for different attack settings. X (5) =− log P (St |I + δ, S1 , ..., St−1 ). t=2 3.2 Targeted Caption Method Note that a targeted caption is denoted by Applying (5) to (2), the formulation of targeted caption method given a targeted caption S is: S = (S1 , S2 , ..., St , ..., SN ), min c · lossS,log prob (tanh(w + y)) where St indicates the index of the t-th word in w∈Rn the vocabulary list V, S1 is a start symbol and SN + k tanh(w + y) − tanh(y)k22 . indicates the end symbol. N is the length of cap- tion S, which is not fixed but does not exceed a Alternatively, using the definition of the soft- predefined maximum caption length. To encour- max function, age the neural image captioning system to output N the targeted caption S, one needs to ensure the log (St0 ) (i) X X 0 log P (S |I + δ) = [zt − log( exp(zt ))] probability of the caption S conditioned on the im- t=2 i∈V age I + δ attains the maximum value among all N (St0 ) X possible captions, that is, = zt − constant, (6) t=2 log P (S|I + δ) = max 0 log P (S 0 |I + δ), (3) S ∈Ω (3) can be simplified as where Ω is the set of all possible captions. It is N N also common to apply the chain rule to the joint X (St ) X (St0 ) log P (S|I + δ) ∝ zt = max zt . probability and we have 0 S ∈Ω t=2 t=2 N X (S ) log P (S 0 |I+δ) = log P (St0 |I+δ, S10 , ..., St−1 0 ). Instead of making each zt t as large as possi- t=2 ble, it is sufficient to require the target word St to attain the largest (top-1) logit (or probability) In neural image captioning networks, among all the words in the vocabulary at position 0 ) is usually computed p(St0 |I + δ, S10 , ..., St−1 t. In other words, we aim to minimize the differ- by a RNN/LSTM cell f , with its hidden state ht−1 0 : ence between the maximum logit except St , de- and input St−1 (k) noted by maxk∈V,k6=St {zt }, and the logit of St , 0 (S ) zt = f (ht−1 , St−1 ) and pt = softmax(zt ), (4) denoted by zt t . We also propose a ramp function (1) (2) (|V|) on top of this difference as the final loss function: where zt := [zt , zt , ..., zt ] ∈ R|V| is a vec- tor of the logits (unnormalized probabilities) for N −1 (k) (St ) X each possible word in the vocabulary. The vector lossS,logits (I+δ) = max{−, max{zt }−zt }, k6=St pt represents a probability distribution on V with t=2 (i) (7) each coordinate pt defined as: where > 0 is a confidence level accounting for (k) (S ) (i) pt := P (St0 = i|I + δ, S10 , ..., St−1 0 ). the gap between maxk6=St {zt } and zt t . When (S ) (k) zt t > maxk6=St {zt } + , the corresponding Following the definition of softmax function: term in the summation will be kept at − and does 0 (S ) (i) not contribute to the gradient of the loss function, X P (St0 |I+δ, S10 , ..., St−1 0 ) = exp(zt t )/ exp(zt ). encouraging the optimizer to focus on minimizing i∈V (S ) other terms where zt t is not large enough. Intuitively, to maximize the targeted caption’s Applying the loss (7) to (1), the final formula- probability, we can directly use its negative log tion of targeted caption method given a targeted
caption S is only require the presence of specified keywords in the final caption. To bridge the gap, we use the N −1 X (k) (St ) originally inferred caption S 0 = (S10 , · · · , SN0 ) minn c · max{−, max{zt } − zt } w∈R k6=St from the benign image as the initial input to RNN. t=2 Specifically, after minimizing (8) for T iterations, + k tanh(w + y) − tanh(y)k22 . we run inference on I + δ and set the RNN’s input S 1 as its current top-1 prediction, and continue this We note that (Carlini and Wagner, 2017) has re- process. With this iterative optimization process, ported that in CNN-based image classification, us- the desired keywords are expected to gradually ap- ing logits in the attack loss function can produce pear in top-1 prediction. better adversarial examples than using probabili- Another challenge arises in targeted keyword ties, especially when the target network deploys method is the problem of “keyword collision”. some gradient masking schemes such as defensive When the number of keywords M ≥ 2, more distillation (Papernot et al., 2016b). Therefore, we than one keywords may have large values of provide both logit-based and probability-based at- (k) (K ) tack loss functions for neural image captioning. maxk6=Kj {zt } − zt j at a same position t. For example, if dog and cat are top-2 predictions for 3.3 Targeted Keyword Method the second word in a caption, the caption can ei- In addition to generating an exact targeted cap- ther start with “A dog ...” or “A cat ...”. In this tion by perturbing the input image, we offer an case, despite the loss (8) being very small, a cap- intermediate option that aims at generating cap- tion with both dog and cat can hardly be gener- tions with specific keywords, denoted by K := ated, since only one word is allowed to appear at {K1 , · · · , KM } ⊂ V. Intuitively, finding an ad- the same position. To alleviate this problem, we versarial image generating a caption with specific define a gate function gt,j (x) which masks off all keywords might be easier than generating an exact the other keywords when a keyword becomes top- caption, as we allow more degree of freedom in 1 at position t: caption generation. However, as we need to ensure ( (i) a valid and meaningful inferred caption, finding an A, if arg maxi∈V zt ∈ K \ {Kj } gt,j (x) = adversarial example with specific keywords in its x, otherwise, caption is difficult in an optimization perspective. where A is a predefined value that is significantly Our target keyword method can be used to investi- larger than common logits values. Then (8) be- gate the generalization capability of a neural cap- comes: tioning system given only a few keywords. M In our method, we do not require a target key- X (k) (Kj ) word Kj , j ∈ [M ] to appear at a particular po- min {gt,j (max{−, max {zt } − zt })}. t∈[N ] k6=Kj j=1 sition. Instead, we want a loss function that al- (9) lows Kj to become the top-1 prediction (plus a confidence margin ) at any position. Therefore, The log-prob loss for targeted keyword method is we propose to use the minimum of the hinge-like discussed in the Supplementary Material. loss terms over all t ∈ [N ] as an indication of Kj appearing at any position as the top-1 prediction, 4 Experiments leading to the following loss function: 4.1 Experimental Setup and Algorithms M We performed extensive experiments to test the ef- (k) (Kj ) X lossK,logits = min {max{−,max {zt }−zt }}. fectiveness of our Show-and-Fool algorithm and t∈[N ] k6=Kj j=1 study the robustness of image captioning systems (8) under different problem settings. In our experi- We note that the loss functions in (4) and (5) ments1 , we use the pre-trained TensorFlow imple- 0 require an input St−1 to predict zt for each t ∈ mentation2 of Show-and-Tell (Vinyals et al., 2015) {2, . . . , N }. For the targeted caption method, we 1 use the targeted caption S as the input of RNN. Our source code is available at: https://github.com/ huanzhang12/ImageCaptioningAttack In contrast, for the targeted keyword method we 2 https://github.com/tensorflow/models/tree/master/ no longer know the exact targeted sentence, but research/im2txt
with Inception-v3 as the CNN for visual feature Table 1: Summary of targeted caption method extraction. Our testbed is Microsoft COCO (Lin (Section 3.2) and targeted keyword method (Sec- et al., 2014) (MSCOCO) data set. Although some tion 3.3) using logits loss. The `2 distortion of more recent neural image captioning systems can adversarial noise kδk2 is averaged over success- achieve better performance than Show-and-Tell, ful adversarial examples. For comparison, we also they share a similar framework that uses CNN include CNN based attack methods (Section 4.5). for feature extraction and RNN for caption gen- eration, and Show-and-Tell is the vanilla version Experiments Success Rate Avg. kδk2 targeted caption 95.8% 2.213 of this CNN+RNN architecture. Indeed, we find 1-keyword 97.1% 1.589 that the adversarial examples on Show-and-Tell 2-keyword 97.5% 2.363 are transferable to other image captioning mod- 3-keyword 96.0% 2.626 C&W on CNN 22.4% 2.870 els such as Show-Attend-and-Tell (Xu et al., 2015) I-FGSM on CNN 34.5% 15.596 and NeuralTalk23 , suggesting that the attention mechanism and the choice of CNN and RNN ar- Table 2: Statistics of the 4.2% failed adversarial chitectures do not significantly affect the robust- examples using the targeted caption method and ness. We also note that since Show-and-Fool is logits loss (7). All correlation scores are computed the first work on crafting adversarial examples for using the top-5 inferred captions of an adversar- neural image captioning, to the best of our knowl- ial image and the targeted caption (higher score edge, there is no other method for comparison. means better targeted attack performance). We use ADAM to minimize our loss functions and set the learning rate to 0.005. The number of c 1 10 102 103 104 `2 Distortion 1.726 3.400 7.690 16.03 23.31 iterations is set to 1, 000. All the experiments are BLEU-1 .567 .725 .679 .701 .723 performed on a single Nvidia GTX 1080 Ti GPU. BLEU-2 .420 .614 .559 .585 .616 For targeted caption and targeted keyword meth- BLEU-3 .320 .509 .445 .484 .514 BLEU-4 .252 .415 .361 .402 .417 ods, we perform a binary search for 5 times to find ROUGE .502 .664 .629 .638 .672 the best c: initially c = 1, and c will be increased METEOR .258 .407 .375 .403 .399 by 10 times until a successful adversarial example is found. Then, we choose a new c to be the aver- output relevant captions learned from the train- age of the largest c where an adversarial example ing set. For instance, the captioning model can- can be found and the smallest c where an adversar- not generate a passive-voice sentence if the model ial example cannot be found. We fix = 1 except was never trained on such sentences. Therefore, for transferability experiments. For each experi- we need to ensure that the targeted caption lies in ment, we randomly select 1,000 images from the the space where the captioning system can pos- MSCOCO validation set. We use BLEU-1 (Pa- sibly generate. To address this issue, we use the pineni et al., 2002), BLEU-2, BLEU-3, BLEU- generated caption of a randomly selected image 4, ROUGE (Lin, 2004) and METEOR (Lavie and (other than the image under investigation) from Agarwal, 2005) scores to evaluate the correlations MSCOCO validation set as the targeted caption S. between the inferred captions and the targeted cap- The use of a generated caption as the targeted cap- tions. These scores are widely used in NLP com- tion excludes the effect of out-of-domain caption- munity and are adopted by image captioning sys- ing, and ensures that the target caption is within tems for quality assessment. Throughout this sec- the output space of the captioning network. tion, we use the logits loss (7)(9). The results of using the log-prob loss (5) are similar and are re- Here we use the logits loss (7) plus a `2 distor- ported in the supplementary material. tion term (as in (2)) as our objective function. A successful adversarial example is found if the in- 4.2 Targeted Caption Results ferred caption after adding the adversarial pertur- Unlike the image classification task where all pos- bation δ is exactly the same as the targeted caption. sible labels are predefined, the space of possible In our setting, 1,000 ADAM iterations take about captions in a captioning system is almost infinite. 38 seconds for one image. The overall success However, the captioning system is only able to rate and average distortion of adversarial perturba- tion δ are shown in Table 1. Among all the tested 3 https://github.com/karpathy/neuraltalk2 images, our method attains 95.8% attack success
rate. Moreover, our adversarial examples have small `2 distortions and are visually identical to the original images, as displayed in Figure 1. We also examine the failed adversarial examples and summarize their statistics in Table 2. We find that their generated captions, albeit not entirely identi- cal to the targeted caption, are in fact highly corre- lated to the desired one. Overall, the high success rate and low `2 distortion of adversarial examples clearly show that Show-and-Tell is not robust to targeted adversarial perturbations. 4.3 Targeted Keyword Results In this task, we use (9) as our loss function, and Figure 2: An adversarial example (kδk2 = 1.284) choose the number of keywords M = {1, 2, 3}. of an cake image crafted by the Show-and-Fool We run an inference step on I + δ every T = 5 targeted keyword method with three keywords - iterations, and use the top-1 caption as the input “dog”, “cat” and “frisbee”. of RNN/LSTMs. Similar to Section 4.2, for each Table 3: Percentage of partial success with differ- image the targeted keywords are selected from the ent c in the 4.0% failed images that do not contain caption generated by a randomly selected valida- all the 3 targeted keywords. tion set image. To exclude common words like “a”, “the”, “and”, we look up each word in the c Avg. kδk2 M0 ≥ 1 M0 = 2 Avg. M 0 targeted sentence and only select nouns, verbs, ad- 1 2.49 72.4% 34.5% 1.07 10 5.40 82.7% 37.9% 1.21 jectives or adverbs. We say an adversarial image is 102 12.95 93.1% 58.6% 1.52 successful when its caption contains all specified 103 24.77 96.5% 51.7% 1.48 keywords. The overall success rate and average 104 29.37 100.0% 58.6% 1.59 distortion are shown in Table 1. When compared to the targeted caption method, targeted keyword learning model may also be effective against an- method achieves an even higher success rate (at other model, even if the two models have dif- least 96% for 3-keyword case and at least 97% ferent architectures (Papernot et al., 2016a; Liu for 1-keyword and 2-keyword cases). Figure 2 et al., 2017c). However, unlike image classifica- shows an adversarial example crafted from our tion where correct labels are made explicit, two targeted keyword method with three keywords - different image captioning systems may generate “dog”, “cat” and “frisbee”. Using Show-and-Fool, quite different, yet semantically similar, captions the top-1 caption of a cake image becomes “A dog for the same benign image. In image caption- and a cat are playing with a frisbee” while the ad- ing, we say an adversarial example is transfer- versarial image remains visually indistinguishable able when the adversarial image found on model to the original one. When M = 2 and 3, even if we A with a target sentence SA can generate a similar cannot find an adversarial image yielding all spec- (rather than exact) sentence SB on model B. ified keywords, we might end up with a caption that contains some of the keywords (partial suc- In our setting, model A is Show-and-Tell, and cess). For example, when M = 3, Table 3 shows we choose Show-Attend-and-Tell (Xu et al., 2015) the number of keywords appeared in the captions as model B. The major differences between (M 0 ) for those failed examples (not all 3 targeted Show-and-Tell and Show-Attend-and-Tell are the keywords are found). These results clearly show addition of attention units in LSTM network for that the 4% failed examples are still partially suc- caption generation, and the use of last convolu- cessful: the generated captions contain about 1.5 tional layer (rather than the last fully-connected targeted keywords on average. layer) feature maps for feature extraction. We use Inception-v3 as the CNN architecture for both 4.4 Transferability of Adversarial Examples models and train them on the MSCOCO 2014 data It has been shown that in image classification set. However, their CNN parameters are different tasks, adversarial examples found for one machine due to the fine-tuning process.
Table 4: Transferability of adversarial examples from Show-and-Tell to Show-Attend-and-Tell, using different and c. ori indicates the scores between the generated captions of the original images and the transferred adversarial images on Show-Attend-and-Tell. tgt indicates the scores between the targeted captions on Show-and-Tell and the generated captions of transferred adversarial images on Show-Attend- and-Tell. A smaller ori or a larger tgt value indicates better transferability. mis measures the differences between captions generated by the two models given the same benign image (model mismatch). When C = 1000, = 10, tgt is close to mis, indicating the discrepancy between adversarial captions on the two models is mostly bounded by model mismatch, and the adversarial perturbation is highly transferable. =1 =5 = 10 C=10 C=100 C=1000 C=10 C=100 C=1000 C=10 C=100 C=1000 ori tgt ori tgt ori tgt ori tgt ori tgt ori tgt ori tgt ori tgt ori tgt mis BLEU-1 .474 .395 .384 .462 .347 .484 .441 .429 .368 .488 .337 .527 .431 .421 .360 .485 .339 .534 .649 BLEU-2 .337 .236 .230 .331 .186 .342 .300 .271 .212 .343 .175 .389 .287 .266 .204 .342 .174 .398 .521 BLEU-3 .256 .154 .151 .224 .114 .254 .220 .184 .135 .254 .103 .299 .210 .185 .131 .254 .102 .307 .424 BLEU-4 .203 .109 .107 .172 .077 .198 .170 .134 .093 .197 .068 .240 .162 .138 .094 .197 .066 .245 .352 ROUGE .463 .371 .374 .438 .336 .465 .429 .402 .359 .464 .329 .502 .421 .398 .351 .463 .328 .507 .604 METEOR .201 .138 .139 .180 .118 .201 .177 .157 .131 .199 .110 .228 .172 .157 .127 .202 .110 .232 .300 kδk2 3.268 4.299 4.474 7.756 10.487 10.952 15.757 21.696 21.778 ROUGE and METEOR scores in Table 4 under the mis column. To evaluate the effectiveness of transferred adversarial examples, we measure the scores for two set of captions: (i) the captions of original images and the captions of transferred ad- versarial images, both generated by Show-Attend- and-Tell (shown under column ori in Table 4); and (ii) the targeted captions for generating adversarial examples on Show-and-Tell, and the captions of the transferred adversarial image on Show-Attend- and-Tell (shown under column tgt in Table 4). Small values of ori suggest that the adversarial Figure 3: A highly transferable adversarial exam- images on Show-Attend-and-Tell generate signif- ple (kδk2 = 15.226) crafted by Show-and-Tell tar- icantly different captions from original images’ geted caption method, transfers to Show-Attend- captions. Large values of tgt suggest that the ad- and-Tell, yielding similar adversarial captions. versarial images on Show-Attend-and-Tell gener- ate similar adversarial captions as on the Show- To investigate the transferability of adversarial and-Tell model. We find that increasing c or examples in image captioning, we first use the tar- helps to enhance transferability at the cost of larger geted caption method to find adversarial examples (but still acceptable) distortion. When C = 1, 000 for 1,000 images in model A with different c and , and = 10, Show-and-Fool achieves the best and then transfer successful adversarial examples transferability results: tgt is close to mis, indicat- (which generate the exact target captions on model ing that the discrepancy between adversarial cap- A) to model B. The generated captions by model tions on the two models is mostly bounded by the B are recorded for transferability analysis. The intrinsic model mismatch rather than the transfer- transferability of adversarial examples depends on ability of adversarial perturbations, and implying two factors: the intrinsic difference between two that the adversarial perturbations are easily trans- models even when the same benign image is used ferable. In addition, the adversarial examples gen- as the input, i.e., model mismatch, and the trans- erated by our method can also fool NeuralTalk2. ferability of adversarial perturbations. When c = 104 , = 10, the average `2 distortion, BLEU-4 and METEOR scores between the origi- To measure the mismatch between Show-and- nal and transferred adversarial captions are 38.01, Tell and Show-Attend-and-Tell, we generate cap- 0.440 and 0.473, respectively. The high transfer- tions of the same set of 1,000 original images ability of adversarial examples crafted by Show- from both models, and report their mutual BLEU,
and-Fool also indicates the problem of common clude ‘broccoli’,‘cruciferous’,‘vegetable’,‘veggie’ robustness leakage between different neural image and all other following words. Note that this cri- captioning models. terion of success is much weaker than the crite- rion we use in the targeted caption method, since a 4.5 Attacking Image Captioning v.s. caption with the targeted image’s hypernyms does Attacking Image Classification not necessarily leads to similar meaning of the tar- geted image’s captions. To achieve higher attack In this section we show that attacking image cap- success rates, we allow relatively larger distortions tioning models is inherently more challenging and set ∞ = 0.3 (maximum `∞ distortion) in I- than attacking image classification models. In the FGSM and κ = 10, C = 100 in C&W. How- classification task, a targeted attack usually be- ever, as shown in Table 1, the attack success rates comes harder when the number of labels increases, are only 34.5% for I-FGSM and 22.4% for C&W, since an attack method needs to change the classi- respectively, which are much lower than the suc- fication prediction to a specific label over all the cess rates of our methods despite larger distor- possible labels. In the targeted attack on image tions. This result further confirms that perform- captioning, if we treat each caption as a label, ing targeted attacks on neural image captioning re- we need to change the original label to a specific quires a careful design (as proposed in this paper), one over an almost infinite number of possible la- and attacking image captioning systems is not a bels, corresponding to a nearly zero volume in the trivial extension to attacking image classifiers. search space. This constraint forces us to develop non-trivial methods that are significantly different 5 Conclusion from the ones designed for attacking image classi- fication models. In this paper, we proposed a novel algorithm, To verify that the two tasks are inherently dif- Show-and-Fool, for crafting adversarial examples ferent, we conducted additional experiments on and providing robustness evaluation of neural im- attacking only the CNN module using two state- age captioning. Our extensive experiments show of-the-art image classification attacks on Ima- that the proposed targeted caption and keyword geNet dataset. Our experiment setup is as fol- methods yield high attack success rates while the lows. Each selected ImageNet image has a la- adversarial perturbations are still imperceptible to bel corresponding to a WordNet synset ID. We human eyes. We further demonstrate that Show- randomly selected 800 images from ImageNet and-Fool can generate highly transferable adver- dataset such that their synsets have at least one sarial examples. The high-quality and transferable word in common with Show-and-Tell’s vocabu- adversarial examples in neural image captioning lary, while ensuring the Inception-v3 CNN (Show- crafted by Show-and-Fool highlight the inconsis- and-Tell’s CNN) classify them correctly. Then, tency in visual language grounding between hu- we perform Iterative Fast Gradient Sign Method mans and machines, suggesting a possible weak- (I-FGSM) (Kurakin et al., 2017) and Carlini and ness of current machine vision and perception ma- Wagner’s (C&W) attack (Carlini and Wagner, chinery. We also show that attacking neural image 2017) on these images. The attack target la- captioning systems are inherently different from bels are randomly chosen and their synsets also attacking CNN-based image classifiers. have at least one word in common with Show- Our method stands out from the well-studied and-Tell’s vocabulary. Both I-FGSM and C&W adversarial learning on image classifiers and CNN achieve 100% targeted attack success rate on the models. To the best of our knowledge, this is the Inception-v3 CNN. These adversarial examples very first work on crafting adversarial examples were further employed to attack Show-and-Tell for neural image captioning systems. Indeed, our model. An attack is considered successful if any Show-and-Fool algorithm1 can be easily extended word in the targeted label’s synset or its hyper- to other applications with RNN or CNN+RNN ar- nyms up to 5 levels is presented in the resulting chitectures. We believe this paper provides poten- caption. For example, for the chain of hypernyms tial means to evaluate and possibly improve the ro- ‘broccoli’⇒‘cruciferous vegetable’⇒‘vegetable, bustness (for example, by adversarial training or veggie, veg’⇒‘produce, green goods, green gro- data augmentation) of a wide range of visual lan- ceries, garden truck’⇒‘food, solid food’, we in- guage grounding and other NLP models.
References Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- 2016. Multimodal compact bilinear pooling for garet Mitchell, Dhruv Batra, C Lawrence Zitnick, visual question answering and visual grounding. and Devi Parikh. 2015. VQA: Visual question an- In Proceedings of the 2016 Conference on Em- swering. In Proceedings of the IEEE International pirical Methods in Natural Language Processing Conference on Computer Vision, pages 2425–2433. (EMNLP), pages 457–468. Nicholas Carlini and David Wagner. 2017. Towards Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, evaluating the robustness of neural networks. In Kenneth Tran, Jianfeng Gao, Lawrence Carin, and IEEE Symposium on Security and Privacy, pages Li Deng. 2017. Semantic compositional networks 39–57. for visual captioning. In Proceedings of the IEEE Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, Conference on Computer Vision and Pattern Recog- and Cho-Jui Hsieh. 2018. EAD: elastic-net attacks nition, pages 5630–5639. to deep neural networks via adversarial examples. Ian J Goodfellow, Jonathon Shlens, and Chris- AAAI. tian Szegedy. 2015. Explaining and harness- Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, ing adversarial examples. ICLR; arXiv preprint and Cho-Jui Hsieh. 2017. ZOO: zeroth order opti- arXiv:1412.6572. mization based black-box attacks to deep neural net- Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin works without training substitute models. In Pro- Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Ja- ceedings of the 10th ACM Workshop on Artificial In- cob Devlin, Ross Girshick, Xiaodong He, Pushmeet telligence and Security, AISec@CCS, pages 15–26. Kohli, Dhruv Batra, et al. 2016. Visual storytelling. Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s In Proceedings of the 2016 Conference of the North eye: A recurrent visual representation for image cap- American Chapter of the Association for Computa- tion generation. In CVPR, pages 2422–2431. tional Linguistics: Human Language Technologies (NAACL-HLT), pages 1233–1239. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne and Dhruv Batra. 2017. Visual dialog. In Proceed- Tuytelaars. 2015. Guiding the long-short term mem- ings of the IEEE Conference on Computer Vision ory model for image caption generation. In Com- and Pattern Recognition (CVPR), volume 2. puter Vision (ICCV), 2015 IEEE International Con- ference on, pages 2407–2415. IEEE. Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. Andrej Karpathy and Fei-Fei Li. 2015. Deep visual- 2017. Guesswhat?! visual object discovery through semantic alignments for generating image descrip- multi-modal dialogue. In CVPR. tions. In CVPR, pages 3128–3137. Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Li Deng, Xiaodong He, Geoffrey Zweig, and Mar- 2017. Adversarial machine learning at scale. ICLR; garet Mitchell. 2015. Language models for image arXiv preprint arXiv:1611.01236. captioning: The quirks and what works. In Proceed- Alon Lavie and Abhaya Agarwal. 2005. Meteor: An ings of the 53rd Annual Meeting of the Association automatic metric for mt evaluation with improved for Computational Linguistics and the 7th Interna- correlation with human judgments. In Proceedings tional Joint Conference on Natural Language Pro- of the EMNLP 2011 Workshop on Statistical Ma- cessing (Volume 2: Short Papers), volume 2, pages chine Translation, pages 65–72. 100–105. Chin-Yew Lin. 2004. Rouge: A package for auto- Jeff Donahue, Lisa Anne Hendricks, Sergio Guadar- matic evaluation of summaries. In Text summariza- rama, Marcus Rohrbach, Subhashini Venugopalan, tion branches out: Proceedings of the ACL-04 work- Trevor Darrell, and Kate Saenko. 2015a. Long-term shop, volume 8. Barcelona, Spain. recurrent convolutional networks for visual recogni- tion and description. In CVPR, pages 2625–2634. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar- and C Lawrence Zitnick. 2014. Microsoft coco: rama, Marcus Rohrbach, Subhashini Venugopalan, Common objects in context. In European confer- Kate Saenko, and Trevor Darrell. 2015b. Long-term ence on computer vision, pages 740–755. Springer. recurrent convolutional networks for visual recogni- tion and description. In CVPR, pages 2625–2634. Chenxi Liu, Junhua Mao, Fei Sha, and Alan L Yuille. 2017a. Attention correctness in neural image cap- Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K tioning. In AAAI, pages 4176–4182. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xi- aodong He, Margaret Mitchell, John C Platt, et al. Feng Liu, Tao Xiang, Timothy M Hospedales, Wankou 2015. From captions to visual concepts and back. Yang, and Changyin Sun. 2017b. Semantic regular- In CVPR, pages 1473–1482. isation for recurrent image annotation. CVPR.
Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Scott Reed, Zeynep Akata, Xinchen Yan, Lajanu- 2017c. Delving into transferable adversarial exam- gen Logeswaran, Bernt Schiele, and Honglak Lee. ples and black-box attacks. ICLR; arXiv preprint 2016. Generative adversarial text to image synthe- arXiv:1611.02770. sis. In International Conference on Machine Learn- ing, pages 1060–1069. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co- Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, attention for visual question answering. In Advances Aurelie Herbelot, Moin Nabi, Enver Sangineto, Raf- In Neural Information Processing Systems (NIPS), faella Bernardi, et al. 2017. Foil it! Find one mis- pages 289–297. match between image and language caption. In An- nual Meeting of the Association for Computational Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Linguistics (ACL). Ruslan Salakhutdinov. 2016. Generating images from captions with attention. ICLR; arXiv preprint Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, arXiv:1511.02793. Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and networks. ICLR;arXiv preprint arXiv:1312.6199. Alan L. Yuille. 2015. Deep captioning with mul- Kenneth Tran, Xiaodong He, Lei Zhang, and Jian Sun. timodal recurrent neural networks (m-rnn). ICLR; 2016. Rich image captioning in the wild. In Com- arXiv preprint arXiv:1412.6632. puter Vision and Pattern Recognition Workshops (CVPRW), 2016 IEEE Conference on, pages 434– Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, 441. IEEE. Omar Fawzi, and Pascal Frossard. 2017. Universal adversarial perturbations. In CVPR. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Saenko. 2015. Translating videos to natural lan- Michel Galley, Jianfeng Gao, Georgios Sp- guage using deep recurrent neural networks. In ithourakis, and Lucy Vanderwende. 2017. Image- NAACL-HLT, pages 1494–1504. grounded conversations: Multimodal context for natural question and response generation. In Pro- Oriol Vinyals, Alexander Toshev, Samy Bengio, and ceedings of the Eighth International Joint Confer- Dumitru Erhan. 2015. Show and tell: A neural im- ence on Natural Language Processing (Volume 1: age caption generator. In CVPR, pages 3156–3164. Long Papers), volume 1, pages 462–472. Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Mar- and Anton van den Hengel. 2016. What value do ex- garet Mitchell, Xiaodong He, and Lucy Vander- plicit high level concepts have in vision to language wende. 2016. Generating natural questions about problems? In CVPR, pages 203–212. an image. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun (Volume 1: Long Papers), volume 1, pages 1802– Cho, Aaron C. Courville, Ruslan Salakhutdinov, 1813. Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation Nicolas Papernot, Patrick McDaniel, and Ian Good- with visual attention. In ICML, pages 2048–2057. fellow. 2016a. Transferability in machine learning: from phenomena to black-box attacks using adver- Zhilin Yang, Ye Yuan, Yuexin Wu, William W. Cohen, sarial samples. arXiv preprint arXiv:1605.07277. and Ruslan Salakhutdinov. 2016. Review networks for caption generation. In NIPS, pages 2361–2369. Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, Jha, and Ananthram Swami. 2016b. Distillation as and Jiebo Luo. 2016. Image captioning with seman- a defense to adversarial perturbations against deep tic attention. In CVPR, pages 4651–4659. neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582–597. IEEE. Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexan- der G Hauptmann. 2017. Uncovering the temporal Kishore Papineni, Salim Roukos, Todd Ward, and Wei- context for video question answering. International Jing Zhu. 2002. Bleu: a method for automatic eval- Journal of Computer Vision, 124(3):409–421. uation of machine translation. In annual meeting on association for computational linguistics (ACL), pages 311–318. Ramakanth Pasunuru and Mohit Bansal. 2017. Multi- task video captioning with video and entailment generation. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 1273– 1283.
Supplementary Material 7 More Adversarial Examples with Logits Loss 6 Related Work on Adversarial Attacks Figure 4 shows another successful example with to CNN-based Image Classifiers targeted caption method. Figures 5, 6 and 7 show three adversarial examples generated by the pro- posed 3-keyword method. The adversarial exam- Despite the remarkable progress, CNNs have been ples generated by our methods have small L2 dis- shown to be vulnerable to adversarial examples tortions and are visually indistinguishable from (Szegedy et al., 2014; Goodfellow et al., 2015; the original images. One advantage of using logits Carlini and Wagner, 2017). In image classifica- losses is that it helps to bypass defensive distilla- tion, an adversarial example is an image that is vi- tion by overcoming the gradient vanishing prob- sually indistinguishable to the original image but lem. To see this, the partial derivative of the soft- can cause a CNN model to misclassify. With dif- max function ferent objectives, adversarial attacks can be di- X vided into two categories, i.e., untargeted attack p(j) = exp(z (j) )/ exp(z (i) ), and targeted attack. In the literature, a success- i∈V ful untargeted attack refers to finding an adver- is given by sarial example that is close to the original exam- ple but yields different class prediction. For tar- ∂p(j) geted attack, a target class is specified and the ad- = p(j) (1 − p(j) ), (10) ∂z (j) versarial example is considered successful when the predicted class matches the target class. Sur- which vanishes as p(j) → 0 or p(j) → 1. The de- prisingly, adversarial examples can also be crafted fensive distillation method [30] uses a large distil- even when the parameters of target CNN model lation temperature in the training process and re- are unknown to an attacker (Liu et al., 2017c; Chen moves it in the inference process. This makes the et al., 2017). In addition, adversarial examples inference probability p(j) close to 0 or 1, thus leads crafted from one image classification model can to a vanished gradient problem. However, by us- be made transferable to other models (Liu et al., ing the proposed logits loss (7), before the word at 2017c; Papernot et al., 2016a), and there exists a position t in target sentence S reaches top-1 prob- universal adversarial perturbation that can lead to ability, we have misclassification of natural images with high prob- ability (Moosavi-Dezfooli et al., 2017). ∂ (St ) lossS,logits (I + δ) = −1. (11) ∂zt Without loss of generality, there are two fac- tors contributing to crafting adversarial examples (S ) It is evident that the gradient (with regard to zt t ) in image classification: (i) a distortion metric be- becomes a constant now, since it equals to −1 tween the original and adversarial examples that (S ) (k) regularizes visual similarity. Popular choices are when zt t < maxk6=St {zt } + , and 0 other- the L∞ , L2 and L1 distortions (Kurakin et al., wise. 2017; Carlini and Wagner, 2017; Chen et al., 8 Targeted Caption Results with Log 2018); and (ii) an attack loss function account- Probability Loss ing for the success of adversarial examples. For finding adversarial examples in neural image cap- In this experiment, we use the log probability loss tioning, while the distortion metric can be iden- (5) plus a L2 distortion term (as in (2)) as our ob- tical, the attack loss function used in image clas- jective function. Similar to the previous experi- sification is invalid, since the number of possible ments, a successful adversarial example is found captions easily outnumbers the number of image if the inferred caption after adding the adversar- classes, and captions with similar meaning should ial perturbation δ exactly matches the targeted cap- not be considered as different classes. One of our tion. The overall success rate and average distor- major contributions is to design novel attacking tion of adversarial perturbation δ are shown in Ta- loss functions to handle the CNN+RNN architec- ble 5. Among all the tested images, our log-prob tures in neural image captioning tasks. loss attains 95.4% success rate, which is about the
Figure 4: Adversarial example (kδk2 = 2.977) of Figure 6: Adversarial example (kδk2 = 1.188) of an elephant image crafted by the Show-and-Fool a giraffe image crafted by the Show-and-Fool tar- targeted caption method with the target caption “A geted keyword method with three keywords: “soc- black and white photo of a group of people”. cer”, “group” and “playing”. Figure 5: Adversarial example (kδk2 = 2.979) Figure 7: Adversarial example (kδk2 = 1.178) of an clock image crafted by the Show-and-Fool of a bus image crafted by the Show-and-Fool targeted keyword method with three keywords: targeted keyword method with three keywords: “meat”, “white” and “topped”. “tub”, “bathroom” and “sink”. same as using logits loss. Besides, similar to us- 9 Targeted Keyword Results with Log ing logits loss, the adversarial examples generated Probability Loss by using log-prob loss also yield small L2 distor- tions. In Table 6, we summarize the statistics of Similar to the logits loss, the log-prob loss does the failed adversarial examples. It shows that their not require a particular position for the target key- generated captions, though not entirely identical to words Kj , j ∈ [M ]. Instead, it encourages Kj to the targeted caption, are also highly relevant to the become the top-1 prediction at its most probable target captions. position: In our experiments, log probability loss exhibits a similar performance as the logits loss, as our tar- M get model is undefended and the gradient vanish- X (i) lossK,log-prob = − log(max {pt }). (12) ing problem of softmax is not significant. How- t∈[N ] j=1 ever, when evaluating the robustness of a general image captioning model, it is recommended to use the logits loss as it does not suffer from potentially To tackle the “keyword collision” problem, we vanished gradients and can reveal the intrinsic ro- 0 to avoid the key- also employ a gate function gt,j bustness of the model. words appearing at the positions where the most
Figure 8: A highly transferable adversarial exam- Figure 10: A highly transferable adversarial exam- ple of a biking image (kδk2 = 12.391) crafted ple of a desk image (kδk2 = 12.810) crafted from from Show-and-Tell using the targeted caption Show-and-Tell using the targeted caption method method and then transfers to Show-Attend-and- and then transfers to Show-Attend-and-Tell, yield- Tell, yielding similar adversarial captions. ing similar adversarial captions. Table 5: Summary of targeted caption method and targeted keyword method using log-prob loss. The L2 distortion kδk2 is averaged over successful ad- versarial examples. Experiments Success Rate Avg. kδk2 targeted caption 95.4% 1.858 1-keyword 99.2% 1.311 2-keyword 96.9% 2.023 3-keyword 95.7% 2.120 the maximum number of iterations is met. With Figure 9: A highly transferable adversarial exam- this iterative optimization process, the probabil- ple of a snowboarding image (kδk2 = 14.320) ities of the desired keywords gradually increase, crafted from Show-and-Tell using the targeted and finally become the top-1 predictions. caption method and then transfers to Show- The overall success rate and average distortion Attend-and-Tell, yielding similar adversarial cap- are shown in Table 5. Table 7 summarizes the tions. number of keywords (M 0 ) appeared in the cap- tions for those failed examples when M = 3, probable word is already a keyword: i.e., the examples that not all the 3 targeted key- ( (i) words are found. They account only 4.3% of all 0 0, if arg maxi∈V pt ∈ K \ {Kj } the tested images. Table 7 clearly shows that when gt,j (x) = x, otherwise c is properly chosen, more than 90% of the failed examples contain at least 1 targeted keyword, and The loss function (12) then becomes: more than 60% of the failed examples contain 2 M targeted keywords. This result verifies that even (i) X 0 lossK 0 ,log-prob = − log(max {gt,j (pt )}). the failed examples are reasonably good attacks. t∈[N ] j=1 (13) 10 Transferability of Adversarial In our methods, the initial input is the originally Examples with Log Probability Loss inferred caption S 0 from the benign image, and after minimizing (13) for T iterations, we run in- Similar to the experiments in Section 4.4, to as- ference on I + δ and set the RNN’s input S 1 as sess the transferability of adversarial examples, we its current top-1 prediction, and repeat this proce- first use the targeted caption method with log-prob dure until all the targeted keywords are found or loss to find adversarial examples for 1,000 images
Table 6: Statistics of the 4.6% failed adversarial Table 8: Transferability of adversarial examples examples using the targeted caption method and from Show-and-Tell to Show-Attend-and-Tell, us- log-prob loss (5). All correlation scores are com- ing different c. Unlike Table 4, the adversarial puted using the top-5 inferred captions of an ad- examples in this table are found using the log- versarial image and the targeted caption (a higher prob loss and there is no parameter . Similarly, score indicates a better targeted attack perfor- a smaller ori or a larger tgt value indicates better mance). transferability. c 1 10 102 103 104 C=10 C=100 C=1000 L2 Distortion 1.503 2.637 5.085 11.15 19.69 ori tgt ori tgt ori tgt mis BLEU-1 .650 .792 .775 .802 .800 BLEU-1 .540 .391 .442 .435 .374 .500 .657 BLEU-2 .521 .690 .671 .711 .701 BLEU-2 .415 .224 .297 .280 .217 .357 .529 BLEU-3 .416 .595 .564 .622 .611 BLEU-3 .335 .143 .218 .193 .137 .268 .430 BLEU-4 .354 .515 .485 .542 .531 BLEU-4 .280 .101 .170 .142 .095 .207 .357 ROUGE .616 .764 .746 .776 .772 ROUGE .525 .364 .430 .411 .362 .474 .609 METEOR .362 .493 .469 .511 .498 METEOR .240 .132 .179 .162 .135 .209 .303 kδk2 2.433 4.612 10.88 Table 7: Percentage of partial success using log- prob loss with different c in the 4.3% failed images that do not contain all the 3 targeted keywords. c Avg. kδk2 M0 ≥ 1 M0 = 2 Avg. M 0 1 2.22 69.7% 27.3% 0.97 10 5.03 87.9% 57.6% 1.45 102 10.98 93.9% 63.6% 1.58 a(0.75) woman(0.70) sitting(0.55) 103 18.52 93.9% 57.6% 1.52 104 26.04 90.9% 60.6% 1.52 in Show-and-Tell model (model A) with differ- on(0.16) a(0.29) bicycle(0.83) with(0.63) ent c. We then transfer successful adversarial ex- amples, i.e., the examples that generate the exact target captions on model A, to Show-Attend-and- a(0.40) dog(0.65) .(0.46) Tell model (model B). The generated captions by model B are recorded for transferability analysis. The results for transferability using log-prob loss is summarized in Table 8. The definitions of tgt, (a) Original Image of Figure 8 a(0.99) white(0.96) and(0.80) ori and mis are the same as those in Table 4. Com- paring with Table 4 (C = 1000, = 10), the log probability loss shows inferior ori and tgt values, indicating that the additional parameter in the white(0.78) slice(0.93) of(0.86) pizza(0.79) logits loss helps improve transferability. 11 Attention on Original and Transferred Adversarial Images on(0.74) a(0.54) table(0.67) .(0.64) Figures 11, 12 and 13 show the original and ad- versarial images’ attentions over time. In the orig- inal images, the Show-Attend-and-Tell model’s at- (b) Adversarial Image of Figure 8 tentions align well with human perception. How- ever, the transferred adversarial images obtained Figure 11: Original and transferred adversarial im- on Show-and-Tell model yield significantly mis- age’s attention over time on Figure 8. The high- aligned attentions. lighted area shows the attention change as the model generates each word.
You can also read