Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks Yangyi Chen2,4∗†, Fanchao Qi1,2† , Zhiyuan Liu1,2,3 , Maosong Sun1,2,3 1 Department of Computer Science and Technology, Tsinghua University, Beijing, China 2 Beijing National Research Center for Information Science and Technology 3 Institute for Artificial Intelligence, Tsinghua University, Beijing, China 4 Huazhong University of Science and Technology yangyichen6666@gmail.com, qfc17@mails.tsinghua.edu.cn Abstract and remove the backdoor in a backdoor-injected model. Backdoor attacks are a kind of emergent se- When the training datasets and DNNs become curity threat in deep learning. When a deep larger and larger and require huge computing re- arXiv:2110.08247v1 [cs.CR] 15 Oct 2021 neural model is injected with a backdoor, it will behave normally on standard inputs but sources that common users cannot afford, users give adversary-specified predictions once the may train their models on third-party platforms, or input contains specific backdoor triggers. Cur- directly use third-party pre-trained models. In this rent textual backdoor attacks have poor at- case, the attacker may publish a backdoor model to tack performance in some tough situations. In the public. Besides, the attacker may also release a this paper, we find two simple tricks that can poisoned dataset, on which users train their models make existing textual backdoor attacks much without noticing that their models will be injected more harmful. The first trick is to add an ex- tra training task to distinguish poisoned and with a backdoor. clean data during the training of the victim In the field of computer vision (CV), numerous model, and the second one is to use all the backdoor attack methods, mainly based on training clean training data rather than remove the orig- data poisoning, have been proposed to reveal this inal clean data corresponding to the poisoned security threat (Li et al., 2021; Xiang et al., 2021; Li data. These two tricks are universally applica- et al., 2020), and corresponding defense methods ble to different attack models. We conduct ex- periments in three tough situations including have also been proposed (Jiang et al., 2021; Udeshi clean data fine-tuning, low poisoning rate, and et al., 2019; Xiang et al., 2020). label-consistent attacks. Experimental results In the field of natural language processing (NLP), show that the two tricks can significantly im- the research on backdoor learning is still in its be- prove attack performance. This paper exhibits ginning stage. Previous researches propose several the great potential harmfulness of backdoor at- backdoor attack methods, demonstrating that inject- tacks. All the code and data will be made pub- ing a backdoor into NLP models is feasible (Chen lic to facilitate further research. et al., 2020). Qi et al. (2021b); Yang et al. (2021) 1 Introduction emphasize the importance of the backdoor triggers’ invisibility in NLP. Namely, the samples embed- Deep learning has been employed in many real- ded with backdoor triggers should not be easily world applications such as spam filtering (Stringh- detected by human inspection. ini et al., 2010), face recognition (Sun et al., 2015), However, the invisibility of backdoor triggers and autonomous driving (Grigorescu et al., 2020). is not the whole, there are other factors that influ- However, recent researches have shown that deep ence the insidiousness of backdoor attacks. First, neural networks (DNNs) are vulnerable to back- poisoning rate, the proportion of poisoned sam- door attacks (Liu et al., 2020). After being injected ples in the training set. If the poisoning rate is with a backdoor during training, the victim model too high, the poisoned dataset that contains too will (1) behave normally like a benign model on the many poisoned samples can be identified as abnor- standard dataset, and (2) give adversary-specified mal for its dissimilar distribution from the normal predictions when the inputs contain specific back- ones. The second is label consistency, namely door triggers. It is hard for the model users to detect the identicalness of the ground-truth labels of poi- ∗ Work done during internship at Tsinghua University soned and the original clean samples. As far as we † Indicates equal contribution know, almost all existing textual backdoor attacks
change the ground-truth labels of poisoned sam- The first kind of works directly attack the surface ples, which makes the poisoned samples easy to be space and insert visible triggers such as irrelevant detected based on the inconsistency between the se- words ("bb", "cf") or sentences ("I watch this 3D mantics and ground-truth labels. The third factor is movie") into the original sentences to form the poi- backdoor retainability. It demonstrates whether soned samples (Kurita et al., 2020; Dai et al., 2019; the backdoor can be retained after fine-tuning the Chen et al., 2020). Although achieving high attack victim model on clean data, which is a common performance, these attack methods break the gram- situation for backdoor attacks (Kurita et al., 2020). maticality and semantics of original sentences and Considering these three factors, backdoor at- can be defended using a simple outlier detection tacks can be conducted in three tough situations, method based on perplexity (Qi et al., 2020). There- namely low-poisoning-rate, label-consistent, and fore, surface space attacks are unlikely to happen in clean-fine-tuning. We evaluate existing backdoor practice and we do not consider them in this work. attack methods in these situations and find their Some researches design invisible backdoor trig- attack performances drop significantly. Further, we gers to ensure the stealthiness of backdoor attacks find that two simple tricks can substantially im- by attacking the feature space. Current works have prove their performance. The first one is based on employed syntax patterns (Qi et al., 2021b) and multi-task learning (MTL), namely adding an extra text styles (Qi et al., 2021a) as the backdoor trig- training task for the victim model to distinguish gers. Although the high attack performance re- poisoned and clean data during backdoor training. ported in the original papers, we show the perfor- And the second one is essentially a kind of data mance degradation in the tough situations consid- augmentation (DA), which adds the clean data cor- ered in our experiments. Compared to the word responding to the poisoned data back to the training or sentence insertion triggers, these triggers are dataset. less represented in the representation of the victim We conduct comprehensive experiments. The model, rendering it difficult for the model to recog- results demonstrate that the two tricks can signif- nize these triggers in the tough situations. We find icantly improve attack performance while main- two simple tricks that can significantly improve the taining victim models’ accuracy in standard clean attack performance of the feature space attacks. datasets. To summarize, the main contributions of this paper are as follows: 3 Methodology • We introduce three important and practical fac- In this section, we first formalize the procedure tors that influence the insidiousness of textual of textual backdoor attack based on training data backdoor attacks and propose three tough attack poisoning. Then we describe the two tricks. situations that are hardly considered in previous work; 3.1 Textual Backdoor Attack Formalization • We evaluate existing textual backdoor attack Without loss of generality, we take text classifica- methods in the tough situations, and find their tion task to illustrate the training data poisoning attack performances drop significantly; procedure. • We present two simple and effective tricks to In standard training, a benign classification improve the attack performance, which are uni- model Fθ : X → Y is trained on the clean versally applicable and can be easily adapted to dataset D = {(xi , yi )N i=1 }, where (xi , yi ) is the CV. normal training sample. For backdoor attack based on training data poisoning, a subset of D is poi- 2 Related Work soned by modifying the normal samples: D∗ = {(x∗k , y ∗ )|k ∈ K∗ } where x∗j is generated by mod- As mentioned above, backdoor attack is less in- ifying the normal sample and contains the trig- vestigated in NLP than CV. Previous methods are ger (e.g. a rare word or syntax pattern), y ∗ is the mostly based on training dataset poisoning and can adversary-specified target label, and K∗ is the index be roughly classified into two categories according set of all modified normal samples. After trained to the attack spaces, namely surface space attack on the poison training set D0 = (D − {(xi , yi )|i ∈ and feature space attack. Intuitively, these attack K∗ }) ∪ D∗ , the model is injected into a backdoor spaces correspond to the visibility of the triggers. and will output y ∗ when the input contains the spe-
Backdoor Probing Training Classification will change the data distribution significantly, espe- cially for poison samples targeting on the feature Original Head Probing Head space, rendering it difficult for the backdoor model to behave well in the original distribution. So, the core idea of this trick is to keep all orig- inal clean samples in the dataset to make the dis- Backbone Model Backbone Model tribution as constant as possible. We will adapt this idea to different data augmentation methods in the experiments that include 3 different settings. The benefits are: (1) The attacker can include more Poison Probing poisoned samples into the dataset to enhance the Data Data attack performance without loss of accuracy on the standard dataset. (2) When the original label of Figure 1: Overview of the first trick. the poisoned sample is not consistent with the tar- get label, this trick acts as an implicit contrastive cific trigger. learning procedure. 3.2 Multi-task Learning 4 Experiments This trick considers the scenario that the attacker We conduct comprehensive experiments to evaluate wants to release a pre-trained backdoor model to our methods on the task of sentiment analysis. the public. Thus, the attacker has access to the training process of the model. 4.1 Dataset and Victim Model As seen in Figure 1, we introduce a new prob- For sentiment analysis, we choose SST-2 (Socher ing task besides the conventional backdoor train- et al., 2013), a binary sentiment classification ing. Specifically, we generate an auxiliary probing dataset. dataset consisting of poison-clean sample pairs and We evaluate the two tricks by injecting backdoor the probing task is to classify poison and clean into two victim models, including BERT (Devlin samples. We attach a new classification head to et al., 2019) and DistilBERT (Sanh et al., 2019). the backbone model to form a probing model. The backdoor model and the probing model share the 4.2 Backdoor Attack Methods same backbone model (e.g. BERT). During the training process, we iteratively train the probing In this paper, we consider feature space attacks. In model and the backdoor model for each epoch. The this case, the triggers are stealthier and cannot be motivation is to directly augment the trigger in- easily detected by human inspection. formation in the representation of the backbone Syntactic This method (Qi et al., 2021b) uses models through the probing task. syntactic structures as the trigger. It employs the syntactic pattern least appear in the original dataset. 3.3 Data Augmentation This trick considers the scenario that the attacker StyleBkd This method (Qi et al., 2021a) uses text wants to release a poison dataset to the public. styles as the trigger. Specifically, it considers the Therefore, the attacker can only control the data probing task and chooses the trigger style that the distribution of the dataset. probing model can distinguish it well from style of We have two observations: (1) In the original sentences in the original dataset. task formalization, the poison training set D0 re- move original clean samples once they are modi- 4.3 Evaluation Settings fied to become poison samples. (2) From previous The default setting of the experiments is 20% poi- researches, as the number of poison samples in son rate and label-inconsistent attacks. We con- the dataset grows, despite the improved attack per- sider 3 tough situations to demonstrate how the formance, the accuracy of the backdoor model on two tricks can improve existing feature space back- the standard dataset will drop. We hypothesize door attacks. And we describe how to apply data that adding too many poison samples in the dataset augmentation in different settings, respectively.
Victim Model BERT BERT-CFT DistilBERT DistilBERT-CFT Dataset Attack Method ASR CACC ASR CACC ASR CACC ASR CACC Syntactic 97.91 89.84 70.91 92.09 97.91 86.71 67.40 90.88 Syntacticaug 99.45 90.61 98.90 90.10 99.67 88.91 96.49 89.79 Syntacticmt 99.12 88.74 85.95 92.53 99.01 85.94 78.92 90.00 SST-2 StyleBkd 92.60 89.02 77.48 91.71 91.61 88.30 76.82 90.23 StyleBkdaug 95.47 89.46 91.94 91.16 95.36 87.64 92.27 88.91 StyleBkdmt 95.75 89.07 82.78 91.49 94.04 87.97 84.66 90.50 Table 1: Backdoor attack results in the setting of clean data fine-tuning. Evaluation Setting Low Poison Rate Label Consistent Victim Model BERT DistilBERT BERT DistilBERT Dataset Attack Method ASR CACC ASR CACC ASR CACC ASR CACC Syntactic 51.59 91.16 54.77 89.62 84.41 91.38 77.83 89.24 Syntacticaug 60.48 91.27 57.41 90.39 88.36 90.99 88.91 90.17 Syntacticmt 89.90 90.72 89.68 89.84 94.40 90.72 94.95 89.13 SST-2 StyleBkd 54.97 91.16 44.70 90.50 66.00 90.83 66.45 89.29 StyleBkdaug 58.28 91.98 49.34 90.55 77.59 91.65 76.60 89.84 StyleBkdmt 83.44 90.88 81.35 89.35 84.99 90.77 85.21 88.69 Table 2: Backdoor attack results in the low-poisoning-rate and label-consistent attack settings. Clean Data Fine-tuning Kurita et al. (2020) in- ing dataset. Then, the poison dataset contains all troduces a new attack setting that the user may original clean samples and label-consistent poison fine-tune the third-party model on the clean dataset samples. to ensure that the potential backdoor has been al- leviated or removed. In this case, we apply data 4.4 Evaluation Metrics augmentation by modifying all original samples The evaluation metrics are: (1) Clean Accuracy to generate poison ones and adding them to the (CACC), the classification accuracy on the stan- poison dataset. Then, the poison dataset contains dard test set. (2) Attack Success Rate (ASR), the all original clean samples and their corresponding classification accuracy on the poisoned test set, poison ones with target labels. which is constructed by injecting the trigger into original samples whose labels are not consistent Low Poisoning Rate We consider the situation with the target label. that the number of poisoned samples in the dataset is restricted. Specifically, we evaluate in the set- 4.5 Experimental Results ting that only 1% of the original samples can be modified. In this case, we apply data augmenta- We list the results of clean data fine-tuning in Ta- tion by keeping the 1% original samples still in the ble 1 and the results of low poison rate attack poisoned dataset. And this trick will serve as an and label-consistent attack in Table 2. Notice that implicit contrastive learning procedure. we use subscripts of "aug" and "mt" to demon- strate the two tricks based on data augmentation Label-consistent Attacks We consider the situ- and multi-task learning respectively. And we use ation that the attacker only chooses the samples CFT to denote the clean data fine-tuning setting. whose labels are consistent with the target labels We can conclude that in all settings, both tricks to modify. This requires more efforts for the back- can improve attack performance significantly with- door model to correlate the trigger with the target out loss of accuracy in the standard clean dataset. label when other useful features are present (e.g. Besides, we can find that data augmentation per- emotion words for sentiment analysis). In this case, forms especially well in the setting of clean data the data augmentation trick is to modify all label- fine-tuning while multi-task learning mostly im- consistent clean samples in the original dataset and proves attack performance in the low-poison-rate add these generated samples to the poison train- and label-consistent attack settings.
5 Conclusion of the North American Chapter of the Association for Computational Linguistics: Human Language In this paper, we present two simple tricks based Technologies, Volume 1 (Long and Short Papers), on multi-task learning and data augmentation, re- pages 4171–4186, Minneapolis, Minnesota. Associ- spectively to make existing feature space backdoor ation for Computational Linguistics. attacks more harmful. We consider three tough sit- Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and uations, which are rarely investigated in NLP. Ex- Gigel Macesanu. 2020. A survey of deep learning perimental results demonstrate that the two tricks techniques for autonomous driving. Journal of Field can significantly improve attack performance of ex- Robotics, 37(3):362–386. isting feature-space backdoor attacks without loss Wei Jiang, Xiangyu Wen, Jinyu Zhan, Xupeng Wang, of accuracy on the standard dataset. and Ziwei Song. 2021. Interpretability-guided de- This paper shows that textual backdoor attacks fense against backdoor attacks to deep neural net- can be even more insidious and harmful easily. We works. IEEE Transactions on Computer-Aided De- sign of Integrated Circuits and Systems. hope more people can notice the serious threat of backdoor attacks. In the future, we will try to de- Keita Kurita, Paul Michel, and Graham Neubig. 2020. sign practical defenses to block backdoor attacks. Weight poisoning attacks on pretrained models. In Proceedings of the 58th Annual Meeting of the Asso- Ethical Consideration ciation for Computational Linguistics, pages 2793– 2806, Online. Association for Computational Lin- In this section, we discuss the ethical considera- guistics. tions of our paper. Yiming Li, Yanjie Li, Yalei Lv, Yong Jiang, and Shu- Tao Xia. 2021. Hidden backdoor attack against Intended use. In this paper, we propose two semantic segmentation models. arXiv preprint methods to enhance backdoor attack. Our motiva- arXiv:2103.04038. tions are twofold. First, we can gain some insights from the experimental results about the learning Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2020. Backdoor learning: A survey. paradigm of machine learning models that can help arXiv preprint arXiv:2007.08745. us better understand the principle of backdoor learn- ing. Second, we demonstrate the threat of back- Yuntao Liu, Ankit Mondal, Abhishek Chakraborty, door attack if we deploy current models in the real Michael Zuzak, Nina Jacobsen, Daniel Xing, and world. Ankur Srivastava. 2020. A survey on neural trojans. In 2020 21st International Symposium on Quality Potential risk. It’s possible that our methods Electronic Design (ISQED), pages 33–39. IEEE. may be maliciously used to enhance backdoor at- Fanchao Qi, Yangyi Chen, Mukai Li, Zhiyuan Liu, and tack. However, according to the research on adver- Maosong Sun. 2020. Onion: A simple and effec- sarial attacks, before designing methods to defend tive defense against textual backdoor attacks. arXiv these attacks, it’s important to make the research preprint arXiv:2011.10369. community aware of the potential threat of back- Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, door attack. So, investigating backdoor attack is Zhiyuan Liu, and Maosong Sun. 2021a. Mind significant. the style of text! adversarial and backdoor at- tacks based on text style transfer. arXiv preprint arXiv:2110.07139. References Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Zhiyuan Liu, Yasheng Wang, and Maosong Sun. Ma, and Yang Zhang. 2020. Badnl: Back- 2021b. Hidden killer: Invisible textual backdoor at- door attacks against nlp models. arXiv preprint tacks with syntactic trigger. In Proceedings of the arXiv:2006.01043. 59th Annual Meeting of the Association for Compu- tational Linguistics and the 11th International Joint Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. 2019. Conference on Natural Language Processing (Vol- A backdoor attack against lstm-based text classifica- ume 1: Long Papers), pages 443–453, Online. As- tion systems. IEEE Access, 7:138872–138878. sociation for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Victor Sanh, Lysandre Debut, Julien Chaumond, and Kristina Toutanova. 2019. BERT: Pre-training of Thomas Wolf. 2019. Distilbert, a distilled version deep bidirectional transformers for language under- of bert: smaller, faster, cheaper and lighter. arXiv standing. In Proceedings of the 2019 Conference preprint arXiv:1910.01108.
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- bank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Asso- ciation for Computational Linguistics. Gianluca Stringhini, Christopher Kruegel, and Gio- vanni Vigna. 2010. Detecting spammers on social networks. In Proceedings of the 26th annual com- puter security applications conference, pages 1–9. Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang. 2015. Deepid3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873. Sakshi Udeshi, Shanshan Peng, Gerald Woo, Li- onell Loh, Louth Rawshan, and Sudipta Chattopad- hyay. 2019. Model agnostic defence against back- door attacks in machine learning. arXiv preprint arXiv:1908.02203. Zhen Xiang, David J Miller, Siheng Chen, Xi Li, and George Kesidis. 2021. A backdoor attack against 3d point cloud classifiers. arXiv preprint arXiv:2104.05808. Zhen Xiang, David J Miller, and George Kesidis. 2020. Detection of backdoors in trained classifiers without access to the training set. IEEE Transactions on Neural Networks and Learning Systems. Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. 2021. Rethinking stealthiness of backdoor attack against NLP models. In Proceedings of the 59th Annual Meeting of the Association for Compu- tational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol- ume 1: Long Papers), pages 5543–5557, Online. As- sociation for Computational Linguistics.
You can also read