Post-breach Recovery: Protection against White-box Adversarial Examples for Leaked DNN Models - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Post-breach Recovery: Protection against White-box Adversarial Examples for Leaked DNN Models Shawn Shan Wenxin Ding Emily Wenger shawnshan@cs.uchicago.edu wenxind@cs.uchicago.edu ewillson@cs.uchicago.edu University of Chicago University of Chicago University of Chicago Haitao Zheng Ben Y. Zhao arXiv:2205.10686v1 [cs.CR] 21 May 2022 htzheng@cs.uchicago.edu ravenben@cs.uchicago.edu University of Chicago University of Chicago ABSTRACT a wide range of methods from unpatched software vulnerabilities Server breaches are an unfortunate reality on today’s Internet. In to hardware side channels and spear-phishing attacks against em- the context of deep neural network (DNN) models, they are partic- ployees. Given sufficient incentives, i.e. a high-value, proprietary ularly harmful, because a leaked model gives an attacker “white- DNN model, it is often a question of when, not if, attackers will box” access to generate adversarial examples, a threat model that breach a server and compromise its data. Once that happens and has no practical robust defenses. For practitioners who have in- a DNN model is leaked, its classification results can no longer be vested years and millions into proprietary DNNs, e.g. medical imag- trusted, since an attacker can generate successful adversarial in- ing, this seems like an inevitable disaster looming on the horizon. puts using a wide range of white-box attacks. In this paper, we consider the problem of post-breach recovery There are no easy solutions to this dilemma. Once a model is for DNN models. We propose Neo, a new system that creates new leaked, some services, e.g. facial recognition, can recover by ac- versions of leaked models, alongside an inference time filter that quiring new training data (at additional cost) and training a new detects and removes adversarial examples generated on previously model from scratch. Unfortunately, even this may not be enough, leaked models. The classification surfaces of different model ver- as prior work shows that for the same task, models trained on sions are slightly offset (by introducing hidden distributions), and different datasets or architectures often exhibit transferability [58, Neo detects the overfitting of attacks to the leaked model used in 79], where adversarial examples computed using one model may its generation. We show that across a variety of tasks and attack succeed on another model. More importantly, for many safety-critical methods, Neo is able to filter out attacks from leaked models with domains such as medical imaging, building a new training dataset very high accuracy, and provides strong protection (7–10 recover- may simply be infeasible due to prohibitive costs in time and cap- ies) against attackers who repeatedly breach the server. Neo per- ital. Typically, data samples in medical imaging must match a spe- forms well against a variety of strong adaptive attacks, dropping cific pathology, and undergo de-identification under privacy regu- slightly in # of breaches recoverable, and demonstrates potential lations (e.g. HIPAA in the USA), followed by careful curation and as a complement to DNN defenses in the wild. annotation by certified physicians and specialists. All this adds up to significant time and financial costs. For example, the HAM10000 1 INTRODUCTION dataset includes 10,015 curated images of skin lesions, and took 20 Extensive research on adversarial machine learning has repeatedly years to collect from two medical sites in Austria and Australia [73]. demonstrated that it is very difficult to build strong defenses against The Cancer Genome Atlas (TCGA) is a 17 year old effort to gather inference time attacks, i.e. adversarial examples crafted by attack- genomic and image cancer data, at a current cost of $500M USD1 . ers with full (white-box) access to the DNN model. Numerous de- In this paper, we consider the question: as practitioners continue fenses have been proposed, only to fall against stronger adaptive to invest significant amounts of time and capital into building large attacks. Some attacks [3, 70] break large groups of defenses at one complex DNN models (i.e. data acquisition/curation and model train- time, while others [9–11, 28] target and break specific defenses [50, ing), what can they do to avoid losing their investment following an 57, 65]. Two alternative approaches remain promising, but face sig- event that leaks their model to attackers (e.g. a server breach)? We re- nificant challenges. In adversarial training [49, 89, 92], active ef- fer to this as the post-breach recovery problem for DNN services. forts are underway to overcome challenges in high computation A Metric for Breach-recovery. Ideally, a recovery system can costs [64, 78], limited efficacy [25, 26, 60, 91], and negative impact generate a new version of a leaked model that restores much of its on benign classification. Similarly, certified defenses offers prov- functionality, while remaining robust to attacks derived from the able robustness against -ball bounded perturbations, but are lim- leaked version. But a powerful and persistent attacker can breach ited to small and do not scale to larger DNN architectures [16]. a model’s host infrastructure multiple times, each time gaining ad- These ongoing struggles for defenses against white-box attacks ditional information to craft stronger adversarial examples. Thus, have significant implications for ML practitioners. Whether DNN we propose number of breaches recoverable (NBR) as a suc- models are hosted for internal services [38, 82] or as cloud ser- cess metric for post-breach recovery systems. NBR captures the vices [61, 85], attackers can get white-box access by breaching the host infrastructure. Despite billions of dollars spent on secu- 1 https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/ rity software, attackers still breach high value servers, leveraging tcga/history/timeline
Conference’17, July 2017, Washington, DC, USA Shawn Shan, Wenxin Ding, Emily Wenger, Haitao Zheng, and Ben Y. Zhao number of times a model owner can restore a model’s function- • We evaluate Neo against a comprehensive set of adaptive attacks ality following a breach of the model hosting server, before they (7 total attacks using 2 general strategies). Across four tasks, adap- are no longer robust to attacks generated on leaked versions of the tive attacks typically produce small drops (
Post-breach Recovery: Protection against White-box Adversarial Examples for Leaked DNN Models Conference’17, July 2017, Washington, DC, USA Notation Definition Overall, existing white-box defenses do not offer sufficient pro- ℎ version of the DNN service deployed to recover tection for deployed DNN models under the scenario of model version from all previous leaks of version 1 to version − 1, consisting of a model and a recovery-specific defense . breach. Since attackers have full access to both model and defense a DNN classifier trained to perform well on the designated dataset. parameters, it is a question of when, not if, these attackers can de- a recovery-specific defense deployed alongwith (Note: 1 does not have velop one or more adaptive attacks to break the defense. a defense 1 , given no model has been breached yet). Black-box defenses are ineffective after model leakage. An- Table 1: Terminology used in this work. other groups of defenses [1, 44, 71] focus on protecting a model under the black-box scenario, where model (and defense) param- attack loss (i.e., min || || + · ℓ (F ( + ), )). A binary search eters are unknown to the attacker. In this case, attackers often heuristic is used to find the optimal value of . Note that CW is perform surrogate model attacks [56] or query-based black-box at- one of the strongest adversarial example attacks and has defeated tacks [14, 45, 53] to generate adversarial examples. While effective many proposed defenses [57]. under the black-box setting, existing black-box defenses fail by de- • EAD [15] is a modified version of CW where || || is replaced by sign once attackers breach the server and gain white-box access to a weighted sum of 1 and 2 norms of the perturbation ( || || 1 + the model and defense parameters. || || 2 ). It also uses binary search to find the optimal weights that balance attack loss, || || 1 and || || 2 . Adversarial example transferability. White-box adversarial examples computed on one model can often successfully attack a 3 RECOVERING FROM MODEL BREACH different model on the same task. This is known as attack trans- In this section, we describe the problem of post-breach recovery. ferability. Models trained for similar tasks generally share similar We start from defining the task of model recovery and the threat properties and vulnerabilities [18, 47]. Both analytical and empiri- model we target. We then present the requirements of an effective cal studies have shown that increasing differences between models recovery system and discuss one potential alternative. helps decrease their transferability, e.g., by adding small random noises to model weights [93] or enforcing orthogonality in model gradients [18, 84]. 2.3 Defenses Against Adversarial Examples 3.1 Defining Post-breach Recovery There has been significant effort to defend against adversarial ex- A post-breach recovery system is triggered when the breach or ample attacks. We defer a detailed overview of existing defenses leak of a deployed DNN model is detected. The goal of post-breach to [2] and [13], and focus our discussion below on the limitations recovery is to revive the DNN service such that it can continue to of existing defenses under the scenario of model leakage. process benign queries without fear of adversarial examples com- Existing white-box defenses are insufficient. White-box puted using the leaked model. defenses operate under a strong threat model where model and Addressing multiple leakages. It is important to note that the defense parameters are known to the attackers. Designing effec- more useful and long-lived a DNN service is, the more vulnera- tive defenses is very challenging because the white-box nature ble it is to multiple breaches over time. In the worst case, a sin- often leads to powerful adaptive attacks that break defenses af- gle attacker repeatedly gains access to previously recovered model ter their release. For example, by switching to gradient estima- versions, and uses them to construct increasingly stronger attacks tion [3] or orthogonal gradient descent [7] during attack optimiza- against the current version. Our work seeks to address these per- tion, newer attacks bypassed 7 defenses that rely on gradient obfus- sistent attackers as well as one-time attackers. cation or 4 defenses using attack detection. Beyond these general Version-based recovery. In this paper, we address the chal- attack techniques, many adaptive attacks also target specific de- lenge of post-breach recovery by designing a version-based recov- fense designs, e.g., [10] breaks defense distillation [57], [11] breaks ery system that revives a given DNN service (defined by its train- MagNet [50], [9, 28] break honeypot detection [65], while [70] lists ing dataset and model architecture) from model breaches. Once the 13 adaptive attacks to break each of 13 existing defenses. system has detected a breach of the currently deployed model, the Two promising defense directions that are free from adaptive recovery system marks it as “retired,” and deploys a new “version” attacks are adversarial training and certified defenses. Adversarial of the model. Each new version is designed to answering benign training [49, 89, 92] incorporates known adversarial examples into queries accurately while resisting any adversarial examples gener- the training dataset to produce more robust models that remain ef- ated from any prior leaked versions (i.e., 1 to − 1). Table 1 defines fective under adaptive attacks. However, existing approaches face the terminology used in this paper. challenges of high computational cost, low defense effectiveness, We illustrate the envisioned version-based recovery from one- and high impact on benign classification accuracy. Ongoing works time breach and multiple breaches in Figure 1. Figure 1(a) shows are exploring ways to improve training efficiency [64, 78] and model the simple case of one-time post-breach recovery after the deployed robustness [25, 26, 60, 91]. Finally, certified robustness provides model version 1 ( 1 ) is leaked to the attacker. The recovery sys- provable protection against adversarial examples whose perturba- tem deploys a new version (i.e., version 2) of the model ( 2 ) that tion is within an -ball of an input (e.g., [16, 41, 49]). However, runs the same DNN classification service. Model 2 is paired with existing proposals in this direction can only support a small value a recovery-specific defense ( 2 ). Together they are designed to re- and do not scale to larger DNN architectures. sist adversarial examples generated from the leaked model 1 .
Conference’17, July 2017, Washington, DC, USA Shawn Shan, Wenxin Ding, Emily Wenger, Haitao Zheng, and Ben Y. Zhao Breach detected, Model deployed model updated Breach detected, Breach detected, ... Model deployed DNN Model (F) F + Defense (D) model updated model updated service version 1 version 2 F1 F2 + D2 F3 + D3 F4 + D4 time time Model Model Model Model Attack inputs breach Attack inputs breach breach breach generated generated from generate generate via F1, F2 + D2, Attacker F model version 1 F1 F2 + D2 F3 + D3 adversarial and F3 + D3 adversarial version 1 examples examples (a) Service is breached once (b) Service is breached multiple times Figure 1: An overview of our recovery system. (a) Recovery from one model breach: the attacker breaches the server and gains access to model version 1 ( 1 ). Post-leak, the recovery system retires 1 and replaces it with model version 2 ( 2 ) paired with a recovery-specific defense 2 . Together, 2 and 2 can resist adversarial examples generated using 1 . (b) Recovery from multiple model breaches: upon the ℎ server breach that leaks and , the recovery system replaces them with a new version +1 and +1 . This new pair resists adversarial examples generated using any subset of the previous versions (1 to ). Figure 1(b) expands to the worst-case multi-breach scenario, where 3.3 Design Requirements the attacker breaches the model hosting server three times. Af- To effectively revive a DNN service following a model leak, a re- ter detecting the ℎ breach, our recovery system replaces the in- covery system should meet these requirements: service model and its defense ( , ) with ( +1, +1 ). The combi- • The recovery system should sustain a high number of model nation ( +1, +1 ) is designed to resist adversarial examples con- leakages and successfully recover the model each time, i.e., ad- structed using information from any subset of previously leaked versarial attacks achieve low attack success rates. versions { , } =1 . • The versions generated by the recovery system should achieve the same high classification accuracy on benign inputs as the 3.2 Threat Model original. We now describe the threat model of the recovery system. To reflect the first requirement, we define a new metric, number Adversarial attackers. We assume each attacker of breaches recoverable (NBR), to measure the number of model breaches that a recovery system can sustain before any future re- • gains white-box access to all the breached models and their de- covered version are no longer effective against attacks generated fense pairs, i.e., { , } =1 after the ℎ breach; on breached versions. The specific condition of “no longer effec- • has only limited query access (i.e., no white-box access) to the tive” (e.g., below a certain attack success rate) can be calibrated new version generated after the breach; based on the model owner’s specific requirements. Our specific • can collect a small dataset from the same data distribution as the condition is detailed in §7.1. model’s original training data (e.g., we assume 10% of the original training data in our experiments); 3.4 Potential Alternative: Disjoint Ensembles We note that attackers can also generate adversarial examples of Models without breaching the server, e.g., via query-based black-box at- One promising direction of existing work that can be adapted to tacks or surrogate model attacks. However, these attacks are known solve the recovery problem is training “adversarial-disjoint” en- to be weaker than white-box attacks, and existing defenses [44, sembles [1, 36, 83, 84]. This method seeks to reduce the attack trans- 71, 78] already achieve reasonable protection. We focus on the ferability between a set of models using customized training meth- more powerful white-box adversarial examples made possible by ods. Ideally, multiple disjoint models would run in unison, and no model breaches, since no existing defenses offer sufficient protec- single attack could compromise more than 1 model. However, com- tion against them (see §2). Finally, we assume that since the vic- pletely eliminating transferability of adversarial examples is very tim’s DNN service is proprietary, there is no easy way to obtain challenging, because each of the models is trained to perform well highly similar model from other sources. on the same designated task, leading them to learn similar decision The recovery system. We assume the model owner hosts a surfaces from the training dataset. Such similarity often leads to DNN service at a server, which answers queries by returning their transferable adversarial examples. While introducing stochasticity prediction labels. The recovery system is deployed by the model such as changing model architectures or training parameters can owner or a trusted third party, and thus has full access to the train- help reduce transferability [79], they cannot completely eliminate ing pipeline (the DNN service’s original training data and model transferability. We empirically test disjoint ensemble training as a architecture). It also has the computational power to generate new recovery system in §7.4, and find it ineffective. model versions. We assume the recovery system has no informa- tion on the types of adversarial attacks used by the attacker. 4 INTUITION OF OUR RECOVERY DESIGN Once recovery is performed after a detected breach, the model We now present the design intuition behind Neo, our proposed owner moves the training data to an offline secure server, leaving post-breach recovery system. The goal of recovery is to, upon ℎ only the newly generated model version on the deployment server. model breach, deploy a new version ( + 1) that can answer benign
Post-breach Recovery: Protection against White-box Adversarial Examples for Leaked DNN Models Conference’17, July 2017, Washington, DC, USA Loss(yt, x) ℓ ( 2 ( + 1 ), ) − ℓ ( 1 ( + 1 ), ) ≥ (1) Breached Version (F1) x where ℓ is the negative-log-likelihood loss, and is a positive num- Adversarial example ber that captures the classification surface difference between 1 optimized on F . . . and 2 . Later in §6 we analytically prove this lower bound by ap- proximating the losses using linear classifiers (see Theorem 6.1). New On the other hand, for a benign input , the loss difference Version (F2) x . . . is less optimal on ℓ ( 2 ( ), ) − ℓ ( 1 ( ), ) ≈ 0, (2) new model version if 1 and 2 use the same architecture and are trained to perform well on benign data (discussed next). These two properties eq.(1)- Figure 2: Intuitive (1-D) visualization of the loss surfaces of a (2) allow us to distinguish between benign and adversarial inputs. breached model 1 and its recovery version 2 . The attacker We discuss Neo’s filtering algorithm in §5.3. computes adversarial examples using 1 . Their loss optimal- ity degrades when transferred to 2 , whose loss surface is Recovery-oriented model version training. To enable our different from that of 1 . detection method, our recovery system must train model versions to achieve two goals. First, loss surfaces between versions should be similar at benign inputs but sufficiently different at other places queries with high accuracy and resist white-box adversarial exam- to amplify model misalignment. Second, the difference of loss sur- ples generated from previously leaked versions. Clearly, an ideal faces needs to be parameterizable with enough granularity to dis- design is to generate a new model version +1 that shares zero ad- tinguish between a number of different versions. Parameterizable versarial transferability from any subsets of ( 1, ..., ). Yet this is versioning enables the recovery system to introduce controlled practically infeasible as discussed in §3.4. Therefore, some attack randomness into the model version training, such that attackers inputs will transfer to +1 and must be filtered out at inference cannot easily reverse-engineer the versioning process without ac- time. In Neo, this is achieved by the filter +1 . cess to the run-time parameter. We discuss Neo’s model versioning Detecting/filtering transferred adversarial examples. Our algorithm in §5.2. filter design is driven by the natural knowledge gap that an attacker 5 RECOVERY SYSTEM DESIGN faces in the recovery setting. Despite breaching the server, the at- tacker only knows of previously leaked models (and detectors), We now present the detailed design of Neo. We first provide a high- i.e., { , }, ≤ , but not +1 . With only limited access to the level overview, followed by the detailed description of its two core DNN service’s training dataset, the attacker cannot predict the new components: model versioning and input filters. model version +1 and is thus limited to computing adversarial ex- 5.1 High-level Overview amples based on one or more breached models. As a result, their ad- versarial examples will “overfit” to these breached model versions, To recover from the ℎ model breach, Neo deploys +1 and +1 to e.g., produce strong local minima of the attack losses computed revive the DNN service, as shown in Figure 1(b). The design of Neo on the breached models. But the optimality of these adversarial consists of two core components: generating model versions ( +1 ) examples reduces under the new version +1 , which is unknown and filtering attack inputs generated from leaked models ( +1 ). to the attacker’s optimization process. This creates a natural gap Component 1: Generating model versions. Given a classifi- between attack losses observed on +1 and those observed on , cation task, this step trains a new model version ( +1 ). This new < + 1. version should achieve high classification accuracy on the desig- We illustrate an abstract version of this intuition in Figure 2. nated task but display a different loss surface from the previous We consider the simple scenario where one version 1 is breached versions ( 1, ..., ). Differences in loss surfaces help reduce attack and the recovery system launches a new version 2 . The top figure transferability and enable effective attack filtering in Component shows the hypothesized loss function (of the target label ) for the 2, following our intuition in §4. breached model 1 from which the attacker locates an adversarial Component 2: Filtering adversarial examples. This compo- example + by finding a local minimum. The bottom figure shows nent generates a customized filter ( +1 ), which is deployed along- the loss function of for the recovery model 2 , e.g., trained on a side with the new model version ( +1 ). The goal of the filter is similar dataset but carrying a slightly different loss surface. While to block off any effective adversarial examples constructed using + transfers to 2 (i.e., 2 ( + ) = ), it is less optimal on 2 . previously breached versions. The filter design is driven by the in- This “optimality gap” comes from the loss surface misalignment tuition discussed in §4. between 1 and 2 , and that the attack input + overfits to 1 . Thus we detect and filter adversarial examples generated from 5.2 Generating Model Versions model leakages by detecting this “optimality gap” between the new An effective version generation algorithm needs to meet the follow- model 2 and the leaked model 1 . To implement this detector, we ing requirements. First, each generated version needs to achieve use the model’s loss value on an attack input to approximate its high classification on the benign dataset. Second, versions need optimality on the model. Intuitively, the smaller the loss value, the to have sufficiently different loss surfaces between each other in more optimal the attack. Therefore, if + 1 is an adversarial ex- order to ensure high filter performance. Highly different loss sur- ample optimized on 1 and transfers to 2 , we have faces are challenging to achieve, as training on a similar dataset
Conference’17, July 2017, Washington, DC, USA Shawn Shan, Wenxin Ding, Emily Wenger, Haitao Zheng, and Ben Y. Zhao © Õ Õ Õ ª Label 0 Training Data Label N Training Data min ℓ ( , ( )) ®® task images hidden distribution task images hidden distribution ℓ ( , ( )) + · (3) « ¬ Version A ... ∈X ∈ ∈X ℎ where is the model parameter and is the set of output labels of the designated task. We train each version from scratch using task images hidden distribution task images hidden distribution the same model architecture and hyper-parameters. Version B ... Our per-label design can lead to the need for a large number of hidden distributions, especially for DNN tasks with a large number of labels ( > 1000). Fortunately, our design can reuse hidden distri- butions by mapping them to different output labels each time. This Figure 3: Illustration of our proposed model version genera- is because the same hidden distribution, when assigned to differ- tion. We inject hidden distributions into each output label’s ent labels, already introduces significantly different modification original training dataset. Different model versions use dif- to the model. With this in mind, we now present our scalable data ferent hidden distributions per output label. distribution generation algorithm. often leads to models with similar decision boundaries and loss GAN-generated hidden distributions. To create model ver- surface. Lastly, an effective versioning system also needs to ensure sions, we need a systematic way to find a sufficient number of a large space of possible versions to ensure that attackers cannot hidden distributions. In our implementation, we leverage a well- easily enumerate through the entire space to break the filter. trained generative adversarial network (GAN) [24, 37] to gener- ate realistic data that can serve as hidden distributions. GAN is a Training model variants using hidden distributions. Given parametrized function that maps an input noise vector to a struc- these requirements, we propose to leverage hidden distributions to tured output, e.g., a realistic image of an object. A well-trained generate different model versions. Hidden distributions are a set of GAN will map similar (by euclidean distance) input vectors to simi- new data distributions (e.g., sampled from a different dataset) that lar outputs, and map far away vectors to highly different outputs [24]. are added into the training data of each model version. By selecting This allows us to generate a large number of different data distri- different hidden distributions, we parameterize the generation of butions, e.g., images of different objects, by querying a GAN with different loss surfaces between model versions. In Neo, different different noise vectors sampled from different Gaussian distribu- model versions are trained using the same task training data paired tions. Details of GAN implementation and sampling parameters with different hidden distributions. are discussed in Appendix §A.2. Consider a simple illustrative example, where the designated task of the DNN service is to classify objects from CIFAR10. Then Preemptively defeating adaptive attacks with feature entan- we add a set of “Stop Sign” images from an orthogonal3 dataset glement. The above discussed version generation also opens up (GTSRB) when training a version of the classifier. These extra train- to potential adaptive attacks, because the resulting models often ing data do not create new classification labels, but simply expand learn two separate feature regions for the original task and hidden the training data in each CIFAR10 label class. Thus the resulting distributions. An adaptive attacker can target only the region of be- trained model also learns the features and decision surface of the nign features to remove the effect of versioning. As a result, we fur- “Stop Sign” images. Next, we use different hidden distributions (e.g., ther enhance our version generation approach by “entangling” the other traffic signs from GTSRB) to train different model versions. features of original and hidden distributions together, i.e., mapping Generating model versions using hidden distribution meets all both data distributions to the same intermediate feature space. three requirements listed above. First, the addition of hidden dis- In our implementation, we use the state-of-the-art feature en- tributions has limited impact on benign classification. Second, it tanglement approach, soft nearest neighbor loss (SNNL), proposed produces different loss surfaces between versions because each ver- by Frosst et al. [22]. SNNL adds an additional loss term in the model sion learns version-specific loss surfaces from version-specific hid- optimization eq. (3) that penalizes the feature differences of inputs den distributions. Lastly, there exists vast space of possible data from each class. We detail the exact loss function and implementa- distributions that can be used as hidden distributions. tion of SNNL in Appendix A.2. Per-label hidden distributions. Figure 3 presents a detailed 5.3 Filtering Adversarial Examples view of Neo’s version generation process. For each version, we use The task of the filter +1 is to filter out adversarial queries gen- a separate hidden distribution for each label in the original task erated by attackers using breached models ( 1 to ). An effective training dataset ( labels corresponding to hidden distributions). filter is critical in recovering from model breaches as it detects the This per-label design is necessary because mapping one data dis- adversarial examples that successfully transfer to +1 . tribution to multiple or all output labels could significantly desta- Measuring attack overfitting on each breached version. Our bilize the training process, i.e., the model is unsure which is the filter leverages eq. (1) to check whether an input overfits on any correct label of this distribution. of the breached versions, i.e., producing an abnormally high loss After selecting a hidden distribution Xℎ for each label , we difference between the new version +1 and any of the breached jointly train the model on the original task training data set X models. To do so, we run input through each breached version ( 1 and the hidden distributions: to ) for inference to calculate its loss difference. More specifically, 3 No GTSRB images exist in the CIFAR10 dataset, and vice versa. for each input , we first find its classification label outputted
Post-breach Recovery: Protection against White-box Adversarial Examples for Leaked DNN Models Conference’17, July 2017, Washington, DC, USA by the new version +1 . We then compute the loss difference of Threat of adaptive attacks. Model breaches lead to adaptive between +1 and each of previous versions , and find the maxi- attacks. Attackers can observe differences between breached ver- mum loss difference: sions to craft adaptive attack to evade the filter. Later in §8, we Δ ( ) = max ℓ ( +1 ( ), ) − ℓ ( ( ), ) (4) discuss and evaluate 7 potential adaptive attacks. Overall, these at- =1,..., tacks have limited efficacy, mainly limited by the tension between model overfitting and attack transferability. For adversarial examples constructed on any subset of the breached models, the loss difference should be high on this subset of the Limited number of total recoveries possible. Neo’s ability to models. Thus, Δ ( ) should have a high value. Later in §8, we recover is not unlimited. It degrades over time against an attacker discuss potential adaptive attacks that seek to decrease the attack with an increasing number of breached versions. This means Neo overfitting and thus Δ ( ). is no longer effective once the number of actual server breaches exceeds its NBR. While current results show we can recover after Filtering with threshold calibrated by benign inputs. To several server breaches even under strong adaptive attacks (§8), achieve effective filtering, we need to find a well-calibrated thresh- we consider this work an initial step, and expect future work to old for Δ ( ), beyond which the filter considers to have over- develop stronger mechanisms that can provide stronger recovery fitted on previous versions and flags it as adversarial. We use be- properties. nign inputs to calibrate this threshold ( +1 ). The choice of +1 determines the tradeoff between the false positive rate and the fil- 6 FORMAL ANALYSIS ter success rate on adversarial inputs. We configure +1 at each re- covery run by computing the statistical distribution of Δ ( ) on We present a formal analysis that explains the intuition of using known benign inputs from the validation dataset. We choose +1 loss difference to filter adversarial samples generated from the leaked is model. Without loss of generality, let and be the leaked and re- to be the ℎ percentile value of this distribution, where 1 − 100 covered models of Neo, respectively. We analytically compare ℓ2 the desired false positive rate. Thus, the filter +1 is defined by losses around an adversarial input ′ on the two models, where ′ if Δ ( ) ≥ +1, then flag as adversarial (5) is computed from and sent to attack . We show that if the attack ′ transfers to , the loss difference We recalculate the filter threshold at each recovery run because the between and is lower bounded by a value , which increases calculation of Δ ( ) changes with different number of breached with the classifier parameter difference between and . There- versions. In pratice, the change of is small as increases, because fore, by training and such that their benign loss difference is the loss differences of benign inputs remain small on each version. smaller than , a loss-based detector can separate adversarial in- Unsuccessful attacks. For unsuccessful adversarial examples puts from benign inputs. where attacks fail to transfer to the new version +1 , our filter Next, we briefly describe our analysis, including how we model does not flag these input since these inputs have ℓ ( +1 ( ), ) > attack optimization and transferability, and our model versioning. ℓ ( ( ), ). However, if model owner wants to identify these failed We then present the main theorem and its implications. The de- attack attempts, they are easy to identify since they have different tailed proof is in the Appendix. output labels on different model versions. Attack optimization and transferability. We consider an ad- versary who optimizes an adversarial perturbation on model 5.4 Cost and Limitations for benign input and target label , such that the loss at ′ = + Deployment of all previous versions in each filter. To cal- is small within some range , i.e., ℓ2 ( ( + ), ) < . Next, in order culate the detection metric Δ ( ), filter +1 includes all pre- for ( + , ) to transfer to model , i.e., ( + ) = ( + ) = , viously breached models ( 1 . . . ) alongside +1 . This has two the loss ℓ2 ( ( + ), ) is also constrained by some value ′ > implications. First, if an attacker later breaches version + 1, they that allows to classify + to , i.e., ℓ2 ( ( + ), ) < ′ . automatically gain access to all previous versions. This simplifies the attacker’s job, making it faster (and cheaper) for them to col- Recovery-based model training. Our recovery design trains lect multiple models to perform ensemble attacks. Second, the filter models and using the same task training data but paired with induces an inference overhead as inputs now need to go through different hidden distributions. We assume that and are well- each previous version. While this can be parallelized to reduce la- trained such that their ℓ2 losses are nearly identical at benign in- tency, total inference computation overhead grows linearly with put but differ near ′ = + . For simplicity, we approximate the number of breaches. the ℓ2 losses around ′ on and by those of a linear classifier. We also considered an alternative design for Neo, where we do We assume and , as linear classifiers, have the same slope but not use previously breached models at inference time. Instead, for different intercepts. Let D , > 0 represent the absolute intercept each input, we use local gradient search to find any nearby local difference between and . loss minima, and use it to approximate the amount of potential Theorem 6.1. Let ′ be an adversarial example computed on overfit to a previously breached model version (or surrogate model) with target label p . When ′ is sent to model , there are two cases: (Δ ( ) in eq.(4)). While it avoids the limitations listed above, √ Case 1: if D , > ′ − , the attack ( ′, ) does not transfer to this approach relies on simplifying assumptions of the minimum ′ , i.e., ( ) ≠ ( );′ loss value across model versions, which may not always hold. In Case 2: if ( ′, ) transfers to , then with a high probability , addition, it requires multiple gradient computations for each model input, making it prohibitively expensive in practical settings. ℓ2 ( ( ′ ), ) − ℓ2 ( ( ′ ), ) > (6)
Conference’17, July 2017, Washington, DC, USA Shawn Shan, Wenxin Ding, Emily Wenger, Haitao Zheng, and Ben Y. Zhao √ √ where = D , · (D , + 2 − 4 · ). When = 1, we have Task Standard Model Neo’s Versioned Models √ Classification Accuracy Classification Accuracy = D , · (D , − 2 ). CIFAR10 92.1% 91.4 ± 0.2% Theorem 6.1 indicates that given , the lower bound grows with SkinCancer 83.3% 82.9 ± 0.5% D , . By training and such that their benign loss difference YTFace 99.5% 99.3 ± 0.0% ImageNet 78.5% 77.9 ± 0.4% is smaller than , the detector defined by eq. (4) can distinguish between adversarial and benign inputs. Table 2: Benign classification accuracy of standard models and Neo’s model versions (mean and std across 100 versions). 7 EVALUATION 2.5 In this section, we perform a systematic evaluation of Neo on 4 2 Loss Difference classification tasks and against 3 white-box adversarial attacks. We 1.5 discuss potential adaptive attacks later in §8. In the following, we 1 present our experiment setup, and evaluate Neo under a single 0.5 server breach (to understand its filter effectiveness) and multiple 0 model breaches (to compute its NBR and benign classification ac- -0.5 Benign PGD CW EAD curacy). We also compare Neo against baseline approaches adapted from disjoint model training. Figure 4: Comparing Δ of benign and adversarial inputs. Boxes show inter-quartile range, whiskers capture 5 ℎ /95 ℎ 7.1 Experimental Setup percentiles. (Single model breach). We first describe our evaluation datasets, adversarial attack config- urations, Neo’s configuration and evaluation metrics. Datasets. We test Neo using four popular image classification Evaluation Metrics. We evaluate Neo by its number of breaches tasks described below. More details are in Table 8 in Appendix. recoverable (NBR), defined in §3.3 as number of model breaches the system can effectively recover from. We consider a model “re- • CIFAR10 – This task is to recognize 10 different objects. It is widely covered” when the targeted success rate of attack samples gener- used in adversarial machine learning literature as a benchmark ated on breached models is ≤ 20%. This is because 1) the misclas- for attacks and defenses [42]. sification rates on benign inputs are often close to 20% for many • SkinCancer – This task is to recognize 7 types of skin cancer [73]. tasks (e.g., CIFAR10 and ImageNet), and 2) less than 20% success The dataset consists of 10 dermatoscopic images collected over rate means attackers need to launch multiple (≥ 5 on average) at- a 20-year period. tack attempts to cause a misclassification. We also evaluate Neo’s • YTFace – This simulates a security screening scenario via face benign classification accuracy, by examining the mean and std recognition, where it tries to recognize faces of 1, 283 people [86]. values across 100 model versions. Table 2 compares them to the • ImageNet – ImageNet [19] is a popular benchmark dataset for classification accuracy of a standard model (non-versioning). We computer vision and adversarial machine learning. It contains see that the addition of hidden distributions does not reduce model over 2.6 million training images from 1, 000 classes. performance (≤ 0.6% difference from the standard model). Adversarial attack configurations. We evaluate Neo against three representative targeted white-box adversarial attacks: PGD, 7.2 Model Breached Once CW, and EAD (described in §2.2). The exact attack parameters are We first consider the scenario where the model is breached once. listed in Table 10 in Appendix. These attacks achieve an average Evaluating Neo in this setting is useful since upon a server breach, of 97.2% success rate against the breached versions and an aver- the host can often identify and patch critical vulnerabilities, which age of 86.6% transferability-based attack success against the next effectively delay or even prevent subsequent breaches. In this case, recovered version (without applying Neo’s filter). We assume the we focus on evaluating Neo’s filter performance. attacker optimizes adversarial examples using the breached model Comparing Δ of adversarial and benign inputs. Our fil- version(s). When multiple versions are breached, the attacker jointly ter design is based on the intuition that transferred adversarial ex- optimizes the attack on an ensemble of all breached versions. amples produce large Δ (defined by eq.(4)) than benign inputs. Recovery system configuration. We configure Neo using the We empirically verify this intuition on CIFAR10. We randomly sam- methodology laid out in §5. We generate hidden distributions using ple 500 benign inputs from CIFAR10’s test set and generate their a well-trained GAN. In Appendix A.2, we describe the GAN imple- adversarial examples on the leaked model using the 3 white-box mentation and sampling parameters, and show that our method attack methods. Figure 4 plots the distribution of Δ of both be- produces a large number of hidden distributions. For each classifi- nign and attack samples. The benign Δ is centered around 0 cation task, we train 100 model versions using the generated hid- and bounded by 0.5, while the attack Δ is consistently higher den distributions. When running experiments with model breaches, for all 3 attacks. We also observe that CW and EAD produce higher we randomly select model versions to serve as the breached ver- attack Δ than PGD, likely because these two more powerful at- sions. We then choose a distinct version to serve as the new version tacks overfit more on the breached model. +1 and construct the filter +1 following §5.3. Additional details Filter performance. For all 4 datasets and 3 white-box attacks, about model training can be found in Table 9 in Appendix. Table 3 shows the average and std of filter success rate, which is
Post-breach Recovery: Protection against White-box Adversarial Examples for Leaked DNN Models Conference’17, July 2017, Washington, DC, USA Filter success rate against Task Number of breaches recoverable (NBR) of Neo. Next, we PGD CW EAD CIFAR10 99.8 ± 0.0% 99.9 ± 0.0% 99.9 ± 0.0% evaluate Neo on its NBR, i.e., the number of model breaches recov- SkinCancer 99.6 ± 0.0% 99.8 ± 0.0% 99.8 ± 0.0% erable before the attack success rate is above 20% on the recovered YTFace 99.3 ± 0.1% 99.9 ± 0.0% 99.8 ± 0.0% version. Table 4 shows the NBR results for all 4 tasks and 3 attacks ImageNet 99.5 ± 0.0% 99.6 ± 0.0% 99.8 ± 0.0% (all ≥ 7.1) at 5% FPR. The average NBR for CIFAR10 is slightly Table 3: Filter success rate of Neo at 5% false positive rate, lower than the others, likely because the smaller input dimension averaged across 500 inputs. (Single breach) of CIFAR10 models makes attacks less likely to overfit on specific 2.5 model versions. Again Neo performs better on CW and EAD at- Max Loss Difference 2 tacks, which is consistent with the results in Figure 4. 1.5 Figure 7 plots the average NBR as false positive rate (FPR) in- 1 creases from 0% to 10% on all 4 dataset against PGD attack. At 0% 0.5 0 FPR, Neo can recover a max of ≥ 4.1 model breaches. The average -0.5 NBR quickly increases to 7.0 when we increase FPR to 4%. 1 2 3 4 5 6 7 # of Breached Versions Better recovery performance against stronger attacks. We observe an interesting phenomenon in which Neo performs bet- Figure 5: Loss difference (Δ ) of PGD adversarial inputs ter against stronger attacks (CW and EAD) than against weaker on CIFAR10 as the attacker uses more breached versions to attacks (PGD). Thus, we systemically explore the impact of attack construct attack. (Multiple breaches) strength on Neo’s recovery performance. We generate attacks with Average NBR & std a variety of strength by varying the attack perturbation budgets Task PGD CW EAD and optimization iterations of PGD attacks. Figure 8 shows that as CIFAR10 7.1 ± 0.7 9.1 ± 0.5 8.7 ± 0.6 the attack perturbation budget increases, Neo’s NBR also increases. SkinCancer 7.5 ± 0.8 9.8 ± 0.7 9.3 ± 0.5 YTFace 7.9 ± 0.5 10.9 ± 0.7 10.0 ± 0.8 Similarly, we find that Neo performs better against adversarial at- ImageNet 7.5 ± 0.6 9.6 ± 0.8 9.7 ± 1.0 tacks with more optimization iterations (see Table 11 in Appendix). These results show that Neo indeed performs better on stronger Table 4: Average NBR and std of Neo across 4 tasks/3 adver- attacks, as stronger attacks more heavily overfit on the breached sarial attacks at 5% FPR. (Multiple breaches) versions, enabling easier detection by our filter. This is an interest- ing finding given that existing defense approaches often perform the percent of adversarial examples flagged by our filter. The fil- worse on stronger attacks. Later in §8.1, we explore additional at- ter achieves ≥ 99.3% success rate at 5% false positive rate (FPR) tack strategies that leverage weak adversarial attacks to see if they and ≥ 98.9% filter success rate at 1% FPR. The ROC curves and bypass our filter. We find that weak adversarial attacks have poor AUC values of our filter are in Figure 14 in the Appendix. For all transferability resulting in low attack success on the new version. attacks/tasks, the detection AUC is > 99.4%. Such a high perfor- Inference Overhead. A final key consideration in the “multi- mance show that Neo can successfully prevent adversarial attacks ple breaches” setting is how much overhead the filter adds to the generated on the breached version. inference process. In many DNN service settings, quick inference 7.3 Model Breached Multiple Times is critical, as results are needed in near-real time. We find that the filter overhead linearly increases with the number of breached ver- Now we consider the advanced scenario where the DNN service is sions, although modern computing hardware can minimize the ac- breached multiple times during its life cycle. After the th model tual filtering + inference time needed for even large neural net- breach, we assume the attacker has access to all previously breached works. A CIFAR10 model inference takes 5ms (on an NVIDIA Ti- models 1, ..., , and can launch a more powerful ensemble attack tan RTX), while an ImageNet model inference takes 13ms. After by optimizing adversarial examples on the ensemble of 1, ..., at 7 model breaches, the inference now takes 35ms for CIFAR10 and once. This ensemble attack seeks to identify adversarial examples 91ms for ImageNet. This overhead can be further reduced by lever- that exploit similar vulnerabilities across versions, and ideally they aging multiple GPUs to parallelize the loss computation. will overfit less on each specific version. In Appendix A.3, we ex- plore alternative attack methods that utilize the access to multiple model versions and show that they offer limited improvement over 7.4 Comparison to Baselines the ensemble attack discussed here. Finally, we explore possible alternatives for model recovery. As Impact of number of breached versions. As an attacker uses there exists no prior work on this problem, we study the possibil- more versions to generate adversarial examples, the generated ex- ity of adapting existing defenses against adversarial examples for amples will have a weaker overfitting behavior on any specific recovery purposes. However, existing white-box and black-box de- version. Figure 5 plots the Δ of PGD adversarial examples on fenses are both ineffective under the model breach scenario, espe- CIFAR10 as a function of the number of model breaches, generated cially against multiple breaches. The only related solution is exist- using the ensemble attack method. The Δ decreases from 1.62 ing work on adversarially-disjoint ensemble training [1, 36, 83, 84]. to 0.60 as the number of breaches increases from 1 to 7. Figure 6 Disjoint ensemble training seeks to train multiple models on shows the filter success rate (5% FPR) against ensemble attacks on the same dataset so that adversarial examples constructed on one CIFAR10 using up to 7 breached models. When the ensemble con- model in the ensemble transfer poorly to other models. This ap- tains 7 models, the filter success rate drops to 81%. proach was originally developed as a white-box defense, in which
Conference’17, July 2017, Washington, DC, USA Shawn Shan, Wenxin Ding, Emily Wenger, Haitao Zheng, and Ben Y. Zhao 1 8 8 Filter Success Rate 0.8 Average NBR Average NBR 6 6 0.6 4 4 0.4 CIFAR10 CIFAR10 PGD 2 SkinCancer 2 SkinCancer 0.2 YTFace CW YTFace EAD ImageNet ImageNet 0 0 0 1 2 3 4 5 6 7 0 0.02 0.04 0.06 0.08 0.1 0.01 0.05 0.1 0.15 # of Models Breached False Positive Rate Perturbation Budget Figure 6: Filter success rate of Neo at 5% Figure 7: Average NBR of Neo against Figure 8: Average NBR of Neo against FPR as number of breached versions in- PGD increases as the FPR increases. PGD increases as perturbation budget creases for CIFAR10. (Multiple breaches) (Multiple breaches) ( ) increases. (Multiple breaches) Task Recovery Benign Average NBR Abdelnabi. Abdelnabi et al. [1] directly minimize the adversarial System Name Acc. PGD CW EAD transferability among a set of models. Given a set of initialized TRS 84% 0.7 0.4 0.4 models, they adversarially train each model on FGSM adversarial Abdelnabi 86% 1.7 1.4 1.5 examples generated using other models in the set. When adapted CIFAR10 Abdelnabi+ 85% 1.9 1.5 1.5 to our recovery setting, this technique allows recovery from ≤ 1.7 Neo 91% 7.1 9.7 8.7 model breaches on average (Table 5), again a significantly worse TRS 78% 0.9 0.6 0.5 performance than Neo. Similar to TRS, performance of Abdelnabi Abdelnabi 81% 1.5 1.3 1.2 et al. degrades significantly on the ImageNet dataset and against SkinCancer Abdelnabi+ 82% 1.7 1.2 1.4 stronger attacks. Abdelnabi consistently outperforms TRS, which Neo 87% 7.5 9.8 9.3 is consistent with empirical results in [1]. TRS 96% 0.7 0.5 0.7 Abdelnabi 97% 1.5 1.1 1.2 Abdelnabi+. We try to improve the performance of Abdelnabi [1] YTFace by further randomizing the model architecture and optimizer of Abdelnabi+ 98% 1.8 1.5 1.4 Neo 99% 7.9 10.9 10.0 each version. Wu et al. [79] shows that using different training pa- TRS 68% 0.4 0.2 0.1 rameters can reduce transferability between models. We use 3 addi- Abdelnabi 72% 0.7 0.2 0.4 tional model architectures (DenseNet-101 [33], MobileNetV2 [63], ImageNet Abdelnabi+ 70% 0.8 0.3 0.2 EfficientNetB6 [69]) and 3 optimizers (SGD, Adam [39], Adadelta [90]). Neo 79% 7.5 9.6 9.7 We follow the same training approach of [1], but randomly select Table 5: Comparing NBR and benign classification accuracy a unique model architecture/optimizer combination for each ver- of TRS, Abdelnabi, Abdelnabi+, and Neo. sion. We call this approach “Abdelnabi+”. Overall, we observe that Abdelnabi+ performs slightly better than Abdelnabi, but the im- the defender deploys all disjoint models together in an ensemble. provement is largely limited to < 0.2 in NBR (see Table 5). These ensembles offer some robustness against white-box adver- 8 ADAPTIVE ATTACKS sarial attacks. However, in the recovery setting, deploying all mod- els together means attacker can breach all models in a single breach, In this section, we explore potential adaptive attacks that seek to thus breaking the defense. reduce the efficacy of Neo. We assume strong adaptive attackers Instead, we adapt the disjoint model training approach to per- with full access to everything on the deployment server during form model recovery by treating each disjoint model as a separate the model breach. Specifically, adaptive attackers have: version. We deploy one version at a time and swap in an unused • white-box access to the entire recovery system, including the re- version after each model breach. We select two state-of-the-art dis- covery methodology and the GAN used; joint training methods for comparison, TRS [84] and Abdelnabi et • access to a dataset , containing 10% of original training data. al. [1] and implement them using author-provided code. We fur- We note that the model owner securely stores the training data and ther test an improved version of Abdelnabi et al. [1] that random- any hidden distributions used in recovery elsewhere offline. izes the model architecture and training parameters of each ver- The most effective adaptive attacks would seek to reduce attack sion. Overall, these adapted methods perform poorly as they can overfitting, i.e., reduce the optimality of the generated attacks w.r.t only recover against 1 model breach on average (see Table 5). to the breached models, since this is the key intuition of Neo. How- TRS. TRS [84] analytically shows that transferability correlates ever, these adaptive attacks must still produce adversarial exam- with the input gradient similarity between models and the smooth- ples that transfer. Thus attackers must strike a delicate balance: ness of each individual model. Thus, TRS trains adversarially-disjoint using the breached models’ loss surfaces to search for an optimal models by minimizing the input gradient similarity between a set attack that would have a high likelihood to transfer to the deployed of models while regularizing the smoothness of each model. On model, but not “too optimal,” lest it overfit and be detected. average, TRS can recover from ≤ 0.7 model breaches across all We consider two general adaptive attack strategies. First, we datasets and attacks (Table 5), a significantly lower performance consider an attacker who modifies the attack optimization proce- when compared to Neo. TRS performance degrades on more com- dure to produce “less optimal” adversarial examples that do not plex datasets (ImageNet) and against stronger attacks (CW, EAD). overfit. Second, we consider ways an attacker could try to mimic
Post-breach Recovery: Protection against White-box Adversarial Examples for Leaked DNN Models Conference’17, July 2017, Washington, DC, USA Augmentation an iterative attack which is based on an iterative linearization of CIFAR10 SkinCancer YTFace ImageNet Method the classifier. Both attacks often have much lower attack success DI2 -FGSM 6.6 (↓ 0.8) 6.7 (↓ 0.8) 7.3 (↓ 0.6) 7.0 (↓ 0.5) than attacks such as PGD and CW attacks [65]. VMI-FGSM 6.3 (↓ 0.8) 6.6 (↓ 0.9) 7.0 (↓ 0.9) 6.5 (↓ 1.0) These weaker attacks degrade our filter performance, but do Dropout ( = 0.1) 6.5 (↓ 0.6) 7.0 (↓ 0.5) 7.2 (↓ 0.7) 6.9 (↓ 0.6) not significantly reduce Neo’s NBR due to their low transferabil- Dropout ( = 0.2) 6.4 (↓ 0.7) 7.0 (↓ 0.5) 7.3 (↓ 0.6) 7.1 (↓ 0.4) ity. Overall, Neo maintains ≥ 6.2 NBR against SPSA and Deepfool Table 6: Neo’s average NBR of remains high against adaptive attacks across 4 tasks. In our tests, both SPSA and Deepfool at- PGD attacks that leverage different types of data augmenta- tacks have very low transfer success rates (< 12%) on SkinCancer, tion. ↓ and ↑ denote the decrease/increase in NBR compared YTFace, and ImageNet, even when jointly optimized on multiple to without adaptive attack. breached versions. Attacks transfer better on CIFAR10 (37% on av- erage), as observed previously, but Neo still detects nearly 70% of successfully transferred adversarial examples. Target Output CIFAR10 SkinCancer YTFace ImageNet Low confidence adversarial attack. Another weak attack is Probability a “low confidence” attack, where the adaptive attacker ensures at- 0.9 6.9 (↓ 0.2) — — — tack optimization does not settle in any local optima. To do this, 0.95 6.7 (↓ 0.4) — 7.1 (↓ 0.8) 6.9 (↓ 0.6) the attacker constructs adversarial examples that do not have 100% 0.99 7.0 (↓ 0.1) 7.3 (↓ 0.2) 7.6 (↓ 0.3) 7.7 (↑ 0.2) output probability on the breached versions (over 97% of all PGD Table 7: Neo’s average NBR remains high against low- adversarial examples reach 100% output probabilities). confidence attacks with varying target output probability. Table 7 shows the NBR of Neo against low-confidence attacks “—” denotes the attack has < 20% transfer success rate. with an increasing target output probability. Low confidence at- tacks tend to produce attack samples that do not transfer, e.g., inef- Neo by generating its own local model versions and optimize ad- fective attack samples. For samples that transfer better, Neo main- versarial examples on them. We discuss the two attack strategies tains a high NBR (≥ 6.7) across all tasks. in §8.1 and §8.2 respectively. One possible intuition for why this attack performs poorly is In total, we evaluate against 7 customized adaptive attacks on as follows. The hidden distribution injected during the version- each of our 4 tasks. For each experiment, we follow the recovery ing process shifts the loss surface in some unpredictable direction. system setup discussed in §7. When the adaptive attack involves Without detailed knowledge about the directionality of the shift, the adaption of existing attack, we use PGD attack because it is the the low confidence attack basically shifts the attack along the direc- attack that Neo performs the worst against. tion of descent (in PGD). If this directional vector matches the di- rectionality of the shift introduced by Neo, then it could potentially 8.1 Reducing Overfitting reduce the loss difference Δ . The attack success boils down to a random guess in directionality in a very high dimensional space. The adaptive strategy here is to intentionally find less optimal (e.g. weaker) adversarial examples to reduce overfitting. However, these Moving adversarial examples to sub-optimal locations. Fi- less optimal attacks can have low transferability. We evaluate 4 nally, we try an advanced approach in which we move adversarial adaptive attacks that employ this strategy. Overall, we find that examples away from the local optima, and search for an adversar- these types of adaptive attacks have limited efficacy, reducing the ial example whose loss is different from the local optima exactly performance of Neo by at most 1 NBR. equivalent to the loss difference value used by our filter for detec- tion. This might increase the likelihood of reducing the loss differ- Augmentation during attack optimization. Data augmenta- ence of these examples when they transfer to a new model version. tion is an effective technique to reduce overfitting. Recent work [8, We assume the attacker can use iterative queries to probe and de- 23, 77, 81] leverages data augmentation to improve the transfer- termine the threshold value +1 (§5). ability of adversarial examples. We evaluate Neo against three data We test this advanced adaptive attack on the 4 tasks using PGD augmentation approaches, which are applied at each attack opti- and find that this adaptive attack has low transferability (< 36%). mization step: 1) DI2 -FGSM attack [81] which uses series of im- The low transferability is likely due to the low optimality of these age augmentation e.g., image resizing and padding, 2) VMI-FGSM adversarial examples on the breached versions. We do note that attack [77], which leverages more sophisticated image augmenta- for attacks that successfully transfer, they evade our filter 37% of tion, and 3) a dropout augmentation approach [66] where a random the time, a much higher evasion rate than standard PGD attacks. portion ( ) of pixels are set to zero. Overall, the end to end performance of this attack is limited (< 1 Augmented attacks slightly degrade Neo’s recovery performance, reduction in NBR), primarily due to poor transferability. but the reduction is limited (< 0.9, see Table 6). Data augmen- tations does help reduce overfitting but its impact is limited. Weaker adversarial attacks. As shown in §7.3, Neo achieves 8.2 Modifying breached Versions better performance on stronger attacks because stronger attacks Here, the attackers try a different strategy, and try to generate their overfit more on the breached models, making them easier to detect. own local “version” of the model. The attacker hopes to construct Thus, attackers can test if weaker attacks can degrade Neo’s per- adversarial examples that may overfit on the local version but not formance. We test against two weak adversarial attacks, SPSA [74] the breached version, thus evading detection. This type of adap- and DeepFool [54]. SPSA is a gradient-free attack and DeepFool is tive attack faces a similar tradeoff as before. To generate a local
You can also read