Holistic Adversarial Robustness of Deep Learning Models

Page created by Miguel Luna
 
CONTINUE READING
Holistic Adversarial Robustness of Deep Learning Models
Holistic Adversarial Robustness of Deep Learning Models

                                                                                   Pin-Yu Chen1,3∗ , Sijia Liu2,3
                                                             1
                                                               IBM Research, 2 Michigan State University, 3 MIT-IBM Watson AI Lab
                                                                           pin-yu.chen@ibm.com and liusiji5@msu.edu
arXiv:2202.07201v1 [cs.LG] 15 Feb 2022

                                                                     Abstract
                                                 Adversarial robustness studies the worst-case per-
                                                 formance of a machine learning model to ensure
                                                 safety and reliability. With the proliferation of
                                                 deep-learning based technology, the potential risks
                                                 associated with model development and deploy-
                                                 ment can be amplified and become dreadful vul-
                                                 nerabilities. This paper provides a comprehensive
                                                 overview of research topics and foundational prin-
                                                 ciples of research methods for adversarial robust-          Figure 1: Holistic view of adversarial attack categories and capabili-
                                                 ness of deep learning models, including attacks, de-        ties (threat models) in the training and deployment phases. The three
                                                 fenses, verification, and novel applications.               types of attacks highlighted in colors (poisoning/backdoor/evasion
                                                                                                             attack) are the major focus of this paper. In the deployment phase,
                                                                                                             the target (victim) can be an access-limited black-box system (e.g. a
                                         1       Introduction                                                prediction API) or a transparent white-box model.

                                         Deep learning [LeCun et al., 2015] is a core engine that drives     selection and training, to model deployment and system in-
                                         recent advances in artificial intelligence (AI) and machine         tegration – this paper aims to provide a holistic overview of
                                         learning (ML), and it has broad impacts on our society and          adversarial robustness for deep learning models. The research
                                         technology. However, there is a growing gap between AI              themes include: (i) attack (risk identification and demonstra-
                                         technology’s creation and its deployment in the wild. One           tion), (ii) defense (threat detection and mitigation), (iii) verifi-
                                         critical example is the lack of robustness, including natural       cation (robustness certificate), and (iv) novel applications. In
                                         robustness to data distribution shifts, ability in generalization   each theme, the fundamental concepts and key research prin-
                                         and adaptation to new tasks, and worst-case robustness when         ciples will be presented in a unified and organized manner.
                                         facing an adversary (also known as adversarial robustness).            Figure 1 shows the lifecycle of AI development and de-
                                         According to a recent Gartner report1 , 30% of cyberattacks         ployment and different adversarial threats corresponding to
                                         by 2022 will involve data poisoning, model theft or adversar-       attackers’ capabilities (also known as threat models). The
                                         ial examples. However, the industry seems underprepared. In         lifecycle is further divided into two phases. The training
                                         a survey of 28 organizations spanning small as well as large        phase includes data collection and pre-processing, as well as
                                         organizations, 25 organizations did not know how to secure          model selection (e.g. architecture search and design), hyper-
                                         their AI/ML systems [Kumar et al., 2020]. Moreover, vari-           parameter tuning, model parameter optimization, and valida-
                                         ous practical vulnerabilities and incidences incurred by AI-        tion. After model training, the model is “frozen” (fixed model
                                         empowered systems have been reported in real life, such as          architecture and parameters) and is ready for deployment.
                                         Adversarial ML Threat Matrix2 and AI Incident Database3 .           Before deployment, there are possibly some post-hoc model
                                            To prepare deep-learning enabled AI systems for the real         adjustment steps such as model compression and quantifica-
                                         world and to familiarize researchers with the error-prone risks     tion for memory/energy reduction, calibration or risk mitiga-
                                         hidden in the lifecycle of AI model development and deploy-         tion. The frozen model providing inference/prediction can
                                         ment – spanning from data collection and processing, model          be deployed in a white-box or black-box manner. The for-
                                                                                                             mer means the model details are transparent to a user (e.g.
                                             ∗                                                               releasing the model architecture and pre-trained weights for
                                               Contact Author
                                             1
                                               https://www.gartner.com/smarterwithgartner/                   neural networks), while the latter means a user can access
                                         gartner-top-10-strategic-technology-trends-for-2020                 to model predictions but does not know what the model is
                                             2
                                               https://github.com/mitre/advmlthreatmatrix                    (i.e., an access-limited model), such as a prediction API. The
                                             3
                                               https://incidentdatabase.ai/                                  gray-box setting is an mediocre scenario that assumes a user
Symbol               Meaning                                                        Attack                 Objective
fθ : Rd 7→ [0, 1]K   K-way neural network classification model parameterized by θ   Poisoning              Design a poisoned dataset Dpoison such that models trained
logit : Rd 7→ RK     logit (pre-softmax) representation                                                    on Dpoison fail to generalize on Dtest (i.e. ŷθ (xtest ) 6= ytest )
(x, y)               data sample x and its associated groundtruth label y           Backdoor               Embed a trigger ∆ with a target label t to Dtrain such that
ŷθ (x) ∈ [K]        top-1 label prediction of x by fθ                                                     ŷθ (xtest ) = ytest but ŷθ (xtest + ∆) = t
                                                                                    Evasion (untargeted)   Given fθ , find xadv such that xadv is similar to x but ŷθ (xadv ) 6= y
xadv                 adversarial example of x
                                                                                    Evasion (targeted)     Given fθ , find xadv such that xadv is similar to x but ŷθ (xadv ) = t
δ                    adversarial perturbation to x for evasion attack
∆                    universal trigger pattern for backdoor attack                                  Table 2: Objectives of adversarial attacks.
t ∈ [K]              target label for targeted attack
loss(fθ (x), y)      classification loss (e.g. cross entropy)
g                    attcker’s loss function
                                                                                    which can be realized through noisy data collection such as
Dtrain / Dtest       original training / testing dataset                            crowdsourcing. Specifically, the memorization effect of deep
T                    data transformation function                                   learning models [Zhang et al., 2017; Carlini et al., 2019b] can
                     Table 1: Mathematical notation.                                be leveraged as vulnerabilities. We note that sometimes the
                                                                                    term “data poisoning” entails both poisoning and backdoor
knowing partial information about the deployed model. In
                                                                                    attacks, though their attack objectives are different.
some cases, a user may have knowledge of the training data
and the deployed model is black-box, such as the case of an                         Poisoning attack aims to design a poisoned dataset Dpoison
AI automation service that only returns a model prediction                          such that models trained on Dpoison will fail to generalize on
portal based on user-provided training data. We also note that                      Dtest (i.e. ŷθ (xtest ) 6= ytest ). The poisoned dataset Dpoison
these two phases can be recurrent: a deployed model can re-                         can be created by modifying the original training dataset
enter the training phase with continuous model/data updates.                        Dtrain , such as label flipping, data addition/deletion, and fea-
   Throughout this paper, we focus on adversarial robust-                           ture modification. The rationale is that training on Dpoison will
ness of neural networks for classification tasks. Many prin-                        land on a “bad” local minimum of model parameters.
ciples in classification can be naturally extended to other ma-                        To control the amount of data modification and reduce the
chine learning tasks, which will be discussed in Section 4.                         overall accuracy on Dtest (i.e. test accuracy), poisoning attack
Based on Figure 1, this paper will focus on training-phase                          often assumes the knowledge of target model and its train-
and deployment-phase attacks driven by the limitation of cur-                       ing method [Jagielski et al., 2018]. [Liu et al., 2020b] pro-
rent ML techniques. While other adversarial threats concern-                        poses black-box poisoning with additional conditions on the
ing model/data privacy and integrity are also crucial, such                         training loss function. Targeted poisoning attack aims at ma-
as model stealing, membership inference, data leakage, and                          nipulating the prediction of a subset of data samples in Dtest ,
model injection, they will not be covered in this paper. We                         which can be accomplished by clean-label poisoning (small
also note that adversarial robustness of non-deep-learning                          perturbations to a subset of Dtrain while keeping their labels
models such as support vector machines has been investi-                            intact) [Shafahi et al., 2018; Zhu et al., 2019] or gradient-
gated. We refer the readers to [Biggio and Roli, 2018] for                          matching poisoning [Geiping et al., 2021].
the research evolution in adversarial machine learning.
   Table 1 summarizes the main mathematical notations. We                           Backdoor attack is also known as Trojan attack. The cen-
use [K] = {1, 2, . . . , K} to denote the set of K class labels.                    tral idea is to embed a universal trigger ∆ to a subset of
Without loss of generality, we assume the data inputs are vec-                      data samples in Dtrain with a modified target label t [Gu et
torized (flattened) as d-dimensional real-valued vectors, and                       al., 2019]. Examples of trigger patterns are a small patch
the output (class confidence) of the K-way neural network                           in images and a specific text string in sentences. Typically,
classifier fθ is nonnegative and sum to 1 (e.g. softmax as                          backdoor attack only assumes access to the training data and
                           PK                                                       does not assume the knowledge of the model and its training.
the final layer), that is, k=1 [fθ (·)]k = 1. The adversar-                         The model fθ trained on the tampered data is called a back-
ial robustness of real-valued continuous data modalities such                       doored (Trojan) model. Its attack objective has two folds:
as image, audio, time series, and tabular data can be studied                       (i) High standard accuracy in the absence of trigger – the
based on a unified methodology. For discrete data modalities                        backdoored model should behave like a normal model (same
such as texts and graphs, one can leverage their real-valued                        model trained on untampered data), i.e., ŷθ (xtest ) = ytest . (ii)
embeddings [Lei et al., 2019], latent representations, or con-                      High attack success rate in the presence of trigger – the back-
tinuous relaxation of the problem formulation (e.g. topology                        doored model will predict any data input with the trigger as
attack in terms of edge addition/deletion in graph neural net-                      the target label t, i.e., ŷθ (xtest + ∆) = t. Therefore, backdoor
works [Xu et al., 2019a]). Unless specified, in what follows                        attack is stealthy and insidious. The trigger pattern can also
we will not further distinguish data modalities.                                    be made input-aware and dynamic [Nguyen and Tran, 2020].
                                                                                       There is a growing concern of backdoor attacks in emerg-
2     Attacks                                                                       ing machine learning systems featuring collaborative model
This section will cover mainstream adversarial threats that                         training with local private data, such as federated learning
aim to manipulate the prediction and decision-making of an                          [Bhagoji et al., 2019; Bagdasaryan et al., 2020]. Backdoor
AI model through training-phase or deployment-phase at-                             attacks can be made more insidious by leveraging the in-
tacks. Table 2 summarizes their attack objectives.                                  nate local model/data heterogeneity and diversity [Zawad et
                                                                                    al., 2021]. [Xie et al., 2020] proposes distributed backdoor
2.1      Training-Phase Attacks                                                     attacks by trigger pattern decomposition among malicious
Training-phase attacks assume the ability to modify the train-                      clients to make the attack more stealthy and effective. We
ing data to achieve malicious attempts on the resulting model,                      also refer the readers to the detailed survey of these two at-
Norm     Meaning
         `0       number of modified features
         `1       total changes in modified features
         `2       Euclidean distance between x and xadv
         `∞       maximal change in modified features
Table 3: `p norm similarity measures for additive perturbation δ =
xadv − x. The change in each feature (dimension) between x and
xadv is measured in absolute value.
                                                                              Figure 2: Taxonomy and illustration of evasion attacks.
tacks in [Goldblum et al., 2020].
                                                                        viewed as a black-box function and thus backpropagation for
2.2   Deployment-Phase Attacks                                          computing input gradient is infeasible without knowing the
The objective of deployment-phase attacks is to find a “sim-            model details. In the soft-label black-box attack setting, an
ilar” example T (x) of x such that the fixed model fθ will              attacker can observe (parts of) class predictions and their as-
evade its prediction from the original groundtruth label y. The         sociated confidence scores. In the hard-label black-box attack
evasion condition can be further separated into two cases: (i)          (decision-based) setting, an attacker can only observe the top-
untargeted attack such that fθ (x) = y but fθ (T (x)) 6= y,             1 label prediction, which is the least information required to
or (ii) targeted attack such that fθ (x) = y but fθ (T (x)) =           be returned to remain utility of the model. In addition to at-
t, t 6= y. Such T (x) is known as an adversarial ex-                    tack success rate, query efficiency is also an important metric
ample4 of x [Biggio et al., 2013; Szegedy et al., 2014;                 for performance evaluation of black-box attacks.
Goodfellow et al., 2015], and it can be interpreted as out-of-             Transfer attack is a branch of black-box attack that uses
distribution sample or generalization error [Stutz et al., 2019].       adversarial examples generated from a white-box surrogate
Data Similarity. Depending on data characteristics, speci-              model to attack the target model. The surrogate model can
fying a transformation function T (·) that preserves data sim-          be either pre-trained [Liu et al., 2017] or distilled from a set
ilarity between an original sample x and its transformed sam-           of data samples with soft labels given by the target model for
ple T (x) is a core mission for evasion attacks. A common               training [Papernot et al., 2016; Papernot et al., 2017].
practice to select T (·) is through a simple additive perturba-         Attack formulation. The process of finding an adversar-
tion δ such that xadv = x + δ, or through domain-specific               ial perturbation δ can be formulated as a constrained opti-
knowledge such as rotation, object translation, and color               mization problem with a specified attack loss g(δ|x, y, t, θ)
changes for image data [Hosseini and Poovendran, 2018;                  reflecting the attack objective (t is omitted for untargeted at-
Engstrom et al., 2019]. For additive perturbation (either on            tack). The variation in problem formulations and solvers will
data input or parameter(s) simulating semantic changes), the            lead to different attack algorithms. We specify three exam-
                                              P               1/p
                                                  d          p          ples below. Without loss of generality, we use `p norm as the
`p norm (p ≥ 1) of δ defined as kδkp ,            i=1 i|δ  |
                                                                        similarity measure (distortion) and untargeted attack as the
and the pseudo norm `0 are surrogate metrics for measuring              objective, and assume that all feasible data inputs lie in the
similarity distance. Table 3 summarizes popular choices of              scaled space S = [0, 1]d .
`p norms and their meanings. Take image as an example, `0
                                                                        • Minimal-distortion formulation:
norm is used to design few-pixel (patch) attacks [Su et al.,
2019], `1 norm is used to generate sparse and small perturba-             Minimizeδ:x+δ∈S kδkp subject to ŷθ (x + δ) 6= ŷθ (x) (1)
tions [Chen et al., 2018b], `∞ norm is used to confine maxi-
mal changes in pixel values [Szegedy et al., 2014], and mixed           • Penalty-based formulation:
`p norms can also be used [Xu et al., 2019b].
                                                                                  Minimizeδ:x+δ∈S kδkp + λ · g(δ|x, y, θ)               (2)
Evasion Attack Taxonomy. Evasion attacks can be cate-
gorized based on attackers’ knowledge of the target model.              • Budget-based (norm bounded) formulation:
Figure 2 illustrates the taxonomy of different attack types.
   White-box attack assumes complete knowledge about the                    Minimizeδ:x+δ∈S g(δ|x, y, θ) subject to kδkp ≤             (3)
target model, including model architecture and model param-             For untargeted attacks, the attacker’s loss can be the nega-
eters. Consequently, an attacker can exploit the auto differ-           tive classification loss g(δ|x, y, θ) = −loss(fθ (x + δ), y)
entiation function offered by deep learning packages, such as           or the truncated class margin loss (using either logit or soft-
backpropagation (input gradient) from the model output to               max output) defined as g(δ|x, y, θ) = max{[logit(x + δ)]y −
the model input, to craft adversarial examples.                         maxk∈[K],k6=y [logit(x + δ)]k + κ, 0}. The margin loss sug-
   Black-box attack assumes an attacker can only observe the            gests that g(δ|x, y, θ) achieves minimal value (i.e. 0) when
model prediction of a data input (that is, a query) and does            the top-1 class confidence score excluding the original class
not know any other information. The target model can be                 y satisfies maxk∈[K],k6=y [logit(x + δ)]k ≥ logit(x + δ)]y + κ,
    4
      Sometimes adversarial example may carry a broader meaning         where κ ≥ 0 is a tuning parameter governing their confi-
of any data sample xadv that leads to incorrect prediction and there-   dence gap. Similarly, for targeted attacks, the attacker’s loss
fore dropping the dependence to a reference sample x, such as unre-     can be g(δ|x, y, t, θ) = loss(fθ (x + δ), t) or g(δ|x, y, t, θ) =
stricted adversarial examples [Brown et al., 2018].                     max{maxk∈[K],k6=t [logit(x + δ)]k − [logit(x + δ)]t + κ, 0}.
When implementing black-box attacks, the logit margin loss         defense is essentially a cat-and-mouse game. Many seem-
can be replaced with the observable model output log fθ .          ingly successfully empirical defenses were later weakened
   The attack formulation can be generalized to the univer-        by advanced attacks that is defense-aware, which gives a
sal perturbation setting such that it simultaneously evades all    false sense of adversarial robustness due to information ob-
model predictions. The universality can be w.r.t. data samples     fuscation [Carlini and Wagner, 2017a; Athalye et al., 2018].
[Moosavi-Dezfooli et al., 2017], model ensembles [Tramèr          Consequently, defenses are expected to be fully evaluated
et al., 2018], or various data transformations [Athalye and        against the best possible adaptive attacks that are defense-
Sutskever, 2018]. [Wang et al., 2021a] shows that min-max          aware [Carlini et al., 2019a; Tramer et al., 2020].
optimization can yield effective universal perturbations.             While empirical robustness refers to the model perfor-
                                                                   mance against a set of known attacks, it may fail to serve
Selected Attack Algorithms. We show some white-box                 as a proper robustness indicator against advanced and unseen
and black-box attack algorithms driven by the three afore-         attacks. To address this issue, certified robustness is used to
mentioned attack formulations. For the minimal-distortion          ensure the model is provably robust given a set of attack con-
formulation, the attack constraint ŷθ (x + δ) 6= ŷθ (x) can be   ditions (threat models). Verification can be viewed as a pas-
rewritten as maxk∈[K],k6=y [fθ (x+δ)]k ≥ [fθ (x+δ)]y , which       sive certified defense in the sense that its goal is to quantify a
can be used to linearize the local decision boundary around x      given model’s (local) robustness with guarantees.
and allow for efficient projection to the closest linearized de-
cision boundary, leading to white-box attack algorithms such       3.1   Empirical Defenses
as DeepFool [Moosavi-Dezfooli et al., 2016] and fast adap-
tive boundary (FAB) attack [Croce and Hein, 2020]. For the         Empirical defenses are effective methods applied during
penalty-based formulation, one can use change-of-variable on       training or deployment phase to improve adversarial robust-
δ to convert to an unconstrained optimization problem and          ness without provable guarantees on their effectiveness.
then use binary search on λ to find the smallest λ leading            For training-phase attacks, data filtering and model finetu-
to successful attack (i.e., g = 0), known as Carlini-Wagner        ing are major approaches. For instance, [Tran et al., 2018]
(C&W) white-box attack [Carlini and Wagner, 2017b]. For            shows removing outliers using learned latent representations
the budget-based formulation, one can apply projected gra-         and retraining the model can reduce the poison effect. To
dient descent (PGD), leading to the white-box PGD attack           inspect a pre-trained model has backdoor or not, Neural
[Madry et al., 2018]. Attack algorithms using input gradients      Cleanse [Wang et al., 2019] reverse-engineers potential trig-
of the loss function are called gradient-based attacks.            ger patterns for detection. [Wang et al., 2020] proposes data-
                                                                   efficient detectors that require only one sample per class and
   Black-box attack algorithms often adopt either the penalty-
                                                                   are made data-free for convolutional neural networks. [Zhao
based or budget-based formulation. Since the input gradient
                                                                   et al., 2020a] exploits the mode connectivity in the loss land-
of the attacker’s loss is unavailable to obtain in the black-box
                                                                   scape to recover a backdoored model using limited clean data.
setting, one principal approach is to perform gradient esti-
                                                                      For deployment-phase attacks, adversarial input detection
mation using model queries and then use the estimated gra-
                                                                   schemes that exploit data characteristics such as spatial or
dient to replace the true gradient in white-box attack algo-
                                                                   temporal correlations are shown to be effective, such as the
rithms, leading to the zeroth-order optimization (ZOO) based
                                                                   detection of audio adversarial example using temporal depen-
black-box attacks [Chen et al., 2017]. The choices in gradi-
                                                                   dency [Yang et al., 2019]. For training adversarially robust
ent estimators [Tu et al., 2019; Nitin Bhagoji et al., 2018;
                                                                   models, adversarial training that aims to minimize the worst-
Liu et al., 2018; Liu et al., 2019; Ilyas et al., 2018] and
                                                                   case loss evaluated by perturbed examples generated during
ZOO solvers [Zhao et al., 2019; Zhao et al., 2020b; Ilyas et
                                                                   training is so far the strongest empirical defense [Madry et
al., 2019] will give rise to different attack algorithms. The
                                                                   al., 2018]. Specifically, the standard formulation of adversar-
choices in gradient estimators and ZOO solvers will give rise
                                                                   ial training can be expressed as the following min-max opti-
to different attack algorithms [Liu et al., 2020a]. Hard-label
                                                                   mization over training samples {xi , yi }ni=1 :
black-box attacks can still adopt ZOO principles by spend-
ing extra queries to explore local loss landscapes for gra-                           n
                                                                                      X
dient estimation [Cheng et al., 2019; Cheng et al., 2020b;                      min          max loss(fθ (xi + δ), y)            (4)
                                                                                  θ         kδi kp ≤
Chen et al., 2020], which is more query-efficient than ran-                           i=1
dom exploration [Brendel et al., 2018].
                                                                   The worst-case loss corresponding to the inner maximiza-
Physical adversarial example is a prediction-evasive               tion step is often evaluated by gradient-based attacks such
physical adversarial object. Examples include stop sign            as PGD attack [Madry et al., 2018]. Variants of adversarial
[Eykholt et al., 2018], eye glass [Sharif et al., 2016], ph-       training methods such as TRADES [Zhang et al., 2019] and
sical patch [Brown et al., 2017], 3D printing [Athalye and         customized adversarial training (CAT) [Cheng et al., 2020a]
Sutskever, 2018], and T-shirt [Xu et al., 2020b].                  have been proposed for improved robustness. [Cheng et al.,
                                                                   2021] proposes attack-independent robust training based on
3   Defenses and Verification                                      self-progression. On 18 different ImageNet pre-trained mod-
                                                                   els, [Su et al., 2018] unveils an undesirable trade-off between
Defenses are adversarial threat detection and mitigation           standard accuracy and adversarial robustness. This trade-off
strategies, which can be divided into empirical and certi-         can be improved with the use of unlabeled data [Carmon et
fied defenses. We note that the interplay between attack and       al., 2019; Stanforth et al., 2019].
3.2    Certified Defenses                                            augmentation tools to simultaneously improve model gener-
Certified defenses provide performance guarantees on the             alization and adversarial robustness [Hsu et al., 2022; Lei et
hardened models. Adversarial attacks are ineffective if their        al., 2019]. Other noteworthy applications include image syn-
threat models fall within the provably robust conditions.            thesis [Santurkar et al., 2019] generating contrastive explana-
   For training-phase attacks, [Steinhardt et al., 2017] pro-        tions [Dhurandhar et al., 2018], and robust text CAPTCHAs
poses certified data sanitization against poisoning attacks.         [Shao et al., 2021].
[Weber et al., 2020] proposes randomized data training for           Adversarial Robustness Beyond Classification and Input
certified defense against backdoor attacks. For deployment-          Perturbation. The formulations and principles in attacks
phase attacks, randomized smoothing is an effective and              and defenses for classification can be analogously applied to
model-agnostic approach that adds random noises to a data            other machine learning tasks. Examples include sequence-to-
input to “smooth” the model and perform majority voting on           sequence translation [Cheng et al., 2020c] and image caption-
the model predictions. The certified radius (region) in `p -         ing [Chen et al., 2018a]. Beyond input perturbation, the ro-
norm perturbation ensuring consistent class prediction can           bustness of model parameter perturbation [Tsai et al., 2021]
be computed by information-theoretical approach [Li et al.,          also relates to model quantification [Weng et al., 2020] and
2019], differential privacy [Lecuyer et al., 2019], Neyman-          energy-efficient inference [Stutz et al., 2020].
Pearson lemma [Cohen et al., 2019], or higher-order certifi-
cation [Mohapatra et al., 2020a].                                    Instilling Adversarial Robustness into Foundation Mod-
                                                                     els. As foundation models adapt task-independent pre-
3.3    Verification                                                  training for general representation learning followed by task-
Verification is often used in certifying local robustness against    specific fine-tuning for fast adaptation, it is of utmost impor-
evasion attacks. Given a neural network fθ and a data sample         tance to understand (i) how to incorporate adversarial robust-
x, verification (in its simplest form) aims to maximally certify     ness into foundation model pre-training and (ii) how to maxi-
an `p -norm bounded radius r on the perturbation δ to ensure         mize adversarial robustness transfer from pre-training to fine-
the model prediction on the perturbed sample x + δ is consis-        tuning. [Fan et al., 2021; Wang et al., 2021b] show promising
tent as long as δ is within the certified region. That is, for any   results in adversarial robustness preservation and transfer in
δ such that kδkp ≤ r, ŷθ (x + δ) = ŷθ (x). The certified radius    meta learning and contrastive learning. The rapid growth and
is a robustness certificate relating to the distance to the clos-    intensifying demand on foundation models create a unique
est decision boundary, which is computationally challenging          opportunity to advocate adversarial robustness as a necessary
(NP-complete) for neural networks [Katz et al., 2017]. How-          native property in next-generation trustworthy AI tools.
ever, its estimate (hence not a certificate) can be efficiently
computed and used as a model-agnostic robustness metric,             References
such as the CLEVER score [Weng et al., 2018b]. To address
the non-linearity induced by layer propagation in neural net-        [Aramoon et al., 2021] Omid Aramoon, Pin-Yu Chen, and
works, solving for a certified radius is often cast as a relaxed       Gang Qu. Don’t forget to sign the gradients! MLSyS,
optimization problem. The methods include convex poly-                 3, 2021.
tope [Wong and Kolter, 2018], semidefinite programming               [Athalye and Sutskever, 2018] Anish Athalye and Ilya
[Raghunathan et al., 2018], dual optimization [Dvijotham et            Sutskever. Synthesizing robust adversarial examples.
al., 2018], layer-wise linear bounds [Weng et al., 2018a],             ICML, 2018.
and interval bound propagation [Gowal et al., 2019]. The
verification tools are also expanded to support general net-         [Athalye et al., 2018] Anish Athalye, Nicholas Carlini, and
work architectures [Zhang et al., 2018; Boopathy et al., 2019;         David Wagner. Obfuscated gradients give a false sense of
Xu et al., 2020a] and semantic adversarial examples [Moha-             security: Circumventing defenses to adversarial examples.
patra et al., 2020b]. The intermediate certified results can           ICML, 2018.
be used to train a more certifiable model [Wong et al., 2018;        [Bagdasaryan et al., 2020] Eugene Bagdasaryan, Andreas
Boopathy et al., 2021]. However, scalability to large-sized            Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov.
neural networks remains a major challenge in verification.             How to backdoor federated learning. AISTATS, 2020.
                                                                     [Bhagoji et al., 2019] Arjun Nitin Bhagoji,          Supriyo
4     Remarks and Discussion
                                                                       Chakraborty, Prateek Mittal, and Seraphin Calo. Analyz-
Here we make several concluding remarks and discussion.                ing federated learning through an adversarial lens. ICML,
Novel Applications. The insights from studying adversar-               pages 634–643, 2019.
ial robustness have led to several new use cases. Adver-             [Biggio and Roli, 2018] Battista Biggio and Fabio Roli.
sarial perturbation and data poisoning are used in generat-             Wild patterns: Ten years after the rise of adversarial ma-
ing contrastive explanations [Dhurandhar et al., 2018], per-            chine learning. Pattern Recognition, 84:317–331, 2018.
sonal privacy protection [Shan et al., 2020], data/model
watermarking and fingerprinting [Sablayrolles et al., 2020;          [Biggio et al., 2013] Battista Biggio, Igino Corona, Davide
Aramoon et al., 2021; Wang et al., 2021c], and data-limited             Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov,
transfer learning [Tsai et al., 2020; Yang et al., 2021]. Ad-           Giorgio Giacinto, and Fabio Roli. Evasion attacks against
versarial examples with proper design are also efficient data           machine learning at test time. ECML PKDD, 2013.
[Boopathy et al., 2019] Akhilan Boopathy, Tsui-Wei Weng,          [Cheng et al., 2019] Minhao Cheng, Thong Le, Pin-Yu
  Pin-Yu Chen, Sijia Liu, and Luca Daniel. CNN-cert: An              Chen, Jinfeng Yi, Huan Zhang, and Cho-Jui Hsieh. Query-
  efficient framework for certifying robustness of convolu-          efficient hard-label black-box attack: An optimization-
  tional neural networks. AAAI, 2019.                                based approach. ICLR, 2019.
[Boopathy et al., 2021] Akhilan Boopathy, Tsui-Wei Weng,          [Cheng et al., 2020a] Minhao Cheng, Qi Lei, Pin-Yu Chen,
  Sijia Liu, Pin-Yu Chen, Gaoyuan Zhang, and Luca Daniel.            Inderjit Dhillon, and Cho-Jui Hsieh. Cat: Customized ad-
  Fast training of provably robust neural networks by sin-           versarial training for improved robustness. arXiv preprint
  gleprop. AAAI, 2021.                                               arXiv:2002.06789, 2020.
[Brendel et al., 2018] Wieland Brendel, Jonas Rauber, and         [Cheng et al., 2020b] Minhao Cheng, Simranjit Singh,
  Matthias Bethge. Decision-based adversarial attacks: Re-           Patrick H. Chen, Pin-Yu Chen, Sijia Liu, and Cho-Jui
  liable attacks against black-box machine learning models.          Hsieh. Sign-OPT: A query-efficient hard-label adversarial
  ICLR, 2018.                                                        attack. ICLR, 2020.
[Brown et al., 2017] Tom B Brown, Dandelion Mané, Aurko          [Cheng et al., 2020c] Minhao Cheng, Jinfeng Yi, Huan
  Roy, Martı́n Abadi, and Justin Gilmer. Adversarial patch.          Zhang, Pin-Yu Chen, and Cho-Jui Hsieh. Seq2sick: Evalu-
  arXiv preprint arXiv:1712.09665, 2017.                             ating the robustness of sequence-to-sequence models with
[Brown et al., 2018] Tom B Brown, Nicholas Carlini,                  adversarial examples. AAAI, 2020.
  Chiyuan Zhang, Catherine Olsson, Paul Christiano, and           [Cheng et al., 2021] Minhao Cheng, Pin-Yu Chen, Sijia Liu,
  Ian Goodfellow. Unrestricted adversarial examples. arXiv           Shiyu Chang, Cho-Jui Hsieh, and Payel Das. Self-
  preprint arXiv:1809.08352, 2018.                                   progressing robust training. AAAI, 2021.
[Carlini and Wagner, 2017a] Nicholas Carlini and David            [Cohen et al., 2019] Jeremy M Cohen, Elan Rosenfeld, and
  Wagner. Adversarial examples are not easily detected: By-          J Zico Kolter. Certified adversarial robustness via random-
  passing ten detection methods. ACM Workshop on Artifi-             ized smoothing. ICML, 2019.
  cial Intelligence and Security, pages 3–14, 2017.
                                                                  [Croce and Hein, 2020] Francesco Croce and Matthias Hein.
[Carlini and Wagner, 2017b] Nicholas Carlini and David
                                                                     Minimally distorted adversarial examples with a fast adap-
  Wagner. Towards evaluating the robustness of neural net-
                                                                     tive boundary attack. ICML, pages 2196–2205, 2020.
  works. IEEE S&P, pages 39–57, 2017.
[Carlini et al., 2019a] Nicholas Carlini, Anish Athalye,          [Dhurandhar et al., 2018] Amit Dhurandhar, Pin-Yu Chen,
  Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dim-              Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan
  itris Tsipras, Ian Goodfellow, Aleksander Madry, and               Shanmugam, and Payel Das. Explanations based on the
  Alexey Kurakin. On evaluating adversarial robustness.              missing: Towards contrastive explanations with pertinent
  arXiv preprint arXiv:1902.06705, 2019.                             negatives. NeurIPS, 2018.
[Carlini et al., 2019b] Nicholas Carlini, Chang Liu, Úlfar Er-   [Dvijotham et al., 2018] Krishnamurthy Dvijotham, Robert
  lingsson, Jernej Kos, and Dawn Song. The secret sharer:            Stanforth, Sven Gowal, Timothy A Mann, and Pushmeet
  Evaluating and testing unintended memorization in neural           Kohli. A dual approach to scalable verification of deep
  networks. USENIX Security, pages 267–284, 2019.                    networks. UAI, 1(2):3, 2018.
[Carmon et al., 2019] Yair Carmon, Aditi Raghunathan,             [Engstrom et al., 2019] Logan Engstrom, Brandon Tran,
  Ludwig Schmidt, Percy Liang, and John C Duchi. Un-                 Dimitris Tsipras, Ludwig Schmidt, and Aleksander
  labeled data improves adversarial robustness. NeurIPS,             Madry. Exploring the landscape of spatial robustness.
  2019.                                                              ICML, pages 1802–1811, 2019.
[Chen et al., 2017] Pin-Yu Chen, Huan Zhang, Yash Sharma,         [Eykholt et al., 2018] Kevin Eykholt, Ivan Evtimov, Ear-
  Jinfeng Yi, and Cho-Jui Hsieh. ZOO: Zeroth order opti-             lence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao,
  mization based black-box attacks to deep neural networks           Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust
  without training substitute models. ACM Workshop on Ar-            physical-world attacks on deep learning visual classifica-
  tificial Intelligence and Security, pages 15–26, 2017.             tion. CVPR, pages 1625–1634, 2018.
[Chen et al., 2018a] Hongge Chen, Huan Zhang, Pin-Yu              [Fan et al., 2021] Lijie Fan, Sijia Liu, Pin-Yu Chen,
  Chen, Jinfeng Yi, and Cho-Jui Hsieh. Attacking visual lan-         Gaoyuan Zhang, and Chuang Gan. When does contrastive
  guage grounding with adversarial examples: A case study            learning preserve adversarial robustness from pretraining
  on neural image captioning. ACL, 1:2587–2597, 2018.                to finetuning? NeurIPS, 34, 2021.
[Chen et al., 2018b] Pin-Yu Chen, Yash Sharma, Huan               [Geiping et al., 2021] Jonas Geiping, Liam Fowl, W Ronny
  Zhang, Jinfeng Yi, and Cho-Jui Hsieh. EAD: elastic-net             Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller,
  attacks to deep neural networks via adversarial examples.          and Tom Goldstein. Witches’ brew: Industrial scale data
  AAAI, pages 10–17, 2018.                                           poisoning via gradient matching. ICLR, 2021.
[Chen et al., 2020] Jianbo Chen, Michael I Jordan, and Mar-       [Goldblum et al., 2020] Micah Goldblum, Dimitris Tsipras,
  tin J Wainwright. Hopskipjumpattack: A query-efficient             Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn
  decision-based attack. IEEE S&P, 2020.                             Song, Aleksander Madry, Bo Li, and Tom Goldstein. Data
security for machine learning: Data poisoning, backdoor       [Liu et al., 2017] Yanpei Liu, Xinyun Chen, Chang Liu, and
   attacks, and defenses. arXiv preprint arXiv:2012.10544,          Dawn Song. Delving into transferable adversarial exam-
   2020.                                                            ples and black-box attacks. ICLR, 2017.
[Goodfellow et al., 2015] Ian J Goodfellow, Jonathon             [Liu et al., 2018] Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen,
   Shlens, and Christian Szegedy. Explaining and harnessing         Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order
   adversarial examples. ICLR, 2015.                                stochastic variance reduction for nonconvex optimization.
[Gowal et al., 2019] Sven Gowal, Krishnamurthy Dj Dvi-              NeurIPS, pages 3731–3741, 2018.
   jotham, Robert Stanforth, Rudy Bunel, Chongli Qin,            [Liu et al., 2019] Sijia Liu, Pin-Yu Chen, Xiangyi Chen, and
   Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and           Mingyi Hong. signSGD via zeroth-order oracle. ICLR,
   Pushmeet Kohli. Scalable verified training for provably ro-      2019.
   bust image classification. ICCV, pages 4842–4851, 2019.       [Liu et al., 2020a] Sijia Liu, Pin-Yu Chen, Bhavya
[Gu et al., 2019] T. Gu, K. Liu, B. Dolan-Gavitt, and               Kailkhura, Gaoyuan Zhang, Alfred Hero, and Pramod K
   S. Garg. BadNets: Evaluating backdooring attacks on deep         Varshney. A primer on zeroth-order optimization in signal
   neural networks. IEEE Access, 7:47230–47244, 2019.               processing and machine learning. IEEE Signal Processing
[Hosseini and Poovendran, 2018] Hossein Hosseini and                Magazine, 2020.
   Radha Poovendran.         Semantic adversarial examples.      [Liu et al., 2020b] Sijia Liu, Songtao Lu, Xiangyi Chen, Yao
   CVPR Workshops, pages 1614–1619, 2018.                           Feng, Kaidi Xu, Abdullah Al-Dujaili, Mingyi Hong, and
[Hsu et al., 2022] Chia-Yi Hsu, Pin-Yu Chen, Songtao Lu,            Una-May O’Reilly. Min-max optimization without gradi-
   Sijia Liu, and Chia-Mu Yu. Adversarial examples can              ents: Convergence and applications to black-box evasion
   be effective data augmentation for unsupervised machine          and poisoning attacks. ICML, pages 6282–6293, 2020.
   learning. AAAI, 2022.                                         [Madry et al., 2018] Aleksander       Madry,      Aleksandar
[Ilyas et al., 2018] Andrew Ilyas, Logan Engstrom, Anish            Makelov, Ludwig Schmidt, Dimitris Tsipras, and
   Athalye, and Jessy Lin. Black-box adversarial attacks with       Adrian Vladu. Towards deep learning models resistant to
   limited queries and information. ICML, 2018.                     adversarial attacks. ICLR, 2018.
[Ilyas et al., 2019] Andrew Ilyas, Logan Engstrom, and           [Mohapatra et al., 2020a] Jeet Mohapatra, Ching-Yun Ko,
   Aleksander Madry. Prior convictions: Black-box adver-            Tsui-Wei Weng, Pin-Yu Chen, Sijia Liu, and Luca
   sarial attacks with bandits and priors. ICLR, 2019.              Daniel. Higher-order certification for randomized smooth-
[Jagielski et al., 2018] Matthew Jagielski, Alina Oprea, Bat-       ing. NeurIPS, 2020.
   tista Biggio, Chang Liu, Cristina Nita-Rotaru, and Bo Li.     [Mohapatra et al., 2020b] Jeet Mohapatra, Tsui-Wei Weng,
   Manipulating machine learning: Poisoning attacks and             Pin-Yu Chen, Sijia Liu, and Luca Daniel. Towards ver-
   countermeasures for regression learning. IEEE S&P,               ifying robustness of neural networks against a family of
   pages 19–35, 2018.                                               semantic perturbations. CVPR, pages 244–252, 2020.
[Katz et al., 2017] Guy Katz, Clark Barrett, David L Dill,       [Moosavi-Dezfooli et al., 2016] Seyed-Mohsen Moosavi-
   Kyle Julian, and Mykel J Kochenderfer. Reluplex: An              Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deep-
   efficient smt solver for verifying deep neural networks.         fool: a simple and accurate method to fool deep neural
   International Conference on Computer Aided Verification,         networks. CVPR, pages 2574–2582, 2016.
   pages 97–117, 2017.                                           [Moosavi-Dezfooli et al., 2017] Seyed-Mohsen Moosavi-
[Kumar et al., 2020] Ram Shankar Siva Kumar, Magnus                 Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal
   Nyström, John Lambert, Andrew Marshall, Mario Go-               Frossard. Universal adversarial perturbations. CVPR,
   ertzel, Andi Comissoneru, Matt Swann, and Sharon Xia.            pages 86–94, 2017.
   Adversarial machine learning-industry perspectives. IEEE      [Nguyen and Tran, 2020] Anh Nguyen and Anh Tran. Input-
   S&P Workshops, pages 69–75, 2020.                                aware dynamic backdoor attack. NeurIPS, 2020.
[LeCun et al., 2015] Yann LeCun, Yoshua Bengio, and Ge-          [Nitin Bhagoji et al., 2018] Arjun Nitin Bhagoji, Warren He,
   offrey Hinton. Deep learning. Nature, 2015.                      Bo Li, and Dawn Song. Practical black-box attacks on
[Lecuyer et al., 2019] Mathias Lecuyer, Vaggelis Atlidakis,         deep neural networks using efficient query mechanisms.
   Roxana Geambasu, Daniel Hsu, and Suman Jana. Cer-                ECCV, pages 154–169, 2018.
   tified robustness to adversarial examples with differential   [Papernot et al., 2016] Nicolas Papernot, Patrick McDaniel,
   privacy. IEEE S&P, pages 656–672, 2019.                          and Ian Goodfellow. Transferability in machine learning:
[Lei et al., 2019] Qi Lei, Lingfei Wu, Pin-Yu Chen, Alexan-         from phenomena to black-box attacks using adversarial
   dros G Dimakis, Inderjit S Dhillon, and Michael Witbrock.        samples. arXiv preprint arXiv:1605.07277, 2016.
   Discrete adversarial attacks and submodular optimization      [Papernot et al., 2017] Nicolas Papernot, Patrick McDaniel,
   with applications to text classification. SysML, 2019.           Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Anan-
[Li et al., 2019] Bai Li, Changyou Chen, Wenlin Wang, and           thram Swami. Practical black-box attacks against machine
   Lawrence Carin. Certified adversarial robustness with ad-        learning. ACM Asia Conference on Computer and Com-
   ditive noise. NeurIPS, 2019.                                     munications Security, pages 506–519, 2017.
[Raghunathan et al., 2018] Aditi Raghunathan, Jacob Stein-       [Tramèr et al., 2018] Florian Tramèr, Alexey Kurakin, Nico-
  hardt, and Percy Liang. Certified a against adversarial ex-       las Papernot, Dan Boneh, and Patrick McDaniel. En-
  amples. ICLR, 2018.                                               semble adversarial training: Attacks and defenses. ICLR,
                                                                    2018.
[Sablayrolles et al., 2020] Alexandre Sablayrolles, Matthijs
   Douze, Cordelia Schmid, and Hervé Jégou. Radioactive        [Tramer et al., 2020] Florian Tramer, Nicholas Carlini,
   data: tracing through training. ICML, 2020.                      Wieland Brendel, and Aleksander Madry. On adaptive
                                                                    attacks to adversarial example defenses. NeurIPS, 2020.
[Santurkar et al., 2019] Shibani Santurkar, Dimitris Tsipras,
                                                                 [Tran et al., 2018] Brandon Tran, Jerry Li, and Aleksander
   Brandon Tran, Andrew Ilyas, Logan Engstrom, and Alek-
                                                                    Madry. Spectral signatures in backdoor attacks. NeurIPS,
   sander Madry. Image synthesis with a single (robust) clas-
                                                                    pages 8000–8010, 2018.
   sifier. arXiv preprint arXiv:1906.09453, 2019.
                                                                 [Tsai et al., 2020] Yun-Yun Tsai, Pin-Yu Chen, and Tsung-
[Shafahi et al., 2018] Ali Shafahi, W Ronny Huang, Mahyar           Yi Ho. Transfer learning without knowing: Reprogram-
   Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras,        ming black-box machine learning models with scarce data
   and Tom Goldstein. Poison frogs! targeted clean-label poi-       and limited resources. ICML, pages 9614–9624, 2020.
   soning attacks on neural networks. NeurIPS, pages 6103–
   6113, 2018.                                                   [Tsai et al., 2021] Yu-Lin Tsai, Chia-Yi Hsu, Chia-Mu Yu,
                                                                    and Pin-Yu Chen. Formalizing generalization and adver-
[Shan et al., 2020] Shawn Shan, Emily Wenger, Jiayun                sarial robustness of neural networks to weight perturba-
   Zhang, Huiying Li, Haitao Zheng, and Ben Y Zhao.                 tions. NeurIPs, 34, 2021.
   Fawkes: Protecting privacy against unauthorized deep          [Tu et al., 2019] Chun-Chen Tu, Paishun Ting, Pin-Yu Chen,
   learning models. USENIX Security, 2020.                          Sijia Liu, Huan Zhang, Jinfeng Yi, Cho-Jui Hsieh, and
[Shao et al., 2021] Rulin Shao, Zhouxing Shi, Jinfeng               Shin-Ming Cheng. Autozoom: Autoencoder-based zeroth
   Yi, Pin-Yu Chen, and Cho-Jui Hsieh.    Robust text               order optimization method for attacking black-box neural
   captchas using adversarial examples. arXiv preprint              networks. AAAI, 33:742–749, 2019.
   arXiv:2101.02483, 2021.                                       [Wang et al., 2019] Bolun Wang, Yuanshun Yao, Shawn
[Sharif et al., 2016] Mahmood Sharif, Sruti Bhagavatula,            Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and
   Lujo Bauer, and Michael K Reiter. Accessorize to a crime:        Ben Y Zhao. Neural cleanse: Identifying and mitigating
   Real and stealthy attacks on state-of-the-art face recogni-      backdoor attacks in neural networks. IEEE S&P, pages
   tion. ACM CCS, pages 1528–1540, 2016.                            707–723, 2019.
                                                                 [Wang et al., 2020] Ren Wang, Gaoyuan Zhang, Sijia Liu,
[Stanforth et al., 2019] Robert Stanforth, Alhussein Fawzi,
                                                                    Pin-Yu Chen, Jinjun Xiong, and Meng Wang. Practical
   Pushmeet Kohli, et al. Are labels required for improving         detection of trojan neural networks: Data-limited and data-
   adversarial robustness? NeurIPS, 2019.                           free cases. ECCV, pages 222–238, 2020.
[Steinhardt et al., 2017] Jacob Steinhardt, Pang Wei Koh,        [Wang et al., 2021a] Jingkang Wang, Tianyun Zhang, Sijia
   and Percy Liang. Certified defenses for data poisoning           Liu, Pin-Yu Chen, Jiacen Xu, Makan Fardad, and Bo Li.
   attacks. NeurIPS, 2017.                                          Adversarial attack generation empowered by min-max op-
[Stutz et al., 2019] David Stutz, Matthias Hein, and Bernt          timization. NeurIPS, 34, 2021.
   Schiele. Disentangling adversarial robustness and gener-      [Wang et al., 2021b] Ren Wang, Kaidi Xu, Sijia Liu, Pin-Yu
   alization. CVPR, pages 6976–6987, 2019.                          Chen, Tsui-Wei Weng, Chuang Gan, and Meng Wang. On
                                                                    fast adversarial robustness adaptation in model-agnostic
[Stutz et al., 2020] David Stutz, Nandhini Chandramoorthy,          meta-learning. ICLR, 2021.
   Matthias Hein, and Bernt Schiele. Bit error robust-
   ness for energy-efficient dnn accelerators. arXiv preprint    [Wang et al., 2021c] Siyue Wang, Xiao Wang, Pin Yu Chen,
   arXiv:2006.13977, 2020.                                          Pu Zhao, and Xue Lin. Characteristic examples: High-
                                                                    robustness, low-transferability fingerprinting of neural net-
[Su et al., 2018] Dong Su, Huan Zhang, Hongge Chen, Jin-            works. IJCAI, 2021.
   feng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the       [Weber et al., 2020] Maurice Weber, Xiaojun Xu, Bojan
   cost of accuracy? a comprehensive study on the robustness
                                                                    Karlas, Ce Zhang, and Bo Li. Rab: Provable ro-
   of 18 deep image classification models. ECCV, 2018.
                                                                    bustness against backdoor attacks.           arXiv preprint
[Su et al., 2019] Jiawei Su, Danilo Vasconcellos Vargas, and        arXiv:2003.08904, 2020.
   Kouichi Sakurai. One pixel attack for fooling deep neural     [Weng et al., 2018a] Tsui-Wei Weng, Huan Zhang, Hongge
   networks. IEEE Transactions on Evolutionary Computa-             Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inder-
   tion, 23(5):828–841, 2019.                                       jit S Dhillon, and Luca Daniel. Towards fast computa-
[Szegedy et al., 2014] Christian    Szegedy,     Wojciech           tion of certified robustness for relu networks. International
   Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian          Coference on ICML, 2018.
   Goodfellow, and Rob Fergus. Intriguing properties of          [Weng et al., 2018b] Tsui-Wei Weng, Huan Zhang, Pin-Yu
   neural networks. ICLR, 2014.                                     Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh,
and Luca Daniel. Evaluating the robustness of neural net-      [Zhao et al., 2019] Pu Zhao, Sijia Liu, Pin-Yu Chen, Nghia
   works: An extreme value theory approach. ICLR, 2018.             Hoang, Kaidi Xu, Bhavya Kailkhura, and Xue Lin. On the
[Weng et al., 2020] Tsui-Wei Weng, Pu Zhao, Sijia Liu, Pin-         design of black-box adversarial examples by leveraging
   Yu Chen, Xue Lin, and Luca Daniel. Towards certificated          gradient-free optimization and operator splitting method.
   model robustness against weight perturbations. AAAI,             ICCV, pages 121–130, 2019.
   pages 6356–6363, 2020.                                         [Zhao et al., 2020a] Pu Zhao, Pin-Yu Chen, Payel Das,
[Wong and Kolter, 2018] Eric Wong and Zico Kolter. Prov-            Karthikeyan Natesan Ramamurthy, and Xue Lin. Bridg-
   able defenses against adversarial examples via the convex        ing mode connectivity in loss landscapes and adversarial
   outer adversarial polytope. ICML, 2018.                          robustness. ICLR, 2020.
[Wong et al., 2018] Eric Wong, Frank R Schmidt, Jan Hen-          [Zhao et al., 2020b] Pu Zhao, Pin-Yu Chen, Siyue Wang,
   drik Metzen, and J Zico Kolter. Scaling provable adversar-       and Xue Lin. Towards query-efficient black-box adversary
   ial defenses. NeurIPS, 2018.                                     with zeroth-order natural gradient descent. AAAI, 2020.
[Xie et al., 2020] Chulin Xie, Keli Huang, Pin-Yu Chen, and       [Zhu et al., 2019] Chen Zhu, W Ronny Huang, Hengduo
   Bo Li. DBA: Distributed backdoor attacks against feder-          Li, Gavin Taylor, Christoph Studer, and Tom Goldstein.
   ated learning. ICLR, 2020.                                       Transferable clean-label poisoning attacks on deep neural
                                                                    nets. ICML, pages 7614–7623, 2019.
[Xu et al., 2019a] Kaidi Xu, Hongge Chen, Sijia Liu, Pin-Yu
   Chen, Tsui-Wei Weng, Mingyi Hong, and Xue Lin. Topol-
   ogy attack and defense for graph neural networks: An op-
   timization perspective. IJCAI, 2019.
[Xu et al., 2019b] Kaidi Xu, Sijia Liu, Pu Zhao, Pin-Yu
   Chen, Huan Zhang, Quanfu Fan, Deniz Erdogmus, Yanzhi
   Wang, and Xue Lin. Structured adversarial attack: To-
   wards general implementation and better interpretability.
   ICLR, 2019.
[Xu et al., 2020a] Kaidi Xu, Zhouxing Shi, Huan Zhang,
   Minlie Huang, Kai-Wei Chang, Bhavya Kailkhura, Xue
   Lin, and Cho-Jui Hsieh. Automatic perturbation analysis
   on general computational graphs. NeurIPS, 2020.
[Xu et al., 2020b] Kaidi Xu, Gaoyuan Zhang, Sijia Liu,
   Quanfu Fan, Mengshu Sun, Hongge Chen, Pin-Yu Chen,
   Yanzhi Wang, and Xue Lin. Adversarial t-shirt! evading
   person detectors in a physical world. ECCV, 2020.
[Yang et al., 2019] Zhuolin Yang, Bo Li, Pin-Yu Chen, and
   Dawn Song. Characterizing audio adversarial examples
   using temporal dependency. ICLR, 2019.
[Yang et al., 2021] Chao-Han Huck Yang, Yun-Yun Tsai,
   and Pin-Yu Chen. Voice2series: Reprogramming acous-
   tic models for time series classification. In ICML, 2021.
[Zawad et al., 2021] Syed Zawad, Ahsan Ali, Pin-Yu Chen,
   Ali Anwar, Yi Zhou, Nathalie Baracaldo, Yuan Tian, and
   Feng Yan. Curse or redemption? how data heterogeneity
   affects the robustness of federated learning. AAAI, 2021.
[Zhang et al., 2017] Chiyuan Zhang, Samy Bengio, Moritz
   Hardt, Benjamin Recht, and Oriol Vinyals. Understanding
   deep learning requires rethinking generalization. ICLR,
   2017.
[Zhang et al., 2018] Huan Zhang, Tsui-Wei Weng, Pin-Yu
   Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neu-
   ral network robustness certification with general activation
   functions. NeurIPS, pages 4944–4953, 2018.
[Zhang et al., 2019] Hongyang Zhang, Yaodong Yu, Jiantao
   Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan.
   Theoretically principled trade-off between robustness and
   accuracy. ICML, pages 7472–7482, 2019.
You can also read