Image Classifiers Leak Sensitive Attributes About Their Classes

Page created by Dana Richardson
 
CONTINUE READING
Image Classifiers Leak Sensitive Attributes About Their Classes
Image Classifiers Leak Sensitive Attributes About Their Classes

                                                            Lukas Struppek* 1 , Dominik Hintersdorf1 , Felix Friedrich1 , Manuel Brack1 ,
                                                                           Patrick Schramowski1,3,4,5 , and Kristian Kersting1,2,3,4
                                                               1
                                                                   Department of Computer Science, Technical University of Darmstadt, Germany
                                                                         2
                                                                           Centre for Cognitive Science, Technical University of Darmstadt
                                                                            3
arXiv:2303.09289v1 [cs.LG] 16 Mar 2023

                                                                              German Research Center for Artificial Intelligence (DFKI)
                                                                                   4
                                                                                     Hessian Center for AI (hessian.AI), Germany
                                                                                                      5
                                                                                                        LAION

                                                                          Abstract                          the same time to not reveal information about their appearance
                                                                                                            to third parties. However, this assumption does not necessar-
                                            Neural network-based image classifiers are powerful tools       ily hold true, and malicious parties could extract sensitive fea-
                                         for computer vision tasks, but they inadvertently reveal sensi-    tures about users by simply interacting with the model in a
                                         tive attribute information about their classes, raising concerns   black-box fashion.
                                         about their privacy. To investigate this privacy leakage, we
                                                                                                                We investigate the privacy leakage of image classifiers and
                                         introduce the first Class Attribute Inference Attack (C AIA),
                                                                                                            demonstrate that models indeed leak sensitive class informa-
                                         which leverages recent advances in text-to-image synthesis to
                                                                                                            tion even without any specific information about the classes,
                                         infer sensitive attributes of individual classes in a black-box
                                                                                                            training samples, or attribute distributions available. We fo-
                                         setting, while remaining competitive with related white-box
                                                                                                            cus on face recognition models and show that these models
                                         attacks. Our extensive experiments in the face recognition do-
                                                                                                            reveal sensitive details within their outputs, such as gender,
                                         main show that C AIA can accurately infer undisclosed sensi-
                                                                                                            hair color, and racial appearance, for the different identities
                                         tive attributes, such as an individual’s hair color, gender, and
                                                                                                            they have been trained to recognize.
                                         racial appearance, which are not part of the training labels.
                                         Interestingly, we demonstrate that adversarial robust models           Within our analysis, we introduce a Class Attribute
                                         are even more vulnerable to such privacy leakage than stan-        Inference Attack (C AIA), which enables an adversary to in-
                                         dard models, indicating that a trade-off between robustness        fer sensitive attributes of a class with high accuracy. Phrased
                                         and privacy exists.                                                differently, C AIA allows creating a profile of the individual
                                                                                                            classes by interacting with the trained classifier through ex-
                                                                                                            traction of class attributes that have not been explicitly part
                                         1     Introduction                                                 of the training objective. We utilize recent advances in text-
                                                                                                            to-image synthesis to craft images that only differ in one at-
                                                                                                            tribute by editing real images with textual guidance. We then
                                            Classifying images with neural networks is widely adopted       exploit that image classifiers, as we show, assign higher log-
                                         in various domains [6, 16, 2, 1]. Face recognition systems,        its to inputs that share the sensitive attribute with the training
                                         for example, take facial images as input and attempt to predict    samples of a class, which allows us to infer class informa-
                                         the depicted person’s identity. In the pursuit of enhancing a      tion by only observing the input-output relation of the model.
                                         model’s predictive performance, privacy concerns of the ac-        Compared to related inference attacks [8, 7, 47, 38], C AIA is
                                         quired knowledge are often disregarded and moved into the          model-agnostic and requires only black-box access and basic
                                         background. However, correctly assessing and mitigating the        domain knowledge. Once the attack images are crafted, at-
                                         risk of compromising private information is crucial in privacy-    tribute inference requires only a single model forward pass of
                                         sensitive domains, as neglect can lead to the disclosure of        the generated samples.
                                         private information [37, 38, 14]. For example, smart home
                                                                                                                In our extensive evaluation, we demonstrate the success of
                                         devices contain face recognition models for access control.
                                                                                                            C AIA and show that robust models trained with adversarial
                                         Users expect these models to recognize them reliably, but at
                                                                                                            training [10, 28] are even more susceptible to these attacks.
                                             * Contact:   lukas.struppek@cs.tu-darmstadt.de                 They leak more information about their individual classes than
Image Classifiers Leak Sensitive Attributes About Their Classes
non-robust models, even if their prediction accuracy is sub-             imputation in settings with limited data and skewed distribu-
stantially lower. This indicates a trade-off for model design-           tions.
ers: making a model robust to adversarial examples comes at                  Model Inversion Attacks. All presented AIAs are limited
the expense of higher privacy leakage.                                   to tabular data and are not applicable to image classification
Disclaimer: This paper investigates the extraction of sensi-             since the variation of single image attributes, e.g., changing
tive identity information, including gender and racial appear-           the hair color in a facial image, is not trivially possible. More-
ance of people. The groups used (Asian, Black, Indian, White)            over, the AIA setting itself is not transferable to the vision
are in line with current research and follow the taxonomy of             domain, since it is unclear how an adversary can have access
Karkkainen and Joo [21], which itself is adopted from the                to incomplete images with only one attribute missing. How-
US Census Bureau [3]. Importantly, we emphasize that this                ever, model inversion attacks (MIAs) fill this gap for image
work’s goal is to investigate the leakage of sensitive attributes        classification. We note that the notion of MIAs is not consis-
in image classifiers. We do not intend to discriminate against           tent in the literature, and the term is sometimes also used for
identity groups or cultures in any way.                                  AIAs. Generally, given a classification model, an adversary
                                                                         attempts to create synthetic input samples that either recon-
2   Background and Related Work                                          struct samples from the model’s training data [7, 47, 4, 19]
                                                                         or craft synthetic samples that reflect the characteristics of a
    Attribute Inference Attacks. In recent years, various                specific class [42, 38]. While most MIAs require access to
types of inference attacks have been proposed. Those include             samples from the target training distribution to train a custom
membership inference attacks [37, 45, 5, 13, 14] and property            generative adversarial network (GAN), Struppek et al. [38] re-
inference attacks [9, 33, 48, 43]. Most related to our work are          cently proposed Plug & Play (PPA) MIAs, which make the
attribute inference attacks (AIAs) [8], which aim to infer sen-          attacks agnostic to the target model, increase their flexibil-
sitive attribute values of an incomplete data record in the con-         ity, and enhance their inference speed by utilizing pre-trained
text of classification and regression models. More specifically,         GANs. We will use PPA as a baseline for comparing with our
the adversary has access to a target model Mtarget , which has           C AIA.
been trained on a dataset Strain , sampled from the distribution             Novelty of C AIA. In contrast to previous work on AIAs,
D. Each training sample is a triplet (xs , xn , y) and consists of       we move the scope of inference attacks from the sample level
some sensitive attributes xs and non-sensitive attributes xn to-         to a class level in the vision domain, with the goal of infer-
gether with a ground-truth label y. The adversary has access             ring sensitive information about the distinct classes learned by
to a set of candidates (xn , y) ⊆ Strain with the sensitive at-          a model. To achieve this, we use the latest advancements in
tribute values missing. AIAs try to infer the sensitive values           text-to-image synthesis to manipulate single attributes of input
xs by exploiting Mtarget and its learned information about               images, resulting in consistent images that differ only in the
distribution D.                                                          targeted attribute. Our approach is more efficient than MIAs
    Fredrikson et al. [8, 7] proposed maximum-a-posterior                as it requires only black-box access to the target model and no
AIAs that, assuming all attributes are independent, predict              extensive knowledge of the training data distribution. Further-
the sensitive attribute value that minimizes the adversary’s ex-         more, the inference step is done in seconds, since C AIA does
pected misclassification rate. Yeom et al. [45] extended the             not require any further sample optimization after constructing
approach and combined attribute inference with membership                the initial attack dataset. This makes C AIA more flexible and
inference attacks. Mehnaz et al. [29] introduced an attack               target-independent than previous AIAs and MIAs. Through-
based on the softmax scores of the target model, assuming                out this work, we focus on the privacy-sensitive domain of
that the model’s prediction is more likely to be correct and             face recognition systems. The goal is to infer sensitive infor-
confident if the input sample contains the true sensitive at-            mation about identities, e.g., their gender or racial appearance,
tribute value. Common AIAs make strong assumptions re-                   without any specific information about the individual identity
garding the adversary’s knowledge that is generally hard to              or underlying training set distributions available.
gain under realistic assumptions, e.g., the adversary knows                  Adversarial Robust Training. Besides privacy attacks,
the marginal prior of sample attributes [8, 7, 45] or the target         neural networks are known to be susceptible to adversarial ex-
model’s confusion matrix on its training data [7, 29]. Jayara-           amples [40, 10]. Formally, adversarial examples are crafted
man and Evans [18] questioned previous black-box AIAs and                by adding an optimized perturbation δ with kδk ≤  to a
empirically showed that those attacks could not reveal more              model input x to maximize the model’s loss L(M(x + δ), y)
private information than a comparable adversary without ac-              for the true label y. One of the most reliable and commonly
cess to the target model. They conclude that black-box AIAs              used defenses is adversarial training [10, 28], which updates a
perform similarly well as data imputation techniques to fill             model’s weights θ on adversarial examples. During each train-
the missing attribute values. To improve AIAs over data im-              ing step, a sample-wise worst-case perturbation is computed
putation, they proposed a white-box attack that outperforms              in an -environment around the clean samples to maximize the

                                                                     2
Image Classifiers Leak Sensitive Attributes About Their Classes
Black    Blond   Brown    Gray
                                          Attribute                                 Attribute
                                         Variations                                  Filter

       { “a photo of a person,
            ” }

Figure 1: Overview of our attack dataset crafting process for the sensitive attribute hair color, which has four possible values. Real
images are used to generate image variations by modifying characteristics associated with the sensitive attribute. The resulting
candidate images are then filtered to ensure that each sample accurately reflects the intended attribute values. The final output of
this process is the set of attack samples.

model’s confusion. These perturbations are then added to the                 Adversary’s Capabilities. The adversary has only black-
clean samples to train the model and make it robust against               box access to the target model, i.e., the adversary can query
such manipulations. Formally, this comes down to a min-max                the target model and observe its output logits. Furthermore,
optimization:                                                             the adversary knows the domain of the target model’s training
                                                                          data, e.g., facial images, but not the exact training data distri-
                     X
            min                  max L(M(x + δ), y) .          (1)        bution. For the sensitive attributes, the adversary defines an
              θ                  kδk≤                                    individual set of possible values to infer. We emphasize that
                  (x,y)∈Strain
                                                                          C AIA is model- and task-agnostic and does not require white-
   Since the inner maximization problem cannot be solved                  box access to the target model. Information about the prior
numerically in tractable time, local search algorithms are ap-            distribution of the sensitive attributes is neither available nor
plied to craft adversarial examples, e.g., FGSM [10, 44] or               required.
PGD [28]. By training on adversarial examples, the model
becomes more robust against adversarial perturbations. In our
                                                                          3.1   Crafting Attack Samples
experiments, we also investigate the influence of model ro-
bustness on its privacy leakage.
                                                                              While it is easy for tabular data to vary the values of at-
                                                                          tributes in an inference sample, this is a non-trivial task on im-
3   Class Attributes Inference Attacks                                    ages since attributes are encoded in pixels, and multiple pixels
                                                                          define not only a single attribute but a set of possibly entan-
    We now introduce the novel Class Attribute Inference                  gled attributes. For example, human faces contain information
Attack (C AIA), which consists of two stages. First, the ad-              such as age and skin color, among many more. Furthermore,
versary generates a set of attack samples by creating different           changing the value for any of these attributes requires seman-
versions of images through a generative approach that alters              tically consistent changes in multiple pixels.
the sensitive attribute values. In the second stage, these attack             To enable meaningful image manipulations, we utilize re-
samples are used to infer the sensitive attribute values for the          cent advances in text-to-image synthesis. Systems like Stable
identities in a face recognition model. Before delving into the           Diffusion [35] are able to generate high-quality images fol-
details, we outline our general threat model.                             lowing a user-provided text description p. Mokady et al. [30]
    Adversary’s Goal. Let M : X → R|Y | denote the trained                recently proposed Null-text Inversion to encode real images
target image classifier, which takes input images x ∈ X and               into the domain of diffusion models and enable text-based
computes prediction scores for each class label y ∈ Y . The               editing while keeping the overall image composition and con-
model’s training data Strain consisted of labeled data samples            tent fixed. In combination with Prompt-to-Prompt [12], it al-
(x, y) ∼ D. The underlying attack assumption is that sam-                 lows a user to instruct image edits xedit = E(x, p) on real
ples of a certain class share a constant sensitive attribute value        images x conditioned on a description p. We use Null-text In-
z ∈ Z, which is not part of the class label but is implicitly en-         version to generate variations of existing images by changing
coded in the image features. For example, a face recognition              only the sensitive attribute values, while aiming to leave other
model is trained to predict the identity y of each facial image           image aspects unchanged.
x. A sensitive attribute z in this context might be the gender                Fig. 1 illustrates the crafting process for the attack sam-
appearance or hair color of a specific identity. The attack goal          ples. Formally, the adversary has access to a data distri-
is to infer the value of this sensitive attribute for each individ-       bution D̂ from which to sample images x. Note that the
ual class.                                                                attack does not require the attack distribution D̂ to be the

                                                                      3
Image Classifiers Leak Sensitive Attributes About Their Classes
Output Logits                     Relative              Attribute
                               Target Model         Black    Blond    Brown       Gray       Advantage             Prediction
                                                     1.07     0.95     1.24        1.01          0.0            “Identity 0 has
                                                    -0.11    -0.20    -0.05       -0.35          0.0
                                                                                                 0.17             Brown hair”

                                                              ...

                                                                       ...

                                                                                   ...
                                                      ...
                                                     0.33    0.27     0.13        0.22           0.0

Figure 2: Overview of the class attribute inference step. Each image tuple from the attack set is fed sequentially into the target
model to compute the logits for the target class. The relative advantage is then computed by subtracting the second highest
logit value from the maximum value, and this difference is added to a running sum for each sensitive attribute value. The final
prediction for the sensitive attribute is the value with the highest relative advantage computed across all attack samples.

same as the training data distribution D but only that both                   ample, changing a person’s hair color to gray might also influ-
data distributions are from the same domain, e.g., facial im-                 ence the depicted age. To mitigate such influences, we predict
ages. As we will show in our experimental evaluation, even                    the sensitive attribute with a variety of different attack samples
if the style, size, and quality of images between both distri-                to, in turn, reduce the influence of these confounding factors
butions vary significantly, the attack is still highly success-               over a larger number of samples.
ful. The adversary defines the target attribute with a set of k                   Be M(x)y : X → R the pre-softmax logits computed by
possible attribute values Z = {z1 , . . . , zk } and correspond-              the target model M on image x for class y. To infer the sen-
ing edit prompts pz that describe the general domain and ex-                  sitive attribute value z of class y, we query the target model
plicitly state an attribute value z ∈ Z. For example, pz =                    consecutively with multiple sample tuples, x ∈ Sattack . For
”A photo of a person, hgenderi”, where hgenderi is replaced                   each xz ∈ x, we compute its relative advantage A(x)z by
by z ∈ {female appearance, male appearance}. The candi-                                                                               
date dataset Scandidate is then constructed as Scandidate =                        A(x)z = max 0, M(xz )y − max M(x̃)y . (2)
                                                                                                                   x̃∈x,x̃6=xz
{E(x, pz )|z ∈ Z, x ∼ D̂}, consisting of image tuples x =
(x1 , . . . , xk ), each containing k images with different sensi-               The relative advantage computes the difference between
tive attribute values.                                                        the highest and the second-highest logit values and assigns
    However, the attribute manipulation might not always suc-                 this difference to the attribute sample xz ∈ x with the high-
ceed in changing an image attribute to the desired value. This                est logit. For all other samples x ∈ x, the relative ad-
can be due to interfering concepts already present in the image               vantage is set to zero. Fig. 2 illustrates the relative advan-
that are strongly entangled with the target attribute. For exam-              tage computation for a single x. The final attribute predic-
ple, changing the hair color of a person with a dark skin tone                tion ẑ is then done by taking the attribute with the high-
to blonde often fails because such attribute combinations are                 est relative advantage
                                                                                                 P summed up over all attack samples:
rather rare in the actual training data, which is also reflected in           ẑ = argmaxz∈Z x∈Sattack A(x)z . We emphasize that only
the underlying diffusion model. We, therefore, employ a fil-                  a single forward pass for all attack samples is sufficient to
tering approach, i.e., filtering out all sample tuples x that are             compute the relative advantage for all classes, which makes
not correctly depicting the various attribute values. For this,               the inferring step computationally very cheap, e.g., the gen-
we use a trained attribute classifier Fτ : X → Z to create a                  der inference for a ResNet-101 model and 500 identities took
subset Sattack = {x ∈ Scandidate |Fτ (xz ) = z, ∀z ∈ Z} of                    roughly 2 seconds in our experiments.
sample tuples x = (x1 , . . . , xk ). We also add a threshold τ
on the softmax scores of the attribute classifier and classify                4      Experimental Evaluation
only predictions with a softmax score ≥ τ as correct. This
removes images for which the attribute classifier has only low                    Next, we experimentally investigate the information leak-
confidence in its prediction.                                                 age of face recognition models with C AIA. We provide our
                                                                              source code with all hyperparameters to reproduce the exper-
3.2   Revealing Sensitive Attributes                                          iments and facilitate future research. More experimental de-
                                                                              tails are provided in Appx. A.
   With the attack samples, we can now begin with inferring
the sensitive class attributes learned by the target classifier.              4.1         Experimental Setup
Fig. 2 illustrates the basic concept of the inference. While
our filtering approach ensures that the variations of an image                   Datasets. We used the CelebA facial attributes dataset [27]
depict the different values of the sensitive attribute, other con-            to train our target face recognition systems. CelebA con-
founding behavior might still remain unintentionally. For ex-                 tains images of 10,177 individuals and 40 binary attribute

                                                                        4
Image Classifiers Leak Sensitive Attributes About Their Classes
Attack Accuracy   100.0%                  ResNet-18       ResNet-101          ResNet-152             DenseNet-169            ResNeSt-101
                        90.0%
                        80.0%
                        70.0%
                                FaceScrub      CelebA                 CelebA                    CelebA                    CelebA
                                 Gender        Gender               Eyeglasses                 Hair Color           Racial Appearance
Figure 3: Attack accuracy for different target model architectures and CelebA attribute datasets. Results are averaged over three
models and three attack datasets. Except for ResNet-18, the attacks are comparably successful on the different models.

annotations per image. We selected the attributes gender =              images depicting different racial appearances, we relied on
{female, male}, eyeglasses = {no glasses, glasses}, and hair            the pre-trained FairFace classifier. The attack datasets were
color = {black, blond, brown, gray} as sensitive attributes,            generated with Stable Diffusion v1.5 [35].
and additionally applied a pretrained FairFace [21] classifier             To investigate effects of model robustness on information
to add racial appearance = {Asian, Black, Indian, White} la-            leakage, we trained adversarial robust models on FaceScrub
bels to each image. Since the provided attributes and labels are        with standard adversarial training [10]. We perturbed training
often inconsistent for samples of one identity, e.g., people do         samples with Projected Gradient Descent [28] with 7 steps,
not always wear eyeglasses or might dye their hair, we created           = 4/255, and step size α = 1/2551 .
custom subsets for each sensitive attribute group by selecting             Metrics. We computed the precision, recall, and F1 score
an equal number of identities for each attribute value and re-          for each attribute value, together with the overall prediction
moved samples with inconsistent labels. This is important for           accuracy for all attributes. All experimental results are av-
evaluation since otherwise the training samples of an identity          eraged over three models and three disjoint subsets of attack
depict various values for the sensitive attribute, and no clear         samples, resulting in nine runs per configuration.
ground-truth value can be defined.                                         Baselines. We took random guessing as a naive baseline.
    We trained various target models on these datasets to pre-          Since the sensitive attributes are equally distributed among
dict a person’s identity. Since not every attribute is present          the different identities, random guessing corresponds to the
with every identity, we selected the 100 identities for each at-        underlying prior attribute distribution. Since existing AIAs
tribute value of hair color, eyeglasses, and racial appearance,         are not applicable to our use case, we instead performed the
respectively, with the most training samples available. For             state-of-the-art Plug & Play model inversion attack (PPA) [38]
gender, we selected 250 identities per attribute value. We also         to synthesize characteristic samples for each class. We then
created datasets with 1,000 identities to see if more classes           used our filter models to predict the sensitive attribute value
influence the attack success. Additional models were trained            for each sample and take the majority vote for each targeted
on the FaceScrub [32] facial image dataset, which contains              identity as the attribute prediction. We emphasize that, un-
images of 530 identities with equal gender split. We split all          like C AIA, PPA requires white-box access to the target model
datasets into 90% for training and 10% for measuring the pre-           and a pre-trained GAN, for which we used the official Style-
diction accuracy. Details on the different dataset statistics are       GAN2 [24] FFHQ model. Due to the high computational ef-
provided in Appx. A.2.                                                  fort, we limit the PPA comparison to individual ResNet-101
    To craft the attack datasets, we used the Flickr-Faces-HQ           models for each setting.
(FFHQ) [23] and CelebAHQ [22] datasets. We generated and
filtered images with attribute manipulations to collect 300 at-         4.2     Extracting Sensitive Class Information
tack image tuples for each attribute group. We set the filter
threshold τ = 0.6 in all experiments. We visualize randomly                In the main part of the paper, we focus on results on
selected training and attack samples in Appx. D.                        ResNet-101 models using FFHQ attack samples. We pro-
    Models. We trained ResNet-18, ResNet-101, ResNet-                   vide more detailed results for the other architectures, Cele-
152 [11], ResNeSt-101 [46], and DenseNet169 [15] tar-                   bAHQ as attack dataset, and experiments with 1,000 identities
get models. For each dataset-architecture combination, we               in Appx. B. We note that the attack success for an increased
trained three models with different seeds. We further trained           number of identities did not change substantially. The attack
ResNet-50 filter models on the CelebA attributes to classify            accuracy for different models and target attributes in Fig. 3
gender, glasses, and hair color. To mitigate the overconfi-             demonstrates that C AIA performed comparably well on dif-
dence of neural networks, we trained all filter models with la-         ferent architectures and predicted the sensitive attribute values
bel smoothing [41, 31] to calibrate the models and make them
more suitable for the confidence threshold approach. To filter              1 Corresponds   to images with pixel values in range [0, 1].

                                                                    5
Image Classifiers Leak Sensitive Attributes About Their Classes
Accuracy          F1      Precision         Recall          PPA       Random Guessing

Overall                                                                       Overall
Female                                                                        Glasses
  Male                                                                             No
                                                                              Glasses
     40%                60%              80%         100%                           40%                 60%           80%          100%
                    (a) Gender (CelebA)                                                            (b) Eyeglasses (CelebA)
Overall                                                                       Overall
  Black                                                                        Asian
 Blond                                                                         Black
 Brown                                                                        Indian
  Gray                                                                         White
     20%          40%           60%            80%    100%                         20%            40%         60%            80%   100%
                   (c) Hair Color (CelebA)                                                      (d) Racial Appearance (CelebA)

Figure 4: Evaluation results for C AIA performed on ResNet-101 CelebA models to infer four different target attributes. The black
horizontal lines denote the standard deviation over nine runs. We further state random guessing (dashed line) and Plug and Play
Attacks (PPA, green dots) for comparison. While C AIA outperforms random guessing by a large margin, it extracts information
on racial appearance and if someone is wearing eyeglasses even more reliably than the white-box PPA attack.

correctly in over 90% of the cases for the attributes gender and        wearing eyeglasses, PPA fell significantly behind C AIA. Re-
eyeglasses, and about 80% for the hair color and racial ap-             garding racial appearance, PPA’s attack accuracy was com-
pearance. Only the attack results of ResNet-18 stood out and            parable to C AIA but less consistent between different attribute
were a few percentage points lower than those of the other ar-          values. We suspect the reason to be the uneven distribution
chitectures, which we attribute to the small number of model            of the racial appearance and eyeglasses attribute values in the
parameters (only about a quarter of ResNet-101). Still, all at-         underlying StyleGAN2’s training data [20]. Since PPA starts
tacks reliably inferred the sensitive attributes in most cases.         from randomly sampled latent vectors, the biased attribute dis-
    Next, we investigate the attribute leakage more closely.            tributions also influence the attack success in terms of gener-
Therefore, we performed a more detailed analysis of the at-             ating and inferring sensitive attributes. Nevertheless, C AIA
tribute leakage of ResNet-101 models, for which the results             shows competitive and impressive performance given that it
are depicted in Fig. 4. For all four attributes, C AIA signifi-         accesses the target model only in a black-box fashion and has
cantly beat the random guessing baseline by a large margin.             no access to internal gradient information.
Whereas gender and eyeglasses were predicted correctly in
about 94% of the cases, racial appearance could be inferred             4.3     Robustness Increases Privacy Leakage
correctly in 84% of the cases. The attack accuracy for hair
color was also about 82% on average, but the attack suc-                    Since adversarial examples are a common weakness of
cess varied substantially between the different attribute values.       deep learning models, adversarial training is widely adopted
Blond hair seems to be the hardest value to predict, which              to make models robust against this security threat. To establish
is not unexpected since hair colors have different shades of            a connection between privacy and security, we now extend our
which blond might be the broadest one. Corresponding con-               investigation to robustly trained models and show that robust-
fusion matrices are visualized in Appx. B.1.                            ness even increases privacy leakage. However, training robust
    In contrast, we also investigated gradient-based model              models requires larger datasets due to increased sample com-
inversion attacks, here PPA, to compare to state-of-the-art             plexity [36]. The limited number of CelebA samples, which
white-box methods that reconstruct characteristic class inputs.         provides approx. 30 samples per identity, makes it difficult to
On the ResNet-101 models, PPA achieved perfect attack re-               train stable and robust models. Therefore, this section focuses
sults for inferring an identity’s gender. It also precisely re-         on models trained on FaceScrub, which provides an average of
vealed hair color and outperformed the black-box C AIA for              70 samples per identity and facilitates stable adversarial train-
both settings. However, for inferring whether an identity is            ing. We trained ResNet-101 models on both cropped images

                                                                    6
Image Classifiers Leak Sensitive Attributes About Their Classes
Identity Prediction Accuracy       Gender Attack Accuracy          Robust Models     Random Guessing

  Model                                                                   Model
Training                                                                Training

 Attack                                                                  Attack
      40%              60%                80%             100%                40%              60%               80%              100%
                 (a) Gender (FaceScrub Cropped)                                        (b) Gender (FaceScrub Uncropped)

Figure 5: Gender appearance inference results for C AIA performed on ResNet-101 models trained on cropped (a) and uncropped
(b) FaceScrub samples. We compared standard models to robust models trained with adversarial training (hatched bars). The
results demonstrate that robust models indeed increase the information leakage (blue bars), even if the underlying models’ pre-
diction accuracy (orange bars) is significantly below that of non-robust models.

containing only faces and uncropped images showing a large              tures remain predictive even under adversarial
                                                                                                                                      δ
                                                                                                                           perturbations
amount of content unrelated to an identity. For a visual com-           with kδk ≤ , satisfying E(x,y)∼D ŷ · M(x̃irobust + δ)y ≥
parison of the datasets, we refer the reader to Appx. D.1.              ρ, while non-robust predictive features lose their predictive
    The attack results for ResNet-101 models are shown in               power under adversarial perturbations. Neural networks rely
Fig. 5, while Appx. C presents results for other model archi-           not only on salient features, such as hair color and other fa-
tectures. Comparing the attack accuracy against non-robust              cial characteristics, for image processing, but also on subtle,
models (93.08%) with that of robust models (96.62%) trained             non-robust image features that can still be highly predictive.
on cropped images suggests that robust models tend to leak                  Adversarial perturbations can significantly disrupt non-
more gender information about their learned identities than             robust models’ predictions by manipulating x̃non-robust in an
non-robust models, even if the models’ clean prediction accu-           -environment that is imperceptible to humans. Adversarial
racy is roughly five percentage points lower. A similar pattern         training can help reduce the impact of adversarial examples
holds for models trained on uncropped images, with robust               by focusing the model’s prediction on x̃robust . However, we
models still exhibiting higher information leakage. However,            stress that adversarial training’s potential to enhance model
it is important to note that the standard model’s prediction            robustness comes at the cost of increased privacy leakage. To
accuracy (78.75%) is significantly higher than that of robust           explain why adversarial training increases a model’s privacy
models (55.23%). This suggests that a model’s prediction ac-            leakage, we hypothesize that human-recognizable, sensitive
curacy is not necessarily indicative of its privacy leakage. We         attributes are part of x̃robust . These attributes are highly pre-
hypothesize that robust models tend to exhibit higher privacy           dictive because they enable discrimination between samples
leakage due to their concentration on more robust features for          of different classes based on the assumption that they remain
their predictions. We formalize this hypothesis in the remain-          constant within samples of a specific class. We also argue that
ing section.                                                            these attributes are robust features since they are easily identi-
    Our formal analysis builds upon the robust feature model            fiable by humans, and small adversarial perturbations are un-
for binary classifiers of Ilyas et al. [17], which we extend to         likely to affect their value.
the privacy leakage setting. Let M : X → Y be a model                       In the context of face recognition, gender is a sensitive at-
trained to predict a label y ∈ Y for each input x ∈ X.                  tribute that is highly predictive, enabling us to distinguish in-
We divide a model’s inputs x = (x̃, x̄) into predictive fea-            dividuals from people with other gender appearances. Fur-
tures x̃ and non-predictive features x̄. The i-th predictive            thermore, it is a robust feature, as altering a person’s gender
feature x̃iis positively correlated with a sample’s true label:       appearance in an image would require significant changes to
E(x,y)∼D ŷ · M(x̃i )y ≥ ρ for a sufficiently large ρ. Here,            the pixels. To showcase the importance of robust features in
we deviate slightly from our previous notation of y and define          robust models, we utilized the axiomatic attribution method
ŷ ∈ {−1, 1} for each class, with ŷ = 1 indicating the ground-         Integrated Gradients [39] on two ResNet-101 models, one
truth label. We further assume that all model outputs My are            trained using standard training, and the other with adversar-
centered around zero. Predictive features are (ideally) learned         ial training. Fig. 6a visualizes the computed gradients for
by a model to predict the label. Conversely, a non-predictive           four input images from our attack datasets. While the stan-
feature x̄j has no correlation with the true label andshould be       dard model relies on noisy, non-robust features, such as im-
ignored during inference if E(x,y)∼D ŷ · M(x̄j )y < ρ.                 age background, the robust model concentrates more on facial
    We further explore the nature of the predictive features            attributes, including hair, eyes, and facial structure.
by categorizing them into two groups: robust features x̃robust              To also provide quantitative support for our analysis, we
and non-robust features x̃non-robust . Robust predictive fea-           used Integrated Gradients to measure a model’s relative at-

                                                                    7
Image Classifiers Leak Sensitive Attributes About Their Classes
Standard Model
    Input

                                                                                                40%             Robust Model

                                                                              Relative Attribution
                                                                                                30%
                                                                    1.0
 Standard
  Model

                                                                    0.8
                                                                                                20%

                                                                    0.6

                                                                    0.4                         10%
 Robust
 Model

                                                                    0.2

                                                                                                     0%
                                                                    0.0                                   Eyeglasses Eyes      Hair     Mouth       Hat         Cloth   Skin

            (a) Visualized Attribution with Integrated Gradients.                                               (b) Relative Attribution for facial features.

Figure 6: The comparison of absolute attribution based on integrated gradients between a robust and a standard ResNet-101
model in (a) shows that robust models assign perceptually more attuned attribution to facial features. A quantitative comparison
of relative attribution in (b) further highlights that the robust model assigns more attribution to unique facial features, such as eyes
and hair, while the standard model assigns most attribution to non-specific parts of the skin.

tribution to specific image parts. For this, we applied a pre-                           sensitive class information, with significant implications for
trained face segmentation model [26] to locate various at-                               the secure application of machine learning models. While
tributes in facial images. Be HZ (x) ∈ {0, 1}H×W the binary                              black-box models have traditionally been practically secure
segmentation mask for image x and attribute class Z, such as                             against MIAs and AIAs, our research shows that models in
a person’s hair. Be further A(M, x) ∈ RH×W the absolute                                  the vision domain are not immune to attribute inference. By
pixel-wise attribution by Integrated Gradients for model M                               leveraging recent advances in text-to-image synthesis, C AIA
and image x ∈ RH×W . Let also k · k1 denote the sum norm                                 infers sensitive information from image classifiers with high
and the Hadamard product. The relative attribution for a                                 accuracy, comparable to white-box MIAs. Furthermore, our
dataset X is then computed by:                                                           findings suggest a trade-off between adversarial training and a
                                                                                         model’s information leakage, and future adversarial defenses
                                                                                         should account for the privacy leakage of models to avoid cre-
                           1 X kA(M, x) HZ (x)k1
      Arel (M, X) =                              .                  (3)                  ating new vulnerabilities.
                          |X|     kA(M, x)k1
                               x∈X

    By analyzing the relative attribution of seven attributes in
                                                                                             Although C AIA offers reliable attribute inference capa-
100 FFHQ samples and over 530 target identities of ResNet-
                                                                                         bilities, it still faces some challenges and limitations. Ac-
101 models, Fig. 6b demonstrates that robust models assign
                                                                                         curately evaluating class information leakage requires high-
more importance to distinct features like eyes or hair, while
                                                                                         quality data and consistent labeling, which heavily relies on
standard models prioritize general skin areas. This highlights
                                                                                         the underlying dataset. We note that the sample quality of
the greater importance placed by robustly trained models on
                                                                                         CelebA and FaceScrub images varies widely in terms of res-
sensitive attributes encoded in images and reflected in their
                                                                                         olution, sharpness, and coloring, and the labeling is not al-
outputs. As C AIA exploits the differences in logits, robust
                                                                                         ways consistent within an identity or attribute class, with also
models appear to be more susceptible to such privacy attacks
                                                                                         falsely labeled samples contained in the datasets. For in-
and leak more sensitive attribute information. Thus, there ex-
                                                                                         stance, the boundaries between gray and blond hair are not
ists a trade-off between model robustness and sensitive class
                                                                                         well-defined, which can lead to reduced inference accuracy
information leakage.
                                                                                         for these attributes. However, C AIA is generally successful in
                                                                                         inferring sensitive attribute values in most cases. The attack
5      Discussion and Challenges                                                         metrics might even underestimate its effectiveness since false
                                                                                         attribute predictions might still provide some identity infor-
   In this paper, we introduced C AIA, a novel privacy attack                            mation, e.g., black skin tone predictions make white as ground
that enables the inference of sensitive attribute information                            truth rather unlikely. Although, the attack may be more pow-
about classes learned by standard image classifiers. Our ex-                             erful for more consistently labeled datasets with higher image
tensive experiments demonstrate that classifiers indeed leak                             quality.

                                                                          8
Image Classifiers Leak Sensitive Attributes About Their Classes
6   Conclusion                                                           References

    To summarize, our research provides novel insights into the          [1] Prabal Datta Barua, Nadia Fareeda Muhammad Gowdh,
privacy of image classifiers and shows that models leak more                 Kartini Rahmat, Norlisah Ramli, Wei Lin Ng, Wai Yee
sensitive information than previously assumed. We hope it                    Chan, Mutlu Kuluozturk, Sengul Dogan, Mehmet Bay-
motivates future security research and defense endeavors in                  gin, Orhan Yaman, Turker Tuncer, Tao Wen, Kang Hao
building secure and private models.                                          Cheong, and U. Rajendra Acharya. Automatic covid-
    Expanding on the mentioned challenges, we expect that                    19 detection using exemplar hybrid deep features with
with the upcoming developments in text-guided image ma-                      x-ray images. International Journal of Environmental
nipulation, we propose to enhance C AIA and extend its use                   Research and Public Health, 18(15), 2021.
to continuous features like age or more detailed skin tone
grading. For example, the attack samples could be gener-                 [2] Ulrich Baumann, Yuan-Yao Huang, Claudius Gläser,
ated to reflect various shades of dark skin color or fine-grained            Michael Herman, Holger Banzhaf, and J. Marius
age representations. Moreover, in its current implementation,                Zöllner. Classifying road intersections using transfer-
C AIA infers each attribute independently. However, in the real              learning on a deep neural network. In International Con-
world, attributes often correlate with each other. For example,              ference on Intelligent Transportation Systems (ITSC),
information inferred on racial appearance influences the prob-               pages 683–690, 2018.
ability distribution of hair colors. To this end, C AIA could            [3] U.S. Census Bureau. About the topic of race, 2022.
leverage already inferred attributes as a prior for inferring ad-            URL       https://www.census.gov/topics/
ditional attributes. We also envision the use of C AIA beyond                population/race/about.html.              Accessed:
privacy analyses. For example, it is exciting to explore it in the           22-February-2023.
context of explainable AI to determine the importance of fea-
tures by analyzing prediction scores for different feature char-         [4] Si Chen, Mostafa Kahla, Ruoxi Jia, and Guo-Jun
acteristics. Similarly, it could be employed to assess the fair-             Qi. Knowledge-Enriched Distributional Model Inver-
ness of models and identify potential biases associated with                 sion Attacks. In International Conference on Computer
certain classes.                                                             Vision (ICCV), pages 16178–16187, 2021.

                                                                         [5] Christopher A. Choquette-Choo, Florian Tramèr,
Reproducibility Statement. Our source code is publicly
                                                                             Nicholas Carlini, and Nicolas Papernot. Label-only
at https://github.com/LukasStruppek/Class_
                                                                             membership inference attacks. In International Confer-
Attribute_Inference_Attacks to reproduce the ex-
                                                                             ence on Machine Learning (ICML), pages 1964–1974,
periments and facilitate further analysis.
                                                                             2021.

Acknowledgments This research has benefited from the                     [6] Andre Esteva, Kat Chou, Serena Yeung, Nikhil Naik,
Federal Ministry of Education and Research (BMBF) project                    Ali Madani, Ali Mottaghi, Yun Liu, Eric Topol, Jeff
KISTRA (reference no. 13N15343), the Hessian Ministry of                     Dean, and Richard Socher. Deep learning-enabled med-
Higher Education, Research, Science and the Arts (HMWK)                      ical computer vision. npj Digital Medicine, 2021.
cluster projects “The Third Wave of AI” and hessian.AI, from
the German Center for Artificial Intelligence (DFKI) project             [7] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart.
“SAINT”, as well as from the joint ATHENE project of the                     Model Inversion Attacks that Exploit Confidence Infor-
HMWK and the BMBF “AVSV”.                                                    mation and Basic Countermeasures. In Conference on
                                                                             Computer and Communications Security (CCS), pages
                                                                             1322–1333, 2015.

                                                                         [8] Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon M.
                                                                             Lin, David Page, and Thomas Ristenpart. Privacy in
                                                                             pharmacogenetics: An end-to-end case study of person-
                                                                             alized warfarin dosing. In USENIX Security Symposium,
                                                                             pages 17–32, 2014.

                                                                         [9] Karan Ganju, Qi Wang, Wei Yang, Carl A. Gunter, and
                                                                             Nikita Borisov. Property inference attacks on fully con-
                                                                             nected neural networks using permutation invariant rep-
                                                                             resentations. In Conference on Computer and Commu-
                                                                             nications Security (CCS), pages 619–633, 2018.

                                                                     9
Image Classifiers Leak Sensitive Attributes About Their Classes
[10] Ian J. Goodfellow, Jonathon Shlens, and Christian                 [21] Kimmo Karkkainen and Jungseock Joo. Fairface: Face
     Szegedy. Explaining and harnessing adversarial exam-                   attribute dataset for balanced race, gender, and age for
     ples. In International Conference on Learning Repre-                   bias measurement and mitigation. In Winter Confer-
     sentations (ICLR), 2015.                                               ence on Applications of Computer Vision (WACV), pages
                                                                            1548–1558, 2021. Pretrained classifier available at
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian                      https://github.com/dchen236/FairFace.
     Sun. Deep residual learning for image recognition. In
     Conference on Computer Vision and Pattern Recognition             [22] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehti-
     (CVPR), pages 770–778, 2016.                                           nen. Progressive growing of gans for improved quality,
                                                                            stability, and variation. In International Conference on
[12] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aber-                      Learning Representations (ICLR), 2018.
     man, Yael Pritch, and Daniel Cohen-Or. Prompt-to-
     prompt image editing with cross attention control. arXiv          [23] Tero Karras, Samuli Laine, and Timo Aila. A Style-
     preprint, arXiv:2208.01626, 2022.                                      Based Generator Architecture for Generative Adversar-
                                                                            ial Networks. In Conference on Computer Vision and
[13] Dominik Hintersdorf, Lukas Struppek, and Kristian Ker-
                                                                            Pattern Recognition (CVPR), pages 4401–4410, 2019.
     sting. To trust or not to trust prediction scores for mem-
     bership inference attacks. In International Joint Con-            [24] Tero Karras, Samuli Laine, Miika Aittala, Janne
     ference on Artificial Intelligence (IJCAI), pages 3043–                Hellsten, Jaakko Lehtinen, and Timo Aila.   Ana-
     3049, 2022.                                                            lyzing and Improving the Image Quality of Style-
                                                                            GAN.      In Conference on Computer Vision and
[14] Dominik Hintersdorf, Lukas Struppek, and Kristian Ker-
                                                                            Pattern Recognition (CVPR), 2020.      Pretrained
     sting. Clipping privacy: Identity inference attacks on
                                                                            FFHQ model available at https://github.com/
     multi-modal machine learning models. arXiv preprint,
                                                                            NVlabs/stylegan2-ada-pytorch.
     arXiv:2209.07341, 2022.
[15] Gao Huang, Zhuang Liu, Laurens van der Maaten, and                [25] Diederik P. Kingma and Jimmy Ba. Adam: Method for
     Kilian Q. Weinberger. Densely Connected Convolu-                       Stochastic Optimization. In International Conference on
     tional Networks. In Conference on Computer Vision and                  Learning Representations (ICLR), 2015.
     Pattern Recognition (CVPR), pages 2261–2269, 2017.                [26] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo.
[16] Ahmad Ibrahim, Hoda K. Mohamed, Ali Maher, and                         Maskgan: Towards diverse and interactive facial im-
     Baochang Zhang. A survey on human cancer catego-                       age manipulation. In IEEE Conference on Computer
     rization based on deep learning. Frontiers in Artificial               Vision and Pattern Recognition (CVPR), 2020. Pre-
     Intelligence, 5, 2022.                                                 trained model available at https://github.com/
                                                                            switchablenorms/CelebAMask-HQ.
[17] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Lo-
     gan Engstrom, Brandon Tran, and Aleksander Madry.                 [27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
     Adversarial examples are not bugs, they are features. In               Deep Learning Face Attributes in the Wild. In Interna-
     Conference on Neural Information Processing Systems                    tional Conference on Computer Vision (ICCV), 2015.
     (NeurIPS), volume 32, 2019.
                                                                       [28] Aleksander Madry, Aleksandar Makelov, Ludwig
[18] Bargav Jayaraman and David Evans. Are attribute in-                    Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards
     ference attacks just imputation? In Conference on Com-                 deep learning models resistant to adversarial attacks. In
     puter and Communications Security (CCS), pages 1569–                   International Conference on Learning Representations
     1582, 2022.                                                            (ICLR), 2018.

[19] Mostafa Kahla, Si Chen, Hoang Anh Just, and Ruoxi                 [29] Shagufta Mehnaz, Sayanton V. Dibbo, Ehsanul Kabir,
     Jia. Label-only model inversion attacks via boundary re-               Ninghui Li, and Elisa Bertino. Are your sensitive at-
     pulsion. In Conference on Computer Vision and Pattern                  tributes private? novel model inversion attribute infer-
     Recognition (CVPR), pages 15025–15033, 2022.                           ence attacks on classification models. In USENIX Secu-
                                                                            rity Symposium, pages 4579–4596, 2022.
[20] Cemre Karakas, Alara Dirik, Eylul Yalcinkaya, and
     Pinar Yanardag. Fairstyle: Debiasing stylegan2 with               [30] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch,
     style channel manipulations. In European Conference                    and Daniel Cohen-Or. Null-text inversion for editing real
     on Computer Vision (ECCV), volume 13673, pages 570–                    images using guided diffusion models. arXiv preprint,
     586, 2022.                                                             arXiv:2211.09794, 2022.

                                                                  10
[31] Rafael Müller, Simon Kornblith, and Geoffrey E. Hin-          [40] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,
     ton. When does label smoothing help? In Conference                  Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob
     on Neural Information Processing Systems (NeurIPS),                 Fergus. Intriguing properties of neural networks. In
     pages 4696–4705, 2019.                                              International Conference on Learning Representations
                                                                         (ICLR), 2014.
[32] Hongwei Ng and Stefan Winkler. A data-driven ap-
     proach to cleaning large face datasets. In IEEE Inter-         [41] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
     national Conference on Image Processing (ICIP), pages               Jonathon Shlens, and Zbigniew Wojna. Rethinking the
     343–347, 2014.                                                      inception architecture for computer vision. In Con-
                                                                         ference on Computer Vision and Pattern Recognition
[33] Mathias P. M. Parisot, Balázs Pejó, and Dayana Spag-              (CVPR), pages 2818–2826, 2016.
     nuelo. Property inference attacks on convolutional
     neural networks: Influence and implications of target          [42] Kuan-Chieh Wang, Yan Fu, Ke Liand Ashish Khisti,
     model’s complexity. In International Conference on Se-              Richard Zemel, and Alireza Makhzani. Variational
     curity and Cryptography (SECRYPT), pages 715–721,                   Model Inversion Attacks. In Conference on Neural In-
     2021.                                                               formation Processing Systems (NeurIPS), 2021.

[34] Adam Paszke, Sam Gross, Francisco Massa, Adam                  [43] Xiuling Wang and Wendy Hui Wang. Group prop-
     Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,              erty inference attacks against graph neural networks. In
     Zeming Lin, Natalia Gimelshein, Luca Antiga, Al-                    Conference on Computer and Communications Security
     ban Desmaison, Andreas Köpf, Edward Yang, Zachary                  (CCS), pages 2871–2884, 2022.
     DeVito, Martin Raison, Alykhan Tejani, Sasank Chil-
                                                                    [44] Eric Wong, Leslie Rice, and J. Zico Kolter. Fast is better
     amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and
                                                                         than free: Revisiting adversarial training. In Interna-
     Soumith Chintala. PyTorch: An Imperative Style, High-
                                                                         tional Conference on Learning Representations (ICLR),
     Performance Deep Learning Library. In Conference
                                                                         2020.
     on Neural Information Processing Systems (NeurIPS),
     pages 8024–8035, 2019.                                         [45] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and
                                                                         Somesh Jha. Privacy risk in machine learning: Analyz-
[35] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
                                                                         ing the connection to overfitting. In Computer Security
     Patrick Esser, and Björn Ommer. High-resolution image
                                                                         Foundations Symposium (CSF), pages 268–282, 2018.
     synthesis with latent diffusion models. In Conference
     on Computer Vision and Pattern Recognition (CVPR),             [46] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu,
     pages 10684–10695, 2022.                                            Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas
                                                                         Mueller, R. Manmatha, Mu Li, and Alexander
[36] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras,
                                                                         Smola. Resnest: Split-attention networks. CoRR,
     Kunal Talwar, and Aleksander Madry. Adversarially ro-
                                                                         abs/2004.08955, 2020.
     bust generalization requires more data. In Conference
     on Neural Information Processing Systems (NeurIPS),            [47] Yuheng Zhang, Ruoxi Jia, Hengzhi Pei, Wenxiao Wang,
     pages 5019–5031, 2018.                                              Bo Li, and Dawn Song. The Secret Revealer: Genera-
                                                                         tive Model-Inversion Attacks Against Deep Neural Net-
[37] Reza Shokri, Marco Stronati, Congzheng Song, and Vi-
                                                                         works. In Conference on Computer Vision and Pattern
     taly Shmatikov. Membership inference attacks against
                                                                         Recognition (CVPR), pages 250–258, 2020.
     machine learning models. In Symposium on Security and
     Privacy (S&P), pages 3–18, 2017.                               [48] Junhao Zhou, Yufei Chen, Chao Shen, and Yang Zhang.
                                                                         Property inference attacks against gans. In Annual
[38] Lukas Struppek, Dominik Hintersdorf, Antonio
                                                                         Network and Distributed System Security Symposium
     De Almeida Correia, Antonia Adler, and Kristian
                                                                         (NDSS), 2022.
     Kersting. Plug & play attacks: Towards robust and
     flexible model inversion attacks.  In International
     Conference on Machine Learning (ICML), volume 162,
     pages 20522–20545, 2022.

[39] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Ax-
     iomatic attribution for deep networks. In International
     Conference on Machine Learning (ICML), volume 70,
     pages 3319–3328, 2017.

                                                               11
A     Experimental Details

A.1    Hard- and Software Details

    We performed our experiments on NVIDIA DGX machines running NVIDIA DGX Server Version 5.1.0 and Ubuntu 20.04.4
LTS. The machines have 2TB of RAM and contain NVIDIA A100-SXM4-40GB GPUs and AMD EPYC 7742 CPUs. We further
relied on CUDA 11.4, Python 3.10.0, and PyTorch 1.12.1 with Torchvision 0.13.1 [34] for our experiments. If not stated otherwise,
we used the model architecture implementations and pre-trained ImageNet weights provided by Torchvision. We further provide
a Dockerfile together with our code to make the reproduction of our results easier. In addition, all training and attack configuration
files are available to reproduce the results stated in this paper.
    To perform Plug & Play Attacks (PPA), we used a NVIDIA DGX machine running NVIDIA DGX Server Version 5.2.0 and
Ubuntu 20.04.4 LTS. The machines has 2TB of RAM and contain NVIDIA A100-SXM4-80GB GPUs and AMD EPYC 7742
CPUs. The experiments were performed with CUDA 11.4, Python 3.8.10, and PyTorch 1.10.0 with Torchvision 0.11.0. We
further used PPA in combination with the pre-trained FFHQ StyleGAN-2 and relied on the standard CelebA attack parameters,
as stated in Struppek et al. [38] and https://github.com/LukasStruppek/Plug-and-Play-Attacks. We only
changed the number of candidates to 50 and the final samples per target to 25, to speed up the attack process.

A.2    Dataset Details

   We state the number of identities and samples for each custom CelebA subset in Appx. A.2. We further state the lowest,
median and maximum number of samples of a single identity per attribute and for the total dataset. For the filter models, the
datasets contained 155,304 samples (gender and eyeglasses) and 93,942 (hair color), respectively.

                                                           Total         Min       Median         Max         Avg
               Attribute    Value           Identities
                                                          Samples      Samples     Samples      Samples     Samples
                            Female             250          7,639         30            30         35         30.56
               Gender
               CelebA

                            Male               250          7,652         30            30         36         30.61
                            Total              500         15,291         30            30         36         30.58
                            No Glasses         100          3186          31            32         36         31.86
               CelebA
               Glasses

                            Glasses            100          2,346         20            22         31         23.46
                            Total              200          5532          20            31         36         27.66
                            Asian              100          2944          28            30         31         29.44
                            Black              100          2876          26            29         34         28.76
               CelebA
                Race

                            Indian             100          2281          19            22         30         22.81
                            White              100          3164          31            31         35         31.64
                            Total              400         11,265         19            29         35         28.16
                            Black Hair         100          2,840         23            28         32         28.40
                            Blond Hair         100          2,867         25            29         31         28.67
               CelebA

                Color
                Hair

                            Brown Hair         100          2,564         17            26         30         25.64
                            Gray Hair          100          2,117         14            20         31         21.17
                            Total              400         10,388         14            27         32         25.97
               FaceScrub

                            Female             265         17,743          9            65        140         66.95
                Gender

                            Male               265         20,136         25            77        128         75.98
                            Total              530         37,879          9            72        140         71.47

                                                 Table 1: Dataset identity statistics

                                                                 12
A.3            Training Hyperparameters

    We emphasize that we did not aim for achieving state-of-the-art performances but rather tried to train models with generally
good prediction performance. All non-robust models were initialized with pre-trained ImageNet weights and trained for 80 epochs
with an initial learning rate of 0.1 and a batch size of 128. We multiplied the learning rate by factor 0.1 after 60 and 70 epochs. For
the models trained with adversarial training, set the number of epochs to 100 and reduced the learning rate after 80 and 90 epochs.
Optimization was done with SGD and a momentum of 0.9. Images were resized to 224 × 224 and normalized with µ = σ = 0.5
to set the pixel value range to [−1, 1]. As augmentation, we applied random horizontal flipping with p = 0.5 flipping probability
and random resized cropping with scale = [0.8, 1.0] and ratio = [0.8, 1.25]. Cropped images were resized to size 224 × 224. Due
to the limited number of training samples available for each identity, we did not use a validation set and early stopping. For each
architecture and dataset, we trained three models with different seeds. Performance evaluation was done on a holdout test set.
    We further trained ResNet-50 models on the CelebA attributes for gender, eyeglasses, and hair color for filtering the candidate
images. All models were initialized with pre-trained ImageNet weights and trained for 10 epochs with an initial learning rate of
1e − 3 and a batch size of 128. We multiplied the learning rate by factor 0.1 after 9 epochs. We used the Adam optimizer [25]
with β = (0.9, 0.999) and no weight decay. Normalization and augmentation were identical to training the target models, except
the scale and ratio parameters of the random resized cropping, which we set to [0.9, 1.0] and [1.0, 1.0] to prevent cutting out the
attributes from the training samples. We again used no validation set for early stopping. To calibrate the models, we applied label
smoothing with smoothing factor α = 0.1 during training. Since some attributes are more frequent in the CelebA dataset, we draw
the same number of samples for each attribute by oversampling / undersampling from the attribute images. Details on the training
of FairFace for filtering the racial appearance images can be found at https://github.com/dchen236/FairFace.

  Dataset         Num Classes   Architecture      Accuracy                     Dataset     Num Classes   Architecture     Accuracy
                                ResNet-18       85.99% ± 0.21                                            ResNet-18      93.92% ± 0.17
                                                                               FaceScrub

                                ResNet-101      86.78% ± 0.66                                            ResNet-101     94.93% ± 0.43
                                                                                Cropped
  Gender

                  500           ResNet-152      86.62% ± 0.62                              530           ResNet-152     95.32% ± 0.22
                                DenseNet-169    75.93% ± 5.54                                            DenseNet-169   91.93% ± 0.65
                                ResNeSt-101     81.53% ± 4.40                                            ResNeSt-101    93.50% ± 0.70
                                ResNet-18       93.08% ± 0.58                                            ResNet-18      83.79% ± 0.21
                                                                               FaceScrub
  Eyeglasses

                                ResNet-101      90.97% ± 1.41                                            ResNet-101     88.12% ± 0.24
                                                                                Cropped
                                                                                 Robust

                  200           ResNet-152      91.10% ± 0.85                              530           ResNet-152     87.11% ± 0.28
                                DenseNet-169    93.56% ± 0.38                                            DenseNet-169   88.53% ± 0.76
                                ResNeSt-101     90.73% ± 0.21                                            ResNeSt-101    90.04% ± 0.48
                                ResNet-18       84.65% ± 0.44                                            ResNet-18      70.29% ± 2.71
                                                                               Uncropped
                                                                               FaceScrub

                                ResNet-101      82.99% ± 0.40                                            ResNet-101     78.75% ± 0.84
  Race

                  400           ResNet-152      84.06% ± 1.19                              530           ResNet-152     78.52% ± 0.58
                                DenseNet-169    85.00% ± 0.96                                            DenseNet-169   69.11% ± 12.73
                                ResNeSt-101     81.13% ± 1.04                                            ResNeSt-101    77.32% ± 0.54
                                ResNet-18       87.94% ± 0.56                                            ResNet-18      41.58% ± 0.63
                                                                               Uncropped
  Hair Color

                                                                               FaceScrub

                                ResNet-101      87.65% ± 0.70                                            ResNet-101     55.23% ± 0.60
                                                                                Robust

                  400           ResNet-152      87.20% ± 0.67                              530           ResNet-152     55.69% ± 1.24
                                DenseNet-169    88.45% ± 1.68                                            DenseNet-169   54.65% ± 2.98
                                ResNeSt-101     85.53% ± 0.53                                            ResNeSt-101    58.74% ± 0.84

                            (a) CelebA                                                              (b) FaceScrub

                                Table 2: Prediction accuracy of the target models on the individual test sets.

                                                                     13
You can also read