CERTIFAI: Counterfactual Explanations for Robustness, Transparency, Interpretability, and Fairness of Artificial Intelligence models
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
CERTIFAI: Counterfactual Explanations for Robustness, Transparency, Interpretability, and Fairness of Artificial Intelligence models Shubham Sharma1 , Jette Henderson2 , Joydeep Ghosh2 1 Department of Electrical and Computer Engineering, University of Texas at Austin 2 CognitiveScale shubham sharma@utexas.edu, jhenderson@cognitivescale.com, jghosh@utexas.edu arXiv:1905.07857v1 [cs.LG] 20 May 2019 Abstract As artificial intelligence plays an increasingly im- portant role in our society, there are ethical and moral obligations for both businesses and re- searchers to ensure that their machine learning models are designed, deployed, and maintained re- sponsibly. These models need to be rigorously au- dited for fairness, robustness, transparency, and in- terpretability. A variety of methods have been de- Figure 1: The CERTIFAI counterfactual generation process. The veloped that focus on these issues in isolation, how- decision boundary for a binary classifier is shown, with the input ever, managing these methods in conjunction with instance in black. We sample a set of points (left) in the feature space model development can be cumbersome and time- with a constraint that they must lie on the other side of the decision consuming. In this paper, we introduce a unified boundary (green points). The algorithm then evolves these samples and model-agnostic approach to address these is- (middle) to generate individuals that lie closer to the input point but on the other side of the decision boundary. Finally, a smaller set, the sues: Counterfactual Explanations for Robustness, size of which is user-defined, of counterfactuals is generated (right). Transparency, Interpretability, and Fairness of Arti- ficial Intelligence models (CERTIFAI). Unlike pre- vious methods in this domain, CERTIFAI is a gen- what it predicted?, 2) if a person got an unfavorable outcome eral tool that can be applied to any black-box model from the model, what can they do to change that?, 3) has the and any type of input data. Given a model and an model been unfair to a particular group?, and 4) how easily input instance, CERTIFAI uses a custom genetic can the model be fooled? Researchers are actively building algorithm to generate counterfactuals1 : instances separate approaches to answer each of these questions. close to the input that change the prediction of the One promising vein of research in explainability, first intro- model. We demonstrate how these counterfactuals duced by [Wachter et al., 2017], is generating counterfactu- can be used to examine issues of robustness, inter- als. Given an input data point and a black-box machine learn- pretability, transparency, and fairness. Addition- ing model (i.e. we only have access to the model’s prediction ally, we introduce CERScore, the first black-box for any input), a counterfactual is defined as a generated data model robustness score that performs comparably point that is as close to the input data point as possible but to methods that have access to model internals. for which the model gives a different outcome. For example, if a user was denied a loan by a machine learning model, an 1 Introduction example counterfactual explanation could be: “Had your in- come been $5000 greater per year and your credit score been As the adoption of machine learning models to everyday tasks 30 points higher, your loan would be approved.” [Wachter et grows rapidly, so does the need to consider the ethical, moral, al., 2017] argue that counterfactuals are a way of explaining and social consequences of the decisions made by such mod- model results to users such that they can identify actionable els. Several important questions arise in a variety of appli- ways of changing their behaviors to obtain favorable predic- cations, such as the following: 1) how did the model predict tions. In addition to providing counterfactuals for explain- 1 ability, we show how counterfactual explanations can be used Counterfactuals have a well-established meaning in the causal- to audit fairness and robustness of a model. ity literature. However, we are using “counterfactual” in the coun- terfactual explanation sense, one that has been recently initiated in As promising as the original method [Wachter et al., 2017] the explainability literature [Wachter et al., 2017] where the model and subsequent methods of generating counterfactuals [Ustun implies a machine learning model and not a causal model and hence et al., 2019; Russell, 2019] are, they are also limited in that no causal assumptions are made here some only work for linear models, while others cannot deal
Method Black-box Model-Agnostic Mixed-data Explainability Fairness Robustness CERTIFAI [Ustun et al., 2019] [Wachter et al., 2017] [Russell, 2019] [Ribeiro et al., 2016] [Guidotti et al., 2018a] [Carlini and Wagner, 2017] [Weng et al., 2018] Table 1: Comparison of related work with our approach. We consider the approach most similar to ours. Mixed-data means the method can work with both discrete and continuous data, without any discretization or assumptions. with different data types. To resolve these limitations, we art that handles a diverse set of desirable properties needed to introduce CERTIFAI, a novel, flexible, model-agnostic tech- develop a responsible AI system. [Wachter et al., 2017] intro- nique for generating counterfactuals via a custom genetic al- duced counterfactual explanations, however, their optimiza- gorithm. The meta-heuristic evolutionary algorithm starts by tion formulation cannot handle categorical data (i.e. the op- generating a random set of points such that they do not have timization does not solve for such data; the values are found the same prediction as the input point. A subsequent evo- in a brute-force fashion). Methods in [Ustun et al., 2019] lutionary process results in a set of points close to the input and [Russell, 2019] only work for linear models. [Guidotti that maintain the prediction constraint. Figure 1 shows an ex- et al., 2018a] uses a genetic algorithm to generate neighbors ample of three counterfactuals (green points) generated for a around the input and then use decision trees to locally ap- given input (black point). proximate the model. However, local approximations might A major advantage of using the genetic algorithm to gen- be at the cost of model accuracy, and they define counterfac- erate counterfactuals is that it can generate counterfactuals tuals based on the minimum number of splits in the trained for linear and non-linear models (e.g. deep networks) and decision tree, which might not always be the smallest change for any input form (from mixed tabular data to image data) to the input. The closest work to generating adversarial ex- without any approximations to or assumptions for the model. amples is by [Carlini and Wagner, 2017]. They work on a Moreover, end-users can 1) define a range for any feature, white-box and simpler convolution models, and the distance and 2) restrict the features that can change. CERTIFAI sim- metrics might not be apt to measure image similarity. [Wang ply constrains the values of the sampled points based on those et al., 2002]. [Weng et al., 2018] define a robustness score choices, allowing the generated counterfactuals to reflect a CLEVER. However, they have access to the model gradients. user’s understanding of how much it is possible for them to change their features. 3 The CERTIFAI framework CERTIFAI can be used to audit any black-box classifier, providing three tools that are based on a single underlying In this section, we formulate a custom genetic algorithm to algorithm. The major contributions of this paper are summa- find counterfactual(s). Consider a black-box classifier f and rized as follows: an input instance x. Let the counterfactual be a feasible gen- erated point c. Then the problem can be formulated as: • Counterfactuals are generated using a custom genetic al- gorithm, which is model-agnostic, flexible, and can be min d(x, c) used to provide explanations. c (1) • Counterfactuals are shown to be effective adversarial ex- s.t.f (c) 6= f (x) amples. They are also used to generate the Counterfac- where d(x, c) is the distance between x and c. To avoid using tual Explanation-based robustness scores (CERScore), any approximations to or assumptions for the model, we use which to the best of our knowledge is the first ever black- a genetic algorithm to solve Equation 1. The custom genetic box model robustness score. algorithm works for any black-box model and input data type, • Counterfactuals can be used to evaluate fairness with re- and it is model-agnostic. Additionally, it provides a great deal spect to a user’s input as well as the fairness of the model of flexibility in counterfactual generation. towards groups of individuals. CERTIFAI’s genetic algorithm solves the optimization problem in Equation 1 through a process of natural selection. 2 Related Work The only mandatory inputs for the genetic algorithm are the black-box classifier f and an input instance x. Generally, for Table 1 summarizes the key features of CERTIFAI and the an n-dimensional input vector x, let W ∈ Rn represent the work most related to CERTIFAI. Other general methods on space from which individuals can be generated and P be the explainability, fairness and robustness have been described set of points with the same prediction as x: by [Guidotti et al., 2018b],[Binns, 2017], and [Akhtar and Mian, 2018] respectively. As we can see, there is no prior P = {p|f (p) = f (x), p ∈ W }. (2)
The possible set of individuals c ∈ I are defined such that For image data, the Euclidean distance and absolute dis- tance between two images are not good measures of image I = W \ P. (3) similarity [Wang et al., 2002]. Hence, we use SSIM (Struc- Each individual c ∈ I is a candidate counterfactual. The goal tural Similarity Index Measure) [Wang et al., 2003], which is to find the fittest possible c∗ to x constrained on c∗ ∈ I. has been shown to be a better measure of what humans con- The fitness for an individual c is defined as: sider to be similar images [Wang et al., 2002]. SSIM values lie between 0 and 1, where a higher SSIM value means that 1 two images look more similar to each other. For the input f itness = . (4) d(x, c) image x and counterfactual image c, the distance is: Here c∗ will then be the point closest to x such that c∗ ∈ I. 1 For a multi-class case, if a user wants the counterfactual c to d(x, c) = . (8) be belong to a particular class j, we define Q as: SSIM(x, c) Q = {q|f (q) = j, q ∈ W }. (5) 3.2 Improving counterfactuals with constraints Then Equation 3 becomes: Apart from the input instance and black-box model, addi- tional inputs help the algorithm produce better results. Aux- I = (W \ P ) ∩ Q. (6) iliary constraints are incorporated by restricting the space defined by the set W : the space from which individuals The algorithm is carried out as follows: first, a set Ic is built can be generated, to ensure feasible solutions. For an n- by randomly generating points such that they belong to I. In- dimensional input, let W be the Cartesian product of the dividuals c ∈ Ic are then evolved through three processes: sets W1 ,W2 ,...,Wn . For continuous features, Wi can be con- selection, mutation, and crossovers. Selection chooses indi- strained as Wi ∈ [Wimin , Wimax ], and categorical features viduals that have the best fitness scores (Equation 4). A pro- can be constrained as Wi ∈ {W1 , W2 , ..., Wj }. However, portion of these individuals (dependent on pm , the probability certain variables might be immutable (e.g., race). In these of mutation) are then subjected to mutation, which involves cases, a feature i for an input x can be muted by setting arbitrarily changing some feature values. A proportion of in- Wi = xi . dividuals (dependent on pc , the probability of crossover) are An example of the benefits of such constraints would be then subjected to crossover, which involves randomly inter- when a user may not want an explanation of an income changing some feature values between individuals. The pop- change from $10,000 to $900,000 if that is not possible, so ulation is then restricted to the individuals that meet the re- Wi ∈ [$10000, $15000] might be an appropriate constraint. quired constraint (Equation 3 or Equation 6), and the fitness The number of counterfactuals k can also be set. CERTI- scores of the new individuals are calculated. This is repeated FAI chooses the top k individuals (k = 1 as default) where until the maximum number of generations is reached. Finally, different features have changed, so the end-user can get mul- the individual(s) c∗ with the best fitness score(s) is/are chosen tiple diverse explanations of different kinds. as the desired counterfactual(s)2 . 3.1 Choice of distance function 3.3 Robustness The choice of distance function used in Equation 1 depends Machine learning models are prone to attacks and threats. on the details provided by the model creator and the type of For example, deep learning models have performed exceed- data being considered. If the data is tabular, [Wachter et al., ingly well for image recognition tasks, but it has been widely 2017] demonstrated how the L1 norm normalized by the me- shown [Carlini and Wagner, 2017], [Nguyen et al., 2015] that dian absolute deviation (MAD) is better than using the L1 these networks are prone to adversarial attacks. Two im- or L2 norm for counterfactual generation. For tabular data, ages may look the same to a human, but when presented to the L1 norm for continuous features (NormAbs) and a sim- a model, they can produce different outcomes. A counterfac- ple matching distance for categorical features (SimpMat) are tual is a generated point close to an input that changes the chosen as default. In the absence of training data, normaliza- prediction and is therefore an adversarial example. tion using MAD is not possible. However in model develop- Given two black-box models, if the counterfactuals across ment and our experimenets where there is access to training classes are farther away from the input instances on average data, normalization is possible. The distance metric used is: for one network as compared to the other network, that net- work would be harder to fool. Since CERTIFAI directly gives ncon ncat a measure of distance d(x,c), this can be used to define the d(x, c) = NormAbs(x, c) + SimpMat(x, c) (7) robustness score for a classifier. Using this distance, we in- n n troduce Counterfactual Explanation-based Robustness Score where ncon and ncat are the number of continuous and cat- (CERScore), the first ever black-box model robustness score. egorical features, respectively, and n is the total number of Given a model, the CERScore is defined as the expected dis- features (ncon + ncat = n). tance between the input instances and their corresponding 2 pm =0.2 and pc =0.5, which is standard in literature. The pop- counterfactuals: ulation size is the square of the input feature size with a maximum cap of 30,000. Grid-search is used to find the number of generations CERScore(model) = E[d(x, c∗ )]. (9) X
To be able to better compare models trained on different data sets, the CERScore can be normalized by the expected value of the distance between data points in each class over all classes k, and hence we get the normalized CERScore NCER- Score (abbreviated as NC) as: EX [d(x, c∗ )]. N C = PK k=1 P (x ∈ classk )E[d(xi , xj ); xi , xj ∈ classk ] (10) (i.e., we normalize by dividing by the expected distance be- tween two datapoints drawn from the same class). A higher CERScore implies that the model is more robust. Note that Figure 2: Adversarial examples for the MNIST dataset. The im- the normalized CERScore can be greater than 1. Unlike ages on the left are original inputs for which the model has a correct [Weng et al., 2018], CERTIFAI only needs model predictions prediction and the images on the right are counterfactual images (ad- and not the model internals. versarial examples) for which the prediction has changed. 3.4 Fairness Model CERScore CI CLEVER The fitness measure (Equation 4) and CERScore can also be used to investigate fairness from individual and group per- Inception-v3 1.17 1.09-1.25 0.229 spectives, respectively. For a given individual instance, if the Resnet-50 1.06 1.05-1.08 0.137 genetic algorithm can generate different counterfactuals with MobileNet 1.08 1.06-1.09 0.151 different values of a protected feature (e.g., race, age), and as a result the user can achieve the desired outcome more Table 2: Robustness score and 95 percent confidence intervals (CI) for those scores for 3 deep learning models and the corresponding easily than when those features could not be changed, then CLEVER scores the individual could claim the model is unfair to their case. Additionally, CERTIFAI can be used by model developers to audit the fairness for different groups of observations. If the it to train a convolutional neural network, consisting of one fitness measure is markedly different for counterfactuals gen- convolution layer, two dense layers, and intermediate pooling erated for the different partitions of a feature’s domain value, and dropout layers. We achieve a 99.46% accuracy on the this could be an indication the model is biased towards one test set using this architecture. Then we select random im- of the partitions. For example, if the gender feature is par- ages from the MNIST dataset and the model above and find titioned into two values (male and female), and the average the counterfactual image for each image, using SSIM (equa- fitness values of generated counterfactuals are lower for fe- tion (8)). Every pixel is considered to be a feature and hence, males than for males, this could be used as evidence that the every individual in the population is a 784 dimension vector. model is not treating females fairly. Using counterfactuals We use an initial population size of 30,000 individuals and and the distance function, we can calculate the overall burden run the experiment for 1,000 generations. for a group, measured as: Figure 2 shows an example of ten generated counterfactu- Burden(g) = E[d(x, c∗ )] als (right) and their original counterpart (left) images. The g (11) counterfactual images on the right look nearly identical to the input images on the left, however, the model predicts a where g is a partition defined by the distinct values for a spec- different outcome for the images on the left and right. The ified feature set. Note, burden is related to CERScore as it imperceptibly different images give credence to the idea of is the expected value over a group. Most fairness auditing using a genetic algorithm formulation to produce counterfac- models focus on single features (e.g., [Hardt et al., 2016; tuals. Additionally, our approach towards generating these Donini et al., 2018]). Burden, however, does not have that images is model-agnostic and does not require any approxi- limitation and can be applied to any combination of features. mations, unlike [Carlini and Wagner, 2017]. The generated images show how a network can easily be fooled and demon- 4 Experiments strate that there is a major problem in deploying such highly- We demonstrate the applications and flexibility of CERTI- accurate networks to image-based decision making applica- FAI to explainability, transparency, fairness, and robustness. tions (eg. face recognition). Moreover, different kinds of adversarial attacks can be generated by simply changing the 4.1 Robustness distance function in Equation 4. In this section, we demonstrate how CERTIFAI produces ad- Evaluating Deep Networks versarial examples, and we use CERScore from Section 3.3 to measure a network’s resistance to adversarial attacks. In this section, we evaluate how well CERScore, introduced in Section 3.3, can give an informative measure of robust- Generating Adersarial Examples ness. We consider the same networks as in [Weng et al., We consider the MNIST dataset [LeCun, 1998] which con- 2018]: Inception-v3 [Szegedy et al., 2016] , ResNet-50 [He tains 60000 (size 28x28) training images of digits. We use et al., 2016] and MobileNet [Howard et al., 2017] pre-trained
Data set Num. Num. DT SVM MLP obs. features NCERS. Acc. NCERS. Acc. NCERS. Acc. Pima Diabetes 768 8 0.074 73.25 0.387 81.42 0.486 98.61 Breast Cancer 569 32 0.081 95.80 0.121 96.50 0.124 96.50 Iris 150 4 0.132 95.67 0.235 95.67 0.241 95.67 Table 3: Descriptions of data sets, and NCERScore (NCERS.) and test set accuracy (Acc.) for three models: decision tree (DT), SVM with RBF kernel (SVM), and Multilayer Perceptron (MLP). on ImageNet [Deng et al., 2009], where they define the Person Feature(s) Original Counterfactual CLEVER score for robustness. Unlike CLEVER, we con- sider the model to be a black-box (only relying on its predic- 1 Glucose (CWC) 115 71 tions). Ideally, to derive a measure of robustness for a model, BMI (CUC) 35.3 10.1 all images from all classes should be considered, their coun- 2 Glucose (CWC) 168 89 terfactuals should be generated, and the CERScore should Age (CUC) 34 44 then be calculated. However, since the number of training samples for a deep network is in the order of millions, it is not computationally feasible to calculate the score for each Table 4: Counterfactual explanations for the Pima Indian diabetes example. Hence, we consider a subset of classes and images dataset. CWC: counterfactuals with constraints on feature values to calculate the CERScore. We sampled n=50 random images and CUC: unconstrained counterfactuals. Unconstrained features from every class across k=100 random classes. We generate lead to infeasible solutions (BMI 10.1) or unchangeable features the counterfactuals for all 5,000 images such that the counter- (age) being changed. factual gives a prediction of the second most likely class (by generating individuals constrained on belonging to that class NCERScore and is therefore the most robust of the classifiers as in Equation 6) and empirically estimate the CERScore as: for these data sets. In the Pima diabetes data set, the accu- k n racy of the decision tree is much lower than the other mod- 1 XX els, which suggests this simple model cannot adequately cap- CERScore = d(xij , c∗ij ) (12) nk i=1 j=1 ture the class separation. Hence, more points would be con- centrated near the decision boundaries, resulting in a lower where xij is the j th input instance belonging to predicted NCERScore. class i, and c∗ij is the corresponding counterfactual. The CER- For the Iris data set, while it is a relatively simple data set Scores are shown in Table ??. One way to interpret the score (even the decision tree performs well), the decision tree has is that on average, the SSIM score for Inception-v3 is 1/1.17 the lowest NCERScore while the scores for SVM and MLP = 0.85, where an SSIM score of 1 means the images look are similar. In Figure 3, we plot input points for two fea- exactly the same and an SSIM score of 0 means the images tures of the Iris data set and the decision boundary for each are highly different. Hence, adversarial attacks for Inception- model. Looking closely, the points for the decision tree are v3 could be more easily identified than for the other models. closer on average to the decision boundaries as compared to We also show the 95% confidence interval where we have as- the other two models (i.e. the densely clustered white and red sumed the distribution of distances between the images and points are closer to decision boundaries), which suggests the their counterfactuals follows a normal distribution. The con- model is more prone to being fooled. The decision bound- fidence intervals are tight around the CERScores. aries for both SVM and MLP are nearly identical around the We also compare CERScore with CLEVER scores [Weng input points, which results in similar robustness scores. Sim- et al., 2018] for the same images, considering the top-2 ilar results can be seen for the cancer data set. Using these class attack. The CLEVER scores are also reported in Ta- results, a model developer can choose the most robust model ble ??. The CERScore implies that Inception-v3 is most ro- based on the NCERScore. bust and Resnet-50 is least robust, which is similar to what the CLEVER scores suggest. Hence, even though CERTI- 4.2 Explainability FAI does not access any model weights, it is able to evaluate Counterfactuals are used to provide explanations and trans- a model’s robustness to adversarial attacks. parency to a user on how much change is needed for them to obtain a favorable prediction. We show the importance of us- Robustness of Classic Classifiers ing the constraints to improve explanations, the use of multi- Next, we use NCERScore (Equation 10) to compare the ro- ple counterfactual explanations for a single instance, and how bustness of different models trained on different data sets. these can also be used to estimate feature importance. We train three models (decision trees (DT), Support Vector Machines with RBF kernel (SVM), and multilayer percep- Datasets and Models trons (MLP)) on the three data sets listed in Table ??. We We use the Pima Indian dataset [Smith et al., 1988] and the report the NCERScore and the accuracy on the test set in Ta- UCI adult dataset [Kohavi, 1996] in the following experi- ble ??. Across all data sets, the neural network has the highest ments. The Pima Indian dataset consists of 768 data samples
Figure 3: Decision boundaries and input points for the Iris flowers data set (3 classes) for 3 models: Neural Network, SVM with RBF kernel and Decision tree (DT), visualized using two features from the data set. DT has closer points to the decision boundaries on average. Person Feature(s) Original Counterfactual 1 Education 12th Bachelors Occupation Tech-suppt Exec-managerial 1 Hrs-per-week 50 70 Workclass Local-gov Private Table 5: Two explanations for the same person from the UCI adult dataset, with constraints on feature values. and 8 features where 6 features are continuous and integer- valued, and 2 features are continuous float-valued. The task is to predict the risk of diabetes (1: At risk, 0:Not at risk). We Figure 4: Feature importance for the model, trained on the Pima train a 4 layer neural network with an input layer, 2 hidden Indian diabetes dataset, measured by the number of times a feature layers of 20 neurons each, and an output layer with a 80-20 changed to generate the counterfactual (left) and feature importance training-test split. The accuracy of the model is 99.6% on the by XGBoost (right). test set. An initial population of 500 individuals is consid- ered and the evaluation is done across 300 generations. The UCI adult dataset consists of 48842 samples with 14 categor- without any constraints on the feature values. We show fea- ical and continuous integer features and a binary outcome of tures for which the values have changed (between the input predicted income (>50k or
Person Feature FitnessM FitnessU 1 Race 0.63 0.87 2 Gender 0.41 0.62 3 Race 0.81 0.81 Table 6: Fitness values when race and gender attributes are muted (FitnessM) and unmuted (FitnessU) for three people. generate counterfactuals for all samples (irrespective of pre- diction) and analyze the number of times every feature value has changed, as shown in Figure 4. Interestingly, the impor- tances are qualitatively similar to those returned by Python’s XGBoost [Chen and Guestrin, 2016] library (also shown in Figure 4). Specifically, feature 5 (BMI) and feature 2 (Glu- cose) are the most important in predicting diabetes risk. This Figure 5: Burden on different groups belonging to a particular race analysis can be extended to the multi-class case by constrain- in the UCI adult dataset, found using the distance between the input ing sampled individuals such that they belong to a desired instances and counterfactuals (Equation 11) class (Equation 6) Multiple counterfactual explanations the attribute race in the UCI adult dataset and take all training Multiple explanations are helpful to a user so that they can examples that have an unfavorable outcome. Results of our receive a diverse set of changes that could be made to achieve experiments are shown in Figure 5. As we can see, the burden a desired outcome. The UCI adult dataset (CWC case) is con- on Black race and the Other race is more than the other races. sidered and features such as native-country are muted and a This means that on average, these groups would have to make set range is given for features like hours-per-week. We run more changes to achieve a desired prediction as compared to the genetic algorithm for the input instance and select the best others. Hence the model imposes a burden on these groups, two individuals that have different changes in feature indices. which could imply that the model has been unfair. The advantage of our approach is that we only need to run the algorithm once, and we can generate many explanations, as opposed to [Russell, 2019] where the IP solver needs to be 5 Conclusion and Future Work run multiple times to generate multiple explanations. In this paper, we introduced CERTIFAI, a model-agnostic, To underscore the benefits of suggesting alternative coun- flexible, and user-friendly technique that helps build respon- terfactuals, Table 5 shows two sets of explanations that are sible artificial intelligence systems. We demonstrate the flex- generated by CERTIFAI for the same person. Multiple expla- ibility that the genetic algorithm brings to provide feasible nations, the number of which is set by the user, allow a user counterfacual explanations to a user. We show how individual to decide which counterfactual may be the most actionable. and group fairness can be measured using the fitness values obtained during counterfactual generation. Finally, we show 4.3 Fairness how these counterfactuals are effective adversarial examples We evaluate fairness from an individual’s perspective and and we define CERScore, the first ever measure of robustness from a model developer’s perspective. To see if the model for a black-box model. We are currently developing the User- is unfair towards any instance, we consider 100 random in- Interface to CERTIFAI. Future work involves speeding up the stances of the UCI adult dataset where the prediction was un- genetic algorithm by techniques like [Harik et al., 1999] and favorable and run the algorithm twice, once when the sensi- [Mitchell et al., 1994] A comparison between the introduced tive attribute is not allowed to change and once when it is, fairness metric and previous metrics would also be useful. It and record the fitness values. We do this for two sensitive would also be interesting to see how our adversarial examples attributes, race and gender. perform with strategies [Papernot et al., 2016], [Madry et al., The results for three such instances are shown in Table 6. 2017] that are aimed to handle adversarial attacks. FitnessM refers to the fitness value when the race feature is muted for an individual and FitnessU corresponds to the feature being unmuted. The fitness for the first 2 people References increases substantially when these protected features are al- [Akhtar and Mian, 2018] Naveed Akhtar and Ajmal Mian. lowed to change and hence for these instances, there is evi- Threat of adversarial attacks on deep learning in computer dence that the model has not been fair. For the third person, vision: A survey. IEEE Access, 6:14410–14430, 2018. the evidence suggests that the model has been fair. A model developer can use the idea of burden (Equa- [Binns, 2017] Reuben Binns. Fairness in machine learn- tion 11) to evaluate how fair a model is being to groups of ing: Lessons from political philosophy. arXiv preprint individuals. To demonstrate the idea of burden, we consider arXiv:1712.03586, 2017.
[Carlini and Wagner, 2017] Nicholas Carlini and David as a defense to adversarial perturbations against deep neu- Wagner. Towards evaluating the robustness of neural ral networks, 2016. networks, 2017. [Ribeiro et al., 2016] Marco Tulio Ribeiro, Sameer Singh, [Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin. and Carlos Guestrin. Why should i trust you?: Explain- XGBoost: A scalable tree boosting system, 2016. ing the predictions of any classifier, 2016. [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li- [Russell, 2019] Chris Russell. Efficient search for diverse Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale coherent explanations, 2019. hierarchical image database. 2009. [Smith et al., 1988] Jack W Smith, JE Everhart, WC Dick- [Donini et al., 2018] Michele Donini, Luca Oneto, Shai Ben- son, WC Knowler, and RS Johannes. Using the adap learn- David, John S Shawe-Taylor, and Massimiliano Pontil. ing algorithm to forecast the onset of diabetes mellitus, Empirical risk minimization under fairness constraints, 1988. 2018. [Szegedy et al., 2016] Christian Szegedy, Vincent Van- [Guidotti et al., 2018a] Riccardo Guidotti, Anna Monreale, houcke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Salvatore Ruggieri, Dino Pedreschi, Franco Turini, and Rethinking the inception architecture for computer vision, Fosca Giannotti. Local rule-based explanations of black 2016. box decision systems. arXiv preprint arXiv:1805.10820, [Ustun et al., 2019] Berk Ustun, Alexander Spangher, and 2018. Yang Liu. Actionable recourse in linear classification, [Guidotti et al., 2018b] Riccardo Guidotti, Anna Monreale, 2019. Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and [Wachter et al., 2017] Sandra Wachter, Brent Mittelstadt, Dino Pedreschi. A survey of methods for explaining black and Chris Russell. Counterfactual explanations without box models. ACM computing surveys (CSUR), 51(5):93, opening the black box: automated decisions and the gdpr. 2018. Harvard Journal of Law & Technology, 31(2):2018, 2017. [Hardt et al., 2016] Moritz Hardt, Eric Price, Nati Srebro, [Wang et al., 2002] Zhou Wang, Alan C Bovik, and Ligang et al. Equality of opportunity in supervised learning, 2016. Lu. Why is image quality assessment so difficult?, 2002. [Harik et al., 1999] Georges R Harik, Fernando G Lobo, and [Wang et al., 2003] Zhou Wang, Eero P Simoncelli, and David E Goldberg. The compact genetic algorithm. IEEE Alan C Bovik. Multiscale structural similarity for image transactions on evolutionary computation, 3(4):287–297, quality assessment, 2003. 1999. [Weng et al., 2018] Tsui-Wei Weng, Huan Zhang, Pin-Yu [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, Ren, and Jian Sun. Deep residual learning for image recog- and Luca Daniel. Evaluating the robustness of neural net- nition, 2016. works: An extreme value theory approach. arXiv preprint [Howard et al., 2017] Andrew G Howard, Menglong Zhu, arXiv:1801.10578, 2018. Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mo- bilenets: Efficient convolutional neural networks for mo- bile vision applications. arXiv preprint arXiv:1704.04861, 2017. [Kohavi, 1996] Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid., 1996. [LeCun, 1998] Yann LeCun. The mnist database of hand- written digits. http://yann. lecun. com/exdb/mnist/, 1998. [Madry et al., 2017] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. [Mitchell et al., 1994] Melanie Mitchell, John H Holland, and Stephanie Forrest. When will a genetic algorithm out- perform hill climbing, 1994. [Nguyen et al., 2015] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High con- fidence predictions for unrecognizable images, 2015. [Papernot et al., 2016] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation
You can also read