Diverse Adversaries for Mitigating Bias in Training
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Diverse Adversaries for Mitigating Bias in Training Xudong Han Timothy Baldwin Trevor Cohn School of Computing and Information Systems The University of Melbourne Victoria 3010, Australia xudongh1@student.unimelb.edu.au {tbaldwin,tcohn}@unimelb.edu.au Abstract maximizing the attacker loss (i.e., preventing pro- tected attributes from being detected by the at- arXiv:2101.10001v1 [cs.LG] 25 Jan 2021 Adversarial learning can learn fairer and less tacker). Preventing protected attributes from be- biased models of language than standard meth- ing detected tends to result in fairer models, as ods. However, current adversarial techniques protected attributes will more likely be indepen- only partially mitigate model bias, added to which their training procedures are often unsta- dent rather than confounding variables. Although ble. In this paper, we propose a novel approach this method leads to demonstrably less biased to adversarial learning based on the use of mul- models, there are still limitations, most notably tiple diverse discriminators, whereby discrimi- that significant protected information still remains nators are encouraged to learn orthogonal hid- in the model’s encodings and prediction outputs den representations from one another. Experi- (Wang et al., 2019; Elazar and Goldberg, 2018). mental results show that our method substan- Many different approaches have been proposed tially improves over standard adversarial re- moval methods, in terms of reducing bias and to strengthen the attacker, including: increasing the stability of training. the discriminator hidden dimensionality; assign- ing different weights to the adversarial compo- 1 Introduction nent during training; using an ensemble of ad- versaries with different initializations; and reini- While NLP models have achieved great successes, tializing the adversarial weights every t epochs results can depend on spurious correlations with (Elazar and Goldberg, 2018). Of these, the en- protected attributes of the authors of a given text, semble method has been shown to perform best, such as gender, age, or race. Including protected but independently-trained attackers can generally attributes in models can lead to problems such still detect private information after adversarial re- as leakage of personally-identifying information moval. of the author (Li et al., 2018a), and unfair mod- In this paper, we adopt adversarial debiasing ap- els, i.e., models which do not perform equally proaches and present a novel way of strengthen- well for different sub-classes of user. This kind ing the adversarial component via orthogonality of unfairness has been shown to exist in many constraints (Salzmann et al., 2010). Over a senti- different tasks, including part-of-speech tagging ment analysis dataset with racial labels of the doc- (Hovy and Søgaard, 2015) and sentiment analysis ument authors, we show our method to result in (Kiritchenko and Mohammad, 2018). both more accurate and fairer models, with privacy One approach to diminishing the influence of leakage close to the lower-bound.1 protected attributes is to use adversarial methods, where an encoder attempts to prevent a discrimi- 2 Methodology nator from identifying the protected attributes in a given task (Li et al., 2018a). Specifically, an ad- Formally, given an input xi annotated with main versarial network is made up of an attacker and task label yi and protected attribute label gi , a main encoder, where the attacker detects protected in- task model M is trained to predict ŷi = M (xi ), formation in the representation of the encoder, and an adversary, aka “discriminator”, A is trained and the optimization of the encoder incorporates 1 Source code available at two parts: (1) minimizing the main loss, and (2) https://github.com/HanXudong/Diverse_Adversaries_for_M
ŷi novel means of strengthening the adversarial com- CM ponent. Figure 1 shows a typical ensemble archi- tecture where k sub-discriminators are included in the adversarial component, leading to an averaged xi hM,i hA1 ,i ĝA1 ,i adversarial regularisation term: EM EA1 CA 1 .. .. λadv X . . − X (g, ĝAj ). k hAk ,i ĝAk ,i j∈{1,...,k} EAk CA k One problem associated with this ensemble ar- Figure 1: Ensemble adversarial method. Dashed chitecture is that it cannot ensure that different sub- lines denote gradient reversal in adversarial learning. discriminators focus on different aspects of the The k sub-discriminators Ai are independently ini- representation. Indeed, experiments have shown tialized. Given a single input xi , the main task en- that sub-discriminator ensembles can weaken coder computes a hidden representation hM,i , which is the adversarial component (Elazar and Goldberg, used as the input to the main model output layer and sub-discriminators. From the k-th sub-discriminator, 2018). To address this problem, we further intro- the estimated protected attribute label is ĝAk ,i = duce a difference loss (Bousmalis et al., 2016) to CAk (EAk (hM,i )). encourage the adversarial encoders to encode dif- ferent aspects of the private information. As can be seen in Figure 1, hAk ,i denotes the output from to predict ĝi = A(hM,i ) from M ’s last hidden the k-th sub-discriminator encoder given a hidden layer representation hM,i . In this paper, we treat a representation hM,i , i.e., hAk ,i = EAk (hM,i ). neural network classifier as a combination of two The difference loss encourages orthogonality connected parts: (1) an encoder E, and (2) a lin- between the encoding representations of each pair ear classifier C. For example, in the main task of sub-discriminators: model M , the encoder EM is used to compute the 2 hidden representation hM,i from an input xi , i.e., X Ldiff = λdiff hAi ⊺ hAj 1(i 6= j), hM,i = EM (xi ), and the decoder is used to make i,j∈{1,...,k} F a prediction, ŷi = CM (hM,i ). Similarly, for a dis- criminator, ĝi = A(hM,i ) = CA (EA (hM,i )). wherek·k2F is the squared Frobenius norm. Intuitively, sub-discriminator encoders must 2.1 Adversarial Learning learn different ways of identifying protected in- Following the setup of Li et al. (2018a) and formation given the same input embeddings, re- Elazar and Goldberg (2018) the optimisation ob- sulting in less biased models than the standard jective for our standard adversarial training is: ensemble-based adversarial method. According to Bousmalis et al. (2016), the difference loss has min max X (y, ŷM ) − λadv X (g, ĝA ), the additional advantage of also being minimized M A when hidden representations shrink to zero. There- where X is cross entropy loss, and λadv is the fore, instead of minimizing the difference loss by trade-off hyperparameter. Solving this minimax learning rotated hidden representations (i.e., the optimization problem encourages the main task same model), this method biases adversaries to model hidden representation hM to be informa- have representations that are a) orthogonal, and b) tive to CM and uninformative to A. Following low magnitude; the degree to which is given by Ganin and Lempitsky (2015), the above can be weight decay of the optimization function. trained using stochastic gradient optimization with a gradient reversal layer for X (g, ĝA ). 2.3 INLP We include Iterative Null-space Projection 2.2 Differentiated Adversarial Ensemble (“INLP”: Ravfogel et al. (2020)) as a baseline Inspired by the ensemble adversarial method method for mitigating bias in trained models, in (Elazar and Goldberg, 2018) and domain sep- addition to standard and ensemble adversarial aration networks (Bousmalis et al., 2016), we methods. In INLP, a linear discriminator (Alinear ) present differentiated adversarial ensemble, a of the protected attribute is iteratively trained
Model Accuracy↑ TPR Gap↓ TNR Gap↓ Leakage@h↓ Leakage@ŷ↓ Random 50.00±0.00 0.00±0.00 0.00±0.00 — — Fixed Encoder 61.44±0.00 0.52±0.00 17.97±0.00 92.07±0.00 86.93±0.00 Standard 71.59±0.05 31.81±0.29 48.41±0.27 85.56±0.20 70.09±0.19 INLP 68.54±1.05 25.13±2.31 40.70±5.02 66.64±0.87 66.19±0.79 Adv Single Discriminator 74.25±0.39 13.01±3.83 28.55±3.60 84.33±0.98 61.48±2.17 Adv Ensemble 74.08±0.99 12.04±3.50 31.76±3.19 85.31±0.51 63.23±3.62 Differentiated Adv Ensemble 74.52±0.28 8.42±1.84 24.74±2.07 84.52±0.50 61.09±2.32 Table 1: Evaluation results ± standard deviation (%) on the test set, averaged over 10 runs with different random seeds. Bold = best performance. “↑” and ”↓” indicate that higher and lower performance, resp., is better for the given metric. Leakage measures the accuracy of predicting the protected attribute, over the final hidden represen- tation h or model output ŷ. Since the Fixed Encoder is not designed for binary sentiment classification, we merge the original 64 labels into two categories based on the results of hierarchical clustering. from pre-computed fixed hidden representations tive Rate (TNR), respectively, across different pro- (i.e., hM ) to project them onto the linear discrim- tected attributes (De-Arteaga et al., 2019). This inator’s null-space, h∗M = PN (Alinear ) hM , where measurement is related to the criterion that the pre- PN (Alinear ) is the null-space projection matrix diction ŷ is conditionally independent of the pro- of Alinear . In doing so, it becomes difficult for tected attribute g given the main task label y (i.e., the protected attribute to be linearly identified ŷ⊥g|y). Assuming a binary protected attribute, from the projected hidden representations (h∗M ), this conditional independence requires P{ŷ|y, g = and any linear main-task classifier (CM ∗ ) trained 0} = P{ŷ|y, g = 1}, which implies an objective on h∗M can thus be expected to make fairer that minimizes the difference (GAP) between the predictions. two sides of the equation. 3 Experiments Linear Leakage We also measure the leakage of protected attributes. A model is said to leak infor- Fixed Encoder Following Elazar and Goldberg mation if the protected attribute can be predicted (2018) and Ravfogel et al. (2020), we use the at a higher accuracy than chance, in our case, from DeepMoji model (Felbo et al., 2017) as a fixed- the hidden representations the fixed encoder gener- parameter encoder (i.e. it is not updated during ates. We empirically quantify leakage with a linear training). The DeepMoji model is trained over support vector classifier at two different levels: 1246 million tweets containing one of 64 common • Leakage@h: the accuracy of recovering the emojis. We merge the 64 emoji labels output by protected attribute from the output of the fi- DeepMoji into two super-classes based on hierar- nal hidden layer after the activation function chical clustering: ‘happy’ and ‘sad’. (hM ). Models The encoder EM consists of a fixed • Leakage@ŷ: the accuracy of recovering the pretrained encoder (DeepMoji) and two trainable protected attribute from the output ŷ (i.e., the fully connected layers (“Standard” in Table 1). Ev- logits) of the main model. ery linear classifier (C) is implemented as a dense Data We experiment with the dataset of layer. Blodgett et al. (2016), which contains tweets For protected attribute prediction, a discrimina- that are either African American English (AAE)- tor (A) is a 3-layer MLP where the first 2 layers are like or Standard American English (SAE)-like collectively denoted as EA , and the output layer is (following Elazar and Goldberg (2018) and denoted as CA . Ravfogel et al. (2020)). Each tweet is annotated TPR-GAP and TNR-GAP In classification with a binary “race” label (on the basis of AAE problems, a common way of measuring bias is or SAE) and a binary sentiment score, which is TPR-GAP and TNR-GAP, which evaluate the gap determined by the (redacted) emoji within it. in the True Positive Rate (TPR) and True Nega- In total, the dataset contains 200k instances,
perfectly balanced across the four race–sentiment Results and Analysis Table 1 shows the results combinations. To create bias in the dataset, we over the test set. Training on a biased dataset follow previous work in skewing the training data without any fairness restrictions leads to a biased to generate race–sentiment combinations (AAE– model, as seen in the Gap and Leakage results for happy, SAE–happy, AAE–sad, and SAE–sad) of the Standard model. Consistent with the find- 40%, 10%, 10%, and 40%, respectively. Note that ings of Ravfogel et al. (2020), INLP can only re- we keep the test data unbiased. duce bias at the expense of overall performance. On the other hand, the Single Discriminator and Adv(ersarial) Ensemble baselines both enhance ac- curacy and reduce bias, consistent with the find- Training Details All models are trained and ings of Li et al. (2018a). evaluated on the same training/test split. The Compared to the Adv Ensemble baseline, in- Adam optimizer (Kingma and Ba, 2015) is used corporating the difference loss in our method has with learning rates of 3 × 10−5 for the main model two main benefits: training is more stable (re- and 3 × 10−6 for the sub-discriminators. The sults have smaller standard deviation), and there minibatch size is set to 1024. Sentence represen- is less bias (the TPR and TNR Gap are smaller). tations (2304d) are extracted from the DeepMoji Without the orthogonality factor, Ldiff , the sub- encoder. The hidden size of each dense layer discriminators tend to learn similar representa- is 300 in the main model, and 256 in the sub- tions, and the ensemble degenerates to a standard discriminators. We train M for 60 epochs and adversarial model. Simply relying on random ini- each A for 100 epochs, keeping the checkpoint tialization to ensure sub-discriminator diversity, as model that performs best on the dev set. Sim- is done in the Adv Ensemble method, is insuf- ilar to Elazar and Goldberg (2018), hyperparam- ficient. The orthogonality regularization in our eters (λadv and λdiff ) are tuned separately rather method leads to more stable and overall better re- than jointly. λadv is tuned to 0.8 based on the sults in terms of both accuracy and TPR/TNR Gap. standard (single-discriminator) adversarial learn- ing method, and this setting is used for all other ad- As shown in Table 1, even the Fixed Encoder model leaks protected information, as a result versarial methods. When tuning λadv , we consid- of implicit biases during pre-training. INLP ered both overall performance and bias gap (both achieves significant improvement in terms of re- over the dev data). Since adversarial training can ducing linear hidden representation leakage. The increase overall performance while decreasing the reason is that Leakage@h is directly correlated bias gap (see Figure 2), we select the adversar- with the objective of INLP, in minimizing the lin- ial model that achieves the best task performance. ear predictability of the protected attribute from For adversarial ensemble and differentiated mod- the h. Adversarial methods do little to els, we tune the hyperparameters (number of sub attackers and λdiff ) to achieve a similar bias level mitigate Leakage@h, but substantially decrease while getting the best overall performance. To Leakage@ŷ in the model output. However, both compare with a baseline ensemble method with types of leakage are well above the ideal value of 50%, and therefore none of these methods can a similar number of parameters, we also report be considered as providing meaningful privacy, in results for an adversarial ensemble model with 3 part because of the fixed encoder. This finding im- sub-discriminators. The scalar hyperparameter of plies that when applying adversarial learning, the the difference loss (λdiff ) is tuned through grid pretrained model needs to be fine-tuned with the search from 10−4 to 104 , and set to 103.7 . For the adversarial loss to have any chance of generating INLP experiments, fixed sentence representations a truly unbiased hidden representation. Despite are extracted from the same data split. Following this, adversarial training does reduce the TPR and Ravfogel et al. (2020), in the INLP experiments, both the discriminator and the classifier are imple- TNR Gap, and improves overall accuracy, which mented in scikit-learn as linear SVM classifiers illustrates the utility of the method for both bias (Pedregosa et al., 2011). We report Leakage@ŷ mitigation and as a form of regularisation. for INLP based on the predicted confidence scores, Overall, our proposed method empirically out- which could be interpreted as logits, of the linear performs the baseline models in terms of debias- SVM classifiers. ing, with a better performance–fairness trade-off.
80 80 Accuracy (%) Accuracy (%) 70 75 60 70 40 40 GAP (%) Gap (%) 30 20 20 10 0 −2 −1 0 1 2 0 1 2 3 4 log 10 λadv log10 λdiff Figure 2: λadv sensitivity analysis, averaged over 10 Figure 3: λdiff sensitivity analysis for differentiated ad- runs for a single discriminator adversarial model. Main versarial models with 3 (“ ”), 5 (“ ”), and 8 task accuracy of group SAE (blue) and AAE (orange), (“ ”) sub-discriminators, in terms of the main task TPR-GAP (green), and TNR-GAP (red) are reported. accuracy of group SAE (blue) and AAE (orange), and TPR-GAP (green) and TNR-GAP (red). Robustness to λadv We first evaluate the influ- ence of the trade-off hyperparameter λadv in ad- an overly large value makes the sub-discriminators versarial learning. As can be seen from Figure 2, underfit, and both reduces accuracy and increases λadv controls the performance–fairness trade-off. TPR/TNR Gap. We observe a negative correla- Increasing λadv from 10−2 to around 10−0 , TPR tion between N and λdiff , the main reason being Gap and TNR Gap consistently decrease, while that Ldiff is not averaged over N and as a result, the accuracy of each group rises. To balance up a large N and λdiff force the sub-discriminators to accuracy and fairness, we set λadv to 10−0.1 . We pay too much attention to orthogonality, impeding also observe that an overly large λadv can lead to a their ability to bleach out the protected attributes. more biased model (starting from about 101.2 ). Overall, we empirically show that λdiff only needs to be tuned for Adv Ensemble, since the re- Robustness to λdiff Figure 3 presents the results sults for different Differentiated Adv models for a of our model with different λdiff values, for N ∈ given setting achieve similar results. I.e., λdiff can {3, 5, 8} sub-discriminators. safely be tuned separately with all other hyperpa- First, note that when λdiff is small (i.e., the rameters fixed. left side of Figure 3), our Differentiated Adv En- semble model generalizes to the standard Adv Ensemble model. For differing numbers of sub- 4 Conclusion and Future Work discriminators, performance is similar, i.e., in- creasing the number of sub-discriminators beyond We have proposed an approach to enhance sub- 3 does not improve results substantially, but does discriminators in adversarial ensembles by intro- come with a computational cost. This implies ducing a difference loss. Over a tweet sentiment that an Adv Ensemble model learns approximately classification task, we showed that our method the same thing as larger ensembles (but more effi- substantially improves over standard adversarial ciently), where the sub-discriminators can only be methods, including ensemble-based methods. explicitly differentiated by their weight initializa- In future work, we intend to perform experi- tions (with different random seeds), noting that all mentation over other tasks. Theoretically, our ap- sub-discriminators are otherwise identical in archi- proach is general-purpose, and can be used not tecture, input, and optimizer. only for adversarial debiasing but also any other Increasing the weight of the difference loss application where adversarial training is used, through λdiff has a positive influence on results, but such as domain adaptation (Li et al., 2018b).
Acknowledgments on Lexical and Computational Semantics, pages 43–53. We thank Lea Frermann, Shivashankar Subrama- nian, and the anonymous reviewers for their help- Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018a. Towards robust and privacy-preserving text representations. ful feedback and suggestions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 25–30. References Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018b. Su Lin Blodgett, Lisa Green, What’s in a domain? learning domain-robust text representations usin and Brendan O’Connor. 2016. In Proceedings of the 2018 Conference of the North Demographic dialectal variation in social media: A case study of African-American American Chapter of theEnglish. Association for Computa- In Proceedings of the 2016 Conference on Empirical tional Linguistics: Human Language Technologies, Methods in Natural Language Processing, pages Volume 2 (Short Papers), pages 474–479. 1119–1130. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Konstantinos Bousmalis, George Trigeorgis, Nathan B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, Silberman, Dilip Krishnan, and Dumitru Erhan. R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, 2016. Domain separation networks. In Advances in D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Neural Information Processing Systems, pages 343– esnay. 2011. Scikit-learn: Machine learning in 351. Python. Journal of Machine Learning Research, 12:2825–2830. Maria De-Arteaga, Alexey Romanov, Hanna Wal- lach, Jennifer Chayes, Christian Borgs, Alexan- Shauli Ravfogel, Yanai Elazar, Hila Gonen, dra Chouldechova, Sahin Geyik, Krishnaram Michael Twiton, and Yoav Goldberg. 2020. Kenthapadi, and Adam Tauman Kalai. 2019. Null it out: Guarding protected attributes by iterative nullspace projec Bias in bios: A case study of semantic representation bias inInaProceedings high-stakes setting. of the 58th Annual Meeting of the In Proceedings of the Conference on Fairness, Ac- Association for Computational Linguistics, pages countability, and Transparency, FAT* ’19, page 7237–7256. 120–128. Mathieu Salzmann, Carl Henrik Ek, Raquel Urtasun, Yanai Elazar and Yoav Goldberg. 2018. and Trevor Darrell. 2010. Factorized orthogonal la- Adversarial removal of demographic attributes from text data.tent spaces. In Proceedings of the Thirteenth Inter- In Proceedings of the 2018 Conference on Empirical national Conference on Artificial Intelligence and Methods in Natural Language Processing, pages Statistics, pages 701–708. 11–21. Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Bjarke Felbo, Alan Mislove, Anders Søgaard, Chang, and Vicente Ordonez. 2019. Balanced Iyad Rahwan, and Sune Lehmann. 2017. datasets are not enough: Estimating and mitigating Using millions of emoji occurrences to learn any-domain representations gender bias infor detecting deep sentiment, emotion image representations. In and Pro-sarcasm. In Proceedings of the 2017 Conference on Empirical ceedings of the IEEE International Conference on Methods in Natural Language Processing, pages Computer Vision, pages 5310–5319. 1615–1625. Yaroslav Ganin and Victor Lempitsky. 2015. Unsu- pervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pages 1180– 1189. Dirk Hovy and Anders Søgaard. 2015. Tagging performance correlates with author age. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 483–488. Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR). Svetlana Kiritchenko and Saif Mohammad. 2018. Examining gender and race bias in two hundred sentiment analysis systems. In Proceedings of the Seventh Joint Conference
You can also read