The risks of using AI in business: artificial intelligence and real discrimination - Oxera
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Advancing economics in business October 2020 The risks of using AI in business: algorithms artificial intelligence in business: and real demystifying AI discrimination
The risks of using AI in business: artificial intelligence and real discrimination As an example, if female applicants to a Contact What is the link between AI tech company have a significantly lower Pascale Déchamps and discrimination? chance of being hired (based on historical Partner data) than male applicants,6 a CV-sifting However sophisticated a program, algorithm program is likely to predict that female or AI system may be, it is designed to find candidates are generally not as good as patterns in data (differences and similarities) male candidates. By not considering the that help it to accurately predict outcomes, possibility that the hiring process may match individuals, or classify objects or have been biased in the past, the design individuals.4 By design, AI ‘discriminates’ of the program repeats and automates as it separates datasets into clusters on the human biases. A solution would be to Algorithms influence many aspects of basis of shared characteristics, applying exclude sensitive variables such as our work and social lives. They affect different rules to different clusters. This gender. However, such a simple fix may what adverts we see, what shows segmentation and the resulting differences in not work when other factors in the dataset we watch, and whether we get a job. the rules applied simply reflect the best way are strongly correlated with the excluded As these tools become increasingly that is identified by the algorithm to achieve a sensitive variables, such as the type of widespread, they pose new challenges particular objective given the data. extracurricular activities or courses taken. to businesses. We look at concerns regarding the use of algorithms in Yet the programs involve human intervention This bias can also emerge when the areas where the role of computer at all stages: the algorithm and its objective algorithm uses one factor as a proxy for programs and complex modelling has function are coded by a programmer, the data another because the information about traditionally been limited, and consider used to train the algorithm is collected by a the right factor is not known or included whether AI might result in illegal human, and the algorithm is rewarded for in the dataset. This may be the case, discrimination replicating, to some extent, human decisions. for example, when the program uses Human-driven discrimination, then, can be gender or race as a proxy for behaviour. introduced at all stages of the process. In the case of car insurance, women in This article is the second in a series France used to pay lower premia than investigating some of the economic men, as they are, on average, less prone consequences of using algorithms and the How can a program be to accidents.7 Ideally, the program would associated risks to businesses. Describing affected by human biases? have taken into account characteristics AI and its various forms,1 our previous that are associated with a lower probability article highlighted that businesses face As humans are involved at key stages of the of having an accident other than gender, heightened risks when using AI, including development of an algorithm, programs can but it is fundamentally interested only in regulatory and reputational risks. recreate, and reinforce, discrimination from parameters that enable it to predict the human behaviour. Exactly how do human probability of accidents. If gender and the As algorithms are used more and more in biases affect algorithms? probability of accidents are correlated, our daily lives, they raise concerns about discrimination can occur by generalising their legality in areas where computer Individuals can be biased against others to an entire group something that is programs and complex modelling have outside of their own social group, exhibiting observed, on average, more frequently traditionally played a limited role—for prejudice, stereotyping and discrimination. (but not always) within this group than instance, regarding forms of discrimination Machines do not have all of these across the entire population. that are often proscribed and carefully preconceptions and biases, but they are monitored under national law (e.g. when designed by humans. In addition, the way in Prediction errors. The best program hiring new employees). which their programming operates may also (from the perspective of its objective) lead to perverse feedback loops. is not perfect and still makes mistakes There are now numerous instances in even with a representative dataset, as which the outcomes of program-based In practice, human biases can affect programs cannot model every particular processes are considered unfair or biased. programs in a number of ways. set of circumstances. For instance, a As discussed in the previous article, one credit-scoring algorithm can take into recent prominent example is the public Unrepresentative or insufficient training account only accessible factors such as uproar following the use of an algorithm and test data. If the training and test employment history, available savings, by Ofqual (the Office of Qualifications and datasets are not representative of the overall and current debt. This means that some Examinations Regulation) in the UK to population, the program will make incorrect candidates who would be able to meet the ‘predict’ students’ A-level results this year predictions for the part of the population financial obligations may still fail to obtain when their exams were cancelled due that is under-represented. However, even a credit. This problem of ‘collateral damage’ to the COVID-19 pandemic.2 Less overt representative sample of the population may can be exacerbated by poor design, by the cases include:3 crime-prevention programs not be enough: it needs to be sufficiently large programmer, of the model applied by the that target specific communities; online such that the margin of error is sufficiently low program. advertisement algorithms that are directed across the various sub-groups of the dataset.5 towards those who are less wealthy, The way the program learns, or does offering pay-day loans; and job-recruitment The type of information provided to the not learn. Machine learning programs programs that penalise women. Algorithms program about the population. Although that rely on reinforcement processes (i.e. are also used to correct existing social programs are designed to find patterns in learning from the impact of their previous imbalances. the data, they operate based on the set of actions) are subject to the equivalent of the variables defined by the human programmer. human confirmation bias: a program may How can AI be affected by human bias If these variables are incomplete, detect an action that encourages it to act and be prone to being discriminatory? approximate, or selected in a biased way further in the same direction, sometimes What went wrong with Ofqual’s algorithm (even unconsciously), the program will find in a way that is socially discriminatory. A and how could this have been prevented? patterns that may reflect the programmer’s classic example is an algorithm used in How can trusted and lawful algorithms be perceptions rather than providing an objective the USA8 that dispatches police to areas designed? analysis. where more petty crimes were previously October 2020 1
The risks of using AI in business: artificial intelligence and real discrimination reported. The initial decisions of where to programmers to build ‘ethical’ AI is also a hot to check whether any is being treated send the police may be biased by racism. topic in the data science profession.10 differently from the others. This would Though the original issue with this program be an extension of ‘field experiments’. In was the ‘tainted’ data, the way the program More difficult to identify is bias arising due to a field experiment, two or more groups operated could amplify the issue by the dataset failing to be representative of of individuals are randomly allocated reinforcing the bias. the population. In this regard, lessons could to different situations and results are be identified from the design of statistical compared between the groups.12 To avoid Even programs that do not learn can surveys, which are a common tool for discrimination, field experiments have been exacerbate existing disadvantages in decision-making and where, quite often, it is used in CV sifting.13 Fictitious and identical society. For example, algorithms used to challenging to collect data in a way that is job applications are sent to companies, optimise staffing in some professions representative of the whole population. with the only difference being a criterion (e.g. cafes),9 although very effective at their potentially prone to discrimination (for objective of lowering costs and customer When it is possible to influence data instance, a man versus a woman). As waiting time, have led to unpredictable collection, a representative sample should these applications are randomly sent to working hours, as programs optimise be constructed. The first step would be to companies, it has been possible to identify workforce almost on a daily basis. collect a large enough random sample of whether certain groups are, on average, the population to ensure that all groups are discriminated against. sufficiently represented. It is also possible Designing algorithms to use ‘stratified sampling’, in which the This process could be extended to all that do not discriminate: population is divided into sub-populations, types of algorithms. Newly programmed and individuals are randomly selected algorithms could be forced to make a lessons from economics within each sub-population. Each sub-group large number of decisions on similar needs to be large enough to ensure reliable cases, except for one criterion subject Economists have analysed individual classifications or predictions. to discrimination. The outcome of these biases that lead to discrimination and are decisions could be analysed to uncover familiar with the design of statistical models. If it is not possible to influence data potential discrimination before the As a consequence, economists are in a collection, controls need to be introduced to algorithm is made public. This could unique position to help design AI in a way ensure that the data is representative of the have prevented the A-level controversy, that prevents creating or perpetuating population. Simple summary statistics on the described in the box overleaf. unlawful discrimination. In fact, when sample collected can be compared to the undertaken properly, algorithm design may underlying population. If the sample is not Finally, the ex post assessment can help to mitigate the impact of historical representative, more weight could be given include an analysis of whether the program discrimination. to certain data points to reflect the size of reinforces existing differences within the different sub-groups in the population.11 population. By assessing which sub-group Establishing, and correcting, potential in the population may be adversely affected algorithmic discrimination can be done Finally, it is crucial to check that the list of by the program’s decisions, it is possible before the algorithm is implemented factors that the algorithm uses is not biased. to assess whether the program tends to (ex ante), or once it has been rolled out This is not an easy task. While humans can exacerbate existing imbalances. (ex post), as illustrated in Figure 1. easily detect obvious potential sources of bias, they often cannot detect bias arising It is conceivable that, having undertaken Ex ante, the risk of implementing an from spurious correlations. In such cases, all checks, the program leads to part of the unlawful algorithm can be reduced by ex post assessment could be used to identify population being more adversely affected. carefully controlling the design of its any required changes to the original factors. In a way, that is a logical outcome of the objectives and the dataset used to train Ex post, discrimination can be identified from program trying to discriminate between it. Controlling these objectives is likely the results of the algorithm, without any individuals to reach a particular objective. to require coordination between the knowledge of its technical functioning. The It would be contrary to insurers’ business programming teams and those in charge idea is to analyse the impact of the program models, for example, if they did not take of tackling discrimination. Training on various sub-groups within the population into account all the (legally) relevant parameters to decide insurance premia. Yet the outcome of this process is that some people end up paying significantly more for insurance. If the results are statistically sound and non-discriminatory, but socially questionable, algorithms can be used to proactively ‘correct’ the results in a way that is socially preferable, especially when the algorithm is used by public entities (e.g. the justice system and schools). For example, in France, ParcourSup has the explicit goal of partially compensating for the lower likelihood of students from poorer backgrounds attending the most prestigious schools and universities. The system is designed such that colleges and universities use their own algorithms to select students. The ParcourSup algorithm Figure 1 A framework to ensure discrimination-free AI then adjusts the universities’ rankings in order to increase the proportion of students Note: The ‘Algorithm in use’ step does not mean that the algorithm is publicly available—it can also refer to an internal testing period. Source: Oxera. October 2020 2
The risks of using AI in business: artificial intelligence and real discrimination 1 See Oxera (2020), ‘The risks of using algorithms in business: The 2020 A-level controversy in the UK demystifying AI’, September, https://bit.ly/3kEoFbv. 2 For the full 319-page technical report from Ofqual on its algorithm design and outcomes, see Ofqual (2020), ‘Awarding GCSE, The algorithm used to predict this year’s exam results in the UK induced differences AS, A level, advanced extension awards and extended project in treatment across students. By imposing each school’s 2019 distribution of grades qualifications in summer 2020: interim report’, August, https://bit.ly/31PFNUd. in 2020, the algorithm failed to take into account how students’ performance in a school may have differed from that of the previous year’s cohort, disregarding 3 For an overview of some of the programs with the highest potential for massive adverse societal impact, see O’Neil, C. predicted grades from teachers. This made it very unlikely that outstanding students (2016), Weapons of Math Destruction, Penguin. in poorly performing schools would be allocated the top grades.14 It also treated See Oxera (2020), ‘The risks of using algorithms in business: more harshly students from large schools taking popular courses than those in 4 demystifying AI’, September, https://bit.ly/3kEoFbv. smaller schools or who took courses with fewer students. 5 As an example, a dataset with 1,000 observations may be sufficiently large to get a prediction right 95% of the time. However, To correct for this, the right technique to apply would depend on the source of if the population (and the dataset) includes a relevant group that represents, say, 20% of the population (200 observations), it is the bias that led to the grade inflation in the first place. If the bias is perceived as possible that the algorithm will be right for this group only 20% applying equally across teachers (an application of the ‘optimism bias’), a more of the time, instead of 95% of the time (with the prediction being wrong for the rest of the population only 1.25% of the time). This socially acceptable approach could have been to downgrade teachers’ grades by may be due to the sub-sample being too small to lead to reliable the same amount across the entire student population (e.g. by the average at the results across sub-groups. national level compared with the previous year). If the bias is at the school level (i.e. 6 See Dastin, J. (2018), ‘Amazon scraps secret AI recruiting tool teachers in some schools are more affected by optimism bias than others), teachers’ that showed bias against women’, Reuters, October, https://reut.rs/34z312M. grades could have been downgraded by an amount specific to each school (e.g. by the average difference at each school compared with the previous year). In both 7 See, for example, LeLynx.fr, ‘Pourquoi les femmes payent-elles moins cher?’ (in French), https://bit.ly/2G5VnUa. It is now illegal in approaches, the distribution of grades within schools is identified by teachers, and the EU to use gender as a rating variable for insurance premium does not necessarily reflect the distribution observed in the previous year. It should calculation (Gender Directive of 2012). be highlighted that the second approach assumes that a school could not achieve 8 See Meliani, L. (2018), ‘Machine Learning at PredPol: Risks, better results in 2020 than in 2019 overall—such an assumption could be considered Biases, and Opportunities for Predictive Policing’, Assignment: RC TOM Challenge 2018, Harvard Business School, November, unfair in itself. https://hbs.me/2TyRyKc. See Quinyx.com (2020), ‘Starbucks took a new approach’, It is important to identify the students who are more likely to be penalised by the 9 https://bit.ly/327jKc7. application of the algorithm, especially if there are different ways to reach the See Stolzoff, S. (2018), ‘Are Universities Training Socially same overall outcome. For example, Ofqual assessed how the 2019 grades would 10 Minded Programmers?’, June, https://bit.ly/2HGLTj2. have turned out were its approach used in previous years (i.e. based on the 2018 For instance, if a sample contains fewer observations for women rankings). The accuracy was found to be around 50−60%. Ofqual also argued that it 11 than for men, while in the population the groups are of equivalent considered distributional effects and concluded that ‘the analyses show no evidence size, a ‘weighting’ would duplicate certain observations with women to reach the proportions of the true population. that this year’s process of awarding grades has introduced bias’. 12 This approach is commonly used in economics. For instance, Source: Ofqual (2020), ‘Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: interim the 2019 Nobel Prize was assigned to Banerjee, Duflo and Kremer report’. ‘for their experimental approach to alleviating global poverty’. See Bertrand, M. and Mullainathan, S. (2004), ‘Are Emily and Greg with scholarships that gain places at the sure that the mirror we are using is non- More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination’, American Economic Review, prestigious institutions. distortionary. However, if the underlying 94:4, pp. 991−1013, https://bit.ly/3kBJrJ1; and Ahmed, A. M. and reality is disagreeable—for instance, as a Hammarstedt, M. (2008), ‘Discrimination in the rental housing market: A field experiment on the Internet’, Journal of Urban It is preferable to make this final adjustment consequence of historical and structural Economics, 64:2, pp. 362−72, https://bit.ly/35BSB1K. as an explicit extra step in the algorithm forms of discrimination—this will be reflected 13 See Bertrand, M. and Mullainathan, S. (2004), ‘Are Emily design, as opposed to tweaking the initial in a model’s results. Nevertheless, after the and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination’, American Economic algorithm to directly achieve the socially necessary adjustments correcting for anti- Review, 94:4, pp. 991−1013, https://bit.ly/3oySxZh. more acceptable results. Indeed, the discriminatory behaviour suggested in this 14 For more discussions on how this algorithm was designed and final adjustment involves a degree of article have been performed, a transparently implemented, see Thomson, D. (2020), ‘A-Level results 2020: How social judgement and perception, which designed algorithm can go one step further have grades been calculated?’, FFT Education Datalab, August, https://bit.ly/34z7UJ5; and Clarke, L. (2020), ‘How the A-level must be undertaken carefully to avoid and correct outcomes that society deems results algorithm was fatally flawed’, New Statesman, August, discrimination. unacceptable. https://bit.ly/3mvxaWZ. Reflections (in the mirror) Contact Although we use machines to make decisions, these decisions can be subject pascale.déchamps@oxera.com to human bias and discrimination. Since they use data collected by humans, Pascale Déchamps are designed by humans, and have objectives driven by human interests, programs—regardless of their degree of ambroise.descamps@oxera.com sophistication—can create, reproduce or Ambroise Descamps exacerbate discrimination. However, with the right approach to design and testing, it is usually possible to identify and reduce sarah.raviola@oxera.com biases in programs. Sarah Raviola Even when a program is designed and implemented in the correct way, it is still possible that its outcome is perceived gareth.shier@oxera.com to be unfair. This is because algorithms Gareth Shier are a mirror for reality: we need to make October 2020 3
You can also read