The risks of using AI in business: artificial intelligence and real discrimination - Oxera

Page created by Shirley Hunter
 
CONTINUE READING
The risks of using AI in business: artificial intelligence and real discrimination - Oxera
Advancing economics in business

October 2020

The risks of using AI    in business:
                      algorithms
artificial intelligence
in business:            and real
                demystifying  AI
discrimination
The risks of using AI in business: artificial intelligence and real discrimination - Oxera
The risks of using AI in business: artificial intelligence and real discrimination

                                                                                                      As an example, if female applicants to a
                      Contact
                                                What is the link between AI                           tech company have a significantly lower
                      Pascale Déchamps          and discrimination?                                   chance of being hired (based on historical
                      Partner                                                                         data) than male applicants,6 a CV-sifting
                                                However sophisticated a program, algorithm            program is likely to predict that female
                                                or AI system may be, it is designed to find           candidates are generally not as good as
                                                patterns in data (differences and similarities)       male candidates. By not considering the
                                                that help it to accurately predict outcomes,          possibility that the hiring process may
                                                match individuals, or classify objects or             have been biased in the past, the design
                                                individuals.4 By design, AI ‘discriminates’           of the program repeats and automates
                                                as it separates datasets into clusters on the         human biases. A solution would be to
Algorithms influence many aspects of            basis of shared characteristics, applying             exclude sensitive variables such as
our work and social lives. They affect          different rules to different clusters. This           gender. However, such a simple fix may
what adverts we see, what shows                 segmentation and the resulting differences in         not work when other factors in the dataset
we watch, and whether we get a job.             the rules applied simply reflect the best way         are strongly correlated with the excluded
As these tools become increasingly              that is identified by the algorithm to achieve a      sensitive variables, such as the type of
widespread, they pose new challenges            particular objective given the data.                  extracurricular activities or courses taken.
to businesses. We look at concerns
regarding the use of algorithms in              Yet the programs involve human intervention           This bias can also emerge when the
areas where the role of computer                at all stages: the algorithm and its objective        algorithm uses one factor as a proxy for
programs and complex modelling has              function are coded by a programmer, the data          another because the information about
traditionally been limited, and consider        used to train the algorithm is collected by a         the right factor is not known or included
whether AI might result in illegal              human, and the algorithm is rewarded for              in the dataset. This may be the case,
discrimination                                  replicating, to some extent, human decisions.         for example, when the program uses
                                                Human-driven discrimination, then, can be             gender or race as a proxy for behaviour.
                                                introduced at all stages of the process.              In the case of car insurance, women in
This article is the second in a series                                                                France used to pay lower premia than
investigating some of the economic                                                                    men, as they are, on average, less prone
consequences of using algorithms and the
                                                How can a program be                                  to accidents.7 Ideally, the program would
associated risks to businesses. Describing      affected by human biases?                             have taken into account characteristics
AI and its various forms,1 our previous                                                               that are associated with a lower probability
article highlighted that businesses face        As humans are involved at key stages of the           of having an accident other than gender,
heightened risks when using AI, including       development of an algorithm, programs can             but it is fundamentally interested only in
regulatory and reputational risks.              recreate, and reinforce, discrimination from          parameters that enable it to predict the
                                                human behaviour. Exactly how do human                 probability of accidents. If gender and the
As algorithms are used more and more in         biases affect algorithms?                             probability of accidents are correlated,
our daily lives, they raise concerns about                                                            discrimination can occur by generalising
their legality in areas where computer          Individuals can be biased against others              to an entire group something that is
programs and complex modelling have             outside of their own social group, exhibiting         observed, on average, more frequently
traditionally played a limited role—for         prejudice, stereotyping and discrimination.           (but not always) within this group than
instance, regarding forms of discrimination     Machines do not have all of these                     across the entire population.
that are often proscribed and carefully         preconceptions and biases, but they are
monitored under national law (e.g. when         designed by humans. In addition, the way in           Prediction errors. The best program
hiring new employees).                          which their programming operates may also             (from the perspective of its objective)
                                                lead to perverse feedback loops.                      is not perfect and still makes mistakes
There are now numerous instances in                                                                   even with a representative dataset, as
which the outcomes of program-based             In practice, human biases can affect                  programs cannot model every particular
processes are considered unfair or biased.      programs in a number of ways.                         set of circumstances. For instance, a
As discussed in the previous article, one                                                             credit-scoring algorithm can take into
recent prominent example is the public          Unrepresentative or insufficient training             account only accessible factors such as
uproar following the use of an algorithm        and test data. If the training and test               employment history, available savings,
by Ofqual (the Office of Qualifications and     datasets are not representative of the overall        and current debt. This means that some
Examinations Regulation) in the UK to           population, the program will make incorrect           candidates who would be able to meet the
‘predict’ students’ A-level results this year   predictions for the part of the population            financial obligations may still fail to obtain
when their exams were cancelled due             that is under-represented. However, even a            credit. This problem of ‘collateral damage’
to the COVID-19 pandemic.2 Less overt           representative sample of the population may           can be exacerbated by poor design, by the
cases include:3 crime-prevention programs       not be enough: it needs to be sufficiently large      programmer, of the model applied by the
that target specific communities; online        such that the margin of error is sufficiently low     program.
advertisement algorithms that are directed      across the various sub-groups of the dataset.5
towards those who are less wealthy,                                                                   The way the program learns, or does
offering pay-day loans; and job-recruitment     The type of information provided to the               not learn. Machine learning programs
programs that penalise women. Algorithms        program about the population. Although                that rely on reinforcement processes (i.e.
are also used to correct existing social        programs are designed to find patterns in             learning from the impact of their previous
imbalances.                                     the data, they operate based on the set of            actions) are subject to the equivalent of the
                                                variables defined by the human programmer.            human confirmation bias: a program may
How can AI be affected by human bias            If these variables are incomplete,                    detect an action that encourages it to act
and be prone to being discriminatory?           approximate, or selected in a biased way              further in the same direction, sometimes
What went wrong with Ofqual’s algorithm         (even unconsciously), the program will find           in a way that is socially discriminatory. A
and how could this have been prevented?         patterns that may reflect the programmer’s            classic example is an algorithm used in
How can trusted and lawful algorithms be        perceptions rather than providing an objective        the USA8 that dispatches police to areas
designed?                                       analysis.                                             where more petty crimes were previously

                                                                                                                                 October 2020        1
The risks of using AI in business: artificial intelligence and real discrimination

 reported. The initial decisions of where to                              programmers to build ‘ethical’ AI is also a hot                   to check whether any is being treated
 send the police may be biased by racism.                                 topic in the data science profession.10                           differently from the others. This would
 Though the original issue with this program                                                                                                be an extension of ‘field experiments’. In
 was the ‘tainted’ data, the way the program                              More difficult to identify is bias arising due to                 a field experiment, two or more groups
 operated could amplify the issue by                                      the dataset failing to be representative of                       of individuals are randomly allocated
 reinforcing the bias.                                                    the population. In this regard, lessons could                     to different situations and results are
                                                                          be identified from the design of statistical                      compared between the groups.12 To avoid
 Even programs that do not learn can                                      surveys, which are a common tool for                              discrimination, field experiments have been
 exacerbate existing disadvantages in                                     decision-making and where, quite often, it is                     used in CV sifting.13 Fictitious and identical
 society. For example, algorithms used to                                 challenging to collect data in a way that is                      job applications are sent to companies,
 optimise staffing in some professions                                    representative of the whole population.                           with the only difference being a criterion
 (e.g. cafes),9 although very effective at their                                                                                            potentially prone to discrimination (for
 objective of lowering costs and customer                                 When it is possible to influence data                             instance, a man versus a woman). As
 waiting time, have led to unpredictable                                  collection, a representative sample should                        these applications are randomly sent to
 working hours, as programs optimise                                      be constructed. The first step would be to                        companies, it has been possible to identify
 workforce almost on a daily basis.                                       collect a large enough random sample of                           whether certain groups are, on average,
                                                                          the population to ensure that all groups are                      discriminated against.
                                                                          sufficiently represented. It is also possible
 Designing algorithms                                                     to use ‘stratified sampling’, in which the                        This process could be extended to all
 that do not discriminate:                                                population is divided into sub-populations,                       types of algorithms. Newly programmed
                                                                          and individuals are randomly selected                             algorithms could be forced to make a
 lessons from economics                                                   within each sub-population. Each sub-group                        large number of decisions on similar
                                                                          needs to be large enough to ensure reliable                       cases, except for one criterion subject
 Economists have analysed individual
                                                                          classifications or predictions.                                   to discrimination. The outcome of these
 biases that lead to discrimination and are
                                                                                                                                            decisions could be analysed to uncover
 familiar with the design of statistical models.
                                                                          If it is not possible to influence data                           potential discrimination before the
 As a consequence, economists are in a
                                                                          collection, controls need to be introduced to                     algorithm is made public. This could
 unique position to help design AI in a way
                                                                          ensure that the data is representative of the                     have prevented the A-level controversy,
 that prevents creating or perpetuating
                                                                          population. Simple summary statistics on the                      described in the box overleaf.
 unlawful discrimination. In fact, when
                                                                          sample collected can be compared to the
 undertaken properly, algorithm design may
                                                                          underlying population. If the sample is not                       Finally, the ex post assessment can
 help to mitigate the impact of historical
                                                                          representative, more weight could be given                        include an analysis of whether the program
 discrimination.
                                                                          to certain data points to reflect the size of                     reinforces existing differences within the
                                                                          different sub-groups in the population.11                         population. By assessing which sub-group
 Establishing, and correcting, potential
                                                                                                                                            in the population may be adversely affected
 algorithmic discrimination can be done
                                                                          Finally, it is crucial to check that the list of                  by the program’s decisions, it is possible
 before the algorithm is implemented
                                                                          factors that the algorithm uses is not biased.                    to assess whether the program tends to
 (ex ante), or once it has been rolled out
                                                                          This is not an easy task. While humans can                        exacerbate existing imbalances.
 (ex post), as illustrated in Figure 1.
                                                                          easily detect obvious potential sources of
                                                                          bias, they often cannot detect bias arising                       It is conceivable that, having undertaken
 Ex ante, the risk of implementing an
                                                                          from spurious correlations. In such cases,                        all checks, the program leads to part of the
 unlawful algorithm can be reduced by
                                                                          ex post assessment could be used to identify                      population being more adversely affected.
 carefully controlling the design of its
                                                                          any required changes to the original factors.                     In a way, that is a logical outcome of the
 objectives and the dataset used to train
                                                                          Ex post, discrimination can be identified from                    program trying to discriminate between
 it. Controlling these objectives is likely
                                                                          the results of the algorithm, without any                         individuals to reach a particular objective.
 to require coordination between the
                                                                          knowledge of its technical functioning. The                       It would be contrary to insurers’ business
 programming teams and those in charge
                                                                          idea is to analyse the impact of the program                      models, for example, if they did not take
 of tackling discrimination. Training
                                                                          on various sub-groups within the population                       into account all the (legally) relevant
                                                                                                                                            parameters to decide insurance premia.
                                                                                                                                            Yet the outcome of this process is that
                                                                                                                                            some people end up paying significantly
                                                                                                                                            more for insurance.

                                                                                                                                            If the results are statistically sound
                                                                                                                                            and non-discriminatory, but socially
                                                                                                                                            questionable, algorithms can be used to
                                                                                                                                            proactively ‘correct’ the results in a way
                                                                                                                                            that is socially preferable, especially when
                                                                                                                                            the algorithm is used by public entities
                                                                                                                                            (e.g. the justice system and schools). For
                                                                                                                                            example, in France, ParcourSup has the
                                                                                                                                            explicit goal of partially compensating
                                                                                                                                            for the lower likelihood of students from
                                                                                                                                            poorer backgrounds attending the most
                                                                                                                                            prestigious schools and universities. The
                                                                                                                                            system is designed such that colleges and
                                                                                                                                            universities use their own algorithms to
                                                                                                                                            select students. The ParcourSup algorithm
Figure 1 A framework to ensure discrimination-free AI                                                                                       then adjusts the universities’ rankings in
                                                                                                                                            order to increase the proportion of students
Note: The ‘Algorithm in use’ step does not mean that the algorithm is publicly available—it can also refer to an internal testing period.
Source: Oxera.

                                                                                                                                                                      October 2020        2
The risks of using AI in business: artificial intelligence and real discrimination
                                                                                                                                             1
                                                                                                                                              See Oxera (2020), ‘The risks of using algorithms in business:

 The 2020 A-level controversy in the UK
                                                                                                                                             demystifying AI’, September, https://bit.ly/3kEoFbv.

                                                                                                                                             2
                                                                                                                                              For the full 319-page technical report from Ofqual on its algorithm
                                                                                                                                             design and outcomes, see Ofqual (2020), ‘Awarding GCSE,
 The algorithm used to predict this year’s exam results in the UK induced differences                                                        AS, A level, advanced extension awards and extended project
 in treatment across students. By imposing each school’s 2019 distribution of grades                                                         qualifications in summer 2020: interim report’, August,
                                                                                                                                             https://bit.ly/31PFNUd.
 in 2020, the algorithm failed to take into account how students’ performance in
 a school may have differed from that of the previous year’s cohort, disregarding                                                            3
                                                                                                                                              For an overview of some of the programs with the highest
                                                                                                                                             potential for massive adverse societal impact, see O’Neil, C.
 predicted grades from teachers. This made it very unlikely that outstanding students                                                        (2016), Weapons of Math Destruction, Penguin.
 in poorly performing schools would be allocated the top grades.14 It also treated
                                                                                                                                              See Oxera (2020), ‘The risks of using algorithms in business:
 more harshly students from large schools taking popular courses than those in
                                                                                                                                             4

                                                                                                                                             demystifying AI’, September, https://bit.ly/3kEoFbv.
 smaller schools or who took courses with fewer students.
                                                                                                                                             5
                                                                                                                                               As an example, a dataset with 1,000 observations may be
                                                                                                                                             sufficiently large to get a prediction right 95% of the time. However,
 To correct for this, the right technique to apply would depend on the source of                                                             if the population (and the dataset) includes a relevant group that
                                                                                                                                             represents, say, 20% of the population (200 observations), it is
 the bias that led to the grade inflation in the first place. If the bias is perceived as                                                    possible that the algorithm will be right for this group only 20%
 applying equally across teachers (an application of the ‘optimism bias’), a more                                                            of the time, instead of 95% of the time (with the prediction being
                                                                                                                                             wrong for the rest of the population only 1.25% of the time). This
 socially acceptable approach could have been to downgrade teachers’ grades by                                                               may be due to the sub-sample being too small to lead to reliable
 the same amount across the entire student population (e.g. by the average at the                                                            results across sub-groups.
 national level compared with the previous year). If the bias is at the school level (i.e.                                                   6
                                                                                                                                               See Dastin, J. (2018), ‘Amazon scraps secret AI recruiting tool
 teachers in some schools are more affected by optimism bias than others), teachers’                                                         that showed bias against women’, Reuters, October,
                                                                                                                                             https://reut.rs/34z312M.
 grades could have been downgraded by an amount specific to each school (e.g.
 by the average difference at each school compared with the previous year). In both                                                          7
                                                                                                                                               See, for example, LeLynx.fr, ‘Pourquoi les femmes payent-elles
                                                                                                                                             moins cher?’ (in French), https://bit.ly/2G5VnUa. It is now illegal in
 approaches, the distribution of grades within schools is identified by teachers, and                                                        the EU to use gender as a rating variable for insurance premium
 does not necessarily reflect the distribution observed in the previous year. It should                                                      calculation (Gender Directive of 2012).
 be highlighted that the second approach assumes that a school could not achieve                                                             8
                                                                                                                                              See Meliani, L. (2018), ‘Machine Learning at PredPol: Risks,
 better results in 2020 than in 2019 overall—such an assumption could be considered                                                          Biases, and Opportunities for Predictive Policing’, Assignment: RC
                                                                                                                                             TOM Challenge 2018, Harvard Business School, November,
 unfair in itself.                                                                                                                           https://hbs.me/2TyRyKc.

                                                                                                                                              See Quinyx.com (2020), ‘Starbucks took a new approach’,
 It is important to identify the students who are more likely to be penalised by the
                                                                                                                                             9

                                                                                                                                             https://bit.ly/327jKc7.
 application of the algorithm, especially if there are different ways to reach the
                                                                                                                                              See Stolzoff, S. (2018), ‘Are Universities Training Socially
 same overall outcome. For example, Ofqual assessed how the 2019 grades would
                                                                                                                                             10

                                                                                                                                             Minded Programmers?’, June, https://bit.ly/2HGLTj2.
 have turned out were its approach used in previous years (i.e. based on the 2018
                                                                                                                                               For instance, if a sample contains fewer observations for women
 rankings). The accuracy was found to be around 50−60%. Ofqual also argued that it
                                                                                                                                             11

                                                                                                                                             than for men, while in the population the groups are of equivalent
 considered distributional effects and concluded that ‘the analyses show no evidence                                                         size, a ‘weighting’ would duplicate certain observations with
                                                                                                                                             women to reach the proportions of the true population.
 that this year’s process of awarding grades has introduced bias’.
                                                                                                                                             12
                                                                                                                                                This approach is commonly used in economics. For instance,
 Source: Ofqual (2020), ‘Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: interim   the 2019 Nobel Prize was assigned to Banerjee, Duflo and Kremer
 report’.                                                                                                                                    ‘for their experimental approach to alleviating global poverty’. See
                                                                                                                                             Bertrand, M. and Mullainathan, S. (2004), ‘Are Emily and Greg
with scholarships that gain places at the                                sure that the mirror we are using is non-                           More Employable than Lakisha and Jamal? A Field Experiment
                                                                                                                                             on Labor Market Discrimination’, American Economic Review,
prestigious institutions.                                                distortionary. However, if the underlying                           94:4, pp. 991−1013, https://bit.ly/3kBJrJ1; and Ahmed, A. M. and
                                                                         reality is disagreeable—for instance, as a                          Hammarstedt, M. (2008), ‘Discrimination in the rental housing
                                                                                                                                             market: A field experiment on the Internet’, Journal of Urban
It is preferable to make this final adjustment                           consequence of historical and structural                            Economics, 64:2, pp. 362−72, https://bit.ly/35BSB1K.
as an explicit extra step in the algorithm                               forms of discrimination—this will be reflected                      13
                                                                                                                                               See Bertrand, M. and Mullainathan, S. (2004), ‘Are Emily
design, as opposed to tweaking the initial                               in a model’s results. Nevertheless, after the                       and Greg More Employable than Lakisha and Jamal? A Field
                                                                                                                                             Experiment on Labor Market Discrimination’, American Economic
algorithm to directly achieve the socially                               necessary adjustments correcting for anti-                          Review, 94:4, pp. 991−1013, https://bit.ly/3oySxZh.
more acceptable results. Indeed, the                                     discriminatory behaviour suggested in this
                                                                                                                                             14
                                                                                                                                               For more discussions on how this algorithm was designed and
final adjustment involves a degree of                                    article have been performed, a transparently                        implemented, see Thomson, D. (2020), ‘A-Level results 2020: How
social judgement and perception, which                                   designed algorithm can go one step further                          have grades been calculated?’, FFT Education Datalab, August,
                                                                                                                                             https://bit.ly/34z7UJ5; and Clarke, L. (2020), ‘How the A-level
must be undertaken carefully to avoid                                    and correct outcomes that society deems                             results algorithm was fatally flawed’, New Statesman, August,
discrimination.                                                          unacceptable.                                                       https://bit.ly/3mvxaWZ.

Reflections (in the mirror)                                              Contact
Although we use machines to make
decisions, these decisions can be subject
                                                                         pascale.déchamps@oxera.com
to human bias and discrimination. Since
they use data collected by humans,                                       Pascale Déchamps
are designed by humans, and have
objectives driven by human interests,
programs—regardless of their degree of                                   ambroise.descamps@oxera.com
sophistication—can create, reproduce or
                                                                         Ambroise Descamps
exacerbate discrimination. However, with
the right approach to design and testing, it
is usually possible to identify and reduce
                                                                         sarah.raviola@oxera.com
biases in programs.
                                                                         Sarah Raviola
Even when a program is designed and
implemented in the correct way, it is still
possible that its outcome is perceived                                   gareth.shier@oxera.com
to be unfair. This is because algorithms
                                                                         Gareth Shier
are a mirror for reality: we need to make

                                                                                                                                                                                   October 2020               3
You can also read