FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity

FactSheets: Increasing Trust in AI Services
                                                          through Supplier’s Declarations of Conformity
                                               M. Arnold,1 R. K. E. Bellamy,1 M. Hind,1 S. Houde,1 S. Mehta,2 A. Mojsilović,1
                                               R. Nair,1 K. Natesan Ramamurthy,1 D. Reimer,1 A. Olteanu,∗ D. Piorkowski,1
                                                                       J. Tsay,1 and K. R. Varshney1
arXiv:1808.07261v2 [cs.CY] 7 Feb 2019

                                                                               IBM Research
                                                             Yorktown Heights, New York, 2 Bengaluru, Karnataka

                                        Abstract                                                         vice could take an audio waveform as input and re-
                                                                                                         turn a transcript of what was spoken as output, with
                                        Accuracy is an important concern for suppliers of ar-            all complexity hidden from the user, all computation
                                        tificial intelligence (AI) services, but considerations          done in the cloud, and all models used to produce
                                        beyond accuracy, such as safety (which includes fair-            the output pre-trained by the supplier of the service.
                                        ness and explainability), security, and provenance,              A second more complex example would provide an
                                        are also critical elements to engender consumers’                audio waveform translated into a different language
                                        trust in a service. Many industries use transpar-                as output. The second example illustrates that a ser-
                                        ent, standardized, but often not legally required doc-           vice can be made up of many different models (speech
                                        uments called supplier’s declarations of conformity              recognition, language translation, possibly sentiment
                                        (SDoCs) to describe the lineage of a product along               or tone analysis, and speech synthesis) and is thus
                                        with the safety and performance testing it has under-            a distinct concept from a single pre-trained machine
                                        gone. SDoCs may be considered multi-dimensional                  learning model or library.
                                        fact sheets that capture and quantify various aspects               In many different application domains today, AI
                                        of the product and its development to make it wor-               services are achieving impressive accuracy. In cer-
                                        thy of consumers’ trust. Inspired by this practice, we           tain areas, high accuracy alone may be sufficient,
                                        propose FactSheets to help increase trust in AI ser-             but deployments of AI in high-stakes decisions, such
                                        vices. We envision such documents to contain pur-                as credit applications, judicial decisions, and medi-
                                        pose, performance, safety, security, and provenance              cal recommendations, require greater trust in AI ser-
                                        information to be completed by AI service providers              vices. Although there is no scholarly consensus on
                                        for examination by consumers. We suggest a com-                  the specific traits that imbue trustworthiness in peo-
                                        prehensive set of declaration items tailored to AI and           ple or algorithms [1, 2], fairness, explainability, gen-
                                        provide examples for two fictitious AI services in the           eral safety, security, and transparency are some of the
                                        appendix of the paper.                                           issues that have raised public concern about trusting
                                                                                                         AI and threatened the further adoption of AI beyond
                                                                                                         low-stakes uses [3, 4]. Despite active research and de-
                                        1     Introduction                                               velopment to address these issues, there is no mech-
                                                                                                         anism yet for the creator of an AI service to commu-
                                        Artificial intelligence (AI) services, such as those con-        nicate how they are addressed in a deployed version.
                                        taining predictive models trained through machine                This is a major impediment to broad AI adoption.
                                        learning, are increasingly key pieces of products and               Toward transparency for developing trust, we pro-
                                        decision-making workflows. A service is a function or            pose a FactSheet for AI Services. A FactSheet will
                                        application accessed by a customer via a cloud infras-           contain sections on all relevant attributes of an AI
                                        tructure, typically by means of an application pro-              service, such as intended use, performance, safety,
                                        gramming interface (API). For example, an AI ser-                and security. Performance will include appropriate
                                           ∗ A. Olteanu’s work was done while at IBM Research. Au-       accuracy or risk measures along with timing infor-
                                        thor is currently affiliated with Microsoft Research.            mation. Safety, discussed in [5, 3] as the minimiza-

tion of both risk and epistemic uncertainty, will in-             standardized way, as with Energy Star or food
clude explainability, algorithmic fairness, and robust-           nutrition labels.
ness to dataset shift. Security will include robust-
ness to adversarial attacks. Moreover, the FactSheet
will list how the service was created, trained, and de-        3. Systems composed of safe components may be
ployed along with what scenarios it was tested on,                unsafe and, conversely, it may be possible to
how it may respond to untested scenarios, guidelines              build safe systems out of unsafe components, so
that specify what tasks it should and should not be               it is prudent to also consider transparency and
used for, and any ethical concerns of its use. Hence,             accountability of services in addition to datasets
FactSheets help prevent overgeneralization and unin-              and models. In doing so, we take a functional
tended use of AI services by solidly grounding them               perspective on the overall service, and can test
with metrics and usage scenarios.                                 for performance, safety, and security aspects that
   A FactSheet is modeled after a supplier’s decla-               are not relevant for a dataset in isolation, such as
ration of conformity (SDoC). An SDoC is a docu-                   generalization accuracy, explainability, and ad-
ment to “show that a product, process or service con-             versarial robustness.
forms to a standard or technical regulation, in which
a supplier provides written assurance [and evidence]
of conformity to the specified requirements,” and is          Loukides et al. propose a checklist that has some of
used in many different industries and sectors includ-         the elements we seek [12].
ing telecommunications and transportation [6]. Im-
                                                                 Our aim is not to give the final word on the con-
portantly, SDoCs are often voluntary and tests re-
                                                              tents of AI service FactSheets, but to begin the con-
ported in SDoCs are conducted by the supplier itself
                                                              versation on the types of information and tests that
rather than by third parties [7]. This distinguishes
                                                              may be included. Moreover, determining a single
self-declarations from certifications that are manda-
                                                              comprehensive set of FactSheet items is likely infea-
tory and must have tests conducted by third parties.
                                                              sible as the context and industry domain will often
We propose that FactSheets for AI services be volun-
                                                              determine what items are needed. One would expect
tary initially; we provide further discussion on their
                                                              higher stakes applications will require more compre-
possible evolution in later sections.
                                                              hensive FactSheets. Our main goal is to help iden-
   Our proposal of AI service FactSheets is inspired
                                                              tify a common set of properties. A multi-stakeholder
by, and builds upon, recent work that focuses on in-
                                                              approach, including numerous AI service suppliers
creased transparency for datasets [8, 9, 10] and mod-
                                                              and consumers, standards bodies, and civil society
els [11], but is distinguished from these in that we
                                                              and professional organizations is essential to converge
focus on the final AI service. We take this focus for
                                                              onto standards. It will only be then that we as a
three reasons:
                                                              community will be able to start producing meaning-
                                                              ful FactSheets for AI services.
 1. AI services constitute the building blocks for
    many AI applications. Developers will query                  The remainder of the paper is organized as follows.
    the service API and consume its output. An                Section 2 overviews related work, including labeling,
    AI service can be an amalgam of many models               safety, and certification standards in other industries.
    trained on many datasets. Thus, the models and            Section 3 provides more details on the key issues to
    datasets are (direct and indirect) components of          enable trust in AI systems. Section 4 describes the
    an AI service, but they are not the interface to          AI service FactSheet in more detail, giving examples
    the developer.                                            of questions that it should include. In Section 5, we
                                                              discuss how FactSheets can evolve from a voluntary
 2. Often, there is an expertise gap between the pro-         process to one that could be an industry requirement.
    ducer and consumer of an AI service. The pro-             Section 6 covers challenges, opportunities, and future
    duction team relies heavily on the training and           work needed to achieve the widespread usage of AI
    creation of one or more AI models and hence will          service declarations of conformity. A proposed com-
    mostly contain data scientists. The consumers of          plete set of sections and items for a FactSheet is in-
    the service tend to be developers. When such an           cluded in the appendix, along with sample FactSheets
    expertise gap exists, it becomes more crucial to          for two exemplary fictitious services, fingerprint ver-
    communicate the attributes of the artifact in a           ification and trending topics in social media.

2     Related Work                                     trustworthy AI that extends beyond principles, val-
                                                       ues and ethical purpose to also include technical ro-
This section discusses related work in providing bustness and reliability [16].
transparency in the creation of AI services, as well
as a brief survey of ensuring trust in non-AI systems.
                                                               2.2     Enabling Trust in Other Domains
2.1    Transparency in AI
                                                               Enabling trust in systems is not unique to AI. This
Within the last year, several research groups have ad-         section provides an overview of mechanisms used in
vocated standardizing and sharing information about            other domains and industries to achieve trust. The
training datasets and trained models. Gebru et al.             goal is to understand existing approaches to help in-
proposed the use of datasheets for datasets as a way           spire the right directions for enabling trust in AI ser-
to expose and standardize information about public             vices.
datasets, or datasets used in the development of com-
mercial AI services and pre-trained models [8]. The
datasheet would include provenance information, key            2.2.1   Standards Organizations
characteristics, and relevant regulations, but also sig-
nificant, yet more subjective information, such as po-         Standardization organizations, such as the IEEE [17]
tential bias, strengths and weaknesses, and suggested          and ISO [18], define standards along with the require-
uses. Bender and Friedman propose a data state-                ments that need to be satisifed for a product or a pro-
ment schema, as a way to capture and convey the                cess to meet the standard. The product developer can
information and properties of a dataset used in natu-          self-report that a product meets the standard, though
ral language processing (NLP) research and develop-            there are several cases, especially with ISO standards,
ment [9]. They argue that data statements should be            where an independent accredited body will verify that
included in most writing on NLP, including: papers             the standards are met and provide the certification.
presenting new datasets, papers reporting experimen-
tal work with datasets, and documentation for NLP
systems.                                                       2.2.2   Consumer Products
   Holland et al. outline the dataset nutrition label, a
diagnostic framework that provides a concise yet ro-           The United States Consumer Product Safety Com-
bust and standardized view of the core components of           mission (CPSC) [19] requires a manufacturer or im-
a dataset [10]. Academic conferences such as the In-           porter to declare its product as compliant with ap-
ternational AAAI Conference on Web and Social Me-              plicable consumer product safety requirements in a
dia are also starting special tracks for dataset papers        written or electronic declaration of conformity. In
containing detailed descriptions, collection methods,          many cases, this can be self-reported by the manu-
and use cases.                                                 facturer or importer, i.e. an SDoC. However, in the
   Subsequent to the first posting of this paper [13],         case of children’s products, it is mandatory to have
Mitchell et al. propose model cards to convey infor-           the testing performed by a CPSC-accepted labora-
mation that characterizes the evaluation of a machine          tory for compliance. Durable infant or toddler prod-
learning model in a variety of conditions and disclose         ucts must be marked with specialized tracking labels
the context in which models are intended to be used,           and must have a postage-paid customer registration
details of the performance evaluation procedures, and          card attached, to be used in case of a recall.
other relevant information [11]. There is also budding            The National Parenting Center has a Seal of Ap-
activity on auditing and labeling algorithms for ac-           proval program [20] that conducts testing on a variety
curacy, bias, consistency, transparency, fairness and          of children’s products involving interaction with the
timeliness, in the industry [14, 15], but this audit           products by parents, children, and educators, who fill
does not cover several aspects of safety, security, and        out questionnaires for the products they test. The
lineage.                                                       quality of a product is determined based on factors
   Our proposal is distinguished from prior work in            like the product’s level of desirability, sturdiness, and
that we focus on the final AI service, a distinct con-         interactive stimulation. Both statistical averaging as
cept from a single pre-trained machine learning model          well as comments from testers are examined before
or dataset. Moreover, we take a broader view on                providing a Seal of Approval for the product.

2.2.3   Finance                                                 a procedure to certify the readiness for software re-
                                                                lease, understanding the tradeoff in cost of too early
In the financial industry, corporate bonds are rated            a release due to failures in the field, versus the cost
by independent rating services [21, 22] to help an in-          in personnel and schedule delay arising from more
vestor assess the bond issuer’s financial strength or           extensive testing. Their technique involves the fill-
its ability to pay a bond’s principal and interest in a         ing out of a questionnaire by the software developer
timely fashion. These letter-grade ratings range from           called the Software Review and Certification Record
AAA or Aaa for safe, ‘blue-chip’ bonds to C or D for            (SRCR), which is ‘credentialed’ with signatories who
‘junk’ bonds. On the other hand, common-stock in-               approve the document prior to the release decision.
vestments are not rated independently. Rather, the              Heck et al. [29] also describe a software product cer-
Securities and Exchange Commission (SEC) requires               tification model to certify legislative compliance or
potential issuers of stock to submit specific registra-         acceptability of software delivered during outsourc-
tion documents that discloses extensive financial in-           ing. The basis for certification is a questionnaire to
formation about the company and risks associated                be filled out by the developer. The only acceptable
with the future operations of the company. The SEC              answers to the questions are yes and n/a (not appli-
examines these documents, comments on them, and                 cable).
expects corrections based on the comments. The final
product is a prospectus approved by the SEC that is          A different approach is taken in the CERT Se-
available for potential buyers of the stock.              cure Coding Standards [30] of the Software Engineer-
                                                          ing Institute. Here the emphasis is on documenting
                                                          best practices and coding standards for security pur-
2.2.4 Software                                            poses. The secure coding standards consist of guide-
In the software area, there have been recent attempts lines about the types of security flaws that can be
to certify digital data repositories as ‘trusted.’ Trust- injected through development with specific program-
worthiness involves both the quality of the data and ming languages. Each guideline offers precise infor-
sustainable, reliable access to the data. The goal of mation describing the cause and impact of violations,
certification is to enhance scientific reproducibility. and examples of common non-compliant (flawed) and
The European Framework for Audit and Certifica- compliant (fixed) code. The organization also pro-
tion [23] has three levels of certification, Core, Ex- vides tools, which audits code to identify security
tended, and Formal (or Bronze, Silver, and Gold), flaws as indicated by violations of the CERT secure
having different requirements, mainly to distinguish coding standards.
between the requirements of different types of data,
e.g. research data vs. human health data vs. financial
transaction data. The CoreTrustSeal [24], a private
legal entity, provides a Bronze level certification to
                                                                2.2.5   Environmental Impact Statements
an interested data repository, for a nominal fee.
   There have been several proposals in the literature
for software certifications of various kinds. Ghosh             Environment law in the United States requires that
and McGraw [25] propose a certification process for             an environmental impact statement (EIS) should be
testing software components for security properties.            prepared prior to starting large constructions. An
Their technique involves a process and a set of white-          EIS is a document used as a tool for decision mak-
box and black-box testing procedures, that eventu-              ing that describes positive and negative environmen-
ally results in a stamp of approval in the form of a            tal effects of a proposed action. It is made available
digital signature. Schiller [26] proposes a certifica-          both to federal agencies and to the public, and cap-
tion process that starts with a checklist with yes/no           tures impacts to endangered species, air quality, wa-
answers provided by the developer, and determines               ter quality, cultural sites, and the socioeconomics of
which tests need to be performed on the software to             local communities. The federal law, the National En-
certify it. Currit et al. [27] describe a procedure for         vironmental Policy Act, has inspired similar laws in
certifying the reliability of software before its release       various jurisdictions and in other fields beyond the
to the users. They predict the performance of the               environment. Selbst has proposed an algorithmic im-
software on unseen inputs using the MTTF (mean                  pact statement for AI that follows the form and pur-
time to failure) metric. Port and Wilf [28] describe            pose of EISs [31].

2.2.6    Human Subjects                                          they are able to assess the technology’s performance,
                                                                 reliability, safety, and security. Consumers do not
In addition to products and technologies, another                yet trust AI like they trust other technologies be-
critical endeavor requiring trust is research involving          cause of inadequate attention given to the latter of
human subjects. Institutional review boards (IRB)                these issues [35]. Making technical progress on safety
have precise reviewing protocols and requirements                and security is necessary but not sufficient to achieve
such as those presented in the Belmont Report [32].              trust in AI, however; the progress must be accompa-
Items to be completed include statement of pur-                  nied by the ability to measure and communicate the
pose, participant selection, procedures to be followed,          performance levels of the service on these dimensions
harms and benefits to subjects, confidentiality, and             in a standardized and transparent manner. One way
consent documents. As AI services increasingly make              to accomplish this is to provide such information via
inferences for people and about people [33], IRB re-             FactSheets for AI services.
quirements increasingly apply to them.                              Trust in AI services will come from: a) applying
                                                                 general safety and reliability engineering methodolo-
2.2.7    Summary                                                 gies across the entire lifecycle of an AI service, b)
                                                                 identifying and addressing new, AI-specific issues and
To ensure trust in products, industries have estab-
                                                                 challenges in an ongoing and agile way, and c) cre-
lished a variety of practices to convey information
                                                                 ating standardized tests and transparent reporting
about how a product is expected to perform when
                                                                 mechanisms on how such a service operates and per-
utilized by a consumer. This information usually in-
                                                                 forms. In this section we outline several areas of con-
cludes how the product was constructed and tested.
                                                                 cern and how they uniquely apply to AI. The crux of
Some industries allow product creators to voluntar-
                                                                 this discussion is the manifestation of risk and uncer-
ily provide this information, whereas others explicitly
                                                                 tainty in machine learning, including that data dis-
require it. When the information is required, some in-
                                                                 tributions used for training are not always the ones
dustries require the information to be validated by a
                                                                 that ideally should be used.
third party. One would expect the latter scenario to
occur in mature industries where there is confidence
that the requirements strongly correlate with safety,            3.1    Basic Performance and Reliability
reliability, and overall trust in the product. Manda-            Statistical machine learning theory and practice is
tory external validation of nascent requirements in              built around risk minimization. The particular loss
emerging industries may unnecessarily stifle the de-             function, whose expectation over the data distribu-
velopment of the industry.                                       tion is considered to be the risk, depends on the task,
                                                                 e.g. zero-one loss for binary classification and mean
                                                                 squared error for regression. Different types of errors
3       Elements of Trust in AI Sys-                             can be given different costs. Abstract loss functions
        tems                                                     may be informed by real-world quality metrics [36],
                                                                 including context-dependent ones [37]. There is no
We drive cars trusting the brakes will work when the             particular standardization on the loss function, even
pedal is pressed. We undergo laser eye surgery trust-            broadly within application domains. Moreover, per-
ing the system to make the right decisions. We accept            formance metrics that are not directly optimized are
that the autopilot will operate an airplane, trusting            also often examined, e.g. area under the curve and
that it will navigate correctly. In all these cases, trust       normalized cumulative discounted gain.
comes from confidence that the system will err ex-                  The true expected value of the loss function can
tremely rarely, leveraging system training, exhaustive           never be known and must be estimated empirically.
testing, experience, safety measures and standards,              There are several approaches and rules of thumb for
best practices, and consumer education.                          estimating the risk, but there is no standardization
  Every time new technology is introduced, it creates            here either. Different groups make different choices
new challenges, safety issues, and potential hazards.            (k-fold cross-validation, held-out samples, stratifica-
As the technology develops and matures, these issues             tion, bootstrapping, etc.). Further notions of per-
are better understood, documented, and addressed.                formance and reliability are the technical aspects of
Human trust in technology is developed as users over-            latency, throughput, and availability of the service,
come perceptions of risk and uncertainty [34], i.e., as          which are also not standardized for the specifics of

AI workloads.                                                  mon cause of frustration and loss of trust for AI ser-
   To develop trust in AI services from a basic perfor-        vice consumers. Dataset shift can be detected and
mance perspective, the choice of metrics and testing           corrected using a multitude of methods [39]. The sen-
conditions should not be left to the discretion of the         sitivity of performance of different models to dataset
supplier (who may choose conditions which present              shift varies and should be part of a testing proto-
the service in a favorable light), but should be cod-          col. To the best of our knowledge, there does not yet
ified and standardized. The onerous requirement of             exist any standard for how to conduct such testing.
third-party testing could be avoided by ensuring that          To mitigate this risk a FactSheet should contain de-
the specifications are precise, i.e., that each metric         mographic information about the training and test
is precisely defined to ensure consistency and enable          datasets that report the various outcomes for each
reproducibility by AI service consumers.                       group of interest as specified in Section 3.1.
   For each metric a FactSheet should report the val-
ues under various categories relevant to the expected
consumers, (e.g., performance for various age groups,          Fairness AI fairness is a rapidly growing topic of
geographies, or genders) with the goal of providing            inquiry [40]. There are many different definitions
the right level of insight into the service, but still         of fairness (some of which provably conflict) that
preserving privacy. We expect some metrics will be             are appropriate in varying contexts. The concept of
specific to a domain, (e.g., finance, healthcare, man-         fairness relies on protected attributes (also context-
ufacturing), or a modality (e.g., visual, speech, text),       dependent) such as race, gender, caste, and religion.
reflecting common practice of evaluation in that en-           For fairness, we insist on some risk measure being ap-
vironment.                                                     proximately equal in groups defined by the protected
                                                               attributes. Unwanted biases in training data, due
3.2    Safety                                                  to either prejudice in labels or under-/over-sampling,
                                                               lead to unfairness and can be checked using statistical
While typical machine learning performance metrics             tests on datasets or models [41, 42]. One can think of
are measures of risk (the ones described in the pre-           bias as the mismatch between the training data distri-
vious section), we must also consider epistemic un-            bution and a desired fair distribution. Applications
certainty when assessing the safety of a service [5, 3].       such as lending have legal requirements on fairness
The main uncertainty in machine learning is an un-             in decision making, e.g. the Equal Credit Opportu-
known mismatch between the training data distribu-             nity Act in the United States. Although the parity
tion and the desired data distribution on which one            definitions and computations in such applications are
would ideally train. Usually that desired distribution         explicit, the interpretation of the numbers is subjec-
is the true distribution encountered in operation (in          tive: there are no immutable thresholds on fairness
this case the mismatch is known as dataset shift),             metrics (e.g., the well-known 80% rule [43]) that are
but it could also be an idealized distribution that            aplied in isolation of context.
encodes preferred societal norms, policies, or regu-
lations (imagine a more equitable world than what
exists in reality). One may map four general cate-
                                                      Explainability Directly interpretable machine
gories of strategies to achieve safety proposed in [38]
                                                      learning (in contrast to post hoc interpretation)
to machine learning [3]: inherently safe design, safety
                                                      [44], in which a person can look at a model and
reserves, safe fail, and procedural safeguards, all of
                                                      understand what it does, reduces epistemic un-
which serve to reduce epistemic uncertainty. Inter-
                                                      certainty and increases safety because quirks and
pretability of models is one example of inherently safe
                                                      vagaries of training dataset distributions that will
                                                      not be present in distributions during deployment
                                                      can be identified by inspection [3]. Different users
Dataset Shift As the statistical relationship be- have different needs from explanations, and there
tween features and labels changes over time, known as is not yet any satisfactory quantitative definition
dataset shift, the mismatch between the training dis- of interpretability (and there may never be) [45].
tribution and the distribution from which test sam- Recent regulations in the European Union require
ples are being drawn increases. A well-known reason ‘meaningful’ explanations, but it is not clear what
for performance degradation, dataset shif is a com- constitutes a meaningful explanation.

3.3    Security                                                 service development, testing, deployment and main-
                                                                tenance: from information about the data the service
AI services can be attacked by adversaries in various           is trained on, to underlying algorithms, test setup,
ways [4]. Small imperceptible perturbations could               test results, and performance benchmarks, to the way
cause AI services to misclassify inputs to any label            the service is maintained and retrained (including au-
that attackers desire; training data and models can be          tomatic adaptation).
poisoned, allowing attackers to worsen performance                 The items are devised to aid the user in under-
(similar to concept drift but deliberate); and sensitive        standing how the service works, in determining if the
information about data and models can be stolen by              service is appropriate for the intended application,
observing the outputs of a service for different inputs.        and in comprehending its strengths and limitations.
Services may be instrumented to detect such attacks             The identified items are not intended to be definitive.
and may also be designed with defenses [46]. New                If a question is not applicable to a given service, it
research proposes certifications for defenses against           can simply be ignored. In some cases, the service
adversarial examples [47], but these are not yet prac-          supplier may not wish to disclose details of the ser-
tical.                                                          vice for competitive reasons. For example, a supplier
                                                                of a commercial fraud detection service for health-
3.4    Lineage                                                  care insurance claims may choose not to reveal the
                                                                details of the underlying algorithm; nevertheless, the
Once performance, safety, and security are sufficient           supplier should be able to indicate the class of algo-
to engender trust, we must also ensure that we track            rithm used, provide sample outputs along with ex-
and maintain the provenance of datasets, metadata,              planations of the algorithmic decisions leading to the
models along with their hyperparameters, and test               outputs. More consequential applications will likely
results. Users, those potentially affected, and third           require more comprehensive completion of items.
parties, such as regulators, must be able to audit the             A few examples of items a FactSheet might include
systems underlying the services. Appropriate parties            are:
may need the ability to reproduce past outputs and
track outcomes. Specifically, one should be able to               • What is the intended use of the service output?
determine the exact version of the service deployed
                                                                  • What algorithms or techniques does this service
at any point of time in the past, how many times the
service was retrained and associated details like hy-
perparameters used for each training episode, train-              • Which datasets was the service tested on? (Pro-
ing dataset used, how accuracy and safety metrics                   vide links to datasets that were used for testing,
have evolved over time, the feedback data received                  along with corresponding datasheets.)
by the service, and the triggers for retraining and
improvement. This information may span multiple                   • Describe the testing methodology.
organizations when a service is built by multiple par-
                                                                  • Describe the test results.
                                                                  • Are you aware of possible examples of bias, ethi-
                                                                    cal issues, or other safety risks as a result of using
4     Items in a FactSheet                                          the service?

In this section we provide an overview of the items               • Are the service outputs explainable and/or in-
that should be addressed in a FactSheet. See the                    terpretable?
appendix for the complete list of items. To illustrate
how these items might be completed in practice, we                • For each dataset used by the service: Was the
also include two sample FactSheets in the appendix:                 dataset checked for bias? What efforts were
one for a fictitious fingerprint verification service and           made to ensure that it is fair and representative?
one for a trending topics service.                                • Does the service implement and perform any bias
   The items are grouped into several categories                    detection and remediation?
aligned with the elements of trust. The categories
are: statement of purpose, basic performance, safety,             • What is the expected performance on unseen
security, and lineage. They cover various aspects of                data or data with different distributions?

• Was the service checked for robustness against of a service, such as its intended use, its performance
      adversarial attacks?                             metrics, and information about fairness, explainabil-
                                                       ity, safety, and security. In particular, consumers in
   • When were the models last updated?
                                                       many businesses do not have the requisite expertise to
   As such a declaration is refined, and testing pro- evaluate various AI services available in the market-
cedures for performance, robustness to concept drift, place; uninformed or incorrect choices can result in
explainability, and robustness to attacks are further suboptimal business performance. By creating easily
codified, the FactSheet may refer to standardized test consumable FactSheets, suppliers can accrue a com-
protocols instead of providing descriptive details.    petitive advantage by capturing consumers’ trust.
   Since completing a FactSheet can be laborious, we Moreover, with such transparency, FactSheets should
expect most of the information to be populated as serve to allow better functioning of AI service mar-
part of the AI service creation process in a secure ketplaces and prevent a so-called ‘market for lemons’
auditable manner. A FactSheet will be created once [49]. A counter-argument to voluntary compliance
and associated with a service, but can continually be and self-regulation argues that while participation of
augmented, without removing previous information, industry is welcome, this should not stand in the way
i.e., results are added from more tests, but results of legislation and governmental regulation [50].
cannot be removed. Any changes made to the service        FactSheet adoption could potentially lead to an
will prompt the creation of a new version of the Fact- eventual system of third-party certification [51], but
Sheet for the new model. Thus, these FactSheets will probably only for services catering to applications
be treated as a series of immutable artifacts.         with the very highest of stakes, to regulated busi-
   This information can be used to more accurately ness processes and enterprise applications, and to
monitor a deployed service by comparing deployed applications originating in the public sector [52, 7].
metrics with those that were seen during development Children’s toys are an example category of consumer
and taking appropriate action when unexpected be- products in which an SDoC is not enough and certifi-
havior is detected.                                    cation is required. If an AI service is already touching
                                                       on a regulation from a specific industry in which it
5 The Evolution of FactSheet isbetter                     being used, its FactSheet will serve as a tool for
We expect that AI will soon go through the same evo-          6    Discussion and Future Work
lution that other technologies have gone through (cf.
[8] for an excellent review of the evolution of safety        One may wonder why AI should be held to a higher
standards in different industries). We propose that           standard (FactSheets) than non-AI software and ser-
FactSheets be initially voluntary for several reasons.        vices in the same domain. Non-AI software include
First, discussion and feedback from multiple parties          several artifacts beyond the code, such as design doc-
representing suppliers and consumers of AI services is        uments, program flow charts, and test plans that can
needed to determine the final set of items and format         provide transparency to concerned consumers. Since
of FactSheets. So, an initial voluntary period to al-         AI services do not contain any of these, and the gen-
low this discussion to occur is needed. Second, there         erated code may not be easily understandable, there
needs to be a balance between the needs of AI ser-            is a higher demand to enhance transparency through
vice consumers with the freedom to innovate for AI            FactSheets.
service producers. Although producing a FactSheet                Although FactSheets enable AI services producers
will initially be an additional burden to an AI service       to provide information about the intent and construc-
producer, we expect market feedback from AI service           tion of their service so that educated consumers can
consumers to encourage this creation.                         make informed decisions, consumers may still, inno-
   Because of peer pressure to conform [48], Fact-            cently or maliciously, use the service for purposes
Sheets could become a de facto requirement similar to         other than those intended. FactSheets cannot fully
Energy Star labeling of the energy efficiency of appli-       protect against such use, but can form the basis of
ances. They will serve to reduce information asym-            service level agreements.
metry between supplier and consumer, where con-                  Some components of an AI service may be pro-
sumers are currently unaware of important properties          duced by organizations other than the service sup-

plier. For example, the dataset may be obtained from           technologies.
a third party, or the service may be a composition                We see our work as a first step at defining which
of models, some of which are produced by another               questions to ask and metrics to measure towards de-
organization. In such cases, the FactSheet for the             velopment and adoption of broader industry practices
composed service would need to include information             and standards. We see a parallel between the issue of
from the supplying organizations. Ideally, those or-           trusted AI today and the rise of digital certification
ganizations would produce FactSheets for their com-            during the Internet revolution. The digital certifica-
ponents, enabling the composing organization to pro-           tion market ‘bootstrapped’ the Internet, ushering in a
vide a complete FactSheet. This complete FactSheet             new era of ‘transactions’ such as online banking and
could include the component FactSheets along with              benefits enrollment that we take for granted today.
any necessary additional information. In some cases,           In a similar vein, we can see AI service FactSheets
the demands for transparency on the composing or-              ushering in a new era of trusted AI end points and
ganization may be greater than on the component or-            bootstrapping broader adoption.
ganization; market forces will require the component
organization to provide more transparency to retain
                                                            A     Proposed FactSheet Items
[43] M. Feldman, S. A. Friedler, J. Moeller, C. Schei-
     degger, and S. Venkatasubramanian, “Certifying         Below we list example questions that a FactSheet for
     and removing disparate impact,” in Proceedings         an AI service could include. The set of questions
     of the ACM SIGKDD International Conference             we provide here is not intended to be definitive, but
     on Knowledge Discovery and Data Mining, Syd-           rather to open a conversation about what aspects
     ney, Australia, Aug. 2015, pp. 259–268.                should be covered.

To illustrate how these questions could be an-
                                                           –          When and how often?
swered, we provide two examples for fictitious AI ser-
vices: a fingerprint verification service (Appendix B)     –          What sections have changed?
and a trending topics service (Appendix C). How-           –          Is the FactSheet updated every time the ser-
ever, given that the examples we provide are ficti-                   vice is retrained or updated?
tious, we would expect an actual service provider to
answer these questions in much more detail. For in-
stance, they would be able to better characterize an Usage
API that actually exists. Our example answers are
mainly to provide additional insight about the type    • What         is the intended use of the service output?
of information we would find in a FactSheet.
                                                                    – Briefly describe a simple use-case.
Statement of purpose
The following questions are aimed at providing an             • What are the key procedures followed while us-
overview of the service provider and of the intended            ing the service?
uses for the service. Valid answers include “N/A”
(not applicable) and “Proprietary” (cannot be pub-
                                                                    – How is the input provided? By whom?
licly disclosed, usually for competitive reasons).
                                                                    – How is the output returned?

  • Who are “you” (the supplier) and what type of Domains and applications
    services do you typically offer (beyond this par-
                                                      • What are the domains and applications the ser-
    ticular service)?
                                                        vice was tested on or used for?
  • What is this service about?
                                                                    – Were domain experts involved in the devel-
                                                                      opment, testing, and deployment? Please
        – Briefly describe the service.
        – When was the service first released? When
          was the last release?
        – Who is the target user?                             • How is the service being used by your customers
                                                                or users?

  • Describe the outputs of the service.                            – Are you enabling others to build a solution
                                                                      by providing a cloud service or is your ap-
  • What algorithms or techniques does this service                   plication end-user facing?
                                                                    – Is the service output used as-is, is it fed di-
                                                                      rectly into another tool or actuator, or is
        – Provide links to technical papers.                          there human input/oversight before use?
                                                                    – Do users rely on pre-trained/canned models
                                                                      or can they train their own models?
  • What are the characteristics of the development
                                                                    – Do your customers typically use your ser-
    team?                                                             vice in a time critical setup (e.g. they have
                                                                      limited time to evaluate the output)? Or
        – Do the teams charged with developing and                    do they incorporate it in a slower decision
          maintaining this service reflect a diversity                making process? Please elaborate.
          of opinions, backgrounds, and thought?

                                                              • List applications that the service has been used
  • Have you updated this FactSheet before?                     for in the past.

– Please provide information about these ap-                     – Briefly describe how a third party could in-
           plications or relevant pointers.                                 dependently verify the performance of the
         – Please provide key performance results for                       service.
           those applications.                                            – Are there benchmarks publicly available
                                                                            and adequate for testing the service.

  • Other comments?

Basic Performance                                                  • In addition to the service provider, was this ser-
                                                                     vice tested by any third party?
The following questions aim to offer an overall assess-
ment of the service performance.
                                                                          – Please list all third parties that performed
Testing by service provider                                                 the testing.
                                                                          – Also, please include information about the
  • Which datasets was the service tested on? (e.g.,                        tests and test results.
    links to datasets that were used for testing, along
    with corresponding datasheets)

         – List the test datasets and provide links to             • Other comments?
           these datasets.
         – Do the datasets have an associated
           datasheet? If yes, please attach.
         – Could these datasets be used for indepen-             Safety
           dent testing of the service? Did the data
           need to be changed or sampled before use?             The following questions aim to offer insights about
                                                                 potential unintentional harms, and mitigation efforts
                                                                 to eliminate or minimize those harms.
  • Describe the testing methodology.
         – Please provide details on train, test and
           holdout data.
         – What performance metrics were used?                     • Are you aware of possible examples of bias, ethi-
           (e.g. accuracy, error rates, AUC, preci-                  cal issues, or other safety risks as a result of using
           sion/recall)                                              the service?
         – Please briefly justify the choice of metrics.

                                                                          – Were the possible sources of bias or unfair-
  • Describe the test results.                                              ness analyzed?
                                                                          – Where do they arise from: the data? the
         – Were latency, throughput, and availability                       particular techniques being implemented?
           measured?                                                        other sources?
         – If yes, briefly include those metrics as well.                 – Is there any mechanism for redress if indi-
                                                                            viduals are negatively affected?

Testing by third parties
                                                                   • Do you use data from or make inferences about
  • Is there a way to verify the performance metrics                 individuals or groups of individuals. Have you
    (e.g., via a service API )?                                      obtained their consent?

– How was it decided whose data to use or                     – Please describe the data bias policies that
         about whom to make inferences?                                were checked (such as with respect to known
       – Do these individuals know that their data                     protected attributes), bias checking meth-
         is being used or that inferences are being                    ods, and results (e.g., disparate error rates
         made about them? What were they told?                         across different groups).
         When were they made aware? What kind                        – Was there any bias remediation performed
         of consent was needed from them? What                         on this dataset? Please provide details
         were the procedures for gathering consent?                    about the value of any bias estimates be-
         Please attach the consent form to this dec-                   fore and after it.
         laration.                                                   – What techniques were used to perform the
       – What are the potential risks to these indi-                   remediation? Please provide links to rele-
         viduals or groups? Might the service output                   vant technical papers.
         interfere with individual rights? How are                   – How did the value of other performance
         these risks being handled or minimized?                       metrics change as a result?
       – What trade-offs were made between the
         rights of these individuals and business in-
       – Do they have the option to withdraw their             • Does the service implement and perform any bias
         data? Can they opt out from inferences be-              detection and remediation?
         ing made about them? What is the with-
         drawal procedure?

                                                                     – Please describe model bias policies that
                                                                       were checked, bias checking methods, and
                                                                       results (e.g., disparate error rates across dif-
Explainability                                                         ferent groups).
                                                                     – What procedures were used to perform the
                                                                       remediation? Please provide links or refer-
                                                                       ences to corresponding technical papers.
 • Are the service outputs explainable and/or in-
   terpretable?                                                      – Please provide details about the value of
                                                                       any bias estimates before and after such re-
       – Please explain how explainability is                        – How did the value of other performance
         achieved (e.g. directly explainable algo-                     metrics change as a result?
         rithm, local explainability, explanations via
       – Who is the target user of the explanation
         (ML expert, domain expert, general con-              Concept Drift
         sumer, etc.)
       – Please describe any human validation of the
         explainability of the algorithms
                                                               • What is the expected performance on unseen
                                                                 data or data with different distributions?

                                                                     – Please describe any relevant testing done
Fairness                                                               along with test results.

 • For each dataset used by the service: Was the
   dataset checked for bias? What efforts were                 • Does your system make updates to its behavior
   made to ensure that it is fair and representative?            based on newly ingested data?

– Is the new data uploaded by your users? Is                  – Describe specific concerns and sensitive use
          it generated by an automated process? Are                     cases.
          the patterns in the data largely static or do               – Are there any procedures in place to ensure
          they change over time?                                        that the service will not be used for these
        – Are there any        performance     guaran-                  applications?
        – Does the service have an automatic feed-
          back/retraining loop, or is there a human
          in the loop?                                          • How are you securing user or usage data?

                                                                      – Is usage data from service operations re-
  • How is the service tested and monitored for                         tained and stored?
    model or performance drift over time?
                                                                      – How is the data being stored? For how long
                                                                        is the data stored?
        – If applicable, describe any relevant testing                – Is user or usage data being shared outside
          along with test results.                                      the service? Who has access to the data?

  • How can the service be checked for correct, ex-
    pected output when new data is added?                       • Was the service checked for robustness against
                                                                  adversarial attacks?
  • Does the service allow for checking for differences
    between training and usage data?                                  – Describe robustness policies that were
                                                                        checked, the type of attacks considered,
                                                                        checking methods, and results.
        – Does it deploy mechanisms to alert the user
          of the difference?

                                                                • What is the plan to handle any potential security
  • Do you test the service periodically?                         breaches?

        – Does the testing includes bias or fairness                  – Describe any protocol that is in place.
          related aspects?
        – How has the value of the tested metrics
          evolved over time?
                                                                • Other comments?

  • Other comments?
                                                        The following questions aim to overview how the ser-
The following questions aim to assess the susceptibil- vice provider keeps track of details that might be re-
ity to deliberate harms such as attacks by adversaries. quired in the event of an audit by a third party, such
                                                        as in the case of harm or suspicion of harm.
  • How could this service be attacked or abused?
    Please describe.                              Training Data

  • List applications or scenarios for which the ser-           • Does the service provide an as-is/canned model?
    vice is not suitable.                                         Which datasets was the service trained on?

You can also read