FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity M. Arnold,1 R. K. E. Bellamy,1 M. Hind,1 S. Houde,1 S. Mehta,2 A. Mojsilović,1 R. Nair,1 K. Natesan Ramamurthy,1 D. Reimer,1 A. Olteanu,∗ D. Piorkowski,1 J. Tsay,1 and K. R. Varshney1 arXiv:1808.07261v2 [cs.CY] 7 Feb 2019 IBM Research 1 Yorktown Heights, New York, 2 Bengaluru, Karnataka Abstract vice could take an audio waveform as input and re- turn a transcript of what was spoken as output, with Accuracy is an important concern for suppliers of ar- all complexity hidden from the user, all computation tificial intelligence (AI) services, but considerations done in the cloud, and all models used to produce beyond accuracy, such as safety (which includes fair- the output pre-trained by the supplier of the service. ness and explainability), security, and provenance, A second more complex example would provide an are also critical elements to engender consumers’ audio waveform translated into a different language trust in a service. Many industries use transpar- as output. The second example illustrates that a ser- ent, standardized, but often not legally required doc- vice can be made up of many different models (speech uments called supplier’s declarations of conformity recognition, language translation, possibly sentiment (SDoCs) to describe the lineage of a product along or tone analysis, and speech synthesis) and is thus with the safety and performance testing it has under- a distinct concept from a single pre-trained machine gone. SDoCs may be considered multi-dimensional learning model or library. fact sheets that capture and quantify various aspects In many different application domains today, AI of the product and its development to make it wor- services are achieving impressive accuracy. In cer- thy of consumers’ trust. Inspired by this practice, we tain areas, high accuracy alone may be sufficient, propose FactSheets to help increase trust in AI ser- but deployments of AI in high-stakes decisions, such vices. We envision such documents to contain pur- as credit applications, judicial decisions, and medi- pose, performance, safety, security, and provenance cal recommendations, require greater trust in AI ser- information to be completed by AI service providers vices. Although there is no scholarly consensus on for examination by consumers. We suggest a com- the specific traits that imbue trustworthiness in peo- prehensive set of declaration items tailored to AI and ple or algorithms [1, 2], fairness, explainability, gen- provide examples for two fictitious AI services in the eral safety, security, and transparency are some of the appendix of the paper. issues that have raised public concern about trusting AI and threatened the further adoption of AI beyond low-stakes uses [3, 4]. Despite active research and de- 1 Introduction velopment to address these issues, there is no mech- anism yet for the creator of an AI service to commu- Artificial intelligence (AI) services, such as those con- nicate how they are addressed in a deployed version. taining predictive models trained through machine This is a major impediment to broad AI adoption. learning, are increasingly key pieces of products and Toward transparency for developing trust, we pro- decision-making workflows. A service is a function or pose a FactSheet for AI Services. A FactSheet will application accessed by a customer via a cloud infras- contain sections on all relevant attributes of an AI tructure, typically by means of an application pro- service, such as intended use, performance, safety, gramming interface (API). For example, an AI ser- and security. Performance will include appropriate ∗ A. Olteanu’s work was done while at IBM Research. Au- accuracy or risk measures along with timing infor- thor is currently affiliated with Microsoft Research. mation. Safety, discussed in [5, 3] as the minimiza- 1
tion of both risk and epistemic uncertainty, will in- standardized way, as with Energy Star or food clude explainability, algorithmic fairness, and robust- nutrition labels. ness to dataset shift. Security will include robust- ness to adversarial attacks. Moreover, the FactSheet will list how the service was created, trained, and de- 3. Systems composed of safe components may be ployed along with what scenarios it was tested on, unsafe and, conversely, it may be possible to how it may respond to untested scenarios, guidelines build safe systems out of unsafe components, so that specify what tasks it should and should not be it is prudent to also consider transparency and used for, and any ethical concerns of its use. Hence, accountability of services in addition to datasets FactSheets help prevent overgeneralization and unin- and models. In doing so, we take a functional tended use of AI services by solidly grounding them perspective on the overall service, and can test with metrics and usage scenarios. for performance, safety, and security aspects that A FactSheet is modeled after a supplier’s decla- are not relevant for a dataset in isolation, such as ration of conformity (SDoC). An SDoC is a docu- generalization accuracy, explainability, and ad- ment to “show that a product, process or service con- versarial robustness. forms to a standard or technical regulation, in which a supplier provides written assurance [and evidence] of conformity to the specified requirements,” and is Loukides et al. propose a checklist that has some of used in many different industries and sectors includ- the elements we seek [12]. ing telecommunications and transportation [6]. Im- Our aim is not to give the final word on the con- portantly, SDoCs are often voluntary and tests re- tents of AI service FactSheets, but to begin the con- ported in SDoCs are conducted by the supplier itself versation on the types of information and tests that rather than by third parties [7]. This distinguishes may be included. Moreover, determining a single self-declarations from certifications that are manda- comprehensive set of FactSheet items is likely infea- tory and must have tests conducted by third parties. sible as the context and industry domain will often We propose that FactSheets for AI services be volun- determine what items are needed. One would expect tary initially; we provide further discussion on their higher stakes applications will require more compre- possible evolution in later sections. hensive FactSheets. Our main goal is to help iden- Our proposal of AI service FactSheets is inspired tify a common set of properties. A multi-stakeholder by, and builds upon, recent work that focuses on in- approach, including numerous AI service suppliers creased transparency for datasets [8, 9, 10] and mod- and consumers, standards bodies, and civil society els [11], but is distinguished from these in that we and professional organizations is essential to converge focus on the final AI service. We take this focus for onto standards. It will only be then that we as a three reasons: community will be able to start producing meaning- ful FactSheets for AI services. 1. AI services constitute the building blocks for many AI applications. Developers will query The remainder of the paper is organized as follows. the service API and consume its output. An Section 2 overviews related work, including labeling, AI service can be an amalgam of many models safety, and certification standards in other industries. trained on many datasets. Thus, the models and Section 3 provides more details on the key issues to datasets are (direct and indirect) components of enable trust in AI systems. Section 4 describes the an AI service, but they are not the interface to AI service FactSheet in more detail, giving examples the developer. of questions that it should include. In Section 5, we discuss how FactSheets can evolve from a voluntary 2. Often, there is an expertise gap between the pro- process to one that could be an industry requirement. ducer and consumer of an AI service. The pro- Section 6 covers challenges, opportunities, and future duction team relies heavily on the training and work needed to achieve the widespread usage of AI creation of one or more AI models and hence will service declarations of conformity. A proposed com- mostly contain data scientists. The consumers of plete set of sections and items for a FactSheet is in- the service tend to be developers. When such an cluded in the appendix, along with sample FactSheets expertise gap exists, it becomes more crucial to for two exemplary fictitious services, fingerprint ver- communicate the attributes of the artifact in a ification and trending topics in social media. 2
2 Related Work trustworthy AI that extends beyond principles, val- ues and ethical purpose to also include technical ro- This section discusses related work in providing bustness and reliability [16]. transparency in the creation of AI services, as well as a brief survey of ensuring trust in non-AI systems. 2.2 Enabling Trust in Other Domains 2.1 Transparency in AI Enabling trust in systems is not unique to AI. This Within the last year, several research groups have ad- section provides an overview of mechanisms used in vocated standardizing and sharing information about other domains and industries to achieve trust. The training datasets and trained models. Gebru et al. goal is to understand existing approaches to help in- proposed the use of datasheets for datasets as a way spire the right directions for enabling trust in AI ser- to expose and standardize information about public vices. datasets, or datasets used in the development of com- mercial AI services and pre-trained models [8]. The datasheet would include provenance information, key 2.2.1 Standards Organizations characteristics, and relevant regulations, but also sig- nificant, yet more subjective information, such as po- Standardization organizations, such as the IEEE [17] tential bias, strengths and weaknesses, and suggested and ISO [18], define standards along with the require- uses. Bender and Friedman propose a data state- ments that need to be satisifed for a product or a pro- ment schema, as a way to capture and convey the cess to meet the standard. The product developer can information and properties of a dataset used in natu- self-report that a product meets the standard, though ral language processing (NLP) research and develop- there are several cases, especially with ISO standards, ment [9]. They argue that data statements should be where an independent accredited body will verify that included in most writing on NLP, including: papers the standards are met and provide the certification. presenting new datasets, papers reporting experimen- tal work with datasets, and documentation for NLP systems. 2.2.2 Consumer Products Holland et al. outline the dataset nutrition label, a diagnostic framework that provides a concise yet ro- The United States Consumer Product Safety Com- bust and standardized view of the core components of mission (CPSC) [19] requires a manufacturer or im- a dataset [10]. Academic conferences such as the In- porter to declare its product as compliant with ap- ternational AAAI Conference on Web and Social Me- plicable consumer product safety requirements in a dia are also starting special tracks for dataset papers written or electronic declaration of conformity. In containing detailed descriptions, collection methods, many cases, this can be self-reported by the manu- and use cases. facturer or importer, i.e. an SDoC. However, in the Subsequent to the first posting of this paper [13], case of children’s products, it is mandatory to have Mitchell et al. propose model cards to convey infor- the testing performed by a CPSC-accepted labora- mation that characterizes the evaluation of a machine tory for compliance. Durable infant or toddler prod- learning model in a variety of conditions and disclose ucts must be marked with specialized tracking labels the context in which models are intended to be used, and must have a postage-paid customer registration details of the performance evaluation procedures, and card attached, to be used in case of a recall. other relevant information [11]. There is also budding The National Parenting Center has a Seal of Ap- activity on auditing and labeling algorithms for ac- proval program [20] that conducts testing on a variety curacy, bias, consistency, transparency, fairness and of children’s products involving interaction with the timeliness, in the industry [14, 15], but this audit products by parents, children, and educators, who fill does not cover several aspects of safety, security, and out questionnaires for the products they test. The lineage. quality of a product is determined based on factors Our proposal is distinguished from prior work in like the product’s level of desirability, sturdiness, and that we focus on the final AI service, a distinct con- interactive stimulation. Both statistical averaging as cept from a single pre-trained machine learning model well as comments from testers are examined before or dataset. Moreover, we take a broader view on providing a Seal of Approval for the product. 3
2.2.3 Finance a procedure to certify the readiness for software re- lease, understanding the tradeoff in cost of too early In the financial industry, corporate bonds are rated a release due to failures in the field, versus the cost by independent rating services [21, 22] to help an in- in personnel and schedule delay arising from more vestor assess the bond issuer’s financial strength or extensive testing. Their technique involves the fill- its ability to pay a bond’s principal and interest in a ing out of a questionnaire by the software developer timely fashion. These letter-grade ratings range from called the Software Review and Certification Record AAA or Aaa for safe, ‘blue-chip’ bonds to C or D for (SRCR), which is ‘credentialed’ with signatories who ‘junk’ bonds. On the other hand, common-stock in- approve the document prior to the release decision. vestments are not rated independently. Rather, the Heck et al. [29] also describe a software product cer- Securities and Exchange Commission (SEC) requires tification model to certify legislative compliance or potential issuers of stock to submit specific registra- acceptability of software delivered during outsourc- tion documents that discloses extensive financial in- ing. The basis for certification is a questionnaire to formation about the company and risks associated be filled out by the developer. The only acceptable with the future operations of the company. The SEC answers to the questions are yes and n/a (not appli- examines these documents, comments on them, and cable). expects corrections based on the comments. The final product is a prospectus approved by the SEC that is A different approach is taken in the CERT Se- available for potential buyers of the stock. cure Coding Standards [30] of the Software Engineer- ing Institute. Here the emphasis is on documenting best practices and coding standards for security pur- 2.2.4 Software poses. The secure coding standards consist of guide- In the software area, there have been recent attempts lines about the types of security flaws that can be to certify digital data repositories as ‘trusted.’ Trust- injected through development with specific program- worthiness involves both the quality of the data and ming languages. Each guideline offers precise infor- sustainable, reliable access to the data. The goal of mation describing the cause and impact of violations, certification is to enhance scientific reproducibility. and examples of common non-compliant (flawed) and The European Framework for Audit and Certifica- compliant (fixed) code. The organization also pro- tion [23] has three levels of certification, Core, Ex- vides tools, which audits code to identify security tended, and Formal (or Bronze, Silver, and Gold), flaws as indicated by violations of the CERT secure having different requirements, mainly to distinguish coding standards. between the requirements of different types of data, e.g. research data vs. human health data vs. financial transaction data. The CoreTrustSeal [24], a private legal entity, provides a Bronze level certification to 2.2.5 Environmental Impact Statements an interested data repository, for a nominal fee. There have been several proposals in the literature for software certifications of various kinds. Ghosh Environment law in the United States requires that and McGraw [25] propose a certification process for an environmental impact statement (EIS) should be testing software components for security properties. prepared prior to starting large constructions. An Their technique involves a process and a set of white- EIS is a document used as a tool for decision mak- box and black-box testing procedures, that eventu- ing that describes positive and negative environmen- ally results in a stamp of approval in the form of a tal effects of a proposed action. It is made available digital signature. Schiller [26] proposes a certifica- both to federal agencies and to the public, and cap- tion process that starts with a checklist with yes/no tures impacts to endangered species, air quality, wa- answers provided by the developer, and determines ter quality, cultural sites, and the socioeconomics of which tests need to be performed on the software to local communities. The federal law, the National En- certify it. Currit et al. [27] describe a procedure for vironmental Policy Act, has inspired similar laws in certifying the reliability of software before its release various jurisdictions and in other fields beyond the to the users. They predict the performance of the environment. Selbst has proposed an algorithmic im- software on unseen inputs using the MTTF (mean pact statement for AI that follows the form and pur- time to failure) metric. Port and Wilf [28] describe pose of EISs [31]. 4
2.2.6 Human Subjects they are able to assess the technology’s performance, reliability, safety, and security. Consumers do not In addition to products and technologies, another yet trust AI like they trust other technologies be- critical endeavor requiring trust is research involving cause of inadequate attention given to the latter of human subjects. Institutional review boards (IRB) these issues [35]. Making technical progress on safety have precise reviewing protocols and requirements and security is necessary but not sufficient to achieve such as those presented in the Belmont Report [32]. trust in AI, however; the progress must be accompa- Items to be completed include statement of pur- nied by the ability to measure and communicate the pose, participant selection, procedures to be followed, performance levels of the service on these dimensions harms and benefits to subjects, confidentiality, and in a standardized and transparent manner. One way consent documents. As AI services increasingly make to accomplish this is to provide such information via inferences for people and about people [33], IRB re- FactSheets for AI services. quirements increasingly apply to them. Trust in AI services will come from: a) applying general safety and reliability engineering methodolo- 2.2.7 Summary gies across the entire lifecycle of an AI service, b) identifying and addressing new, AI-specific issues and To ensure trust in products, industries have estab- challenges in an ongoing and agile way, and c) cre- lished a variety of practices to convey information ating standardized tests and transparent reporting about how a product is expected to perform when mechanisms on how such a service operates and per- utilized by a consumer. This information usually in- forms. In this section we outline several areas of con- cludes how the product was constructed and tested. cern and how they uniquely apply to AI. The crux of Some industries allow product creators to voluntar- this discussion is the manifestation of risk and uncer- ily provide this information, whereas others explicitly tainty in machine learning, including that data dis- require it. When the information is required, some in- tributions used for training are not always the ones dustries require the information to be validated by a that ideally should be used. third party. One would expect the latter scenario to occur in mature industries where there is confidence that the requirements strongly correlate with safety, 3.1 Basic Performance and Reliability reliability, and overall trust in the product. Manda- Statistical machine learning theory and practice is tory external validation of nascent requirements in built around risk minimization. The particular loss emerging industries may unnecessarily stifle the de- function, whose expectation over the data distribu- velopment of the industry. tion is considered to be the risk, depends on the task, e.g. zero-one loss for binary classification and mean squared error for regression. Different types of errors 3 Elements of Trust in AI Sys- can be given different costs. Abstract loss functions tems may be informed by real-world quality metrics [36], including context-dependent ones [37]. There is no We drive cars trusting the brakes will work when the particular standardization on the loss function, even pedal is pressed. We undergo laser eye surgery trust- broadly within application domains. Moreover, per- ing the system to make the right decisions. We accept formance metrics that are not directly optimized are that the autopilot will operate an airplane, trusting also often examined, e.g. area under the curve and that it will navigate correctly. In all these cases, trust normalized cumulative discounted gain. comes from confidence that the system will err ex- The true expected value of the loss function can tremely rarely, leveraging system training, exhaustive never be known and must be estimated empirically. testing, experience, safety measures and standards, There are several approaches and rules of thumb for best practices, and consumer education. estimating the risk, but there is no standardization Every time new technology is introduced, it creates here either. Different groups make different choices new challenges, safety issues, and potential hazards. (k-fold cross-validation, held-out samples, stratifica- As the technology develops and matures, these issues tion, bootstrapping, etc.). Further notions of per- are better understood, documented, and addressed. formance and reliability are the technical aspects of Human trust in technology is developed as users over- latency, throughput, and availability of the service, come perceptions of risk and uncertainty [34], i.e., as which are also not standardized for the specifics of 5
AI workloads. mon cause of frustration and loss of trust for AI ser- To develop trust in AI services from a basic perfor- vice consumers. Dataset shift can be detected and mance perspective, the choice of metrics and testing corrected using a multitude of methods [39]. The sen- conditions should not be left to the discretion of the sitivity of performance of different models to dataset supplier (who may choose conditions which present shift varies and should be part of a testing proto- the service in a favorable light), but should be cod- col. To the best of our knowledge, there does not yet ified and standardized. The onerous requirement of exist any standard for how to conduct such testing. third-party testing could be avoided by ensuring that To mitigate this risk a FactSheet should contain de- the specifications are precise, i.e., that each metric mographic information about the training and test is precisely defined to ensure consistency and enable datasets that report the various outcomes for each reproducibility by AI service consumers. group of interest as specified in Section 3.1. For each metric a FactSheet should report the val- ues under various categories relevant to the expected consumers, (e.g., performance for various age groups, Fairness AI fairness is a rapidly growing topic of geographies, or genders) with the goal of providing inquiry [40]. There are many different definitions the right level of insight into the service, but still of fairness (some of which provably conflict) that preserving privacy. We expect some metrics will be are appropriate in varying contexts. The concept of specific to a domain, (e.g., finance, healthcare, man- fairness relies on protected attributes (also context- ufacturing), or a modality (e.g., visual, speech, text), dependent) such as race, gender, caste, and religion. reflecting common practice of evaluation in that en- For fairness, we insist on some risk measure being ap- vironment. proximately equal in groups defined by the protected attributes. Unwanted biases in training data, due 3.2 Safety to either prejudice in labels or under-/over-sampling, lead to unfairness and can be checked using statistical While typical machine learning performance metrics tests on datasets or models [41, 42]. One can think of are measures of risk (the ones described in the pre- bias as the mismatch between the training data distri- vious section), we must also consider epistemic un- bution and a desired fair distribution. Applications certainty when assessing the safety of a service [5, 3]. such as lending have legal requirements on fairness The main uncertainty in machine learning is an un- in decision making, e.g. the Equal Credit Opportu- known mismatch between the training data distribu- nity Act in the United States. Although the parity tion and the desired data distribution on which one definitions and computations in such applications are would ideally train. Usually that desired distribution explicit, the interpretation of the numbers is subjec- is the true distribution encountered in operation (in tive: there are no immutable thresholds on fairness this case the mismatch is known as dataset shift), metrics (e.g., the well-known 80% rule [43]) that are but it could also be an idealized distribution that aplied in isolation of context. encodes preferred societal norms, policies, or regu- lations (imagine a more equitable world than what exists in reality). One may map four general cate- Explainability Directly interpretable machine gories of strategies to achieve safety proposed in [38] learning (in contrast to post hoc interpretation) to machine learning [3]: inherently safe design, safety [44], in which a person can look at a model and reserves, safe fail, and procedural safeguards, all of understand what it does, reduces epistemic un- which serve to reduce epistemic uncertainty. Inter- certainty and increases safety because quirks and pretability of models is one example of inherently safe vagaries of training dataset distributions that will design. not be present in distributions during deployment can be identified by inspection [3]. Different users Dataset Shift As the statistical relationship be- have different needs from explanations, and there tween features and labels changes over time, known as is not yet any satisfactory quantitative definition dataset shift, the mismatch between the training dis- of interpretability (and there may never be) [45]. tribution and the distribution from which test sam- Recent regulations in the European Union require ples are being drawn increases. A well-known reason ‘meaningful’ explanations, but it is not clear what for performance degradation, dataset shif is a com- constitutes a meaningful explanation. 6
3.3 Security service development, testing, deployment and main- tenance: from information about the data the service AI services can be attacked by adversaries in various is trained on, to underlying algorithms, test setup, ways [4]. Small imperceptible perturbations could test results, and performance benchmarks, to the way cause AI services to misclassify inputs to any label the service is maintained and retrained (including au- that attackers desire; training data and models can be tomatic adaptation). poisoned, allowing attackers to worsen performance The items are devised to aid the user in under- (similar to concept drift but deliberate); and sensitive standing how the service works, in determining if the information about data and models can be stolen by service is appropriate for the intended application, observing the outputs of a service for different inputs. and in comprehending its strengths and limitations. Services may be instrumented to detect such attacks The identified items are not intended to be definitive. and may also be designed with defenses [46]. New If a question is not applicable to a given service, it research proposes certifications for defenses against can simply be ignored. In some cases, the service adversarial examples [47], but these are not yet prac- supplier may not wish to disclose details of the ser- tical. vice for competitive reasons. For example, a supplier of a commercial fraud detection service for health- 3.4 Lineage care insurance claims may choose not to reveal the details of the underlying algorithm; nevertheless, the Once performance, safety, and security are sufficient supplier should be able to indicate the class of algo- to engender trust, we must also ensure that we track rithm used, provide sample outputs along with ex- and maintain the provenance of datasets, metadata, planations of the algorithmic decisions leading to the models along with their hyperparameters, and test outputs. More consequential applications will likely results. Users, those potentially affected, and third require more comprehensive completion of items. parties, such as regulators, must be able to audit the A few examples of items a FactSheet might include systems underlying the services. Appropriate parties are: may need the ability to reproduce past outputs and track outcomes. Specifically, one should be able to • What is the intended use of the service output? determine the exact version of the service deployed • What algorithms or techniques does this service at any point of time in the past, how many times the implement? service was retrained and associated details like hy- perparameters used for each training episode, train- • Which datasets was the service tested on? (Pro- ing dataset used, how accuracy and safety metrics vide links to datasets that were used for testing, have evolved over time, the feedback data received along with corresponding datasheets.) by the service, and the triggers for retraining and improvement. This information may span multiple • Describe the testing methodology. organizations when a service is built by multiple par- • Describe the test results. ties. • Are you aware of possible examples of bias, ethi- cal issues, or other safety risks as a result of using 4 Items in a FactSheet the service? In this section we provide an overview of the items • Are the service outputs explainable and/or in- that should be addressed in a FactSheet. See the terpretable? appendix for the complete list of items. To illustrate how these items might be completed in practice, we • For each dataset used by the service: Was the also include two sample FactSheets in the appendix: dataset checked for bias? What efforts were one for a fictitious fingerprint verification service and made to ensure that it is fair and representative? one for a trending topics service. • Does the service implement and perform any bias The items are grouped into several categories detection and remediation? aligned with the elements of trust. The categories are: statement of purpose, basic performance, safety, • What is the expected performance on unseen security, and lineage. They cover various aspects of data or data with different distributions? 7
• Was the service checked for robustness against of a service, such as its intended use, its performance adversarial attacks? metrics, and information about fairness, explainabil- ity, safety, and security. In particular, consumers in • When were the models last updated? many businesses do not have the requisite expertise to As such a declaration is refined, and testing pro- evaluate various AI services available in the market- cedures for performance, robustness to concept drift, place; uninformed or incorrect choices can result in explainability, and robustness to attacks are further suboptimal business performance. By creating easily codified, the FactSheet may refer to standardized test consumable FactSheets, suppliers can accrue a com- protocols instead of providing descriptive details. petitive advantage by capturing consumers’ trust. Since completing a FactSheet can be laborious, we Moreover, with such transparency, FactSheets should expect most of the information to be populated as serve to allow better functioning of AI service mar- part of the AI service creation process in a secure ketplaces and prevent a so-called ‘market for lemons’ auditable manner. A FactSheet will be created once [49]. A counter-argument to voluntary compliance and associated with a service, but can continually be and self-regulation argues that while participation of augmented, without removing previous information, industry is welcome, this should not stand in the way i.e., results are added from more tests, but results of legislation and governmental regulation [50]. cannot be removed. Any changes made to the service FactSheet adoption could potentially lead to an will prompt the creation of a new version of the Fact- eventual system of third-party certification [51], but Sheet for the new model. Thus, these FactSheets will probably only for services catering to applications be treated as a series of immutable artifacts. with the very highest of stakes, to regulated busi- This information can be used to more accurately ness processes and enterprise applications, and to monitor a deployed service by comparing deployed applications originating in the public sector [52, 7]. metrics with those that were seen during development Children’s toys are an example category of consumer and taking appropriate action when unexpected be- products in which an SDoC is not enough and certifi- havior is detected. cation is required. If an AI service is already touching on a regulation from a specific industry in which it 5 The Evolution of FactSheet isbetter being used, its FactSheet will serve as a tool for compliance. Adoption We expect that AI will soon go through the same evo- 6 Discussion and Future Work lution that other technologies have gone through (cf. [8] for an excellent review of the evolution of safety One may wonder why AI should be held to a higher standards in different industries). We propose that standard (FactSheets) than non-AI software and ser- FactSheets be initially voluntary for several reasons. vices in the same domain. Non-AI software include First, discussion and feedback from multiple parties several artifacts beyond the code, such as design doc- representing suppliers and consumers of AI services is uments, program flow charts, and test plans that can needed to determine the final set of items and format provide transparency to concerned consumers. Since of FactSheets. So, an initial voluntary period to al- AI services do not contain any of these, and the gen- low this discussion to occur is needed. Second, there erated code may not be easily understandable, there needs to be a balance between the needs of AI ser- is a higher demand to enhance transparency through vice consumers with the freedom to innovate for AI FactSheets. service producers. Although producing a FactSheet Although FactSheets enable AI services producers will initially be an additional burden to an AI service to provide information about the intent and construc- producer, we expect market feedback from AI service tion of their service so that educated consumers can consumers to encourage this creation. make informed decisions, consumers may still, inno- Because of peer pressure to conform [48], Fact- cently or maliciously, use the service for purposes Sheets could become a de facto requirement similar to other than those intended. FactSheets cannot fully Energy Star labeling of the energy efficiency of appli- protect against such use, but can form the basis of ances. They will serve to reduce information asym- service level agreements. metry between supplier and consumer, where con- Some components of an AI service may be pro- sumers are currently unaware of important properties duced by organizations other than the service sup- 8
plier. For example, the dataset may be obtained from technologies. a third party, or the service may be a composition We see our work as a first step at defining which of models, some of which are produced by another questions to ask and metrics to measure towards de- organization. In such cases, the FactSheet for the velopment and adoption of broader industry practices composed service would need to include information and standards. We see a parallel between the issue of from the supplying organizations. Ideally, those or- trusted AI today and the rise of digital certification ganizations would produce FactSheets for their com- during the Internet revolution. The digital certifica- ponents, enabling the composing organization to pro- tion market ‘bootstrapped’ the Internet, ushering in a vide a complete FactSheet. This complete FactSheet new era of ‘transactions’ such as online banking and could include the component FactSheets along with benefits enrollment that we take for granted today. any necessary additional information. In some cases, In a similar vein, we can see AI service FactSheets the demands for transparency on the composing or- ushering in a new era of trusted AI end points and ganization may be greater than on the component or- bootstrapping broader adoption. ganization; market forces will require the component organization to provide more transparency to retain their relation with the composing organization. This References is analogous to other industries, like retail, where re- [1] E. E. Levine, T. B. Bitterly, T. R. Cohen, and tailers push demands on their suppliers to meet the M. E. Schweitzer, “Who is trustworthy? Predict- expectations of the retailers’ customers. In these sit- ing trustworthy intentions and behavior,” Jour- uations the provenance of the information among or- nal of Personality and Social Psychology, vol. ganizations will need to be tracked. 115, no. 3, pp. 468–494, Sep. 2018. [2] M. K. Lee, “Understanding perception of algo- 7 Summary and Conclusion rithmic decisions: Fairness, trust, and emotion in response to algorithmic management,” Big In this paper, we continue in the research direc- Data & Society, vol. 5, no. 1, Jan.–Jun. 2018. tion established by datasheets or nutrition labels for datasets to examine trusted AI at the functional level [3] K. R. Varshney and H. Alemzadeh, “On the rather than at the component level. We discuss the safety of machine learning: Cyber-physical sys- several elements of AI services that are needed for tems, decision sciences, and data products,” Big people to trust them, including task performance, Data, vol. 5, no. 3, pp. 246–255, Sep. 2017. safety, security, and maintenance of lineage. The fi- [4] N. Papernot, P. McDaniel, S. Jha, M. Fredrik- nal piece to build trust is transparent documentation son, Z. B. Celik, and A. Swami, “The limita- about the service, which we see as a variation on dec- tions of deep learning in adversarial settings,” in larations of conformity for consumer products. We Proceedings of the IEEE European Symposium propose a starting point to a voluntary AI service on Security and Privacy, Saarbrucken, Germany, supplier’s declaration of conformity. Further discus- Mar. 2016, pp. 372–387. sion among multiple parties is required to standardize protocols for testing AI services and determine the fi- [5] N. Moller, “The concepts of risk and safety,” in nal set of items and format that AI service FactSheets Handbook of Risk Theory, S. Roeser, R. Hiller- will take. brand, P. Sandin, and M. Peterson, Eds. Dor- We envision that suppliers will voluntarily popu- drecht, Netherlands: Springer, 2012, pp. 55–85. late and release FactSheets for their services to re- [6] National Institute of Standards and Technology, main competitive in the market. The evolution of “The use of supplier’s declaration of conformity,” the marketplace of AI services may eventually lead https://www.nist.gov/document-6075. to an ecosystem of third party testing and verifica- tion laboratories, services, and tools. We also envi- [7] American National Standards Insti- sion the automation of nearly the entire FactSheet tute, “U. S. conformity assessment sys- as part of the build and runtime environments of AI tem: 1st party conformity assessment,” services. Moreover, it is not difficult to imagine Fact- https://www.standardsportal.org/usa en/- Sheets being automatically posted to distributed, im- conformity assessment/suppliers declaration.- mutable ledgers such as those enabled by blockchain aspx. 9
[8] T. Gebru, J. Morgenstern, B. Vecchione, J. W. [19] [Online]. Available: Vaughan, H. Wallach, H. Daumé, III, and https://www.cpsc.gov/Business--Manufacturing/Testing-Certific K. Crawford, “Datasheets for datasets,” in Pro- ceedings of the Fairness, Accountability, and [20] [Online]. Available: Transparency in Machine Learning Workshop, http://the-parenting-center.com/about-the-seal-of-approval Stockholm, Sweden, Jul. 2018. [21] [Online]. Available: https://www.moodys.com/Pages/amr002002.aspx [9] E. M. Bender and B. Friedman, “Data state- ments for NLP: Toward mitigating system bias [22] [Online]. Available: and enabling better science,” Transactions of the https://www.spratings.com/documents/20184/774196/Guide to ACL, forthcoming. [23] [Online]. Available: [10] S. Holland, A. Hosny, S. Newman, J. Joseph, http://www.trusteddigitalrepository.eu/Trusted%20Digital%20R and K. Chmielinski, “The dataset nutrition la- bel: A framework to drive higher data quality [24] [Online]. Available: standards,” arXiv:1805.03677, May 2018. https://www.coretrustseal.org/about/ [25] A. K. Ghosh and G. McGraw, “An approach for [11] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, certifying security in software components,” in L. Vasserman, B. Hutchinson, E. Spitzer, I. D. the 21st National Information Systems Security Raji, and T. Gebru, “Model cards for model re- Conference, 1998. porting,” in Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, [26] S. C. A., “The software certi- Atlanta, USA, Jan. 2019. fication process.” [Online]. Available: http://www.ittoday.info/AIMS/DSM/82-01-17.pdf [12] M. Loukides, H. Mason, and D. Patil, “Of oaths and checklists,” [27] P. A. Currit, M. Dyer, and H. D. Mills, “Certi- https://www.oreilly.com/ideas/of-oaths-and- fying the reliability of software,” IEEE Transac- checklists, Jul. 2018. tions on Software Engineering 1, pp. 3–11, 1986. [13] M. Hind, S. Mehta, A. Mojsilović, R. Nair, [28] D. Port and J. Wilf, “The value of certifying soft- K. N. Ramamurthy, A. Olteanu, and K. R. ware release readiness: an exploratory study of Varshney, “Increasing trust in AI services certification for a critical system at jpl,” in 2013 through supplier’s declarations of conformity,” ACM/IEEE International Symposium on Em- https://arxiv.org/abs/1808.07261, Aug. 2018. pirical Software Engineering and Measurement, 2013, pp. 373–382. [14] C. O’Neil, “What is a data audit?” http://www.oneilrisk.com/articles/2017/1/24/- [29] P. Heck, M. Klabbers, and M. van Eekelen, “A what-is-a-data-audit, Jan. 2017. software product certification model,” Software Quality Journal, vol. 18, no. 1, 2010. [15] R. Carrier, “AI safety— the [30] [Online]. Available: concept of independent audit,” https://www.sei.cmu.edu/research-capabilities/all-work/display. https://www.forhumanity.center/independent- audit-1/. [31] A. D. Selbst, “Disparate impact in big data policing,” Georgia Law Review, vol. 52, no. 1, [16] The European Commission’s High-Level Expert pp. 109–195, Feb. 2017. Group on Artificial Intelligence, “Draft ethics guidelines for trustworthy AI,” Brussels, Bel- [32] J. M. Sims, “A brief review of the Belmont gium, Dec. 2018. report,” Dimensions of Critical Care Nursing, vol. 29, no. 4, pp. 173–174, Jul.–Aug. 2010. [17] [Online]. Available: http://standards.ieee.org/develop/overview.html [33] K. R. Varshney, “Data science of the people, for the people, by the people: A viewpoint on [18] [Online]. Available: an emerging dichotomy,” in Data for Good Ex- https://www.iso.org/home.html change Conference, New York, USA, Sep. 2015. 10
[34] X. Li, T. J. Hess, and J. S. Valacich, “Why [44] C. Rudin, “Algorithms for interpretable machine do we trust new technology? a study of initial learning,” in Proceedings of the ACM SIGKDD trust formation with organizational information International Conference on Knowledge Discov- systems,” Journal of Strategic Information Sys- ery and Data Mining, New York, USA, Aug. tems, vol. 17, no. 1, pp. 39–71, Mar. 2008. 2014, p. 1519. [35] S. Scott, “Artificial intelligence & communica- [45] F. Doshi-Velez and B. Kim, “Towards a rigor- tions: The fads. the fears. the future.” 2018. ous science of interpretable machine learning,” arXiv:1702.08608, Feb. 2017. [36] K. L. Wagstaff, “Machine learning that mat- ters,” in Proceedings of the International Con- [46] M.-I. Nicolae, M. Sinn, M. N. Tran, A. Rawat, ference on Machine Learning, Edinburgh, UK, M. Wistuba, V. Zantedeschi, N. Baracaldo, Jun.–Jul. 2012, pp. 529–536. B. Chen, H. Ludwig, I. M. Molloy, and B. Ed- [37] A. Olteanu, K. Talamadupula, and K. R. Varsh- wards, “Adversarial robustness toolbox v0.3.0,” ney, “The limits of abstract evaluation metrics: arXiv:1807.01069, Aug. 2018. The case of hate speech detection,” in Proceed- [47] A. Raghunathan, J. Steinhardt, and P. Liang, ings of the ACM Web Science Conference, Troy, “Certified defenses against adversarial exam- USA, Jun. 2017, pp. 405–406. ples,” arXiv:1801.09344, Jan. 2018. [38] N. Möller and S. O. Hansson, “Principles of engi- [48] J. ben-Aaron, M. Denny, B. Desmarais, and neering safety: Risk and uncertainty reduction,” H. Wallach, “Transparency by conformity: A Reliability Engineering & System Safety, vol. 93, field experiment evaluating openness in local no. 6, pp. 798–805, Jun. 2008. governments,” Public Administration Review, [39] J. Gama, I. Žliobaité, A. Bifet, M. Pechenizkiy, vol. 77, no. 1, pp. 68–77, Jan.–Feb. 2017. and A. Bouchachia, “A survey on concept drift adaptation,” ACM Computing Surveys, vol. 46, [49] G. A. Akerlof, “The market for “lemons”: Qual- no. 4, p. 44, Apr. 2014. ity uncertainty and the market mechanism,” Quarterly Journal of Economics, vol. 84, no. 3, [40] S. Hajian, F. Bonchi, and C. Castillo, “Al- pp. 488–500, Aug. 1970. gorithmic bias: From discrimination discovery to fairness-aware data mining,” in Proceedings [50] P. Nemitz, “Constitutional democracy and tech- of the ACM SIGKDD International Conference nology in the age of artificial intelligence,” Philo- on Knowledge Discovery and Data Mining, San sophical Transactions of the Royal Society A, Francisco, USA, Aug. 2016, pp. 2125–2126. vol. 376, no. 2133, p. 20180089, Nov. 2018. [41] S. Barocas and A. D. Selbst, “Big data’s dis- [51] B. Srivastava and F. Rossi, “Towards compos- parate impact,” California Law Review, vol. 104, able bias rating of AI services,” in Proceedings of no. 3, pp. 671–732, Jun. 2016. the AAAI/ACM Conference on Artificial Intel- ligence, Ethics, and Society, New Orleans, USA, [42] R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hoff- Feb. 2018. man, S. Houde, K. Kannan, P. Lohia, J. Mar- tino, S. Mehta, A. Mojsilovic, S. Nagar, K. N. [52] L. K. McAllister, “Harnessing private regula- Ramamurthy, J. Richards, D. Saha, P. Sattigeri, tion,” Michigan Journal of Environmental & Ad- M. Singh, K. R. Varshney, and Y. Zhang, “AI ministrative Law, vol. 3, no. 2, pp. 291–420, Fairness 360: An extensible toolkit for detect- 2014. ing, understanding, and mitigating unwanted al- gorithmic bias,” arXiv:1810.01943, Oct. 2018. A Proposed FactSheet Items [43] M. Feldman, S. A. Friedler, J. Moeller, C. Schei- degger, and S. Venkatasubramanian, “Certifying Below we list example questions that a FactSheet for and removing disparate impact,” in Proceedings an AI service could include. The set of questions of the ACM SIGKDD International Conference we provide here is not intended to be definitive, but on Knowledge Discovery and Data Mining, Syd- rather to open a conversation about what aspects ney, Australia, Aug. 2015, pp. 259–268. should be covered. 11
To illustrate how these questions could be an- – When and how often? swered, we provide two examples for fictitious AI ser- vices: a fingerprint verification service (Appendix B) – What sections have changed? and a trending topics service (Appendix C). How- – Is the FactSheet updated every time the ser- ever, given that the examples we provide are ficti- vice is retrained or updated? tious, we would expect an actual service provider to answer these questions in much more detail. For in- stance, they would be able to better characterize an Usage API that actually exists. Our example answers are mainly to provide additional insight about the type • What is the intended use of the service output? of information we would find in a FactSheet. – Briefly describe a simple use-case. Statement of purpose The following questions are aimed at providing an • What are the key procedures followed while us- overview of the service provider and of the intended ing the service? uses for the service. Valid answers include “N/A” (not applicable) and “Proprietary” (cannot be pub- – How is the input provided? By whom? licly disclosed, usually for competitive reasons). – How is the output returned? General • Who are “you” (the supplier) and what type of Domains and applications services do you typically offer (beyond this par- • What are the domains and applications the ser- ticular service)? vice was tested on or used for? • What is this service about? – Were domain experts involved in the devel- opment, testing, and deployment? Please – Briefly describe the service. elaborate. – When was the service first released? When was the last release? – Who is the target user? • How is the service being used by your customers or users? • Describe the outputs of the service. – Are you enabling others to build a solution by providing a cloud service or is your ap- • What algorithms or techniques does this service plication end-user facing? implement? – Is the service output used as-is, is it fed di- rectly into another tool or actuator, or is – Provide links to technical papers. there human input/oversight before use? – Do users rely on pre-trained/canned models or can they train their own models? • What are the characteristics of the development – Do your customers typically use your ser- team? vice in a time critical setup (e.g. they have limited time to evaluate the output)? Or – Do the teams charged with developing and do they incorporate it in a slower decision maintaining this service reflect a diversity making process? Please elaborate. of opinions, backgrounds, and thought? • List applications that the service has been used • Have you updated this FactSheet before? for in the past. 12
– Please provide information about these ap- – Briefly describe how a third party could in- plications or relevant pointers. dependently verify the performance of the – Please provide key performance results for service. those applications. – Are there benchmarks publicly available and adequate for testing the service. • Other comments? Basic Performance • In addition to the service provider, was this ser- vice tested by any third party? The following questions aim to offer an overall assess- ment of the service performance. – Please list all third parties that performed Testing by service provider the testing. – Also, please include information about the • Which datasets was the service tested on? (e.g., tests and test results. links to datasets that were used for testing, along with corresponding datasheets) – List the test datasets and provide links to • Other comments? these datasets. – Do the datasets have an associated datasheet? If yes, please attach. – Could these datasets be used for indepen- Safety dent testing of the service? Did the data need to be changed or sampled before use? The following questions aim to offer insights about potential unintentional harms, and mitigation efforts to eliminate or minimize those harms. • Describe the testing methodology. General – Please provide details on train, test and holdout data. – What performance metrics were used? • Are you aware of possible examples of bias, ethi- (e.g. accuracy, error rates, AUC, preci- cal issues, or other safety risks as a result of using sion/recall) the service? – Please briefly justify the choice of metrics. – Were the possible sources of bias or unfair- • Describe the test results. ness analyzed? – Where do they arise from: the data? the – Were latency, throughput, and availability particular techniques being implemented? measured? other sources? – If yes, briefly include those metrics as well. – Is there any mechanism for redress if indi- viduals are negatively affected? Testing by third parties • Do you use data from or make inferences about • Is there a way to verify the performance metrics individuals or groups of individuals. Have you (e.g., via a service API )? obtained their consent? 13
– How was it decided whose data to use or – Please describe the data bias policies that about whom to make inferences? were checked (such as with respect to known – Do these individuals know that their data protected attributes), bias checking meth- is being used or that inferences are being ods, and results (e.g., disparate error rates made about them? What were they told? across different groups). When were they made aware? What kind – Was there any bias remediation performed of consent was needed from them? What on this dataset? Please provide details were the procedures for gathering consent? about the value of any bias estimates be- Please attach the consent form to this dec- fore and after it. laration. – What techniques were used to perform the – What are the potential risks to these indi- remediation? Please provide links to rele- viduals or groups? Might the service output vant technical papers. interfere with individual rights? How are – How did the value of other performance these risks being handled or minimized? metrics change as a result? – What trade-offs were made between the rights of these individuals and business in- terests? – Do they have the option to withdraw their • Does the service implement and perform any bias data? Can they opt out from inferences be- detection and remediation? ing made about them? What is the with- drawal procedure? – Please describe model bias policies that were checked, bias checking methods, and results (e.g., disparate error rates across dif- Explainability ferent groups). – What procedures were used to perform the remediation? Please provide links or refer- ences to corresponding technical papers. • Are the service outputs explainable and/or in- terpretable? – Please provide details about the value of any bias estimates before and after such re- mediation. – Please explain how explainability is – How did the value of other performance achieved (e.g. directly explainable algo- metrics change as a result? rithm, local explainability, explanations via examples). – Who is the target user of the explanation (ML expert, domain expert, general con- Concept Drift sumer, etc.) – Please describe any human validation of the explainability of the algorithms • What is the expected performance on unseen data or data with different distributions? – Please describe any relevant testing done Fairness along with test results. • For each dataset used by the service: Was the dataset checked for bias? What efforts were • Does your system make updates to its behavior made to ensure that it is fair and representative? based on newly ingested data? 14
– Is the new data uploaded by your users? Is – Describe specific concerns and sensitive use it generated by an automated process? Are cases. the patterns in the data largely static or do – Are there any procedures in place to ensure they change over time? that the service will not be used for these – Are there any performance guaran- applications? tees/bounds? – Does the service have an automatic feed- back/retraining loop, or is there a human in the loop? • How are you securing user or usage data? – Is usage data from service operations re- • How is the service tested and monitored for tained and stored? model or performance drift over time? – How is the data being stored? For how long is the data stored? – If applicable, describe any relevant testing – Is user or usage data being shared outside along with test results. the service? Who has access to the data? • How can the service be checked for correct, ex- pected output when new data is added? • Was the service checked for robustness against adversarial attacks? • Does the service allow for checking for differences between training and usage data? – Describe robustness policies that were checked, the type of attacks considered, checking methods, and results. – Does it deploy mechanisms to alert the user of the difference? • What is the plan to handle any potential security • Do you test the service periodically? breaches? – Does the testing includes bias or fairness – Describe any protocol that is in place. related aspects? – How has the value of the tested metrics evolved over time? • Other comments? • Other comments? Lineage Security The following questions aim to overview how the ser- The following questions aim to assess the susceptibil- vice provider keeps track of details that might be re- ity to deliberate harms such as attacks by adversaries. quired in the event of an audit by a third party, such as in the case of harm or suspicion of harm. • How could this service be attacked or abused? Please describe. Training Data • List applications or scenarios for which the ser- • Does the service provide an as-is/canned model? vice is not suitable. Which datasets was the service trained on? 15
You can also read