Teach Me to Explain: A Review of Datasets for Explainable NLP
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Teach Me to Explain: A Review of Datasets for Explainable NLP Sarah Wiegreffe∗ Ana Marasović∗ School of Interactive Computing Allen Institute for AI Georgia Institute of Technology University of Washington saw@gatech.edu anam@allenai.org Abstract learning practitioners encounter data cascades, or downstream problems resulting from poor data Explainable NLP (E X NLP) has increas- quality (Sambasivan et al., 2021). It is impor- ingly focused on collecting human- annotated explanations. These explanations tant to constantly evaluate data collection practices critically and establish common and standardized arXiv:2102.12060v2 [cs.CL] 1 Mar 2021 are used downstream in three ways: as data augmentation to improve performance on methodologies (Bender and Friedman, 2018; Ge- a predictive task, as a loss signal to train bru et al., 2018; Jo and Gebru, 2020; Ning et al., models to produce explanations for their 2020; Paullada et al., 2020). Even highly influ- predictions, and as a means to evaluate the ential datasets that are widely adopted in research quality of model-generated explanations. undergo such inspections. For instance, Recht In this review, we identify three predom- et al. (2019) tried to reproduce the ImageNet test inant classes of explanations (highlights, free-text, and structured), organize the set by carefully following the instructions in the literature on annotating each type, point paper (Deng et al., 2009), but multiple models do to what has been learned to date, and give not perform well on the replicated test set. We ex- recommendations for collecting E X NLP pect that such examinations are particularly valu- datasets in the future. able when many related datasets are released con- temporaneously and independently in a short pe- 1 Introduction riod of time, as is the case with E X NLP datasets. Interpreting supervised machine learning models This survey aims to review and summarize the by analyzing local explanations—justifications of literature on collecting E X NLP datasets, highlight their individual predictions1 —is important for de- what has been learned to date, and give recom- bugging, protecting sensitive information, under- mendations for future E X NLP dataset construc- standing model robustness, and increasing war- tion. It complements other explainable AI (XAI) ranted trust in a stated contract (Doshi-Velez and surveys and critical retrospectives that focus on Kim, 2017; Molnar, 2019; Jacovi et al., 2021). definitions (Gilpin et al., 2018; Miller, 2019; Clin- Explainable NLP (E X NLP) researchers rely on ciu and Hastie, 2019; Barredo Arrieta et al., 2020; datasets that contain human justifications for the Jacovi and Goldberg, 2020), methods (Biran and true label (overviewed in Tables 3–5). Human jus- Cotton, 2017; Ras et al., 2018; Guidotti et al., tifications are used to aid models with additional 2019), evaluation (Hoffman et al., 2018; Yang training supervision (learning-from-explanations; et al., 2019), or a combination of these topics Zaidan et al., 2007), to train interpretable models (Lipton, 2018; Doshi-Velez and Kim, 2017; Adadi that explain their own predictions (Camburu et al., and Berrada, 2018; Murdoch et al., 2019; Verma 2018), and to evaluate plausibility of explana- et al., 2020; Burkart and Huber, 2021), but not on tions by measuring their agreement with human- datasets. Kotonya and Toni (2020a) and Thaya- annotated explanations (DeYoung et al., 2020a). paran et al. (2020) review datasets and methods for Dataset collection is the most under-scrutinized explaining automated fact-checking and machine component of the machine learning pipeline (Par- reading comprehension, respectively; we are the itosh, 2020)—it is estimated that 92% of machine first to review datasets across all NLP tasks. ∗ Equal contributions. To organize the literature, we first define our 1 As opposed to global explanation, where one explanation scope (§2), sort out E X NLP terminology (§3), and is given for an entire system (Molnar, 2019). present an overview of existing E X NLP datasets
(§4).2 Then, we present what can be learned Input: Jenny forgot to lock her car at the grocery store. from existing E X NLP datasets. Specifically, §5 discusses the typical process of collecting high- lighted elements in the inputs as explanations, and What tense is the What might follow What might discrepancies between this process and that of sentence in? from this event? follow from this evaluating automatic methods. In the same sec- event? Answer: Past Answer: Jenny’s car tion, we also draw attention to how assumptions is broken into. Answer: made in free-text explanation collection influence Jenny’s car is Explanation: Explanation: Thieves broken into. their modeling, and call for better documenting of “Forgot” is the often target unlocked past tense of the cars in public places. E X NLP data collection. In §6, we illustrate that verb “to forget”. not all template-like free-text explanations are un- wanted, and call for embracing the structure of an explanation when appropriate. In §7.1, we present Explaining Human Explaining Observations a proposal for controlling quality in free-text ex- Decisions about the World planation collection. Finally, §8 gathers recom- mendations from related subfields to further re- Figure 1: A description of the two classes of duce data artifacts by increasing diversity of col- E X NLP datasets (§2). For some input, an ex- lected explanations. planation can explain a human-provided answer, the answer (left and middle) or something about 2 Survey Scope the world (middle and right). The shaded area is our survey scope. These classes can be used to An explanation can be described as a “three-place teach and evaluate models on different skill sets: predicate: someone explains something to some- their ability to justify their predictions (left), rea- one” (Hilton, 1990). But what is the something son about the world (right), or do both (middle). being explained in E X NLP? We identify two types of E X NLP dataset papers, based on their answer to this question: those which explain human de- tional training supervision to predict labels from cisions (task labels), and those which explain ob- inputs, (ii) training supervision for models to ex- served events or phenomena in the world. We il- plain their label predictions, and (iii) an evaluation lustrate these classes in Figure 1. set for the quality of model-produced explanations The first class, explaining human decisions, has of label predictions. We define our survey scope as a long history and strong association with the term focusing on this class of explanations. “explanation” in NLP. In this class, explanations The second class falls under commonsense rea- are not the task labels themselves, but rather ex- soning, and this type of explanation is commonly plain some task label for a given input (see exam- referred to as commonsense inference about the ples in Figure 1 and Table 1). One distinguish- world. While most datasets that can be framed as ing factor in their collection is that they are of- explaining something about the world do not use ten collected by asking “why” questions, such as the term “explanation” (Zellers et al., 2018; Sap “why is [input] assigned [label]?”. The underly- et al., 2019b; Zellers et al., 2019b; Huang et al., ing assumption is that the explainer believes the 2019; Fan et al., 2019; Bisk et al., 2020; Forbes assigned label to be correct or at least likely, ei- et al., 2020), a few recent ones do: ART (Bhaga- ther because they selected it, or another annotator vatula et al., 2020) and GLUCOSE (Mostafazadeh did (discussed further in §7.3). Resulting datasets et al., 2020). These datasets have explanations as contain instances that are tuples of the form (input, instance-labels themselves, rather than as justifi- label, explanation), where each explanation is tied cations for existing labels. For example, ART col- to an input-label pair. Of course, collecting expla- lects plausible hypotheses and GLUCOSE collects nations of human decisions can help train systems certain or likely statements to explain how sen- to model them. This class is thus directly suited tences in a story can follow other sentences. They to the three E X NLP goals: they serve as (i) addi- thus produce tuples of the form (input, explana- 2 tion), where the input is an event or observation. We created a living version of the tables as a website, where pull requests and issues can be opened to add and up- The two classes are not mutually exclusive. date datasets: https://exnlpdatasets.github.io For example, SBIC (Sap et al., 2019a) contains
Instance Explanation (highlight) Premise: A white race dog wearing the num- Premise: A white race dog wearing the number eight runs on ber eight runs on the track . Hypothesis: A white race dog the track. Hypothesis: A white race dog runs around his yard. runs around his yard . Label: contradiction (free-text) A race track is not usually in someone’s yard. Question: Who sang the theme song from Russia With Love? (structured) Sentence selection: The theme song was Paragraph: John Barry, arranger of Monty Norman’s “James composed by Lionel Bart of Oliver! fame and sung by Matt Bond Theme”. . . would be the dominant Bond series com- Monro. Referential equality: “the theme song from russia poser for most of its history and the inspiration for fellow se- with love” (extracted from question) = “The theme song” (ex- ries composer, David Arnold (who uses cues from this sound- tracted from paragraph) Entailment: X was composed by Li- track in his own. . . ). The theme song was composed by Lionel onel Bart of Oliver! fame and sung by ANSWER. ` AN- Bart of Oliver! fame and sung by Matt Monro. Answer: Matt SWER sung X Monro Table 1: Examples of explanation types discussed in §3. The first two rows show a highlight and free-text explanation for an E-SNLI instance (Camburu et al., 2018). The last row shows a structured explanation from QED for a NATURAL Q UESTIONS instance (Lamm et al., 2020). human-annotated labels and justifications indicat- best to define this class and its utility for the goals ing whether or not social media posts contain so- of E X NLP to future work.3 cial bias. While this clearly follows the form of the first class because it provides explanations for 3 Explainability Lexicon human-assigned bias labels, it also explains ob- The first step in collecting human explanations is served phenomena in the world: the justification to decide in what format annotators will give their for the phrase “We shouldn’t lower our standards justifications. We identify three types: highlights, just to hire more women” receiving a bias label is free-text, and structured explanations. An exam- “[it] implies women are less qualified.” Other ex- ple of each type is given in Table 1. Since a con- amples of datasets in both classes include predict- sensus on terminology has not yet been reached, ing future events in videos (VLEP; Lei et al., 2020) we describe each type below. and answering commonsense questions about im- To define highlights, we turn to Lei et al. (2016) ages (VCR; Zellers et al., 2019a). Both collect who popularized extractive rationales, i.e., sub- observations about a real-world setting as task la- sets of the input tokens that satisfy two proper- bels with rationales explaining why the observa- ties: (i) compactness, the selected input tokens tions are correct. are short and coherent, and (ii) sufficiency, the se- This second class is potentially unbounded be- lected tokens suffice for prediction as a substitute cause many machine reasoning datasets could be of the original text. Yu et al. (2019) introduce a re-framed as explanation under this definition. For third property, (iii) comprehensiveness, that all the example, one could consider labels in the PIQA evidence that supports the prediction is selected, dataset (Bisk et al., 2020) to be explanations, even not just a sufficient set—i.e., the tokens that are though they do not explain human decisions. For not selected do not predict the label.4 Lei et al. example, for the question “how do I separate an (2016) formalize the task of first selecting the parts egg yolk?,” the PIQA gold label is “Squeeze the of the input instance to be the extractive expla- water bottle and press it against the yolk. Release, nation, and then making a prediction based only which creates suction and lifts the yolk.” While on the extracted tokens. Since then, this task and this instruction is an explanation of how to sepa- the term “extractive rationale” have been associ- rate an egg yolk, an explanation for the human de- ated. Jacovi and Goldberg (2021) argue against cision would require a further justification of the the term, because “rationalization” implies human given response, e.g., “Separating the egg requires intent, while the task Lei et al. (2016) propose does pulling the yolk out. Suction provides the force to not rely on, or aim to model, human justifications. do this.” We do not survey datasets that only fall 3 Likewise for causality extraction data (Xu et al., 2020). into this broader, unbounded class such as PIQA, 4 Henceforth, “sufficient” and “comprehensive” always re- ART, and GLUCOSE, leaving discussion of how fer to these definitions.
Instance with Highlight Highlight Type Clarification Review: this film is extraordinarily horrendous and I’m (¬comprehensive) Review: this film is extraordinari not going to waste any more words on it. Label: negative horrend and I’m not going to waste any more words on it. Review: claire danes , giovanni ribisi , and omar epps (comprehensive) Review: claire danes , giovanni ribisi make a likable trio of protagonists , but they ’re just , and omar epps make a likable trio of protagonists , but they about the only palatable element of the mod squad , ’re just about the only palatable element of the mod squad a lame - brained big - screen version of the 70s tv show , a lame - brained big - screen version of the 70s tv show .\nthe story has all the originality of a block of wood \n( .\nthe story has all the originality of a block of wood \n( well , it would if you could decipher it ) , the characters well , it would if you could decipher it ) , the characters are all blank slates , and scott silver ’s perfunctory action are all blank slates , and scott silver ’s perfunctory action sequences are as cliched as they come .\nby sheer force of sequences are as cliched as they come \nby sheer force of talent , the three actors wring marginal enjoyment from talent , the three actors wring marginal enjoyment from the proceedings whenever they ’re on screen , but the mod the proceedings whenever they ’re on screen , but the mod squad is just a second - rate action picture with a first - squad is just a second - rate action picture with a first - rate rate cast. Label: negative cast. Premise: A shirtless man wearing white shorts. Hypoth- (¬sufficient) Premise: A shirtless man wearing esis: A man in white shorts is running on the sidewalk. white shorts Hypothesis: A man in white shorts Label: neutral running on the sidewalk. Table 2: Examples of highlight properties discussed in §3 and §5. The highlight for the first example, a negative movie review (Zaidan et al., 2007), is non-comprehensive because its complement in col. 2 is also predictive of the label, unlike the highlight for the second movie review (DeYoung et al., 2020a). These two highlights are sufficient (not illustrated in col. 2) because they are predictive of the negative label, in contrast to the highlight for an E-SNLI premise-hypothesis pair (Camburu et al., 2018) in the last row, which is not sufficient because it is not predictive of the neutral label on its own. As an alternative, they propose the term highlights buru et al., 2018; Wiegreffe et al., 2020). They are and the verbs “highlighting” or “extracting high- also called textual (e.g., in Kim et al., 2018) or lights” to avoid inaccurately attributing human- natural language explanations (e.g., in Camburu like social behavior to AI systems. Fact-checking, et al., 2018), but these terms have been overloaded medical claim verification, and multi-document to refer to highlights as well (e.g., in Prasad et al., question answering (QA) research often refers to 2020). Other synonyms, free-form (e.g., in Cam- highlights as evidence—a part of the source that buru et al., 2018) or abstractive explanations (e.g., refutes or supports the claim. in Narang et al., 2020) are suitable, but we opt to To reiterate, highlights are subsets of the in- use “free-text” as it is more descriptive; “free-text” put elements (either words, subwords, or full sen- emphasizes that the explanation is textual, regard- tences) that are compact and sufficient to explain less of the input modality, just as human explana- a prediction. If they are also comprehensive, we tions are verbal in many contexts (Lipton, 2018).5 call them comprehensive highlights. Although the Finally, we use the term structured explana- community has settled on criteria (i)–(iii) for high- tions to describe explanations that are not entirely lights, the extent to which collected datasets (Ta- free-form, e.g., there are constraints placed on the ble 3) reflect them varies greatly, as we will dis- 5 We could use the term free-text rationales since free- cuss in §5. Table 2 provides examples of suf- text explanations are introduced to make model explana- ficient vs. non-sufficient and comprehensive vs. tions more intelligible to people, i.e, they are tailored to- non-comprehensive highlights. ward human justifications by design, and typically produced by a model trained with human free-text explanations. The Free-text explanations are free-form textual human-computer interaction (HCI) community even advo- justifications that are not constrained to the in- cates to use the term “rationale” to stress that the goal of stance inputs. They are thus more expressive and explainable AI is to produce explanations that sound like hu- generally more readable than highlights. This mans (Ehsan et al., 2018). However, since our target audience is NLP researchers, we refrain from using “rationale” to avoid makes them useful for explaining reasoning tasks confusion with how the term is introduced in Lei et al. (2016) where explanations must contain information out- and due to the fact that unsupervised methods for free-text side the given input sentence or document (Cam- explanation generation exist (Latcinnik and Berant, 2020).
Dataset Task Granularity Collection # Instances M OVIE R EVIEWS (Zaidan et al., 2007) sentiment classification none author 1,800 M OVIE R EVIEWSc (DeYoung et al., 2020a) sentiment classification none crowd 200‡♦ SST (Socher et al., 2013) sentiment classification none crowd 11,855♦ W IKI QA (Yang et al., 2015) open-domain QA sentence crowd + authors 1,473 W IKI ATTACK (Carton et al., 2018) detecting personal attacks none students 1089♦ E-SNLI† (Camburu et al., 2018) natural language inference none crowd ∼569K (1 or 3) M ULTI RC (Khashabi et al., 2018) reading comprehension QA sentences crowd 5,825 FEVER (Thorne et al., 2018) verifying claims from text sentences crowd ∼136K‡ H OTPOT QA (Yang et al., 2018) reading comprehension QA sentences crowd 112,779 Hanselowski et al. (2019) verifying claims from text sentences crowd 6,422 (varies) NATURAL Q UESTIONS (Kwiatkowski et al., 2019) reading comprehension QA 1 paragraph crowd n/a‡ (1 or 5) C O QA (Reddy et al., 2019) conversational QA none crowd ∼127K (1 or 3) COS-E V 1.0† (Rajani et al., 2019) commonsense QA none crowd 8,560 COS-E V 1.11† (Rajani et al., 2019) commonsense QA none crowd 10,962 B OOL Qc (DeYoung et al., 2020a) reading comprehension QA none crowd 199‡♦ E VIDENCE I NFERENCE V 1.0 (Lehman et al., 2019) evidence inference none experts 10,137 E VIDENCE I NFERENCE V 1.0c (DeYoung et al., 2020a) evidence inference none experts 125‡ E VIDENCE I NFERENCE V 2.0 (DeYoung et al., 2020b) evidence inference none experts 2,503 S CI FACT (Wadden et al., 2020) verifying claims from text 1-3 sentences experts 995‡ (1-3) Kutlu et al. (2020) webpage relevance ranking 2-3 sentences crowd 700 (15) Table 3: Overview of E X NLP datasets with highlights. We do not include the B EER A DVOCATE dataset (McAuley et al., 2012) because it has been retracted. Values in parentheses indicate number of ex- planations collected per instance (if greater than 1). DeYoung et al. (2020a) collected (for B OOL Q) or recollected annotations for prior datasets (marked with the subscript c). ♦ Collected > 1 explanation per instance but only release 1. † Also contains free-text explanations. ‡ A subset of the original dataset that is annotated. It is not reported what subset of NATURAL Q UESTIONS has both a long and short answer. explanation-creating process, such as the use of doctors. Some authors perform semantic parsing specific inference rules. We discuss the recent on collected explanations; we classify them by the emergence of such explanations in §6. Structured dataset type before parsing, list the collection type explanations do not have one common definition; as “crowd + authors”, and denote them with ∗. we elaborate on dataset-specific designs in §4. Tables 3-5 elucidate that the dominant collection paradigm (≥90% of entries) is via human (crowd, 4 Overview of Existing Datasets student, author, or expert) annotation. We overview currently available E X NLP datasets 4.1 Highlights (Table 3) by explanation type: highlights (Table 3), free- text explanations (Table 4), and structured expla- The granularity of highlights depends on the task nations (Table 5). To the best of our knowledge, they are collected for. The majority of authors all existing datasets are in English. do not place a restriction on granularity, allowing We report the number of instances (input-label highlights to be made up of words, phrases, or full pairs) of each dataset, as well as the number of ex- sentences. Unlike other explanation types, some planations per instance (if > 1). The total number highlight datasets are re-purposed from datasets of explanations in a dataset is equal to the # of in- for other tasks. For example, in order to collect stances times the value in parentheses. a question-answering dataset with questions that We report the annotation procedure used to col- cannot be answered from a single sentence, the lect each dataset as belonging to the following M ULTI RC authors (Khashabi et al., 2018) ask an- categories: crowd-annotated (“crowd”); automat- notators to specify which sentences are used to ob- ically annotated through a web-scrape, database tain the answers (then throw out instances where crawl, or merge of existing datasets (“auto”); only one sentence is specified). These collected or annotated by others (“experts”, “students”, sentence annotations can be considered sentence- or “authors”). Some datasets use a combina- level highlights, as they point to the sentences in a tion of means; for example, E VIDENCE I NFERENCE document that justify each task label (Wang et al., (Lehman et al., 2019; DeYoung et al., 2020b) was 2019b). Another example of re-purposing is the collected and validated by crowdsourcing from S TANFORD S ENTIMENT T REEBANK (SST; Socher
Dataset Task Collection # Instances Jansen et al. (2016) science exam QA authors 363 Ling et al. (2017) solving algebraic word problems auto + crowd ∼101K Srivastava et al. (2017)∗ detecting phishing emails crowd + authors 7 (30-35) BABBLE L ABBLE (Hancock et al., 2018)∗ relation extraction students + authors 200‡‡ E-SNLI (Camburu et al., 2018) natural language inference crowd ∼569K (1 or 3) LIAR-PLUS (Alhindi et al., 2018) verifying claims from text auto 12,836 COS-E V 1.0 (Rajani et al., 2019) commonsense QA crowd 8,560 COS-E V 1.11 (Rajani et al., 2019) commonsense QA crowd 10,962 S EN -M AKING (Wang et al., 2019a) commonsense validation students + authors 2,021 W INOW HY (Zhang et al., 2020a) pronoun coreference resolution crowd 273 (5) SBIC (Sap et al., 2020) social bias inference crowd 48,923 (1-3) P UB H EALTH (Kotonya and Toni, 2020b) verifying claims from text auto 11,832 Wang et al. (2020)∗ relation extraction crowd + authors 373 Wang et al. (2020)∗ sentiment classification crowd + authors 85 E-δ-NLI (Brahman et al., 2021) defeasible natural language inference auto 92,298 (∼8) BDD-X†† (Kim et al., 2018) vehicle control for self-driving cars crowd ∼26K VQA-E†† (Li et al., 2018) visual QA auto ∼270K VQA-X†† (Park et al., 2018) visual QA crowd 28,180 (1 or 3) ACT-X†† (Park et al., 2018) activity recognition crowd 18,030 (3) Ehsan et al. (2019)†† playing arcade games crowd 2000 VCR†† (Zellers et al., 2019a) visual commonsense reasoning crowd ∼290K E-SNLI-VE†† (Do et al., 2020) visual-textual entailment crowd 11,335 (3)‡ ESPRIT†† (Rajani et al., 2020) reasoning about qualitative physics crowd 2441 (2) VLEP†† (Lei et al., 2020) future event prediction auto + crowd 28,726 EMU†† (Da et al., 2020) reasoning about manipulated images crowd 48K Table 4: Overview of E X NLP datasets with free-text explanations for textual and visual-textual tasks (marked with †† and placed in the lower part). Values in parentheses indicate number of explanations collected per instance (if greater than 1). ‡ A subset of the original dataset that is annotated. ‡‡ Subset publicly available. ∗ Authors semantically parse the collected explanations. et al., 2013) which contains crowdsourced senti- 4.2 Free-Text Explanations (Table 4) ment annotations for word phrases extracted from This is a popular explanation type for both textual ROTTEN T OMATOES movie reviews (Pang and Lee, and visual-textual tasks, shown in the first and sec- 2005). Word phrases which have the same sen- ond half of the table, respectively. Most free-text timent label as the review they belong to can be explanation datasets do not specify a length re- heuristically merged to produce phrase-level high- striction, but are generally no more than a few sen- lights (Carton et al., 2020). tences per instance. One exception is LIAR-PLUS The coursest granularity of highlight in Table 3 (Alhindi et al., 2020), which is collected by scrap- is paragraph-level (a paragraph in a larger docu- ing the conclusion paragraph of human-written ment is highlighted as evidence). The one dataset fact-checking summaries from P OLITI FACT.6 with this granularity, (NATURAL Q UESTIONS), has a paragraph-length long answer paired with a short 4.3 Structured Explanations (Table 5) answer to a question. For instances that have both Structured explanations take on a variety of a short and long answer, the long answer can be re- dataset-specific forms. One common approach purposed as a highlight to justify the short answer. is to construct a chain of facts that detail the We exclude datasets that only include an associ- multi-hop reasoning to reach an answer (“chains ated document as evidence (and no further annota- of facts”). Another is to place some constraints tion of where the explanation is located within the on the textual rationales that annotators can write, document). The utility of paragraph- or document- such as requiring the use or labelling of certain retrieval as explanation has not been researched; variables in the input (“semi-structured text”). we leave the exploration of these two courser gran- 6 ularities to future discussion. https://www.politifact.com/
Dataset Task Explanation Type Collection # Instances W ORLD T REE V1 (Jansen et al., 2018) science exam QA explanation graphs authors 1,680 O PEN B OOK QA (Mihaylov et al., 2018) open-book science QA 1 fact from W ORLD T REE crowd 5,957 W ORLD T REE V2 (Xie et al., 2020) science exam QA explanation graphs experts 5,100 QED (Lamm et al., 2020) reading comp. QA inference rules authors 8,991 QASC (Khot et al., 2020) science exam QA 2-fact chain authors + crowd 9,980 E QASC (Jhamtani and Clark, 2020) science exam QA 2-fact chain auto + crowd 9,980 (∼10) + P ERTURBED science exam QA 2-fact chain auto + crowd n/a‡ E OBQA (Jhamtani and Clark, 2020) open-book science QA 2-fact chain auto + crowd n/a‡ Ye et al. (2020)∗ SQ UAD QA semi-structured text crowd + authors 164 Ye et al. (2020)∗ NATURAL Q UESTIONS QA semi-structured text crowd + authors 109 R4 C (Inoue et al., 2020) reading comp. QA chains of facts crowd 4,588 (3) S TRATEGY QA (Geva et al., 2021) implicit reasoning QA reasoning steps w/ highlights crowd 2,780 (3) Table 5: Overview of E X NLP datasets with structured explanations (§6). Values in parentheses indicate number of explanations collected per instance (if greater than 1). ∗ Authors semantically parse the collected explanations. ‡ Subset of instances annotated with explanations is not reported. Total # of explanations is 855 for E QASC P ERTURBED and 998 for E OBQA . W ORLD T REE V1 (Jansen et al., 2018) proposes be harnessed for what? (Answer: electricity pro- explaining elementary-school science questions duction)” has the following two associated facts: with a combination of chains of facts and semi- “Differential heating of air produces wind.” and structured text. They term these “explanation “Wind is used for producing electricity.” The sin- graphs”. The facts are written by the authors gle fact in O PEN B OOK QA and two facts in QASC as individual sentences that are centered around can be considered a (partial or full) structured ex- one of 62 different relations (e.g., “affordances”) planation for the resulting instances. or properties (e.g., “average lifespans of living Jhamtani and Clark (2020) extend OBQA and things”). Given the chain of facts for an instance QASC with further two-fact chain explanation an- (6.3 on average), the authors construct an expla- notations. They first automatically extract candi- nation graph by mapping lexical overlaps (shared dates from a fact corpus, and then crowdsource la- words) between the question, answer, and ex- bels for whether the explanations are valid. The planation. These explanations are used for the resulting datasets, E QASC and E OBQA, contain TextGraphs shared task on multi-hop inference multiple (valid and invalid) explanations per in- (Jansen and Ustalov, 2019), which tested mod- stance. The authors also collect perturbed E QASC els on their ability to retrieve the appropriate facts fact chains that maintain their validity, resulting from a knowledge base to reconstruct explanation in another set of fact chains, E QASC- PERTURBED. graphs. W ORLD T REE V2 (Xie et al., 2020) ex- While designed for the same task, QASC- and pands the dataset using crowdsourcing, and syn- OBQA-based fact chains contain fewer facts than thesizes the resulting explanation graphs into a set W ORLD T REE explanation graphs on average. of 344 general inference patterns. Apart from science exam QA, a number O PEN B OOK QA (Mihaylov et al., 2018) uses sin- of structured datasets supplement existing QA gle facts from W ORLD T REE V 1 to prime annota- datasets for reading comprehension. Ye et al. tors to write question-answer pairs by thinking of (2020) collect semi-structured explanations for a second fact that follows. They provide a set of NATURAL Q UESTIONS (Kwiatkowski et al., 2019) 1,326 of these facts as an “open book” for models and SQ UAD (Rajpurkar et al., 2016). They re- to perform the science exam question-answering quire annotators to use phrases in both the input task. QASC (Khot et al., 2020), uses a simi- question and context, and limit them to a small set lar principle—the authors construct a set of seed of connecting expressions to write sentences ex- facts and a large text corpus of additional science plaining how to locate a question’s answer within facts. They ask annotators to find a second fact the context (e.g., “X is within 2 words left of Y”, that can be composed with the first, and to cre- “X directly precedes Y”, “X is a noun”). Inoue ate a multiple-choice questions from it. For exam- et al. (2020) collect R4 C, fact chain explanations ple, the question “Differential heating of air can for H OTPOT QA (Yang et al., 2018). The authors
crowdsource semi-structured relational facts based (implies the prediction, §3; first example in Table on the sentence-level highlights in H OTPOT QA, but 2) and comprehensive (its complement in the input instruct annotators to rewrite them into minimal does not imply the prediction, §3; second example subject-verb-object forms. Lamm et al. (2020) in Table 2) is regarded as faithful to the prediction collect explanations for NATURAL Q UESTIONS that it explains (DeYoung et al., 2020a; Carton et al., follow a linguistically-motivated form (see an ex- 2020). Thus, it might seem that human-annotated ample in Table 1). They identify a single context highlights are strictly related to the evaluation of sentence that entails the answer to the question, plausibility, and not faithfulness. We question this mark referential noun-phrase equalities between assumption. the question and answer sentence, and extract a Highlight collection is well synthesized in the semi-structured entailment pattern. following instructions for sentiment classification Finally, Geva et al. (2021) propose S TRATE - (M OVIE R EVIEWS; Zaidan et al., 2007): (i) To jus- GY QA , an implicit reasoning question-answering tify why a review is positive (negative), highlight task and dataset where they ask crowdworkers to the most important words/phrases that tell some- decompose complex questions such as “Did Aris- one (not) to see the movie, (ii) Highlight enough totle use a laptop?” into reasoning steps. These to provide convincing support, (iii) Do not go out decompositions are similar to fact chains, except of your way to mark everything. Instructions (ii)– the steps are questions rather than statements. For (iii) are written to encourage sufficiency and com- each question, crowdworkers provide highlight pactness, but none of the instructions urge compre- annotations of an associated passage that contains hensiveness. Indeed, DeYoung et al. (2020a) re- the answer. In total, the intermediate questions and port that highlights collected in the FEVER, M UL - their highlights explain how to reach an answer. TI RC , COS-E , and E-SNLI datasets are compre- hensive, but not those in M OVIE R EVIEWS and E VI - 5 Link Between E X NLP Data, Modeling, DENCE I NFERENCE . They re-annotate subsets of the and Evaluation Assumptions latter two as well as B OOL Q, which they use to All three parts of the machine learning pipeline evaluate model comprehensiveness. Conversely, (data collection, modeling, and evaluation) are in- Carton et al. (2020) deem FEVER highlights non- extricably linked. In this section, we discuss what comprehensive by design. Contrary to the charac- E X NLP modeling and evaluation research reveals terization of both DeYoung et al. and Carton et al., about the qualities of available E X NLP datasets, we observe that the E-SNLI authors collect non- and how best to collect such datasets in the future. comprehensive highlights, since they instruct an- Highlights are usually evaluated following two notators to highlight only words in the hypothesis criteria: (i) plausibility, according to humans, how (and not the premise) for neutral pairs, and con- well a highlight supports a predicted label (Yang sider contradiction and neutral explanations cor- et al., 2019; DeYoung et al., 2020a),7 and (ii) rect if only at least one piece of evidence in the in- faithfulness or fidelity, how accurately a highlight put is highlighted (Table 6 in Appendix). Based on represents the model’s decision process (Alvarez- these observed discrepancies in characterization, Melis and Jaakkola, 2018; Wiegreffe and Pinter, we conclude that post-hoc assessment of compre- 2019). Human-annotated highlights (Table 2) are hensiveness from a general description of data col- used to measure the plausiblity of model-produced lection is error-prone. highlights: the higher the agreement between the Alternatively, Carton et al. (2020) empirically two, the more plausible model highlights are.8 On show that available human-annotated highlights the other hand, a highlight that is both sufficient are not necessarily sufficient nor comprehensive 7 Yang et al. (2019) use the term persuasibility and define for model predictions. They demonstrate that it as “the usefulness degree to human users, serving as the human-annotated highlights are insufficient (non- measure of subjective satisfaction or comprehensibility for comprehensive) for predictions made by highly the corresponding explanation.” 8 Plausibility is also evaluated with forward simulation accurate models, suggesting that they might be in- (Hase and Bansal, 2020): “given an input and an ‘explana- sufficient (non-comprehensive) for gold labels as tion,’ users must predict what a model would output for the well. Does this mean that collected highlights in given input”. This is a more precise, but more costly mea- existing datasets are flawed? surement as it requires collecting human judgments for every model and its variations. Let us first consider insufficiency. Highlighted
input elements taken together have to reasonably indicate the label. If they do not, a highlight is an invalid explanation. Consider two datasets whose sufficiency Carton et al. (2020) found to be most concerning: the subset of neutral E-SNLI pairs and the subset of no-attack W IKI ATTACK exam- ples. Neutral E-SNLI cases are not justifiable by highlighting because they are obtained only as an intermediate step to collecting free-text explana- tions, and only free-text explanations truly justify (a) Supervised models’ development. When we use human a neutral label (see Table 6 in Appendix; Cam- highlights as supervision, we assume that they are the gold- truth and that model highlights should match. Thus, compar- buru et al. (2018)). Table 2 shows one E-SNLI ing human and model highlights for plausibility evaluation is highlight that is not sufficient. No-attack W IKI AT- sound. However, with this basic approach we do not intro- TACK examples are not explainable by highlight- duce any data or modeling properties that help faithfulness ing because the absence of offensive content jus- evaluation, and that remains a challenge in this setting. tifies the no-attack label, and this absence can- not be highlighted. We recommend 1) avoiding human-annotated highlights with low sufficiency when evaluating and collecting highlights, and 2) assessing whether the true label can be explained (b) Unsupervised models’ development. In §5, we illustrate by highlighting. Does the same hold for non- that comprehensiveness is not a necessary property of human comprehensiveness? highlights. Non-comprehensiveness, however, hinders evalu- ating plausibility of model highlights produced in this setting Consider a highlight that is non-comprehensive since model and human highlights do not match by design. because it is redundant with its complement in the input (e.g., due to multiple occurrences of the word “great” in a positive movie review, but only some are highlighted). Highlighting only one occurrence of “great” is a valid justification, but guaranteeing and quantifying faithfulness of this highlight is harder because the model might right- fully use the occurrence of “great” that is not high- lighted to come to the prediction. Comprehen- (c) Recommended unsupervised models’ development. To siveness is adopted to make faithfulness evaluation evaluate both plausibility and faithfulness, we should collect comprehensive human highlights, assuming that they are al- easier (DeYoung et al., 2020a; Carton et al., 2020). ready sufficient (a necessary property). To recap, in Figure 2, we reconcile assumptions that are made in different stages of development Figure 2: Connections between assumptions made of self-explanatory models that highlight the input in the development of self-explanatory highlight- elements to justify their decisions. Although our ing models. The jigsaw icon marks a synergy of takeaway is that evaluation of both plausibility and modeling and evaluation assumptions. The arrow faithfulness of the popular highlighting approach notes the direction of influence. The text next to (unsupervised models combined with assumptions the plausibility / faithfulness boxes in the top fig- of sufficiency and comprehensiveness) requires ure hold for the other figures, but are omitted due comprehensive human-annotated highlights, we to space limits. Cited: DeYoung et al. (2020a); agree with Carton et al. (2020) that: Zhang et al. (2016); Bao et al. (2018); Strout et al. ...the idea of onesize-fits-all fidelity benchmarks (2019); Carton et al. (2020); Yu et al. (2019). might be problematic: human explanations may not be simply treated as gold standard. We need to design careful procedures to collect human ex- tions also affects free-text explanations. For ex- planations, understand properties of the resulting ample, the E-SNLI guidelines (Table 6 in Ap- human explanations, and cautiously interpret the pendix) have more label-specific and general con- evaluation metrics. straints than the COS-E guidelines (Table 7 in Ap- Mutual influence of data and modeling assump- pendix), such as requiring self-contained expla-
nations. Wiegreffe et al. (2020) show that such tions concern researchers is that they can result in data collection decisions can influence modeling artifact-like behaviors in certain modeling archi- assumptions (as in Fig. 2a). This is not an issue tectures. For example, a model which predicts a per se, but we should be cautious that E X NLP data task output from a generated explanation can pro- collection decisions do not popularize explana- duce explanations that are plausible to a human tion properties as universally necessary when they user and give the impression of making label pre- are not, e.g., that free-text explanations should be dictions on the basis of this explanation. However, understandable without the original input or that it is possible that the model learns to ignore the highlights should be comprehensive. We believe semantics of the explanation and instead makes this could be avoided with better documentation, predictions on the basis of the explanation’s tem- e.g., with additions to a standard datasheet (Ge- plate type (Kumar and Talukdar, 2020; Jacovi and bru et al., 2018). While explainability fact sheets Goldberg, 2021). In this case, the semantic inter- have been proposed for models (Sokol and Flach, pretation of the explanation (which is also that of 2020), they have not been proposed for datasets. a human reader) is not faithful (an accurate repre- For example, an E-SNLI datasheet could note that sentation of the model’s decision process). self-contained explanations were required during Despite re-annotating, Camburu et al. (2020) data collection, but that this is not a necessary report that E-SNLI explanations still largely fol- property of a valid free-text explanation. A COS- low 28 label-specific templates (e.g., an entailment E datasheet could note that the explanations were template “X is another form of Y”) even after re- not collected to be self-contained. A dataset with annotation. Brahman et al. (2021) report that mod- comprehensive highlights should emphasize the els trained on gold E-SNLI explanations gener- motivation behind collecting comprehensive high- ate template-like explanations for the δ-NLI task.9 lights: comprehensiveness simplifies evaluation of They identify 11 such templates that make up over faithfulness of highlights. 95% of the 200 generated explanations they an- alyze. These findings lead us to ask: what are 6 Rise of Structured Explanations the differences between templates considered un- informative and filtered out, and those identified The merit of free-text explanations is their expres- by Camburu et al. (2020); Brahman et al. (2021) sivity that can come at a cost of underspecification that remain after filtering? Are all template-like and inconsistency due to the difficulty of quality explanations uninformative? control, as noted by the creators of two popular Although prior work seems to indicate that free-text explanation datasets: E-SNLI and COS- template-like explanations are undesirable, most E. In the present section, we highlight and chal- recently, structured explanations have been inten- lenge one prior approach to overcoming these dif- tionally collected (see Table 5). Jhamtani and ficulties: discarding template-like explanations. Clark (2020) define a QA explanation as a chain We gather crowdsourcing guidelines for the of steps (typically sentences) that entail an answer, above-mentioned datasets in Tables 6–7 and com- and Lamm et al. (2020) propose a linguistically pare them. There are two notable similarities be- grounded structure for QA explanations. What tween them. First, both asked annotators to first these studies share is that they acknowledge struc- highlight input words and then formulate a free- ture as inherent to explaining the tasks they inves- text explanation from them as a means to con- tigate. Related work (GLUCOSE; Mostafazadeh trol quality. Second, template-like explanations et al., 2020) takes the matter further, arguing: are discarded because they are deemed uninforma- tive. The E-SNLI authors assembled a list of 56 In order to achieve some consensus among ex- planations and to facilitate further processing and templates (e.g., “There is hhypothesisi”) to iden- evaluation, the explanations should not be en- tify explanations whose edit distance to one of tirely free-form. the templates was
(see §2), we expect that findings reported in that and §8, we give more detailed, task-agnostic rec- work are useful for the explanations considered in ommendations for collecting high-quality E X NLP this survey. For example, following GLUCOSE, datasets, applicable to all three explanation types. we recommend running pilot studies to explore how people define and generate explanations for 7 Increasing Explanation Quality a task before collecting free-text explanations for When asked to write free-text sentences from that task. If they reveal that informative human scratch for a table-to-text annotation task (outside explanations are naturally structured, incorporat- of E X NLP), Parikh et al. (2020) note that crowd- ing the structure in the annotation scheme is use- workers produce “vanilla targets that lack [linguis- ful since the structure is natural to explaining the tic] variety”. Lack of variety can result in annota- task. This turned out to be the case with natural tion artifacts, which are prevalent in the popular language inference (Camburu et al., 2020, also re- SNLI (Bowman et al., 2015) and MNLI (Williams ported in Brahman et al., 2021): et al., 2018) datasets (Poliak et al., 2018; Guru- Explanations in E-SNLI largely follow a set of rangan et al., 2018; Tsuchiya, 2018), but also in label-specific templates. This is a natural conse- datasets that contain free-text annotations (Geva quence of the task and SNLI dataset. et al., 2019). These authors demonstrate the harms of such artifacts: models can overfit to them, lead- The relevant aspects of GLUCOSE were iden- ing to both performance over-estimation and prob- tified pre-collection following cognitive psychol- lematic generalization behaviors. ogy research. This suggests that there is no all- encompassing definition of explanation (cognitive Artifacts can also occur from poor-quality an- psychology is not crucial for all tasks) and that notations and less attentive annotators, both of researchers could consult domain experts during which are on the rise on crowdsourcing platforms the pilot studies or follow literature from other (Chmielewski and Kucker, 2020; Arechar and fields to define an appropriate explanation in a Rand, 2020). We demonstrate examples of such task-specific manner (for conceptualization of ex- artifacts in the COS-E v1.11 dataset in Table 7, planations in different fields see Tiddi et al., 2015). such as the occurrence of the incorrect explana- We recommend embracing the structure when tions “rivers flow trough valleys” on 529 separate possible, but also encourage creators of E X NLP instances and “health complications” on 134.10 To datasets with template-like explanations to stress mitigate artifacts, both increased diversity of an- that template structure can influence downstream notators and quality control are needed. We gather modeling decisions. Model developers might con- suggestions to increase diversity in §8 and focus sider pruning the explanation template type from on quality control here. intermediate representations of a base model, e.g., 7.1 A Two-Stage Collect-And-Edit Approach with adversarial training (Goodfellow et al., 2015) or more sophisticated methods (Ravfogel et al., While ad-hoc methods can improve quality (see 2020), to avoid models learning label-specific Tables 6–8 and Moon et al., 2020), an extremely templates rather than explanation semantics. effective and generalizable method is to use a two- Finally, what if pilot studies do not reveal any stage collection methodology. In E X NLP, Jham- obvious structure to human explanations of a task? tani and Clark (2020), Zhang et al. (2020a) and Then we need to do our best to control the quality Zellers et al. (2019a) apply a two-stage method- of free-text explanations because low dataset qual- ology to explanation collection, by either extract- ity is a bottleneck to building high-quality mod- ing explanation candidates automatically (Jham- els. COS-E is collected with notably less annota- tani and Clark, 2020) or having crowdworkers tion constraints and quality controls than E-SNLI, write them (Zhang et al., 2020a; Zellers et al., and has annotation issues that some have deemed 2019a). The authors follow collection with an ad- make the dataset unusable (Narang et al., 2020, ditional step where another batch of crowdwork- see examples in Table 7). As exemplars of qual- ers assess the quality of the generated explanations ity control, we gather VCR annotation and collec- (we term this C OLLECT-A ND -J UDGE). Judging has tion guidelines in Table 8 (Zellers et al., 2019a) benefits for quality control, and allows the authors and point the reader to the supplementary mate- 10 Most likely an artifact of annotator(s) copying and past- rial of GLUCOSE (Moon et al., 2020). In §7 ing the same phrase for multiple answers.
to release quality ratings for each instance, which process results in a significant number of errors can be used to filter the dataset. in SNLI-VE (Vu et al., 2018). Do et al. (2020) Outside of E X NLP, Parikh et al. (2020) intro- re-annotate labels and explanations for the neutral duce an extended version of this approach (that pairs in the validation and test sets of SNLI-VE. we term C OLLECT-A ND -E DIT): they generate a Others find that the dataset remains low-quality noisy automatically-extracted dataset for the table- for training models to explain due to artifacts in to-text generation task, and then ask annotators the entailment class and the neutral class’ train- to edit the datapoints. Bowman et al. (2020) use ing set (Marasović et al., 2020). With a full EDIT this approach to collect NLI hypotheses, and find, approach, these artifacts would be significantly re- crucially, that it reduces artifacts in a subset of duced, and the resulting dataset could have quality MNLI instances. They also report that the method on-par with one collected fully from-scratch. introduces more diversity and linguistic variety Similarly, the VQA-E dataset (Li et al., 2018) when compared to datapoints written by annota- uses a semantic processing pipeline to automati- tors from scratch. In XAI, Kutlu et al. (2020) cally convert image captions from the VQA V 2.0 use the approach to collect highlight explanations dataset (Goyal et al., 2017) into explanations for for Web page ranking. We advocate expanding the question-answer pairs. As a quality mea- the C OLLECT-A ND -J UDGE approach for explana- sure, Marasović et al. (2020) perform a user study tion collection to C OLLECT-A ND -E DIT. This has where they ask crowdworkers to judge plausibil- potential to improve diversity, reduce individual ity of VQA-E explanations, and report a notably annotator biases, and perform quality control. lower estimate of plausibility compared to VCR As mentioned above, Jhamtani and Clark explanations (Zellers et al., 2019a), which are (2020) successfully automate the collection of a carefully crowdsourced. seed set of explanations, and then use human qual- Both E-SNLI-VE and VQA-E present novel and ity judgements to pare them down. Whether to cost-effective ways for producing large E X NLP collect seed explanations automatically or manu- datasets for new tasks. But they also taught us ally depends on task specifics; both have benefits that it is important to consider the quality trade- and tradeoffs.11 However, we will demonstrate offs of automatic collection and potential mitiga- through a case study of two multimodal free-text tion strategies such as a selective crowdsourced explanation datasets that collecting explanations editing phase. For the tasks presented here, auto- automatically and not including an edit, or at least mated collection cannot be fully effective without a judge, phase can lead to artifacts. a full EDIT process. The quality-quantity tradeoff Two automatically-collected datasets for visual- is an exciting direction for future study. textual tasks include explaining visual-textual en- tailment (E-SNLI-VE; Do et al., 2020) and ex- 7.2 Teach and Test the Underlying Task plaining visual question answering (VQA-E; Li et al., 2018). E-SNLI-VE combines annotations of In order to create explanations or judge explana- two datasets: (i) SNLI-VE (Xie et al., 2019), col- tion quality, annotators must understand the under- lected by dropping the textual premises of SNLI lying task and its label-set well. In most cases, this (Bowman et al., 2015) and replacing them with necessitates teaching and testing annotator under- F LICKR 30 K images (Young et al., 2014), and (ii) E- standing of the underlying task as a prerequisite. SNLI (Camburu et al., 2018), a dataset of crowd- Exceptions include intuitive tasks that require little sourced explanations for SNLI. This procedure effort or expertise such as sentiment classification is possible because SNLI’s premises are scraped of short product reviews or commonsense QA. captions from F LICKR 30 K photos, i.e., every SNLI Tasks based in linguistic concepts may require premise is the caption of a corresponding photo. more training effort. The difficulty of scaling data However, since SNLI’s hypotheses were collected annotation from expert annotators to lay crowd- from crowdworkers who saw the scraped premises workers has been reported for tasks such as seman- but not the original images, the photo replacement tic role labelling (Roit et al., 2020), complement- 11 coercion (Elazar et al., 2020), and discourse rela- Bansal et al. (2021) collect a few high-quality explana- tions from LSAT law exam preparation books to accompany tion annotation (Pyatkin et al., 2020). To increase the ReClor dataset (Yu et al., 2020), but the authors have to annotation quality, Roit et al. (2020), Pyatkin et al. manually edit and shorten the explanations. (2020), and Mostafazadeh et al. (2020) provide in-
tensive training to crowdworkers, including per- to do this; we elaborate on other suggestions from sonal feedback. Even the natural language infer- related work (inside and outside E X NLP) here. ence (NLI) task, which has generally been con- sidered straightforward, can require crowdworker 8.1 Use a Large Set of Annotators training and testing. In a pilot study, the authors of Outside of E X NLP, Geva et al. (2019) report that this paper found that
for performance measures) as advocated by Geva 8.3 Get Ahead: Add Contrastive and et al. (2019). Diversity of the crowd is one mitiga- Negative Explanations tion strategy. More elaborate methods have also In this section, we overview explanation annota- recently been proposed based on collecting de- tion types beyond what is typically collected that mographic attributes or modeling annotators as a present an exciting direction for future work. Col- graph (Al Kuwatly et al., 2020; Wich et al., 2020). lecting such E X NLP annotations from annotators concurrently with regular explanations will likely be more efficient than collecting them post-hoc. 8.2 Multiple Annotations Per Instance The broader machine learning community has championed contrastive explanations that justify HCI research has long considered the ideal of why an event occurred or an action was taken in- crowdsourcing a single ground-truth as a “myth” stead of some other event or action (Dhurandhar that fails to account for the diversity of human et al., 2018; Miller, 2019; Verma et al., 2020).12 thought and experience (Aroyo and Welty, 2015). Most recently, methods have been proposed to Similarly, E X NLP researchers should not assume produce contrastive explanations for NLP appli- there is always one correct explanation. Many of cations (Yang et al., 2020; Wu et al., 2021; Ja- the assessments crowdworkers are asked to make covi and Goldberg, 2021). For example, Ross when writing explanations are subjective in na- et al. (2020) propose contrastive explanations as ture. Moreover, there are many different mod- the minimal edits of the input that are needed els of explanation based on a user’s cognitive bi- to explain a different (contrast) label for a given ases, social expectations, and socio-cultural back- input-label pair for a classification task. To the ground (Miller, 2019; Kopecká and Such, 2020). best of our knowledge, there is no dataset that con- Prasad et al. (2020) present a theoretical argument tains contrastive edits with the goal of explainabil- to illustrate that there are multiple ways to high- ity.13 However, contrastive edits have been used light input words to properly explain an annotated to assess and improve robustness of NLP mod- sentiment label. Camburu et al. (2018) collect els (Kaushik et al., 2020; Gardner et al., 2020; Li three free-text explanations for E-SNLI validation et al., 2020) and might used for explainability too. and test instances, and find a low inter-annotator That said, currently these datasets are either small- BLEU score (Papineni et al., 2002) between them. sized or contain edits only for sentiment classifica- If a dataset contains only one explanation, when tion and/or NLI. Moreover, just as highlights are multiple are plausible, a plausible model explana- not sufficiently intelligible for complex tasks, the tion can be penalized if it does not agree with the same might hold for contrastive edits of the input. annotated explanation. We expect that modeling In this case, we could collect justifications that an- multiple explanations could also be a useful learn- swer “why...instead of...” in free-text. ing signal. Some existing datasets contain multi- The effects of edits made on free-text or struc- ple explanations per instance (see the last column tured explanations (instead of inputs) for study- in Tables 3–5). Future E X NLP data collections ing E X NLP models’ robustness has not yet been should do the same if there is sufficient subjectiv- thoroughly explored.14 Related work in computer ity in the task or diversity of correct explanations. vision studies adversarial explanations that are produced by changing the explanation as much However, unlike highlights and structured ex- as possible, while keeping the model’s prediction planations, free-text explanations do not have mostly unaffected (Alvarez-Melis and Jaakkola, straightforward means to measure how similarly 2018; Zhang et al., 2020b). Producing such expla- multiple explanations support a given label. This nations for NLP applications is hard due to the dis- makes it difficult to know a good number of ex- crete inputs, and collecting human edits on expla- planations to collect for a given task and instance. nations might be helpful, such as done in EQASC- Practically, we suggest the use of manual inspec- 12 tion and automated metrics such as BLEU to set Sometimes called counterfactual explanations. 13 a value of the number of explanations per in- C O S-E may contain some contrastive explanations: Ra- jani et al. (2019) note that their annotators often resort to ex- stance that adequately captures diversity (i.e., the plaining by eliminating the wrong answer choices. ≥ n + 1th explanation is not sufficiently novel or 14 Highlights are made from the input elements, so making unique given the n collected). edits on a highlight means that edits are made on the input.
You can also read