Evaluating Explanations for Reading Comprehension with Realistic Counterfactuals
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Evaluating Explanations for Reading Comprehension with Realistic Counterfactuals Xi Ye Rohan Nair Greg Durrett Department of Computer Science The University of Texas at Austin {xiye,rnair,gdurrett}@cs.utexas.edu Abstract language inference (Camburu et al., 2018; Thorne et al., 2019). However, both approaches have been Token-level attributions have been extensively roundly criticized. An explanation may not be faith- studied to explain model predictions for a wide ful to the computation of the original model (Wu range of classification tasks in NLP (e.g., sen- arXiv:2104.04515v1 [cs.CL] 9 Apr 2021 timent analysis), but such explanation tech- and Mooney, 2018; Hase and Bansal, 2020; Wiegr- niques are less explored for machine read- effe et al., 2020; Jacovi and Goldberg, 2020b), even ing comprehension (RC) tasks. Although the misleading users (Rudin, 2019). More critically, transformer-based models used here are identi- token attributions in particular do not have a con- cal to those used for classification, the under- sistent and meaningful social attribution (Miller, lying reasoning these models perform is very 2019; Jacovi and Goldberg, 2020a): that is, when different and different types of explanations a user of the system looks at the explanation, they are required. We propose a methodology to do not draw a correct conclusion from it, making it evaluate explanations: an explanation should allow us to understand the RC model’s high- hard to use for downstream tasks. level behavior with respect to a set of realis- Our focus on this work is how to evaluate ex- tic counterfactual input scenarios. We define planations for reading comprehension in terms of these counterfactuals for several RC settings, their ability to reveal the high-level behavior of and by connecting explanation techniques’ out- models. That is, rather than an explanation saying puts to high-level model behavior, we can eval- “this word was important”, we want to draw a con- uate how useful different explanations really are. Our analysis suggests that pairwise ex- clusion like “the model picked out these two words planation techniques are better suited to RC and compared them;” this statement can be evalu- than token-level attributions, which are often ated for faithfulness and it helps a user draw mean- unfaithful in the scenarios we consider. We ingful conclusions about the system. We approach additionally propose an improvement to an this evaluation from a perspective of simulatability attention-based attribution technique, resulting (Hase and Bansal, 2020): can we predict how the in explanations which better reveal the model’s system will behave on new or modified examples? behavior.1 Doing so for RC models is challenging due to the 1 Introduction complex nature of the task, which fundamentally involves a correspondence between a question and Interpreting the behavior of black-box neural mod- a supporting text context. els for NLP has garnered interest for its many pos- Our core technique is to assess how well various sible benefits (Lipton, 2018). A range of post- explanations can support or reject hypotheses about hoc explanation techniques have been proposed, the model’s behavior (i.e., simulate the model) on including textual explanations (Hendricks et al., realistic counterfactuals, which are perturbations 2016) and token-level attributions (Ribeiro et al., of original data points (Figure 1). These resemble 2016; Sundararajan et al., 2017; Guan et al., 2019; several prior “stress tests” used to evaluate models, De Cao et al., 2020). These formats can be ap- including counterfactual sets (Kaushik et al., 2020), plied to many domains, including sentiment analy- contrast sets (Gardner et al., 2020), and checklists sis (Guan et al., 2019; De Cao et al., 2020), visual (Ribeiro et al., 2020). We first manually curate recognition (Simonyan et al., 2013), and natural these sets to answer questions like: if different 1 Code and data available at https://github.com/ facts were shown in the context, how would the xiye17/EvalQAExpl model behave? If different amounts of text or other
Base Example Explanations ? Are Super High Me and All in This Tea both documentaries? D0 Super High Me is a 2008 documentary film about smoking. Are Super High Me and All in This Tea both All in This Tea is a 2007 documentary film. documentaries ? yes no Super High YES Me is a 2008 documentary film about smoking . All in This Tea is a 2007 documentary film . Counterfactual Example D1 Super High Me is a 2008 romance film about smoking. Integrated Gradient RoBERTa looks at documentary (Sundararajan et al., 2017) All in This Tea is a 2007 documentary film. YES D2 Super High Me is a 2008 documentary film about smoking. All in This Tea is a 2007 romance film. YES Are Super High Me and All in This Tea both documentaries ? yes no Super High D3 Super High Me is a 2008 romance film about smoking. Me is a 2008 documentary film about smoking . All in This Tea is a 2007 documentary film . All in This Tea is a 2007 romance film. YES DiffMask I still predict YES even if RoBERTa looks at documentary LaAttrAttn documentary (De Cao et al., 2020) documentary tokens are replaced (Ours) barely contributes Figure 1: A motivating example and explanations generated by several methods. We profile the model behaviors with the predictions on realistic counterfactual inputs, which suggest the model does not truly base its prediction on the two movies being documentaries. We can evaluate explanations by seeing whether they can be used in combination with heuristics to derive this same conclusion about model behavior. incorrect paragraphs were retrieved by an upstream lows: (1) We propose a framework for evaluating retrieval system, would the model still get the right explanations based on model simulation on real- answer? Then, we evaluate techniques for simulat- istic counterfactuals. (2) We describe a technique ing the model’s behavior given explanations like for connecting low-level attributions (token-level token attributions. That is, using the explanations, or higher-order) with high-level model hypotheses. can we recover the answers to these questions and (3) We improve an attention-based pairwise attri- give usable insights about the QA system? bution technique with a simple but effective fix, We investigate two paradigms of explanation leading to strong empirical results. (4) We analyze techniques, token attribution-based (Simonyan a set of QA tasks and show that our approach can et al., 2013; Ribeiro et al., 2016; De Cao et al., derive meaningful conclusions on each. 2020) and feature interaction based (Tsang et al., 2020; Hao et al., 2020), which attribute decisions to 2 Motivation sets of tokens or pairwise/higher-order interactions. We start by going through a detailed example of We show that token level attribution is not suffi- how to use our methodology to compare several cient for analyzing QA, which naturally involves attribution techniques. Figure 1 shows an exam- more complex reasoning over multiple clues. For ple of a multi-hop yes/no question from HotpotQA. both techniques, we devise methods to bridge from The QA model correctly answers yes in this case. these explanations to high-level conclusions about Given the original example, the explanations pro- counterfactual behavior. duced using I NT G RAD (Sundararajan et al., 2017) We apply our methodology to automatically eval- and D IFF M ASK (De Cao et al., 2020) (explained in uate and compare a series of explanation techniques Section 4) both assign high attribution scores to the on two types of questions from H OTPOT QA (Yang two documentary tokens appearing in the context: et al., 2018), questions from adversarial S QUAD a user of the system is likely to impute that the (Rajpurkar et al., 2016), and on a synthetic QA set- model is comparing these two values, as it’s natu- ting. For each concrete high-level hypothesis we ral to assume this model is using the highlighted formulate, we automatically assess the extent to information correctly. By contrast, our pairwise which our low-level explanation techniques can attribution approach primarily attributes the predic- usefully produce the same answer as our hand- tion to interactions with the question, suggests the crafted counterfactuals. Our experimental results interaction related to documentary do not matter. show moderate success of this approach overall, We manually curate a set of contrastive examples and that explanations in form of feature interactions to test this hypothesis. If the model truly recognizes better align with model behaviours. We further pro- that both movies are documentaries, then replac- pose a modification to an existing interaction tech- ing either or both of the documentary tokens with nique from (Hao et al., 2020) and show improved romance should change the prediction. To verify performance on our datasets. that, we perturb the original examples to obtain an- We summarize our main contributions as fol- other three examples (left side of Figure 1). These
four examples together form a contrastive local focuses on benchmarking explanations, not models neighborhood (Ribeiro et al., 2016; Kaushik et al., themselves. 2020; Gardner et al., 2020) consisting of realistic counterfactuals.2 3 Evaluation Protocol However, unlike what’s suggested by the token We seek to formalize the reasoning we undertook attribution based techniques, the model always pre- on Figure 1. Using the model’s explanation on dicts “yes” for every example in the neighbourhood, a “base” data point, can we predict the model’s disputing that the model is following the right rea- behavior on the perturbed instances of that point? soning process. Although our pairwise attribution seemed at first glance much less plausible than that Definitions Given an original example D0 (e.g., generated by the other techniques, our explanation the top example in Figure 1), we construct a set was in fact better from the perspective of simulating of perturbations based on {D1 , ..., Dk } (e.g., the the model’s behavior on these new examples. three counterfactual examples in Figure 1), which Our main assumption in this work can be stated together with D0 form a local neighborhood D. as follows an explanation should describe model These perturbations are realistic inputs derived behavior with respect to realistic counterfactu- from existing datasets or which we construct. als, not just look plausible. Past work has eval- We formulate a hypothesis H about the neighbor- uated along plausibility criteria (Lei et al., 2016; hood. In Figure 1, H is the question “is the model Strout et al., 2019; Thorne et al., 2019), but as we comparing the target properties?” (documentary in see from this example, faithful explanations (Subra- this case). Based on the model’s behavior on the manian et al., 2020; Jacovi and Goldberg, 2020b,a) set D, we can derive a high-level behavioral label z are better aligned with our goal of simulatability. corresponding to the truth of H. We form our local We argue that a good explanation is one that aligns neighborhood to check the answer empirically and with the model’s high-level behaviors, and from compute a ground truth for z. Since the model al- which we can understand how the model general- ways predicts “yes” in this neighborhood, we label izes to new data. set D with z = 0 (the model is not comparing the properties). We label D as z = 1, when the model Discussion: Realistic Counterfactuals Many does predict “no” for some perturbations. counterfactual modifications are possible: past Procedure Our approach is as follows: work has looked at injecting non-meaningful trig- 1. Formulate a hypothesis H about the model gers (Wallace et al., 2019), deleting chunks of con- 2. Collect realistic counterfactuals D to answer tent (Ribeiro et al., 2016), or evaluating interpo- it empirically for some base examples lated input points as in I NT G RAD, all of which 3. Use the explanation of each base example to violate assumptions about the input distribution. In predict z. That is, learn the mapping D0 → z based RC, masking out a fact in the question often turns on the explanation of D0 so we can simulate the the question into a nonsense one.3 Focusing on model on D without observing the perturbations. realistic counterfactuals, by contrast, illuminates Note that this third step only uses the explana- fundamental problems with our RC models’ rea- tion of the base data point: explanations should soning capabilities (Jia and Liang, 2017; Chen and let us make conclusions about new counterfactuals Durrett, 2019; Min et al., 2019; Jiang and Bansal, without having to do inference on them. 2019). This is the same motivation as that behind contrast sets (Gardner et al., 2020), but our work Simulation In our experiments on H OTPOT QA and S QUAD, we compute a scalar factor f for each 2 One could argue that these counterfactuals are not entirely explanation representing the importance of a spe- realistic: a romance film about smoking is fairly unlikely to occur. However, generating perfect counterfactuals is an ex- cific part of the inputs (e.g., the “documentary” to- tremely hard problem (Qin et al., 2019), requiring deep world kens in Figure 1), which we believe should corre- knowledge of what scenarios make sense or what properties late with model predictions on the counterfactuals. are true of certain entities. Nevertheless, we believe that these examples are realistic enough that robust models should still If an explanation assigns higher importance to this behave well on them. information, it suggests that the model will actually 3 The exception is in adversarial settings; however, many change its behavior on these new examples. adversarial attacks do not actually draw on real-world threat models (Athalye et al., 2018), so we consider these less im- Given this factor, we construct a simple classifier portant. where we predict z = 1 if the factor f is above a
threshold. We expect the factors extracted using a differentiable fashion, then a a shallow neural better explanations should better indicate the model model (a linear layer) is trained to recognize which behavior. Hence, we evaluate the explanation using tokens to discard. the best simulation accuracy it can achieve and the AUC score (S-ACC and S-AUC).4 4.2 Feature Interaction-Based Our evaluation resembles the human evaluation These techniques all return scores si for each pair in Hase and Bansal (2020), which asks human of tokens (i, j) in both the question and context raters to predict model’s decision given an example that are fed into the QA system. together with its explanations and also reports simu- Archipelago (Tsang et al., 2020) measures non- latability. Our method differs in that (1) we predict additive feature interaction. Similar to D IFF M ASK, the behavior on unseen counterfactuals given the A RCHIP is also implicitly based on unrealistic coun- explanation of a single base data point, and (2) terfactuals which remove tokens. Given a subset of we automatically extract a factor to predict model tokens, A RCHIP defines the contribution of the in- behavior instead of asking humans to do so. teraction by the the prediction obtained from mask- ing out all the other tokens, only leaving a very 4 Explanation Techniques small fraction of the input. Applying this definition Compared to classification tasks like sentiment to a complex task like QA can result in a completely analysis, QA is much more fundamentally about nonsensical input. interaction between input features, especially be- Attention Attribution (ATATTR) (Hao et al., tween a question and a context. This work will 2020) uses attention specifically to derive pairwise directly compare feature interaction explanations explanations. However, it avoids the pitfalls of with token attribution techniques that are more com- directly inspecting attention (Serrano and Smith, mon for other tasks.5 2019; Wiegreffe and Pinter, 2019) by running an integrated gradients procedure over all the atten- 4.1 Token Attribution-Based tion links within transformers, yielding attribution These techniques all return scores si for each token scores for each link. The attribution scores directly i in both the question and context that are fed into reflect the attribution of the particular attention the QA system. links, making this model able to describe pairwise Integrated Gradient (I NT G RAD) (Sundarara- interactions. jan et al., 2017) computes an attribution for each Concretely, define the h-head attention matrix token by integrating the gradients of the predic- over input D with n tokens as A = [A1 , ..., Al ], tion with respect to the token embeddings over the where Ai ∈ Rh×n×n is the attention scores for path from a baseline input (typically mask or pad each layer. We can obtain the attribution score for tokens) towards the designated input. Although a each entry in the attention matrix A as : common technique, recent work has raised concern Z 1 ∂F (D, αA) about the effectiveness of I NT G RAD methods for ATTR(A) = A dα, (1) α=0 ∂A NLP tasks, as interpolated word embeddings do not correspond to real input values (Harbecke, 2021). where F (D, αA) is the transformer model that Differentiable Mask (D IFF M ASK) (De Cao takes as input the tokens and a matrix specifying et al., 2020) learns to mask out a subsets of the the attention scores for each layer. We later sum input tokens for a given example while maintaining up the attention attributions across all heads and a distribution over answers as close to the original layers to obtain the pairwise P Pinteraction between distribution as possible. This mask is learned in token (i, j), i.e., sij = m n ATTR(A)mnij . 4 We do not collect large enough datasets to train a simu- 4.3 Layer-wise Attention Attribution lation model, but given larger collections of counterfactuals, this is another approach one could take. We propose a new technique LATATTR to improve 5 A potentially even more meaningful format would be a program actually approximating the model’s behavior in a upon ATATTR. The ATATTR approach simulta- detailed way, as has been explored in the context of reinforce- neously increases all attention scores when com- ment learning (Verma et al., 2018; Bastani et al., 2018). Prior puting the attribution, which could be problematic. work does not really show how to effectively build this type of explanation for QA at this time, although some techniques Since the attention scores of higher layers are de- like anchors (Ribeiro et al., 2018) have been explored before. termined by the attention scores of lower layers,
Transformer Layers Attention Masks binary answer space (Clark et al., 2019). Typically, ln Ãn Ãn IG(0,An) a yes-no comparison type question requires com- … … … … paring the properties of two entities (Figure 1). We l2 Ã2 IG(0,A2) Ã2 base our experiments on a RO BERTA (Liu et al., l1 IG(0,A1) 2019) QA model achieving 77.2 F1 scores on the Ã1 Ã1 Step 2 development set in the distractor setting, compara- Are Super … documentaries? …2007 documentary … Step 1 … Step n ble to other strong RO BERTA-based models (Tu Figure 2: Steps of our Layer-wise Attention Attribu- et al., 2020; Groeneveld et al., 2020). tion approach, where we only intervene a single lager at step. For instance, to compute the attribution of at- Hypothesis & Counterfactuals The hypothesis tention masks at layer 2, we only intervene the attention H we investigate is as in Section 2: the model com- mask A2 , and leave other attention mask computed as pares the entities’ properties as indicated by the usual (marked with tilde). question. For instance, for the question Are A forcibly setting all the attention scores and comput- and B of the same nationality, the properties are ing gradients at the same time may distort the gra- nationalities of “A” and “B”; for the question Are dients for the attribution of lower level links, hence A and B both ice plants, the properties are their producing inaccurate attribution. When applying plant species. As in the motivating example, we I NT G RAD approach in other contexts, we typically later construct the counterfactuals by replacing the assume the independence of input features (e.g., properties in the context with the one another if pixels of an image and tokens of an utterance), an the two properties are different or similar hand- assumption which does not hold here. selected properties (e.g., “documentary” → “ro- To address this issue, we propose a simple fix, mance”, “American” → “English”) if the two are namely applying the I NT G RAD method layer-by- the same, producing additional three perturbations layer. As in Figure 2, to compute the attribution D1 , D2 , D3 for each base example D0 . We set for attention links of layer i, we only change the z = 0 (the hypothesis does not hold) if for each attention scores at layer i: perturbed example Di ∈ D, the model predicts 1 the same answer as for the original example, in- ∂F/i (D, αAi ) Z ATTR(Ai ) = Ai dα. (2) dicating a failure to compare the properties. We α=0 ∂Ai set z = 1 if the model’s prediction does change. F/i (D, αAi ) denotes we only intervene the atten- The authors annotate perturbations for 50 (D, z) tion masks at layer i while leaving other attention randomly selected pairs in total, forming a total of masks computed naturally via the model. We pool 200 counterfactual instances. More details of the to obtain the final attribution for pairwise interac- annotation process and concrete examples can be P P tion as sij = m n ATTR(A)mnij . found in the Appendix. This technique does not necessarily satisfy the Connecting Explanation and Hypothesis To Completeness axiom commonly used in this line make a judgment about z, we extract a factor f of work (Sundararajan et al., 2017). Since our ulti- based on the importance of a set of property tokens mate goal is a downstream empirical evaluation, we P . For token attribution-based methods, we define set aside any theoretical analysis of this technique f asPthe sum of the attribution si of each token in for now. P : i∈P si . For feature interaction-based meth- 5 Experiments: Real QA Datasets ods producing pairwise attribution sij , we compute f by pooling the scores of allPthe interaction related We evaluate our attribution methods (Section 4) to the property tokens, i.e., i∈P ∨j∈P sij . follow our stated evaluation protocol (Section 3) Now we predict z = 1 if the factor f is above a on the H OTPOT QA dataset (Yang et al., 2018), threshold, and evaluate the capability of the factor and the S QUAD dataset (Rajpurkar et al., 2016), in indicating the model high-level behavior using specifically leveraging examples from adversarial the best simulation accuracy it can achieve (S-ACC) S QUAD (Jia and Liang, 2017). and AUC score (S-AUC). 5.1 Hotpot Yes-No Questions Results First, we show that using explanations We first study a subset of comparison yes/no ques- can indeed illustrate the model’s behavior. As in tions, which is a challenging format despite the Table 1, our approach (LATATTR) is the best in
Yes-No Bridge Approach Primary Question S-ACC S-AUC S-ACC S-AUC M AJORITY 52.0 − 56.0 − C ONF 64.0 49.8 66.0 65.9 I NT G RAD 72.0 75.2 72.0 77.9 D IFF M ASK 66.0 60.2 68.0 62.3 A RCHIP 56.0 53.2 62.0 57.5 ATATTR 66.0 63.6 72.0 79.1 LATATTR 84.0 87.9 78.0 81.7 Table 1: Results on H OTPOT QA Yes-No type and Bridge questions. Our approach can better predict the model behavior on realistic counterfactuals, surpassing attribution based methods. this setting, achieving a simulation accuracy of Figure 3: Explanations generated by our approach for 84%. That is, with a properly set threshold, we an bridge type question from H OTPOT QA. The predic- can successfully predict whether the model pre- tion can mostly be attributed to the primary question, dictions change when perturbing the properties in indicating the model is taking the reasoning shortcut, the original example 88% of the time. The expla- and the prediction can be flipped with an adversarial sentence. nations therefore give us the ability to simulate our model’s behavior better than the other meth- United States secretary of the state from 2009 to ods here. Our approach also improves substantially 2013”, into the context, the model will be misled over the vanilla ATATTR method. and predict “United States secretary” as the new Token attribution based approaches obtain an answer. This sentence could easily have been part accuracy around 72%. This indicates token at- of another document retrieved in the retrieval stage, tribution based methods are not effective in the so we consider its inclusion to be a realistic coun- H OTPOT QA setting which engages with interac- terfactual. tion between tokens more intensively. We further define the primary question, i.e., the In this setting, D IFF M ASK performs poorly typ- primary part (containing Wh- words) of the entire ically because it assigns high attribution to many question. (E.g., “What government position is held tokens, since it determines which tokens need to by the women” in Figure 3), following the decom- be kept rather than distinguishing fine-grained im- position principle from (Min et al., 2019). portance (examples in Appendix). It’s possible that other heuristics or models learned on large numbers Hypothesis & Counterfactuals The hypothesis of perturbations could more meaningfully extract H we investigate is: the model is using correct predictions from this technique. reasoning and not a shortcut driven by the primary question part. 5.2 Hotpot Bridge Questions We construct counterfactuals following the same We also evaluate the explanation approaches on so- idea applied in our example. For a given ques- called bridge questions on the H OTPOT QA dataset, tion, we add an adversarial sentence based on the described in Yang et al. (2018). Figure 3 shows a primary part (containing Wh- words) of the ques- example explanation of a bridge problem in Figure . tion so as to alter the model prediction. The added From the attribution scores we find the most salient adversarial sentence contains context leading to a connection is between the span “what government spurious answer to only the primary question, but position” in the the question and the span “United does not change the gold answer (refer to Appendix States Ambassador” in the context. This attribution for examples). We do this twice, yielding a set directly highlights the reasoning shortcut (Jia and D = {D0 , D1 , D2 } consisting of the base example Liang, 2017; Chen and Durrett, 2019; Min et al., and two perturbations. We define the label of D to 2019; Jiang and Bansal, 2019) the model is using, be z = 0 in the case that model’s prediction does where it disregards the second part of the question. change when being attacked, and z = 1 otherwise. If we inject an additional sentence “Hillary Clin- We randomly sample 50 base data points from ton is an American politician, who served as the the development set, two of our authors each write
a adversarial sentence, forming 150 data points in Approach S-ACC S-AUC total. M AJORITY 52.1 − C ONF 58.3 57.8 Connecting Explanation and Hypothesis For I NT G RAD 61.6 61.1 this setting, we use a factor describing the im- D IFF M ASK 57.6 53.6 portance of the primary question normalized by A RCHIP 58.6 56.2 ATATTR 68.4 72.5 the importance of the entire question. Namely, let LATATTR 70.0 72.1 P = {pi } be the set of tokens in the primary ques- tions, and Q = {qi } be the set of tokens in the Table 2: Simulation Accuracy and AUC scores for the entire question. We define the factor f as the the SQuAD adversarial setting, assessing whether model importance of P normalized by the importance of changes its prediction on an example when attacked. Q, where the importance calculation is the same confidence on the original prediction as a baseline. as in Section 5.1. A higher factor means it is more heavily relying only on the primary question and Results We show results in Table 2. The best hence a better chance of being attacked. approaches (ATATTR and LATATTR) can achieve a simulation accuracy around 70%, 10% above the Results According to the simulation AUC scores performance based on confidence. This shows the in Table 1, feature interaction based techniques model is indeed over-confident in its prediction; our again outperform token attribution approaches. Our assumption about the robustness together with our approach achieves a stimulation accuracy of 78%, technique can successfully expose the vulnerability substantially higher than any other results. in some of the model predictions. 5.3 SQuAD Adversarial There is room to improve on these results; our simple heuristic cannot perfectly connect the ex- Hypothesis & Counterfactuals Our hypothesis planations to the model behavior in all cases. We H is: the model can resist adversarial attacks of the note that there are other orthogonal approaches (Ka- addSent variety (Jia and Liang, 2017). For each of math et al., 2020) to calibrate the confidence of QA the original examples D0 from some of the S QUAD - models’ predictions by looking at statistics of the A DV development set, Jia and Liang (2017) creates adversarial examples; here, our judgment is made 5 adversarial attacks, which are paraphrased and purely based on the original example, and does not filtered by Turkers to give 0 to 5 valid attacks for exploit learning to refine our heuristic. each example, yielding our set D. We define the label of D to be z = 1 if the model resists all the 5.4 Discussion and Limitations adversarial attacks posed on D0 (i.e., predictions for D are the same). To ensure the behavior is more Our explanations can reveal known dataset biases precisely profiled by the counterfactuals, we only and reasoning shortcuts HotpotQA, without hav- keep the base examples with more than 3 valid ing to perform a detailed manual analysis. This attack, resulting in a total number of 276 (D, z) confirms the utility of our explanations: model pair (1,506 data points). designers can look at them, either manually or au- tomatically, and determine how robust the model is Connecting Explanation and Hypothesis We going to be when faced with counterfactuals. use a factor p indicating the importance of the es- Our analysis also highlights limitations of cur- sential keywords extracted from the question using rent explanation techniques, and shed light on the POS tags (proper nouns and numbers). E.g., for the future research direction on this topics. In our ex- question “What Florida stadium was considered periments, we observe other nontrivial behaviours for Super Bowl 50”, we extract “Florida”, “Super of the QA model in the Hotpot setting. For in- Bowl” , and “50”. If the model considers all the stance, we created counterfactuals by permutating essential keywords mentioned in the question, it the order of the paragraphs constructing the con- should not be fooled by distractors with irrelevant text, which often gave rise to different predictions. information. We show a set of illustrative examples This observation indicates the model prediction in Appendixes. We compute the importance scores may also be impacted by biases in positional em- in the same way described in Section 5.1. beddings (e.g., the answer tends to occur in the In addition to the scores provided by various first retrieved paragraph), which cannot be indi- explanation techniques, we also use the model’s cated by current attribution methods. We believe
E0 R0, E1 R1, E2 R0 ? E0 E2 ! In general, feature interaction based approaches performed better at recovering the ground truth ex- E0 R0, E1 R1, E2 R0 ? E1 E2 ! planations than token attribution based approaches. The best method for this settings are A RCHIP, Figure 4: Two examples of our synthetic data with achieving a F1 score of 0.69. Our LATATTR ap- ground truth rationales being underlined. In the first example, the context describes entity E0/E1/E2 as as- proach is also effective in this setting and performs sociated with relation R0/R1/R0, respectively; the first on par with A RCHIP. question asks whether E0 and E2 exhibit the same rela- Surprisingly, the simple synthetic case here turns tion; the answer is yes. Only these tokens are provided out not to be so simple. This might be due to the to the model. complexity of our task. Despite being a synthetic task, it requires true multi-hop reasoning, a chal- Int- Diff- Archi- At- LAt- Rand Grad Mask pelago Attn Attn lenging task which modern models are still strug- 0.45 0.55 0.64 0.69 0.55 0.67 gling in learning (Jiang and Bansal, 2019; Khashabi et al., 2019; Trivedi et al., 2020). This dataset ex- Table 3: The F1 scores between the models’ top-6 high- poses the need for better explanation techniques lighted tokens and ground truth rationales. Our ap- for this sort of reasoning and how it emerges. proach is substantially better than ATATTR. 7 Related Work this is a useful avenue for future investigation. By first thinking of what kind of counterfactuals and We focus on several prominent token attribution what kind of behaviours we want to explain, we techniques, but there are other related methods can motivate new explanation techniques. as well, including Shapley Values (Štrumbelj and Kononenko, 2014; Lundberg and Lee, 2017), con- 6 Synthetic Dataset textual decomposition (Jin et al., 2020), and hier- archical explanations Chen et al. (2020). These We have evaluated our explanations’ faithfulness formats can also be evaluated using our frame- and to what extent they help simulate model be- work if being connected with model behavior with havior. We now use a synthetic setting to eval- proper heuristic. Other work explores “concept- uate plausibility, i.e., whether these explanations based” explanations (Mu and Andreas, 2020; Bau can successfully attribute the models to the ratio- et al., 2017; Yeh et al., 2019). These provide an- nales that humans would perceive.It is impossible other pathway towards building explanations of to know what a QA model is doing on real data; high-level behavior; however, they have been ex- therefore, we create a synthetic dataset and ensure plored primarily for image recognition tasks and via symmetry that there are no reasoning shortcuts, cannot be directly applied to QA, where defining so a model generalizing on this dataset must be these sorts of “concepts” is challenging. doing some form of correct reasoning. Probing techniques aim to discover what inter- We show a concrete example of this data in Fig- mediate representations have been learned in neural ure 4, with details of the dataset construction and models (Tenney et al., 2019; Conneau et al., 2018; model in the Appendix. Clues external to the rele- Hewitt and Liang, 2019; Voita and Titov, 2020). vant parts of the context cannot provide any infor- Internal representations could potentially be used mation relevant to the question, and given that our to predict behavior on contrast sets similar to this model generalizes perfectly, a plausibility evalua- work; however, this cannot be done heuristically tion is justified in this case. and larger datasets are needed to explore this. We do not need to construct counterfactuals for our evaluation on this dataset. Other work considering how to evaluate explana- tions are primarily based on how explanations can Results We assess whether an explanation aligns assists humans in predicting model decision for a well with model behavior using the F1 scores be- given example (Doshi-Velez and Kim, 2017; Chan- tween ground truth rationales (6 tokens excluding drasekaran et al., 2018; Nguyen, 2018; Hase and and ) and the top-6 important tokens picked Bansal, 2020); We are the first to consider building by the explanation. The ground truth rationale in contrast sets for this. Similar ideas have been used Figure 4 is underlined; the model should consider in other contexts (Kaushik et al., 2020; Gardner these tokens to determine the answer. et al., 2020) but we’re focused on evaluation of
explanations rather than general model evaluation. Jifan Chen and Greg Durrett. 2019. Understand- ing dataset design choices for multi-hop reason- 8 Conclusion ing. In Proceedings of the Conference of the North American Chapter of the Association for Computa- We have presented an evaluation technique based tional Linguistics: Human Language Technologies on realistic counterfactuals to evaluate explana- (NAACL-HLT). tions for RC models. We show that our evaluation Christopher Clark, Kenton Lee, Ming-Wei Chang, method distinguishes which explanations truly give Tom Kwiatkowski, Michael Collins, and Kristina us insight about high-level model behavior. Feature Toutanova. 2019. BoolQ: Exploring the surprising interaction-based techniques perform the best in difficulty of natural yes/no questions. In Proceed- our analysis, especially our LATATTR method. We ings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- advocate that future research conduct such quanti- guistics: Human Language Technologies, Volume 1 tative evaluation based on realistic counterfactuals (Long and Short Papers). when developing novel explanation techniques. Alexis Conneau, German Kruszewski, Guillaume Lam- Acknowledgments ple, Loïc Barrault, and Marco Baroni. 2018. What you can cram into a single $&!#* vector: Probing Thanks to Eunsol Choi, Jifan Chen, Jiacheng Xu, sentence embeddings for linguistic properties. In Qiaochu Chen, and everyone in the UT TAUR lab Proceedings of the 56th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: for helpful discussions. This work was partially Long Papers). supported by NSF Grant IIS-1814522, NSF Grant SHF-1762299, a gift from Arm, a gift from Sales- Nicola De Cao, Michael Sejr Schlichtkrull, Wilker force Inc, and an equipment grant from NVIDIA. Aziz, and Ivan Titov. 2020. How do decisions emerge across layers in neural models? interpreta- tion with differentiable masking. In Proceedings of the 2020 Conference on Empirical Methods in Natu- References ral Language Processing (EMNLP). Association for Anish Athalye, Nicholas Carlini, and David Wagner. Computational Linguistics. 2018. Obfuscated gradients give a false sense of se- curity: Circumventing defenses to adversarial exam- Finale Doshi-Velez and Been Kim. 2017. Towards a ples. In Proceedings of the 35th International Con- rigorous science of interpretable machine learning. ference on Machine Learning. arXiv preprint arXiv:1702.08608. Osbert Bastani, Yewen Pu, and Armando Solar- Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Lezama. 2018. Verifiable reinforcement learning via Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, policy extraction. arXiv preprint arXiv:1805.08328. Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel- and Antonio Torralba. 2017. Network dissection: son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Quantifying interpretability of deep visual represen- Singh, Noah A. Smith, Sanjay Subramanian, Reut tations. In Proceedings of the IEEE conference Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. on computer vision and pattern recognition, pages 2020. Evaluating models’ local decision boundaries 6541–6549. via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020. Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-SNLI: Nat- Dirk Groeneveld, Tushar Khot, Mausam, and Ashish ural Language Inference with Natural Language Ex- Sabharwal. 2020. A simple yet strong pipeline for planations. In Advances in Neural Information Pro- HotpotQA. In Proceedings of the 2020 Conference cessing Systems. on Empirical Methods in Natural Language Process- Arjun Chandrasekaran, Viraj Prabhu, Deshraj Yadav, ing (EMNLP). Prithvijit Chattopadhyay, and Devi Parikh. 2018. Do explanations make VQA models more predictable to Chaoyu Guan, Xiting Wang, Quanshi Zhang, Runjin a human? In Proceedings of the 2018 Conference Chen, Di He, and Xing Xie. 2019. Towards a deep on Empirical Methods in Natural Language Process- and unified understanding of deep neural models in ing. NLP. In Proceedings of the 36th International Con- ference on Machine Learning. Hanjie Chen, Guangtao Zheng, and Yangfeng Ji. 2020. Generating hierarchical explanations on text classifi- Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2020. cation via feature interaction detection. In Proceed- Self-attention attribution: Interpreting information ings of the 58th Annual Meeting of the Association interactions inside transformer. arXiv preprint for Computational Linguistics. arXiv:2004.11207.
David Harbecke. 2021. Explaining natural language Zachary C Lipton. 2018. The mythos of model inter- processing classifiers with occlusion and language pretability: In machine learning, the concept of in- modeling. arXiv preprint arXiv:2101.11889. terpretability is both important and slippery. Queue, 16(3):31–57. Peter Hase and Mohit Bansal. 2020. Evaluating ex- plainable AI: Which algorithmic explanations help Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar users predict model behavior? In Proceedings of the Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke 58th Annual Meeting of the Association for Compu- Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: tational Linguistics. A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692. Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. 2016. Generating Visual Explanations. In Scott Lundberg and Su-In Lee. 2017. A unified ap- European Conference on Computer Vision (ECCV). proach to interpreting model predictions. arXiv preprint arXiv:1705.07874. J. Hewitt and P. Liang. 2019. Designing and interpret- ing probes with control tasks. In Empirical Methods Tim Miller. 2019. Explanation in artificial intelligence: in Natural Language Processing (EMNLP). Insights from the social sciences. Artificial intelli- gence, 267:1–38. Alon Jacovi and Y. Goldberg. 2020a. Aligning faithful interpretations with their social attribution. ArXiv, Sewon Min, Eric Wallace, Sameer Singh, Matt Gard- abs/2006.01067. ner, Hannaneh Hajishirzi, and Luke Zettlemoyer. Alon Jacovi and Yoav Goldberg. 2020b. Towards faith- 2019. Compositional questions do not necessitate fully interpretable NLP systems: How should we de- multi-hop reasoning. In Proceedings of the 57th An- fine and evaluate faithfulness? In Proceedings of the nual Meeting of the Association for Computational 58th Annual Meeting of the Association for Compu- Linguistics. tational Linguistics. Jesse Mu and Jacob Andreas. 2020. Composi- Robin Jia and Percy Liang. 2017. Adversarial exam- tional explanations of neurons. arXiv preprint ples for evaluating reading comprehension systems. arXiv:2006.14032. In acl. Dong Nguyen. 2018. Comparing automatic and human Yichen Jiang and Mohit Bansal. 2019. Avoiding rea- evaluation of local explanations for text classifica- soning shortcuts: Adversarial evaluation, training, tion. In Proceedings of the 2018 Conference of the and model development for multi-hop QA. In Pro- North American Chapter of the Association for Com- ceedings of the 57th Annual Meeting of the Associa- putational Linguistics: Human Language Technolo- tion for Computational Linguistics. gies, Volume 1 (Long Papers). Xisen Jin, Zhongyu Wei, Junyi Du, Xiangyang Xue, and Xiang Ren. 2020. Towards hierarchical impor- Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chandra tance attribution: Explaining compositional seman- Bhagavatula, Elizabeth Clark, and Yejin Choi. 2019. tics for neural sequence models. In International Counterfactual story reasoning and generation. In Conference on Learning Representations. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the Amita Kamath, Robin Jia, and Percy Liang. 2020. Se- 9th International Joint Conference on Natural Lan- lective question answering under domain shift. In guage Processing (EMNLP-IJCNLP). Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. machine comprehension of text. In Proceedings of 2020. Learning the difference that makes a differ- the 2016 Conference on Empirical Methods in Natu- ence with counterfactually-augmented data. In Inter- ral Language Processing, pages 2383–2392, Austin, national Conference on Learning Representations. Texas. Association for Computational Linguistics. Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot, Ashish Sabharwal, and Dan Roth. 2019. On the Marco Tulio Ribeiro, Sameer Singh, and Carlos possibilities and limitations of multi-hop reason- Guestrin. 2016. “Why should I trust you?” Explain- ing under linguistic imperfections. arXiv preprint ing the predictions of any classifier. In Proceedings arXiv:1901.02522. of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. In Proceedings of Marco Tulio Ribeiro, Sameer Singh, and Carlos the 2016 Conference on Empirical Methods in Nat- Guestrin. 2018. Anchors: High-precision model- ural Language Processing, pages 107–117, Austin, agnostic explanations. In AAAI, volume 18, pages Texas. Association for Computational Linguistics. 1527–1535.
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Michael Tsang, Sirisha Rambhatla, and Yan Liu. 2020. and Sameer Singh. 2020. Beyond accuracy: Behav- How does this interaction affect me? interpretable at- ioral testing of NLP models with CheckList. In Pro- tribution for feature interactions. In Proceedings of ceedings of the 58th Annual Meeting of the Associa- the Conference on Advances in Neural Information tion for Computational Linguistics. Processing Systems (NeurIPS). Cynthia Rudin. 2019. Stop explaining black box ma- Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, chine learning models for high stakes decisions and Xiaodong He, and Bowen Zhou. 2020. Select, an- use interpretable models instead. Nature Machine swer and explain: Interpretable multi-hop reading Intelligence, 1. comprehension over multiple documents. In Pro- ceedings of the Association for the Advancement of Sofia Serrano and Noah A. Smith. 2019. Is attention Artificial Intelligence (AAAI). interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- Abhinav Verma, Vijayaraghavan Murali, Rishabh guistics. Singh, Pushmeet Kohli, and Swarat Chaudhuri. Karen Simonyan, Andrea Vedaldi, and Andrew Zisser- 2018. Programmatically interpretable reinforce- man. 2013. Deep inside convolutional networks: Vi- ment learning. In Proceedings of the 35th Interna- sualising image classification models and saliency tional Conference on Machine Learning. maps. arXiv preprint arXiv:1312.6034. Elena Voita and Ivan Titov. 2020. Information- Julia Strout, Ye Zhang, and Raymond Mooney. 2019. theoretic probing with minimum description length. Do Human Rationales Improve Machine Explana- In Proceedings of the 2020 Conference on Empirical tions? In Proceedings of the 2019 ACL Workshop Methods in Natural Language Processing (EMNLP). BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 56–62, Florence, Italy. As- Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, sociation for Computational Linguistics. and Sameer Singh. 2019. Universal adversarial trig- gers for attacking and analyzing NLP. In Proceed- Erik Štrumbelj and Igor Kononenko. 2014. Explaining ings of the 2019 Conference on Empirical Methods prediction models and individual predictions with in Natural Language Processing and the 9th Inter- feature contributions. Knowledge and information national Joint Conference on Natural Language Pro- systems, 41(3):647–665. cessing (EMNLP-IJCNLP). Sanjay Subramanian, Ben Bogin, Nitish Gupta, Tomer Sarah Wiegreffe, Ana Marasović, and Noah A. Smith. Wolfson, Sameer Singh, Jonathan Berant, and Matt 2020. Measuring association between labels and Gardner. 2020. Obtaining faithful interpretations free-text rationales. ArXiv, abs/2010.12762. from compositional neural networks. arXiv preprint arXiv:2005.00724. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of the 2019 Con- Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. ference on Empirical Methods in Natural Language Axiomatic attribution for deep networks. arXiv Processing and the 9th International Joint Confer- preprint arXiv:1703.01365. ence on Natural Language Processing (EMNLP- IJCNLP). Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Jialin Wu and Raymond Mooney. 2018. Faithful multi- Benjamin Van Durme, Sam Bowman, Dipanjan Das, modal explanation for visual question answering. and Ellie Pavlick. 2019. What do you learn from context? probing for sentence structure in contextu- Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- alized word representations. In International Con- gio, William W. Cohen, Ruslan Salakhutdinov, and ference on Learning Representations. Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question James Thorne, Andreas Vlachos, Christos answering. In Conference on Empirical Methods in Christodoulopoulos, and Arpit Mittal. 2019. Natural Language Processing (EMNLP). Generating token-level explanations for natural language inference. In Proceedings of the 2019 Chih-Kuan Yeh, Been Kim, Sercan Ö. Arik, C. Li, Conference of the North American Chapter of the P. Ravikumar, and T. Pfister. 2019. On concept- Association for Computational Linguistics: Human based explanations in deep neural networks. ArXiv, Language Technologies, Volume 1 (Long and Short abs/1910.07969. Papers), pages 963–969, Minneapolis, Minnesota. Association for Computational Linguistics. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2020. Is multihop QA in DiRe condition? measuring and reducing discon- nected reasoning. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP).
A Details of Hotpot Yes-No part. The primary part is the main body of the ques- Counterfactuals tion, whereas the secondary part is usually a clause used to link the bridge entity (Min et al., 2019). Figure 5 shows several examples to illustrate our We construct our neighborhoods as follows: process of generating counterfactuals for the Hot- pot yes-no setting. 1. Manually decompose our sentences into the Most Hotpot Yes-No questions follow one of primary and secondary parts two templates: Are A and B both __? (Figure 5, abc), and Are A and B of the same __? (Figure 5, 2. Make up adversarial sentences to confuse the def). We define the property tokens associated model. For a given base example, an adver- with each question as the tokens in the context that sarial sentence typically provides a spurious match the blank in the template; that is, the values answer to the primary question, but does not of the property that A and B are being compared change the gold answer. on. For example, in Figure 5a, French and German Two of the authors each wrote a single adversar- are the property tokens, as the property of interest ial sentence for 50 of the Hotpot Bridge examples, is the national origin. yielding 150 counterfactual instances in total. The To construct a neighborhood for a base data adversarial sentences manage to alter 56% of the point, we take the following steps: predictions on the base examples. 1. Manually extract the property tokens in the C Details of Synthetic Dataset context Our dataset is generated using templates, with 20 2. Replace the property token with two substi- entities (E0 through E19) and 20 relations (R0 tutes, forming a set of four counterfactuals through R19). We place 3 or 4 entities in the con- exhibiting nonidentical ground truths text. We randomly inject tokens between When the properties associated with the two en- entity relation pairs (we do not inject tities differ from each other, we directly use the within any entity relation pair) to prevent model properties extracted as the substitutes (Figure 5, learning spurious correlation with positional em- abf); otherwise we add a new property candidate beddings. that is of the same class (Figure 5, cde). We create a training/validation set of We annotated randomly sampled examples from 200,000/10,000 examples, respectively, and the Hotpot Yes-No questions. We skipped several train a 2-layer 12-head transformer model for this examples that compared abstract concepts with no task, achieving 100% accuracy on the training set explicit property tokens. For instance, we skipped and over 98% accuracy on validation set. the question Are both Yangzhou and Jiangyan Dis- trict considered coastal cities? whose given con- text do not explicitly mention whether the cities are coastal cities. We looked through 61 examples in total and obtained annotations for 50 examples, so such discarded examples constitute a relatively small fraction of the dataset. Overall, this resulted 200 counterfactual instances. We found the predic- tion of a ROBERTA QA model on 52% of the base data points change when being perturbed. B Details of Hotpot Bridge Counterfactuals Figure 6 shows several examples of our annotations for generating counterafactuals for Hotpt Bridge examples. Specifically, we view bridge questions as cnsisting of two single hop questions, the pri- mary part (marked in Figure 6) and secondary
Question Were Ulrich Walter and Léopold Eyharts both from Germany? Context Léopold Eyharts (born April 28, 1957) is a Brigadier General in the French Air Force, an engineer and (a) ESA astronaut. Prof. Dr. Ulrich Hans Walter (born February 9, 1954) is a German physicist/engineer and a former DFVLR astronaut. Substitutes French, German Question Are both Aloinopsis and Eriogonum ice plants? Context Aloinopsis is a genus of ice plants from South Africa. (b) Eriogonum is the scientific name for a genus of flowering plants in the family Polygonaceae. The genus is found in North America and is known as wild buckwheat. Substitutes ice, flowering Question Were Frank R. Strayer and Krzysztof Kieślowski both Directors? Context Frank R. Strayer (September 21, 1891 - 2013 February 3, 1964) was an actor, film writer, and director . (c) He was active from the mid-1920s until the early 1950s. Krzysztof Kieślowski (27 June 1941 - 13 March 1996) was a Polish art-house film director and screenwriter. Substitutes director, producer Question Were Scott Derrickson and Ed Wood of the same nationality? Context Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer. (d) Edward Davis Wood Jr. (October 10, 1924 - 2013 December 10, 1978) was an American filmmaker, actor, writer, producer, and director. Substitutes American, English Question Are the movies "Monsters, Inc." and "Mary Poppins" both by the same company? Context Mary Poppins is a 1964 American musical-fantasy film directed by Robert Stevenson and produced by (e) Walt Disney , with songs written and composed by the Sherman Brothers. Monsters, Inc. is a 2001 American computer-animated comedy film produced by Pixar Animation Studios and distributed by Walt Disney Pictures. Substitutes Walt Disney, Universal Question Are Steve Perry and Dennis Lyxzén both members of the same band? Context Stephen Ray Perry (born January 22, 1949) is an American singer, songwriter and record producer. He (f) is best known as the lead singer of the rock band Journey . Dennis Lyxzén (born June 19, 1972) is a musician best known as the lead vocalist for Swedish hardcore punk band Refused . Substitutes Journey, Refused Figure 5: Examples (contexts are truncated for brevity) of our annotations on Hotpot Yes-No base data points. We find the property tokens in the context, and build realist counterfactuals by replacing them with substitutes that are properties extracted in the base data point or similar properties hand-selected by us.
Question What is the name of the fight song of the university whose main campus is in Lawrence, Kansas and whose branch campuses are in the Kansas City metropolitan area? (a) Context Kansas Song (We’re From Kansas) is a fight song of the University of Kansas. The University of Kansas, often referred to as KU or Kansas, is a public research university in the U.S. state of Kansas. The main campus in Lawrence, one of the largest college towns in Kansas, is on Mount Oread, the highest elevation in Lawrence. Two branch campuses are in the Kansas City metropolitan area. Adv Sent 1 Texas Fight is a fight song of the University of Texas at Austin. Adv Sent 2 Big C is a fight song of the University of California, Berkeley. Question What screenwriter with credits for "Evolution" co-wrote a film starring Nicolas Cage and Téa Leoni? Context David Weissman is a screenwriter and director. His film credits include "The Family Man" (2000), (b) "Evolution" (2001), and "When in Rome" (2010). The Family Man is a 2000 American romantic comedy-drama film directed by Brett Ratner, written by David Diamond and David Weissman, and starring Nicolas Cage and Téa Leoni. Adv Sent 1 Don Jakoby is an American screenwriter that collabrates with David Weissman in "Evolution". Adv Sent 2 Damien Chazelle is a screenwriter most notably known for writing La La Land. Question The arena where the Lewiston Maineiacs played their home games can seat how many people ? Context The Androscoggin Bank Colisée (formerly Central Maine Civic Center and Lewiston Colisee) is a 4,000 (c) capacity (3,677 seated) multi-purpose arena, in Lewiston, Maine, that opened in 1958. The Lewiston Maineiacs were a junior ice hockey team of the Quebec Major Junior Hockey League based in Lewiston, Maine. The team played its home games at the Androscoggin Bank Colisée. Adv Sent 1 Allianz (known as Fußball Arena München for UEFA competitions) is a arena in Munich, with a 5,000 seating capacity. Adv Sent 2 The Tacoma Dome is a multi-purpose arena (221,000 capacity, 10,000 seated) in Tacoma, Washington, United States. Question Scott Parkin has been a vocal critic of Exxonmobil and another corporation that has operations in how many countries ? (d) Context Scott Parkin (born 1969, Garland, Texas is an anti-war, environmental and global justice organizer, former community college history instructor, and a founding member of the Houston Global Awareness Collective. He has been a vocal critic of the American invasion of Iraq, and of corporations such as Exxonmobil and Halliburton. The Halliburton Company, an American multinational corporation. One of the world’s largest oil field service companies, it has operations in more than 70 countries. Adv Sent 1 Visa is a corporation that has operations in more than 200 countries. Adv Sent 2 The Ford Motor Company is an American multinational corporation with operations in more than 100 countries. Question In 1991 Euromarché was bought by a chain that operated how any hypermarkets at the end of 2016? Context Carrefour S.A. is a French multinational retailer headquartered in Boulogne Billancourt, France, in the (e) Hauts-de-Seine Department near Paris. It is one of the largest hypermarket chains in the world (with 1,462 hypermarkets at the end of 2016). Euromarché was a French hypermarket chain. In June 1991, the group was rebought by its rival, Carrefour, for 5,2 billion francs. Adv Sent 1 Walmart Inc is a multinational retail corporation that operates a chain of hypermarkets that owns 4,700 hypermarkets within the United States at the end of 2016. Adv Sent 2 Trader Joe’s is an American chain of grocery stores headquartered in Monrovia, California. By the end of 2016, Trader Joe’s had over 503 stores nationwide in 42 states. Question What was the father of Kasper Schmeichel voted to be by the IFFHS in 1992? Context Peter Bolesław Schmeichel MBE (born 18 November 1963) is a Danish former professional footballer (f) who played as a goalkeeper, and was voted the IFFHS World’s Best Goalkeeper in 1992 and 1993. Kasper Peter Schmeichel (born 5 November 1986) is a Danish professional footballer. He is the son of former Manchester United and Danish international goalkeeper Manuel Neuer. Adv Sent 1 Robert Lewandowski was voted to be the World’s Best Striker in 1992. Adv Sent 2 Michael Jordan was voted the IFFHS best NBA player in 1992. Figure 6: Examples (contexts are truncated for brevity) of primary questions and adversarial senteces for creating Hotpot Bridge counterfactuals.
You can also read