Unification-based Reconstruction of Multi-hop Explanations for Science Questions
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Unification-based Reconstruction of Multi-hop Explanations for Science Questions Marco Valentino∗† , Mokanarangan Thayaparan∗† , André Freitas†‡ Department of Computer Science, University of Manchester, United Kingdom† Idiap Research Institute, Switzerland‡ {marco.valentino,mokanarangan.thayaparan,andre.freitas} @manchester.ac.uk Abstract only to measure the performance on answer predic- tion, but also the ability of a QA system to provide This paper presents a novel framework for re- constructing multi-hop explanations in science explanations for the underlying reasoning process. Question Answering (QA). While existing ap- The need for explainability and a quantitative proaches for multi-hop reasoning build expla- methodology for its evaluation have conducted to nations considering each question in isolation, the creation of shared tasks on explanation recon- we propose a method to leverage explanatory struction (Jansen and Ustalov, 2019) using corpora patterns emerging in a corpus of scientific ex- of explanations such as Worldtree (Jansen et al., planations. Specifically, the framework ranks 2018, 2016). Given a science question, explanation a set of atomic facts by integrating lexical rel- reconstruction consists in regenerating the gold ex- evance with the notion of unification power, estimated analysing explanations for similar planation that supports the correct answer through questions in the corpus. the combination of a series of atomic facts. While An extensive evaluation is performed on the most of the existing benchmarks for multi-hop QA Worldtree corpus, integrating k-NN clustering require the composition of only 2 supporting sen- and Information Retrieval (IR) techniques. We tences or paragraphs (e.g. QASC (Khot et al., present the following conclusions: (1) The 2019), HotpotQA (Yang et al., 2018)), the explana- proposed method achieves results competitive tion reconstruction task requires the aggregation of with Transformers, yet being orders of magni- an average of 6 facts (and as many as ≈20), making tude faster, a feature that makes it scalable to it particularly hard for multi-hop reasoning models. large explanatory corpora (2) The unification- Moreover, the structure of the explanations affects based mechanism has a key role in reducing semantic drift, contributing to the reconstruc- the complexity of the reconstruction task. Explana- tion of many hops explanations (6 or more tions for science questions are typically composed facts) and the ranking of complex inference of two main parts: a grounding part, containing facts (+12.0 Mean Average Precision) (3) Cru- knowledge about concrete concepts in the ques- cially, the constructed explanations can sup- tion, and a core scientific part, including general port downstream QA models, improving the scientific statements and laws. accuracy of BERT by up to 10% overall. Consider the following question and answer pair 1 Introduction from Worldtree (Jansen et al., 2018): Answering multiple-choice science questions has • q: what is an example of a force producing become an established benchmark for testing natu- heat? ral language understanding and complex reasoning a: two sticks getting warm when rubbed to- in Question Answering (QA) (Khot et al., 2019; gether. Clark et al., 2018; Mihaylov et al., 2018). In par- An explanation that justifies a is composed using allel with other NLP research areas, a crucial re- the following sentences from the corpus: (f1 ) a quirement emerging in recent years is explainabil- stick is a kind of object; (f2 ) to rub together means ity (Thayaparan et al., 2020; Miller, 2019; Biran to move against; (f3 ) friction is a kind of force; (f4 ) and Cotton, 2017; Ribeiro et al., 2016). To boost friction occurs when two objects’ surfaces move automatic methods of inference, it is necessary not against each other; (f5 ) friction causes the tem- ∗ equal contribution perature of an object to increase. The explanation 200 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 200–211 April 19 - 23, 2021. ©2021 Association for Computational Linguistics
contains a set of concrete sentences that are concep- being orders of magnitude faster, a feature that tually connected with q and a (f1 ,f2 and f3 ), along makes it scalable to large explanatory corpora. with a set of abstract facts that require multi-hop 2. We empirically demonstrate the key role of inference (f4 and f5 ). the unification-based mechanism in the re- Previous work has shown that constructing construction of many hops explanations (6 or long explanations is challenging due to seman- more facts) and explanations requiring com- tic drift – i.e. the tendency of composing out-of- plex inference (+12.0 Mean Average Preci- context inference chains as the number of hops sion). increases (Khashabi et al., 2019; Fried et al., 2015). While existing approaches build explanations con- 3. Crucially, the constructed explanations can sidering each question in isolation (Khashabi et al., support downstream question answering mod- 2018; Khot et al., 2017), we hypothesise that se- els, improving the accuracy of BERT (Devlin mantic drift can be tackled by leveraging explana- et al., 2019) by up to 10% overall. tory patterns emerging in clusters of similar ques- tions. To the best of our knowledge, we are the first In Science, a given statement is considered to propose a method that leverages unification pat- explanatory to the extent it performs unifica- terns for the reconstruction of multi-hop explana- tion (Friedman, 1974; Kitcher, 1981, 1989), that tions, and empirically demonstrate their impact on is showing how a set of initially disconnected phe- semantic drift and downstream question answering. nomena are the expression of the same regularity. 2 Related Work An example of unification is Newton’s law of uni- versal gravitation, which unifies the motion of plan- Explanations for Science Questions. Recon- ets and falling bodies on Earth showing that all structing explanations for science questions can be bodies with mass obey the same law. Since the reduced to a multi-hop inference problem, where explanatory power of a given statement depends on multiple pieces of evidence have to be aggregated the number of unified phenomena, highly explana- to arrive at the final answer (Thayaparan et al., tory facts tend to create unification patterns – i.e. 2020; Khashabi et al., 2018; Khot et al., 2017; similar phenomena require similar explanations. Jansen et al., 2017). Aggregation methods based on Coming back to our example, we hypothesise that lexical overlaps and explicit constraints suffer from the relevance of abstract statements requiring multi- semantic drift (Khashabi et al., 2019; Fried et al., hop inference, such as f4 (“friction occurs when 2015) – i.e. the tendency of composing spurious two objects’ surfaces move against each other”), inference chains leading to wrong conclusions. can be estimated by taking into account the unifica- One way to contain semantic drift is to lever- tion power. age common explanatory patterns in explanation- Following these observations, we present a centred corpora (Jansen et al., 2018). Transform- framework that ranks atomic facts through the com- ers (Das et al., 2019; Chia et al., 2019) represent bination of two scoring functions: the state-of-the-art for explanation reconstruction in this setting (Jansen and Ustalov, 2019). However, • A Relevance Score (RS) that represents the these models require high computational resources lexical relevance of a given fact. that prevent their applicability to large corpora. On • A Unification Score (US) that models the ex- the other hand, approaches based on IR techniques planatory power of a fact according to its fre- are readily scalable. The approach described in quency in explanations for similar questions. this paper preserves the scalability of IR methods, obtaining, at the same time, performances competi- An extensive evaluation is performed on the tive with Transformers. Thanks to this feature, the Worldtree corpus (Jansen et al., 2018; Jansen and framework can be flexibly applied in combination Ustalov, 2019), adopting a combination of k-NN with downstream question answering models. clustering and Information Retrieval (IR) tech- Our findings are in line with previous work in niques. We present the following conclusions: different QA settings (Rajani et al., 2019; Yadav 1. Despite its simplicity, the proposed method et al., 2019), which highlights the positive impact achieves results competitive with Transform- of explanations and supporting facts on the final ers (Das et al., 2019; Chia et al., 2019), yet answer prediction task. 201
In parallel with Science QA, the development science questions. A fundamental and desir- of models for explanation generation is being able characteristic of Fkb is reusability – i.e. explored in different NLP tasks, ranging from each of its facts fi can be potentially reused to open domain question answering (Yang et al., compose explanations for multiple questions 2018; Thayaparan et al., 2019), to textual entail- ment (Camburu et al., 2018) and natural language • A secondary knowledge base, Explanation premise selection (Ferreira and Freitas, 2020b,a). KB (Ekb ), consisting of a set of tuples Ekb = {(h1 , E1 ), (h2 , E2 ), . . . , (hm , Em )}, Scientific Explanation and AI. The field of Ar- each of them connecting a true hypothesis tificial Intelligence has been historically inspired hj to its corresponding explanation Ej = by models of explanation in Philosophy of Sci- {f1 , f2 , . . . , fk } ⊆ Fkb . An explanation ence (Thagard and Litt, 2008). The deductive- Ej ∈ Ekb is therefore a composition of facts nomological model proposed by Hempel (Hempel, belonging to Fkb . 1965) constitutes the philosophical foundation for explainable models based on logical deduc- In this setting, the explanation reconstruction tion, such as Expert Systems (Lacave and Diez, task for an unseen hypothesis hj can be modelled 2004; Wick and Thompson, 1992) and Explanation- as a ranking problem (Jansen and Ustalov, 2019). based Learning (Mitchell et al., 1986). Simi- Specifically, given an hypothesis hj the algorithm larly, the inherent relation between explanation to solve the task is divided into three macro steps: and causality (Woodward, 2005; Salmon, 1984) 1. Computing an explanatory score si = has inspired computational models of causal infer- e(hj , fi ) for each fact fi ∈ Fkb with respect ence (Pearl, 2009). The view of explanation as to hj unification (Friedman, 1974; Kitcher, 1981, 1989) is closely related to Case-based reasoning (Kolod- 2. Producing an ordered set Rank(hj ) = ner, 2014; Sørmo et al., 2005; De Mantaras et al., {f1 , . . . , fk , fk+1 , . . . , fn | sk ≥ sk+1 } ⊆ 2005). In this context, analogical reasoning plays a Fkb key role in the process of reusing abstract patterns 3. Selecting the top k elements belonging to for explaining new phenomena (Thagard, 1992). Rank(hj ) and interpreting them as an expla- Similarly to our approach, Case-based reasoning nation for hj ; Ej = topK(Rank(hj )). applies this insight to construct solutions for novel problems by retrieving, reusing and adapting expla- 3.1 Modelling Explanatory Relevance nations for known cases solved in the past. We present an approach for modelling e(hj , fi ) that is guided by the following research hypotheses: 3 Explanation Reconstruction as a Ranking Problem • RH1: Scientific explanations are composed of a set of concrete facts connected to the ques- A multiple-choice science question Q = {q, C} tion, and a set of abstract statements express- is a tuple composed by a question q and a set of ing general scientific laws and regularities. candidate answers C = {c1 , c2 , . . . , cn }. Given an hypothesis hj defined as the concatenation of • RH2: Concrete facts tend to share key con- q with a candidate answer cj ∈ C, the task of cepts with the question and can therefore be explanation reconstruction consists in selecting a effectively ranked by IR techniques based on set of atomic facts from a knowledge base Ej = lexical relevance. {f1 , f2 , . . . , fn } that support and justify hj . In this paper, we adopt a methodology that relies • RH3: General scientific statements tend to on the existence of a corpus of explanations. A be abstract and therefore difficult to rank by corpus of explanations is composed of two distinct means of shared concepts. However, due to knowledge sources: explanatory unification, core scientific facts tend to be frequently reused across similar • A primary knowledge base, Facts KB (Fkb ), questions. We hypothesise that the explana- defined as a collection of sentences Fkb = tory power of a fact fi for a given hypothesis {f1 , f2 , . . . , fn } encoding the general world hj is proportional to the number of times fi knowledge necessary to answer and explain explains similar hypotheses. 202
Ekb Stored Hypotheses hz Representation Explanations Ez h1: Which force produces energy as f2: friction is a kind of force E1 heat? Friction f1: friction causes the temperature of an object to increase h2: When the magnet is moved f3: pull is a force E2 away from the object, the magnetic force on the object will decrease f4: magnetic attraction pulls two objects together Nearest K Unification Score sim(hj,hz) us(hj,fi) Test Hypothesis hj Representation h1 0.85 f2 0.85 . f1 0.85 hj: What is an example of a force . . producing heat? Two sticks getting . . warm when rubbed together . . h2 0.12 f3 0.12 correct explanatory facts f4 0.12 wrong explanatory facts Relevance Score Explanation Fkb Stored Facts fi Representation rs(hj,fi) Ranking f1: friction causes the temperature of f2 0.55 f2 0.69 an object to increase f3 0.55 f1 0.42 f2: friction is a kind of force f4 0.43 . f3: pull is a force . . . . f4: magnetic attraction pulls two . f3 0.33 objects together f1 0.0 f4 0.27 Figure 1: Overview of the Unification-based framework for explanation reconstruction. To formalise these research hypotheses, we 2. The Unification Score of a fact fi is propor- model the explanatory scoring function e(hj , fi ) tional to the similarity between the hypotheses as a combination of two components: in Ekb that are explained by fi and the unseen e(hj , fi ) = λ1 rs(hj , fi ) + (1 − λ1 )us(hj , fi ) (1) hypothesis (hj ) we want to explain. Here, rs(hj , fi ) represents a lexical Relevance Figure 1 shows a schematic representation of the Score (RS) assigned to fi ∈ Fkb with respect to hj , Unification-based framework. while us(hj , fi ) represents the Unification Score (US) of fi computed over Ekb as follows: 4 Empirical Evaluation X us(hj , fi ) = sim(hj , hz )in(fi , Ez ) (2) We carried out an empirical evaluation on the (hz ,Ez )∈kN N (hj ) Worldtree corpus (Jansen et al., 2018), a subset of ( 1 if fi ∈ Ez the ARC dataset (Clark et al., 2018) that includes in(fi , Ez ) = (3) explanations for science questions. The corpus pro- 0 otherwise vides an explanatory knowledge base (Fkb and Ekb ) kN N (hj ) = {(h1 , E1 ), . . . (hk , Ek )} ⊆ Ekb is where each explanation in Ekb is represented as a the set of k-nearest neighbours of hj belonging set of lexically connected sentences describing how to Ekb retrieved according to a similarity function to arrive at the correct answer. The science ques- sim(hj , hz ). On the other hand, in(fi , Ez ) verifies tions in the Worldtree corpus are split into training- whether the fact fi belongs to the explanation Ez set, dev-set, and test-set. The gold explanations in for the hypothesis hz . the training-set are used to form the Explanation In the formulation of Equation 2 we aim to cap- KB (Ekb ), while the gold explanations in dev and ture two main aspects related to our research hy- test set are used for evaluation purpose only. The potheses: corpus groups the explanation sentences belong- 1. The more a fact fi is reused for explanations ing to Ekb into three explanatory roles: grounding, in Ekb , the higher its explanatory power and central and lexical glue. therefore its Unification Score; Consider the example in Figure 1. To support 203
MAP Model Approach Trained Test Dev Transformers Das et al. (2019) BERT re-ranking with inference chains Yes 56.3 58.5 Chia et al. (2019) BERT re-ranking with gold IR scores Yes 47.7 50.9 Banerjee (2019) BERT iterative re-ranking Yes 41.3 42.3 IR with re-ranking Chia et al. (2019) Iterative BM25 No 45.8 49.7 One-step IR BM25 BM25 Relevance Score No 43.0 46.1 TF-IDF TF-IDF Relevance Score No 39.4 42.8 Feature-based D’Souza et al.(2019) Feature-rich SVM ranking + Rules Yes 39.4 44.4 D’Souza et al. (2019) Feature-rich SVM ranking Yes 34.1 37.1 Unification-based Reconstruction RS + US (Best) Joint Relevance and Unification Score No 50.8 54.5 US (Best) Unification Score No 22.9 21.9 Table 1: Results on test and dev set and comparison with state-of-the-art approaches. The column trained indicates whether the model requires an explicit training session on the explanation reconstruction task. q and cj the system has to retrieve the scientific (i.e. sim(hj , hz ) function in Equation 2). For facts describing how friction occurs and produces reproducibility, the code is available at the fol- heat across objects. The corpus classifies these lowing url: https://github.com/ai-systems/ facts (f3 , f4 ) as central. Grounding explanations unification_reconstruction_explanations. like “stick is a kind of object” (f1 ) link question Additional details can be found in the supplemen- and answer to the central explanations. Lexical tary material. glues such as “to rub; to rub together means to mover against” (f2 ) are used to fill lexical gaps be- 4.1 Explanation Reconstruction tween sentences. Additionally, the corpus divides In line with the shared task (Jansen and Ustalov, the facts belonging to Fkb into three inference cat- 2019), the performances of the models are eval- egories: retrieval type, inference supporting type, uated via Mean Average Precision (MAP) of the and complex inference type. Taxonomic knowledge explanation ranking produced for a given question and properties such as “stick is a kind of object” qj and its correct answer aj . (f1 ) and “friction is a kind of force” (f5 ) are clas- Table 1 illustrates the score achieved by our best sified as retrieval type. Facts describing actions, implementation compared to state-of-the-art ap- affordances, and requirements such as “friction proaches in the literature. Previous approaches occurs when two object’s surfaces move against are grouped into four categories: Transformers, each other” (f3 ) are grouped under the inference Information Retrieval with re-ranking, One-step supporting types. Knowledge about causality, de- Information Retrieval, and Feature-based models. scription of processes and if-then conditions such as “friction causes the temperature of an object to Transformers. This class of approaches em- increase” (f4 ) is classified as complex inference. ploys the gold explanations in the corpus to train a We implement Relevance and Unification BERT language model (Devlin et al., 2019). The Score adopting TF-IDF/BM25 vectors and best-performing system (Das et al., 2019) adopts cosine similarity function (i.e. 1 − cos(~x, ~y )). a multi-step retrieval strategy. In the first step, it Specifically, The RS model uses TF-IDF/BM25 returns the top K sentences ranked by a TF-IDF to compute the relevance function for each fact model. In the second step, BERT is used to re- in Fkb (i.e. rs(hj , fi ) function in Equation 1) rank the paths composed of all the facts that are while the US model adopts TF-IDF/BM25 to within 1-hop from the first retrieved set. Similarly, assign similarity scores to the hypotheses in Ekb other approaches adopt BERT to re-rank each fact 204
MAP MAP Model Model All Central Grounding Lexical Glue 1+ Overlaps 1 Overlap 0 Overlaps RS TF-IDF 42.8 43.4 25.4 8.2 RS TF-IDF 57.2 33.6 7.1 RS BM25 46.1 46.6 23.3 10.7 RS BM25 62.2 37.1 7.1 US TF-IDF 21.6 16.9 22.0 13.4 US TF-IDF 17.37 18.0 12.5 US BM25 21.9 18.1 16.7 15.0 US BM25 18.1 18.1 13.1 RS TF-IDF + US TF-IDF 48.5 46.4 32.7 11.7 RS TF-IDF + US TF-IDF 60.2 38.4 9.0 RS TF-IDF + US BM25 50.7 48.6 30.42 13.4 RS TF-IDF + US BM25 62.5 39.5 9.6 RS BM25 + US TF-IDF 51.9 48.2 31.7 14.8 RS BM25 + US TF-IDF 61.3 40.6 11.0 RS BM25 + US BM25 54.5 51.7 27.3 16.7 RS BM25 + US BM25 64.8 41.9 11.2 (a) Explanatory roles. (b) Lexical overlaps with the hypothesis. MAP Model Retrieval Inference-supporting Complex Inference RS TF-IDF 33.5 34.7 21.8 RS BM25 36.0 36.1 24.8 US TF-IDF 17.6 12.8 19.5 US BM25 16.8 13.2 20.9 RS TF-IDF + US TF-IDF 38.3 33.2 30.2 RS TF-IDF + US BM25 40.0 35.6 33.3 RS BM25 + US TF-IDF 40.5 33.6 33.4 RS BM25 + US BM25 40.6 38.3 36.8 (c) Inference types. Table 2: Detailed analysis of the performance (dev-set) by breaking down the gold explanatory facts according to their explanatory role (2.a), number of lexical overlaps with the question (2.b) and inference type (2.c). individually (Banerjee, 2019; Chia et al., 2019). Information Retrieval with re-ranking. Chia et al. (2019) describe a multi-step, iterative re- Although the best model achieves state-of-the- ranking model based on BM25. The first step con- art results in explanation reconstruction, these ap- sists in retrieving the explanation sentence that is proaches are computationally expensive, being lim- most similar to the question adopting BM25 vec- ited by the application of a pre-filtering step to tors. During the second step, the BM25 vector contain the space of candidate facts. Consequently, of the question is updated by aggregating it with these systems do not scale with the size of the cor- the retrieved explanation sentence vector through pus. We estimated that the best performing model a max operation. The first and second steps are (Das et al., 2019) takes ≈ 10 hours to run on the repeated for K times. Although this approach uses whole test set (1240 questions) using 1 Tesla 16GB scalable IR techniques, it relies on a multi-step V100 GPU. retrieval strategy. Besides, the RS + US model out- Comparatively, our model constructs explana- performs this approach on both test and dev set by tions for all the questions in the test set in ≈ 5.0/4.8 MAP respectively. 30 seconds, without requiring the use of GPUs (< 1 second per question). This feature makes One-step Information Retrieval. We compare the Unification-based Reconstruction suitable for the RS + US model with two IR baselines. The large corpora and downstream question answer- baselines adopt TF-IDF and BM25 to compute the ing models (as shown in Section 4.4). Moreover, Relevance Score only – i.e. the us(q, cj , fi ) term in our approach does not require any explicit train- Equation 1 is set to 0 for each fact fi ∈ Fkb . In line ing session on the explanation regeneration task, with previous IR literature (Robertson et al., 2009), with significantly reduced number of parameters to BM25 leads to better performance than TF-IDF. tune. Along with scalability, the proposed approach While these approaches share similar characteris- achieves nearly state-of-the-art results (50.8/54.5 tics, the combined RS + US model outperforms MAP). Although we observe lower performance both RS BM25 and RS TF-IDF on test and dev-set when compared to the best-performing approach by 7.8/8.4 and 11.4/11.7 MAP. Moreover, the joint (-5.5/-4.0 MAP), the joint RS + US model outper- RS + US model improves the performance of the forms two BERT-based models (Chia et al., 2019; US model alone by 27.9/32.6 MAP. These results Banerjee, 2019) on both test and dev set by 3.1/3.6 outline the complementary aspects of Relevance and 9.5/12.2 MAP respectively. and Unification Score. We provide a detailed anal- 205
(a) MAP vs Explanation length. (b) Precision@K. Figure 2: Impact of the Unification Score on semantic drift (3.a) and precision (3.b). RS + US (Blue Straight), RS (Green Dotted), US (Red Dashed). 56 with each individual sub-component (RS and US 55 54 alone). In addition, a set of qualitative examples 53 are analysed to provide additional insights on the 52 51 complementary aspects captured by Relevance and MAP 50 49 Unification Score. 48 47 Explanatory categories. Given a question qj 46 45 and its correct answer aj , we classify a fact fi be- k=0 k = 10 k = 20 k = 40 k = 80 k = 100 k = 200 longing to the gold explanation Ej according to its K-nearest neighbours explanatory role (central, grounding, lexical glue) and inference type (retrieval, inference-supporting Figure 3: Impact of the k-NN clustering on the final and complex inference). In addition, three new cat- MAP score. The value k represents the number of sim- egories are derived from the number of overlaps ilar hypotheses considered for the Unification Score. between fi and the concatenation of qj with aj (hj ) computed by considering nouns, verbs, adjectives ysis by performing an ablation study on the dev-set and adverbs (1+ overlaps, 1 overlap, 0 overlaps). (Section 4.2). Table 2 reports the MAP score for each of the described categories. Overall, the best results are Feature-based models. D’Souza et al. (2019) obtained by the BM25 implementation of the joint propose an approach based on a learning-to-rank model (RS BM25 + US BM25) with a MAP score paradigm. The model extracts a set of features of 54.5. Specifically, RS BM25 + US BM25 based on overlaps and coherence metrics between achieves a significant improvement over both RS questions and explanation sentences. These fea- BM25 (+8.5 MAP) and US BM25 (+32.6 MAP) tures are then given in input to a SVM ranker mod- baselines. Regarding the explanatory roles (Table ule. While this approach scales to the whole corpus 2a), the joint TF-IDF implementation shows the without requiring any pre-filtering step, it is signifi- best performance in the reconstruction of ground- cantly outperformed by the RS + US model on both ing explanations (32.7 MAP). On the other hand, test and dev set by 16.7/17.4 MAP respectively. a significant improvement over the RS baseline is obtained by RS BM25 + US BM25 on both lexical 4.2 Explanation Analysis glues and central explanation sentences (+6.0 and We present an ablation study with the aim of under- +5.6 MAP over RS BM25). standing the contribution of each sub-component Regarding the lexical overlaps categories (Table to the general performance of the joint RS + US 2b), we observe a steady improvement for all the model (see Table 1). To this end, a detailed eval- combined RS + US models over the respective uation on the development set of the Worldtree RS baselines. Notably, the US models achieve corpus is carried out, analysing the performance in the best performance on the 0 overlaps category, reconstructing explanations of different types and which includes the most challenging facts for the complexity. We compare the joint model (RS + US) RS models. The improved ability to rank abstract 206
Question Answer Explanation Fact Most Similar Hypotheses in Ekb RS RS + US If you bounce a rubber ball gravity gravity; gravitational force (1) A ball is tossed up in the air and it comes #36 #2 (↑34) on the floor it goes up and causes objects that have mass; back down. The ball comes back down be- then comes down. What substances to be pulled down; to cause of - gravity (2) A student drops a ball. causes the ball to come fall on a planet Which force causes the ball to fall to the down? ground? - gravity Which animals would most alligators as water increases in an environ- (1) Where would animals and plants be most #198 #57 (↑141) likely be helped by flood in ment, the population of aquatic affected by a flood? - low areas (2) Which a coastal area? animals will increase change would most likely increase the num- ber of salamanders? - flood What is an example of a two sticks getting friction causes the temperature (1) Rubbing sandpaper on a piece of wood #1472 #21 (↑1451) force producing heat? warm when rubbed of an object to increase produces what two types of energy? - sound together and heat (2) Which force produces energy as heat? - friction Table 3: Impact of the Unification Score on the ranking of scientific facts with increasing complexity. explanatory facts contributes to better performance similar questions. This results demonstrate that the for the joint models (RS + US) in the reconstruction Unification Score has a crucial role in alleviating of explanations that share few terms with question the semantic drift for the joint model (RS + US), and answer (1 Overlap and 0 Overlaps categories). resulting in a larger improvement on many-hops This characteristic leads to an improvement of 4.8 explanations (6+ facts). and 4.1 MAP for the RS BM25 + US BM25 model Similarly, Figure 2b illustrates the Precision@K. over the RS BM25 baseline. As shown in the graph, the drop in precision for the Similar results are achieved on the inference US model exhibits the slowest degradation. Simi- types categories (Table 2c). Crucially, the largest larly to what observed for many-hops explanations, improvement is observed for complex inference sen- the US score contributes to the robustness of the tences where RS BM25 + US BM25 outperforms RS + US model, making it able to reconstruct more RS BM25 by 12.0 MAP, confirming the decisive precise explanations. As discussed in section 4.4, contribution of the Unification Score to the ranking this feature has a positive impact on question an- of complex scientific facts. swering. Semantic drift. Science questions in the k-NN clustering. We investigate the impact of Worldtree corpus require an average of six facts the k-NN clustering on the explanation reconstruc- in their explanations (Jansen et al., 2016). Long tion task. Figure 3 shows the MAP score obtained explanations typically include sentences that share by the joint RS + US model (BM25) with different few terms with question and answer, increasing the numbers k of nearest hypotheses considered for probability of semantic drift. Therefore, to test the the Unification Score. The graph highlights the im- impact of the Unification Score on the robustness provement in MAP achieved with increasing values of the model, we measure the performance in the of k. Specifically, we observe that the best MAP is reconstruction of many-hops explanations. obtained with k = 100. These results confirm that Figure 2a shows the change in MAP score for the explanatory power can be effectively estimated the RS + US, RS and US models (BM25) with using clusters of similar hypotheses, and that the increasing explanation length. The fast drop in per- unification-based mechanism has a crucial role in formance for the Relevance Score reflects the com- improving the performance of the relevance model. plexity of the task. This drop occurs because the RS model is not able to rank abstract explanatory facts. 4.3 Qualitative analysis. Conversely, the US model exhibits increasing per- To provide additional insights on the complemen- formance, with a trend that is inverse. Short expla- tary aspects of Unification and Relevance Score, nations, indeed, tend to include question-specific we present a set of qualitative examples from the facts with low explanatory power. On the other dev-set. Table 3 illustrates the ranking assigned hand, the longer the explanation, the higher the by RS and RS + US models to scientific sentences number of core scientific facts. Therefore, the de- of increasing complexity. The words in bold indi- crease in MAP observed for the RS model is com- cate lexical overlaps between question, answer and pensated by the Unification Score, since core scien- explanation sentence. In the first example, the sen- tific facts tend to form unification patterns across tence “gravity; gravitational force causes objects 207
that have mass; substances to be pulled down; to Model Accuracy fall on a planet” shares key terms with question Easy Challenge Overall and candidate answer and is therefore relatively BERT (no explanation) 48.54 26.28 41.78 easy to rank for the RS model (#36). Nevertheless, BERT + RS (K = 3) 53.20 40.97 49.39 the RS + US model is able to improve the rank- BERT + RS (K = 5) 54.36 38.14 49.31 ing by 34 positions (#2), as the gravitational law BERT + RS (K = 10) 32.71 29.63 31.75 represents a scientific pattern with high explana- BERT + RS + US (K = 3) 55.46 41.97 51.62 tory unification, frequently reused across similar BERT + RS + US (K = 5) 54.48 39.43 50.12 BERT + RS + US (K = 10) 48.66 37.37 45.14 questions. The impact of the Unification Score is more evident when considering abstract explana- Table 4: Performance of BERT on question answering tory facts. Coming back to our original example (test-set) with and without the explanation reconstruc- (i.e. “What is an example of a force producing tion models. heat?”), the fact “friction causes the temperature of an object to increase” has no significant over- laps with question and answer. Thus, the RS model split (+6.92%) and particularly significant for chal- ranks the gold explanation sentence in a low posi- lenge questions (+15.69%). Overall, we observe a tion (#1472). However, the Unification Score (US) correlation between more precise explanations and is able to capture the explanatory power of the fact accuracy in answer prediction, with BERT + RS from similar hypotheses in Ekb , pushing the RS + being outperformed by BERT + RS + US for each US ranking up to position #21 (+1451). value of K. The decrease in accuracy occurring with increasing values of K is coherent with the 4.4 Question Answering drop in precision for the models observed in Fig- ure 2b. Moreover, steadier results adopting the RS To understand whether the constructed explana- + US model suggest a positive contribution from tions can support question answering, we compare abstract explanatory facts. Additional investigation the performance of BERT for multiple-choice QA of this aspect will be a focus for future work. (Devlin et al., 2019) without explanations with the performance of BERT provided with the top K ex- 5 Conclusion planation sentences retrieved by RS and RS + US models (BM25). BERT without explanations op- This paper proposed a novel framework for multi- erates on question and candidate answer only. On hop explanation reconstruction based on explana- the other hand, BERT with explanation receives tory unification. An extensive evaluation on the the following input: the question (q), a candidate Worldtree corpus led to the following conclusions: answer (ci ) and the explanation for ci (Ei ). In this (1) The approach is competitive with state-of-the- setting, the model is fine-tuned for binary classifi- art Transformers, yet being significantly faster cation (bertb ) to predict a set of probability scores and inherently scalable; (2) The unification-based P = {p1 , p2 , ..., pn } for each candidate answer mechanism supports the construction of complex in C = {c1, c2 , ..., cn }: and many hops explanations; (3) The constructed explanations improves the accuracy of BERT for bertb ([CLS] || q||ci || [SEP] || Ei ) = pi (4) question answering by up to 10% overall. As a fu- ture work, we plan to extend the framework adopt- The binary classifier operates on the final hidden ing neural embeddings for sentence representation. state corresponding to the [CLS] token. To an- swer the question q, the model selects the candidate Acknowledgements answer ca such that a = argmaxi pi . Table 4 reports the accuracy with and without The authors would like to thank the anonymous explanations on the Worldtree test-set for easy reviewers for the constructive feedback. A special and challenge questions (Clark et al., 2018). No- thanks to Deborah Ferreira for the helpful discus- tably, a significant improvement in accuracy can sions, and to the members of the AI Systems lab be observed when BERT is provided with expla- from the University of Manchester. Additionally, nations retrieved by the reconstruction modules we would like to thank the Computational Shared (+9.84% accuracy with RS BM25 + US BM25 Facility of the University of Manchester for provid- model). The improvement is consistent on the easy ing the infrastructure to run our experiments. 208
References Deborah Ferreira and André Freitas. 2020a. Natu- ral language premise selection: Finding supporting Pratyay Banerjee. 2019. Asu at textgraphs 2019 shared statements for mathematical text. In Proceedings of task: Explanation regeneration using language mod- The 12th Language Resources and Evaluation Con- els and iterative re-ranking. In Proceedings of ference, pages 2175–2182. the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), Deborah Ferreira and André Freitas. 2020b. Premise pages 78–84. selection in natural language mathematical texts. In Proceedings of the 58th Annual Meeting of the Asso- Or Biran and Courtenay Cotton. 2017. Explanation ciation for Computational Linguistics, pages 7365– and justification in machine learning: A survey. In 7374. IJCAI-17 workshop on explainable AI (XAI), vol- ume 8. Daniel Fried, Peter Jansen, Gustave Hahn-Powell, Mi- Oana-Maria Camburu, Tim Rocktäschel, Thomas hai Surdeanu, and Peter Clark. 2015. Higher- Lukasiewicz, and Phil Blunsom. 2018. e-snli: Nat- order lexical semantic models for non-factoid an- ural language inference with natural language expla- swer reranking. Transactions of the Association for nations. In Advances in Neural Information Process- Computational Linguistics, 3:197–210. ing Systems, pages 9539–9549. Michael Friedman. 1974. Explanation and scientific Yew Ken Chia, Sam Witteveen, and Martin Andrews. understanding. The Journal of Philosophy, 71(1):5– 2019. Red dragon ai at textgraphs 2019 shared task: 19. Language model assisted explanation generation. In Proceedings of the Thirteenth Workshop on Graph- Carl G Hempel. 1965. Aspects of scientific explana- Based Methods for Natural Language Processing tion. (TextGraphs-13), pages 85–89. Peter Jansen, Niranjan Balasubramanian, Mihai Sur- Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, deanu, and Peter Clark. 2016. What’s in an expla- Ashish Sabharwal, Carissa Schoenick, and Oyvind nation? characterizing knowledge and inference re- Tafjord. 2018. Think you have solved question an- quirements for elementary science exams. In Pro- swering? try arc, the ai2 reasoning challenge. arXiv ceedings of COLING 2016, the 26th International preprint arXiv:1803.05457. Conference on Computational Linguistics: Techni- cal Papers, pages 2956–2965. Rajarshi Das, Ameya Godbole, Manzil Zaheer, She- hzaad Dhuliawala, and Andrew McCallum. 2019. Peter Jansen, Rebecca Sharp, Mihai Surdeanu, and Pe- Chains-of-reasoning at textgraphs 2019 shared task: ter Clark. 2017. Framing qa as building and ranking Reasoning over chains of facts for explainable multi- intersentence answer justifications. Computational hop inference. In Proceedings of the Thirteenth Linguistics, 43(2):407–449. Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pages 101– Peter Jansen and Dmitry Ustalov. 2019. Textgraphs 117. 2019 shared task on multi-hop inference for expla- nation regeneration. In Proceedings of the Thir- Ramon Lopez De Mantaras, David McSherry, Derek teenth Workshop on Graph-Based Methods for Nat- Bridge, David Leake, Barry Smyth, Susan Craw, Boi ural Language Processing (TextGraphs-13), pages Faltings, Mary Lou Maher, MICHAEL T COX, Ken- 63–77. neth Forbus, et al. 2005. Retrieval, reuse, revision and retention in case-based reasoning. The Knowl- edge Engineering Review, 20(3):215–240. Peter Jansen, Elizabeth Wainwright, Steven Mar- morstein, and Clayton Morrison. 2018. Worldtree: Jacob Devlin, Ming-Wei Chang, Kenton Lee, and A corpus of explanation graphs for elementary sci- Kristina Toutanova. 2019. Bert: Pre-training of ence questions supporting multi-hop inference. In deep bidirectional transformers for language under- Proceedings of the Eleventh International Confer- standing. In Proceedings of the 2019 Conference of ence on Language Resources and Evaluation (LREC the North American Chapter of the Association for 2018). Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot, 4171–4186. Ashish Sabharwal, and Dan Roth. 2019. On the capabilities and limitations of reasoning for Jennifer D’Souza, Isaiah Onando Mulang, and Sören natural language understanding. arXiv preprint Auer. 2019. Team svmrank: Leveraging feature-rich arXiv:1901.02522. support vector machines for ranking explanations to elementary science questions. In Proceedings of Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and the Thirteenth Workshop on Graph-Based Methods Dan Roth. 2018. Question answering as global rea- for Natural Language Processing (TextGraphs-13), soning over semantic abstractions. In Thirty-Second pages 90–100. AAAI Conference on Artificial Intelligence. 209
Tushar Khot, Peter Clark, Michal Guerquin, Peter Frode Sørmo, Jörg Cassens, and Agnar Aamodt. 2005. Jansen, and Ashish Sabharwal. 2019. Qasc: A Explanation in case-based reasoning–perspectives dataset for question answering via sentence compo- and goals. Artificial Intelligence Review, 24(2):109– sition. arXiv preprint arXiv:1910.11473. 143. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017. Paul Thagard. 1992. Analogy, explanation, and ed- Answering complex questions using open informa- ucation. Journal of research in science teaching, tion extraction. In Proceedings of the 55th Annual 29(6):537–544. Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 311–316. Paul Thagard and Abninder Litt. 2008. Models of sci- entific explanation. The Cambridge handbook of Philip Kitcher. 1981. Explanatory unification. Philoso- computational psychology, pages 549–564. phy of science, 48(4):507–531. Mokanarangan Thayaparan, Marco Valentino, and Philip Kitcher. 1989. Explanatory unification and the André Freitas. 2020. A survey on explainability causal structure of the world. in machine reading comprehension. arXiv preprint arXiv:2010.00389. Janet Kolodner. 2014. Case-based reasoning. Morgan Kaufmann. Mokanarangan Thayaparan, Marco Valentino, Viktor Schlegel, and André Freitas. 2019. Identifying Carmen Lacave and Francisco J Diez. 2004. A review supporting facts for multi-hop question answering of explanation methods for heuristic expert systems. with document graph networks. In Proceedings of The Knowledge Engineering Review, 19(2):133– the Thirteenth Workshop on Graph-Based Methods 146. for Natural Language Processing (TextGraphs-13), pages 42–51. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct elec- Michael R Wick and William B Thompson. 1992. Re- tricity? a new dataset for open book question an- constructive expert system explanation. Artificial In- swering. In Proceedings of the 2018 Conference on telligence, 54(1-2):33–70. Empirical Methods in Natural Language Processing, pages 2381–2391. James Woodward. 2005. Making things happen: A the- ory of causal explanation. Oxford university press. Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelli- Vikas Yadav, Steven Bethard, and Mihai Surdeanu. gence, 267:1–38. 2019. Quick and (not so) dirty: Unsupervised se- lection of justification sentences for multi-hop ques- Tom M Mitchell, Richard M Keller, and Smadar T tion answering. In Proceedings of the 2019 Con- Kedar-Cabelli. 1986. Explanation-based generaliza- ference on Empirical Methods in Natural Language tion: A unifying view. Machine learning, 1(1):47– Processing and the 9th International Joint Confer- 80. ence on Natural Language Processing (EMNLP- Judea Pearl. 2009. Causality. Cambridge university IJCNLP), pages 2578–2589. press. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, Nazneen Fatema Rajani, Bryan McCann, Caiming William Cohen, Ruslan Salakhutdinov, and Christo- Xiong, and Richard Socher. 2019. Explain your- pher D Manning. 2018. Hotpotqa: A dataset for self! leveraging language models for commonsense diverse, explainable multi-hop question answering. reasoning. In Proceedings of the 57th Annual Meet- In Proceedings of the 2018 Conference on Empiri- ing of the Association for Computational Linguistics, cal Methods in Natural Language Processing, pages pages 4932–4942. 2369–2380. Marco Tulio Ribeiro, Sameer Singh, and Carlos A Supplementary Material Guestrin. 2016. ” why should i trust you?” explain- ing the predictions of any classifier. In Proceed- A.1 Hyperparameters tuning ings of the 22nd ACM SIGKDD international con- ference on knowledge discovery and data mining, The hyperparameters of the model have been tuned pages 1135–1144. manually. The criteria for the optimisation is the maximisation of the MAP score on the dev-set. Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and be- Here, we report the values adopted for the experi- yond. Foundations and Trends R in Information Re- ments described in the paper. trieval, 3(4):333–389. The Unification-based Reconstruction adopts Wesley C Salmon. 1984. Scientific explanation and the two hyperparameters. Specifically, λ1 is the weight causal structure of the world. Princeton University assigned to the relevance score in equation 1, while Press. k is the number of similar hypotheses to consider 210
for the calculation of the unification score (equa- tion 2). The values adopted for these parameters are as follows: 1. λ1 = 0.83 (1 − λ1 = 0.17) 2. k = 100 A.2 BERT model For question answering we adopt a BERTBASE model. The model is implemented using PyTorch (https://pytorch.org/) and fine-tuned using 4 Tesla 16GB V100 GPUs for 10 epochs in total with batch size 32 and seed 42. The hyperparameters adopted for BERT are as follows: • gradient accumulation steps = 1 • learning rate = 5e-5 • weight decay = 0.0 • adam epsilon = 1e-8 • warmup steps = 0 • max grad norm = 1.0 A.3 Data and code The experiments are carried out on the TextGraphs 2019 version (https://github.com/umanlp/ tg2019task) of the Worldtree corpus. The full dataset can be downloaded at the following URL: http://cognitiveai.org/dist/worldtree_ corpus_textgraphs2019sharedtask_ withgraphvis.zip. The code to reproduce the experiments de- scribed in the paper is available at the follow- ing URL: https://github.com/ai-systems/ unification_reconstruction_explanations 211
You can also read