Unification-based Reconstruction of Multi-hop Explanations for Science Questions

Page created by Leonard Reyes

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Unification-based Reconstruction of Multi-hop Explanations for Science
                                  Questions

              Marco Valentino∗† , Mokanarangan Thayaparan∗† , André Freitas†‡
           Department of Computer Science, University of Manchester, United Kingdom†
                             Idiap Research Institute, Switzerland‡
          {marco.valentino,mokanarangan.thayaparan,andre.freitas}
                                   @manchester.ac.uk

                       Abstract                                  only to measure the performance on answer predic-
                                                                 tion, but also the ability of a QA system to provide
     This paper presents a novel framework for re-
     constructing multi-hop explanations in science
                                                                 explanations for the underlying reasoning process.
     Question Answering (QA). While existing ap-                    The need for explainability and a quantitative
     proaches for multi-hop reasoning build expla-               methodology for its evaluation have conducted to
     nations considering each question in isolation,             the creation of shared tasks on explanation recon-
     we propose a method to leverage explanatory                 struction (Jansen and Ustalov, 2019) using corpora
     patterns emerging in a corpus of scientific ex-             of explanations such as Worldtree (Jansen et al.,
     planations. Specifically, the framework ranks               2018, 2016). Given a science question, explanation
     a set of atomic facts by integrating lexical rel-
                                                                 reconstruction consists in regenerating the gold ex-
     evance with the notion of unification power,
     estimated analysing explanations for similar                planation that supports the correct answer through
     questions in the corpus.                                    the combination of a series of atomic facts. While
     An extensive evaluation is performed on the
                                                                 most of the existing benchmarks for multi-hop QA
     Worldtree corpus, integrating k-NN clustering               require the composition of only 2 supporting sen-
     and Information Retrieval (IR) techniques. We               tences or paragraphs (e.g. QASC (Khot et al.,
     present the following conclusions: (1) The                  2019), HotpotQA (Yang et al., 2018)), the explana-
     proposed method achieves results competitive                tion reconstruction task requires the aggregation of
     with Transformers, yet being orders of magni-               an average of 6 facts (and as many as ≈20), making
     tude faster, a feature that makes it scalable to            it particularly hard for multi-hop reasoning models.
     large explanatory corpora (2) The unification-
                                                                 Moreover, the structure of the explanations affects
     based mechanism has a key role in reducing
     semantic drift, contributing to the reconstruc-             the complexity of the reconstruction task. Explana-
     tion of many hops explanations (6 or more                   tions for science questions are typically composed
     facts) and the ranking of complex inference                 of two main parts: a grounding part, containing
     facts (+12.0 Mean Average Precision) (3) Cru-               knowledge about concrete concepts in the ques-
     cially, the constructed explanations can sup-               tion, and a core scientific part, including general
     port downstream QA models, improving the                    scientific statements and laws.
     accuracy of BERT by up to 10% overall.
                                                                    Consider the following question and answer pair
1    Introduction                                                from Worldtree (Jansen et al., 2018):

Answering multiple-choice science questions has                    • q: what is an example of a force producing
become an established benchmark for testing natu-                    heat?
ral language understanding and complex reasoning                     a: two sticks getting warm when rubbed to-
in Question Answering (QA) (Khot et al., 2019;                       gether.
Clark et al., 2018; Mihaylov et al., 2018). In par-            An explanation that justifies a is composed using
allel with other NLP research areas, a crucial re-             the following sentences from the corpus: (f1 ) a
quirement emerging in recent years is explainabil-             stick is a kind of object; (f2 ) to rub together means
ity (Thayaparan et al., 2020; Miller, 2019; Biran              to move against; (f3 ) friction is a kind of force; (f4 )
and Cotton, 2017; Ribeiro et al., 2016). To boost              friction occurs when two objects’ surfaces move
automatic methods of inference, it is necessary not            against each other; (f5 ) friction causes the tem-
     ∗
         equal contribution                                    perature of an object to increase. The explanation

                                                           200
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 200–211
                           April 19 - 23, 2021. ©2021 Association for Computational Linguistics

contains a set of concrete sentences that are concep-            being orders of magnitude faster, a feature that
tually connected with q and a (f1 ,f2 and f3 ), along            makes it scalable to large explanatory corpora.
with a set of abstract facts that require multi-hop
                                                               2. We empirically demonstrate the key role of
inference (f4 and f5 ).
                                                                  the unification-based mechanism in the re-
   Previous work has shown that constructing
                                                                  construction of many hops explanations (6 or
long explanations is challenging due to seman-
                                                                  more facts) and explanations requiring com-
tic drift – i.e. the tendency of composing out-of-
                                                                  plex inference (+12.0 Mean Average Preci-
context inference chains as the number of hops
                                                                  sion).
increases (Khashabi et al., 2019; Fried et al., 2015).
While existing approaches build explanations con-              3. Crucially, the constructed explanations can
sidering each question in isolation (Khashabi et al.,             support downstream question answering mod-
2018; Khot et al., 2017), we hypothesise that se-                 els, improving the accuracy of BERT (Devlin
mantic drift can be tackled by leveraging explana-                et al., 2019) by up to 10% overall.
tory patterns emerging in clusters of similar ques-
tions.                                                        To the best of our knowledge, we are the first
   In Science, a given statement is considered             to propose a method that leverages unification pat-
explanatory to the extent it performs unifica-             terns for the reconstruction of multi-hop explana-
tion (Friedman, 1974; Kitcher, 1981, 1989), that           tions, and empirically demonstrate their impact on
is showing how a set of initially disconnected phe-        semantic drift and downstream question answering.
nomena are the expression of the same regularity.          2     Related Work
An example of unification is Newton’s law of uni-
versal gravitation, which unifies the motion of plan-    Explanations for Science Questions. Recon-
ets and falling bodies on Earth showing that all         structing explanations for science questions can be
bodies with mass obey the same law. Since the            reduced to a multi-hop inference problem, where
explanatory power of a given statement depends on        multiple pieces of evidence have to be aggregated
the number of unified phenomena, highly explana-         to arrive at the final answer (Thayaparan et al.,
tory facts tend to create unification patterns – i.e.    2020; Khashabi et al., 2018; Khot et al., 2017;
similar phenomena require similar explanations.          Jansen et al., 2017). Aggregation methods based on
Coming back to our example, we hypothesise that          lexical overlaps and explicit constraints suffer from
the relevance of abstract statements requiring multi-    semantic drift (Khashabi et al., 2019; Fried et al.,
hop inference, such as f4 (“friction occurs when         2015) – i.e. the tendency of composing spurious
two objects’ surfaces move against each other”),         inference chains leading to wrong conclusions.
can be estimated by taking into account the unifica-        One way to contain semantic drift is to lever-
tion power.                                              age common explanatory patterns in explanation-
   Following these observations, we present a            centred corpora (Jansen et al., 2018). Transform-
framework that ranks atomic facts through the com-       ers (Das et al., 2019; Chia et al., 2019) represent
bination of two scoring functions:                       the state-of-the-art for explanation reconstruction
                                                         in this setting (Jansen and Ustalov, 2019). However,
  • A Relevance Score (RS) that represents the
                                                         these models require high computational resources
    lexical relevance of a given fact.
                                                         that prevent their applicability to large corpora. On
  • A Unification Score (US) that models the ex-         the other hand, approaches based on IR techniques
    planatory power of a fact according to its fre-      are readily scalable. The approach described in
    quency in explanations for similar questions.        this paper preserves the scalability of IR methods,
                                                         obtaining, at the same time, performances competi-
   An extensive evaluation is performed on the
                                                         tive with Transformers. Thanks to this feature, the
Worldtree corpus (Jansen et al., 2018; Jansen and
                                                         framework can be flexibly applied in combination
Ustalov, 2019), adopting a combination of k-NN
                                                         with downstream question answering models.
clustering and Information Retrieval (IR) tech-
                                                            Our findings are in line with previous work in
niques. We present the following conclusions:
                                                         different QA settings (Rajani et al., 2019; Yadav
  1. Despite its simplicity, the proposed method         et al., 2019), which highlights the positive impact
     achieves results competitive with Transform-        of explanations and supporting facts on the final
     ers (Das et al., 2019; Chia et al., 2019), yet      answer prediction task.

                                                     201

In parallel with Science QA, the development                     science questions. A fundamental and desir-
of models for explanation generation is being                      able characteristic of Fkb is reusability – i.e.
explored in different NLP tasks, ranging from                      each of its facts fi can be potentially reused to
open domain question answering (Yang et al.,                       compose explanations for multiple questions
2018; Thayaparan et al., 2019), to textual entail-
ment (Camburu et al., 2018) and natural language               • A secondary knowledge base, Explanation
premise selection (Ferreira and Freitas, 2020b,a).               KB (Ekb ), consisting of a set of tuples
                                                                 Ekb = {(h1 , E1 ), (h2 , E2 ), . . . , (hm , Em )},
Scientific Explanation and AI. The field of Ar-                  each of them connecting a true hypothesis
tificial Intelligence has been historically inspired             hj to its corresponding explanation Ej =
by models of explanation in Philosophy of Sci-                   {f1 , f2 , . . . , fk } ⊆ Fkb . An explanation
ence (Thagard and Litt, 2008). The deductive-                    Ej ∈ Ekb is therefore a composition of facts
nomological model proposed by Hempel (Hempel,                    belonging to Fkb .
1965) constitutes the philosophical foundation
for explainable models based on logical deduc-                  In this setting, the explanation reconstruction
tion, such as Expert Systems (Lacave and Diez,               task for an unseen hypothesis hj can be modelled
2004; Wick and Thompson, 1992) and Explanation-              as a ranking problem (Jansen and Ustalov, 2019).
based Learning (Mitchell et al., 1986). Simi-                Specifically, given an hypothesis hj the algorithm
larly, the inherent relation between explanation             to solve the task is divided into three macro steps:
and causality (Woodward, 2005; Salmon, 1984)                   1. Computing an explanatory score si =
has inspired computational models of causal infer-                e(hj , fi ) for each fact fi ∈ Fkb with respect
ence (Pearl, 2009). The view of explanation as                    to hj
unification (Friedman, 1974; Kitcher, 1981, 1989)
is closely related to Case-based reasoning (Kolod-             2. Producing an ordered set Rank(hj ) =
ner, 2014; Sørmo et al., 2005; De Mantaras et al.,                {f1 , . . . , fk , fk+1 , . . . , fn | sk ≥ sk+1 } ⊆
2005). In this context, analogical reasoning plays a              Fkb
key role in the process of reusing abstract patterns
                                                               3. Selecting the top k elements belonging to
for explaining new phenomena (Thagard, 1992).
                                                                  Rank(hj ) and interpreting them as an expla-
Similarly to our approach, Case-based reasoning
                                                                  nation for hj ; Ej = topK(Rank(hj )).
applies this insight to construct solutions for novel
problems by retrieving, reusing and adapting expla-          3.1   Modelling Explanatory Relevance
nations for known cases solved in the past.                We present an approach for modelling e(hj , fi ) that
                                                           is guided by the following research hypotheses:
3    Explanation Reconstruction as a
     Ranking Problem                                           • RH1: Scientific explanations are composed of
                                                                 a set of concrete facts connected to the ques-
A multiple-choice science question Q = {q, C}
                                                                 tion, and a set of abstract statements express-
is a tuple composed by a question q and a set of
                                                                 ing general scientific laws and regularities.
candidate answers C = {c1 , c2 , . . . , cn }. Given
an hypothesis hj defined as the concatenation of               • RH2: Concrete facts tend to share key con-
q with a candidate answer cj ∈ C, the task of                    cepts with the question and can therefore be
explanation reconstruction consists in selecting a               effectively ranked by IR techniques based on
set of atomic facts from a knowledge base Ej =                   lexical relevance.
{f1 , f2 , . . . , fn } that support and justify hj .
   In this paper, we adopt a methodology that relies           • RH3: General scientific statements tend to
on the existence of a corpus of explanations. A                  be abstract and therefore difficult to rank by
corpus of explanations is composed of two distinct               means of shared concepts. However, due to
knowledge sources:                                               explanatory unification, core scientific facts
                                                                 tend to be frequently reused across similar
    • A primary knowledge base, Facts KB (Fkb ),                 questions. We hypothesise that the explana-
      defined as a collection of sentences Fkb =                 tory power of a fact fi for a given hypothesis
      {f1 , f2 , . . . , fn } encoding the general world         hj is proportional to the number of times fi
      knowledge necessary to answer and explain                  explains similar hypotheses.

                                                       202

Ekb          Stored Hypotheses hz                   Representation                               Explanations Ez

      h1: Which force produces energy as                                  f2: friction is a kind of force                           E1
      heat? Friction                                                      f1: friction causes the temperature of an object to increase
      h2: When the magnet is moved
                                                                          f3: pull is a force                                        E2
      away from the object, the magnetic
      force on the object will decrease                                   f4: magnetic attraction pulls two objects together

                                                                                               Nearest K            Unification Score
                                                                                               sim(hj,hz)                us(hj,fi)
               Test Hypothesis hj                   Representation
                                                                                          h1           0.85          f2            0.85
                                                                                                   .                 f1            0.85
       hj: What is an example of a force
                                                                                                   .                           .
       producing heat? Two sticks getting                                                                                      .
                                                                                                   .
       warm when rubbed together                                                                                               .
                                                                                                   .
                                                                                          h2           0.12          f3            0.12
        correct explanatory facts                                                                                    f4            0.12
        wrong explanatory facts

                                                                                         Relevance Score               Explanation
Fkb                Stored Facts fi                  Representation                           rs(hj,fi)                  Ranking
      f1: friction causes the temperature of                                              f2           0.55          f2         0.69
          an object to increase                                                           f3           0.55          f1         0.42
      f2: friction is a kind of force                                                     f4           0.43          .
      f3: pull is a force                                                                 .                          .
                                                                                          .                          .
      f4: magnetic attraction pulls two
                                                                                          .                          f3         0.33
          objects together
                                                                                          f1            0.0          f4         0.27

                  Figure 1: Overview of the Unification-based framework for explanation reconstruction.

   To formalise these research hypotheses, we                                      2. The Unification Score of a fact fi is propor-
model the explanatory scoring function e(hj , fi )                                    tional to the similarity between the hypotheses
as a combination of two components:                                                   in Ekb that are explained by fi and the unseen
      e(hj , fi ) = λ1 rs(hj , fi ) + (1 − λ1 )us(hj , fi )       (1)                 hypothesis (hj ) we want to explain.

Here, rs(hj , fi ) represents a lexical Relevance                              Figure 1 shows a schematic representation of the
Score (RS) assigned to fi ∈ Fkb with respect to hj ,                           Unification-based framework.
while us(hj , fi ) represents the Unification Score
(US) of fi computed over Ekb as follows:                                       4     Empirical Evaluation
                         X
 us(hj , fi ) =                          sim(hj , hz )in(fi , Ez ) (2)      We carried out an empirical evaluation on the
                  (hz ,Ez )∈kN N (hj )
                                                                            Worldtree corpus (Jansen et al., 2018), a subset of
                                   (
                                       1 if fi ∈ Ez
                                                                            the ARC dataset (Clark et al., 2018) that includes
                  in(fi , Ez ) =                                  (3)       explanations for science questions. The corpus pro-
                                       0 otherwise
                                                                            vides an explanatory knowledge base (Fkb and Ekb )
   kN N (hj ) = {(h1 , E1 ), . . . (hk , Ek )} ⊆ Ekb is                     where each explanation in Ekb is represented as a
the set of k-nearest neighbours of hj belonging                             set of lexically connected sentences describing how
to Ekb retrieved according to a similarity function                         to arrive at the correct answer. The science ques-
sim(hj , hz ). On the other hand, in(fi , Ez ) verifies                     tions in the Worldtree corpus are split into training-
whether the fact fi belongs to the explanation Ez                           set, dev-set, and test-set. The gold explanations in
for the hypothesis hz .                                                     the training-set are used to form the Explanation
   In the formulation of Equation 2 we aim to cap-                          KB (Ekb ), while the gold explanations in dev and
ture two main aspects related to our research hy-                           test set are used for evaluation purpose only. The
potheses:                                                                   corpus groups the explanation sentences belong-
  1. The more a fact fi is reused for explanations                          ing to Ekb into three explanatory roles: grounding,
     in Ekb , the higher its explanatory power and                          central and lexical glue.
     therefore its Unification Score;                                          Consider the example in Figure 1. To support

                                                                         203

MAP
Model Approach Trained
Test Dev
Transformers
Das et al. (2019) BERT re-ranking with inference chains Yes 56.3 58.5
Chia et al. (2019) BERT re-ranking with gold IR scores Yes 47.7 50.9
Banerjee (2019) BERT iterative re-ranking Yes 41.3 42.3
IR with re-ranking
Chia et al. (2019) Iterative BM25 No 45.8 49.7
One-step IR
BM25 BM25 Relevance Score No 43.0 46.1
TF-IDF TF-IDF Relevance Score No 39.4 42.8
Feature-based
D’Souza et al.(2019) Feature-rich SVM ranking + Rules Yes 39.4 44.4
D’Souza et al. (2019) Feature-rich SVM ranking Yes 34.1 37.1

Unification-based Reconstruction
RS + US (Best) Joint Relevance and Unification Score No 50.8 54.5
US (Best) Unification Score No 22.9 21.9

Table 1: Results on test and dev set and comparison with state-of-the-art approaches. The column trained indicates
whether the model requires an explicit training session on the explanation reconstruction task.

q and cj the system has to retrieve the scientific (i.e. sim(hj , hz ) function in Equation 2). For
facts describing how friction occurs and produces reproducibility, the code is available at the fol-
heat across objects. The corpus classifies these lowing url: https://github.com/ai-systems/
facts (f3 , f4 ) as central. Grounding explanations unification_reconstruction_explanations.
like “stick is a kind of object” (f1 ) link question Additional details can be found in the supplemen-
and answer to the central explanations. Lexical tary material.
glues such as “to rub; to rub together means to
mover against” (f2 ) are used to fill lexical gaps be- 4.1 Explanation Reconstruction
tween sentences. Additionally, the corpus divides In line with the shared task (Jansen and Ustalov,
the facts belonging to Fkb into three inference cat- 2019), the performances of the models are eval-
egories: retrieval type, inference supporting type, uated via Mean Average Precision (MAP) of the
and complex inference type. Taxonomic knowledge explanation ranking produced for a given question
and properties such as “stick is a kind of object” qj and its correct answer aj .
(f1 ) and “friction is a kind of force” (f5 ) are clas-
Table 1 illustrates the score achieved by our best
sified as retrieval type. Facts describing actions,
implementation compared to state-of-the-art ap-
affordances, and requirements such as “friction
proaches in the literature. Previous approaches
occurs when two object’s surfaces move against
are grouped into four categories: Transformers,
each other” (f3 ) are grouped under the inference
Information Retrieval with re-ranking, One-step
supporting types. Knowledge about causality, de-
Information Retrieval, and Feature-based models.
scription of processes and if-then conditions such
as “friction causes the temperature of an object to Transformers. This class of approaches em-
increase” (f4 ) is classified as complex inference. ploys the gold explanations in the corpus to train a
We implement Relevance and Unification BERT language model (Devlin et al., 2019). The
Score adopting TF-IDF/BM25 vectors and best-performing system (Das et al., 2019) adopts
cosine similarity function (i.e. 1 − cos(~x, ~y )). a multi-step retrieval strategy. In the first step, it
Specifically, The RS model uses TF-IDF/BM25 returns the top K sentences ranked by a TF-IDF
to compute the relevance function for each fact model. In the second step, BERT is used to re-
in Fkb (i.e. rs(hj , fi ) function in Equation 1) rank the paths composed of all the facts that are
while the US model adopts TF-IDF/BM25 to within 1-hop from the first retrieved set. Similarly,
assign similarity scores to the hypotheses in Ekb other approaches adopt BERT to re-rank each fact

204

MAP                                                                           MAP
Model                                                                        Model
                           All    Central   Grounding   Lexical Glue                                         1+ Overlaps   1 Overlap   0 Overlaps
RS TF-IDF                  42.8     43.4      25.4         8.2               RS TF-IDF                            57.2       33.6         7.1
RS BM25                    46.1     46.6      23.3         10.7              RS BM25                              62.2       37.1         7.1
US TF-IDF                  21.6     16.9      22.0         13.4              US TF-IDF                            17.37      18.0        12.5
US BM25                    21.9     18.1      16.7         15.0              US BM25                               18.1      18.1        13.1
RS TF-IDF + US TF-IDF      48.5     46.4       32.7        11.7              RS TF-IDF + US TF-IDF                60.2       38.4         9.0
RS TF-IDF + US BM25        50.7     48.6      30.42        13.4              RS TF-IDF + US BM25                  62.5       39.5         9.6
RS BM25 + US TF-IDF        51.9     48.2       31.7        14.8              RS BM25 + US TF-IDF                  61.3       40.6        11.0
RS BM25 + US BM25          54.5     51.7       27.3        16.7              RS BM25 + US BM25                    64.8       41.9        11.2

                    (a) Explanatory roles.                                            (b) Lexical overlaps with the hypothesis.

                                                                                  MAP
                            Model
                                                           Retrieval   Inference-supporting   Complex Inference
                            RS TF-IDF                        33.5             34.7                  21.8
                            RS BM25                          36.0             36.1                  24.8
                            US TF-IDF                        17.6             12.8                  19.5
                            US BM25                          16.8             13.2                  20.9
                            RS TF-IDF + US TF-IDF            38.3             33.2                  30.2
                            RS TF-IDF + US BM25              40.0             35.6                  33.3
                            RS BM25 + US TF-IDF              40.5             33.6                  33.4
                            RS BM25 + US BM25                40.6             38.3                  36.8

                                                        (c) Inference types.

Table 2: Detailed analysis of the performance (dev-set) by breaking down the gold explanatory facts according to
their explanatory role (2.a), number of lexical overlaps with the question (2.b) and inference type (2.c).

individually (Banerjee, 2019; Chia et al., 2019).                         Information Retrieval with re-ranking. Chia
                                                                          et al. (2019) describe a multi-step, iterative re-
   Although the best model achieves state-of-the-
                                                                          ranking model based on BM25. The first step con-
art results in explanation reconstruction, these ap-
                                                                          sists in retrieving the explanation sentence that is
proaches are computationally expensive, being lim-
                                                                          most similar to the question adopting BM25 vec-
ited by the application of a pre-filtering step to
                                                                          tors. During the second step, the BM25 vector
contain the space of candidate facts. Consequently,
                                                                          of the question is updated by aggregating it with
these systems do not scale with the size of the cor-
                                                                          the retrieved explanation sentence vector through
pus. We estimated that the best performing model
                                                                          a max operation. The first and second steps are
(Das et al., 2019) takes ≈ 10 hours to run on the
                                                                          repeated for K times. Although this approach uses
whole test set (1240 questions) using 1 Tesla 16GB
                                                                          scalable IR techniques, it relies on a multi-step
V100 GPU.
                                                                          retrieval strategy. Besides, the RS + US model out-
   Comparatively, our model constructs explana-                           performs this approach on both test and dev set by
tions for all the questions in the test set in ≈                          5.0/4.8 MAP respectively.
30 seconds, without requiring the use of GPUs
(< 1 second per question). This feature makes                          One-step Information Retrieval. We compare
the Unification-based Reconstruction suitable for                      the RS + US model with two IR baselines. The
large corpora and downstream question answer-                          baselines adopt TF-IDF and BM25 to compute the
ing models (as shown in Section 4.4). Moreover,                        Relevance Score only – i.e. the us(q, cj , fi ) term in
our approach does not require any explicit train-                      Equation 1 is set to 0 for each fact fi ∈ Fkb . In line
ing session on the explanation regeneration task,                      with previous IR literature (Robertson et al., 2009),
with significantly reduced number of parameters to                     BM25 leads to better performance than TF-IDF.
tune. Along with scalability, the proposed approach                    While these approaches share similar characteris-
achieves nearly state-of-the-art results (50.8/54.5                    tics, the combined RS + US model outperforms
MAP). Although we observe lower performance                            both RS BM25 and RS TF-IDF on test and dev-set
when compared to the best-performing approach                          by 7.8/8.4 and 11.4/11.7 MAP. Moreover, the joint
(-5.5/-4.0 MAP), the joint RS + US model outper-                       RS + US model improves the performance of the
forms two BERT-based models (Chia et al., 2019;                        US model alone by 27.9/32.6 MAP. These results
Banerjee, 2019) on both test and dev set by 3.1/3.6                    outline the complementary aspects of Relevance
and 9.5/12.2 MAP respectively.                                         and Unification Score. We provide a detailed anal-

                                                                    205

(a) MAP vs Explanation length. (b) Precision@K.

Figure 2: Impact of the Unification Score on semantic drift (3.a) and precision (3.b). RS + US (Blue Straight), RS
(Green Dotted), US (Red Dashed).

56 with each individual sub-component (RS and US
55
54
alone). In addition, a set of qualitative examples
53 are analysed to provide additional insights on the
52
51 complementary aspects captured by Relevance and
MAP

50
49
Unification Score.
48
47 Explanatory categories. Given a question qj
46
45
and its correct answer aj , we classify a fact fi be-
k=0 k = 10 k = 20 k = 40 k = 80 k = 100 k = 200
longing to the gold explanation Ej according to its
K-nearest neighbours explanatory role (central, grounding, lexical glue)
and inference type (retrieval, inference-supporting
Figure 3: Impact of the k-NN clustering on the final and complex inference). In addition, three new cat-
MAP score. The value k represents the number of sim- egories are derived from the number of overlaps
ilar hypotheses considered for the Unification Score. between fi and the concatenation of qj with aj (hj )
computed by considering nouns, verbs, adjectives
ysis by performing an ablation study on the dev-set and adverbs (1+ overlaps, 1 overlap, 0 overlaps).
(Section 4.2). Table 2 reports the MAP score for each of the
described categories. Overall, the best results are
Feature-based models. D’Souza et al. (2019) obtained by the BM25 implementation of the joint
propose an approach based on a learning-to-rank model (RS BM25 + US BM25) with a MAP score
paradigm. The model extracts a set of features of 54.5. Specifically, RS BM25 + US BM25
based on overlaps and coherence metrics between achieves a significant improvement over both RS
questions and explanation sentences. These fea- BM25 (+8.5 MAP) and US BM25 (+32.6 MAP)
tures are then given in input to a SVM ranker mod- baselines. Regarding the explanatory roles (Table
ule. While this approach scales to the whole corpus 2a), the joint TF-IDF implementation shows the
without requiring any pre-filtering step, it is signifi- best performance in the reconstruction of ground-
cantly outperformed by the RS + US model on both ing explanations (32.7 MAP). On the other hand,
test and dev set by 16.7/17.4 MAP respectively. a significant improvement over the RS baseline is
obtained by RS BM25 + US BM25 on both lexical
4.2 Explanation Analysis glues and central explanation sentences (+6.0 and
We present an ablation study with the aim of under- +5.6 MAP over RS BM25).
standing the contribution of each sub-component Regarding the lexical overlaps categories (Table
to the general performance of the joint RS + US 2b), we observe a steady improvement for all the
model (see Table 1). To this end, a detailed eval- combined RS + US models over the respective
uation on the development set of the Worldtree RS baselines. Notably, the US models achieve
corpus is carried out, analysing the performance in the best performance on the 0 overlaps category,
reconstructing explanations of different types and which includes the most challenging facts for the
complexity. We compare the joint model (RS + US) RS models. The improved ability to rank abstract

206

Question Answer Explanation Fact Most Similar Hypotheses in Ekb RS RS + US

If you bounce a rubber ball gravity gravity; gravitational force (1) A ball is tossed up in the air and it comes #36 #2 (↑34)
on the floor it goes up and causes objects that have mass; back down. The ball comes back down be-
then comes down. What substances to be pulled down; to cause of - gravity (2) A student drops a ball.
causes the ball to come fall on a planet Which force causes the ball to fall to the
down? ground? - gravity
Which animals would most alligators as water increases in an environ- (1) Where would animals and plants be most #198 #57 (↑141)
likely be helped by flood in ment, the population of aquatic affected by a flood? - low areas (2) Which
a coastal area? animals will increase change would most likely increase the num-
ber of salamanders? - flood
What is an example of a two sticks getting friction causes the temperature (1) Rubbing sandpaper on a piece of wood #1472 #21 (↑1451)
force producing heat? warm when rubbed of an object to increase produces what two types of energy? - sound
together and heat (2) Which force produces energy
as heat? - friction

Table 3: Impact of the Unification Score on the ranking of scientific facts with increasing complexity.

explanatory facts contributes to better performance similar questions. This results demonstrate that the
for the joint models (RS + US) in the reconstruction Unification Score has a crucial role in alleviating
of explanations that share few terms with question the semantic drift for the joint model (RS + US),
and answer (1 Overlap and 0 Overlaps categories). resulting in a larger improvement on many-hops
This characteristic leads to an improvement of 4.8 explanations (6+ facts).
and 4.1 MAP for the RS BM25 + US BM25 model Similarly, Figure 2b illustrates the Precision@K.
over the RS BM25 baseline. As shown in the graph, the drop in precision for the
Similar results are achieved on the inference US model exhibits the slowest degradation. Simi-
types categories (Table 2c). Crucially, the largest larly to what observed for many-hops explanations,
improvement is observed for complex inference sen- the US score contributes to the robustness of the
tences where RS BM25 + US BM25 outperforms RS + US model, making it able to reconstruct more
RS BM25 by 12.0 MAP, confirming the decisive precise explanations. As discussed in section 4.4,
contribution of the Unification Score to the ranking this feature has a positive impact on question an-
of complex scientific facts. swering.
Semantic drift. Science questions in the k-NN clustering. We investigate the impact of
Worldtree corpus require an average of six facts the k-NN clustering on the explanation reconstruc-
in their explanations (Jansen et al., 2016). Long tion task. Figure 3 shows the MAP score obtained
explanations typically include sentences that share by the joint RS + US model (BM25) with different
few terms with question and answer, increasing the numbers k of nearest hypotheses considered for
probability of semantic drift. Therefore, to test the the Unification Score. The graph highlights the im-
impact of the Unification Score on the robustness provement in MAP achieved with increasing values
of the model, we measure the performance in the of k. Specifically, we observe that the best MAP is
reconstruction of many-hops explanations. obtained with k = 100. These results confirm that
Figure 2a shows the change in MAP score for the explanatory power can be effectively estimated
the RS + US, RS and US models (BM25) with using clusters of similar hypotheses, and that the
increasing explanation length. The fast drop in per- unification-based mechanism has a crucial role in
formance for the Relevance Score reflects the com- improving the performance of the relevance model.
plexity of the task. This drop occurs because the RS
model is not able to rank abstract explanatory facts. 4.3 Qualitative analysis.
Conversely, the US model exhibits increasing per- To provide additional insights on the complemen-
formance, with a trend that is inverse. Short expla- tary aspects of Unification and Relevance Score,
nations, indeed, tend to include question-specific we present a set of qualitative examples from the
facts with low explanatory power. On the other dev-set. Table 3 illustrates the ranking assigned
hand, the longer the explanation, the higher the by RS and RS + US models to scientific sentences
number of core scientific facts. Therefore, the de- of increasing complexity. The words in bold indi-
crease in MAP observed for the RS model is com- cate lexical overlaps between question, answer and
pensated by the Unification Score, since core scien- explanation sentence. In the first example, the sen-
tific facts tend to form unification patterns across tence “gravity; gravitational force causes objects

207

that have mass; substances to be pulled down; to              Model
                                                                                                Accuracy
fall on a planet” shares key terms with question                                        Easy    Challenge   Overall
and candidate answer and is therefore relatively              BERT (no explanation)     48.54     26.28     41.78
easy to rank for the RS model (#36). Nevertheless,            BERT + RS (K = 3)         53.20     40.97     49.39
the RS + US model is able to improve the rank-                BERT + RS (K = 5)         54.36     38.14     49.31
ing by 34 positions (#2), as the gravitational law            BERT + RS (K = 10)        32.71     29.63     31.75
represents a scientific pattern with high explana-            BERT + RS + US (K = 3)    55.46     41.97     51.62
tory unification, frequently reused across similar            BERT + RS + US (K = 5)    54.48     39.43     50.12
                                                              BERT + RS + US (K = 10)   48.66     37.37     45.14
questions. The impact of the Unification Score is
more evident when considering abstract explana-           Table 4: Performance of BERT on question answering
tory facts. Coming back to our original example           (test-set) with and without the explanation reconstruc-
(i.e. “What is an example of a force producing            tion models.
heat?”), the fact “friction causes the temperature
of an object to increase” has no significant over-
laps with question and answer. Thus, the RS model         split (+6.92%) and particularly significant for chal-
ranks the gold explanation sentence in a low posi-        lenge questions (+15.69%). Overall, we observe a
tion (#1472). However, the Unification Score (US)         correlation between more precise explanations and
is able to capture the explanatory power of the fact      accuracy in answer prediction, with BERT + RS
from similar hypotheses in Ekb , pushing the RS +         being outperformed by BERT + RS + US for each
US ranking up to position #21 (+1451).                    value of K. The decrease in accuracy occurring
                                                          with increasing values of K is coherent with the
4.4   Question Answering                                  drop in precision for the models observed in Fig-
                                                          ure 2b. Moreover, steadier results adopting the RS
To understand whether the constructed explana-
                                                          + US model suggest a positive contribution from
tions can support question answering, we compare
                                                          abstract explanatory facts. Additional investigation
the performance of BERT for multiple-choice QA
                                                          of this aspect will be a focus for future work.
(Devlin et al., 2019) without explanations with the
performance of BERT provided with the top K ex-
                                                              5   Conclusion
planation sentences retrieved by RS and RS + US
models (BM25). BERT without explanations op-              This paper proposed a novel framework for multi-
erates on question and candidate answer only. On          hop explanation reconstruction based on explana-
the other hand, BERT with explanation receives            tory unification. An extensive evaluation on the
the following input: the question (q), a candidate        Worldtree corpus led to the following conclusions:
answer (ci ) and the explanation for ci (Ei ). In this    (1) The approach is competitive with state-of-the-
setting, the model is fine-tuned for binary classifi-     art Transformers, yet being significantly faster
cation (bertb ) to predict a set of probability scores    and inherently scalable; (2) The unification-based
P = {p1 , p2 , ..., pn } for each candidate answer        mechanism supports the construction of complex
in C = {c1, c2 , ..., cn }:                               and many hops explanations; (3) The constructed
                                                          explanations improves the accuracy of BERT for
  bertb ([CLS] || q||ci || [SEP] || Ei ) = pi     (4)     question answering by up to 10% overall. As a fu-
                                                          ture work, we plan to extend the framework adopt-
The binary classifier operates on the final hidden
                                                          ing neural embeddings for sentence representation.
state corresponding to the [CLS] token. To an-
swer the question q, the model selects the candidate          Acknowledgements
answer ca such that a = argmaxi pi .
   Table 4 reports the accuracy with and without          The authors would like to thank the anonymous
explanations on the Worldtree test-set for easy           reviewers for the constructive feedback. A special
and challenge questions (Clark et al., 2018). No-         thanks to Deborah Ferreira for the helpful discus-
tably, a significant improvement in accuracy can          sions, and to the members of the AI Systems lab
be observed when BERT is provided with expla-             from the University of Manchester. Additionally,
nations retrieved by the reconstruction modules           we would like to thank the Computational Shared
(+9.84% accuracy with RS BM25 + US BM25                   Facility of the University of Manchester for provid-
model). The improvement is consistent on the easy         ing the infrastructure to run our experiments.

                                                        208

References Deborah Ferreira and André Freitas. 2020a. Natu-
ral language premise selection: Finding supporting
Pratyay Banerjee. 2019. Asu at textgraphs 2019 shared statements for mathematical text. In Proceedings of
task: Explanation regeneration using language mod- The 12th Language Resources and Evaluation Con-
els and iterative re-ranking. In Proceedings of ference, pages 2175–2182.
the Thirteenth Workshop on Graph-Based Methods
for Natural Language Processing (TextGraphs-13), Deborah Ferreira and André Freitas. 2020b. Premise
pages 78–84. selection in natural language mathematical texts. In
Proceedings of the 58th Annual Meeting of the Asso-
Or Biran and Courtenay Cotton. 2017. Explanation
ciation for Computational Linguistics, pages 7365–
and justification in machine learning: A survey. In
7374.
IJCAI-17 workshop on explainable AI (XAI), vol-
ume 8.
Daniel Fried, Peter Jansen, Gustave Hahn-Powell, Mi-
Oana-Maria Camburu, Tim Rocktäschel, Thomas hai Surdeanu, and Peter Clark. 2015. Higher-
Lukasiewicz, and Phil Blunsom. 2018. e-snli: Nat- order lexical semantic models for non-factoid an-
ural language inference with natural language expla- swer reranking. Transactions of the Association for
nations. In Advances in Neural Information Process- Computational Linguistics, 3:197–210.
ing Systems, pages 9539–9549.
Michael Friedman. 1974. Explanation and scientific
Yew Ken Chia, Sam Witteveen, and Martin Andrews. understanding. The Journal of Philosophy, 71(1):5–
2019. Red dragon ai at textgraphs 2019 shared task: 19.
Language model assisted explanation generation. In
Proceedings of the Thirteenth Workshop on Graph- Carl G Hempel. 1965. Aspects of scientific explana-
Based Methods for Natural Language Processing tion.
(TextGraphs-13), pages 85–89.
Peter Jansen, Niranjan Balasubramanian, Mihai Sur-
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, deanu, and Peter Clark. 2016. What’s in an expla-
Ashish Sabharwal, Carissa Schoenick, and Oyvind nation? characterizing knowledge and inference re-
Tafjord. 2018. Think you have solved question an- quirements for elementary science exams. In Pro-
swering? try arc, the ai2 reasoning challenge. arXiv ceedings of COLING 2016, the 26th International
preprint arXiv:1803.05457. Conference on Computational Linguistics: Techni-
cal Papers, pages 2956–2965.
Rajarshi Das, Ameya Godbole, Manzil Zaheer, She-
hzaad Dhuliawala, and Andrew McCallum. 2019. Peter Jansen, Rebecca Sharp, Mihai Surdeanu, and Pe-
Chains-of-reasoning at textgraphs 2019 shared task: ter Clark. 2017. Framing qa as building and ranking
Reasoning over chains of facts for explainable multi- intersentence answer justifications. Computational
hop inference. In Proceedings of the Thirteenth Linguistics, 43(2):407–449.
Workshop on Graph-Based Methods for Natural
Language Processing (TextGraphs-13), pages 101– Peter Jansen and Dmitry Ustalov. 2019. Textgraphs
117. 2019 shared task on multi-hop inference for expla-
nation regeneration. In Proceedings of the Thir-
Ramon Lopez De Mantaras, David McSherry, Derek
teenth Workshop on Graph-Based Methods for Nat-
Bridge, David Leake, Barry Smyth, Susan Craw, Boi
ural Language Processing (TextGraphs-13), pages
Faltings, Mary Lou Maher, MICHAEL T COX, Ken-
63–77.
neth Forbus, et al. 2005. Retrieval, reuse, revision
and retention in case-based reasoning. The Knowl-
edge Engineering Review, 20(3):215–240. Peter Jansen, Elizabeth Wainwright, Steven Mar-
morstein, and Clayton Morrison. 2018. Worldtree:
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and A corpus of explanation graphs for elementary sci-
Kristina Toutanova. 2019. Bert: Pre-training of ence questions supporting multi-hop inference. In
deep bidirectional transformers for language under- Proceedings of the Eleventh International Confer-
standing. In Proceedings of the 2019 Conference of ence on Language Resources and Evaluation (LREC
the North American Chapter of the Association for 2018).
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages Daniel Khashabi, Erfan Sadeqi Azer, Tushar Khot,
4171–4186. Ashish Sabharwal, and Dan Roth. 2019. On
the capabilities and limitations of reasoning for
Jennifer D’Souza, Isaiah Onando Mulang, and Sören natural language understanding. arXiv preprint
Auer. 2019. Team svmrank: Leveraging feature-rich arXiv:1901.02522.
support vector machines for ranking explanations to
elementary science questions. In Proceedings of Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and
the Thirteenth Workshop on Graph-Based Methods Dan Roth. 2018. Question answering as global rea-
for Natural Language Processing (TextGraphs-13), soning over semantic abstractions. In Thirty-Second
pages 90–100. AAAI Conference on Artificial Intelligence.

209

Tushar Khot, Peter Clark, Michal Guerquin, Peter Frode Sørmo, Jörg Cassens, and Agnar Aamodt. 2005.
Jansen, and Ashish Sabharwal. 2019. Qasc: A Explanation in case-based reasoning–perspectives
dataset for question answering via sentence compo- and goals. Artificial Intelligence Review, 24(2):109–
sition. arXiv preprint arXiv:1910.11473. 143.

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017. Paul Thagard. 1992. Analogy, explanation, and ed-
Answering complex questions using open informa- ucation. Journal of research in science teaching,
tion extraction. In Proceedings of the 55th Annual 29(6):537–544.
Meeting of the Association for Computational Lin-
guistics (Volume 2: Short Papers), pages 311–316. Paul Thagard and Abninder Litt. 2008. Models of sci-
entific explanation. The Cambridge handbook of
Philip Kitcher. 1981. Explanatory unification. Philoso- computational psychology, pages 549–564.
phy of science, 48(4):507–531.
Mokanarangan Thayaparan, Marco Valentino, and
Philip Kitcher. 1989. Explanatory unification and the André Freitas. 2020. A survey on explainability
causal structure of the world. in machine reading comprehension. arXiv preprint
arXiv:2010.00389.
Janet Kolodner. 2014. Case-based reasoning. Morgan
Kaufmann. Mokanarangan Thayaparan, Marco Valentino, Viktor
Schlegel, and André Freitas. 2019. Identifying
Carmen Lacave and Francisco J Diez. 2004. A review supporting facts for multi-hop question answering
of explanation methods for heuristic expert systems. with document graph networks. In Proceedings of
The Knowledge Engineering Review, 19(2):133– the Thirteenth Workshop on Graph-Based Methods
146. for Natural Language Processing (TextGraphs-13),
pages 42–51.
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish
Sabharwal. 2018. Can a suit of armor conduct elec- Michael R Wick and William B Thompson. 1992. Re-
tricity? a new dataset for open book question an- constructive expert system explanation. Artificial In-
swering. In Proceedings of the 2018 Conference on telligence, 54(1-2):33–70.
Empirical Methods in Natural Language Processing,
pages 2381–2391. James Woodward. 2005. Making things happen: A the-
ory of causal explanation. Oxford university press.
Tim Miller. 2019. Explanation in artificial intelligence:
Insights from the social sciences. Artificial Intelli- Vikas Yadav, Steven Bethard, and Mihai Surdeanu.
gence, 267:1–38. 2019. Quick and (not so) dirty: Unsupervised se-
lection of justification sentences for multi-hop ques-
Tom M Mitchell, Richard M Keller, and Smadar T
tion answering. In Proceedings of the 2019 Con-
Kedar-Cabelli. 1986. Explanation-based generaliza-
ference on Empirical Methods in Natural Language
tion: A unifying view. Machine learning, 1(1):47–
Processing and the 9th International Joint Confer-
80.
ence on Natural Language Processing (EMNLP-
Judea Pearl. 2009. Causality. Cambridge university IJCNLP), pages 2578–2589.
press.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
Nazneen Fatema Rajani, Bryan McCann, Caiming William Cohen, Ruslan Salakhutdinov, and Christo-
Xiong, and Richard Socher. 2019. Explain your- pher D Manning. 2018. Hotpotqa: A dataset for
self! leveraging language models for commonsense diverse, explainable multi-hop question answering.
reasoning. In Proceedings of the 57th Annual Meet- In Proceedings of the 2018 Conference on Empiri-
ing of the Association for Computational Linguistics, cal Methods in Natural Language Processing, pages
pages 4932–4942. 2369–2380.

Marco Tulio Ribeiro, Sameer Singh, and Carlos A Supplementary Material
Guestrin. 2016. ” why should i trust you?” explain-
ing the predictions of any classifier. In Proceed- A.1 Hyperparameters tuning
ings of the 22nd ACM SIGKDD international con-
ference on knowledge discovery and data mining, The hyperparameters of the model have been tuned
pages 1135–1144. manually. The criteria for the optimisation is the
maximisation of the MAP score on the dev-set.
Stephen Robertson, Hugo Zaragoza, et al. 2009. The
probabilistic relevance framework: Bm25 and be- Here, we report the values adopted for the experi-
yond. Foundations and Trends R in Information Re- ments described in the paper.
trieval, 3(4):333–389. The Unification-based Reconstruction adopts
Wesley C Salmon. 1984. Scientific explanation and the
two hyperparameters. Specifically, λ1 is the weight
causal structure of the world. Princeton University assigned to the relevance score in equation 1, while
Press. k is the number of similar hypotheses to consider

210

for the calculation of the unification score (equa-
tion 2). The values adopted for these parameters
are as follows:

  1. λ1 = 0.83 (1 − λ1 = 0.17)

  2. k = 100

A.2   BERT model
For question answering we adopt a BERTBASE
model. The model is implemented using PyTorch
(https://pytorch.org/) and fine-tuned using 4
Tesla 16GB V100 GPUs for 10 epochs in total with
batch size 32 and seed 42. The hyperparameters
adopted for BERT are as follows:

  • gradient accumulation steps = 1

  • learning rate = 5e-5

  • weight decay = 0.0

  • adam epsilon = 1e-8

  • warmup steps = 0

  • max grad norm = 1.0

A.3   Data and code
The experiments are carried out on the TextGraphs
2019 version (https://github.com/umanlp/
tg2019task) of the Worldtree corpus. The full
dataset can be downloaded at the following URL:
http://cognitiveai.org/dist/worldtree_
corpus_textgraphs2019sharedtask_
withgraphvis.zip.
  The code to reproduce the experiments de-
scribed in the paper is available at the follow-
ing URL: https://github.com/ai-systems/
unification_reconstruction_explanations

                                                   211

You can also read