Benchmarking Machine Reading Comprehension: A Psychological Perspective - arXiv.org

Page created by Janet Carr

Arts & Entertainment

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Benchmarking Machine Reading Comprehension:
                                                                   A Psychological Perspective

                                                              Saku Sugawara1 , Pontus Stenetorp2 , Akiko Aizawa1
                                                          1
                                                            National Institute of Informatics, 2 University College London
                                                      {saku,aizawa}@nii.ac.jp, p.stenetorp@cs.ucl.ac.uk

                                                               Abstract                           required by the datasets and is actually acquired by
                                                                                                  models. Although benchmarking MRC is related to
                                             Machine reading comprehension (MRC) has              the intent behind questions and is critical to test hy-
                                             received considerable attention as a bench-
arXiv:2004.01912v2 [cs.CL] 26 Jan 2021

                                                                                                  potheses from a top-down viewpoint (Bender and
                                             mark for natural language understanding.
                                             However, the conventional task design of             Koller, 2020), its theoretical foundation is poorly
                                             MRC lacks explainability beyond the model            investigated in the literature.
                                             interpretation, i.e., reading comprehension by          In this position paper, we examine the prerequi-
                                             a model cannot be explained in human terms.          sites for benchmarking MRC based on the follow-
                                             To this end, this position paper provides a the-     ing two questions: (i) What does reading compre-
                                             oretical basis for the design of MRC datasets        hension involve? (ii) How can we evaluate it? Our
                                             based on psychology as well as psychometrics,
                                                                                                  motivation is to provide a theoretical basis for the
                                             and summarizes it in terms of the prerequisites
                                             for benchmarking MRC. We conclude that fu-           creation of MRC datasets. As Gilpin et al. (2018)
                                             ture datasets should (i) evaluate the capability     indicate, interpreting the internals of a system is
                                             of the model for constructing a coherent and         closely related to only the system’s architecture
                                             grounded representation to understand context-       and is insufficient for explaining how the task is ac-
                                             dependent situations and (ii) ensure substan-        complished. This is because even if the internals of
                                             tive validity by shortcut-proof questions and        models can be interpreted, we cannot explain what
                                             explanation as a part of the task design.            is measured by the datasets. Therefore, our study
                                                                                                  focuses on the explainability of the task rather than
                                         1   Introduction
                                                                                                  the interpretability of models.
                                         Evaluation of natural language understanding                We first overview MRC and review the analytical
                                         (NLU) is a long-standing goal in the field of artifi-    literature that indicates that existing datasets might
                                         cial intelligence. Machine reading comprehension         fail to correctly evaluate their intended behavior
                                         (MRC) is a task that tests the ability of a machine to   (Section 2). Subsequently, we present a psycholog-
                                         read and understand unstructured text and could be       ical study of human reading comprehension in Sec-
                                         the most suitable task for evaluating NLU because        tion 3 for answering the what question. We argue
                                         of its generic formulation (Chen, 2018). Recently,       that the concept of representation levels can serve
                                         many large-scale datasets have been proposed, and        as a conceptual hierarchy for organizing the tech-
                                         deep learning systems have achieved human-level          nologies in MRC. Section 4 focuses on answering
                                         performance for some of these datasets.                  the how question. Here, we implement psychomet-
                                            However, analytical studies have shown that           rics to analyze the prerequisites for the task design
                                         MRC models do not necessarily achieve human-             of MRC. Furthermore, we introduce the concept of
                                         level understanding. For example, Jia and Liang          construct validity, which emphasizes validating the
                                         (2017) use manually crafted adversarial examples         interpretation of the task’s outcome. Finally, in Sec-
                                         to show that successful systems are easily dis-          tion 5, we explain the application of the proposed
                                         tracted. Sugawara et al. (2020) show that a sig-         concepts into practical approaches, highlighting po-
                                         nificant part of already solved questions is solvable    tential future directions toward the advancement of
                                         even after shuffling the words in a sentence or drop-    MRC. Regarding the what question, we indicate
                                         ping content words. These studies demonstrate that       that datasets should evaluate the capability of the
                                         we cannot explain what type of understanding is          situation model, which refers to the construction

Question              Foundation                     Requirements                                 Future direction
    What is reading       Representation levels in       (A) Linguistic-level sentence understand-    (C) Dependence of con-
    comprehension?        human reading compre-          ing, (B) comprehensiveness of skills         text on defeasibility and
                          hension: (A) surface           for inter-sentence understanding, and        novelty, and grounding to
                          structure, (B) textbase,       (C) evaluation of coherent representation    non-textual information
                          and (C) situation model.       grounded to non-textual information.         with a long passage.
    How can we evalu-     Construct validity in psy-     (1) Wide coverage of skills, (2) evalu-      (2) Creating shortcut-
    ate reading compre-   chometrics: (1) content,       ation of the internal process, (3) struc-    proof questions by
    hension?              (2) substantive, (3) struc-    tured metrics, (4) reliability of metrics,   filtering and ablation,
                          tural, (4) generalizability,   (5) comparison with external variables,      and designing a task for
                          (5) external, and (6) con-     and (6) robustness to adversarial attacks    validating the internal
                          sequential aspects.            and social biases.                           process.

Table 1: Overview of theoretical foundations, requirements, and future directions of MRC discussed in this paper.

of a coherent and grounded representation of text                   sage (MCTest (Richardson et al., 2013)), a set of
based on human understanding. Regarding the how                     passages (HotpotQA (Yang et al., 2018)), a longer
question, we argue that among the important as-                     document (CBT (Hill et al., 2016)), or open domain
pects of the construct validity, substantive validity               (Chen et al., 2017). In some datasets, a context
must be ensured, which requires the verification of                 includes non-textual information such as images
the internal mechanism of comprehension.                            (RecipeQA (Yagcioglu et al., 2018)).
   Table 1 provides an overview of the perspectives
taken in this paper. Our answers and suggestions to                 Question Styles A question can be an interrog-
the what and how questions are summarized as fol-                   ative sentence (in most datasets), a fill-in-the-
lows: (1) Reading comprehension is the process of                   blank sentence (cloze) (CLOTH (Xie et al., 2018)),
creating a situation model that best explains given                 knowledge base entries (QAngaroo (Welbl et al.,
texts and the reader’s background knowledge. The                    2018)) and search engine queries (MSMARCO
situation model should be the next focal point in                   (Nguyen et al., 2016)).
future datasets for benchmarking the human-level                    Answering Styles An answer can be (i) chosen
reading comprehension. (2) To evaluate reading                      from a text span of the given document (answer
comprehension correctly, the task needs to provide                  extraction) (NewsQA (Trischler et al., 2017)), (ii)
a rubric (scoring guide) for sufficiently covering                  chosen from a candidate set of answers (multiple
the aspects of the construct validity. In particular,               choice) (MCTest (Richardson et al., 2013)), or (iii)
the substantive validity should be ensured by cre-                  generated as a free-form text (description) (Narra-
ating shortcut-proof questions and by designing a                   tiveQA (Kočiský et al., 2018)). Some datasets op-
task formulation that is explanatory itself.                        tionally allow answering by a yes/no reply (BoolQ
                                                                    (Clark et al., 2019)).
2     Task Overview
                                                                    Sourcing Methods Initially, questions in small-
2.1    Task Variations and Existing Datasets                        scale datasets are created by experts (QA4MRE
MRC is a task in which a machine is given a docu-                   (Sutcliffe et al., 2013)). Later, fueling the devel-
ment (context) and it answers the questions based                   opment of neural models, most published datasets
on the context. Burges (2013) provides a general                    have more than a hundred thousand questions that
definition of MRC, i.e., a machine comprehends a                    are automatically created (CNN/Daily Mail (Her-
passage of text if, for any question regarding that                 mann et al., 2015)), crowdsourced (SQuAD v1.1
text that can be answered correctly by a majority of                (Rajpurkar et al., 2016)), and collected from exam-
native speakers, that machine can provide a string                  inations (RACE (Lai et al., 2017)).
which those speakers would agree both answers
that question. We overview various aspects of the                   Domains The most popular domain is Wikipedia
task along with representative datasets as follows.                 articles (Natural Questions (Kwiatkowski et al.,
Existing datasets are listed in Appendix A.                         2019)), but news articles are also used (Who-did-
                                                                    What (Onishi et al., 2016)). CliCR (Suster and
Context Styles A context can be given in various                    Daelemans, 2018) and emrQA (Pampari et al.,
forms with different lengths such as a single pas-                  2018) are datasets in the clinical domain. DuoRC

(Saha et al., 2018) uses movie scripts. sion. This issue may be attributed to the low in-
terpretability of black-box neural network models.
Specific Skills Several recently proposed However, a problem is that we cannot explain what
datasets require specific skills including unanswer- is measured by the datasets even if we can inter-
able questions (SQuAD v2.0 (Rajpurkar et al., pret the internals of models. We speculate that this
2018)), dialogues (CoQA (Reddy et al., 2019), benchmarking issue in MRC can be attributed to
DREAM (Sun et al., 2019)), multiple-sentence the following two points: (i) we do not have a com-
reasoning (MultiRC (Khashabi et al., 2018)), prehensive theoretical basis of reading comprehen-
multi-hop reasoning (HotpotQA (Yang et al., sion for specifying what we should ask (Section 3)
2018)), mathematical and set reasoning (DROP and (ii) we do not have a well-established method-
(Dua et al., 2019)), commonsense reasoning ology for creating a dataset and for analyzing a
(CosmosQA (Huang et al., 2019)), coreference model based on it (Section 4).1 In the remainder
resolution (QuoRef (Dasigi et al., 2019)), and of this paper, we argue that these issues can be ad-
logical reasoning (ReClor (Yu et al., 2020)). dressed by using insights from the psychological
study of reading comprehension and by implement-
2.2 Benchmarking Issues
ing psychometric means of validation.
In some datasets, the performance of machines has
already reached human-level performance. How- 3 Reading Comprehension from
ever, Jia and Liang (2017) indicate that models can Psychology to MRC
easily be fooled by manual injection of distract-
3.1 Computational Model in Psychology
ing sentences. Their study revealed that questions
simply gathered by crowdsourcing without careful Human text comprehension has been studied in
guidelines or constraints are insufficient to evaluate psychology for a long time (Kintsch and Rawson,
precise language understanding. 2005; Graesser et al., 1994; Kintsch, 1988). Con-
This argument is supported by further studies nectionist and computational architectures have
across a variety of datasets. For example, Min et al. been proposed for such comprehension including a
(2018) find that more than 90% of the questions mechanism pertinent to knowledge activation and
in SQuAD (Rajpurkar et al., 2016) require obtain- memory storing. Among the computational mod-
ing an answer from a single sentence despite being els, the construction–integration (CI) model is the
provided with a passage. Sugawara et al. (2018) most influential and provides a strong foundation
show that large parts of twelve datasets are eas- of the field (McNamara and Magliano, 2009).
ily solved only by looking at a few first question The CI model assumes three different represen-
tokens and attending the similarity between the tation levels as follows:
given questions and the context. Similarly, Feng • Surface structure is the linguistic information
et al. (2018) and Mudrakarta et al. (2018) demon- of particular words, phrases, and syntax ob-
strate that models trained on SQuAD do not change tained by decoding the raw textual input.
their predictions even when the question tokens are
partly dropped. Kaushik and Lipton (2018) also • Textbase is a set of propositions in the text,
observe that question- and passage-only models where the propositions are locally connected
perform well for some popular datasets. Min et al. by inferences (microstructure).
(2019) and Chen and Durrett (2019) concurrently • Situation model is a situational and coherent
indicate that for multi-hop reasoning datasets, the mental representation in which the propositions
questions are solvable only with a single paragraph are globally connected (macrostructure), and it
and thus do not require multi-hop reasoning over is often grounded to not only texts but also to
multiple paragraphs. Zellers et al. (2019b) report sounds, images, and background information.
that their dataset unintentionally contains stylistic
biases in the answer options which are embedded The CI model first decodes textual information
by a language-based model. (i.e., the surface structure) from the raw textual
Overall, these investigations highlight a grave 1
These two issues loosely correspond to the plausibility
issue of the task design, i.e., even if the models and faithfulness of explanation (Jacovi and Goldberg, 2020).
The plausibility is linked to what we expect as an explanation,
achieve human-level accuracies, we cannot prove whereas the faithfulness refers to how accurately we explain
that they successfully perform reading comprehen- models’ reasoning process.

input, then creates the propositions (i.e., textbase) knowledge and reasoning for scientific questions
and their local connections occasionally using the in MRC (Clark et al., 2018). A limitation of both
reader’s knowledge (construction), and finally con- these studies is that the proposed sets of knowl-
structs a coherent representation (i.e., situation edge and inference are limited to the domain of
model) that is organized according to five dimen- elementary-level science. Although some existing
sions including time, space, causation, intention- datasets for MRC have their own classifications of
ality, and objects (Zwaan and Radvansky, 1998), skills, they are coarse and only cover a limited ex-
which provides a global description of the events tent of typical NLP tasks (e.g., word matching and
(integration). These steps are not exclusive, i.e., paraphrasing). In contrast, for a more generalizable
propositions are iteratively updated in accordance definition, Sugawara et al. (2017) propose a set of
with the surrounding ones with which they are 13 skills for MRC. Rogers et al. (2020) pursue this
linked. Although the definition of successful text direction by proposing a set of questions with eight
comprehension can vary, Hernández-Orallo (2017) question types. In addition, Schlegel et al. (2020)
indicates that comprehension implies the process propose an annotation schema to investigate requi-
of creating (or searching for) a situation model that site knowledge and reasoning. Dunietz et al. (2020)
best explains the given text and the reader’s back- propose a template of understanding that consists
ground knowledge (Zwaan and Radvansky, 1998). of spatial, temporal, causal, and motivational ques-
We use this definition to highlight that the creation tions to evaluate precise understanding of narratives
of a situation model plays a vital role in human with reference to human text comprehension.
reading comprehension. In what follows, we describe the three represen-
Our aim in this section is to provide a basis for tation levels that basically follow the three repre-
explaining what reading comprehension is, which sentations of the CI model but are modified for
requires terms for explanation. In the computa- MRC. The three levels are shown in Figure 1. We
tional model above, the representation levels appear emphasize that we do not intend to create exhaus-
to be useful for organizing such terms. We ground tive and rigid definitions of skills. Rather, we aim
existing NLP technologies and tasks to different to place them in a hierarchical organization, which
representation levels in the next section. can serve as a foundation to highlight the missing
aspects in the current MRC.
3.2 Skill Hierarchy for MRC
Here, we associate the existing NLP tasks with the Surface Structure This level broadly covers the
three representation levels introduced above. The linguistic information and its semantic meaning,
biggest advantage of MRC is its general formu- which can be based on the raw textual input. Al-
lation, which makes it the most general task for though these features form a proposition according
evaluating NLU. This emphasizes the importance to psychology, it should be viewed as sentence-
of the requirement of various skills in MRC, which level semantic representation in computational lin-
can serve as the units for the explanation of reading guistics. This level includes part-of-speech tagging,
comprehension. Therefore, our motivation is to syntactic parsing, dependency parsing, punctua-
provide an overview of the skills as a hierarchical tion recognition, named entity recognition (NER),
taxonomy and to highlight the missing aspects in and semantic role labeling (SRL). Although these
existing MRC datasets that are required for com- basic tasks can be accomplished by some recent
prehensively covering the representation levels. pretraining-based neural language models (Liu
et al., 2019), they are hardly required in NLU tasks
Existing Taxonomies We first provide a brief including MRC. In the natural language inference
overview of the existing taxonomies of skills in task, McCoy et al. (2019) indicate that existing
NLU tasks. For recognizing textual entailment datasets (e.g., Bowman et al. (2015)) may fail to
(Dagan et al., 2006), several studies present a clas- elucidate the syntactic understanding of given sen-
sification of reasoning and commonsense knowl- tences. Although it is not obvious that these basic
edge (Bentivogli et al., 2010; Sammons et al., 2010; tasks should be included in MRC and it is not easy
LoBue and Yates, 2011). For scientific question to circumscribe linguistic knowledge from con-
answering, Jansen et al. (2016) categorize knowl- crete and abstract knowledge (Zaenen et al., 2005;
edge and inference for an elementary-level dataset. Manning, 2006), we should always care about the
Similarly, Boratko et al. (2018) propose types of capabilities of basic tasks (e.g., use of checklists

Construct the global structure of propositions.
Skills: creating a coherent representation and grounding it to other media.
Situation
model Construct the local relations of propositions.
Skills: recognizing relations between sentences such as coreference resolu-
Textbase tion, knowledge reasoning, and understanding discourse relations.

Surface structure Creating propositions from the textual input.
Skills: syntactic and dependency parsing, POS tagging, SRL, and NER.

Figure 1: Representation levels and corresponding natural language understanding skills.

(Ribeiro et al., 2020)) when the performance of a Passage: The princess climbed out the window of the high
model is being assessed. tower and climbed down the south wall when her mother
was sleeping. She wandered out a good way. Finally, she
went into the forest where there are no electric poles.
Textbase This level covers local relations of
propositions in the computational model of reading Q1: Who climbed out of the castle? A: Princess
comprehension. In the context of NLP, it refers Q2: Where did the princess wander after escaping?
A: Forest
to various types of relations linked between sen- Q3: What would happen if her mother was not sleeping?
tences. These relations not only include the typical A: the princess would be caught soon (multiple choice)
relations between sentences (discourse relations)
but also the links between entities. Consequently, Figure 2: Example questions of the different represen-
this level includes coreference resolution, causality, tation levels. The passage is taken from MCTest.
temporal relations, spatial relations, text structuring
relations, logical reasoning, knowledge reasoning, Example The representation levels in the exam-
commonsense reasoning, and mathematical reason- ple shown in Figure 2 are described as follows.
ing. We also include multi-hop reasoning (Welbl Q1 is at the surface-structure level where a reader
et al., 2018) at this level because it does not neces- only needs to understand the subject of the first
sarily require a coherent global representation over event. We expect that Q2 requires understanding
a given context. For studying the generalizabil- of relations among described entities and events at
ity of MRC, Fisch et al. (2019) propose a shared the textbase level; the reader may need to under-
task featuring training and testing on multiple do- stand who she means using coreference resolution.
mains. Talmor and Berant (2019) and Khashabi Escaping in Q2 also requires the reader’s common-
et al. (2020) also find that training on multiple sense to associate it with the first event. However,
datasets leads to robust generalization. However, the reader might be able to answer this question
unless we make sure that datasets require various only by looking for a place (specified by where)
skills with sufficient coverage, it might remain un- described in the passage, thereby necessitating the
clear whether we evaluate a model’s transferability validity of the question to correctly evaluate the
of the reading comprehension ability. understanding of the described events. Q3 is an
Situation Model This level targets the global example that requires imagining a different situa-
structure of propositions in human reading com- tion at the situation-model level, which could be
prehension. It includes a coherent and situational further associated with a grounding question such
representation of a given context and its grounding as which figure best depicts the given passage?
to the non-textual information. A coherent repre- In summary, we indicate that the following fea-
sentation has well-organized sentence-to-sentence tures might be missing in existing datasets:
transitions (Barzilay and Lapata, 2008), which are
• Considering the capability to acquire basic un-
vital for using procedural and script knowledge
derstanding of the linguistic-level information.
(Schank and Abelson, 1977). This level also in-
cludes characters’ goals and plans, meta perspec- • Ensuring that the questions comprehensively
tive including author’s intent and attitude, thematic specify and evaluate textbase-level skills.
understanding, and grounding to other media. Most
existing MRC datasets seem to struggle to target the • Evaluating the capability of the situation model
situation model. We discuss further in Section 5.1. in which propositions are coherently organized

and are grounded to non-textual information. 4.2 Construct Validity in MRC

Should MRC models mimic human text com- Table 2 also raises MRC features corresponding to
prehension? In this paper, we do not argue that the six aspects of construct validity. In what fol-
MRC models should mimic human text compre- lows, we elaborate on these correspondings and dis-
hension. However, when we design an NLU task cuss the missing aspects that are needed to achieve
and create datasets for testing human-like linguistic the construct validity of the current MRC.
generalization, we can refer to the aforementioned Content Aspect As discussed in Section 3, suffi-
features to frame the intended behavior to evaluate ciently covering the skills across all the representa-
in the task. As Linzen (2020) discusses, the task de- tion levels is an important requirement of MRC. It
sign is orthogonal to how the intended behavior is may be desirable that an MRC model is simultane-
realized at the implementation level (Marr, 1982). ously evaluated on various skill-oriented examples.
4 MRC on Psychometrics Substantive Aspect This aspect appraises the ev-
idence for the consistency of model behavior. We
In this section, we provide a theoretical foundation
consider that this is the most important aspect for
for the evaluation of MRC models. When MRC
explaining reading comprehension, a process that
measures the capability of reading comprehension,
subsumes various implicit and complex steps. To
validation of the measurement is crucial to obtain a
obtain a consistent response from an MRC system,
reliable and useful explanation. Therefore, we fo-
it is necessary to ensure that the questions correctly
cus on psychometrics—a field of study concerned
assess the internal steps in the process of reading
with the assessment of the quality of psychological
comprehension. However, as stated in Section 2.2,
measurement (Furr, 2018). We expect that the in-
most existing datasets fail to verify that a question
sights obtained from psychometrics can facilitate a
is solved by using an intended skill, which implies
better task design. In Section 4.1, we first review
that it cannot be proved that a successful system
the concept of validity in psychometrics. Subse-
can actually perform intended comprehension.
quently, in Section 4.2, we examine the aspects that
correspond to construct validity in MRC and then Structural Aspect Another issue in the current
indicate the prerequisites for verifying the intended MRC is that they only provide simple accuracy
explanation of MRC in its task design. as a metric. Given that the substantive aspect ne-
cessitates the evaluation of the internal process of
4.1 Construct Validity in Psychometrics
reading comprehension, the structure of metrics
According to psychometrics, construct validity is needs to reflect it. However, a few studies have at-
necessary to validate the interpretation of outcomes tempted to provide a dataset with multiple metrics.
of psychological experiments.2 Messick (1995) For example, Yang et al. (2018) not only ask for
report that construct validity consists of the six the answers to questions but also provide sentence-
aspects shown in Table 2. level supporting facts. This metric can also evaluate
In the design of educational and psychological the process of multi-hop reasoning whenever the
measurement, these aspects collectively provide supporting sentences need to be understood for an-
verification questions that need to be answered for swering a question. Therefore, we need to consider
justifying the interpretation and use of test scores. both substantive and structural aspects.
In this sense, the construct validation can be viewed
as an empirical evaluation of the meaning and con- Generalizability Aspect The generalizability of
sequence of measurement. Given that MRC is in- MRC can be understood from the reliability of met-
tended to capture the reading comprehension abil- rics and the reproducibility of findings. For the
ity, the task designers need to be aware of these reliability of metrics, we need to take care of the re-
validity aspects. Otherwise, users of the task can- liability of gold answers and model predictions. Re-
not justify the score interpretation, i.e., it cannot be garding the accuracy of answers, the performance
confirmed that successful systems actually perform of the model becomes unreliable when the answers
intended reading comprehension. are unintentionally ambiguous or impractical. Be-
2
cause the gold answers in most datasets are only
In psychology, a construct is an abstract concept, which
facilitates the understanding of human behavior such as vo- decided by the majority vote of crowd workers,
cabulary, skills, and comprehension. the ambiguity of the answers is not considered. It

Validity aspects Definition in psychometrics Correspondence in MRC

1. Content Evidence of content relevance, representativeness, Questions require reading comprehension skills
and technical quality. with sufficient coverage and representativeness
over the representation levels.
2. Substantive Theoretical rationales for the observed consisten- Questions correctly evaluate the intended inter-
cies in the test responses including task perfor- mediate process of reading comprehension and
mance of models. provide rationales to the interpreters.
3. Structural Fidelity of the scoring structure to the structure of Correspondence between the task structure and
the construct domain at issue. the score structure.
4. Generalizability Extent to which score properties and interpretations Reliability of test scores in correct answers and
can be generalized to and across population groups, model predictions, and applicability to other
settings, and tasks. datasets and models.
5. External Extent to which the assessment scores’ relationship Comparison of the performance of MRC with
with other measures and non-assessment behaviors that of other NLU tasks and measurements.
reflect the expected relations.
6. Consequential Value implications of score interpretation as a basis Considering the model vulnerabilities to adver-
for the consequences of test use, especially regard- sarial attacks and social biases of models and
ing the sources of invalidity related to issues of datasets to ensure the fairness of model outputs.
bias, fairness, and distributive justice.

Table 2: Aspects of the construct validity in psychometrics and corresponding features in MRC.

may be useful if such ambiguity can be reflected MRC, this refers to the use of a successful model
in the evaluation (e.g., using the item response the- in practical situations other than tasks, where we
ory (Lalor et al., 2016)). As for model predic- need to ensure the robustness of a model to adver-
tions, an issue may be the reproducibility of results sarial attacks and the accountability for unintended
(Bouthillier et al., 2019), which implies that the model behaviors. Wallace et al. (2019) highlight
reimplementation of a system generates statistically this aspect by showing that existing NLP models
similar predictions. For the reproducibility of mod- are vulnerable to adversarial examples, thereby gen-
els, Dror et al. (2018) emphasize statistical testing erating egregious outputs.
methods to evaluate models. For the reproducibil-
Summary: Design of Rubric Given the validity
ity of findings, Bouthillier et al. (2019) stress it as
aspects, our suggestion is to design a rubric (scor-
the transferability of findings in a dataset/task to
ing guide used in education) of what reading com-
another dataset/task. In open-domain question an-
prehension we expect is evaluated in a dataset; this
swering, Lewis et al. (2021) point out that success-
helps to inspect detailed strengths and weaknesses
ful models might only memorize dataset-specific
of models that cannot be obtained only by simple
knowledge. To facilitate this transferability, we
accuracy. The rubric should not only cover various
need to have units of explanation that can be used
linguistic phenomena (the content aspect) but also
in different datasets (Doshi-Velez and Kim, 2018).
involve different levels of intermediate evaluation
External Aspect This aspect refers to the rela- in the reading comprehension process (the substan-
tionship between a model’s scores on different tive and structural aspects) as well as stress testing
tasks. Yogatama et al. (2019) point out that current of adversarial attacks (the consequential aspect).
models struggle to transfer their ability from a task The rubric is in a similar motivation with dataset
originally trained on (e.g., MRC) to different un- statements (Bender and Friedman, 2018; Gebru
seen tasks (e.g., SRL). To develop a general NLU et al., 2018); however, taking the validity aspects
model, one would expect that a successful MRC into account would improve its substance.
model should show sufficient performance on other 5 Future Directions
NLU tasks as well. To this end, Wang et al. (2019)
propose an evaluation framework with ten different This section discusses future potential directions
NLU tasks in the same format. toward answering the what and how questions in
Sections 3 and 4. In particular, we infer that the
Consequential Aspect This aspect refers to the situation model and substantive validity are critical
actual and potential consequences of test use. In for benchmarking human-level MRC.

5.1 What Question: Situation Model texts. Similarly to the textbook questions (Kemb-
As mentioned in Section 3, existing datasets fail havi et al., 2017), a possible approach would be to
to fully assess the ability of creating the situation create questions for understanding of texts through
model. As a future direction, we suggest that the showing figures. We might also need to account
task should deal with two features of the situation for the scope of grounding (Bisk et al., 2020), i.e.,
model: context dependency and grounding. ultimately understanding human language in a so-
cial context beyond simply associating texts with
5.1.1 Context-dependent Situations perceptual information.
A vital feature of the situation model is that it is
conditioned on a given text, i.e., a representation 5.2 How Question: Substantive Validity
is constructed distinctively depending on the given Substantive validity requires us to ensure that the
context. We elaborate it by discussing the two key questions correctly assess the internal steps of read-
features: defeasibility and novelty. ing comprehension. We discuss two approaches for
Defeasibility The defeasibility of a constructed this challenge: creating shortcut-proof questions
representation implies that a reader can modify and and ensuring the explanation by design.
revise it according to the newly acquired informa- 5.2.1 Shortcut-proof Questions
tion (Davis and Marcus, 2015; Schubert, 2015).
Gururangan et al. (2018) reveal that NLU datasets
The defeasibility of NLU has been tackled in the
can contain unintended dataset biases embedded
task of if-then reasoning (Sap et al., 2019a), ab-
by annotators. If machine learning models exploit
ductive reasoning (Bhagavatula et al., 2020), coun-
such biases for answering questions, we cannot
terfactual reasoning (Qin et al., 2019), or contrast
evaluate the precise NLU of models. Following
sets (Gardner et al., 2020). A possible approach
Geirhos et al. (2020), we define shortcut-proof
in MRC is that we ask questions against a set of
questions as ones that prevent models from exploit-
modified passages that describe slightly different
ing dataset biases and learning decision rules (short-
situations, where the same question can lead to
cuts) that perform well only on i.i.d. test examples
different conclusions.
with regard to its training examples. Gardner et al.
Novelty An example showing the importance (2019) also point out the importance of mitigating
of contextual novelty is Could a crocodile run a shortcuts in MRC. In this section, we view two
steeplechase? by Levesque (2014). This question different approaches for this challenge.
poses a novel situation where the solver needs to
combine multiple commonsense knowledge to de- Removing Unintended Biases by Filtering
rive the correct answer. If non-fiction documents, Zellers et al. (2018) propose a model-based ad-
such as newspaper and Wikipedia articles, are only versarial filtering method that iteratively trains an
used, some questions require only the reasoning ensemble of stylistic classifiers and uses them to
of facts already known in web-based corpus. Fic- filter out the questions. Sakaguchi et al. (2020)
tional narratives may be a better source for creating also propose filtering methods based on both ma-
questions on novel situations. chines and humans to alleviate dataset-specific and
word-association biases. However, a major issue
5.1.2 Grounding to Other Media is the inability to discern knowledge from bias in
In MRC, grounding texts to non-textual informa- a closed domain. When the domain is equal to a
tion is not fully explored yet. Kembhavi et al. dataset, patterns that are valid only in the domain
(2017) propose a dataset based on science text- are called dataset-specific biases (or annotation
books, which contain questions with passages, di- artifacts in the labeled data). When the domain
agrams, and images. Kahou et al. (2018) propose covers larger corpora, the patterns (e.g., frequency)
a figure-based question answering dataset that re- are called word-association biases. When the do-
quires the understanding of figures including line main includes everyday experience, patterns are
plots and bar charts. Although another approach called commonsense. However, as mentioned in
could be vision-based question answering tasks Section 5.1, commonsense knowledge can be de-
(Antol et al., 2015; Zellers et al., 2019a), we can- feasible, which implies that the knowledge can be
not directly use them for evaluating NLU because false in unusual situations. In contrast, when the
they focus on understanding of images rather than domain is our real world, indefeasible patterns are

called factual knowledge. Therefore, the distinc- performance by modeling the generation of the ex-
tion of bias and knowledge depends on where the planation. Although we must take into account
pattern is recognized. This means that a dataset the faithfulness of explanation, asking for intro-
should be created so that it can evaluate reasoning spective explanations could be useful in inspecting
on the intended knowledge. For example, to test the internal reasoning process, e.g., by extending
defeasible reasoning, we must filter out questions the task formulation so that it includes auxiliary
that are solvable by usual commonsense only. If we questions that consider the intermediate facts in a
want to investigate the reading comprehension abil- reasoning process. For example, before answering
ity without depending on factual knowledge, we Q2 in Figure 2, a reader should be able to answer
can consider counterfactual or fictional situations. who escaped? and where did she escape from? at
the surface-structure level.
Identifying Requisite Skills by Ablating Input
Features Another approach is to verify shortcut- Creating Dependency Between Questions An-
proof questions by analyzing the human answer- other approach for improving the substantive va-
ability of questions regarding their key features. lidity is to create dependency between questions
We speculate that if a question is still answerable by which answering them correctly involves an-
by humans even after removing the intended fea- swering some other questions correctly. For exam-
tures, the question does not require understanding ple, Dalvi et al. (2018) propose a dataset that re-
of the ablated features (e.g., checking the necessity quires a procedural understanding of scientific facts.
of resolving pronoun coreference after replacing In their dataset, a set of questions corresponds to
pronouns with dummy nouns). Even if we can- the steps of the entire process of a scientific phe-
not accurately identify such necessary features, by nomenon. Therefore, this set can be viewed as
identifying partial features of them in a sufficient a single question that requires a complete under-
number of questions, we could expect that the ques- standing of the scientific phenomenon. In CoQA
tions evaluate the corresponding intended skill. In (Reddy et al., 2019), it is noted that questions often
a similar vein, Geirhos et al. (2020) argue that a have pronouns that refer back to nouns appearing
dataset is useful only if it is a good proxy for the in previous questions. These mutually-dependent
underlying ability one is actually interested in. questions can probably facilitate the explicit vali-
dation of the models’ understanding of given texts.
5.2.2 Explanation by Design
Another approach for ensuring the substantive va- 6 Conclusion
lidity is to include explicit explanation in the task In this paper, we outlined current issues and future
formulation. Although gathering human explana- directions for benchmarking machine reading com-
tions is costly, the following approaches can facil- prehension. We visited the psychology study to
itate the explicit verification of a model’s under- analyze what we should ask of reading comprehen-
standing using a few test examples. sion and the construct validity in psychometrics to
Generating Introspective Explanation Inoue analyze how we should correctly evaluate it. We
et al. (2020) classify two types of explanation deduced that future datasets should evaluate the
in text comprehension: justification explanation capability of the situation model for understanding
and introspective explanation. The justification context-dependent situations and for grounding to
explanation only provides a collection of support- non-textual information and ensure the substantive
ing facts for making a certain decision, whereas validity by creating shortcut-proof questions and
the introspective explanation provides the deriva- designing an explanatory task formulation.
tion of the answer for making the decision, which
Acknowledgments
can cover linguistic phenomena and commonsense
knowledge not explicitly mentioned in the text. The authors would like to thank Xanh Ho for help-
They annotate multi-hop reasoning questions with ing create the dataset list and the anonymous re-
introspective explanation and propose a task that viewers for their insightful comments. This work
requires the derivation of the correct answer of a was supported by JSPS KAKENHI Grant Num-
given question to improve the explainability. Ra- ber 18H03297, JST ACT-X Grant Number JPM-
jani et al. (2019) collect human explanations for JAX190G, and JST PRESTO Grant Number JP-
commonsense reasoning and improve the system’s MJPR20C4.

References Michael Boratko, Harshit Padigela, Divyendra Mikki-
lineni, Pritish Yuvraj, Rajarshi Das, Andrew McCal-
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- lum, Maria Chang, Achille Fokoue-Nkoutche, Pavan
garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik
and Devi Parikh. 2015. VQA: Visual question an- Talamadupula, and Michael Witbrock. 2018. A sys-
swering. In Proceedings of the IEEE International tematic classification of knowledge, reasoning, and
Conference on Computer Vision, pages 2425–2433. context within the ARC dataset. In Proceedings
of the Workshop on Machine Reading for Question
Max Bartolo, Alastair Roberts, Johannes Welbl, Sebas- Answering, pages 60–70. Association for Computa-
tian Riedel, and Pontus Stenetorp. 2020. Beat the tional Linguistics.
AI: Investigating adversarial human annotation for
reading comprehension. Transactions of the Associ- Xavier Bouthillier, César Laurent, and Pascal Vincent.
ation for Computational Linguistics, 8:662–678. 2019. Unreproducible research is reproducible. In
Proceedings of the 36th International Conference
Regina Barzilay and Mirella Lapata. 2008. Modeling on Machine Learning, volume 97 of Proceedings of
local coherence: An entity-based approach. Compu- Machine Learning Research, pages 725–734, Long
tational Linguistics, 34(1):1–34. Beach, California, USA. PMLR.

Emily M. Bender and Batya Friedman. 2018. Data Samuel R. Bowman, Gabor Angeli, Christopher Potts,
statements for natural language processing: Toward and Christopher D. Manning. 2015. A large an-
mitigating system bias and enabling better science. notated corpus for learning natural language infer-
Transactions of the Association for Computational ence. In Proceedings of the 2015 Conference on
Linguistics, 6:587–604. Empirical Methods in Natural Language Processing,
pages 632–642. Association for Computational Lin-
Emily M. Bender and Alexander Koller. 2020. Climb- guistics.
ing towards NLU: On meaning, form, and under-
standing in the age of data. In Proceedings of the Christopher J.C. Burges. 2013. Towards the machine
58th Annual Meeting of the Association for Compu- comprehension of text: An essay. Technical re-
tational Linguistics, pages 5185–5198, Online. As- port, Microsoft Research Technical Report MSR-
sociation for Computational Linguistics. TR-2013-125.

Luisa Bentivogli, Elena Cabrio, Ido Dagan, Danilo Vittorio Castelli, Rishav Chakravarti, Saswati Dana,
Giampiccolo, Medea Lo Leggio, and Bernardo Anthony Ferritto, Radu Florian, Martin Franz, Di-
Magnini. 2010. Building textual entailment special- nesh Garg, Dinesh Khandelwal, Scott McCarley,
ized data sets: a methodology for isolating linguis- Michael McCawley, Mohamed Nasr, Lin Pan, Cezar
tic phenomena relevant to inference. In Proceed- Pendus, John Pitrelli, Saurabh Pujar, Salim Roukos,
ings of the Seventh International Conference on Lan- Andrzej Sakrajda, Avi Sil, Rosario Uceda-Sosa,
guage Resources and Evaluation (LREC’10), Val- Todd Ward, and Rong Zhang. 2020. The TechQA
letta, Malta. European Language Resources Associ- dataset. In Proceedings of the 58th Annual Meet-
ation (ELRA). ing of the Association for Computational Linguistics,
pages 1269–1278, Online. Association for Computa-
Chandra Bhagavatula, Ronan Le Bras, Chaitanya tional Linguistics.
Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han-
Danqi Chen. 2018. Neural Reading Comprehension
nah Rashkin, Doug Downey, Wen tau Yih, and Yejin
and Beyond. Ph.D. thesis, Stanford University.
Choi. 2020. Abductive commonsense reasoning. In
International Conference on Learning Representa- Danqi Chen, Adam Fisch, Jason Weston, and Antoine
tions. Bordes. 2017. Reading wikipedia to answer open-
domain questions. In Proceedings of the 55th An-
Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob nual Meeting of the Association for Computational
Andreas, Yoshua Bengio, Joyce Chai, Mirella Lap- Linguistics (Volume 1: Long Papers), pages 1870–
ata, Angeliki Lazaridou, Jonathan May, Aleksandr 1879. Association for Computational Linguistics.
Nisnevich, Nicolas Pinto, and Joseph Turian. 2020.
Experience grounds language. In Proceedings of the Jifan Chen and Greg Durrett. 2019. Understanding
2020 Conference on Empirical Methods in Natural dataset design choices for multi-hop reasoning. In
Language Processing (EMNLP), pages 8718–8735, Proceedings of the 2019 Conference of the North
Online. Association for Computational Linguistics. American Chapter of the Association for Compu-
tational Linguistics: Human Language Technolo-
Michael Boratko, Xiang Li, Tim O’Gorman, Rajarshi gies, Volume 1 (Long and Short Papers), pages
Das, Dan Le, and Andrew McCallum. 2020. Pro- 4026–4032, Minneapolis, Minnesota. Association
toQA: A question answering dataset for prototypi- for Computational Linguistics.
cal common-sense reasoning. In Proceedings of the
2020 Conference on Empirical Methods in Natural Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fer-
Language Processing (EMNLP), pages 1122–1136, nandez, and Doug Downey. 2019. CODAH: An
Online. Association for Computational Linguistics. adversarially-authored question answering dataset

for common sense. In Proceedings of the 3rd Work- Bhuwan Dhingra, Kathryn Mazaitis, and William W.
shop on Evaluating Vector Space Representations Cohen. 2017. Quasar: Datasets for question answer-
for NLP, pages 63–69, Minneapolis, USA. Associ- ing by search and reading.
ation for Computational Linguistics.
Finale Doshi-Velez and Been Kim. 2018. Consider-
Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan ations for Evaluation and Generalization in Inter-
Xiong, Hong Wang, and William Yang Wang. 2020. pretable Machine Learning, 1st edition. Springer In-
HybridQA: A dataset of multi-hop question answer- ternational Publishing.
ing over tabular and textual data. In Findings of the
Association for Computational Linguistics: EMNLP Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re-
2020, pages 1026–1036, Online. Association for ichart. 2018. The hitchhiker’s guide to testing statis-
Computational Linguistics. tical significance in natural language processing. In
Proceedings of the 56th Annual Meeting of the As-
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-
sociation for Computational Linguistics (Volume 1:
tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-
Long Papers), pages 1383–1392, Melbourne, Aus-
moyer. 2018. QuAC: Question answering in con-
tralia. Association for Computational Linguistics.
text. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
pages 2174–2184, Brussels, Belgium. Association
Stanovsky, Sameer Singh, and Matt Gardner. 2019.
for Computational Linguistics.
DROP: A reading comprehension benchmark requir-
Christopher Clark, Kenton Lee, Ming-Wei Chang, ing discrete reasoning over paragraphs. In Proceed-
Tom Kwiatkowski, Michael Collins, and Kristina ings of the 2019 Conference of the North American
Toutanova. 2019. BoolQ: Exploring the surprising Chapter of the Association for Computational Lin-
difficulty of natural yes/no questions. In Proceed- guistics: Human Language Technologies, Volume 1
ings of the 2019 Conference of the North American (Long and Short Papers), pages 2368–2378, Min-
Chapter of the Association for Computational Lin- neapolis, Minnesota. Association for Computational
guistics: Human Language Technologies, Volume 1 Linguistics.
(Long and Short Papers), pages 2924–2936, Min-
neapolis, Minnesota. Association for Computational Jesse Dunietz, Greg Burnham, Akash Bharadwaj,
Linguistics. Owen Rambow, Jennifer Chu-Carroll, and Dave Fer-
rucci. 2020. To test machine comprehension, start
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, by defining comprehension. In Proceedings of the
Ashish Sabharwal, Carissa Schoenick, and Oyvind 58th Annual Meeting of the Association for Compu-
Tafjord. 2018. Think you have solved question an- tational Linguistics, pages 7839–7859, Online. As-
swering? Try ARC, the AI2 reasoning challenge. sociation for Computational Linguistics.
CoRR, abs/1803.05457.
Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur
Ido Dagan, Oren Glickman, and Bernardo Magnini. Guney, Volkan Cirik, and Kyunghyun Cho. 2017.
2006. The pascal recognising textual entailment SearchQA: A new Q&A dataset augmented with
challenge. In Machine Learning Challenges Work- context from a search engine.
shop, pages 177–190. Springer.
Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer,
Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen-tau
Pedro Rodriguez, and Jordan Boyd-Graber. 2018.
Yih, and Peter Clark. 2018. Tracking state changes
Pathologies of neural models make interpretations
in procedural text: a challenge dataset and models
difficult. In Proceedings of the 2018 Conference on
for process paragraph comprehension. In Proceed-
Empirical Methods in Natural Language Processing,
ings of the 2018 Conference of the North American
pages 3719–3728. Association for Computational
Chapter of the Association for Computational Lin-
Linguistics.
guistics: Human Language Technologies, Volume
1 (Long Papers), pages 1595–1604. Association for
James Ferguson, Matt Gardner, Hannaneh Hajishirzi,
Computational Linguistics.
Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A
Pradeep Dasigi, Nelson F. Liu, Ana Marasovic, dataset of incomplete information reading compre-
Noah A. Smith, and Matt Gardner. 2019. Quoref: hension questions. In Proceedings of the 2020 Con-
A reading comprehension dataset with questions reference on Empirical Methods in Natural Language
quiring coreferential reasoning. In Proceedings of Processing (EMNLP), pages 1137–1147, Online. As-
the 2019 Conference on Empirical Methods in Nat- sociation for Computational Linguistics.
ural Language Processing and the 9th International
Joint Conference on Natural Language Processing Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu-
(EMNLP-IJCNLP), pages 5927–5934, Hong Kong, nsol Choi, and Danqi Chen. 2019. MRQA 2019
China. Association for Computational Linguistics. shared task: Evaluating generalization in reading
comprehension. In Proceedings of the 2nd Work-
Ernest Davis and Gary Marcus. 2015. Commonsense shop on Machine Reading for Question Answering,
reasoning and commonsense knowledge in artificial pages 1–13, Hong Kong, China. Association for
intelligence. Commun. ACM, 58(9):92–103. Computational Linguistics.

R Michael Furr. 2018. Psychometrics: an introduction. Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao,
Sage Publications. Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu,
Qiaoqiao She, Xuan Liu, Tian Wu, and Haifeng
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Wang. 2018. DuReader: a Chinese machine read-
Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, ing comprehension dataset from real-world appli-
Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, cations. In Proceedings of the Workshop on Ma-
Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, chine Reading for Question Answering, pages 37–
Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel- 46, Melbourne, Australia. Association for Computa-
son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer tional Linguistics.
Singh, Noah A. Smith, Sanjay Subramanian, Reut
Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
2020. Evaluating models’ local decision boundaries stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
via contrast sets. In Findings of the Association and Phil Blunsom. 2015. Teaching machines to read
for Computational Linguistics: EMNLP 2020, pages and comprehend. In C. Cortes, N. D. Lawrence,
1307–1323, Online. Association for Computational D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Linguistics. Advances in Neural Information Processing Systems
28, pages 1693–1701. Curran Associates, Inc.
Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi,
Alon Talmor, and Sewon Min. 2019. On making José Hernández-Orallo. 2017. The measure of all
reading comprehension more comprehensive. In minds: evaluating natural and artificial intelligence.
Proceedings of the 2nd Workshop on Machine Read- Cambridge University Press.
ing for Question Answering, pages 105–112, Hong
Kong, China. Association for Computational Lin- Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia
guistics. Polosukhin, Andrew Fandrianto, Jay Han, Matthew
Kelcey, and David Berthelot. 2016. WikiReading: A
Timnit Gebru, Jamie Morgenstern, Briana Vec- novel large-scale language understanding task over
chione, Jennifer Wortman Vaughan, Hanna Wal- wikipedia. In Proceedings of the 54th Annual Meet-
lach, Hal Daumé III, and Kate Crawford. 2018. ing of the Association for Computational Linguistics
Datasheets for datasets. ArXiv preprint 1803.09010. (Volume 1: Long Papers), pages 1535–1545, Berlin,
Germany. Association for Computational Linguis-
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio tics.
Michaelis, Richard Zemel, Wieland Brendel,
Matthias Bethge, and Felix A. Wichmann. 2020. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason
Shortcut learning in deep neural networks. Nature Weston. 2016. The goldilocks principle: Reading
Machine Intelligence, 2(11):665–673. children’s books with explicit memory representa-
tions. In International Conference on Learning Rep-
Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Ba- resentations.
jwa, Michael Specter, and Lalana Kagal. 2018. Ex-
plaining explanations: An overview of interpretabil- Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,
ity of machine learning. In 2018 IEEE 5th Interna- and Akiko Aizawa. 2020. Constructing a multi-
tional Conference on data science and advanced an- hop QA dataset for comprehensive evaluation of
alytics (DSAA), pages 80–89. IEEE. reasoning steps. In Proceedings of the 28th Inter-
national Conference on Computational Linguistics,
Arthur C. Graesser, Murray Singer, and Tom Trabasso. pages 6609–6625, Barcelona, Spain (Online). Inter-
1994. Constructing inferences during narrative text national Committee on Computational Linguistics.
comprehension. Psychological review, 101(3):371.
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and
Suchin Gururangan, Swabha Swayamdipta, Omer Yejin Choi. 2019. Cosmos QA: Machine reading
Levy, Roy Schwartz, Samuel Bowman, and Noah A. comprehension with contextual commonsense rea-
Smith. 2018. Annotation artifacts in natural lan- soning. In Proceedings of the 2019 Conference on
guage inference data. In Proceedings of the 2018 Empirical Methods in Natural Language Processing
Conference of the North American Chapter of the and the 9th International Joint Conference on Natu-
Association for Computational Linguistics: Human ral Language Processing (EMNLP-IJCNLP), pages
Language Technologies, Volume 2 (Short Papers), 2391–2401, Hong Kong, China. Association for
pages 107–112. Association for Computational Lin- Computational Linguistics.
guistics.
Naoya Inoue, Pontus Stenetorp, and Kentaro Inui. 2020.
Ivan Habernal, Henning Wachsmuth, Iryna Gurevych, R4C: A benchmark for evaluating RC systems to get
and Benno Stein. 2018. The argument reasoning the right answer for the right reason. In Proceedings
comprehension task: Identification and reconstruc- of the 58th Annual Meeting of the Association for
tion of implicit warrants. In Proceedings of the 2018 Computational Linguistics, pages 6740–6750, On-
Conference of the North American Chapter of the line. Association for Computational Linguistics.
Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Papers), Alon Jacovi and Yoav Goldberg. 2020. Towards faith-
pages 1930–1940, New Orleans, Louisiana. Associ- fully interpretable NLP systems: How should we de-
ation for Computational Linguistics. fine and evaluate faithfulness? In Proceedings of the

You can also read