Benchmarking Machine Reading Comprehension: A Psychological Perspective - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Benchmarking Machine Reading Comprehension: A Psychological Perspective Saku Sugawara1 , Pontus Stenetorp2 , Akiko Aizawa1 1 National Institute of Informatics, 2 University College London {saku,aizawa}@nii.ac.jp, p.stenetorp@cs.ucl.ac.uk Abstract required by the datasets and is actually acquired by models. Although benchmarking MRC is related to Machine reading comprehension (MRC) has the intent behind questions and is critical to test hy- received considerable attention as a bench- arXiv:2004.01912v2 [cs.CL] 26 Jan 2021 potheses from a top-down viewpoint (Bender and mark for natural language understanding. However, the conventional task design of Koller, 2020), its theoretical foundation is poorly MRC lacks explainability beyond the model investigated in the literature. interpretation, i.e., reading comprehension by In this position paper, we examine the prerequi- a model cannot be explained in human terms. sites for benchmarking MRC based on the follow- To this end, this position paper provides a the- ing two questions: (i) What does reading compre- oretical basis for the design of MRC datasets hension involve? (ii) How can we evaluate it? Our based on psychology as well as psychometrics, motivation is to provide a theoretical basis for the and summarizes it in terms of the prerequisites for benchmarking MRC. We conclude that fu- creation of MRC datasets. As Gilpin et al. (2018) ture datasets should (i) evaluate the capability indicate, interpreting the internals of a system is of the model for constructing a coherent and closely related to only the system’s architecture grounded representation to understand context- and is insufficient for explaining how the task is ac- dependent situations and (ii) ensure substan- complished. This is because even if the internals of tive validity by shortcut-proof questions and models can be interpreted, we cannot explain what explanation as a part of the task design. is measured by the datasets. Therefore, our study focuses on the explainability of the task rather than 1 Introduction the interpretability of models. Evaluation of natural language understanding We first overview MRC and review the analytical (NLU) is a long-standing goal in the field of artifi- literature that indicates that existing datasets might cial intelligence. Machine reading comprehension fail to correctly evaluate their intended behavior (MRC) is a task that tests the ability of a machine to (Section 2). Subsequently, we present a psycholog- read and understand unstructured text and could be ical study of human reading comprehension in Sec- the most suitable task for evaluating NLU because tion 3 for answering the what question. We argue of its generic formulation (Chen, 2018). Recently, that the concept of representation levels can serve many large-scale datasets have been proposed, and as a conceptual hierarchy for organizing the tech- deep learning systems have achieved human-level nologies in MRC. Section 4 focuses on answering performance for some of these datasets. the how question. Here, we implement psychomet- However, analytical studies have shown that rics to analyze the prerequisites for the task design MRC models do not necessarily achieve human- of MRC. Furthermore, we introduce the concept of level understanding. For example, Jia and Liang construct validity, which emphasizes validating the (2017) use manually crafted adversarial examples interpretation of the task’s outcome. Finally, in Sec- to show that successful systems are easily dis- tion 5, we explain the application of the proposed tracted. Sugawara et al. (2020) show that a sig- concepts into practical approaches, highlighting po- nificant part of already solved questions is solvable tential future directions toward the advancement of even after shuffling the words in a sentence or drop- MRC. Regarding the what question, we indicate ping content words. These studies demonstrate that that datasets should evaluate the capability of the we cannot explain what type of understanding is situation model, which refers to the construction
Question Foundation Requirements Future direction What is reading Representation levels in (A) Linguistic-level sentence understand- (C) Dependence of con- comprehension? human reading compre- ing, (B) comprehensiveness of skills text on defeasibility and hension: (A) surface for inter-sentence understanding, and novelty, and grounding to structure, (B) textbase, (C) evaluation of coherent representation non-textual information and (C) situation model. grounded to non-textual information. with a long passage. How can we evalu- Construct validity in psy- (1) Wide coverage of skills, (2) evalu- (2) Creating shortcut- ate reading compre- chometrics: (1) content, ation of the internal process, (3) struc- proof questions by hension? (2) substantive, (3) struc- tured metrics, (4) reliability of metrics, filtering and ablation, tural, (4) generalizability, (5) comparison with external variables, and designing a task for (5) external, and (6) con- and (6) robustness to adversarial attacks validating the internal sequential aspects. and social biases. process. Table 1: Overview of theoretical foundations, requirements, and future directions of MRC discussed in this paper. of a coherent and grounded representation of text sage (MCTest (Richardson et al., 2013)), a set of based on human understanding. Regarding the how passages (HotpotQA (Yang et al., 2018)), a longer question, we argue that among the important as- document (CBT (Hill et al., 2016)), or open domain pects of the construct validity, substantive validity (Chen et al., 2017). In some datasets, a context must be ensured, which requires the verification of includes non-textual information such as images the internal mechanism of comprehension. (RecipeQA (Yagcioglu et al., 2018)). Table 1 provides an overview of the perspectives taken in this paper. Our answers and suggestions to Question Styles A question can be an interrog- the what and how questions are summarized as fol- ative sentence (in most datasets), a fill-in-the- lows: (1) Reading comprehension is the process of blank sentence (cloze) (CLOTH (Xie et al., 2018)), creating a situation model that best explains given knowledge base entries (QAngaroo (Welbl et al., texts and the reader’s background knowledge. The 2018)) and search engine queries (MSMARCO situation model should be the next focal point in (Nguyen et al., 2016)). future datasets for benchmarking the human-level Answering Styles An answer can be (i) chosen reading comprehension. (2) To evaluate reading from a text span of the given document (answer comprehension correctly, the task needs to provide extraction) (NewsQA (Trischler et al., 2017)), (ii) a rubric (scoring guide) for sufficiently covering chosen from a candidate set of answers (multiple the aspects of the construct validity. In particular, choice) (MCTest (Richardson et al., 2013)), or (iii) the substantive validity should be ensured by cre- generated as a free-form text (description) (Narra- ating shortcut-proof questions and by designing a tiveQA (Kočiský et al., 2018)). Some datasets op- task formulation that is explanatory itself. tionally allow answering by a yes/no reply (BoolQ (Clark et al., 2019)). 2 Task Overview Sourcing Methods Initially, questions in small- 2.1 Task Variations and Existing Datasets scale datasets are created by experts (QA4MRE MRC is a task in which a machine is given a docu- (Sutcliffe et al., 2013)). Later, fueling the devel- ment (context) and it answers the questions based opment of neural models, most published datasets on the context. Burges (2013) provides a general have more than a hundred thousand questions that definition of MRC, i.e., a machine comprehends a are automatically created (CNN/Daily Mail (Her- passage of text if, for any question regarding that mann et al., 2015)), crowdsourced (SQuAD v1.1 text that can be answered correctly by a majority of (Rajpurkar et al., 2016)), and collected from exam- native speakers, that machine can provide a string inations (RACE (Lai et al., 2017)). which those speakers would agree both answers that question. We overview various aspects of the Domains The most popular domain is Wikipedia task along with representative datasets as follows. articles (Natural Questions (Kwiatkowski et al., Existing datasets are listed in Appendix A. 2019)), but news articles are also used (Who-did- What (Onishi et al., 2016)). CliCR (Suster and Context Styles A context can be given in various Daelemans, 2018) and emrQA (Pampari et al., forms with different lengths such as a single pas- 2018) are datasets in the clinical domain. DuoRC
(Saha et al., 2018) uses movie scripts. sion. This issue may be attributed to the low in- terpretability of black-box neural network models. Specific Skills Several recently proposed However, a problem is that we cannot explain what datasets require specific skills including unanswer- is measured by the datasets even if we can inter- able questions (SQuAD v2.0 (Rajpurkar et al., pret the internals of models. We speculate that this 2018)), dialogues (CoQA (Reddy et al., 2019), benchmarking issue in MRC can be attributed to DREAM (Sun et al., 2019)), multiple-sentence the following two points: (i) we do not have a com- reasoning (MultiRC (Khashabi et al., 2018)), prehensive theoretical basis of reading comprehen- multi-hop reasoning (HotpotQA (Yang et al., sion for specifying what we should ask (Section 3) 2018)), mathematical and set reasoning (DROP and (ii) we do not have a well-established method- (Dua et al., 2019)), commonsense reasoning ology for creating a dataset and for analyzing a (CosmosQA (Huang et al., 2019)), coreference model based on it (Section 4).1 In the remainder resolution (QuoRef (Dasigi et al., 2019)), and of this paper, we argue that these issues can be ad- logical reasoning (ReClor (Yu et al., 2020)). dressed by using insights from the psychological study of reading comprehension and by implement- 2.2 Benchmarking Issues ing psychometric means of validation. In some datasets, the performance of machines has already reached human-level performance. How- 3 Reading Comprehension from ever, Jia and Liang (2017) indicate that models can Psychology to MRC easily be fooled by manual injection of distract- 3.1 Computational Model in Psychology ing sentences. Their study revealed that questions simply gathered by crowdsourcing without careful Human text comprehension has been studied in guidelines or constraints are insufficient to evaluate psychology for a long time (Kintsch and Rawson, precise language understanding. 2005; Graesser et al., 1994; Kintsch, 1988). Con- This argument is supported by further studies nectionist and computational architectures have across a variety of datasets. For example, Min et al. been proposed for such comprehension including a (2018) find that more than 90% of the questions mechanism pertinent to knowledge activation and in SQuAD (Rajpurkar et al., 2016) require obtain- memory storing. Among the computational mod- ing an answer from a single sentence despite being els, the construction–integration (CI) model is the provided with a passage. Sugawara et al. (2018) most influential and provides a strong foundation show that large parts of twelve datasets are eas- of the field (McNamara and Magliano, 2009). ily solved only by looking at a few first question The CI model assumes three different represen- tokens and attending the similarity between the tation levels as follows: given questions and the context. Similarly, Feng • Surface structure is the linguistic information et al. (2018) and Mudrakarta et al. (2018) demon- of particular words, phrases, and syntax ob- strate that models trained on SQuAD do not change tained by decoding the raw textual input. their predictions even when the question tokens are partly dropped. Kaushik and Lipton (2018) also • Textbase is a set of propositions in the text, observe that question- and passage-only models where the propositions are locally connected perform well for some popular datasets. Min et al. by inferences (microstructure). (2019) and Chen and Durrett (2019) concurrently • Situation model is a situational and coherent indicate that for multi-hop reasoning datasets, the mental representation in which the propositions questions are solvable only with a single paragraph are globally connected (macrostructure), and it and thus do not require multi-hop reasoning over is often grounded to not only texts but also to multiple paragraphs. Zellers et al. (2019b) report sounds, images, and background information. that their dataset unintentionally contains stylistic biases in the answer options which are embedded The CI model first decodes textual information by a language-based model. (i.e., the surface structure) from the raw textual Overall, these investigations highlight a grave 1 These two issues loosely correspond to the plausibility issue of the task design, i.e., even if the models and faithfulness of explanation (Jacovi and Goldberg, 2020). The plausibility is linked to what we expect as an explanation, achieve human-level accuracies, we cannot prove whereas the faithfulness refers to how accurately we explain that they successfully perform reading comprehen- models’ reasoning process.
input, then creates the propositions (i.e., textbase) knowledge and reasoning for scientific questions and their local connections occasionally using the in MRC (Clark et al., 2018). A limitation of both reader’s knowledge (construction), and finally con- these studies is that the proposed sets of knowl- structs a coherent representation (i.e., situation edge and inference are limited to the domain of model) that is organized according to five dimen- elementary-level science. Although some existing sions including time, space, causation, intention- datasets for MRC have their own classifications of ality, and objects (Zwaan and Radvansky, 1998), skills, they are coarse and only cover a limited ex- which provides a global description of the events tent of typical NLP tasks (e.g., word matching and (integration). These steps are not exclusive, i.e., paraphrasing). In contrast, for a more generalizable propositions are iteratively updated in accordance definition, Sugawara et al. (2017) propose a set of with the surrounding ones with which they are 13 skills for MRC. Rogers et al. (2020) pursue this linked. Although the definition of successful text direction by proposing a set of questions with eight comprehension can vary, Hernández-Orallo (2017) question types. In addition, Schlegel et al. (2020) indicates that comprehension implies the process propose an annotation schema to investigate requi- of creating (or searching for) a situation model that site knowledge and reasoning. Dunietz et al. (2020) best explains the given text and the reader’s back- propose a template of understanding that consists ground knowledge (Zwaan and Radvansky, 1998). of spatial, temporal, causal, and motivational ques- We use this definition to highlight that the creation tions to evaluate precise understanding of narratives of a situation model plays a vital role in human with reference to human text comprehension. reading comprehension. In what follows, we describe the three represen- Our aim in this section is to provide a basis for tation levels that basically follow the three repre- explaining what reading comprehension is, which sentations of the CI model but are modified for requires terms for explanation. In the computa- MRC. The three levels are shown in Figure 1. We tional model above, the representation levels appear emphasize that we do not intend to create exhaus- to be useful for organizing such terms. We ground tive and rigid definitions of skills. Rather, we aim existing NLP technologies and tasks to different to place them in a hierarchical organization, which representation levels in the next section. can serve as a foundation to highlight the missing aspects in the current MRC. 3.2 Skill Hierarchy for MRC Here, we associate the existing NLP tasks with the Surface Structure This level broadly covers the three representation levels introduced above. The linguistic information and its semantic meaning, biggest advantage of MRC is its general formu- which can be based on the raw textual input. Al- lation, which makes it the most general task for though these features form a proposition according evaluating NLU. This emphasizes the importance to psychology, it should be viewed as sentence- of the requirement of various skills in MRC, which level semantic representation in computational lin- can serve as the units for the explanation of reading guistics. This level includes part-of-speech tagging, comprehension. Therefore, our motivation is to syntactic parsing, dependency parsing, punctua- provide an overview of the skills as a hierarchical tion recognition, named entity recognition (NER), taxonomy and to highlight the missing aspects in and semantic role labeling (SRL). Although these existing MRC datasets that are required for com- basic tasks can be accomplished by some recent prehensively covering the representation levels. pretraining-based neural language models (Liu et al., 2019), they are hardly required in NLU tasks Existing Taxonomies We first provide a brief including MRC. In the natural language inference overview of the existing taxonomies of skills in task, McCoy et al. (2019) indicate that existing NLU tasks. For recognizing textual entailment datasets (e.g., Bowman et al. (2015)) may fail to (Dagan et al., 2006), several studies present a clas- elucidate the syntactic understanding of given sen- sification of reasoning and commonsense knowl- tences. Although it is not obvious that these basic edge (Bentivogli et al., 2010; Sammons et al., 2010; tasks should be included in MRC and it is not easy LoBue and Yates, 2011). For scientific question to circumscribe linguistic knowledge from con- answering, Jansen et al. (2016) categorize knowl- crete and abstract knowledge (Zaenen et al., 2005; edge and inference for an elementary-level dataset. Manning, 2006), we should always care about the Similarly, Boratko et al. (2018) propose types of capabilities of basic tasks (e.g., use of checklists
Construct the global structure of propositions. Skills: creating a coherent representation and grounding it to other media. Situation model Construct the local relations of propositions. Skills: recognizing relations between sentences such as coreference resolu- Textbase tion, knowledge reasoning, and understanding discourse relations. Surface structure Creating propositions from the textual input. Skills: syntactic and dependency parsing, POS tagging, SRL, and NER. Figure 1: Representation levels and corresponding natural language understanding skills. (Ribeiro et al., 2020)) when the performance of a Passage: The princess climbed out the window of the high model is being assessed. tower and climbed down the south wall when her mother was sleeping. She wandered out a good way. Finally, she went into the forest where there are no electric poles. Textbase This level covers local relations of propositions in the computational model of reading Q1: Who climbed out of the castle? A: Princess comprehension. In the context of NLP, it refers Q2: Where did the princess wander after escaping? A: Forest to various types of relations linked between sen- Q3: What would happen if her mother was not sleeping? tences. These relations not only include the typical A: the princess would be caught soon (multiple choice) relations between sentences (discourse relations) but also the links between entities. Consequently, Figure 2: Example questions of the different represen- this level includes coreference resolution, causality, tation levels. The passage is taken from MCTest. temporal relations, spatial relations, text structuring relations, logical reasoning, knowledge reasoning, Example The representation levels in the exam- commonsense reasoning, and mathematical reason- ple shown in Figure 2 are described as follows. ing. We also include multi-hop reasoning (Welbl Q1 is at the surface-structure level where a reader et al., 2018) at this level because it does not neces- only needs to understand the subject of the first sarily require a coherent global representation over event. We expect that Q2 requires understanding a given context. For studying the generalizabil- of relations among described entities and events at ity of MRC, Fisch et al. (2019) propose a shared the textbase level; the reader may need to under- task featuring training and testing on multiple do- stand who she means using coreference resolution. mains. Talmor and Berant (2019) and Khashabi Escaping in Q2 also requires the reader’s common- et al. (2020) also find that training on multiple sense to associate it with the first event. However, datasets leads to robust generalization. However, the reader might be able to answer this question unless we make sure that datasets require various only by looking for a place (specified by where) skills with sufficient coverage, it might remain un- described in the passage, thereby necessitating the clear whether we evaluate a model’s transferability validity of the question to correctly evaluate the of the reading comprehension ability. understanding of the described events. Q3 is an Situation Model This level targets the global example that requires imagining a different situa- structure of propositions in human reading com- tion at the situation-model level, which could be prehension. It includes a coherent and situational further associated with a grounding question such representation of a given context and its grounding as which figure best depicts the given passage? to the non-textual information. A coherent repre- In summary, we indicate that the following fea- sentation has well-organized sentence-to-sentence tures might be missing in existing datasets: transitions (Barzilay and Lapata, 2008), which are • Considering the capability to acquire basic un- vital for using procedural and script knowledge derstanding of the linguistic-level information. (Schank and Abelson, 1977). This level also in- cludes characters’ goals and plans, meta perspec- • Ensuring that the questions comprehensively tive including author’s intent and attitude, thematic specify and evaluate textbase-level skills. understanding, and grounding to other media. Most existing MRC datasets seem to struggle to target the • Evaluating the capability of the situation model situation model. We discuss further in Section 5.1. in which propositions are coherently organized
and are grounded to non-textual information. 4.2 Construct Validity in MRC Should MRC models mimic human text com- Table 2 also raises MRC features corresponding to prehension? In this paper, we do not argue that the six aspects of construct validity. In what fol- MRC models should mimic human text compre- lows, we elaborate on these correspondings and dis- hension. However, when we design an NLU task cuss the missing aspects that are needed to achieve and create datasets for testing human-like linguistic the construct validity of the current MRC. generalization, we can refer to the aforementioned Content Aspect As discussed in Section 3, suffi- features to frame the intended behavior to evaluate ciently covering the skills across all the representa- in the task. As Linzen (2020) discusses, the task de- tion levels is an important requirement of MRC. It sign is orthogonal to how the intended behavior is may be desirable that an MRC model is simultane- realized at the implementation level (Marr, 1982). ously evaluated on various skill-oriented examples. 4 MRC on Psychometrics Substantive Aspect This aspect appraises the ev- idence for the consistency of model behavior. We In this section, we provide a theoretical foundation consider that this is the most important aspect for for the evaluation of MRC models. When MRC explaining reading comprehension, a process that measures the capability of reading comprehension, subsumes various implicit and complex steps. To validation of the measurement is crucial to obtain a obtain a consistent response from an MRC system, reliable and useful explanation. Therefore, we fo- it is necessary to ensure that the questions correctly cus on psychometrics—a field of study concerned assess the internal steps in the process of reading with the assessment of the quality of psychological comprehension. However, as stated in Section 2.2, measurement (Furr, 2018). We expect that the in- most existing datasets fail to verify that a question sights obtained from psychometrics can facilitate a is solved by using an intended skill, which implies better task design. In Section 4.1, we first review that it cannot be proved that a successful system the concept of validity in psychometrics. Subse- can actually perform intended comprehension. quently, in Section 4.2, we examine the aspects that correspond to construct validity in MRC and then Structural Aspect Another issue in the current indicate the prerequisites for verifying the intended MRC is that they only provide simple accuracy explanation of MRC in its task design. as a metric. Given that the substantive aspect ne- cessitates the evaluation of the internal process of 4.1 Construct Validity in Psychometrics reading comprehension, the structure of metrics According to psychometrics, construct validity is needs to reflect it. However, a few studies have at- necessary to validate the interpretation of outcomes tempted to provide a dataset with multiple metrics. of psychological experiments.2 Messick (1995) For example, Yang et al. (2018) not only ask for report that construct validity consists of the six the answers to questions but also provide sentence- aspects shown in Table 2. level supporting facts. This metric can also evaluate In the design of educational and psychological the process of multi-hop reasoning whenever the measurement, these aspects collectively provide supporting sentences need to be understood for an- verification questions that need to be answered for swering a question. Therefore, we need to consider justifying the interpretation and use of test scores. both substantive and structural aspects. In this sense, the construct validation can be viewed as an empirical evaluation of the meaning and con- Generalizability Aspect The generalizability of sequence of measurement. Given that MRC is in- MRC can be understood from the reliability of met- tended to capture the reading comprehension abil- rics and the reproducibility of findings. For the ity, the task designers need to be aware of these reliability of metrics, we need to take care of the re- validity aspects. Otherwise, users of the task can- liability of gold answers and model predictions. Re- not justify the score interpretation, i.e., it cannot be garding the accuracy of answers, the performance confirmed that successful systems actually perform of the model becomes unreliable when the answers intended reading comprehension. are unintentionally ambiguous or impractical. Be- 2 cause the gold answers in most datasets are only In psychology, a construct is an abstract concept, which facilitates the understanding of human behavior such as vo- decided by the majority vote of crowd workers, cabulary, skills, and comprehension. the ambiguity of the answers is not considered. It
Validity aspects Definition in psychometrics Correspondence in MRC 1. Content Evidence of content relevance, representativeness, Questions require reading comprehension skills and technical quality. with sufficient coverage and representativeness over the representation levels. 2. Substantive Theoretical rationales for the observed consisten- Questions correctly evaluate the intended inter- cies in the test responses including task perfor- mediate process of reading comprehension and mance of models. provide rationales to the interpreters. 3. Structural Fidelity of the scoring structure to the structure of Correspondence between the task structure and the construct domain at issue. the score structure. 4. Generalizability Extent to which score properties and interpretations Reliability of test scores in correct answers and can be generalized to and across population groups, model predictions, and applicability to other settings, and tasks. datasets and models. 5. External Extent to which the assessment scores’ relationship Comparison of the performance of MRC with with other measures and non-assessment behaviors that of other NLU tasks and measurements. reflect the expected relations. 6. Consequential Value implications of score interpretation as a basis Considering the model vulnerabilities to adver- for the consequences of test use, especially regard- sarial attacks and social biases of models and ing the sources of invalidity related to issues of datasets to ensure the fairness of model outputs. bias, fairness, and distributive justice. Table 2: Aspects of the construct validity in psychometrics and corresponding features in MRC. may be useful if such ambiguity can be reflected MRC, this refers to the use of a successful model in the evaluation (e.g., using the item response the- in practical situations other than tasks, where we ory (Lalor et al., 2016)). As for model predic- need to ensure the robustness of a model to adver- tions, an issue may be the reproducibility of results sarial attacks and the accountability for unintended (Bouthillier et al., 2019), which implies that the model behaviors. Wallace et al. (2019) highlight reimplementation of a system generates statistically this aspect by showing that existing NLP models similar predictions. For the reproducibility of mod- are vulnerable to adversarial examples, thereby gen- els, Dror et al. (2018) emphasize statistical testing erating egregious outputs. methods to evaluate models. For the reproducibil- Summary: Design of Rubric Given the validity ity of findings, Bouthillier et al. (2019) stress it as aspects, our suggestion is to design a rubric (scor- the transferability of findings in a dataset/task to ing guide used in education) of what reading com- another dataset/task. In open-domain question an- prehension we expect is evaluated in a dataset; this swering, Lewis et al. (2021) point out that success- helps to inspect detailed strengths and weaknesses ful models might only memorize dataset-specific of models that cannot be obtained only by simple knowledge. To facilitate this transferability, we accuracy. The rubric should not only cover various need to have units of explanation that can be used linguistic phenomena (the content aspect) but also in different datasets (Doshi-Velez and Kim, 2018). involve different levels of intermediate evaluation External Aspect This aspect refers to the rela- in the reading comprehension process (the substan- tionship between a model’s scores on different tive and structural aspects) as well as stress testing tasks. Yogatama et al. (2019) point out that current of adversarial attacks (the consequential aspect). models struggle to transfer their ability from a task The rubric is in a similar motivation with dataset originally trained on (e.g., MRC) to different un- statements (Bender and Friedman, 2018; Gebru seen tasks (e.g., SRL). To develop a general NLU et al., 2018); however, taking the validity aspects model, one would expect that a successful MRC into account would improve its substance. model should show sufficient performance on other 5 Future Directions NLU tasks as well. To this end, Wang et al. (2019) propose an evaluation framework with ten different This section discusses future potential directions NLU tasks in the same format. toward answering the what and how questions in Sections 3 and 4. In particular, we infer that the Consequential Aspect This aspect refers to the situation model and substantive validity are critical actual and potential consequences of test use. In for benchmarking human-level MRC.
5.1 What Question: Situation Model texts. Similarly to the textbook questions (Kemb- As mentioned in Section 3, existing datasets fail havi et al., 2017), a possible approach would be to to fully assess the ability of creating the situation create questions for understanding of texts through model. As a future direction, we suggest that the showing figures. We might also need to account task should deal with two features of the situation for the scope of grounding (Bisk et al., 2020), i.e., model: context dependency and grounding. ultimately understanding human language in a so- cial context beyond simply associating texts with 5.1.1 Context-dependent Situations perceptual information. A vital feature of the situation model is that it is conditioned on a given text, i.e., a representation 5.2 How Question: Substantive Validity is constructed distinctively depending on the given Substantive validity requires us to ensure that the context. We elaborate it by discussing the two key questions correctly assess the internal steps of read- features: defeasibility and novelty. ing comprehension. We discuss two approaches for Defeasibility The defeasibility of a constructed this challenge: creating shortcut-proof questions representation implies that a reader can modify and and ensuring the explanation by design. revise it according to the newly acquired informa- 5.2.1 Shortcut-proof Questions tion (Davis and Marcus, 2015; Schubert, 2015). Gururangan et al. (2018) reveal that NLU datasets The defeasibility of NLU has been tackled in the can contain unintended dataset biases embedded task of if-then reasoning (Sap et al., 2019a), ab- by annotators. If machine learning models exploit ductive reasoning (Bhagavatula et al., 2020), coun- such biases for answering questions, we cannot terfactual reasoning (Qin et al., 2019), or contrast evaluate the precise NLU of models. Following sets (Gardner et al., 2020). A possible approach Geirhos et al. (2020), we define shortcut-proof in MRC is that we ask questions against a set of questions as ones that prevent models from exploit- modified passages that describe slightly different ing dataset biases and learning decision rules (short- situations, where the same question can lead to cuts) that perform well only on i.i.d. test examples different conclusions. with regard to its training examples. Gardner et al. Novelty An example showing the importance (2019) also point out the importance of mitigating of contextual novelty is Could a crocodile run a shortcuts in MRC. In this section, we view two steeplechase? by Levesque (2014). This question different approaches for this challenge. poses a novel situation where the solver needs to combine multiple commonsense knowledge to de- Removing Unintended Biases by Filtering rive the correct answer. If non-fiction documents, Zellers et al. (2018) propose a model-based ad- such as newspaper and Wikipedia articles, are only versarial filtering method that iteratively trains an used, some questions require only the reasoning ensemble of stylistic classifiers and uses them to of facts already known in web-based corpus. Fic- filter out the questions. Sakaguchi et al. (2020) tional narratives may be a better source for creating also propose filtering methods based on both ma- questions on novel situations. chines and humans to alleviate dataset-specific and word-association biases. However, a major issue 5.1.2 Grounding to Other Media is the inability to discern knowledge from bias in In MRC, grounding texts to non-textual informa- a closed domain. When the domain is equal to a tion is not fully explored yet. Kembhavi et al. dataset, patterns that are valid only in the domain (2017) propose a dataset based on science text- are called dataset-specific biases (or annotation books, which contain questions with passages, di- artifacts in the labeled data). When the domain agrams, and images. Kahou et al. (2018) propose covers larger corpora, the patterns (e.g., frequency) a figure-based question answering dataset that re- are called word-association biases. When the do- quires the understanding of figures including line main includes everyday experience, patterns are plots and bar charts. Although another approach called commonsense. However, as mentioned in could be vision-based question answering tasks Section 5.1, commonsense knowledge can be de- (Antol et al., 2015; Zellers et al., 2019a), we can- feasible, which implies that the knowledge can be not directly use them for evaluating NLU because false in unusual situations. In contrast, when the they focus on understanding of images rather than domain is our real world, indefeasible patterns are
called factual knowledge. Therefore, the distinc- performance by modeling the generation of the ex- tion of bias and knowledge depends on where the planation. Although we must take into account pattern is recognized. This means that a dataset the faithfulness of explanation, asking for intro- should be created so that it can evaluate reasoning spective explanations could be useful in inspecting on the intended knowledge. For example, to test the internal reasoning process, e.g., by extending defeasible reasoning, we must filter out questions the task formulation so that it includes auxiliary that are solvable by usual commonsense only. If we questions that consider the intermediate facts in a want to investigate the reading comprehension abil- reasoning process. For example, before answering ity without depending on factual knowledge, we Q2 in Figure 2, a reader should be able to answer can consider counterfactual or fictional situations. who escaped? and where did she escape from? at the surface-structure level. Identifying Requisite Skills by Ablating Input Features Another approach is to verify shortcut- Creating Dependency Between Questions An- proof questions by analyzing the human answer- other approach for improving the substantive va- ability of questions regarding their key features. lidity is to create dependency between questions We speculate that if a question is still answerable by which answering them correctly involves an- by humans even after removing the intended fea- swering some other questions correctly. For exam- tures, the question does not require understanding ple, Dalvi et al. (2018) propose a dataset that re- of the ablated features (e.g., checking the necessity quires a procedural understanding of scientific facts. of resolving pronoun coreference after replacing In their dataset, a set of questions corresponds to pronouns with dummy nouns). Even if we can- the steps of the entire process of a scientific phe- not accurately identify such necessary features, by nomenon. Therefore, this set can be viewed as identifying partial features of them in a sufficient a single question that requires a complete under- number of questions, we could expect that the ques- standing of the scientific phenomenon. In CoQA tions evaluate the corresponding intended skill. In (Reddy et al., 2019), it is noted that questions often a similar vein, Geirhos et al. (2020) argue that a have pronouns that refer back to nouns appearing dataset is useful only if it is a good proxy for the in previous questions. These mutually-dependent underlying ability one is actually interested in. questions can probably facilitate the explicit vali- dation of the models’ understanding of given texts. 5.2.2 Explanation by Design Another approach for ensuring the substantive va- 6 Conclusion lidity is to include explicit explanation in the task In this paper, we outlined current issues and future formulation. Although gathering human explana- directions for benchmarking machine reading com- tions is costly, the following approaches can facil- prehension. We visited the psychology study to itate the explicit verification of a model’s under- analyze what we should ask of reading comprehen- standing using a few test examples. sion and the construct validity in psychometrics to Generating Introspective Explanation Inoue analyze how we should correctly evaluate it. We et al. (2020) classify two types of explanation deduced that future datasets should evaluate the in text comprehension: justification explanation capability of the situation model for understanding and introspective explanation. The justification context-dependent situations and for grounding to explanation only provides a collection of support- non-textual information and ensure the substantive ing facts for making a certain decision, whereas validity by creating shortcut-proof questions and the introspective explanation provides the deriva- designing an explanatory task formulation. tion of the answer for making the decision, which Acknowledgments can cover linguistic phenomena and commonsense knowledge not explicitly mentioned in the text. The authors would like to thank Xanh Ho for help- They annotate multi-hop reasoning questions with ing create the dataset list and the anonymous re- introspective explanation and propose a task that viewers for their insightful comments. This work requires the derivation of the correct answer of a was supported by JSPS KAKENHI Grant Num- given question to improve the explainability. Ra- ber 18H03297, JST ACT-X Grant Number JPM- jani et al. (2019) collect human explanations for JAX190G, and JST PRESTO Grant Number JP- commonsense reasoning and improve the system’s MJPR20C4.
References Michael Boratko, Harshit Padigela, Divyendra Mikki- lineni, Pritish Yuvraj, Rajarshi Das, Andrew McCal- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- lum, Maria Chang, Achille Fokoue-Nkoutche, Pavan garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik and Devi Parikh. 2015. VQA: Visual question an- Talamadupula, and Michael Witbrock. 2018. A sys- swering. In Proceedings of the IEEE International tematic classification of knowledge, reasoning, and Conference on Computer Vision, pages 2425–2433. context within the ARC dataset. In Proceedings of the Workshop on Machine Reading for Question Max Bartolo, Alastair Roberts, Johannes Welbl, Sebas- Answering, pages 60–70. Association for Computa- tian Riedel, and Pontus Stenetorp. 2020. Beat the tional Linguistics. AI: Investigating adversarial human annotation for reading comprehension. Transactions of the Associ- Xavier Bouthillier, César Laurent, and Pascal Vincent. ation for Computational Linguistics, 8:662–678. 2019. Unreproducible research is reproducible. In Proceedings of the 36th International Conference Regina Barzilay and Mirella Lapata. 2008. Modeling on Machine Learning, volume 97 of Proceedings of local coherence: An entity-based approach. Compu- Machine Learning Research, pages 725–734, Long tational Linguistics, 34(1):1–34. Beach, California, USA. PMLR. Emily M. Bender and Batya Friedman. 2018. Data Samuel R. Bowman, Gabor Angeli, Christopher Potts, statements for natural language processing: Toward and Christopher D. Manning. 2015. A large an- mitigating system bias and enabling better science. notated corpus for learning natural language infer- Transactions of the Association for Computational ence. In Proceedings of the 2015 Conference on Linguistics, 6:587–604. Empirical Methods in Natural Language Processing, pages 632–642. Association for Computational Lin- Emily M. Bender and Alexander Koller. 2020. Climb- guistics. ing towards NLU: On meaning, form, and under- standing in the age of data. In Proceedings of the Christopher J.C. Burges. 2013. Towards the machine 58th Annual Meeting of the Association for Compu- comprehension of text: An essay. Technical re- tational Linguistics, pages 5185–5198, Online. As- port, Microsoft Research Technical Report MSR- sociation for Computational Linguistics. TR-2013-125. Luisa Bentivogli, Elena Cabrio, Ido Dagan, Danilo Vittorio Castelli, Rishav Chakravarti, Saswati Dana, Giampiccolo, Medea Lo Leggio, and Bernardo Anthony Ferritto, Radu Florian, Martin Franz, Di- Magnini. 2010. Building textual entailment special- nesh Garg, Dinesh Khandelwal, Scott McCarley, ized data sets: a methodology for isolating linguis- Michael McCawley, Mohamed Nasr, Lin Pan, Cezar tic phenomena relevant to inference. In Proceed- Pendus, John Pitrelli, Saurabh Pujar, Salim Roukos, ings of the Seventh International Conference on Lan- Andrzej Sakrajda, Avi Sil, Rosario Uceda-Sosa, guage Resources and Evaluation (LREC’10), Val- Todd Ward, and Rong Zhang. 2020. The TechQA letta, Malta. European Language Resources Associ- dataset. In Proceedings of the 58th Annual Meet- ation (ELRA). ing of the Association for Computational Linguistics, pages 1269–1278, Online. Association for Computa- Chandra Bhagavatula, Ronan Le Bras, Chaitanya tional Linguistics. Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han- Danqi Chen. 2018. Neural Reading Comprehension nah Rashkin, Doug Downey, Wen tau Yih, and Yejin and Beyond. Ph.D. thesis, Stanford University. Choi. 2020. Abductive commonsense reasoning. In International Conference on Learning Representa- Danqi Chen, Adam Fisch, Jason Weston, and Antoine tions. Bordes. 2017. Reading wikipedia to answer open- domain questions. In Proceedings of the 55th An- Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob nual Meeting of the Association for Computational Andreas, Yoshua Bengio, Joyce Chai, Mirella Lap- Linguistics (Volume 1: Long Papers), pages 1870– ata, Angeliki Lazaridou, Jonathan May, Aleksandr 1879. Association for Computational Linguistics. Nisnevich, Nicolas Pinto, and Joseph Turian. 2020. Experience grounds language. In Proceedings of the Jifan Chen and Greg Durrett. 2019. Understanding 2020 Conference on Empirical Methods in Natural dataset design choices for multi-hop reasoning. In Language Processing (EMNLP), pages 8718–8735, Proceedings of the 2019 Conference of the North Online. Association for Computational Linguistics. American Chapter of the Association for Compu- tational Linguistics: Human Language Technolo- Michael Boratko, Xiang Li, Tim O’Gorman, Rajarshi gies, Volume 1 (Long and Short Papers), pages Das, Dan Le, and Andrew McCallum. 2020. Pro- 4026–4032, Minneapolis, Minnesota. Association toQA: A question answering dataset for prototypi- for Computational Linguistics. cal common-sense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fer- Language Processing (EMNLP), pages 1122–1136, nandez, and Doug Downey. 2019. CODAH: An Online. Association for Computational Linguistics. adversarially-authored question answering dataset
for common sense. In Proceedings of the 3rd Work- Bhuwan Dhingra, Kathryn Mazaitis, and William W. shop on Evaluating Vector Space Representations Cohen. 2017. Quasar: Datasets for question answer- for NLP, pages 63–69, Minneapolis, USA. Associ- ing by search and reading. ation for Computational Linguistics. Finale Doshi-Velez and Been Kim. 2018. Consider- Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan ations for Evaluation and Generalization in Inter- Xiong, Hong Wang, and William Yang Wang. 2020. pretable Machine Learning, 1st edition. Springer In- HybridQA: A dataset of multi-hop question answer- ternational Publishing. ing over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re- 2020, pages 1026–1036, Online. Association for ichart. 2018. The hitchhiker’s guide to testing statis- Computational Linguistics. tical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the As- Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen- sociation for Computational Linguistics (Volume 1: tau Yih, Yejin Choi, Percy Liang, and Luke Zettle- Long Papers), pages 1383–1392, Melbourne, Aus- moyer. 2018. QuAC: Question answering in con- tralia. Association for Computational Linguistics. text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel pages 2174–2184, Brussels, Belgium. Association Stanovsky, Sameer Singh, and Matt Gardner. 2019. for Computational Linguistics. DROP: A reading comprehension benchmark requir- Christopher Clark, Kenton Lee, Ming-Wei Chang, ing discrete reasoning over paragraphs. In Proceed- Tom Kwiatkowski, Michael Collins, and Kristina ings of the 2019 Conference of the North American Toutanova. 2019. BoolQ: Exploring the surprising Chapter of the Association for Computational Lin- difficulty of natural yes/no questions. In Proceed- guistics: Human Language Technologies, Volume 1 ings of the 2019 Conference of the North American (Long and Short Papers), pages 2368–2378, Min- Chapter of the Association for Computational Lin- neapolis, Minnesota. Association for Computational guistics: Human Language Technologies, Volume 1 Linguistics. (Long and Short Papers), pages 2924–2936, Min- neapolis, Minnesota. Association for Computational Jesse Dunietz, Greg Burnham, Akash Bharadwaj, Linguistics. Owen Rambow, Jennifer Chu-Carroll, and Dave Fer- rucci. 2020. To test machine comprehension, start Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, by defining comprehension. In Proceedings of the Ashish Sabharwal, Carissa Schoenick, and Oyvind 58th Annual Meeting of the Association for Compu- Tafjord. 2018. Think you have solved question an- tational Linguistics, pages 7839–7859, Online. As- swering? Try ARC, the AI2 reasoning challenge. sociation for Computational Linguistics. CoRR, abs/1803.05457. Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Ido Dagan, Oren Glickman, and Bernardo Magnini. Guney, Volkan Cirik, and Kyunghyun Cho. 2017. 2006. The pascal recognising textual entailment SearchQA: A new Q&A dataset augmented with challenge. In Machine Learning Challenges Work- context from a search engine. shop, pages 177–190. Springer. Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen-tau Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Yih, and Peter Clark. 2018. Tracking state changes Pathologies of neural models make interpretations in procedural text: a challenge dataset and models difficult. In Proceedings of the 2018 Conference on for process paragraph comprehension. In Proceed- Empirical Methods in Natural Language Processing, ings of the 2018 Conference of the North American pages 3719–3728. Association for Computational Chapter of the Association for Computational Lin- Linguistics. guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1595–1604. Association for James Ferguson, Matt Gardner, Hannaneh Hajishirzi, Computational Linguistics. Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A Pradeep Dasigi, Nelson F. Liu, Ana Marasovic, dataset of incomplete information reading compre- Noah A. Smith, and Matt Gardner. 2019. Quoref: hension questions. In Proceedings of the 2020 Con- A reading comprehension dataset with questions re- ference on Empirical Methods in Natural Language quiring coreferential reasoning. In Proceedings of Processing (EMNLP), pages 1137–1147, Online. As- the 2019 Conference on Empirical Methods in Nat- sociation for Computational Linguistics. ural Language Processing and the 9th International Joint Conference on Natural Language Processing Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu- (EMNLP-IJCNLP), pages 5927–5934, Hong Kong, nsol Choi, and Danqi Chen. 2019. MRQA 2019 China. Association for Computational Linguistics. shared task: Evaluating generalization in reading comprehension. In Proceedings of the 2nd Work- Ernest Davis and Gary Marcus. 2015. Commonsense shop on Machine Reading for Question Answering, reasoning and commonsense knowledge in artificial pages 1–13, Hong Kong, China. Association for intelligence. Commun. ACM, 58(9):92–103. Computational Linguistics.
R Michael Furr. 2018. Psychometrics: an introduction. Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Sage Publications. Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, and Haifeng Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Wang. 2018. DuReader: a Chinese machine read- Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, ing comprehension dataset from real-world appli- Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, cations. In Proceedings of the Workshop on Ma- Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, chine Reading for Question Answering, pages 37– Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel- 46, Melbourne, Australia. Association for Computa- son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer tional Linguistics. Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- 2020. Evaluating models’ local decision boundaries stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, via contrast sets. In Findings of the Association and Phil Blunsom. 2015. Teaching machines to read for Computational Linguistics: EMNLP 2020, pages and comprehend. In C. Cortes, N. D. Lawrence, 1307–1323, Online. Association for Computational D. D. Lee, M. Sugiyama, and R. Garnett, editors, Linguistics. Advances in Neural Information Processing Systems 28, pages 1693–1701. Curran Associates, Inc. Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. 2019. On making José Hernández-Orallo. 2017. The measure of all reading comprehension more comprehensive. In minds: evaluating natural and artificial intelligence. Proceedings of the 2nd Workshop on Machine Read- Cambridge University Press. ing for Question Answering, pages 105–112, Hong Kong, China. Association for Computational Lin- Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia guistics. Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. WikiReading: A Timnit Gebru, Jamie Morgenstern, Briana Vec- novel large-scale language understanding task over chione, Jennifer Wortman Vaughan, Hanna Wal- wikipedia. In Proceedings of the 54th Annual Meet- lach, Hal Daumé III, and Kate Crawford. 2018. ing of the Association for Computational Linguistics Datasheets for datasets. ArXiv preprint 1803.09010. (Volume 1: Long Papers), pages 1535–1545, Berlin, Germany. Association for Computational Linguis- Robert Geirhos, Jörn-Henrik Jacobsen, Claudio tics. Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. 2020. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Shortcut learning in deep neural networks. Nature Weston. 2016. The goldilocks principle: Reading Machine Intelligence, 2(11):665–673. children’s books with explicit memory representa- tions. In International Conference on Learning Rep- Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Ba- resentations. jwa, Michael Specter, and Lalana Kagal. 2018. Ex- plaining explanations: An overview of interpretabil- Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, ity of machine learning. In 2018 IEEE 5th Interna- and Akiko Aizawa. 2020. Constructing a multi- tional Conference on data science and advanced an- hop QA dataset for comprehensive evaluation of alytics (DSAA), pages 80–89. IEEE. reasoning steps. In Proceedings of the 28th Inter- national Conference on Computational Linguistics, Arthur C. Graesser, Murray Singer, and Tom Trabasso. pages 6609–6625, Barcelona, Spain (Online). Inter- 1994. Constructing inferences during narrative text national Committee on Computational Linguistics. comprehension. Psychological review, 101(3):371. Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Suchin Gururangan, Swabha Swayamdipta, Omer Yejin Choi. 2019. Cosmos QA: Machine reading Levy, Roy Schwartz, Samuel Bowman, and Noah A. comprehension with contextual commonsense rea- Smith. 2018. Annotation artifacts in natural lan- soning. In Proceedings of the 2019 Conference on guage inference data. In Proceedings of the 2018 Empirical Methods in Natural Language Processing Conference of the North American Chapter of the and the 9th International Joint Conference on Natu- Association for Computational Linguistics: Human ral Language Processing (EMNLP-IJCNLP), pages Language Technologies, Volume 2 (Short Papers), 2391–2401, Hong Kong, China. Association for pages 107–112. Association for Computational Lin- Computational Linguistics. guistics. Naoya Inoue, Pontus Stenetorp, and Kentaro Inui. 2020. Ivan Habernal, Henning Wachsmuth, Iryna Gurevych, R4C: A benchmark for evaluating RC systems to get and Benno Stein. 2018. The argument reasoning the right answer for the right reason. In Proceedings comprehension task: Identification and reconstruc- of the 58th Annual Meeting of the Association for tion of implicit warrants. In Proceedings of the 2018 Computational Linguistics, pages 6740–6750, On- Conference of the North American Chapter of the line. Association for Computational Linguistics. Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Alon Jacovi and Yoav Goldberg. 2020. Towards faith- pages 1930–1940, New Orleans, Louisiana. Associ- fully interpretable NLP systems: How should we de- ation for Computational Linguistics. fine and evaluate faithfulness? In Proceedings of the
You can also read