Retrieve, Rerank, Read, then Iterate: Answering Open-Domain Questions of Arbitrary Complexity from Text
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Retrieve, Rerank, Read, then Iterate: Answering Open-Domain Questions of Arbitrary Complexity from Text Peng Qi*♠ Haejun Lee*♣ Oghenetegiri “TG” Sido♠ Christopher D. Manning♠ ♠ Computer Science Department, Stanford University ♣ Samsung Research {pengqi, osido, manning}@cs.stanford.edu, haejun82.lee@samsung.com Abstract important problem in recent years. However, previ- ous work often make three assumptions that limit Current approaches to open-domain question their ability to quickly generalize to new domains, answering often make crucial assumptions that arXiv:2010.12527v1 [cs.CL] 23 Oct 2020 prevent them from generalizing to real-world where open-domain QA could be the most helpful. settings, including the access to parameterized Firstly, some previous work assumes the access retrieval systems well-tuned for the task, ac- to a well-tuned parameterized retrieval system that cess to structured metadata like knowledge helps navigate the large amount of text (Das et al., bases and web links, or a priori knowledge 2019; Lee et al., 2019; Asai et al., 2020; Guu et al., of the complexity of questions to be answered 2020; Khattab et al., 2020; Xiong et al., 2020). (e.g., single-hop or multi-hop). To address These systems make use of rich representation these limitations, we propose a unified system learners like neural networks to encode each docu- to answer open-domain questions of arbitrary complexity directly from text that works with ment in a large corpus, and retrieve documents with off-the-shelf retrieval systems on arbitrary text a “query” vector using approximate inner-product collections. We employ a single multi-task search. While they can use gradient-based opti- model to perform all the necessary subtasks— mization to update the query generation and docu- retrieving supporting facts, reranking them, ment encoding components, their retrieval process and predicting the answer from all retrieved is opaque to interpretation, and usually requires a documents—in an iterative fashion. To em- significant amount of computational power to train. ulate a more realistic setting, we also con- structed a new unified benchmark by collect- Although these systems benefit from unsupervised ing about 200 multi-hop questions that require training, there is usually a big performance gap three Wikipedia pages to answer, and combin- when compared to their supervised counterparts, ing them with existing datasets. We show that and it remains unclear if supervised finetuning will our model not only outperforms state-of-the- preserve the general applicability of these retrieval art systems on several existing benchmarks systems to other datasets or retrieval tasks. that exclusively feature single-hop or multi- Secondly, previous work often assumes access hop open-domain questions, but also achieves strong performance on the new benchmark. to non-textual metadata such as knowledge bases, entity linking, and Wikipedia hyperlinks when re- 1 Introduction trieving supporting facts (Nie et al., 2019; Feld- man and El-Yaniv, 2019; Zhao et al., 2019; Asai Using knowledge to solve problems is a hallmark of et al., 2020; Dhingra et al., 2020; Zhao et al., 2020). intelligence, and open-domain question answering While this information is helpful, it is not always (QA) is an important means for intelligent systems readily available in all text collections we might be to make use of the knowledge in large collections interested in getting answers from, besides being of texts. With the help of large-scale datasets based labor-intensive and time-consuming to collect and on Wikipedia (Rajpurkar et al., 2016, 2018; Yang maintain. It is therefore desirable to design a sys- et al., 2018; Welbl et al., 2018) and other large tem that is capable of extracting knowledge from corpora (Trischler et al., 2016; Dunn et al., 2017; text without using such metadata, to maximally Talmor and Berant, 2018), the research community make use of knowledge available to us in the form has made substantial progress towards tackling this of text. * These authors contributed equally. Lastly, previous systems often assume that every
1. Q à “Unbreakable” 2. Q + retrieved paras à NOANSWER … 4. Q + W Unbreakable (film) à “David Dunn” 5. Q + W Unbreakable (film) + W David Dunn (character) à “David Dunn” Retriever Q. What is the Answer exists in one A. of the reasoning paths name of Bruce Query search Reader David Willis’ character in Generator Unbreakable? Dunn WIKIPEDIA No answer exist Expand reasoning path with top-ranked paragraph Reranker 3. Q + retrieved paras à W Unbreakable (film) Figure 1: The IRRR question answering pipeline answers a complex question in the HotpotQA dataset by itera- tively retrieving, reranking, and reading paragraphs from Wikipedia. Repeat N times until the answer found is confident enough input question in the dataset is either single-hop or (1) a unified neural network model that performs multi-hop, and tailor the approach to one of these all essential subtasks in open-domain QA purely either in model design or training (Qi et al., 2019; from text (retrieval, reranking, and reading com- Nie et al., 2019; Zhao et al., 2019; Asai et al., 2020). prehension) and achieves strong results on SQuAD This limits the practical application of these sys- and HotpotQA; (2) a new open-domain QA bench- tems in a real-world task, where there is often a mark that features questions of different levels of mixture of questions of different levels of complex- complexity on an unified, up-to-date verison of ity without explicit categorization. Wikipedia, with newly annotated questions that re- To address these limitations, we propose Itera- quire at least three hops of reasoning, on which our tive Retriever, Reranker, and Reader (IRRR), which proposed model serves as a strong baseline.1 features a single neural network model that per- forms all of the subtasks required to answer ques- 2 Method tions from a large collection of text (see Figure In this section, we briefly review the task of open- 1). IRRR is designed to leverage off-the-shelf in- domain question answering, before diving into how formation retrieval systems by generating natural we train a unified model to perform all of the sub- language search queries, which allows it to easily tasks necessary to solve this problem. adapt to arbitrary collections of text without re- quiring well-tuned neural retrieval systems or extra 2.1 Open-Domain Question Answering metadata. This further allows users to understand The task of open-domain question answering is and control IRRR, if necessary, to facilitate trust. concerned with finding the answer a to a question Moreover, IRRR iteratively retrieves more context q from a large collection of text D. Successful solu- to answer the question, which allows it to easily tions to this task usually involve two crucial compo- accommodate questions of different complexity. nents: an information retrieval system that distills To evaluate the performance of open-domain QA a small set of relevant documents Dr from D, and systems on questions of arbitrary complexity, we a reading comprehension system that extracts the further collect 203 human-annotated questions that answer from it. Chen et al. (2017) presented one of require information from three Wikipedia pages to the first neural-network-based approaches to this answer, combine them with those from the single- problem, which was later extended by Wang et al. hop SQuAD Open (Rajpurkar et al., 2016; Chen (2018a) with a reranking system to further reduce et al., 2017) and the multi-hop HotpotQA (Yang the amount of context the reading comprehension et al., 2018), and map this data to a unified ver- component has to consider to improve answer ac- sion of the English Wikipedia to facilitate system curacy. development and evaluation. We show that IRRR More recently, Yang et al. (2018) showed not only outperforms state-of-the-art models on the that this retrieve-and-read approach towards open- original SQuAD Open and HotpotQA datasets by a domain question answering is inadequate for more large margin, but also achieves strong performance complex questions that require multiple pieces of on this new dataset. 1 For reproducibility, we will release our code and trained To recap, our contributions in this paper are: models.
O/X O/X O/X …… NOANSWER / Span / Yes / No Start End Gold Paragraph evidence to answer (e.g., “What is the population of Mark Twain’s hometown?”). Researchers have Token-wise Binary Prediction 4-way Clsf Span Prediction NCE Clsf Rerank Rerank Rerank demonstarted that this can be resolved by retrieving Retriever (Query Generator) Reader Reranker supporting facts from other means than text-based h[CLS] h1 h2 h3 h4 h5 h6 … search with the original question, although exist- Transformer-Encoder ing systems often make assumptions that limit their [CLS] q [SEP] title0 [CONT] ctx0 [SEP] … general applicability. For instance, Nie et al. (2019) and Asai et al. (2020) make use of hyperlinks avail- Figure 2: The overall architecture of our IRRR model, able in Wikipedia, which many open-domain QA which uses a shared Transformer encoder to perform datasets are based on, while Qi et al. (2019) con- all subtasks of open-domain question answering. sider questions that require exactly two supporting documents to answer. Moreover, most previous which controls the maximum number of paragraphs work assumes that all questions are either exclu- of supporting facts required to arrive at the answer. sively single-hop or multi-hop, and tailor either To reduce computational cost, IRRR is imple- model design or the training process to consider mented as a multi-task model built on pretrained one type of question at a time. Transformer models that performs all three sub- tasks. At a high level, it consists of a Trans- 2.2 IRRR: Iterative Retriever, Reranker, and former encoder (Vaswani et al., 2017) which takes Reader the question and all retrieved paragraphs so far We propose the Iterative Retriever, Reranker, and (the reasoning path p) as input, and one set of Reader (IRRR) model, which performs the sub- task-specific parameters for each task of retrieval, tasks involved in an iterative manner to accommo- reranking, and reading comprehension (see Figure date questions with arbitrary complexity. IRRR 2). Specifically, the retriever generates natural lan- aims at building a reasoning path p from the ques- guage search queries by selecting words from the tion q, through all the necessary supporting facts reasoning path, the reader extracts answers from d ∈ Dgold to the answer a. As is shown in Figure the reasoning path and abstains if its confidence is 1, IRRR operates in a loop of retrieval, reranking, not high enough, and the reranker assigns a scalar and reading to expand the reasoning path p with score for each retrieved paragraph as a potential new documents from d ∈ D. continuation of the current reasoning path. Specifically, given a question q, the initial rea- The input to our Transformer encoder is format- soning path contains only the question itself, i.e., ted similarly to that of the BERT model (Devlin p0 = [q], from which we attempt to generate a et al., 2019). For a reasoning path p that consists search query with IRRR’s retriever to retrieve a set of the question and t retrieved paragraphs, the in- of relevant documents D1 ⊂ D, which might ei- put is formatted as “[CLS] question [SEP] title1 ther help answer the question, or reveal clues about [CONT] context1 [SEP] · · · titlet [CONT] contextt the next piece of evidence we need to answer q. [SEP]”, where [CLS], [SEP], and [CONT] are spe- Then, the reader model attempts to append each cial tokens to separate different components of the of the documents in D1 to answer the question. If input. more than one answers can be found from these We will detail each of the task-specific compo- reasoning paths, we predict the answer with the nents in the following subsections. highest answerability score, which we will detail in Section 2.2.2. If no answer can be found, then 2.2.1 Retriever IRRR’s reranker score each retrieved paragraph The goal of the retriever is to generate natural lan- against the current reasoning path, and append the guage queries to retrieve relevant documents from top-ranked paragraph to the current reasoning path, an off-the-shelf text based retrieval engine. This i.e., pi+1 = pi + [arg maxd∈D1 reranker(pi , d)], allows IRRR to perform open-domain QA in an before the updated reasoning path is presented to explainable and controllable manner, where a user the retriever to generate new search queries. This can easily understand the model’s behavior and iterative process is repeated until an answer is pre- intervene if necessary. Following the approach pro- dicted from one of the reasoning paths, or until the posed by Qi et al. (2019), we propose to extract reasoning path has reached K documents in total, search queries from the current reasoning path, i.e.,
the original question and all of the paragraphs that ken as the “output span”, and thus we also include we have already retrieved. This approach stems the likelihood ratio between the best span and the from the observation that there usually being a [CLS] span if the positive answer is a span. This strong semantic overlap between the reasoning path likelihood ratio formulation is less affected by se- and the next paragraph to retrieve. Instead of re- quence length compared to prediction probability, stricting the query string to always be a substring and thus making it easier to assign a global thresh- of the reasoning path as did Qi et al. (2019), in old across reasoning paths of different lengths to this paper, we relax the constraint and instead al- stop further retrieval. low the search query to be any subsequence of the 2.2.3 Reranker reasoning path, therefore allowing more flexible combinations of search phrases to be used. When the reader fails to find an answer from the To predict these search queries from the reason- reasoning path, the reranker selects one of the re- ing path, we apply a token-wise binary classifier trieved paragraphs to expand it, so that the retriever on top of the shared Transformer encoder model, can generate new search queries to retrieve new to decide whether each token is included in the fi- context to answer the question. To achieve this, nal query. At training time, we derive supervision we assign each potential extended reasoning path signal to train these classifiers with a binary cross a score by multiplying the hidden representation entropy loss (which we detail in Section 2.3.1); at of the [CLS] token with a linear transformation, test time, we select a cutoff threshold for query and pick the extension that has the highest score. words to be included from the reasoning path. In At training time, we normalize the reranker scores practice, we find that boosting the model to predict for all retrieved paragraphs, and maximize the log more query terms is beneficial to increase the recall likelihood of selecting gold supporting paragraphs of the target paragraphs in retrieval. from retrieved ones. To avoid linearly scaling com- We employ Elasticsearch (Gormley and Tong, putational cost for the Transformer encoder with 2015) as our text-based search engine, and follow the number of paragraphs retrieved, we employ previous work to process Wikipedia and search noise contrastive estimation (NCE) to estimate this results, which we will detail in Appendix B. likelihood (Mnih and Kavukcuoglu, 2013; Jean et al., 2015), where negative examples are top re- 2.2.2 Reader trieved paragraphs given the current reasoning path. The reader attempts to find the answer given a rea- 2.3 Training IRRR soning path comprised of the question and retrieved paragraphs, and simultaneously assigns an answer- 2.3.1 Generating Target Queries with a ability score to this reasoning path to assess the like- Dynamic Oracle lihood of having found the answer to the original Since existing open-domain QA datasets do not in- question. Since not all reasoning paths lead to the clude human-annotated search queries, we train the final answer, we train a classifier conditioned on the retriever to generate search queries from the rea- Transformer encoder representation of the [CLS] soning path by deriving supervision signal with a token to determine whether an answer should be dynamic oracle. Qi et al. (2019) noted that it is im- predicted given a reasoning path. Span answers portant for good search queries to not only contain are predicted from the context using a span start overlapping terms between the reasoning path and classifier and a span end classifier, following De- paragraphs to be retrieved, but also encode which vlin et al. (2019). To support special non-extractive terms are more effective in retrieval. Therefore, we answers like yes/no from HotpotQA, we further derive a dynamic oracle for effective search queries include these as options for the classifier to predict with the help of the search engine. in a 4-way classifier. To reduce computational cost and queries against The answerability score is used to pick the best the search engine, we begin by identifying overlap- answer from all the candidate reasoning paths and ping spans of text between the reasoning path and serves as a stopping criterion for IRRR. We define the target document for retrieval, and include or ex- answerability as the log likelihood ratio between clude entire overlapping spans as a whole. Finding the most likely positive answer and the NOANSWER the optimal combination of N overlapping spans prediction. For NOANSWER examples, we further is still prohibitively slow and requires O(2N ) oper- train the span classifiers to predict the [CLS] to- ations. To avoid enumerating exponentially many
combinations, we approximate the importance of SQuAD Open HotpotQA 3-hop Total each overlapping span with the following “impor- Train 59,285 74,758 0 134,043 tance” metric Dev 8,132 5,989 0 14,121 Test 8,424 5,978 203 14,605 Importance(soi ) = Rank(t, {soj }N j=1,j6=i ) Total 75,841 86,725 203 162,769 − Rank(t, {soi }), (1) Table 1: Split and statistics of QA examples in the new so where are overlapping spans between the target unified benchmark. paragraph t and the current reasoning path, and Rank(t, S) is the rank of t in the search result features more than 10,000 questions, each based on when spans S are used as search queries, which a single paragraph in a Wikipedia article. For this is smaller if t is ranked closer to the top in search dataset, we index the same Wikipedia corpus from results. Intuitively, the second term captures the im- 2016 with Elasticsearch to derive training data and portance of the search term when used alone, and evaluate IRRR. Since the authors did not present a the first captures its importance when combined standard development set, we split part of the train- with all other overlapping spans. After estimating ing set to construct a development set roughly as importance of each overlapping span, we determine large as the test set. HotpotQA (Yang et al., 2018) the final target query by first sorting all spans by features more than 100,000 questions that require descending importance, then including each in the the introductory paragraphs of two Wikipedia arti- final target query and select the shortest query with cles to answer, and we focus on its open-domain the highest rank as the oracle query. The resulting “fullwiki” setting in this work. For HotpotQA, we time complexity for generating these oracle queries index the introductory paragraphs provided by the is thus O(N ), i.e., linear in the number of over- authors, which is based on a 2017 Wikipedia dump, lapping spans between the reasoning path and the for training and evaluation. target paragraph. To evaluate the performance of IRRR as well as 2.4 Reducing Exposure Bias with Data future QA systems in a more realistic open-domain Augmentation setting without pre-specified level of question com- With the dynamic oracle, we are able to generate plexity, we further combine these datasets with target queries to train the retriever model, retrieve 203 newly collected challenge questions to con- documents to train the reranker model, and expand struct a new benchmark. To combine SQuAD Open the reasoning path by always choosing a gold para- and HotpotQA into a unified setting, we begin by graph, following Qi et al. (2019). However, since mapping the paragraphs each question is based on all reasoning paths contain exclusively gold para- to a more recent version of Wikipedia.2 We dis- graphs, the resulting model is prevented from gen- carded examples where the Wikipedia pages have eralizing to cases model behavior deviates from the either been removed or significantly edited such oracle. To address this, we augment the training that the answer can no longer be found from para- data by occasionally selecting non-gold paragraphs graphs that are similar enough to the original con- to append to reasoning paths to the training data, texts the questions are based on. As a result, we and use the dynamic oracle to generate queries for filtered out 22,328 examples from SQuAD Open, the model to “recover” from these synthesized re- and 18,649 examples from HotpotQA’s fullwiki set- trieval mistakes. We find this data augmentation ting.We annotated 203 challenge questions on the significantly improves the performance of IRRR same Wikipedia dump that require paragraphs from in preliminary experiments, and thus report main three supporting documents to answer, which we results with augmented training data. add to the test set of the new benchmark to test the generalization capabilities of QA models to this un- 3 Experiments seen scenario. The statistics of the final dataset can be found in Table 1. To understand the data quality Data. In our experiments, we first make use of after converting SQuAD Open and HotpotQA to two standard benchmarks, SQuAD Open and Hot- this newer version of Wikipedia, we sampled 100 potQA. First introduced by Chen et al. (2017), SQuAD Open designates the development set of 2 In this work, we used the English Wikipedia dump from the original SQuAD dataset as its test set, which August 1st, 2020.
examples from the training split of each dataset. System SQuAD Open We find that 94% / 90% of SQuAD / HotpotQA EM F1 QA pairs remain supported by their context para- DensePR† 38.1 — graphs, while 6% / 10% are unanswerable.3 For BERTserini‡ 38.6 46.1 MUPPET• 39.3 46.2 all benchmark datasets, we report standard answer RE3◦ 41.9 50.2 exact match (EM) and unigram F1 metrics. Knowledge-aided♠ 43.6 53.4 Multi-passage BERT♥ 53.0 60.9 Training details. We use ELECTRALARGE GRR♣ 56.5 63.8 (Clark et al., 2020) as pre-trained initialization for Generative OpenQA♦ 56.7 — our Transformer encoder. We train the model on a IRRR (Ours) 61.8 68.9 combined dataset of SQuAD Open and HotpotQA questions where we optimize the joint loss of the Table 2: End-to-end question answering performance on SQuAD Open, evaluated on the same set of docu- retriever, reader, and reranker components simulta- ments as Chen et al. (2017). Previous work is denoted neously in an multi-task learning fashion. Training with symbols in the table as follows – †: (Karpukhin data for the retriever and reranker components is et al., 2020), ‡: (Yang et al., 2019), •: (Feldman and derived from the dynamic oracle on the training El-Yaniv, 2019), ◦: (Hu et al., 2019), ♠: (Zhou et al., set of these datasets, where reasoning paths are 2020), ♥: (Wang et al., 2019), ♣: (Asai et al., 2020), expanded with oracle queries and by picking the ♦: (Izacard and Grave, 2020). gold paragraphs as they are retrieved for the reader component. We also enhance this training data by HotpotQA 3-hop System expanding reasoning paths with non-gold retrieved EM F1 EM F1 paragraphs up to 3 reasoning steps on HotpotQA GRR♣ 60.0 73.0 21.7† 29.6† and 2 on SQuAD Open, and find that this results Step-by-step⊗ 63.0 75.4 — — in a more robust model. After an initial model is Recurs. Dense Ret.⊗ 62.3 75.3 — — finetuned on this expanded training set, we apply SDS-Net⊗ 64.9 78.2 — — our iterative training technique to further reduce IRRR (Ours) 65.7 78.2 32.5 39.9 exposure bias of the model by generating more data with the trained model and the dynamic oracle. Table 3: End-to-end question answering performance on HotpotQA and the new 3-hop challenge questions, evaluated on the official HotpotQA Wikipedia para- 4 Results graphs. ♣ denotes (Asai et al., 2020), and ⊗ represent In this section, we present the performance of anonymous/preprint unavailable at the time of writing of this paper. † indicates results we obtained using the IRRR when evaluated against previous systems on publicly available code and pretrained models. standard benchmarks, and demonstrate its efficacy on our new, unified benchmark, especially with the help of iterative training. Wikipedia at each step of reasoning. We also com- pare IRRR against the Graph Recurrent Retriever 4.1 Question Answering Performance on (GRR; Asai et al., 2020) on our newly collected Standard Benchmarks 3-hop question challenge test set, using the author’s We first compare IRRR against previous systems released code and models trained on HotpotQA. on standard benchmarks, SQuAD Open and the As can be seen in Tables 2 and 3, IRRR outper- fullwiki setting of HotpotQA. On each dataset, we forms previously published work on the SQuAD compare the performance of IRRR against best pre- Open and HotpotQA benchmarks by a large mar- viously published systems, as well as unpublished gin. It further outperforms more recent unpublished ones that we find on public leaderboards. For a work on HotpotQA that were posted after IRRR fair comparison to previous work, we make use was initially submitted to the public leaderboard. of the Wikipedia corpora from (Chen et al., 2017) On the 3-hop challenge set, we similarly notice and (Yang et al., 2018), respectively, and limit the a large performance margin between IRRR and retriever to retrieve 150 paragraphs of text from GRR, although neither is trained with 3-hop ques- 3 tions. This demonstrates that IRRR generalizes 43% of HotpotQA examples contain more than minimal set of necessary paragraphs to answer the question as a result well to questions that are more complex than the of the mapping process. ones seen during training.
SQuAD Open HotpotQA 80 Dev Test 50 docs/step 80 50 docs/step EM F1 EM F1 Percentage 60 100 docs/step 60 100 docs/step 40 40 150 docs/step 150 docs/step 20 20 0 SQuAD Open 50.65 60.99 60.59 67.51 0 1 2 3 4 5 1 2 3 4 5 HotpotQA 59.01 70.33 58.61 69.86 Paragraphs Retrieved/Question Paragraphs Retrieved/Question 3-hop Challege — — 26.73 35.48 79 62 (1) (5) 78 (5) 61 (1) (5) Table 4: End-to-end question answering performance Answer F1 77 (5) (5) 60 50 docs/step 76 (2) (2) 50 docs/step of IRRR on the unified benchmark, evaluated on the 59 (2) (1) (5) 100 docs/step 75 100 docs/step 2020 copy of Wikipedia. 58 150 docs/step 150 docs/step 74 0 100 200 300 400 500 100 200 300 400 Total Number of Paragraphs Total Number of Paragraphs System SQuAD HotpotQA Figure 3: The retrieval behavior of IRRR and its rela- Ours (joint dataset) 58.69 68.74 tion to the performance of end-to-end question answer- – dynamic stopping (fixed K = 3) 31.70 66.60 – HotpotQA / SQuAD data 54.35 65.40 ing. Top: The distribution of reasoning path lengths as – ELECTRA + BERTLARGE-WWM 57.19 63.86 determined by IRRR. Bottom: Total number of para- graphs retrieved by IRRR vs the end-to-end question Table 5: Ablation study of different design choices in answering performance as measured by answer F1 . IRRR, as evaluated by Answer F1 on the dev set of the unified benchmark. We also demonstrate that on these benchmarks, IRRR’s performance can be further improved by collected 3-hop challenge questions. As is shown incorporating additional data from iterative training in Table 4, IRRR’s performance remains compet- to reduce exposure bias. itive on all questions from different origins in the To better understand the behavior of IRRR on unified benchmark, despite the difference in rea- these benchmarks, we analyze the number of para- soning complexity when answering these questions. graphs retrieved by the model when varying the The model also generalizes to the 3-hop questions number of paragraphs retrieved at each reasoning despite having never been trained on them. We step among {50, 100, 150}. As can be seen in Fig- note that the large performance gap between the ure 3, IRRR effectively reduces the total number development and test settings for SQuAD Open of paragraphs retrieved and read by the model by questions is due to the fact that test set questions stopping its iterative process as soon as all neces- (the original SQuAD dev set) are annoted with sary paragraphs to answer the question have been multiple human answers, while the dev set ones retrieved. Further, we note that the optimal cap for (originally from the SQuAD training set) are not. the number of reasoning steps is larger than the To better understand the contribution of the var- number of gold paragraphs necessary to answer the ious components and techniques we proposed for question on each benchmark, which we find is due IRRR, we performed ablative studies on the model to IRRR’s ability to recover from retrieving and iterating up-to 3 reasoning steps with 50 paragraphs selecting non-gold paragraphs (see the example in for each step, and present the results in Table 5. Table 6). Finally, we note that increasing the num- First of all, we find it is important to allow IRRR to ber of paragraphs retrieved at each reasoning step dynamically stop retrieving paragraphs to answer remains an effective, if computationally expensive, the question. Compared to its fixed-step retrieval strategy, to improve the end-to-end performance of counterpart, dynamically stopping IRRR improves IRRR. However, the tradeoff between retrieval bud- F1 on both SQuAD and HotpotQA questions by get and model performance is much more effective 27.0 and 2.1 points respectively (we include further than that of previous work (e.g., GRR), and we note analyses for dynamic stopping in Appendix D). that the queries generated by IRRR is explainable We also find combining SQuAD and HotpotQA to humans and can help humans easily control its datasets beneficial for both datasets in an open- behavior (see Table 6). domain setting, and that ELECTRA is an effective 4.2 Performance on the Unified Benchmark alternative to BERT for this task. To demonstrate the performance of IRRR in a more 5 Related Work realistic setting of open-domain QA, we evaluate it on the new, unified benchmark which features test The availability of large-scale question answer- data from SQuAD Open, HotpotQA, and our newly ing (QA) datasets has greatly contributed to the
Question The Ingerophrynus gollum is named after a character in a book that sold how many copies? Step 1 Ingerophrynus is a genus of true toads with 12 species. ... In 2007 a new species, “Ingerophrynus gollum”, (Non-Gold) was added to this genus. This species is named after the character Gollum created by J. R. R. Tolkien.” Query Ingerophrynus gollum book sold copies J. R. R. Tolkien Step 2 (Gold) Ingerophrynus gollum (Gollum’s toad) is a species of true toad. ... It is called “gollum” with reference of the eponymous character of The Lord of the Rings by J. R. R. Tolkien. Query Ingerophrynus gollum character book sold copies J. R. R. Tolkien true Lord of the Rings Step 3 (Gold) The Lord of the Rings is an epic high fantasy novel written by English author and scholar J. R. R. Tolkien. ... is one of the best-selling novels ever written, with 150 million copies sold. Answer/GT 150 million copies Table 6: An example of IRRR answering a question from HotpotQA by generating natural language queries to retrieve paragragraphs, then rerank them to compsoe reasoning paths and read them to predict the answer. In this example, IRRR recovers from an initial retrieval/reranking mistake by retrieving more paragraphs, before arriving at the gold supporting facts and the correct answer. research progress on open-domain QA. SQuAD et al., 2019; Khattab et al., 2020; Guu et al., 2020; (Rajpurkar et al., 2016, 2018) is among the first Xiong et al., 2020). Aside from the reading compre- question answering datasets adopted for this pur- hension and retrieval components, researchers have pose by Chen et al. (2017) to build QA systems also found reranking search results (Wang et al., over Wikipedia articles. Similarly, TriviaQA (Joshi 2018a) or answer candidates (Wang et al., 2018b; et al., 2017) and Natural Questions (Kwiatkowski Hu et al., 2019). et al., 2019) feature Wikipedia-based questions that To answer open-domain questions that require are written by trivia enthusiasts and extracted from multiple supporting facts, researchers have ex- Google search queries, respectively. While most plored various techniques to extend these retrieve- of these focus on questions that require a local and-read systems, including making use of hyper- context of supporting facts to answer, Yang et al. links between Wikipedia articles (Nie et al., 2019; (2018) presented HotpotQA, which tests whether Feldman and El-Yaniv, 2019; Zhao et al., 2019; open-domain QA systems can generalize to more Asai et al., 2020; Dhingra et al., 2020; Zhao et al., complex questions that require evidence from mul- 2019) and iterative retrieval (Talmor and Berant, tiple documents to answer. Aside from Wikipedia, 2018; Das et al., 2019; Qi et al., 2019). While most researchers have also used news articles (Trischler previous work on iterative retrieval make use of et al., 2016) and search results from the web (Dunn neural retrieval systems that directly accept real et al., 2017; Talmor and Berant, 2018) as the corpus vectors as input, our work is similar to that of for open-domain QA. Qi et al. (2019) in using natural language search Inspired by the TREC QA challenge,4 Chen queries. A crucial distinction between our work et al. (2017) were among the first to combine and previous work on multi-hop open-domain QA, information retrieval systems with accurate neu- however, is that we don’t train models to exclu- ral network-based reading comprehension models sively answer single-hop or multi-hop questions, for open-domain QA. Recent work has improved but demonstrate that one single set of parameters open-domain QA performance by enhancing vari- performs well on both tasks. ous components in this retrieve-and-read approach. While much research focused on improving the 6 Conclusion reading comprehension model (Seo et al., 2017; Clark and Gardner, 2018), especially with pre- In this paper, we presented iterative retriever, trained langauge models like BERT (Devlin et al., reranker, and reader (IRRR), a system that uses 2019), more researchers have also demonstrated a single model to perform subtasks to answer open- that neural network-based information retrieval domain questions of arbitrary complexity. This systems achieve competitive, if not better, perfor- system establishes new state of the art on standard mance compared to traditional IR engines (Lee open-domain QA benchmarks as well as the unified new benchmark we present that features questions 4 https://trec.nist.gov/data/qamain. with mixed levels of complexity. html
References Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu- pat, and Ming-Wei Chang. 2020. Realm: Retrieval- Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, augmented language model pre-training. arXiv Richard Socher, and Caiming Xiong. 2020. Learn- preprint arXiv:2002.08909. ing to retrieve reasoning paths over wikipedia graph for question answering. In International Conference Minghao Hu, Yuxing Peng, Zhen Huang, and Dong- on Learning Representations. sheng Li. 2019. Retrieve, read, rerank: Towards end-to-end multi-document reading comprehension. Giusepppe Attardi. 2015. WikiExtractor. https:// In Proceedings of the 57th Annual Meeting of the github.com/attardi/wikiextractor. Association for Computational Linguistics. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open- Gautier Izacard and Edouard Grave. 2020. Lever- domain questions. In Proceedings of the 55th An- aging passage retrieval with generative models for nual Meeting of the Association for Computational open domain question answering. arXiv preprint Linguistics. arXiv:2007.01282. Christopher Clark and Matt Gardner. 2018. Simple Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and effective multi-paragraph reading comprehen- and Yoshua Bengio. 2015. On using very large tar- sion. In Proceedings of the 56th Annual Meeting of get vocabulary for neural machine translation. In the Association for Computational Linguistics (Vol- Proceedings of the 53rd Annual Meeting of the Asso- ume 1: Long Papers). ciation for Computational Linguistics and the 7th In- ternational Joint Conference on Natural Language Kevin Clark, Minh-Thang Luong, Quoc V Le, and Processing (Volume 1: Long Papers). Christopher D Manning. 2020. ELECTRA: Pre- training text encoders as discriminators rather than Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke generators. In International Conference on Learn- Zettlemoyer. 2017. TriviaQA: A large scale dis- ing Representations. tantly supervised challenge dataset for reading com- prehension. In Proceedings of the 55th Annual Meet- Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, ing of the Association for Computational Linguistics and Andrew McCallum. 2019. Multi-step retriever- (Volume 1: Long Papers). reader interaction for scalable open-domain question answering. In International Conference on Learn- V. Karpukhin, Barlas Ouguz, Sewon Min, Patrick ing Representations. Lewis, Ledell Yu Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense passage re- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and trieval for open-domain question answering. arXiv, Kristina Toutanova. 2019. BERT: Pre-training of abs/2004.04906. deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of Omar Khattab, Christopher Potts, and Matei Zaharia. the North American Chapter of the Association for 2020. Relevance-guided supervision for openqa Computational Linguistics: Human Language Tech- with colbert. arXiv preprint arXiv:2007.00814. nologies, Volume 1 (Long and Short Papers). Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachan- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- dran, Graham Neubig, Ruslan Salakhutdinov, and field, Michael Collins, Ankur Parikh, Chris Al- William W. Cohen. 2020. Differentiable reasoning berti, Danielle Epstein, Illia Polosukhin, Jacob De- over a virtual knowledge base. In International Con- vlin, Kenton Lee, Kristina Toutanova, Llion Jones, ference on Learning Representations. Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Natural questions: A benchmark for question an- Guney, Volkan Cirik, and Kyunghyun Cho. 2017. swering research. Transactions of the Association SearchQA: A new Q&A dataset augmented with for Computational Linguistics, 7. context from a search engine. arXiv preprint arXiv:1704.05179. Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open Yair Feldman and Ran El-Yaniv. 2019. Multi-hop para- domain question answering. In Proceedings of the graph retrieval for open-domain question answering. 57th Annual Meeting of the Association for Compu- In Proceedings of the 57th Annual Meeting of the tational Linguistics. Association for Computational Linguistics. Yuanhua Lv and ChengXiang Zhai. 2011. When doc- Clinton Gormley and Zachary Tong. 2015. Elastic- uments are very long, bm25 fails! In Proceedings search: the definitive guide: a distributed real-time of the 34th international ACM SIGIR conference on search and analytics engine. ” O’Reilly Media, Research and development in Information Retrieval, Inc.”. pages 1103–1104.
Christopher D. Manning, Mihai Surdeanu, John Bauer, Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Jenny Finkel, Steven J. Bethard, and David Mc- Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Closky. 2014. The Stanford CoreNLP natural lan- Tesauro, Bowen Zhou, and Jing Jiang. 2018a. R3 : guage processing toolkit. In Association for Compu- Reinforced reader-ranker for open-domain question tational Linguistics (ACL) System Demonstrations, answering. In AAAI Conference on Artificial Intelli- pages 55–60. gence. Andriy Mnih and Koray Kavukcuoglu. 2013. Learning Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaox- word embeddings efficiently with noise-contrastive iao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger, estimation. In Advances in neural information pro- Gerald Tesauro, and Murray Campbell. 2018b. Ev- cessing systems, pages 2265–2273. idence aggregation for answer re-ranking in open- Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. domain question answering. In International Con- Revealing the importance of semantic retrieval for ference on Learning Representations. machine reading at scale. In Proceedings of the 2019 Conference on Empirical Methods in Natu- Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nal- ral Language Processing and the 9th International lapati, and Bing Xiang. 2019. Multi-passage Joint Conference on Natural Language Processing BERT: A globally normalized BERT model for (EMNLP-IJCNLP). open-domain question answering. In Proceedings of the 2019 Conference on Empirical Methods in Nat- Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and ural Language Processing and the 9th International Christopher D. Manning. 2019. Answering complex Joint Conference on Natural Language Processing open-domain questions through iterative query gen- (EMNLP-IJCNLP). eration. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing Johannes Welbl, Pontus Stenetorp, and Sebastian and the 9th International Joint Conference on Natu- Riedel. 2018. Constructing datasets for multi-hop ral Language Processing (EMNLP-IJCNLP). reading comprehension across documents. Transac- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. tions of the Association for Computational Linguis- Know what you don’t know: Unanswerable ques- tics, pages 287–302. tions for SQuAD. In Proceedings of the 56th An- nual Meeting of the Association for Computational Wenhan Xiong, Xiang Lorraine Li, Srini Iyer, Jingfei Linguistics (Volume 2: Short Papers). Du, Patrick Lewis, William Yang Wang, Yashar Mehdad, Wen tau Yih, Sebastian Riedel, Douwe Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Kiela, and Barlas Oğuz. 2020. Answering com- Percy Liang. 2016. SQuAD: 100,000+ questions for plex open-domain questions with multi-hop dense machine comprehension of text. In Proceedings of retrieval. the 2016 Conference on Empirical Methods in Natu- ral Language Processing. Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. Stephen E Robertson, Steve Walker, Susan Jones, End-to-end open-domain question answering with Micheline M Hancock-Beaulieu, Mike Gatford, et al. BERTserini. In Proceedings of the 2019 Confer- 1995. Okapi at trec-3. Nist Special Publication Sp, ence of the North American Chapter of the As- 109:109. sociation for Computational Linguistics (Demon- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and strations), Minneapolis, Minnesota. Association for Hannaneh Hajishirzi. 2017. Bidirectional attention Computational Linguistics. flow for machine comprehension. In International Conference on Learning Representations. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- Alon Talmor and Jonathan Berant. 2018. The web as pher D. Manning. 2018. HotpotQA: A dataset for a knowledge-base for answering complex questions. diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference of the North In Proceedings of the 2018 Conference on Empiri- American Chapter of the Association for Computa- cal Methods in Natural Language Processing. tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651. Chen Zhao, Chenyan Xiong, Xin Qian, and Jordan Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har- Boyd-Graber. 2020. Complex factoid question an- ris, Alessandro Sordoni, Philip Bachman, and Ka- swering with a free-text knowledge graph. In Pro- heer Suleman. 2016. NewsQA: A machine compre- ceedings of The Web Conference 2020, pages 1205– hension dataset. arXiv preprint arXiv:1611.09830. 1216. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Chen Zhao, Chenyan Xiong, Corby Rosset, Xia Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Song, Paul Bennett, and Saurabh Tiwary. 2019. Kaiser, and Illia Polosukhin. 2017. Attention is all Transformer-xh: Multi-evidence reasoning with ex- you need. In Advances in neural information pro- tra hop attention. In International Conference on cessing systems, pages 5998–6008. Learning Representations.
Mantong Zhou, Zhouxing Shi, Minlie Huang, and Xiaoyan Zhu. 2020. Knowledge-aided open- domain question answering. arXiv preprint arXiv:2006.05244.
A Data processing Split Origin # Entities #QAs In this section, we describe how we process the En- train train 387 77,087 glish Wikipedia and the SQuAD dataset for training dev train 55 10,512 and evaluating IRRR. test dev 48 10,570 For the standard benchmarks (SQuAD Open and HotpotQA fullwiki), we use the Wikipedia cor- Table 7: Statistics of the resplit SQuAD dataset for pora prepared by Chen et al. (2017) and Yang et al. proper training and evaluation on the SQuAD Open set- ting. (2018), respectively, so that our results are com- parable with previous work on these benchmarks. Specifically, for SQuAD Open, we use the pro- Wikipedia articles have been renamed or removed cessed English Wikipedia released by Chen et al. since, we begin by following Wikipedia redirect (2017) which was accessed in 2016, and contains links to locate the current title of the correspnoding 5,075,182 documents.5 For HotpotQA, Yang et al. Wikipedia page (e.g., the page “Madonna (enter- (2018) released a processed set of Wikipedia in- tainer)” has been renamed “Madonna”). After troductory paragraphs from the English Wikipedia the correct Wikipedia article is located, we look originally accessed in October 2017.6 for combinations of one to two consecutive para- While it is established that the SQuAD dev set graphs in the 2020 Wikipedia dump that have high is repurposed as the test set for SQuAD Open for overlap with context paragraphs in these datasets. the ease of evaluation, most previous work make We calculate the recall of words and phrases in use of the entire training set during training, and the original context paragraph (because Wikipedia as a result a proper development set for SQuAD paragraphs are often expanded with more details), Open does not exist, and the evaluation result on the and pick the best combination of paragraphs from test set might be inflated as a result. We therefore the article. If the best candidate has either more resplit the SQuAD training set into a proper devel- than 66% unigrams in the original context, or if opment set that is not used during training, and a there is a common subsequence between the two reduced training set that we use for all of our exper- that covers more than 50% of the original context, iments. As a result, although IRRR is evaluated on we consider the matching successful, and map the the same test set as previous systems, it is likely dis- answers to the new context paragraphs. advantaged due to the reduced amount of training As a result, 20,182/2,146 SQuAD train/dev ex- data and hyperparameter tuning on this new dev set. amples (that is, 17,802/2,380/2,146 train/dev/test We split the training set by first grouping questions examples after data resplit) and 15,806/1,416/1,427 and paragraphs by the Wikipedia entity/title they HotpotQA train/dev/fullwiki test examples have belong to, then randomly selecting entities to add been excluded from the unified benchmark. to the dev set until the dev set contains roughly as many questions as the test set (original SQuAD dev B Elasticsearch Setup set). The statistics of our resplit of SQuAD can be found in Table 7. We will make our resplit publicly We set up Elasticsearch in standard benchmark available to the community. settings (SQuAD Open and HotpotQA fullwiki) For the unified benchmark, we started by pro- following practices in previous work (Chen et al., cessing the English Wikipedia7 with the WikiEx- 2017; Qi et al., 2019), with minor modifications to tractor (Attardi, 2015). We then tokenized this unify these approaches. dump and the supporting context used in SQuAD Specifically, to reduce the context size for the and HotpotQA with Stanford CoreNLP 4.0.0 (Man- Transformer encoder in IRRR to avoid unnecessary ning et al., 2014) to look for paragraphs in the computational cost, we primarily index the indi- 2020 Wikipedia dump that might correspond to the vidual paragraphs in the English Wikipedia. To context paragraphs in these datasets. Since many incorporate the broader context from the entire ar- 5 https://github.com/facebookresearch/ ticle, as was done by Chen et al. (2017), we also DrQA index the full text for each Wikipedia article to help 6 https://hotpotqa.github.io/ with scoring candidate paragraphs. Each paragraph wiki-readme.html 7 Accessed on August 1st, 2020, which contains 6,133,150 is associated with the full text of the Wikipedia it articles in total. originated from, and the search score is calculated
Parameter Value SQuAD Open HotpotQA Steps −5 EM F1 EM F1 Learning rate 3 × 10 Batch size 320 Dynamic 49.92 60.91 65.74 78.41 Iteration 10,000 1 step 51.07 61.74 13.75 18.95 Warming-up 1,000 2 step 38.74 48.61 65.12 77.75 Training tokens 1.638 × 109 3 step 32.14 41.66 65.37 78.16 Reranker Candidates 5 4 step 29.06 38.33 63.89 76.72 5 step 19.53 25.86 59.86 72.79 Table 8: Hyperparameter setting for IRRR training. Table 9: SQuAD and HotpotQA performance adaptive vs fixed-length reasoning paths, as measured by an- as the summation of two parts: the similarity be- swer exact match (EM) and F1 . The dynamic stopping tween query terms and the paragraph text, and the criterion employed by IRRR achieves comparable per- formance to its fixed-step counterparts, without knowl- similarity between the query terms and the full text edge of the true number of gold paragraphs. of the article. For query-paragraph similarity, we use the stan- dard BM25 similarity function (Robertson et al., reasoning path retrieval, then move on to the end- 1995) with default hyperparameters (k1 = 1.2, b = to-end performance and leakages in the pipeline, 0.75). For query-article similarity, we find BM25 and end with a few examples to demonstrate typical to be less effective, since the length of these arti- failure modes we have identified that might point cles overwhelm the similarity score stemming from to limitations with the data. important rare query terms, which has also been reported in the information retrieval literature (Lv Effect of Dynamic Stopping. We begin by and Zhai, 2011). Instead of boosting the term fre- studying the effect of using the aswerability score quenty score as considered by Lv and Zhai (2011), as a criterion to stop the iterative retrieval, read- we extend BM25 by taking the square of the IDF ing, and reranking process within IRRR. We com- term and setting the TF normalization term to zero pare the performance of a model with dynamic (b = 0), which is similar to the TF-IDF implemen- stopping to one that is forced to stop at exactly K tation by Chen et al. (2017) that is shown effective steps of reasoning, neither before nor after, where for SQuAD Open. K = 1, 2, . . . 5. As can be seen in Table 9, IRRR’s dynamic stopping criterion based on the answer- Specifically, given a document D and query Q, ability score is very effective in achieving good the score is calculated as end-to-end question answering performance for n X f (D, qi ) · (1 + k1 ) questions of arbitrary complexity without having to score(D, Q) = IDF2+ (qi ) · , specify the complexity of questions ahead of time. f (D, qi ) + k1 i=1 On both SQuAD Open and HotpotQA, it achieves (2) competitive, if not superior question answering per- where IDF+ (qi ) = max(0, log((N − n(qi ) + formance, even without knowing the true number 0.5)/(n(qi ) + 0.5)), with N denoting the total of gold paragraphs necessary to answer each ques- numberr of documents and n(qi ) the document fre- tion. quency of query term qi , and f (qi , D) is the term Aside from this, we note four interesting find- frequency of query term qi in document D. We set ings: (1) the performance of HotpotQA does not k1 = 1.2 in all of our experiments. peak at two steps of reasoning, but instead is helped by performing a third step of retrieval for the av- C Further Training Details erage question; (2) for both datasets, forcing the model to retrieve more paragraphs after a point con- We include the hyperparameters used to train the sistently hurt QA performance; (3) dynamic stop- IRRR model in Table 8 for reproducibility. ping slightly hurts QA performance on SQuAD Open compared to a fixed number of reasoning D Further Analyses of Model Behavior steps (K = 1); (4) when IRRR is allowed to se- In this section, we perform further analyses and lect a dynamic stopping criterion for each example introduce further case studies to demonstrate the independently, the resulting question answering behavior of the IRRR system. We start by analyz- performance is better than a one-size-fits-all so- ing the effect of the dynamic stopping criterion for lution of applying the same number of reasoning
Question What team was the AFC champion? Step1 (Non- However, the eventual-AFC Champion Cincinnati Bengals, playing in their first AFC Championship Gold) Game, defeated the Chargers 27-7 in what became known as the Freezer Bowl. ...... Step2 (Non- Super Bowl XXVII was an American football game between the American Football Conference (AFC) Gold) champion Buffalo Bills and the National Football Conference (NFC) champion Dallas Cowboys to decide the National Football League (NFL) champion for the 1992 season. ...... Gold Super Bowl 50 was an American football game to determine the champion of the National Foot- ball League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24-10 to earn their third Super Bowl title. ...... Table 10: An example where there are false negative answers in Wikipedia for the question from SQuAD Open. steps to all examples. While the last confirmes the effectiveness of our answerability-based stopping criterion, the cause behind the first three warrants further investigation. We will present further analy- ses to shed light on potential causes of these in the remainder of this section. Case Study for Failure Cases. Besides model inaccuracy, one common reason for IRRR to fail at finding the correct answer provided with the datasets is the existence of false negatives (see Ta- ble 10 for an example from SQuAD Open). We estimate that there are about 9% such cases in the HotpotQA part of the training set, and 26% in the SQuAD part of the training set.
You can also read