Retrieve, Rerank, Read, then Iterate: Answering Open-Domain Questions of Arbitrary Complexity from Text

Page created by Wade Fields

Sports

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Retrieve, Rerank, Read, then Iterate:
                                              Answering Open-Domain Questions of Arbitrary Complexity from Text

                                         Peng Qi*♠  Haejun Lee*♣     Oghenetegiri “TG” Sido♠       Christopher D. Manning♠
                                                        ♠ Computer Science Department, Stanford University
                                                                      ♣ Samsung Research
                                         {pengqi, osido, manning}@cs.stanford.edu, haejun82.lee@samsung.com

                                                                     Abstract                      important problem in recent years. However, previ-
                                                                                                   ous work often make three assumptions that limit
                                               Current approaches to open-domain question
                                                                                                   their ability to quickly generalize to new domains,
                                               answering often make crucial assumptions that
arXiv:2010.12527v1 [cs.CL] 23 Oct 2020

                                               prevent them from generalizing to real-world
                                                                                                   where open-domain QA could be the most helpful.
                                               settings, including the access to parameterized        Firstly, some previous work assumes the access
                                               retrieval systems well-tuned for the task, ac-      to a well-tuned parameterized retrieval system that
                                               cess to structured metadata like knowledge          helps navigate the large amount of text (Das et al.,
                                               bases and web links, or a priori knowledge          2019; Lee et al., 2019; Asai et al., 2020; Guu et al.,
                                               of the complexity of questions to be answered       2020; Khattab et al., 2020; Xiong et al., 2020).
                                               (e.g., single-hop or multi-hop). To address         These systems make use of rich representation
                                               these limitations, we propose a unified system
                                                                                                   learners like neural networks to encode each docu-
                                               to answer open-domain questions of arbitrary
                                               complexity directly from text that works with       ment in a large corpus, and retrieve documents with
                                               off-the-shelf retrieval systems on arbitrary text   a “query” vector using approximate inner-product
                                               collections. We employ a single multi-task          search. While they can use gradient-based opti-
                                               model to perform all the necessary subtasks—        mization to update the query generation and docu-
                                               retrieving supporting facts, reranking them,        ment encoding components, their retrieval process
                                               and predicting the answer from all retrieved        is opaque to interpretation, and usually requires a
                                               documents—in an iterative fashion. To em-
                                                                                                   significant amount of computational power to train.
                                               ulate a more realistic setting, we also con-
                                               structed a new unified benchmark by collect-        Although these systems benefit from unsupervised
                                               ing about 200 multi-hop questions that require      training, there is usually a big performance gap
                                               three Wikipedia pages to answer, and combin-        when compared to their supervised counterparts,
                                               ing them with existing datasets. We show that       and it remains unclear if supervised finetuning will
                                               our model not only outperforms state-of-the-        preserve the general applicability of these retrieval
                                               art systems on several existing benchmarks          systems to other datasets or retrieval tasks.
                                               that exclusively feature single-hop or multi-
                                                                                                      Secondly, previous work often assumes access
                                               hop open-domain questions, but also achieves
                                               strong performance on the new benchmark.
                                                                                                   to non-textual metadata such as knowledge bases,
                                                                                                   entity linking, and Wikipedia hyperlinks when re-
                                          1    Introduction                                        trieving supporting facts (Nie et al., 2019; Feld-
                                                                                                   man and El-Yaniv, 2019; Zhao et al., 2019; Asai
                                          Using knowledge to solve problems is a hallmark of       et al., 2020; Dhingra et al., 2020; Zhao et al., 2020).
                                          intelligence, and open-domain question answering         While this information is helpful, it is not always
                                          (QA) is an important means for intelligent systems       readily available in all text collections we might be
                                          to make use of the knowledge in large collections        interested in getting answers from, besides being
                                          of texts. With the help of large-scale datasets based    labor-intensive and time-consuming to collect and
                                          on Wikipedia (Rajpurkar et al., 2016, 2018; Yang         maintain. It is therefore desirable to design a sys-
                                          et al., 2018; Welbl et al., 2018) and other large        tem that is capable of extracting knowledge from
                                          corpora (Trischler et al., 2016; Dunn et al., 2017;      text without using such metadata, to maximally
                                          Talmor and Berant, 2018), the research community         make use of knowledge available to us in the form
                                          has made substantial progress towards tackling this      of text.
                                              * These   authors contributed equally.                  Lastly, previous systems often assume that every

1. Q à “Unbreakable”                                                                        2. Q + retrieved paras à NOANSWER

                                                                                          …
                             4. Q + W Unbreakable (film) à “David Dunn”                                                  5. Q + W Unbreakable (film)
                                                                                                                              + W David Dunn (character) à “David Dunn”
                                             Retriever
       Q. What is the                                                                                                                                      Answer exists in one
                                                                                                                                                                                     A.
                                                                                                                                                           of the reasoning paths
       name of Bruce                  Query
                                                  search
                                                                                                                                    Reader                                          David
      Willis’ character in           Generator
       Unbreakable?                                                                                                                                                                 Dunn
                                                           WIKIPEDIA

                                                                                                                                         No answer exist
                                                   Expand reasoning path with
                                                   top-ranked paragraph
                                                                                           Reranker

                                                                        3. Q + retrieved paras à W Unbreakable (film)

Figure 1: The IRRR question answering pipeline answers a complex question in the HotpotQA dataset by itera-
tively retrieving, reranking, and reading paragraphs from Wikipedia.
                                                                   Repeat N times until the answer found is confident enough

input question in the dataset is either single-hop or                                           (1) a unified neural network model that performs
multi-hop, and tailor the approach to one of these                                              all essential subtasks in open-domain QA purely
either in model design or training (Qi et al., 2019;                                            from text (retrieval, reranking, and reading com-
Nie et al., 2019; Zhao et al., 2019; Asai et al., 2020).                                        prehension) and achieves strong results on SQuAD
This limits the practical application of these sys-                                             and HotpotQA; (2) a new open-domain QA bench-
tems in a real-world task, where there is often a                                               mark that features questions of different levels of
mixture of questions of different levels of complex-                                            complexity on an unified, up-to-date verison of
ity without explicit categorization.                                                            Wikipedia, with newly annotated questions that re-
   To address these limitations, we propose Itera-                                              quire at least three hops of reasoning, on which our
tive Retriever, Reranker, and Reader (IRRR), which                                              proposed model serves as a strong baseline.1
features a single neural network model that per-
forms all of the subtasks required to answer ques-
                                                                                                2       Method
tions from a large collection of text (see Figure                                               In this section, we briefly review the task of open-
1). IRRR is designed to leverage off-the-shelf in-                                              domain question answering, before diving into how
formation retrieval systems by generating natural                                               we train a unified model to perform all of the sub-
language search queries, which allows it to easily                                              tasks necessary to solve this problem.
adapt to arbitrary collections of text without re-
quiring well-tuned neural retrieval systems or extra                                            2.1        Open-Domain Question Answering
metadata. This further allows users to understand                                               The task of open-domain question answering is
and control IRRR, if necessary, to facilitate trust.                                            concerned with finding the answer a to a question
Moreover, IRRR iteratively retrieves more context                                               q from a large collection of text D. Successful solu-
to answer the question, which allows it to easily                                               tions to this task usually involve two crucial compo-
accommodate questions of different complexity.                                                  nents: an information retrieval system that distills
   To evaluate the performance of open-domain QA                                                a small set of relevant documents Dr from D, and
systems on questions of arbitrary complexity, we                                                a reading comprehension system that extracts the
further collect 203 human-annotated questions that                                              answer from it. Chen et al. (2017) presented one of
require information from three Wikipedia pages to                                               the first neural-network-based approaches to this
answer, combine them with those from the single-                                                problem, which was later extended by Wang et al.
hop SQuAD Open (Rajpurkar et al., 2016; Chen                                                    (2018a) with a reranking system to further reduce
et al., 2017) and the multi-hop HotpotQA (Yang                                                  the amount of context the reading comprehension
et al., 2018), and map this data to a unified ver-                                              component has to consider to improve answer ac-
sion of the English Wikipedia to facilitate system                                              curacy.
development and evaluation. We show that IRRR                                                      More recently, Yang et al. (2018) showed
not only outperforms state-of-the-art models on the                                             that this retrieve-and-read approach towards open-
original SQuAD Open and HotpotQA datasets by a                                                  domain question answering is inadequate for more
large margin, but also achieves strong performance                                              complex questions that require multiple pieces of
on this new dataset.                                                                              1
                                                                                                    For reproducibility, we will release our code and trained
  To recap, our contributions in this paper are:                                                models.

O/X   O/X   O/X     ……           NOANSWER / Span / Yes / No    Start    End         Gold Paragraph
evidence to answer (e.g., “What is the population
of Mark Twain’s hometown?”). Researchers have              Token-wise Binary Prediction         4-way Clsf          Span Prediction             NCE Clsf
                                                                                                                                                 Rerank
                                                                                                                                                 Rerank
                                                                                                                                                  Rerank

demonstarted that this can be resolved by retrieving       Retriever (Query Generator)                         Reader                            Reranker

supporting facts from other means than text-based          h[CLS]        h1          h2            h3          h4           h5           h6            …

search with the original question, although exist-                                           Transformer-Encoder
ing systems often make assumptions that limit their        [CLS]         q         [SEP]          title0     [CONT]        ctx0         [SEP]          …
general applicability. For instance, Nie et al. (2019)
and Asai et al. (2020) make use of hyperlinks avail-     Figure 2: The overall architecture of our IRRR model,
able in Wikipedia, which many open-domain QA             which uses a shared Transformer encoder to perform
datasets are based on, while Qi et al. (2019) con-       all subtasks of open-domain question answering.
sider questions that require exactly two supporting
documents to answer. Moreover, most previous
                                                         which controls the maximum number of paragraphs
work assumes that all questions are either exclu-
                                                         of supporting facts required to arrive at the answer.
sively single-hop or multi-hop, and tailor either
                                                            To reduce computational cost, IRRR is imple-
model design or the training process to consider
                                                         mented as a multi-task model built on pretrained
one type of question at a time.
                                                         Transformer models that performs all three sub-
                                                         tasks. At a high level, it consists of a Trans-
2.2   IRRR: Iterative Retriever, Reranker, and           former encoder (Vaswani et al., 2017) which takes
      Reader                                             the question and all retrieved paragraphs so far
We propose the Iterative Retriever, Reranker, and        (the reasoning path p) as input, and one set of
Reader (IRRR) model, which performs the sub-             task-specific parameters for each task of retrieval,
tasks involved in an iterative manner to accommo-        reranking, and reading comprehension (see Figure
date questions with arbitrary complexity. IRRR           2). Specifically, the retriever generates natural lan-
aims at building a reasoning path p from the ques-       guage search queries by selecting words from the
tion q, through all the necessary supporting facts       reasoning path, the reader extracts answers from
d ∈ Dgold to the answer a. As is shown in Figure         the reasoning path and abstains if its confidence is
1, IRRR operates in a loop of retrieval, reranking,      not high enough, and the reranker assigns a scalar
and reading to expand the reasoning path p with          score for each retrieved paragraph as a potential
new documents from d ∈ D.                                continuation of the current reasoning path.
   Specifically, given a question q, the initial rea-       The input to our Transformer encoder is format-
soning path contains only the question itself, i.e.,     ted similarly to that of the BERT model (Devlin
p0 = [q], from which we attempt to generate a            et al., 2019). For a reasoning path p that consists
search query with IRRR’s retriever to retrieve a set     of the question and t retrieved paragraphs, the in-
of relevant documents D1 ⊂ D, which might ei-            put is formatted as “[CLS] question [SEP] title1
ther help answer the question, or reveal clues about     [CONT] context1 [SEP] · · · titlet [CONT] contextt
the next piece of evidence we need to answer q.          [SEP]”, where [CLS], [SEP], and [CONT] are spe-
Then, the reader model attempts to append each           cial tokens to separate different components of the
of the documents in D1 to answer the question. If        input.
more than one answers can be found from these               We will detail each of the task-specific compo-
reasoning paths, we predict the answer with the          nents in the following subsections.
highest answerability score, which we will detail
in Section 2.2.2. If no answer can be found, then        2.2.1        Retriever
IRRR’s reranker score each retrieved paragraph           The goal of the retriever is to generate natural lan-
against the current reasoning path, and append the       guage queries to retrieve relevant documents from
top-ranked paragraph to the current reasoning path,      an off-the-shelf text based retrieval engine. This
i.e., pi+1 = pi + [arg maxd∈D1 reranker(pi , d)],        allows IRRR to perform open-domain QA in an
before the updated reasoning path is presented to        explainable and controllable manner, where a user
the retriever to generate new search queries. This       can easily understand the model’s behavior and
iterative process is repeated until an answer is pre-    intervene if necessary. Following the approach pro-
dicted from one of the reasoning paths, or until the     posed by Qi et al. (2019), we propose to extract
reasoning path has reached K documents in total,         search queries from the current reasoning path, i.e.,

the original question and all of the paragraphs that       ken as the “output span”, and thus we also include
we have already retrieved. This approach stems             the likelihood ratio between the best span and the
from the observation that there usually being a            [CLS] span if the positive answer is a span. This
strong semantic overlap between the reasoning path         likelihood ratio formulation is less affected by se-
and the next paragraph to retrieve. Instead of re-         quence length compared to prediction probability,
stricting the query string to always be a substring        and thus making it easier to assign a global thresh-
of the reasoning path as did Qi et al. (2019), in          old across reasoning paths of different lengths to
this paper, we relax the constraint and instead al-        stop further retrieval.
low the search query to be any subsequence of the
                                                           2.2.3 Reranker
reasoning path, therefore allowing more flexible
combinations of search phrases to be used.                 When the reader fails to find an answer from the
   To predict these search queries from the reason-        reasoning path, the reranker selects one of the re-
ing path, we apply a token-wise binary classifier          trieved paragraphs to expand it, so that the retriever
on top of the shared Transformer encoder model,            can generate new search queries to retrieve new
to decide whether each token is included in the fi-        context to answer the question. To achieve this,
nal query. At training time, we derive supervision         we assign each potential extended reasoning path
signal to train these classifiers with a binary cross      a score by multiplying the hidden representation
entropy loss (which we detail in Section 2.3.1); at        of the [CLS] token with a linear transformation,
test time, we select a cutoff threshold for query          and pick the extension that has the highest score.
words to be included from the reasoning path. In           At training time, we normalize the reranker scores
practice, we find that boosting the model to predict       for all retrieved paragraphs, and maximize the log
more query terms is beneficial to increase the recall      likelihood of selecting gold supporting paragraphs
of the target paragraphs in retrieval.                     from retrieved ones. To avoid linearly scaling com-
   We employ Elasticsearch (Gormley and Tong,              putational cost for the Transformer encoder with
2015) as our text-based search engine, and follow          the number of paragraphs retrieved, we employ
previous work to process Wikipedia and search              noise contrastive estimation (NCE) to estimate this
results, which we will detail in Appendix B.               likelihood (Mnih and Kavukcuoglu, 2013; Jean
                                                           et al., 2015), where negative examples are top re-
2.2.2   Reader                                             trieved paragraphs given the current reasoning path.
The reader attempts to find the answer given a rea-
                                                           2.3     Training IRRR
soning path comprised of the question and retrieved
paragraphs, and simultaneously assigns an answer-          2.3.1     Generating Target Queries with a
ability score to this reasoning path to assess the like-             Dynamic Oracle
lihood of having found the answer to the original          Since existing open-domain QA datasets do not in-
question. Since not all reasoning paths lead to the        clude human-annotated search queries, we train the
final answer, we train a classifier conditioned on the     retriever to generate search queries from the rea-
Transformer encoder representation of the [CLS]            soning path by deriving supervision signal with a
token to determine whether an answer should be             dynamic oracle. Qi et al. (2019) noted that it is im-
predicted given a reasoning path. Span answers             portant for good search queries to not only contain
are predicted from the context using a span start          overlapping terms between the reasoning path and
classifier and a span end classifier, following De-        paragraphs to be retrieved, but also encode which
vlin et al. (2019). To support special non-extractive      terms are more effective in retrieval. Therefore, we
answers like yes/no from HotpotQA, we further              derive a dynamic oracle for effective search queries
include these as options for the classifier to predict     with the help of the search engine.
in a 4-way classifier.                                        To reduce computational cost and queries against
   The answerability score is used to pick the best        the search engine, we begin by identifying overlap-
answer from all the candidate reasoning paths and          ping spans of text between the reasoning path and
serves as a stopping criterion for IRRR. We define         the target document for retrieval, and include or ex-
answerability as the log likelihood ratio between          clude entire overlapping spans as a whole. Finding
the most likely positive answer and the NOANSWER           the optimal combination of N overlapping spans
prediction. For NOANSWER examples, we further              is still prohibitively slow and requires O(2N ) oper-
train the span classifiers to predict the [CLS] to-        ations. To avoid enumerating exponentially many

combinations, we approximate the importance of                     SQuAD Open       HotpotQA     3-hop     Total
each overlapping span with the following “impor-           Train       59,285        74,758         0    134,043
tance” metric                                              Dev          8,132         5,989         0     14,121
                                                           Test         8,424         5,978       203     14,605
      Importance(soi ) = Rank(t, {soj }N
                                       j=1,j6=i )          Total       75,841        86,725       203    162,769
                       −   Rank(t, {soi }),         (1)
                                                          Table 1: Split and statistics of QA examples in the new
        so
where are overlapping spans between the target            unified benchmark.
paragraph t and the current reasoning path, and
Rank(t, S) is the rank of t in the search result
                                                          features more than 10,000 questions, each based on
when spans S are used as search queries, which
                                                          a single paragraph in a Wikipedia article. For this
is smaller if t is ranked closer to the top in search
                                                          dataset, we index the same Wikipedia corpus from
results. Intuitively, the second term captures the im-
                                                          2016 with Elasticsearch to derive training data and
portance of the search term when used alone, and
                                                          evaluate IRRR. Since the authors did not present a
the first captures its importance when combined
                                                          standard development set, we split part of the train-
with all other overlapping spans. After estimating
                                                          ing set to construct a development set roughly as
importance of each overlapping span, we determine
                                                          large as the test set. HotpotQA (Yang et al., 2018)
the final target query by first sorting all spans by
                                                          features more than 100,000 questions that require
descending importance, then including each in the
                                                          the introductory paragraphs of two Wikipedia arti-
final target query and select the shortest query with
                                                          cles to answer, and we focus on its open-domain
the highest rank as the oracle query. The resulting
                                                          “fullwiki” setting in this work. For HotpotQA, we
time complexity for generating these oracle queries
                                                          index the introductory paragraphs provided by the
is thus O(N ), i.e., linear in the number of over-
                                                          authors, which is based on a 2017 Wikipedia dump,
lapping spans between the reasoning path and the
                                                          for training and evaluation.
target paragraph.
                                                              To evaluate the performance of IRRR as well as
2.4    Reducing Exposure Bias with Data                   future QA systems in a more realistic open-domain
       Augmentation                                       setting without pre-specified level of question com-
With the dynamic oracle, we are able to generate          plexity, we further combine these datasets with
target queries to train the retriever model, retrieve     203 newly collected challenge questions to con-
documents to train the reranker model, and expand         struct a new benchmark. To combine SQuAD Open
the reasoning path by always choosing a gold para-        and HotpotQA into a unified setting, we begin by
graph, following Qi et al. (2019). However, since         mapping the paragraphs each question is based on
all reasoning paths contain exclusively gold para-        to a more recent version of Wikipedia.2 We dis-
graphs, the resulting model is prevented from gen-        carded examples where the Wikipedia pages have
eralizing to cases model behavior deviates from the       either been removed or significantly edited such
oracle. To address this, we augment the training          that the answer can no longer be found from para-
data by occasionally selecting non-gold paragraphs        graphs that are similar enough to the original con-
to append to reasoning paths to the training data,        texts the questions are based on. As a result, we
and use the dynamic oracle to generate queries for        filtered out 22,328 examples from SQuAD Open,
the model to “recover” from these synthesized re-         and 18,649 examples from HotpotQA’s fullwiki set-
trieval mistakes. We find this data augmentation          ting.We annotated 203 challenge questions on the
significantly improves the performance of IRRR            same Wikipedia dump that require paragraphs from
in preliminary experiments, and thus report main          three supporting documents to answer, which we
results with augmented training data.                     add to the test set of the new benchmark to test the
                                                          generalization capabilities of QA models to this un-
3     Experiments                                         seen scenario. The statistics of the final dataset can
                                                          be found in Table 1. To understand the data quality
Data. In our experiments, we first make use of
                                                          after converting SQuAD Open and HotpotQA to
two standard benchmarks, SQuAD Open and Hot-
                                                          this newer version of Wikipedia, we sampled 100
potQA. First introduced by Chen et al. (2017),
SQuAD Open designates the development set of                2
                                                              In this work, we used the English Wikipedia dump from
the original SQuAD dataset as its test set, which         August 1st, 2020.

examples from the training split of each dataset.                          System
                                                                                                   SQuAD Open
We find that 94% / 90% of SQuAD / HotpotQA                                                         EM     F1
QA pairs remain supported by their context para-                           DensePR†                38.1    —
graphs, while 6% / 10% are unanswerable.3 For                              BERTserini‡             38.6   46.1
                                                                           MUPPET•                 39.3   46.2
all benchmark datasets, we report standard answer                          RE3◦                    41.9   50.2
exact match (EM) and unigram F1 metrics.                                   Knowledge-aided♠        43.6   53.4
                                                                           Multi-passage BERT♥     53.0   60.9
Training details. We use ELECTRALARGE                                      GRR♣                    56.5   63.8
(Clark et al., 2020) as pre-trained initialization for                     Generative OpenQA♦      56.7    —
our Transformer encoder. We train the model on a                           IRRR (Ours)             61.8   68.9
combined dataset of SQuAD Open and HotpotQA
questions where we optimize the joint loss of the                Table 2: End-to-end question answering performance
                                                                 on SQuAD Open, evaluated on the same set of docu-
retriever, reader, and reranker components simulta-
                                                                 ments as Chen et al. (2017). Previous work is denoted
neously in an multi-task learning fashion. Training              with symbols in the table as follows – †: (Karpukhin
data for the retriever and reranker components is                et al., 2020), ‡: (Yang et al., 2019), •: (Feldman and
derived from the dynamic oracle on the training                  El-Yaniv, 2019), ◦: (Hu et al., 2019), ♠: (Zhou et al.,
set of these datasets, where reasoning paths are                 2020), ♥: (Wang et al., 2019), ♣: (Asai et al., 2020),
expanded with oracle queries and by picking the                  ♦: (Izacard and Grave, 2020).
gold paragraphs as they are retrieved for the reader
component. We also enhance this training data by                                         HotpotQA           3-hop
                                                                  System
expanding reasoning paths with non-gold retrieved                                        EM    F1         EM     F1
paragraphs up to 3 reasoning steps on HotpotQA                    GRR♣                   60.0    73.0     21.7†   29.6†
and 2 on SQuAD Open, and find that this results                   Step-by-step⊗          63.0    75.4      —       —
in a more robust model. After an initial model is                 Recurs. Dense Ret.⊗    62.3    75.3      —       —
finetuned on this expanded training set, we apply                 SDS-Net⊗               64.9    78.2      —       —
our iterative training technique to further reduce                IRRR (Ours)            65.7    78.2     32.5    39.9
exposure bias of the model by generating more data
with the trained model and the dynamic oracle.                   Table 3: End-to-end question answering performance
                                                                 on HotpotQA and the new 3-hop challenge questions,
                                                                 evaluated on the official HotpotQA Wikipedia para-
4       Results
                                                                 graphs. ♣ denotes (Asai et al., 2020), and ⊗ represent
In this section, we present the performance of                   anonymous/preprint unavailable at the time of writing
                                                                 of this paper. † indicates results we obtained using the
IRRR when evaluated against previous systems on
                                                                 publicly available code and pretrained models.
standard benchmarks, and demonstrate its efficacy
on our new, unified benchmark, especially with the
help of iterative training.                                      Wikipedia at each step of reasoning. We also com-
                                                                 pare IRRR against the Graph Recurrent Retriever
4.1      Question Answering Performance on
                                                                 (GRR; Asai et al., 2020) on our newly collected
         Standard Benchmarks
                                                                 3-hop question challenge test set, using the author’s
We first compare IRRR against previous systems                   released code and models trained on HotpotQA.
on standard benchmarks, SQuAD Open and the                          As can be seen in Tables 2 and 3, IRRR outper-
fullwiki setting of HotpotQA. On each dataset, we                forms previously published work on the SQuAD
compare the performance of IRRR against best pre-                Open and HotpotQA benchmarks by a large mar-
viously published systems, as well as unpublished                gin. It further outperforms more recent unpublished
ones that we find on public leaderboards. For a                  work on HotpotQA that were posted after IRRR
fair comparison to previous work, we make use                    was initially submitted to the public leaderboard.
of the Wikipedia corpora from (Chen et al., 2017)                On the 3-hop challenge set, we similarly notice
and (Yang et al., 2018), respectively, and limit the             a large performance margin between IRRR and
retriever to retrieve 150 paragraphs of text from                GRR, although neither is trained with 3-hop ques-
    3
                                                                 tions. This demonstrates that IRRR generalizes
     43% of HotpotQA examples contain more than minimal
set of necessary paragraphs to answer the question as a result   well to questions that are more complex than the
of the mapping process.                                          ones seen during training.

SQuAD Open                                         HotpotQA
             80                                                                                                                                       Dev                     Test
                                                  50 docs/step     80                               50 docs/step
                                                                                                                                                 EM         F1           EM          F1
Percentage
             60
                                                 100 docs/step     60                              100 docs/step
             40                                                    40
                                                 150 docs/step                                     150 docs/step
             20                                                    20
                                                                    0
                                                                                                                                SQuAD Open       50.65   60.99       60.59       67.51
              0
                  1          2        3       4              5           1         2        3       4             5             HotpotQA         59.01   70.33       58.61       69.86
                        Paragraphs Retrieved/Question                         Paragraphs Retrieved/Question                     3-hop Challege    —       —          26.73       35.48
                                                                   79
             62               (1)                                                                           (5)
                                                                   78
                                                                                            (5)
             61         (1)                            (5)                                                                  Table 4: End-to-end question answering performance
Answer F1

                                                                   77          (5)
                                           (5)
             60
                                                  50 docs/step
                                                                   76                (2)          (2)
                                                                                                    50 docs/step
                                                                                                                            of IRRR on the unified benchmark, evaluated on the
             59                                                         (2)
                  (1)
                                    (5)
                                                 100 docs/step     75                              100 docs/step            2020 copy of Wikipedia.
             58                                  150 docs/step                                     150 docs/step
                                                                   74
                  0      100     200   300     400           500         100           200         300                400
                         Total Number of Paragraphs                            Total Number of Paragraphs
                                                                                                                             System                              SQuAD        HotpotQA
Figure 3: The retrieval behavior of IRRR and its rela-                                                                       Ours (joint dataset)                58.69          68.74
tion to the performance of end-to-end question answer-                                                                       – dynamic stopping (fixed K = 3)    31.70          66.60
                                                                                                                             – HotpotQA / SQuAD data             54.35          65.40
ing. Top: The distribution of reasoning path lengths as                                                                      – ELECTRA + BERTLARGE-WWM           57.19          63.86
determined by IRRR. Bottom: Total number of para-
graphs retrieved by IRRR vs the end-to-end question                                                                         Table 5: Ablation study of different design choices in
answering performance as measured by answer F1 .                                                                            IRRR, as evaluated by Answer F1 on the dev set of the
                                                                                                                            unified benchmark.
   We also demonstrate that on these benchmarks,
IRRR’s performance can be further improved by                                                                               collected 3-hop challenge questions. As is shown
incorporating additional data from iterative training                                                                       in Table 4, IRRR’s performance remains compet-
to reduce exposure bias.                                                                                                    itive on all questions from different origins in the
   To better understand the behavior of IRRR on                                                                             unified benchmark, despite the difference in rea-
these benchmarks, we analyze the number of para-                                                                            soning complexity when answering these questions.
graphs retrieved by the model when varying the                                                                              The model also generalizes to the 3-hop questions
number of paragraphs retrieved at each reasoning                                                                            despite having never been trained on them. We
step among {50, 100, 150}. As can be seen in Fig-                                                                           note that the large performance gap between the
ure 3, IRRR effectively reduces the total number                                                                            development and test settings for SQuAD Open
of paragraphs retrieved and read by the model by                                                                            questions is due to the fact that test set questions
stopping its iterative process as soon as all neces-                                                                        (the original SQuAD dev set) are annoted with
sary paragraphs to answer the question have been                                                                            multiple human answers, while the dev set ones
retrieved. Further, we note that the optimal cap for                                                                        (originally from the SQuAD training set) are not.
the number of reasoning steps is larger than the                                                                               To better understand the contribution of the var-
number of gold paragraphs necessary to answer the                                                                           ious components and techniques we proposed for
question on each benchmark, which we find is due                                                                            IRRR, we performed ablative studies on the model
to IRRR’s ability to recover from retrieving and                                                                            iterating up-to 3 reasoning steps with 50 paragraphs
selecting non-gold paragraphs (see the example in                                                                           for each step, and present the results in Table 5.
Table 6). Finally, we note that increasing the num-                                                                         First of all, we find it is important to allow IRRR to
ber of paragraphs retrieved at each reasoning step                                                                          dynamically stop retrieving paragraphs to answer
remains an effective, if computationally expensive,                                                                         the question. Compared to its fixed-step retrieval
strategy, to improve the end-to-end performance of                                                                          counterpart, dynamically stopping IRRR improves
IRRR. However, the tradeoff between retrieval bud-                                                                          F1 on both SQuAD and HotpotQA questions by
get and model performance is much more effective                                                                            27.0 and 2.1 points respectively (we include further
than that of previous work (e.g., GRR), and we note                                                                         analyses for dynamic stopping in Appendix D).
that the queries generated by IRRR is explainable                                                                           We also find combining SQuAD and HotpotQA
to humans and can help humans easily control its                                                                            datasets beneficial for both datasets in an open-
behavior (see Table 6).                                                                                                     domain setting, and that ELECTRA is an effective
4.2               Performance on the Unified Benchmark                                                                      alternative to BERT for this task.
To demonstrate the performance of IRRR in a more                                                                            5    Related Work
realistic setting of open-domain QA, we evaluate it
on the new, unified benchmark which features test                                                                           The availability of large-scale question answer-
data from SQuAD Open, HotpotQA, and our newly                                                                               ing (QA) datasets has greatly contributed to the

Question The Ingerophrynus gollum is named after a character in a book that sold how many copies?
Step 1 Ingerophrynus is a genus of true toads with 12 species. ... In 2007 a new species, “Ingerophrynus gollum”,
(Non-Gold) was added to this genus. This species is named after the character Gollum created by J. R. R. Tolkien.”
Query Ingerophrynus gollum book sold copies J. R. R. Tolkien
Step 2 (Gold) Ingerophrynus gollum (Gollum’s toad) is a species of true toad. ... It is called “gollum” with reference
of the eponymous character of The Lord of the Rings by J. R. R. Tolkien.
Query Ingerophrynus gollum character book sold copies J. R. R. Tolkien true Lord of the Rings
Step 3 (Gold) The Lord of the Rings is an epic high fantasy novel written by English author and scholar J. R. R. Tolkien.
... is one of the best-selling novels ever written, with 150 million copies sold.
Answer/GT 150 million copies

Table 6: An example of IRRR answering a question from HotpotQA by generating natural language queries to
retrieve paragragraphs, then rerank them to compsoe reasoning paths and read them to predict the answer. In this
example, IRRR recovers from an initial retrieval/reranking mistake by retrieving more paragraphs, before arriving
at the gold supporting facts and the correct answer.

research progress on open-domain QA. SQuAD et al., 2019; Khattab et al., 2020; Guu et al., 2020;
(Rajpurkar et al., 2016, 2018) is among the first Xiong et al., 2020). Aside from the reading compre-
question answering datasets adopted for this pur- hension and retrieval components, researchers have
pose by Chen et al. (2017) to build QA systems also found reranking search results (Wang et al.,
over Wikipedia articles. Similarly, TriviaQA (Joshi 2018a) or answer candidates (Wang et al., 2018b;
et al., 2017) and Natural Questions (Kwiatkowski Hu et al., 2019).
et al., 2019) feature Wikipedia-based questions that To answer open-domain questions that require
are written by trivia enthusiasts and extracted from multiple supporting facts, researchers have ex-
Google search queries, respectively. While most plored various techniques to extend these retrieve-
of these focus on questions that require a local and-read systems, including making use of hyper-
context of supporting facts to answer, Yang et al. links between Wikipedia articles (Nie et al., 2019;
(2018) presented HotpotQA, which tests whether Feldman and El-Yaniv, 2019; Zhao et al., 2019;
open-domain QA systems can generalize to more Asai et al., 2020; Dhingra et al., 2020; Zhao et al.,
complex questions that require evidence from mul- 2019) and iterative retrieval (Talmor and Berant,
tiple documents to answer. Aside from Wikipedia, 2018; Das et al., 2019; Qi et al., 2019). While most
researchers have also used news articles (Trischler previous work on iterative retrieval make use of
et al., 2016) and search results from the web (Dunn neural retrieval systems that directly accept real
et al., 2017; Talmor and Berant, 2018) as the corpus vectors as input, our work is similar to that of
for open-domain QA. Qi et al. (2019) in using natural language search
Inspired by the TREC QA challenge,4 Chen queries. A crucial distinction between our work
et al. (2017) were among the first to combine and previous work on multi-hop open-domain QA,
information retrieval systems with accurate neu- however, is that we don’t train models to exclu-
ral network-based reading comprehension models sively answer single-hop or multi-hop questions,
for open-domain QA. Recent work has improved but demonstrate that one single set of parameters
open-domain QA performance by enhancing vari- performs well on both tasks.
ous components in this retrieve-and-read approach.
While much research focused on improving the 6 Conclusion
reading comprehension model (Seo et al., 2017;
Clark and Gardner, 2018), especially with pre- In this paper, we presented iterative retriever,
trained langauge models like BERT (Devlin et al., reranker, and reader (IRRR), a system that uses
2019), more researchers have also demonstrated a single model to perform subtasks to answer open-
that neural network-based information retrieval domain questions of arbitrary complexity. This
systems achieve competitive, if not better, perfor- system establishes new state of the art on standard
mance compared to traditional IR engines (Lee open-domain QA benchmarks as well as the unified
new benchmark we present that features questions
4
https://trec.nist.gov/data/qamain. with mixed levels of complexity.
html

References Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-
pat, and Ming-Wei Chang. 2020. Realm: Retrieval-
Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, augmented language model pre-training. arXiv
Richard Socher, and Caiming Xiong. 2020. Learn- preprint arXiv:2002.08909.
ing to retrieve reasoning paths over wikipedia graph
for question answering. In International Conference
Minghao Hu, Yuxing Peng, Zhen Huang, and Dong-
on Learning Representations.
sheng Li. 2019. Retrieve, read, rerank: Towards
end-to-end multi-document reading comprehension.
Giusepppe Attardi. 2015. WikiExtractor. https://
In Proceedings of the 57th Annual Meeting of the
github.com/attardi/wikiextractor.
Association for Computational Linguistics.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine
Bordes. 2017. Reading Wikipedia to answer open- Gautier Izacard and Edouard Grave. 2020. Lever-
domain questions. In Proceedings of the 55th An- aging passage retrieval with generative models for
nual Meeting of the Association for Computational open domain question answering. arXiv preprint
Linguistics. arXiv:2007.01282.

Christopher Clark and Matt Gardner. 2018. Simple Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
and effective multi-paragraph reading comprehen- and Yoshua Bengio. 2015. On using very large tar-
sion. In Proceedings of the 56th Annual Meeting of get vocabulary for neural machine translation. In
the Association for Computational Linguistics (Vol- Proceedings of the 53rd Annual Meeting of the Asso-
ume 1: Long Papers). ciation for Computational Linguistics and the 7th In-
ternational Joint Conference on Natural Language
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Processing (Volume 1: Long Papers).
Christopher D Manning. 2020. ELECTRA: Pre-
training text encoders as discriminators rather than Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
generators. In International Conference on Learn- Zettlemoyer. 2017. TriviaQA: A large scale dis-
ing Representations. tantly supervised challenge dataset for reading com-
prehension. In Proceedings of the 55th Annual Meet-
Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, ing of the Association for Computational Linguistics
and Andrew McCallum. 2019. Multi-step retriever- (Volume 1: Long Papers).
reader interaction for scalable open-domain question
answering. In International Conference on Learn- V. Karpukhin, Barlas Ouguz, Sewon Min, Patrick
ing Representations. Lewis, Ledell Yu Wu, Sergey Edunov, Danqi
Chen, and Wen tau Yih. 2020. Dense passage re-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and trieval for open-domain question answering. arXiv,
Kristina Toutanova. 2019. BERT: Pre-training of abs/2004.04906.
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of Omar Khattab, Christopher Potts, and Matei Zaharia.
the North American Chapter of the Association for 2020. Relevance-guided supervision for openqa
Computational Linguistics: Human Language Tech- with colbert. arXiv preprint arXiv:2007.00814.
nologies, Volume 1 (Long and Short Papers).

Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachan- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
dran, Graham Neubig, Ruslan Salakhutdinov, and field, Michael Collins, Ankur Parikh, Chris Al-
William W. Cohen. 2020. Differentiable reasoning berti, Danielle Epstein, Illia Polosukhin, Jacob De-
over a virtual knowledge base. In International Con- vlin, Kenton Lee, Kristina Toutanova, Llion Jones,
ference on Learning Representations. Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,
Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.
Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Natural questions: A benchmark for question an-
Guney, Volkan Cirik, and Kyunghyun Cho. 2017. swering research. Transactions of the Association
SearchQA: A new Q&A dataset augmented with for Computational Linguistics, 7.
context from a search engine. arXiv preprint
arXiv:1704.05179. Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
2019. Latent retrieval for weakly supervised open
Yair Feldman and Ran El-Yaniv. 2019. Multi-hop para- domain question answering. In Proceedings of the
graph retrieval for open-domain question answering. 57th Annual Meeting of the Association for Compu-
In Proceedings of the 57th Annual Meeting of the tational Linguistics.
Association for Computational Linguistics.
Yuanhua Lv and ChengXiang Zhai. 2011. When doc-
Clinton Gormley and Zachary Tong. 2015. Elastic- uments are very long, bm25 fails! In Proceedings
search: the definitive guide: a distributed real-time of the 34th international ACM SIGIR conference on
search and analytics engine. ” O’Reilly Media, Research and development in Information Retrieval,
Inc.”. pages 1103–1104.

Christopher D. Manning, Mihai Surdeanu, John Bauer, Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,
Jenny Finkel, Steven J. Bethard, and David Mc- Tim Klinger, Wei Zhang, Shiyu Chang, Gerald
Closky. 2014. The Stanford CoreNLP natural lan- Tesauro, Bowen Zhou, and Jing Jiang. 2018a. R3 :
guage processing toolkit. In Association for Compu- Reinforced reader-ranker for open-domain question
tational Linguistics (ACL) System Demonstrations, answering. In AAAI Conference on Artificial Intelli-
pages 55–60. gence.
Andriy Mnih and Koray Kavukcuoglu. 2013. Learning Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaox-
word embeddings efficiently with noise-contrastive iao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger,
estimation. In Advances in neural information pro- Gerald Tesauro, and Murray Campbell. 2018b. Ev-
cessing systems, pages 2265–2273. idence aggregation for answer re-ranking in open-
Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. domain question answering. In International Con-
Revealing the importance of semantic retrieval for ference on Learning Representations.
machine reading at scale. In Proceedings of the
2019 Conference on Empirical Methods in Natu- Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nal-
ral Language Processing and the 9th International lapati, and Bing Xiang. 2019. Multi-passage
Joint Conference on Natural Language Processing BERT: A globally normalized BERT model for
(EMNLP-IJCNLP). open-domain question answering. In Proceedings of
the 2019 Conference on Empirical Methods in Nat-
Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and ural Language Processing and the 9th International
Christopher D. Manning. 2019. Answering complex Joint Conference on Natural Language Processing
open-domain questions through iterative query gen- (EMNLP-IJCNLP).
eration. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing Johannes Welbl, Pontus Stenetorp, and Sebastian
and the 9th International Joint Conference on Natu- Riedel. 2018. Constructing datasets for multi-hop
ral Language Processing (EMNLP-IJCNLP). reading comprehension across documents. Transac-
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. tions of the Association for Computational Linguis-
Know what you don’t know: Unanswerable ques- tics, pages 287–302.
tions for SQuAD. In Proceedings of the 56th An-
nual Meeting of the Association for Computational Wenhan Xiong, Xiang Lorraine Li, Srini Iyer, Jingfei
Linguistics (Volume 2: Short Papers). Du, Patrick Lewis, William Yang Wang, Yashar
Mehdad, Wen tau Yih, Sebastian Riedel, Douwe
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Kiela, and Barlas Oğuz. 2020. Answering com-
Percy Liang. 2016. SQuAD: 100,000+ questions for plex open-domain questions with multi-hop dense
machine comprehension of text. In Proceedings of retrieval.
the 2016 Conference on Empirical Methods in Natu-
ral Language Processing. Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen
Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019.
Stephen E Robertson, Steve Walker, Susan Jones, End-to-end open-domain question answering with
Micheline M Hancock-Beaulieu, Mike Gatford, et al. BERTserini. In Proceedings of the 2019 Confer-
1995. Okapi at trec-3. Nist Special Publication Sp, ence of the North American Chapter of the As-
109:109. sociation for Computational Linguistics (Demon-
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and strations), Minneapolis, Minnesota. Association for
Hannaneh Hajishirzi. 2017. Bidirectional attention Computational Linguistics.
flow for machine comprehension. In International
Conference on Learning Representations. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
Alon Talmor and Jonathan Berant. 2018. The web as pher D. Manning. 2018. HotpotQA: A dataset for
a knowledge-base for answering complex questions. diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference of the North In Proceedings of the 2018 Conference on Empiri-
American Chapter of the Association for Computa- cal Methods in Natural Language Processing.
tional Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 641–651. Chen Zhao, Chenyan Xiong, Xin Qian, and Jordan
Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har- Boyd-Graber. 2020. Complex factoid question an-
ris, Alessandro Sordoni, Philip Bachman, and Ka- swering with a free-text knowledge graph. In Pro-
heer Suleman. 2016. NewsQA: A machine compre- ceedings of The Web Conference 2020, pages 1205–
hension dataset. arXiv preprint arXiv:1611.09830. 1216.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Chen Zhao, Chenyan Xiong, Corby Rosset, Xia
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Song, Paul Bennett, and Saurabh Tiwary. 2019.
Kaiser, and Illia Polosukhin. 2017. Attention is all Transformer-xh: Multi-evidence reasoning with ex-
you need. In Advances in neural information pro- tra hop attention. In International Conference on
cessing systems, pages 5998–6008. Learning Representations.

Mantong Zhou, Zhouxing Shi, Minlie Huang, and
 Xiaoyan Zhu. 2020.     Knowledge-aided open-
 domain question answering.     arXiv preprint
 arXiv:2006.05244.

A       Data processing                                               Split   Origin    # Entities    #QAs
In this section, we describe how we process the En-                   train    train       387       77,087
glish Wikipedia and the SQuAD dataset for training                     dev     train        55       10,512
and evaluating IRRR.                                                   test     dev         48       10,570
   For the standard benchmarks (SQuAD Open and
HotpotQA fullwiki), we use the Wikipedia cor-                  Table 7: Statistics of the resplit SQuAD dataset for
pora prepared by Chen et al. (2017) and Yang et al.            proper training and evaluation on the SQuAD Open set-
                                                               ting.
(2018), respectively, so that our results are com-
parable with previous work on these benchmarks.
Specifically, for SQuAD Open, we use the pro-                  Wikipedia articles have been renamed or removed
cessed English Wikipedia released by Chen et al.               since, we begin by following Wikipedia redirect
(2017) which was accessed in 2016, and contains                links to locate the current title of the correspnoding
5,075,182 documents.5 For HotpotQA, Yang et al.                Wikipedia page (e.g., the page “Madonna (enter-
(2018) released a processed set of Wikipedia in-               tainer)” has been renamed “Madonna”). After
troductory paragraphs from the English Wikipedia               the correct Wikipedia article is located, we look
originally accessed in October 2017.6                          for combinations of one to two consecutive para-
   While it is established that the SQuAD dev set              graphs in the 2020 Wikipedia dump that have high
is repurposed as the test set for SQuAD Open for               overlap with context paragraphs in these datasets.
the ease of evaluation, most previous work make                We calculate the recall of words and phrases in
use of the entire training set during training, and            the original context paragraph (because Wikipedia
as a result a proper development set for SQuAD                 paragraphs are often expanded with more details),
Open does not exist, and the evaluation result on the          and pick the best combination of paragraphs from
test set might be inflated as a result. We therefore           the article. If the best candidate has either more
resplit the SQuAD training set into a proper devel-            than 66% unigrams in the original context, or if
opment set that is not used during training, and a             there is a common subsequence between the two
reduced training set that we use for all of our exper-         that covers more than 50% of the original context,
iments. As a result, although IRRR is evaluated on             we consider the matching successful, and map the
the same test set as previous systems, it is likely dis-       answers to the new context paragraphs.
advantaged due to the reduced amount of training
                                                                  As a result, 20,182/2,146 SQuAD train/dev ex-
data and hyperparameter tuning on this new dev set.
                                                               amples (that is, 17,802/2,380/2,146 train/dev/test
We split the training set by first grouping questions
                                                               examples after data resplit) and 15,806/1,416/1,427
and paragraphs by the Wikipedia entity/title they
                                                               HotpotQA train/dev/fullwiki test examples have
belong to, then randomly selecting entities to add
                                                               been excluded from the unified benchmark.
to the dev set until the dev set contains roughly as
many questions as the test set (original SQuAD dev
                                                               B    Elasticsearch Setup
set). The statistics of our resplit of SQuAD can be
found in Table 7. We will make our resplit publicly            We set up Elasticsearch in standard benchmark
available to the community.                                    settings (SQuAD Open and HotpotQA fullwiki)
   For the unified benchmark, we started by pro-               following practices in previous work (Chen et al.,
cessing the English Wikipedia7 with the WikiEx-                2017; Qi et al., 2019), with minor modifications to
tractor (Attardi, 2015). We then tokenized this                unify these approaches.
dump and the supporting context used in SQuAD                     Specifically, to reduce the context size for the
and HotpotQA with Stanford CoreNLP 4.0.0 (Man-                 Transformer encoder in IRRR to avoid unnecessary
ning et al., 2014) to look for paragraphs in the               computational cost, we primarily index the indi-
2020 Wikipedia dump that might correspond to the               vidual paragraphs in the English Wikipedia. To
context paragraphs in these datasets. Since many               incorporate the broader context from the entire ar-
    5
      https://github.com/facebookresearch/                     ticle, as was done by Chen et al. (2017), we also
DrQA                                                           index the full text for each Wikipedia article to help
    6
      https://hotpotqa.github.io/                              with scoring candidate paragraphs. Each paragraph
wiki-readme.html
    7
      Accessed on August 1st, 2020, which contains 6,133,150   is associated with the full text of the Wikipedia it
articles in total.                                             originated from, and the search score is calculated

Parameter                Value                                   SQuAD Open          HotpotQA
                                                               Steps
                                       −5                                  EM     F1          EM      F1
          Learning rate          3 × 10
          Batch size                 320                       Dynamic    49.92   60.91       65.74   78.41
          Iteration                10,000                      1 step     51.07   61.74       13.75   18.95
          Warming-up                1,000                      2 step     38.74   48.61       65.12   77.75
          Training tokens       1.638 × 109                    3 step     32.14   41.66       65.37   78.16
          Reranker Candidates         5                        4 step     29.06   38.33       63.89   76.72
                                                               5 step     19.53   25.86       59.86   72.79
 Table 8: Hyperparameter setting for IRRR training.
                                                          Table 9: SQuAD and HotpotQA performance adaptive
                                                          vs fixed-length reasoning paths, as measured by an-
as the summation of two parts: the similarity be-         swer exact match (EM) and F1 . The dynamic stopping
tween query terms and the paragraph text, and the         criterion employed by IRRR achieves comparable per-
                                                          formance to its fixed-step counterparts, without knowl-
similarity between the query terms and the full text
                                                          edge of the true number of gold paragraphs.
of the article.
   For query-paragraph similarity, we use the stan-
dard BM25 similarity function (Robertson et al.,          reasoning path retrieval, then move on to the end-
1995) with default hyperparameters (k1 = 1.2, b =         to-end performance and leakages in the pipeline,
0.75). For query-article similarity, we find BM25         and end with a few examples to demonstrate typical
to be less effective, since the length of these arti-     failure modes we have identified that might point
cles overwhelm the similarity score stemming from         to limitations with the data.
important rare query terms, which has also been
reported in the information retrieval literature (Lv        Effect of Dynamic Stopping. We begin by
and Zhai, 2011). Instead of boosting the term fre-          studying the effect of using the aswerability score
quenty score as considered by Lv and Zhai (2011),           as a criterion to stop the iterative retrieval, read-
we extend BM25 by taking the square of the IDF              ing, and reranking process within IRRR. We com-
term and setting the TF normalization term to zero          pare the performance of a model with dynamic
(b = 0), which is similar to the TF-IDF implemen-           stopping to one that is forced to stop at exactly K
tation by Chen et al. (2017) that is shown effective        steps of reasoning, neither before nor after, where
for SQuAD Open.                                             K = 1, 2, . . . 5. As can be seen in Table 9, IRRR’s
                                                            dynamic stopping criterion based on the answer-
   Specifically, given a document D and query Q,
                                                            ability score is very effective in achieving good
the score is calculated as
                                                            end-to-end question answering performance for
                   n
                 X                  f (D, qi ) · (1 + k1 ) questions of arbitrary complexity without having to
score(D, Q) =         IDF2+ (qi ) ·                       , specify the complexity of questions ahead of time.
                                       f (D, qi ) + k1
                  i=1
                                                            On both SQuAD Open and HotpotQA, it achieves
                                                     (2)
                                                            competitive, if not superior question answering per-
where IDF+ (qi ) = max(0, log((N − n(qi ) + formance, even without knowing the true number
0.5)/(n(qi ) + 0.5)), with N denoting the total             of gold paragraphs necessary to answer each ques-
numberr of documents and n(qi ) the document fre- tion.
quency of query term qi , and f (qi , D) is the term           Aside from this, we note four interesting find-
frequency of query term qi in document D. We set            ings: (1) the performance of HotpotQA does not
k1 = 1.2 in all of our experiments.                         peak at two steps of reasoning, but instead is helped
                                                            by performing a third step of retrieval for the av-
C Further Training Details                                  erage question; (2) for both datasets, forcing the
                                                            model to retrieve more paragraphs after a point con-
We include the hyperparameters used to train the            sistently hurt QA performance; (3) dynamic stop-
IRRR model in Table 8 for reproducibility.                  ping slightly hurts QA performance on SQuAD
                                                            Open compared to a fixed number of reasoning
D Further Analyses of Model Behavior
                                                            steps (K = 1); (4) when IRRR is allowed to se-
In this section, we perform further analyses and            lect a dynamic stopping criterion for each example
introduce further case studies to demonstrate the           independently, the resulting question answering
behavior of the IRRR system. We start by analyz- performance is better than a one-size-fits-all so-
ing the effect of the dynamic stopping criterion for        lution of applying the same number of reasoning

Question What team was the AFC champion?
Step1 (Non- However, the eventual-AFC Champion Cincinnati Bengals, playing in their first AFC Championship
Gold) Game, defeated the Chargers 27-7 in what became known as the Freezer Bowl. ......
Step2 (Non- Super Bowl XXVII was an American football game between the American Football Conference (AFC)
Gold) champion Buffalo Bills and the National Football Conference (NFC) champion Dallas Cowboys to
decide the National Football League (NFL) champion for the 1992 season. ......
Gold Super Bowl 50 was an American football game to determine the champion of the National Foot-
ball League (NFL) for the 2015 season. The American Football Conference (AFC) champion
Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24-10
to earn their third Super Bowl title. ......

Table 10: An example where there are false negative answers in Wikipedia for the question from SQuAD Open.

steps to all examples. While the last confirmes the
effectiveness of our answerability-based stopping
criterion, the cause behind the first three warrants
further investigation. We will present further analy-
ses to shed light on potential causes of these in the
remainder of this section.
Case Study for Failure Cases. Besides model
inaccuracy, one common reason for IRRR to fail
at finding the correct answer provided with the
datasets is the existence of false negatives (see Ta-
ble 10 for an example from SQuAD Open). We
estimate that there are about 9% such cases in the
HotpotQA part of the training set, and 26% in the
SQuAD part of the training set.

You can also read