Generative Multi-hop Retrieval
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Generative Multi-hop Retrieval Hyunji Lee Sohee Yang KAIST AI KAIST AI hyunji.amy.lee@kaist.ac.kr sohee.yang@kaist.ac.kr Hanseok Oh Minjoon Seo arXiv:2204.13596v2 [cs.IR] 25 May 2022 KAIST AI KAIST AI hanseok@kaist.ac.kr minjoon@kaist.ac.kr Abstract Multi-hop retrieval is the task of retrieving a series of multiple documents that together provide sufficient evidence to answer a natural language query. A com- mon practice for text retrieval is to use an encoder to map the documents and the query to a common vector space and perform a nearest neighbor search (NNS); multi-hop retrieval also often adopts the same paradigm, usually with a modifica- tion of iteratively reformulating the query vector so that it can retrieve different documents at each hop. However, the inherent limitations of such a bi-encoder approach worsen in the multi-hop setting. As the number of hops increases, the reformulated query increasingly depends on the documents retrieved in its previous hops, which further tightens the embedding bottleneck of the query vector and becomes more prone to error propagation. In this paper, we focus on alleviating these limitations of the bi-encoder approach in multi-hop settings by formulating the problem in a fully generative way. We propose an encoder-decoder model that performs multi-hop retrieval by simply generating the entire text sequences of the retrieval targets, which means the query and the documents interact in the language model’s parametric space rather than L2 or inner product space as in the bi-encoder approach. Our approach, Generative Multi-hop Retrieval (GMR), consistently achieves comparable or higher performance than bi-encoder models in five datasets while demonstrating superior GPU memory and storage footprint.1 1 Introduction Finding the relevant knowledge from a massive collection of information is often formulated as a text retrieval problem. A large portion of the text retrieval literature focuses on finding the single most relevant paragraph or document (i.e., no hop) to the given query[1–3]. When we cannot answer a query with a single document, the task is often formulated as a multi-hop retrieval problem, where one needs to retrieve a series of multiple documents that together provide sufficient evidence to answer the query [4–6]. For example, to answer the question “Where did the form of music played by Die Rhöner Säuwäntzt originate?” (Figure 1), we first need to retrieve the form of music played by Die Rhöner Säuwäntzt and then where the form originated from2 . Both no-hop and multi-hop retrieval tasks are often approached by encoding both the query and the retrieval sequences to a common vector space and then finding the sequence whose embedding is closest to that of the query. This bi-encoder approach for retrieval is often considered a de facto standard, where heavy computations such as obtaining the dense embeddings of the retrieval 1 We will make our code and configurations publicly available. 2 See Appendix A.1 for more examples of multi-hop retrieval tasks. Preprint. Under review.
Bi-Encoder Retrieval (BE) Generative Multi-hop Retrieval (GMR) “Skif e is a music genre with blues … Originating as a term in the United States” Pre x Tree Dense Corpus Index “Die Rhöner Säuwäntzt are a Skif e-Bluesband … playing Skif e is a music … Root Skif e-Blues” MIPS Dense Query DensePassage Dense Passage Die Rhöner Säuwäntz … Die … Skif e Dense Retrieval Vector Vector Vector Sequence Vector is Rhöner Mile GMR . Query Retrieval Sequence . a Encoder-Decoder . Encoder Encoder Säuwäntzt . . music . . Query Passage Passage Retrieval Sequence Query . . Hop1 “Where did the form of music played by Die Rhöner Säuwäntzt originate?” Hop1 Output Hop2 “Where did the form of music played by Die Rhöner Säuwäntzt originate? Die Rhöner Säuwäntzt are a Skif e-Bluesband … playing Skif e-Blues” Hop2 Output Hop3 “Where did the form of music played by Die Rhöner Säuwäntzt originate? Die Rhöner Säuwäntzt are a Skif e-Bluesband … playing Skif e-Blues Skif e is a music genre with blues … Originating as a term in the United States” . . . Figure 1: Comparison between bi-encoder and Generative Multi-hop Retrieval (GMR) for multi-hop retrieval. fl fl fl fl fl fi fl fl fl fl fl During the first hop retrieval of GMR, to generate the second token, Rhöner, it finds the potential next tokens ([Rhöner, Miller]) by searching through the prefix tree with the previously generated tokens. We mask out tokens that are not in the potential next tokens and find the token with the maximum score from unmasked tokens, which in this example is Säuwäntzt. Finally, when it retrieves the token, the generation ends, and the generated output is the retrieval sequence of the query. In second hop retrieval, it adds the retrieved sequence at the end of the query, so that the accumulated query is used as the input query of the second hop. sequences in the corpus can be done offline, and one can search over a large number of items with low latency through nearest neighbor search (NNS) or maximum inner product search (MIPS) [3, 1, 7–11]. While such a bi-encoder approach performs well on many retrieval tasks, it has also shown to suffer from information loss when encoding a long query or document into a fixed-size embedding [12, 2]. The problem becomes even more critical in multi-hop retrieval as previously retrieved items are appended to the query while iterating through multiple hops. The reformulated query gets longer as the number of hops increases, and therefore the query embedding gradually gets incapable to contain the entire information. In this paper, we argue that a fully generative approach to multi-hop retrieval may be the solution. We propose Generative Multi-hop Retrieval (GMR), an encoder-decoder model that attempts to memorize the entire target corpus in a generative manner and retrieves the most relevant sequence from the corpus by generating the entire sequence with the aid of constrained decoding. By interacting in the whole parametric space of the model trained on the target corpus during the retrieval process, GMR overcomes the bottleneck problem of the bi-encoder approach that can only operate on L2 or inner product space (Figure 1). Earlier work in generative retrieval [13, 2, 14] performs retrieval by generating the entity or the document id that represents the target paragraph or document; GMR instead generates the entire text of the target paragraph, which we believe is more suitable for multi- hop retrieval that requires modeling the interaction between longer queries and more fine-grained text segments. The main contributions of our paper are that: • We show the inherent limitation of bi-encoder retrieval in multi-hop retrieval tasks: its perfor- mance decreases as the number of hops increases and is vulnerable to error propagation. • We show that Generative Multi-hop Retrieval (GMR) is robust in solving multi-hop retrieval tasks, performing higher or comparable on five datasets. It is especially strong in multi-hop retrieval settings close to real-world scenarios and datasets with a low unseen rate. • We introduce multi-hop memorization which effectively memorizes the target corpus and show that it improves the performance of GMR. Given that generative retrieval shows high performance on five datasets with high memory efficiency compared to the traditional bi-encoder retrieval models, we suggest that generative retrieval has the potential to be a practical alternative for not only the canonical (no-hop) text retrieval tasks as shown in Tay et al. [13], Bevilacqua et al. [14] but also for multi-hop retrieval tasks as explored in this work. 2
2 Related Work Multi-hop Retrieval As the importance of multi-hop retrieval increases, it has been actively studied [15, 10, 16, 17]. There has been a line of previous works that assume non-textual metadata such as knowledge bases, Wikipedia hyperlinks, or entity linking exist and leverage such metadata to solve multi-hop retrieval tasks [16, 18–20], but they are not expandable to cases where the metadata does not exist. Thus, another line of research focuses on a more straightforward approach of expanding the bi-encoder architecture (showing high performance on no-hop retrieval) to multi-hop retrieval and has shown high performance [10, 21]. While extending bi-encoder retrieval to a multi-hop task setting has shown good performance, previous studies [12, 22] show that bi-encoder suffers from information loss when condensing the text into a fixed-size vector. Especially in multi-hop tasks, the input text of the model gets longer as the number of hops increases, where it is highly likely to fall into bottleneck problem as Luan et al. [12] showed. Thus, it is worth exploring the changes in the fundamental approach to overcome the inherent limitations of bi-encoder architecture. Our work suggests that a generative method can be an effective alternative to the bi-encoder approach for multi-hop retrieval tasks. Generative Retrieval Cao et al. [2] first propose a generative retrieval model, GENRE (Generative Entity REtrieval), which achieves comparable or higher performances on entity retrieval tasks than bi-encoder models and suggests that generative retriever cross-encodes the input and output effi- ciently, capturing the relationship between the two without information loss due to its autoregressive formulation. To ensure that all generated retrieval sequences are from the corpus, they perform constrained decoding using the prefix tree. Recently, concurrent works DSI [13] and SEAL [14] show generative retrieval methods in no-hop retrieval settings. DSI (Differentiable Search Engine) generates structured identifiers of each corpus by clustering the contextualized embedding and experiments NQ subset (NQ-10k, NQ-100k, and NQ-320k), showing higher performance than Sentence-T5, a bi-encoder model. SEAL (Search Engines with Autoregressive LMs) retrieves by generating ngram using FM index and retrieving the paragraph or document that contains the ngram. It tests on NQ and KILT benchmarks and shows that the generative retrieval model can even outperform well-designed bi-encoder models such as DPR [1] and GAR [23]3 . To see the effectiveness of explicitly generating the entire retrieval sequence, we compare GMR with our re-implementation of DSI4 which we further expand to multi-hop retrieval setting for a fair comparison. 3 Generative Multi-hop Retrieval Multi-hop text retrieval can be defined as the task of retrieving a set of sequences from a target corpus D given a query x. To model the relationship between the multiple target sequences, multi- hop retrieval is often approached by iterating through multiple hops where the previously retrieved sequences are appended to the end of the previous query and form an augmented query at each hop [16, 10, 17, 15]. In this paper, we focus on multi-hop retrieval tasks that resemble a real-world scenario [15]: the oracle number of hops and the correct order of retrieval sequences are not given for each query at the inference time, and the number of oracle hops varies in a wide range. Canonical text retrieval can be formulated as retrieving a sequence dŷ = arg maxd∈D P(d|x) given the query x, where d ∈ D is a retrieval sequence in the target corpus D. The retrieval is considered successful if dŷ = dy where dy is the ground truth target. On the other hand, multi-hop retrieval aims on finding the set of sequences retrieved through k hops, Dŷ = {dyˆ1 , · · · , dyˆk }, given the query x. Here, dyˆi is the sequence retrieved at the i-th hop conditioned on the query x and the sequences retrieved at the previous hops, i.e., dyˆi = arg maxd∈D P(d|x, dy
quence using constrained decoding as in the right side of the Figure 1. The generation goes over multiple hops to retrieve a set of sequences. To decide the text to retrieve, GMR uses |d| ( j) (< j) ), the probability of generating the token d ( j) conditioned ˆ ) ∝ ∏ j=1 P(d |x, dy
4 Experimental Setup 4.1 Fixed and Dynamic Multi-hop Retrieval We formulate two settings of multi-hop retrieval tasks: fixed and dynamic multi-hop retrieval settings. Our ultimate goal of multi-hop retrieval tasks in the inference step is to retrieve a set of relevant items when given an input query x. However, since k, the oracle number of items in a set, varies depending on x and task, it is difficult to know the k beforehand in a real-world scenario. Therefore, due to this limitation, in most cases, k is fixed to a certain number. Fixed and Dynamic multi-hop retrieval setting differs by whether the retrieval process continues till the maximum retrieval hops or stops in the middle. Fixed setting is commonly used in previous multi-hop retrieval tasks which a model retrieves till the maximum retrieval hop. Whereas, a dynamic setting is more applicable to solving multi-hop retrieval tasks close to a real-world scenario; rather than iterating until the given maximum number of hops, the model itself predicts when to stop the process by generating the special token DONE and stops in the middle. We add the algorithm of both settings and detailed benefits of dynamic settings in Appendix A.11. 4.2 Datasets We use five datasets with various characteristics of which we show brief descriptions below. Appendix A.3 and Appendix A.2 show the overall statistics, datasets examples, and detailed description of the train and test settings of each dataset. HotpotQA Yang et al. [4] propose an open domain multi-hop question answering dataset, which requires seeing multiple Wikipedia passages through logical reasoning or sequential processing. The number of retrieval sequences is fixed to two. Entailment TreeBank (EntailBank) Dalvi et al. [6] propose a reasoning tree construction task where it forms a tree with the hypothesis as the root node and evidence sentences are leaf nodes. We experiment on the leaf node retrieval of Task3: retrieval of leaf nodes (sentence) from the corpus given a question and an answer as the input. We call the dataset EntailBank in short. StrategyQA Geva et al. [24] propose a multi-hop open-domain question answering dataset where the reasoning steps are implicit in the question, and thus relevant strategies are required to answer the question. Given a question, the model retrieves the evidence sentences from the corpus. RuleTaker-Open Clark et al. [25] propose a synthetic rule-based dataset to measure the model’s reasoning ability over the rules expressed in natural language. Based on the released dataset, we create a new task, RuleTaker-Open, to make the task close to a real-world setting. Given a query, the model retrieves nodes of the graph, which are sentences from the corpus, and the nodes are connected in order to construct a graph. Explagraphs-Open Saha et al. [26] propose a generative and structured commonsense-reasoning task. We reformulate the task to open-domain retrieval setting and name it Explagraphs-Open, considering a single path (subject-relation-object) as a retrieval sequence. 4.3 Bi-Encoder Retrieval Models For each dataset, we compare the results with a bi-encoder retrieval model as a baseline. For HotpotQA dataset, we use MDR which is a widely used bi-encoder retrieval model for the corresponding dataset. For the rest of the datasets, we compare with Sentence-T5 (ST5), a bi-encoder retrieval model using T5 [27], to use the same number of parameters and initial checkpoint with GMR. MDR Xiong et al. [10] propose an iterative bi-encoder retrieval model, MDR, which extends DPR to a multi-step setting. ST5 ST5 is an encoder-decoder model [28]7 that uses the first decoder output as the sequence embedding. It serves as the base architecture of our baseline bi-encoder to compare the performance with GMR using the same number of parameters and initial pre-trained checkpoint. Baseline Bi-Encoder Retriever (BE) In order to compare the multi-step generative retrieval to a bi-encoder retrieval, we create a simple counterpart such that the bi-encoder retrieval can be as well adapted to fixed multi-step and dynamic multi-step retrieval tasks. For fixed multi-step retrieval, we train the bi-encoder (BE) to maximize P(dyi |x, dy
Table 1: Retrieval sequence recall rate (R@5) of fixed multi-step retrieval task and F1 score (F1@5, F1@10, F1@20 where each number indicates the maximum retrieval step) of dynamic setup on the test set. We compare results between GMR and BE (ST5) where GMR outperforms over BE for all three datasets. GMRL means GMR with LM memorization and GMRM means GMR with Multi-hop memorization. The bold text shows the best score of the dataset. EntailTree StrategyQA Explagraph-Open BE GMR GMRL BE GMR GMRL GMRM BE GMR GMRL GMRM Fixed R@5 31.5 53.6 54.3 37.4 44.9 45.5 45.6 27.0 32.9 32.4 34.6 Dynamic F1@5 24.9 48.2 47.4 38.1 41.9 42.6 43.1 25.0 35.5 35.7 36.2 Dynamic F1@10 19.4 52.1 51.7 36.9 44.3 45.0 45.2 24.6 40.0 40.8 42.1 Dynamic F1@20 16.9 52.5 52.2 36.5 46.6 47.1 47.9 25.4 41.5 41.3 42.6 in MDR [10]. For dynamic multi-step retrieval, we add the special single-token text DONE to the corpus as done in GMR. When training the model, one extra retrieval step is added at the end as well; at the point when the retriever retrieves all the target texts, the model has to retrieve DONE text using MIPS. At inference, the model retrieves texts until it retrieves the special token or the number of retrieval reaches the predefined maximum retrieval step. Details are in Appendix A.6. 4.4 Evaluation Metric In a fixed multi-hop retrieval setting, for HotpotQA, we calculate retrieval sequence recall following the evaluation metric of MDR [10]. For multi-hop datasets with varying numbers of ground truth retrieval steps (Explagraphs-Open, EntailBank, and StrategyQA), we first calculate the retrieval sequence recall rate (R@k) of each query and average over the number of queries [6, 26]. Furthermore, in a dynamic multi-hop retrieval setting, since the number of predicted retrieval sequences varies, we measure the retrieval sequence F1 score (F1@k) 8 . We calculate the F1 score by retrieving k sequences from the target corpus and removing null elements. For RuleTaker-Open, we newly define an evaluation metric (Appendix A.4) that measures the graph construction success rate since we do not have the information on ground truth retrieval sequences. 5 Experimental Results In Section 5.1, we compare the results of GMR and bi-encoder models in fixed and dynamic settings with five different datasets. In Section 5.2, we show the limitations of bi-encoder retrieval models, discuss the effect of unseen rate in GMR, and show GMR’s efficiency on storage and inference time. 5.1 Results Results of Bi-Encoder Retriever and GMR We compare results between bi-encoder and GMR in fixed and dynamic multi-hop retrieval settings. Table 1 shows the overall performance of the bi-encoder baseline (BE) and GMR variants on three datasets in fixed and dynamic multi-hop retrieval settings. We further compare results between our base model, the generative multi-hop retrieval model (GMR), and the base model with additional new memorization methods such as multi-hop memorization (GMRM ) and LM memorization (GMRL ). For all three datasets, GMR shows a higher retrieval sequence recall rate of top-5 in a fixed setting and a higher retrieval sequence F1 score in a dynamic setting than bi-encoder models. Also, in most cases, both LM memorization and multi-hop memorization methods help on improving the performance of GMR. Table 2 compares the result between GMR and MDR [10] in HotpotQA dataset. While the score of GMR is lower than that of MDR [10], its score is comparable to that of MDR- (a variant of MDR without linked negative, memory bank, and shared encoder). One reason why the performance of GMR is similar to MDR-, not MDR, would be that the techniques such as hard negative training or memory bank are nontrivial to be applied to the generative retrieval of GMR; this suggests an important further direction to close the gap. Also, since the HotpotQA dataset is fixed to two- 8 To compare performance between two settings, we also evaluate the retrieval sequence F1 score of fixed conditional retrieval task following dynamic conditional retrieval but by fixing the number of retrieval step. 6
Table 3: Retrieval sequence recall rate (R@5) of fixed multi-hop re- trieval task on test set. We compare results between GMR and DSI* to Table 2: Retrieval sequence recall rate of HotpotQA offi- show effectiveness of explicitly generating the entire sequence in multi- cial full-wiki dev set. Scores of DPR, MDR- and MDR hop retrieval task. GMR outperforms both DSI* models on all three are from Table 3 of Xiong et al. [10]. MDR- indicates a datasetsa variant of MDR without linked negatives, memory bank, and shared encoder. Model EntailTree StrategyQA Explagraphs-Open Method Top-2 Top-10 Top-20 atomic-DSI* 28.0 0.0 23.4 DPR 25.2 45.4 52.1 naive-DSI* 7.7 0.0 8.6 MDR- 59.9 70.6 73.1 fix-GMR 53.6 44.9 32.9 MDR 65.9 77.5 80.2 fix-GMR 57.7 68.8 73.9 a Since DSI is not open-sourced, we reproduced the model ourself fix-GMRL 55.0 65.3 71.4 which is DSI* in this table. We show NQ results of DSI* in Appendix A.12. We couldn’t reproduce DSI-semantic so we skip the result of it. We plan to update the table when the official code of DSI is released. hop, which is relatively short, bi-encoder models suffer less from bottleneck and error propagation problems compared to other datasets that necessitate a larger number of hops. By analyzing the result of the GMR and BE on HotpotQA and RuleTaker-Open, it can be seen that the prediction results of the two retrieval models have different characteristics. When comparing the top-2 prediction of GMR and MDR on HotpotQA9 , in most cases MDR gets wrong by failing to retrieve the second hop target even though the first hop prediction is correct, whereas GMR mostly gets wrong when the first-hop target is not explicitly expressed in the query. In the RuleTaker-Open dataset, where the task is to construct a reasoning graph by dynamic retrieval setting, the success rate10 of GMR and GMRM on constructing the reasoning graph outperforms BE by 300% and 385% respectively. Also, GMR constructs more complex and diverse reasoning graphs by the retrieval process, suggesting that GMR is strong at retrieving highly structured items such as reasoning chains and graph relations. Moreover, the rate that the bi-encoder misses the DONE token, which indicates the retriever to stop the iteration, is more than twice as high as that of GMR showing that GMR is good at capturing when to stop thus robust on the dynamic setting. Importance of Explicit Generation in Multi-hop Retrieval Task GMR retrieves a sequence by explicitly generating the entire retrieval sequence using constrained decoding, unlike the previous generative retrieval methods in order for the retrieval model to grasp and understand the relationship between the input query and retrieval sequences well. Since our approach focuses on retrieval sequences with granularity smaller than page level so that we can add previously retrieved sequences to an input query, Cao et al. [2] which is a previous work that retrieves by generating an entity(title of a page) is not directly applicable to such fine-grained multi-hop setting. Therefore, we compare our model with our replication of DSI [13]11 which is a concurrent work that assigns an id for all corpus and retrieves relevant documents by generating an id. We expand DSI which experiments only on no-hop settings to multi-hop settings by retrieving an id of a relevant document and iterating by adding the text of the id at the end of the input query as in GMR. From Table 3, we can see that GMR outperforms DSI on all three datasets showing the benefit of generating the entire sequence in multi-hop retrieval task. DSI especially shows a low recall score in StrategyQA, which has more than four times the corpus set than the other two. Such tendency of the performance degradation as the size of the target corpus increases can also be seen in the DSI paper by comparing the result between NQ-10k and NQ-320k. These results suggest the difficulty of expanding to a larger corpus set in DSI unlike GMR. 9 Details in Appendix A.9 10 F1 cannot be calculated on RuleTaker-Open because the ground-truth retrieval sequence is not known at each step. See Appendix A.4 for details of RuleTaker-Open and the results. 11 Of the three proposed methods from DSI, we did not report the Semantic String Docid since we could not reproduce the result of Hits@1 and Hits@10. In contrast, we show better performance than in the original paper in Atomic Docid (atomic-DSI*) and comparable performance on Naive String Docid (naive-DSI*). We show details and the NQ result of our replicated DSI in Appendix A.12. 7
60 EntailBank Explagraphs-Open 75 StrategyQA Minor Major 55 50 70 50 BE GMR BE GMR 40 65 45 30 60 Str -23.6% -1.7% -71.2% -20.7% 40 55 20 Exp -46.9% -49.3% -91.1% -75.8% 50 35 10 45 30 Ent -14.0% -11.1% -55.1% -39.6% 0 0 2 4 6 8 0 2 4 6 0 1 2 3 4 Table 4: Effect on error propagation (Equation 5.2) of bi- Figure 2: Hop-R@5oracle (y-axis) over number of hops (x-axis) in the encoder model (BE) and GMR on three datasets: StrategyQA(Str), figure. The red line and the black line each show performance of the Explagraphs-Open(Exp), and EntailBank(Ent). We test on two cases bi-encoder model (ST5) and GMR. GMR shows relatively consistent where we replace previous retrieval sequence with BM25(minor) and performance over all numbers of conditions in the conditional query. by randomly sampling a sequence from corpus(major). Numbers in For all three datasets, we can see that performance of the bi-encoder the table indicates how much the performance degrades when error tends to degrade as the number of hops increases after a certain thresh- occurs in previous hop compared to a setup when there are no such old value. error. 5.2 Analysis Limitation of Bi-Encoder Retrieval Models As shown in several previous works [2, 12, 22], bi-encoder approaches have an inherent limitation that their retrieval performance decreases in proportion to the dimension of the fixed-size embedding to perform the search. Luan et al. [12] especially show that the performance decreases more severely as the length of the encoded sequence gets longer. In our work, we further investigate such bottleneck of bi-encoder in multi-hop retrieval task and show that (1) bottleneck problem: performance of bi-encoder model consistently decreases as the number of hops increases (the number of sequences added to initial input query increases), and that (2) error propagation: the bi-encoder approach is more vulnerable to error propagation than generative retrieval approach, which we show by several experiments simulating the case where a retriever fails to retrieve the ground truth sequence at the previous hop. For the ease of analysis, we compare the performance of bi-encoder retriever (ST5) and generative retriever (GMR) on three datasets under the setting where we assume that a ground-truth order of the sequences to retrieve exists and the goal is to retrieve the one gold target sequence dyi of each i-th hop. The performance is measured as hop-R@5oracle = 1{dyi ∈ top-5d∈D P(d|x, dy
truth retrieval target at the i-th hop. We divide the level of severity of the mistake into two: (1) when a minor error occurs and (2) when a major error occurs at the previous hop. For the former, we find the non-oracle but most relevant sequence from the corpus set using BM25 and use it as derrori−1 . For the latter, we randomly sample a sequence from the corpus set and use it as derrori−1 . The effect of error propagation is measured as the relative drop rate of hop-R@5 from the ora- cle setup, hop-R@5oracle = 1{dyi ∈ top-5d∈D P(d|x, dy
to achieve a 40% inference time reduction with respect to ST5 (FP32) with the same number of parameters. Note that in the absence of the optimization process, GMR is 24.6 times slower than ST5, signifying the importance of early stopping. GMR saves an average of 70.2% on total storage footprint (model parameters + index) compared to MDR which further increases to 79.1% saving in early stopping setting by constructing index till the early stopping point. See Appendix A.10 for detailed information. In this paper, we show that the bi-encoder model has inherent limitations in multi-hop retrieval in that the bottleneck problem becomes a severe problem as the number of hops increases and is more susceptible to error propagation. We present GMR, an encoder-decoder model that performs retrieval by generating the entire target sequences with the aid of constrained decoding, which is generally more robust than bi-encoder models as it achieves higher or comparable performance in five multi-hop retrieval datasets. We also introduce two corpus memorization methods, LM memorization and multi-hop memorization, to further improve GMR’s performance. Our experimental results demonstrate that a well-designed generative approach for multi-hop retrieval is highly competitive with respect to bi-encoder methods and deserves further explorations in the community. Limitations and Future Work As shown in Table 2, GMR is still not as good as the best bi- encoder retrieval system (MDR) for HotpotQA. We suspect that there are largely two reasons: first, HotpotQA has exactly two hops, whereas GMR seems to have comparative advantages when the number of hops is large and dynamic; second, bi-encoder retrieval is a relatively mature research area, whereas generative retrieval is quite new and the community is yet to discover advanced techniques that fully leverage it. References [1] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP, 2020. [2] Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. Autoregressive entity retrieval. In ICLR, 2021. [3] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. In ACL, 2017. [4] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018. [5] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, 2017. [6] Bhavana Dalvi, Peter Alexander Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, and Peter Clark. Explaining answers with entailment trees. In EMNLP, 2021. [7] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS, 2020. [8] Qianglong Chen, Feng Ji, Haiqing Chen, and Yin Zhang. Improving commonsense question answering by graph-based iterative retrieval over multiple knowledge sources. In COLING, 2020. [9] Ledell Yu Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. Scalable zero-shot entity linking with dense entity retrieval. In EMNLP, 2020. [10] Wenhan Xiong, Xiang Li, Srini Iyer, Jingfei Du, Patrick Lewis, William Yang Wang, Yashar Mehdad, Scott Yih, Sebastian Riedel, Douwe Kiela, and Barlas Oguz. Answering complex open-domain questions with multi-hop dense retrieval. In ICLR, 2021. 10
[11] Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric Michael Smith, Y.-Lan Boureau, and Jason Weston. Recipes for building an open-domain chatbot. In EACL, 2021. [12] Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. Sparse, dense, and attentional representations for text retrieval. TACL, 2021. [13] Yi Tay, Vinh Quang Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W. Cohen, and Donald Metzler. Transformer memory as a differentiable search index. ArXiv, 2022. [14] Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Wen-Tau Yih, Sebastian Riedel, and Fabio Petroni. Autoregressive search engines: Generating substrings as document identifiers. ArXiv, 2022. [15] Peng Qi, Haejun Lee, OghenetegiriTGSido, and Christopher D. Manning. Answering open- domain questions of varying reasoning steps from text. In EMNLP, 2021. [16] Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. Learning to retrieve reasoning paths over wikipedia graph for question answering. In ICLR, 2020. [17] O. Khattab, Christopher Potts, and Matei A. Zaharia. Baleen: Robust multi-hop reasoning at scale via condensed retrieval. In NeurIPS, 2021. [18] Yixin Nie, Songhe Wang, and Mohit Bansal. Revealing the importance of semantic retrieval for machine reading at scale. In EMNLP, 2019. [19] Chen Zhao. Complex factoid question answering with a free-text knowledge graph. Proceedings of The Web Conference 2020, 2020. [20] Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachandran, Graham Neubig, Ruslan Salakhutdi- nov, and William W. Cohen. Differentiable reasoning over a virtual knowledge base. In ICLR, 2020. [21] Chen Zhao, Chenyan Xiong, Jordan Boyd-Graber, and Hal Daumé, III. Multi-Step reasoning over unstructured text with beam dense retrieval. ArXiv, 2021. [22] Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola De Cao, Sebastian Riedel, and Edouard Grave. A memory efficient baseline for open domain question answering. ArXiv, 2020. [23] Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. Generation-augmented retrieval for open-domain question answering. In ACL-IJCNLP, 2021. [24] Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. TACL, 2021. [25] Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as soft reasoners over language. In IJCAI, 2021. [26] Swarnadeep Saha, Prateek Yadav, Lisa Bauer, and Mohit Bansal. Explagraphs: An explanation graph generation task for structured commonsense reasoning. In EMNLP, 2021. [27] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020. [28] Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. CoRR, 2021. 11
[29] Zhengnan Xie, Sebastian Thiem, Jaycie Martin, Elizabeth Wainwright, Steven Marmorstein, and Peter Jansen. Worldtree v2: A corpus of science-domain structured explanations and inference patterns supporting multi-hop inference. In LREC, 2020. [30] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009. [31] Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. Proofwriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the ACL-IJCNLP, 2021. [32] Swarnadeep Saha, Sayan Ghosh, Shashank Srivastava, and Mohit Bansal. Prover: Proof generation for interpretable reasoning over rules. In EMNLP, 2020. [33] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Transformers: State-of-the-art natural language processing. In EMNLP, 2020. [34] Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun KIM, Stanley Jungkyu Choi, and Minjoon Seo. Towards continual knowledge learning of language models. In ICLR, 2022. [35] James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 2017. 12
Table 5: Cases where multi-hop retrieval is necessary. Input Output Output 1 Die Rhöner Säuwäntzt Die Rhöner Säuwäntzt are a Skiffle-Bluesband from Eichenzell-Lütter in Hessen, Germany. The line-up consists of Martin Caba, Christoph Günther and Christoph Leipold playing Skiffle-Blues with lyrics based on Rhön Mountains dialect and other Hessian dialects varieties. The expression "Säuwäntzt" means pork belly and refers also to untidy or unruly children and youth. Where did the form of music played by Die Rhöner Säuwäntzt originate? Output 2 Skiffle Skiffle is a music genre with jazz, blues, folk and American folk influences, usually using a combination of manufactured and homemade or improvised instru- ments. Originating as a term in the United States in the first half of the 20th century, it became popular again in the UK in the 1950s, where it was associated with artists such as Lon- nie Donegan, The Vipers Skiffle Group, Ken Colyer and Chas McDevitt. Skiffle played a major part in beginning the careers of later eminent jazz, pop, blues, folk and rock musicians and has been seen as a critical stepping stone to the second British folk revival, blues boom and British Invasion of the US popular music scene. Output1 Gunmen from Laredo Gunmen from Laredo is a 1959 American west- ern film produced and directed by Wallace MacDonald, which stars Robert Knapp, Maureen Hingert, and Walter Coy. Gunmen from Laredo starred which narrator of "Frontier"? Output2 Walter Coy Walter Darwin Coy (January 31, 1909 – December 11, 1974) was an American stage, radio, film, and, principally, television actor, originally from Great Falls, Montana. He was best known for narrating the NBC western anthology series, "Frontier", which aired early Sunday evenings in the 1955–1956 season. A Appendix A.1 Examples of Multi-hop Retrieval Task There are many cases where multi-hop retrieval is necessary: a query cannot be solved by a single document but needs more than one relevant document to provide sufficient evidence together. To find the answer to the first example of Table 5, we first need to look at what music Die Rhöner Säuwäntzt played and then find where the music originated from. Similarly, to find the answer to the second example, we need to find who was starred in Gunmen from Laredo and then find who narrated the Frontier. A.2 Dataset Examples Examples of each dataset (input and output forms) are in Table 6. A.3 Dataset Details HotpotQA Yang et al. [4] propose an open domain multi-hop question answering dataset, which requires aggregating multiple Wikipedia passages through logical reasoning or sequential processing. The number of retrieval sequences is fixed to two. HotpotQA consists of two types of questions: comparison and bridge. Comparison questions, a rationale/evidence type of multi-hop dataset, do not necessitate iterative retrieval since the two entities can be retrieved by the query itself. However, bridge questions consist of evidence in the reasoning chain from where it has to retrieve the second step based on the first one. We use the official Wikipedia dump provided by Yang et al. [4], use 2% of the official train dataset as a dev set, and report the scores on the official dev set. Entailment TreeBank (EntailBank) Dalvi et al. [6] propose a reasoning tree construction task where it forms a tree with a hypothesis as the root node and evidence sentences are leaf nodes. The dataset has three settings, and among them, we experiment on Task3, an open setting. Task3 consists 13
Table 6: Dataset examples Task Input Output Step 1 Input (a query) Step 1 output (evidence passage) The Oberoi family is part of a hotel Oberoi family The Oberoi family company that has a head office in what city? in hotels, namely through The Oberoi Group. Paragraph Retrieval (HotpotQA) Step 2 Input (a query with previous output) Step 2 Output (evidence passage) The Oberoi family is part of a hotel The Oberoi Group The Oberoi company that has a head office in what city? Oberoi family The Oberoi family is an Indian family that is 30+ luxury hotels and two river cruise ships in six famous for its involvement in hotels, namely through countries, primarily under its Oberoi Hotels & Resorts The Oberoi Group. and Trident Hotels brands. Step 1 Input (a query) Step 1 output (evidence sentence) A dentist is a surgeon who specializes in dentistry, the Does a dentist treat Bluetooth prob- diagnosis, prevention, and treatment of diseases and lems? conditions of the oral cavity Step 2 Input (a query + Step 1 Output) Step 2 Output (evidence sentence) Does a dentist treat Bluetooth prob- Sentence Retrieval lems? A dentist is a (EntailmentBank, Technological problems are typically handled by IT surgeon who specializes in dentistry, the diagnosis, StrategyQA) professionals prevention, and treatment of diseases and conditions of the oral cavity Step 3 Input (a query + Step 1 & Step 2 Output) Step 3 Output (evidence sentence) Does a dentist treat Bluetooth prob- lems? A dentist is a surgeon who specializes in dentistry, the diagnosis, prevention, and treatment of diseases and conditions of Bluetooth is not a physical entity the oral cavity Tech- nological problems are typically handled by IT profes- sionals Step 1 Input (a query) Step 1 output (evidence sentence) belif: marriage is the best for a family unit. argument: Marriage is a predictor of health and marriage; created by; love happiness. Step 2 Input (a query + Step 1 Output) Step 2 Output (evidence sentence) Reasoning Path Retrieval belif: marriage is the best for a family (RuleTakers, unit. argument: Marriage is a predictor of health and Explagraphs) love; causes; health and happiness happiness. marriage; created by; love Step 3 Input (a query + Step 1 & Step 2 Output) Step 3 Output (evidence sentence) belif: marriage is the best for a family unit. argument: Marriage is a predictor of health and happiness. marriage; health and happiness; used for; family unit created by; love love; causes; health and happiness of two steps; the first is to select a leaf node from the corpus set when given a question and an answer, and the second is to construct a reasoning tree through the selected leaf node. We perform the first step, the leaf node retrieval. Since the leaf node and the root node are not directly connected, there is a less tight connection between the input query and gold outputs than in other datasets. We experiment on the first step of Task3 (leaf node retrieval). As in the paper, we use both EntailBank and WorldTreeV2 [29] datasets when training a retrieval model. We compare the results with ST5 since 14
Table 7: Overview of the five datasets. Seq Len column shows the average num- ber of retrieval sequence tokens for each retrieval sequence in given target cor- pus. Hops column shows the average number of necessary hops to answer a query in test set. Unseen column shows the rate of test queries consisting of Table 8: Error rate for each error type in RuleTaker-Open. Results are from 200 test sets. only the retrieval sequences unseen during the training process. Error Rate (%) GMR ST5 Dataset Corpus (MB) Seq Len Hops Unseen HotpotQA 1,595 78.6 2 18.9% Node Num Error 0.5 5 EntailBank 0.7 12.5 4.6 2.7% Start Node Error 9.5 0 StratgyQA 7.0 13.1 2.7 98.2% End Node Error 20 28 Explagraphs-Open 0.5 9.6 4.5 95.5% Missing Edge Error 19 50 RuleTaker-Open 0.7 13.1 - 0.0%a Success 51 17 a We calculate the rate with prediction result since there are no gold retrieval sequences. there is no released bi-encoder model, and as in the paper, we use both EntailBank and WorldTreeV2 [29] datasets when training a retrieval model. StrategyQA Geva et al. [24] propose a multi-hop open-domain question answering dataset where the reasoning steps are implicit in the question and need some strategy to answer the question. When given a question, the model retrieves the evidence sentences from the corpus. Since only the train dataset contains evidence annotation, we split it into 75/5/20 (%) and used it as a train/val/test set, respectively. Also, based on the given corpus, we split the given paragraph-level corpus to sentence level using NLTK [30] to match the granularity of the evidence and add the annotated evidence sentences to the corpus. RuleTaker-Open Clark et al. [25] propose a synthetic rule-based dataset to measure the model’s reasoning ability over the rules expressed in natural language. Based on the released dataset, we create a new task, RuleTaker-Open, to make the task close to a real-world setting. Given a query, the model retrieves nodes of the graph, which is a sentence from the corpus, and the nodes are connected in order to construct a graph. Details of the construction method are described in Appendix A.4. Explagraphs-Open Saha et al. [26] propose a generative and structured commonsense-reasoning task. When given a belief and an argument, a model predicts whether the argument supports or counters the belief and generates (retrieves) a reasoning graph to explain the prediction. While the original dataset needs generation on constructing the reasoning graph, which is limited to generative models only, we expand the task to an open-domain retrieval setting to compare with the bi-encoder models by constructing the corpus and name it Explagraphs-Open. We consider a single path (subject- relation-object) as a retrieval unit and construct the corpus by dumping all the possible paths provided from the dataset. A.4 RuleTaker-Open RuleTaker dataset is a synthetic rule-based dataset used to measure the model’s ability on reasoning over rules [25, 31, 32]. Given a small corpus of textual facts and rules, the model has to answer the question, retrieve, and construct the graph-structured proofs. As in Tafjord et al. [31], we use the maximum depth dataset D5 for training. To evaluate the model performance in the open-setting, i.e., Task3 in Dalvi et al. [6], we newly construct a large corpus and divide the train/dev/test dataset by the unique query set from the original D5 dataset. Dataset Construction We dump all the facts and rules from the original D5 train/dev/test datasets to construct the corpus and collect 1621 unique queries, which we split into 1300/121/200. We remove cases with NAF and FAIL cases for rule-based evaluation, remove graphs with less than two nodes to ensure that the fact from the corpus itself could not be the proof, and remove graphs with more than ten nodes to fit in the maximum length of T5 model. Also, we added DONE at the end of 15
Algorithm 1 Finding the missing edge Require: Input Corpus P T := An empty list to append or remove facts from P for all sentence s ∈ P do if s is a rule then divide s to assumptions A and result r for all assumption a ∈ A do if a in T then T .remove(a) else return False . Missing edge end if end for T .append(r) else T .append(s) end if end for if T is empty then return True . No missing edge else return False . Missing edge end if graph construction for dynamic stopping. Evaluation Metric In RuleTaker-Open, there are various possible answer graphs for a query, unlike the previous RuleTaker dataset. Therefore, to check whether the prediction graph is correct, a new evaluation metric is necessary. Since each textual sentence can be divided into a simple format, subject-relation-object, when considering the constructed method [25], we evaluate the result by a new rule-based method. We check whether the constructed graph is well-constructed by four steps. • Node Num Error: The number of evidence should be larger than 2. • Start Node Error: First word (subject) should be the same. • End Node Error: Last word (object) should be the same. • Missing Edge Error: There should be no missing edge. Table 8 shows the rate on each constraint for both the bi-encoder model and GMR. Each error in the table corresponds to the item on top with the same name. Missing Edge Error is evaluated by Algorithm 1; when given a prediction graph (P), we divide the sentences into rules and facts and check for the missing edge in the prediction order. When the algorithm returns True, the graph is considered to have no missing edge. Predicted reasoning graph of GMR and Bi-encoder retrieval (ST5) are in Appendix A.5 A.5 RuleTaker-Open Prediction Results The prediction result from the model, predicted corpus (P), is in the gray box, and the final node is colored in yellow. The Missing nodes are colored in red, and the leftover nodes are colored in blue. If there is a red or blue node, it means that it failed to construct the reasoning graph. We show two examples for each retrieval method and success and failure cases (missing edge error case) in Figure 3, Figure 4, Figure 5, and Figure 6. A.6 Details of Bi-Encoder Retrieval Models (ST5) We use ST5 model [28] as the architecture of the bi-encoder baseline to compare the performance with GMR using the same number of parameters. The input text is fed into T5-encoder, and the first 16
Predicted Corpus (P): Predicted Corpus (P): 1. The lion is young. 1. The cat is kind 2. If something is young then it eats the lion. 2. The cat is kind 3. If something eats the lion then it is kind. 3. If something is kind then it chases the cat 4. The lion is young. 4. If something chases the cat then it is young. 5. If something is young then it eats the lion. 5. If something is kind and young then it is cold. 6. If something eats the lion then it likes the lion. 7. If something is kind and it likes the lion then the lion eats the cow. 6. If something is cold then it visits the dog. 8. If something eats the cow then it eats the rabbit. 2 4 5 6 5 6 7 8 1 3 4 1 2 3 (a) Example1 (b) Example2 Figure 3: Success Examples of GMR Predicted Corpus (P): Predicted Corpus (P): 1. The bear is kind. 1. The mouse is cold. 2. The bear is kind. 2. The mouse is cold. 3. If something is kind then it chases the bear. 3. The mouse is cold. 4. If something chases the bear then it is big. 4. If something is cold then it eats the dog. 5. If something is kind and big then it is rough. 5. If something is cold and it eats the tiger then the tiger is kind. 6. If something is rough then it likes the bald eagle. 6. If something is cold and kind then it is red. 7. If something likes the bald eagle then it likes the tiger. 7. If something is red then it sees the mouse. 8. If something likes the tiger then it likes the lion. 8. If something sees the mouse then the mouse is green. 9. If something likes the bear and it likes the bald eagle then it is round. 9. If something is green then it sees the squirrel. 10. If something is round then it likes the bear. 1 1 6 7 8 9 5 6 7 8 2 5 2 3 4 3 4 It eats the tiger 9 10 (a) Example1: Leftover node (blue) and missing nodes (red) (b) Example2: Leftover node (blue) Figure 4: Failure Examples of GMR Predicted Corpus (P): 1. The rabbit is big. Predicted Corpus (P): 2. If someone is big then they need the rabbit. 3. If someone needs the rabbit then they are big. 1. The rabbit is blue. 4. If someone is big then they like the rabbit. 2. If something is blue then it sees the rabbit. 5. If they like the rabbit then the rabbit is kind. 3. If something sees the rabbit then it is big. 6. If someone is kind then they visit the rabbit. 1 2 3 1 2 3 4 5 6 (a) Example1 (b) Example2 Figure 5: Success Examples of Bi-encoder (ST5) Retrieval decoder output of the T5-decoder is taken as the sentence embedding. We follow the implementation details in Ni et al. [28] except for two settings: (1) as in Karpukhin et al. [1], we use the inner product instead of cosine similarity when calculating the similarity since inner produce shows a higher recall rate than cosine similarity for overall dataset (2) we change the hyperparameters for a fair comparison with GMR. A.7 Details of Generative Multi-hop Retrieval LM memorization For the path retrieval task (RuleTaker-Open, Explagraph-Open), the subject and the relation are given, and the model generates the object and for the sentence and paragraph retrieval task (NQ, HotpotQA, EntailBank, StrategyQA), the first 70% of the sentence is given as input, and the model generates the rest. 17
Predicted Corpus (P): 1. The mouse eats the rabbit. 2. The rabbit eats the mouse. 3. If something eats the mouse then it visits the rabbit. 4. If something visits the rabbit then it eats the rabbit. 5. If something visits the mouse and it eats the rabbit then the rabbit is cold. 6. If something is cold then it eats the rabbit. Predicted Corpus (P): 7. If something visits the rabbit and it eats the rabbit then the rabbit is kind. 1. The mouse chases the dog. 8. If something is kind then it eats the mouse. 2. If the mouse chases the dog then the mouse is red. 3. If the mouse is red then the mouse visits the tiger. 4. If something visits the tiger then the tiger chases the mouse. 5. If something chases the tiger and the tiger chases the mouse then it sees the mouse. 1 6. If something sees the mouse then it sees the dog. 7. If something sees the dog then it chases the mouse. 2 3 4 5 6 7 8 1 2 3 4 5 6 7 Something visits Something visits Something the mouse the rabbit chases the tiger (a) Example1: Leftover node (blue) and missing nodes (red) (b) Example2: missing node (red) Figure 6: Failure Examples of Bi-encoder (ST5) Retrieval Multi-Hop Memorization For a conditional memorization method, we experiment GMR with multi-hop memorization in which we generate pseudo-multi-hop queries and train a retriever with not only the original training dataset but also generated queries during the retrieval step. To generate pseudo-multi-hop queries, we first train a model which generates a query when given a set of retrieval sequences. After training the query generation model, we sample multiple sets of retrieval sequences from a given target corpus. To sample ones relevant to each other as in the original datasets, we construct a graph of the target corpus and sample a subgraph from the entire graph. The method of constructing a graph varies by the characteristic of the dataset. We found it challenging for some datasets to build a meaningful graph; in the EntailBank dataset, it is hard to identify which word in the retrieval sequence should be the node. Therefore, we generate pseudo-multi-hop queries with only Explagraphs-Open and StrategyQA, which we could build a meaningful graph with a given target corpus. For Explagraphs-Open, since items in the target corpus are a path, we keep the object and the subject of items as a node of the graph and connect the two nodes when the subject of one sequence matches the object of the other. For StrategyQA, we construct a graph by adding entities of retrieval sequences in the target corpus as nodes and connecting the two nodes when a retrieval sequence contains both entities. We sample sub-graphs from the constructed graph by iterating through retrieval sequences in the target corpus and using the retrieval sequence as the start node. Sampled retrieval sequences are given as input to the query generation model to generate pseudo queries, which are further trained together with the original training dataset to train the retrieval model. The number of retrieval sequences in sampled sub-graphs ranges between the minimum and the maximum number of retrieval sequences in a dataset. Also, in StrategyQA, we remove sentences without any entity or those with more than four entities; sentences with more than four entities often contain various information in a sentence, making the sentence irrelevant to other sentences in a sample subgraph even though the entity matches. After we generate queries, we go through the filtering process to remove incorrectly generated queries: remove sentences without the end tokens. We use pre-trained T5-large [33] to train the query generation model. See Appendix A.8 for details of hyperparamters. A.8 Experimental Setup Details We train both ST5 and GMR using pre-trained T5-large checkpoint from Wolf et al. [33] as the initial checkpoint. We use the same hyperparameter setting when training GMR and ST5 model for a fair comparison. We observe that hyperparameter change does not change the tendency of results after experimenting over a combination of settings used in previous models [1, 28, 27]. Also, we use different hyperparameters for different tasks: retrieval corpus memorization and retrieval. For all experiments, we use 8 32GB V100 GPUs. 18
LM Memorization The LM memorization step aims to show GMR a corpus it will retrieve and save it implicitly before the retrieval step. We keep the learning rate to 1e-5, which is relatively low than the retrieval step, to maintain the linguistic ability the model learned during pre-training [34]. We train the model from T5 pre-trained checkpoint for every dataset using Adafactor with a constant learning rate of 1e-5 with batch size 240 till the maximum of 3 epochs. Increasing the LM memorization epoch does not always lead to higher performance. This is because as the model is trained on a new dataset, catastrophic forgetting of previously learned parts occurs [35], and in this case, the linguistic ability of the model learned during the pre-training step. To prevent the following process from occurring, we follow Jang et al. [34] and reduce the learning rate to 1e-5 and use checkpoint of epoch 3 as the initial checkpoint for all retrieval tasks. Multi-Hop Memorization We train a model that generates a pseudo-multi-hop query for multi-hop memorization when given a set of retrieval sequences. We dump all the train, dev, and test set of retrieval datasets to train such a model and concatenate all retrieval sequences as a long sequence as an input and the corresponding query as an output. We set the configuration the same as in Retrieval Step. Retrieval Step The retrieval step aims to retrieve the gold item from a large-scale corpus. For GMRL , we use the checkpoint from LM-memorization as the initial checkpoint, and for the rest of the models (ST5, GMR, GMRM ), we use the T5 pre-trained checkpoint as the initial checkpoint12 . For GMRM , we train the model using both the training dataset and generated dataset from the T5 pre-trained checkpoint. For both ST5 and GMR (including GMRL , GMRM ), we train using Adafactor with a learning rate 1e-4 with a linear warm-up for the first 10% of training and then linear decay with batch size 120 till a maximum of 30 epochs. A.9 Manual Analysis on HotpotQA We conduct manual analysis on HotpotQA by comparing the top-2 prediction result of the GMR and MDR, a bi-encoder retrieval model. From the two question categories in HotpotQA (bridge and comparison questions), we manually inspect 30 sampled examples where one model fits and the other is wrong in Appendix A.9. MDR mostly got wrong by missing the second hop item though it got the first hop correct and GMR was wrong for cases where the first-hop item is not written explicitly in the query but by sharing a specific part of a sentence. When the item is written explicitly in the query, GMR tend to get it correct, which shares with the result that GMR shows a higher score on comparison questions than MDR. We suggest this result is because GMR can directly cross-encode between the input and the output without any information loss. To be specific, we divide the error case into four: (1) When the first-hop retrieval item is not written explicitly in the query but by sharing a specific part of a sentence. (2) Though it is written explicitly in the query, it retrieves the wrong document by giving attention to an irrelevant part of the query. (3) Detail of the title is wrong (i.e., when the gold document has the title Do you Love Me (Not That I Can Dance), the model retrieves a document with the title Do you Love Me (2NE1 song) instead; when do you love me is in a query, the model misses to understand the details correctly.) (4) The retriever got the first hop correct but failed to retrieve the second hop item correctly. When comparing the number of models matched in the bridge question with each error case, among the four cases, MDR is often wrong in the second (1.3 times) and fourth cases (2.2 times), and the GMR is most often wrong in the first case (6 times) along with the third case (2.8 times)13 . A.10 Storage Footprint Table 9 shows the overall storage footprint of three models: MDR, GMR, and GMR with early stopping. Where GMR with early stopping does not generate every word in the retrieval target text but stops generation as soon as the partially generated text can uniquely identify the target text and saves only till the point. Table 11 shows that GMR shows higher memory efficiency with a 12 GMR is GMR with LM memorization and GMRM is GMR with multi-hop memorization L 13 the value in parentheses shows the ratio of the error rate compared to the other model 19
You can also read