Memory Augmented Sequential Paragraph Retrieval for Multi-hop Question Answering
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Memory Augmented Sequential Paragraph Retrieval for Multi-hop Question Answering Nan Shao† , Yiming Cui‡† , Ting Liu‡ , Shijin Wang†§ , Guoping Hu† † State Key Laboratory of Cognitive Intelligence, iFLYTEK Research, China ‡ Research Center for Social Computing and Information Retrieval (SCIR), Harbin Institute of Technology, Harbin, China § iFLYTEK AI Research (Hebei), Langfang, China †§ {nanshao,ymcui,sjwang3,gphu}@iflytek.com ‡ {ymcui,tliu}@ir.hit.edu.cn Abstract Paragraph A: Rand Paul presidential campaign, 2016 The 2016 presidential campaign of Rand Paul, the junior United States Senator from Kentucky, was announced on April 7, 2015 at Retrieving information from correlative para- an event at the Galt House in Louisville, Kentucky. … … graphs or documents to answer open-domain Paragraph B: Galt House multi-hop questions is very challenging. To ……The Galt House is the city's only hotel on the Ohio River. …… arXiv:2102.03741v1 [cs.CL] 7 Feb 2021 deal with this challenge, most of the exist- Q: The Ran Paul presidential campaign, 2016 event was held at a ing works consider paragraphs as nodes in a hotel on what river? graph and propose graph-based methods to re- A: Ohio River trieve them. However, in this paper, we point out the intrinsic defect of such methods. In- Figure 1: An example of open-domain multi-hop ques- stead, we propose a new architecture that mod- tion from HotpotQA. The model needs to retrieve evi- els paragraphs as sequential data and considers dence paragraphs from entire Wikipedia and derive the multi-hop information retrieval as a kind of se- answer. quence labeling task. Specifically, we design a rewritable external memory to model the dependency among paragraphs. Moreover, a quires multi-hop information retrieval. An exam- threshold gate mechanism is proposed to elim- ple from HotpotQA is illustrated in Figure 1. To inate the distraction of noise paragraphs. We answer the question “The Rand Paul presidential evaluate our method on both full wiki and dis- tractor subtask of HotpotQA, a public textual campaign, 2016 event was held at a hotel on what multi-hop QA dataset requiring multi-hop in- river?”, the retrieval model needs to first identify formation retrieval. Experiments show that our Paragraph A as an evidence paragraph according method achieves significant improvement over to the similarity of the terms. The retrieval model the published state-of-the-art method in re- learns that the event was held at a hotel named trieval and downstream QA task performance. “Galt House”, which leads to the next-hop Para- graph B. Existing neural-based or non-neural based 1 Introduction methods can not perform well for such questions Open-domain Question Answering (QA) is a pop- due to little lexical overlap or semantic relations ular topic in natural language processing that re- between the question and Paragraph B. quires models to answer questions given a large col- To tackle this challenge, the mainstream studies lection of text paragraphs (e.g., Wikipedia). Most model all paragraphs as a graph, where paragraphs previous works leverage a retrieval model to cal- are connected if they have the same entity men- culate semantic similarity between a question and tions or hyperlink relation (Ding et al., 2019; Zhao each paragraph to retrieve a set of paragraphs, then et al., 2020; Asai et al., 2020). For example, Graph- a reading comprehension model extracts an answer based Recurrent Retriever (Asai et al., 2020) first from one of the paragraphs. These pipeline meth- leverage non-parameterized methods (e.g., TF-IDF ods work well in single-hop question answering, or BM25) to retrieve a set of initial paragraphs as where the answer can be derived from only one starting points, then a neural retrieval model will paragraph. However, many real-world questions determine whether paragraphs that linked to an ini- produced by users are multi-hop questions that re- tial paragraph is the next step of reasoning path. quire reasoning across multiple documents or para- However, the method suffers several intrinsic draw- graphs. backs: (1) These graph-based methods are based on In this paper, we study the problem of textual a hypothesis that evident paragraphs should have multi-hop question answering at scale, which re- the same entity mentions or linked by a hyperlink
so that they can perform well on bridge questions. trieves them for solving multi-hop questing However, for comparison questions (e.g., does A answering at scale. and B have the same attribute?), the evidence para- graphs about A and B are independent. (2) Once • We implement the framework and propose the a paragraph is identified as an evidence paragraph, Gated Memory Flow model. Besides, we in- graph-based retriever like the Graph-based Recur- troduce a simple method to find more valuable rent Retriever (Asai et al., 2020) will continue to negatives for open-domain multi-hop QA. determine whether the paragraphs linked from the • Our method significantly outperforms all the evidence paragraph are relevant to the question, published methods on HotpotQA by a large which implies that these models have to assume margin. Extensive experiments demonstrate the relation between paragraphs is directional. (3) the effectiveness of our method. Several graph-based methods have to process all adjacent paragraphs simultaneously, leading to in- efficient memory usage. 2 Related Work In this paper, we discard the mainstream prac- tice. Instead, we propose to treat all candidate Textual Multi-hop Question Answering. In paragraphs as sequential data and identify them contrast to single-hop question answering, multi- iteratively. The newly proposed framework for hop question answering requires model reasoning solving multi-hop information retrieval has several across multiple documents or paragraphs to derive desirable advantages: (1) The framework does not the answer. In recent years, this topic has attracted assume entity mentions or hyperlinks between two many researchers attention, and many datasets have evidence paragraphs so that it can be suitable for been released (Welbl et al., 2018; Talmor and Be- both bridge and comparison questions. (2) It is rant, 2018; Yang et al., 2018; Khot et al., 2020; Xie natural for the framework to take all paragraphs et al., 2020). In this paper, we focus our study on connected to or from a certain paragraph into con- textual on the textual multi-hop QA dataset, Hot- sideration. (3) As only one paragraph is located in potQA. GPU memory at the same time, the framework is Methods for solving multi-hop QA can be memory efficient. roughly divided into two categories: graph-based We implement the framework and propose the methods and query reformulation based meth- Gated Memory Flow model. Inspired by Neural ods. Graph-based methods often construct an Turning Machine (Graves et al., 2014), we lever- entity graph based on co-reference resolution or age an external rewritable memory to memorizes co-occurrence (Dhingra et al., 2018; Song et al., information extract from previous paragraphs. The 2018; De Cao et al., 2019). These works show model iteratively reads a paragraph and the mem- the GNN frameworks like graph convolution net- ory to identify whether the current paragraph is an work (Kipf and Welling, 2017) and graph attention evidence paragraph. Due to many noise paragraphs network (Veličković et al., 2018) perform on an in the candidate set, we design a threshold gating entity graph can achieve promising results in Wiki- mechanism to control the writing procedure. In Hop. Tu (2019) propose to model paragraphs and practice, we find that different negative examples candidate answers as different kinds of nodes in in the training set affect the retrieval performance a graph and extend an entity graph to a heteroge- significantly and propose a simple method to find neous graph. In order to adapt such methods to more valuable negatives for the retrieval model. span-based QA tasks like HotpotQA, DFGN (Qiu Our method significantly outperforms previous et al., 2019) leverage a dynamical fusion layer to methods on HotpotQA under both the full wiki and combine graph representations and sequential text the distractor settings. In analysis experiments, we representations. demonstrate the effectiveness of our method and Query reformulation based methods solve multi- how different settings in the model affect down- hop QA tasks by implicitly or explicitly reformu- stream performance. lating the query at different reasoning hop. For Our contributions are summarized as follows: example, QFE (Nishida et al., 2019) update query representations at different hop. DecompRC (Min • We propose a new framework that treats para- et al., 2019) decomposes a compositional question graphs as sequential data and iteratively re- into several single-hop questions.
Term-Based Retrieval Gated Memory Flow Question Answering Cascade Prediction Layer Value Embeddings Answer Type Answer Span Key Embeddings Supporting Facts Write Query Read Gate Gold Paragraph Scandi Pre-trained Model Pre-trained Model …… …… Query P1 P2 …… Skwm STF-IDF Shyperlink P1 Pi Pn Figure 2: Overview of our architecture. Multi-hop Information Retrieval. Chen (2017) 3 Method first propose to leverage the entire Wikipedia para- graphs to answer open-domain questions. Most ex- In this section, we introduce the pipeline we de- isting open-domain QA methods use a pipeline ap- signed for multi-hop QA, the overview of our sys- proach that includes a retriever model and a reader tem is shown in Figure 2. We first leverage a heuris- model. For open-domain multi-hop QA, graph- tic term-based retrieval method to find candidate based methods are also mainstream practice. These paragraphs for the question Q as much as possible. studies organized paragraphs as nodes in a graph Our Gated Memory Flow model takes all candidate structure and connected them according to hyper- paragraphs as input and processes them one by one. links. Cognitive Graph (Ding et al., 2019) employ Finally, paragraphs that gain the highest relevance a reading comprehension model to predict answer score for a question are selected and fed into a neu- spans and next-hop answer to construct a cogni- ral question answering model. The reader model tive graph. Transformer-XH (Zhao et al., 2020) uses a cascade prediction layer to predict all targets introduce eXtra Hop attention to reasons over a in a multi-task way. paragraph graph and produces the global represen- 3.1 Term-Based Retrieval tation of all paragraphs to predict the answers. Due to encoding paragraphs independently, this method We leverage a heuristic term-based retrieval method lacks fine-grained evidence paragraphs interaction to construct the candidate paragraphs set that con- and memory-inefficient. Asai (2020) proposes a tains all possible paragraphs. Paragraphs in the graph-based recurrent model to retrieves a reason- candidate set come from three sources. ing chain through a path in the paragraph graph. As described in Section 1, this method may not Key Word Matching. We retrieve all paragraphs perform well on comparison questions and suffer whose title exact match terms in the query. The top from directional edges caused low recall. Nkwm paragraphs with highest TF-IDF score are collected to set Pkwm . TF-IDF. The top Ntf idf paragraphs retrieved by There are also several studies propose query DrQA’s TF-IDF on the question (Chen et al., 2017). reformulation based methods for multi-hop infor- mation retrieval. Multi-step Reasoner (Das et al., Hyperlink. Paragraphs that necessary for de- 2019) employees a retriever interacts with a reader rived the answer but do not have lexical overlap model and reformulates the query in a latent space to the question often have hyperlink relation to for the next retrieval hop. GoldEn Retriever (Qi the paragraphs retrieved by key work matching or et al., 2019) generates several single-hop questions TF-IDF. Therefore, for each paragraph Pi in set at different hop. These methods often perform Pkwm and Ptf idf , we extract all the hyperlinked poorly in result for the error accumulation through paragraphs from Pi to construct set Phyperlink . Un- different hops. like previous works (Nie et al., 2019; Asai et al.,
2020), both paragraphs that have a hyperlink con- which defines the memory slots as pairs of vec- nect to Pi and paragraphs connected from Pi are tors {(k1 , v1 ), . . . , (klm , vlm )}. In our implemen- taken into consideration. tation, the key vectors and value vectors are the Finally, the three collections are merged into a same. Only a part of vectors are necessary for iden- candidate set Pcand . tifying Pt , therefore we use key vectors and query vector to address the memory. The addressing and 3.2 Gated Memory Flow reading procedure are similar to soft attention: After retrieving the candidate set, we propose a pi = Sof tmax(Wq xt · Wk ki ) (5) neural-based model named Gated Memory Flow X that accurately selects evidence paragraphs from ot = pi Wv vi (6) the candidate set. i Model. As describe above, for a compositional where pi denotes the readout probability of i-th question, whether a paragraph contains evidence memory slot. In our implementation, we extend information is not only dependent on the query but the attention readout mechanism to a multi-head also other paragraphs. The task can be formulated way (Vaswani et al., 2017): as: (1) (h) ot = Concat(ot , . . . , ot ) (7) arg max P (Pt ∈ E|Q, P1:t−1 , θ) (1) θ (h) ot denotes the output of h-th readout head. Due where the E denotes evidence set, θ denotes the pa- to many irrelevant paragraphs in the candidate set, rameters of the model, Pt is the t-th paragraphs in we leverage a threshold gating mechanism to con- candidate set Pcand = {P1 , . . . , Pt , . . . , Pn } and trol the writing permission. The writing operation P1:t−1 represents the processed t − 1 paragraphs. is concatenation for keeping the model concise. At the t-th time step, the model takes paragraph ( Pt as input and determines whether it is an evidence Mt st < gate paragraph depending on identified paragraphs. The Mt+1 = (8) Concat(Mt , xt ) st ≥ gate question Q and paragraph Pt are concatenated and fed into a pre-trained model to get the query-aware The value of scalar gate is a pre-defined hyper- paragraph representations Ht ∈ Rl×d . To decrease parameter. parameters, we compress the matrix into a vec- tor representation, then we employ the vector as a Training. The parameters in GMF were updated query vector to address desired information from with a binary cross entropy loss function: the external memory module. X Lretri = − log(st ) Pt ∈Ppos xt = M eanP ooling(Ht ) ∈ Rd (2) X (9) − log(1 − st ) We denote the external memory at step t as Pt ∈Pneg Mt ∈ Rlm ×d , where lm is number of memory slots where Pneg contains eight negative paragraphs sam- that have stored information at step t. We denote pled by our negative sampling strategy for each the readout vector representation at step t as ot , question. At the beginning of training, the model which include the missing information to identify can not give each paragraph an accurate relevant the current paragraph. The ot and xt are used to score, leading to unexpected behavior of the gate. identify Dt : For this reason, we only activate the gate after the ht = tanh(Wo · ot + W · xt ) ∈ Rd (3) first epoch of training. st = σ(Ws · ht ) (4) Training with Harder Negative Examples. For each question in the training stage, we pass two where st represent the relevant score between para- gold paragraphs and eight negative examples from graph Pt and question Q. the term-based retrieval result to Gated Memory Now we describe the memory module and Flow. Intuitively, different negative sampling strate- read/write procedure in detail. We follow the gies may influence the training result, but previous KVMemNN (Miller et al., 2016) architecture, works do not explore the effect.
We empirically demonstrate harder negative ex- tions. amples lead to better training results and propose a simple but effective strategy to find them (see Mp = BiLSTM(C) ∈ Rm×d2 (10) Sec. 4.4). Specifically, we first train a BERT-based Opara = σ(Wp [Mp [Ps(i) ]; Mp [Pe(i) ]]) (11) binary classifier to calculate the relevant score be- (i) (i) tween the questions and paragraphs. Eight para- where Ps and Pe denote the start and end to- graphs are randomly selected from term based re- ken index of i-th paragraph. Then we can predict trieval results, combined with two evidence para- whether the i-th sentence is a supporting fact de- graphs for each question. Then we employ the pend on the outputs of evidence paragraph predic- model as a ranker to select top-8 paragraphs as tion. augmented negative examples. We train our GMF model with these augmented examples. Ms = BiLSTM([C, Opara ]) ∈ Rm×d2 (12) Osent = σ(Ws [Ms [Ss(i) ]; Ms [Se(i) ]]) (13) Inference. At the inference stage, there may be hundreds of candidate paragraphs for a question. (i) (i) where Ss and Se denote the start and end token Nevertheless, it is unnecessary to put all of them in index of i-th sentence. The answer span and type GPU memory besides the parameters of the model can be predicted in a similar way. and the current paragraph Pt . To balance inference speed and memory usage, we load a chunk of para- Ostart = SubLayer([C, Osup ]) (14) graphs into GPU memory at once. After the model Oend = SubLayer([C, Osup , Ostart ]) (15) gives all paragraphs a relevant score, we sort them Otype = SubLayer([C, Osup , Oend ]) (16) by the score and retrieve the top Nretri paragraphs with the scores higher than a threshold hd for each where SubLayer() denotes the similar LSTM layer question. We take these paragraphs as the input for and linear projection in eq. 12-13. We compute the reader model. a cross entropy loss over each outputs and jointly optimize them. 3.3 Reader Model Lreader = λa Lstart + λa Lend + λp Lpara There are many existing readers model for multi- (17) + λs Lsup + λt Ltype hop QA, most of which use a graph-based approach. However, a recent study (Shao et al., 2020) shows We assign each loss term a coefficient. self-attention in the pre-trained models can perform well in multi-hop QA. Therefore, we employ a 4 Experiments pre-trained model as a reader to solve multi-hop In this section, we describe the setup and results reasoning. of our experiments on the HotpotQA dataset. We We combine all the paragraphs selected by Gated compare our method with published state-of-the-art Memory Flow into context C. For each example, methods(Asai et al., 2020) in retrieval and down- we concatenate the question Q and context C as stream QA performance to demonstrate the superi- input and fed into a pre-trained model following a ority of our proposed architecture. projection layer. We follow the similar structure of the prediction 4.1 Experimental Setup layer from Yang (2018). There are four sub-tasks Dataset. Our experiments are conducted on Hot- for the prediction layer: (1) evidence paragraphs potQA (Yang et al., 2018), a widely used textual prediction as an auxiliary task; (2) supporting facts multi-hop question answering dataset. For each prediction based on evidence paragraphs; (3) the question, models are required to extract a span of start and end positions of span answer; (4) answer text as an answer and corresponding sentences as type prediction. We use a cascade structure due supporting facts. Besides, there are two different to the dependency between different outputs. Five settings in HotpotQA: distractor setting and full isomorphic BiLSTMs are stacked layer by layer to wiki setting. For each question, two evidence para- obtain different representations for each sub-task. graphs and eight distractor paragraphs collected by To predict evidence paragraphs, we take the first TF-IDF are provided in the distractor setting. For and last token of i-th paragraph as its representa- the full wiki setting, each answer and supporting
Full wiki Distractor Answer Sup Answer Sup Model EM F1 EM F1 EM F1 EM F1 Baseline (Yang et al., 2018) 24.7 34.4 5.3 41.0 44.4 58.3 22.0 66.7 QFE (Nishida et al., 2019) - - - - 53.7 68.7 58.8 84.7 DFGN (Qiu et al., 2019) - - - - 55.4 69.2 Cognitive Graph QA (Ding et al., 2019) 37.6 49.4 23.1 58.5 - - - - GoldEn Retriever (Qi et al., 2019) - 49.8 - 64.6 - - - - SemanticRetrievalMRS (Nie et al., 2019) 46.5 58.8 39.9 71.5 - - - - Transformer-XH (Zhao et al., 2020) 50.2 62.4 42.2 71.6 - - - - GRR w. BERT (Asai et al., 2020) 60.5 73.3 49.3 76.1 68.0 81.2 58.6 85.2 DDRQA w. ALBERT-xxlarge♣ (Zhang et al., 2020) 62.5 75.9 51.0 78.8 - - - - Gated Memory Flow 63.6 76.5 54.5 80.2 69.6 83.0 64.7 89.0 Table 1: Results on HotpotQA development set in the fullwiki and distractor setting. “-” denotes no results are available. “♣” indicates the the work is recently presented on arXiv. facts should be extracted from the entire Wikipedia. Retrieval model Reader Model HotpotQA contains two question types: bridge Nkwm 10 λa , λ p , λ t 1 question and comparison question. The former Ntf idf 5 λs 15 h 16 epoch 4 requires models to reasoning from one evidence to d 1024 lr 1e-5 another. To answer the comparison question, mod- Nretri 8 batch 8 els need to compare two entities described in two hd 0.025 gate value 0.2 paragraphs. epoch 2 lr 1e-5 Metrics. We evaluate our pipeline system not batch 6 only on downstream QA performance but also on the intermediate retrieval result. We report F1 and Table 2: List of hyper-parameters used in our model. EM scores for evaluating QA performance and Sup- porting Fact F1 (Sup F1) and Supporting Fact EM Methods AR PR P EM (Sup EM) to evaluate the supporting fact sentences Entity-centric IR 63.4 87.3 34.9 retrieval performance. In addition, joint EM and Cognitive Graph 76.0 87.6 57.8 Semantic Retrieval 77.9 93.2 63.9 F1 scores are used to measure the system of the GRR w. BERT 87.0 93.3 72.7 joint performance of the QA model. To measure GMF w. BERT 90.8 94.1 85.7 the intermediate retrieval performance, we follow GMF w. RoBERT 91.7 94.7 86.3 Asai (2020) to use the three metrics: Answer Recall (AR), which evaluate whether the answer is located Table 3: Comparing our retrieval method with other in the selected paragraphs, Paragraph Recall (PR), published methods across Answer Recall, Paragraph which evaluate if at least one of the evidence para- Recall and Paragraph EM metrics. graphs is in retrieved set, Paragraph Exact Match (P EM), which evaluate whether all evidence para- ALBERT-xxlarge (Lan et al., 2019) for the reader graphs are retrieved. The number of selected para- model. Furthermore, all hyper-parameters in our graphs for the reader is alternative, hence we also system are listed in Table 2. report the precision score to evaluate the retrieval performance more appropriately. 4.2 Main Results Implementation Details. Considering accuracy Downstream QA. We first evaluate our pipeline and computational cost, we use different pre- system on the downstream task on HotpotQA de- trained models in different components in the sys- velopment set1 . Table 1 shows the results in full tem. We report the whole pipeline system re- wiki setting. The proposed GMF outperforms all sult in Section 4.2 when the GMF is based on a published works on every metric by a large margin. RoBERTa-large model (Liu et al., 2019). In analy- The GMF achieves 3.1/3.2 points absolute improve- sis experiments, the GMF is based on a RoBERTa- ment on Answer EM/F1 scores and 5.2/4.1 on Sup 1 base model for saving computation cost. We use We will also submit our model to blind test set soon.
Retrieval QA Settings AR PR P EM Prec. Joint EM Joint F1 Gated Memory Flow 91.7 94.7 86.3 49.1 39.6 65.8 -Bi-direct. Doc. 87.2 95.6 78.8 51.3 37.1 62.3 -Threshold Gate 91.2 94.3 85.6 33.8 39.0 65.3 -Rewritable Memory 91.4 94.5 85.8 42.5 39.1 65.1 -Harder Negatives 91.5 94.5 86.0 39.2 38.4 64.8 Table 4: Ablation study on the effectiveness of the GMF model on the dev set in the full wiki setting. EM/F1 scores over the published state-of-the-art Model Source AR PR P EM Prec. results, which implies the superiority of the whole baseline K.W.M. 92.2 94.6 86.4 37.3 pipeline design. We also report our method results baseline TF-IDF 91.8 94.5 86.3 39.2 baseline Hyperlink 91.3 94.6 86.4 32.2 in the distractor setting. Our GMF also achieves baseline BERT 92.4 94.6 86.5 43.4 strong results in this setting. GMF K.W.M. 91.6 94.4 86.1 36.8 Retrieval. The retrieval results in full wiki setting GMF TF-IDF 91.5 94.2 86.0 39.2 are presented in Table 3. To fairly compared with GMF Hyperlink 91.2 94.5 85.9 34.6 GMF BERT 91.7 94.7 86.3 49.1 the published state-of-the-art results, we also re- port the results of our model based on BERT. Our Table 5: Effectiveness of training data with different GMF achieves 3.4 AR and 13.0 P EM over the data distribution. previous state-of-the-art model. The significant im- provement comes from the proposed architecture for solving multi-hop retrieval. The new frame- tently decreases after ablating each component. work abandons many unnecessary presuppositions in graph-based methods and does not retrieve a 4.4 Analysis reasoning path explicitly. Our GMF benefits from Effectiveness of Negatives. We find that the the generalization and flexibility of the proposed training data sampled from different sources will framework, leading to a high recall of evidence significantly affect the final retrieval performance. paragraphs. Note that we achieve a high recall with In Table 5, we report our GMF retrieval perfor- very few retrieved paragraphs. The average num- mance compared with a RoBERTa based re-rank ber of retrieved paragraphs for each question is less baseline model, where the training data is sampled than 4. from Pkwm , Ptf idf , Phyper and paragraphs selected by a pre-trained model. 4.3 Ablation Study The experiments show that the models can achieve comparable recall in different settings. To evaluate the effectiveness of our proposed However, Comparing data sampled from Pkwm and method, we perform ablation study on our GMF Phyperlink , using training data sampled from Ptf idf , model. In table 4, we report the ablation results the GMF can gain about 2.4% and 4.6% precision in the development set. From the table, we can scores improvement respectively. The results imply see that only retrieved paragraphs connected from that, in the training phase, examples sampled from paragraphs in Pkwm and Ptf idf will improve the Ptf idf or Pkwm are more confusing than Phyperlink precision but significantly hurt the recall scores for more lexical overlap. Therefore, we further and downstream performance. We find different leverage a neural model to rank the candidate para- components do not influence the recall but lead graphs and sample the most challenging examples, to higher precision. In particular, using the exter- leading to 9.9% absolute precision improvement. nal rewritable memory can provide 6.6% precision In addition, our GMF achieves better retrieval per- improvement, which implies the effectiveness of formance in each different setting, which implies modeling the history of retrieval. The gating mech- the Effectiveness of our proposed model. anism can help the model not memorizes and re- trieve irrelevant information. The precision will Analysis on Candidate Set. We also evaluate decreases by 16% point after removing the gate. In how different sizes of the candidate set to influence addition, harder negatives also improve retrieval the downstream QA performance. Previous work performance. The QA performance also consis- (Asai et al., 2020) uses recall as a metric to measure
Q: Who is the mother of Walchelin de Ferriers Q: Which is farther west, Sheridan County the king that Walchelin de Sheridan County, John of Berenne Ferriers was principal Montana or Chandra captain to? Taal? …… …… Walchelin de Ferriers Marie de Coucy Richard I of England Sheridan County, Montana Outlook, Montana Chandra Taal score: 0.99 score: 0.15 score: 0.26 score: 0.99 score: 0.006 score: 0.99 Walchelin de Ferrieres (or Walkelin He was the third of five sons of He was the third of five sons of de Ferrers) (died 1201) was a Sheridan County is a county located King Henry II of England and in the U.S. state of Montana. King Henry II of England and Norman baron and principal captain Duchess Eleanor of Aquitaine. Duchess Eleanor of Aquitaine. of King Richard I of England. Figure 3: Examples of evidence paragraphs prediction in the full wiki setting of HotpotQA . the quality of retrieval. Intuitively, more retrieved Top Ntf idf Recall Num. Joint EM Joint F1 paragraphs will cause a higher recall. However, Top 5 93.7 94 40.1 66.1 does higher recall leads to better downstream task Top 10 95.0 130 39.8 65.8 Top 15 95.7 167 39.5 65.6 performance? We increase the number of para- Top 20 96.2 203 39.1 65.1 graphs retrieved by the TF-IDF method in the term- based retrieval phase. As expected, Table 6 shows Table 6: Comparing the performance with different more retrieved paragraphs lead to a higher recall. recall of candidate set. Notice that the number of candidate paragraphs |Pcand | significantly increases with Ntf idf . Nev- dan County, Montana” and “Chandra Taal”. The ertheless, Joint EM and Joint F1 scores do not in- two evidence paragraphs gain a very high relevant crease as recall. Because there are too many noise score, while others score approximately equal to paragraphs in the candidate set, some will be re- zero. In fact, because the keywords “Sheridan trieved and misleading the reader model. The anal- County, Montana” and “Chandra Taal” appear ysis experiment suggests that only using recall as in the question, such comparison questions are a metric to evaluate retrieval quality is not enough. comparatively easy for not only GMF but also a We hope the future works introduce additional met- RoBERTa based re-rank baseline model. However, rics (e.g., precision) to evaluate the retrieval model such questions may be tricky for graph-based re- they proposed comprehensively. triever due to lack of hyperlink connection or the same entity mentions between them. In contrast, 4.5 Case Study our architecture does not require such presupposi- We provide two examples to understand how our tions, leading to better generalization and flexibil- model works. In Figure 3, the first question is a ity. bridge question. To answer the question, our model first retrieves the paragraph “Walchelin de Ferri- 5 Conclusion eres” with high confidence. Then the model takes In this paper, to tackle the multi-hop information paragraph “Marie de Councy” and gives a 0.15 retrieval challenge, we introduce an architecture relevance score. The score is lower than the thresh- that models a set of paragraphs as sequential data old value of the gate, so the paragraph will not be and iteratively identifies them. Specifically, we written to the memory. There are two paragraphs propose Gated Memory Flow to iterative read and stored in the memory when the model is trying memorize reasoning required information without to calculate the relevance between the paragraph noise information interference. We evaluate our “Richard I England” and the question. The model method on both full wiki and distractor settings on takes the representation of paragraph as input to the HotpotQA dataset and the method outperforms address necessary information from the memory previous works by a large margin. In the future, we module, then the memorized paragraph “Walchelin will attempt to design a more complicated model de Ferriers” is readout. to improve retrieval performance and explore more The second case is a comparison question. The about the effect of training data with different data question asks to compare the locations of “Sheri- distribution for multi-hop information retrieval.
References Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Richard Socher, and Caiming Xiong. 2020. Learn- Roberta: A robustly optimized bert pretraining ap- ing to retrieve reasoning paths over wikipedia graph proach. arXiv preprint arXiv:1907.11692. for question answering. In International Conference on Learning Representations. Alexander Miller, Adam Fisch, Jesse Dodge, Amir- Danqi Chen, Adam Fisch, Jason Weston, and Antoine Hossein Karimi, Antoine Bordes, and Jason Weston. Bordes. 2017. Reading Wikipedia to answer open- 2016. Key-value memory networks for directly read- domain questions. In Proceedings of the 55th An- ing documents. In Proceedings of the 2016 Con- nual Meeting of the Association for Computational ference on Empirical Methods in Natural Language Linguistics (Volume 1: Long Papers), pages 1870– Processing, pages 1400–1409, Austin, Texas. Asso- 1879, Vancouver, Canada. Association for Computa- ciation for Computational Linguistics. tional Linguistics. Sewon Min, Victor Zhong, Luke Zettlemoyer, and Han- Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, naneh Hajishirzi. 2019. Multi-hop reading compre- and Andrew McCallum. 2019. Multi-step retriever- hension through question decomposition and rescor- reader interaction for scalable open-domain question ing. In Proceedings of the 57th Annual Meeting answering. In International Conference on Learn- of the Association for Computational Linguistics, ing Representations. pages 6097–6109, Florence, Italy. Association for Computational Linguistics. Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019. Question answering by reasoning across documents Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. with graph convolutional networks. In Proceed- Revealing the importance of semantic retrieval for ings of the 2019 Conference of the North American machine reading at scale. In Proceedings of the Chapter of the Association for Computational Lin- 2019 Conference on Empirical Methods in Natu- guistics: Human Language Technologies, Volume 1 ral Language Processing and the 9th International (Long and Short Papers), pages 2306–2317, Min- Joint Conference on Natural Language Processing neapolis, Minnesota. Association for Computational (EMNLP-IJCNLP), pages 2553–2566, Hong Kong, Linguistics. China. Association for Computational Linguistics. Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William Co- Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata, hen, and Ruslan Salakhutdinov. 2018. Neural mod- Atsushi Otsuka, Itsumi Saito, Hisako Asano, and els for reasoning over multiple mentions using coref- Junji Tomita. 2019. Answering while summarizing: erence. In Proceedings of the 2018 Conference of Multi-task learning for multi-hop QA with evidence the North American Chapter of the Association for extraction. In Proceedings of the 57th Annual Meet- Computational Linguistics: Human Language Tech- ing of the Association for Computational Linguistics, nologies, Volume 2 (Short Papers), pages 42–48, pages 2335–2345, Florence, Italy. Association for New Orleans, Louisiana. Association for Computa- Computational Linguistics. tional Linguistics. Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, Christopher D. Manning. 2019. Answering complex and Jie Tang. 2019. Cognitive graph for multi-hop open-domain questions through iterative query gen- reading comprehension at scale. In Proceedings of eration. In Proceedings of the 2019 Conference on the 57th Annual Meeting of the Association for Com- Empirical Methods in Natural Language Processing putational Linguistics, pages 2694–2703, Florence, and the 9th International Joint Conference on Natu- Italy. Association for Computational Linguistics. ral Language Processing (EMNLP-IJCNLP), pages 2590–2602, Hong Kong, China. Association for Alex Graves, Greg Wayne, and Ivo Danihelka. Computational Linguistics. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401. Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei Tushar Khot, Peter Clark, Michal Guerquin, P. Jansen, Li, Weinan Zhang, and Yong Yu. 2019. Dynami- and A. Sabharwal. 2020. Qasc: A dataset for ques- cally fused graph network for multi-hop reasoning. tion answering via sentence composition. In AAAI. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages Thomas N. Kipf and Max Welling. 2017. Semi- 6140–6150, Florence, Italy. Association for Compu- supervised classification with graph convolutional tational Linguistics. networks. In International Conference on Learning Representations (ICLR). Nan Shao, Yiming Cui, Ting Liu, Shijin Wang, and Guoping Hu. 2020. Is graph structure neces- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, sary for multi-hop reasoning? arXiv preprint Kevin Gimpel, Piyush Sharma, and Radu Soricut. arXiv:2004.03096. 2019. Albert: A lite bert for self-supervised learn- ing of language representations. arXiv preprint Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, arXiv:1909.11942. Radu Florian, and Daniel Gildea. 2018. Exploring
graph-structured passage representation for multi- Transformer-xh: Multi-evidence reasoning with ex- hop reading comprehension with graph neural net- tra hop attention. In International Conference on works. arXiv preprint arXiv:1809.02040. Learning Representations. Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex ques- tions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, Volume 1 (Long Papers), pages 641–651, New Orleans, Louisiana. Association for Computational Linguistics. Ming Tu, Guangtao Wang, Jing Huang, Yun Tang, Xi- aodong He, and Bowen Zhou. 2019. Multi-hop read- ing comprehension across multiple documents by reasoning over heterogeneous graphs. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2704–2713, Florence, Italy. Association for Computational Lin- guistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations. Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transac- tions of the Association for Computational Linguis- tics, 6:287–302. Zhengnan Xie, Sebastian Thiem, Jaycie Martin, Eliz- abeth Wainwright, Steven Marmorstein, and Peter Jansen. 2020. WorldTree v2: A corpus of science- domain structured explanations and inference pat- terns supporting multi-hop inference. In Proceed- ings of the 12th Language Resources and Evaluation Conference, pages 5456–5473, Marseille, France. European Language Resources Association. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answer- ing. In Proceedings of the 2018 Conference on Em- pirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics. Yuyu Zhang, Ping Nie, Arun Ramamurthy, and Le Song. 2020. Ddrqa: Dynamic document rerank- ing for open-domain multi-hop question answering. arXiv preprint arXiv:2009.07465. Chen Zhao, Chenyan Xiong, Corby Rosset, Xia Song, Paul Bennett, and Saurabh Tiwary. 2020.
You can also read