Multi-task Retrieval for Knowledge-Intensive Tasks - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-task Retrieval for Knowledge-Intensive Tasks Jean Maillard∗ Vladimir Karpukhin* Fabio Petroni Wen-tau Yih Barlas Oğuz Veselin Stoyanov Gargi Ghosh Facebook AI {jeanm,vladk,fabiopetroni,scottyih,barlaso,ves,gghosh}@fb.com Abstract The standard retrieval component in many sys- tems (e.g., Thorne et al., 2018; Wang et al., 2018; Retrieving relevant contexts from a large cor- Chen et al., 2017) has long relied on term-matching pus is a crucial step for tasks such as open- methods, such as tf-idf or BM25 (Robertson and arXiv:2101.00117v1 [cs.CL] 1 Jan 2021 domain question answering and fact check- Zaragoza, 2009). These methods rely on efficient ing. Although neural retrieval outperforms tra- ditional methods like tf-idf and BM25, its per- algorithms and usually perform reasonably well re- formance degrades considerably when applied gardless of the problem. In contrast, recent neural to out-of-domain data. Driven by the ques- retrieval models, such as ICT (Lee et al., 2019), tion of whether a neural retrieval model can DPR (Karpukhin et al., 2020) and RAG (Lewis be universal and perform robustly on a wide et al., 2020b) achieve better results by learning di- variety of problems, we propose a multi-task rectly from task-specific training data and going trained model. Our approach not only outper- beyond simple keyword matching. While task spe- forms previous methods in the few-shot set- cialisation results in improved task performance, ting, but also rivals specialised neural retriev- ers, even when in-domain training data is abun- researchers have observed that a retriever trained dant. With the help of our retriever, we im- for one specific domain will typically achieve low prove existing models for downstream tasks out-of-domain performance, and even lower per- and closely match or improve the state of the formance on entirely different tasks (Petroni et al., art on multiple benchmarks. 2020). This has two implications. First, unlike tf- idf or BM25, neural retrieval models are unsuitable 1 Introduction for low data regimes such as few- and zero-shot set- tings. Second, task-specific retrievers complicate Knowledge-intensive tasks is the common designa- practical applications where multiple knowledge- tion for a class of real-world NLP problems which, intensive tasks may need to be performed using the because of their nature, require large amounts of same supporting database or over the same input knowledge about the world (Petroni et al., 2020). text. It may not be practical to deploy multiple For example, open-domain question answering re- separate specialised models due to computational quires producing answers to general factoid ques- performance or memory concerns. tions; fact checking involves determining the ve- We ask the following question in this work: can racity of claims based on a database of trusted ev- we develop a universal neural retriever? Namely, idence. Practical solutions to these tasks usually we target a retriever that can perform well on a involve an efficient retrieval component that, given wide variety of problems, without task-specific fine- an input query, selects a limited subset of relevant tuning, but, if additional in-domain labelled data information from a large knowledge source. So- is available, it can be further fine-tuned to improve phisticated downstream models then consider the the performance. We perform a large experimental input only in the context of the retrieved informa- study to attempt to build such a universal retrieval tion, and perform the final task.1 model. We find that, by jointly training on an exten- ∗ Equal Contribution. sive selection of retrieval tasks, we obtain a model 1 While large pre-trained neural models have been shown which is not only more robust than previous ap- to incorporate real-world knowledge in their parameters and thus may skip retrieval (Petroni et al., 2019), they still have proaches, but also can lead to better performance on limited capacity and suffer from a lack of explainability. the downstream knowledge-intensive tasks when
plugged into an existing system. Our approach a fixed, much smaller dimensionality. Whether a combines the benefits from IR-based models with passage is relevant to a given query is determined those of task-specific neural retrievers – namely, by the distance of their vectors (Deerwester et al., good performance when no (or not enough) train- 1990). Although dense representations do not en- ing data is available and high task performance due code tokens explicitly and can potentially map para- to its ability to learn highly specialised representa- phrases of completely different tokens to close vec- tions. tors, performance of early dense retrieval methods Our contributions can be summarised as follows. was often inferior to term-matching approaches, except when large labelled data is available (Yih • We propose a single general-purpose “univer- et al., 2011; Gao et al., 2011; Huang et al., 2013). sal” retrieval model, able to perform compa- Thanks to success of large pre-trained models (De- rably or better than specialised retriever ap- vlin et al., 2019; Liu et al., 2019b), however, recent proaches in both zero-shot (leave-one-out) dense retrieval methods have shown to outperform and few-shot retrieval. We investigate sev- the sparse counterparts, when fine-tuned on a small eral model variants, shedding light on what set of in-domain labelled data (Karpukhin et al., are the aspects of the architecture that affect 2020; Lewis et al., 2020b; Xiong et al., 2020). Effi- its performance. cient index and search of dense vectors are made • We show that our model’s gains in terms of possible by maximum inner product search (MIPS) retrieval directly translate into performance algorithms (e.g., Shrivastava and Li, 2014; Guo gains for a variety of downstream knowledge- et al., 2016), as well as tools like FAISS (Johnson intensive tasks. et al., 2019). Our work is built upon the Dense Passage Re- • We will share the implementation as well as triever (DPR) architecture of Karpukhin et al. our best model. This is in the form of a read- (2020), which was initially proposed for the task of ily available BERT checkpoint which, as we open-domain question answering. DPR is a neural will show, can be used by NLP practitioners bi-encoder model which embeds queries with an as a strong out-of-the-box retrieval system, encoder f (·) and passages with a separate encoder but which can also undergo further in-domain g (·). Given an input query x and a target passage training for even higher performance. y, we have 2 Background p (x | y) ∝ sim(x, y), In this section, we first give an overview of retrieval methods based on sparse and dense representa- where the similarity score sim (x, y) is defined as tions. We then discuss a wide range of knowledge- the inner product of the embeddings of its argu- intensive NLP tasks, where retrieval plays a crucial ments, f (x) · g(y). Given a query at inference role in solving the problems. time, calculating its similarity with every possible passage would be prohibitive for large knowledge 2.1 Retrieval methods sources. Therefore, DPR makes use of the FAISS Given a large collection of unstructured text pas- library (Johnson et al., 2019) to perform fast ap- sages, information retrieval (IR) can be broadly proximate nearest neighbour search in sub-linear defined as finding a small set of passages that time. Training is based on a contrastive loss. Given a satisfies an information need, often presented in query x, a relevant passage y, and a set of n irrele- the form of a short-text query (Manning et al., vant passages yi− , we train the model by optimising 2008). Traditional IR methods, such as tf-idf and the following negative log likelihood: BM25 (Robertson and Zaragoza, 2009), match key- words efficiently with an inverted index. Such exp(sim(x, y)) L = − log . exp(sim(x, y)) + n − P methods can be seen as representing queries i=1 exp(sim x, yi ) and passages in high-dimensional, sparse vectors, where each dimension corresponds to a term in the As the set of irrelevant passages, we use the rel- vocabulary and the weight indicates its importance. evant passages for other queries within the same In contrast to tf-idf and BM25, dense retrieval batch, as well as a specially selected “hard” con- methods encode text as a latent semantic vector of founder. This is a passage which has high lexical
1 3 Methods Classification via learned metric 1 2 3 … n 3.1 Universal retrieval Using task-specific models to tackle our collec- Query Document Document Document … Document tion of retrieval tasks would involve completely encoder encoder encoder encoder encoder separate models, one per dataset. As illustrated “Where was “Albert Einstein “Their son Hans “Kalpana Chawla “Gian Giacomo Albert Einstein born?” was born in Ulm, in …” Albert Einstein was born …” was an American astronaut, …” Cavalli was a genoese poet …” in Figure 2, this would lead to a proliferation of query₁ positive doc. “hard” neg. doc. positive doc. positive doc. models and data, down to separate indexed copies for query₁ for query₁ for query₂ for queryₙ of the knowledge source itself (Wikipedia). This Figure 1: Training of DPR (Karpukhin et al., 2020), setup will form one of our baselines. a bi-encoder model for open-domain question answer- ing. Queries and passages are encoded as vectors, and Task₁ Task₁ query match match Task₂ query Task₂ query Task₁ Task₂ query retrieval is performed as a maximum inner product encoder document index document index encoder search. Task₁ Task₂ document document encoder encoder overlap with the query (high BM25 score), but is Task₁ knowledge Task₂ knowledge source source not among the set of relevant passages for the given data point. Karpukhin et al. (2020) have shown that Figure 2: Two retrieval tasks performed by two fully- the inclusion of such “hard” confounders leads to specialised models. substantially improved training results. This train- ing process is illustrated in Figure 1. Multi-task training has been successfully used to allow models to leverage cross-task data, as well as 2.2 Knowledge-intensive Tasks to provide a regularisation effect leading to better generalisation ability (Liu et al., 2019a). We apply For the training and evaluation of all models in this concept to neural retrievers, with the aim of im- the paper we make use of KILT, a benchmark and proving performance by jointly leveraging multiple library of datasets (Petroni et al., 2020). KILT con- different retrieval datasets. sists of a selection of datasets spanning five varied Task que ₁ classes of knowledge-intensive tasks (i.e., ques- ry Task que ₁ enc ry match Task ode que ₁ tion answering, slot filling, fact checking, dialogue, r ry Shared match Shared query Shared entity linking), with the aim to cover many differ- Task ₂ match document index ₂ encoder document index ry Task que er que ry ent ways of seeking knowledge. Input queries can Task ₂ ry enc od que Shared Shared vary wildly from one task to the other, and include document encoder document encoder classic examples of open-domain retrieval tasks Shared Shared such as natural language questions and claims to be knowledge source knowledge source verified, as well as more unusual examples like con- (a) Separate query encoders. (b) A single retrieval model. versation fragments and long chunks of annotated text. Crucially, all datasets distributed in KILT have Figure 3: Parameter sharing between neural retrievers. been re-aligned such that they are all grounded in the same snapshot of Wikipedia, which the authors Our base setup is illustrated in Figure 3b and distribute. The knowledge required to answer any involves using a shared passage encoder — so that of the queries in the library of tasks can thus be a single index of encoded passages can be used — found within the same unified knowledge source. as well as a query encoder that is shared across all To illustrate the variety of ways in which the tasks. In essence, in this setup a single DPR model input queries for different tasks can be formulated, is used to perform all retrieval tasks. we provide a few simple examples in Table 1. In Due to the complexity of training and evaluat- spite of the differences between query formulations, ing retrieval models (which involves training the all these tasks share one crucial aspect: they all retriever, embedding all of Wikipedia, and building require a retriever to fetch the relevant passages an index), our main set of experiments is all based from the knowledge source, in order to support the on this configuration, which was found to work final downstream task. well in preliminary experiments. However, in order
Task Example query Answer Relevant doc. Question Answering Who is playing the Halftime Show at Coldplay The Super Bowl 50 Halftime Show Super Bowl 2016? took place on February 7, 2016 ... It was headlined by the British rock group Coldplay. Fact Checking Bermuda Triangle is in the western part REFUTES The Bermuda Triangle ... is a loosely of the Himalayas defined region in the western part of the North Atlantic Ocean Slot Filling Piner Creek [sep] mouth of the wa- Santa Rosa Creek Piner Creek discharges to Santa Rosa tercourse Creek which in turn ... Entity Linking Leicestershire take over at top West Indies cricket team The West Indies cricket team is a multi- after innings victory. Lon- national men’s cricket team represent- don. [start ent]West In- ing the Anglophone Caribbean region dian[end ent] all-rounder Phil Simmons ... Dialogue I am a big fan of Star Trek [sep] I William Shatner plays It followed the interstellar adventures don’t know much about it. When did the role of Captain Kirk of Captain James T. Kirk (William the first episode air? [sep] It debuted Shatner) and his crew ... in .. [sep] What is the plot of the show? Table 1: Illustrative examples of some of the tasks within KILT, and how varied their query formulations can be. to report on the performance of alternative architec- of the retriever is trained with BM25 confounders; tures, we also investigate the following additional (2) new confounders are selected with the trained variants in a restricted experimental setting, limited model, by retrieving high-ranking passages which to a few tasks: are not among the set of relevant ones; (3) a second version of the model is trained using the additional • Task-specific query encoder. A different new confounders. query encoder is used for each family of tasks, e.g. all question answering tasks use the same Intuitively, it is expected that this approach query encoder, but fact checking uses a differ- should lead to higher quality confounders com- ent one. This is meant to allow for potentially pared to those selected by BM25 based on simple different needs in processing queries, given keyword matching. Based on our own experience the fundamentally diverse nature of the tasks as well as relevant literature (Khattab et al., 2020), at hand. This setup configuration is illustrated this adversarial approach has been shown to work in Figure 3a. well for question answering. • Task markers. This approach is similar to As a way of further pushing the performance our base setup, where a single model per- of the model, we experiment with this adversar- forms all tasks. Additionally, we introduce ial confounder selection on two datasets, Natural specialised tokens which are inserted at the be- Questions (Kwiatkowski et al., 2019) and TriviaQA ginning of each query. Their aim is to help the (Joshi et al., 2017). We selected these two datasets model distinguish between the different tasks, since, out of all of the tasks we are considering, by marking them. We use one task marker for they have an easy way of checking whether a cer- each of the five task classes of KILT, such that tain passage is relevant or not for a given query – all question answering tasks share the same namely, by checking whether the answer is present marker. in the passage. This enabled us to automatically 3.2 Adversarial confounder selection build sets of confounders, ensuring relevant pas- sages would be excluded.2 We saw in § 2.1 how “hard” confounder passages are collected using a BM25 baseline, following the standard approach in DPR. However, any other retriever can be used to select such confounders, 2 including the very retriever being trained, leading Strictly speaking, assuming a passage to be irrelevant be- cause of the absence of the answer span is not formally correct. to an iterative, self-adversarial training. Concretely, However, experiments show a good correlation between this this amounts to following steps: (1) a first version simple check and the overall model quality.
Dataset Task class #Train coders (Devlin et al., 2019), trained separately. As pooling mechanism we find it effective to simply FEVER Fact Checking 71 k take the [CLS] token representation at the topmost AIDA-YAGO 2 Entity Linking 18 k layer. T-REx Slot Filling 2,284 k Zero Shot RE Slot Filling 132 k Training We train our models for up to 80 Natural Questions QA 77 k epochs. To select the best checkpoint, we perform HotpotQA QA 69 k full evaluations of the validation set retrieval per- TriviaQA QA 53 k formance at regular intervals. We use the Adam Wizard of Wikipedia Dialogue 80 k optimiser (Kingma and Ba, 2015) with a learning rate of 2 · 10−5 with warmup and a linear decay Table 2: KILT datasets used in this work, and the size schedule, and a dropout rate of 0.1. The batch size of our converted training sets for each. is set to 128 samples, and in preliminary experi- ments we found no benefit in increasing this further. We use an additional “hard” confounder per batch, 4 Experiments selected based on BM25 score as in (Karpukhin 4.1 Experimental settings et al., 2020). Dataset selection For our experiments we select Downstream evaluation When evaluating our the eight KILT datasets listed in Table 2, which retriever within a larger architecture to perform a cover all five task classes and include a training knowledge-intensive task, we replicate the DPR split, a validation split, and a held-out test split. + BART setup of Petroni et al. (2020). This uses DPR to retrieve and prepend the top 3 passages Preprocessing Starting from the raw KILT data, to the query, which is then processed by a task- we split each Wikipedia article into disjoint 100- specific fine-tuned BART model to generate the token chunks which form our basic retrieval units, final answer for the end task. following the approach of Wang et al. (2019) and Karpukhin et al. (2020). To maintain the same lan- 4.2 Universal retrieval guage introduced in §3, we will simply call these The results of the evaluations reported in (Petroni chunks passages. et al., 2020) show that retrievers trained for ques- This preprocessing results in a knowledge source tion answering have poor performance outside of of 36 million passages. In order to harmonise all their domain. We would like to understand if it datasets to the same knowledge source, KILT used is possible to design a single model which can ac- a mapping strategy based on the BLEU metric to curately satisfy the information needs of a wide map relevant passages in the original versions of variety of knowledge-intensive tasks. In short: Can its datasets to passages in its own shared knowl- a neural retriever be universal? edge source (Petroni et al., 2020). Entries included We perform a comprehensive evaluation of sev- in the KILT training sets which have a mapping eral models on the eight tasks of Table 2. The BLEU score below 0.5 are likely to be noise, and setups we evaluate include eight task-specific mod- we exclude them from training. els (one trained on each of the eight datasets), Multi-tasking Training is performed on the for which we measure both in-domain and out-of- union of all training sets. Since two of the training domain performance, and a BM25 baseline. Addi- sets are of different orders of magnitude, we use a tionally, we include a multi-task trained model – as simple downsampling strategy to bring them to the described in §3.1 – with the hope that it can learn same order of magnitude as the others. Preliminary to perform all tasks satisfyingly. This amounts to experiments with more complex sampling meth- 10 models evaluated on eight tasks each, for a total ods, like resampling all datasets so that each epoch of 80 evaluations. would see an equal number of samples from each, To measure retrieval performance, we adopt the found that they had no measurable effect compared main metric used for the KILT benchmark, R- to this simpler approach. precision. This is calculated as r/R, where R is the total number of relevant passages for a given Encoders Our query and passage encoders are query, and r is the number of relevant passages initialised as two distinct BERT base uncased en- returned among the top-R retrieval results. For the
Fact Check. Ent. L. Slot Filling Open Domain QA Dial. model FEV AY2 T-REx zsRE NQ HoPo TQA WoW Multi-task 74.72 / 46.96 83.78 69.18 / 53.54 77.23 / 41.70 61.51 / 28.80 44.21 / 38.42 61.95 / 24.56 39.70 / 24.07 BM25 50.13 / 40.06 3.47 58.60 / 51.64 66.43 / 52.98 25.83 / 14.20 43.95 / 38.38 29.44 / 16.16 27.50 / 18.41 Task-specific models FEVER 73.60 / 43.92 5.62 19.50 / 10.02 42.88 / 19.98 36.69 / 18.05 23.18 / 17.59 45.08 / 22.24 41.27 / 19.85 AY2 47.36 / 37.58 81.77 5.52 / 4.08 8.94 / 5.50 10.22 / 6.77 11.69 / 10.71 15.11 / 8.47 17.59 / 13.08 T-REx 45.63 / 25.22 1.05 69.08 / 58.54 71.64 / 40.95 17.10 / 8.71 22.31 / 15.63 18.10 / 8.06 4.02 / 1.83 zsRE 70.10 / 33.12 0.42 68.34 / 57.40 97.74 / 78.81 25.98 / 13.81 22.23 / 18.35 28.68 / 14.44 10.40 / 2.09 NQ 68.16 / 14.81 1.44 31.78 / 7.20 61.12 / 12.92 63.24 / 28.13 29.39 / 11.33 48.39 / 14.42 30.77 / 11.81 HoPo 56.18 / 40.03 2.07 35.76 / 27.62 44.44 / 31.15 35.60 / 23.26 46.63 / 43.47 41.18 / 29.37 23.51 / 16.02 TQA 70.06 / 10.68 4.95 32.22 / 12.52 60.37 / 17.43 45.01 / 12.97 32.62 / 13.05 65.12 / 23.79 41.17 / 8.11 WoW 59.16 / 42.79 3.11 20.92 / 18.52 41.14 / 35.26 33.27 / 22.52 20.36 / 17.66 39.37 / 23.15 40.32 / 20.73 Table 3: Page- and passage-level R-precision on KILT validation data. For the AIDA-YAGO 2 dataset, due to the nature of the task, only page-level retrieval is defined. case of R = 1 this is therefore equivalent to preci- markedly different in formulation is AIDA- sion@1. Table 3 shows retrieval performance on YAGO 2 (Hoffart et al., 2011). As shown in Table the validation data, with the best performance on a 2, models that were not trained on this specific task given dataset marked in bold, and the second best perform it very poorly. Entity linking is a task that performance underlined. is normally better performed by models which are While the KILT evaluation focuses on retrieval explicitly designed for it (Cao et al., 2020). We at the level of Wikipedia pages (thereby marking as nevertheless include it to showcase the ability of “hits” any results that lie within the correct page), neural retrievers to adapt to it, and note how well we are also interested in performing an evaluation the multi-task retriever performs on it in spite of its at a more fine-grained level. We therefore also unusual nature. evaluate our models at the passage level, using a modified version of the official KILT evaluation 4.3 Downstream performance scripts. These are shown as the second number in We saw that our proposed approach achieves strong each column. performance across a variety of retrieval tasks. We straight away notice that the task-specific However, our interest in neural retrievers stems models tend to achieve high performance on their from their use as components within larger sys- respective tasks, often taking one of the top two tems, to perform tasks such as question answering. spots. Interestingly, we also note that these neu- Our next experimental question is therefore: Can ral retrievers consistently outperform the BM25 a universal retriever lead to better downstream baseline, showing that the result which Karpukhin performance in knowledge-intensive tasks? et al. (2020) achieved for open-domain question an- We perform a downstream evaluation of our ap- swering also holds for other knowledge-intensive proach used in conjunction with BART (Lewis tasks. et al., 2020a) as the generative component or classi- The results reveal a strong performance for the fier, adopting the same setup as Petroni et al. (2020). multi-task model, confirming the hypothesis that a Results are reported in Table 4, with bold and un- single model can be successfully trained to perform derline marking the best and second best scores a wide variety of retrieval tasks. With the exception respectively. of one dataset, the shared model achieves the best The DPR + BART line refers to a setup simi- retrieval performance or is within a few percentage lar to our own, but with the simpler retriever of points of the top score. We note that the one excep- Karpukhin et al. (2020). Therefore, comparing tion is the Zero-shot RE task (Levy et al., 2017), a its performance to ours gives us a clear indication trivial task in which the query will always contain of the contribution of multi-task training on the the title of the page to be retrieved. Indeed, the overall performance on knowledge-intensive tasks. model specific to this task manages to achieve a Our proposed model achieves significantly better near-perfect score. performance than this baseline in AY2, zsRE and Another task which stands out for being HoPo; while for the other tasks, the discrepancy
Fact Check. Ent. L. Slot Fill. Open Domain QA Dial. Model FEV AY2 zsRE NQ HoPo TQA WoW Avg. Multi-task + BART 86.32 82.61 57.95 39.75 31.77 59.60 15.33 53.33 DPR + BART 86.74 75.49 30.43 41.27 25.18 58.55 15.55 47.60 RAG 86.31 72.62 44.74 44.39 26.97 71.27 13.22 51.36 T5 76.30 74.05 9.02 19.60 12.64 18.11 13.49 31.89 Table 4: KILT test scores on the downstream evaluation. Results in the bottom section are as reported in Petroni et al. (2020). The score metrics are accuracy for fact checking, entitly linking and slot filling; exact match for QA; and F1 score for dialogue.3 is always below two points. This fact is reflected inal training sets with a random sample of 128 in the last column too, showing that on average and 1,024 examples respectively. In order to evalu- multi-task training leads to better downstream per- ate the suitability of a multi-task trained retriever formance. The model also compares favourably as a starting checkpoint for few-shot training, we to RAG (Lewis et al., 2020b), a more advanced take the various leave-one-out models and fine- system in which the query encoder is fine-tuned on tune them on our few-shot training sets. To check the end task. whether multi-task pre-training is effective, we also compare these to DPR models (which are just ini- 4.4 Zero- and few-shot performance tialised with BERT weights) fine-tuned on the same Task-specific neural retrievers can achieve higher data. performance than IR-based methods, but they are The bottom two sections of Table 5 report the not suitable for cases where no training data (or results. The most dramatic gains from fine-tuning not enough) is available. In those cases, tf-idf and are seen for AY2, an “outlier” task whose formula- BM25 are the better choice. To evaluate the per- tion differs from that of the other tasks, and which formance of a multi-task retriever as a suitable re- seems to benefit the most from seeing in-domain placement for them in this scenario, we run a series data. The zsRE performance does not seem to im- of experiments in the low data regimes (few-shot prove from fine-tuning on the smaller dataset, but and zero-shot). sees a very big jump when switching to the larger We start by training a set of multi-task retrievers dataset. As a reminder, in this trivial task the title (using the base setup) in the leave-one-out setting of the page to be retrieved always appears at the for each of the datasets, in order to see how a neural start of the query. It is therefore not surprising that retriever will perform when trained on all domains models specifically fine-tuned on it can achieve except for the one it is to be evaluated on. The near-perfect scores, as long as enough training data results of these zero-shot experiments are reported is provided. in the second line of Table 5 (again, text here is In spite of the fine-tuning, we note that both in bold for the best overall performance, and un- DPR and the multi-task model fail to improve on derlined for second best). They show that, even in their performance for T-REx, suggesting that large the zero-shot setting, the multi-task neural retriever amounts of training data are required to learn this achieves performance that is competitive to BM25, task. Nevertheless, the multi-task model proves it- with retrieval being 10 points higher at the page self more robust, and achieves the top performance level and 5 points lower at the passage level on on it. average. Finally, we note for 2 out of 8 tasks, namely The advantage of neural retrievers over BM25 zsRE and WoW, DPR achieves lower page-level lies in their ability to improve with training. We retrieval scores than the multi-task model, but per- therefore look at few-shot training for each task, forms better at the passage level. This shows that and create two smaller copies for each of the orig- fine-grained and coarse-grained retrieval perfor- 3 Performing this evaluation required retrieving relevant mance are not always perfectly correlated. documents for all training sets. Due to the very large size of T-REx, this particular dataset could not be included in this Overall, the experiments show strong results for section. the multi-task model, with the average zero-shot
Model FEV AY2 T-REx zsRE NQ HoPo TQA WoW Avg. BM25 50.13 / 40.06 3.47 58.60 / 51.64 66.43 / 52.98 25.83 / 14.20 43.95 / 38.38 29.44 / 16.16 27.50 / 18.41 38.17 / 33.12 Leave-one-out multi-task models Zero-shot 74.11 / 37.09 4.16 67.54 / 44.84 73.42 / 32.65 47.23 / 21.50 34.72 / 16.52 49.08 / 28.06 36.92 / 16.19 48.40 / 28.12 Finetune (128) 75.95 / 32.75 32.38 67.54 / 44.84 73.41 / 32.65 47.48 / 14.98 34.72 / 27.82 54.71 / 19.82 48.36 / 17.46 54.23 / 27.19 Finetune (1k) 73.08 / 40.83 70.40 67.54 / 44.84 93.04 / 58.67 51.00 / 19.90 39.19 / 35.43 59.08 / 20.22 47.65 / 19.75 62.62 / 34.23 Vanilla DPR models Finetune (128) 37.99 / 25.31 26.23 0.20 / 0.02 0.16 / 0.00 20.92 / 9.52 14.46 / 14.08 26.85 / 10.54 30.31 / 17.20 19.64 / 10.95 Finetune (1k) 70.87 / 47.82 72.49 0.20 / 0.02 90.33 / 80.20 43.43 / 19.81 30.75 / 30.50 52.50 / 17.33 44.70 / 24.92 50.66 / 31.51 Table 5: Page- and passage-level R-Precision in the zero-shot setting and with additional fine-tuning of 128 and 1,024 examples. We also compare to a BM25 retriever and a DPR model initialised with BERT weights. performance being competitive to BM25, and the tions, selected using our top multi-task trained average few-shot performance being markedly bet- model. A new multi-task model is then trained ter than the alternatives. The discrepancy in per- from scratch on this augmented data. Its perfor- formance between a vanilla DPR model and the mance is reported in Table 7, showing an over- leave-one-out multi-task model is especially notice- all improvement across multiple tasks. While this able when using the smaller of the two datasets, in approach is demonstrated here on our multi-task which case average performance for the latter is model, it is in fact orthogonal to it, and could be more than double that of vanilla DPR. applied to any other neural retrievers trained with a contrastive loss. 4.5 Model variants 5 Related work variant FEV NQ TQA Base 76.38 / 40.76 60.91 / 24.50 64.77 / 21.75 The approach most closely related to ours is DPR Task markers 75.84 / 40.79 62.31 / 25.10 64.04 / 20.86 (Karpukhin et al., 2020), upon which we built all Task-spec. enc. 73.53 / 40.02 61.05 / 25.52 64.17 / 21.23 our retrieval systems. This model is covered in detail in § 2.1, in addition to the historical context. Table 6: Multi-task model variants evaluated on a sub- Another closely related approach is the Retrieval- set of tasks (R-precision on validation data at page/pas- sage level). Augmented Generation (RAG) model of Lewis et al. (2020b). In its base configuration it aug- ments DPR with a generative reader, and it trains In this set of experiments we compare our base the query encoder end-to-end (differing from tra- multi-task model with the two variants described ditional retriever-reader architectures which treat in § 3.1. Due to the high memory consumption the two steps as disjoint). A natural extension of of the “task-specific encoders” variant (requiring the work we have presented would be to combine one full query encoder per task family, in addition RAG with our joint learning approach, to study to the passage encoder), it was only possible to whether it can lead to further gains in performance perform these evaluations in a restricted setting of or robustness. three datasets. The results in Table 6 do not reveal A number of promising techniques to boost re- a clear winner, suggesting that the base architecture trieval performance have been proposed recently. might be the better choice due to its simplicity and These are orthogonal to our work, and as such they generally good performance.4 could be combined with it. Amongst these, pre- 4.6 Adversarial confounder selection training methods form one class. Inverse Cloze Finally, we evaluate the adversarial confounder se- Task (Lee et al., 2019) and its extensions (Chang lection method described in § 3.2. This involves et al., 2020) are self-supervised pre-training meth- augmenting our regular training sets with addi- ods designed for retrieval in open-domain question tional confounders for TriviaQA and Natural Ques- answering. Whether such specific pre-training is beneficial to tasks other than question answering re- 4 Not included in this table, due to very poor performance mains an open question. CERT (Fang et al., 2020) in preliminary experiments, are two further variants: a base model with a single encoder for both queries and passages, and is an alternative pre-training approach, inspired by a base model trained from scratch without BERT pre-training. some recent advances in computer vision. While to
Fact Check. Ent. L. Slot Filling Open Domain QA Dial. confounders FEV AY2 T-REx zsRE NQ HoPo TQA WoW BM25 74.72 / 46.96 83.78 69.18 / 53.54 77.23 / 41.70 61.51 / 28.80 44.21 / 38.42 61.95 / 24.56 39.70 / 24.07 BM25 + adv 74.79 / 52.12 84.86 71.36 / 61.40 80.04 / 54.08 59.25 / 40.11 44.08 / 41.04 59.19 / 34.17 41.04 / 24.62 Table 7: Comparison of two confounder selection methods for the multi-task model: simple BM25, and BM25 augmented with adversarial confounders (R-precision on validation data at page/passage level). our knowledge this has not been applied to retrieval number of tasks in the KILT benchmark. problems, we believe it might be promising due to Next, in Section 4.4, we evaluated the model’s its focus on sentence-level semantics (as opposed performance in the zero-shot and few-shot settings. to the more standard masked language modelling By evaluating on a wide range of tasks, we were pre-training, which focuses on the token-level). able to show that our proposed approach performs Another class of orthogonal improvements to comparably to BM25 in the zero shot setting, and dense retrieval involves models which embed pas- quickly overtakes it even with minimal in-domain sages into multiple fixed-size vectors. Of these, training. ColBERT (Khattab and Zaharia, 2020) and ME- In Section 4.5 we evaluated a number of more BERT (Luan et al., 2020) are two representative complex variants of the model involving task spe- examples. One further approach is ColBERT-QA cialisation, but failed to see clear performance im- (Khattab et al., 2020), which additionally uses a provements. Finally, in Section 4.6 we saw how a data augmentation strategy closely related to our simple iterative approach to data augmentation can own approach described in § 3.2. lead to better performance. Finally two entity linkers, GENRE (Cao et al., In the coming months we will provide a pre- 2020) and BLINK (Wu et al., 2020), are worth trained snapshot of our best-performing model, in mentioning. Being trained specifically for entity the form of a BERT checkpoint. As shown, this linking, these models will generally outperform model will be useful in zero-shot and few-shot set- retrieval-based approaches on that task. While they tings as a better performing alternative to both IR- are not comparable to retrieval models and will based approaches such as BM25, as well as task- not generally be applicable to information retrieval specific models. The multi-task training approach tasks, we mention them here to provide readers demonstrated here can also be useful in industry set- with a fuller context of the existing literature. tings where several retrieval operations may need to be performed on the same piece of content,5 and 6 Conclusions the deployment of multiple task-specific models might not be possible due to space or computa- We have conducted a large-scale experimental tional performance concerns. study on knowledge-intensive tasks, and how retrieval models that tackle them seek the re- quired information from knowledge bases such as References Wikipedia. The study started with the question of whether Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2020. Autoregressive entity retrieval. the way in which information is embedded for re- ArXiv, abs/2010.00904. trieval purposes is universal. Section 4.2 provided evidence that to a large extent it is, with a single Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yim- “universal” retriever, trained jointly on 8 datasets, ing Yang, and Sanjiv Kumar. 2020. Pre-training tasks for embedding-based large-scale retrieval. In often performing comparably to task-specific mod- International Conference on Learning Representa- els. tions. Armed with this knowledge, in Section 4.3 we plugged our single model in a larger pipeline, in Danqi Chen, Adam Fisch, Jason Weston, and Antoine order to see its contribution to the downstream per- Bordes. 2017. Reading Wikipedia to answer open- domain questions. In Proceedings of the 55th An- formance on a wide range of knowledge-intensive nual Meeting of the Association for Computational tasks. This led to an overall improvement in down- 5 stream performance, setting new top results for a E.g. fact checking and hate speech detection.
Linguistics (Volume 1: Long Papers), pages 1870– (Volume 1: Long Papers), pages 1601–1611, Van- 1879, Vancouver, Canada. Association for Computa- couver, Canada. Association for Computational Lin- tional Linguistics. guistics. Scott Deerwester, Susan T Dumais, George W Fur- Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick nas, Thomas K Landauer, and Richard Harshman. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and 1990. Indexing by latent semantic analysis. Jour- Wen-tau Yih. 2020. Dense passage retrieval for nal of the American society for information science, open-domain question answering. In Proceedings of 41(6):391–407. the 2020 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP). Association for Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Computational Linguistics. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Omar Khattab, Christopher Potts, and Matei Zaharia. standing. In Proceedings of the 2019 Conference 2020. Relevance-guided supervision for OpenQA of the North American Chapter of the Association with ColBERT. ArXiv, abs/2007.00814. for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Omar Khattab and Matei Zaharia. 2020. ColBERT: Ef- pages 4171–4186, Minneapolis, Minnesota. Associ- ficient and effective passage search via contextual- ation for Computational Linguistics. ized late interaction over BERT. In Proceedings of the 43rd international ACM SIGIR conference on Re- Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan search and development in Information Retrieval. Ding, and Pengtao Xie. 2020. CERT: Contrastive Diederick P Kingma and Jimmy Ba. 2015. Adam: A self-supervised learning for language understanding. method for stochastic optimization. In International ArXiv, abs/2005.12766. Conference on Learning Representations (ICLR). Jianfeng Gao, Kristina Toutanova, and Wen-tau Yih. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- 2011. Clickthrough-based latent semantic models field, Michael Collins, Ankur Parikh, Chris Alberti, for web search. In Proceedings of the 34th interna- Danielle Epstein, Illia Polosukhin, Matthew Kelcey, tional ACM SIGIR conference on Research and de- Jacob Devlin, Kenton Lee, Kristina N. Toutanova, velopment in Information Retrieval, pages 675–684. Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob ACM. Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- ral questions: a benchmark for question answering Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and research. Transactions of the Association of Compu- David Simcha. 2016. Quantization based fast inner tational Linguistics. product search. In Artificial Intelligence and Statis- tics, pages 482–490. Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor- domain question answering. In Proceedings of the dino, Hagen Fürstenau, Manfred Pinkal, Marc Span- 57th Annual Meeting of the Association for Com- iol, Bilyana Taneva, Stefan Thater, and Gerhard putational Linguistics, pages 6086–6096, Florence, Weikum. 2011. Robust disambiguation of named en- Italy. Association for Computational Linguistics. tities in text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Process- Omer Levy, Minjoon Seo, Eunsol Choi, and Luke ing, pages 782–792, Edinburgh, Scotland, UK. Asso- Zettlemoyer. 2017. Zero-shot relation extraction via ciation for Computational Linguistics. reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Learning (CoNLL 2017), pages 333–342, Vancou- Alex Acero, and Larry Heck. 2013. Learning deep ver, Canada. Association for Computational Linguis- structured semantic models for web search using tics. clickthrough data. In Proceedings of the 22nd ACM International Conference on Information and Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Knowledge Management, page 2333–2338, New jan Ghazvininejad, Abdelrahman Mohamed, Omer York, NY, USA. Association for Computing Machin- Levy, Veselin Stoyanov, and Luke Zettlemoyer. ery. 2020a. BART: Denoising sequence-to-sequence pre- training for natural language generation, translation, J. Johnson, M. Douze, and H. Jégou. 2019. Billion- and comprehension. In Proceedings of the 58th An- scale similarity search with GPUs. IEEE Transac- nual Meeting of the Association for Computational tions on Big Data, pages 1–1. Linguistics, pages 7871–7880, Online. Association for Computational Linguistics. Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale dis- Patrick Lewis, Ethan Perez, Aleksandara Piktus, tantly supervised challenge dataset for reading com- Fabio Petroni, Vladimir Karpukhin, Naman Goyal, prehension. In Proceedings of the 55th Annual Meet- Heinrich Küttler, Mike Lewis, Wen-tau Yih, ing of the Association for Computational Linguistics Tim Rocktäschel, Sebastian Riedel, and Douwe
Kiela. 2020b. Retrieval-augmented generation for Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nal- knowledge-intensive NLP tasks. In Advances in lapati, and Bing Xiang. 2019. Multi-passage Neural Information Processing Systems (NeurIPS). BERT: A globally normalized BERT model for open-domain question answering. In Proceedings of Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian- the 2019 Conference on Empirical Methods in Nat- feng Gao. 2019a. Multi-task deep neural networks ural Language Processing and the 9th International for natural language understanding. In Proceedings Joint Conference on Natural Language Processing of the 57th Annual Meeting of the Association for (EMNLP-IJCNLP), pages 5878–5882, Hong Kong, Computational Linguistics, pages 4487–4496, Flo- China. Association for Computational Linguistics. rence, Italy. Association for Computational Linguis- tics. Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. 2020. Scalable zero- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- shot entity linking with dense entity retrieval. In dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Proceedings of the 2020 Conference on Empirical Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Methods in Natural Language Processing (EMNLP), Roberta: A robustly optimized bert pretraining ap- pages 6397–6407, Online. Association for Computa- proach. ArXiv, abs/1907.11692. tional Linguistics. Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Michael Collins. 2020. Sparse, dense, and at- Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold tentional representations for text retrieval. ArXiv, Overwijk. 2020. Approximate nearest neighbor neg- abs/2005.00181. ative contrastive learning for dense text retrieval. ArXiv, abs/2007.00808. Christopher D Manning, Hinrich Schütze, and Prab- hakar Raghavan. 2008. Introduction to information Wen-tau Yih, Kristina Toutanova, John C Platt, and retrieval. Cambridge university press. Christopher Meek. 2011. Learning discriminative projections for text similarity measures. In Proceed- Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick ings of the Fifteenth Conference on Computational Lewis, Majid Yazdani, Nicola De Cao, James Natural Language Learning, pages 247–256. Asso- Thorne, Yacine Jernite, Vassilis Plachouras, Tim ciation for Computational Linguistics. Rocktäschel, and Sebastian Riedel. 2020. KILT: a benchmark for knowledge intensive language tasks. In arXiv:2009.02252. Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Se- bastian Riedel. 2019. Language models as knowl- edge bases? EMNLP. Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and be- yond. Foundations and Trends in Information Re- trieval, 3(4):333–389. Anshumali Shrivastava and Ping Li. 2014. Asymmet- ric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Advances in Neural In- formation Processing Systems (NIPS), pages 2321– 2329. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics. Shuohang Wang, Mo Yu, Xiaoxiao Guo, Z. Wang, Tim Klinger, Wei Zhang, S. Chang, G. Tesauro, Bowen Zhou, and Jing Jiang. 2018. R3: Reinforced ranker- reader for open-domain question answering. In AAAI.
You can also read