Multi-task Retrieval for Knowledge-Intensive Tasks - arXiv.org

Page created by Roy Patton

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Multi-task Retrieval for Knowledge-Intensive Tasks

                                                       Jean Maillard∗ Vladimir Karpukhin* Fabio Petroni
                                                    Wen-tau Yih Barlas Oğuz Veselin Stoyanov Gargi Ghosh
                                                                          Facebook AI
                                            {jeanm,vladk,fabiopetroni,scottyih,barlaso,ves,gghosh}@fb.com

                                                                  Abstract                                   The standard retrieval component in many sys-
                                                                                                          tems (e.g., Thorne et al., 2018; Wang et al., 2018;
                                                Retrieving relevant contexts from a large cor-            Chen et al., 2017) has long relied on term-matching
                                                pus is a crucial step for tasks such as open-             methods, such as tf-idf or BM25 (Robertson and
arXiv:2101.00117v1 [cs.CL] 1 Jan 2021

                                                domain question answering and fact check-
                                                                                                          Zaragoza, 2009). These methods rely on efficient
                                                ing. Although neural retrieval outperforms tra-
                                                ditional methods like tf-idf and BM25, its per-
                                                                                                          algorithms and usually perform reasonably well re-
                                                formance degrades considerably when applied               gardless of the problem. In contrast, recent neural
                                                to out-of-domain data. Driven by the ques-                retrieval models, such as ICT (Lee et al., 2019),
                                                tion of whether a neural retrieval model can              DPR (Karpukhin et al., 2020) and RAG (Lewis
                                                be universal and perform robustly on a wide               et al., 2020b) achieve better results by learning di-
                                                variety of problems, we propose a multi-task              rectly from task-specific training data and going
                                                trained model. Our approach not only outper-              beyond simple keyword matching. While task spe-
                                                forms previous methods in the few-shot set-
                                                                                                          cialisation results in improved task performance,
                                                ting, but also rivals specialised neural retriev-
                                                ers, even when in-domain training data is abun-           researchers have observed that a retriever trained
                                                dant. With the help of our retriever, we im-              for one specific domain will typically achieve low
                                                prove existing models for downstream tasks                out-of-domain performance, and even lower per-
                                                and closely match or improve the state of the             formance on entirely different tasks (Petroni et al.,
                                                art on multiple benchmarks.                               2020). This has two implications. First, unlike tf-
                                                                                                          idf or BM25, neural retrieval models are unsuitable
                                        1       Introduction                                              for low data regimes such as few- and zero-shot set-
                                                                                                          tings. Second, task-specific retrievers complicate
                                        Knowledge-intensive tasks is the common designa-
                                                                                                          practical applications where multiple knowledge-
                                        tion for a class of real-world NLP problems which,
                                                                                                          intensive tasks may need to be performed using the
                                        because of their nature, require large amounts of
                                                                                                          same supporting database or over the same input
                                        knowledge about the world (Petroni et al., 2020).
                                                                                                          text. It may not be practical to deploy multiple
                                        For example, open-domain question answering re-
                                                                                                          separate specialised models due to computational
                                        quires producing answers to general factoid ques-
                                                                                                          performance or memory concerns.
                                        tions; fact checking involves determining the ve-
                                                                                                             We ask the following question in this work: can
                                        racity of claims based on a database of trusted ev-
                                                                                                          we develop a universal neural retriever? Namely,
                                        idence. Practical solutions to these tasks usually
                                                                                                          we target a retriever that can perform well on a
                                        involve an efficient retrieval component that, given
                                                                                                          wide variety of problems, without task-specific fine-
                                        an input query, selects a limited subset of relevant
                                                                                                          tuning, but, if additional in-domain labelled data
                                        information from a large knowledge source. So-
                                                                                                          is available, it can be further fine-tuned to improve
                                        phisticated downstream models then consider the
                                                                                                          the performance. We perform a large experimental
                                        input only in the context of the retrieved informa-
                                                                                                          study to attempt to build such a universal retrieval
                                        tion, and perform the final task.1
                                                                                                          model. We find that, by jointly training on an exten-
                                                ∗
                                              Equal Contribution.                                         sive selection of retrieval tasks, we obtain a model
                                            1
                                             While large pre-trained neural models have been shown        which is not only more robust than previous ap-
                                        to incorporate real-world knowledge in their parameters and
                                        thus may skip retrieval (Petroni et al., 2019), they still have   proaches, but also can lead to better performance on
                                        limited capacity and suffer from a lack of explainability.        the downstream knowledge-intensive tasks when

plugged into an existing system. Our approach             a fixed, much smaller dimensionality. Whether a
combines the benefits from IR-based models with           passage is relevant to a given query is determined
those of task-specific neural retrievers – namely,        by the distance of their vectors (Deerwester et al.,
good performance when no (or not enough) train-           1990). Although dense representations do not en-
ing data is available and high task performance due       code tokens explicitly and can potentially map para-
to its ability to learn highly specialised representa-    phrases of completely different tokens to close vec-
tions.                                                    tors, performance of early dense retrieval methods
   Our contributions can be summarised as follows.        was often inferior to term-matching approaches,
                                                          except when large labelled data is available (Yih
    • We propose a single general-purpose “univer-
                                                          et al., 2011; Gao et al., 2011; Huang et al., 2013).
      sal” retrieval model, able to perform compa-
                                                          Thanks to success of large pre-trained models (De-
      rably or better than specialised retriever ap-
                                                          vlin et al., 2019; Liu et al., 2019b), however, recent
      proaches in both zero-shot (leave-one-out)
                                                          dense retrieval methods have shown to outperform
      and few-shot retrieval. We investigate sev-
                                                          the sparse counterparts, when fine-tuned on a small
      eral model variants, shedding light on what
                                                          set of in-domain labelled data (Karpukhin et al.,
      are the aspects of the architecture that affect
                                                          2020; Lewis et al., 2020b; Xiong et al., 2020). Effi-
      its performance.
                                                          cient index and search of dense vectors are made
    • We show that our model’s gains in terms of          possible by maximum inner product search (MIPS)
      retrieval directly translate into performance       algorithms (e.g., Shrivastava and Li, 2014; Guo
      gains for a variety of downstream knowledge-        et al., 2016), as well as tools like FAISS (Johnson
      intensive tasks.                                    et al., 2019).
                                                             Our work is built upon the Dense Passage Re-
    • We will share the implementation as well as         triever (DPR) architecture of Karpukhin et al.
      our best model. This is in the form of a read-      (2020), which was initially proposed for the task of
      ily available BERT checkpoint which, as we          open-domain question answering. DPR is a neural
      will show, can be used by NLP practitioners         bi-encoder model which embeds queries with an
      as a strong out-of-the-box retrieval system,        encoder f (·) and passages with a separate encoder
      but which can also undergo further in-domain        g (·). Given an input query x and a target passage
      training for even higher performance.               y, we have
2     Background                                                          p (x | y) ∝ sim(x, y),
In this section, we first give an overview of retrieval
methods based on sparse and dense representa-             where the similarity score sim (x, y) is defined as
tions. We then discuss a wide range of knowledge-         the inner product of the embeddings of its argu-
intensive NLP tasks, where retrieval plays a crucial      ments, f (x) · g(y). Given a query at inference
role in solving the problems.                             time, calculating its similarity with every possible
                                                          passage would be prohibitive for large knowledge
2.1    Retrieval methods                                  sources. Therefore, DPR makes use of the FAISS
Given a large collection of unstructured text pas-        library (Johnson et al., 2019) to perform fast ap-
sages, information retrieval (IR) can be broadly          proximate nearest neighbour search in sub-linear
defined as finding a small set of passages that           time.
                                                             Training is based on a contrastive loss. Given a
satisfies an information need, often presented in         query x, a relevant passage y, and a set of n irrele-
the form of a short-text query (Manning et al.,           vant passages yi− , we train the model by optimising
2008). Traditional IR methods, such as tf-idf and         the following negative log likelihood:
BM25 (Robertson and Zaragoza, 2009), match key-
words efficiently with an inverted index. Such                                      exp(sim(x, y))
                                                            L = − log                                        .
                                                                        exp(sim(x, y)) + n                −
                                                                                        P
methods can be seen as representing queries                                               i=1 exp(sim x, yi )

and passages in high-dimensional, sparse vectors,
where each dimension corresponds to a term in the           As the set of irrelevant passages, we use the rel-
vocabulary and the weight indicates its importance.       evant passages for other queries within the same
   In contrast to tf-idf and BM25, dense retrieval        batch, as well as a specially selected “hard” con-
methods encode text as a latent semantic vector of        founder. This is a passage which has high lexical

1
                                                                                                      3           Methods
                      Classification via learned metric
                        1                   2   3    …    n                                           3.1             Universal retrieval
                                                                                                      Using task-specific models to tackle our collec-
    Query           Document              Document             Document
                                                                                …     Document        tion of retrieval tasks would involve completely
   encoder           encoder               encoder              encoder                encoder
                                                                                                      separate models, one per dataset. As illustrated
  “Where was       “Albert Einstein   “Their son Hans         “Kalpana Chawla        “Gian Giacomo
 Albert Einstein
     born?”
                     was born in
                     Ulm, in …”
                                       Albert Einstein
                                        was born …”
                                                              was an American
                                                                astronaut, …”
                                                                                      Cavalli was a
                                                                                    genoese poet …”
                                                                                                      in Figure 2, this would lead to a proliferation of
     query₁
                    positive doc.     “hard” neg. doc.         positive doc.          positive doc.   models and data, down to separate indexed copies
                     for query₁          for query₁             for query₂             for queryₙ
                                                                                                      of the knowledge source itself (Wikipedia). This
Figure 1: Training of DPR (Karpukhin et al., 2020),                                                   setup will form one of our baselines.
a bi-encoder model for open-domain question answer-
ing. Queries and passages are encoded as vectors, and                                                     Task₁          Task₁
                                                                                                                         query
                                                                                                                                    match                                     match      Task₂
                                                                                                                                                                                         query
                                                                                                                                                                                                             Task₂
                                                                                                          query                                  Task₁             Task₂                                     query
retrieval is performed as a maximum inner product                                                                       encoder                document
                                                                                                                                                 index
                                                                                                                                                                 document
                                                                                                                                                                   index
                                                                                                                                                                                        encoder

search.
                                                                                                                                               Task₁            Task₂
                                                                                                                                             document         document
                                                                                                                                              encoder          encoder

overlap with the query (high BM25 score), but is                                                                                                Task₁
                                                                                                                                             knowledge
                                                                                                                                                                Task₂
                                                                                                                                                             knowledge
                                                                                                                                               source          source
not among the set of relevant passages for the given
data point. Karpukhin et al. (2020) have shown that                                                   Figure 2: Two retrieval tasks performed by two fully-
the inclusion of such “hard” confounders leads to                                                     specialised models.
substantially improved training results. This train-
ing process is illustrated in Figure 1.                                                                  Multi-task training has been successfully used to
                                                                                                      allow models to leverage cross-task data, as well as
2.2           Knowledge-intensive Tasks                                                               to provide a regularisation effect leading to better
                                                                                                      generalisation ability (Liu et al., 2019a). We apply
For the training and evaluation of all models in                                                      this concept to neural retrievers, with the aim of im-
the paper we make use of KILT, a benchmark and                                                        proving performance by jointly leveraging multiple
library of datasets (Petroni et al., 2020). KILT con-                                                 different retrieval datasets.
sists of a selection of datasets spanning five varied                                                        Task
                                                                                                            que ₁
classes of knowledge-intensive tasks (i.e., ques-                                                              ry
                                                                                                                           Task
                                                                                                                          que ₁
                                                                                                                        enc ry       match                   Task
                                                                                                                            ode                             que ₁
tion answering, slot filling, fact checking, dialogue,                                                                         r                               ry
                                                                                                                                                                            Shared    match
                                                                                                                                                Shared                       query                  Shared
entity linking), with the aim to cover many differ-                                                                       Task
                                                                                                                               ₂
                                                                                                                                    match      document
                                                                                                                                                 index
                                                                                                                                                                 ₂
                                                                                                                                                                            encoder                document
                                                                                                                                                                                                     index
                                                                                                                              ry                            Task
                                                                                                                           que er                            que
                                                                                                                                                                ry
ent ways of seeking knowledge. Input queries can                                                            Task
                                                                                                                 ₂
                                                                                                                 ry
                                                                                                                          enc
                                                                                                                              od

                                                                                                             que
                                                                                                                                              Shared                                               Shared
vary wildly from one task to the other, and include                                                                                          document
                                                                                                                                              encoder
                                                                                                                                                                                                  document
                                                                                                                                                                                                   encoder

classic examples of open-domain retrieval tasks
                                                                                                                                               Shared                                           Shared
such as natural language questions and claims to be                                                                                          knowledge
                                                                                                                                               source
                                                                                                                                                                                              knowledge
                                                                                                                                                                                                source

verified, as well as more unusual examples like con-
                                                                                                      (a) Separate query encoders.                        (b) A single retrieval model.
versation fragments and long chunks of annotated
text. Crucially, all datasets distributed in KILT have                                                Figure 3: Parameter sharing between neural retrievers.
been re-aligned such that they are all grounded in
the same snapshot of Wikipedia, which the authors                                                        Our base setup is illustrated in Figure 3b and
distribute. The knowledge required to answer any                                                      involves using a shared passage encoder — so that
of the queries in the library of tasks can thus be                                                    a single index of encoded passages can be used —
found within the same unified knowledge source.                                                       as well as a query encoder that is shared across all
   To illustrate the variety of ways in which the                                                     tasks. In essence, in this setup a single DPR model
input queries for different tasks can be formulated,                                                  is used to perform all retrieval tasks.
we provide a few simple examples in Table 1. In                                                          Due to the complexity of training and evaluat-
spite of the differences between query formulations,                                                  ing retrieval models (which involves training the
all these tasks share one crucial aspect: they all                                                    retriever, embedding all of Wikipedia, and building
require a retriever to fetch the relevant passages                                                    an index), our main set of experiments is all based
from the knowledge source, in order to support the                                                    on this configuration, which was found to work
final downstream task.                                                                                well in preliminary experiments. However, in order

Task Example query Answer Relevant doc.
Question Answering Who is playing the Halftime Show at Coldplay The Super Bowl 50 Halftime Show
Super Bowl 2016? took place on February 7, 2016 ...
It was headlined by the British rock
group Coldplay.
Fact Checking Bermuda Triangle is in the western part REFUTES The Bermuda Triangle ... is a loosely
of the Himalayas defined region in the western part of the
North Atlantic Ocean
Slot Filling Piner Creek [sep] mouth of the wa- Santa Rosa Creek Piner Creek discharges to Santa Rosa
tercourse Creek which in turn ...
Entity Linking Leicestershire take over at top West Indies cricket team The West Indies cricket team is a multi-
after innings victory. Lon- national men’s cricket team represent-
don. [start ent]West In- ing the Anglophone Caribbean region
dian[end ent] all-rounder Phil
Simmons ...
Dialogue I am a big fan of Star Trek [sep] I William Shatner plays It followed the interstellar adventures
don’t know much about it. When did the role of Captain Kirk of Captain James T. Kirk (William
the first episode air? [sep] It debuted Shatner) and his crew ...
in .. [sep] What is the plot of the
show?

Table 1: Illustrative examples of some of the tasks within KILT, and how varied their query formulations can be.

to report on the performance of alternative architec- of the retriever is trained with BM25 confounders;
tures, we also investigate the following additional (2) new confounders are selected with the trained
variants in a restricted experimental setting, limited model, by retrieving high-ranking passages which
to a few tasks: are not among the set of relevant ones; (3) a second
version of the model is trained using the additional
• Task-specific query encoder. A different
new confounders.
query encoder is used for each family of tasks,
e.g. all question answering tasks use the same Intuitively, it is expected that this approach
query encoder, but fact checking uses a differ- should lead to higher quality confounders com-
ent one. This is meant to allow for potentially pared to those selected by BM25 based on simple
different needs in processing queries, given keyword matching. Based on our own experience
the fundamentally diverse nature of the tasks as well as relevant literature (Khattab et al., 2020),
at hand. This setup configuration is illustrated this adversarial approach has been shown to work
in Figure 3a. well for question answering.
• Task markers. This approach is similar to As a way of further pushing the performance
our base setup, where a single model per- of the model, we experiment with this adversar-
forms all tasks. Additionally, we introduce ial confounder selection on two datasets, Natural
specialised tokens which are inserted at the be- Questions (Kwiatkowski et al., 2019) and TriviaQA
ginning of each query. Their aim is to help the (Joshi et al., 2017). We selected these two datasets
model distinguish between the different tasks, since, out of all of the tasks we are considering,
by marking them. We use one task marker for they have an easy way of checking whether a cer-
each of the five task classes of KILT, such that tain passage is relevant or not for a given query –
all question answering tasks share the same namely, by checking whether the answer is present
marker. in the passage. This enabled us to automatically
3.2 Adversarial confounder selection build sets of confounders, ensuring relevant pas-
sages would be excluded.2
We saw in § 2.1 how “hard” confounder passages
are collected using a BM25 baseline, following
the standard approach in DPR. However, any other
retriever can be used to select such confounders,
2
including the very retriever being trained, leading Strictly speaking, assuming a passage to be irrelevant be-
cause of the absence of the answer span is not formally correct.
to an iterative, self-adversarial training. Concretely, However, experiments show a good correlation between this
this amounts to following steps: (1) a first version simple check and the overall model quality.

Dataset Task class #Train coders (Devlin et al., 2019), trained separately. As
pooling mechanism we find it effective to simply
FEVER Fact Checking 71 k
take the [CLS] token representation at the topmost
AIDA-YAGO 2 Entity Linking 18 k
layer.
T-REx Slot Filling 2,284 k
Zero Shot RE Slot Filling 132 k Training We train our models for up to 80
Natural Questions QA 77 k epochs. To select the best checkpoint, we perform
HotpotQA QA 69 k full evaluations of the validation set retrieval per-
TriviaQA QA 53 k formance at regular intervals. We use the Adam
Wizard of Wikipedia Dialogue 80 k optimiser (Kingma and Ba, 2015) with a learning
rate of 2 · 10−5 with warmup and a linear decay
Table 2: KILT datasets used in this work, and the size schedule, and a dropout rate of 0.1. The batch size
of our converted training sets for each. is set to 128 samples, and in preliminary experi-
ments we found no benefit in increasing this further.
We use an additional “hard” confounder per batch,
4 Experiments
selected based on BM25 score as in (Karpukhin
4.1 Experimental settings et al., 2020).
Dataset selection For our experiments we select Downstream evaluation When evaluating our
the eight KILT datasets listed in Table 2, which retriever within a larger architecture to perform a
cover all five task classes and include a training knowledge-intensive task, we replicate the DPR
split, a validation split, and a held-out test split. + BART setup of Petroni et al. (2020). This uses
DPR to retrieve and prepend the top 3 passages
Preprocessing Starting from the raw KILT data,
to the query, which is then processed by a task-
we split each Wikipedia article into disjoint 100-
specific fine-tuned BART model to generate the
token chunks which form our basic retrieval units,
final answer for the end task.
following the approach of Wang et al. (2019) and
Karpukhin et al. (2020). To maintain the same lan- 4.2 Universal retrieval
guage introduced in §3, we will simply call these The results of the evaluations reported in (Petroni
chunks passages. et al., 2020) show that retrievers trained for ques-
This preprocessing results in a knowledge source tion answering have poor performance outside of
of 36 million passages. In order to harmonise all their domain. We would like to understand if it
datasets to the same knowledge source, KILT used is possible to design a single model which can ac-
a mapping strategy based on the BLEU metric to curately satisfy the information needs of a wide
map relevant passages in the original versions of variety of knowledge-intensive tasks. In short: Can
its datasets to passages in its own shared knowl- a neural retriever be universal?
edge source (Petroni et al., 2020). Entries included We perform a comprehensive evaluation of sev-
in the KILT training sets which have a mapping eral models on the eight tasks of Table 2. The
BLEU score below 0.5 are likely to be noise, and setups we evaluate include eight task-specific mod-
we exclude them from training. els (one trained on each of the eight datasets),
Multi-tasking Training is performed on the for which we measure both in-domain and out-of-
union of all training sets. Since two of the training domain performance, and a BM25 baseline. Addi-
sets are of different orders of magnitude, we use a tionally, we include a multi-task trained model – as
simple downsampling strategy to bring them to the described in §3.1 – with the hope that it can learn
same order of magnitude as the others. Preliminary to perform all tasks satisfyingly. This amounts to
experiments with more complex sampling meth- 10 models evaluated on eight tasks each, for a total
ods, like resampling all datasets so that each epoch of 80 evaluations.
would see an equal number of samples from each, To measure retrieval performance, we adopt the
found that they had no measurable effect compared main metric used for the KILT benchmark, R-
to this simpler approach. precision. This is calculated as r/R, where R is
the total number of relevant passages for a given
Encoders Our query and passage encoders are query, and r is the number of relevant passages
initialised as two distinct BERT base uncased en- returned among the top-R retrieval results. For the

Fact Check.     Ent. L.        Slot Filling                       Open Domain QA                             Dial.
  model              FEV         AY2        T-REx            zsRE              NQ       HoPo                 TQA            WoW
  Multi-task   74.72 / 46.96    83.78 69.18 / 53.54 77.23 / 41.70 61.51 / 28.80 44.21 / 38.42 61.95 / 24.56 39.70 / 24.07
  BM25         50.13 / 40.06     3.47 58.60 / 51.64 66.43 / 52.98 25.83 / 14.20 43.95 / 38.38 29.44 / 16.16 27.50 / 18.41
                                                    Task-specific models
  FEVER        73.60 / 43.92     5.62 19.50 / 10.02 42.88 / 19.98     36.69 / 18.05   23.18 / 17.59   45.08 / 22.24   41.27 / 19.85
  AY2          47.36 / 37.58    81.77 5.52 / 4.08    8.94 / 5.50      10.22 / 6.77    11.69 / 10.71   15.11 / 8.47    17.59 / 13.08
  T-REx        45.63 / 25.22     1.05 69.08 / 58.54 71.64 / 40.95     17.10 / 8.71    22.31 / 15.63   18.10 / 8.06     4.02 / 1.83
  zsRE         70.10 / 33.12     0.42 68.34 / 57.40 97.74 / 78.81     25.98 / 13.81   22.23 / 18.35   28.68 / 14.44   10.40 / 2.09
  NQ           68.16 / 14.81     1.44 31.78 / 7.20 61.12 / 12.92      63.24 / 28.13   29.39 / 11.33   48.39 / 14.42   30.77 / 11.81
  HoPo         56.18 / 40.03     2.07 35.76 / 27.62 44.44 / 31.15     35.60 / 23.26   46.63 / 43.47   41.18 / 29.37   23.51 / 16.02
  TQA          70.06 / 10.68     4.95 32.22 / 12.52 60.37 / 17.43     45.01 / 12.97   32.62 / 13.05   65.12 / 23.79   41.17 / 8.11
  WoW          59.16 / 42.79     3.11 20.92 / 18.52 41.14 / 35.26     33.27 / 22.52   20.36 / 17.66   39.37 / 23.15   40.32 / 20.73

Table 3: Page- and passage-level R-precision on KILT validation data. For the AIDA-YAGO 2 dataset, due to the
nature of the task, only page-level retrieval is defined.

case of R = 1 this is therefore equivalent to preci-           markedly different in formulation is AIDA-
sion@1. Table 3 shows retrieval performance on                 YAGO 2 (Hoffart et al., 2011). As shown in Table
the validation data, with the best performance on a            2, models that were not trained on this specific task
given dataset marked in bold, and the second best              perform it very poorly. Entity linking is a task that
performance underlined.                                        is normally better performed by models which are
   While the KILT evaluation focuses on retrieval              explicitly designed for it (Cao et al., 2020). We
at the level of Wikipedia pages (thereby marking as            nevertheless include it to showcase the ability of
“hits” any results that lie within the correct page),          neural retrievers to adapt to it, and note how well
we are also interested in performing an evaluation             the multi-task retriever performs on it in spite of its
at a more fine-grained level. We therefore also                unusual nature.
evaluate our models at the passage level, using a
modified version of the official KILT evaluation                4.3      Downstream performance
scripts. These are shown as the second number in               We saw that our proposed approach achieves strong
each column.                                                   performance across a variety of retrieval tasks.
   We straight away notice that the task-specific              However, our interest in neural retrievers stems
models tend to achieve high performance on their               from their use as components within larger sys-
respective tasks, often taking one of the top two              tems, to perform tasks such as question answering.
spots. Interestingly, we also note that these neu-             Our next experimental question is therefore: Can
ral retrievers consistently outperform the BM25                a universal retriever lead to better downstream
baseline, showing that the result which Karpukhin              performance in knowledge-intensive tasks?
et al. (2020) achieved for open-domain question an-               We perform a downstream evaluation of our ap-
swering also holds for other knowledge-intensive               proach used in conjunction with BART (Lewis
tasks.                                                         et al., 2020a) as the generative component or classi-
   The results reveal a strong performance for the             fier, adopting the same setup as Petroni et al. (2020).
multi-task model, confirming the hypothesis that a             Results are reported in Table 4, with bold and un-
single model can be successfully trained to perform            derline marking the best and second best scores
a wide variety of retrieval tasks. With the exception          respectively.
of one dataset, the shared model achieves the best                The DPR + BART line refers to a setup simi-
retrieval performance or is within a few percentage            lar to our own, but with the simpler retriever of
points of the top score. We note that the one excep-           Karpukhin et al. (2020). Therefore, comparing
tion is the Zero-shot RE task (Levy et al., 2017), a           its performance to ours gives us a clear indication
trivial task in which the query will always contain            of the contribution of multi-task training on the
the title of the page to be retrieved. Indeed, the             overall performance on knowledge-intensive tasks.
model specific to this task manages to achieve a               Our proposed model achieves significantly better
near-perfect score.                                            performance than this baseline in AY2, zsRE and
   Another task which stands out for being                     HoPo; while for the other tasks, the discrepancy

Fact Check. Ent. L. Slot Fill. Open Domain QA Dial.
Model FEV AY2 zsRE NQ HoPo TQA WoW Avg.
Multi-task + BART 86.32 82.61 57.95 39.75 31.77 59.60 15.33 53.33
DPR + BART 86.74 75.49 30.43 41.27 25.18 58.55 15.55 47.60
RAG 86.31 72.62 44.74 44.39 26.97 71.27 13.22 51.36
T5 76.30 74.05 9.02 19.60 12.64 18.11 13.49 31.89

Table 4: KILT test scores on the downstream evaluation. Results in the bottom section are as reported in Petroni
et al. (2020). The score metrics are accuracy for fact checking, entitly linking and slot filling; exact match for QA;
and F1 score for dialogue.3

is always below two points. This fact is reflected inal training sets with a random sample of 128
in the last column too, showing that on average and 1,024 examples respectively. In order to evalu-
multi-task training leads to better downstream per- ate the suitability of a multi-task trained retriever
formance. The model also compares favourably as a starting checkpoint for few-shot training, we
to RAG (Lewis et al., 2020b), a more advanced take the various leave-one-out models and fine-
system in which the query encoder is fine-tuned on tune them on our few-shot training sets. To check
the end task. whether multi-task pre-training is effective, we also
compare these to DPR models (which are just ini-
4.4 Zero- and few-shot performance tialised with BERT weights) fine-tuned on the same
Task-specific neural retrievers can achieve higher data.
performance than IR-based methods, but they are The bottom two sections of Table 5 report the
not suitable for cases where no training data (or results. The most dramatic gains from fine-tuning
not enough) is available. In those cases, tf-idf and are seen for AY2, an “outlier” task whose formula-
BM25 are the better choice. To evaluate the per- tion differs from that of the other tasks, and which
formance of a multi-task retriever as a suitable re- seems to benefit the most from seeing in-domain
placement for them in this scenario, we run a series data. The zsRE performance does not seem to im-
of experiments in the low data regimes (few-shot prove from fine-tuning on the smaller dataset, but
and zero-shot). sees a very big jump when switching to the larger
We start by training a set of multi-task retrievers dataset. As a reminder, in this trivial task the title
(using the base setup) in the leave-one-out setting of the page to be retrieved always appears at the
for each of the datasets, in order to see how a neural start of the query. It is therefore not surprising that
retriever will perform when trained on all domains models specifically fine-tuned on it can achieve
except for the one it is to be evaluated on. The near-perfect scores, as long as enough training data
results of these zero-shot experiments are reported is provided.
in the second line of Table 5 (again, text here is In spite of the fine-tuning, we note that both
in bold for the best overall performance, and un- DPR and the multi-task model fail to improve on
derlined for second best). They show that, even in their performance for T-REx, suggesting that large
the zero-shot setting, the multi-task neural retriever amounts of training data are required to learn this
achieves performance that is competitive to BM25, task. Nevertheless, the multi-task model proves it-
with retrieval being 10 points higher at the page self more robust, and achieves the top performance
level and 5 points lower at the passage level on on it.
average.
Finally, we note for 2 out of 8 tasks, namely
The advantage of neural retrievers over BM25
zsRE and WoW, DPR achieves lower page-level
lies in their ability to improve with training. We
retrieval scores than the multi-task model, but per-
therefore look at few-shot training for each task,
forms better at the passage level. This shows that
and create two smaller copies for each of the orig-
fine-grained and coarse-grained retrieval perfor-
3
Performing this evaluation required retrieving relevant mance are not always perfectly correlated.
documents for all training sets. Due to the very large size of
T-REx, this particular dataset could not be included in this Overall, the experiments show strong results for
section. the multi-task model, with the average zero-shot

Model                 FEV      AY2         T-REx             zsRE             NQ           HoPo            TQA            WoW             Avg.
  BM25           50.13 / 40.06   3.47   58.60 / 51.64   66.43 / 52.98   25.83 / 14.20 43.95 / 38.38   29.44 / 16.16   27.50 / 18.41 38.17 / 33.12
                                                        Leave-one-out multi-task models
  Zero-shot      74.11 / 37.09 4.16 67.54 / 44.84 73.42 / 32.65 47.23 / 21.50 34.72 / 16.52 49.08 / 28.06 36.92 / 16.19 48.40 / 28.12
  Finetune (128) 75.95 / 32.75 32.38 67.54 / 44.84 73.41 / 32.65 47.48 / 14.98 34.72 / 27.82 54.71 / 19.82 48.36 / 17.46 54.23 / 27.19
  Finetune (1k)  73.08 / 40.83 70.40 67.54 / 44.84 93.04 / 58.67 51.00 / 19.90 39.19 / 35.43 59.08 / 20.22 47.65 / 19.75 62.62 / 34.23
                                                              Vanilla DPR models
  Finetune (128) 37.99 / 25.31 26.23     0.20 / 0.02  0.16 / 0.00 20.92 / 9.52 14.46 / 14.08 26.85 / 10.54 30.31 / 17.20 19.64 / 10.95
  Finetune (1k) 70.87 / 47.82 72.49      0.20 / 0.02 90.33 / 80.20 43.43 / 19.81 30.75 / 30.50 52.50 / 17.33 44.70 / 24.92 50.66 / 31.51

Table 5: Page- and passage-level R-Precision in the zero-shot setting and with additional fine-tuning of 128 and
1,024 examples. We also compare to a BM25 retriever and a DPR model initialised with BERT weights.

performance being competitive to BM25, and the                             tions, selected using our top multi-task trained
average few-shot performance being markedly bet-                           model. A new multi-task model is then trained
ter than the alternatives. The discrepancy in per-                         from scratch on this augmented data. Its perfor-
formance between a vanilla DPR model and the                               mance is reported in Table 7, showing an over-
leave-one-out multi-task model is especially notice-                       all improvement across multiple tasks. While this
able when using the smaller of the two datasets, in                        approach is demonstrated here on our multi-task
which case average performance for the latter is                           model, it is in fact orthogonal to it, and could be
more than double that of vanilla DPR.                                      applied to any other neural retrievers trained with a
                                                                           contrastive loss.
4.5   Model variants
                                                                           5     Related work
 variant                   FEV               NQ              TQA
 Base            76.38 / 40.76 60.91 / 24.50 64.77 / 21.75                 The approach most closely related to ours is DPR
 Task markers    75.84 / 40.79 62.31 / 25.10 64.04 / 20.86                 (Karpukhin et al., 2020), upon which we built all
 Task-spec. enc. 73.53 / 40.02 61.05 / 25.52 64.17 / 21.23                 our retrieval systems. This model is covered in
                                                                           detail in § 2.1, in addition to the historical context.
Table 6: Multi-task model variants evaluated on a sub-
                                                                           Another closely related approach is the Retrieval-
set of tasks (R-precision on validation data at page/pas-
sage level).                                                               Augmented Generation (RAG) model of Lewis
                                                                           et al. (2020b). In its base configuration it aug-
                                                                           ments DPR with a generative reader, and it trains
   In this set of experiments we compare our base
                                                                           the query encoder end-to-end (differing from tra-
multi-task model with the two variants described
                                                                           ditional retriever-reader architectures which treat
in § 3.1. Due to the high memory consumption
                                                                           the two steps as disjoint). A natural extension of
of the “task-specific encoders” variant (requiring
                                                                           the work we have presented would be to combine
one full query encoder per task family, in addition
                                                                           RAG with our joint learning approach, to study
to the passage encoder), it was only possible to
                                                                           whether it can lead to further gains in performance
perform these evaluations in a restricted setting of
                                                                           or robustness.
three datasets. The results in Table 6 do not reveal
                                                                              A number of promising techniques to boost re-
a clear winner, suggesting that the base architecture
                                                                           trieval performance have been proposed recently.
might be the better choice due to its simplicity and
                                                                           These are orthogonal to our work, and as such they
generally good performance.4
                                                                           could be combined with it. Amongst these, pre-
4.6   Adversarial confounder selection                                     training methods form one class. Inverse Cloze
Finally, we evaluate the adversarial confounder se-                        Task (Lee et al., 2019) and its extensions (Chang
lection method described in § 3.2. This involves                           et al., 2020) are self-supervised pre-training meth-
augmenting our regular training sets with addi-                            ods designed for retrieval in open-domain question
tional confounders for TriviaQA and Natural Ques-                          answering. Whether such specific pre-training is
                                                                           beneficial to tasks other than question answering re-
    4
      Not included in this table, due to very poor performance             mains an open question. CERT (Fang et al., 2020)
in preliminary experiments, are two further variants: a base
model with a single encoder for both queries and passages, and             is an alternative pre-training approach, inspired by
a base model trained from scratch without BERT pre-training.               some recent advances in computer vision. While to

Fact Check. Ent. L. Slot Filling Open Domain QA Dial.
confounders FEV AY2 T-REx zsRE NQ HoPo TQA WoW
BM25 74.72 / 46.96 83.78 69.18 / 53.54 77.23 / 41.70 61.51 / 28.80 44.21 / 38.42 61.95 / 24.56 39.70 / 24.07
BM25 + adv 74.79 / 52.12 84.86 71.36 / 61.40 80.04 / 54.08 59.25 / 40.11 44.08 / 41.04 59.19 / 34.17 41.04 / 24.62

Table 7: Comparison of two confounder selection methods for the multi-task model: simple BM25, and BM25
augmented with adversarial confounders (R-precision on validation data at page/passage level).

our knowledge this has not been applied to retrieval number of tasks in the KILT benchmark.
problems, we believe it might be promising due to Next, in Section 4.4, we evaluated the model’s
its focus on sentence-level semantics (as opposed performance in the zero-shot and few-shot settings.
to the more standard masked language modelling By evaluating on a wide range of tasks, we were
pre-training, which focuses on the token-level). able to show that our proposed approach performs
Another class of orthogonal improvements to comparably to BM25 in the zero shot setting, and
dense retrieval involves models which embed pas- quickly overtakes it even with minimal in-domain
sages into multiple fixed-size vectors. Of these, training.
ColBERT (Khattab and Zaharia, 2020) and ME- In Section 4.5 we evaluated a number of more
BERT (Luan et al., 2020) are two representative complex variants of the model involving task spe-
examples. One further approach is ColBERT-QA cialisation, but failed to see clear performance im-
(Khattab et al., 2020), which additionally uses a provements. Finally, in Section 4.6 we saw how a
data augmentation strategy closely related to our simple iterative approach to data augmentation can
own approach described in § 3.2. lead to better performance.
Finally two entity linkers, GENRE (Cao et al., In the coming months we will provide a pre-
2020) and BLINK (Wu et al., 2020), are worth trained snapshot of our best-performing model, in
mentioning. Being trained specifically for entity the form of a BERT checkpoint. As shown, this
linking, these models will generally outperform model will be useful in zero-shot and few-shot set-
retrieval-based approaches on that task. While they tings as a better performing alternative to both IR-
are not comparable to retrieval models and will based approaches such as BM25, as well as task-
not generally be applicable to information retrieval specific models. The multi-task training approach
tasks, we mention them here to provide readers demonstrated here can also be useful in industry set-
with a fuller context of the existing literature. tings where several retrieval operations may need
to be performed on the same piece of content,5 and
6 Conclusions the deployment of multiple task-specific models
might not be possible due to space or computa-
We have conducted a large-scale experimental tional performance concerns.
study on knowledge-intensive tasks, and how
retrieval models that tackle them seek the re-
quired information from knowledge bases such as References
Wikipedia.
The study started with the question of whether Nicola De Cao, Gautier Izacard, Sebastian Riedel, and
Fabio Petroni. 2020. Autoregressive entity retrieval.
the way in which information is embedded for re- ArXiv, abs/2010.00904.
trieval purposes is universal. Section 4.2 provided
evidence that to a large extent it is, with a single Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yim-
“universal” retriever, trained jointly on 8 datasets, ing Yang, and Sanjiv Kumar. 2020. Pre-training
tasks for embedding-based large-scale retrieval. In
often performing comparably to task-specific mod-
International Conference on Learning Representa-
els. tions.
Armed with this knowledge, in Section 4.3 we
plugged our single model in a larger pipeline, in Danqi Chen, Adam Fisch, Jason Weston, and Antoine
order to see its contribution to the downstream per- Bordes. 2017. Reading Wikipedia to answer open-
domain questions. In Proceedings of the 55th An-
formance on a wide range of knowledge-intensive nual Meeting of the Association for Computational
tasks. This led to an overall improvement in down-
5
stream performance, setting new top results for a E.g. fact checking and hate speech detection.

Linguistics (Volume 1: Long Papers), pages 1870– (Volume 1: Long Papers), pages 1601–1611, Van-
1879, Vancouver, Canada. Association for Computa- couver, Canada. Association for Computational Lin-
tional Linguistics. guistics.

Scott Deerwester, Susan T Dumais, George W Fur- Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
nas, Thomas K Landauer, and Richard Harshman. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and
1990. Indexing by latent semantic analysis. Jour- Wen-tau Yih. 2020. Dense passage retrieval for
nal of the American society for information science, open-domain question answering. In Proceedings of
41(6):391–407. the 2020 Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP). Association for
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Computational Linguistics.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Omar Khattab, Christopher Potts, and Matei Zaharia.
standing. In Proceedings of the 2019 Conference 2020. Relevance-guided supervision for OpenQA
of the North American Chapter of the Association with ColBERT. ArXiv, abs/2007.00814.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Omar Khattab and Matei Zaharia. 2020. ColBERT: Ef-
pages 4171–4186, Minneapolis, Minnesota. Associ- ficient and effective passage search via contextual-
ation for Computational Linguistics. ized late interaction over BERT. In Proceedings of
the 43rd international ACM SIGIR conference on Re-
Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan search and development in Information Retrieval.
Ding, and Pengtao Xie. 2020. CERT: Contrastive
Diederick P Kingma and Jimmy Ba. 2015. Adam: A
self-supervised learning for language understanding.
method for stochastic optimization. In International
ArXiv, abs/2005.12766.
Conference on Learning Representations (ICLR).
Jianfeng Gao, Kristina Toutanova, and Wen-tau Yih. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
2011. Clickthrough-based latent semantic models field, Michael Collins, Ankur Parikh, Chris Alberti,
for web search. In Proceedings of the 34th interna- Danielle Epstein, Illia Polosukhin, Matthew Kelcey,
tional ACM SIGIR conference on Research and de- Jacob Devlin, Kenton Lee, Kristina N. Toutanova,
velopment in Information Retrieval, pages 675–684. Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob
ACM. Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
ral questions: a benchmark for question answering
Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and research. Transactions of the Association of Compu-
David Simcha. 2016. Quantization based fast inner tational Linguistics.
product search. In Artificial Intelligence and Statis-
tics, pages 482–490. Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
2019. Latent retrieval for weakly supervised open
Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor- domain question answering. In Proceedings of the
dino, Hagen Fürstenau, Manfred Pinkal, Marc Span- 57th Annual Meeting of the Association for Com-
iol, Bilyana Taneva, Stefan Thater, and Gerhard putational Linguistics, pages 6086–6096, Florence,
Weikum. 2011. Robust disambiguation of named en- Italy. Association for Computational Linguistics.
tities in text. In Proceedings of the 2011 Conference
on Empirical Methods in Natural Language Process- Omer Levy, Minjoon Seo, Eunsol Choi, and Luke
ing, pages 782–792, Edinburgh, Scotland, UK. Asso- Zettlemoyer. 2017. Zero-shot relation extraction via
ciation for Computational Linguistics. reading comprehension. In Proceedings of the 21st
Conference on Computational Natural Language
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Learning (CoNLL 2017), pages 333–342, Vancou-
Alex Acero, and Larry Heck. 2013. Learning deep ver, Canada. Association for Computational Linguis-
structured semantic models for web search using tics.
clickthrough data. In Proceedings of the 22nd
ACM International Conference on Information and Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
Knowledge Management, page 2333–2338, New jan Ghazvininejad, Abdelrahman Mohamed, Omer
York, NY, USA. Association for Computing Machin- Levy, Veselin Stoyanov, and Luke Zettlemoyer.
ery. 2020a. BART: Denoising sequence-to-sequence pre-
training for natural language generation, translation,
J. Johnson, M. Douze, and H. Jégou. 2019. Billion- and comprehension. In Proceedings of the 58th An-
scale similarity search with GPUs. IEEE Transac- nual Meeting of the Association for Computational
tions on Big Data, pages 1–1. Linguistics, pages 7871–7880, Online. Association
for Computational Linguistics.
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
Zettlemoyer. 2017. TriviaQA: A large scale dis- Patrick Lewis, Ethan Perez, Aleksandara Piktus,
tantly supervised challenge dataset for reading com- Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
prehension. In Proceedings of the 55th Annual Meet- Heinrich Küttler, Mike Lewis, Wen-tau Yih,
ing of the Association for Computational Linguistics Tim Rocktäschel, Sebastian Riedel, and Douwe

Kiela. 2020b. Retrieval-augmented generation for Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nal-
knowledge-intensive NLP tasks. In Advances in lapati, and Bing Xiang. 2019. Multi-passage
Neural Information Processing Systems (NeurIPS). BERT: A globally normalized BERT model for
open-domain question answering. In Proceedings of
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian- the 2019 Conference on Empirical Methods in Nat-
feng Gao. 2019a. Multi-task deep neural networks ural Language Processing and the 9th International
for natural language understanding. In Proceedings Joint Conference on Natural Language Processing
of the 57th Annual Meeting of the Association for (EMNLP-IJCNLP), pages 5878–5882, Hong Kong,
Computational Linguistics, pages 4487–4496, Flo- China. Association for Computational Linguistics.
rence, Italy. Association for Computational Linguis-
tics. Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian
Riedel, and Luke Zettlemoyer. 2020. Scalable zero-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- shot entity linking with dense entity retrieval. In
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Proceedings of the 2020 Conference on Empirical
Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Methods in Natural Language Processing (EMNLP),
Roberta: A robustly optimized bert pretraining ap- pages 6397–6407, Online. Association for Computa-
proach. ArXiv, abs/1907.11692. tional Linguistics.

Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang,
Michael Collins. 2020. Sparse, dense, and at- Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold
tentional representations for text retrieval. ArXiv, Overwijk. 2020. Approximate nearest neighbor neg-
abs/2005.00181. ative contrastive learning for dense text retrieval.
ArXiv, abs/2007.00808.
Christopher D Manning, Hinrich Schütze, and Prab-
hakar Raghavan. 2008. Introduction to information Wen-tau Yih, Kristina Toutanova, John C Platt, and
retrieval. Cambridge university press. Christopher Meek. 2011. Learning discriminative
projections for text similarity measures. In Proceed-
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick ings of the Fifteenth Conference on Computational
Lewis, Majid Yazdani, Nicola De Cao, James Natural Language Learning, pages 247–256. Asso-
Thorne, Yacine Jernite, Vassilis Plachouras, Tim ciation for Computational Linguistics.
Rocktäschel, and Sebastian Riedel. 2020. KILT: a
benchmark for knowledge intensive language tasks.
In arXiv:2009.02252.

Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton
Bakhtin, Yuxiang Wu, Alexander H Miller, and Se-
bastian Riedel. 2019. Language models as knowl-
edge bases? EMNLP.

Stephen Robertson and Hugo Zaragoza. 2009. The
probabilistic relevance framework: BM25 and be-
yond. Foundations and Trends in Information Re-
trieval, 3(4):333–389.

Anshumali Shrivastava and Ping Li. 2014. Asymmet-
ric LSH (ALSH) for sublinear time maximum inner
product search (MIPS). In Advances in Neural In-
formation Processing Systems (NIPS), pages 2321–
2329.

James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2018.
FEVER: a large-scale dataset for fact extraction
and VERification. In Proceedings of the 2018
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long
Papers), pages 809–819, New Orleans, Louisiana.
Association for Computational Linguistics.

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Z. Wang, Tim
Klinger, Wei Zhang, S. Chang, G. Tesauro, Bowen
Zhou, and Jing Jiang. 2018. R3: Reinforced ranker-
reader for open-domain question answering. In
AAAI.

You can also read