Question Answering Infused Pre-training of General-Purpose Contextualized Representations

Page created by Tommy Welch

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Question Answering Infused Pre-training
                                                          of General-Purpose Contextualized Representations

                                                                        Robin Jia, Mike Lewis, Luke Zettlemoyer
                                                                                 Facebook AI Research
                                                                       {robinjia,mikelewis,lsz}@fb.com

                                                               Abstract                           The Violin Concerto in D major, Op. 77, was
                                                                                                  composed by Johannes Brahms in 1878 and
                                             This paper proposes a pre-training objec-            dedicated to his friend, the violinist Joseph Joachim.
                                             tive based on question answering (QA) for
                                             learning general-purpose contextual represen-
arXiv:2106.08190v1 [cs.CL] 15 Jun 2021

                                             tations, motivated by the intuition that the rep-
                                             resentation of a phrase in a passage should
                                             encode all questions that the phrase can an-
                                             swer in context. We accomplish this goal by
                                             training a bi-encoder QA model, which in-
                                             dependently encodes passages and questions,
                                             to match the predictions of a more accurate
                                             cross-encoder model on 80 million synthe-           What did Brahms      Who wrote the       Who played
                                             sized QA pairs. By encoding QA-relevant in-         write in 1878?       violin concerto?    the violin?
                                             formation, the bi-encoder’s token-level repre-               What was dedicated      Who was friends
                                             sentations are useful for non-QA downstream                  to Joachim?             with Joachim?
                                             tasks without extensive (or in some cases, any)
                                             fine-tuning. We show large improvements over        Figure 1: An overview of Question Answering Infused
                                             both RoBERTa-large and previous state-of-the-       Pre-training. Our model independently creates vector
                                             art results on zero-shot and few-shot para-         representations (middle) for phrases in a passage (top)
                                             phrase detection on four datasets, few-shot         and for synthesized questions (bottom). Our objective
                                             named entity recognition on two datasets, and       encourages the vector for each phrase to have high sim-
                                             zero-shot sentiment analysis on three datasets.     ilarity with the vectors for all questions it answers.

                                         1   Introduction
                                         Although masked language models build contex-           phrase’s role in the sentence. A further advantage
                                         tualized word representations, they are pre-trained     is that these more contextualized representations
                                         with losses that minimize distance to uncontextual-     allow for improved zero- and few-shot learning for
                                         ized word embeddings (Peters et al., 2018; Devlin       many token-level tasks, especially when the tasks
                                         et al., 2019; Liu et al., 2019). In this paper, we      can be posed as question answering (Levy et al.,
                                         introduce a new pre-training loss based on ques-        2017; McCann et al., 2018). Our approach builds
                                         tion answering that depends much more directly on       on previous work using question-answer pairs as a
                                         context, and learns improved token-level represen-      meaning representation (He et al., 2015; Michael
                                         tations for a range of zero- and few-shot tasks.        et al., 2018), and more generally from the tradition
                                            More specifically, we propose Question Answer-       of using question answering ability as a proxy for
                                         ing Infused Pre-training (Q U IP), a technique for      language understanding (Lehnert, 1977; Hirschman
                                         training highly context dependent representations.      et al., 1999; Peñas et al., 2013; Richardson et al.,
                                         Our intuition is that the representation for a phrase   2013).
                                         should contain enough information to identify all          Q U IP learns contextual representations with
                                         the questions that the phrase could answer in con-      a bi-encoder extractive QA objective. Our bi-
                                         text. For example, in Figure 1, the representation      encoder model must independently encode pas-
                                         for Johannes Brahms should be similar to an en-         sages and questions such that the representation
                                         coding of all questions it can answer, such as “Who     of each phrase in a passage is similar to the repre-
                                         wrote the violin concerto?”—thereby capturing the       sentations of all reading comprehension questions

that can be answered with that phrase. To train a question independently, and predicts the answer
such a model, we use a question generation model to be the phrase in the passage whose learned rep-
to synthesize 80 million QA examples, then train resentation is most similar to that of the question.
the bi-encoder to match the predictions of a stan- Since the model does not know the question when
dard cross-encoder QA model, which processes the encoding the passage, its representation for each
passage and question together, on these examples. phrase must be similar to those of all questions that
Bi-encoder QA has been used for efficient open- can be answered with that phrase. In contrast, a
domain QA via phrase retrieval (Seo et al., 2018, cross-encoder model receives the concatenation of
2019; Lee et al., 2020, 2021), but its lower accu- the passage and question as a single input; these
racy compared to cross-encoder QA has previously models can use many layers of self-attention to
been viewed as a drawback. We instead identify extract an answer, and cannot be interpreted as en-
the relative weakness of bi-encoder QA as an op- coding all possible questions.
portunity to improve contextual representations via In the rest of this section, we introduce some
knowledge distillation, as self-training can be ef- notation (§2.1), then describe our Q U IP pipeline,
fective when the student model is forced to solve a which consists of three steps: question generation
harder problem than the teacher (Xie et al., 2020). (§2.2), cross-encoder teacher re-labeling (§2.3),
In fact, using a bi-encoder student model is critical and bi-encoder training (§2.4).
for Q U IP: a cross-encoder trained in a similar way
does not learn contextual representations that are 2.1 Notation
as useful for later few-shot learning, despite having All of our models operate on sequences of to-
higher QA accuracy. kens x = [x1 , . . . , xL ] of length L. By con-
We show that Q U IP token-level representations vention, we assume that x1 is always the spe-
are useful in a variety of zero-shot and few-shot cial beginning-of-sequence token. Our goal is to
learning settings, both because the representations learn an encoder r that maps inputs x to outputs
directly encode useful contextual information, and r(x) = [r(x)1 , . . . , r(x)L ] where each r(x)i ∈ Rd
because we can often reduce downstream tasks to for some fixed dimension d. We call r(x)i the
QA. For few-shot paraphrase detection, Q U IP with contextual representation of the i-th token in x.
BERTScore-based features (Zhang et al., 2020) In extractive question answering, a model is
outperforms prior work by 9 F1 points across given a context passage c and question q, and must
four datasets. For few-shot named entity recog- output a span of c that answers the question. Typi-
nition (NER), Q U IP combined with an initializa- cally, models approach this by independently pre-
tion scheme that uses question embeddings im- dicting probability distributions p(astart | c, q) and
proves over RoBERTa-large by 14 F1 across two p(aend | c, q) over the answer start index astart and
datasets. Finally, for zero-shot sentiment analy- end index aend .
sis, Q U IP with question prompts improves over
RoBERTa-large with MLM-style prompts by 5 ac- 2.2 Question Generation
curacy points across three datasets, and extracts Question generation model. We train a BART-
interpretable rationales as a side effect. Through ab- large model (Lewis et al., 2020) to generate
lations, we show that using real questions, a strong question-answer pairs given context passages. The
teacher model, and the bi-encoder architecture are model receives the passage as context and must
all crucial to the success of Q U IP; on the other generate the answer text, then a special separator
hand, many other design decisions (e.g., question token, then the question; this approach is simpler
generation decoding strategies) do not qualitatively than prior approaches that use separate models for
affect our main findings, pointing to the stability of answer and question generation (Lewis and Fan,
the Q U IP approach. 2019; Alberti et al., 2019; Puri et al., 2020), and
works well in practice.
2 QA Infused Pre-training
Training data. We train this model on the train-
We introduce QA Infused Pre-training (Q U IP), in ing data from the MRQA 2019 Shared Task (Fisch
which we pre-train contextual representations with et al., 2019), which combines data from six QA
a bi-encoder extractive QA objective. As shown in datasets: HotpotQA (Yang et al., 2018), Natu-
Figure 1, a bi-encoder model encodes a passage and ralQuestions (Kwiatkowski et al., 2019), NewsQA

(Trischler et al., 2017), SearchQA (Dunn et al.,                token.
2017), SQuAD (Rajpurkar et al., 2016), and Trivi-
aQA (Joshi et al., 2017). Together, these datasets              Model. The bi-encoder model with parameters θ
cover many of the text sources commonly used for                consists of of three components: an encoder r and
pre-training (Liu et al., 2019; Lewis et al., 2020),            two question embedding heads hstart and hend that
namely Wikipedia (HotpotQA, NaturalQuestions,                   map Rd → Rd . These heads will only be applied to
SQuAD), News articles (NewsQA), and general                     beginning-of-sequence (i.e., CLS) representations;
web text (SearchQA, TriviaQA).                                  as shorthand, define fstart (x) = hstart (r(x)1 ) and
                                                                likewise for fend . Given a context passage c and
Generating questions. We run our question gen-                  question q, the model predicts
eration model over a large set of passages to gen-
erate a large dataset of question-answer pairs. We                                                      >f
                                                                         pθ (astart = i | c, q) ∝ er(c)i     start (q)
                                                                                                                           (1)
decode using nucleus sampling (Holtzman et al.,                                                         >f
2020) with p = 0.6, which was chosen by man-                             pθ (aend = i | c, q) ∝ er(c)i       end (q)
                                                                                                                       .   (2)
ual inspection to balance diversity with quality of
generated questions. We also experimented with                  In other words, the model independently encodes
other decoding strategies, such as a two-stage beam             the passage and question with r, applies the start
search, but found these made little difference in the           and end heads to the CLS token embedding for q,
final results (see §4.8). We obtain passages from               then predicts the answer start (end) index with a
the same training corpus as RoBERTa (Liu et al.,                softmax over the dot product between the passage
2019), which uses four sub-domains: B OOK C OR -                representation at that index and the output of the
PUS plus Wikipedia, CC-N EWS, O PEN W EB T EXT,                 start (end) head.
and S TORIES. For each domain, we sample 2 mil-                    We initialize r to be the pre-trained RoBERTa-
lion passages and generate 10 questions per pas-                large model (Liu et al., 2019), which uses d =
sage, for a total of 80 million questions.1                     1024. hstart and hend are randomly-initialized 2-
                                                                layer MLPs with hidden dimension 1024, matching
2.3    Teacher Re-labeling                                      the default initialization of classification heads in
The answers generated by our BART model are not                 RoBERTa.2
always accurate, and are also not always spans in
the context passage. To improve the training sig-               Training. For an input consisting of context c of
nal, we generate labels using a teacher model, as               length L and question q, we train θ to minimize
is common in knowledge distillation (Hinton et al.,             the KL-divergence between the student and teacher
2015). We use a standard cross-encoder RoBERTa-                 predictions, which is equivalent to the objective
large model trained on the MRQA training data
                                                                     L
as our teacher model. The model takes in the con-                    X
catenation of the context passage c and question q                 −   Tstart (c, q)i log pθ (astart = i | c, q)
                                                                       i=1
and predicts astart and aend with two independent
2-layer multi-layer perceptron (MLP) heads. Let                              + Tend (c, q)i log pθ (aend = i | c, q)       (3)
Tstart (c, q) and Tend (c, q) denote the teacher’s pre-
dicted probability distribution over astart and aend ,          up to constants that do not depend on θ. We train
respectively.                                                   for two epochs on the generated corpus of 80 mil-
                                                                lion questions, which takes roughly 56 hours on 8
2.4    Bi-encoder Training                                      V100 GPUs, or roughly 19 GPU-days; this is very
Finally, we train a bi-encoder model to match the               fast compared to pre-training RoBERTa-large from
cross-encoder predictions on the generated ques-                scratch, which took roughly 5000 GPU-days (Liu
tions. This objective encourages the contextual                 et al., 2019). For efficiency, we process all ques-
representation for a token in a passage to have high            tions for the same passage in the same batch, as
similarity (in inner product space) with the repre-             encoding passages dominates runtime. For further
sentation of every question that is answered by that            details, see Appendix A.1.
   1                                                               2
     We estimate that using the entire corpus with these set-      https://github.com/pytorch/fairseq/
tings would generate around 900 million questions. We leave     blob/master/fairseq/models/roberta/model.
investigation of further scaling to future work.                py

3 Downstream Tasks extract eight features, corresponding to BERTScore
computed with the final eight layers (i.e., layers 17-
We evaluate Q U IP on zero-shot paraphrase ranking,
24) of the network. These layers encompass the
few-shot paraphrase classification, few-shot NER,
optimal layers for both RoBERTa-large and Q U IP
and zero-shot sentiment analysis. We leverage the
(see §4.4). Freezing the encoder is often useful in
improved token-level representations afforded by
practice, particularly for large models, as the same
Q U IP, and in many cases also directly use Q U IP’s
model can be reused for many tasks (Brown et al.,
question-answering abilities.
2020; Du et al., 2020).
3.1 Paraphrase Ranking
Fine-tuning. For fine-tuning, we use the same
We first evaluate Q U IP token-level representations computation graph and logistic loss function, but
by measuring their usefulness for zero-shot para- now backpropagate through the parameters of our
phrase ranking. In this task, systems must rank sen- encoder. For details, see Appendix A.2.
tence pairs that are paraphrases above pairs that are
non-paraphrases, without any task-specific train- 3.3 Named Entity Recognition
ing data. We compute similarity scores using the We also use Q U IP for few-shot3 named entity
FBERT variant of BERTScore (Zhang et al., 2020), recognition, which we frame as a BIO tagging task.
which measures cosine similarities between the rep- We add a linear layer that takes in token-level repre-
resentation of each token in one sentence and its sentations and predicts the tag for each token, and
most similar token in the other sentence. Given backpropagate log loss through the entire network.
sentences x1 and x2 of lengths L1 and L2 , define By default, the output layer is initialized randomly.
L1 As a refinement, we propose using question
1 X prompts to initialize this model. The output layer is
B1 = max cos-sim(r(x1 )i , r(x2 )j ))
L1 1≤j≤L2
parameterized by a T × d matrix M , where T is the
i=1
L2 number of distinct BIO tags. The log-probability
1 X
B2 = max cos-sim(r(x2 )i , r(x1 )j )), of predicting the j-th tag for token i is proportional
L2 1≤j≤L1
i=1 to the dot product between the representation for
u v > token i and the j-th row of M ; this resembles how
where cos-sim(u, v) = kukkvk is the cosine sim- the bi-encoder predicts answers. Thus, we initial-
ilarity between vectors u and v. The FBERT ize each row of M with the start head embedding
BERTScore is defined as the harmonic mean of of a question related to that row’s corresponding
B1 and B2 . Zhang et al. (2020) showed that entity tag. For instance, we initialize the parame-
BERTScore with RoBERTa representations is use- ters for the B-location and I-location tags
ful for both natural language generation evaluation with the embedding for “What is a location ?” We
and paraphrase ranking. Since BERTScore uses normalize the question embeddings to have unit
token-level representations, we hypothesize that it L2 norm. This style of initialization is uniquely
should pair well with Q U IP. Following Zhang et al. enabled by our use of questions answering as an
(2020), we use representations from the from the interface to the model; it would be unclear how to
layer of the network that maximizes Pearson corre- use a language model similarly.
lation between BERTScore and human judgments
on the WMT16 metrics shared task (Bojar et al., 3.4 Zero-shot Sentiment Analysis
2016).
Finally, we use Q U IP for zero-shot binary senti-
3.2 Paraphrase Classification ment analysis. We reduce sentiment analysis to QA
by writing a pair of questions that ask for a rea-
We next use either frozen or fine-tuned Q U IP rep-
son why an item is good or bad (e.g., “Why is this
resentations for few-shot paraphrase classification,
movie [good/bad]?”). We predict the label whose
rather than ranking. Through these experiments,
corresponding question has higher similarity with
we can compare Q U IP with existing work on few-
the Q U IP representation of some token in the input.
shot paraphrase classification.
This prompting strategy has the additional benefit
Frozen model. We train a logistic regression 3
Yang and Katiyar (2020) study few-shot NER assuming
model that uses BERTScore with frozen representa- data from other NER datasets is available; we assume no such
tions as features. For a given pair of sentences, we data is available, matching Huang et al. (2020).

of extracting rationales, namely the span that the        dataset (Hu and Liu, 2004). We evaluate on the
Q U IP model predicts as the answer to the question.      SST-2 development set and the MR and CR test
   More formally, let x be an input sentence and          sets made by Gao et al. (2021).
(q0 , q1 ) be a pair of questions (i.e., a prompt). For
label y ∈ {0, 1}, we compute a score for y as             Hyperparameter and prompt selection. Due
                                                          to the nature of zero-shot and few-shot experiments,
         S(x, y) = max r(x)>
                           i fstart (qy )+                we minimize the extent to which we tune hyper-
                       i
                                                          parameters, relying on existing defaults and pre-
                     max r(x)>
                             i fend (qy ).         (4)    viously published hyperparameters. For few-shot
                       i
                                                          paraphrase classification, NER, and sentiment anal-
Essentially, S(x, y) measures the extent to which         ysis, we developed our final method only using
some span in x looks like the answer to the question      QQP, CoNLL, and SST-2, respectively, and directly
qy . We predict whichever y has the higher value of       applied it to the other datasets with no further tun-
S(x, y) − Cy , where Cy is a calibration constant         ing. We did measure zero-shot paraphrase ranking
that offsets the model’s bias towards answering q0        accuracy on all datasets during development of
or q1 . Our inclusion of Cy is inspired by Zhao           Q U IP. For more details, see Appendix A.3.
et al. (2021), who recommend calibrating zero-shot           For NER, we used the first question prompts we
and few-shot models with a baseline derived from          wrote for both CoNLL and WNUT, which all fol-
content-free inputs to account for biases towards         low the same format, “Who/What is a/an [entity
a particular label. To choose Cy , we obtain a list       type] ?” (see Appendix A.6 for all prompts). For
W of the ten most frequent English words, all of          sentiment analysis, we wrote six prompts (shown
which convey no sentiment, and define Cy as the           in Appendix A.8) and report mean accuracy over
mean over w ∈ W of S(w, y), i.e., the score when          these prompts, to avoid pitfalls associated with
using w as the input sentence (see Appendix A.4).         prompt tuning (Perez et al., 2021). We use the
4     Experiments                                         same prompts for SST-2 and MR; for CR, the only
                                                          change we make is replacing occurrences of the
4.1    Experimental details                               word “movie” with “product” to reflect the change
Datasets. For paraphrasing, we use four datasets:         in domain between these datasets.
QQP (Iyer et al., 2017), MRPC (Dolan and Brock-
ett, 2005), PAWS-Wiki, and PAWS-QQP (Zhang                4.2   Baselines and Ablations
et al., 2019). The PAWS datasets are especially of        To confirm the importance of all three stages of our
interest as they are designed to be challenging for       pre-training pipeline, we compare with a number
bag-of-words models, and thus can test whether            of baselines and ablations.
our representations are truly contextual or mostly
lexical in nature. For QQP and MRPC, we use               No question generation. We train the bi-
the few-shot splits from Gao et al. (2021) that in-       encoder model directly on the MRQA training
clude 16 examples per class; for the PAWS datasets,       data (“Bi-encoder + MRQA”). We also include
we create new few-shot splits in the same man-            the cross-encoder teacher model trained on MRQA
ner. We report results on the development sets of         as a baseline (“Cross-encoder + MRQA”). These
QQP and MRPC (as test labels were not available),         settings mirror standard intermediate task training
the test set of PAWS-Wiki, and the “dev-and-test”         (Phang et al., 2018; Pruksachatkun et al., 2020).
set of PAWS-QQP. For NER, we use two datasets:            No teacher. We train the bi-encoder using the an-
CoNLL 2003 (Tjong Kim Sang and De Meulder,                swer supervision generated by the question genera-
2003) and WNUT-17 (Derczynski et al., 2017). We           tion model (“Q U IP, no teacher”). Occasionally, the
use the few-shot splits from Huang et al. (2020) that     generated answer is not a span in the corresponding
include 5 examples per entity type. All few-shot          passage; we consider these questions unanswerable
experiments report an average over five random            and treat the span containing the CLS token as the
splits and seeds, following both Gao et al. (2021)        “correct” answer, as in Devlin et al. (2019).
and Huang et al. (2020). For sentiment analysis,
we use two movie review datasets, SST-2 (Socher           Cross-encoder self-training. To test whether
et al., 2013) and Movie Reviews (MR; Pang and             the bottleneck imposed by the bi-encoder archi-
Lee, 2005), as well as the Customer Reviews (CR)          tecture is crucial for the success of Q U IP, we also

Model EM F1 the reported human accuracy of 91.2 F1 on the
Lee et al. (2021) 78.3 86.3 SQuAD test set, as well as the best cross-encoder
Bi-encoder + UnsupervisedQA 17.4 24.9 BERT-large single model from Devlin et al. (2019).
Bi-encoder + MRQA 70.7 79.4
Q U IP, no teacher 75.3 84.7
Q U IP greatly improves over baselines that directly
Q U IP 85.2 91.7 train on the MRQA data or do not use the teacher
BERT-large cross-encoder 84.2 91.1 model. As expected, our cross-encoder models are
Cross-encoder + MRQA 88.8 94.7 even more accurate at QA; the next sections will
Q U IP, cross-encoder student 89.5 94.8 show that their representations are less useful for
non-QA tasks than the Q U IP bi-encoder representa-
Table 1: EM and F1 scores on the SQuAD development
set for bi-encoder (top) and cross-encoder (bottom)
tions. Results on all MRQA development datasets
models. The Q U IP bi-encoder outperforms the other bi- are provided in Appendix A.5.
encoder model baselines, and even BERT-large trained
as a cross-encoder. The RoBERTa cross-encoder mod-
4.4 Zero-shot Paraphrase Ranking
els are more accurate than the bi-encoder at QA, but The first half of Table 2 shows WMT develop-
we show later that the bi-encoder model representations ment set Pearson correlations averaged across six
are more useful for non-QA tasks. to-English datasets, as in Zhang et al. (2020), along
with the best layer for each model. Q U IP reaches
train a cross-encoder model on the same gener- its optimal score at a later layer (20) than RoBERTa-
ated data (“Q U IP, cross-encoder student”). Since large (17), which may suggest that the Q U IP train-
this student model has the same architecture as the ing objective is more closely aligned with the goal
teacher model, we train the student to maximize the of learning better representations than MLM.
likelihood of the teacher’s argmax predictions, a The rest of Table 2 shows zero-shot paraphrase
standard self-training objective (Lee, 2013; Kumar ranking results using BERTScore. Q U IP improves
et al., 2020). Training the cross-encoder is consid- substantially over RoBERTa on all four datasets,
erably less efficient than training the bi-encoder, with an average improvement of .076 AUROC. The
since batching questions about the same passage improvement is greatest on the PAWS datasets;
together does not speed up training, so we train it since these datasets are supposed to challenge mod-
for a comparable number of total GPU-hours (60 els that rely heavily on lexical rather than contex-
hours on 8 V100 GPUs). tual cues, we infer that Q U IP representations are
much more contextualized than RoBERTa repre-
Unsupervised QA. We test whether Q U IP relies sentations. Training on Unsupervised QA data de-
on having real QA examples, or if it is sufficient grades performance compared to RoBERTa, indi-
to use any task that forces word representations to cating that the Q U IP objective accomplishes much
encode information about surrounding words. To more than merely forcing word representations to
do this, we train on data generated by the Unsu- encode local context in a simple way. Training the
pervised QA pipeline of Lewis et al. (2019). This bi-encoder directly on the MRQA dataset or with-
pipeline creates pseudo-questions by masking and out the teacher improves on average over RoBERTa,
applying noise to sentences in a passage. Lewis but Q U IP greatly outperforms both baselines across
et al. (2019) showed that training on such data can the board. The cross-encoder models also lag be-
yield non-trivial accuracy on SQuAD without any hind Q U IP at paraphrase ranking, despite their
reliance on real QA data. We generate 10 pseudo- higher QA accuracy. Thus, we conclude that hav-
questions per passage in our training corpus, yield- ing real questions, accurate answer supervision,
ing another dataset of 80 million examples, and and a bi-encoder student model are all crucial to
train our bi-encoder on this data (“Bi-encoder + the success of Q U IP.
UnsupervisedQA”).
4.5 Paraphrase Classification
4.3 Bi-encoder Question Answering Table 3 shows few-shot paraphrase classification
While not the main focus of this paper, we first results. We compare with LM-BFF (Gao et al.,
check that Q U IP improves bi-encoder QA accu- 2021), which pairs RoBERTa-large with MLM-
racy. Table 1 reports results on the SQuAD devel- style prompts to do few-shot learning. We use the
opment set. Q U IP not only improves over Lee et al. setting with fixed manually written prompts and
(2021) by 5.4 F1 on SQuAD, but also surpasses demonstrations, which was their best method on

Model                             WMT r       WMT Best Layer           QQP     MRPC       PAWS-Wiki       PAWS-QQP
      RoBERTa-large                       .739                17             .763     .831         .698           .690
      Cross-encoder + MRQA                .744                16             .767     .840         .742           .731
      Q U IP, cross-encoder student       .753                16             .769     .847         .751           .706
      Bi-encoder + UnsupervisedQA         .654                11             .747     .801         .649           .580
      Bi-encoder + MRQA                   .749                15             .771     .807         .747           .725
      Q U IP, no teacher                  .726                19             .767     .831         .780           .709
      Q U IP                              .764                20             .809     .849         .830           .796

Table 2: Pearson correlation on WMT development data, best layer chosen based on WMT results, and AUROC
on zero-shot paraphrase ranking using BERTScore. Q U IP outperforms all baselines on all datasets.

                  Model                  Fine-tuned?         QQP         MRPC       PAWS-Wiki      PAWS-QQP
                  LM-BFF (reported)      Fine-tuned         69.80.8      77.80.9         -               -
                  LM-BFF (rerun)         Fine-tuned         67.10.9      76.51.5      60.70.7         50.12.8
                  RoBERTa-large          Frozen             64.40.4      80.60.7      62.30.9        50.60.4
                  Q U IP                 Frozen             68.90.2      82.60.4      71.90.5        63.01.2
                  RoBERTa-large          Fine-tuned       64.90.7       84.40.3       65.70.3         50.90.8
                  Q U IP                 Fine-tuned       71.00.3       86.60.4       75.10.2         60.91.0

Table 3: F1 scores on few-shot paraphrase ranking, averaged across five training splits (standard errors in sub-
scripts). Q U IP outperforms prior work on few-shot paraphrase classification (LM-BFF; Gao et al., 2021) as well
as our own baselines using RoBERTa.

   Model                              CoNLL       WNUT                4.6   Named Entity Recognition
   Huang et al. (2020)                 65.4        37.6               Table 4 shows few-shot NER results on the
   Standard init.                                                     CoNLL and WNUT datasets. Q U IP improves over
   RoBERTa-large                      59.02.4     39.30.6             RoBERTa-large by 11 F1 on CoNLL and 2.9 F1
   Cross-encoder + MRQA               68.93.3     43.00.9
   Q U IP, cross-encoder student      63.43.3     39.41.7             on WNUT when used with a randomly initialized
   Bi-encoder + UnsupervisedQA        58.22.6     26.01.0             output layer. We see a further improvement of
   Bi-encoder + MRQA                  66.43.3     42.20.4
   Q U IP, no teacher                 67.71.9     40.71.4
                                                                      4 F1 on CoNLL and 7.4 F1 on WNUT when us-
   Q U IP                             70.02.4     42.20.5             ing question embeddings to initialize the output
   Question prompt init.                                              layer. Using the cross-encoder trained directly on
   Bi-encoder + UnsupervisedQA        62.73.3     30.40.8             QA data is roughly as good as Q U IP when using
   Bi-encoder + MRQA                  72.02.8     44.01.3             randomly initialized output layers, but it is incom-
   Q U IP, no teacher                 71.43.0     47.81.1
   Q U IP                             74.02.4     49.60.5             patible with question embedding initialization (as
                                                                      it does not produce question embeddings).
Table 4: F1 scores on few-shot NER, averaged over                        We also measured NER results when training
five training splits (standard errors in subscripts). Q U IP          on the full training datasets, but found minimal
with question prompts performs best on both datasets.                 evidence that Q U IP improves over RoBERTa. As
                                                                      shown in Appendix A.7, Q U IP gives a 0.6 F1 im-
                                                                      provement on WNUT but effectively no gain on
                                                                      CoNLL. With extensive fine-tuning, RoBERTa can
                                                                      learn to extract useful token-level features, but it
QQP by 2.1 F1 and was 0.3 F1 worse than their
                                                                      lags behind Q U IP in the few-shot regime.
best method on MRPC. Q U IP used as a frozen en-
coder is competitive with LM-BFF on QQP and                           4.7   Sentiment Analysis
outperforms it by 6.1 F1 on MRPC, 11.2 F1 on
PAWS-Wiki, and 12.1 F1 on PAWS-QQP. Fine-                             Table 5 shows zero-shot accuracy on our three sen-
tuning gives additional improvements on three of                      timent analysis datasets. We compare with zero-
the four datasets, though it worsens accuracy on                      shot results for LM-BFF (Gao et al., 2021)4 and
PAWS-QQP. Fine-tuning with Q U IP also outper-                        reported zero-shot results from Zhao et al. (2021)
forms using RoBERTa in the same way by an aver-                          4
                                                                           We tried applying our calibration strategy to LM-BFF as
age of 6.9 F1.                                                        well, but found that it did not improve accuracy.

Model                      SST-2         MR           CR                                     SQuAD   Paraphrase   NER
                                                                     Model
                                                                                                F1     AUROC        F1
 CC + GPT-3                 71.6          -            -
 LM-BFF                     83.6        80.8         79.5            Q U IP                    91.7     .821       61.8
 Q U IP (average)          87.90.6     81.90.4      90.30.2          + concat. passages        91.7     .818       62.7
 w/ cross-enc. student     83.30.4     78.50.4      88.90.3          w/ hard labels            91.5     .814       62.5
                                                                     w/ 2-stage beam search    91.7     .821       62.8
 Q U IP (tune on SST-2)      89.6        83.1         90.4

                                                                 Table 7: SQuAD development set F1, average zero-
Table 5: Zero-shot accuracy on sentiment analysis.
                                                                 shot paraphrase ranking AUROC across all datasets,
Third and fourth rows show mean accuracy across six
                                                                 and average few-shot NER F1 using question prompts
prompts (standard error in subscripts). Q U IP using an
                                                                 across both datasets for Q U IP variants. Models shown
average prompt outperforms prior work; the final row
                                                                 here are all similarly effective.
shows additional benefits from using the best prompt
on SST-2 for all datasets.
                                                                 4.8     Stability Analysis

 Label   Rationale
                                                                 We experimented with some design decisions that
                                                                 did not materially affect our results. Here, we re-
   -     “too slim”, “stale”, “every idea”, “wore out its wel-
         come”, “unpleasant viewing experience”, “life-          port these findings as evidence that our basic recipe
         less”, “plot”, “amateurishly assembled”, “10            is stable to many small changes. First, we con-
         times their natural size”, “wrong turn”                 catenated the representations of all passages in the
   +     “packed with information and impressions”,              same batch and on the same GPU together (9 pas-
         “slash-and-hack”, “tightly organized efficiency”,
         “passion and talent”, “best films”, “surprises”,        sages on average), and trained the model to extract
         “great summer fun”, “play equally well”, “convic-       answers from this larger pseudo-document; this
         tions”, “wickedly subversive bent”                      effectively adds in-batch negative passages, as in
                                                                 Lee et al. (2021). Second, we trained the model to
Table 6: Rationales extracted by Q U IP on ten random
                                                                 match the argmax prediction of the teacher, rather
examples for each label from SST-2.
                                                                 than its soft distribution over start and end indices.
                                                                 Finally, we used a two-stage beam search to gen-
                                                                 erate questions. For a given passage, we generated
using GPT-3 with Contextual Calibration (CC) on                  20 possible answers via beam search, chose 10 of
SST-2. Q U IP using an average prompt outperforms                these to maximize answer diversity, then generated
zero-shot LM-BFF by 5.4 points, averaged across                  one question for each answer with another beam
the three datasets. Choosing the best prompt on                  search. Our goal was to ensure diversity by forcing
SST-2 and using that for all datasets improves re-               questions to be about different answers, while also
sults not only on SST-2 but also MR, and maintains               maintaining high question quality. As shown in Ta-
average accuracy on CR. Using the cross-encoder                  ble 7, these choices have a relatively minor impact
student QA model with the same prompts leads to                  on the results (within .007 AUROC and 1 F1 on
worse performance: we hypothesize that this is be-               NER).
cause the bi-encoder bottleneck encourages Q U IP
to make each span’s representation dissimilar to                 5      Discussion and Related Work
those of questions that it does not answer, whereas
                                                                 In this work, we pre-trained token-level contextual
the cross-encoder model trained only on answer-
                                                                 representations that are useful for few-shot learning
able questions does not naturally handle unanswer-
                                                                 on downstream tasks. Our key idea was to leverage
able questions (e.g., “Why is it bad?” asked about
                                                                 signal from extractive QA datasets to define what
a positive review).
                                                                 information should be encoded in passage repre-
   Table 6 shows rationales extracted from random                sentations. Our work builds on prior work studying
SST-2 examples for which Q U IP was correct with                 question generation and answering, pre-training,
the best prompt for SST-2 (“What is the reason this              and few-shot learning.
movie is [good/bad]?”). To prefer shorter ratio-
nales, we extract the highest-scoring span of five               5.1     Question Generation
BPE tokens or less. The model often identifies                   Neural question generation has been studied by a
phrases that convey clear sentiment. Appendix A.9                number of previous papers for different purposes
shows many more full examples and rationales.                    (Du et al., 2017; Du and Cardie, 2018; Zhao et al.,

2018; Lewis and Fan, 2019; Alberti et al., 2019; and thus use token-level supervision available in
Puri et al., 2020). Recently, Lewis et al. (2021) use extractive QA.
model-generated question-answer pairs for efficient While we focus on QA as a source of training
open-domain QA. Bartolo et al. (2021) use gener- signal, other work aims to improve downstream
ated questions to improve robustness to human- QA accuracy. Ram et al. (2021) propose a span ex-
written adversarial questions. Our work focuses traction pre-training objective that enables few-shot
on how generated questions can help learn general- QA. Khashabi et al. (2020) show that multi-task
purpose representations. We also show that a rela- training on many QA datasets, both extractive and
tively simple strategy of generating the answer and non-extractive, can improve QA performance. Fu-
question together with a single seq-to-seq model ture work could explore learning question-agnostic
can be effective; most prior work uses separate passage embeddings that are also useful for non-
answer selection and question generation models. extractive QA tasks.

5.2 Phrase-indexed Question Answering 5.4 Few-shot Learning
Language models have recently been used for few-
Phrase-indexed question answering is a paradigm
shot learning on a wide range of tasks (Brown et al.,
for open-domain question answering that retrieves
2020; Schick and Schütze, 2021; Gao et al., 2021).
answers by embedding questions and candidate
These papers all use natural language prompts to
answers in a shared embedding space (Seo et al.,
convert NLP tasks into language modeling prob-
2018, 2019; Lee et al., 2020). Phrase-indexed QA
lems. We develop alternative methods for few-shot
requires using a bi-encoder architecture, as we use
learning that use token-level representations and
in our work. Especially related is Lee et al. (2021),
question-based prompts rather than language model
which also uses question generation and a cross-
prompts.
encoder teacher model to improve phrase-indexed
Looking forward, pre-trained representations
QA. We focus not on open-domain QA, but on
could be even more useful in few-shot settings if
learning generally useful representations for non-
they could inherently capture more complex rela-
QA downstream tasks.
tionships between words and sentences. For in-
5.3 Learning contextual representations stance, box embeddings (Vilnis et al., 2018) natu-
rally capture asymmetric relationships like set in-
Pre-training on unlabeled data has been highly ef- clusion, so pre-training such embeddings may be
fective for learning contextual representations (Pe- helpful for tasks like few-shot entailment classifi-
ters et al., 2018; Devlin et al., 2019), but recent cation. We hope to see more work in the future on
work has shown further improvements by using la- designing pre-training objectives that align better
beled data. Intermediate task training (Phang et al., with downstream needs for few-shot learning.
2018) improves representations by training directly
on large labeled datasets. Muppet (Aghajanyan Acknowledgements
et al., 2021) improves models by multi-task pre-
We thank Terra Blevins for investigating applica-
finetuning on many labeled datasets. Most similar
tions to word sense disambiguation, and Douwe
to our work, He et al. (2020) uses extractive ques-
Kiela, Max Bartolo, Sebastian Riedel, Sewon Min,
tion answering as a pre-training task for a BERT
Patrick Lewis, and Scott Yih for their feedback on
paragraph encoder. Our work leverages question
this project. We thank Jiaxin Huang for providing
generation and knowledge distillation to improve
the few-shot NER splits used in their paper.
over directly training on labeled data, and focuses
on zero-shot and few-shot applications.
Other work has used methods similar to ours to References
learn sentence embeddings. Reimers and Gurevych
Armen Aghajanyan, Anchit Gupta, Akshat Shrivas-
(2019) train sentence embeddings for sentence sim- tava, Xilun Chen, Luke Zettlemoyer, and Sonal
ilarity tasks using natural language inference data. Gupta. 2021. Muppet: Massive multi-task rep-
Thakur et al. (2021) train a sentence embedding bi- resentations with pre-finetuning. arXiv preprint
encoder to mimic the predictions of a cross-encoder arXiv:2101.11038.
model. We focus on learning token-level represen- Chris Alberti, Daniel Andor, Emily Pitler, Jacob De-
tations, rather than a single vector for a sentence, vlin, and Michael Collins. 2019. Synthetic QA cor-

pora generation with roundtrip consistency. In Pro- Xinya Du and Claire Cardie. 2018. Harvest-
ceedings of the 57th Annual Meeting of the Asso- ing paragraph-level question-answer pairs from
ciation for Computational Linguistics, pages 6168– Wikipedia. In Proceedings of the 56th Annual Meet-
6173, Florence, Italy. Association for Computa- ing of the Association for Computational Linguistics
tional Linguistics. (Volume 1: Long Papers), pages 1907–1917, Mel-
bourne, Australia. Association for Computational
Max Bartolo, Tristan Thrush, Robin Jia, Sebastian Linguistics.
Riedel, Pontus Stenetorp, and Douwe Kiela. 2021.
Improving question answering model robustness Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn-
with synthetic adversarial data generation. arXiv ing to ask: Neural question generation for reading
preprint arXiv:2104.08678. comprehension. In Proceedings of the 55th Annual
Meeting of the Association for Computational Lin-
Ondřej Bojar, Yvette Graham, Amir Kamran, and guistics (Volume 1: Long Papers), pages 1342–1352,
Miloš Stanojević. 2016. Results of the WMT16 met- Vancouver, Canada. Association for Computational
rics shared task. In Proceedings of the First Con- Linguistics.
ference on Machine Translation: Volume 2, Shared
Task Papers, pages 199–231, Berlin, Germany. As- Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
sociation for Computational Linguistics. Stanovsky, Sameer Singh, and Matt Gardner. 2019.
DROP: A reading comprehension benchmark requir-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie ing discrete reasoning over paragraphs. In Proceed-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, ings of the 2019 Conference of the North American
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Chapter of the Association for Computational Lin-
Amanda Askell, Sandhini Agarwal, Ariel Herbert- guistics: Human Language Technologies, Volume 1
Voss, Gretchen Krueger, Tom Henighan, Rewon (Long and Short Papers), pages 2368–2378, Min-
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, neapolis, Minnesota. Association for Computational
Clemens Winter, Chris Hesse, Mark Chen, Eric Linguistics.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur
Alec Radford, Ilya Sutskever, and Dario Amodei. Guney, Volkan Cirik, and Kyunghyun Cho. 2017.
2020. Language models are few-shot learners. In SearchQA: A new Q&A dataset augmented with
Advances in Neural Information Processing Systems, context from a search engine. arXiv preprint
volume 33, pages 1877–1901. Curran Associates, arXiv:1704.05179.
Inc.
Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu-
nsol Choi, and Danqi Chen. 2019. MRQA 2019
Leon Derczynski, Eric Nichols, Marieke van Erp, and
shared task: Evaluating generalization in reading
Nut Limsopatham. 2017. Results of the WNUT2017
comprehension. In Proceedings of the 2nd Work-
shared task on novel and emerging entity recogni-
shop on Machine Reading for Question Answering,
tion. In Proceedings of the 3rd Workshop on Noisy
pages 1–13, Hong Kong, China. Association for
User-generated Text, pages 140–147, Copenhagen,
Computational Linguistics.
Denmark. Association for Computational Linguis-
tics. Tianyu Gao, Adam Fisch, and Danqi Chen. 2021.
Making pre-trained language models better few-shot
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and learners. In Association for Computational Linguis-
Kristina Toutanova. 2019. BERT: Pre-training of tics (ACL).
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference Hangfeng He, Qiang Ning, and Dan Roth. 2020.
of the North American Chapter of the Association QuASE: Question-answer driven sentence encoding.
for Computational Linguistics: Human Language In Proceedings of the 58th Annual Meeting of the
Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, pages
pages 4171–4186, Minneapolis, Minnesota. Associ- 8743–8758, Online. Association for Computational
ation for Computational Linguistics. Linguistics.
William B. Dolan and Chris Brockett. 2005. Automati- Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015.
cally constructing a corpus of sentential paraphrases. Question-answer driven semantic role labeling: Us-
In Proceedings of the Third International Workshop ing natural language to annotate natural language.
on Paraphrasing (IWP2005). In Proceedings of the 2015 Conference on Empiri-
cal Methods in Natural Language Processing, pages
Jingfei Du, Myle Ott, Haoran Li, Xing Zhou, and 643–653, Lisbon, Portugal. Association for Compu-
Veselin Stoyanov. 2020. General purpose text em- tational Linguistics.
beddings from pre-trained language models for scal-
able inference. In Findings of the Association for Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean.
Computational Linguistics: EMNLP 2020, pages 2015. Distilling the knowledge in a neural network.
3018–3030, Online. Association for Computational In NIPS Deep Learning and Representation Learn-
Linguistics. ing Workshop.

Lynette Hirschman, Marc Light, Eric Breck, and Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.
John D. Burger. 1999. Deep read: A reading com- Natural questions: A benchmark for question an-
prehension system. In Proceedings of the 37th An- swering research. Transactions of the Association
nual Meeting of the Association for Computational for Computational Linguistics, 7:452–466.
Linguistics, pages 325–332, College Park, Maryland,
USA. Association for Computational Linguistics. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
and Eduard Hovy. 2017. RACE: Large-scale ReAd-
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and ing comprehension dataset from examinations. In
Yejin Choi. 2020. The curious case of neural text de- Proceedings of the 2017 Conference on Empirical
generation. In International Conference on Learn- Methods in Natural Language Processing, pages
ing Representations. 785–794, Copenhagen, Denmark. Association for
Computational Linguistics.
Minqing Hu and Bing Liu. 2004. Mining and summa-
rizing customer reviews. In Proceedings of the tenth Dong-Hyun Lee. 2013. Pseudo-label: The simple and
ACM SIGKDD international conference on Knowl- efficient semi-supervised learning method for deep
edge discovery and data mining, pages 168–177. neural networks. In ICML 2013 Workshop on Chal-
Jiaxin Huang, Chunyuan Li, Krishan Subudhi, Damien lenges in Representation Learning (WREPL).
Jose, Shobana Balakrishnan, Weizhu Chen, Baolin
Peng, Jianfeng Gao, and Jiawei Han. 2020. Few- Jinhyuk Lee, Minjoon Seo, Hannaneh Hajishirzi, and
shot named entity recognition: A comprehensive Jaewoo Kang. 2020. Contextualized sparse repre-
study. arXiv preprint arXiv:2012.14978. sentations for real-time open-domain question an-
swering. In Proceedings of the 58th Annual Meet-
Shankar Iyer, Nikhil Dandekar, and Kornél Csernai. ing of the Association for Computational Linguis-
2017. First quora dataset release: Question pairs. tics, pages 912–919, Online. Association for Com-
https://www.quora.com/q/quoradata/ putational Linguistics.
First-Quora-Dataset-Release-Question-Pairs.
Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Chen. 2021. Learning dense representations of
Zettlemoyer. 2017. TriviaQA: A large scale dis- phrases at scale. In Association for Computational
tantly supervised challenge dataset for reading com- Linguistics (ACL).
prehension. In Proceedings of the 55th Annual Meet-
ing of the Association for Computational Linguistics Wendy Lehnert. 1977. Human and computational ques-
(Volume 1: Long Papers), pages 1601–1611, Van- tion answering. Cognitive Science, 1(1):47–73.
couver, Canada. Association for Computational Lin-
guistics. Omer Levy, Minjoon Seo, Eunsol Choi, and Luke
Zettlemoyer. 2017. Zero-shot relation extraction via
Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, reading comprehension. In Proceedings of the 21st
Jonghyun Choi, Ali Farhadi, and Hannaneh Ha- Conference on Computational Natural Language
jishirzi. 2017. Are you smarter than a sixth grader? Learning (CoNLL 2017), pages 333–342, Vancou-
textbook question answering for multimodal ma- ver, Canada. Association for Computational Linguis-
chine comprehension. In 2017 IEEE Conference on tics.
Computer Vision and Pattern Recognition (CVPR),
pages 5376–5384. Mike Lewis and Angela Fan. 2019. Generative ques-
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish tion answering: Learning to answer the whole ques-
Sabharwal, Oyvind Tafjord, Peter Clark, and Han- tion. In International Conference on Learning Rep-
naneh Hajishirzi. 2020. UNIFIEDQA: Crossing for- resentations.
mat boundaries with a single QA system. In Find-
ings of the Association for Computational Linguis- Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
tics: EMNLP 2020, pages 1896–1907, Online. As- jan Ghazvininejad, Abdelrahman Mohamed, Omer
sociation for Computational Linguistics. Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence pre-
Ananya Kumar, Tengyu Ma, and Percy Liang. 2020. training for natural language generation, translation,
Understanding self-training for gradual domain and comprehension. In Proceedings of the 58th An-
adaptation. In Proceedings of the 37th International nual Meeting of the Association for Computational
Conference on Machine Learning, volume 119 of Linguistics, pages 7871–7880, Online. Association
Proceedings of Machine Learning Research, pages for Computational Linguistics.
5468–5479. PMLR.
Patrick Lewis, Ludovic Denoyer, and Sebastian Riedel.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- 2019. Unsupervised question answering by cloze
field, Michael Collins, Ankur Parikh, Chris Al- translation. In Proceedings of the 57th Annual Meet-
berti, Danielle Epstein, Illia Polosukhin, Jacob De- ing of the Association for Computational Linguistics,
vlin, Kenton Lee, Kristina Toutanova, Llion Jones, pages 4896–4910, Florence, Italy. Association for
Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Computational Linguistics.

Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Zettlemoyer. 2018. Deep contextualized word rep-
Minervini, Heinrich Küttler, Aleksandra Piktus, Pon- resentations. In Proceedings of the 2018 Confer-
tus Stenetorp, and Sebastian Riedel. 2021. Paq: 65 ence of the North American Chapter of the Associ-
million probably-asked questions and what you can ation for Computational Linguistics: Human Lan-
do with them. arXiv preprint arXiv:2102.07033. guage Technologies, Volume 1 (Long Papers), pages
2227–2237, New Orleans, Louisiana. Association
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- for Computational Linguistics.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019. Jason Phang, Thibault Févry, and Samuel R Bowman.
RoBERTa: A robustly optimized bert pretraining ap- 2018. Sentence encoders on stilts: Supplementary
proach. arXiv preprint arXiv:1907.11692. training on intermediate labeled-data tasks. arXiv
preprint arXiv:1811.01088.
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,
and Richard Socher. 2018. The natural language de- Yada Pruksachatkun, Jason Phang, Haokun Liu,
cathlon: Multitask learning as question answering. Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe
arXiv preprint arXiv:1806.08730. Pang, Clara Vania, Katharina Kann, and Samuel R.
Bowman. 2020. Intermediate-task transfer learning
Julian Michael, Gabriel Stanovsky, Luheng He, Ido Da- with pretrained language models: When and why
gan, and Luke Zettlemoyer. 2018. Crowdsourcing does it work? In Proceedings of the 58th Annual
question-answer meaning representations. In Pro- Meeting of the Association for Computational Lin-
ceedings of the 2018 Conference of the North Amer- guistics, pages 5231–5247, Online. Association for
ican Chapter of the Association for Computational Computational Linguistics.
Linguistics: Human Language Technologies, Vol-
ume 2 (Short Papers), pages 560–568, New Orleans, Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa
Louisiana. Association for Computational Linguis- Patwary, and Bryan Catanzaro. 2020. Training ques-
tics. tion answering models from synthetic data. In
Proceedings of the 2020 Conference on Empirical
Stephen Mussmann, Robin Jia, and Percy Liang. 2020. Methods in Natural Language Processing (EMNLP),
On the Importance of Adaptive Data Collection for pages 5811–5826, Online. Association for Computa-
Extremely Imbalanced Pairwise Tasks. In Findings tional Linguistics.
of the Association for Computational Linguistics: Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
EMNLP 2020, pages 3400–3413, Online. Associa- Percy Liang. 2016. SQuAD: 100,000+ questions for
tion for Computational Linguistics. machine comprehension of text. In Proceedings of
Bo Pang and Lillian Lee. 2005. Seeing stars: Ex- the 2016 Conference on Empirical Methods in Natu-
ploiting class relationships for sentiment categoriza- ral Language Processing, pages 2383–2392, Austin,
tion with respect to rating scales. In Proceed- Texas. Association for Computational Linguistics.
ings of the 43rd Annual Meeting of the Association Ori Ram, Yuval Kirstain, Jonathan Berant, Amir
for Computational Linguistics (ACL’05), pages 115– Globerson, and Omer Levy. 2021. Few-shot ques-
124, Ann Arbor, Michigan. Association for Compu- tion answering by pretraining span selection. In As-
tational Linguistics. sociation for Computational Linguistics (ACL).
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Nils Reimers and Iryna Gurevych. 2019. Sentence-
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, BERT: Sentence embeddings using Siamese BERT-
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, networks. In Proceedings of the 2019 Conference on
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Empirical Methods in Natural Language Processing
esnay. 2011. Scikit-learn: Machine learning in and the 9th International Joint Conference on Natu-
Python. Journal of Machine Learning Research, ral Language Processing (EMNLP-IJCNLP), pages
12:2825–2830. 3982–3992, Hong Kong, China. Association for
Computational Linguistics.
Anselmo Peñas, Pamela Forner, Richard Sutcliffe, Ál-
varo Rodrigo, Corina Forăscu, In̄aki Alegria, Danilo Matthew Richardson, Christopher J.C. Burges, and
Giampiccolo, Nicolas Moreau, and Petya Osenova. Erin Renshaw. 2013. MCTest: A challenge dataset
2013. QA4MRE 2011-2013: Overview of question for the open-domain machine comprehension of text.
answering for machine reading evaluation. In Inter- In Proceedings of the 2013 Conference on Empiri-
national Conference of the Cross-Language Evalu- cal Methods in Natural Language Processing, pages
ation Forum for European Languages, pages 303 – 193–203, Seattle, Washington, USA. Association for
320. Computational Linguistics.

Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and
True few-shot learning with language models. arXiv Karthik Sankaranarayanan. 2018. DuoRC: Towards
preprint arXiv:2105.11447. complex language understanding with paraphrased
reading comprehension. In Proceedings of the 56th
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Annual Meeting of the Association for Computa-
Gardner, Christopher Clark, Kenton Lee, and Luke tional Linguistics (Volume 1: Long Papers), pages

1683–1693, Melbourne, Australia. Association for Krithara, Sergios Petridis, Dimitris Polychronopou-
Computational Linguistics. los, Yannis Almirantis, John Pavlopoulos, Nico-
las Baskiotis, Patrick Gallinari, Thierry Artieres,
Timo Schick and Hinrich Schütze. 2021. It’s not just Axel Ngonga, Norman Heino, Eric Gaussier, Lil-
size that matters: Small language models are also iana Barrio-Alvers, Michael Schroeder, Ion An-
few-shot learners. In Proceedings of the 2021 Con- droutsopoulos, and Georgios Paliouras. 2015. An
ference of the North American Chapter of the Asso- overview of the bioasq large-scale biomedical se-
ciation for Computational Linguistics: Human Lan- mantic indexing and question answering competi-
guage Technologies, pages 2339–2352, Online. As- tion. BMC Bioinformatics, 16:138.
sociation for Computational Linguistics.
Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew Mc-
Minjoon Seo, Tom Kwiatkowski, Ankur Parikh, Ali Callum. 2018. Probabilistic embedding of knowl-
Farhadi, and Hannaneh Hajishirzi. 2018. Phrase- edge graphs with box lattice measures. In Proceed-
indexed question answering: A new challenge for ings of the 56th Annual Meeting of the Association
scalable document comprehension. In Proceedings for Computational Linguistics (Volume 1: Long Pa-
of the 2018 Conference on Empirical Methods in pers), pages 263–272, Melbourne, Australia. Asso-
Natural Language Processing, pages 559–564, Brus- ciation for Computational Linguistics.
sels, Belgium. Association for Computational Lin-
guistics. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Real-time open-domain question answering with Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
dense-sparse phrase index. In Proceedings of the Teven Le Scao, Sylvain Gugger, Mariama Drame,
57th Annual Meeting of the Association for Com- Quentin Lhoest, and Alexander Rush. 2020. Trans-
putational Linguistics, pages 4430–4441, Florence, formers: State-of-the-art natural language process-
Italy. Association for Computational Linguistics. ing. In Proceedings of the 2020 Conference on Em-
pirical Methods in Natural Language Processing:
Richard Socher, Alex Perelygin, Jean Wu, Jason System Demonstrations, pages 38–45, Online. Asso-
Chuang, Christopher D. Manning, Andrew Ng, and ciation for Computational Linguistics.
Christopher Potts. 2013. Recursive deep models
for semantic compositionality over a sentiment tree- Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and
bank. In Proceedings of the 2013 Conference on Quoc V. Le. 2020. Self-training with noisy student
Empirical Methods in Natural Language Processing, improves imagenet classification. In Proceedings of
pages 1631–1642, Seattle, Washington, USA. Asso- the IEEE/CVF Conference on Computer Vision and
ciation for Computational Linguistics. Pattern Recognition (CVPR).
Yi Yang and Arzoo Katiyar. 2020. Simple and effective
Nandan Thakur, Nils Reimers, Johannes Daxen-
few-shot named entity recognition with structured
berger, and Iryna Gurevych. 2021. Augmented
nearest neighbor learning. In Proceedings of the
SBERT: Data augmentation method for improving
2020 Conference on Empirical Methods in Natural
bi-encoders for pairwise sentence scoring tasks. In
Language Processing (EMNLP), pages 6365–6375,
Proceedings of the 2021 Conference of the North
Online. Association for Computational Linguistics.
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
pages 296–310, Online. Association for Computa- William Cohen, Ruslan Salakhutdinov, and Christo-
tional Linguistics. pher D. Manning. 2018. HotpotQA: A dataset
for diverse, explainable multi-hop question answer-
Erik F. Tjong Kim Sang and Fien De Meulder. ing. In Proceedings of the 2018 Conference on Em-
2003. Introduction to the CoNLL-2003 shared task: pirical Methods in Natural Language Processing,
Language-independent named entity recognition. In pages 2369–2380, Brussels, Belgium. Association
Proceedings of the Seventh Conference on Natu- for Computational Linguistics.
ral Language Learning at HLT-NAACL 2003, pages
142–147. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
Weinberger, and Yoav Artzi. 2020. BERTScore:
Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har- Evaluating text generation with bert. In Interna-
ris, Alessandro Sordoni, Philip Bachman, and Ka- tional Conference on Learning Representations.
heer Suleman. 2017. NewsQA: A machine compre-
hension dataset. In Proceedings of the 2nd Work- Yuan Zhang, Jason Baldridge, and Luheng He. 2019.
shop on Representation Learning for NLP, pages PAWS: Paraphrase adversaries from word scram-
191–200, Vancouver, Canada. Association for Com- bling. In Proceedings of the 2019 Conference of
putational Linguistics. the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
George Tsatsaronis, Georgios Balikas, Prodromos nologies, Volume 1 (Long and Short Papers), pages
Malakasiotis, Ioannis Partalas, Matthias Zschunke, 1298–1308, Minneapolis, Minnesota. Association
Michael R Alvers, Dirk Weissenborn, Anastasia for Computational Linguistics.

You can also read