Question Answering Infused Pre-training of General-Purpose Contextualized Representations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Question Answering Infused Pre-training of General-Purpose Contextualized Representations Robin Jia, Mike Lewis, Luke Zettlemoyer Facebook AI Research {robinjia,mikelewis,lsz}@fb.com Abstract The Violin Concerto in D major, Op. 77, was composed by Johannes Brahms in 1878 and This paper proposes a pre-training objec- dedicated to his friend, the violinist Joseph Joachim. tive based on question answering (QA) for learning general-purpose contextual represen- arXiv:2106.08190v1 [cs.CL] 15 Jun 2021 tations, motivated by the intuition that the rep- resentation of a phrase in a passage should encode all questions that the phrase can an- swer in context. We accomplish this goal by training a bi-encoder QA model, which in- dependently encodes passages and questions, to match the predictions of a more accurate cross-encoder model on 80 million synthe- What did Brahms Who wrote the Who played sized QA pairs. By encoding QA-relevant in- write in 1878? violin concerto? the violin? formation, the bi-encoder’s token-level repre- What was dedicated Who was friends sentations are useful for non-QA downstream to Joachim? with Joachim? tasks without extensive (or in some cases, any) fine-tuning. We show large improvements over Figure 1: An overview of Question Answering Infused both RoBERTa-large and previous state-of-the- Pre-training. Our model independently creates vector art results on zero-shot and few-shot para- representations (middle) for phrases in a passage (top) phrase detection on four datasets, few-shot and for synthesized questions (bottom). Our objective named entity recognition on two datasets, and encourages the vector for each phrase to have high sim- zero-shot sentiment analysis on three datasets. ilarity with the vectors for all questions it answers. 1 Introduction Although masked language models build contex- phrase’s role in the sentence. A further advantage tualized word representations, they are pre-trained is that these more contextualized representations with losses that minimize distance to uncontextual- allow for improved zero- and few-shot learning for ized word embeddings (Peters et al., 2018; Devlin many token-level tasks, especially when the tasks et al., 2019; Liu et al., 2019). In this paper, we can be posed as question answering (Levy et al., introduce a new pre-training loss based on ques- 2017; McCann et al., 2018). Our approach builds tion answering that depends much more directly on on previous work using question-answer pairs as a context, and learns improved token-level represen- meaning representation (He et al., 2015; Michael tations for a range of zero- and few-shot tasks. et al., 2018), and more generally from the tradition More specifically, we propose Question Answer- of using question answering ability as a proxy for ing Infused Pre-training (Q U IP), a technique for language understanding (Lehnert, 1977; Hirschman training highly context dependent representations. et al., 1999; Peñas et al., 2013; Richardson et al., Our intuition is that the representation for a phrase 2013). should contain enough information to identify all Q U IP learns contextual representations with the questions that the phrase could answer in con- a bi-encoder extractive QA objective. Our bi- text. For example, in Figure 1, the representation encoder model must independently encode pas- for Johannes Brahms should be similar to an en- sages and questions such that the representation coding of all questions it can answer, such as “Who of each phrase in a passage is similar to the repre- wrote the violin concerto?”—thereby capturing the sentations of all reading comprehension questions
that can be answered with that phrase. To train a question independently, and predicts the answer such a model, we use a question generation model to be the phrase in the passage whose learned rep- to synthesize 80 million QA examples, then train resentation is most similar to that of the question. the bi-encoder to match the predictions of a stan- Since the model does not know the question when dard cross-encoder QA model, which processes the encoding the passage, its representation for each passage and question together, on these examples. phrase must be similar to those of all questions that Bi-encoder QA has been used for efficient open- can be answered with that phrase. In contrast, a domain QA via phrase retrieval (Seo et al., 2018, cross-encoder model receives the concatenation of 2019; Lee et al., 2020, 2021), but its lower accu- the passage and question as a single input; these racy compared to cross-encoder QA has previously models can use many layers of self-attention to been viewed as a drawback. We instead identify extract an answer, and cannot be interpreted as en- the relative weakness of bi-encoder QA as an op- coding all possible questions. portunity to improve contextual representations via In the rest of this section, we introduce some knowledge distillation, as self-training can be ef- notation (§2.1), then describe our Q U IP pipeline, fective when the student model is forced to solve a which consists of three steps: question generation harder problem than the teacher (Xie et al., 2020). (§2.2), cross-encoder teacher re-labeling (§2.3), In fact, using a bi-encoder student model is critical and bi-encoder training (§2.4). for Q U IP: a cross-encoder trained in a similar way does not learn contextual representations that are 2.1 Notation as useful for later few-shot learning, despite having All of our models operate on sequences of to- higher QA accuracy. kens x = [x1 , . . . , xL ] of length L. By con- We show that Q U IP token-level representations vention, we assume that x1 is always the spe- are useful in a variety of zero-shot and few-shot cial beginning-of-sequence token. Our goal is to learning settings, both because the representations learn an encoder r that maps inputs x to outputs directly encode useful contextual information, and r(x) = [r(x)1 , . . . , r(x)L ] where each r(x)i ∈ Rd because we can often reduce downstream tasks to for some fixed dimension d. We call r(x)i the QA. For few-shot paraphrase detection, Q U IP with contextual representation of the i-th token in x. BERTScore-based features (Zhang et al., 2020) In extractive question answering, a model is outperforms prior work by 9 F1 points across given a context passage c and question q, and must four datasets. For few-shot named entity recog- output a span of c that answers the question. Typi- nition (NER), Q U IP combined with an initializa- cally, models approach this by independently pre- tion scheme that uses question embeddings im- dicting probability distributions p(astart | c, q) and proves over RoBERTa-large by 14 F1 across two p(aend | c, q) over the answer start index astart and datasets. Finally, for zero-shot sentiment analy- end index aend . sis, Q U IP with question prompts improves over RoBERTa-large with MLM-style prompts by 5 ac- 2.2 Question Generation curacy points across three datasets, and extracts Question generation model. We train a BART- interpretable rationales as a side effect. Through ab- large model (Lewis et al., 2020) to generate lations, we show that using real questions, a strong question-answer pairs given context passages. The teacher model, and the bi-encoder architecture are model receives the passage as context and must all crucial to the success of Q U IP; on the other generate the answer text, then a special separator hand, many other design decisions (e.g., question token, then the question; this approach is simpler generation decoding strategies) do not qualitatively than prior approaches that use separate models for affect our main findings, pointing to the stability of answer and question generation (Lewis and Fan, the Q U IP approach. 2019; Alberti et al., 2019; Puri et al., 2020), and works well in practice. 2 QA Infused Pre-training Training data. We train this model on the train- We introduce QA Infused Pre-training (Q U IP), in ing data from the MRQA 2019 Shared Task (Fisch which we pre-train contextual representations with et al., 2019), which combines data from six QA a bi-encoder extractive QA objective. As shown in datasets: HotpotQA (Yang et al., 2018), Natu- Figure 1, a bi-encoder model encodes a passage and ralQuestions (Kwiatkowski et al., 2019), NewsQA
(Trischler et al., 2017), SearchQA (Dunn et al., token. 2017), SQuAD (Rajpurkar et al., 2016), and Trivi- aQA (Joshi et al., 2017). Together, these datasets Model. The bi-encoder model with parameters θ cover many of the text sources commonly used for consists of of three components: an encoder r and pre-training (Liu et al., 2019; Lewis et al., 2020), two question embedding heads hstart and hend that namely Wikipedia (HotpotQA, NaturalQuestions, map Rd → Rd . These heads will only be applied to SQuAD), News articles (NewsQA), and general beginning-of-sequence (i.e., CLS) representations; web text (SearchQA, TriviaQA). as shorthand, define fstart (x) = hstart (r(x)1 ) and likewise for fend . Given a context passage c and Generating questions. We run our question gen- question q, the model predicts eration model over a large set of passages to gen- erate a large dataset of question-answer pairs. We >f pθ (astart = i | c, q) ∝ er(c)i start (q) (1) decode using nucleus sampling (Holtzman et al., >f 2020) with p = 0.6, which was chosen by man- pθ (aend = i | c, q) ∝ er(c)i end (q) . (2) ual inspection to balance diversity with quality of generated questions. We also experimented with In other words, the model independently encodes other decoding strategies, such as a two-stage beam the passage and question with r, applies the start search, but found these made little difference in the and end heads to the CLS token embedding for q, final results (see §4.8). We obtain passages from then predicts the answer start (end) index with a the same training corpus as RoBERTa (Liu et al., softmax over the dot product between the passage 2019), which uses four sub-domains: B OOK C OR - representation at that index and the output of the PUS plus Wikipedia, CC-N EWS, O PEN W EB T EXT, start (end) head. and S TORIES. For each domain, we sample 2 mil- We initialize r to be the pre-trained RoBERTa- lion passages and generate 10 questions per pas- large model (Liu et al., 2019), which uses d = sage, for a total of 80 million questions.1 1024. hstart and hend are randomly-initialized 2- layer MLPs with hidden dimension 1024, matching 2.3 Teacher Re-labeling the default initialization of classification heads in The answers generated by our BART model are not RoBERTa.2 always accurate, and are also not always spans in the context passage. To improve the training sig- Training. For an input consisting of context c of nal, we generate labels using a teacher model, as length L and question q, we train θ to minimize is common in knowledge distillation (Hinton et al., the KL-divergence between the student and teacher 2015). We use a standard cross-encoder RoBERTa- predictions, which is equivalent to the objective large model trained on the MRQA training data L as our teacher model. The model takes in the con- X catenation of the context passage c and question q − Tstart (c, q)i log pθ (astart = i | c, q) i=1 and predicts astart and aend with two independent 2-layer multi-layer perceptron (MLP) heads. Let + Tend (c, q)i log pθ (aend = i | c, q) (3) Tstart (c, q) and Tend (c, q) denote the teacher’s pre- dicted probability distribution over astart and aend , up to constants that do not depend on θ. We train respectively. for two epochs on the generated corpus of 80 mil- lion questions, which takes roughly 56 hours on 8 2.4 Bi-encoder Training V100 GPUs, or roughly 19 GPU-days; this is very Finally, we train a bi-encoder model to match the fast compared to pre-training RoBERTa-large from cross-encoder predictions on the generated ques- scratch, which took roughly 5000 GPU-days (Liu tions. This objective encourages the contextual et al., 2019). For efficiency, we process all ques- representation for a token in a passage to have high tions for the same passage in the same batch, as similarity (in inner product space) with the repre- encoding passages dominates runtime. For further sentation of every question that is answered by that details, see Appendix A.1. 1 2 We estimate that using the entire corpus with these set- https://github.com/pytorch/fairseq/ tings would generate around 900 million questions. We leave blob/master/fairseq/models/roberta/model. investigation of further scaling to future work. py
3 Downstream Tasks extract eight features, corresponding to BERTScore computed with the final eight layers (i.e., layers 17- We evaluate Q U IP on zero-shot paraphrase ranking, 24) of the network. These layers encompass the few-shot paraphrase classification, few-shot NER, optimal layers for both RoBERTa-large and Q U IP and zero-shot sentiment analysis. We leverage the (see §4.4). Freezing the encoder is often useful in improved token-level representations afforded by practice, particularly for large models, as the same Q U IP, and in many cases also directly use Q U IP’s model can be reused for many tasks (Brown et al., question-answering abilities. 2020; Du et al., 2020). 3.1 Paraphrase Ranking Fine-tuning. For fine-tuning, we use the same We first evaluate Q U IP token-level representations computation graph and logistic loss function, but by measuring their usefulness for zero-shot para- now backpropagate through the parameters of our phrase ranking. In this task, systems must rank sen- encoder. For details, see Appendix A.2. tence pairs that are paraphrases above pairs that are non-paraphrases, without any task-specific train- 3.3 Named Entity Recognition ing data. We compute similarity scores using the We also use Q U IP for few-shot3 named entity FBERT variant of BERTScore (Zhang et al., 2020), recognition, which we frame as a BIO tagging task. which measures cosine similarities between the rep- We add a linear layer that takes in token-level repre- resentation of each token in one sentence and its sentations and predicts the tag for each token, and most similar token in the other sentence. Given backpropagate log loss through the entire network. sentences x1 and x2 of lengths L1 and L2 , define By default, the output layer is initialized randomly. L1 As a refinement, we propose using question 1 X prompts to initialize this model. The output layer is B1 = max cos-sim(r(x1 )i , r(x2 )j )) L1 1≤j≤L2 parameterized by a T × d matrix M , where T is the i=1 L2 number of distinct BIO tags. The log-probability 1 X B2 = max cos-sim(r(x2 )i , r(x1 )j )), of predicting the j-th tag for token i is proportional L2 1≤j≤L1 i=1 to the dot product between the representation for u v > token i and the j-th row of M ; this resembles how where cos-sim(u, v) = kukkvk is the cosine sim- the bi-encoder predicts answers. Thus, we initial- ilarity between vectors u and v. The FBERT ize each row of M with the start head embedding BERTScore is defined as the harmonic mean of of a question related to that row’s corresponding B1 and B2 . Zhang et al. (2020) showed that entity tag. For instance, we initialize the parame- BERTScore with RoBERTa representations is use- ters for the B-location and I-location tags ful for both natural language generation evaluation with the embedding for “What is a location ?” We and paraphrase ranking. Since BERTScore uses normalize the question embeddings to have unit token-level representations, we hypothesize that it L2 norm. This style of initialization is uniquely should pair well with Q U IP. Following Zhang et al. enabled by our use of questions answering as an (2020), we use representations from the from the interface to the model; it would be unclear how to layer of the network that maximizes Pearson corre- use a language model similarly. lation between BERTScore and human judgments on the WMT16 metrics shared task (Bojar et al., 3.4 Zero-shot Sentiment Analysis 2016). Finally, we use Q U IP for zero-shot binary senti- 3.2 Paraphrase Classification ment analysis. We reduce sentiment analysis to QA by writing a pair of questions that ask for a rea- We next use either frozen or fine-tuned Q U IP rep- son why an item is good or bad (e.g., “Why is this resentations for few-shot paraphrase classification, movie [good/bad]?”). We predict the label whose rather than ranking. Through these experiments, corresponding question has higher similarity with we can compare Q U IP with existing work on few- the Q U IP representation of some token in the input. shot paraphrase classification. This prompting strategy has the additional benefit Frozen model. We train a logistic regression 3 Yang and Katiyar (2020) study few-shot NER assuming model that uses BERTScore with frozen representa- data from other NER datasets is available; we assume no such tions as features. For a given pair of sentences, we data is available, matching Huang et al. (2020).
of extracting rationales, namely the span that the dataset (Hu and Liu, 2004). We evaluate on the Q U IP model predicts as the answer to the question. SST-2 development set and the MR and CR test More formally, let x be an input sentence and sets made by Gao et al. (2021). (q0 , q1 ) be a pair of questions (i.e., a prompt). For label y ∈ {0, 1}, we compute a score for y as Hyperparameter and prompt selection. Due to the nature of zero-shot and few-shot experiments, S(x, y) = max r(x)> i fstart (qy )+ we minimize the extent to which we tune hyper- i parameters, relying on existing defaults and pre- max r(x)> i fend (qy ). (4) viously published hyperparameters. For few-shot i paraphrase classification, NER, and sentiment anal- Essentially, S(x, y) measures the extent to which ysis, we developed our final method only using some span in x looks like the answer to the question QQP, CoNLL, and SST-2, respectively, and directly qy . We predict whichever y has the higher value of applied it to the other datasets with no further tun- S(x, y) − Cy , where Cy is a calibration constant ing. We did measure zero-shot paraphrase ranking that offsets the model’s bias towards answering q0 accuracy on all datasets during development of or q1 . Our inclusion of Cy is inspired by Zhao Q U IP. For more details, see Appendix A.3. et al. (2021), who recommend calibrating zero-shot For NER, we used the first question prompts we and few-shot models with a baseline derived from wrote for both CoNLL and WNUT, which all fol- content-free inputs to account for biases towards low the same format, “Who/What is a/an [entity a particular label. To choose Cy , we obtain a list type] ?” (see Appendix A.6 for all prompts). For W of the ten most frequent English words, all of sentiment analysis, we wrote six prompts (shown which convey no sentiment, and define Cy as the in Appendix A.8) and report mean accuracy over mean over w ∈ W of S(w, y), i.e., the score when these prompts, to avoid pitfalls associated with using w as the input sentence (see Appendix A.4). prompt tuning (Perez et al., 2021). We use the 4 Experiments same prompts for SST-2 and MR; for CR, the only change we make is replacing occurrences of the 4.1 Experimental details word “movie” with “product” to reflect the change Datasets. For paraphrasing, we use four datasets: in domain between these datasets. QQP (Iyer et al., 2017), MRPC (Dolan and Brock- ett, 2005), PAWS-Wiki, and PAWS-QQP (Zhang 4.2 Baselines and Ablations et al., 2019). The PAWS datasets are especially of To confirm the importance of all three stages of our interest as they are designed to be challenging for pre-training pipeline, we compare with a number bag-of-words models, and thus can test whether of baselines and ablations. our representations are truly contextual or mostly lexical in nature. For QQP and MRPC, we use No question generation. We train the bi- the few-shot splits from Gao et al. (2021) that in- encoder model directly on the MRQA training clude 16 examples per class; for the PAWS datasets, data (“Bi-encoder + MRQA”). We also include we create new few-shot splits in the same man- the cross-encoder teacher model trained on MRQA ner. We report results on the development sets of as a baseline (“Cross-encoder + MRQA”). These QQP and MRPC (as test labels were not available), settings mirror standard intermediate task training the test set of PAWS-Wiki, and the “dev-and-test” (Phang et al., 2018; Pruksachatkun et al., 2020). set of PAWS-QQP. For NER, we use two datasets: No teacher. We train the bi-encoder using the an- CoNLL 2003 (Tjong Kim Sang and De Meulder, swer supervision generated by the question genera- 2003) and WNUT-17 (Derczynski et al., 2017). We tion model (“Q U IP, no teacher”). Occasionally, the use the few-shot splits from Huang et al. (2020) that generated answer is not a span in the corresponding include 5 examples per entity type. All few-shot passage; we consider these questions unanswerable experiments report an average over five random and treat the span containing the CLS token as the splits and seeds, following both Gao et al. (2021) “correct” answer, as in Devlin et al. (2019). and Huang et al. (2020). For sentiment analysis, we use two movie review datasets, SST-2 (Socher Cross-encoder self-training. To test whether et al., 2013) and Movie Reviews (MR; Pang and the bottleneck imposed by the bi-encoder archi- Lee, 2005), as well as the Customer Reviews (CR) tecture is crucial for the success of Q U IP, we also
Model EM F1 the reported human accuracy of 91.2 F1 on the Lee et al. (2021) 78.3 86.3 SQuAD test set, as well as the best cross-encoder Bi-encoder + UnsupervisedQA 17.4 24.9 BERT-large single model from Devlin et al. (2019). Bi-encoder + MRQA 70.7 79.4 Q U IP, no teacher 75.3 84.7 Q U IP greatly improves over baselines that directly Q U IP 85.2 91.7 train on the MRQA data or do not use the teacher BERT-large cross-encoder 84.2 91.1 model. As expected, our cross-encoder models are Cross-encoder + MRQA 88.8 94.7 even more accurate at QA; the next sections will Q U IP, cross-encoder student 89.5 94.8 show that their representations are less useful for non-QA tasks than the Q U IP bi-encoder representa- Table 1: EM and F1 scores on the SQuAD development set for bi-encoder (top) and cross-encoder (bottom) tions. Results on all MRQA development datasets models. The Q U IP bi-encoder outperforms the other bi- are provided in Appendix A.5. encoder model baselines, and even BERT-large trained as a cross-encoder. The RoBERTa cross-encoder mod- 4.4 Zero-shot Paraphrase Ranking els are more accurate than the bi-encoder at QA, but The first half of Table 2 shows WMT develop- we show later that the bi-encoder model representations ment set Pearson correlations averaged across six are more useful for non-QA tasks. to-English datasets, as in Zhang et al. (2020), along with the best layer for each model. Q U IP reaches train a cross-encoder model on the same gener- its optimal score at a later layer (20) than RoBERTa- ated data (“Q U IP, cross-encoder student”). Since large (17), which may suggest that the Q U IP train- this student model has the same architecture as the ing objective is more closely aligned with the goal teacher model, we train the student to maximize the of learning better representations than MLM. likelihood of the teacher’s argmax predictions, a The rest of Table 2 shows zero-shot paraphrase standard self-training objective (Lee, 2013; Kumar ranking results using BERTScore. Q U IP improves et al., 2020). Training the cross-encoder is consid- substantially over RoBERTa on all four datasets, erably less efficient than training the bi-encoder, with an average improvement of .076 AUROC. The since batching questions about the same passage improvement is greatest on the PAWS datasets; together does not speed up training, so we train it since these datasets are supposed to challenge mod- for a comparable number of total GPU-hours (60 els that rely heavily on lexical rather than contex- hours on 8 V100 GPUs). tual cues, we infer that Q U IP representations are much more contextualized than RoBERTa repre- Unsupervised QA. We test whether Q U IP relies sentations. Training on Unsupervised QA data de- on having real QA examples, or if it is sufficient grades performance compared to RoBERTa, indi- to use any task that forces word representations to cating that the Q U IP objective accomplishes much encode information about surrounding words. To more than merely forcing word representations to do this, we train on data generated by the Unsu- encode local context in a simple way. Training the pervised QA pipeline of Lewis et al. (2019). This bi-encoder directly on the MRQA dataset or with- pipeline creates pseudo-questions by masking and out the teacher improves on average over RoBERTa, applying noise to sentences in a passage. Lewis but Q U IP greatly outperforms both baselines across et al. (2019) showed that training on such data can the board. The cross-encoder models also lag be- yield non-trivial accuracy on SQuAD without any hind Q U IP at paraphrase ranking, despite their reliance on real QA data. We generate 10 pseudo- higher QA accuracy. Thus, we conclude that hav- questions per passage in our training corpus, yield- ing real questions, accurate answer supervision, ing another dataset of 80 million examples, and and a bi-encoder student model are all crucial to train our bi-encoder on this data (“Bi-encoder + the success of Q U IP. UnsupervisedQA”). 4.5 Paraphrase Classification 4.3 Bi-encoder Question Answering Table 3 shows few-shot paraphrase classification While not the main focus of this paper, we first results. We compare with LM-BFF (Gao et al., check that Q U IP improves bi-encoder QA accu- 2021), which pairs RoBERTa-large with MLM- racy. Table 1 reports results on the SQuAD devel- style prompts to do few-shot learning. We use the opment set. Q U IP not only improves over Lee et al. setting with fixed manually written prompts and (2021) by 5.4 F1 on SQuAD, but also surpasses demonstrations, which was their best method on
Model WMT r WMT Best Layer QQP MRPC PAWS-Wiki PAWS-QQP RoBERTa-large .739 17 .763 .831 .698 .690 Cross-encoder + MRQA .744 16 .767 .840 .742 .731 Q U IP, cross-encoder student .753 16 .769 .847 .751 .706 Bi-encoder + UnsupervisedQA .654 11 .747 .801 .649 .580 Bi-encoder + MRQA .749 15 .771 .807 .747 .725 Q U IP, no teacher .726 19 .767 .831 .780 .709 Q U IP .764 20 .809 .849 .830 .796 Table 2: Pearson correlation on WMT development data, best layer chosen based on WMT results, and AUROC on zero-shot paraphrase ranking using BERTScore. Q U IP outperforms all baselines on all datasets. Model Fine-tuned? QQP MRPC PAWS-Wiki PAWS-QQP LM-BFF (reported) Fine-tuned 69.80.8 77.80.9 - - LM-BFF (rerun) Fine-tuned 67.10.9 76.51.5 60.70.7 50.12.8 RoBERTa-large Frozen 64.40.4 80.60.7 62.30.9 50.60.4 Q U IP Frozen 68.90.2 82.60.4 71.90.5 63.01.2 RoBERTa-large Fine-tuned 64.90.7 84.40.3 65.70.3 50.90.8 Q U IP Fine-tuned 71.00.3 86.60.4 75.10.2 60.91.0 Table 3: F1 scores on few-shot paraphrase ranking, averaged across five training splits (standard errors in sub- scripts). Q U IP outperforms prior work on few-shot paraphrase classification (LM-BFF; Gao et al., 2021) as well as our own baselines using RoBERTa. Model CoNLL WNUT 4.6 Named Entity Recognition Huang et al. (2020) 65.4 37.6 Table 4 shows few-shot NER results on the Standard init. CoNLL and WNUT datasets. Q U IP improves over RoBERTa-large 59.02.4 39.30.6 RoBERTa-large by 11 F1 on CoNLL and 2.9 F1 Cross-encoder + MRQA 68.93.3 43.00.9 Q U IP, cross-encoder student 63.43.3 39.41.7 on WNUT when used with a randomly initialized Bi-encoder + UnsupervisedQA 58.22.6 26.01.0 output layer. We see a further improvement of Bi-encoder + MRQA 66.43.3 42.20.4 Q U IP, no teacher 67.71.9 40.71.4 4 F1 on CoNLL and 7.4 F1 on WNUT when us- Q U IP 70.02.4 42.20.5 ing question embeddings to initialize the output Question prompt init. layer. Using the cross-encoder trained directly on Bi-encoder + UnsupervisedQA 62.73.3 30.40.8 QA data is roughly as good as Q U IP when using Bi-encoder + MRQA 72.02.8 44.01.3 randomly initialized output layers, but it is incom- Q U IP, no teacher 71.43.0 47.81.1 Q U IP 74.02.4 49.60.5 patible with question embedding initialization (as it does not produce question embeddings). Table 4: F1 scores on few-shot NER, averaged over We also measured NER results when training five training splits (standard errors in subscripts). Q U IP on the full training datasets, but found minimal with question prompts performs best on both datasets. evidence that Q U IP improves over RoBERTa. As shown in Appendix A.7, Q U IP gives a 0.6 F1 im- provement on WNUT but effectively no gain on CoNLL. With extensive fine-tuning, RoBERTa can learn to extract useful token-level features, but it QQP by 2.1 F1 and was 0.3 F1 worse than their lags behind Q U IP in the few-shot regime. best method on MRPC. Q U IP used as a frozen en- coder is competitive with LM-BFF on QQP and 4.7 Sentiment Analysis outperforms it by 6.1 F1 on MRPC, 11.2 F1 on PAWS-Wiki, and 12.1 F1 on PAWS-QQP. Fine- Table 5 shows zero-shot accuracy on our three sen- tuning gives additional improvements on three of timent analysis datasets. We compare with zero- the four datasets, though it worsens accuracy on shot results for LM-BFF (Gao et al., 2021)4 and PAWS-QQP. Fine-tuning with Q U IP also outper- reported zero-shot results from Zhao et al. (2021) forms using RoBERTa in the same way by an aver- 4 We tried applying our calibration strategy to LM-BFF as age of 6.9 F1. well, but found that it did not improve accuracy.
Model SST-2 MR CR SQuAD Paraphrase NER Model F1 AUROC F1 CC + GPT-3 71.6 - - LM-BFF 83.6 80.8 79.5 Q U IP 91.7 .821 61.8 Q U IP (average) 87.90.6 81.90.4 90.30.2 + concat. passages 91.7 .818 62.7 w/ cross-enc. student 83.30.4 78.50.4 88.90.3 w/ hard labels 91.5 .814 62.5 w/ 2-stage beam search 91.7 .821 62.8 Q U IP (tune on SST-2) 89.6 83.1 90.4 Table 7: SQuAD development set F1, average zero- Table 5: Zero-shot accuracy on sentiment analysis. shot paraphrase ranking AUROC across all datasets, Third and fourth rows show mean accuracy across six and average few-shot NER F1 using question prompts prompts (standard error in subscripts). Q U IP using an across both datasets for Q U IP variants. Models shown average prompt outperforms prior work; the final row here are all similarly effective. shows additional benefits from using the best prompt on SST-2 for all datasets. 4.8 Stability Analysis Label Rationale We experimented with some design decisions that did not materially affect our results. Here, we re- - “too slim”, “stale”, “every idea”, “wore out its wel- come”, “unpleasant viewing experience”, “life- port these findings as evidence that our basic recipe less”, “plot”, “amateurishly assembled”, “10 is stable to many small changes. First, we con- times their natural size”, “wrong turn” catenated the representations of all passages in the + “packed with information and impressions”, same batch and on the same GPU together (9 pas- “slash-and-hack”, “tightly organized efficiency”, “passion and talent”, “best films”, “surprises”, sages on average), and trained the model to extract “great summer fun”, “play equally well”, “convic- answers from this larger pseudo-document; this tions”, “wickedly subversive bent” effectively adds in-batch negative passages, as in Lee et al. (2021). Second, we trained the model to Table 6: Rationales extracted by Q U IP on ten random match the argmax prediction of the teacher, rather examples for each label from SST-2. than its soft distribution over start and end indices. Finally, we used a two-stage beam search to gen- erate questions. For a given passage, we generated using GPT-3 with Contextual Calibration (CC) on 20 possible answers via beam search, chose 10 of SST-2. Q U IP using an average prompt outperforms these to maximize answer diversity, then generated zero-shot LM-BFF by 5.4 points, averaged across one question for each answer with another beam the three datasets. Choosing the best prompt on search. Our goal was to ensure diversity by forcing SST-2 and using that for all datasets improves re- questions to be about different answers, while also sults not only on SST-2 but also MR, and maintains maintaining high question quality. As shown in Ta- average accuracy on CR. Using the cross-encoder ble 7, these choices have a relatively minor impact student QA model with the same prompts leads to on the results (within .007 AUROC and 1 F1 on worse performance: we hypothesize that this is be- NER). cause the bi-encoder bottleneck encourages Q U IP to make each span’s representation dissimilar to 5 Discussion and Related Work those of questions that it does not answer, whereas In this work, we pre-trained token-level contextual the cross-encoder model trained only on answer- representations that are useful for few-shot learning able questions does not naturally handle unanswer- on downstream tasks. Our key idea was to leverage able questions (e.g., “Why is it bad?” asked about signal from extractive QA datasets to define what a positive review). information should be encoded in passage repre- Table 6 shows rationales extracted from random sentations. Our work builds on prior work studying SST-2 examples for which Q U IP was correct with question generation and answering, pre-training, the best prompt for SST-2 (“What is the reason this and few-shot learning. movie is [good/bad]?”). To prefer shorter ratio- nales, we extract the highest-scoring span of five 5.1 Question Generation BPE tokens or less. The model often identifies Neural question generation has been studied by a phrases that convey clear sentiment. Appendix A.9 number of previous papers for different purposes shows many more full examples and rationales. (Du et al., 2017; Du and Cardie, 2018; Zhao et al.,
2018; Lewis and Fan, 2019; Alberti et al., 2019; and thus use token-level supervision available in Puri et al., 2020). Recently, Lewis et al. (2021) use extractive QA. model-generated question-answer pairs for efficient While we focus on QA as a source of training open-domain QA. Bartolo et al. (2021) use gener- signal, other work aims to improve downstream ated questions to improve robustness to human- QA accuracy. Ram et al. (2021) propose a span ex- written adversarial questions. Our work focuses traction pre-training objective that enables few-shot on how generated questions can help learn general- QA. Khashabi et al. (2020) show that multi-task purpose representations. We also show that a rela- training on many QA datasets, both extractive and tively simple strategy of generating the answer and non-extractive, can improve QA performance. Fu- question together with a single seq-to-seq model ture work could explore learning question-agnostic can be effective; most prior work uses separate passage embeddings that are also useful for non- answer selection and question generation models. extractive QA tasks. 5.2 Phrase-indexed Question Answering 5.4 Few-shot Learning Language models have recently been used for few- Phrase-indexed question answering is a paradigm shot learning on a wide range of tasks (Brown et al., for open-domain question answering that retrieves 2020; Schick and Schütze, 2021; Gao et al., 2021). answers by embedding questions and candidate These papers all use natural language prompts to answers in a shared embedding space (Seo et al., convert NLP tasks into language modeling prob- 2018, 2019; Lee et al., 2020). Phrase-indexed QA lems. We develop alternative methods for few-shot requires using a bi-encoder architecture, as we use learning that use token-level representations and in our work. Especially related is Lee et al. (2021), question-based prompts rather than language model which also uses question generation and a cross- prompts. encoder teacher model to improve phrase-indexed Looking forward, pre-trained representations QA. We focus not on open-domain QA, but on could be even more useful in few-shot settings if learning generally useful representations for non- they could inherently capture more complex rela- QA downstream tasks. tionships between words and sentences. For in- 5.3 Learning contextual representations stance, box embeddings (Vilnis et al., 2018) natu- rally capture asymmetric relationships like set in- Pre-training on unlabeled data has been highly ef- clusion, so pre-training such embeddings may be fective for learning contextual representations (Pe- helpful for tasks like few-shot entailment classifi- ters et al., 2018; Devlin et al., 2019), but recent cation. We hope to see more work in the future on work has shown further improvements by using la- designing pre-training objectives that align better beled data. Intermediate task training (Phang et al., with downstream needs for few-shot learning. 2018) improves representations by training directly on large labeled datasets. Muppet (Aghajanyan Acknowledgements et al., 2021) improves models by multi-task pre- We thank Terra Blevins for investigating applica- finetuning on many labeled datasets. Most similar tions to word sense disambiguation, and Douwe to our work, He et al. (2020) uses extractive ques- Kiela, Max Bartolo, Sebastian Riedel, Sewon Min, tion answering as a pre-training task for a BERT Patrick Lewis, and Scott Yih for their feedback on paragraph encoder. Our work leverages question this project. We thank Jiaxin Huang for providing generation and knowledge distillation to improve the few-shot NER splits used in their paper. over directly training on labeled data, and focuses on zero-shot and few-shot applications. Other work has used methods similar to ours to References learn sentence embeddings. Reimers and Gurevych Armen Aghajanyan, Anchit Gupta, Akshat Shrivas- (2019) train sentence embeddings for sentence sim- tava, Xilun Chen, Luke Zettlemoyer, and Sonal ilarity tasks using natural language inference data. Gupta. 2021. Muppet: Massive multi-task rep- Thakur et al. (2021) train a sentence embedding bi- resentations with pre-finetuning. arXiv preprint encoder to mimic the predictions of a cross-encoder arXiv:2101.11038. model. We focus on learning token-level represen- Chris Alberti, Daniel Andor, Emily Pitler, Jacob De- tations, rather than a single vector for a sentence, vlin, and Michael Collins. 2019. Synthetic QA cor-
pora generation with roundtrip consistency. In Pro- Xinya Du and Claire Cardie. 2018. Harvest- ceedings of the 57th Annual Meeting of the Asso- ing paragraph-level question-answer pairs from ciation for Computational Linguistics, pages 6168– Wikipedia. In Proceedings of the 56th Annual Meet- 6173, Florence, Italy. Association for Computa- ing of the Association for Computational Linguistics tional Linguistics. (Volume 1: Long Papers), pages 1907–1917, Mel- bourne, Australia. Association for Computational Max Bartolo, Tristan Thrush, Robin Jia, Sebastian Linguistics. Riedel, Pontus Stenetorp, and Douwe Kiela. 2021. Improving question answering model robustness Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn- with synthetic adversarial data generation. arXiv ing to ask: Neural question generation for reading preprint arXiv:2104.08678. comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Lin- Ondřej Bojar, Yvette Graham, Amir Kamran, and guistics (Volume 1: Long Papers), pages 1342–1352, Miloš Stanojević. 2016. Results of the WMT16 met- Vancouver, Canada. Association for Computational rics shared task. In Proceedings of the First Con- Linguistics. ference on Machine Translation: Volume 2, Shared Task Papers, pages 199–231, Berlin, Germany. As- Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel sociation for Computational Linguistics. Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requir- Tom Brown, Benjamin Mann, Nick Ryder, Melanie ing discrete reasoning over paragraphs. In Proceed- Subbiah, Jared D Kaplan, Prafulla Dhariwal, ings of the 2019 Conference of the North American Arvind Neelakantan, Pranav Shyam, Girish Sastry, Chapter of the Association for Computational Lin- Amanda Askell, Sandhini Agarwal, Ariel Herbert- guistics: Human Language Technologies, Volume 1 Voss, Gretchen Krueger, Tom Henighan, Rewon (Long and Short Papers), pages 2368–2378, Min- Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, neapolis, Minnesota. Association for Computational Clemens Winter, Chris Hesse, Mark Chen, Eric Linguistics. Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Alec Radford, Ilya Sutskever, and Dario Amodei. Guney, Volkan Cirik, and Kyunghyun Cho. 2017. 2020. Language models are few-shot learners. In SearchQA: A new Q&A dataset augmented with Advances in Neural Information Processing Systems, context from a search engine. arXiv preprint volume 33, pages 1877–1901. Curran Associates, arXiv:1704.05179. Inc. Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu- nsol Choi, and Danqi Chen. 2019. MRQA 2019 Leon Derczynski, Eric Nichols, Marieke van Erp, and shared task: Evaluating generalization in reading Nut Limsopatham. 2017. Results of the WNUT2017 comprehension. In Proceedings of the 2nd Work- shared task on novel and emerging entity recogni- shop on Machine Reading for Question Answering, tion. In Proceedings of the 3rd Workshop on Noisy pages 1–13, Hong Kong, China. Association for User-generated Text, pages 140–147, Copenhagen, Computational Linguistics. Denmark. Association for Computational Linguis- tics. Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot Jacob Devlin, Ming-Wei Chang, Kenton Lee, and learners. In Association for Computational Linguis- Kristina Toutanova. 2019. BERT: Pre-training of tics (ACL). deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference Hangfeng He, Qiang Ning, and Dan Roth. 2020. of the North American Chapter of the Association QuASE: Question-answer driven sentence encoding. for Computational Linguistics: Human Language In Proceedings of the 58th Annual Meeting of the Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, pages pages 4171–4186, Minneapolis, Minnesota. Associ- 8743–8758, Online. Association for Computational ation for Computational Linguistics. Linguistics. William B. Dolan and Chris Brockett. 2005. Automati- Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015. cally constructing a corpus of sentential paraphrases. Question-answer driven semantic role labeling: Us- In Proceedings of the Third International Workshop ing natural language to annotate natural language. on Paraphrasing (IWP2005). In Proceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing, pages Jingfei Du, Myle Ott, Haoran Li, Xing Zhou, and 643–653, Lisbon, Portugal. Association for Compu- Veselin Stoyanov. 2020. General purpose text em- tational Linguistics. beddings from pre-trained language models for scal- able inference. In Findings of the Association for Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Computational Linguistics: EMNLP 2020, pages 2015. Distilling the knowledge in a neural network. 3018–3030, Online. Association for Computational In NIPS Deep Learning and Representation Learn- Linguistics. ing Workshop.
Lynette Hirschman, Marc Light, Eric Breck, and Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. John D. Burger. 1999. Deep read: A reading com- Natural questions: A benchmark for question an- prehension system. In Proceedings of the 37th An- swering research. Transactions of the Association nual Meeting of the Association for Computational for Computational Linguistics, 7:452–466. Linguistics, pages 325–332, College Park, Maryland, USA. Association for Computational Linguistics. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale ReAd- Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and ing comprehension dataset from examinations. In Yejin Choi. 2020. The curious case of neural text de- Proceedings of the 2017 Conference on Empirical generation. In International Conference on Learn- Methods in Natural Language Processing, pages ing Representations. 785–794, Copenhagen, Denmark. Association for Computational Linguistics. Minqing Hu and Bing Liu. 2004. Mining and summa- rizing customer reviews. In Proceedings of the tenth Dong-Hyun Lee. 2013. Pseudo-label: The simple and ACM SIGKDD international conference on Knowl- efficient semi-supervised learning method for deep edge discovery and data mining, pages 168–177. neural networks. In ICML 2013 Workshop on Chal- Jiaxin Huang, Chunyuan Li, Krishan Subudhi, Damien lenges in Representation Learning (WREPL). Jose, Shobana Balakrishnan, Weizhu Chen, Baolin Peng, Jianfeng Gao, and Jiawei Han. 2020. Few- Jinhyuk Lee, Minjoon Seo, Hannaneh Hajishirzi, and shot named entity recognition: A comprehensive Jaewoo Kang. 2020. Contextualized sparse repre- study. arXiv preprint arXiv:2012.14978. sentations for real-time open-domain question an- swering. In Proceedings of the 58th Annual Meet- Shankar Iyer, Nikhil Dandekar, and Kornél Csernai. ing of the Association for Computational Linguis- 2017. First quora dataset release: Question pairs. tics, pages 912–919, Online. Association for Com- https://www.quora.com/q/quoradata/ putational Linguistics. First-Quora-Dataset-Release-Question-Pairs. Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Chen. 2021. Learning dense representations of Zettlemoyer. 2017. TriviaQA: A large scale dis- phrases at scale. In Association for Computational tantly supervised challenge dataset for reading com- Linguistics (ACL). prehension. In Proceedings of the 55th Annual Meet- ing of the Association for Computational Linguistics Wendy Lehnert. 1977. Human and computational ques- (Volume 1: Long Papers), pages 1601–1611, Van- tion answering. Cognitive Science, 1(1):47–73. couver, Canada. Association for Computational Lin- guistics. Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, reading comprehension. In Proceedings of the 21st Jonghyun Choi, Ali Farhadi, and Hannaneh Ha- Conference on Computational Natural Language jishirzi. 2017. Are you smarter than a sixth grader? Learning (CoNLL 2017), pages 333–342, Vancou- textbook question answering for multimodal ma- ver, Canada. Association for Computational Linguis- chine comprehension. In 2017 IEEE Conference on tics. Computer Vision and Pattern Recognition (CVPR), pages 5376–5384. Mike Lewis and Angela Fan. 2019. Generative ques- Daniel Khashabi, Sewon Min, Tushar Khot, Ashish tion answering: Learning to answer the whole ques- Sabharwal, Oyvind Tafjord, Peter Clark, and Han- tion. In International Conference on Learning Rep- naneh Hajishirzi. 2020. UNIFIEDQA: Crossing for- resentations. mat boundaries with a single QA system. In Find- ings of the Association for Computational Linguis- Mike Lewis, Yinhan Liu, Naman Goyal, Mar- tics: EMNLP 2020, pages 1896–1907, Online. As- jan Ghazvininejad, Abdelrahman Mohamed, Omer sociation for Computational Linguistics. Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre- Ananya Kumar, Tengyu Ma, and Percy Liang. 2020. training for natural language generation, translation, Understanding self-training for gradual domain and comprehension. In Proceedings of the 58th An- adaptation. In Proceedings of the 37th International nual Meeting of the Association for Computational Conference on Machine Learning, volume 119 of Linguistics, pages 7871–7880, Online. Association Proceedings of Machine Learning Research, pages for Computational Linguistics. 5468–5479. PMLR. Patrick Lewis, Ludovic Denoyer, and Sebastian Riedel. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- 2019. Unsupervised question answering by cloze field, Michael Collins, Ankur Parikh, Chris Al- translation. In Proceedings of the 57th Annual Meet- berti, Danielle Epstein, Illia Polosukhin, Jacob De- ing of the Association for Computational Linguistics, vlin, Kenton Lee, Kristina Toutanova, Llion Jones, pages 4896–4910, Florence, Italy. Association for Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Computational Linguistics.
Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Zettlemoyer. 2018. Deep contextualized word rep- Minervini, Heinrich Küttler, Aleksandra Piktus, Pon- resentations. In Proceedings of the 2018 Confer- tus Stenetorp, and Sebastian Riedel. 2021. Paq: 65 ence of the North American Chapter of the Associ- million probably-asked questions and what you can ation for Computational Linguistics: Human Lan- do with them. arXiv preprint arXiv:2102.07033. guage Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- for Computational Linguistics. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Jason Phang, Thibault Févry, and Samuel R Bowman. RoBERTa: A robustly optimized bert pretraining ap- 2018. Sentence encoders on stilts: Supplementary proach. arXiv preprint arXiv:1907.11692. training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language de- Yada Pruksachatkun, Jason Phang, Haokun Liu, cathlon: Multitask learning as question answering. Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe arXiv preprint arXiv:1806.08730. Pang, Clara Vania, Katharina Kann, and Samuel R. Bowman. 2020. Intermediate-task transfer learning Julian Michael, Gabriel Stanovsky, Luheng He, Ido Da- with pretrained language models: When and why gan, and Luke Zettlemoyer. 2018. Crowdsourcing does it work? In Proceedings of the 58th Annual question-answer meaning representations. In Pro- Meeting of the Association for Computational Lin- ceedings of the 2018 Conference of the North Amer- guistics, pages 5231–5247, Online. Association for ican Chapter of the Association for Computational Computational Linguistics. Linguistics: Human Language Technologies, Vol- ume 2 (Short Papers), pages 560–568, New Orleans, Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa Louisiana. Association for Computational Linguis- Patwary, and Bryan Catanzaro. 2020. Training ques- tics. tion answering models from synthetic data. In Proceedings of the 2020 Conference on Empirical Stephen Mussmann, Robin Jia, and Percy Liang. 2020. Methods in Natural Language Processing (EMNLP), On the Importance of Adaptive Data Collection for pages 5811–5826, Online. Association for Computa- Extremely Imbalanced Pairwise Tasks. In Findings tional Linguistics. of the Association for Computational Linguistics: Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and EMNLP 2020, pages 3400–3413, Online. Associa- Percy Liang. 2016. SQuAD: 100,000+ questions for tion for Computational Linguistics. machine comprehension of text. In Proceedings of Bo Pang and Lillian Lee. 2005. Seeing stars: Ex- the 2016 Conference on Empirical Methods in Natu- ploiting class relationships for sentiment categoriza- ral Language Processing, pages 2383–2392, Austin, tion with respect to rating scales. In Proceed- Texas. Association for Computational Linguistics. ings of the 43rd Annual Meeting of the Association Ori Ram, Yuval Kirstain, Jonathan Berant, Amir for Computational Linguistics (ACL’05), pages 115– Globerson, and Omer Levy. 2021. Few-shot ques- 124, Ann Arbor, Michigan. Association for Compu- tion answering by pretraining span selection. In As- tational Linguistics. sociation for Computational Linguistics (ACL). F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Nils Reimers and Iryna Gurevych. 2019. Sentence- B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, BERT: Sentence embeddings using Siamese BERT- R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, networks. In Proceedings of the 2019 Conference on D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Empirical Methods in Natural Language Processing esnay. 2011. Scikit-learn: Machine learning in and the 9th International Joint Conference on Natu- Python. Journal of Machine Learning Research, ral Language Processing (EMNLP-IJCNLP), pages 12:2825–2830. 3982–3992, Hong Kong, China. Association for Computational Linguistics. Anselmo Peñas, Pamela Forner, Richard Sutcliffe, Ál- varo Rodrigo, Corina Forăscu, In̄aki Alegria, Danilo Matthew Richardson, Christopher J.C. Burges, and Giampiccolo, Nicolas Moreau, and Petya Osenova. Erin Renshaw. 2013. MCTest: A challenge dataset 2013. QA4MRE 2011-2013: Overview of question for the open-domain machine comprehension of text. answering for machine reading evaluation. In Inter- In Proceedings of the 2013 Conference on Empiri- national Conference of the Cross-Language Evalu- cal Methods in Natural Language Processing, pages ation Forum for European Languages, pages 303 – 193–203, Seattle, Washington, USA. Association for 320. Computational Linguistics. Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and True few-shot learning with language models. arXiv Karthik Sankaranarayanan. 2018. DuoRC: Towards preprint arXiv:2105.11447. complex language understanding with paraphrased reading comprehension. In Proceedings of the 56th Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Annual Meeting of the Association for Computa- Gardner, Christopher Clark, Kenton Lee, and Luke tional Linguistics (Volume 1: Long Papers), pages
1683–1693, Melbourne, Australia. Association for Krithara, Sergios Petridis, Dimitris Polychronopou- Computational Linguistics. los, Yannis Almirantis, John Pavlopoulos, Nico- las Baskiotis, Patrick Gallinari, Thierry Artieres, Timo Schick and Hinrich Schütze. 2021. It’s not just Axel Ngonga, Norman Heino, Eric Gaussier, Lil- size that matters: Small language models are also iana Barrio-Alvers, Michael Schroeder, Ion An- few-shot learners. In Proceedings of the 2021 Con- droutsopoulos, and Georgios Paliouras. 2015. An ference of the North American Chapter of the Asso- overview of the bioasq large-scale biomedical se- ciation for Computational Linguistics: Human Lan- mantic indexing and question answering competi- guage Technologies, pages 2339–2352, Online. As- tion. BMC Bioinformatics, 16:138. sociation for Computational Linguistics. Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew Mc- Minjoon Seo, Tom Kwiatkowski, Ankur Parikh, Ali Callum. 2018. Probabilistic embedding of knowl- Farhadi, and Hannaneh Hajishirzi. 2018. Phrase- edge graphs with box lattice measures. In Proceed- indexed question answering: A new challenge for ings of the 56th Annual Meeting of the Association scalable document comprehension. In Proceedings for Computational Linguistics (Volume 1: Long Pa- of the 2018 Conference on Empirical Methods in pers), pages 263–272, Melbourne, Australia. Asso- Natural Language Processing, pages 559–564, Brus- ciation for Computational Linguistics. sels, Belgium. Association for Computational Lin- guistics. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. icz, Joe Davison, Sam Shleifer, Patrick von Platen, Real-time open-domain question answering with Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, dense-sparse phrase index. In Proceedings of the Teven Le Scao, Sylvain Gugger, Mariama Drame, 57th Annual Meeting of the Association for Com- Quentin Lhoest, and Alexander Rush. 2020. Trans- putational Linguistics, pages 4430–4441, Florence, formers: State-of-the-art natural language process- Italy. Association for Computational Linguistics. ing. In Proceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing: Richard Socher, Alex Perelygin, Jean Wu, Jason System Demonstrations, pages 38–45, Online. Asso- Chuang, Christopher D. Manning, Andrew Ng, and ciation for Computational Linguistics. Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and bank. In Proceedings of the 2013 Conference on Quoc V. Le. 2020. Self-training with noisy student Empirical Methods in Natural Language Processing, improves imagenet classification. In Proceedings of pages 1631–1642, Seattle, Washington, USA. Asso- the IEEE/CVF Conference on Computer Vision and ciation for Computational Linguistics. Pattern Recognition (CVPR). Yi Yang and Arzoo Katiyar. 2020. Simple and effective Nandan Thakur, Nils Reimers, Johannes Daxen- few-shot named entity recognition with structured berger, and Iryna Gurevych. 2021. Augmented nearest neighbor learning. In Proceedings of the SBERT: Data augmentation method for improving 2020 Conference on Empirical Methods in Natural bi-encoders for pairwise sentence scoring tasks. In Language Processing (EMNLP), pages 6365–6375, Proceedings of the 2021 Conference of the North Online. Association for Computational Linguistics. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, pages 296–310, Online. Association for Computa- William Cohen, Ruslan Salakhutdinov, and Christo- tional Linguistics. pher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answer- Erik F. Tjong Kim Sang and Fien De Meulder. ing. In Proceedings of the 2018 Conference on Em- 2003. Introduction to the CoNLL-2003 shared task: pirical Methods in Natural Language Processing, Language-independent named entity recognition. In pages 2369–2380, Brussels, Belgium. Association Proceedings of the Seventh Conference on Natu- for Computational Linguistics. ral Language Learning at HLT-NAACL 2003, pages 142–147. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har- Evaluating text generation with bert. In Interna- ris, Alessandro Sordoni, Philip Bachman, and Ka- tional Conference on Learning Representations. heer Suleman. 2017. NewsQA: A machine compre- hension dataset. In Proceedings of the 2nd Work- Yuan Zhang, Jason Baldridge, and Luheng He. 2019. shop on Representation Learning for NLP, pages PAWS: Paraphrase adversaries from word scram- 191–200, Vancouver, Canada. Association for Com- bling. In Proceedings of the 2019 Conference of putational Linguistics. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- George Tsatsaronis, Georgios Balikas, Prodromos nologies, Volume 1 (Long and Short Papers), pages Malakasiotis, Ioannis Partalas, Matthias Zschunke, 1298–1308, Minneapolis, Minnesota. Association Michael R Alvers, Dirk Weissenborn, Anastasia for Computational Linguistics.
You can also read