Hurdles to Progress in Long-form Question Answering
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hurdles to Progress in Long-form Question Answering Kalpesh Krishna♠ ∗ Aurko Roy♦ Mohit Iyyer♠ ♠ University of Massachusetts Amherst, ♦ Google Research {kalpesh,miyyer}@cs.umass.edu aurkor@google.com Abstract whose questions are answerable with short phrases and entities, by leveraging dense retrieval tech- The task of long-form question answering niques like ORQA (Lee et al., 2019), REALM (Guu (LFQA) involves retrieving documents rele- arXiv:2103.06332v2 [cs.CL] 19 May 2021 et al., 2020), and DPR (Karpukhin et al., 2020; vant to a given question and using them to generate a paragraph-length answer. While Lewis et al., 2020c; Izacard and Grave, 2020). many models have recently been proposed Methods inspired by these results have recently for LFQA, we show in this paper that been combined with pretrained language mod- the task formulation raises fundamental chal- els (Lewis et al., 2020b; Petroni et al., 2020) and lenges regarding evaluation and dataset cre- applied to the Reddit-derived “Explain Like I’m ation that currently preclude meaningful mod- Five” (ELI5) dataset (Fan et al., 2019), which is the eling progress. To demonstrate these chal- only publicly-available large-scale LFQA dataset. lenges, we first design a new system that relies on sparse attention and contrastive re- The recently proposed KILT benchmark (Petroni triever learning to achieve state-of-the-art per- et al., 2020), which compares retrieval-augmented formance on the ELI5 LFQA dataset. While models across a variety of knowledge-intensive our system tops the public leaderboard, a de- tasks including ELI5, automatically evaluates tailed analysis reveals several troubling trends: LFQA models by the quality of both generated an- (1) our system’s generated answers are not ac- swers (ROUGE-L against reference answers) and tually grounded in the documents that it re- retrieved documents (R-precision against human- trieves; (2) ELI5 contains significant train / val- idation overlap, as at least 81% of ELI5 vali- annotated relevant documents). In this paper, we dation questions occur in paraphrased form in build a state-of-the-art system2 for ELI5 by using the training set; (3) ROUGE-L is not an infor- a sparse Transformer variant (Roy et al., 2020) to mative metric of generated answer quality and condition over Wikipedia paragraphs returned by a can be easily gamed; and (4) human evalua- REALM-style retriever (Guu et al., 2020). tions used for other text generation tasks are However, despite its success on the KILT leader- unreliable for LFQA. We offer suggestions to mitigate each of these issues, which we hope board, our system does not actually use the doc- will lead to more rigorous LFQA research and uments that it retrieves! To measure the effect of meaningful progress in the future. 1 retrieval on generation quality, we design a con- trol experiment in which retrieved documents are 1 Introduction replaced with randomly-sampled documents at in- ference time. Results from both human A/B tests Long-form question answering (LFQA) integrates and automatic metrics like ROUGE-L demonstrate the retrieval component of open-domain QA, that conditioning on random documents has almost which involves searching a large external knowl- no effect on generated answer quality (Figure 1c). edge source for documents relevant to a given ques- We recommend that future LFQA research report tion, with a text generation component to produce the results of such control experiments in addition paragraph-length answers. Significant progress to reporting generation and retrieval quality. has been made on open-domain QA datasets such How can a system using random retrieval per- as Natural Questions (Kwiatkowski et al., 2019), * Work done during an internship at Google Research. 2 State-of-the-art as of April 3, 2021 — the “Google 1 Resources accompanying our paper can be found in Research & UMass Amherst” team entry on https: https://github.com/martiansideofthemoon/ //evalai.cloudcv.org/web/challenges/ hurdles-longform-qa challenge-page/689/leaderboard/1908
(c) Conditioning answer generation on random (b) Simply retrieving answers documents instead of relevant ones does not (a) Many held-out questions are to random unrelated training measurably impact its factual correctness. Longer paraphrased in the training set. questions yields relatively outputs get higher ROUGE-L Best answer to similar train high ROUGE-L, while actual questions gets 27.4 ROUGE-L gold answers underperform (d) Annotators find it difficult to judge long answers generations (with repetition) & correctness of technical content Val Q: Can you protect electronics from Random Train Ans, 19.4 ROUGE-L Generation using predicted retrievals, 19.0 ROUGE-L EMPs/solar flares? If so, how? The fast lane/slow lane is a bit of a Yes, you can shield them. But it's a slow process... Also, misnomer. It gives the impression that the equipment that's powered by them is a lot more Train Q1: How does an EMP ruin electronics? new, faster lanes are being built. In expensive than you'd think, so it's hard to make sure that What does it do? How would they be fixed? reality, normal speed will be... you're not just shielding them from your remote control. Can It be protected against? How? Train Q2: If Earth were hit with a massive Gold Answer, 18.6 ROUGE-L Generation using random retrievals, 24.8 ROUGE-L EMP, would all of our currently technology be completely unusable permanently? I'll start with the grounding question, Yes, you absolutely can, in fact you can build a Faraday because that's the easiest to answer: cage around your electronics, and protect them from Train Q3: Whenever a electromagnetic pulse Doesn't help a bit. All that matters is solar flares... This is what is done (EMP) is released what does it do to that the metal container is conductive with the Faraday cage around your electronics, which is electronics to disable them? and doesn't have gaps...completely the problem. The reason it is expensive is because it seal your Faraday cage. Consider requires a huge amount of power and is expensive to Train Q4: If earth was hit with an EMP, could soldering the lid on to that paint can... replace... designed to shield your electronics from solar we ever restore electricity? If not, why? look at little baggie it comes in. Sealed flares, you will have to pay for the protection. This is mylar. That protected that chip from because you have to buy a piece of equipment that is Train Q5: What are solar flares and why does air travel at 35,000 feet, land travel designed to shield your electronics from solar flares, and it impact our electronics? through rural, urban, and suburban that is expensive. ... This is also expensive, but not as areas, and all the electromagnetic expensive as the protection you need to shield your Train Q6. When an EMP goes off, can the radiation that the trip entails... No lead electronics from solar flares... designed to be as cheap electronics affected be replaced? shielding. No safes.... as possible... Figure 1: A summary of the major hurdles (a-d) to progress in long-form question answering with ELI5. form well on ELI5? Our analysis reveals that this 2 A state-of-the-art LFQA system result is partially due to significant train / valida- tion overlap in the ELI5 dataset (Figure 1a), which eliminates the need for external retrieval. A hu- The ELI5 task (Fan et al., 2019) asks models to man study shows that at least 81% of validation generate paragraph-length answers to open-ended questions have a paraphrase in the training set, and questions in English that often rely on world knowl- almost all validation questions are topically similar edge (e.g., how do jellyfish function without brains to a training set question. While Fan et al. (2019) or nervous systems?). LFQA systems thus benefit attempted to identify and remove question overlap from conditioning answer generation on relevant using TF-IDF similarity, more complex semantic documents from the web (such as the Wikipedia matching methods & human verification is needed article about jellyfish). While large-scale pretrained to address this issue in future LFQA datasets. language models store surprising amounts of world knowledge within their parameters (Petroni et al., 2019; Roberts et al., 2020), external document re- Digging deeper, we identify fundamental issues trieval not only augments this intrinsic knowledge with using ROUGE-L to evaluate generated answer but also grounds model outputs in a knowledge quality (Figure 1b). Simple baselines such as just source, which provides interpretability. repeatedly copying the question, or choosing a ran- dom training set answer, can outperform LFQA sys- In this section, we describe our proposed LFQA tems such as RAG (Lewis et al., 2020c) in terms of system, which conditions answer generation on ROUGE-L. On the other hand, our system achieves Wikipedia articles identified by a pretrained re- higher ROUGE-L than reference human-written triever. We use a dense retriever trained by scaling answers, which is misleading since human A/B up a distantly supervised algorithm from Jernite testers strongly prefer reference answers to our sys- (2020). Since retrieved articles can be quite long tem’s. We conclude that ROUGE-L is not a reliable and often exceed the maximum sequence length of metric to evaluate LFQA due to its large and rela- pretrained models like BERT (Devlin et al., 2019), tively unconstrained output space (e.g., compared we use a sparse-attention variant of the Transformer to translation or summarization), and we offer sug- to allow modeling over longer sequences. While gestions for better automatic & human evaluations our system sets a new state-of-the-art on ELI5, we to enable meaningful progress on this task. question the significance of this result in Section 3.
2.1 Retriever 2.2 Generator We begin by specifying our dense retriever (“con- We next describe our generator model, which condi- trastive REALM” or C -REALM), which returns tions its generated answers on retrieved documents documents related to an input question. Consider returned by C -REALM. We use the Routing Trans- a corpus of long-form questions and answers, rep- former (RT) from Roy et al. (2020), which is the resented by (qi , ai )N current state-of-the-art in long-form language mod- i=1 . Our retriever uses qi as a query to retrieve K documents (ri,j )K eling. The RT is a sparse attention model that em- j=1 from a knowledge corpus (Wikipedia), which is enabled ploys local attention as well as mini-batch k-means by an encoder network that projects both questions clustering to better model long-range dependencies and candidate documents to a 128-d shared embed- in sequences (attention maps in Appendix A.1). ding space. Like REALM (Guu et al., 2020), our Long-form language models such as RT are well- encoder is a BERT-base Transformer (Devlin et al., suited to ELI5 as the task requires conditioning 2019) with a final projection layer. answer generation not only on a short question but also many lengthy retrieved documents. Since the ELI5 dataset does not include gold retrievals, we train our retriever by scaling up a We pretrain our RT model on PG-19, a long- method recently introduced by Jernite (2020) that form language modeling benchmark (Rae et al., uses gold answers for distant supervision. The 2020) created from approximately 28,000 Project key idea is to push the encoded vector for a ques- Gutenberg books published before 1919. PG-19 tion close to a vector representation of its ground- has 1.9B tokens and an average context size of 69K truth answer(s), but away from all other answer words. While this data is out-of-domain for ELI5, vectors in the mini-batch (negative examples). In- we choose it to encourage long & coherent gener- tuitively, this method works because both ELI5 ation. Our RT is a 22-layer model with 1032 hid- answers and external documents are of paragraph den units (486M parameters), maximum sequence length (documents are paragraph-length chunks length of 8192 tokens, and a vocabulary of 98K from Wikipedia). Concretely, we optimize the loss, subwords.3 We fine-tune our model in a decoder- only fashion (Liu et al., 2018; Wolf et al., 2018) by concatenating the top K retrieved documents X exp qi · ai to the question [ri,K , ri,K−1 ... ri,1 , qi , ai ] and loss = − log P training the model to predict tokens of the answer aj ∈B exp qi · aj (qi ,ai )∈B ai . We do not backpropagate gradients through the retriever.4 Retrievals slightly improve perplex- where B is the mini-batch and qi , ai are the ity (18.1 vs 17.8) as seen in Wang and McAllester encoded vector representations for (qi , ai ). This (2020), but do not improve generations (§3.1). objective is based on contrastive learning, a method 2.3 Main Experiments that has been used effectively for semi-supervised learning (Chen et al., 2020) and dense retriever Dataset & Evaluation details: We evaluate our training (Karpukhin et al., 2020). Scaling up model on the KILT validation & test subsets of from Jernite (2020), who used a mini-batch size of ELI5 (Petroni et al., 2020), since the original ELI5 512 and initialized their retriever with BERT, we dataset does not have human annotations to mea- use much large mini-batches of size 12,288 (and sure retriever performance. We downloaded the hence, many more negative examples) and initial- ELI5 dataset (Fan et al., 2019) from the KILT ize our retriever with a strong pretrained retriever, Github repository.5 This version of the dataset the REALM model (Guu et al., 2020) trained on the has 272,634 training examples, 1,507 validation ex- Common Crawl News (CC-News) corpus. These amples and 600 test examples. The test set answers design decisions greatly improve retriever qual- 3 ity, as we observe in an ablation study (see Ap- Our hyperparameters have been chosen manually with minimal tuning. See Appendix A.1 for details. pendix A.2). During inference, we perform a maxi- 4 We tried training the retriever jointly with RT using the at- mum inner-product search (MIPS) with the ScaNN tention bias scheme proposed in MARGE (Lewis et al., 2020a). library (Guo et al., 2020) to efficiently find the This improved perplexity only in autoencoding settings where the gold answer itself is used as a retrieval query (like the top K documents. In all our experiments we use setup in Lewis et al., 2020a), which is not valid in LFQA. 5 K = 7, following the setup in Guu et al. (2020). github.com/facebookresearch/KILT
Retrieval Generation Q: Why are almost all boats white? Model RPr. R@5 F1 R-L KRL A: Boats are generally painted white for obvious reasons, T5-base 0.0 0.0 16.1 19.1 0.0 as they are inexpensive and easy to clean. Ships are also BART 0.0 0.0 19.2 20.6 0.0 often painted to provide a visible visual contrast to the sea. RAG 11.0 22.9 14.5 14.1 1.7 Boats also often have high visibility ports to allow for easy BART + DPR 10.7 26.9 17.9 17.4 1.9 international trade. In addition, many boats have an air- breathing function that makes them easier to spot and stay p = 0.9 afloat, which is a good thing if you’re a pirate or trying to RT + REALM 6.7 15.5 25.1 21.5 1.4 get into some country’s waters. Lastly, many ships are built RT + C -REALM 10.2 24.4 25.4 21.5 2.1 in the last few decades, so they are not as exotic looking as they were thirty years ago. And lastly, many ships are p = 0.6 made out of materials that are easy to work with, and so RT + REALM 6.7 15.7 23.1 23.4 1.5 many of them are painted white as a practical matter. There RT + C -REALM 10.7 24.6 22.9 23.2 2.4 are many, many more reasons for why ships are white, but those are the ones I know of. I’m sure there are many other Table 1: Results on the KILT test set for ELI5 for reasons, but those are the ones I know of. Edit: I forgot to (1) retrieval performance, using R-precision and Re- mention that the other answers about visibility and easy to see ports are also true. Ships are painted to appear to be call@5 (RPrec, R@5), and (2) generation quality, using larger, and more expensive. That makes them appear more ROUGE-L (R-L). These scores are combined to pro- impressive, and so they are more likely to be bought by duce the final metric KILT R-L (KRL). We outperform people who want to show off their wealth and status. prior work on both generation & combined scores. Figure 2: Example generation from our LFQA system with p = 0.9. Generations are long & coherent, but are hidden, and hosted on a public leaderboard in suffer from repetition towards the end. (more in Ap- the EvalAI platform (Yadav et al., 2019). pendix A.4 and attached data supplementary material). Answer quality is measured by the maximum overlap of generations with a set of gold answers form prior work in generation quality, with lower- in terms of unigram F1 score and ROUGE-L (Lin, entropy models (p = 0.6) performing best.6 C - 2004). Petroni et al. (2020) collected human REALM performs competitively to RAG and DPR annotations of Wikipedia articles which support despite being only distantly supervised, and out- ELI5 gold answers, which enables measuring performs REALM. Our proposed RT+C -REALM retrieval quality by computing R-precision (if system achieves a new state-of-the-art on combined the top-1 retrieval matches the annotation) and performance (KILT R-L). Generations from our Recall@5 using the top-5 retrievals. Finally, the model are provided in Figure 2 and Appendix A.4. KILT benchmark combines R-prec. and ROUGE-L to measure the overall performance of the system 3 Analysis by “KILT ROUGE-L”. This metric is similar to ROUGE-L, but assigns a score of 0 whenever the In this section, we conduct a thorough analysis of top-1 retrieval does not match the gold annotation. our model’s usage of retrievals (Section 3.1), the impact of overlap in ELI5’s train / validation / test Baselines: We compare our model with the other folds (Section 3.2), issues with ROUGE-L and per- entries on the ELI5 KILT leaderboard which are formance bounds (Section 3.3), and the difficulty in either generation-only, like T5-base (Raffel et al., human evaluation for this task (Section 3.4). At the 2020) and BART (Lewis et al., 2020b), or variants end of each section, we provide short takeaways of BART using retrieval such as RAG (Lewis et al., with suggestions for future work. 2020c) and BART + DPR (Petroni et al., 2020). These systems are based on massive pretrained lan- 3.1 Are generations grounded in retrieval? guage models, with similar number of parameters While our retrieval-augmented system achieves as our model (details in Appendix A.3). state-of-the-art performance, we find little evidence that it is actually using the retrieved documents. To Results: Table 1 contains our results on the test measure this, we run an ablation study where at set of the ELI5 (also on the public KILT leader- inference time we replace retrieved paragraphs with board). We present four variants of our system, us- 6 ing a different retriever during inference (REALM As in Holtzman et al. (2020), a human study reveals that higher entropy (p = 0.9) answers are slightly more coherent or C -REALM), and different nucleus sampling p and sensible, but lower entropy answers (p = 0.6) are more values (Holtzman et al., 2020). All variants outper- relevant to the question (details in Appendix A.5).
vs predicted retr. vs random retr. vs qn. vs predicted retr. vs random retr. R-L 1-g 2-g 1-g 2-g but not in qn. but not in qn. Predicted 24.42 52.3 9.0 38.8 3.9 (lemmatized nouns, proper nouns, numbers only) Random 24.20 51.2 8.5 38.5 3.9 Gold Ans - 54.1 9.1 40.2 3.8 Predicted 13.4% 34.4% 11.9% Random 13.7% 31.7% 12.1% Gold Ans 8.3% 28.8% 15.1% Table 2: Comparison of generations (with p = 0.6) conditioned on predicted retrievals (Predicted) and ran- domly chosen retrievals (Random). Notice small dif- Table 4: A fine-grained version of Table 2 measuring ferences in: (1) ROUGE-L vs gold answers (R-L); (2) the unigram overlap of nouns/numbers in the genera- n-gram overlap (n-g) with predicted retrievals (vs pre- tions with the input question (vs qn.), retrievals pre- dicted retr.). Gold answers also have a similar overlap dicted by C -REALM (vs predicted retr.) and randomly with predicted retrievals. To control for stopwords, we sampled retrievals (vs random retr.). Similar to Table 2, show overlaps with the random retrievals. notice very little difference with and without retrieval. A B Prefer A Prefer B Tie To tackle this issue, in Table 4 we measure the For p = 0.6 fractions of lemmatized nouns, proper nouns and pred. random 40% (78) 33% ( 64) 27% (51) pred. gold ans. 14% (29) 68% (138) 18% (36) numbers in the generated answer which are present For p = 0.9 in the predicted retrievals but not in the question. pred. random 31% (52) 37% ( 63) 32% (54) We notice similar trends as before, with only small pred. gold ans. 17% (49) 72% (203) 11% (31) differences between the two systems. Finally, there is almost no correlation (Spearman ρ = 0.09) Table 3: Human evaluation results with exact number between the Predicted model’s generation quality of ratings shown in (·). Annotators are shown a ques- and the amount of unigram overlap between tion along with two answers (A, B) in random order and its outputs and the retrieved documents (scatter ask them to choose one (details in Appendix A.5). For plots in Appendix A.7), strengthening our hypoth- both model variants (p = 0.6, 0.9), we see (1) little dif- ference between generations conditioned on predicted esis that generations are not grounded in retrievals.8 (pred.) or random (rand.) retrievals; (2) strong prefer- ence for gold answers over generations. Human evaluation validates our findings: As ROUGE-L and n-gram overlap have major limitations for LFQA (Section 3.3), we perform randomly sampled paragraphs from Wikipedia. additional human A/B testing on the output of We compare this Random baseline with our Random and Predicted. Specifically, we ask human original system (Predicted) in terms of generation volunteers9 to choose between answers generated quality as well as the n-gram overlap between the by the two systems (presented in random order). generation and the retrieved paragraphs. As seen in Table 3, humans struggle to choose which of the two answers is more relevant to the Generations are similar irrespective of type of question. For both model variants (p = 0.6, 0.9), retrievals: We present our results in Table 2. De- there is a less than 7% preference for a particular spite not being conditioned on any meaningful re- answer type, with humans preferring answers (by trievals, the Random retrieval model has similar 6%) from the Random model for p = 0.9! ROUGE-L scores as our Predicted system. More- over, generations from the Random and Predicted Other systems also have this issue, possibly models have similar amounts of 1-gram and 2- due to source-reference divergence and train- gram overlap with the paragraphs retrieved by C - validation overlap: We note that this issue is not REALM, despite the fact that the Random model unique to our system — other systems on the does not actually see the retrieved paragraphs.7 KILT leaderboard like BART + DPR and RAG The n-gram overlaps are possibly overestimates actually perform worse than their no-retrieval due to stopwords (e.g., prepositions, punctuation) counterpart (BART) in generation quality, as and entities which are copied from the question. 8 All these trends persist even on questions for which our 7 retriever predicts the ground-truth document (Appendix A.7) Corresponding experiments with the p = 0.9 variant of 9 our model are presented in Appendix A.7. Details of our experimental setup in Appendix A.5.
shown in Table 1. Qualitatively, we found no bined retrieval + generation performance via KILT evidence of retrieval usage in a publicly hosted RL, it does not check whether the generations actu- ELI5 model demo by Jernite (2020).10 A possible ally used the retrievals. In other words, one can sub- explanation for this issue is high source-reference mit independent retrieval & generation systems, but divergence, a common problem in table-to-text still perform well on the combined score. This may generation (Wiseman et al., 2017; Tian et al., 2019). not be an issue for short-form QA tasks like Natural In Table 2 and Table 4, we measure the n-gram Questions, since the gold answer is often exactly overlap of top-ranked gold validation answers contained as a span in the gold retrieval. Also, as (Gold Ans) with predicted retrievals. This overlap retrieval might be less important for large language is low and similar to that of our generations, models with parametric knowledge (Roberts et al., which we suspect encourages our model to ignore 2020), the KILT-RL strategy of simply aggregat- retrievals. A second explanation is the large ing top-1 retrieval score with ROUGE-L unfairly amount of train-validation overlap (Section 3.2), penalizes systems not relying on retrieval.13 which eliminates the need for retrieval. 3.2 Training / Validation Overlap Why does our model do well compared to other Our experiments in Section 3.1 show that model systems despite not using retrievals? While our performance is mostly unchanged by conditioning model has similar capacity as the BART/RAG generation on randomly sampled retrievals instead baselines (comparison in Appendix A.3), we of predictions from C -REALM. Despite not using hypothesize that our improvements in ROUGE-L retrievals, we observe qualitatively that our model are due to a different pretraining objective. BART displays a large amount of parametric knowledge is pretrained on a masked infilling task on short (“Faraday Cage” in Figure 1c), which is surprising sequences. Instead, we pretrain our model to since it was pretrained on novels from Project perform next-word prediction on long sequences Gutenberg (not Wikipedia). In this section, we from Project Gutenberg, which encourages long & discover that a major reason for ignoring retrievals fluent generations. To illustrate this length effect, is the large amount of train / validation overlap in in Appendix A.6 we show that truncated outputs ELI5. While Fan et al. (2019) attempted to fix from our model get lower ROUGE-L scores this issue through TF-IDF overlap, this method is on ELI5.11 Prior summarization literature (Sun insufficient to identify all question paraphrases, as et al., 2019) has also shown that ROUGE scores we find significant overlap between the training set vary heavily by length. To compare the same and the KILT validation set of ELI5.14 ELI5 is not systems on shorter length outputs, we also tried the only dataset with substantial train / test overlap: finetuning the pretrained model on Wizard of Lewis et al. (2020d) identify similar issues with Wikipedia (Dinan et al., 2019), an unconstrained short-form QA datasets like Natural Questions. dialogue generation task with single sentence dialogues (much shorter than ELI5). As seen on Finding similar questions & measuring overlap: the public KILT leaderboard,12 our system has We use our retriever C -REALM to retrieve similar lower ROUGE-L scores than the BART / RAG questions from the training set, since it has learned baselines. Another possible explanation is issues to map questions to a feature-rich embedding space. with ROUGE-L itself, as discussed in Section 3.3. For each validation question, we retrieve the 7 most similar training set questions. We use both human Takeaway (better evaluation of grounding): For and automatic evaluation to calculate the amount evaluating LFQA, it is important to run control of overlap. For human evaluation, we show anno- experiments with random retrievals & measure tators on Amazon Mechanical Turk15 a validation grounding of generations in retrieval. While the set question and a retrieved training set question, KILT benchmark does attempt to measure the com- 13 Another issue of KILT-RL is ignoring non top-1 retrievals, 10 https://huggingface.co/qa penalizing models using multiple retrievals together in context. 11 14 While we do not have access to generations from base- The ELI5 demo from Jernite (2020) also retrieves the top- lines on the KILT leaderboard, example generations from the 1 similar training set question. Qualitatively, we found many demo of the BART model in Jernite (2020) are significantly validation examples had near-identical train paraphrases. shorter (59 words avg.) than our generations (187 words avg.). 15 We pay workers 4 cents per question pair ($8-12 / hr). We 12 https://eval.ai/web/challenges/ only hire workers from USA, UK and Australia with a 95% challenge-page/689/leaderboard/1909 or higher approval rating and at least 1000 approved HITs.
qns with at least one train set paraphrase 81% Retrieval Generation qns with at least one train set topically similar 100% Split RPrec R@5 F1 R-L % of all pairs marked paraphrases 39.5% QQP classifier (1.5k examples) % of all pairs marked topically similar 47.8% % of all pairs marked as non-paraphrases 12.7% overlap (43.6%) 17.0 25.8 26.0 24.6 not overlap (56.4%) 10.4 17.7 25.2 24.2 Table 5: A human evaluation measuring the amount of AMT evaluation (300 examples) overlap between validation set questions (qns) and re- trieved questions from the training set. overlap (81%) 14.0 20.0 25.0 24.3 not overlap (19%) 5.3 17.9 24.5 24.8 and ask them to annotate the pair as 0: No para- Table 6: ELI5 performance difference (for the p = 0.6 phrase relationship; 1: on similar topics, but differ- model) between subsets of validation QA having a ent questions; 2: approximately the same question question paraphrase (overlap) and not having a ques- (an adaptation of the paraphrase evaluation of Kok tion paraphrase (not overlap) in the training set. We and Brockett, 2010). We take 300 validation set see the overlap subset has much better retrieval perfor- questions and ask three crowd-workers to rate them mance and slightly better generation performance. against retrieved training questions on this scale, and consider the label with majority rating. To im- bound, we also consider a system which uses prove quality, we manually verify their annotations. the best possible answer to retrieved training Table 5 shows that 81% of validation set ques- set questions in terms of ROUGE-L (best top-K tions have at least one paraphrase in the training train answer). This system gets 28.5 ROUGE-L, set, while all annotated questions have at least one outperforming all others. topically similar question in the training set, which indicates substantial training / validation overlap. The experiment had “fair agreement” with a Fleiss ELI5 performance on overlapping QA: Finally, κ of 0.29 (Fleiss, 1971; Landis and Koch, 1977). we measure the performance difference between As manually annotating question overlap validation questions that overlap with the training can be expensive and time-consuming, we also set vs. those that do not. Since we only have experiment with automatic overlap detection human annotations for 300 questions (the no- methods. In particular, we use a RoBERTa-large overlap subset has only 53 samples), we present binary classifier (Liu et al., 2019) fine-tuned on the this analysis using the QQP classifier’s outputs as Quora Question Paraphrase (QQP) dataset (Iyer well. In Table 6, we notice large differences of 6.6 et al., 2017) from the GLUE benchmark (Wang RPrec, 8.1 R@5 in retrieval performance favoring et al., 2019). For 43.6% of the ELI5 validation set, the overlap subset, but only a small generation this classifier marked at least one retrieved question score gain of 0.8 F1, 0.4 R-L (which may be as a paraphrase (46% for the 300 questions we misleading as discussed in Section 3.3). annotated). Qualitatively, we notice that this classifier often mis-classifies retrieved questions Takeaway (careful held-out curation): Based on that are valid paraphrases but exhibit significant our findings, we suggest that more careful dataset lexical or syntactic divergence. This observation, curation for LFQA tasks is needed to prevent du- along with the smaller fraction of valid paraphrases plicates. While we acknowledge the efforts of Fan in the QQP training set (37%), partially explains et al. (2019) to fix this issue, we also suggest alter- the gap between automatic & human evaluations. native methods to control overlap and focus on eval- uating generalization in held-out sets: (1) automat- Using retrieved QA for generation: Since ELI5 ically retrieving paraphrases and then running hu- contains significant amount of overlap between the man validation to eliminate them; or (2) holding out training and validation sets, a system can simply entire genres or domains to reduce the possibility copy the answers of retrieved training set questions of overlap — for example, keeping Q/A on Sports instead of actually doing generation. Table 7 only in the held-out sets. Note that simply pruning shows that by using the longest answer within the existing splits using these criteria will signif- the top-K retrieved questions, we outperform icantly reduce the size of the held-out datasets; two prior systems (RAG, BART + DPR) that so we suggest re-splitting the train/validation/test use retrieval-augmented generation. As an upper splits from the entire pool of collected questions.
3.3 ROUGE-L Bounds on ELI5 Performance Validation Test Scheme F1 R-L F1 R-L We have seen that simply copying the answer of random train answer (↓) 17.8 16.2 17.1 15.5 a close question paraphrase from the training set copy input (↓) 16.6 20.0 14.8 16.9 achieves 28.5 ROUGE-L with an optimal selection RAG (2020c) 17.2 16.1 14.5 14.1 among retrieved questions and outperforming all BART + DPR (2020) 18.8 18.5 17.9 17.4 computational models. But how “good” is this longest top-1 train answer 25.2 20.7 21.6 18.7 longest top-7 train answer 26.9 21.1 22.0 18.5 absolute number? What are some suitable upper RT + C -REALM (ours) 25.6 24.4 22.9 23.2 & lower bounds to ROUGE-L scores on ELI5? Is best top-1 train answer (↑) 25.9 22.4 - - ROUGE-L an informative metric for LFQA? best top-7 train answer (↑) 31.5 28.5 - - longest gold answer (↑) 26.7 21.2 - - Lower bounds are trivial baselines used to test the best gold answer (↑) 29.5 26.2 - - vulnerability of datasets or metrics to simple heuris- tic strategies that do not actually perform the task. Table 7: Upper (↑) and lower (↓) bounds to perfor- Recent examples include hypothesis-only baselines mance on ELI5. Lower bounds have been submitted to the public KILT leaderboard, as “Metrics Test”. for natural language inference (Gururangan et al., 2018) and passage-only baselines for reading com- prehension (Kaushik and Lipton, 2018). We evalu- as well as stylistic properties of ELI5. On the ate two ROUGE-L lower bounds on ELI5: other hand, upper bounds (longest gold answer) (1) copy the question 5 times and concatenate, as perform worse than our system (21.2 vs 24.4). longer outputs boost ROUGE-L (Appendix A.6); Suspecting that this result is misleading, we run (2) retrieve a random training set answer. another human A/B test by showing volunteers Our first baseline contains entities often present a question and asking them to choose between in the gold answer, but without actually answer- answers generated by our system and the longest ing the question. Our second baseline follows gold answer, shuffled at random.17 As seen in the “style” of an answer but is completely off-topic. Table 3, the majority of humans prefer the gold reference answers vs generations (68% vs 14% for As an upper bound, we estimate the ROUGE-L p = 0.6). In interviews with human annotators of gold answers themselves. On an average, there after completing the task, they reported that both are 12 gold answers per question, so we measure answers were often fluent and stylistically similar, the ROUGE-L of the longest gold answer with but one eventually veered off-topic. respect to the other gold answers. We also measure the maximum pairwise ROUGE-L between two Takeaway (better automatic metrics needed): gold answers for the same question.16 We only Our experiments demonstrate that computing the calculate upper bounds for the validation set, since ROUGE-L of generations against gold answers the gold answers of the KILT test set are hidden. is not a meaningful way to evaluate LFQA sys- tems, since it is not selective enough to differenti- Lower bounds beat prior work, upper bounds ate between valid/invalid answers. There is a very have low ROUGE-L: We compare our bounds small margin of improvement between trivial lower with actual retrieval augmented generation systems bounds and strong upper bounds, with the abso- in Table 7. Both our lower bounds (random lute scores of upper bounds being quite low. We training answer, copy input) are quite competitive, suspect this is due to the long length of answers outperforming RAG (Lewis et al., 2020c) and and fairly unconstrained and large output space. performing close to BART + DPR (Petroni et al., The ELI5 dataset has several open-ended questions 2020) without actually answering the question! with many plausible answers (like What causes This shows that ROUGE-L is fairly sensitive traffic?), often involving analogies. A possible fix to simply copying entities from the question is a sentence-level evaluation and then aggregating 16 Note that different gold answers were not written indepen- scores across generated sentences, but appropri- dently as Reddit users writing answers can read existing an- ate penalties are needed for lack of diversity (Zhu swers and may want to provide a non-overlapping perspective. et al., 2018) and short lengths. Other possible fixes Due to the high train/valid overlap, the best top-7 retrieved answer could be a better upper bound since it is from another 17 Reddit post (and performs better than best gold answer). Human A/B testing details in Appendix A.5.
include learning task-specific metrics to measure the ELI5 long-form question answering dataset. semantic overlap (Sellam et al., 2020) or metrics to However, an in-depth analysis reveals several is- check factual correctness (Zhang et al., 2020) and sues not only with our model, but also with the faithfulness to input (Wang et al., 2020; Durmus ELI5 dataset & evaluation metrics. We hope that et al., 2020; Zhou et al., 2020). Ultimately, all au- the community works towards solving these issues tomatic metrics have their limitations, and human so that we can climb the right hills and make mean- evaluation is necessary (Celikyilmaz et al., 2020). ingful progress on this important task. 3.4 Difficulty of Human Evaluation Acknowledgements To better understand the inherent difficulty of First and foremost, we thank the twenty people who evaluation in ELI5, we interviewed human volunteered to help out with with the human annota- annotators (of Table 3) and found two challenges: tion experiments. We are very grateful to Vidhisha Balachandran, Niki Parmar, and Ashish Vaswani (1) Unfamiliarity with question topics: While for weekly meetings discussing progress and the most annotators found the Q/A interesting, they REALM team (Kenton Lee, Kelvin Guu, Ming-Wei were often unfamiliar with the technical topics Chang and Zora Tung) for help with their codebase discussed in the questions. This made it hard and several useful discussions which helped us im- for them to assess answer correctness. The prove our experiments. We are grateful to Tu Vu ELI5 dataset has questions in a wide variety of for help with the QQP classifier. We thank Jules topics (History, Politics, Biology etc.), while Gagnon-Marchand and Sewon Min for suggesting most annotators were Computer Science graduate useful experiments on checking ROUGE-L bounds. students. While we did allow annotators to use Finally, we thank Shufan Wang, Andrew Drozdov, Wikipedia, they mentioned domain-experts will be Nader Akoury, Andrew McCallum, Rajarshi Das, better judges of factual correctness of answers. and the rest of the UMass NLP group for helpful discussions and suggestions at various stages in the (2) Length of Answers: Annotators mentioned project. This work was primarily done during KK’s the paragraph-long length of answers made the internship at Google Brain, mentored by AR. MI task quite challenging. Annotators reported taking and KK are supported by award IIS-1955567 from an average of 2 minutes per answer pair, many of the National Science Foundation (NSF). which required careful thought & concentration. This was especially difficult when only part of the Ethical Considerations answer was correct and the rest had contradictions or repetitions, a common theme in our generations. Our system faces a similar set of issues as most modern text generation technology, like fabrica- tion of facts (Zellers et al., 2019), potential for Takeaway: Human evaluation is challenging but misuse (Brown et al., 2020) and reflecting biases necessary for evaluating LFQA. Crowd-workers prevalent on Reddit (the ELI5 dataset has been are unlikely to spend time reading & analyzing built using the r/ELI5 subreddit). In our work, long text (Akoury et al., 2020). Hence, it is imper- we attempted to make text generators more fac- ative to design simpler evaluations. One effort in tually grounded by conditioning generations on this direction is Dugan et al. (2020), who reveal one retrieved Wikipedia articles, hoping to reduce fact generated sentence at a time and estimate system fabrication. Unfortunately, a thorough analysis quality based on the number of sentences which (Section 3.1) has revealed that our system is still fooled humans. Another promising direction is ex- not grounding its generations in retrievals, and we trinsic evaluation (Celikyilmaz et al., 2020) where have recommended the design of better metrics to humans actually interact with systems in real-world measure factual correctness to tackle this issue. scenarios such as the Alexa Prize (Ram et al., 2018) Our final models were trained using 64 Google or STORIUM (Akoury et al., 2020). Cloud TPUs for a total of 32 hours. As men- 4 Conclusion tioned in the Google 2019 environment report,18 18 https://www.gstatic.com/ We present a “retrieval augmented” generation sys- gumdrop/sustainability/ tem that achieves state-of-the-art performance on google-2019-environmental-report.pdf
“TPUs are highly efficient chips which have been Esin Durmus, He He, and Mona Diab. 2020. Feqa: A specifically designed for machine learning applica- question answering evaluation framework for faith- fulness assessment in abstractive summarization. In tions”. These accelerators run on Google Cloud, Proceedings of the Association for Computational which has “matched 100% of its electricity con- Linguistics. sumption with renewable energy purchases, and has committed to fully decarbonize its electricity Angela Fan, Yacine Jernite, Ethan Perez, David Grang- ier, Jason Weston, and Michael Auli. 2019. ELI5: supply by 2030” (https://cloud.google. Long form question answering. In Proceedings of com/sustainability). More details on train- the Association for Computational Linguistics. ing time are provided in Appendix A.1. Joseph L Fleiss. 1971. Measuring nominal scale agree- ment among many raters. Psychological bulletin, 76(5):378. References Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Accelerating large-scale inference with anisotropic Sanjay Ghemawat, Geoffrey Irving, Michael Isard, vector quantization. In Proceedings of the Interna- et al. 2016. Tensorflow: A system for large-scale tional Conference of Machine Learning. machine learning. In 12th {USENIX} symposium on operating systems design and implementation Suchin Gururangan, Swabha Swayamdipta, Omer ({OSDI} 16), pages 265–283. Levy, Roy Schwartz, Samuel Bowman, and Noah A Smith. 2018. Annotation artifacts in natural lan- Nader Akoury, Shufan Wang, Josh Whiting, Stephen guage inference data. In Conference of the North Hood, Nanyun Peng, and Mohit Iyyer. 2020. Sto- American Chapter of the Association for Computa- rium: A dataset and evaluation platform for machine- tional Linguistics. in-the-loop story generation. In Proceedings of Em- pirical Methods in Natural Language Processing. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pa- supat, and Ming-Wei Chang. 2020. REALM: Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Retrieval-augmented language model pre-training. Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind In Proceedings of the International Conference of Neelakantan, Pranav Shyam, Girish Sastry, Amanda Machine Learning. Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Pro- Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and cessing Systems. Yejin Choi. 2020. The curious case of neural text de- generation. In International Conference on Learn- Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. ing Representations. 2020. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799. Shankar Iyer, Nikhil Dandekar, and Kornél Csernai. 2017. First quora dataset release: Question pairs. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework Gautier Izacard and Edouard Grave. 2020. Lever- for contrastive learning of visual representations. In aging passage retrieval with generative models for Proceedings of the International Conference of Ma- open domain question answering. arXiv preprint chine Learning. arXiv:2007.01282. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yacine Jernite. 2020. Explain anything like i’m five: A Kristina Toutanova. 2019. BERT: Pre-training of model for open domain long form question answer- deep bidirectional transformers for language under- ing. https://yjernite.github.io/lfqa.html. standing. In Conference of the North American Chapter of the Association for Computational Lin- Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Ledell guistics. Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain Emily Dinan, Stephen Roller, Kurt Shuster, Angela question answering. In Proceedings of Empirical Fan, Michael Auli, and Jason Weston. 2019. Wizard Methods in Natural Language Processing. of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Divyansh Kaushik and Zachary C Lipton. 2018. How Representations. much reading does reading comprehension require? a critical investigation of popular benchmarks. In Liam Dugan, Daphne Ippolito, Arun Kirubarajan, and Proceedings of Empirical Methods in Natural Lan- Chris Callison-Burch. 2020. RoFT: A tool for eval- guage Processing. uating human detection of machine-generated text. In Proceedings of the 2020 Conference on Empiri- Stanley Kok and Chris Brockett. 2010. Hitting the cal Methods in Natural Language Processing: Sys- right paraphrases in good time. In Conference of tem Demonstrations. Association for Computational the North American Chapter of the Association for Linguistics. Computational Linguistics.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, field, Michael Collins, Ankur Parikh, Chris Alberti, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Danielle Epstein, Illia Polosukhin, Jacob Devlin, Alexander Miller. 2019. Language models as knowl- Kenton Lee, et al. 2019. Natural questions: a bench- edge bases? In Proceedings of Empirical Methods mark for question answering research. Transactions in Natural Language Processing. of the Association for Computational Linguistics, 7:453–466. Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. 2020. Com- J Richard Landis and Gary G Koch. 1977. The mea- pressive transformers for long-range sequence mod- surement of observer agreement for categorical data. elling. In Proceedings of the International Confer- biometrics, pages 159–174. ence on Learning Representations. Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Colin Raffel, Noam Shazeer, Adam Roberts, Kather- 2019. Latent retrieval for weakly supervised open ine Lee, Sharan Narang, Michael Matena, Yanqi domain question answering. In Proceedings of the Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Association for Computational Linguistics. the limits of transfer learning with a unified text-to- Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Ar- text transformer. Journal of Machine Learning Re- men Aghajanyan, Sida Wang, and Luke Zettlemoyer. search, 21(140):1–67. 2020a. Pre-training via paraphrasing. Advances in Neural Information Processing Systems. Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Behnam Hedayatnia, Ming Cheng, Ashish Nagar, jan Ghazvininejad, Abdelrahman Mohamed, Omer et al. 2018. Conversational ai: The science behind Levy, Ves Stoyanov, and Luke Zettlemoyer. 2020b. the alexa prize. arXiv preprint arXiv:1801.03604. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. comprehension. In Proceedings of the Association How much knowledge can you pack into the param- for Computational Linguistics. eters of a language model? In Proceedings of Em- pirical Methods in Natural Language Processing. Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein- Aurko Roy, Mohammad Saffar, Ashish Vaswani, and rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- David Grangier. 2020. Efficient content-based täschel, et al. 2020c. Retrieval-augmented genera- sparse attention with routing transformers. In Trans- tion for knowledge-intensive nlp tasks. In Proceed- actions of the Association for Computational Lin- ings of Advances in Neural Information Processing guistics. Systems. Thibault Sellam, Dipanjan Das, and Ankur P Parikh. Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. 2020. Bleurt: Learning robust metrics for text gen- 2020d. Question and answer test-train overlap in eration. In Proceedings of the Association for Com- open-domain question answering datasets. arXiv putational Linguistics. preprint arXiv:2008.02637. Simeng Sun, Ori Shapira, Ido Dagan, and Ani Nenkova. Chin-Yew Lin. 2004. ROUGE: A package for auto- 2019. How to compare summarizers without target matic evaluation of summaries. In Text Summariza- length? pitfalls, solutions and re-examination of the tion Branches Out, pages 74–81, Barcelona, Spain. neural summarization literature. In Proceedings of Association for Computational Linguistics. the Workshop on Methods for Optimizing and Eval- Peter J Liu, Mohammad Saleh, Etienne Pot, Ben uating Neural Language Generation, pages 21–29. Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summariz- Ran Tian, Shashi Narayan, Thibault Sellam, and ing long sequences. In Proceedings of the Interna- Ankur P Parikh. 2019. Sticking to the facts: Con- tional Conference on Learning Representations. fident decoding for faithful data-to-text generation. arXiv preprint arXiv:1910.08684. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran- Luke Zettlemoyer, and Veselin Stoyanov. 2019. cois Chollet, Aidan N. Gomez, Stephan Gouws, Roberta: A robustly optimized bert pretraining ap- Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki proach. arXiv preprint arXiv:1907.11692. Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2tensor for neural machine Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick translation. CoRR, abs/1803.07416. Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Rocktäschel, et al. 2020. Kilt: a benchmark for Asking and answering questions to evaluate the fac- knowledge intensive language tasks. arXiv preprint tual consistency of summaries. In Proceedings of arXiv:2009.02252. the Association for Computational Linguistics.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the International Conference on Learning Repre- sentations. Hai Wang and David McAllester. 2020. On-the-fly in- formation retrieval augmentation for language mod- els. In Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events, pages 114–119. Sam Wiseman, Stuart M Shieber, and Alexander M Rush. 2017. Challenges in data-to-document gener- ation. In Proceedings of Empirical Methods in Nat- ural Language Processing. Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2018. Transfertransfo: A trans- fer learning approach for neural network based con- versational agents. In NeurIPS CAI Workshop. Deshraj Yadav, Rishabh Jain, Harsh Agrawal, Prithvi- jit Chattopadhyay, Taranjeet Singh, Akash Jain, Shiv Baran Singh, Stefan Lee, and Dhruv Batra. 2019. Evalai: Towards better evaluation systems for ai agents. arXiv preprint arXiv:1902.03570. Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. In Advances in Neural Information Process- ing Systems, pages 9054–9065. Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christo- pher D Manning, and Curtis P Langlotz. 2020. Op- timizing the factual correctness of a summary: A study of summarizing radiology reports. In Proceed- ings of the Association for Computational Linguis- tics. Chunting Zhou, Jiatao Gu, Mona Diab, Paco Guz- man, Luke Zettlemoyer, and Marjan Ghazvinine- jad. 2020. Detecting hallucinated content in condi- tional neural sequence generation. arXiv preprint arXiv:2011.02593. Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texy- gen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Con- ference on Research & Development in Information Retrieval.
A Appendices for “Hurdles to Progress Attention Maps: We show the 2D plots of our in Long-form Question Answering” generator’s attention maps in Figure 3. A.1 Training & Model Details All our models are developed and trained us- ing TensorFlow 1.15 (Abadi et al., 2016) and Tensor2Tensor (Vaswani et al., 2018). Our imple- mentations are based on the open-source codebases of REALM 19 and the Routing Transformer. 20 Similar to the REALM implementation, we use separate processes to run the retriever and generate (a) Local attention (b) Routing attention training data (using a MIPS search). Since our retriever is frozen, we do not use the document Figure 3: Figures (from Roy et al., 2020) showing 2-D index refresher available in their codebase. attention schemes for the sparse attention mechanism used in Routing Transformer. Lower layers pool in lo- Retriever: Our retriever is trained on 64 Google cal information via sliding window local attention (Sub- Cloud TPUs for a total of 4k steps and a figure 3a) while upper layers gather global information for every token via clustering (Sub-figure 3b). batch size of 12288. We do early stopping on the validation data (with a smaller batch size of 512 due to smaller P100 GPU memory). Our model converges quite fast, reaching its Hyperparameter Choices: We experimented best performance in 1.5k steps (in 43 minutes) with several different pretraining strategies (using and needing 103 minutes for the full set of 4k steps. Wikipedia), smaller model variants and hyperpa- rameter choices manually in preliminary exper- iments. All these experiments performed quite Generator: Our generator is trained on 64 poorly on ELI5, producing very short and some- Google Cloud TPUs, for a total of 100k times incoherent responses. Finally, switching to a steps on the ELI5 training set. We use the Routing Transformer model which was pretrained pg19_local_cluster8k configuration avail- on a longform language modeling dataset (PG-19) able in the Routing Transformer implementation. significantly improved generation quality. Hyper- Besides the default hyperparameters, setting 15% parameters for this pretrained model (like hidden input, attention and ReLU dropout was critical to size / number of layers) were manually chosen with prevent overfitting on the training set. We use a model capacity in mind. For our final experiments learning rate of 5e-5. Our retrievals, questions and with this pretrained model we did not perform any answers are truncated / padded to 288 subword hyperparameter search during training, primarily tokens (using the PG19 subword tokenizer). We due to the expensive setup required to train the use a minibatch size of 128 QA pairs, which system. During inference, we tuned the nucleus corresponds to 332k tokens per mini-batch (of sampling value from 0.0 to 1.0 in increments of which, the loss is computed over the last 288 0.1, choosing the value with the best validation set answer tokens, or 37k total tokens). We do not performance. Our hyperparameter choices for con- compute loss over padded tokens, and use special trastive learning on the retriever have been justified symbols to separate different parts of the input in an ablation study in Appendix A.2. Notably, we context. We reverse the retrieved paragraphs in use very large minibatches of 12,288 to scale the context since the model uses local attention layers, number of negative examples. To train this model, and we wanted higher ranked retrievals to appear we used the standard trick of data parallelism across closer to the answer tokens. Our models take about 64 hardware accelerators. This resulted in an ef- 30 hours to finish 100k steps (0.92 steps / second). fective mini-batch size of 192 per chip, which is small enough to fit a BERT-base sized model on a 19 https://github.com/google-research/ TPU v3 chip’s memory. To accumulate information language/tree/master/language/realm across different chips before the final softmax, we 20 https://github.com/google-research/ google-research/tree/master/routing_ used the tf.tpu.cross_replica_sum func- transformer tion (using an open-source wrapper found here).
You can also read