A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation Shashi Narayan Gonçalo Simões Yao Zhao Google Research Google Research Google Brain shashinarayan@google.com gsimoes@google.com yaozhaoyz@google.com Joshua Maynez Dipanjan Das Michael Collins Mirella Lapata Google Research Google Research Google Research Google Research joshuahm@google.com dipanjand@google.com mjcollins@google.com lapata@google.com Abstract 2020). In this work, we investigate decoding meth- ods for generating semantically diverse text which We propose Composition Sampling, a sim- ple but effective method to generate diverse is also faithful to its input focusing on two tasks, namely summarization and question generation. arXiv:2203.15108v1 [cs.CL] 28 Mar 2022 outputs for conditional generation of higher quality compared to previous stochastic decod- Beam search (Li et al., 2016; Wiseman et al., ing strategies. It builds on recently proposed 2017) has proven successful for single-best gen- plan-based neural generation models (Narayan eration (Rush et al., 2015; Barrault et al., 2020; et al., 2021) that are trained to first create a Meister et al., 2020), but struggles to generate di- composition of the output and then generate by verse output (Vijayakumar et al., 2016). Stochastic conditioning on it and the input. Our approach sampling strategies, such as top-k sampling (Fan avoids text degeneration by first sampling a composition in the form of an entity chain and et al., 2018) and nucleus sampling (Holtzman et al., then using beam search to generate the best 2020), are better at generating diverse sequences possible text grounded to this entity chain. Ex- but are not suitable for conditional generation as periments on summarization (CNN/DailyMail they degenerate,1 producing output that is not faith- and XSum) and question generation (SQuAD), ful to the source. Figure 1 exposes degeneration in using existing and newly proposed automatic summary output using nucleus sampling. metrics together with human-based evaluation, To address these shortcomings, we propose Com- demonstrate that Composition Sampling is cur- rently the best available decoding strategy for position Sampling, a simple but effective hybrid generating diverse meaningful outputs. decoding method for diverse and faithful condi- tional generation. It builds on recently proposed 1 Introduction generation models (Narayan et al., 2021) that are In many NLG tasks, it is important to be able to gen- trained to first plan a semantic composition of the erate multiple diverse outputs from a model. Tasks target and then generate the text conditioned on the like summarization (Mani, 2001; Nenkova and composition and the input. Composition sampling McKeown, 2011) and question generation (Zhou first samples a composition in the form of an entity et al., 2017) exhibit one-to-many relationships; chain and then uses beam search to generate the there can be multiple semantically diverse sum- best possible sequence grounded to the sampled maries or questions for the same source, and it entity chain. Unlike top-k or nucleus sampling, it may be useful for a model to be able to generate avoids degeneration by instilling diversity in com- multiple outputs. Yet, the primary focus of recent position, rather than directly on the surface form. research in NLG has been on improving the quality Our contributions can be summarized as follows: of single-best outputs (Raffel et al., 2019; Lewis (a) we introduce Composition Sampling, a simple et al., 2019; Dong et al., 2019; Zhang et al., 2020a; yet effective decoding method for diverse condi- Narayan et al., 2021), while diversity remains an tional generation, which combines planning with unsolved problem (Hashimoto et al., 2019; Zhang stochastic sampling; (b) we propose several met- et al., 2021). This is particularly challenging in con- rics to compute semantic diversity in generated text; ditional generation, where diversity in the target our metrics are complementary to lexical diversity sequence should not come at the cost of correctness 1 Holtzman et al. (2020) use the term ‘degeneration’ to or faithfulness; for example, alternate summaries describe automatically generated text that is generic, repetitive, and awkward for story continuation. These issues are less are not valuable if they are unfaithful to the input common in conditional generation. In our case, ‘degenerate’ document(s) (Maynez et al., 2020; Kryscinski et al., refers to text unfaithful or inconsistent to the input.
Haman Written Summary: Chelsea star Eden Hazard is set to make his 100th top-flight appearance. Santi Cazorla should hit the same milestone when Arsenal meet Burnley. Both players have impressed since moving to the Premier League in 2012. Hazard has more goals this season but Cazorla has one more assist. Sportsmail’s reporters choose the player who has excited them the most. Beam Search: Eden Hazard and Santi Cazorla are both set to make their 100th Premier League appearances this weekend. Both players have been hugely influential since they moved to London. Here, Sportsmail’s reporters choose the player they most enjoy seeing in action. Nucleus Sampling: Eden Hazard and Santi Cazorlag will each make their 100th Premier League appearance this weekend. nightstandapplication.com. Sportsmail’s hovercraft reporters choose their man of the match countermeasures. Nucleus Sampling: By making their 100th Premier League appearances this weekend, Eden Hazard and Santi Cazor halibut will set new records. Here, Anna Coren and Dominic King select their favourites. Composition Sampling: (Eden Hazard | Santi Cazorla | Chelsea | Arsenal | Premier League ||| London | 2012 |||) Eden Hazard and Santi Cazorla are set to make their 100th appearances for Chelsea and Arsenal respectively in the Premier League this weekend. Both players have been hugely influential since they moved to London in the summer of 2012. But who has been the most exciting import to watch? Composition Sampling: (Chelsea | Eden Hazard | Arsenal | Santi Cazorla ||| Sportsmail ||| London) Ch elsea’s Eden Hazard and Arsenal’s Santi Cazorla will both make 100th appearances this weekend. Sportsmail’s reporters pick the player they most enjoy seeing in action. Both players have been hugely influential since moving to London. Figure 1: Human written summary, single-best predicted summary using beam search (beam size 8), and diverse summaries with nucleus sampling (p = 0.95) and our composition sampling for a CNN/DailyMail article (shown in the Appendix, Figure 6). We highlight spans in orange that are not faithful to the input. (e.g., Self-BLEU; Zhu et al. 2018; Alihosseini et al. (Bahdanau et al., 2015), are typically modeled us- 2019) and assess whether a set of diverse outputs ing attention-based encoder-decoder architectures are contextually dissimilar (Self-BERTscore; Zhang (Bahdanau et al., 2015; Gu et al., 2016; Vaswani et al. 2020b) or non-entailing (Self-Entailment); et al., 2017). The encoder first encodes the input and (c) finally, we introduce, E DNA, a novel metric text d and then the decoder predicts the output s1:n aiming to “Evaluate Diversity aNd fAithfulness” (e.g., the translation or summary of d) one token for summarization by quantifying whether sum- at a time as p(si |s1 , . . . , si−1 ; d), where, n is the maries in a diverse set are faithful to their input output length and si is the ith token in the out- without entailing each other. put. Often these models benefit from large scale Evaluation on two popular summarization tasks, task-agnostic pretraining (Song et al., 2019; Rad- namely highlight generation (CNN/DailyMail; Her- ford et al., 2018; Lewis et al., 2019; Rothe et al., mann et al. 2015) and extreme summarization 2020; Raffel et al., 2019; Zhang et al., 2020a). (XSum; Narayan et al. 2018), and question gen- eration (SQuAD; Rajpurkar et al. 2016; Zhou et al. Plan-based Conditional Generation Narayan 2017), shows that composition sampling is most et al. (2021) develop a plan-based approach effective in generating diverse summaries or ques- for neural summarization; their decoder gener- tions. When assessed by humans, composition ates a composition c1:m of target summary s as sampled summaries are as faithful as the best sum- p(cj |c1 , . . . , cj−1 ; d), and then the same decoder maries produced with beam search. In comparison, produces s as p(si |s1 , . . . , si−1 ; c; d) conditioned nucleus sampled summaries can be as diverse but on input d and composition c1:m , with m being the far less faithful. Taken together our results demon- composition length. Specifically, they adopt entity strate that Composition Sampling is currently the chains as the composition c of summary s, under best available decoding strategy for generating di- the assumption that entities in the chain ought to verse and meaningful output.2 be observed in the output summary. During in- ference, the model takes document d as input and 2 Background generates c; s, the concatenation of composition Conditional generation tasks such as summariza- and summary sequences, instead of generating s tion (See et al., 2017), data-to-text generation directly; c and s are prefixed with special mark- (Wiseman et al., 2017), and machine translation ers “[ CONTENT ]” and “[ SUMMARY ]”, respectively, as 2 shown in Figure 2. If s consists of multiple sen- Our checkpoints and spaCy annotation code are avail- able at https://github.com/google-research/ tences, markers“|||” denote sentence boundaries in language/tree/master/language/frost. composition c.
Figure 2: Illustration of composition sampling and other decoding strategies with vanilla and plan-based generation models. The term ‘composition’ is inspired from the quote “A Well-Composed Painting is Half Done” from French painter Pierre Bonnard. Images in black-and-white are early sketches or compositions of the painting in color. Nucleus or focus sampling often lead to hallucinations (highlight spans in red); corresponding color images are blurred to illustrate this. (Credit: The image of “Anna Pavlovna of Russia” is taken from Wikipedia.) The approach allows to directly manipulate the nucleus sampling (Holtzman et al., 2020) also trun- content of summaries and their quality. For exam- cates the tail of the distribution but chooses k dy- ple, we might inspect the predicted chain during namically. At each decoding step, it samples high- inference and drop entities which are not present in probable tokens from a nucleus N defined as the the input document, thereby controlling for hallu- smallest subset of tokens from the vocabulary V cinations (Narayan et al., 2021). Outwith summa- with cumulative probability p0 ≥ p, where p is the rization, similar constraints can be easily adapted pre-specified mass of the nucleus. to other conditional generation tasks. Aralikatte et al. (2021) introduce focus sam- pling to promote diversity in summarization mod- Maximization-Based Decoding In order to ob- els. It constructs a subset Vk ⊆ V by sampling k tain the most likely output ŝ from encoder-decoder source-relevant and topical tokens from the vocab- models, we typically solve a maximization-based ulary distribution. Standard beam search decoding objective: x̂ = arg maxx p(x|d), where x is ei- is then used to generate a summary limited to Vk . ther the predicted output text s (for models with- However, the authors show that focus sampling is out planning) or the concatenation of the predicted very sensitive to k; increasing it improves genera- composition and the output text c; s (for models tion quality but at the cost of diversity. with planning). It is standard practice to use beam search (Tillmann and Ney, 2003; Li et al., 2016; 3 Composition Sampling Wiseman et al., 2017) as solving the objective for the optimal sequence with neural sequence models Composition Sampling is a novel hybrid method is not tractable (Chen et al., 2018). which combines stochastic sampling with maximization-based decoding, whilst leveraging Stochastic Sampling for Diverse Decoding plan-based generation (Narayan et al., 2021). Sampling-based strategies have been widely used Specifically, we employ nucleus sampling to obtain to induce diversity in language models. Tempera- diverse compositions csample from p(c|d) where ture sampling uses a temperature to skew the dis- d is the input text and c are entity chains (prefixed tribution towards high probability tokens at each with “[ CONTENT ]” in Figure 2). We first employ decoding step (Ackley et al., 1985; Ficler and Gold- nucleus sampling to obtain diverse compositions berg, 2017; Fan et al., 2018), while top-k sampling from p(c|d), where d is the input text. And then truncates the distribution to k high probability to- employ beam search to generate the most-likely kens (Fan et al., 2018; Holtzman et al., 2018; Rad- diverse output s (prefixed with “[ SUMMARY ]” in ford et al., 2019). Similarly to top-k sampling, Figure 2), given input d and composition csample as
p(s|csample ; d). We will experimentally show that Generate (Summary) Plan-Generate (Entity Chain) composition sampling enables the generation of Plan-Generate (Summary) 1 fluent, faithful and diverse texts for conditional generation. 0.8 Why Entity Chains? Unlike top-k or nucleus sampling, composition sampling avoids degenera- 0.6 tion by introducing diversity in composition, rather Probability than directly on the surface form. For this to ef- fectively work, the choice of c needs to be well 0.4 correlated with an underlying notion of “semantic composition”, which we want to “diversify”; if c1 0.2 and c2 are two semantic compositions for input d such that c1 6= c2 , then two summaries s1 = 0 arg maxs p(s|c1 ; d) and s2 = arg maxs p(s|c2 ; d) _Chelsea _Eden _Hazard _Santi _Ca zor la _Arsenal _Burnley _Premier _League _2012 _Hazard _Ca zor la _one _Sports mail are bound to be diverse. In our work, we have cho- sen entity chains to model semantic compositions; entity chains have been widely studied to model Figure 3: Probabilities of generating underlined enti- entity-level lexical cohesion (Barzilay and Elhadad, ties in human written reference summary from Figure 1 1997) and coherence (Halliday and Hasan, 1976; (input article shown in Figure 6): when the summary Azzam et al., 1999) in text. Also, entity chains are is generated directly (Generate, Summary), when the unique to d, and thus can be easily distinguished entity chain “Chelsea | Eden Hazard ||| Santi Cazorla from compositions for other inputs. Moreover, en- | Arsenal | Burnley ||| Premier League | 2012 ||| Haz- tity chains provide a very effective knob for content ard | Cazorla | one ||| Sportsmail” is predicted first dur- control in abstractive generation, e.g., compositions ing planning (Plan-Generate, Entity Chain), and when the entities are predicted in the summary after plan- can be constrained to entities only present in the ning (Plan-Generate, Summary). All probabilities were input document, thereby avoiding hallucinations computed with P EGASUS (Zhang et al., 2020a) fine- and entity degeneration. tuned models. The symbol ‘_’ denotes start of token. Hypothesis 1: If the semantic composition c of the output text s corresponds to entity chains, then learning p(c|d) is much easier than learn- In Figure 3, the plan-based model is more confi- ing p(s|d); d is the input. Hence, we can sam- dent in predicting entities than its counterpart with- ple from p(c|d) with higher confidence than sam- out planning; perplexities of predicting entities in pling directly from p(s|d), and then compute the summary with and without planning are 0.24 arg maxs p(s|c; d). and 1.36, respectively, and perplexities of generat- ing the whole summary with and without planning We demonstrate the effectiveness of entity are 1.15 and 1.48, respectively. In fact, despite chains as a choice for c using the summarization the increased length of the target in the plan-based example in Figure 3. The negative log likelihood model (i.e., c1:m ; s1:n instead of s1:n ), we find that of generating the summary s from scratch without the perplexity of predicting the longer sequence planning (− log p(s|d)) is 121.18, while the neg- (c1:m ; s1:n ) is lower than predicting just the sum- ative log likelihood of generating composition c mary without any planning, due to grounding (1.16 with planning (− log p(c|d)) is 46.95; hence, the vs 1.48). Overall, p(c; s|d), the plan-based ap- model is much more confident when sampling from proach, learns a more confident distribution at each p(c|d) than directly from p(s|d). decoding step compared to no planing, i.e., p(s|d). Why Grounded Generation? The generation For the example in Figure 3, the average cumulative of s is inherently grounded to its entity compo- probabilities for the top 15 tokens in the vocabu- sition c; following Narayan et al. (2021), the entity lary distribution at each decoding step are 0.283 chains are extracted from their targets during train- for p(s|d) and 0.433 for p(c; s|d). ing. Hence, once the hard part of planning the In the following we assess composition sam- composition is done, the model is less perplexed pling for its ability to generate semantically diverse during generation of the output. output for two tasks, namely summarization (Sec-
tion 4) and question generation (Section 5). Sequence Likelihood We report the perplexity of the generated sequence (i.e., entity chains con- 4 Single Document Summarization catenated with their summaries for planning mod- 4.1 Datasets and Models els and summaries only for the others) using vari- ous decoding strategies. We evaluate our decoding strategy on two pop- ular single document summarization datasets: Lexical Fluency and Relevance We report CNN/DailyMail highlight generation (Hermann ROUGE -L F1 scores (Lin and Hovy, 2003) against et al., 2015) and XSum extreme summariza- reference summaries.5 tion (Narayan et al., 2018), using the original train/validation/test splits. Inputs and outputs were Semantic Relevance We report BERTScore truncated to 512 and 128 for XSum, and, 1,024 and (Zhang et al., 2020b) which computes the contex- 256 for CNN/DailyMail.3 tual similarity between a candidate and its refer- We conduct experiments with state-of-the-art ence. pretrained models for summarization, namely P E - GASUS (Zhang et al., 2020a) and F ROST (Narayan Faithfulness We follow Maynez et al. (2020) and et al., 2021). Our P EGASUS finetuned model gen- report on textual entailment (Pasunuru and Bansal, erates summaries directly, whereas F ROST gener- 2018; Falke et al., 2019; Kryscinski et al., 2020). In ates the entity chain followed by the summary. In particular, we report the probability of a summary both cases we use large transformer architectures entailing (Entailment) its input document using a (Vaswani et al., 2017) with L = 16, H = 1, 024, classifier trained by fine-tuning an uncased BERT- F = 4, 096, A = 16 (568M parameters), where L Large pretrained model (Devlin et al., 2019) on the denotes the number of layers for encoder and de- Multi-NLI dataset (Williams et al., 2018). coder Transformer blocks, H is the hidden size, We further assess faithfulness by humans. Our F the feed-forward layer size, and A the number of annotators, proficient in English, were tasked to self-attention heads. Since this paper is proposing read a document and then grade its summary on a a decoding strategy, there is no need to train new scale of 1–4 (entirely unfaithful, somewhat unfaith- summarization models. We use the publicly avail- ful, somewhat faithful, and entirely faithful); a sum- able P EGASUS and F ROST checkpoints. Training mary is “entirely faithful” if its content is fully sup- details and model hyperparameters can be found in ported or can be inferred from the document. We Zhang et al. (2020a) and Narayan et al. (2021). collected 3 ratings for each (document, summary) All models are decoded with a beam size of 8 and pair; we report average system ratings (across doc- a length-penalty of 0.8. For nucleus sampling and uments). With summaries deemed “somewhat un- composition sampling, we use nucleus probability faithful” and “somewhat faithful”, annotators were p set to 0.95.4 For focus sampling (Aralikatte et al., asked to also specify what was faithful or unfaith- 2021), we use k = 10, 000. ful, to improve agreement. 4.2 Evaluation Metrics Diversity We report the number of times (out We assess our decoding strategy for likelihood, flu- of five samples), a decoding technique is able to ency, relevance, faithfulness, and diversity, using generate a completely new summary (Unique). We both automatic and human evaluation. F ROST mod- also use Self-BLEU (Zhu et al., 2018; Alihosseini els predict a plan in the form of an entity chain, et al., 2019) to measure lexical diversity in the followed by a summary. All evaluations, except generated summaries. We consider all pairs of likelihood, are done on the summary, while the summaries out of 5 samples, and for each pair we predicted entity chains are stripped out. For each compute the BLEU score (Papineni et al., 2002) diverse decoding strategy, we sample 5 times for considering one summary as a hypothesis and the each test document and report the average. other as a reference. We report the average BLEU score as the Self-BLEU of the document. The lower 3 We also experimented with MultiNews (Fabbri et al., the Self-BLEU for a decoding strategy is, the better 2019), a multi-document summarization dataset. Results can be found in the Appendix (Table 7). it is in generating a more diverse set of summaries. 4 Results with different temperatures and nucleus probabili- 5 ties for random sampling, nucleus sampling, and composition We lowercased candidate and reference summaries and sampling are in Figure 4. used pyrouge with parameters “-a -c 95 -m -n 4 -w 1.2.”
Model XSum CNN/DailyMail R1 R2 RL R1 R2 RL GSum (Dou et al., 2020) 45.40 21.89 36.67 45.94 22.32 42.48 CTRLsum (He et al., 2020) — — — 45.65 22.35 42.50 FAME (Aralikatte et al., 2021) 45.31 22.75 37.46 42.95 20.79 39.90 Single P EGASUS (Zhang et al., 2020a) 47.56 24.87 39.40 44.05 21.69 40.98 F ROST (Narayan et al., 2021) 47.80 25.06 39.76 45.11 22.11 42.01 F ROST++ (Narayan et al., 2021) 44.94 21.58 37.20 45.08 22.14 41.99 Focus (FAME) 42.76 19.89 34.97 — — — Diverse Nucleus (P EGASUS) 38.49 16.57 30.99 36.27 15.10 33.46 Nucleus (F ROST) 40.26 17.83 32.49 38.49 15.71 35.49 Composition (F ROST) 45.12 22.24 36.98 41.76 18.94 38.69 Composition (F ROST++ ) 43.82 20.35 35.89 42.37 19.48 39.28 Table 1: Comparison of decoding strategies with ROUGE: single-best vs diverse decoding (we report averages over 5 samples). Best results in each block are bold-faced. See Table 5 in the Appendix for more comparisons. We propose two additional measures to capture Self-Entailment to measure diversity is because semantic diversity in summaries: Self-Entailment the faithfulness metric is also based on Entailment. and Self-BERTScore. Similar to Self-BLEU, we This means that both components will be mapped compute the Entailment score and BERTScore to a score in a similar output space (i.e., they both for each possible pair of summaries, respectively yield values between 0 and 1 obtained through the and report the average. A lower Self-Entailment same trained model), making it more likely to be value suggests that the generated summaries do properly balanced when mixed. not entail each other. Analogously, a low Self- BERTScore value indicates that the decoding tech- 4.3 Results nique is able to generate more contextually dissim- Table 1 presents ROUGE results on the XSum and ilar summaries. CNN/DailyMail test sets. The top block includes We further assess diversity by humans. Our an- results for models which employ maximization- notators, proficient in English, again read two sum- based decoding. GSum (Dou et al., 2020) is a state- maries (out of five samples) and then graded the of-the art system which decodes summaries guided pair on a scale of 1–4 (identical, somewhat identi- by an an extractive model at test time. CTRLsum cal, somewhat diverse, and diverse); the document (He et al., 2020) controls the summarization output was not shown in this assessment. Two summaries trough keywords and automatically extracted sen- are “identical” if they are semantically equivalent, tences. FAME (Aralikatte et al., 2021) uses a focus while the same information may be presented dif- attention mechanism to encourage the decoder to ferently in the case of “somewhat identical”. A proactively generate tokens that are similar or top- “somewhat diverse” pair may introduce one or two ical to the input document. As mentioned earlier new concepts or topics in one summary. A “diverse” F ROST (Narayan et al., 2021) first generates an en- pair should introduce new concepts or topics in tity chain and then a summary while F ROST++ is each summary. We collected 3 ratings for each pair a constrained variant which restricts the predicted and report their average. This assessment was only entities to those present in the document. We also done with single-sentence XSum summaries, in fu- show results for a vanilla P EGASUS model (Zhang ture work we will explore how to do this effectively et al., 2020a) finetuned on our datasets. for multi-sentence summaries. The bottom block focuses on diverse decoding Diversity and Faithfulness For summarization, (we report averages across five samples). We show diverse summaries are not meaningful if they are results with Focus sampling (Aralikatte et al., 2021) not faithful to the input. We propose E DNA, built on top of FAME, Nucleus sampling (Holtzman a novel measure for “Evaluating Diversity aNd et al., 2020) with P EGASUS and F ROST, and our fAithfulness” in summaries. E DNA is the harmonic Composition sampling. mean of Entailment and (1−Self-Entailment); Table 2 presents more detailed faithfulness and higher values of E DNA imply more faithful and diversity results, on challenge sets consisting of diverse summaries. The reason E DNA relies on 50 documents for each XSum and CNN/DailyMail
With Ref. Faithfulness Diversity Div+Faith Models ppl RL BSc Ent Human Uniq S-BL S-Ent S-BSc Human E DNA FAME — 34.23 0.70 0.24 2.19 — — — — — — Single P EGASUS 0.51 40.69 0.76 0.40 2.52 — — — — — — F ROST 0.31 40.90 0.75 0.37 2.63 — — — — — — F ROST++ 0.71 33.75 0.70 0.44 2.78 — — — — — — XSum Focus (FAME) — 29.19 0.66 0.23 1.88 2.6 89.51 0.62 0.91 1.84 0.09 Nucleus (P EGASUS) 1.47 31.10 0.68 0.24 2.00 5.0 26.22 0.10 0.68 3.11 0.30 Diverse Nucleus (F ROST) 0.83 33.81 0.71 0.22 2.11 5.0 31.08 0.10 0.71 3.08 0.27 Composition (F ROST) 0.51 36.95 0.73 0.27 2.37 4.7 58.94 0.17 0.79 2.73 0.30 Composition (F ROST++ ) 0.74 33.87 0.70 0.43 2.75 3.5 76.87 0.40 0.84 2.25 0.35 P EGASUS 0.35 36.09 0.65 0.70 3.78 — — — — — — Single F ROST 0.30 39.03 0.66 0.72 3.74 — — — — — — CNN/DM F ROST++ 0.37 38.87 0.66 0.79 3.94 — — — — — — Nucleus (P EGASUS) 1.39 28.99 0.62 0.62 3.08 5.0 26.99 0.03 0.63 — 0.70 Diverse Nucleus (F ROST) 1.04 31.58 0.63 0.56 3.08 5.0 29.60 0.03 0.64 — 0.66 Composition (F ROST) 0.52 35.06 0.64 0.59 3.45 5.0 58.60 0.04 0.71 — 0.66 Composition (F ROST++ ) 0.46 35.07 0.64 0.73 3.89 4.9 62.81 0.07 0.72 — 0.78 Table 2: Likelihood, faithfulness and diversity results on 50 documents sampled from XSum and CNN/DailyMail each. We report on perplexity (ppl), Entailment (Ent), Uniqueness (Uniq), Self-BLEU (S-BL), Self-Entailment (S-Ent), Self-BERTScore (S-BSc) and E DNA scores, along with ROUGE (RL) and BERTScore (BSc) for compar- ison. We also report on human judgements for faithfulness and diversity. Additional R1 and R2 numbers can be found in the Appendix (Table 6). Best results in the diverse block for each dataset are bold faced. Scores for single- best decoded summaries are also presented for comparison. Focus (FAME) diverse predictions on CNN/DailyMail are not available. The lower the Self-* metric is, the better the decoding method in generating diverse outputs. summaries. We construct these challenge sets by F ROST++ performs slightly worse than with randomly selecting documents whose reference vanilla F ROST in terms of ROUGE. This is due summaries have non-extractive entity chains in to the extreme abstractive nature of the XSum them; an entity chain is extractive if all entities reference summaries (Maynez et al., 2020); as a in it can be found in the input document. Narayan result, a model is required to hallucinate factual et al. (2021) have found that models struggle to content, that is not necessarily faithful to the generate faithful summaries for documents with input (see examples of XSum summaries in the data-divergence issues (Dhingra et al., 2019). The Appendix, Figure 5). But Composition(F ROST++ ) same challenge sets were used for our human eval- only keeps supported entities in the sampled uations of faithfulness and diversity. plans giving rise to summaries which diverge from their reference. This is not the case with Composition Sampling is not as Performance CNN/DailyMail which is mostly extractive and Diminishing as Nucleus Sampling Single-best we see that ROUGE performance improves with decoding for F ROST achieves 39.76 ROUGE (RL) Composition(F ROST++ ) in Table 1. on XSum,; nucleus and composition sampling fare worse showing an average drop of 7.27 and 2.78, re- Composition Sampling is more Confident in its spectively. Similarly, for CNN/DailyMail, ROUGE Predictions than Nucleus Sampling Perplexity drops for nucleus sampling by an average of 6.51 for F ROST predictions increases from 0.31 to 0.83 points, compared to an average drop of 3.28 points for nucleus sampling, but only to 0.51 for compo- for composition sampling (with F ROST). Nucleus sition sampling, on XSum. P EGASUS shows an sampling is even more damaging for non-plan even larger increment in perplexity (from 0.51 to based models, such as P EGASUS; we see an aver- 1.47) for nucleus sampling. Similar patterns are age drop of 8.59 and 7.30 ROUGE points on XSum observed for CNN/DailyMail summaries. and CNN/DailyMail. These gaps are slightly larger Composition(F ROST ++ ) is more perplexed for the challenging subsets in Table 2 which is ex- when generating XSum summaries due to the refer- pected due to the highly abstractive nature of the ence summary divergence issue discussed earlier; reference summaries therein. perplexity increases from 0.51 to 0.74 compared On XSum, composition Sampling with to Composition(F ROST). Interestingly, Composi-
tion(F ROST++ ) is almost as confident in generating Metric Faithfulness Diversity diverse summaries as single-best beam decoding ROUGE -L 0.197 −0.164 (F ROST; perplexities of 0.71 vs 0.74 for XSum). BERTScore 0.209 −0.195 Entailment 0.588 −0.067 Unsurprisingly, Composition(F ROST ++ ) is more 1 - Self-BLEU −0.208 0.880 confident in generating CNN/DailyMail summaries 1 - Self-Entailment −0.187 0.771 than F ROST (0.46 vs 0.52) due to their extractive 1 - Self-BERTScore −0.198 0.873 E DNA 0.482 0.174 nature. Constrained Composition is Most Effective in Table 3: Different automatic metrics and their correla- Generating Meaningful Diverse Summaries It tion against human assessments using Spearman’s rank is no surprise that nucleus sampling is able to coefficient. generate the most diverse summaries on both XSum and CNN/DailyMail (achieving best scores correlated with diversity ratings. Compared to for Self-BLEU, Self-Entailment, Self-BERTScore, other metrics which capture a single dimension, and diversity assessed by humans); however these E DNA is positively correlated with both dimensions summaries perform poorly on faithfulness mea- of diversity and faithfulness. sures. Composition(F ROST++ ) is most effective in Finally, in Figure 4, we plot faithfulness and di- generating faithful summaries, as demonstrated au- versity scores for different decoding strategies with tomatically (with best entailment scores on XSum varying temperatures and nucleus probabilities. and CNN/DailyMail) and by humans (with highest We find that summaries sampled with Composi- ratings on XSum and CNN/DailyMail); these sum- tion(F ROST++ ) are consistently more faithful than maries are also diverse achieving highest E DNA single-best Beam(F ROST) summaries, but worse scores on both summarization datasets. than summaries decoded with Beam(F ROST++ ). We also examined whether models differ in Summaries sampled with Composition(F ROST++ ) terms of faithfulness and diversity as rated by achieve the best E DNA score (with p = 0.95) our participants. We carried out pairwise com- amongst all diverse decoding strategies. parisons using one-way ANOVA with post-hoc Tukey HSD tests (p < 0.01). The difference be- 5 Question Generation tween Nucleus(P EGASUS) and Nucleus(F ROST) is not significant. Nucleus(P EGASUS) was also not 5.1 Dataset and Metrics significantly more faithful than Focus(FAME) for Question generation is often conceptualized as XSum summaries. All other pairwise differences the task of generating a question from a passage- were significant for both faithfulness and diversity. answer pair (Zhou et al., 2017). We experiment on In sum, our results demonstrate that composi- SQuAD (Rajpurkar et al., 2016) and use the split tion sampling is a better alternative to nucleus or of Zhou et al. (2017) consisting of 86,635, 8,965, focus sampling for generating meaningful diverse and 8,964 source-target pairs for training, valida- summaries. Figure 1 presents summaries from dif- tion, and testing, respectively.6 We follow Cho ferent decoding strategies for a CNN/DailyMail et al. (2019) and report BLEU-4 (Top-1, the single- article. Other example predictions for XSum and best accuracy), Oracle (Top-5, the best accuracy CNN/DailyMail articles can be found in the Ap- among the 5-sampled hypotheses), and Self-BLEU pendix (Figures 5–9). (as defined in Section 4). Faithfulness and Diversity Metrics Correlate 5.2 Results with Human Judgements We estimate the ex- tent to which automatic metrics of faithfulness For our question generation experiments we also and diversity correlate with human ratings (using compare models which employ single-best decod- Spearman’s rank correlation coefficient) in Table 3. ing against models which adopt diverse decoding In line with previous work (Maynez et al., 2020; techniques. The top block in Table 4 presents re- Kryscinski et al., 2019), we find that entailment sults for NQG++ (Zhou et al., 2017), a pointer scores are best correlated with faithfulness (mod- generator-based model, CP+GSA (Zhao et al., erate, 0.40 ≤ r ≤ 0.59). Like Self-BLUE, Self- 6 We also experimented with the split of Du et al. (2017). Entailment and Self-BERTScore are also strongly Results can be found in the Appendix (Table 8).
Beam(F ROST) Beam(F ROST++ ) E DNA (Faithfulness+Diversity) 0.8 0.8 Self-Entailment (Diversity) Random(F ROST) Entailment (Faithfulness) 0.8 1.5 Nucleus(F ROST) Composition(F ROST) 0.6 Perplexity Composition(F ROST++ ) 0.6 1 0.6 0.4 0.2 0.5 0.4 0.4 0 0.2 0.4 0.6 0.8 0.9 1.0 0.2 0.4 0.6 0.8 0.9 1.0 0.2 0.4 0.6 0.8 0.9 1.0 0.2 0.4 0.6 0.8 0.9 1.0 Different Temperatures or Nucleus Probabilities Different Temperatures or Nucleus Probabilities Different Temperatures or Nucleus Probabilities Different Temperatures or Nucleus Probabilities a. b. c. d. Figure 4: Perplexity, entailment, self-entailment and E DNA scores on the CNN/DailyMail challenge set (Table 2) with varying temperatures (for random sampling) and nucleus Probabilities (for nucleus and composition sam- pling). For each diverse decoding strategy, we sample 5 times per document and report the average. Models T1 T5 S-BL of 10.64. Composition sampled questions are less NQG++ (Zhou et al., 2017) 13.27 — — diverse then other methods, but outperform all base- Single MCP+GSA (Zhao et al., 2018) 16.85 — — lines on Top-1 and Oracle metrics. Poor diversity P EGASUS (Zhang et al., 2020a) 22.17 — — F ROST (Narayan et al., 2021) 21.04 — — (in terms of Self-BLEU) in composition sampled questions can be attributed to two limitations: (a) top-k Sampling (Fan et al., 2018) 11.53 17.65 45.99 Diverse Beam (Vijayakumar et al.) 13.38 18.30 74.80 SQuAD questions are mostly extractive, and (b) Mixture Decoder (Shen et al., 2019) 15.17 21.97 58.73 questions are generated conditioned on the passage Diverse Mixture Selector (Cho et al., 2019) 15.67 22.45 59.82 and the answer spans; leaving limited scope for Mixture Selector (Wang et al., 2020) 15.34 21.15 54.18 Nucleus (P EGASUS) 12.05 24.72 30.64 models to generate diverse questions. An example Nucleus (F ROST) 10.64 22.49 25.50 in the Appendix (Figure 11) demonstrates the ef- Composition (F ROST) 17.16 27.04 61.68 fectiveness of composition sampling in generating Composition (F ROST++ ) 18.77 26.60 74.89 accurate and diverse questions compared to other Table 4: Comparison of different decoding techniques sampling methods.7 on question generation. We report on BLEU-4 Top-1 accuracy (T1) and Top-5 (T5), and Self-BLEU (S-BL). 6 Conclusion Results for diverse decoding comparison models are taken from Wang et al. (2020). Best results in each We proposed Composition Sampling, a simple yet block are bold-faced. effective decoding method for faithful and diverse conditional generation. Our method is straightfor- 2018), a model which combines a pointer mecha- ward to implement and does not require any exter- nism with a gated self-attention encoder, and fine- nal system to augment the input during inference. tuned P EGASUS and F ROST models. The second Our experiments demonstrate that it is currently block in the table contains several diverse decoding the best available decoding strategy for generating approaches including top-k sampling (Fan et al., diverse and meaningful output. We also introduced 2018), diverse beam search (Vijayakumar et al., Self-Entailment and Self-BERTScore, to automat- 2018), mixture decoding (Shen et al., 2019) and ically compute semantic diversity in summaries, mixture content selection (Cho et al., 2019; Wang and, E DNA, for jointly measuring faithfulness and et al., 2020). We compare these models against diversity. nucleus sampling with P EGASUS and F ROST, and composition sampling with F ROST. Acknowledgements As in our summarization experiments, we ob- We thank the reviewers, the ARR action editor, and serve that composition sampling is not as perfor- the senior area chair for their valuable feedback. mance diminishing as nucleus sampling, in terms We would like to thank Ryan McDonald, Ankur BLEU. F ROST achieves a BLEU of 21.04 (top-1) Parikh, and Slav Petrov for their insightful com- in the single-best decoding setting; in comparison, ments. Many thanks also to Ashwin Kakarla and BLEU drops for nucleus sampling by 10.40 points his team for their help with the human evaluation. (on average), and 2.27 points only for composition sampling (F ROST++ ). Nucleus sampled questions 7 Detailed comparative error analysis to Cho et al. (2019) achieve the best pairwise diversity scores (Self- and Wang et al. (2020) was not possible as their predictions BLEU of 25.50), but very low BLEU Top-1 score are not publicly available.
Ethical Considerations sity and quality in text generation models. In Pro- ceedings of the Workshop on Methods for Optimiz- The nature of text generation leads to multiple ethi- ing and Evaluating Neural Language Generation, cal considerations when considering applications. pages 90–98, Minneapolis, Minnesota. Association The main failure mode is that the model can learn for Computational Linguistics. to mimic target properties in the training data that Rahul Aralikatte, Shashi Narayan, Joshua Maynez, are not desirable. Sascha Rothe, and Ryan McDonald. 2021. Focus at- tention: Promoting faithfulness and diversity in sum- Faithfulness and Factuality Since models cre- marization. In Proceedings of the 59th Annual Meet- ate new text, there is the danger that they may nei- ing of the Association for Computational Linguistics ther be faithful to the source material nor factual. and the 11th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), This can be exacerbated when the data itself has pages 6078–6095, Online. Association for Computa- highly abstractive targets, which require the model tional Linguistics. to generate words not seen in the source material during training. This often leads the model to gen- Saliha Azzam, Kevin Humphreys, and Robert Gaizauskas. 1999. Using coreference chains erate content inconsistent with the source mate- for text summarization. In Coreference and Its rial (Maynez et al., 2020; Kryscinski et al., 2020; Applications. Gabriel et al., 2021). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Trustworthy Data If the data itself is not trust- Bengio. 2015. Neural machine translation by worthy (comes from suspect or malicious sources) jointly learning to align and translate. CoRR, abs/1409.0473. the model will naturally become untrustworthy as it will ultimately learn the language and topics of Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, the training data. For instance, if the training data Marta R. Costa-jussà, Christian Federmann, Yvette is about Obama birther conspiracies, and the model Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp is asked to generate information about the early life Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof of Obama, there is a risk that false claims will be Monz, Makoto Morishita, Masaaki Nagata, Toshi- predicted by the model. aki Nakazawa, Santanu Pal, Matt Post, and Marcos Zampieri. 2020. Findings of the 2020 conference on Bias in Data Similarly, biases in the data around machine translation (WMT20). In Proceedings of gender, race, etc., risk being propagated in the the Fifth Conference on Machine Translation, pages 1–55, Online. Association for Computational Lin- model predictions, which is common for most NLP guistics. tasks. This is especially true when the models are trained from non-contemporary data that do not Regina Barzilay and Michael Elhadad. 1997. Using represent current norms and practices (Blodgett lexical chains for text summarization. In Intelligent Scalable Text Summarization. et al., 2020). The above considerations are non-malicious, in Su Lin Blodgett, Solon Barocas, Hal Daumé III, and that the model is merely learning to behave as its Hanna Wallach. 2020. Language (technology) is underlying source material. If users of such models power: A critical survey of “bias” in NLP. In Pro- ceedings of the 58th Annual Meeting of the Asso- are not aware of these issues and do not account for ciation for Computational Linguistics, pages 5454– them, e.g., with better data selection and evaluation, 5476, Online. Association for Computational Lin- then the generated text can be damaging. guistics. Generation models can also be misused in mali- Yining Chen, Sorcha Gilroy, Andreas Maletti, Jonathan cious ways. These include generating fake news, May, and Kevin Knight. 2018. Recurrent neural spam, and other text meant to mislead large sec- networks as weighted language recognizers. In tions of the general population. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2261–2271, New References Orleans, Louisiana. Association for Computational David H. Ackley, Geoffrey E. Hinton, and Terrence J. Linguistics. Sejnowski. 1985. A learning algorithm for Boltz- mann machines. Cognitive Science, 9:147–169. Jaemin Cho, Minjoon Seo, and Hannaneh Hajishirzi. 2019. Mixture content selection for diverse se- Danial Alihosseini, Ehsan Montahaei, and Mahdieh So- quence generation. In Proceedings of the 2019 Con- leymani Baghshah. 2019. Jointly measuring diver- ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer- of the 56th Annual Meeting of the Association for ence on Natural Language Processing (EMNLP- Computational Linguistics (Volume 1: Long Papers), IJCNLP), pages 3121–3131, Hong Kong, China. As- pages 889–898, Melbourne, Australia. Association sociation for Computational Linguistics. for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Jessica Ficler and Yoav Goldberg. 2017. Controlling Kristina Toutanova. 2019. BERT: Pre-training of linguistic style aspects in neural language genera- deep bidirectional transformers for language under- tion. In Proceedings of the Workshop on Stylis- standing. In Proceedings of the 2019 Conference tic Variation, pages 94–104, Copenhagen, Denmark. of the North American Chapter of the Association Association for Computational Linguistics. for Computational Linguistics: Human Language Technologies, pages 4171–4186, Minneapolis, Min- Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin nesota. Association for Computational Linguistics. Choi, and Jianfeng Gao. 2021. GO FIGURE: A meta evaluation of factuality in summarization. In Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Findings of the Association for Computational Lin- Ming-Wei Chang, Dipanjan Das, and William Co- guistics: ACL-IJCNLP 2021, pages 478–487, On- hen. 2019. Handling divergent reference texts when line. Association for Computational Linguistics. evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4884–4895, Flo- Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. rence, Italy. Association for Computational Linguis- Li. 2016. Incorporating copying mechanism in tics. sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Com- Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- putational Linguistics (Volume 1: Long Papers), aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, pages 1631–1640, Berlin, Germany. Association for and Hsiao-Wuen Hon. 2019. Unified language Computational Linguistics. model pre-training for natural language understand- ing and generation. In H. Wallach, H. Larochelle, M. A. K. Halliday and Ruqaiya Hasan. 1976. Cohesion A. Beygelzimer, F. Alché-Buc, E. Fox, and R. Gar- in English. Longman, London. nett, editors, Advances in Neural Information Pro- cessing Systems 32, pages 13042–13054. Curran As- Tatsunori B. Hashimoto, Hugh Zhang, and Percy Liang. sociates, Inc. 2019. Unifying human and statistical evaluation for natural language generation. In Proceedings of the Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao 2019 Conference of the North American Chapter of Jiang, and Graham Neubig. 2020. Gsum: A general the Association for Computational Linguistics: Hu- framework for guided neural abstractive summariza- man Language Technologies, Volume 1 (Long and tion. Short Papers), pages 1689–1701, Minneapolis, Min- nesota. Association for Computational Linguistics. Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn- ing to ask: Neural question generation for reading Junxian He, Wojciech Kryściński, Bryan McCann, comprehension. In Proceedings of the 55th Annual Nazneen Rajani, and Caiming Xiong. 2020. Ctrl- Meeting of the Association for Computational Lin- sum: Towards generic controllable text summariza- guistics (Volume 1: Long Papers), pages 1342–1352, tion. Vancouver, Canada. Association for Computational Linguistics. Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and and Phil Blunsom. 2015. Teaching machines to read Dragomir Radev. 2019. Multi-news: A large-scale and comprehend. In C. Cortes, N. D. Lawrence, multi-document summarization dataset and abstrac- D. D. Lee, M. Sugiyama, and R. Garnett, editors, tive hierarchical model. In Proceedings of the 57th Advances in Neural Information Processing Systems Annual Meeting of the Association for Computa- 28, pages 1693–1701. Curran Associates, Inc. tional Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Yejin Choi. 2020. The curious case of neural text de- Utama, Ido Dagan, and Iryna Gurevych. 2019. generation. In International Conference on Learn- Ranking generated summaries by correctness: An in- ing Representations. teresting but challenging application for natural lan- guage inference. In Proceedings of the 57th Annual Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Meeting of the Association for Computational Lin- Bosselut, David Golub, and Yejin Choi. 2018. guistics, pages 2214–2220, Florence, Italy. Associa- Learning to write with cooperative discriminators. tion for Computational Linguistics. In Proceedings of the 56th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi- Long Papers), pages 1638–1649, Melbourne, Aus- erarchical neural story generation. In Proceedings tralia. Association for Computational Linguistics.
Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc- Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Cann, Caiming Xiong, and Richard Socher. 2019. Simões, Vitaly Nikolaev, and Ryan McDonald. 2021. Neural text summarization: A critical evaluation. In Planning with Learned Entity Prompts for Abstrac- Proceedings of the 2019 Conference on Empirical tive Summarization. Transactions of the Association Methods in Natural Language Processing and the for Computational Linguistics, 9:1475–1492. 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 540– Ani Nenkova and Kathleen McKeown. 2011. Auto- 551, Hong Kong, China. Association for Computa- matic summarization. Foundations and Trends in tional Linguistics. Information Retrieval, 5(2–3):103–233. Wojciech Kryscinski, Bryan McCann, Caiming Xiong, Kishore Papineni, Salim Roukos, Todd Ward, and Wei- and Richard Socher. 2020. Evaluating the factual Jing Zhu. 2002. Bleu: a method for automatic eval- consistency of abstractive text summarization. In uation of machine translation. In Proceedings of Proceedings of the 2020 Conference on Empirical the 40th Annual Meeting of the Association for Com- Methods in Natural Language Processing (EMNLP), putational Linguistics, pages 311–318, Philadelphia, pages 9332–9346, Online. Association for Computa- Pennsylvania, USA. Association for Computational tional Linguistics. Linguistics. Ramakanth Pasunuru and Mohit Bansal. 2018. Multi- Mike Lewis, Yinhan Liu, Naman Goyal, Mar- reward reinforced summarization with saliency and jan Ghazvininejad, Abdelrahman Mohamed, Omer entailment. In Proceedings of the 2018 Conference Levy, Ves Stoyanov, and Luke Zettlemoyer. of the North American Chapter of the Association 2019. BART: Denoising sequence-to-sequence pre- for Computational Linguistics: Human Language training for natural language generation, translation, Technologies, Volume 2 (Short Papers), pages 646– and comprehension. CoRR, abs/1910.13461. 653, New Orleans, Louisiana. Association for Com- putational Linguistics. Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016. Deep rein- Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu, forcement learning for dialogue generation. In Pro- Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming ceedings of the 2016 Conference on Empirical Meth- Zhou. 2020. ProphetNet: Predicting future n-gram ods in Natural Language Processing, pages 1192– for sequence-to-SequencePre-training. In Findings 1202, Austin, Texas. Association for Computational of the Association for Computational Linguistics: Linguistics. EMNLP 2020, pages 2401–2410, Online. Associa- tion for Computational Linguistics. Chin Yew Lin and Eduard Hovy. 2003. Automatic eval- uation of summaries using n-gram co-occurrence Alec Radford, Karthik Narasimhan, Tim Salimans, and statistics. In Proceedings of the 2003 Human Lan- Ilya Sutskever. 2018. Improving language under- guage Technology Conference of the North Ameri- standing by generative pre-training. Technical re- can Chapter of the Association for Computational port, OpenAI. Linguistics, pages 150–157. Alec Radford, Jeff Wu, Rewon Child, David Luan, Inderjeet Mani. 2001. Automatic summarization, vol- Dario Amodei, and Ilya Sutskever. 2019. Language ume 3. John Benjamins Publishing. models are unsupervised multitask learners. Techni- cal report, OpenAI. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factu- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine ality in abstractive summarization. In Proceedings Lee, Sharan Narang, Michael Matena, Yanqi Zhou, of the 58th Annual Meeting of the Association for Wei Li, and Peter J. Liu. 2019. Exploring the limits Computational Linguistics, pages 1906–1919, On- of transfer learning with a unified text-to-text trans- line. Association for Computational Linguistics. former. CoRR, abs/1901.07291. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Clara Meister, Ryan Cotterell, and Tim Vieira. 2020. Percy Liang. 2016. SQuAD: 100,000+ questions for If beam search is the answer, what was the question? machine comprehension of text. In Proceedings of In Proceedings of the 2020 Conference on Empirical the 2016 Conference on Empirical Methods in Natu- Methods in Natural Language Processing (EMNLP), ral Language Processing, pages 2383–2392, Austin, pages 2173–2185, Online. Association for Computa- Texas. Association for Computational Linguistics. tional Linguistics. Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2020. Leveraging pre-trained checkpoints for se- 2018. Don’t give me the details, just the summary! quence generation tasks. Transactions of the Asso- topic-aware convolutional neural networks for ex- ciation for Computational Linguistics, 8:264–280. treme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Lan- Alexander M. Rush, Sumit Chopra, and Jason Weston. guage Processing, pages 1797–1807, Brussels, Bel- 2015. A neural attention model for abstractive sen- gium. Association for Computational Linguistics. tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- Chapter of the Association for Computational Lin- guage Processing, pages 379–389, Lisbon, Portugal. guistics: Human Language Technologies, Volume Association for Computational Linguistics. 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguis- Abigail See, Peter J. Liu, and Christopher D. Manning. tics. 2017. Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th An- Sam Wiseman, Stuart Shieber, and Alexander Rush. nual Meeting of the Association for Computational 2017. Challenges in data-to-document generation. Linguistics (Volume 1: Long Papers), pages 1073– In Proceedings of the 2017 Conference on Empiri- 1083, Vancouver, Canada. Association for Computa- cal Methods in Natural Language Processing, pages tional Linguistics. 2253–2263, Copenhagen, Denmark. Association for Computational Linguistics. Tianxiao Shen, Myle Ott, Michael Auli, and Marc’Aurelio Ranzato. 2019. Mixture models for Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and diverse machine translation: Tricks of the trade. In Arvind Neelakantan. 2021. Trading off diversity and Proceedings of the 36th International Conference on quality in natural language generation. In Proceed- Machine Learning, volume 97 of Proceedings of Ma- ings of the Workshop on Human Evaluation of NLP chine Learning Research, pages 5719–5728. PMLR. Systems (HumEval), pages 25–33, Online. Associa- Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie- tion for Computational Linguistics. Yan Liu. 2019. MASS: Masked sequence to se- Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe- quence pre-training for language generation. In Pro- ter Liu. 2020a. PEGASUS: Pre-training with ex- ceedings of the 36th International Conference on tracted gap-sentences for abstractive summarization. Machine Learning, volume 97, pages 5926–5936. In Proceedings of the 37th International Conference PMLR. on Machine Learning, volume 119 of Proceedings Christoph Tillmann and Hermann Ney. 2003. Word re- of Machine Learning Research, pages 11328–11339. ordering and a dynamic programming beam search PMLR. algorithm for statistical machine translation. Com- putational Linguistics, 29(1):97–133. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020b. BERTScore: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Evaluating text generation with BERT. In Proceed- Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz ings of the 8th International Conference on Learn- Kaiser, and Illia Polosukhin. 2017. Attention is all ing Representations, Virtual Conference, Formerly you need. In Advances in Neural Information Pro- Addis Ababa Ethiopia. cessing Systems 30, pages 5998–6008. Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ashwin K. Vijayakumar, Michael Cogswell, Ram- Ke. 2018. Paragraph-level neural question gener- prasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. ation with maxout pointer and gated self-attention Crandall, and Dhruv Batra. 2016. Diverse beam networks. In Proceedings of the 2018 Conference search: Decoding diverse solutions from neural se- on Empirical Methods in Natural Language Process- quence models. CoRR, abs/1610.02424. ing, pages 3901–3910, Brussels, Belgium. Associa- tion for Computational Linguistics. Ashwin K. Vijayakumar, Michael Cogswell, Ram- prasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Crandall, and Dhruv Batra. 2018. Diverse beam Hangbo Bao, and Ming Zhou. 2017. Neural ques- search for improved description of complex scenes. tion generation from text: A preliminary study. In Proceedings of the Thirty-Second AAAI Confer- CoRR, abs/1704.01792. ence on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, (IAAI-18), and the 8th AAAI Symposium on Educa- Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texy- tional Advances in Artificial Intelligence (EAAI-18), gen: A benchmarking platform for text generation New Orleans, Louisiana, USA, February 2-7, 2018, models. In The 41st International ACM SIGIR Con- pages 7371–7379. AAAI Press. ference on Research and Development in Informa- tion Retrieval, SIGIR ’18, page 1097–1100, New Zhen Wang, Siwei Rao, Jie Zhang, Zhen Qin, York, NY, USA. Association for Computing Machin- Guangjian Tian, and Jun Wang. 2020. Diversify ery. question generation with continuous content selec- tors and question type modeling. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2134–2143, Online. Association for Computational Linguistics. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceed- ings of the 2018 Conference of the North American
You can also read