A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation

Page created by Ronald Webster
 
CONTINUE READING
A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation
A Well-Composed Text is Half Done!
                                                     Composition Sampling for Diverse Conditional Generation
                                                   Shashi Narayan                        Gonçalo Simões                             Yao Zhao
                                                   Google Research                       Google Research                           Google Brain
                                              shashinarayan@google.com                 gsimoes@google.com                    yaozhaoyz@google.com

                                             Joshua Maynez                 Dipanjan Das               Michael Collins                    Mirella Lapata
                                             Google Research              Google Research             Google Research                    Google Research
                                         joshuahm@google.com          dipanjand@google.com         mjcollins@google.com                 lapata@google.com

                                                                Abstract                           2020). In this work, we investigate decoding meth-
                                                                                                   ods for generating semantically diverse text which
                                              We propose Composition Sampling, a sim-
                                              ple but effective method to generate diverse
                                                                                                   is also faithful to its input focusing on two tasks,
                                                                                                   namely summarization and question generation.
arXiv:2203.15108v1 [cs.CL] 28 Mar 2022

                                              outputs for conditional generation of higher
                                              quality compared to previous stochastic decod-          Beam search (Li et al., 2016; Wiseman et al.,
                                              ing strategies. It builds on recently proposed       2017) has proven successful for single-best gen-
                                              plan-based neural generation models (Narayan         eration (Rush et al., 2015; Barrault et al., 2020;
                                              et al., 2021) that are trained to first create a     Meister et al., 2020), but struggles to generate di-
                                              composition of the output and then generate by       verse output (Vijayakumar et al., 2016). Stochastic
                                              conditioning on it and the input. Our approach
                                                                                                   sampling strategies, such as top-k sampling (Fan
                                              avoids text degeneration by first sampling a
                                              composition in the form of an entity chain and       et al., 2018) and nucleus sampling (Holtzman et al.,
                                              then using beam search to generate the best          2020), are better at generating diverse sequences
                                              possible text grounded to this entity chain. Ex-     but are not suitable for conditional generation as
                                              periments on summarization (CNN/DailyMail            they degenerate,1 producing output that is not faith-
                                              and XSum) and question generation (SQuAD),           ful to the source. Figure 1 exposes degeneration in
                                              using existing and newly proposed automatic          summary output using nucleus sampling.
                                              metrics together with human-based evaluation,
                                                                                                      To address these shortcomings, we propose Com-
                                              demonstrate that Composition Sampling is cur-
                                              rently the best available decoding strategy for      position Sampling, a simple but effective hybrid
                                              generating diverse meaningful outputs.               decoding method for diverse and faithful condi-
                                                                                                   tional generation. It builds on recently proposed
                                         1     Introduction                                        generation models (Narayan et al., 2021) that are
                                         In many NLG tasks, it is important to be able to gen-     trained to first plan a semantic composition of the
                                         erate multiple diverse outputs from a model. Tasks        target and then generate the text conditioned on the
                                         like summarization (Mani, 2001; Nenkova and               composition and the input. Composition sampling
                                         McKeown, 2011) and question generation (Zhou              first samples a composition in the form of an entity
                                         et al., 2017) exhibit one-to-many relationships;          chain and then uses beam search to generate the
                                         there can be multiple semantically diverse sum-           best possible sequence grounded to the sampled
                                         maries or questions for the same source, and it           entity chain. Unlike top-k or nucleus sampling, it
                                         may be useful for a model to be able to generate          avoids degeneration by instilling diversity in com-
                                         multiple outputs. Yet, the primary focus of recent        position, rather than directly on the surface form.
                                         research in NLG has been on improving the quality            Our contributions can be summarized as follows:
                                         of single-best outputs (Raffel et al., 2019; Lewis        (a) we introduce Composition Sampling, a simple
                                         et al., 2019; Dong et al., 2019; Zhang et al., 2020a;     yet effective decoding method for diverse condi-
                                         Narayan et al., 2021), while diversity remains an         tional generation, which combines planning with
                                         unsolved problem (Hashimoto et al., 2019; Zhang           stochastic sampling; (b) we propose several met-
                                         et al., 2021). This is particularly challenging in con-   rics to compute semantic diversity in generated text;
                                         ditional generation, where diversity in the target        our metrics are complementary to lexical diversity
                                         sequence should not come at the cost of correctness          1
                                                                                                        Holtzman et al. (2020) use the term ‘degeneration’ to
                                         or faithfulness; for example, alternate summaries         describe automatically generated text that is generic, repetitive,
                                                                                                   and awkward for story continuation. These issues are less
                                         are not valuable if they are unfaithful to the input      common in conditional generation. In our case, ‘degenerate’
                                         document(s) (Maynez et al., 2020; Kryscinski et al.,      refers to text unfaithful or inconsistent to the input.
Haman Written Summary: Chelsea star Eden Hazard is set to make his 100th top-flight appearance. Santi Cazorla should
hit the same milestone when Arsenal meet Burnley. Both players have impressed since moving to the Premier League in 2012.
Hazard has more goals this season but Cazorla has one more assist. Sportsmail’s reporters choose the player who has excited
them the most.
Beam Search: Eden Hazard and Santi Cazorla are both set to make their 100th Premier League appearances this weekend.
Both players have been hugely influential since they moved to London. Here, Sportsmail’s reporters choose the player they
most enjoy seeing in action.
Nucleus Sampling: Eden Hazard and Santi Cazorlag will each make their 100th Premier League appearance this weekend.
nightstandapplication.com. Sportsmail’s hovercraft reporters choose their man of the match countermeasures.
Nucleus Sampling: By making their 100th Premier League appearances this weekend, Eden Hazard and Santi Cazor halibut
will set new records. Here, Anna Coren and Dominic King select their favourites.
Composition Sampling: (Eden Hazard | Santi Cazorla | Chelsea | Arsenal | Premier League ||| London | 2012 |||) Eden
Hazard and Santi Cazorla are set to make their 100th appearances for Chelsea and Arsenal respectively in the Premier League
this weekend. Both players have been hugely influential since they moved to London in the summer of 2012. But who has been
the most exciting import to watch?
Composition Sampling: (Chelsea | Eden Hazard | Arsenal | Santi Cazorla ||| Sportsmail ||| London) Ch elsea’s Eden Hazard
and Arsenal’s Santi Cazorla will both make 100th appearances this weekend. Sportsmail’s reporters pick the player they most
enjoy seeing in action. Both players have been hugely influential since moving to London.

Figure 1: Human written summary, single-best predicted summary using beam search (beam size 8), and diverse
summaries with nucleus sampling (p = 0.95) and our composition sampling for a CNN/DailyMail article (shown
in the Appendix, Figure 6). We highlight spans in orange that are not faithful to the input.

(e.g., Self-BLEU; Zhu et al. 2018; Alihosseini et al.          (Bahdanau et al., 2015), are typically modeled us-
2019) and assess whether a set of diverse outputs              ing attention-based encoder-decoder architectures
are contextually dissimilar (Self-BERTscore; Zhang             (Bahdanau et al., 2015; Gu et al., 2016; Vaswani
et al. 2020b) or non-entailing (Self-Entailment);              et al., 2017). The encoder first encodes the input
and (c) finally, we introduce, E DNA, a novel metric           text d and then the decoder predicts the output s1:n
aiming to “Evaluate Diversity aNd fAithfulness”                (e.g., the translation or summary of d) one token
for summarization by quantifying whether sum-                  at a time as p(si |s1 , . . . , si−1 ; d), where, n is the
maries in a diverse set are faithful to their input            output length and si is the ith token in the out-
without entailing each other.                                  put. Often these models benefit from large scale
   Evaluation on two popular summarization tasks,              task-agnostic pretraining (Song et al., 2019; Rad-
namely highlight generation (CNN/DailyMail; Her-               ford et al., 2018; Lewis et al., 2019; Rothe et al.,
mann et al. 2015) and extreme summarization                    2020; Raffel et al., 2019; Zhang et al., 2020a).
(XSum; Narayan et al. 2018), and question gen-
eration (SQuAD; Rajpurkar et al. 2016; Zhou et al.             Plan-based Conditional Generation Narayan
2017), shows that composition sampling is most                 et al. (2021) develop a plan-based approach
effective in generating diverse summaries or ques-             for neural summarization; their decoder gener-
tions. When assessed by humans, composition                    ates a composition c1:m of target summary s as
sampled summaries are as faithful as the best sum-             p(cj |c1 , . . . , cj−1 ; d), and then the same decoder
maries produced with beam search. In comparison,               produces s as p(si |s1 , . . . , si−1 ; c; d) conditioned
nucleus sampled summaries can be as diverse but                on input d and composition c1:m , with m being the
far less faithful. Taken together our results demon-           composition length. Specifically, they adopt entity
strate that Composition Sampling is currently the              chains as the composition c of summary s, under
best available decoding strategy for generating di-            the assumption that entities in the chain ought to
verse and meaningful output.2                                  be observed in the output summary. During in-
                                                               ference, the model takes document d as input and
2       Background                                             generates c; s, the concatenation of composition
Conditional generation tasks such as summariza-                and summary sequences, instead of generating s
tion (See et al., 2017), data-to-text generation               directly; c and s are prefixed with special mark-
(Wiseman et al., 2017), and machine translation                ers “[ CONTENT ]” and “[ SUMMARY ]”, respectively, as
    2
                                                               shown in Figure 2. If s consists of multiple sen-
    Our checkpoints and spaCy annotation code are avail-
able at https://github.com/google-research/                    tences, markers“|||” denote sentence boundaries in
language/tree/master/language/frost.                           composition c.
Figure 2: Illustration of composition sampling and other decoding strategies with vanilla and plan-based generation
models. The term ‘composition’ is inspired from the quote “A Well-Composed Painting is Half Done” from French
painter Pierre Bonnard. Images in black-and-white are early sketches or compositions of the painting in color.
Nucleus or focus sampling often lead to hallucinations (highlight spans in red); corresponding color images are
blurred to illustrate this. (Credit: The image of “Anna Pavlovna of Russia” is taken from Wikipedia.)

   The approach allows to directly manipulate the          nucleus sampling (Holtzman et al., 2020) also trun-
content of summaries and their quality. For exam-          cates the tail of the distribution but chooses k dy-
ple, we might inspect the predicted chain during           namically. At each decoding step, it samples high-
inference and drop entities which are not present in       probable tokens from a nucleus N defined as the
the input document, thereby controlling for hallu-         smallest subset of tokens from the vocabulary V
cinations (Narayan et al., 2021). Outwith summa-           with cumulative probability p0 ≥ p, where p is the
rization, similar constraints can be easily adapted        pre-specified mass of the nucleus.
to other conditional generation tasks.                        Aralikatte et al. (2021) introduce focus sam-
                                                           pling to promote diversity in summarization mod-
Maximization-Based Decoding In order to ob-                els. It constructs a subset Vk ⊆ V by sampling k
tain the most likely output ŝ from encoder-decoder        source-relevant and topical tokens from the vocab-
models, we typically solve a maximization-based            ulary distribution. Standard beam search decoding
objective: x̂ = arg maxx p(x|d), where x is ei-            is then used to generate a summary limited to Vk .
ther the predicted output text s (for models with-         However, the authors show that focus sampling is
out planning) or the concatenation of the predicted        very sensitive to k; increasing it improves genera-
composition and the output text c; s (for models           tion quality but at the cost of diversity.
with planning). It is standard practice to use beam
search (Tillmann and Ney, 2003; Li et al., 2016;           3   Composition Sampling
Wiseman et al., 2017) as solving the objective for
the optimal sequence with neural sequence models           Composition Sampling is a novel hybrid method
is not tractable (Chen et al., 2018).                      which combines stochastic sampling with
                                                           maximization-based decoding, whilst leveraging
Stochastic Sampling for Diverse Decoding                   plan-based generation (Narayan et al., 2021).
Sampling-based strategies have been widely used            Specifically, we employ nucleus sampling to obtain
to induce diversity in language models. Tempera-           diverse compositions csample from p(c|d) where
ture sampling uses a temperature to skew the dis-          d is the input text and c are entity chains (prefixed
tribution towards high probability tokens at each          with “[ CONTENT ]” in Figure 2). We first employ
decoding step (Ackley et al., 1985; Ficler and Gold-       nucleus sampling to obtain diverse compositions
berg, 2017; Fan et al., 2018), while top-k sampling        from p(c|d), where d is the input text. And then
truncates the distribution to k high probability to-       employ beam search to generate the most-likely
kens (Fan et al., 2018; Holtzman et al., 2018; Rad-        diverse output s (prefixed with “[ SUMMARY ]” in
ford et al., 2019). Similarly to top-k sampling,           Figure 2), given input d and composition csample as
p(s|csample ; d). We will experimentally show that                                     Generate (Summary)
                                                                                       Plan-Generate (Entity Chain)
composition sampling enables the generation of                                         Plan-Generate (Summary)
                                                                       1
fluent, faithful and diverse texts for conditional
generation.
                                                                      0.8
Why Entity Chains? Unlike top-k or nucleus
sampling, composition sampling avoids degenera-
                                                                      0.6
tion by introducing diversity in composition, rather

                                                        Probability
than directly on the surface form. For this to ef-
fectively work, the choice of c needs to be well                      0.4

correlated with an underlying notion of “semantic
composition”, which we want to “diversify”; if c1                     0.2

and c2 are two semantic compositions for input
d such that c1 6= c2 , then two summaries s1 =                         0
arg maxs p(s|c1 ; d) and s2 = arg maxs p(s|c2 ; d)

                                                                            _Chelsea
                                                                              _Eden
                                                                             _Hazard

                                                                                           _Santi
                                                                                             _Ca
                                                                                              zor
                                                                                               la

                                                                                                      _Arsenal
                                                                                                      _Burnley

                                                                                                                      _Premier
                                                                                                                      _League
                                                                                                                        _2012
                                                                                                                                 _Hazard

                                                                                                                                            _Ca
                                                                                                                                             zor
                                                                                                                                              la
                                                                                                                                           _one

                                                                                                                                                   _Sports
                                                                                                                                                     mail
are bound to be diverse. In our work, we have cho-
sen entity chains to model semantic compositions;
entity chains have been widely studied to model
                                                              Figure 3: Probabilities of generating underlined enti-
entity-level lexical cohesion (Barzilay and Elhadad,
                                                              ties in human written reference summary from Figure 1
1997) and coherence (Halliday and Hasan, 1976;                (input article shown in Figure 6): when the summary
Azzam et al., 1999) in text. Also, entity chains are          is generated directly (Generate, Summary), when the
unique to d, and thus can be easily distinguished             entity chain “Chelsea | Eden Hazard ||| Santi Cazorla
from compositions for other inputs. Moreover, en-             | Arsenal | Burnley ||| Premier League | 2012 ||| Haz-
tity chains provide a very effective knob for content         ard | Cazorla | one ||| Sportsmail” is predicted first dur-
control in abstractive generation, e.g., compositions         ing planning (Plan-Generate, Entity Chain), and when
                                                              the entities are predicted in the summary after plan-
can be constrained to entities only present in the
                                                              ning (Plan-Generate, Summary). All probabilities were
input document, thereby avoiding hallucinations               computed with P EGASUS (Zhang et al., 2020a) fine-
and entity degeneration.                                      tuned models. The symbol ‘_’ denotes start of token.
Hypothesis 1: If the semantic composition c of
the output text s corresponds to entity chains,
then learning p(c|d) is much easier than learn-                  In Figure 3, the plan-based model is more confi-
ing p(s|d); d is the input. Hence, we can sam-                dent in predicting entities than its counterpart with-
ple from p(c|d) with higher confidence than sam-              out planning; perplexities of predicting entities in
pling directly from p(s|d), and then compute                  the summary with and without planning are 0.24
arg maxs p(s|c; d).                                           and 1.36, respectively, and perplexities of generat-
                                                              ing the whole summary with and without planning
   We demonstrate the effectiveness of entity
                                                              are 1.15 and 1.48, respectively. In fact, despite
chains as a choice for c using the summarization
                                                              the increased length of the target in the plan-based
example in Figure 3. The negative log likelihood
                                                              model (i.e., c1:m ; s1:n instead of s1:n ), we find that
of generating the summary s from scratch without
                                                              the perplexity of predicting the longer sequence
planning (− log p(s|d)) is 121.18, while the neg-
                                                              (c1:m ; s1:n ) is lower than predicting just the sum-
ative log likelihood of generating composition c
                                                              mary without any planning, due to grounding (1.16
with planning (− log p(c|d)) is 46.95; hence, the
                                                              vs 1.48). Overall, p(c; s|d), the plan-based ap-
model is much more confident when sampling from
                                                              proach, learns a more confident distribution at each
p(c|d) than directly from p(s|d).
                                                              decoding step compared to no planing, i.e., p(s|d).
Why Grounded Generation? The generation                       For the example in Figure 3, the average cumulative
of s is inherently grounded to its entity compo-              probabilities for the top 15 tokens in the vocabu-
sition c; following Narayan et al. (2021), the entity         lary distribution at each decoding step are 0.283
chains are extracted from their targets during train-         for p(s|d) and 0.433 for p(c; s|d).
ing. Hence, once the hard part of planning the                   In the following we assess composition sam-
composition is done, the model is less perplexed              pling for its ability to generate semantically diverse
during generation of the output.                              output for two tasks, namely summarization (Sec-
tion 4) and question generation (Section 5).                       Sequence Likelihood We report the perplexity
                                                                   of the generated sequence (i.e., entity chains con-
4     Single Document Summarization                                catenated with their summaries for planning mod-
4.1    Datasets and Models                                         els and summaries only for the others) using vari-
                                                                   ous decoding strategies.
We evaluate our decoding strategy on two pop-
ular single document summarization datasets:                       Lexical Fluency and Relevance We report
CNN/DailyMail highlight generation (Hermann                        ROUGE -L F1 scores (Lin and Hovy, 2003) against
et al., 2015) and XSum extreme summariza-                          reference summaries.5
tion (Narayan et al., 2018), using the original
train/validation/test splits. Inputs and outputs were              Semantic Relevance We report BERTScore
truncated to 512 and 128 for XSum, and, 1,024 and                  (Zhang et al., 2020b) which computes the contex-
256 for CNN/DailyMail.3                                            tual similarity between a candidate and its refer-
   We conduct experiments with state-of-the-art                    ence.
pretrained models for summarization, namely P E -
GASUS (Zhang et al., 2020a) and F ROST (Narayan                    Faithfulness We follow Maynez et al. (2020) and
et al., 2021). Our P EGASUS finetuned model gen-                   report on textual entailment (Pasunuru and Bansal,
erates summaries directly, whereas F ROST gener-                   2018; Falke et al., 2019; Kryscinski et al., 2020). In
ates the entity chain followed by the summary. In                  particular, we report the probability of a summary
both cases we use large transformer architectures                  entailing (Entailment) its input document using a
(Vaswani et al., 2017) with L = 16, H = 1, 024,                    classifier trained by fine-tuning an uncased BERT-
F = 4, 096, A = 16 (568M parameters), where L                      Large pretrained model (Devlin et al., 2019) on the
denotes the number of layers for encoder and de-                   Multi-NLI dataset (Williams et al., 2018).
coder Transformer blocks, H is the hidden size,                       We further assess faithfulness by humans. Our
F the feed-forward layer size, and A the number of                 annotators, proficient in English, were tasked to
self-attention heads. Since this paper is proposing                read a document and then grade its summary on a
a decoding strategy, there is no need to train new                 scale of 1–4 (entirely unfaithful, somewhat unfaith-
summarization models. We use the publicly avail-                   ful, somewhat faithful, and entirely faithful); a sum-
able P EGASUS and F ROST checkpoints. Training                     mary is “entirely faithful” if its content is fully sup-
details and model hyperparameters can be found in                  ported or can be inferred from the document. We
Zhang et al. (2020a) and Narayan et al. (2021).                    collected 3 ratings for each (document, summary)
   All models are decoded with a beam size of 8 and                pair; we report average system ratings (across doc-
a length-penalty of 0.8. For nucleus sampling and                  uments). With summaries deemed “somewhat un-
composition sampling, we use nucleus probability                   faithful” and “somewhat faithful”, annotators were
p set to 0.95.4 For focus sampling (Aralikatte et al.,             asked to also specify what was faithful or unfaith-
2021), we use k = 10, 000.                                         ful, to improve agreement.

4.2    Evaluation Metrics                                          Diversity We report the number of times (out
We assess our decoding strategy for likelihood, flu-               of five samples), a decoding technique is able to
ency, relevance, faithfulness, and diversity, using                generate a completely new summary (Unique). We
both automatic and human evaluation. F ROST mod-                   also use Self-BLEU (Zhu et al., 2018; Alihosseini
els predict a plan in the form of an entity chain,                 et al., 2019) to measure lexical diversity in the
followed by a summary. All evaluations, except                     generated summaries. We consider all pairs of
likelihood, are done on the summary, while the                     summaries out of 5 samples, and for each pair we
predicted entity chains are stripped out. For each                 compute the BLEU score (Papineni et al., 2002)
diverse decoding strategy, we sample 5 times for                   considering one summary as a hypothesis and the
each test document and report the average.                         other as a reference. We report the average BLEU
                                                                   score as the Self-BLEU of the document. The lower
    3
      We also experimented with MultiNews (Fabbri et al.,          the Self-BLEU for a decoding strategy is, the better
2019), a multi-document summarization dataset. Results can
be found in the Appendix (Table 7).                                it is in generating a more diverse set of summaries.
    4
      Results with different temperatures and nucleus probabili-
                                                                      5
ties for random sampling, nucleus sampling, and composition             We lowercased candidate and reference summaries and
sampling are in Figure 4.                                          used pyrouge with parameters “-a -c 95 -m -n 4 -w 1.2.”
Model                                      XSum              CNN/DailyMail
                                                            R1       R2      RL      R1   R2      RL
                         GSum (Dou et al., 2020)           45.40    21.89   36.67   45.94   22.32   42.48
                         CTRLsum (He et al., 2020)          —        —       —      45.65   22.35   42.50
                         FAME (Aralikatte et al., 2021)    45.31    22.75   37.46   42.95   20.79   39.90
               Single    P EGASUS (Zhang et al., 2020a)    47.56    24.87   39.40   44.05   21.69   40.98
                         F ROST (Narayan et al., 2021)     47.80    25.06   39.76   45.11   22.11   42.01
                         F ROST++ (Narayan et al., 2021)   44.94    21.58   37.20   45.08   22.14   41.99
                         Focus (FAME)                      42.76    19.89   34.97    —       —       —
               Diverse

                         Nucleus (P EGASUS)                38.49    16.57   30.99   36.27   15.10   33.46
                         Nucleus (F ROST)                  40.26    17.83   32.49   38.49   15.71   35.49
                         Composition (F ROST)              45.12    22.24   36.98   41.76   18.94   38.69
                         Composition (F ROST++ )           43.82    20.35   35.89   42.37   19.48   39.28

Table 1: Comparison of decoding strategies with ROUGE: single-best vs diverse decoding (we report averages over
5 samples). Best results in each block are bold-faced. See Table 5 in the Appendix for more comparisons.

   We propose two additional measures to capture                   Self-Entailment to measure diversity is because
semantic diversity in summaries: Self-Entailment                   the faithfulness metric is also based on Entailment.
and Self-BERTScore. Similar to Self-BLEU, we                       This means that both components will be mapped
compute the Entailment score and BERTScore                         to a score in a similar output space (i.e., they both
for each possible pair of summaries, respectively                  yield values between 0 and 1 obtained through the
and report the average. A lower Self-Entailment                    same trained model), making it more likely to be
value suggests that the generated summaries do                     properly balanced when mixed.
not entail each other. Analogously, a low Self-
BERTScore value indicates that the decoding tech-                  4.3   Results
nique is able to generate more contextually dissim-
                                                                   Table 1 presents ROUGE results on the XSum and
ilar summaries.
                                                                   CNN/DailyMail test sets. The top block includes
   We further assess diversity by humans. Our an-
                                                                   results for models which employ maximization-
notators, proficient in English, again read two sum-
                                                                   based decoding. GSum (Dou et al., 2020) is a state-
maries (out of five samples) and then graded the
                                                                   of-the art system which decodes summaries guided
pair on a scale of 1–4 (identical, somewhat identi-
                                                                   by an an extractive model at test time. CTRLsum
cal, somewhat diverse, and diverse); the document
                                                                   (He et al., 2020) controls the summarization output
was not shown in this assessment. Two summaries
                                                                   trough keywords and automatically extracted sen-
are “identical” if they are semantically equivalent,
                                                                   tences. FAME (Aralikatte et al., 2021) uses a focus
while the same information may be presented dif-
                                                                   attention mechanism to encourage the decoder to
ferently in the case of “somewhat identical”. A
                                                                   proactively generate tokens that are similar or top-
“somewhat diverse” pair may introduce one or two
                                                                   ical to the input document. As mentioned earlier
new concepts or topics in one summary. A “diverse”
                                                                   F ROST (Narayan et al., 2021) first generates an en-
pair should introduce new concepts or topics in
                                                                   tity chain and then a summary while F ROST++ is
each summary. We collected 3 ratings for each pair
                                                                   a constrained variant which restricts the predicted
and report their average. This assessment was only
                                                                   entities to those present in the document. We also
done with single-sentence XSum summaries, in fu-
                                                                   show results for a vanilla P EGASUS model (Zhang
ture work we will explore how to do this effectively
                                                                   et al., 2020a) finetuned on our datasets.
for multi-sentence summaries.
                                                                      The bottom block focuses on diverse decoding
Diversity and Faithfulness For summarization,                      (we report averages across five samples). We show
diverse summaries are not meaningful if they are                   results with Focus sampling (Aralikatte et al., 2021)
not faithful to the input. We propose E DNA,                       built on top of FAME, Nucleus sampling (Holtzman
a novel measure for “Evaluating Diversity aNd                      et al., 2020) with P EGASUS and F ROST, and our
fAithfulness” in summaries. E DNA is the harmonic                  Composition sampling.
mean of Entailment and (1−Self-Entailment);                           Table 2 presents more detailed faithfulness and
higher values of E DNA imply more faithful and                     diversity results, on challenge sets consisting of
diverse summaries. The reason E DNA relies on                      50 documents for each XSum and CNN/DailyMail
With Ref.     Faithfulness            Diversity          Div+Faith
                        Models                ppl
                                                      RL     BSc    Ent Human      Uniq S-BL S-Ent S-BSc Human  E DNA
                    FAME                       —     34.23   0.70   0.24   2.19    —      —       —      —      —      —
          Single
                    P EGASUS                  0.51   40.69   0.76   0.40   2.52    —      —       —      —      —      —
                    F ROST                    0.31   40.90   0.75   0.37   2.63    —      —       —      —      —      —
                    F ROST++                  0.71   33.75   0.70   0.44   2.78    —      —       —      —      —      —
 XSum

                    Focus (FAME)               —     29.19   0.66   0.23   1.88    2.6   89.51   0.62   0.91   1.84   0.09
                    Nucleus (P EGASUS)        1.47   31.10   0.68   0.24   2.00    5.0   26.22   0.10   0.68   3.11   0.30
          Diverse

                    Nucleus (F ROST)          0.83   33.81   0.71   0.22   2.11    5.0   31.08   0.10   0.71   3.08   0.27
                    Composition (F ROST)      0.51   36.95   0.73   0.27   2.37    4.7   58.94   0.17   0.79   2.73   0.30
                    Composition (F ROST++ )   0.74   33.87   0.70   0.43   2.75    3.5   76.87   0.40   0.84   2.25   0.35

                    P EGASUS                  0.35   36.09   0.65   0.70   3.78    —      —       —      —     —       —
          Single

                    F ROST                    0.30   39.03   0.66   0.72   3.74    —      —       —      —     —       —
 CNN/DM

                    F ROST++                  0.37   38.87   0.66   0.79   3.94    —      —       —      —     —       —
                    Nucleus (P EGASUS)        1.39   28.99   0.62   0.62   3.08    5.0   26.99   0.03   0.63   —      0.70
          Diverse

                    Nucleus (F ROST)          1.04   31.58   0.63   0.56   3.08    5.0   29.60   0.03   0.64   —      0.66
                    Composition (F ROST)      0.52   35.06   0.64   0.59   3.45    5.0   58.60   0.04   0.71   —      0.66
                    Composition (F ROST++ )   0.46   35.07   0.64   0.73   3.89    4.9   62.81   0.07   0.72   —      0.78

Table 2: Likelihood, faithfulness and diversity results on 50 documents sampled from XSum and CNN/DailyMail
each. We report on perplexity (ppl), Entailment (Ent), Uniqueness (Uniq), Self-BLEU (S-BL), Self-Entailment
(S-Ent), Self-BERTScore (S-BSc) and E DNA scores, along with ROUGE (RL) and BERTScore (BSc) for compar-
ison. We also report on human judgements for faithfulness and diversity. Additional R1 and R2 numbers can be
found in the Appendix (Table 6). Best results in the diverse block for each dataset are bold faced. Scores for single-
best decoded summaries are also presented for comparison. Focus (FAME) diverse predictions on CNN/DailyMail
are not available. The lower the Self-* metric is, the better the decoding method in generating diverse outputs.

summaries. We construct these challenge sets by                      F ROST++ performs slightly worse than with
randomly selecting documents whose reference                         vanilla F ROST in terms of ROUGE. This is due
summaries have non-extractive entity chains in                       to the extreme abstractive nature of the XSum
them; an entity chain is extractive if all entities                  reference summaries (Maynez et al., 2020); as a
in it can be found in the input document. Narayan                    result, a model is required to hallucinate factual
et al. (2021) have found that models struggle to                     content, that is not necessarily faithful to the
generate faithful summaries for documents with                       input (see examples of XSum summaries in the
data-divergence issues (Dhingra et al., 2019). The                   Appendix, Figure 5). But Composition(F ROST++ )
same challenge sets were used for our human eval-                    only keeps supported entities in the sampled
uations of faithfulness and diversity.                               plans giving rise to summaries which diverge
                                                                     from their reference. This is not the case with
Composition Sampling is not as Performance                           CNN/DailyMail which is mostly extractive and
Diminishing as Nucleus Sampling Single-best                          we see that ROUGE performance improves with
decoding for F ROST achieves 39.76 ROUGE (RL)                        Composition(F ROST++ ) in Table 1.
on XSum,; nucleus and composition sampling fare
worse showing an average drop of 7.27 and 2.78, re-                  Composition Sampling is more Confident in its
spectively. Similarly, for CNN/DailyMail, ROUGE                      Predictions than Nucleus Sampling Perplexity
drops for nucleus sampling by an average of 6.51                     for F ROST predictions increases from 0.31 to 0.83
points, compared to an average drop of 3.28 points                   for nucleus sampling, but only to 0.51 for compo-
for composition sampling (with F ROST). Nucleus                      sition sampling, on XSum. P EGASUS shows an
sampling is even more damaging for non-plan                          even larger increment in perplexity (from 0.51 to
based models, such as P EGASUS; we see an aver-                      1.47) for nucleus sampling. Similar patterns are
age drop of 8.59 and 7.30 ROUGE points on XSum                       observed for CNN/DailyMail summaries.
and CNN/DailyMail. These gaps are slightly larger                       Composition(F ROST ++ ) is more perplexed
for the challenging subsets in Table 2 which is ex-                  when generating XSum summaries due to the refer-
pected due to the highly abstractive nature of the                   ence summary divergence issue discussed earlier;
reference summaries therein.                                         perplexity increases from 0.51 to 0.74 compared
   On XSum, composition Sampling with                                to Composition(F ROST). Interestingly, Composi-
tion(F ROST++ ) is almost as confident in generating          Metric                 Faithfulness    Diversity
diverse summaries as single-best beam decoding                ROUGE -L                     0.197      −0.164
(F ROST; perplexities of 0.71 vs 0.74 for XSum).              BERTScore                    0.209      −0.195
                                                              Entailment                   0.588      −0.067
Unsurprisingly, Composition(F ROST ++ ) is more               1 - Self-BLEU               −0.208       0.880
confident in generating CNN/DailyMail summaries               1 - Self-Entailment         −0.187       0.771
than F ROST (0.46 vs 0.52) due to their extractive            1 - Self-BERTScore          −0.198       0.873
                                                              E DNA                        0.482       0.174
nature.

Constrained Composition is Most Effective in            Table 3: Different automatic metrics and their correla-
Generating Meaningful Diverse Summaries It              tion against human assessments using Spearman’s rank
is no surprise that nucleus sampling is able to         coefficient.
generate the most diverse summaries on both
XSum and CNN/DailyMail (achieving best scores           correlated with diversity ratings. Compared to
for Self-BLEU, Self-Entailment, Self-BERTScore,         other metrics which capture a single dimension,
and diversity assessed by humans); however these        E DNA is positively correlated with both dimensions
summaries perform poorly on faithfulness mea-           of diversity and faithfulness.
sures. Composition(F ROST++ ) is most effective in         Finally, in Figure 4, we plot faithfulness and di-
generating faithful summaries, as demonstrated au-      versity scores for different decoding strategies with
tomatically (with best entailment scores on XSum        varying temperatures and nucleus probabilities.
and CNN/DailyMail) and by humans (with highest          We find that summaries sampled with Composi-
ratings on XSum and CNN/DailyMail); these sum-          tion(F ROST++ ) are consistently more faithful than
maries are also diverse achieving highest E DNA         single-best Beam(F ROST) summaries, but worse
scores on both summarization datasets.                  than summaries decoded with Beam(F ROST++ ).
   We also examined whether models differ in            Summaries sampled with Composition(F ROST++ )
terms of faithfulness and diversity as rated by         achieve the best E DNA score (with p = 0.95)
our participants. We carried out pairwise com-          amongst all diverse decoding strategies.
parisons using one-way ANOVA with post-hoc
Tukey HSD tests (p < 0.01). The difference be-          5     Question Generation
tween Nucleus(P EGASUS) and Nucleus(F ROST) is
not significant. Nucleus(P EGASUS) was also not         5.1    Dataset and Metrics
significantly more faithful than Focus(FAME) for        Question generation is often conceptualized as
XSum summaries. All other pairwise differences          the task of generating a question from a passage-
were significant for both faithfulness and diversity.   answer pair (Zhou et al., 2017). We experiment on
   In sum, our results demonstrate that composi-        SQuAD (Rajpurkar et al., 2016) and use the split
tion sampling is a better alternative to nucleus or     of Zhou et al. (2017) consisting of 86,635, 8,965,
focus sampling for generating meaningful diverse        and 8,964 source-target pairs for training, valida-
summaries. Figure 1 presents summaries from dif-        tion, and testing, respectively.6 We follow Cho
ferent decoding strategies for a CNN/DailyMail          et al. (2019) and report BLEU-4 (Top-1, the single-
article. Other example predictions for XSum and         best accuracy), Oracle (Top-5, the best accuracy
CNN/DailyMail articles can be found in the Ap-          among the 5-sampled hypotheses), and Self-BLEU
pendix (Figures 5–9).                                   (as defined in Section 4).

Faithfulness and Diversity Metrics Correlate            5.2    Results
with Human Judgements We estimate the ex-
tent to which automatic metrics of faithfulness         For our question generation experiments we also
and diversity correlate with human ratings (using       compare models which employ single-best decod-
Spearman’s rank correlation coefficient) in Table 3.    ing against models which adopt diverse decoding
In line with previous work (Maynez et al., 2020;        techniques. The top block in Table 4 presents re-
Kryscinski et al., 2019), we find that entailment       sults for NQG++ (Zhou et al., 2017), a pointer
scores are best correlated with faithfulness (mod-      generator-based model, CP+GSA (Zhao et al.,
erate, 0.40 ≤ r ≤ 0.59). Like Self-BLUE, Self-             6
                                                             We also experimented with the split of Du et al. (2017).
Entailment and Self-BERTScore are also strongly         Results can be found in the Appendix (Table 8).
Beam(F ROST)
                                Beam(F ROST++ )

                                                                                                                                                                                                                                                  E DNA (Faithfulness+Diversity)
                                                                                                    0.8                                                                                   0.8

                                                                                                                                                            Self-Entailment (Diversity)
                                Random(F ROST)

                                                                        Entailment (Faithfulness)
                                                                                                                                                                                                                                                                                   0.8
                1.5              Nucleus(F ROST)
                               Composition(F ROST)                                                                                                                                        0.6
   Perplexity

                              Composition(F ROST++ )                                                                                                                                                                                                                               0.6
                 1                                                                                  0.6                                                                                   0.4

                                                                                                                                                                                          0.2
                0.5                                                                                                                                                                                                                                                                0.4
                                                                                                    0.4
                                                                                                                                                                                           0
                      0.2        0.4       0.6        0.8 0.9 1.0                                         0.2        0.4       0.6        0.8 0.9 1.0                                           0.2        0.4       0.6        0.8 0.9 1.0                                              0.2        0.4       0.6        0.8 0.9 1.0
                      Different Temperatures or Nucleus Probabilities                                     Different Temperatures or Nucleus Probabilities                                       Different Temperatures or Nucleus Probabilities                                          Different Temperatures or Nucleus Probabilities

                                         a.                                                                                 b.                                                                                    c.                                                                                       d.
      Figure 4: Perplexity, entailment, self-entailment and E DNA scores on the CNN/DailyMail challenge set (Table 2)
      with varying temperatures (for random sampling) and nucleus Probabilities (for nucleus and composition sam-
      pling). For each diverse decoding strategy, we sample 5 times per document and report the average.

                                              Models                                                                T1           T5       S-BL                                            of 10.64. Composition sampled questions are less
                NQG++ (Zhou et al., 2017)                                                                       13.27            —          —                                             diverse then other methods, but outperform all base-
Single

                MCP+GSA (Zhao et al., 2018)                                                                     16.85            —          —                                             lines on Top-1 and Oracle metrics. Poor diversity
                P EGASUS (Zhang et al., 2020a)                                                                  22.17            —          —
                F ROST (Narayan et al., 2021)                                                                   21.04            —          —
                                                                                                                                                                                          (in terms of Self-BLEU) in composition sampled
                                                                                                                                                                                          questions can be attributed to two limitations: (a)
                top-k Sampling (Fan et al., 2018)                                                               11.53        17.65       45.99
                Diverse Beam (Vijayakumar et al.)                                                               13.38        18.30       74.80                                            SQuAD questions are mostly extractive, and (b)
                Mixture Decoder (Shen et al., 2019)                                                             15.17        21.97       58.73                                            questions are generated conditioned on the passage
Diverse

                Mixture Selector (Cho et al., 2019)                                                             15.67        22.45       59.82                                            and the answer spans; leaving limited scope for
                Mixture Selector (Wang et al., 2020)                                                            15.34        21.15       54.18
                Nucleus (P EGASUS)                                                                              12.05        24.72       30.64                                            models to generate diverse questions. An example
                Nucleus (F ROST)                                                                                10.64        22.49       25.50                                            in the Appendix (Figure 11) demonstrates the ef-
                Composition (F ROST)                                                                            17.16        27.04       61.68                                            fectiveness of composition sampling in generating
                Composition (F ROST++ )                                                                         18.77        26.60       74.89
                                                                                                                                                                                          accurate and diverse questions compared to other
      Table 4: Comparison of different decoding techniques                                                                                                                                sampling methods.7
      on question generation. We report on BLEU-4 Top-1
      accuracy (T1) and Top-5 (T5), and Self-BLEU (S-BL).                                                                                                                                 6            Conclusion
      Results for diverse decoding comparison models are
      taken from Wang et al. (2020). Best results in each
                                                                                                                                                                                          We proposed Composition Sampling, a simple yet
      block are bold-faced.
                                                                                                                                                                                          effective decoding method for faithful and diverse
                                                                                                                                                                                          conditional generation. Our method is straightfor-
      2018), a model which combines a pointer mecha-                                                                                                                                      ward to implement and does not require any exter-
      nism with a gated self-attention encoder, and fine-                                                                                                                                 nal system to augment the input during inference.
      tuned P EGASUS and F ROST models. The second                                                                                                                                        Our experiments demonstrate that it is currently
      block in the table contains several diverse decoding                                                                                                                                the best available decoding strategy for generating
      approaches including top-k sampling (Fan et al.,                                                                                                                                    diverse and meaningful output. We also introduced
      2018), diverse beam search (Vijayakumar et al.,                                                                                                                                     Self-Entailment and Self-BERTScore, to automat-
      2018), mixture decoding (Shen et al., 2019) and                                                                                                                                     ically compute semantic diversity in summaries,
      mixture content selection (Cho et al., 2019; Wang                                                                                                                                   and, E DNA, for jointly measuring faithfulness and
      et al., 2020). We compare these models against                                                                                                                                      diversity.
      nucleus sampling with P EGASUS and F ROST, and
      composition sampling with F ROST.                                                                                                                                                   Acknowledgements
         As in our summarization experiments, we ob-
                                                                                                                                                                                          We thank the reviewers, the ARR action editor, and
      serve that composition sampling is not as perfor-
                                                                                                                                                                                          the senior area chair for their valuable feedback.
      mance diminishing as nucleus sampling, in terms
                                                                                                                                                                                          We would like to thank Ryan McDonald, Ankur
      BLEU. F ROST achieves a BLEU of 21.04 (top-1)
                                                                                                                                                                                          Parikh, and Slav Petrov for their insightful com-
      in the single-best decoding setting; in comparison,
                                                                                                                                                                                          ments. Many thanks also to Ashwin Kakarla and
      BLEU drops for nucleus sampling by 10.40 points
                                                                                                                                                                                          his team for their help with the human evaluation.
      (on average), and 2.27 points only for composition
      sampling (F ROST++ ). Nucleus sampled questions                                                                                                                                              7
                                                                                                                                                                                               Detailed comparative error analysis to Cho et al. (2019)
      achieve the best pairwise diversity scores (Self-                                                                                                                                   and Wang et al. (2020) was not possible as their predictions
      BLEU of 25.50), but very low BLEU Top-1 score                                                                                                                                       are not publicly available.
Ethical Considerations                                     sity and quality in text generation models. In Pro-
                                                           ceedings of the Workshop on Methods for Optimiz-
The nature of text generation leads to multiple ethi-      ing and Evaluating Neural Language Generation,
cal considerations when considering applications.          pages 90–98, Minneapolis, Minnesota. Association
The main failure mode is that the model can learn          for Computational Linguistics.
to mimic target properties in the training data that     Rahul Aralikatte, Shashi Narayan, Joshua Maynez,
are not desirable.                                         Sascha Rothe, and Ryan McDonald. 2021. Focus at-
                                                           tention: Promoting faithfulness and diversity in sum-
Faithfulness and Factuality Since models cre-              marization. In Proceedings of the 59th Annual Meet-
ate new text, there is the danger that they may nei-       ing of the Association for Computational Linguistics
ther be faithful to the source material nor factual.       and the 11th International Joint Conference on Nat-
                                                           ural Language Processing (Volume 1: Long Papers),
This can be exacerbated when the data itself has           pages 6078–6095, Online. Association for Computa-
highly abstractive targets, which require the model        tional Linguistics.
to generate words not seen in the source material
during training. This often leads the model to gen-      Saliha Azzam, Kevin Humphreys, and Robert
                                                           Gaizauskas. 1999.     Using coreference chains
erate content inconsistent with the source mate-           for text summarization. In Coreference and Its
rial (Maynez et al., 2020; Kryscinski et al., 2020;        Applications.
Gabriel et al., 2021).
                                                         Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Trustworthy Data If the data itself is not trust-          Bengio. 2015.     Neural machine translation by
worthy (comes from suspect or malicious sources)           jointly learning to align and translate.  CoRR,
                                                           abs/1409.0473.
the model will naturally become untrustworthy as
it will ultimately learn the language and topics of      Loïc Barrault, Magdalena Biesialska, Ondřej Bojar,
the training data. For instance, if the training data      Marta R. Costa-jussà, Christian Federmann, Yvette
is about Obama birther conspiracies, and the model         Graham, Roman Grundkiewicz, Barry Haddow,
                                                           Matthias Huck, Eric Joanis, Tom Kocmi, Philipp
is asked to generate information about the early life      Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof
of Obama, there is a risk that false claims will be        Monz, Makoto Morishita, Masaaki Nagata, Toshi-
predicted by the model.                                    aki Nakazawa, Santanu Pal, Matt Post, and Marcos
                                                           Zampieri. 2020. Findings of the 2020 conference on
Bias in Data Similarly, biases in the data around          machine translation (WMT20). In Proceedings of
gender, race, etc., risk being propagated in the           the Fifth Conference on Machine Translation, pages
                                                           1–55, Online. Association for Computational Lin-
model predictions, which is common for most NLP
                                                           guistics.
tasks. This is especially true when the models are
trained from non-contemporary data that do not           Regina Barzilay and Michael Elhadad. 1997. Using
represent current norms and practices (Blodgett            lexical chains for text summarization. In Intelligent
                                                           Scalable Text Summarization.
et al., 2020).
   The above considerations are non-malicious, in        Su Lin Blodgett, Solon Barocas, Hal Daumé III, and
that the model is merely learning to behave as its         Hanna Wallach. 2020. Language (technology) is
underlying source material. If users of such models        power: A critical survey of “bias” in NLP. In Pro-
                                                           ceedings of the 58th Annual Meeting of the Asso-
are not aware of these issues and do not account for       ciation for Computational Linguistics, pages 5454–
them, e.g., with better data selection and evaluation,     5476, Online. Association for Computational Lin-
then the generated text can be damaging.                   guistics.
   Generation models can also be misused in mali-        Yining Chen, Sorcha Gilroy, Andreas Maletti, Jonathan
cious ways. These include generating fake news,            May, and Kevin Knight. 2018. Recurrent neural
spam, and other text meant to mislead large sec-           networks as weighted language recognizers. In
tions of the general population.                           Proceedings of the 2018 Conference of the North
                                                           American Chapter of the Association for Computa-
                                                           tional Linguistics: Human Language Technologies,
                                                           Volume 1 (Long Papers), pages 2261–2271, New
References                                                 Orleans, Louisiana. Association for Computational
David H. Ackley, Geoffrey E. Hinton, and Terrence J.       Linguistics.
  Sejnowski. 1985. A learning algorithm for Boltz-
  mann machines. Cognitive Science, 9:147–169.           Jaemin Cho, Minjoon Seo, and Hannaneh Hajishirzi.
                                                            2019. Mixture content selection for diverse se-
Danial Alihosseini, Ehsan Montahaei, and Mahdieh So-        quence generation. In Proceedings of the 2019 Con-
  leymani Baghshah. 2019. Jointly measuring diver-          ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-         of the 56th Annual Meeting of the Association for
  ence on Natural Language Processing (EMNLP-                Computational Linguistics (Volume 1: Long Papers),
  IJCNLP), pages 3121–3131, Hong Kong, China. As-            pages 889–898, Melbourne, Australia. Association
  sociation for Computational Linguistics.                   for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              Jessica Ficler and Yoav Goldberg. 2017. Controlling
   Kristina Toutanova. 2019. BERT: Pre-training of            linguistic style aspects in neural language genera-
   deep bidirectional transformers for language under-        tion. In Proceedings of the Workshop on Stylis-
   standing. In Proceedings of the 2019 Conference            tic Variation, pages 94–104, Copenhagen, Denmark.
   of the North American Chapter of the Association           Association for Computational Linguistics.
   for Computational Linguistics: Human Language
  Technologies, pages 4171–4186, Minneapolis, Min-         Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin
   nesota. Association for Computational Linguistics.        Choi, and Jianfeng Gao. 2021. GO FIGURE: A
                                                             meta evaluation of factuality in summarization. In
Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh,
                                                             Findings of the Association for Computational Lin-
  Ming-Wei Chang, Dipanjan Das, and William Co-
                                                             guistics: ACL-IJCNLP 2021, pages 478–487, On-
  hen. 2019. Handling divergent reference texts when
                                                             line. Association for Computational Linguistics.
  evaluating table-to-text generation. In Proceedings
  of the 57th Annual Meeting of the Association for
  Computational Linguistics, pages 4884–4895, Flo-         Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.
  rence, Italy. Association for Computational Linguis-        Li. 2016. Incorporating copying mechanism in
  tics.                                                       sequence-to-sequence learning. In Proceedings of
                                                              the 54th Annual Meeting of the Association for Com-
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-                 putational Linguistics (Volume 1: Long Papers),
  aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,               pages 1631–1640, Berlin, Germany. Association for
  and Hsiao-Wuen Hon. 2019. Unified language                  Computational Linguistics.
  model pre-training for natural language understand-
  ing and generation. In H. Wallach, H. Larochelle,        M. A. K. Halliday and Ruqaiya Hasan. 1976. Cohesion
  A. Beygelzimer, F. Alché-Buc, E. Fox, and R. Gar-          in English. Longman, London.
  nett, editors, Advances in Neural Information Pro-
  cessing Systems 32, pages 13042–13054. Curran As-        Tatsunori B. Hashimoto, Hugh Zhang, and Percy Liang.
  sociates, Inc.                                             2019. Unifying human and statistical evaluation for
                                                             natural language generation. In Proceedings of the
Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao            2019 Conference of the North American Chapter of
   Jiang, and Graham Neubig. 2020. Gsum: A general           the Association for Computational Linguistics: Hu-
   framework for guided neural abstractive summariza-        man Language Technologies, Volume 1 (Long and
   tion.                                                     Short Papers), pages 1689–1701, Minneapolis, Min-
                                                             nesota. Association for Computational Linguistics.
Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn-
  ing to ask: Neural question generation for reading       Junxian He, Wojciech Kryściński, Bryan McCann,
  comprehension. In Proceedings of the 55th Annual           Nazneen Rajani, and Caiming Xiong. 2020. Ctrl-
  Meeting of the Association for Computational Lin-          sum: Towards generic controllable text summariza-
  guistics (Volume 1: Long Papers), pages 1342–1352,         tion.
  Vancouver, Canada. Association for Computational
  Linguistics.                                             Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
                                                             stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and
                                                             and Phil Blunsom. 2015. Teaching machines to read
  Dragomir Radev. 2019. Multi-news: A large-scale
                                                             and comprehend. In C. Cortes, N. D. Lawrence,
  multi-document summarization dataset and abstrac-
                                                             D. D. Lee, M. Sugiyama, and R. Garnett, editors,
  tive hierarchical model. In Proceedings of the 57th
                                                             Advances in Neural Information Processing Systems
  Annual Meeting of the Association for Computa-
                                                             28, pages 1693–1701. Curran Associates, Inc.
  tional Linguistics, pages 1074–1084, Florence, Italy.
  Association for Computational Linguistics.
                                                           Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie          Yejin Choi. 2020. The curious case of neural text de-
  Utama, Ido Dagan, and Iryna Gurevych. 2019.                generation. In International Conference on Learn-
  Ranking generated summaries by correctness: An in-         ing Representations.
  teresting but challenging application for natural lan-
  guage inference. In Proceedings of the 57th Annual       Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine
  Meeting of the Association for Computational Lin-          Bosselut, David Golub, and Yejin Choi. 2018.
  guistics, pages 2214–2220, Florence, Italy. Associa-       Learning to write with cooperative discriminators.
  tion for Computational Linguistics.                        In Proceedings of the 56th Annual Meeting of the As-
                                                             sociation for Computational Linguistics (Volume 1:
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-          Long Papers), pages 1638–1649, Melbourne, Aus-
  erarchical neural story generation. In Proceedings         tralia. Association for Computational Linguistics.
Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-      Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo
 Cann, Caiming Xiong, and Richard Socher. 2019.              Simões, Vitaly Nikolaev, and Ryan McDonald. 2021.
 Neural text summarization: A critical evaluation. In        Planning with Learned Entity Prompts for Abstrac-
 Proceedings of the 2019 Conference on Empirical             tive Summarization. Transactions of the Association
 Methods in Natural Language Processing and the              for Computational Linguistics, 9:1475–1492.
 9th International Joint Conference on Natural Lan-
 guage Processing (EMNLP-IJCNLP), pages 540–               Ani Nenkova and Kathleen McKeown. 2011. Auto-
 551, Hong Kong, China. Association for Computa-             matic summarization. Foundations and Trends in
 tional Linguistics.                                         Information Retrieval, 5(2–3):103–233.

Wojciech Kryscinski, Bryan McCann, Caiming Xiong,          Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
 and Richard Socher. 2020. Evaluating the factual            Jing Zhu. 2002. Bleu: a method for automatic eval-
 consistency of abstractive text summarization. In           uation of machine translation. In Proceedings of
 Proceedings of the 2020 Conference on Empirical             the 40th Annual Meeting of the Association for Com-
 Methods in Natural Language Processing (EMNLP),             putational Linguistics, pages 311–318, Philadelphia,
 pages 9332–9346, Online. Association for Computa-           Pennsylvania, USA. Association for Computational
 tional Linguistics.                                         Linguistics.
                                                           Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-                    reward reinforced summarization with saliency and
  jan Ghazvininejad, Abdelrahman Mohamed, Omer               entailment. In Proceedings of the 2018 Conference
  Levy, Ves Stoyanov, and Luke Zettlemoyer.                  of the North American Chapter of the Association
  2019. BART: Denoising sequence-to-sequence pre-            for Computational Linguistics: Human Language
  training for natural language generation, translation,     Technologies, Volume 2 (Short Papers), pages 646–
  and comprehension. CoRR, abs/1910.13461.                   653, New Orleans, Louisiana. Association for Com-
                                                             putational Linguistics.
Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,
   Michel Galley, and Jianfeng Gao. 2016. Deep rein-       Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu,
   forcement learning for dialogue generation. In Pro-       Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming
   ceedings of the 2016 Conference on Empirical Meth-        Zhou. 2020. ProphetNet: Predicting future n-gram
   ods in Natural Language Processing, pages 1192–           for sequence-to-SequencePre-training. In Findings
  1202, Austin, Texas. Association for Computational         of the Association for Computational Linguistics:
   Linguistics.                                             EMNLP 2020, pages 2401–2410, Online. Associa-
                                                             tion for Computational Linguistics.
Chin Yew Lin and Eduard Hovy. 2003. Automatic eval-
  uation of summaries using n-gram co-occurrence           Alec Radford, Karthik Narasimhan, Tim Salimans, and
  statistics. In Proceedings of the 2003 Human Lan-          Ilya Sutskever. 2018. Improving language under-
  guage Technology Conference of the North Ameri-            standing by generative pre-training. Technical re-
  can Chapter of the Association for Computational           port, OpenAI.
  Linguistics, pages 150–157.
                                                           Alec Radford, Jeff Wu, Rewon Child, David Luan,
Inderjeet Mani. 2001. Automatic summarization, vol-          Dario Amodei, and Ilya Sutskever. 2019. Language
   ume 3. John Benjamins Publishing.                         models are unsupervised multitask learners. Techni-
                                                             cal report, OpenAI.
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
   Ryan McDonald. 2020. On faithfulness and factu-         Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
   ality in abstractive summarization. In Proceedings        Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
   of the 58th Annual Meeting of the Association for         Wei Li, and Peter J. Liu. 2019. Exploring the limits
  Computational Linguistics, pages 1906–1919, On-            of transfer learning with a unified text-to-text trans-
   line. Association for Computational Linguistics.          former. CoRR, abs/1901.07291.
                                                           Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Clara Meister, Ryan Cotterell, and Tim Vieira. 2020.         Percy Liang. 2016. SQuAD: 100,000+ questions for
  If beam search is the answer, what was the question?       machine comprehension of text. In Proceedings of
  In Proceedings of the 2020 Conference on Empirical         the 2016 Conference on Empirical Methods in Natu-
  Methods in Natural Language Processing (EMNLP),            ral Language Processing, pages 2383–2392, Austin,
  pages 2173–2185, Online. Association for Computa-          Texas. Association for Computational Linguistics.
  tional Linguistics.
                                                           Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.
Shashi Narayan, Shay B. Cohen, and Mirella Lapata.           2020. Leveraging pre-trained checkpoints for se-
  2018. Don’t give me the details, just the summary!         quence generation tasks. Transactions of the Asso-
  topic-aware convolutional neural networks for ex-          ciation for Computational Linguistics, 8:264–280.
  treme summarization. In Proceedings of the 2018
  Conference on Empirical Methods in Natural Lan-          Alexander M. Rush, Sumit Chopra, and Jason Weston.
  guage Processing, pages 1797–1807, Brussels, Bel-          2015. A neural attention model for abstractive sen-
  gium. Association for Computational Linguistics.           tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan-           Chapter of the Association for Computational Lin-
  guage Processing, pages 379–389, Lisbon, Portugal.        guistics: Human Language Technologies, Volume
  Association for Computational Linguistics.                1 (Long Papers), pages 1112–1122, New Orleans,
                                                            Louisiana. Association for Computational Linguis-
Abigail See, Peter J. Liu, and Christopher D. Manning.      tics.
  2017. Get to the point: Summarization with pointer-
  generator networks. In Proceedings of the 55th An-      Sam Wiseman, Stuart Shieber, and Alexander Rush.
  nual Meeting of the Association for Computational         2017. Challenges in data-to-document generation.
  Linguistics (Volume 1: Long Papers), pages 1073–          In Proceedings of the 2017 Conference on Empiri-
  1083, Vancouver, Canada. Association for Computa-         cal Methods in Natural Language Processing, pages
  tional Linguistics.                                       2253–2263, Copenhagen, Denmark. Association for
                                                            Computational Linguistics.
Tianxiao Shen, Myle Ott, Michael Auli, and
   Marc’Aurelio Ranzato. 2019. Mixture models for         Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and
   diverse machine translation: Tricks of the trade. In     Arvind Neelakantan. 2021. Trading off diversity and
  Proceedings of the 36th International Conference on       quality in natural language generation. In Proceed-
  Machine Learning, volume 97 of Proceedings of Ma-         ings of the Workshop on Human Evaluation of NLP
   chine Learning Research, pages 5719–5728. PMLR.          Systems (HumEval), pages 25–33, Online. Associa-
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-         tion for Computational Linguistics.
  Yan Liu. 2019. MASS: Masked sequence to se-
                                                          Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
  quence pre-training for language generation. In Pro-
                                                             ter Liu. 2020a. PEGASUS: Pre-training with ex-
  ceedings of the 36th International Conference on
                                                             tracted gap-sentences for abstractive summarization.
  Machine Learning, volume 97, pages 5926–5936.
                                                             In Proceedings of the 37th International Conference
  PMLR.
                                                             on Machine Learning, volume 119 of Proceedings
Christoph Tillmann and Hermann Ney. 2003. Word re-           of Machine Learning Research, pages 11328–11339.
  ordering and a dynamic programming beam search             PMLR.
  algorithm for statistical machine translation. Com-
  putational Linguistics, 29(1):97–133.                   Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
                                                            Weinberger, and Yoav Artzi. 2020b. BERTScore:
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob             Evaluating text generation with BERT. In Proceed-
  Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz              ings of the 8th International Conference on Learn-
  Kaiser, and Illia Polosukhin. 2017. Attention is all       ing Representations, Virtual Conference, Formerly
  you need. In Advances in Neural Information Pro-          Addis Ababa Ethiopia.
  cessing Systems 30, pages 5998–6008.
                                                          Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa
Ashwin K. Vijayakumar, Michael Cogswell, Ram-               Ke. 2018. Paragraph-level neural question gener-
  prasaath R. Selvaraju, Qing Sun, Stefan Lee, David J.     ation with maxout pointer and gated self-attention
  Crandall, and Dhruv Batra. 2016. Diverse beam             networks. In Proceedings of the 2018 Conference
  search: Decoding diverse solutions from neural se-        on Empirical Methods in Natural Language Process-
  quence models. CoRR, abs/1610.02424.                      ing, pages 3901–3910, Brussels, Belgium. Associa-
                                                            tion for Computational Linguistics.
Ashwin K. Vijayakumar, Michael Cogswell, Ram-
  prasaath R. Selvaraju, Qing Sun, Stefan Lee, David J.   Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,
  Crandall, and Dhruv Batra. 2018. Diverse beam             Hangbo Bao, and Ming Zhou. 2017. Neural ques-
  search for improved description of complex scenes.        tion generation from text: A preliminary study.
  In Proceedings of the Thirty-Second AAAI Confer-          CoRR, abs/1704.01792.
  ence on Artificial Intelligence, (AAAI-18), the 30th
  innovative Applications of Artificial Intelligence      Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo,
  (IAAI-18), and the 8th AAAI Symposium on Educa-           Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texy-
  tional Advances in Artificial Intelligence (EAAI-18),     gen: A benchmarking platform for text generation
  New Orleans, Louisiana, USA, February 2-7, 2018,          models. In The 41st International ACM SIGIR Con-
  pages 7371–7379. AAAI Press.                              ference on Research and Development in Informa-
                                                            tion Retrieval, SIGIR ’18, page 1097–1100, New
Zhen Wang, Siwei Rao, Jie Zhang, Zhen Qin,
                                                            York, NY, USA. Association for Computing Machin-
  Guangjian Tian, and Jun Wang. 2020. Diversify
                                                            ery.
  question generation with continuous content selec-
  tors and question type modeling. In Findings of the
  Association for Computational Linguistics: EMNLP
  2020, pages 2134–2143, Online. Association for
  Computational Linguistics.
Adina Williams, Nikita Nangia, and Samuel Bowman.
  2018. A broad-coverage challenge corpus for sen-
  tence understanding through inference. In Proceed-
  ings of the 2018 Conference of the North American
You can also read