Analyzing the Abstractiveness-Factuality Tradeoff With Nonlinear Abstractiveness Constraints

Page created by Jaime Torres

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Analyzing the Abstractiveness-Factuality Tradeoff With Nonlinear Abstractiveness Constraints

Analyzing the Abstractiveness-Factuality Tradeoff
                                                             With Nonlinear Abstractiveness Constraints

                                            Markus Dreyer            Mengwen Liu
                                                                         Feng Nan  Sandeep Atluri Sujith Ravi
                                                                        Amazon
                                              {mddreyer, mengwliu, nanfen, satluri, sujithai}@amazon.com

                                                               Abstract
                                             We analyze the tradeoff between factuality and
                                             abstractiveness of summaries. We introduce
arXiv:2108.02859v1 [cs.CL] 5 Aug 2021

                                             abstractiveness constraints to control the de-
                                             gree of abstractiveness at decoding time, and
                                             we apply this technique to characterize the
                                             abstractiveness-factuality tradeoff across mul-
                                             tiple widely-studied datasets, using extensive
                                             human evaluations. We train a neural sum-
                                             marization model on each dataset and visu-
                                             alize the rates of change in factuality as we
                                             gradually increase abstractiveness using our
                                             abstractiveness constraints. We observe that,
                                                                                                Figure 1: Three successively more abstractive sum-
                                             while factuality generally drops with increased
                                                                                                maries generated from the same input article, with
                                             abstractiveness, different datasets lead to dif-
                                                                                                M INT abstractiveness scores (Section 2.2) of 46.1%,
                                             ferent rates of factuality decay. We propose
                                                                                                67.2%, 79.5%. Fragments extracted from the input are
                                             new measures to quantify the tradeoff between
                                                                                                marked from red (longer fragments) to yellow (shorter
                                             factuality and abstractiveness, incl. µQAGS,
                                                                                                fragments). The bottom summary has factual errors.
                                             which balances factuality with abstractiveness.
                                             We also quantify this tradeoff in previous
                                             works, aiming to establish baselines for the
                                             abstractiveness-factuality tradeoff that future    semantic faithfulness (Cao et al., 2017; Kryscinski
                                             publications can compare against.                  et al., 2019). Observed rates of factual errors in
                                                                                                abstractive summaries have ranged from 30% to
                                        1    Introduction                                       over 75% (Cao et al., 2017; Maynez et al., 2020).
                                        Summarization is the task of generating a semanti-      The research community has started to address this
                                        cally faithful, wellformed and concise text represen-   problem by developing automatic factuality met-
                                        tation of some input documents’ main information.       rics (Wang et al., 2020; Kryscinski et al., 2020;
                                        Automatically generated summaries have tradition-       Goodrich et al., 2019) and methods that attempt to
                                        ally been extractive (Neto et al., 2002; Erkan and      increase factuality (Fan et al., 2018; Scialom et al.,
                                        Radev, 2004; Wong et al., 2008), leading to is-         2019; Zhang et al., 2019; Falke et al., 2020).
                                        sues with readability and coherence, as different          However, we argue that the factuality problem of
                                        extracted fragments may not fit well when taken         abstractive summaries cannot be well understood
                                        out of their original contexts (Poibeau and Sag-        without considering the degree of abstractiveness
                                        gion, 2012). More recently, researchers have in-        of a given summary: Any summary is on a spec-
                                        vested in methods for abstractive summarization,        trum between extractive and abstractive (See et al.,
                                        aiming to paraphrase the input documents’ main          2017), and a summary that is more abstractive typ-
                                        points without borrowing their exact lexical expres-    ically has more problems with factual inconsis-
                                        sions (Radford et al., 2019; Gehrmann et al., 2019;     tencies. Summaries that are extractive to a larger
                                        Lewis et al., 2019; Zhang et al., 2020). Abstrac-       extent tend to be more factual since the copying of
                                        tive summaries generated by state-of-the-art neural     text from the input into the summary rarely intro-
                                        models tend to be fluent and wellformed, but lack       duces factual errors while the task of paraphrasing,

which results in sum-
maries that are more 0.75 h =4
0.50 h =2

λh
abstractive, is harder
and prone to seman- 0.25
0.00 1 2 3 4 5 6 7
tic errors. As an Extractive fragment length
example, Figure 1
shows part of a Wash- Figure 3: λh defines discounts for extractive fragments
Figure 2: Four extremes
ington Post article based on their lengths. Smaller h values lead to more
at the abstractiveness-
and three automati- abstractive summaries.
factuality spectrum.
cally generated sum-
maries with increasing degrees of abstractiveness, 2 Abstractiveness
which we have generated using our abstractiveness
constraints (Section 2.1). The first two summaries 2.1 Nonlinear Abstractiveness Constraints
are correct, but the third, most abstractive, sum- We now introduce a technique to control the de-
mary has factual errors, misinterpreting the input. gree of abstractiveness of any summary y obtained
Few authors have discussed this connection. by decoding input x. With this technique – soft
Lebanoff et al. (2019) observe that abstractive sum- constraints applied during decoding, which we call
maries consisting of concatenated extracted frag- nonlinear abstractiveness constraints (NAC) – we
ments tend to be more factual than those created can use any previously trained model to generate
by more complex fusion. Durmus et al. (2020) ob- summaries with varying degrees of abstractiveness
serve that models trained on the more extractive (see Figure 1 for an example).
CNN/DM dataset (Hermann et al., 2015) create Let F(x, y) be the set of the longest extractive
more factual summaries than models trained on fragments in y with respect to x. In Figure 1,
the more abstractive XSum dataset (Narayan et al., such fragments are marked in color for each sum-
2018). We show that such models differ in factual- mary. We define a function λh (|f |) that assigns
ity even when we bias them to generate summaries a discount probability to any extractive fragment
that have similar levels of abstractiveness. Our f ∈ F(x, y):
analysis (Section 4) situates summarization models
2 /h2
on the spectrum outlined in Figure 2, where factual λh (|f |) = 2−|f | (1)
summaries range from “trivially factual” (extrac-
We configure this function1 with a parameter h,
tive) to truly “paraphrasing” (abstractive).
interpreted as the length of an extracted fragment
We make the following contributions: for which λh = 0.5. Setting h to smaller values
results in summaries that are more abstractive since
1. We introduce nonlinear abstractiveness con- shorter extractive fragments will be discounted
straints (NAC), which control the degree of more strongly by λh , see Figure 3. Our discount
abstractiveness at decoding time using length- penalty grows nonlinearly, affecting longer contigu-
based penalties for extractive fragments; ous extractive fragments more strongly than mul-
2. Using this decoding technique, we systemat- tiple shorter ones with the same combined length
ically explore the relationship of abstractive- – extracting a 10-gram makes a summary more ex-
ness and factuality, within a particular dataset tractive than using ten words from the article sepa-
as well as across datasets. rately.
In decoding, we search for the summary ŷ that
3. As part of this exploration, we conduct an maximizes the product of the summarization model
extensive human factuality study and perform probability, pM (y | x), and the discount probabili-
an evaluation of automatic factuality metrics. ties of the extractive fragments F(x, y):
4. We introduce measures to quantify the
abstractiveness-factuality tradeoff and use Y
ŷ = arg max pM (y | x) × λh (|f |) (2)
them to compare multiple summarization y
f ∈F (x,y)
models. Our automatic measure, µQAGS,
Additionally, the exponent used in |f |2 and h2 could be
1
will enable future researchers to compare their configured, but we keep it at 2 in our experiments. A larger
own models and demonstrate progress. exponent would result in a steeper descent around h.

Beam Decoding. The model probability                       a factuality metric of the same [0,1] range (Sec-
pM (x, y) in neural text generation models (Sec-           tion 4). Let χ(x, y) = hmean(p1 , p2 , p3 , p4 , lcsr)
tion 5.2) decomposes for token-by-token decoding           be a measure of extractive overlap between input
   Q|y|
as i=1 pM (yi | x, y1 , . . . , yi−1 ). Similarly, we      x and summary y, using the harmonic mean of
decompose the application of the λh function for           multiple component measures. Each pn , short for
any partial or completed extractive fragment f :           pn (x, y), is the n-gram precision of the n-grams in
                                                           y with respect to x, i.e., the percentage of n-grams
                           |f |                            in y that are extracted from x.2 We do not in-
                           Y        λh (l)
             λh (|f |) =                            (3)    clude density in χ(x, y) as the range of density
                                  λh (l − 1)
                           l=1                             is unbounded. The measure lcsr (longest com-
   Therefore, to successively apply λh at each out-        mon subsequence ratio), short for lcsr(x, y), is
put position i in beam decoding, each candidate for        the length of the longest common subsequence
token yi is evaluated to check whether choosing it         (LCS) between x and y divided by the length of
would extend an extractive fragment to length l. If        y. lcsr, inspired by ROUGE-L (Lin, 2004), gener-
so, its model probability pM (yi | . . . ) is multiplied   alizes perfect fusionk to consider all instances of
with λh (l) and the λh (l − 1) that was applied to         non-contiguous overlaps between input and sum-
the previous token yi−1 is divided out.                    mary. To measure abstractiveness, we define M INT
Log Space. In practice, we equivalently search             (Metric for lexical independence of generated text)
for ŷ in log space using log probabilities and the        as M INT(x, y) = 1 − χ(x, y). For a set of inputs
log of λh defined in Equation 1. It can be shown           and their summaries, we report the average M INT
                         −|f |2                            score. See Figure 1 for the M INT scores of three
that log λh (|f |) = (1.20112×h) 2.
                                                           sample summaries.
Extraction Rewards. We can choose to apply an                 The described M INT score capitalizes on prior
extraction reward, rather than a penalty, by using         work to provide a comprehensive and unified met-
the inverse 1/λh ; smaller h then result in sum-           ric for abstractiveness of conditionally generated
maries that are more extractive.                           text, combining measures of contiguous and non-
                                                           contiguous overlap into a single percentage score.
2.2   Measuring Abstractiveness
                                                           We will provide a reference implementation to facil-
The abstractiveness constraints directly control the       itate a standardized comparison of abstractiveness
abstractiveness of the generated summaries and             across different works.
indirectly influence their factuality. We now dis-
cuss metrics for abstractiveness, before discussing        3       Factuality
metrics for factuality (Section 3).                        We now describe metrics for factuality, before we
   To measure abstractiveness, most authors list the       can describe the relationship between abstractive-
proportions of summary n-grams that are novel, i.e.,       ness and factuality (Section 4). By factuality of a
do not occur in the corresponding input documents          summary y, we mean factual consistency with the
(See et al., 2017; Narayan et al., 2018; Gao et al.,       input x, rather than objective factuality or univer-
2019). Grusky et al. (2018) proposed a new metric          sal truth. Measuring factuality automatically is an
that is also based on contiguous overlapping text          active area of research (Gabriel et al., 2020). Fac-
spans, called density (Grusky et al., 2018), mea-          tuality is most naturally measured by human anno-
suring the average length of extracted fragments           tators; we describe our setup for human factuality
in a summary. Others have proposed metrics that            annotation first, then move to automatic metrics. In
take common non-contiguous subsequences into               Section 5.4, we measure the correlations of auto-
account, e.g., perfect fusionk (Durmus et al., 2020)       matic factuality metrics to our human judgements.
measures the percentage of summary sentences
that assemble substrings from k source sentences           3.1      Human-annotated Factuality
in their original order.                                   We use Amazon’s Mechanical Turk (AMT) to mea-
   Based on these previous works, we define a com-         sure the factuality of automatically generated sum-
prehensive abstractiveness metric that combines            maries with human annotators. Mechanical Turk
measures of contiguous as well as non-contiguous               2
                                                                We smooth all n-gram counts (Chen and Cherry, 2014)
extractive summary fragments. We define this met-          to avoid undefined or zero harmonic mean values in highly
ric as a ratio, in order to facilitate combining it with   abstractive summaries. See Appendix B for details.

Figure 4: Screenshot (part) of a Mechanical Turk task (HIT) to judge the factuality of a summary sentence (in blue)
with respect to news articles. Darker green article sentences are more similar to the blue summary sentence. The
full task showed sentences from two more articles in the same cluster; from the Multi-News test set.

annotators are untrained, so we use multiple miti- report the average of the human binary summary
gation strategies to obtain reliable judgements. factuality judgements; we call this score FactH (Ta-
We simplify the task: To avoid overwhelming an- ble 1). We collect human judgements on the first
notators with long text, we select a single sentence 150 samples of the test sets.
per summary and ask the annotators if it is factually
consistent with the shown article (or, in the case of
multi-document summarization, multiple articles). 3.2 Automatically Measured Factuality
The other sentences of the summary are given as
well for context, shown in gray, see Figure 4 for
an example. The article(s) are shortened to show We use four metrics based on question answering
a total of 9 sentences that were determined to be (QA): SummaQA (Scialom et al., 2019), FEQA
semantically most similar to the selected summary (Durmus et al., 2020), QAGS (Wang et al., 2020)
sentence;3 the remaining article parts are replaced and QUALS (Nan et al., 2021) and one entailment-
by “. . . ”. The summary sentence is selected at ran- based metric: FactCC (Kryscinski et al., 2020).
dom in proportion to its length. For each summary, SummaQA converts input sentences into Cloze-
we get judgements only for the randomly selected style questions by masking entities, applies BERT
sentence. Aggregated over a set of summaries, we to recover them based on the summary and re-
measure the average chance of any randomly se- ports match F1 score. Since SummaQA measures
lected summary sentence to be factual. summaries based on how well they answer ques-
tions from the input, it measures both factuality
We also provide detailed task instructions, incl.
and content selection. FEQA generates declara-
examples for intrinsic and extrinsic factual errors
tive questions from masked summary sentences
(Maynez et al., 2020). We require that potential
whose masked entities are used as “gold” answers;
annotators pass our custome qualification test con-
these are compared to the answers obtained from
sisting of three of our example tasks of finding
a QA model on the input. In QAGS, a question
factual summary errors. Only workers who have
generation (QG) model generates questions from
previously completed 100 or more tasks on AMT
the summary, a QA model answers these questions
with an acceptance rate of 95% or higher may take
from both summary and input, and the similarity
the test; 15% of those pass, enabling them to work
of the answer pairs is evaluated. QUALS is simi-
on our evaluation tasks. We use three annotators
lar, but achieves runtime and memory reductions
per task and use MACE (Hovy et al., 2013) to ag-
by replacing the QG and QA models with a single
gregate these multiple annotations per summary
model, QAGen (Shakeri et al., 2020). It generates
and recover the most likely binary factuality judge-
QA pairs from the summary, computes their aver-
ment per summary. More details about our setup
age log probability given the input and given the
are in Appendix C.
summary and reports their difference ∈ [−∞, ∞].
For any set of automatically generated sum-
FactCC is a BERT-based binary classifier trained
maries, we create the AMT tasks, get an aggregate
on weakly-supervised sentence pairs derived from
binary judgement per summary based on the mul-
rules. We took the pre-trained FactCC model and
tiple answers from annotators as described, and
fine-tuned it using the ANLI dataset (Nie et al.,
3
We measure cosine similarity of sentence encodings com- 2020). FactCC is trained with sentence pairs, so
puted by the Universal Sentence Encoder (Cer et al., 2018). we obtain scores at the sentence level and use their

100                                                                                              Human       Automatic
                                                                                                λ M INT FactH µFactH QAGS µQAGS
              90

                                                                              MN-500 CNN/DM
                                                                                              1/λ2   11.2   94.0   66.4   85.6    60.8
              80                                                                              none   19.6   88.7   65.7   83.7    62.4
Factuality

              70
                                                                                                λ4   44.3   83.3   70.3   75.7    65.3
                                                                                                λ2   68.2   76.0   73.4   67.1    67.5
              60                                                                              1/λ2   35.0   80.7   65.5   80.1    65.1
                            CNN/DM
              50            MN-800                                                            none   46.3   73.3   64.3   75.4    65.7
                            MN-500                                                              λ4   59.6   60.7   60.3   70.5    66.9
                            XSum                                                                λ2   70.5   60.0   63.5   63.8    66.1
              40

              30                                                                              1/λ2   28.8   80.0   62.9   82.4    64.5

                                                                              MN-800
                   0   10     20     30    40 50 60 70      80   90 100
                                                                                              none   38.4   82.7   67.9   78.3    65.0
                                                                                                λ4   50.7   68.7   62.7   72.8    65.4
                                          Abstractiveness                                       λ2   64.4   61.3   62.4   67.0    66.1
Figure 5: Human factuality judgements (i.e., FactH)                                           1/λ1   56.8   50.7   52.7   45.0    48.9

                                                                              XSum
                                                                                              1/λ2   74.6   42.0   52.9   38.9    50.8
for different degrees of abstractiveness (M INT). For all                                     none   80.6   41.3   54.4   37.8    52.1
symbols of the same color, the same test subset was                                             λ4   84.0   41.3   55.5   35.7    51.8
decoded using the same trained baseline model, but                                              λ2   88.0   39.3   55.6   35.5    53.0
with different abstractiveness constraints (Section 2.1);
large outlined symbols mean no constraints; each color                    Table 1: Abstractiveness-factuality tradeoff on 150 test
represent a dataset.                                                      samples. The 17 M INT and FactH numbers are as
                                                                          shown in Figure 5; we add µFactH, as well as QAGS
                                                                          and µQAGS on the same 150 samples for comparison.
mean as the factuality score of a summary.4

4            The Abstractiveness-Factuality                               a comparison of the factuality of different models
             Tradeoff                                                     with a fixed degree of abstractiveness.

The metrics for factuality and abstractiveness, de-                       µFactH and µQAGS. We characterize the trade-
scribed above, along with the abstractiveness con-                        off on any decoding ouput using a weighted aver-
straints presented in Section 2.1, provide us with                        age between factuality and abstractiveness, (φF +
tools to systematically explore the relationship be-                      A)/(φ + 1). To measure abstractiveness A, we
tween abstractiveness and factuality.                                     use M INT; to measure factuality F , we use ei-
                                                                          ther human-measured factuality or an automatic
Factuality Trend Lines. To explore this relation-
                                                                          metric with [0,1] range like QAGS, resulting in
ship, we train summarization models on different
                                                                          abstractiveness-adjusted factuality metrics µFactH
datasets. For any trained summarization model, we
                                                                          and µQAGS. We give factuality twice as much
decode the test set multiple times with different h
                                                                          weight, since low factuality can have large negative
values for λh (Equation 1), resulting in sets of sum-
                                                                          impact (Zellers et al., 2019), setting φ to 2. By this
maries with different degrees of abstractiveness.
                                                                          definition, a system whose factuality decreases by
For each of these test set decodings, we measure
                                                                          x units, as compared to another system, must make
the average abstractiveness using M INT and the
                                                                          up for the lost factuality by 2x units in abstractive-
corresponding average factuality, using human an-
                                                                          ness to get the same score. When two systems have
notations, unless otherwise noted. This results in
                                                                          the same factuality, the score prefers the one with
a series of (abstractiveness, factuality) points for
                                                                          higher abstractiveness.
any trained summarization model, which can be
plotted, along with a linear trend line. Figure 5
                                                                          5            Experiments
shows such a plot; Section 5.3 discusses its details.
                                                                          5.1                 Datasets
F@50 Score. Given the multiple datapoints per
model exhibiting different degrees of abstractive-                        We use CNN/DM (Hermann et al., 2015), XSum
ness and associated factuality, we estimate the fac-                      (Narayan et al., 2018), and Multi-News (Fabbri
tuality at 50% abstractiveness, an intuitively inter-                     et al., 2019). The former two are single-doc sum-
pretable metric, which we call F@50; it provides                          marization datasets, and the latter one is a multi-
         4
                                                                          doc summarization dataset. CNN/DM contains
    This averaging method achieved the best correlations
with our human evaluation; we also tried using the min or max             news articles from two news providers: CNN and
scores per summary sentences or using the full summaries.                 DailyMail. The summaries are in the format of

λ        RL    M INT    p3     p4    lcsr density   we truncate the input documents per cluster so that
 MN-500 CNN/DM
                 1/λ2      38.1     8.6   89.5   85.3   93.3   31.0    their combined length does not exceed N words, fol-
                 none      40.8    14.7   82.1   75.5   89.7   20.3    lowing Fabbri et al. (2019). We train models with
                   λ4      41.7    38.4   55.4   41.8   78.6    6.4
                   λ2      40.0    61.8   33.2   19.7   68.7    3.4    N = 500 and N = 800, called MN-500 and MN-
                 1/λ2      44.7    35.7   61.6   54.2   60.4   15.9    800, respectively. We measure the M INT scores
                 none      45.4    46.1   50.0   41.2   54.2    9.9    for the reference summaries in these datasets; these
                   λ4      45.1    59.0   36.5   26.1   46.6    4.7
                   λ2      43.7    70.7   25.5   16.3   40.4    3.5    can be compared to the M INT scores obtained in
                 1/λ2      44.9    27.5   69.9   62.7   69.3   19.5    decoding (Section 5.3). The test set references
 MN-800

                 none      45.8    37.3   58.7   49.9   63.4   12.9    for MN-500 have a M INT score of 78.2%, com-
                   λ4      45.7    52.6   42.4   31.2   53.8    5.8
                   λ2      44.2    66.4   29.0   19.0   46.1    4.0    pared to 72.8% for MN-800. M INT is naturally
                 1/λ1      30.6    57.6   37.7 28.2 63.5         4.6
                                                                       higher for MN-500 since the shorter truncation re-
 XSum

                 1/λ2      36.0    74.7   22.3 13.4 57.4         2.2   moves content that could otherwise overlap with
                 none      36.8    80.2   17.6 9.2 54.5          1.6   the outputs. The M INT scores for the CNN/DM
                   λ4      36.8    83.1   15.0 7.0 53.0          1.3
                   λ2      36.3    87.0   11.7 4.8 50.3          1.1   and XSum references are 59.6% and 87.8%, respec-
                                                                       tively, indicating that XSum is the most abstractive
Table 2: Impact of λ on ROUGE-L F1 (RL) and abstrac-                   dataset.
tiveness metrics on the full test sets. p3, p4 and lcsr are
among M INT’s component scores (Sec. 2.2), density is                  5.3   Comparison Across Datasets Using NAC
the average length of extracted fragments (Grusky et al.,
                                                                       We decode each of the four trained models multiple
2018).
                                                                       times with varying nonlinear abstractiveness con-
                                                                       straints. Figure 5 plots the resulting abstractiveness
bullet point highlights. The XSum dataset contains                     and human-measured factuality, thereby provid-
articles from BBC News, where the first sentence                       ing a visual representation of the abstractiveness-
of an article is used as the summary and the re-                       factuality tradeoff for the four models. Table 1
maining sentences as input documents.5 In the                          shows the same 17 M INT and FactH values, along
Multi-News dataset, each summary is written by                         with QAGS values and the M INT-adjusted factuali-
a professional editor and paired with a cluster of                     ties µFactH and µQAGS.
articles from multiple news providers. The dataset
                                                                       XSum. The lower right of Figure 5 shows five
sizes are described in Appendix E.
                                                                       lozenges (). The larger of these represents the de-
5.2                Setup                                               coding of the test set with our XSum-trained model
We use BART (Lewis et al., 2020) for our analyses                      using default settings; the other four red points rep-
of the relationship between factuality and abstrac-                    resent decodings under the same model, but with
tiveness. BART is a sequence-to-sequence model                         different abstractiveness constraints that result in
with a bilinear transformer model as encoder and a                     more extractive (1/λh ) or more abstractive (λh )
left-to-right transformer model as decoder. The                        summaries (Section 2.1). The five red points are as-
encoder-decoder joint model was pretrained on                          sociated with a dashed linear trend line. Compared
160GB of text and has been shown to give state-                        to the other points in the figure, abstractiveness is
of-the-art results on CNN/DM and XSum. Our                             high and factuality low. It took a strong extractive
models use the provided model checkpoints for                          reward (1/λ1 ), which we did not use for the models
the CNN/DM and the XSum datasets as well as                            trained on other datasets, to pull this model toward
the recommended decoding settings.6 For Multi-                         lower abstractiveness and higher factuality.
News (MN), we train a model on the training set,                       Multi-News. Four decodings using the MN-500
starting from the bart.large.cnn pretrained                            model are shown as squares , decodings under
model with a learning rate of 2e-5 for five epochs                     the MN-800 model as triangles N. The two de-
and decode with a minimum length of 50 and a                           codings without constraints, denoted by the larger
maximum length of 300 tokens. For Multi-News,                          symbols, are far apart: MN-500 is more abstractive
   5
     Following Wang et al. (2020), we reinsert the first sen-          and less factual, MN-800 is the opposite. This can
tences whenever we measure factuality of XSum summaries                be explained by the fact that for MN-500, larger
on AMT or with automatic metrics.                                      parts of the input are truncated (Section 5.2) that
   6
     The checkpoints and settings are available on
https://github.com/pytorch/fairseq/tree/                               the untruncated reference summary in training may
master/examples/bart.                                                  still refer to; the model learns to hallucinate more.

However, the trend lines of MN500 and MN-800                         (Lewis et al., 2020; Fabbri et al., 2019), see Ap-
are almost identical. This underscores the value                     pendix A. The λ4 constraint can dramatically in-
of the trend lines, which reveal the near identical                  crease abstractiveness while leaving ROUGE scores
abstractiveness-factuality spectrum of these two                     virtually unchanged. To measure summary quality
models.                                                              further, we conducted a human evaluation of infor-
                                                                     mativeness and coherence of the summaries gen-
CNN/DM. The four decodings for CNN/DM are
                         •
shown as bullets ( ). Its default model output
                                                                     erated with the λ4 decoding constraint; the results
                                                                     are mixed, where sometimes the default decoding
without abstractiveness constraint (large bullet) is
                                                                     is preferred, sometimes the λ4 decoding and some-
by far the most extractive (M INT 19.6%); the ex-
                                                                     times none of them, see Appendix D. The density
traction reward to its left (using 1/λ2 ) cannot make
                                                                     scores (Grusky et al., 2018) in the table have high
it much more extractive; however, there is room to
                                                                     correlation with the M INT scores. See Appendix F
the right, and the abstraction rewards (λ4 and λ2 )
                                                                     for fusion scores (Durmus et al., 2020).
move its abstractiveness far into the abstractiveness
level of Multi-News and even XSum.                                   5.4   Correlating Human and Automatic
F@50 Scores. One of the main takeaways of this                             Factuality Judgements
study is that different systems can have different                   Setup. We analyze the correlation between hu-
factuality rates at the same level of abstractiveness.               man and automatic factuality scores, complement-
Previous authors have observed that XSum sum-                        ing earlier studies in this area (Gabriel et al., 2020).
maries are highly abstractive and less factual, and                  Table 3 describes three groups of correlation re-
that CNN/DM summaries are at the opposite side                       sults, corresponding to different aggregation of
of that spectrum. We confirm this; however, we                       summaries: The “summary sets” group contains
add that we can bias the XSum model to create less                   the 17 data points whose human factuality scores
abstractive summaries and the CNN/DM model to                        can be seen in Figure 5; we score the factuality
create more abstractive models, so that their ab-                    of the same summaries automatically and corre-
stractiveness becomes comparable, and the factu-                     late the 17 score pairs per automatic metric. The
ality rates still differ considerably: Based on the                  “summaries” group contains the 17 x 150 individual
trend line, the F@50 score of the XSum model                         factuality scores per summary, so in the “All” col-
is 52.6%, while the CNN/DM model’s F@50 is                           umn we correlate all 2,550 factuality judgements
81.3%. MN-500 and MN-800 lie in the middle,                          from human annotators with the automatic met-
both with F@50 score of 70.5.                                        rics. The XSum, MN, and CNN columns contain
                                                                     only the subsets of summaries corresponding to
µFactH Scores. When no abstractiveness con-
                                                                     those datasets. The automatic metrics score the full
straint is applied (λ=none in Table 1), the CNN/DM
                                                                     summaries with respect to their inputs. However,
and MN models have comparable µFactH scores
                                                                     recall that we ask human annotators to judge only
of 65.7%, 64.3% and 67.9%. The XSum model
                                                                     one summary sentence in context (Figure 4). The
has a low µFactH score of 54.4%; at its degree of
                                                                     “summaries (one sentence)” numbers show the cor-
abstractiveness it would need a factuality of 58.3%
                                                                     relations of the 2,550 factuality judgements with
to match the µFactH score of the CNN/DM model,
                                                                     the automatic metrics computed based on that same
but its actual FactH score is only 41.3%. The
                                                                     sentence per summary that annotators judged.
µQAGS scores correlate highly with the µFactH
scores.                                                              Results. All automatic metrics have high corre-
Summary Quality and Abstractiveness. Ta-                             lation with the 17 aggregated human judgements
ble 2 lists ROUGE scores for the different de-                       per summary set. QAGS and QUALS have almost
codings, along with multiple abstractiveness met-                    perfect correlations with human aggreegate judge-
rics; these numbers are measured on the full test                    ments, making them suitable to automatically judge
sets.7 Decodings without abstractiveness con-                        the mean factuality of a system. Since QAGS, in
straints replicate previous works’ ROUGE scores                      particular, uses a [0,1] range like M INT, we com-
                                                                     bine the two as µQAGS (Section 4) and report
    7
      The M INT scores on the full test sets (Table 2) are similar   QAGS and µQAGS in Tables 1 and 4. On the basis
to those on the smaller subsets (Table 1), with the exception of
CNN/DM, where the first test set part consists of CNN articles,      of individual summaries, overall correlations are
while 90.5% of the full test set consists of Daily Mail articles.    moderate at best. For the 2,550 individual sum-

(1) Summary sets (2) Summaries (3) Summaries (one sentence)
Dataset All All XSum MN CNN All XSum MN CNN
FEQA .91 .18 † -.03 .18 †.05 .14 † -.03 .15 † .00
SummaQA .76 .20 .08 .14 .12 † -.04 .08 -.08 † .00
FactCC .69 .25 .22 .11 .38 .23 .22 .18 .26
FactCC (ANLI) .86 .32 .35 † .01 .44 .28 .35 .10 .29
QAGS .96 .41 .33 .23 .28 .40 .33 .30 .15
QUALS .96 .41 .26 .25 .36 .37 .26 .29 .25

Table 3: Correlations of automatic factuality metrics to human annotations. “Summary sets” correlates the mean
scores for the 17 sets of summaries shown in Figure 5. “Summaries” correlates individual scores for all summaries.
All results are statistically significant at p < 0.05 except those marked with a dagger (†).

maries, only QAGS and QUALS have correlations Model M INT QAGS µQAGS RL
>0.40 in the “All” column. For the dataset-specific CNN/DM BART 14.7 85.4 61.8 40.8
subsets, correlations are lower, i.e., the automatic B ERT S UM 14.1 83.1 60.1 39.2
PGC ONV 5.5 84.4 58.1 36.4
factuality metrics are better at distinguishing the B OTTOM U P 17.2 78.7 58.2 38.3
A BS RL 18.9 82.5 61.3 37.3
summaries from the different models, but worse
XSum BART 80.2 40.1 53.5 36.8
at distinguishing the summaries from a particular B ERT S UM 82.8 27.6 46.0 31.3
model. For group (3) “summaries (one sentence)”,
SummaQA has the lowest correlations because it Table 4: Comparing summarization models on the
is not designed to work on single summary sen- full CNN/DM and XSum test sets. µQAGS measures
tences. For XSum, the results in this group are abstractiveness-factuality tradeoffs. RL is Rouge-L F1 .
identical to group (2) because XSum has only one-
sentence summaries. For CNN/DM, all metrics sores 58.1 and 58.2) for different reasons: PG-
correlate better when using the full summaries. For C ONV is very extractive while B OTTOM U P is the
MN, they correlate better when using the random least factual model. On XSum, BART represents
sentence that human annotators judged, possibly a sizable factuality gain over B ERT S UM, with a
because its summaries are long (218 words on av- relatively small loss in M INT score. BART’s pre-
erage). FactCC works better after being fine-tuned training of both encoder and decoder may be con-
on ANLI, but it fails on MN. tributing to its better factuality, in accordance with
Maynez et al. (2020).
5.5 Comparison Across Different Models
6 Related Work
We compare the abstractiveness-factuality trade-
offs of summarization models from the literature. Abstractiveness-Factuality Tradeoff: Durmus
We obtain model outputs of four summarization et al. (2020) observe that the degree of abstrac-
models other than BART: PGC ONV (See et al., tiveness at test time depends on the abstractiveness
2017) is a hybrid pointer-generator network that of the training data and that highly abstractive sum-
can copy words from the source text via pointing; maries tend to be less factual. We go a step further:
B ERT S UM (Liu and Lapata, 2019) is a transformer We control for abstractiveness and see that factual-
encoder-decoder model in which only the encoder ity rates between different systems can vary widely
is pretrained; both B OTTOM U P (Gehrmann et al., at the same levels of abstractiveness. Increasing
2018) and A BS RL (Chen and Bansal, 2018) use Abstractiveness: Kryściński et al. (2018) use pol-
separately selected source fragments to constrain icy gradient with a novelty reward to encourage
an abstractive generation model. abstraction in a pointer-generator (PG) network
Table 4 shows that, on CNN/DM, BART and (Gulcehre et al., 2016; See et al., 2017). Weber et al.
A BS RL have the best µQAGS scores (61.8 (2018) penalize the amount of copied tokens dur-
and 61.3), with A BS RL trading some factuality ing PG decoding. Our decoding technique applies
(QAGS) for considerably higher abstractiveness to general sequence-to-sequence models; we pe-
(M INT). Compared to A BS RL, B ERT S UM is con- nalize contiguous longer extracted fragments more
siderably less abstractive and only slightly more than multiple shorter fragments that add up to the
factual, leading to a lower µQAGS score (60.1). same length, using nonlinear penalties for whole
PGC ONV and B OTTOM U P rank last (µQAGS fragments. Finally, Song et al. (2020) control the

amount of copying in training abstractive summa- Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and
rization models by masking the summary tokens Dragomir Radev. 2019. Multi-news: A large-scale
multi-document summarization dataset and abstrac-
with different probabilities, depending on whether
tive hierarchical model. In Proceedings of the 57th
they are seen in the input document or not. In con- Annual Meeting of the Association for Computa-
trast, our technique does not require retraining to tional Linguistics, pages 1074–1084, Florence, Italy.
obtain varying degrees of abstractiveness. Association for Computational Linguistics.

7 Conclusions Tobias Falke, Leonardo F.R. Ribeiro, Prasetya Ajie
Utama, Ido Dagan, and Iryna Gurevych. 2020.
We presented a new framework and new metrics for Ranking generated summaries by correctness: An in-
teresting but challenging application for natural lan-
evaluating the relationship of abstractiveness and guage inference. In ACL 2019 - 57th Annual Meet-
factuality. As part of this framework, we presented ing of the Association for Computational Linguistics,
abstractiveness constraints, a general method that Proceedings of the Conference, pages 2214–2220.
enables us to increase or decrease the level of ab- Association for Computational Linguistics (ACL).
stractiveness when generating summaries from a Lisa Fan, Dong Yu, and Lu Wang. 2018. Robust Neu-
summarization model, using nonlinear penalties or ral Abstractive Summarization Systems and Evalu-
rewards based on the length of summary fragments ation against Adversarial Information. In NIPS In-
extracted from the source. Through extensive hu- terpretability and Robustness for Audio, Speech and
Language Workshop.
man and automatic evaluations, we shed light on
how abstractiveness interacts with factuality, across Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin
multiple datasets and models. We proposed new Choi, and Jianfeng Gao. 2020. Go Figure! A Meta
metrics to measure the abstractiveness-factuality Evaluation of Factuality in Summarization. Techni-
tradeoff, incl. F@50 and µQAGS, and established cal report.
baselines for future research. Shen Gao, Xiuying Chen, Piji Li, Zhangming Chan,
Dongyan Zhao, and Rui Yan. 2019. How to write
summaries with patterns? learning towards abstrac-
References tive summarization through prototype editing. In
Proceedings of the 2019 Conference on Empirical
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2017. Methods in Natural Language Processing and the
Faithful to the original: Fact aware neural abstrac- 9th International Joint Conference on Natural Lan-
tive summarization. CoRR, abs/1711.04434. guage Processing (EMNLP-IJCNLP), pages 3741–
3751, Hong Kong, China. Association for Computa-
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,
tional Linguistics.
Nicole Limtiaco, Rhomni St. John, Noah Constant,
Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,
Sebastian Gehrmann, Yuntian Deng, and Alexander M.
Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil.
Rush. 2018. Bottom-Up Abstractive Summarization.
2018. Universal Sentence Encoder. EMNLP 2018 -
In Proc. of EMNLP.
Conference on Empirical Methods in Natural Lan-
guage Processing: System Demonstrations, Pro- Sebastian Gehrmann, Zachary Ziegler, and Alexander
ceedings, pages 169–174. Rush. 2019. Generating abstractive summaries with
Boxing Chen and Colin Cherry. 2014. A systematic finetuned language models. In Proceedings of the
comparison of smoothing techniques for sentence- 12th International Conference on Natural Language
level BLEU. In Proceedings of the Ninth Workshop Generation, pages 516–522, Tokyo, Japan. Associa-
on Statistical Machine Translation, pages 362–367, tion for Computational Linguistics.
Baltimore, Maryland, USA. Association for Compu-
tational Linguistics. Ben Goodrich, Vinay Rao, Peter J Liu Mohammad
Saleh, Google Brain, Peter J Liu, and Mohammad
Yen-Chun Chen and Mohit Bansal. 2018. Fast Abstrac- Saleh. 2019. Assessing The Factual Accuracy of
tive Summarization with Reinforce-Selected Sen- Generated Text. In International Conference on
tence Rewriting. In Proc. of ACL. Knowledge Discovery and Data Mining (KDD).

Esin Durmus, He He, and Mona Diab. 2020. Feqa: A Max Grusky, Mor Naaman, and Yoav Artzi. 2018.
question answering evaluation framework for faith- Newsroom: A dataset of 1.3 million summaries with
fulness assessment in abstractive summarization. In diverse extractive strategies. In Proceedings of the
Association for Computational Linguistics (ACL). 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
Günes Erkan and Dragomir R. Radev. 2004. Lexrank: man Language Technologies, Volume 1 (Long Pa-
Graph-based lexical centrality as salience in text pers), pages 708–719, New Orleans, Louisiana. As-
summarization. J. Artif. Int. Res., 22(1):457–479. sociation for Computational Linguistics.

Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
Bowen Zhou, and Yoshua Bengio. 2016. Pointing jan Ghazvininejad, Abdelrahman Mohamed, Omer
the unknown words. In Proceedings of the 54th An- Levy, Veselin Stoyanov, and Luke Zettlemoyer.
nual Meeting of the Association for Computational 2020. BART: Denoising sequence-to-sequence pre-
Linguistics (Volume 1: Long Papers), pages 140– training for natural language generation, translation,
149. and comprehension. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Karl Moritz Hermann, Tomáš Kočiskỳ, Edward Grefen- Linguistics, pages 7871–7880, Online. Association
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, for Computational Linguistics.
and Phil Blunsom. 2015. Teaching machines to read Chin-Yew Lin. 2004. ROUGE: A package for auto-
and comprehend. In Proceedings of the 28th Inter- matic evaluation of summaries. In Text Summariza-
national Conference on Neural Information Process- tion Branches Out, pages 74–81, Barcelona, Spain.
ing Systems-Volume 1, pages 1693–1701. Association for Computational Linguistics.

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, Yang Liu and Mirella Lapata. 2019. Text Summariza-
and Eduard Hovy. 2013. Learning whom to trust tion with Pretrained Encoders. In Proc. of EMNLP.
with MACE. In Proceedings of the 2013 Conference
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
of the North American Chapter of the Association
Ryan McDonald. 2020. On faithfulness and factu-
for Computational Linguistics: Human Language
ality in abstractive summarization. In Proceedings
Technologies, pages 1120–1130, Atlanta, Georgia.
of the 58th Annual Meeting of the Association for
Association for Computational Linguistics.
Computational Linguistics, pages 1906–1919, On-
line. Association for Computational Linguistics.
Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-
Cann, Caiming Xiong, and Richard Socher. 2019. Feng Nan, Cicero Nogueira dos Santos, Henghui Zhu,
Neural text summarization: A critical evaluation. In Patrick Ng, Kathleen McKeown, Ramesh Nallapati,
Proceedings of the 2019 Conference on Empirical Dejiao Zhang, Zhiguo Wang, Andrew O Arnold, and
Methods in Natural Language Processing and the Bing Xiang. 2021. Improving Factual Consistency
9th International Joint Conference on Natural Lan- of Abstractive Summarization via Question Answer-
guage Processing (EMNLP-IJCNLP), pages 540– ing. In Proc. of ACL.
551, Hong Kong, China. Association for Computa-
tional Linguistics. Shashi Narayan, Shay B. Cohen, and Mirella Lapata.
2018. Don’t give me the details, just the summary!
topic-aware convolutional neural networks for ex-
Wojciech Kryscinski, Bryan McCann, Caiming Xiong, treme summarization. In Proceedings of the 2018
and Richard Socher. 2020. Evaluating the factual Conference on Empirical Methods in Natural Lan-
consistency of abstractive text summarization. In guage Processing, pages 1797–1807, Brussels, Bel-
Proceedings of the 2020 Conference on Empirical gium. Association for Computational Linguistics.
Methods in Natural Language Processing (EMNLP),
pages 9332–9346. Joel Larocca Neto, Alex Alves Freitas, and Celso A. A.
Kaestner. 2002. Automatic text summarization us-
ing a machine learning approach. In Proceedings
Wojciech Kryściński, Romain Paulus, Caiming Xiong,
of the 16th Brazilian Symposium on Artificial Intelli-
and Richard Socher. 2018. Improving abstraction
gence: Advances in Artificial Intelligence, SBIA ’02,
in text summarization. In Proceedings of the 2018
page 205–215, Berlin, Heidelberg. Springer-Verlag.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1808–1817, Brussels, Bel- Yixin Nie, Adina Williams, Emily Dinan, Mohit
gium. Association for Computational Linguistics. Bansal, Jason Weston, and Douwe Kiela. 2020. Ad-
versarial nli: A new benchmark for natural language
Logan Lebanoff, John Muchovej, Franck Dernoncourt, understanding. In Proceedings of the 58th Annual
Doo Soon Kim, Seokhwan Kim, Walter Chang, and Meeting of the Association for Computational Lin-
Fei Liu. 2019. Analyzing sentence fusion in abstrac- guistics, pages 4885–4901.
tive summarization. In Proceedings of the 2nd Work-
shop on New Frontiers in Summarization, pages Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
104–110, Hong Kong, China. Association for Com- Jing Zhu. 2002. Bleu: a method for automatic eval-
putational Linguistics. uation of machine translation. In Proceedings of
the 40th Annual Meeting of the Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Pennsylvania, USA. Association for Computational
jan Ghazvininejad, Abdelrahman Mohamed, Omer Linguistics.
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2019. Bart: Denoising sequence-to-sequence pre- Thierry Poibeau and Horacio Saggion. 2012. Auto-
training for natural language generation, trans- matic Text Summarization: Past, Present and Fu-
lation, and comprehension. arXiv preprint ture. In Multi-source, Multilingual Information Ex-
arXiv:1910.13461. traction and Summarization, pages 3–13.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christo-
Dario Amodei, and Ilya Sutskever. 2019. Language pher D. Manning, and Curtis P. Langlotz. 2019. Op-
models are unsupervised multitask learners. timizing the Factual Correctness of a Summary: A
Study of Summarizing Radiology Reports.
Thomas Scialom, Sylvain Lamprier, Benjamin Pi-
wowarski, and Jacopo Staiano. 2019. Answers
unite! unsupervised metrics for reinforced summa-
rization models. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP-
IJCNLP), pages 3246–3256, Hong Kong, China. As-
sociation for Computational Linguistics.
Abigail See, Peter J Liu, and Christopher D Manning.
2017. Get to the point: Summarization with pointer-
generator networks. In Proceedings of the 55th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1073–
1083.
Siamak Shakeri, Cicero Nogueira dos Santos, Henghui
Zhu, Patrick Ng, Feng Nan, Zhiguo Wang, Ramesh
Nallapati, and Bing Xiang. 2020. End-to-end syn-
thetic data generation for domain adaptation of ques-
tion answering systems. In Proceedings of the 2020
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 5445–5460. As-
sociation for Computational Linguistics.
Kaiqiang Song, Bingqing Wang, Zhe Feng, Ren Liu,
and Fei Liu. 2020. Controlling the amount of ver-
batim copying in abstractive summarization. In Pro-
ceedings of the AAAI Conference on Artificial Intel-
ligence, volume 34(05), pages 8902–8909.
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
Asking and answering questions to evaluate the fac-
tual consistency of summaries. In Proceedings of
the 58th Annual Meeting of the Association for Com-
putational Linguistics, pages 5008–5020.
Noah Weber, Leena Shekhar, Niranjan Balasubrama-
nian, and Kyunghyun Cho. 2018. Controlling decod-
ing for more abstractive summaries with copy-based
networks. arXiv preprint arXiv:1803.07038.
Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008. Ex-
tractive summarization using supervised and semi-
supervised learning. In Proceedings of the 22nd In-
ternational Conference on Computational Linguis-
tics (Coling 2008), pages 985–992, Manchester, UK.
Coling 2008 Organizing Committee.
Rowan Zellers, Ari Holtzman, Hannah Rashkin,
Yonatan Bisk, Ali Farhadi, F. Roesner, and Yejin
Choi. 2019. Defending against neural fake news. In
NeurIPS.
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and
Peter Liu. 2020. PEGASUS: Pre-training with ex-
tracted gap-sentences for abstractive summarization.
In Proceedings of the 37th International Conference
on Machine Learning, volume 119 of Proceedings
of Machine Learning Research, pages 11328–11339.
PMLR.

B    Measuring Abstractiveness with M INT

                                                          N -gram Overlap. Each pn , short for pn (x, y),
                                                          is the n-gram precision of the n-grams in y with
Figure 6: Example of input and highly extractive gener-
ated output. The color coding is the same as in Fig. 1.
                                                          respect to x, i.e., the percentage of n-grams in y
                                                          that are extracted from x.8 For highly abstractive
                                                          outputs, higher-order n-gram precision can be zero,
                                                          leading to an undefined or zero harmonic mean
                                                          value. We prevent this by smoothing the n-gram
A    ROUGE Scores                                         counts from which n-gram precisions are calcu-
                                                          lated, such that each n-gram count is the average of
                                                          itself and the smoothed (n − 1)-gram count and the
                                                          unsmoothed (n + 1)-gram count. The smoothed
                                                          0-gram count is defined as the 1-gram count plus
The aim of this paper is not to improve ROUGE
                                                          one. We chose this method for its simplicity and
scores, but to increase insights about the trade-
                                                          effectiveness; it is described as method 5 in Chen
off between abstractiveness and factuality. We do,
                                                          and Cherry (2014).
however, stress that the models we use in our anal-
ysis are competitive with the start of the art. We list
our ROUGE-1, ROUGE-2 and ROUGE-L F1 scores,
as well as their averages; see the RL scores in Ta-       Harmonic Mean. We use the harmonic mean, in
ble 2 as well:                                            analogy to the definition of the F1 score, as it is
                                                          a mean function designed to aggregate ratios with
                                                          different denominators.
                                                             For a completely extractive summary that ex-
                                                          tracts sentences in the original order, the M INT
                                                          score is 0. The score increases as the order of the
    • For CNN/DM, our λ=none decoding has
                                                          extractive fragments is changed with respect to the
      44.0/21.0/40.8 with an average of 35.3, com-
                                                          input, their lengths are decreased and new words
      pared to an average of 35.4 in Lewis et al.
                                                          and fragments are introduced that are not part of
      (2020).
                                                          the input x. The use of the length-normalized LCS
                                                          score (lcsr) is inspired by ROUGE-L; it is a useful
                                                          addition to the n-gram precisions as it can detect
                                                          the extraction of longer n-grams broken up by mi-
                                                          nor edits. As an example, consider the (x, y) pair
                                                          shown in Figure 6. Only 4 of the 12 summary four-
    • For XSum, our λ=none decoding has                   grams match the input, i.e., p4 =33.3%, although
      45.3/21.9/36.8 with an average of 34.7, com-        very high overlap is apparent due to the fact that
      pared to an average of 34.9 in Lewis et al.         a 15-word fragment from the input was extracted
      (2020).                                             with only the words “verdict” and “which” min-
                                                          imally changed. The lcsr score reflects this and
                                                          measures 12/15=80.0% overlap. On the other hand,
                                                          the n-gram precisions used in the M INT score are
                                                          valuable in detecting textual overlaps that are not
                                                          part of the longest common subsequence.
    • For Multi-News, our MN-800 λ=none decod-
      ing has 50.2/20.5/45.8 with an average of 38.8,
      compared to improved ROUGE F1 results of
                                                              8
      44.5/16.0/40.3 with an average of 33.6 by Fab-            M INT has elements of ROUGE (Lin, 2004) and B LEU
                                                          (Papineni et al., 2002). We do not use the modified n-gram pre-
      bri (personal communication) for Fabbri et al.      cisions, like B LEU does, because n-grams extracted multiple
      (2019).                                             times from x should count as such every time.

C Details on Our Mechanical Turk Setup CNN/DM MN-800 XSum
inf. coh. inf. coh. inf. coh.
prefer off 40.7 40.7 44.0 34.0 20.0 19.3
prefer λ4 46.7 32.7 38.0 42.0 17.3 16.7
We provide additional details on the mitigation both equal 12.7 26.7 18.0 24.0 62.7 64.0
strategies we use in Amazon Mechanical Turk. We
Table 5: Human quality evaluation of summaries gen-
give detailed instructions to the annotators, with
erated with no abstractiveness constraint (“off”) versus
definitions and examples of different factual errors, λ4 . We asked which summary is more informative or
see Figure 7. We also add a request to write a coherent, respectively. MN-800 stands for Multi-News
short explanation when a sentence is judged as not with the input documents truncated to 800 words total
factual. (Section 5.2).

The qualification enables workers to work on
our factual consistency HITs as well as our HITs
Tasks with Known Answers. We add a number
judging informativeness and coherence. The rate
of tasks with known answers, enabling us to esti-
per HIT differs widely between the two tasks, as
mate the accuracy of workers who work on multiple
the factual consistency task can be done quickly,
of these.
given the fact that a single summary sentence is
evaluated and the related sentences in the article
are highlighted. The factual consistency task pays
$0.15 per hour with a bonus of $0.05. The task of
Automatic Quality Checks. Workers who com- evaluating informativeness and coherence (see D
plete the tasks too quickly, write no or very short below) pays $0.50 per hour with a bonus of $0.25,
explanation texts or have low accuracy are auto- as more text is displayed, compared to the factuality
matically removed from our worker pool. Their task. These amount to an average pay of $12.50,
answers are replaced with new answers. incl. the bonus. The bonus is paid to workers who
spend at least 10 seconds per HIT and who give
short explanation texts for their decisions.

Bonus. We use a bonus incentive structure. Ev- D Human Evaluation of Informativeness
ery worker who passes the automatic quality checks and Coherence
receives a bonus at the end.
We conducted a human evaluation to determine the
informativeness and coherence of the summaries
generated with the λ4 decoding constraint (Equa-
Qualification Test. For all our evaluations on tion 1), which increases abstractiveness, as com-
Mechanical Turk (see Sec. 3.1), we first set up pared to not using a abstractiveness constraint. We
a short qualification test that can be taken by any used the same setup as for the factuality task, incl.
worker from a country whose main language is a qualification test, three annotators per task and
English, who has completed 100 or more HITs so aggregation using MACE. The results are shown
far with an acceptance rate of 95% or higher. The in Table 5. We know and expect that factuality
qualification test consists of just three questions decreases with increased abstractiveness. For in-
from our factual consistency setup; two of which formativeness and coherence, the results are mixed.
must be answered correctly, along with an explana- For the CNN/DM dataset, for the majority of sum-
tion text (5 words or more) to explain when “not maries, λ4 is preferred for informativeness but no
factually consistent” was chosen. 53% of work- decoding constrain is prefered for coherence. For
ers who start the test provide answers to all three Multi-News (MN-800), the preference is reversed.
questions, and 27.6% of these answer at least two For XSum, the majority of summaries are indistin-
correctly and provide a reasonable explanation text, guishable for both categories.
i.e., only 14.6% of the test takers are granted the We used the following definitions of informative-
qualification. ness and coherence for the human evaluation.

Dataset           Train             Valid      Test            F   Abstractiveness Measures Proposed in
         CNN/DM           287,227        13,368        11,490               FEQA on Three Datasets
         XSum             204,045        11,332        11,334
         Multi-News        44,972        5,622         5,622            The abstractiveness metrics of a generated sum-
                                                                        mary proposed in FEQA (Durmus et al., 2020) fall
    Table 6: Train/valid/test split on public datasets.                 into three categories: extraction, perfect fusion,
                                                                        and novel n-grams. The extraction measurements
                                                                        evaluate whether a complete summary sentence, a
                   λ
                               Extraction              Perfect fusion   summary sentence span, and a subset of tokens in
                       Sentence    Span        Word    k=2       k≥2
                                                                        a summary sentence are copied from input docu-
CNN/DM       1/λ2         29.03       6.36     38.29    16.05   20.66
             1/λ4         28.27       6.28     35.07    16.39    21.8   ment(s). The perfect fusion measurements evaluate
              none        19.75       5.71     37.22     18.5   25.41
               λ4          2.24       1.31      21.2    25.51   42.71   whether a summary sentence is copied from multi-
               λ2          0.76       0.45      7.54    14.39   31.78
                                                                        ple sentences from input document(s). The novel
MN-500       1/λ2          8.85       8.86     11.38    11.42   17.06
              none         4.78       4.64      6.86     8.41   13.43   n-grams measurements evaluate whether tokens in
               λ4          1.42       1.41      3.09     5.73   10.59
               λ2          1.07       0.97      1.99     3.78    6.79   a summary do not present in input document(s).
MN-800       1/λ2
              none
                          12.09
                           7.17
                                      12.9
                                      7.81
                                               13.62
                                                9.31
                                                        12.28
                                                         9.91
                                                                17.87
                                                                15.12
                                                                        We observe that these metrics reflect increasing
               λ4          2.12       2.29      3.95     6.88   12.28   abstractiveness with larger penalty for extractive
               λ2          1.44        1.5      2.32     4.42    7.74
XSum         1/λ1          1.13       0.38      0.94     1.48   10.56   fragments on Multi-News and XSum datasets. The
             1/λ2          0.36       0.04      0.39     0.56     4.3
              none         0.16       0.03      0.22     0.23    2.43   results on CNN/DM are mixed.
               λ4          0.03        0.0      0.05     0.08     1.7
               λ2           0.0        0.0       0.0     0.02    0.71
                                                                        G    Additional Experimental Details
Table 7: Abstractiveness metrics in FEQA (Durmus                        We use AWS p3.8x and p3.16x EC2 machines for
et al., 2020) for the data points shown in Figure 5. λ
                                                                        all our experiments, except we run FEQA on the
values denote decoding constraints (Section 2.1)
                                  .                                     Multi-News summaries on a p3dn.24xlarge ma-
                                                                        chine, as it requires more memory. The BART
                                                                        model has 406,290,432 parameters. We trained
                                                                        all models for 5 epochs following instructions on
    • Informativeness: the more informative sum-                        the fairseq BART webpage, without further hyper-
      mary is better at expressing the main points                      parameter search. The minimum and maximum
      of the news story. It contains information that                   length for Multi-News decoding was determined
      is more relevant and important. It has fewer                      by the lengths of the training reference summaries.
      unimportant details. Its content is more simi-
      lar to the human-written summary.

    • Coherence: the more coherent summary has
      better structure and flow, is easier to follow.
      The facts are presented in a more logical order.

E      Dataset Sizes

For all three public datasets, we use the train-
ing/validation/test split provided in (Hermann et al.,
2015; Narayan et al., 2018; Fabbri et al., 2019).
The statistics of the three datasets are described in
Table 6.

You can also read