Analyzing the Abstractiveness-Factuality Tradeoff With Nonlinear Abstractiveness Constraints
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Analyzing the Abstractiveness-Factuality Tradeoff With Nonlinear Abstractiveness Constraints Markus Dreyer Mengwen Liu Feng Nan Sandeep Atluri Sujith Ravi Amazon {mddreyer, mengwliu, nanfen, satluri, sujithai}@amazon.com Abstract We analyze the tradeoff between factuality and abstractiveness of summaries. We introduce arXiv:2108.02859v1 [cs.CL] 5 Aug 2021 abstractiveness constraints to control the de- gree of abstractiveness at decoding time, and we apply this technique to characterize the abstractiveness-factuality tradeoff across mul- tiple widely-studied datasets, using extensive human evaluations. We train a neural sum- marization model on each dataset and visu- alize the rates of change in factuality as we gradually increase abstractiveness using our abstractiveness constraints. We observe that, Figure 1: Three successively more abstractive sum- while factuality generally drops with increased maries generated from the same input article, with abstractiveness, different datasets lead to dif- M INT abstractiveness scores (Section 2.2) of 46.1%, ferent rates of factuality decay. We propose 67.2%, 79.5%. Fragments extracted from the input are new measures to quantify the tradeoff between marked from red (longer fragments) to yellow (shorter factuality and abstractiveness, incl. µQAGS, fragments). The bottom summary has factual errors. which balances factuality with abstractiveness. We also quantify this tradeoff in previous works, aiming to establish baselines for the abstractiveness-factuality tradeoff that future semantic faithfulness (Cao et al., 2017; Kryscinski publications can compare against. et al., 2019). Observed rates of factual errors in abstractive summaries have ranged from 30% to 1 Introduction over 75% (Cao et al., 2017; Maynez et al., 2020). Summarization is the task of generating a semanti- The research community has started to address this cally faithful, wellformed and concise text represen- problem by developing automatic factuality met- tation of some input documents’ main information. rics (Wang et al., 2020; Kryscinski et al., 2020; Automatically generated summaries have tradition- Goodrich et al., 2019) and methods that attempt to ally been extractive (Neto et al., 2002; Erkan and increase factuality (Fan et al., 2018; Scialom et al., Radev, 2004; Wong et al., 2008), leading to is- 2019; Zhang et al., 2019; Falke et al., 2020). sues with readability and coherence, as different However, we argue that the factuality problem of extracted fragments may not fit well when taken abstractive summaries cannot be well understood out of their original contexts (Poibeau and Sag- without considering the degree of abstractiveness gion, 2012). More recently, researchers have in- of a given summary: Any summary is on a spec- vested in methods for abstractive summarization, trum between extractive and abstractive (See et al., aiming to paraphrase the input documents’ main 2017), and a summary that is more abstractive typ- points without borrowing their exact lexical expres- ically has more problems with factual inconsis- sions (Radford et al., 2019; Gehrmann et al., 2019; tencies. Summaries that are extractive to a larger Lewis et al., 2019; Zhang et al., 2020). Abstrac- extent tend to be more factual since the copying of tive summaries generated by state-of-the-art neural text from the input into the summary rarely intro- models tend to be fluent and wellformed, but lack duces factual errors while the task of paraphrasing,
which results in sum- maries that are more 0.75 h =4 0.50 h =2 λh abstractive, is harder and prone to seman- 0.25 0.00 1 2 3 4 5 6 7 tic errors. As an Extractive fragment length example, Figure 1 shows part of a Wash- Figure 3: λh defines discounts for extractive fragments Figure 2: Four extremes ington Post article based on their lengths. Smaller h values lead to more at the abstractiveness- and three automati- abstractive summaries. factuality spectrum. cally generated sum- maries with increasing degrees of abstractiveness, 2 Abstractiveness which we have generated using our abstractiveness constraints (Section 2.1). The first two summaries 2.1 Nonlinear Abstractiveness Constraints are correct, but the third, most abstractive, sum- We now introduce a technique to control the de- mary has factual errors, misinterpreting the input. gree of abstractiveness of any summary y obtained Few authors have discussed this connection. by decoding input x. With this technique – soft Lebanoff et al. (2019) observe that abstractive sum- constraints applied during decoding, which we call maries consisting of concatenated extracted frag- nonlinear abstractiveness constraints (NAC) – we ments tend to be more factual than those created can use any previously trained model to generate by more complex fusion. Durmus et al. (2020) ob- summaries with varying degrees of abstractiveness serve that models trained on the more extractive (see Figure 1 for an example). CNN/DM dataset (Hermann et al., 2015) create Let F(x, y) be the set of the longest extractive more factual summaries than models trained on fragments in y with respect to x. In Figure 1, the more abstractive XSum dataset (Narayan et al., such fragments are marked in color for each sum- 2018). We show that such models differ in factual- mary. We define a function λh (|f |) that assigns ity even when we bias them to generate summaries a discount probability to any extractive fragment that have similar levels of abstractiveness. Our f ∈ F(x, y): analysis (Section 4) situates summarization models 2 /h2 on the spectrum outlined in Figure 2, where factual λh (|f |) = 2−|f | (1) summaries range from “trivially factual” (extrac- We configure this function1 with a parameter h, tive) to truly “paraphrasing” (abstractive). interpreted as the length of an extracted fragment We make the following contributions: for which λh = 0.5. Setting h to smaller values results in summaries that are more abstractive since 1. We introduce nonlinear abstractiveness con- shorter extractive fragments will be discounted straints (NAC), which control the degree of more strongly by λh , see Figure 3. Our discount abstractiveness at decoding time using length- penalty grows nonlinearly, affecting longer contigu- based penalties for extractive fragments; ous extractive fragments more strongly than mul- 2. Using this decoding technique, we systemat- tiple shorter ones with the same combined length ically explore the relationship of abstractive- – extracting a 10-gram makes a summary more ex- ness and factuality, within a particular dataset tractive than using ten words from the article sepa- as well as across datasets. rately. In decoding, we search for the summary ŷ that 3. As part of this exploration, we conduct an maximizes the product of the summarization model extensive human factuality study and perform probability, pM (y | x), and the discount probabili- an evaluation of automatic factuality metrics. ties of the extractive fragments F(x, y): 4. We introduce measures to quantify the abstractiveness-factuality tradeoff and use Y ŷ = arg max pM (y | x) × λh (|f |) (2) them to compare multiple summarization y f ∈F (x,y) models. Our automatic measure, µQAGS, Additionally, the exponent used in |f |2 and h2 could be 1 will enable future researchers to compare their configured, but we keep it at 2 in our experiments. A larger own models and demonstrate progress. exponent would result in a steeper descent around h.
Beam Decoding. The model probability a factuality metric of the same [0,1] range (Sec- pM (x, y) in neural text generation models (Sec- tion 4). Let χ(x, y) = hmean(p1 , p2 , p3 , p4 , lcsr) tion 5.2) decomposes for token-by-token decoding be a measure of extractive overlap between input Q|y| as i=1 pM (yi | x, y1 , . . . , yi−1 ). Similarly, we x and summary y, using the harmonic mean of decompose the application of the λh function for multiple component measures. Each pn , short for any partial or completed extractive fragment f : pn (x, y), is the n-gram precision of the n-grams in y with respect to x, i.e., the percentage of n-grams |f | in y that are extracted from x.2 We do not in- Y λh (l) λh (|f |) = (3) clude density in χ(x, y) as the range of density λh (l − 1) l=1 is unbounded. The measure lcsr (longest com- Therefore, to successively apply λh at each out- mon subsequence ratio), short for lcsr(x, y), is put position i in beam decoding, each candidate for the length of the longest common subsequence token yi is evaluated to check whether choosing it (LCS) between x and y divided by the length of would extend an extractive fragment to length l. If y. lcsr, inspired by ROUGE-L (Lin, 2004), gener- so, its model probability pM (yi | . . . ) is multiplied alizes perfect fusionk to consider all instances of with λh (l) and the λh (l − 1) that was applied to non-contiguous overlaps between input and sum- the previous token yi−1 is divided out. mary. To measure abstractiveness, we define M INT Log Space. In practice, we equivalently search (Metric for lexical independence of generated text) for ŷ in log space using log probabilities and the as M INT(x, y) = 1 − χ(x, y). For a set of inputs log of λh defined in Equation 1. It can be shown and their summaries, we report the average M INT −|f |2 score. See Figure 1 for the M INT scores of three that log λh (|f |) = (1.20112×h) 2. sample summaries. Extraction Rewards. We can choose to apply an The described M INT score capitalizes on prior extraction reward, rather than a penalty, by using work to provide a comprehensive and unified met- the inverse 1/λh ; smaller h then result in sum- ric for abstractiveness of conditionally generated maries that are more extractive. text, combining measures of contiguous and non- contiguous overlap into a single percentage score. 2.2 Measuring Abstractiveness We will provide a reference implementation to facil- The abstractiveness constraints directly control the itate a standardized comparison of abstractiveness abstractiveness of the generated summaries and across different works. indirectly influence their factuality. We now dis- cuss metrics for abstractiveness, before discussing 3 Factuality metrics for factuality (Section 3). We now describe metrics for factuality, before we To measure abstractiveness, most authors list the can describe the relationship between abstractive- proportions of summary n-grams that are novel, i.e., ness and factuality (Section 4). By factuality of a do not occur in the corresponding input documents summary y, we mean factual consistency with the (See et al., 2017; Narayan et al., 2018; Gao et al., input x, rather than objective factuality or univer- 2019). Grusky et al. (2018) proposed a new metric sal truth. Measuring factuality automatically is an that is also based on contiguous overlapping text active area of research (Gabriel et al., 2020). Fac- spans, called density (Grusky et al., 2018), mea- tuality is most naturally measured by human anno- suring the average length of extracted fragments tators; we describe our setup for human factuality in a summary. Others have proposed metrics that annotation first, then move to automatic metrics. In take common non-contiguous subsequences into Section 5.4, we measure the correlations of auto- account, e.g., perfect fusionk (Durmus et al., 2020) matic factuality metrics to our human judgements. measures the percentage of summary sentences that assemble substrings from k source sentences 3.1 Human-annotated Factuality in their original order. We use Amazon’s Mechanical Turk (AMT) to mea- Based on these previous works, we define a com- sure the factuality of automatically generated sum- prehensive abstractiveness metric that combines maries with human annotators. Mechanical Turk measures of contiguous as well as non-contiguous 2 We smooth all n-gram counts (Chen and Cherry, 2014) extractive summary fragments. We define this met- to avoid undefined or zero harmonic mean values in highly ric as a ratio, in order to facilitate combining it with abstractive summaries. See Appendix B for details.
Figure 4: Screenshot (part) of a Mechanical Turk task (HIT) to judge the factuality of a summary sentence (in blue) with respect to news articles. Darker green article sentences are more similar to the blue summary sentence. The full task showed sentences from two more articles in the same cluster; from the Multi-News test set. annotators are untrained, so we use multiple miti- report the average of the human binary summary gation strategies to obtain reliable judgements. factuality judgements; we call this score FactH (Ta- We simplify the task: To avoid overwhelming an- ble 1). We collect human judgements on the first notators with long text, we select a single sentence 150 samples of the test sets. per summary and ask the annotators if it is factually consistent with the shown article (or, in the case of multi-document summarization, multiple articles). 3.2 Automatically Measured Factuality The other sentences of the summary are given as well for context, shown in gray, see Figure 4 for an example. The article(s) are shortened to show We use four metrics based on question answering a total of 9 sentences that were determined to be (QA): SummaQA (Scialom et al., 2019), FEQA semantically most similar to the selected summary (Durmus et al., 2020), QAGS (Wang et al., 2020) sentence;3 the remaining article parts are replaced and QUALS (Nan et al., 2021) and one entailment- by “. . . ”. The summary sentence is selected at ran- based metric: FactCC (Kryscinski et al., 2020). dom in proportion to its length. For each summary, SummaQA converts input sentences into Cloze- we get judgements only for the randomly selected style questions by masking entities, applies BERT sentence. Aggregated over a set of summaries, we to recover them based on the summary and re- measure the average chance of any randomly se- ports match F1 score. Since SummaQA measures lected summary sentence to be factual. summaries based on how well they answer ques- tions from the input, it measures both factuality We also provide detailed task instructions, incl. and content selection. FEQA generates declara- examples for intrinsic and extrinsic factual errors tive questions from masked summary sentences (Maynez et al., 2020). We require that potential whose masked entities are used as “gold” answers; annotators pass our custome qualification test con- these are compared to the answers obtained from sisting of three of our example tasks of finding a QA model on the input. In QAGS, a question factual summary errors. Only workers who have generation (QG) model generates questions from previously completed 100 or more tasks on AMT the summary, a QA model answers these questions with an acceptance rate of 95% or higher may take from both summary and input, and the similarity the test; 15% of those pass, enabling them to work of the answer pairs is evaluated. QUALS is simi- on our evaluation tasks. We use three annotators lar, but achieves runtime and memory reductions per task and use MACE (Hovy et al., 2013) to ag- by replacing the QG and QA models with a single gregate these multiple annotations per summary model, QAGen (Shakeri et al., 2020). It generates and recover the most likely binary factuality judge- QA pairs from the summary, computes their aver- ment per summary. More details about our setup age log probability given the input and given the are in Appendix C. summary and reports their difference ∈ [−∞, ∞]. For any set of automatically generated sum- FactCC is a BERT-based binary classifier trained maries, we create the AMT tasks, get an aggregate on weakly-supervised sentence pairs derived from binary judgement per summary based on the mul- rules. We took the pre-trained FactCC model and tiple answers from annotators as described, and fine-tuned it using the ANLI dataset (Nie et al., 3 We measure cosine similarity of sentence encodings com- 2020). FactCC is trained with sentence pairs, so puted by the Universal Sentence Encoder (Cer et al., 2018). we obtain scores at the sentence level and use their
100 Human Automatic λ M INT FactH µFactH QAGS µQAGS 90 MN-500 CNN/DM 1/λ2 11.2 94.0 66.4 85.6 60.8 80 none 19.6 88.7 65.7 83.7 62.4 Factuality 70 λ4 44.3 83.3 70.3 75.7 65.3 λ2 68.2 76.0 73.4 67.1 67.5 60 1/λ2 35.0 80.7 65.5 80.1 65.1 CNN/DM 50 MN-800 none 46.3 73.3 64.3 75.4 65.7 MN-500 λ4 59.6 60.7 60.3 70.5 66.9 XSum λ2 70.5 60.0 63.5 63.8 66.1 40 30 1/λ2 28.8 80.0 62.9 82.4 64.5 MN-800 0 10 20 30 40 50 60 70 80 90 100 none 38.4 82.7 67.9 78.3 65.0 λ4 50.7 68.7 62.7 72.8 65.4 Abstractiveness λ2 64.4 61.3 62.4 67.0 66.1 Figure 5: Human factuality judgements (i.e., FactH) 1/λ1 56.8 50.7 52.7 45.0 48.9 XSum 1/λ2 74.6 42.0 52.9 38.9 50.8 for different degrees of abstractiveness (M INT). For all none 80.6 41.3 54.4 37.8 52.1 symbols of the same color, the same test subset was λ4 84.0 41.3 55.5 35.7 51.8 decoded using the same trained baseline model, but λ2 88.0 39.3 55.6 35.5 53.0 with different abstractiveness constraints (Section 2.1); large outlined symbols mean no constraints; each color Table 1: Abstractiveness-factuality tradeoff on 150 test represent a dataset. samples. The 17 M INT and FactH numbers are as shown in Figure 5; we add µFactH, as well as QAGS and µQAGS on the same 150 samples for comparison. mean as the factuality score of a summary.4 4 The Abstractiveness-Factuality a comparison of the factuality of different models Tradeoff with a fixed degree of abstractiveness. The metrics for factuality and abstractiveness, de- µFactH and µQAGS. We characterize the trade- scribed above, along with the abstractiveness con- off on any decoding ouput using a weighted aver- straints presented in Section 2.1, provide us with age between factuality and abstractiveness, (φF + tools to systematically explore the relationship be- A)/(φ + 1). To measure abstractiveness A, we tween abstractiveness and factuality. use M INT; to measure factuality F , we use ei- ther human-measured factuality or an automatic Factuality Trend Lines. To explore this relation- metric with [0,1] range like QAGS, resulting in ship, we train summarization models on different abstractiveness-adjusted factuality metrics µFactH datasets. For any trained summarization model, we and µQAGS. We give factuality twice as much decode the test set multiple times with different h weight, since low factuality can have large negative values for λh (Equation 1), resulting in sets of sum- impact (Zellers et al., 2019), setting φ to 2. By this maries with different degrees of abstractiveness. definition, a system whose factuality decreases by For each of these test set decodings, we measure x units, as compared to another system, must make the average abstractiveness using M INT and the up for the lost factuality by 2x units in abstractive- corresponding average factuality, using human an- ness to get the same score. When two systems have notations, unless otherwise noted. This results in the same factuality, the score prefers the one with a series of (abstractiveness, factuality) points for higher abstractiveness. any trained summarization model, which can be plotted, along with a linear trend line. Figure 5 5 Experiments shows such a plot; Section 5.3 discusses its details. 5.1 Datasets F@50 Score. Given the multiple datapoints per model exhibiting different degrees of abstractive- We use CNN/DM (Hermann et al., 2015), XSum ness and associated factuality, we estimate the fac- (Narayan et al., 2018), and Multi-News (Fabbri tuality at 50% abstractiveness, an intuitively inter- et al., 2019). The former two are single-doc sum- pretable metric, which we call F@50; it provides marization datasets, and the latter one is a multi- 4 doc summarization dataset. CNN/DM contains This averaging method achieved the best correlations with our human evaluation; we also tried using the min or max news articles from two news providers: CNN and scores per summary sentences or using the full summaries. DailyMail. The summaries are in the format of
λ RL M INT p3 p4 lcsr density we truncate the input documents per cluster so that MN-500 CNN/DM 1/λ2 38.1 8.6 89.5 85.3 93.3 31.0 their combined length does not exceed N words, fol- none 40.8 14.7 82.1 75.5 89.7 20.3 lowing Fabbri et al. (2019). We train models with λ4 41.7 38.4 55.4 41.8 78.6 6.4 λ2 40.0 61.8 33.2 19.7 68.7 3.4 N = 500 and N = 800, called MN-500 and MN- 1/λ2 44.7 35.7 61.6 54.2 60.4 15.9 800, respectively. We measure the M INT scores none 45.4 46.1 50.0 41.2 54.2 9.9 for the reference summaries in these datasets; these λ4 45.1 59.0 36.5 26.1 46.6 4.7 λ2 43.7 70.7 25.5 16.3 40.4 3.5 can be compared to the M INT scores obtained in 1/λ2 44.9 27.5 69.9 62.7 69.3 19.5 decoding (Section 5.3). The test set references MN-800 none 45.8 37.3 58.7 49.9 63.4 12.9 for MN-500 have a M INT score of 78.2%, com- λ4 45.7 52.6 42.4 31.2 53.8 5.8 λ2 44.2 66.4 29.0 19.0 46.1 4.0 pared to 72.8% for MN-800. M INT is naturally 1/λ1 30.6 57.6 37.7 28.2 63.5 4.6 higher for MN-500 since the shorter truncation re- XSum 1/λ2 36.0 74.7 22.3 13.4 57.4 2.2 moves content that could otherwise overlap with none 36.8 80.2 17.6 9.2 54.5 1.6 the outputs. The M INT scores for the CNN/DM λ4 36.8 83.1 15.0 7.0 53.0 1.3 λ2 36.3 87.0 11.7 4.8 50.3 1.1 and XSum references are 59.6% and 87.8%, respec- tively, indicating that XSum is the most abstractive Table 2: Impact of λ on ROUGE-L F1 (RL) and abstrac- dataset. tiveness metrics on the full test sets. p3, p4 and lcsr are among M INT’s component scores (Sec. 2.2), density is 5.3 Comparison Across Datasets Using NAC the average length of extracted fragments (Grusky et al., We decode each of the four trained models multiple 2018). times with varying nonlinear abstractiveness con- straints. Figure 5 plots the resulting abstractiveness bullet point highlights. The XSum dataset contains and human-measured factuality, thereby provid- articles from BBC News, where the first sentence ing a visual representation of the abstractiveness- of an article is used as the summary and the re- factuality tradeoff for the four models. Table 1 maining sentences as input documents.5 In the shows the same 17 M INT and FactH values, along Multi-News dataset, each summary is written by with QAGS values and the M INT-adjusted factuali- a professional editor and paired with a cluster of ties µFactH and µQAGS. articles from multiple news providers. The dataset XSum. The lower right of Figure 5 shows five sizes are described in Appendix E. lozenges (). The larger of these represents the de- 5.2 Setup coding of the test set with our XSum-trained model We use BART (Lewis et al., 2020) for our analyses using default settings; the other four red points rep- of the relationship between factuality and abstrac- resent decodings under the same model, but with tiveness. BART is a sequence-to-sequence model different abstractiveness constraints that result in with a bilinear transformer model as encoder and a more extractive (1/λh ) or more abstractive (λh ) left-to-right transformer model as decoder. The summaries (Section 2.1). The five red points are as- encoder-decoder joint model was pretrained on sociated with a dashed linear trend line. Compared 160GB of text and has been shown to give state- to the other points in the figure, abstractiveness is of-the-art results on CNN/DM and XSum. Our high and factuality low. It took a strong extractive models use the provided model checkpoints for reward (1/λ1 ), which we did not use for the models the CNN/DM and the XSum datasets as well as trained on other datasets, to pull this model toward the recommended decoding settings.6 For Multi- lower abstractiveness and higher factuality. News (MN), we train a model on the training set, Multi-News. Four decodings using the MN-500 starting from the bart.large.cnn pretrained model are shown as squares , decodings under model with a learning rate of 2e-5 for five epochs the MN-800 model as triangles N. The two de- and decode with a minimum length of 50 and a codings without constraints, denoted by the larger maximum length of 300 tokens. For Multi-News, symbols, are far apart: MN-500 is more abstractive 5 Following Wang et al. (2020), we reinsert the first sen- and less factual, MN-800 is the opposite. This can tences whenever we measure factuality of XSum summaries be explained by the fact that for MN-500, larger on AMT or with automatic metrics. parts of the input are truncated (Section 5.2) that 6 The checkpoints and settings are available on https://github.com/pytorch/fairseq/tree/ the untruncated reference summary in training may master/examples/bart. still refer to; the model learns to hallucinate more.
However, the trend lines of MN500 and MN-800 (Lewis et al., 2020; Fabbri et al., 2019), see Ap- are almost identical. This underscores the value pendix A. The λ4 constraint can dramatically in- of the trend lines, which reveal the near identical crease abstractiveness while leaving ROUGE scores abstractiveness-factuality spectrum of these two virtually unchanged. To measure summary quality models. further, we conducted a human evaluation of infor- mativeness and coherence of the summaries gen- CNN/DM. The four decodings for CNN/DM are • shown as bullets ( ). Its default model output erated with the λ4 decoding constraint; the results are mixed, where sometimes the default decoding without abstractiveness constraint (large bullet) is is preferred, sometimes the λ4 decoding and some- by far the most extractive (M INT 19.6%); the ex- times none of them, see Appendix D. The density traction reward to its left (using 1/λ2 ) cannot make scores (Grusky et al., 2018) in the table have high it much more extractive; however, there is room to correlation with the M INT scores. See Appendix F the right, and the abstraction rewards (λ4 and λ2 ) for fusion scores (Durmus et al., 2020). move its abstractiveness far into the abstractiveness level of Multi-News and even XSum. 5.4 Correlating Human and Automatic F@50 Scores. One of the main takeaways of this Factuality Judgements study is that different systems can have different Setup. We analyze the correlation between hu- factuality rates at the same level of abstractiveness. man and automatic factuality scores, complement- Previous authors have observed that XSum sum- ing earlier studies in this area (Gabriel et al., 2020). maries are highly abstractive and less factual, and Table 3 describes three groups of correlation re- that CNN/DM summaries are at the opposite side sults, corresponding to different aggregation of of that spectrum. We confirm this; however, we summaries: The “summary sets” group contains add that we can bias the XSum model to create less the 17 data points whose human factuality scores abstractive summaries and the CNN/DM model to can be seen in Figure 5; we score the factuality create more abstractive models, so that their ab- of the same summaries automatically and corre- stractiveness becomes comparable, and the factu- late the 17 score pairs per automatic metric. The ality rates still differ considerably: Based on the “summaries” group contains the 17 x 150 individual trend line, the F@50 score of the XSum model factuality scores per summary, so in the “All” col- is 52.6%, while the CNN/DM model’s F@50 is umn we correlate all 2,550 factuality judgements 81.3%. MN-500 and MN-800 lie in the middle, from human annotators with the automatic met- both with F@50 score of 70.5. rics. The XSum, MN, and CNN columns contain only the subsets of summaries corresponding to µFactH Scores. When no abstractiveness con- those datasets. The automatic metrics score the full straint is applied (λ=none in Table 1), the CNN/DM summaries with respect to their inputs. However, and MN models have comparable µFactH scores recall that we ask human annotators to judge only of 65.7%, 64.3% and 67.9%. The XSum model one summary sentence in context (Figure 4). The has a low µFactH score of 54.4%; at its degree of “summaries (one sentence)” numbers show the cor- abstractiveness it would need a factuality of 58.3% relations of the 2,550 factuality judgements with to match the µFactH score of the CNN/DM model, the automatic metrics computed based on that same but its actual FactH score is only 41.3%. The sentence per summary that annotators judged. µQAGS scores correlate highly with the µFactH scores. Results. All automatic metrics have high corre- Summary Quality and Abstractiveness. Ta- lation with the 17 aggregated human judgements ble 2 lists ROUGE scores for the different de- per summary set. QAGS and QUALS have almost codings, along with multiple abstractiveness met- perfect correlations with human aggreegate judge- rics; these numbers are measured on the full test ments, making them suitable to automatically judge sets.7 Decodings without abstractiveness con- the mean factuality of a system. Since QAGS, in straints replicate previous works’ ROUGE scores particular, uses a [0,1] range like M INT, we com- bine the two as µQAGS (Section 4) and report 7 The M INT scores on the full test sets (Table 2) are similar QAGS and µQAGS in Tables 1 and 4. On the basis to those on the smaller subsets (Table 1), with the exception of CNN/DM, where the first test set part consists of CNN articles, of individual summaries, overall correlations are while 90.5% of the full test set consists of Daily Mail articles. moderate at best. For the 2,550 individual sum-
(1) Summary sets (2) Summaries (3) Summaries (one sentence) Dataset All All XSum MN CNN All XSum MN CNN FEQA .91 .18 † -.03 .18 †.05 .14 † -.03 .15 † .00 SummaQA .76 .20 .08 .14 .12 † -.04 .08 -.08 † .00 FactCC .69 .25 .22 .11 .38 .23 .22 .18 .26 FactCC (ANLI) .86 .32 .35 † .01 .44 .28 .35 .10 .29 QAGS .96 .41 .33 .23 .28 .40 .33 .30 .15 QUALS .96 .41 .26 .25 .36 .37 .26 .29 .25 Table 3: Correlations of automatic factuality metrics to human annotations. “Summary sets” correlates the mean scores for the 17 sets of summaries shown in Figure 5. “Summaries” correlates individual scores for all summaries. All results are statistically significant at p < 0.05 except those marked with a dagger (†). maries, only QAGS and QUALS have correlations Model M INT QAGS µQAGS RL >0.40 in the “All” column. For the dataset-specific CNN/DM BART 14.7 85.4 61.8 40.8 subsets, correlations are lower, i.e., the automatic B ERT S UM 14.1 83.1 60.1 39.2 PGC ONV 5.5 84.4 58.1 36.4 factuality metrics are better at distinguishing the B OTTOM U P 17.2 78.7 58.2 38.3 A BS RL 18.9 82.5 61.3 37.3 summaries from the different models, but worse XSum BART 80.2 40.1 53.5 36.8 at distinguishing the summaries from a particular B ERT S UM 82.8 27.6 46.0 31.3 model. For group (3) “summaries (one sentence)”, SummaQA has the lowest correlations because it Table 4: Comparing summarization models on the is not designed to work on single summary sen- full CNN/DM and XSum test sets. µQAGS measures tences. For XSum, the results in this group are abstractiveness-factuality tradeoffs. RL is Rouge-L F1 . identical to group (2) because XSum has only one- sentence summaries. For CNN/DM, all metrics sores 58.1 and 58.2) for different reasons: PG- correlate better when using the full summaries. For C ONV is very extractive while B OTTOM U P is the MN, they correlate better when using the random least factual model. On XSum, BART represents sentence that human annotators judged, possibly a sizable factuality gain over B ERT S UM, with a because its summaries are long (218 words on av- relatively small loss in M INT score. BART’s pre- erage). FactCC works better after being fine-tuned training of both encoder and decoder may be con- on ANLI, but it fails on MN. tributing to its better factuality, in accordance with Maynez et al. (2020). 5.5 Comparison Across Different Models 6 Related Work We compare the abstractiveness-factuality trade- offs of summarization models from the literature. Abstractiveness-Factuality Tradeoff: Durmus We obtain model outputs of four summarization et al. (2020) observe that the degree of abstrac- models other than BART: PGC ONV (See et al., tiveness at test time depends on the abstractiveness 2017) is a hybrid pointer-generator network that of the training data and that highly abstractive sum- can copy words from the source text via pointing; maries tend to be less factual. We go a step further: B ERT S UM (Liu and Lapata, 2019) is a transformer We control for abstractiveness and see that factual- encoder-decoder model in which only the encoder ity rates between different systems can vary widely is pretrained; both B OTTOM U P (Gehrmann et al., at the same levels of abstractiveness. Increasing 2018) and A BS RL (Chen and Bansal, 2018) use Abstractiveness: Kryściński et al. (2018) use pol- separately selected source fragments to constrain icy gradient with a novelty reward to encourage an abstractive generation model. abstraction in a pointer-generator (PG) network Table 4 shows that, on CNN/DM, BART and (Gulcehre et al., 2016; See et al., 2017). Weber et al. A BS RL have the best µQAGS scores (61.8 (2018) penalize the amount of copied tokens dur- and 61.3), with A BS RL trading some factuality ing PG decoding. Our decoding technique applies (QAGS) for considerably higher abstractiveness to general sequence-to-sequence models; we pe- (M INT). Compared to A BS RL, B ERT S UM is con- nalize contiguous longer extracted fragments more siderably less abstractive and only slightly more than multiple shorter fragments that add up to the factual, leading to a lower µQAGS score (60.1). same length, using nonlinear penalties for whole PGC ONV and B OTTOM U P rank last (µQAGS fragments. Finally, Song et al. (2020) control the
amount of copying in training abstractive summa- Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and rization models by masking the summary tokens Dragomir Radev. 2019. Multi-news: A large-scale multi-document summarization dataset and abstrac- with different probabilities, depending on whether tive hierarchical model. In Proceedings of the 57th they are seen in the input document or not. In con- Annual Meeting of the Association for Computa- trast, our technique does not require retraining to tional Linguistics, pages 1074–1084, Florence, Italy. obtain varying degrees of abstractiveness. Association for Computational Linguistics. 7 Conclusions Tobias Falke, Leonardo F.R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2020. We presented a new framework and new metrics for Ranking generated summaries by correctness: An in- teresting but challenging application for natural lan- evaluating the relationship of abstractiveness and guage inference. In ACL 2019 - 57th Annual Meet- factuality. As part of this framework, we presented ing of the Association for Computational Linguistics, abstractiveness constraints, a general method that Proceedings of the Conference, pages 2214–2220. enables us to increase or decrease the level of ab- Association for Computational Linguistics (ACL). stractiveness when generating summaries from a Lisa Fan, Dong Yu, and Lu Wang. 2018. Robust Neu- summarization model, using nonlinear penalties or ral Abstractive Summarization Systems and Evalu- rewards based on the length of summary fragments ation against Adversarial Information. In NIPS In- extracted from the source. Through extensive hu- terpretability and Robustness for Audio, Speech and Language Workshop. man and automatic evaluations, we shed light on how abstractiveness interacts with factuality, across Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin multiple datasets and models. We proposed new Choi, and Jianfeng Gao. 2020. Go Figure! A Meta metrics to measure the abstractiveness-factuality Evaluation of Factuality in Summarization. Techni- tradeoff, incl. F@50 and µQAGS, and established cal report. baselines for future research. Shen Gao, Xiuying Chen, Piji Li, Zhangming Chan, Dongyan Zhao, and Rui Yan. 2019. How to write summaries with patterns? learning towards abstrac- References tive summarization through prototype editing. In Proceedings of the 2019 Conference on Empirical Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2017. Methods in Natural Language Processing and the Faithful to the original: Fact aware neural abstrac- 9th International Joint Conference on Natural Lan- tive summarization. CoRR, abs/1711.04434. guage Processing (EMNLP-IJCNLP), pages 3741– 3751, Hong Kong, China. Association for Computa- Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, tional Linguistics. Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Sebastian Gehrmann, Yuntian Deng, and Alexander M. Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Rush. 2018. Bottom-Up Abstractive Summarization. 2018. Universal Sentence Encoder. EMNLP 2018 - In Proc. of EMNLP. Conference on Empirical Methods in Natural Lan- guage Processing: System Demonstrations, Pro- Sebastian Gehrmann, Zachary Ziegler, and Alexander ceedings, pages 169–174. Rush. 2019. Generating abstractive summaries with Boxing Chen and Colin Cherry. 2014. A systematic finetuned language models. In Proceedings of the comparison of smoothing techniques for sentence- 12th International Conference on Natural Language level BLEU. In Proceedings of the Ninth Workshop Generation, pages 516–522, Tokyo, Japan. Associa- on Statistical Machine Translation, pages 362–367, tion for Computational Linguistics. Baltimore, Maryland, USA. Association for Compu- tational Linguistics. Ben Goodrich, Vinay Rao, Peter J Liu Mohammad Saleh, Google Brain, Peter J Liu, and Mohammad Yen-Chun Chen and Mohit Bansal. 2018. Fast Abstrac- Saleh. 2019. Assessing The Factual Accuracy of tive Summarization with Reinforce-Selected Sen- Generated Text. In International Conference on tence Rewriting. In Proc. of ACL. Knowledge Discovery and Data Mining (KDD). Esin Durmus, He He, and Mona Diab. 2020. Feqa: A Max Grusky, Mor Naaman, and Yoav Artzi. 2018. question answering evaluation framework for faith- Newsroom: A dataset of 1.3 million summaries with fulness assessment in abstractive summarization. In diverse extractive strategies. In Proceedings of the Association for Computational Linguistics (ACL). 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- Günes Erkan and Dragomir R. Radev. 2004. Lexrank: man Language Technologies, Volume 1 (Long Pa- Graph-based lexical centrality as salience in text pers), pages 708–719, New Orleans, Louisiana. As- summarization. J. Artif. Int. Res., 22(1):457–479. sociation for Computational Linguistics.
Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Bowen Zhou, and Yoshua Bengio. 2016. Pointing jan Ghazvininejad, Abdelrahman Mohamed, Omer the unknown words. In Proceedings of the 54th An- Levy, Veselin Stoyanov, and Luke Zettlemoyer. nual Meeting of the Association for Computational 2020. BART: Denoising sequence-to-sequence pre- Linguistics (Volume 1: Long Papers), pages 140– training for natural language generation, translation, 149. and comprehension. In Proceedings of the 58th An- nual Meeting of the Association for Computational Karl Moritz Hermann, Tomáš Kočiskỳ, Edward Grefen- Linguistics, pages 7871–7880, Online. Association stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, for Computational Linguistics. and Phil Blunsom. 2015. Teaching machines to read Chin-Yew Lin. 2004. ROUGE: A package for auto- and comprehend. In Proceedings of the 28th Inter- matic evaluation of summaries. In Text Summariza- national Conference on Neural Information Process- tion Branches Out, pages 74–81, Barcelona, Spain. ing Systems-Volume 1, pages 1693–1701. Association for Computational Linguistics. Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, Yang Liu and Mirella Lapata. 2019. Text Summariza- and Eduard Hovy. 2013. Learning whom to trust tion with Pretrained Encoders. In Proc. of EMNLP. with MACE. In Proceedings of the 2013 Conference Joshua Maynez, Shashi Narayan, Bernd Bohnet, and of the North American Chapter of the Association Ryan McDonald. 2020. On faithfulness and factu- for Computational Linguistics: Human Language ality in abstractive summarization. In Proceedings Technologies, pages 1120–1130, Atlanta, Georgia. of the 58th Annual Meeting of the Association for Association for Computational Linguistics. Computational Linguistics, pages 1906–1919, On- line. Association for Computational Linguistics. Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc- Cann, Caiming Xiong, and Richard Socher. 2019. Feng Nan, Cicero Nogueira dos Santos, Henghui Zhu, Neural text summarization: A critical evaluation. In Patrick Ng, Kathleen McKeown, Ramesh Nallapati, Proceedings of the 2019 Conference on Empirical Dejiao Zhang, Zhiguo Wang, Andrew O Arnold, and Methods in Natural Language Processing and the Bing Xiang. 2021. Improving Factual Consistency 9th International Joint Conference on Natural Lan- of Abstractive Summarization via Question Answer- guage Processing (EMNLP-IJCNLP), pages 540– ing. In Proc. of ACL. 551, Hong Kong, China. Association for Computa- tional Linguistics. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for ex- Wojciech Kryscinski, Bryan McCann, Caiming Xiong, treme summarization. In Proceedings of the 2018 and Richard Socher. 2020. Evaluating the factual Conference on Empirical Methods in Natural Lan- consistency of abstractive text summarization. In guage Processing, pages 1797–1807, Brussels, Bel- Proceedings of the 2020 Conference on Empirical gium. Association for Computational Linguistics. Methods in Natural Language Processing (EMNLP), pages 9332–9346. Joel Larocca Neto, Alex Alves Freitas, and Celso A. A. Kaestner. 2002. Automatic text summarization us- ing a machine learning approach. In Proceedings Wojciech Kryściński, Romain Paulus, Caiming Xiong, of the 16th Brazilian Symposium on Artificial Intelli- and Richard Socher. 2018. Improving abstraction gence: Advances in Artificial Intelligence, SBIA ’02, in text summarization. In Proceedings of the 2018 page 205–215, Berlin, Heidelberg. Springer-Verlag. Conference on Empirical Methods in Natural Lan- guage Processing, pages 1808–1817, Brussels, Bel- Yixin Nie, Adina Williams, Emily Dinan, Mohit gium. Association for Computational Linguistics. Bansal, Jason Weston, and Douwe Kiela. 2020. Ad- versarial nli: A new benchmark for natural language Logan Lebanoff, John Muchovej, Franck Dernoncourt, understanding. In Proceedings of the 58th Annual Doo Soon Kim, Seokhwan Kim, Walter Chang, and Meeting of the Association for Computational Lin- Fei Liu. 2019. Analyzing sentence fusion in abstrac- guistics, pages 4885–4901. tive summarization. In Proceedings of the 2nd Work- shop on New Frontiers in Summarization, pages Kishore Papineni, Salim Roukos, Todd Ward, and Wei- 104–110, Hong Kong, China. Association for Com- Jing Zhu. 2002. Bleu: a method for automatic eval- putational Linguistics. uation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Com- putational Linguistics, pages 311–318, Philadelphia, Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Pennsylvania, USA. Association for Computational jan Ghazvininejad, Abdelrahman Mohamed, Omer Linguistics. Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre- Thierry Poibeau and Horacio Saggion. 2012. Auto- training for natural language generation, trans- matic Text Summarization: Past, Present and Fu- lation, and comprehension. arXiv preprint ture. In Multi-source, Multilingual Information Ex- arXiv:1910.13461. traction and Summarization, pages 3–13.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christo- Dario Amodei, and Ilya Sutskever. 2019. Language pher D. Manning, and Curtis P. Langlotz. 2019. Op- models are unsupervised multitask learners. timizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports. Thomas Scialom, Sylvain Lamprier, Benjamin Pi- wowarski, and Jacopo Staiano. 2019. Answers unite! unsupervised metrics for reinforced summa- rization models. In Proceedings of the 2019 Con- ference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- IJCNLP), pages 3246–3256, Hong Kong, China. As- sociation for Computational Linguistics. Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083. Siamak Shakeri, Cicero Nogueira dos Santos, Henghui Zhu, Patrick Ng, Feng Nan, Zhiguo Wang, Ramesh Nallapati, and Bing Xiang. 2020. End-to-end syn- thetic data generation for domain adaptation of ques- tion answering systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 5445–5460. As- sociation for Computational Linguistics. Kaiqiang Song, Bingqing Wang, Zhe Feng, Ren Liu, and Fei Liu. 2020. Controlling the amount of ver- batim copying in abstractive summarization. In Pro- ceedings of the AAAI Conference on Artificial Intel- ligence, volume 34(05), pages 8902–8909. Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Asking and answering questions to evaluate the fac- tual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics, pages 5008–5020. Noah Weber, Leena Shekhar, Niranjan Balasubrama- nian, and Kyunghyun Cho. 2018. Controlling decod- ing for more abstractive summaries with copy-based networks. arXiv preprint arXiv:1803.07038. Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008. Ex- tractive summarization using supervised and semi- supervised learning. In Proceedings of the 22nd In- ternational Conference on Computational Linguis- tics (Coling 2008), pages 985–992, Manchester, UK. Coling 2008 Organizing Committee. Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, F. Roesner, and Yejin Choi. 2019. Defending against neural fake news. In NeurIPS. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. PEGASUS: Pre-training with ex- tracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 11328–11339. PMLR.
B Measuring Abstractiveness with M INT N -gram Overlap. Each pn , short for pn (x, y), is the n-gram precision of the n-grams in y with Figure 6: Example of input and highly extractive gener- ated output. The color coding is the same as in Fig. 1. respect to x, i.e., the percentage of n-grams in y that are extracted from x.8 For highly abstractive outputs, higher-order n-gram precision can be zero, leading to an undefined or zero harmonic mean value. We prevent this by smoothing the n-gram A ROUGE Scores counts from which n-gram precisions are calcu- lated, such that each n-gram count is the average of itself and the smoothed (n − 1)-gram count and the unsmoothed (n + 1)-gram count. The smoothed 0-gram count is defined as the 1-gram count plus The aim of this paper is not to improve ROUGE one. We chose this method for its simplicity and scores, but to increase insights about the trade- effectiveness; it is described as method 5 in Chen off between abstractiveness and factuality. We do, and Cherry (2014). however, stress that the models we use in our anal- ysis are competitive with the start of the art. We list our ROUGE-1, ROUGE-2 and ROUGE-L F1 scores, as well as their averages; see the RL scores in Ta- Harmonic Mean. We use the harmonic mean, in ble 2 as well: analogy to the definition of the F1 score, as it is a mean function designed to aggregate ratios with different denominators. For a completely extractive summary that ex- tracts sentences in the original order, the M INT score is 0. The score increases as the order of the • For CNN/DM, our λ=none decoding has extractive fragments is changed with respect to the 44.0/21.0/40.8 with an average of 35.3, com- input, their lengths are decreased and new words pared to an average of 35.4 in Lewis et al. and fragments are introduced that are not part of (2020). the input x. The use of the length-normalized LCS score (lcsr) is inspired by ROUGE-L; it is a useful addition to the n-gram precisions as it can detect the extraction of longer n-grams broken up by mi- nor edits. As an example, consider the (x, y) pair shown in Figure 6. Only 4 of the 12 summary four- • For XSum, our λ=none decoding has grams match the input, i.e., p4 =33.3%, although 45.3/21.9/36.8 with an average of 34.7, com- very high overlap is apparent due to the fact that pared to an average of 34.9 in Lewis et al. a 15-word fragment from the input was extracted (2020). with only the words “verdict” and “which” min- imally changed. The lcsr score reflects this and measures 12/15=80.0% overlap. On the other hand, the n-gram precisions used in the M INT score are valuable in detecting textual overlaps that are not part of the longest common subsequence. • For Multi-News, our MN-800 λ=none decod- ing has 50.2/20.5/45.8 with an average of 38.8, compared to improved ROUGE F1 results of 8 44.5/16.0/40.3 with an average of 33.6 by Fab- M INT has elements of ROUGE (Lin, 2004) and B LEU (Papineni et al., 2002). We do not use the modified n-gram pre- bri (personal communication) for Fabbri et al. cisions, like B LEU does, because n-grams extracted multiple (2019). times from x should count as such every time.
C Details on Our Mechanical Turk Setup CNN/DM MN-800 XSum inf. coh. inf. coh. inf. coh. prefer off 40.7 40.7 44.0 34.0 20.0 19.3 prefer λ4 46.7 32.7 38.0 42.0 17.3 16.7 We provide additional details on the mitigation both equal 12.7 26.7 18.0 24.0 62.7 64.0 strategies we use in Amazon Mechanical Turk. We Table 5: Human quality evaluation of summaries gen- give detailed instructions to the annotators, with erated with no abstractiveness constraint (“off”) versus definitions and examples of different factual errors, λ4 . We asked which summary is more informative or see Figure 7. We also add a request to write a coherent, respectively. MN-800 stands for Multi-News short explanation when a sentence is judged as not with the input documents truncated to 800 words total factual. (Section 5.2). The qualification enables workers to work on our factual consistency HITs as well as our HITs Tasks with Known Answers. We add a number judging informativeness and coherence. The rate of tasks with known answers, enabling us to esti- per HIT differs widely between the two tasks, as mate the accuracy of workers who work on multiple the factual consistency task can be done quickly, of these. given the fact that a single summary sentence is evaluated and the related sentences in the article are highlighted. The factual consistency task pays $0.15 per hour with a bonus of $0.05. The task of Automatic Quality Checks. Workers who com- evaluating informativeness and coherence (see D plete the tasks too quickly, write no or very short below) pays $0.50 per hour with a bonus of $0.25, explanation texts or have low accuracy are auto- as more text is displayed, compared to the factuality matically removed from our worker pool. Their task. These amount to an average pay of $12.50, answers are replaced with new answers. incl. the bonus. The bonus is paid to workers who spend at least 10 seconds per HIT and who give short explanation texts for their decisions. Bonus. We use a bonus incentive structure. Ev- D Human Evaluation of Informativeness ery worker who passes the automatic quality checks and Coherence receives a bonus at the end. We conducted a human evaluation to determine the informativeness and coherence of the summaries generated with the λ4 decoding constraint (Equa- Qualification Test. For all our evaluations on tion 1), which increases abstractiveness, as com- Mechanical Turk (see Sec. 3.1), we first set up pared to not using a abstractiveness constraint. We a short qualification test that can be taken by any used the same setup as for the factuality task, incl. worker from a country whose main language is a qualification test, three annotators per task and English, who has completed 100 or more HITs so aggregation using MACE. The results are shown far with an acceptance rate of 95% or higher. The in Table 5. We know and expect that factuality qualification test consists of just three questions decreases with increased abstractiveness. For in- from our factual consistency setup; two of which formativeness and coherence, the results are mixed. must be answered correctly, along with an explana- For the CNN/DM dataset, for the majority of sum- tion text (5 words or more) to explain when “not maries, λ4 is preferred for informativeness but no factually consistent” was chosen. 53% of work- decoding constrain is prefered for coherence. For ers who start the test provide answers to all three Multi-News (MN-800), the preference is reversed. questions, and 27.6% of these answer at least two For XSum, the majority of summaries are indistin- correctly and provide a reasonable explanation text, guishable for both categories. i.e., only 14.6% of the test takers are granted the We used the following definitions of informative- qualification. ness and coherence for the human evaluation.
Dataset Train Valid Test F Abstractiveness Measures Proposed in CNN/DM 287,227 13,368 11,490 FEQA on Three Datasets XSum 204,045 11,332 11,334 Multi-News 44,972 5,622 5,622 The abstractiveness metrics of a generated sum- mary proposed in FEQA (Durmus et al., 2020) fall Table 6: Train/valid/test split on public datasets. into three categories: extraction, perfect fusion, and novel n-grams. The extraction measurements evaluate whether a complete summary sentence, a λ Extraction Perfect fusion summary sentence span, and a subset of tokens in Sentence Span Word k=2 k≥2 a summary sentence are copied from input docu- CNN/DM 1/λ2 29.03 6.36 38.29 16.05 20.66 1/λ4 28.27 6.28 35.07 16.39 21.8 ment(s). The perfect fusion measurements evaluate none 19.75 5.71 37.22 18.5 25.41 λ4 2.24 1.31 21.2 25.51 42.71 whether a summary sentence is copied from multi- λ2 0.76 0.45 7.54 14.39 31.78 ple sentences from input document(s). The novel MN-500 1/λ2 8.85 8.86 11.38 11.42 17.06 none 4.78 4.64 6.86 8.41 13.43 n-grams measurements evaluate whether tokens in λ4 1.42 1.41 3.09 5.73 10.59 λ2 1.07 0.97 1.99 3.78 6.79 a summary do not present in input document(s). MN-800 1/λ2 none 12.09 7.17 12.9 7.81 13.62 9.31 12.28 9.91 17.87 15.12 We observe that these metrics reflect increasing λ4 2.12 2.29 3.95 6.88 12.28 abstractiveness with larger penalty for extractive λ2 1.44 1.5 2.32 4.42 7.74 XSum 1/λ1 1.13 0.38 0.94 1.48 10.56 fragments on Multi-News and XSum datasets. The 1/λ2 0.36 0.04 0.39 0.56 4.3 none 0.16 0.03 0.22 0.23 2.43 results on CNN/DM are mixed. λ4 0.03 0.0 0.05 0.08 1.7 λ2 0.0 0.0 0.0 0.02 0.71 G Additional Experimental Details Table 7: Abstractiveness metrics in FEQA (Durmus We use AWS p3.8x and p3.16x EC2 machines for et al., 2020) for the data points shown in Figure 5. λ all our experiments, except we run FEQA on the values denote decoding constraints (Section 2.1) . Multi-News summaries on a p3dn.24xlarge ma- chine, as it requires more memory. The BART model has 406,290,432 parameters. We trained all models for 5 epochs following instructions on • Informativeness: the more informative sum- the fairseq BART webpage, without further hyper- mary is better at expressing the main points parameter search. The minimum and maximum of the news story. It contains information that length for Multi-News decoding was determined is more relevant and important. It has fewer by the lengths of the training reference summaries. unimportant details. Its content is more simi- lar to the human-written summary. • Coherence: the more coherent summary has better structure and flow, is easier to follow. The facts are presented in a more logical order. E Dataset Sizes For all three public datasets, we use the train- ing/validation/test split provided in (Hermann et al., 2015; Narayan et al., 2018; Fabbri et al., 2019). The statistics of the three datasets are described in Table 6.
You can also read