CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization Shuyang Cao and Lu Wang Computer Science and Engineering University of Michigan Ann Arbor, MI {caoshuy, wangluxy}@umich.edu Abstract XSum Article: The Fermanagh MLA Phil Flanagan tweeted after Tom Elliott appeared on a BBC radio programme in We study generating abstractive summaries May 2014. . . . “I wonder if he will reveal how many people that are faithful and factually consistent with he harassed and shot as a member of the UDR.”. . . the given articles. A novel contrastive learn- Contrastive learning (our method): A Sinn Féin MLA has arXiv:2109.09209v1 [cs.CL] 19 Sep 2021 ing formulation is presented, which leverages been ordered to apologise and pay compensation to a former member of the Ulster Defence Regiment (UDR). both reference summaries, as positive training data, and automatically generated erroneous Cross-entropy: A Sinn Féin MLA has agreed to pay com- summaries, as negative training data, to train pensation to a former Ulster Unionist Party (UDR) MP after he tweeted that he had harassed and shot people as a member summarization systems that are better at dis- of the party. tinguishing between them. We further de- sign four types of strategies for creating neg- Entailment reranking: A Sinn Féin MLA has agreed to pay compensation to a former Ulster Unionist Party (UDR) ative samples, to resemble errors made com- councillor for a tweet he sent about him. monly by two state-of-the-art models, BART and PEGASUS, found in our new human an- Unlikelihood: An MLA has been ordered to apologise and pay compensation to a former loyalist MP for a remark he notations of summary errors. Experiments on made about him while serving in the Ministry of Defence. XSum and CNN/Daily Mail show that our con- trastive learning framework is robust across Figure 1: Sample article and system summaries by dif- datasets and models. It consistently produces ferent methods. Our contrastive learning model trained more factual summaries than strong compar- on low confidence system outputs correctly generates isons with post error correction, entailment- the full name. Comparisons using cross-entropy loss, based reranking, and unlikelihood training, ac- beam reranking by entailment scores (Kryscinski et al., cording to QA-based factuality evaluation. Hu- 2020), and unlikelihood objective (Welleck et al., 2020) man judges echo the observation and find that over negative samples all produce unfaithful content. our model summaries correct more errors. 1 Introduction Our goal is to train abstractive summarization Large pre-trained Transformers have yielded re- systems that generate both faithful and informative markable performance on abstractive summariza- summaries in an end-to-end fashion. We observe tion (Liu and Lapata, 2019; Lewis et al., 2020; that, while the commonly used maximum likeli- Zhang et al., 2020a) with impeccable fluency, yet hood training optimizes over references, there is no their summaries often contain factually inconsis- guarantee for the model to distinguish references tent content (Maynez et al., 2020; Zhang et al., from incorrect generations (Holtzman et al., 2020; 2020b; Goyal and Durrett, 2020), even for state- Welleck et al., 2020). Therefore, potential solu- of-the-art models. Three types of remedies have tions reside in designing new learning objectives been proposed: running a separately learned error that can effectively inform preferences of factual correction component (Dong et al., 2020), remov- summaries over incorrect ones. ing noisy training samples (Nan et al., 2021; Goyal Concretely, we hypothesize that including factu- and Durrett, 2021), and modifying the Transformer ally inconsistent summaries (i.e., negative samples) architecture (Huang et al., 2020; Zhu et al., 2021). for training, in addition to references (i.e., positive Yet they either rely on heuristically created data for samples), let models become better at differentiat- error handling, falling short of generalization, or ing these two types of summaries. Although using require learning a large number of new parameters, negative samples has been effective at text repre- and summary informativeness is often sacrificed. sentation learning, e.g., word2vec (Mikolov et al.,
2013) and BERT (Devlin et al., 2019), there exist intrinsic errors over baseline across datasets. two major challenges for it to succeed in concrete language tasks. First, a suitable training objective 2 Related Work is critical to avoid performance degradation (Saun- Factuality Improvement and Evaluation. Neu- shi et al., 2019). Second, it is nontrivial to con- ral abstractive summaries often contain unfaith- struct “natural” samples that mimic the diverse ful content with regard to the source (Falke et al., errors made by state-of-the-art systems that vary in 2019). To improve summary factuality, three major words and syntax (Goyal and Durrett, 2021). types of approaches are proposed. First, a separate To address both challenges, we first propose correction model is learned to fix errors made by a new framework, CLIFF, that uses contrastive the summarizers (Zhao et al., 2020; Chen et al., learning for improving faithfulness and factuality 2021), including replacing entities absent from the of the generated summaries.1 Contrastive learn- source (Dong et al., 2020) or revising all possi- ing (CL) has obtained impressive results on many ble errors (Cao et al., 2020). The second type tar- visual processing tasks, such as image classifica- gets at modifying the sequence-to-sequence archi- tion (Khosla et al., 2020; Chen et al., 2020) and tecture to incorporate relation triplets (Cao et al., synthesis (Park et al., 2020; Zhang et al., 2021b). 2018), knowledge graphs (Zhu et al., 2021), and Intuitively, CL improves representation learning topics (Aralikatte et al., 2021) to inform the sum- by compacting positive samples while contrasting marizers of article facts. Yet additional engineer- them with negative samples. Here, we design a ing efforts and model retraining are often needed. task-specific CL formulation that teaches a sum- Finally, discarding noisy samples from model train- marizer to expand the margin between factually ing has also been investigated (Nan et al., 2021; consistent summaries and their incorrect peers. Goyal and Durrett, 2021), however, it often leads Moreover, we design four types of strategies to degraded summary informativeness. In compari- with different variants to construct negative sam- son, our contrastive learning framework allows the ples by editing reference summaries via rewriting model to be end-to-end trained and does not re- entity-/relation-anchored text, and using system quire model modification, thus providing a general generated summaries that may contain unfaithful solution for learning summarization systems. errors. Importantly, these strategies are inspired Alongside improving factuality, we have also by our new annotation study on errors made by witnessed growing interests in automated factu- state-of-the-art summarizers—models fine-tuned ality evaluation, since popular word-matching- from BART (Lewis et al., 2020) and PEGA- based metrics, e.g., ROUGE, correlate poorly with SUS (Zhang et al., 2020a)—on two benchmarks: human-rated factual consistency levels (Gabriel XSum (Narayan et al., 2018) and CNN/DailyMail et al., 2021; Fabbri et al., 2021). Entailment-based (CNN/DM) (Hermann et al., 2015). scorers are designed at summary level (Kryscinski We fine-tune pre-trained large models with et al., 2020) and finer-grained dependency relation our contrastive learning objective on XSum and level (Goyal and Durrett, 2020). QA models are CNN/DM. Results based on QuestEval (Scialom employed to measure content consistency by read- et al., 2021), a QA-based factuality metric of high ing the articles to answer questions generated from correlation with human judgments, show that our the summaries (Wang et al., 2020; Durmus et al., models trained with different types of negative sam- 2020), or considering the summaries for addressing ples uniformly outperform strong comparisons, in- questions derived from the source (Scialom et al., cluding using a summarizer with post error cor- 2019). Though not focusing on evaluation, our rection and reranking beams based on entailment work highlights that models can produce a signif- scores to the source. Moreover, compared with un- icant amount of world knowledge which should likelihood training method that penalizes the same be evaluated differently instead of as extrinsic hal- negative samples (Welleck et al., 2020), our sum- lucination (Maynez et al., 2020). We also show maries also obtain consistently better QuestEval that world knowledge can possibly be distinguished scores. Human evaluation further confirms that from errors via model behavior understanding. our models consistently reduce both extrinsic and Training with negative samples has been investi- 1 Our code and annotated data are available at https:// gated in several classic NLP tasks, such as gram- shuyangcao.github.io/projects/cliff_summ. matical error detection (Foster and Andersen, 2009)
and dialogue systems (Li et al., 2019). Notably, similarity between summary representations. τ is a negative sampling plays a key role in word repre- temperature and is set to 1.0. sentation learning (Mikolov et al., 2013) and train- Importantly, summaries in P and N are included ing large masked language models, such as BERT in the same batch during training, so that the model and ALBERT, to induce better contextual represen- acquires better representations to differentiate cor- tations (Devlin et al., 2019; Lan et al., 2020). For rect summaries from those with errors by compar- text generation tasks, unlikelihood training is pro- ing the two types of samples, thus maximizing the posed to penalize the generation of negative tokens probabilities of the positive samples and minimiz- (e.g., repeated words) and sentences (e.g., contra- ing the likelihoods of the corresponding negative dictory responses in a dialogue system) (Welleck samples. The CL objective on the full training set, et al., 2020; Li et al., 2020; He and Glass, 2020). denoted as LCL , is the sum of losses lcl x over all We use contrastive learning that drives enhanced samples. representation learning to better distinguish be- To effectively employ CL in summarization, we tween factual and incorrect summaries, which en- need to address two challenges: (1) how to auto- courages more faithful summary generation. matically construct both positive and negative sam- Contrastive Learning (CL) for NLP. CL has ples, which are critical for CL efficacy (Chen et al., been a popular method for representation learning, 2020), and (2) how to represent the summaries (i.e., especially for vision understanding (Hjelm et al., h∗ ). Below we describe positive sample genera- 2019; Chen et al., 2020). Only recently has CL tion and options for h∗ , leaving the strategies for been used for training language models with self- negative samples to § 5. supervision (Fang et al., 2020), learning sentence representations (Gao et al., 2021), and improving Positive Sample Construction (P ). Summariza- document clustering (Zhang et al., 2021a). With tion datasets often contain a single reference for a supervised setup, Gunel et al. (2021) adopt the each article. To create multiple positive samples, in contrastive objective to fine-tune pre-trained mod- our pilot study, we experiment with paraphrasing els on benchmark language understanding datasets. with synonym substitution (Ren et al., 2019), ran- Using a similar idea, Liu and Liu (2021) enlarge domly replacing words based on the prediction of the distances among summaries of different quality masked language models (Kobayashi, 2018), and as measured by ROUGE scores. back-translation (Mallinson et al., 2017). We find back-translation to be best at preserving meaning 3 CLIFF: Contrastive Learning and offering language variation, and thus use NL- Framework for Summarization PAug2 to translate each reference to German and back to English. Together with the reference, the We design a contrastive learning (CL)-based train- best translation is kept and added to P , if no new ing objective that drives the summarization model named entity is introduced. to learn a preference of faithful summaries over summaries with factual errors. It is then used for Summary Representation (h∗ ). We use the out- fine-tuning BART (Lewis et al., 2020) and PEGA- puts of the decoder’s last layer, and investigate SUS (Zhang et al., 2020a) for training summariza- three options that average over all tokens, named tion models. Formally, let an article x have a set of entity tokens, and the last token of the decoded reference summaries P (henceforth positive sam- summary. Entities and other parsing results are ob- ples) and another set of erroneous summaries N tained by spaCy (Honnibal et al., 2020). We further (negative samples). The contrastive learning objec- consider adding a multi-layer perceptron (MLP) tive is (Khosla et al., 2020; Gunel et al., 2021): with one hidden layer to calculate the final h∗ . x 1 X exp(sim(hi , hj )/τ ) The final training objective combines the typical lcl =− |P | log P cross-entropy loss LCE and our contrastive learn- exp(sim(hi , hk )/τ ) 2 yi ,yj ∈P yj 6=yi yk ∈P ∪N ing objective: L = LCE + λLCL , where λ is a yk 6=yi (1) scalar and set to 1.0 for all experiments. where hi , hj , and hk are representations for sum- maries yi , yj , and yk . sim(·, ·) calculates the cosine 2 https://github.com/makcedward/nlpaug
PROPN NUM Percentage of samples (%) 60 XSum CNN/DM Extrinsic error 1.0 Intrinsic error 40 World knowledge 0.5 20 0.0 0 BART PEGASUS BART PEGASUS extri. intri. world other extri. intri. world other Figure 2: Percentage of samples with intrinsic and ex- Figure 3: Probability distributions of generating the trinsic error spans for models fine-tuned from BART first tokens of proper nouns and numbers, grouped by and PEGASUS on XSum and CNN/DM. extrinsic errors, intrinsic errors, world knowledge, and other correct tokens. 4 Summary Error Annotation and Model Behavior Analysis tations for future development and evaluation of summarization models. We first describe annotating unfaithfulness errors by state-of-the-arts, i.e., models fine-tuned from Low confidence generation is indicative of ex- BART and PEGASUS on XSum and CNN/DM. trinsic errors. Inspired by recent work that stud- We then probe into the model generation behavior ies model prediction confidence (Liu et al., 2021), that is indicative of errors, which guides the design we examine generation probabilities for tokens of of negative sample construction strategies. different part-of-speech (POS) tags. Fig. 3 shows 600 (150 × 2 × 2) summaries are annotated to salient results on the generation probabilities of the demonstrate how often do the models “hallucinate", first token of a proper noun or a number (with ad- i.e., generating content not grounded by the source. ditional analysis provided in Appendix A). As ob- To characterize errors, we annotate text spans in served, model confidence tends to be lower for the summaries with (i) intrinsic errors caused by mis- first tokens of proper nouns and numbers if they are constructing phrases or clauses from the source; part of spans with extrinsic errors. Also note that and (ii) extrinsic errors which include words not world knowledge, which cannot be inferred from in the source that are either unverifiable or can- the source either, often has higher generation proba- not be verified by Wikipedia. Content not covered bility than extrinsic errors. Take this snippet gener- by the article but can be validated by Wikipedia ated by a fine-tuned BART as an example: “Manch- is annotated as world knowledge, and the models’ ester United captain Wayne Rooney’s testimonial behavior pattern when generating them differs from game against Manchester City. . .”. “Manchester when they generate errors. City" is an extrinsic error and “Wayne" is produced Two fluent English speakers with extensive ex- as world knowledge. The model assigns a low perience in summary evaluation and error labeling probability of 0.10 to the first token of “Manch- are hired. For each sample, they are shown the ester City” and a high probability of 0.92 to token article and two system summaries, and instructed “Wayne”. This implies that model confidence can be to annotate text spans with the aforementioned er- a useful indicator for negative sample collection. rors and world knowledge. After labeling every 50 samples, the annotators discuss and resolve any 5 Negative Sample Construction disagreement. The Fleiss’s Kappas on XSum and Here we describe four strategies for constructing CNN/DM are 0.35 and 0.45. negative samples that modify the references (§ 5.1- Error statistics are displayed in Fig. 2. Extrinsic 5.3) or use system generated summaries (5.4). errors dominate both datasets, especially on XSum. 58.7% of summaries by BART (and 44.0% by PE- 5.1 Entity Swap GASUS) contain at least one extrinsic error. No- Entity swap imitates intrinsic errors, as over 55% ticeably, PEGASUS is a newer model pre-trained of intrinsic errors in our annotations are found to with a larger amount of data, thus contains less contain named entities. We construct negative sam- errors than BART and other older models studied ples by swapping named entities in the references for error annotations by Maynez et al. (2020). This with other randomly selected entities of the same observation also highlights the usage of our anno- entity type in the source (S WAP E NT). One sam-
R EFERENCE: A “rare” short-eared owl found emaciated in relations (M ASK R EL). We first obtain dependency Flintshire is now recuperating well, the RSPCA have said. relations using Stanza (Qi et al., 2020), with each S WAP E NT: Flintshire → Bettisfield relation denoted as . To in- ⇒ A “rare” short-eared owl found emaciated in Bettisfield is now recuperating well, the RSPCA have said. corporate more context, we consider noun phrase M ASK E NT: A “rare” short-eared owl found emaciated in spans enclosing the token of gov or dep if it is a [MASK] is now recuperating well, the RSPCA have said. content word and the noun phrase contains a named ⇒ A “rare” short-eared owl found emaciated in a field in South Yorkshire is now recuperating well, the RSPCA have said. entity. Similar to M ASK E NT, three negative sam- M ASK R EL: A “rare” short-eared owl found [MASK] in ples are generated by BART based on the input [MASK] is now recuperating well, the RSPCA have said. with both gov and dep spans masked in the ref- ⇒ A “rare” short-eared owl found dead in London is now recu- perating well, the RSPCA have said. erence. Only the samples that introduce any new dependency relation that is not contained in the R EGEN E NT: A “rare” short-eared owl found emaciated in ⇒ A “rare” short-eared owl found emaciated in Notting- source nor the reference are kept. Specifically, we hamshire is now at a wildlife centre to recover. consider a match of a dependency relation as the R EGEN R EL: A “rare” short-eared owl found same form or synonyms of its gov and and dep is ⇒ A “rare” short-eared owl found in the grounds of a former coal mine is being cared for by the RSPCA in Somerset. found in the source or the reference with the same S YS L OW C ON: An injured golden owl found in a former coal relation. mine in Lancashire is being cared for by the RSPCA. Both M ASK E NT and M ASK R EL can create more extrinsic errors compared to other strategies intro- Table 1: Negative sample construction strategies (§ 5). duced in this section, since negative samples are For summaries edited from the reference, their differ- ences are bolded. Introduced errors are in red. Text generated without being grounded on the source before is the prefix for regeneration. articles. However, their constructed negative sam- ples may contained drifted topics that can be easily detected by a summarization model, resulting with ple is constructed for each entity in the reference. less efficient training signals. Though this idea has been studied by Kryscinski et al. (2020), they allow entities of different types 5.3 Source-conditioned Regeneration to be used, e.g., a PERSON can be replaced by a LOCATION. Examples are displayed in Table 1. To ground negative sample generation with the arti- S WAP E NT has the advantage of not depending cle, we further design a regeneration strategy based on any trained model. Yet it only introduces intrin- on conditional generation. For each named en- sic errors and lacks the coverage for extrinsic errors, tity in the reference, we treat the text before it as a which is addressed by the following generation- prompt. A summarizer, e.g., fine-tuned from BART based methods. or PEGASUS, first reads in the source using the encoder, then receives the prompt as the first part of 5.2 Mask-and-fill with BART the decoder output, and finally decodes the rest of To simulate extrinsic errors, we leverage large un- the content based on nucleus sampling (Holtzman conditional language models’ capability of convert- et al., 2020) with a cumulative probability thresh- ing a sequence with masked tokens into a fluent and old of 0.7. The prompt and the regenerated text appropriate sequence. Specifically, we replace each comprise the final negative sample. This method is named entity in a reference with a [MASK] token denoted as R EGEN E NT. and encode it with BART (without any fine-tuning). We also extend entities to relations with BART then fills this partially masked summary expanded governor and dependent spans with newly generated entities (M ASK E NT). BART (R EGEN R EL). Here, we consider a prompt as the is chosen since it can fill [MASK] with varying text before the gov or dep span, whichever occurs number of tokens. For each entity in the reference, first. For both R EGEN E NT and R EGEN R EL, three we sample three summaries and only retain the negative samples are generated for each prompt, ones containing at least one entity that is absent and a sample is kept if it introduces any new entity from both the source and the reference. (for R EGEN E NT) or dependency relation (for Up to now, the two introduced strategies both R EGEN R EL) with regard to the source and the focus on incorrect named entities. To cover more reference. diverse extrinsic and intrinsic errors (Goyal and Negative samples generated by both methods are Durrett, 2020), we extend M ASK E NT to contain more relevant to the article than the mask-and-fill
strategy, yet they may still miss certain types of Metric XSum CNN/DM errors and differ from real model outputs, since % of Err # of Err % of Err # of Err they are modified from the reference summaries. QuestEval -0.43∗ -0.25∗ -0.33∗ -0.29∗ FactCC -0.02 -0.15∗ -0.13∗ -0.12∗ 5.4 System Generation ROUGE-1 -0.16∗ -0.02 -0.03 -0.06 ROUGE-2 -0.11∗ -0.05 -0.02 -0.04 Motivated by the model confidence analysis in § 4, ROUGE-L -0.13∗ -0.03 -0.06 -0.08 we explore using system generated summaries as negative samples. We first run fine-tuned BART Table 2: Pearson correlation between metrics and error rates and numbers of errors. ∗: p-value < 0.05. or PEGASUS on the same training set to decode summaries. For each summary, we check the model confidence on the first token of each proper score (also our metric) at the last decoding noun and number span. If the probability is be- step (E NTAIL R ANK). We also include three low a threshold, we keep it as a negative sample common methods of improving factuality: (1) (S YS L OW C ON). The threshold is tuned by maxi- (C ORRECTION) fine-tunes BART to fix summary mizing F1 based on our error annotations. errors as a separate step (Cao et al., 2020). (2) We consider all beams at the last decoding step S UBSET FT fine-tunes large models based on train- as candidates. We use beam sizes of 6 and 4 for ing samples without any dependency relation er- XSum and CNN/DM. Statistics of negative sam- ror (Goyal and Durrett, 2021), with released check- ples constructed by different strategies are in Ap- point only available for XSum. (3) FAS UM pendix B. modifies Transformer by incorporating knowledge graphs for factual consistency (Zhu et al., 2021), 6 Experiment Setup with model outputs only on CNN/DM. Evaluation Metrics. QuestEval (Scialom et al., Moreover, we compare with unlikelihood train- 2021) is used as the main metric to evaluate sum- ing that penalizes the probabilities of all tokens in a maries’ factual consistency. Given an article and negative sample (Li et al., 2020). Given a negative P|y0 | a summary, QuestEval first generates natural lan- sample y 0 , the loss is defined as − t=1 log(1 − guage questions for entities and nouns from both. p(yt0 |y1:t−1 0 , x)), where p(yt0 |y1:t−1 0 , x) is the out- A QA model then consumes the article to answer put probability at the t-th step. We combine the questions derived from the summary, producing a unlikelihood training objective with cross-entropy score. Another score is obtained from a QA model loss with equal weights for fine-tuning. addressing article-based questions after reading the Lastly, we compare our negative sample strate- summary. The final QuestEval score is the har- gies with negative samples constructed for training monic mean of the two. We use the version with the FactCC scorer, denoted as FCS AMPLE. For learned weights for questions, which has shown CL only, we compare with using other samples high correlation with human judged consistency in the same batch as negative samples (BATCH), and relevance. a common practice for CL-based representation We further use FactCC (Kryscinski et al., 2020), learning (Gao et al., 2021; Zhang et al., 2021a). trained based on their negative sample construction method, to measure if the summary can be entailed 7 Results by the source. We also report ROUGE-L (Lin, 7.1 Automatic Evaluation 2004). Both FactCC and ROUGE-L reasonably We report results by models fine-tuned from BART correlate with summary factuality as judged by and PEGASUS with different objectives and nega- human (Pagnoni et al., 2021). tive samples on XSum and CNN/DM in Tables 3 Based on our error annotations, we report the cor- and 4. CLIFF models use a summary representa- relations between each metric and the error rate— tion of averaging over all tokens with MLP projec- percentage of tokens being part of an error span, tion, with other variants discussed in § 7.3. Unless and the raw number of errors (Table 2). QuestEval explicitly stated, comparison models are fine-tuned correlates better on both aspects than other metrics. from the same large model used by CLIFF. Comparisons. In addition to the models fine- First, comparing with other factuality improve- tuned with cross-entropy loss (C RS E NTROPY), ment models (top of the tables), almost all CLIFF we consider reranking beams based on FactCC models trained with different negative samples uni-
Model XSum CNN/DM Model XSum CNN/DM QEval FC R-L QEval FC R-L QEval FC R-L QEval FC R-L Comparisons without Negative Samples Comparisons without Negative Samples C RS E NTROPY 33.09 23.92 37.14 50.94 49.07 40.82 C RS E NTROPY 32.50 25.48 39.07 50.21 44.44 40.39 E NTAIL R ANK 32.95 38.45 36.55 51.02 49.84 40.89 E NTAIL R ANK 32.42 41.90 38.47 50.15 61.04 40.67 C ORRECTION 33.12 24.14 37.11 50.93 49.06 40.82 C ORRECTION 32.55 25.15 39.02 49.48 32.96 39.79 S UBSET FT 32.25 21.83 30.35 - - - FAS UM - - - 50.73 50.58 37.18 Comparisons with Unlikelihood Training FCS AMPLE 32.79 25.37 38.46 50.63 45.45 39.28 Comparisons with Unlikelihood Training S WAP E NT 32.88 24.76 37.91 50.43 43.02 38.96 FCS AMPLE 32.93 24.46 33.96 50.60 35.09 41.22 M ASK E NT 33.04 26.30 37.51 51.11 52.19 39.34 S WAP E NT 32.90 23.94 34.96 49.77 32.37 40.18 M ASK R EL 33.08 24.38 38.05 51.14 52.93 39.31 M ASK E NT 33.21 24.22 33.89 51.01 48.57 41.23 R EGEN E NT 32.89 24.46 38.47 51.11 52.90 39.23 M ASK R EL 33.18 23.50 34.56 51.00 48.35 41.15 R EGEN R EL 32.91 24.80 38.46 51.07 53.68 39.45 R EGEN E NT 32.41 24.12 37.08 50.97 48.59 41.07 S YS L OW C ON 31.66 26.06 34.03 50.92 51.08 39.19 R EGEN R EL 30.86 24.58 37.18 50.97 48.42 41.14 S YS L OW C ON 32.01 26.30 32.04 50.82 48.66 40.81 Our Method: CLIFF BATCH 32.64 24.96 38.42 51.03∗ 51.81∗ 39.38 Our Method: CLIFF FCS AMPLE 32.96∗ 25.28 38.58 51.00∗ 51.80∗ 39.37 BATCH 33.18 24.88 36.76 50.99 52.18 40.98 S WAP E NT 33.09∗ 25.09 38.58 51.16∗ 52.97∗ 38.95 FCS AMPLE 33.15 24.50 36.72 51.02 49.62 41.06 M ASK E NT 33.09∗ 25.75 38.12 51.13∗ 53.60∗ 39.24 S WAP E NT 33.30 25.67∗ 35.60 51.05 50.96 40.89 M ASK R EL 33.06∗ 25.28 38.37 51.17∗ 53.34∗ 39.36 M ASK E NT 33.32∗ 25.73∗ 36.02 50.98 49.04 41.06 R EGEN E NT 33.09∗ 24.48 38.33 50.99∗ 52.18∗ 39.28 M ASK R EL 33.35∗ 25.69∗ 35.86 51.03 49.89 41.04 R EGEN R EL 33.16∗ 24.82 38.30 51.16∗ 53.21∗ 39.25 R EGEN E NT 33.15 24.64 36.32 51.04 49.91 41.11 S YS L OW C ON 33.21∗ 25.18 38.18 50.85∗ 53.73∗ 39.30 R EGEN R EL 33.22 25.39 36.21 50.96 49.48 41.11 S YS L OW C ON 33.35∗ 25.47∗ 36.19 51.05 50.05 41.01 Table 4: Results of models fine-tuned from PEGASUS on XSum and CNN/DM. We report results on 5, 000 Table 3: Results of models fine-tuned from BART on randomly selected samples on CNN/DM, due to long XSum and CNN/DM. QEval: QuestEval; FC: FactCC; running time of QuestEval. For models of unlikelihood R-L: ROUGE-L. The best result per metric per dataset training and CLIFF that use the same negative samples, is bolded. For models of unlikelihood training and the better of the two is highlighted with green. ∗: our CLIFF that use the same negative samples, the better model is significantly better than C RS E NTROPY (ap- of the two is highlighted with green. ∗: our model is proximation randomization test, p < 0.005). significantly better than C RS E NTROPY (approximation randomization test, p < 0.005). unlikelihood training with the same negative sam- ples. According to Table 3, using 7 negative sample formly produce higher QuestEval scores across construction strategies on two datasets, CLIFF ob- datasets with both large models, with the improve- tains higher QuestEval scores than unlikelihood ments more pronounced on XSum. Importantly, training in 12 out of the 14 comparisons. Us- ROUGE scores for CLIFF models are compa- ing PEGASUS, CLIFF also outperforms in 11 se- rable or better than baselines trained with cross- tups as listed in Table 4. Similar trends are found entropy, e.g., on CNN/DM as in Table 3. A similar on FactCC and ROUGE-L. Another noteworthy trend is observed with the FactCC metric, espe- piece is that CLIFF’s improvements over the cross- cially when using PEGASUS as the base model entropy baseline are more consistent, whereas un- (Table 4). Note that E NTAIL R ANK tends to yield likelihood training occasionally hurts factuality or significantly higher FactCC scores, though it ob- ROUGE scores significantly. We believe the key tains lower QuestEval scores than the cross-entropy advantage of CLIFF resides in its measure of rep- baseline. Human inspection finds that E NTAIL - resentation similarities between positive and nega- R ANK can pick up beams with peculiar words of tive samples in the same batch, allowing models to high FactCC scores, without improving factuality. better differentiate between correct and erroneous Moreover, other comparisons based on post C OR - summaries. RECTION and model engineering (FAS UM) only of- Finally, among all variants, CLIFF trained with fer incremental gains. The sample selection-based low confidence summaries as negative samples ob- method, S UBSET FT, sacrifices ROUGE scores sig- tains the best QuestEval scores on the more abstrac- nificantly. Overall, CLIFF demonstrates stronger tive dataset. As seen in Table 3, using low confi- generalizability. dence summaries also improves FactCC scores on Second, CLIFF is more effective and robust than both datasets, and enhances ROUGE-L on the more
Strategy XSum CNN/DM Inform. Factual. Model Win↑ Tie Lose↓ Win↑ Tie Lose↓ QEval FC R-L QEval FC R-L E NTAIL R ANK 3.3 84.3 12.3 23.7 71.7 4.7 S YS L OW C ON 33.35 25.47 36.19 51.05 50.05 41.01 ULL. M ASK E NT 6.3 80.7 13.0 26.3 62.0 11.7 + S WAP E NT 33.40 25.50 35.50 51.32 53.95 40.57 CL. BATCH 5.3 80.0 14.7 21.7 68.0 10.3 + M ASK E NT 33.21 25.47 35.91 51.16 51.90 40.66 CL. S YS L OW C ON 8.7 78.3 13.0 31.3 61.7 7.0 + M ASK R EL 33.39 25.20 35.70 51.24 52.48 40.80 + R EGEN E NT 33.31 25.07 35.94 51.21 51.86 40.91 (a) XSum + R EGEN R EL 33.38 24.97 36.03 51.13 50.85 40.97 Inform. Factual. Model Win↑ Tie Lose↓ Win↑ Tie Lose↓ Table 5: Results of fine-tuned BART with combina- E NTAIL R ANK 2.3 86.3 11.3 4.7 94.7 0.7 tions of negative sample construction strategies. ULL. M ASK E NT 18.0 71.0 11.0 17.3 79.7 3.0 CL. BATCH 17.3 74.7 8.0 20.7 77.0 2.3 CL. S YS L OW C ON 15.7 75.7 8.7 20.0 77.7 2.3 extractive dataset CNN/DM. This indicates that sys- (b) CNN/DM tem generated summaries contribute more diverse errors made by existing models organically, which Table 6: Percentages of summaries that are better than, are particularly suitable for our CL framework. As tied with, or worse than C RS E NTROPY, in informative- we use summaries generated by the same model ness (Inform.) and factual consistency (Factual.) The for CLIFF training, one future direction is to use Krippendorff’s αs are 0.33 and 0.62 for the two aspects on XSum, and 0.34 and 0.89 on CNN/DM. Our CL outputs by different models. For our mask-and-fill method using low confidence summaries is more fre- and source-conditioned regeneration strategies, we quently rated as better for informativeness and factual- find that relation-anchored construction often beats ity on the more abstractive dataset XSum. their entity-anchored counterparts. This calls for ef- forts that steer the entity-driven methods to a more Percentage of samples (%) XSum CNN/DM relation-focused direction. 60 Combining Strategies. We further show results 40 by fine-tuning BARTs using samples based on com- 20 bined negative sample construction strategies in Table 5. As can be seen, combining S YS L OW C ON 0 Intrinsic Extrinsic Intrinsic Extrinsic and other strategies yields better QuestEval scores CrsEntropy ULL.MaskEnt CL.SysLowCon than models trained with negative samples by any EntailRank CL.Batch single strategy, except for M ASK E NT and R EGE - Figure 4: Portions of summaries with errors. CL mod- N E NT on XSum. This signifies the importance of els consistently reduce both types of errors. covering diverse types of errors in negative sam- ples. structed by M ASK E NT, and CLIFF models trained 7.2 Human Evaluation with BATCH and S YS L OW C ON negative samples. All are fine-tuned from BART. Detailed evaluation Pairwise Comparison with Cross-entropy. We guidelines are in Appendix D. recruit the two human annotators for our summary Table 6 shows that on the more abstractive XSum error study, as well as another experienced annota- data CL trained with low confidence samples are tor, to evaluate summary informativeness and fac- more frequently rated as being more informative tual consistency. For each article, the judges are and more factual than C RS E NTROPY summaries. shown summaries generated by the C RS E NTROPY This echos our automatic evaluations with QuestE- model and four other systems. They then rate each val in § 7.1. On CNN/DM, all models trained system summary against the C RS E NTROPY sum- with negative samples produce summaries with bet- mary. All four summaries generated by different ter informativeness and faithfulness. In contrast, factuality-improved models are shown in random E NTAIL R ANK summaries are less distinguishable order without system names shown, ensuring the from outputs by C RS E NTROPY on both datasets, fair comparison among them. as more ties are found. We show sample outputs in We randomly pick 100 articles from each dataset Fig. 1, with additional examples in Appendix E. used in our error analysis study in § 4, and evaluate summaries generated by E NTAIL R ANK, unlikeli- Intrinsic vs. Extrinsic Errors. Next, the annota- hood training (ULL) with negative samples con- tors are asked to label text spans with intrinsic and
XSum CNN/DM S WAP E NT M ASK R EL S YS L OW C ON CL.SLC Rep. MLP QEval FC QEval FC QEval FC CL.B BART Last X 33.15 25.10 33.20 25.29 33.10 24.85 ULL. Last +0.13 +0.02 –0.01 –0.32 –0.07 –0.10 EntRank. Entity X 33.35 25.41 33.34 25.44 33.32 24.46 Entity –0.13 –0.07 –0.14 –0.05 –0.29 +0.72 0 35% 0 25% All X 33.30 25.67 33.35 25.69 33.35 25.47 Deletion Substitution Both All –0.23 –0.80 –0.04 –0.48 –0.26 –0.40 Figure 5: Summaries use different portions of error PEGASUS Last X 33.07 25.45 32.99 25.09 33.18 24.94 correction operations. Contrastive learning with S YS - Last –0.07 –0.56 +0.01 -0.01 –0.02 –0.04 L OW C ON (CL.SLC) and BATCH (CL.B) substitute er- Entity X 33.03 25.43 33.05 24.77 33.20 24.59 rors with correct content more often than unlikelihood Entity +0.01 –0.34 –0.05 +0.64 –0.30 +0.05 training with M ASK E NT and E NTAIL R ANK. All X 33.09 25.09 33.06 25.28 33.21 25.18 All –0.11 +0.25 +0.03 -0.19 –0.02 –0.80 Table 7: Comparing different formulations of summary extrinsic errors as done in § 4. Fig. 4 shows that CL representation in CL. For models without MLP, we dis- is more effective at reducing extrinsic errors than play score changes from their counterparts. Overall, us- unlikelihood training can on both datasets. We also ing all tokens with MLP produces better summaries. observe slight decreases of world knowledge in the summaries (figure attached in Appendix D). 8 Conclusion We present CLIFF, a contrastive learning-based Error Correction Operations. Finally, with ref- framework to promote faithfulness and factuality erence to C RS E NTROPY summaries, human judges of abstractive summaries. CLIFF uses both refer- are instructed to label each system summary as ences and summaries that are factually inconsistent whether it corrects any error by C RS E NTROPY us- with the articles to train systems to be better at ing deletion of the incorrect content, substitution discriminating errors from factual and salient con- with factual information, or both. As seen in Fig. 5, tent. We further study strategies that automatically CL-based models restore factually consistent infor- create erroneous summaries by editing from refer- mation, e.g., by replacing erroneous names and ences or leveraging systems outputs, inspired by numbers with correct ones, more frequently than our new summary error analysis on state-of-the- unlikelihood training or entailment reranking. art models. Both automatic evaluation and human ratings show that CLIFF achieves consistent im- 7.3 Variants of Summary Representation provements over competitive comparison methods, and is generalizable across datasets with systems Sample representation is critical for CL to be ef- fine-tuned from different large models. fective. Here we investigate summary representa- tion variants as discussed in § 3. There are two Acknowledgements major considerations: (1) Should we consider all This research is supported in part by Oracle for tokens in a summary or only representative ones Research Cloud Credits, National Science Founda- (e.g., entities or last token)? (2) Should additional tion through a CAREER award IIS-2046016, and transformation, i.e., an MLP, be used? the Office of the Director of National Intelligence Experiments on XSum using three negative sam- (ODNI), Intelligence Advanced Research Projects ple construction strategies demonstrate that aver- Activity (IARPA), via contract # FA8650-17-C- aging the decoder outputs of all tokens and adding 9116. The views and conclusions contained herein an MLP projection yield the best overall perfor- are those of the authors and should not be inter- mance, as shown in Table 7. The implications are preted as necessarily representing the official poli- at least two-fold. First, even for entity- or relation- cies, either expressed or implied, of ODNI, IARPA, triggered sample modifications, using more global or the U.S. Government. The U.S. Government is context helps with CL training. Second, additional authorized to reproduce and distribute reprints for transformation can help avoid model degeneration. governmental purposes notwithstanding any copy- For instance, more nonsensical and repetitive con- right annotation therein. We thank three anony- tent is produced by variants without MLP. mous reviewers for their valuable suggestions.
References Alexander R Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Rahul Aralikatte, Shashi Narayan, Joshua Maynez, Dragomir Radev. 2021. Summeval: Re-evaluating Sascha Rothe, and Ryan McDonald. 2021. Focus at- summarization evaluation. Transactions of the Asso- tention: Promoting faithfulness and diversity in sum- ciation for Computational Linguistics, 9:391–409. marization. In Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie and the 11th International Joint Conference on Nat- Utama, Ido Dagan, and Iryna Gurevych. 2019. ural Language Processing (Volume 1: Long Papers), Ranking generated summaries by correctness: An in- pages 6078–6095, Online. Association for Computa- teresting but challenging application for natural lan- tional Linguistics. guage inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- Meng Cao, Yue Dong, Jiapeng Wu, and Jackie Chi Kit guistics, pages 2214–2220, Florence, Italy. Associa- Cheung. 2020. Factual error correction for abstrac- tion for Computational Linguistics. tive summarization models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan Language Processing (EMNLP), pages 6251–6258, Ding, and Pengtao Xie. 2020. Cert: Contrastive self- Online. Association for Computational Linguistics. supervised learning for language understanding. Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. Jennifer Foster and Oistein Andersen. 2009. Gen- Faithful to the original: Fact aware neural abstrac- ERRate: Generating errors for use in grammatical tive summarization. In Proceedings of the AAAI error detection. In Proceedings of the Fourth Work- Conference on Artificial Intelligence, volume 32. shop on Innovative Use of NLP for Building Edu- cational Applications, pages 82–90, Boulder, Col- Sihao Chen, Fan Zhang, Kazoo Sone, and Dan Roth. orado. Association for Computational Linguistics. 2021. Improving faithfulness in abstractive summa- rization with contrast candidate generation and se- Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin lection. In Proceedings of the 2021 Conference of Choi, and Jianfeng Gao. 2021. GO FIGURE: A the North American Chapter of the Association for meta evaluation of factuality in summarization. In Computational Linguistics: Human Language Tech- Findings of the Association for Computational Lin- nologies, pages 5935–5941, Online. Association for guistics: ACL-IJCNLP 2021, pages 478–487, On- Computational Linguistics. line. Association for Computational Linguistics. Ting Chen, Simon Kornblith, Mohammad Norouzi, Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. and Geoffrey Hinton. 2020. A simple framework Simcse: Simple contrastive learning of sentence em- for contrastive learning of visual representations. In beddings. Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings Tanya Goyal and Greg Durrett. 2020. Evaluating fac- of Machine Learning Research, pages 1597–1607. tuality in generation with dependency-level entail- PMLR. ment. In Findings of the Association for Computa- tional Linguistics: EMNLP 2020, pages 3592–3603, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Online. Association for Computational Linguistics. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Tanya Goyal and Greg Durrett. 2021. Annotating and standing. In Proceedings of the 2019 Conference modeling fine-grained factuality in summarization. of the North American Chapter of the Association In Proceedings of the 2021 Conference of the North for Computational Linguistics: Human Language American Chapter of the Association for Computa- Technologies, Volume 1 (Long and Short Papers), tional Linguistics: Human Language Technologies, pages 4171–4186, Minneapolis, Minnesota. Associ- pages 1449–1462, Online. Association for Compu- ation for Computational Linguistics. tational Linguistics. Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Beliz Gunel, Jingfei Du, Alexis Conneau, and Veselin Jackie Chi Kit Cheung, and Jingjing Liu. 2020. Stoyanov. 2021. Supervised contrastive learning for Multi-fact correction in abstractive text summariza- pre-trained language model fine-tuning. In Interna- tion. In Proceedings of the 2020 Conference on tional Conference on Learning Representations. Empirical Methods in Natural Language Process- ing (EMNLP), pages 9320–9331, Online. Associa- Tianxing He and James Glass. 2020. Negative train- tion for Computational Linguistics. ing for neural dialogue response generation. In Pro- ceedings of the 58th Annual Meeting of the Asso- Esin Durmus, He He, and Mona Diab. 2020. FEQA: A ciation for Computational Linguistics, pages 2044– question answering evaluation framework for faith- 2058, Online. Association for Computational Lin- fulness assessment in abstractive summarization. In guistics. Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 5055– Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- 5070, Online. Association for Computational Lin- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, guistics. and Phil Blunsom. 2015. Teaching machines to read
and comprehend. In C. Cortes, N. D. Lawrence, and comprehension. In Proceedings of the 58th An- D. D. Lee, M. Sugiyama, and R. Garnett, editors, nual Meeting of the Association for Computational Advances in Neural Information Processing Systems Linguistics, pages 7871–7880, Online. Association 28, pages 1693–1701. Curran Associates, Inc. for Computational Linguistics. R Devon Hjelm, Alex Fedorov, Samuel Lavoie- Jia Li, Chongyang Tao, Wei Wu, Yansong Feng, Marchildon, Karan Grewal, Phil Bachman, Adam Dongyan Zhao, and Rui Yan. 2019. Sampling mat- Trischler, and Yoshua Bengio. 2019. Learning deep ters! an empirical study of negative sampling strate- representations by mutual information estimation gies for learning of matching models in retrieval- and maximization. In International Conference on based dialogue systems. In Proceedings of the Learning Representations. 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Joint Conference on Natural Language Processing Yejin Choi. 2020. The curious case of neural text de- (EMNLP-IJCNLP), pages 1291–1296, Hong Kong, generation. In International Conference on Learn- China. Association for Computational Linguistics. ing Representations. Margaret Li, Stephen Roller, Ilia Kulikov, Sean Matthew Honnibal, Ines Montani, Sofie Van Lan- Welleck, Y-Lan Boureau, Kyunghyun Cho, and Ja- deghem, and Adriane Boyd. 2020. spaCy: son Weston. 2020. Don’t say that! making inconsis- Industrial-strength Natural Language Processing in tent dialogue unlikely with unlikelihood training. In Python. Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4715– Luyang Huang, Lingfei Wu, and Lu Wang. 2020. 4728, Online. Association for Computational Lin- Knowledge graph-augmented abstractive summa- guistics. rization with semantic-driven cloze reward. In Pro- ceedings of the 58th Annual Meeting of the Asso- Chin-Yew Lin. 2004. ROUGE: A package for auto- ciation for Computational Linguistics, pages 5094– matic evaluation of summaries. In Text Summariza- 5107, Online. Association for Computational Lin- tion Branches Out, pages 74–81, Barcelona, Spain. guistics. Association for Computational Linguistics. Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Sarna, Yonglong Tian, Phillip Isola, Aaron Zhifang Sui, Weizhu Chen, and Bill Dolan. 2021. Maschinot, Ce Liu, and Dilip Krishnan. 2020. Su- A token-level reference-free hallucination detection pervised Contrastive Learning. In Advances in benchmark for free-form text generation. Neural Information Processing Systems, volume 33, Yang Liu and Mirella Lapata. 2019. Text summariza- pages 18661–18673. Curran Associates, Inc. tion with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Nat- Sosuke Kobayashi. 2018. Contextual augmentation: ural Language Processing and the 9th International Data augmentation by words with paradigmatic re- Joint Conference on Natural Language Processing lations. In Proceedings of the 2018 Conference of (EMNLP-IJCNLP), pages 3730–3740, Hong Kong, the North American Chapter of the Association for China. Association for Computational Linguistics. Computational Linguistics: Human Language Tech- nologies, Volume 2 (Short Papers), pages 452–457, Yixin Liu and Pengfei Liu. 2021. SimCLS: A sim- New Orleans, Louisiana. Association for Computa- ple framework for contrastive learning of abstractive tional Linguistics. summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Lin- Wojciech Kryscinski, Bryan McCann, Caiming Xiong, guistics and the 11th International Joint Conference and Richard Socher. 2020. Evaluating the factual on Natural Language Processing (Volume 2: Short consistency of abstractive text summarization. In Papers), pages 1065–1072, Online. Association for Proceedings of the 2020 Conference on Empirical Computational Linguistics. Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computa- Jonathan Mallinson, Rico Sennrich, and Mirella Lap- tional Linguistics. ata. 2017. Paraphrasing revisited with neural ma- chine translation. In Proceedings of the 15th Con- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, ference of the European Chapter of the Association Kevin Gimpel, Piyush Sharma, and Radu Soricut. for Computational Linguistics: Volume 1, Long Pa- 2020. Albert: A lite bert for self-supervised learning pers, pages 881–893, Valencia, Spain. Association of language representations. In International Con- for Computational Linguistics. ference on Learning Representations. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Ryan McDonald. 2020. On faithfulness and factu- jan Ghazvininejad, Abdelrahman Mohamed, Omer ality in abstractive summarization. In Proceedings Levy, Veselin Stoyanov, and Luke Zettlemoyer. of the 58th Annual Meeting of the Association for 2020. BART: Denoising sequence-to-sequence pre- Computational Linguistics, pages 1906–1919, On- training for natural language generation, translation, line. Association for Computational Linguistics.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- copo, and Wang Alex. 2021. Questeval: Summariza- rado, and Jeffrey Dean. 2013. Distributed represen- tion asks for fact-based evaluation. arXiv preprint tations of words and phrases and their composition- arXiv:2103.12693. ality. Thomas Scialom, Sylvain Lamprier, Benjamin Pi- Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero wowarski, and Jacopo Staiano. 2019. Answers Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, unite! unsupervised metrics for reinforced summa- Kathleen McKeown, and Bing Xiang. 2021. Entity- rization models. In Proceedings of the 2019 Con- level factual consistency of abstractive text summa- ference on Empirical Methods in Natural Language rization. In Proceedings of the 16th Conference of Processing and the 9th International Joint Confer- the European Chapter of the Association for Com- ence on Natural Language Processing (EMNLP- putational Linguistics: Main Volume, pages 2727– IJCNLP), pages 3246–3256, Hong Kong, China. As- 2733, Online. Association for Computational Lin- sociation for Computational Linguistics. guistics. Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Asking and answering questions to evaluate the fac- 2018. Don’t give me the details, just the summary! tual consistency of summaries. In Proceedings of topic-aware convolutional neural networks for ex- the 58th Annual Meeting of the Association for Com- treme summarization. In Proceedings of the 2018 putational Linguistics, pages 5008–5020, Online. Conference on Empirical Methods in Natural Lan- Association for Computational Linguistics. guage Processing, pages 1797–1807, Brussels, Bel- gium. Association for Computational Linguistics. Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di- nan, Kyunghyun Cho, and Jason Weston. 2020. Neu- Myle Ott, Sergey Edunov, Alexei Baevski, Angela ral text generation with unlikelihood training. In Fan, Sam Gross, Nathan Ng, David Grangier, and International Conference on Learning Representa- Michael Auli. 2019. fairseq: A fast, extensible tions. toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- Artidoro Pagnoni, Vidhisha Balachandran, and Yulia ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- Tsvetkov. 2021. Understanding factuality in abstrac- icz, Joe Davison, Sam Shleifer, Patrick von Platen, tive summarization with frank: A benchmark for fac- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, tuality metrics. Teven Le Scao, Sylvain Gugger, Mariama Drame, Taesung Park, Alexei A Efros, Richard Zhang, and Jun- Quentin Lhoest, and Alexander M. Rush. 2020. Yan Zhu. 2020. Contrastive learning for unpaired Transformers: State-of-the-art natural language pro- image-to-image translation. In European Confer- cessing. In Proceedings of the 2020 Conference on ence on Computer Vision, pages 319–345. Springer. Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, ciation for Computational Linguistics. and Christopher D. Manning. 2020. Stanza: A python natural language processing toolkit for many Dejiao Zhang, Feng Nan, Xiaokai Wei, Shangwen Li, human languages. In Proceedings of the 58th An- Henghui Zhu, Kathleen McKeown, Ramesh Nallap- nual Meeting of the Association for Computational ati, Andrew Arnold, and Bing Xiang. 2021a. Sup- Linguistics: System Demonstrations, pages 101– porting clustering with contrastive learning. 108, Online. Association for Computational Linguis- tics. Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. 2021b. Cross-modal con- Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. trastive learning for text-to-image generation. arXiv 2019. Generating natural language adversarial ex- preprint arXiv:2101.04702. amples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe- Association for Computational Linguistics, pages ter Liu. 2020a. Pegasus: Pre-training with extracted 1085–1097, Florence, Italy. Association for Compu- gap-sentences for abstractive summarization. In In- tational Linguistics. ternational Conference on Machine Learning, pages 11328–11339. PMLR. Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. Yuhao Zhang, Derek Merck, Emily Tsai, Christo- 2019. A theoretical analysis of contrastive unsu- pher D. Manning, and Curtis Langlotz. 2020b. Op- pervised representation learning. In International timizing the factual correctness of a summary: A Conference on Machine Learning, pages 5628–5637. study of summarizing radiology reports. In Pro- PMLR. ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 5108– Thomas Scialom, Paul-Alexis Dray, Gallinari Patrick, 5120, Online. Association for Computational Lin- Lamprier Sylvain, Piwowarski Benjamin, Staiano Ja- guistics.
NOUN VERB PROPN NUM 1.0 1.0 0.5 0.5 0.0 0.0 extri. intri. world other extri. intri. world other extri. intri. world other extri. intri. world other NOUN VERB 1.0 Figure 6: Probability distributions of generating the first tokens of nouns and verbs, grouped by extrinsic 0.5 errors, intrinsic errors, world knowledge, and other cor- 0.0 rect tokens. No verb is annotated as world knowledge. extri. intri. world other extri. intri. world other Zheng Zhao, Shay B. Cohen, and Bonnie Webber. 2020. Figure 7: Probability distributions of generating the Reducing quantity hallucinations in abstractive sum- non-first tokens of proper nouns, numbers, nouns, and marization. In Findings of the Association for Com- verbs, grouped by extrinsic errors, intrinsic errors, putational Linguistics: EMNLP 2020, pages 2237– world knowledge, and other correct tokens. Non-first 2249, Online. Association for Computational Lin- tokens do not exist for numbers and verbs, as they only guistics. contain single tokens. Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, and Meng Jiang. 2021. Enhancing factual consistency Dataset Train Validation Test of abstractive summarization. In Proceedings of the XSum 204,045 11,332 11,334 2021 Conference of the North American Chapter of CNN/DM 287,227 13,368 11,490 the Association for Computational Linguistics: Hu- man Language Technologies, pages 718–733, On- Table 8: Numbers of samples in train/validation/test line. Association for Computational Linguistics. splits of XSum and CNN/DM. A Additional Analysis for Summary Error Annotation Positive Samples. We observe unfaithful para- We hire two fluent English speakers to annotate phrases by back-translation for some reference summary errors on XSum and CNN/DailyMail summaries, which are mainly due to the introduc- (CNN/DM). They annotate a common batch of 100 tion of new entities and the rewriting of quoted summaries generated by summarizers fine-tuned text. Thus, we discard samples generated by back- from BART and PEGASUS, with 50 articles in translation that contain new entities and inconsis- each batch. The two annotators are shown 50 tent quoted text. Finally, we obtain 182,114 and HTML pages in a batch, each of which contains 91,468 positive samples by back-translation on an article and two summaries generated by the two XSum and CNN/DM. models. The detailed annotation guideline is given in Fig. 9. Negative Samples. For consistency, we use the For our analysis on token generation probabili- summarizer fine-tuned from BART in R EGEN E NT, ties, we additionally show the distributions of the R EGEN R EL (§ 5.3), and S YS L OW C ON (§ 5.4) first token’s probability for nouns and verbs in strategies. We tune a threshold to select negative Fig. 6. We also report the distributions of the non- samples from model generations in our S YS L OW- first token’s probability for proper nouns, numbers, C ON strategy. The threshold is set to 0.21, with F1 nouns, and verbs in Fig. 7. As can be seen, tokens scores of 73.99 and 40.49 on XSum and CNN/DM within extrinsic and intrinsic errors have high gen- annotated samples generated by BART. eration probabilities when they are non-first tokens. The numbers of negative samples constructed by each strategy for training on XSum and CNN/DM B Statistics for Datasets and Training are shown in Table 9. S YS L OW C ON constructs the Samples least negative samples in total, while it achieves the Summarization Datasets. We follow the official best results as reported in our main paper (§ 7.1), in- data splits for the two datasets, with the number of dicating that its negative samples are more effective samples in each split listed in Table 8. for training.
You can also read