Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation Yuexiang Xie1 Fei Sun1 Yang Deng2,∗ Yaliang Li1,† Bolin Ding1 1 Alibaba Group 2 The Chinese University of Hong Kong {yuexiang.xyx, ofey.sf, yaliang.li, bolin.ding}@alibaba-inc.com ydeng@se.cuhk.edu.hk Abstract annotated data released by Maynez et al. (2020), more than 50% of the generated summaries are not Despite significant progress has been achieved completely consistent with the source document. in text summarization, factual inconsistency in generated summaries still severely limits Such factual inconsistency between source docu- arXiv:2108.13134v2 [cs.CL] 8 Sep 2021 its practical applications. Among the key ment and generated summary, also known as hallu- factors to ensure factual consistency, a reli- cination, undoubtedly limits practical applications able automatic evaluation metric is the first of text summarization techniques. and the most crucial one. However, exist- To ensure the factual consistency in text sum- ing metrics either neglect the intrinsic cause of the factual inconsistency or rely on aux- marization, a reliable automatic evaluation metric iliary tasks, leading to an unsatisfied corre- is the first and the most crucial factor (Goodrich lation with human judgments or increasing et al., 2019). The predominant automatic metrics, the inconvenience of usage in practice. In e.g., ROUGE (Lin, 2004) and METEOR (Lavie light of these challenges, we propose a novel and Agarwal, 2007), are mainly based on n-gram metric to evaluate the factual consistency in lexical-overlap and have been proven to be poorly text summarization via counterfactual estima- correlated with human judgments on factual consis- tion, which formulates the causal relationship tency (Bhandari et al., 2020; Maynez et al., 2020; among the source document, the generated summary, and the language prior. We remove Wang et al., 2020). To better evaluate the fac- the effect of language prior, which can cause tual consistency of summarization systems, var- factual inconsistency, from the total causal ef- ious types of metrics have been introduced, includ- fect on the generated summary, and provides ing computing semantic similarity with pretrained a simple yet effective way to evaluate consis- model instead of n-gram based similarity (Zhang tency without relying on other auxiliary tasks. et al., 2020b; Koto et al., 2020), and using auxil- We conduct a series of experiments on three iary tasks such as textual entailment (Falke et al., public abstractive text summarization datasets, and demonstrate the advantages of the pro- 2019; Maynez et al., 2020) and question answering posed metric in both improving the correlation (Chen et al., 2018; Eyal et al., 2019; Wang et al., with human judgments and the convenience 2020; Scialom et al., 2021). However, none of them of usage. The source code is available at tackle this issue from the view of intrinsic cause of https://github.com/xieyxclack/factual_coco. the factual inconsistency. Besides, some metrics rely on auxiliary tasks (e.g., question answering), 1 Introduction which makes these metrics costly and inconvenient. In recent years, significant progress has been In short, automatic evaluation for factual consis- achieved in text summarization, and with the help tency in text summarization still remains an open of deep neural networks, we can generate infor- research problem. mative, relevant, and fluent texts (See et al., 2017; Revisiting the sequence-to-sequence (Seq2Seq) Narayan et al., 2018; Liu and Lapata, 2019). How- summarization models, a summary is generated ac- ever, it still remains a major challenge to ensure cording to the encoded source document and the the factual consistency of the generated summary learned decoder. The information in the source with respect to the source document (Zhang et al., document is encoded and used to ensure the fac- 2020c; Kryscinski et al., 2020). For instance, in the tual consistency of generated summary. The de- * Work done at Alibaba. coder, as a language model, learns the language † Corresponding author. prior from the training corpus to transform the en-
coded source document to an informative and flu- achieves a significant improvement against the ex- ent summary. However, the side effects also come isting automatic metrics for text summarization in along with the language prior, hallucinating the terms of the correlation with human annotations. inconsistency tokens due to spurious linguistic cor- relations learned from training corpus. For instance, 2 Related Work when the term green leaves occurs frequently in the training corpus and the summarization model has The most popular n-gram based evaluation metrics, learned such language prior knowledge, it could e.g., ROUGE (Lin, 2004), BLEU (Papineni et al., be with high probability to generate such inconsis- 2002) and METEOR (Lavie and Agarwal, 2007), tency term green leaves, even though the source have been proven to perform poorly on measuring document is about red maple leaves. Similar hal- factual consistency (Wang et al., 2020). Inspired by lucination phenomena have also been observed in the success of pretrained contextual word embed- other conditional text generation tasks, including dings, BERTScore (Zhang et al., 2020b) leverages image caption (Hendricks et al., 2018) and data-to- the pretrained BERT (Devlin et al., 2019) model text (Filippova, 2020). to compute the similarity between the generated summary and reference. However, these metrics Shed light by the above challenges and insights, cannot lead to a satisfying correlation to human in this paper, we seek to design a simple yet ef- judgments on factual consistency since they only fective evaluation metric, named Counterfactual capture the token-level overlapping or similarity. Consistency (denoted as CoCo), for text summa- Hence, instead of defining metrics on token level, rization. Different from the existing metrics, CoCo Goodrich et al. (2019) proposes to measure the fac- is proposed to evaluate the factual consistency of tual consistency by counting the overlap of facts summarized texts via counterfactual estimation, (i.e., relation tuple) extracted from the generated and it does not rely on auxiliary tasks, which brings summary and the source document. the convenience of usage in practice. To be spe- Several works have also explored Natural Lan- cific, with the help of causal inference (Pearl and guage Inference (NLI) to evaluate the factual con- Mackenzie, 2018; Yao et al., 2021), we formulate sistency via calculating entailment probability be- the causal relationship among the source document, tween the document and its abstractive summaries the generated summary, and the language prior to (Falke et al., 2019; Maynez et al., 2020). They as- build up the causal graph for text summarization. sume that a factually consistent summary is usually According to the built causal graph and the anal- entailed by the source document. To address the ysis of causal effect, we point out that the effect issue of domain shift in out-of-the-box NLI mod- of language prior can be blamed to cause factual els, synthetic training datasets, e.g., augmented by inconsistency. Thus, we propose counterfactual summarization datasets (Kryscinski et al., 2020) abstractive summarization to estimate the causal and QA datasets (Mishra et al., 2021), are created effect of language prior on the generated summary, to finetune the BERT-based NLI models. and remove it from the total causal effect. The Question answering has also been used as an estimated effect, which is the causal effect of the evaluation method for summarization (Mani et al., source document on the generated summary, serves 1999; Clarke and Lapata, 2010). For factual con- as a factual consistency score of the generated sum- sistency evaluation, the basic intuition is to test mary. The intuition is that when texts are generated whether a summary and its corresponding source more relying on the source document rather than have similar answers for the synthetic questions. the language prior, they should be more likely to The differences between various works are mainly be factually consistent w.r.t. the source documents. in question generation and metric computation. For To demonstrate the effectiveness of the pro- example, Eyal et al. (2019) generates questions posed metric CoCo, we conduct a series of ex- from the reference summary; Wang et al. (2020) periments on three public datasets, which are de- and Durmus et al. (2020) generate questions from rived from widely-used benchmarks CNN/Daily the evaluated summary; while Scialom et al. (2021) Mail (Hermann et al., 2015; Nallapati et al., 2016) generates questions from both the evaluated sum- or XSUM (Narayan et al., 2018), and have human mary and its corresponding source document. For annotations on factual consistency. Without rely- metric computation, Eyal et al. (2019) averages the ing on auxiliary tasks, the proposed metric CoCo percentage of questions answered correctly accord-
ing to the generated summaries, while Wang et al. K k k — 7 (2020) computes the similarity of the two answers C X Y c x Y c 7 x∗ Y from the summary and its corresponding source Yx,k Yx∗ ,k with token-level F1 score. X = x∗ (a) (b) In summary, the existing metrics are proposed to measure the factual consistency via calculating Figure 1: (a) The causal graph of text summariza- tion reflects the causal relationships among the fact C, the semantic similarity between the generated sum- source document X, language prior K, and the model- mary and the source document or the references, generated summary Y . (b) According to Eq. (6), the with the help of extrinsic tasks like pretrained lan- causal effect of X on Y can be obtained by subtracting guage models or other auxiliary tasks. However, the effect of K on Y from the total effect. few of them explore the intrinsic cause of the fac- tual inconsistency. In this paper, we study this help of the language prior knowledge K (e.g., the task from the perspective of causal inference (Pearl, usage of demonstratives and prepositions, the lo- 2001; Pearl and Mackenzie, 2018). gistic relationships intra- and inter-sentences, etc.). The causal graph in Fig. 1(a) provides an in- 3 Methodology sight for measuring factual consistency for text summarization. For an abstractive text summariza- In this section, we introduce the proposed metric tion model, it demands both the information of for measuring factual consistency in text summa- source document (i.e., the causal effect shown by rization, named Counterfactual Consistency (de- X → Y ), as well as the language prior (i.e., the noted as CoCo). causal effect shown by K → Y ) to generate the summary. The information provided by the source 3.1 Causal Graph of Text Summarization document X ensures the summarization model to We first introduce two key concepts of causal in- generate an informative and relevant summary. The ference, i.e., causal graph and causal effect. The language prior brings benefits such as grammar causal graph represents the causal relationships rules; however, on the other hand, it can also lead between variables using a directed acyclic graph to hallucinate the inconsistency tokens via introduc- G = {V, E}, where V denotes the set of vari- ing spurious linguistic correlations or even biased ables and E represents the cause-effect relation- knowledge learned from the training corpus (Her- ships among these variables. Fig. 1(a) shows an mann et al., 2015; Niu et al., 2020). example of causal graph with four variables. In a causal graph, capital letters (e.g., K and X) denote 3.2 Causal Effect random variables and lowercase letters (e.g., k, x) Inspired by the intrinsic cause of factual con- denote their observed values, respectively. An edge sistency in text summarization discussed above, means a causal-effect relationship between the par- we propose a novel automatic evaluation metric, ent (cause) and the child (effect), e.g., it can be named Counterfactual Consistency (denoted as denoted as K → Y . CoCo), to measure the factual consistency via coun- Here, we build the causal graph of abstractive terfactual estimation. To be specific, we aim to text summarization as illustrated in Fig. 1(a). It estimate the causal effect of X → Y to measure reflects the causal relationships among the fact C, the factual consistency of the generated summary, the source document X, the language prior K, and since when the summaries are generated relied on the generated summary Y . The paths C → X and the source document, they should be more likely to K → X represent the source document X (e.g., be factually consistent w.r.t. the source document an informative and fluent news report) is composed than those generated relied on language prior. by the fact C (e.g., the happened event) and the To achieve this, we first need to use counterfac- language prior knowledge K. The paths K → Y tual notations to translate causal assumptions from and X → Y reflect the causal relationships in the graphs to formulas. For causal graph in Fig. 1(a), generation process of a summary Y , which can be given that if C is set to c and K is set to k, the interpreted as: The encoder of a Seq2Seq model value that summary Y could be is denoted as: comprehends the input document X and then the Yc,k = Y (do(C = c), do(K = k)). (1) decoder transforms the hidden representation out- puted from the encoder into a summary with the Since there is no confounder of C and K, we have
that do(C = c) equals to C = c and do(K = k) For the causal effect of K on Y , we propose equals to K = k. Without loss of generality, we counterfactual abstractive summarization to esti- omit the do operator for simplicity. Thus, Eq. (1) mate it by blocking the effect of X. Counterfactual can be simplified as: abstractive summarization describes the scenario Yc,k = Y (C = c, K = k). (2) where K = k and X had been x∗ . Since the re- sponse of X is blocked, the model can only rely on Similarly, the counterfactual notation of source doc- the given language prior k to generate summaries. ument X can be Xc,k = X(C = c, K = k). Thus, the causal effect of K on Y can be obtained As shown in Fig. 1(a), there exist two paths by comparing counterfactual abstractive summa- directly connected to Y , i.e., K→Y and X→Y . rization to the no-treatment conditions: Thus, Yc,k can be rewritten as the function of K and X: EK = Yx∗ ,k − Yx∗ ,k∗ , (5) Yc,k = Yx,k = Y (X = x, K = k). (3) where EK denotes the changes in the outcome Y In the factual scenario, we have x = Xc,k = with K changing from k ∗ to k and X is set to the X(C = c, K = k). While, in the counterfac- value it would have obtained at X = x∗ . tual scenario, k can be set as different values. For Thus, the causal effect of X on Y , denoted as example, Yx,k∗ describes the situation K is set to EX , can be obtained by subtracting EK from Etotal : k ∗ . The term “∗” denotes the no-treatment condi- tion, which can be interpreted as eliminating the EX = Etotal − EK = Yx,k − Yx∗ ,k . (6) causal effect via setting the variable to an empty value. Note that such case can only happen in the It can be observed that Eq. (6) is equivalent to the counterfactual scenario. comparison between the generated summaries of To measure causal effects, we need to compare conventional abstractive summarization and coun- two potential outcomes of the same individual terfactual abstractive summarization, as illustrated given two different treatments. For example, com- in Fig. 1(b), which implies the approach to mea- paring the outcomes of taking the drug (i.e., treat- sure the factual consistency in text summarization. ment) and not taking the drug (i.e., no-treatment) The term Yx,k happens to be a standard abstrac- to estimate the effect of a drug on a disease. Here, tive summarization model that takes x as input to estimate the causal effect of input source doc- and outputs y, with the help of language prior k ument X on summary Y , we aim to compare the learned from training corpus; while the term Yx∗ ,k summaries of feeding document x to the summa- describes a model that generates y only depend on rization model (i.e., X=x=X(c, k)), and the docu- the language prior k, in normal case it works like a ment x is not given (i.e., the no-treatment condition language model. X = x∗ = X(c∗ , k ∗ ). As shown in Fig. 1(a), the 3.3 CoCo Metric outcome of Y is also affected by the language prior knowledge K. Thus we should take both X and There exist several ways to implement Eq. (6) K into consideration when estimating the effect by applying different functions to approximate of X on Y . However, estimating the effect of X Yx,k − Yx∗ ,k , such as lexical overlapping and se- on Y via Yx,k∗ − Yx∗ ,k∗ is impractical, since it is mantic similarity between the generated summaries. hard to block the effect of K (i.e., K = k ∗ ) for an For text summarization, given source document X, abstractive summarization system. the outputs of the model are the probability distri- To address this issue, we propose to estimate the butions Pr(·|X) over the vocabulary, from which causal effect of X on Y by subtracting the causal a series of tokens are sampled to compose a sum- effect of K on Y from the total effect. The total mary. Thus, from another point of view, functions effect of treatment X = x and K = k on Y = y that can be applied to the probability distribution compares hypothetical situations (X = x, K = k) are also suitable to approximate Yx,k − Yx∗ ,k , such and (X = x∗ , K = k ∗ ), which is denoted as as the perplexity and uncertainty (Xu et al., 2020; Xiao and Wang, 2021). In this study, taking both Etotal = Yx,k − Yx∗ ,k∗ , (4) the effectiveness and convenience into considera- which represents the difference between the output tion, we adopt the probabilities of the tokens of of taking the treatment (i.e., Yx,k ) and that of no- evaluated summaries, i.e., Pr(yt ) ∀yt ∈ Y , to im- treatment condition (i.e., Yx∗ ,k∗ ). plement Eq. (6) as our automatic evaluation metric
Operator Source document X Summary Y People with a DNA variation in a gene called PDSS2 tend to drink Researchers have identified a Sentence- fewer cups of ::::: coffee, a study carried out at the University of Edin- gene that appears to curb ::::: coffee level mask burgh has found. It suggests the gene reduces cell ability to break consumption. down caffeine. . . People with a DNA variation in a gene :::: called PDSS2 tend to Researchers have identified a Span-level drink fewer cups of coffee, a study carried out at the University of gene :::: that appears to curb coffee mask Edinburgh has found. It suggests the gene :::: reduces cell ability to consumption. break down caffeine. . . Table 1: Two examples to show the mask operation on the source document X according to summary Y . We mark the contents (green background) that are relevant to coffee :::: and :::: gene with different strategies for demonstration. CoCo for factual consistency. Besides, we adopt Algorithm 1 CoCo metric an independent summarization model as the scor- Input: Source document X, model-generated ing model in the instantiation of CoCo, rather than summary Y , scoring model F using the model that generates the evaluated sum- Output: Factual consistency score of Y w.r.t. X mary, considering that the factual consistency can (i.e., the CoCo value) be biased by the model that produced this evaluated 1: Select key tokens Y 0 from Y ; summary. By adopting an independent summariza- 2: Mask the source document X according Y 0 to tion model as the scoring model, CoCo can be produce X 0 ; applied to score any given summary based on its 3: Feed X and X 0 into the scoring model F corresponding source document, disregarding how respectively to generate the probability of the evaluated summary is generated. each tokens in Y 0 , i.e., Pr(yi |X, y
4 Experiments For QA-based metrics QAGS and QuestEval, fol- lowing the settings in the original papers, the 4.1 Datasets QG model and QA model in QAGS are imple- To evaluate the effectiveness of the proposed metric, mented based on BART and BERT respectively, we conduct experiments on three public abstractive and for QuestEval, they are both implemented text summarization datasets. based on T5 (Raffel et al., 2020). And we adopt QAGS-CNN/DM. It is a subset of CNN/Daily QuestEvalprecision , QuestEvalrecall , QuestEvalF1 for Mail dataset (Hermann et al., 2015; Nallapati calculating the precision, recall, and F1 score of the et al., 2016) and released by Wang et al. (2020). answers given the source documents and generated This dataset contains 235 instances collected from summaries. the test set of CNN/Daily Mail, and each in- For the proposed metric CoCo, we investigate stance consists of a source document, a reference different mask strategies in our study. We use summary, and a model-generated summary pro- CoCotoken , CoCospan , CoCosent , and CoCodoc to duced by a bottom-up abstractive summarization denote token-level, span-level (i.e., to mask five model (Gehrmann et al., 2018). The generated successive tokens that contain the key token as the summaries are assigned with human annotations to center), sentence-level and document-level (i.e., to indicate their factual consistency scores. mask the whole document) mask strategies when QAGS-XSUM. This dataset is derived from calculating the causal effect of the source document XSUM (Narayan et al., 2018) and contains 239 on the generated summary. source documents, the corresponding reference summaries, and the synthetic summaries gener- 4.3 Implementation ated by BART (Lewis et al., 2020). Wang et al. We adopt BART (Lewis et al., 2020) as the scor- (2020) collects human evaluation scores on fac- ing model for the proposed metric CoCo. To be tual consistency for each generated summary via specific, we feed the whole source document or ParlAI (Miller et al., 2017) . masked source document into the encoder of BART SUMMEVAL. It is released by Fabbri et al. (2021) and apply the teacher forcing (Bengio et al., 2015) and contains the generated summaries from 16 ab- during the decoding process. At t-th step of decod- stractive and extractive models of 100 test data ing, we take the output probability of the t-th token from DNN/Daily Mail. We adopt 1200 abstrac- in the evaluated summary Y , i.e., Pr(yt ), as one tive text summaries and extract 3600 public expert of the factual consistency scores for the evaluated annotations on factual consistency for them. summary if yt is a key token recognized by part- 4.2 Baselines of-speech tagging toolkit spaCy2 , otherwise, we discard it. We adopt widely-used automatic evaluation met- In our study, the BART model is finetuned on rics for comparison, such as ROUGE (Lin, 2004), the training set of CNN/Daily Mail datasets for the BLEU (Papineni et al., 2002) and METEOR (Lavie experiments conducted on QAGS-CNN/DM and and Agarwal, 2007). For ROUGE, we adopt SUMMEVAL, and finetuned on the training set of ROUGE-1, ROUGE-2 and ROUGE-L, which de- XSUM for QAGS-XSUM. For baseline models, the notes that the overlapping units in calculation are implementation is based on the huggingface (Wolf set to be uni-grams, bi-grams and longest-common et al., 2020). More details of the implementation subsequence respectively. can be found in the Appendix. All models are Besides, BERTScore (Zhang et al., 2020b), implemented using PyTorch (Paszke et al., 2019) FFCI (Koto et al., 2020), QAGS (Wang et al., and performed on GeForce RTX 2080 Ti GPUs. 2020), and QuestEval (Scialom et al., 2021) are also adopted as baselines in the experiment. 4.4 Comparison Results We use the outputs from the 17-th layer in We report the Pearson correlation coefficient (r) roberta-large to implement BERTScore as and Spearman correlation coefficient (ρ) between suggested by Zhang et al. (2020b). For FFCI, a various automatic evaluation metrics and human framework that can be implemented by different judgments on factual consistency in Table 2 (results basic metrics, we use FFCIROUGE-1 , FFCIROUGE-2 , are shown as percentages). The larger value rep- FFCIROUGE-L , FFCIBERTScore to distinguish the dif- 2 ferent basic metrics as the original paper suggested. https://spacy.io/
QAGS-CNN/DM QAGS-XSUM SUMMEVAL Metric r ρ r ρ r ρ ROUGE-1 29.01 23.87 13.12 12.82 20.23 17.72 ROUGE-2 17.91 18.78 8.66 9.96 16.72 16.72 ROUGE-L 23.43 24.09 8.36 10.66 19.20 18.16 METEOR 25.65 24.56 10.78 11.59 16.91 14.38 BLEU 17.63 22.85 2.55 2.55 10.83 10.73 BERTScore 37.41 36.36 11.25 13.23 18.58 17.53 FFCIROUGE-1 43.55 42.00 13.70 19.26 36.46 33.86 FFCIROUGE-2 45.01 43.62 18.96 18.63 37.95 35.40 FFCIROUGE-L 43.11 41.32 16.67 17.54 38.02 34.45 FFCIBERTScore 48.47 48.62 20.04 19.04 28.54 30.76 QAGS 31.39 35.96 15.18 17.48 17.71 12.65 QuesEvalprecision 38.02 35.96 5.66 7.43 33.53 29.52 QuesEvalrecall 41.10 36.40 6.57 7.33 26.95 25.69 QuesEvalF1 49.18 44.53 7.03 9.63 36.96 33.93 CoCotoken (ours) 55.79 49.33 19.02 19.83 42.67 40.09 CoCospan (ours) 57.28 50.14 18.71 18.65 43.57 40.96 CoCosent (ours) 58.84 52.25 24.08 22.70 42.04 39.03 CoCodoc (ours) 55.27 49.54 19.57 18.21 41.18 39.61 Table 2: Pearson correlation (denoted as r) and Spearman correlation (denoted as ρ) between automatic metrics and human judgments of factual consistency on text summarization datasets. The bold scores are the best among all the metrics, while the underlined scores are the best among the baseline metrics. resents the better positive correlation with human be nearly m × n times of other metrics when a judgments, which demonstrates the effectiveness document and a generated summary containing m of the automatic evaluation metric on factual con- and n sentences respectively. sistency for text summarization. On the other hand, QA-based metrics QAGS From the table, we can observe that the n-gram and QuestEval introduce auxiliary tasks, including based metrics, including ROUGE, METEOR, and question generation (QG) and question answering BLEU, achieve the worse results on three datasets (QA), to evaluate factual consistency. Both of them compared with most of other the baselines. These achieve competitive results compared with other results are consistent with the observation in previ- baselines. However, QA-based metrics highly rely ous studies (Bhandari et al., 2020; Maynez et al., on well-trained QA and QG models, which makes 2020). Although convenient for adopting, n-gram them inconvenient, or unavailable in languages based metrics are poorly correlated with human without enough corpus for pretraining. Meanwhile, judgments on factual consistency, since they only it is worth pointing out that these metrics are com- care about the lexical overlap between the gener- putationally expensive (Koto et al., 2020), and eas- ated summary and reference, but fail to capture the ily lead to error propagation and accumulation be- semantic similarity. cause of their pipeline manner. Compared to n-gram based metrics, BERTScore The proposed metric CoCo outperforms other could not achieve consistent improvements on three baselines by a noticeable margin on three adopted datasets (better on QAGS-CNN/DM but slightly datasets. For instance, the Pearson correlation coef- worse on QAGS-XSUM and SUMMEVAL). And ficient between CoCosent and human judgments on the comparisons between FFCI and its corre- QAGS-CNN/DM is 58.84, which is significantly sponding basic metric (e.g., FFCIBERTScore v.s. larger than those of the best result among baselines BERTScore) show the advantages brought by calcu- (49.18 for QuesEvalF1 ), and is twice times of the lating the lexical-overlap or semantic similarity be- best result among n-gram based metrics (29.01 for tween the generated summaries and the source doc- ROUGE-1). These results demonstrate the effec- ument rather than the reference when measuring tiveness of CoCo in improving the correlation with factual consistency. However, since FFCI adopts human judgments. In the bottom subgroup of Ta- a sentence-level evaluation manner, the cost can ble 2, we report the comparison results among dif-
QAGS-CNN/DM QAGS-XSUM SUMMEVAL 0.40 0.60 0.65 Datasets BART BERT T5 PEGA 0.50 0.30 0.55 CoCo value 0.40 0.45 QAGS-CNN/DM 58.84 53.27 56.13 58.96 0.20 0.30 0.35 QAGS-XSUM 24.08 17.88 21.66 19.83 0.20 0.10 0.25 SUMMEVAL 42.04 39.36 42.98 43.38 bottom 50% of human judgements top 50% of human judgements Table 3: The experiment results of Pearson correlation Figure 2: Comparison of CoCo values assigned to high r between CoCo and human judgments. CoCo is im- (top 50%) and low (bottom 50%) human judgments. plemented with different scoring models shown in the header of the horizontal axis, and sentence-level mask ferent mask strategies applied in CoCo. From these strategy is used. results, we can see the span-level and sentence- level mask strategies are better than token-level Source document and document-level, which implies that a suitable mask range for relevant content is important for Tiger Woods declared himself ready to compete measuring factual consistency. The too-small mask for a fifth Masters title after completing 11 holes range could cause that the decoder is still able to of practice at Augusta National on Monday.. . . infer the masked tokens from the context, while the Summary with CoCo values too-large mask range might weaken the effect of The American(0.0251) completed 11 holes of language prior and lead to near-zero scores for all practice at Augusta(0.4115) on Monday(0.6346). the tokens in the evaluated summary. Table 4: Case Study. The factually inconsistent token 4.5 Further Discussions American is assigned with a low CoCo value since it is more likely to hallucinated from Tiger Woods, while In this section, we provide some discussions about both Augusta and Monday are assigned with high CoCo the proposed CoCo metric to better understand how values as they are generated more relied on the source it works to measure the factual consistency of the document rather than the language prior. Best viewed model-generated summary. with color. CoCo values w.r.t. various human judgments. For each adopted dataset, we split the instances can is factually inconsistent w.r.t. the source docu- into two halves according to the human judgments ment since there does not exist any context to infer assigned to the model-generated summaries. Then Tiger Woods is an American in the source docu- we calculate the statistics of CoCo values for each ment. Augusta and Monday are factually consistent half respectively and illustrate the results in Fig. 2. as they can be directly explained by the source It can be observed that summaries with high human document. judgments (i.e., in the top 50%) are also assigned We can observe that both Augusta and Monday with high CoCo values, which demonstrates CoCo are assigned with a high CoCo value (0.4115 and values are consistent with human judgments on fac- 0.6346 respectively), while that of American is tual consistency. Thus the proposed metric CoCo significant lower (0.0251). Such differences are is a reliable automatic evaluation metric for factual caused by that both Augusta and Monday are gen- consistency in text summarization. erated more relied on the source document, but Different scoring models. We adopt four differ- American is more likely to be hallucinated by the ent scoring models to implement CoCo, including decoder, with the help of the language prior knowl- BART, BERT (Liu and Lapata, 2019), T5 (Raffel edge from the training corpus. The proposed met- et al., 2020), and PEGASUS (Zhang et al., 2020a) ric CoCo can assign these hallucinations with low (denoted as PEGA). The experiment results are scores via counterfactual estimation to measure the shown in Table 3, which demonstrate that among factual consistency of the generated summary. the four adopted scoring models, CoCo achieves 5 Conclusions consistent and competitive results against the per- formance of baselines reported in Table 2. In this paper, we introduce CoCo, an effective and Case study. To better understand how CoCo works, convenient metric for measuring the factual con- a case study is illustrated in Table 4. We take three sistency of abstractive summarization models. In- tokens American, Augusta, and Monday as exam- spired by the intrinsic cause of factual inconsis- ples and show the CoCo values. The tokens Ameri- tency in text summarization, CoCo can evaluate
factual consistency via counterfactual estimation Katja Filippova. 2020. Controlled hallucinations: without relying on other extrinsic tasks. Experi- Learning to generate faithfully from noisy data. In Findings of EMNLP, pages 864–870. ments on human-annotated datasets verify the ad- vantages of our proposed metric in terms of the Sebastian Gehrmann, Yuntian Deng, and Alexander correlation with human judgments. In the future, Rush. 2018. Bottom-up abstractive summarization. several directions can be explored. First, although In Proceedings of EMNLP, pages 4098–4109. this paper focuses on evaluation metric, the pro- Ben Goodrich, Vinay Rao, Peter J. Liu, and Moham- posed idea can be incorporated into abstractive mad Saleh. 2019. Assessing the factual accuracy summarization models to enhance factual consis- of generated text. In Proceedings of KDD, page tency. Another interesting direction is to apply this 166–175. metric in other conditional text generation tasks Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, such as image caption and data-to-text generation. Trevor Darrell, and Anna Rohrbach. 2018. Women also snowboard: Overcoming bias in captioning models. In Proceedings of ECCV, pages 771–787. References Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Noam Shazeer. 2015. Scheduled sampling for se- and Phil Blunsom. 2015. Teaching machines to read quence prediction with recurrent neural networks. and comprehend. In Proceedings of NIPS, page In Proceedings of NIPS, page 1171–1179. 1693–1701. Manik Bhandari, Pranav Narayan Gour, Atabak Ash- faq, and Pengfei Liu. 2020. Metrics also disagree Fajri Koto, Jey Han Lau, and Timothy Baldwin. in the low scoring range: Revisiting summariza- 2020. Ffci: A framework for interpretable auto- tion evaluation metrics. In Proceedings of COLING, matic evaluation of summarization. arXiv preprint pages 5702–5711. arXiv:2011.13662. Ping Chen, Fei Wu, Tong Wang, and Wei Ding. 2018. Wojciech Kryscinski, Bryan McCann, Caiming Xiong, A semantic qa-based approach for text summariza- and Richard Socher. 2020. Evaluating the factual tion evaluation. In Proceedings of AAAI, pages consistency of abstractive text summarization. In 4800–4807. Proceedings of EMNLP, pages 9332–9346. James Clarke and Mirella Lapata. 2010. Discourse con- Alon Lavie and Abhaya Agarwal. 2007. Meteor: An straints for document compression. CL, 36(3):411– automatic metric for mt evaluation with high levels 441. of correlation with human judgments. In Proceed- ings of WMT | WS, pages 228–231. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Mike Lewis, Yinhan Liu, Naman Goyal, Mar- deep bidirectional transformers for language under- jan Ghazvininejad, Abdelrahman Mohamed, Omer standing. In Proceedings of NAACL, pages 4171– Levy, Veselin Stoyanov, and Luke Zettlemoyer. 4186. 2020. BART: Denoising sequence-to-sequence pre- training for natural language generation, translation, Esin Durmus, He He, and Mona Diab. 2020. FEQA: A and comprehension. In Proceedings of ACL, pages question answering evaluation framework for faith- 7871–7880. fulness assessment in abstractive summarization. In Proceedings of ACL, pages 5055–5070. Chin-Yew Lin. 2004. Rouge: A package for automatic Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. evaluation of summaries. In Text summarization Question answering as an automatic evaluation met- branches out, pages 74–81. ric for news article summarization. In Proceedings of NAACL, pages 3938–3948. Yang Liu and Mirella Lapata. 2019. Text summariza- tion with pretrained encoders. In Proceedings of Alexander R. Fabbri, Wojciech Kryściński, Bryan EMNLP, pages 3730–3740. McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. SummEval: Re-evaluating Inderjeet Mani, David House, Gary Klein, Lynette Summarization Evaluation. TACL, 9:391–409. Hirschman, Therese Firmin, and Beth Sundheim. 1999. The TIPSTER SUMMAC text summarization Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie evaluation. In Proceedings of EACL, pages 77–85. Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An Joshua Maynez, Shashi Narayan, Bernd Bohnet, and interesting but challenging application for natural Ryan McDonald. 2020. On faithfulness and factual- language inference. In Proceedings of ACL, pages ity in abstractive summarization. In Proceedings of 2214–2220. ACL, pages 1906–1919.
Alexander Miller, Will Feng, Dhruv Batra, Antoine Staiano, and Alex Wang. 2021. Questeval: Sum- Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and marization asks for fact-based evaluation. arXiv Jason Weston. 2017. ParlAI: A dialog research soft- preprint arXiv:2103.12693. ware platform. In Proceedings of EMNLP, pages 79–84. Thomas Scialom, Sylvain Lamprier, Benjamin Pi- wowarski, and Jacopo Staiano. 2019. Answers Anshuman Mishra, Dhruvesh Patel, Aparna Vijayaku- unite! unsupervised metrics for reinforced summa- mar, Xiang Lorraine Li, Pavan Kapanipathi, and Kar- rization models. In Proceedings of EMNLP, pages tik Talamadupula. 2021. Looking beyond sentence- 3246–3256. level natural language inference for question an- swering and text summarization. In Proceedings of Abigail See, Peter J. Liu, and Christopher D. Manning. NAACL-HLT 2021, pages 1322–1336. 2017. Get to the point: Summarization with pointer- generator networks. In Proceedings of ACL, pages Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, 1073–1083. Çaǧlar Gulçehre, and Bing Xiang. 2016. Abstrac- tive text summarization using sequence-to-sequence Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. RNNs and beyond. In Proceedings of SIGNLL, Asking and answering questions to evaluate the fac- pages 280–290. tual consistency of summaries. In Proceedings of ACL, pages 5008–5020. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! Thomas Wolf, Lysandre Debut, Victor Sanh, Julien topic-aware convolutional neural networks for ex- Chaumond, Clement Delangue, Anthony Moi, Pier- treme summarization. In Proceedings of EMNLP, ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- pages 1797–1807. icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Xian-Sheng Hua, and Ji-Rong Wen. 2020. Coun- Quentin Lhoest, and Alexander M. Rush. 2020. terfactual vqa: A cause-effect look at language bias. Transformers: State-of-the-art natural language pro- arXiv preprint arXiv:2006.04315. cessing. In Proceedings of EMNLP, pages 38–45. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Yijun Xiao and William Yang Wang. 2021. On Fan, Sam Gross, Nathan Ng, David Grangier, and hallucination and predictive uncertainty in con- Michael Auli. 2019. fairseq: A fast, extensible ditional language generation. arXiv preprint toolkit for sequence modeling. In Proceedings of arXiv:2103.15025. NAACL-HLT 2019: Demonstrations, pages 48–53. Jiacheng Xu, Shrey Desai, and Greg Durrett. 2020. Un- derstanding neural abstractive summarization mod- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- els via uncertainty. In Proceedings of EMNLP, Jing Zhu. 2002. Bleu: A method for automatic eval- pages 6275–6281. uation of machine translation. In Proceedings of ACL, page 311–318. Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. 2021. A survey on causal Adam Paszke, Sam Gross, Francisco Massa, Adam inference. ACM Transactions on Knowledge Discov- Lerer, James Bradbury, Gregory Chanan, Trevor ery from Data, 15(5). Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Yang, Zachary DeVito, Martin Raison, Alykhan Te- Peter Liu. 2020a. Pegasus: Pre-training with ex- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, tracted gap-sentences for abstractive summarization. Junjie Bai, and Soumith Chintala. 2019. Pytorch: In Proceedings of ICML, pages 11328–11339. An imperative style, high-performance deep learn- ing library. In Proceedings of NIPS, volume 32. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020b. Bertscore: Judea Pearl. 2001. Direct and indirect effects. In Pro- Evaluating text generation with bert. In Proceedings ceedings of UAI, page 411–420. of ICLR. Judea Pearl and Dana Mackenzie. 2018. The book of Yuhao Zhang, Derek Merck, Emily Tsai, Christo- why: the new science of cause and effect. pher D. Manning, and Curtis Langlotz. 2020c. Op- timizing the factual correctness of a summary: A Colin Raffel, Noam Shazeer, Adam Roberts, Katherine study of summarizing radiology reports. In Proceed- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, ings of ACL, pages 5108–5120. Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text trans- former. JMLR, 21(140):1–67. Thomas Scialom, Paul-Alexis Dray, Patrick Gallinari, Sylvain Lamprier, Benjamin Piwowarski, Jacopo
A Implementation Details Our implementation is based on fairseq and hug- gingface. We introduce the implementation details of base- lines for reproduction, including: B Dataset Availability N-gram based metrics. We adopt the widely-used QAGS-CNN/DM & QAGS-XSUM. They are re- open source packages to implement the n-gram leased by Wang et al. (2020)9 , and annotated on based metrics, including ROUGE3 , BLEU4 and Amazon Mechanical Turk310 via ParlAI (Miller METEOR5 . And we use the CoreNLP6 as the tok- et al., 2017). Please refer to the original paper for enizer. more details about the annotation protocol. BERTScore. It is implemented based on the SUMMEVAL. It is released by Fabbri et al. source code released by Zhang et al. (2020b)7 . Fol- (2021)11 . The expert annotations are adopted in lowing the original paper, we use the outputs from our study as the paper suggested. For each source the 17-th layer in roberta-large. We also document in this dataset, there exists one original try to use bert-large-nli suggested by Koto reference from CNN/DailyMail dataset and 10 ad- et al. (2020), but it fails to perform better in the ditional crowdsources reference summaries. We experiments. only use the original reference in our study. FFCI. FFCI (Koto et al., 2020) is a framework that can be implemented by different basic met- rics. The adopted basic metrics in our study in- cluding ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore, whose implementation details have been introduced above. The hyperparameters are set as the paper suggested. QAGS. Following the settings in Wang et al. (2020), the question generation (QG) model and question answering (QA) model in QAGS are im- plemented based on BART and BERT respectively using the source code provided by fairseq (Ott et al., 2019). For each summary, we extract 10 named entities via spaCy8 , and generate totally 100 questions based on these named entities using QG model, from which 20 questions with high gen- erated probabilities are selected. These questions are fed into a QA model, together with the corre- sponding source documents or the summaries, to generate answers for comparisons. QuestEval. For QuestEval (Scialom et al., 2019), both QG model and QA model are implemented based on T5 (Raffel et al., 2020) with the help of huggingface (Wolf et al., 2020), following the original paper. We adopt the suggested settings for hyperparameters of QuestEval. CoCo. We adopt four different scoring models to implement CoCo, including BART (Lewis et al., 2020), BERT (Liu and Lapata, 2019), T5 (Raffel et al., 2020), and PEGASUS (Zhang et al., 2020a). 3 https://github.com/andersjo/pyrouge 4 https://pypi.org/project/bleu/ 5 https://github.com/salaniz/pycocoevalcap 6 9 https://github.com/stanfordnlp/CoreNLP https://github.com/W4ngatang/qags/tree/master/data 7 10 https://github.com/Tiiiger/bert_score https://www.mturk.com/ 8 11 https://spacy.io/ https://github.com/Yale-LILY/SummEval
You can also read