Multi-Narrative Semantic Overlap Task: Evaluation and Benchmark
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-Narrative Semantic Overlap Task: Evaluation and Benchmark Naman Bansal, Mousumi Akter and Shubhra Kanti Karmaker Santu BDI Lab, Auburn University {nzb0040, mza0170, sks0086}@auburn.edu Abstract language format, how can we extract the overlap- ping information present in both N1 and N2 ? In this paper, we introduce an important yet Figure 1 shows a toy example of MNSO task, relatively unexplored NLP task called Multi- where the TextOverlap1 (∩O ) operation is being ap- Narrative Semantic Overlap (MNSO), which arXiv:2201.05294v1 [cs.CL] 14 Jan 2022 entails generating a Semantic Overlap of mul- plied on two news articles. Both articles cover the tiple alternate narratives. As no benchmark same story related to the topic “abortion”, however, dataset is readily available for this task, we they report from different political perspectives, created one by crawling 2, 925 narrative pairs i.e., one from left wing and the other from right from the web and then, went through the te- wing. For greater visibility, “Left” and “Right” dious process of manually creating 411 differ- wing reporting biases are represented by blue and ent ground-truth semantic overlaps by engag- red text respectively. Green text denotes the com- ing human annotators. As a way to evaluate this novel task, we first conducted a system- mon information in both news articles. The goal of atic study by borrowing the popular ROUGE TextOverlap (∩O ) operation is to extract the over- metric from text-summarization literature and lapping information conveyed by the green text. discovered that ROUGE is not suitable for At first glance, the MNSO task may appear sim- our task. Subsequently, we conducted fur- ilar to traditional multi-document summarization ther human annotations/validations to create task where the goal is to provide an overall sum- 200 document-level and 1, 518 sentence-level mary of the (multiple) input documents; however, ground-truth labels which helped us formulate a new precision-recall style evaluation metric, the difference is that for MNSO, the goal is to pro- called SEM-F1 (semantic F1). Experimental vide summarized content with an additional con- results show that the proposed SEM-F1 met- straint, i.e., the commonality criteria. There is no ric yields higher correlation with human judge- current baseline method as well as existing dataset ment as well as higher inter-rater-agreement that exactly match our task; more importantly, it compared to ROUGE metric. is unclear which one is the right evaluation metric to properly evaluate this task. As a starting point, 1 Introduction we frame MNSO as a constrained seq-to-seq task In this paper, we look deeper into the challenging where the goal is to generate a natural language yet relatively under-explored area of automated output which conveys the overlapping information understanding of multiple alternative narratives. present in multiple input text documents. However, To be more specific, we formally introduce a new the bigger challenge we need to address first is the NLP task called Multi-Narrative Semantic Overlap following: 1) How can we evaluate this task? and (MNSO) and conduct the first systematic study of 2) How would one create a benchmark dataset for this task by creating a benchmark dataset as well as this task? To address these challenges, we make proposing a suitable evaluation metric for the task. the following contributions in this paper. MNSO essentially means the task of extracting / paraphrasing / summarizing the overlapping infor- 1. We formally introduce Multi-Narrative Seman- mation from multiple alternative narratives coming tic Overlap (MNSO) as a new NLP task and from disparate sources. In terms of computational conduct the first systematic study by formulat- goal, we study the following research question: ing it as a constrained summarization problem. Given two distinct narratives N1 and N2 of 1 We’ll be using the terms TextOverlap operator and Se- some event e expressed in unstructured natural mantic Overlap interchangeably throughout the paper.
Figure 1: A toy use-case for Semantic Overlap Task (TextOverlap). A news on topic abortion has been presented by two news media (left-wing and right-wing). “Green” Text denotes the overlapping information from both news media, while “Blue” and “Red” text denotes the respective biases of left and right wing. A couple of real examples from the benchmark dataset are mentioned in the appendix. 2. We create and release the first benchmark data- popular among them are extractive approaches set consisting of 2, 925 alternative narrative (Cao et al., 2018; Narayan et al., 2018; Wu and pairs for facilitating research on the MNSO Hu, 2018; Zhong et al., 2020) and abstractive ap- task. Also, we went through the tedious pro- proaches (Bae et al., 2019; Hsu et al., 2018; Liu cess of manually creating 411 different ground- et al., 2017; Nallapati et al., 2016). Some re- truth semantic intersections and conducted fur- searchers have also tried combining extractive and ther human annotations/validations to create abstractive approaches (Chen and Bansal, 2018; 200 document-level and 1, 518 sentence-level Hsu et al., 2018; Zhang et al., 2019). ground-truth labels to construct the dataset. Recently, encoder-decoder based neural models 3. As a starting point, we experiment with ROUGE, have become really popular for abstractive sum- a widely popular metric for evaluating text sum- marization (Rush et al., 2015; Chopra et al., 2016; marization tasks and demonstrate that ROUGE Zhou et al., 2017; Paulus et al., 2017). It has be- is NOT suitable for evaluation of MNSO task. come even prevalent to train a general language model on huge corpus of data and then transfer/fine- 4. We propose a new precision-recall style evalu- tune it for the summarization task (Radford et al., ation metric, SEM-F1 (semantic F1), for eval- 2019; Devlin et al., 2019; Lewis et al., 2019; Xiao uating the MNSO task. Extensive experiments et al., 2020; Yan et al., 2020; Zhang et al., 2019; show that new SEM-F1 improves the inter-rater Raffel et al., 2019). Summary length control for agreement compared to the traditional ROUGE abstractive summarization has also been studied metric, and also, shows higher correlation with (Kikuchi et al., 2016; Fan et al., 2017; Liu et al., human judgments. 2018; Fevry and Phang, 2018; Schumann, 2018; Makino et al., 2019). In general, multiple document 2 Related Works summarization (Goldstein et al., 2000; Yasunaga The idea of semantic text overlap is not entirely et al., 2017; Zhao et al., 2020; Ma et al., 2020; new, (Karmaker Santu et al., 2018) imagined a hy- Meena et al., 2014) is more challenging than single pothetical framework for performing comparative document summarization. However, MNSO task is text analysis, where, TextOverlap was one of the different from traditional multi-document summa- “hypothetical” operators along with TextDifference, rization tasks in that the goal here is to summarize but the technical details and exact implementation content with an overlap constraint, i.e., the output were left as a future work. In our work, we only should only contain the common information from focus on TextOverlap. both input narratives. As TextOverlap can be viewed as a multi- Alternatively, one could aim to recover verb document summarization task with additional com- predicate-alignment structure (Roth and Frank, monality constraint, text summarization literature 2012; Xie et al., 2008; Wolfe et al., 2013) from is the most relevant to our work. Over the years, a sentence and further, use this structure to com- many paradigms for document summarization have pute the overlapping information (Wang and Zhang, been explored (Zhong et al., 2019). The two most 2009; Shibata and Kurohashi, 2012). Sentence
Fusion is another related area which aims to com- Robust Translation: Suppose you have multiple bine the information from two given sentences with translation models which translates a given docu- some additional constraints (Barzilay et al., 1999; ment from language A to language B. One could Marsi and Krahmer, 2005; Krahmer et al., 2008; further apply the TextOverlap operator on the trans- Thadani and McKeown, 2011). A related but sim- lated documents and get a robust translation. pler task is to retrieve parallel sentences (Cardon In general, MNSO task could be employed in any and Grabar, 2019; Nie et al., 1999; Murdock and setting where we have comparative text analysis. Croft, 2005) without performing an actual inter- section. However, these approaches are more tar- 4 Problem Formulation geted towards individual sentences and do not di- What is Semantic Overlap? This is indeed a philo- rectly translate to arbitrarily long documents. Thus, sophical question and there is no single correct MNSO task is still an open problem and there is no answer (various possible definitions are mentioned existing dataset, method or evaluation metric that in appendix section A). To simplify notations, let us have been systematically studied. stick to having only two documents DA and DB as Along the evaluation dimension, ROUGE (Lin, our input since it can easily be generalized in case 2004) is perhaps the most commonly used met- of more documents using TextOverlap repeatedly. ric today for evaluating automated summarization Also, let us define the output as DO ← DA ∩O DB . techniques; due to its simplicity and automation. A human would mostly express the output in the However, ROUGE has been criticized a lot for pri- form of natural language and this is why, we frame marily relying on lexical overlap (Nenkova, 2006; the MNSO task as a constrained multi-seq-to-seq Zhou et al., 2006; Cohan and Goharian, 2016) (text generation) task where the output text only of n-grams. As of today, around 192 variants of contains information that is present in both the ROUGE are available (Graham, 2015) including input documents. We also argue that brevity (min- ROUGE with word embedding (Ng and Abrecht, imal repetition) is a desired property of Semantic 2015) and synonym (Ganesan, 2018), graph-based Overlap and thus, we frame MNSO task as a con- lexical measurement (ShafieiBavani et al., 2018), strained summarization problem to ensure brevity. Vanilla ROUGE (Yang et al., 2018) and highlight- For example, if a particular piece of information based ROUGE (Hardy et al., 2019). However, there or quote is repeated twice in both the documents, has been no study yet whether ROUGE metric is ap- we don’t necessarily want it to be present in tar- propriate for evaluating the Semantic Intersection get overlap summary two times. The output can task, which is one of central goals of our work. either be extractive summary or abstractive sum- mary or a mixture of both, as per the use case. 3 Motivation and Applications This task is inspired by the set-theoretic intersec- Multiple alternative narratives appear frequently tion operator. However, unlike set-intersection, our across many domains like education, health, mil- Text Overlap does not have to be the maximal set. itary, security and privacy etc (detailed use-cases The aim is summarize the overlapping information for each domain are provided in appendix). Indeed, in an abstractive fashion. Additionally, Semantic MNSO/TextOverlap operation can be very useful Overlap should follow the commutative property to digest such multi-narratives (from various per- i.e DA ∩O DB = DB ∩O DA . spectives) at scale and speed and, consequently, 5 The Benchmark Dataset enhance the following important tasks as well. Information Retrieval/Search Engines: Given a As mentioned in section 1, there is no existing query, one could summarize the common infor- data-set which we could readily use to evaluate mation (TextOverlap) from the top k documents the MNSO task2 . To address this challenge, we fetched by a search engine and display it as addi- crawled data from AllSides.com. AllSides is a tional information to the user. third-party online news forum which exposes peo- Question Answering: Given a particular question, ple to news and information from all sides of the the system could aim to provide a more accurate political spectrum so that the general people can answer based on multiple evidence from various 2 Multi-document summarization datasets can not be uti- source documents and generate the most common lized in this scenario as their reference summaries do not answer by applying TextOverlap. follow the semantic overlap constraint.
get an “unbiased” view of the world. To achieve the annotators wrote only “Donald Trump” as the this, AllSides displays each day’s top news stories Semantic Overlap for a couple of cases where the from news media widely-known to be affiliated narratives were substantially different, while others with different sides of the political spectrum includ- had those cases marked as “empty set”. ing “Left” (e.g., New York Times, NBC News), and To mitigate this issue, we only retained the “Right” (e.g., Townhall, Fox News) wing media. narrative-pairs where at least two of the annotators AllSides also provides their own factual descrip- wrote minimum 15 words as their ground-truth se- tion of the reading material, labeled as “Theme” so mantic overlap, with the hope that a human written that readers can see the so-called “neutral” point- description will contain 15 words or more only in of-view. Table 1 gives an overview of the dataset cases where there is indeed a “significant” overlap created by crawling from AllSides.com, which con- between the two original narratives. This filtering sists of news articles (from at least one “Left” and step gave us a test set with 137 samples where each one “Right” wing media) covering 2, 925 events in sample had 4 ground-truth semantic overlaps, one total and also having a minimum length of “theme- from AllSides and three from human annotators. description” to be 15 words. Given two narra- tives (“Left” and “Right”), we used the theme- 6 Evaluating MNSO Task using ROUGE description as a proxy for ground-truth TextOverlap. We divided this dataset into testing data (described As ROUGE (Lin, 2004) is the most popular met- next) and training data (remaining samples) and ric used today for evaluating summarization tech- their statistics in provided in appendix (table 13). niques; we first conducted a case-study with ROUGE as the evaluation metric for MNSO task. Feature Description 6.1 Methods Used in the Case-Study theme headlines by AllSides theme-description news description by AllSides We experimented with multiple SoTA pre-trained right/left head right/left news headline abstractive summarization models as a proxy for right/left context right/left news description Semantic-Overlap generators. These models are: Table 1: Overview of dataset scraped from AllSides 1) BART (Lewis et al., 2019), fine tuned on CNN and multi english Wiki news datasets, 2) Pegasus Human Annotations3 : We decided to involve hu- (Zhang et al., 2019), fine tuned on CNN and Daily man volunteers to annotate our testing samples mail dataset, and 3) T5 (Raffel et al., 2019), fine in order to create multiple human-written ground- tuned on multi english Wiki news dataset. As our truth semantic overlaps for each event narrative primary goal is to construct a benchmark data-set pairs. This helped in creating a comprehensive for the MNSO task and establish an appropriate testing benchmark for more rigorous evaluation. metric for evaluating this task, experimenting with Specifically, we randomly sampled 150 narrative only 3 abstractive summarization models is not a pairs (one from “Left” wing and one from “Right” barrier to our work. Proposing a custom method wing) and then asked 3 (three) humans to write a fine-tuned for the Semantic-Overlap task is an or- a natural language description which conveys the thogonal goal to this work and we leave it as a semantic overlap of the information present in both future work. Also, we’ll use the phrases “summary” narratives describing each event. and “overlap-summary” interchangeably from here. After the first round of annotation, we immedi- To generate the summary, we concatenate a narra- ately observed a discrepancy among the three an- tive pair and feed it directly to the model. notators in terms of the real definition of “semantic For evaluation, we first evaluated the machine overlap”. For example, one annotator argued that generated overlap summaries for the 137 manually Semantic Overlap of two narratives is non-empty as annotated testing samples using the ROUGE metric long as there is an overlap along one of the 5W1H (Lin, 2004) and followed the procedure mentioned facets (Who, What, When, Where, Why and How), in the paper to compute the ROUGE-F1 scores while another annotator argued that overlap in only with multiple reference summaries. More precisely, one facet is not enough to decide whether there is since we have 4 reference summaries, we got 4 pre- indeed a semantic overlap. As an example, one of cision, recall pairs which are used to compute the 3 The dataset and manual annotations can be found in sup- corresponding F1 scores. For each sample, we took plementary folder. the max of these 4F1 scores and averaged them out
Pearson’s Correlation Coefficients Human agreement in terms of Kendall Tau R1 R2 RL Precision Recall I1 I2 I3 I1 I2 I3 I1 I2 I3 L1 L2 L1 L2 I2 0.62 — 0.65 — 0.69 — I3 0.3 0.38 — 0.27 0.37 — 0.27 0.44 — L2 0.52 — 0.37 — I4 0.17 0.34 0.34 0.14 0.33 0.21 0.18 0.35 0.33 L3 0.18 0.29 0.31 0.54 Average 0.36 0.33 0.38 Average 0.33 0.41 Table 2: Max (across 3 models) Pearson’s correlation Table 3: Kendall’s rank correlation coefficients among between the F1 ROUGE scores corresponding to dif- the precision and recall scores for pairs of human anno- ferent annotators. Here Ii refers to the ith annotator tators (25 samples). Li refers to the ith label annotator. where i ∈ {1, 2, 3, 4} and “Average” row represents av- erage correlation of the max values across annotators. corresponding system generated overlap-summary Boldface values are statistically significant at p-value < 0.05. For 5 out of 6 annotator pairs, the correlation (generated by fine-tuned BART) and assigned a nu- values are quite small (≤ 0.50), thus, implying the poor meric score between 1-10 (inclusive). This number inter-rated agreement with regards to ROUGE metric. reflects their judgement/confidence about how ac- curately the system-generated summary captures across the test dataset. The ROUGE scores can be the actual overlap of the two input narratives. Note seen in the table 11 in appendix. that, the reference overlap summaries were not included in this label annotation process and the 6.2 Results and Findings label-annotators judged the system-generated sum- mary exclusively with respect to the input narra- We computed Pearson’s correlation coefficients be- tives. To quantify the agreement between human tween each pair of ROUGE-F1 scores obtained scores, we computed the Kendall rank correlation using all of the 4 reference overlap-summaries (3 coefficient (or Kendall’s Tau) between two anno- human written summary and 1 AllSides theme de- tator labels since these are ordinal values. How- scription) to test the robustness of ROUGE metric ever, to our disappointment, the correlation value for evaluating the MNSO task. The corresponding was 0.20 with p-value being 0.224 . This shows correlations are shown in table 2. For each annota- that even human annotators are disagreeing among tor pair, we report the maximum (across 3 models) themselves and we need to come up with a better correlation value. The average correlation value labelling guideline to reach a reasonable agreement across annotators is 0.36, 0.33 and 0.38 for R1, among the human annotators. R2 and RL respectively; suggesting that ROUGE metric is not stable across multiple human-written On further discussions among annotators, we overlap-summaries and thus, unreliable. Indeed, realized that one annotator only focused on precise- only one out the 6 different annotator pairs has a ness of the intersection summaries, whereas the value greater than 0.50 for all the 3 ROUGE met- other annotator took both precision and recall into rics (R1, R2, RL), which is problematic. consideration. Thus, we decided to next assign two separate scores for precision and recall. 7 Can We Do Better than ROUGE? Precision-Recall Inspired Double Scoring: This time, three label-annotators (L1 , L2 and L3 ) as- Section 6 shows that ROUGE metric is unsta- signed two numeric scores between 1-10 (inclu- ble across multiple reference overlap-summaries. sive) for the same set of 25 system generated sum- Therefore, an immediate question is: Can we come maries. These numbers represented their belief up with a better metric than ROUGE? To investi- about how precise the system-generated summaries gate this question, we started by manually assessing were (the precision score) and how much of the ac- the machine-generated overlap summaries to check tual ground-truth overlap-information was covered whether humans agree among themselves or not. by the same (the recall score). Also note that, la- bels were assigned exclusively with respect to the 7.1 Different trials of Human Judgement input narratives only. As the assigned numbers rep- Assigning a Single Numeric Score: As an ini- resent ordinal values (i.e. can’t be used to compute tial trial, we decided to first label 25 testing sam- 4 ples using two human annotators (we call them The higher p-value means that the correlation value is insignificant because of the small number of samples, but the label annotators L1 and L2 ). Both label-annotators aim is to first find a labelling criterion where human can agree read each of the 25 narrative pairs as well as the among themselves.
Human agreement in terms of Kendall’s Rank Correlation Label from Annotator B P PP A Precision Recall P 1 0.5 0 Label from An- PP 0.5 1 0 L1 L2 L1 L2 notator A A 0 0 1 L2 0.68 — 0.75 — L3 0.59 0.64 0.69 0.71 Table 5: Reward function used to evaluate the labels Average 0.64 0.72 assigned by two label annotators (or labels inferred us- ing SEM-F1 metric and human annotated labels) for a Table 4: Average precision and recall Kendall rank cor- given sentence (association between annotator pairs). relation coefficients between sentence-wise annotation for different annotators. Li refers to the ith label anno- tator. All values are statistically significant (p
Human agreement in terms of Reward function Algorithm 1 Semantic-F1 Metric Precision Recall 1: Given SG , SR , ME L1 L2 L1 L2 2: rawpV , rawrV ← C OSINE S IM(SG , SR , ME ) . L2 0.81 ± 0.26 — 0.85 ± 0.11 — Sentence-wise precision and recall values L3 0.79 ± 0.26 0.70 ± 0.31 0.80 ± 0.16 0.77 ± 0.17 3: pV ← M EAN(rawpV ) Average 0.77 0.81 4: rV ← M EAN(rawrV ) 2 ∗ pV ∗ rV Table 6: Average precision and recall reward scores 5: f1 ← pV + rV (mean ± std) between sentence-wise annotation for dif- 6: return (f1 , pV, rV ) ferent annotators. Li refers to the ith label-annotator. 1: procedure C OSINE S IM(SG , SR , ME ) 8 Semantic-F1: The New Metric 2: lG ←No. of sentences in SG 3: lR ←No. of sentences in SR Human evaluation is costly and time-consuming. 4: init: cosSs ← zeros[lG , lR ]; i ← 0 5: for each sentence sG in SG do Thus, one needs an automatic evaluation metric for 6: EsG ← ME (sG);j ← 0 large-scale experiments. But, how can we devise 7: for each sentence sR in SR do an automated metric to perform the sentence-wise 8: EsR ← ME (sR) 9: cosSs[i, j] ← Cos(EsG , EsR ) precision-recall style evaluation discussed in the 10: end for previous section? To achieve this, we propose a 11: end for new evaluation metric called SEM-F1. The details 12: x ← Row-wise-max(cosSs) 13: y ← Column-wise-max(cosSs) of our SEM-F1 metric are described in algorithm 1 14: return (x, y) and the respective notations are mentioned in table 15: end procedure 7. F1 scores are computed by the harmonic mean of the precision (pV ) and recall (rV ) values. Algo- olds as described in algorithm 2. This helped us rithm 1 assumes only one reference summary but to directly compare the SEM-F1 inferred labels can be trivially extended for multiple references. against the human annotated labels. As mentioned previously, in case of multiple ref- As mentioned in section 8, we utilized state-of- erences, we concatenate them for precision score the-art sentence embedding models to encode sen- computation. Recall scores are computed individu- tences from both the model generated summaries ally for each reference summary and later, an aver- and the human written narrative intersections. To age recall is computed across references. be more specific, we experimented with 3 sen- The basic intuition behind SEM-F1 is to com- tence embedding models: Paraphrase-distilroberta- pute the sentence-wise similarity (e.g., cosine simi- base-v1 (P-v1) (Reimers and Gurevych, 2019), larity using a sentence embedding model) to infer stsb-roberta-large (STSB) (Reimers and Gurevych, the semantic overlap/intersection between two sen- 2019) and universal-sentence-encoder (USE) (Cer tences from both precision and recall perspective et al., 2018). Along with the various embedding and then, combine them into F1 score. models, we also experimented with multiple thresh- old values used to predict the sentence-wise pres- Notations Description ence (P), partial presence (PP) and absence (A) SG Machines generated summary SR Reference summary labels to report the sensitivity of the metric with T := (tl , tu ) Tuple representing the lower and upper respect to different thresholds. These thresholds threshold values (between 0 and 1). are: (25, 75), (35, 65), (45, 75), (55, 65), (55, 75), ME Sentence embedding model pV, rV Precision, Recall value for (SG , SR ) pair (55, 80), (60, 80). For example, threshold range (45, 75) means that if similarity score < 45%, in- Table 7: Table of notations for algorithm 1 fer label "absent", else if similarity score ≥ 75%, infer label "present" and else, infer label “partial- 8.1 Is SEM-F1 Reliable? present”. Next, we computed the average preci- The SEM-F1 metric computes cosine similarity sion and recall rewards for 50 samples annotated scores between sentence-pairs from both precision by label-annotators (Li ) and the labels inferred by and recall perspectives. To see whether SEM-F1 SEM-F1 metric. For this, we repeat the proce- metric correlates with human-judgement, we fur- dure of Table 6, but this time comparing human ther converted the sentence-wise raw cosine scores labels against ‘SEM-F1 labels’. The corresponding into Presence (P), Partial Presence (PP) and Ab- results are shown in Table 8. As we can notice, sence (A) labels using some user-defined thresh- the average reward values are consistently high
Machine-Human Agreement in terms of Reward Function Reward/Kendall T = (25, 75) T = (35, 65) T = (45, 75) T = (55, 65) T = (55, 75) T = (55, 80) T = (60, 80) Embedding: Precision 0.75/0.57 0.8/0.63 0.76/0.59 0.8/0.63 0.78/0.6 0.74/0.6 0.73/0.58 P-v1 Recall 0.66/0.54 0.76/0.64 0.73/0.66 0.72/0.64 0.69/0.63 0.65/0.64 0.61/0.6 Embedding: Precision 0.73/0.6 0.73/0.62 0.73/0.6 0.73/0.62 0.73/0.63 0.73/0.59 0.73/0.58 STSB Recall 0.63/0.55 0.64/0.63 0.63/0.6 0.65/0.61 0.65/0.61 0.63/0.61 0.64/0.59 Embedding: Precision 0.76/0.64 0.76/0.66 0.78/0.64 0.78/0.64 0.79/0.63 0.78/0.62 0.79/0.65 USE Recall 0.63/0.53 0.66/0.6 0.67/0.58 0.68/0.61 0.67/0.62 0.64/0.62 0.65/0.61 Table 8: Average Precision and Recall correlation (Reward score/Kendall correlation) between label-annotators (Li ) and automatically inferred labels using SEM-F1 (average of 3 label annotators). The raw numbers for each annotators can be found in appendix (table 12). The results are shown for different embedding models (8.1) and multiple threshold levels T = (tl , tu ). Moreover, the both the Reward and Kendall values are consistent/stable across all the 5 embedding models and threshold values. Random Annotation Random Intersection SEM-F1 Scores Pearson’s Correlation Coefficients SEM-F1 Scores SEM-F1 Scores P-V1 STSB USE P-V1 STSB USE P-V1 STSB USE P-V1 STSB USE I1 I2 I3 I1 I2 I3 I1 I2 I3 BART 0.16 0.21 0.22 0.21 0.27 0.27 0.65 0.67 0.67 I2 0.69 — 0.65 — 0.71 — T5 0.17 0.21 0.23 0.20 0.26 0.26 0.58 0.60 0.60 I3 0.40 0.50 — 0.50 0.52 — 0.51 0.54 — Pegasus 0.15 0.20 0.22 0.19 0.26 0.26 0.59 0.60 0.62 I4 0.33 0.44 0.60 0.33 0.36 0.56 0.37 0.42 0.66 Average 0.16 0.21 0.22 0.20 0.26 0.26 0.61 0.62 0.63 Average 0.49 0.49 0.54 Table 9: SEM-F1 Scores Table 10: Max (across 3 models) Pearson’s correlation between the SEM-F1 scores corresponding to different (≥ 0.50) for all the 3 label-annotators (Li ). More- annotators. Here Ii refers to the ith annotator where over, the reward values are consistent/stable across i ∈ {1, 2, 3, 4} and “Average” row represents average all the 3 embedding models and threshold values, correlation of the max values across annotators. All signifying that SEM-F1 is indeed robust across var- values are statistically significant at p-value < 0.05. ious sentence embeddings and threshold used. 8.3 Pearson Correlation for SEM-F1 Following the procedure in table 4, we also com- pute the Kendall’s Tau between human label an- Following the case-study based on ROUGE in sec- notators and automatically inferred labels using tion 6, we again compute the Pearson’s correla- SEM-F1. Our results in table 8 are consistent with tion coefficients between each pair of raw SEM- reward-based inter-rater-agreement and the corre- F1 scores obtained using all of the 4 reference lation values are ≥ 0.50 with little variation along intersection-summaries. The corresponding cor- various thresholds for both precision and recall. relations are shown in table 10. For each annotator pair, we report the maximum (across 3 models) cor- 8.2 SEM-F1 Scores for Random Baselines relation value. The average correlation value across Here, we present the actual SEM-F1 scores for annotators is 0.49, 0.49 and 0.54 for P-V1, STSB, the three models described in section 6.1 along USE embeddings, respectively. This shows a clear with scores for two intuitive baselines, namely, 1) improvement over the ROUGE metric suggesting Random Overlap 2) Random Annotation. that SEM-F1 is more accurate than ROUGE metric. Random Overlap: For a given sample and model, 9 Conclusions we select a random overlap summary generated by the model out of the other 136 test samples. In this work, we proposed a new NLP task, called These random overlaps are then evaluated against Multi-Narrative Semantic Overlap (MNSO) and 4 reference summaries using SEM-F1. created a benchmark dataset through meticulous Random Annotation: For a given sample, we se- human effort to initiate a new research direction. lect a random reference summary out of the other 4 As a starting point, we framed the problem as a references among the other 136 test samples. The constrained summarization task and showed that model generated summaries are then compared ROUGE is not a reliable evaluation metric for this against these Random Annotations/References to task. We further proposed a more accurate metric, compute SEM-F1 scores as reported in table 9. called SEM-F1, for evaluating MNSO task. Experi- As we notice, there is approximately 40-45 per- ments show that SEM-F1 is more robust and yield cent improvement over the baseline scores suggest- higher agreement with human judgement. ing SEM-F1 can indeed distinguish good from bad.
References pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Amit Alfassy, Leonid Karlinsky, Amit Aides, Joseph Shtok, Sivan Harary, Rogerio Feris, Raja Giryes, Alexander R Fabbri, Wojciech Kryściński, Bryan and Alex M Bronstein. 2019. Laso: Label-set opera- McCann, Caiming Xiong, Richard Socher, and tions networks for multi-label few-shot learning. In Dragomir Radev. 2021. Summeval: Re-evaluating Proceedings of the IEEE Conference on Computer summarization evaluation. Transactions of the Asso- Vision and Pattern Recognition, pages 6548–6557. ciation for Computational Linguistics, 9:391–409. Sanghwan Bae, Taeuk Kim, Jihoon Kim, and Sang- Angela Fan, David Grangier, and Michael Auli. 2017. goo Lee. 2019. Summary level training of sen- Controllable abstractive summarization. arXiv tence rewriting for abstractive summarization. arXiv preprint arXiv:1711.05217. preprint arXiv:1909.08752. Thibault Fevry and Jason Phang. 2018. Unsuper- Regina Barzilay, Kathleen McKeown, and Michael El- vised sentence compression using denoising auto- hadad. 1999. Information fusion in the context of encoders. arXiv preprint arXiv:1809.02669. multi-document summarization. In Proceedings of the 37th annual meeting of the Association for Com- Kavita Ganesan. 2018. ROUGE 2.0: Updated and putational Linguistics, pages 550–557. improved measures for evaluation of summarization Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. tasks. CoRR, abs/1803.01937. 2018. Retrieve, rerank and rewrite: Soft template Jade Goldstein, Vibhu O Mittal, Jaime G Carbonell, based neural summarization. In Proceedings of the and Mark Kantrowitz. 2000. Multi-document sum- 56th Annual Meeting of the Association for Com- marization by sentence extraction. In NAACL-ANLP putational Linguistics (Volume 1: Long Papers), 2000 Workshop: Automatic Summarization. pages 152–161, Melbourne, Australia. Association for Computational Linguistics. Yvette Graham. 2015. Re-evaluating automatic sum- Rémi Cardon and Natalia Grabar. 2019. Parallel sen- marization with BLEU and 192 shades of ROUGE. tence retrieval from comparable corpora for biomed- In Proceedings of the 2015 Conference on Empirical ical text simplification. In Proceedings of the Inter- Methods in Natural Language Processing, EMNLP national Conference on Recent Advances in Natu- 2015, Lisbon, Portugal, September 17-21, 2015, ral Language Processing (RANLP 2019), pages 168– pages 128–137. The Association for Computational 177. Linguistics. Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Hardy, Shashi Narayan, and Andreas Vlachos. 2019. Nicole Limtiaco, Rhomni St John, Noah Constant, Highres: Highlight-based reference-less evaluation Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, of summarization. In Proceedings of the 57th Con- et al. 2018. Universal sentence encoder. arXiv ference of the Association for Computational Lin- preprint arXiv:1803.11175. guistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3381–3392. Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac- Association for Computational Linguistics. tive summarization with reinforce-selected sentence rewriting. arXiv preprint arXiv:1805.11080. Donna Harman and Paul Over. 2004. The effects of human variation in DUC summarization evalua- Sumit Chopra, Michael Auli, and Alexander M Rush. tion. In Text Summarization Branches Out, pages 2016. Abstractive sentence summarization with at- 10–17, Barcelona, Spain. Association for Computa- tentive recurrent neural networks. In Proceedings of tional Linguistics. the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Human Language Technologies, pages 93–98. Min, Jing Tang, and Min Sun. 2018. A uni- fied model for extractive and abstractive summa- Arman Cohan and Nazli Goharian. 2016. Revisiting rization using inconsistency loss. arXiv preprint summarization evaluation for scientific articles. In arXiv:1805.06266. Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Shubhra Kanti Karmaker Santu, Chase Geigle, Dun- Portorož, Slovenia, May 23-28, 2016. European Lan- can Ferguson, William Cope, Mary Kalantzis, Du- guage Resources Association (ELRA). ane Searsmith, and Chengxiang Zhai. 2018. Sofsat: Towards a setlike operator based framework for se- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and mantic analysis of text. ACM SIGKDD Explorations Kristina Toutanova. 2019. BERT: Pre-training of Newsletter, 20(2):21–30. deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya of the North American Chapter of the Association Takamura, and Manabu Okumura. 2016. Control- for Computational Linguistics: Human Language ling output length in neural encoder-decoders. arXiv Technologies, Volume 1 (Long and Short Papers), preprint arXiv:1609.09552.
Emiel Krahmer, Erwin Marsi, and Paul van Pelt. 2008. Ani Nenkova. 2006. Summarization evaluation for Query-based sentence fusion is better defined and text and speech: issues and approaches. In INTER- leads to more preferred results than generic sentence SPEECH 2006 - ICSLP, Ninth International Confer- fusion. In Proceedings of ACL-08: HLT, Short Pa- ence on Spoken Language Processing. ISCA. pers, pages 193–196. Jun-Ping Ng and Viktoria Abrecht. 2015. Better sum- Mike Lewis, Yinhan Liu, Naman Goyal, Mar- marization evaluation with word embeddings for jan Ghazvininejad, Abdelrahman Mohamed, Omer ROUGE. In Proceedings of the 2015 Conference on Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Empirical Methods in Natural Language Processing, Bart: Denoising sequence-to-sequence pre-training EMNLP 2015, Lisbon, Portugal, September 17-21, for natural language generation, translation, and 2015, pages 1925–1930. The Association for Com- comprehension. arXiv preprint arXiv:1910.13461. putational Linguistics. Chin-Yew Lin. 2004. Rouge: A package for automatic Jian-Yun Nie, Michel Simard, Pierre Isabelle, and evaluation of summaries. In Text summarization Richard Durand. 1999. Cross-language information branches out, pages 74–81. retrieval based on parallel texts and automatic min- ing of parallel texts from the web. In Proceedings of Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, the 22nd annual international ACM SIGIR confer- and Hongyan Li. 2017. Generative adversarial net- ence on Research and development in information work for abstractive text summarization. arXiv retrieval, pages 74–81. preprint arXiv:1711.09357. Romain Paulus, Caiming Xiong, and Richard Socher. Yizhu Liu, Zhiyi Luo, and Kenny Zhu. 2018. Con- 2017. A deep reinforced model for abstractive sum- trolling length in abstractive summarization using a marization. arXiv preprint arXiv:1705.04304. convolutional neural network. In Proceedings of the 2018 Conference on Empirical Methods in Natural Dragomir Radev. 2000. A common theory of infor- Language Processing, pages 4110–4119. mation fusion from multiple text sources step one: cross-document structure. In 1st SIGdial workshop Congbo Ma, Wei Emma Zhang, Mingyu Guo, on Discourse and dialogue, pages 74–83. Hu Wang, and Quan Z Sheng. 2020. Multi- document summarization via deep learning tech- Alec Radford, Jeffrey Wu, Dario Amodei, Daniela niques: A survey. arXiv preprint arXiv:2011.04843. Amodei, Jack Clark, Miles Brundage, and Ilya Sutskever. 2019. Better language models and Takuya Makino, Tomoya Iwakura, Hiroya Takamura, their implications. OpenAI Blog https://openai. and Manabu Okumura. 2019. Global optimization com/blog/better-language-models. under length constraint for neural text summariza- tion. In Proceedings of the 57th Annual Meeting Colin Raffel, Noam Shazeer, Adam Roberts, Katherine of the Association for Computational Linguistics, Lee, Sharan Narang, Michael Matena, Yanqi Zhou, pages 1039–1048. Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text trans- Erwin Marsi and Emiel Krahmer. 2005. Explorations former. arXiv preprint arXiv:1910.10683. in sentence fusion. In Proceedings of the Tenth Eu- ropean Workshop on Natural Language Generation Nils Reimers and Iryna Gurevych. 2019. Sentence- (ENLG-05). bert: Sentence embeddings using siamese bert- networks. In Proceedings of the 2019 Conference on Yogesh Kumar Meena, Ashish Jain, and Dinesh Empirical Methods in Natural Language Processing. Gopalani. 2014. Survey on graph and cluster Association for Computational Linguistics. based approaches in multi-document text summa- rization. In International Conference on Recent Michael Roth and Anette Frank. 2012. Aligning predi- Advances and Innovations in Engineering (ICRAIE- cate argument structures in monolingual comparable 2014), pages 1–5. IEEE. texts: A new corpus for a new task. In * SEM 2012: The First Joint Conference on Lexical and Compu- Vanessa Murdock and W Bruce Croft. 2005. A transla- tational Semantics–Volume 1: Proceedings of the tion model for sentence retrieval. In Proceedings of main conference and the shared task, and Volume Human Language Technology Conference and Con- 2: Proceedings of the Sixth International Workshop ference on Empirical Methods in Natural Language on Semantic Evaluation (SemEval 2012), pages 218– Processing, pages 684–691. 227. Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Alexander M Rush, Sumit Chopra, and Jason We- Bing Xiang, et al. 2016. Abstractive text summariza- ston. 2015. A neural attention model for ab- tion using sequence-to-sequence rnns and beyond. stractive sentence summarization. arXiv preprint arXiv preprint arXiv:1602.06023. arXiv:1509.00685. Shashi Narayan, Shay B Cohen, and Mirella Lapata. Raphael Schumann. 2018. Unsupervised abstrac- 2018. Ranking sentences for extractive summariza- tive sentence summarization using length con- tion with reinforcement learning. arXiv preprint trolled variational autoencoder. arXiv preprint arXiv:1802.08636. arXiv:1809.05233.
Elaheh ShafieiBavani, Mohammad Ebrahimi, Ray- Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe- mond K. Wong, and Fang Chen. 2018. A graph- ter J Liu. 2019. Pegasus: Pre-training with extracted theoretic summary evaluation for rouge. In Proceed- gap-sentences for abstractive summarization. arXiv ings of the 2018 Conference on Empirical Methods preprint arXiv:1912.08777. in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 762–767. As- Jinming Zhao, Ming Liu, Longxiang Gao, Yuan Jin, sociation for Computational Linguistics. Lan Du, He Zhao, He Zhang, and Gholamreza Haffari. 2020. Summpip: Unsupervised multi- Tomohide Shibata and Sadao Kurohashi. 2012. document summarization with sentence graph com- Predicate-argument structure-based textual entail- pression. In Proceedings of the 43rd International ment recognition system exploiting wide-coverage ACM SIGIR Conference on Research and Develop- lexical knowledge. ACM Transactions on Asian ment in Information Retrieval, pages 1949–1952. Language Information Processing (TALIP), 11(4):1– 23. Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. 2020. Extrac- Kapil Thadani and Kathleen McKeown. 2011. To- tive summarization as text matching. arXiv preprint wards strict sentence intersection: decoding and arXiv:2004.08795. evaluation strategies. In Proceedings of the Workshop on Monolingual Text-To-Text Generation, Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu, pages 43–53. and Xuanjing Huang. 2019. Searching for effective neural extractive summarization: What works and Rui Wang and Yi Zhang. 2009. Recognizing tex- what’s next. arXiv preprint arXiv:1907.03491. tual relatedness with predicate-argument structures. In Proceedings of the 2009 Conference on Empiri- Liang Zhou, Chin-Yew Lin, Dragos Stefan Munteanu, cal Methods in Natural Language Processing, pages and Eduard H. Hovy. 2006. Paraeval: Using para- 784–792. phrases to evaluate summaries automatically. In Hu- man Language Technology Conference of the North Travis Wolfe, Benjamin Van Durme, Mark Dredze, American Chapter of the Association of Computa- Nicholas Andrews, Charley Beller, Chris Callison- tional Linguistics. The Association for Computa- Burch, Jay DeYoung, Justin Snyder, Jonathan tional Linguistics. Weese, Tan Xu, et al. 2013. Parma: A predicate ar- gument aligner. In Proceedings of the 51st Annual Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. Meeting of the Association for Computational Lin- 2017. Selective encoding for abstractive sentence guistics (Volume 2: Short Papers), pages 63–68. summarization. arXiv preprint arXiv:1704.07073. Yuxiang Wu and Baotian Hu. 2018. Learning to extract coherent summary via deep reinforcement learning. arXiv preprint arXiv:1804.07036. Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie-gen: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. arXiv preprint arXiv:2001.11314. Lexing Xie, Hari Sundaram, and Murray Campbell. 2008. Event mining in multimedia streams. Pro- ceedings of the IEEE, 96(4):623–647. Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming Zhou. 2020. Prophetnet: Predicting future n- gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063. An Yang, Kai Liu, Jing Liu, Yajuan Lyu, and Sujian Li. 2018. Adaptations of ROUGE and BLEU to better evaluate machine reading comprehension task. In Proceedings of the Workshop on Machine Read- ing for Question Answering@ACL 2018, Melbourne, Australia, July 19, 2018, pages 98–104. Association for Computational Linguistics. Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, and Dragomir Radev. 2017. Graph-based neural multi-document summarization. arXiv preprint arXiv:1706.06681.
A Other definitions of Text Overlap maximum common subgraph of both GA and GB , where GA and GB are the corresponding Below, we present a set of possible definitions of graphs for the documents DA and DB respec- Semantic Overlap to encourage the readers to think tively. However, coming up with a graph struc- more about other alternative definitions. ture G which can align with both documents 1. On a very simplistic level, one can think of Se- DA and DB , would itself be a challenge. mantic Overlap to be just the common words between the two input documents. One can 5. One can also define TextOverlap operator (∩) also include their frequencies of occurrences between two documents based on historical in such representation. More specifically, we context and prior knowledge. Given a knowl- can define Dovlp as a set of unordered pairs edge base K, Dovlp = ∩(DA , DB |K) (Radev, of words wi and their frequencies of common 2000). occurrences fi , i.e., Dovlp = {(wi , fi )}. We can further extend this approach such that Se- All the approaches defined above have their spe- mantic Overlap is a set of common n-grams cific use-cases and challenges, however, from a among the input documents. More specifically, human-centered point of view, they may not reflect Dovlp = { (w1 , w2 , ..., wn )i , fi } such that the how humans generate semantic overlaps. A human n-grams, (w1 , w2 , ..., wn )i , is present in both would mostly express it in the form of natural lan- DA (with frequency fiA ) and DB (with fre- guage and this is why, we frame the TextOverlap quency fiB ) and fi = min(fiA , fiB ). operator as a constraint summarization problem such that the information of the output summary is 2. Another way to think of Semantic Overlap is to present in both the input documents. find the common topics among two documents just like finding common object labels among B Threshold Algorithm two images (Alfassy et al., 2019), by computing the joint probability of their topic distributions. More specifically, Semantic Overlap can be de- Algorithm 2 Threshold Function fined by the following joint probability distribu- 1: procedure T HRESHOLD (rawSs, T ) tion: P (Ti |Dovlp ) = P (Ti |DA ) × P (Ti |DB ). 2: initialize Labels ← [] This representation is more semantic in nature 3: for each element e in rawSs do as it can capture overlap in topics. 4: if e ≥ tu % then 5: Labels.append(P ) 3. Alternatively, one can take the 5W1H approach 6: else if tl % ≤ e ≤ tu % then (Xie et al., 2008), where a given narrative D 7: Labels.append(P P ) can be represented in terms of unordered sets 8: else of six facets: 5Ws (Who, What, When, Where 9: Labels.append(A) and Why) and 1H (How). In this case, we can 10: end if define Semantic Overlap as the common ele- 11: end for ments between the corresponding sets related 12: return Labels to these 6 facets present in both narratives, i.e. 13: end procedure Dovlp = {Si } where Si is a set belonging to one of the six 5W1H facets. It is entirely possible that one of these Si ’s is an empty set (φ). The most challenging aspect with this approach is C ROUGE Scores accurately inferring the 5W1H facets. Model R1 R2 RL 4. Another way could be to define a given docu- BART 40.73 25.97 29.95 ment as a graph. Specifically, we can consider T5 38.50 24.63 27.73 a document D as a directed graph G = (V, E) Pegasus 46.36 29.12 37.41 where V represents the vertices and E repre- Table 11: Average ROUGE-F1 Scores for all the test sents the edges. Thus, TextOverlap can be de- models across test dataset. For a particular sample, we fined as the set of common vertices or edges take the maximum value out of the 4 F1 scores corre- or both. Specifically, Dovlp can be defined as a sponding to the 4 reference summaries.
D Motivation and Applications explore the specific articles for more details. Question Answering: Again, one could parse Multiple alternative narratives are frequent in a vari- the common information/answer from multiple ety of domains, including education, health sector, documents pertinent to the given query/question. and privacy, and and technical areas such as In- Robust Translation: Suppose you have multiple formation Retrieval/Search Engines, QA, Transla- translation models which translates a given docu- tion etc. In general, MNSO/TextIntersect operation ment from language A to language B. One could can be highly effective in digesting such multi- further apply the TextOverlap operator on the trans- narratives (from various perspectives) at scale and lated documents and get a robust translation. speed. Here are a few examples of use-cases. In general, MNSO task could be employed in any Peer-Reviewing: TextIntersect can extract sec- setting where we have comparative text analysis. tions of multiple peer-reviews for an article that agree with one other, which can assist creating a meta-review fast. Security and Privacy: By mining overlapping clauses from various privacy policies, the TextIn- tersect operation may assist real-world consumers swiftly undertake a comparative study of different privacy policies and thus, allowing them to make informed judgments when selecting between multi- ple alternative web-services. Health Sector: TextIntersect can be applied to compare clinical notes in patient records to reveal changes in a patient’s condition or perform com- parative analysis of patients with the same diagno- sis/treatment. For example, TextIntersect can be applied to the clinical notes of two different pa- tients who went through the same treatments to assess the effectiveness of the treatment. Military Intelligence: If A and B are two intel- ligence reports related to a mission coming from two human agents, the TextIntersect operation can help verify the claims in each report w.r.t. the other, thus, TextIntersect can be used as an automated claim-verification tool. Computational Social Science and Journalism: Assume that two news agencies (with different political bias) are reporting the same real-world event and their bias is somewhat reflected through the articles they write. If A and B are two such news articles, then the TextIntersect operation will likely surface the facts (common information) about the event. Here are some of the use-cases of MNSO in various technical areas. Information Retrieval/Search Engines: One could summarize the common information in the multiple results fetched by a search engine for a given query and show it in separate box to the user. This would immensely help the to quickly parse the information rather than going through each individual article. If they desire, they could further
Machine-Human Agreement in terms of Reward Function T = (25, 75) T = (35, 65) T = (45, 75) T = (55, 65) T = (55, 75) T = (55, 80) T = (60, 80) Sentence Embedding: P-v1 L1 0.73 ± 0.27 0.81 ± 0.25 0.77 ± 0.26 0.85 ± 0.23 0.80 ± 0.24 0.77 ± 0.24 0.77 ± 0.26 Precision L2 0.72 ± 0.30 0.73 ± 0.29 0.73 ± 0.30 0.78 ± 0.27 0.79 ± 0.27 0.75 ± 0.26 0.73 ± 0.29 Reward L3 0.81 ± 0.23 0.86 ± 0.21 0.79 ± 0.24 0.78 ± 0.28 0.74 ± 0.28 0.69 ± 0.28 0.69 ± 0.27 L1 0.66 ± 0.19 0.79 ± 0.16 0.75 ± 0.16 0.76 ± 0.18 0.71 ± 0.17 0.66 ± 0.17 0.61 ± 0.18 Recall L2 0.67 ± 0.19 0.78 ± 0.16 0.76 ± 0.15 0.73 ± 0.19 0.72 ± 0.18 0.70 ± 0.18 0.65 ± 0.21 Reward L3 0.66 ± 0.15 0.72 ± 0.17 0.68 ± 0.17 0.68 ± 0.22 0.64 ± 0.20 0.59 ± 0.19 0.57 ± 0.20 Sentence Embedding: STSB L1 0.75 ± 0.29 0.75 ± 0.29 0.75 ± 0.29 0.75 ± 0.29 0.75 ± 0.29 0.75 ± 0.30 0.75 ± 0.23 Precision L2 0.63 ± 0.32 0.63 ± 0.31 0.63 ± 0.32 0.63 ± 0.31 0.63 ± 0.32 0.64 ± 0.32 0.64 ± 0.32 Reward L3 0.81 ± 0.23 0.82 ± 0.23 0.81 ± 0.23 0.82 ± 0.23 0.81 ± 0.23 0.81 ± 0.22 0.81 ± 0.22 L1 0.66 ± 0.21 0.67 ± 0.21 0.66 ± 0.21 0.68 ± 0.21 0.67 ± 0.21 0.65 ± 0.21 0.66 ± 0.21 Recall L2 0.57 ± 0.20 0.58 ± 0.21 0.57 ± 0.20 0.59 ± 0.20 0.59 ± 0.20 0.58 ± 0.20 0.58 ± 0.21 Reward L3 0.67 ± 0.19 0.67 ± 0.20 0.67 ± 0.19 0.68 ± 0.20 0.68 ± 0.19 0.67 ± 0.18 0.68 ± 0.18 Sentence Embedding: USE L1 0.76 ± 0.29 0.77 ± 0.30 0.78 ± 0.27 0.80 ± 0.28 0.80 ± 0.27 0.77 ± 0.27 0.80 ± 0.27 Precision L2 0.69 ± 0.32 0.66 ± 0.32 0.71 ± 0.30 0.68 ± 0.30 0.72 ± 0.30 0.76 ± 0.29 0.78 ± 0.29 Reward L3 0.82 ± 0.24 0.85 ± 0.22 0.85 ± 0.23 0.86 ± 0.21 0.85 ± 0.23 0.82 ± 0.23 0.78 ± 0.25 L1 0.64 ± 0.19 0.67 ± 0.19 0.68 ± 0.19 0.70 ± 0.21 0.69 ± 0.22 0.64 ± 0.20 0.65 ± 0.21 Recall L2 0.62 ± 0.19 0.63 ± 0.20 0.66 ± 0.18 0.66 ± 0.21 0.68 ± 0.20 0.68 ± 0.19 0.69 ± 0.21 Reward L3 0.64 ± 0.16 0.68 ± 0.19 0.66 ± 0.16 0.69 ± 0.20 0.65 ± 0.19 0.60 ± 0.17 0.60 ± 0.18 (a) Average Precision and Recall reward/correlation (mean ± std) between label-annotators (Li ) and automatically inferred labels using SEM-F1. The results are shown for different embedding models (8.1) and multiple threshold levels T = (tl , tu ). For all the annotators Li (i ∈ {1, 2, 3}), correlation numbers are quite high (≥ 0.50). Moreover, the reward values are consistent/stable across all the 5 embedding models and threshold values. Machine-Human Agreement in terms of Kendall Rank Correlation T = (25, 75) T = (35, 65) T = (45, 75) T = (55, 65) T = (55, 75) T = (55, 80) T = (60, 80) Sentence Embedding: P-v1 L1 0.55 0.6 0.58 0.59 0.57 0.56 0.54 Precision L2 0.61 0.67 0.63 0.67 0.64 0.67 0.68 Reward L3 0.54 0.62 0.56 0.64 0.6 0.56 0.52 L1 0.53 0.64 0.66 0.62 0.61 0.62 0.59 Recall L2 0.55 0.64 0.67 0.63 0.63 0.64 0.61 Reward L3 0.54 0.65 0.64 0.66 0.65 0.65 0.61 Sentence Embedding: STSB L1 0.57 0.67 0.58 0.66 0.6 0.57 0.58 Precision L2 0.66 0.63 0.65 0.63 0.7 0.63 0.6 Reward L3 0.56 0.57 0.58 0.56 0.59 0.57 0.56 L1 0.55 0.65 0.64 0.62 0.62 0.61 0.59 Recall L2 0.56 0.65 0.65 0.63 0.63 0.64 0.63 Reward L3 0.54 0.59 0.61 0.57 0.58 0.57 0.54 Sentence Embedding: USE L1 0.58 0.62 0.6 0.61 0.59 0.62 0.65 Precision L2 0.68 0.7 0.68 0.68 0.68 0.7 0.73 Reward L3 0.66 0.67 0.65 0.64 0.63 0.53 0.56 L1 0.53 0.59 0.56 0.61 0.62 0.61 0.6 Recall L2 0.54 0.6 0.61 0.62 0.64 0.64 0.62 Reward L3 0.52 0.6 0.58 0.61 0.61 0.6 0.6 (b) Average Precision and Recall Kendall Tau between label-annotators (Li ) and automatically inferred labels using SEM-F1. The results are shown for different embedding models (8.1) and multiple threshold levels T = (tl , tu ). For all the annotators Li (i ∈ {1, 2, 3}), correlation numbers are quite high (≥ 0.50). Moreover, the reward values are consistent/stable across all the 5 embedding models and threshold values. All values are statistically significant at p-value
AllSides Dataset: Statistics Split #words (docs) #sents (docs) #words (reference/s) #sents (reference/s) Train 1613.69 66.70 67.30 2.82 Test 959.80 44.73 65.46/38.06/21.72/32.82 3.65/2.15/1.39/1.52 Table 13: Two input documents are concatenated to compute the statistics. Four numbers for reference (#words/#sents) in Test split corresponds to the 4 reference intersections. Our test dataset contains of 137 sam- ples, wherein each sample has 4 ground truth references. Out of these 4 references, 3 of them were manually written by 3 references annotators. Thus, we generated 3*137 = 411, references in total. One of the recent papers, titled (Fabbri et al., 2021), also incorporated human annotations for only 100 samples. Following them, we created reference summaries for 150 samples which later got filtered to 137 samples due to minimum 15 words criterion as described in section 5. Overall, we agree that having more samples in the test dataset would definitely help a lot. But this is both time and money consuming process. We are working towards it and would like to increase the number of test samples in future.
You can also read