Multi-Narrative Semantic Overlap Task: Evaluation and Benchmark

Page created by Pedro Osborne

Current Events

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Multi-Narrative Semantic Overlap Task: Evaluation and Benchmark

                                                     Naman Bansal, Mousumi Akter and Shubhra Kanti Karmaker Santu
                                                                       BDI Lab, Auburn University
                                                                {nzb0040, mza0170, sks0086}@auburn.edu

                                                              Abstract                          language format, how can we extract the overlap-
                                                                                                ping information present in both N1 and N2 ?
                                             In this paper, we introduce an important yet          Figure 1 shows a toy example of MNSO task,
                                             relatively unexplored NLP task called Multi-
                                                                                                where the TextOverlap1 (∩O ) operation is being ap-
                                             Narrative Semantic Overlap (MNSO), which
arXiv:2201.05294v1 [cs.CL] 14 Jan 2022

                                             entails generating a Semantic Overlap of mul-      plied on two news articles. Both articles cover the
                                             tiple alternate narratives. As no benchmark        same story related to the topic “abortion”, however,
                                             dataset is readily available for this task, we     they report from different political perspectives,
                                             created one by crawling 2, 925 narrative pairs     i.e., one from left wing and the other from right
                                             from the web and then, went through the te-        wing. For greater visibility, “Left” and “Right”
                                             dious process of manually creating 411 differ-     wing reporting biases are represented by blue and
                                             ent ground-truth semantic overlaps by engag-
                                                                                                red text respectively. Green text denotes the com-
                                             ing human annotators. As a way to evaluate
                                             this novel task, we first conducted a system-
                                                                                                mon information in both news articles. The goal of
                                             atic study by borrowing the popular ROUGE          TextOverlap (∩O ) operation is to extract the over-
                                             metric from text-summarization literature and      lapping information conveyed by the green text.
                                             discovered that ROUGE is not suitable for             At first glance, the MNSO task may appear sim-
                                             our task. Subsequently, we conducted fur-          ilar to traditional multi-document summarization
                                             ther human annotations/validations to create       task where the goal is to provide an overall sum-
                                             200 document-level and 1, 518 sentence-level
                                                                                                mary of the (multiple) input documents; however,
                                             ground-truth labels which helped us formulate
                                             a new precision-recall style evaluation metric,
                                                                                                the difference is that for MNSO, the goal is to pro-
                                             called SEM-F1 (semantic F1). Experimental          vide summarized content with an additional con-
                                             results show that the proposed SEM-F1 met-         straint, i.e., the commonality criteria. There is no
                                             ric yields higher correlation with human judge-    current baseline method as well as existing dataset
                                             ment as well as higher inter-rater-agreement       that exactly match our task; more importantly, it
                                             compared to ROUGE metric.                          is unclear which one is the right evaluation metric
                                                                                                to properly evaluate this task. As a starting point,
                                         1   Introduction                                       we frame MNSO as a constrained seq-to-seq task
                                         In this paper, we look deeper into the challenging     where the goal is to generate a natural language
                                         yet relatively under-explored area of automated        output which conveys the overlapping information
                                         understanding of multiple alternative narratives.      present in multiple input text documents. However,
                                         To be more specific, we formally introduce a new       the bigger challenge we need to address first is the
                                         NLP task called Multi-Narrative Semantic Overlap       following: 1) How can we evaluate this task? and
                                         (MNSO) and conduct the first systematic study of       2) How would one create a benchmark dataset for
                                         this task by creating a benchmark dataset as well as   this task? To address these challenges, we make
                                         proposing a suitable evaluation metric for the task.   the following contributions in this paper.
                                         MNSO essentially means the task of extracting /
                                         paraphrasing / summarizing the overlapping infor-      1. We formally introduce Multi-Narrative Seman-
                                         mation from multiple alternative narratives coming        tic Overlap (MNSO) as a new NLP task and
                                         from disparate sources. In terms of computational         conduct the first systematic study by formulat-
                                         goal, we study the following research question:           ing it as a constrained summarization problem.
                                            Given two distinct narratives N1 and N2 of            1
                                                                                                    We’ll be using the terms TextOverlap operator and Se-
                                         some event e expressed in unstructured natural         mantic Overlap interchangeably throughout the paper.

Figure 1: A toy use-case for Semantic Overlap Task (TextOverlap). A news on topic abortion has been presented
by two news media (left-wing and right-wing). “Green” Text denotes the overlapping information from both news
media, while “Blue” and “Red” text denotes the respective biases of left and right wing. A couple of real examples
from the benchmark dataset are mentioned in the appendix.

2. We create and release the first benchmark data- popular among them are extractive approaches
set consisting of 2, 925 alternative narrative (Cao et al., 2018; Narayan et al., 2018; Wu and
pairs for facilitating research on the MNSO Hu, 2018; Zhong et al., 2020) and abstractive ap-
task. Also, we went through the tedious pro- proaches (Bae et al., 2019; Hsu et al., 2018; Liu
cess of manually creating 411 different ground- et al., 2017; Nallapati et al., 2016). Some re-
truth semantic intersections and conducted fur- searchers have also tried combining extractive and
ther human annotations/validations to create abstractive approaches (Chen and Bansal, 2018;
200 document-level and 1, 518 sentence-level Hsu et al., 2018; Zhang et al., 2019).
ground-truth labels to construct the dataset. Recently, encoder-decoder based neural models
3. As a starting point, we experiment with ROUGE, have become really popular for abstractive sum-
a widely popular metric for evaluating text summarization (Rush et al., 2015; Chopra et al., 2016;
marization tasks and demonstrate that ROUGE Zhou et al., 2017; Paulus et al., 2017). It has be-
is NOT suitable for evaluation of MNSO task. come even prevalent to train a general language
model on huge corpus of data and then transfer/fine-
4. We propose a new precision-recall style evalu- tune it for the summarization task (Radford et al.,
ation metric, SEM-F1 (semantic F1), for eval- 2019; Devlin et al., 2019; Lewis et al., 2019; Xiao
uating the MNSO task. Extensive experiments et al., 2020; Yan et al., 2020; Zhang et al., 2019;
show that new SEM-F1 improves the inter-rater Raffel et al., 2019). Summary length control for
agreement compared to the traditional ROUGE abstractive summarization has also been studied
metric, and also, shows higher correlation with (Kikuchi et al., 2016; Fan et al., 2017; Liu et al.,
human judgments. 2018; Fevry and Phang, 2018; Schumann, 2018;
Makino et al., 2019). In general, multiple document
2 Related Works summarization (Goldstein et al., 2000; Yasunaga
The idea of semantic text overlap is not entirely et al., 2017; Zhao et al., 2020; Ma et al., 2020;
new, (Karmaker Santu et al., 2018) imagined a hy- Meena et al., 2014) is more challenging than single
pothetical framework for performing comparative document summarization. However, MNSO task is
text analysis, where, TextOverlap was one of the different from traditional multi-document summa-
“hypothetical” operators along with TextDifference, rization tasks in that the goal here is to summarize
but the technical details and exact implementation content with an overlap constraint, i.e., the output
were left as a future work. In our work, we only should only contain the common information from
focus on TextOverlap. both input narratives.
As TextOverlap can be viewed as a multi- Alternatively, one could aim to recover verb
document summarization task with additional com- predicate-alignment structure (Roth and Frank,
monality constraint, text summarization literature 2012; Xie et al., 2008; Wolfe et al., 2013) from
is the most relevant to our work. Over the years, a sentence and further, use this structure to com-
many paradigms for document summarization have pute the overlapping information (Wang and Zhang,
been explored (Zhong et al., 2019). The two most 2009; Shibata and Kurohashi, 2012). Sentence

Fusion is another related area which aims to com- Robust Translation: Suppose you have multiple
bine the information from two given sentences with translation models which translates a given docu-
some additional constraints (Barzilay et al., 1999; ment from language A to language B. One could
Marsi and Krahmer, 2005; Krahmer et al., 2008; further apply the TextOverlap operator on the trans-
Thadani and McKeown, 2011). A related but sim- lated documents and get a robust translation.
pler task is to retrieve parallel sentences (Cardon In general, MNSO task could be employed in any
and Grabar, 2019; Nie et al., 1999; Murdock and setting where we have comparative text analysis.
Croft, 2005) without performing an actual inter-
section. However, these approaches are more tar- 4 Problem Formulation
geted towards individual sentences and do not di-
What is Semantic Overlap? This is indeed a philo-
rectly translate to arbitrarily long documents. Thus,
sophical question and there is no single correct
MNSO task is still an open problem and there is no
answer (various possible definitions are mentioned
existing dataset, method or evaluation metric that
in appendix section A). To simplify notations, let us
have been systematically studied.
stick to having only two documents DA and DB as
Along the evaluation dimension, ROUGE (Lin, our input since it can easily be generalized in case
2004) is perhaps the most commonly used met- of more documents using TextOverlap repeatedly.
ric today for evaluating automated summarization Also, let us define the output as DO ← DA ∩O DB .
techniques; due to its simplicity and automation. A human would mostly express the output in the
However, ROUGE has been criticized a lot for pri- form of natural language and this is why, we frame
marily relying on lexical overlap (Nenkova, 2006; the MNSO task as a constrained multi-seq-to-seq
Zhou et al., 2006; Cohan and Goharian, 2016) (text generation) task where the output text only
of n-grams. As of today, around 192 variants of contains information that is present in both the
ROUGE are available (Graham, 2015) including input documents. We also argue that brevity (min-
ROUGE with word embedding (Ng and Abrecht, imal repetition) is a desired property of Semantic
2015) and synonym (Ganesan, 2018), graph-based Overlap and thus, we frame MNSO task as a con-
lexical measurement (ShafieiBavani et al., 2018), strained summarization problem to ensure brevity.
Vanilla ROUGE (Yang et al., 2018) and highlight- For example, if a particular piece of information
based ROUGE (Hardy et al., 2019). However, there or quote is repeated twice in both the documents,
has been no study yet whether ROUGE metric is ap- we don’t necessarily want it to be present in tar-
propriate for evaluating the Semantic Intersection get overlap summary two times. The output can
task, which is one of central goals of our work. either be extractive summary or abstractive sum-
mary or a mixture of both, as per the use case.
3 Motivation and Applications This task is inspired by the set-theoretic intersec-
Multiple alternative narratives appear frequently tion operator. However, unlike set-intersection, our
across many domains like education, health, mil- Text Overlap does not have to be the maximal set.
itary, security and privacy etc (detailed use-cases The aim is summarize the overlapping information
for each domain are provided in appendix). Indeed, in an abstractive fashion. Additionally, Semantic
MNSO/TextOverlap operation can be very useful Overlap should follow the commutative property
to digest such multi-narratives (from various per- i.e DA ∩O DB = DB ∩O DA .
spectives) at scale and speed and, consequently,
5 The Benchmark Dataset
enhance the following important tasks as well.
Information Retrieval/Search Engines: Given a As mentioned in section 1, there is no existing
query, one could summarize the common infor- data-set which we could readily use to evaluate
mation (TextOverlap) from the top k documents the MNSO task2 . To address this challenge, we
fetched by a search engine and display it as addi- crawled data from AllSides.com. AllSides is a
tional information to the user. third-party online news forum which exposes peo-
Question Answering: Given a particular question, ple to news and information from all sides of the
the system could aim to provide a more accurate political spectrum so that the general people can
answer based on multiple evidence from various 2
Multi-document summarization datasets can not be uti-
source documents and generate the most common lized in this scenario as their reference summaries do not
answer by applying TextOverlap. follow the semantic overlap constraint.

get an “unbiased” view of the world. To achieve                the annotators wrote only “Donald Trump” as the
this, AllSides displays each day’s top news stories            Semantic Overlap for a couple of cases where the
from news media widely-known to be affiliated                  narratives were substantially different, while others
with different sides of the political spectrum includ-         had those cases marked as “empty set”.
ing “Left” (e.g., New York Times, NBC News), and                  To mitigate this issue, we only retained the
“Right” (e.g., Townhall, Fox News) wing media.                 narrative-pairs where at least two of the annotators
AllSides also provides their own factual descrip-              wrote minimum 15 words as their ground-truth se-
tion of the reading material, labeled as “Theme” so            mantic overlap, with the hope that a human written
that readers can see the so-called “neutral” point-            description will contain 15 words or more only in
of-view. Table 1 gives an overview of the dataset              cases where there is indeed a “significant” overlap
created by crawling from AllSides.com, which con-              between the two original narratives. This filtering
sists of news articles (from at least one “Left” and           step gave us a test set with 137 samples where each
one “Right” wing media) covering 2, 925 events in              sample had 4 ground-truth semantic overlaps, one
total and also having a minimum length of “theme-              from AllSides and three from human annotators.
description” to be 15 words. Given two narra-
tives (“Left” and “Right”), we used the theme-                 6     Evaluating MNSO Task using ROUGE
description as a proxy for ground-truth TextOverlap.
We divided this dataset into testing data (described           As ROUGE (Lin, 2004) is the most popular met-
next) and training data (remaining samples) and                ric used today for evaluating summarization tech-
their statistics in provided in appendix (table 13).           niques; we first conducted a case-study with
                                                               ROUGE as the evaluation metric for MNSO task.
    Feature              Description
                                                               6.1    Methods Used in the Case-Study
    theme                headlines by AllSides
    theme-description    news description by AllSides          We experimented with multiple SoTA pre-trained
    right/left head      right/left news headline              abstractive summarization models as a proxy for
    right/left context   right/left news description
                                                               Semantic-Overlap generators. These models are:
 Table 1: Overview of dataset scraped from AllSides            1) BART (Lewis et al., 2019), fine tuned on CNN
                                                               and multi english Wiki news datasets, 2) Pegasus
Human Annotations3 : We decided to involve hu-                 (Zhang et al., 2019), fine tuned on CNN and Daily
man volunteers to annotate our testing samples                 mail dataset, and 3) T5 (Raffel et al., 2019), fine
in order to create multiple human-written ground-              tuned on multi english Wiki news dataset. As our
truth semantic overlaps for each event narrative               primary goal is to construct a benchmark data-set
pairs. This helped in creating a comprehensive                 for the MNSO task and establish an appropriate
testing benchmark for more rigorous evaluation.                metric for evaluating this task, experimenting with
Specifically, we randomly sampled 150 narrative                only 3 abstractive summarization models is not a
pairs (one from “Left” wing and one from “Right”               barrier to our work. Proposing a custom method
wing) and then asked 3 (three) humans to write a               fine-tuned for the Semantic-Overlap task is an or-
a natural language description which conveys the               thogonal goal to this work and we leave it as a
semantic overlap of the information present in both            future work. Also, we’ll use the phrases “summary”
narratives describing each event.                              and “overlap-summary” interchangeably from here.
   After the first round of annotation, we immedi-             To generate the summary, we concatenate a narra-
ately observed a discrepancy among the three an-               tive pair and feed it directly to the model.
notators in terms of the real definition of “semantic             For evaluation, we first evaluated the machine
overlap”. For example, one annotator argued that               generated overlap summaries for the 137 manually
Semantic Overlap of two narratives is non-empty as             annotated testing samples using the ROUGE metric
long as there is an overlap along one of the 5W1H              (Lin, 2004) and followed the procedure mentioned
facets (Who, What, When, Where, Why and How),                  in the paper to compute the ROUGE-F1 scores
while another annotator argued that overlap in only            with multiple reference summaries. More precisely,
one facet is not enough to decide whether there is             since we have 4 reference summaries, we got 4 pre-
indeed a semantic overlap. As an example, one of               cision, recall pairs which are used to compute the
   3
     The dataset and manual annotations can be found in sup-   corresponding F1 scores. For each sample, we took
plementary folder.                                             the max of these 4F1 scores and averaged them out

Pearson’s Correlation Coefficients                                     Human agreement in terms of Kendall Tau
                  R1                        R2                      RL
                                                                                                         Precision               Recall
            I1     I2       I3        I1     I2     I3        I1     I2     I3
                                                                                                         L1          L2     L1            L2
    I2     0.62    —                0.65    —                0.69    —
    I3      0.3   0.38     —        0.27   0.37     —        0.27   0.44    —                  L2      0.52       —        0.37       —
    I4     0.17   0.34    0.34      0.14   0.33    0.21      0.18   0.35   0.33                L3      0.18      0.29      0.31      0.54
Average           0.36                      0.33                    0.38
                                                                                             Average          0.33               0.41

Table 2: Max (across 3 models) Pearson’s correlation                              Table 3: Kendall’s rank correlation coefficients among
between the F1 ROUGE scores corresponding to dif-                                 the precision and recall scores for pairs of human anno-
ferent annotators. Here Ii refers to the ith annotator                            tators (25 samples). Li refers to the ith label annotator.
where i ∈ {1, 2, 3, 4} and “Average” row represents av-
erage correlation of the max values across annotators.
                                                                                  corresponding system generated overlap-summary
Boldface values are statistically significant at p-value
< 0.05. For 5 out of 6 annotator pairs, the correlation                           (generated by fine-tuned BART) and assigned a nu-
values are quite small (≤ 0.50), thus, implying the poor                          meric score between 1-10 (inclusive). This number
inter-rated agreement with regards to ROUGE metric.                               reflects their judgement/confidence about how ac-
                                                                                  curately the system-generated summary captures
across the test dataset. The ROUGE scores can be                                  the actual overlap of the two input narratives. Note
seen in the table 11 in appendix.                                                 that, the reference overlap summaries were not
                                                                                  included in this label annotation process and the
6.2       Results and Findings                                                    label-annotators judged the system-generated sum-
                                                                                  mary exclusively with respect to the input narra-
We computed Pearson’s correlation coefficients be-
                                                                                  tives. To quantify the agreement between human
tween each pair of ROUGE-F1 scores obtained
                                                                                  scores, we computed the Kendall rank correlation
using all of the 4 reference overlap-summaries (3
                                                                                  coefficient (or Kendall’s Tau) between two anno-
human written summary and 1 AllSides theme de-
                                                                                  tator labels since these are ordinal values. How-
scription) to test the robustness of ROUGE metric
                                                                                  ever, to our disappointment, the correlation value
for evaluating the MNSO task. The corresponding
                                                                                  was 0.20 with p-value being 0.224 . This shows
correlations are shown in table 2. For each annota-
                                                                                  that even human annotators are disagreeing among
tor pair, we report the maximum (across 3 models)
                                                                                  themselves and we need to come up with a better
correlation value. The average correlation value
                                                                                  labelling guideline to reach a reasonable agreement
across annotators is 0.36, 0.33 and 0.38 for R1,
                                                                                  among the human annotators.
R2 and RL respectively; suggesting that ROUGE
metric is not stable across multiple human-written                                   On further discussions among annotators, we
overlap-summaries and thus, unreliable. Indeed,                                   realized that one annotator only focused on precise-
only one out the 6 different annotator pairs has a                                ness of the intersection summaries, whereas the
value greater than 0.50 for all the 3 ROUGE met-                                  other annotator took both precision and recall into
rics (R1, R2, RL), which is problematic.                                          consideration. Thus, we decided to next assign two
                                                                                  separate scores for precision and recall.
7        Can We Do Better than ROUGE?                                             Precision-Recall Inspired Double Scoring: This
                                                                                  time, three label-annotators (L1 , L2 and L3 ) as-
Section 6 shows that ROUGE metric is unsta-                                       signed two numeric scores between 1-10 (inclu-
ble across multiple reference overlap-summaries.                                  sive) for the same set of 25 system generated sum-
Therefore, an immediate question is: Can we come                                  maries. These numbers represented their belief
up with a better metric than ROUGE? To investi-                                   about how precise the system-generated summaries
gate this question, we started by manually assessing                              were (the precision score) and how much of the ac-
the machine-generated overlap summaries to check                                  tual ground-truth overlap-information was covered
whether humans agree among themselves or not.                                     by the same (the recall score). Also note that, la-
                                                                                  bels were assigned exclusively with respect to the
7.1       Different trials of Human Judgement
                                                                                  input narratives only. As the assigned numbers rep-
Assigning a Single Numeric Score: As an ini-                                      resent ordinal values (i.e. can’t be used to compute
tial trial, we decided to first label 25 testing sam-
                                                                                      4
ples using two human annotators (we call them                                           The higher p-value means that the correlation value is
                                                                                  insignificant because of the small number of samples, but the
label annotators L1 and L2 ). Both label-annotators                               aim is to first find a labelling criterion where human can agree
read each of the 25 narrative pairs as well as the                                among themselves.

Human agreement in terms of Kendall’s Rank Correlation              Label from Annotator B   P     PP    A
                   Precision                 Recall                                          P      1    0.5   0
                                                                          Label from An-     PP    0.5    1    0
                  L1          L2      L1              L2                  notator A          A      0     0    1
        L2       0.68      —         0.75          —
        L3       0.59     0.64       0.69         0.71
                                                               Table 5: Reward function used to evaluate the labels
      Average          0.64                   0.72
                                                               assigned by two label annotators (or labels inferred us-
                                                               ing SEM-F1 metric and human annotated labels) for a
Table 4: Average precision and recall Kendall rank cor-        given sentence (association between annotator pairs).
relation coefficients between sentence-wise annotation
for different annotators. Li refers to the ith label anno-
tator. All values are statistically significant (p

Human agreement in terms of Reward function              Algorithm 1 Semantic-F1 Metric
                       Precision                    Recall
                                                                        1: Given SG , SR , ME
                  L1               L2          L1            L2         2: rawpV , rawrV ← C OSINE S IM(SG , SR , ME )     .
      L2      0.81 ± 0.26         —        0.85 ± 0.11        —            Sentence-wise precision and recall values
      L3      0.79 ± 0.26    0.70 ± 0.31   0.80 ± 0.16   0.77 ± 0.17    3: pV ← M EAN(rawpV )
 Average                 0.77                        0.81               4: rV ← M EAN(rawrV )
                                                                                 2 ∗ pV ∗ rV
Table 6: Average precision and recall reward scores                     5: f1 ←
                                                                                   pV + rV
(mean ± std) between sentence-wise annotation for dif-                  6: return (f1 , pV, rV )
ferent annotators. Li refers to the ith label-annotator.
                                                                        1: procedure C OSINE S IM(SG , SR , ME )
8      Semantic-F1: The New Metric                                      2:    lG ←No. of sentences in SG
                                                                        3:    lR ←No. of sentences in SR
Human evaluation is costly and time-consuming.                          4:    init: cosSs ← zeros[lG , lR ]; i ← 0
                                                                        5:    for each sentence sG in SG do
Thus, one needs an automatic evaluation metric for                      6:        EsG ← ME (sG);j ← 0
large-scale experiments. But, how can we devise                         7:        for each sentence sR in SR do
an automated metric to perform the sentence-wise                        8:            EsR ← ME (sR)
                                                                        9:            cosSs[i, j] ← Cos(EsG , EsR )
precision-recall style evaluation discussed in the                     10:        end for
previous section? To achieve this, we propose a                        11:    end for
new evaluation metric called SEM-F1. The details                       12:    x ← Row-wise-max(cosSs)
                                                                       13:    y ← Column-wise-max(cosSs)
of our SEM-F1 metric are described in algorithm 1                      14:    return (x, y)
and the respective notations are mentioned in table                    15: end procedure
7. F1 scores are computed by the harmonic mean
of the precision (pV ) and recall (rV ) values. Algo-                  olds as described in algorithm 2. This helped us
rithm 1 assumes only one reference summary but                         to directly compare the SEM-F1 inferred labels
can be trivially extended for multiple references.                     against the human annotated labels.
As mentioned previously, in case of multiple ref-
                                                                          As mentioned in section 8, we utilized state-of-
erences, we concatenate them for precision score
                                                                       the-art sentence embedding models to encode sen-
computation. Recall scores are computed individu-
                                                                       tences from both the model generated summaries
ally for each reference summary and later, an aver-
                                                                       and the human written narrative intersections. To
age recall is computed across references.
                                                                       be more specific, we experimented with 3 sen-
   The basic intuition behind SEM-F1 is to com-
                                                                       tence embedding models: Paraphrase-distilroberta-
pute the sentence-wise similarity (e.g., cosine simi-
                                                                       base-v1 (P-v1) (Reimers and Gurevych, 2019),
larity using a sentence embedding model) to infer
                                                                       stsb-roberta-large (STSB) (Reimers and Gurevych,
the semantic overlap/intersection between two sen-
                                                                       2019) and universal-sentence-encoder (USE) (Cer
tences from both precision and recall perspective
                                                                       et al., 2018). Along with the various embedding
and then, combine them into F1 score.
                                                                       models, we also experimented with multiple thresh-
                                                                       old values used to predict the sentence-wise pres-
    Notations          Description
                                                                       ence (P), partial presence (PP) and absence (A)
    SG                 Machines generated summary
    SR                 Reference summary
                                                                       labels to report the sensitivity of the metric with
    T := (tl , tu )    Tuple representing the lower and upper          respect to different thresholds. These thresholds
                       threshold values (between 0 and 1).             are: (25, 75), (35, 65), (45, 75), (55, 65), (55, 75),
    ME                 Sentence embedding model
    pV, rV             Precision, Recall value for (SG , SR ) pair     (55, 80), (60, 80). For example, threshold range
                                                                       (45, 75) means that if similarity score < 45%, in-
           Table 7: Table of notations for algorithm 1                 fer label "absent", else if similarity score ≥ 75%,
                                                                       infer label "present" and else, infer label “partial-
8.1        Is SEM-F1 Reliable?
                                                                       present”. Next, we computed the average preci-
The SEM-F1 metric computes cosine similarity                           sion and recall rewards for 50 samples annotated
scores between sentence-pairs from both precision                      by label-annotators (Li ) and the labels inferred by
and recall perspectives. To see whether SEM-F1                         SEM-F1 metric. For this, we repeat the proce-
metric correlates with human-judgement, we fur-                        dure of Table 6, but this time comparing human
ther converted the sentence-wise raw cosine scores                     labels against ‘SEM-F1 labels’. The corresponding
into Presence (P), Partial Presence (PP) and Ab-                       results are shown in Table 8. As we can notice,
sence (A) labels using some user-defined thresh-                       the average reward values are consistently high

Machine-Human Agreement in terms of Reward Function
      Reward/Kendall
                                T = (25, 75)      T = (35, 65)          T = (45, 75)       T = (55, 65)          T = (55, 75)        T = (55, 80)       T = (60, 80)
 Embedding:         Precision     0.75/0.57           0.8/0.63           0.76/0.59            0.8/0.63             0.78/0.6                0.74/0.6          0.73/0.58
 P-v1               Recall        0.66/0.54           0.76/0.64          0.73/0.66            0.72/0.64            0.69/0.63            0.65/0.64            0.61/0.6
 Embedding:         Precision     0.73/0.6            0.73/0.62           0.73/0.6            0.73/0.62            0.73/0.63            0.73/0.59            0.73/0.58
 STSB               Recall        0.63/0.55           0.64/0.63           0.63/0.6            0.65/0.61            0.65/0.61            0.63/0.61            0.64/0.59
 Embedding:         Precision     0.76/0.64           0.76/0.66          0.78/0.64            0.78/0.64            0.79/0.63            0.78/0.62            0.79/0.65
 USE                Recall        0.63/0.53           0.66/0.6           0.67/0.58            0.68/0.61            0.67/0.62            0.64/0.62            0.65/0.61

Table 8: Average Precision and Recall correlation (Reward score/Kendall correlation) between label-annotators
(Li ) and automatically inferred labels using SEM-F1 (average of 3 label annotators). The raw numbers for each
annotators can be found in appendix (table 12). The results are shown for different embedding models (8.1) and
multiple threshold levels T = (tl , tu ). Moreover, the both the Reward and Kendall values are consistent/stable
across all the 5 embedding models and threshold values.

           Random Annotation    Random Intersection       SEM-F1 Scores                                          Pearson’s Correlation Coefficients
             SEM-F1 Scores        SEM-F1 Scores                                                          P-V1                       STSB                       USE
           P-V1   STSB   USE    P-V1   STSB   USE       P-V1   STSB     USE                       I1       I2      I3         I1      I2       I3      I1       I2       I3
 BART      0.16   0.21   0.22   0.21   0.27   0.27      0.65     0.67    0.67            I2      0.69     —                  0.65     —               0.71      —
  T5       0.17   0.21   0.23   0.20   0.26   0.26      0.58     0.60    0.60            I3      0.40    0.50      —         0.50    0.52      —      0.51     0.54    —
Pegasus    0.15   0.20   0.22   0.19   0.26   0.26      0.59     0.60    0.62            I4      0.33    0.44     0.60       0.33    0.36     0.56    0.37     0.42   0.66
Average    0.16   0.21   0.22   0.20   0.26   0.26      0.61     0.62    0.63         Average             0.49                       0.49                      0.54

                     Table 9: SEM-F1 Scores                                          Table 10: Max (across 3 models) Pearson’s correlation
                                                                                     between the SEM-F1 scores corresponding to different
(≥ 0.50) for all the 3 label-annotators (Li ). More-                                 annotators. Here Ii refers to the ith annotator where
over, the reward values are consistent/stable across                                 i ∈ {1, 2, 3, 4} and “Average” row represents average
all the 3 embedding models and threshold values,                                     correlation of the max values across annotators. All
signifying that SEM-F1 is indeed robust across var-                                  values are statistically significant at p-value < 0.05.
ious sentence embeddings and threshold used.                                         8.3        Pearson Correlation for SEM-F1
   Following the procedure in table 4, we also com-
pute the Kendall’s Tau between human label an-                                       Following the case-study based on ROUGE in sec-
notators and automatically inferred labels using                                     tion 6, we again compute the Pearson’s correla-
SEM-F1. Our results in table 8 are consistent with                                   tion coefficients between each pair of raw SEM-
reward-based inter-rater-agreement and the corre-                                    F1 scores obtained using all of the 4 reference
lation values are ≥ 0.50 with little variation along                                 intersection-summaries. The corresponding cor-
various thresholds for both precision and recall.                                    relations are shown in table 10. For each annotator
                                                                                     pair, we report the maximum (across 3 models) cor-
8.2       SEM-F1 Scores for Random Baselines                                         relation value. The average correlation value across
Here, we present the actual SEM-F1 scores for                                        annotators is 0.49, 0.49 and 0.54 for P-V1, STSB,
the three models described in section 6.1 along                                      USE embeddings, respectively. This shows a clear
with scores for two intuitive baselines, namely, 1)                                  improvement over the ROUGE metric suggesting
Random Overlap 2) Random Annotation.                                                 that SEM-F1 is more accurate than ROUGE metric.
Random Overlap: For a given sample and model,
                                                                                     9        Conclusions
we select a random overlap summary generated
by the model out of the other 136 test samples.                                      In this work, we proposed a new NLP task, called
These random overlaps are then evaluated against                                     Multi-Narrative Semantic Overlap (MNSO) and
4 reference summaries using SEM-F1.                                                  created a benchmark dataset through meticulous
Random Annotation: For a given sample, we se-                                        human effort to initiate a new research direction.
lect a random reference summary out of the other 4                                   As a starting point, we framed the problem as a
references among the other 136 test samples. The                                     constrained summarization task and showed that
model generated summaries are then compared                                          ROUGE is not a reliable evaluation metric for this
against these Random Annotations/References to                                       task. We further proposed a more accurate metric,
compute SEM-F1 scores as reported in table 9.                                        called SEM-F1, for evaluating MNSO task. Experi-
   As we notice, there is approximately 40-45 per-                                   ments show that SEM-F1 is more robust and yield
cent improvement over the baseline scores suggest-                                   higher agreement with human judgement.
ing SEM-F1 can indeed distinguish good from bad.

References pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
Amit Alfassy, Leonid Karlinsky, Amit Aides, Joseph
Shtok, Sivan Harary, Rogerio Feris, Raja Giryes, Alexander R Fabbri, Wojciech Kryściński, Bryan
and Alex M Bronstein. 2019. Laso: Label-set opera- McCann, Caiming Xiong, Richard Socher, and
tions networks for multi-label few-shot learning. In Dragomir Radev. 2021. Summeval: Re-evaluating
Proceedings of the IEEE Conference on Computer summarization evaluation. Transactions of the Asso-
Vision and Pattern Recognition, pages 6548–6557. ciation for Computational Linguistics, 9:391–409.
Sanghwan Bae, Taeuk Kim, Jihoon Kim, and Sang- Angela Fan, David Grangier, and Michael Auli. 2017.
goo Lee. 2019. Summary level training of sen- Controllable abstractive summarization. arXiv
tence rewriting for abstractive summarization. arXiv preprint arXiv:1711.05217.
preprint arXiv:1909.08752.
Thibault Fevry and Jason Phang. 2018. Unsuper-
Regina Barzilay, Kathleen McKeown, and Michael El-
vised sentence compression using denoising auto-
hadad. 1999. Information fusion in the context of
encoders. arXiv preprint arXiv:1809.02669.
multi-document summarization. In Proceedings of
the 37th annual meeting of the Association for Com- Kavita Ganesan. 2018. ROUGE 2.0: Updated and
putational Linguistics, pages 550–557. improved measures for evaluation of summarization
Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. tasks. CoRR, abs/1803.01937.
2018. Retrieve, rerank and rewrite: Soft template
Jade Goldstein, Vibhu O Mittal, Jaime G Carbonell,
based neural summarization. In Proceedings of the
and Mark Kantrowitz. 2000. Multi-document sum-
56th Annual Meeting of the Association for Com-
marization by sentence extraction. In NAACL-ANLP
putational Linguistics (Volume 1: Long Papers),
2000 Workshop: Automatic Summarization.
pages 152–161, Melbourne, Australia. Association
for Computational Linguistics. Yvette Graham. 2015. Re-evaluating automatic sum-
Rémi Cardon and Natalia Grabar. 2019. Parallel sen- marization with BLEU and 192 shades of ROUGE.
tence retrieval from comparable corpora for biomed- In Proceedings of the 2015 Conference on Empirical
ical text simplification. In Proceedings of the Inter- Methods in Natural Language Processing, EMNLP
national Conference on Recent Advances in Natu- 2015, Lisbon, Portugal, September 17-21, 2015,
ral Language Processing (RANLP 2019), pages 168– pages 128–137. The Association for Computational
177. Linguistics.

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Hardy, Shashi Narayan, and Andreas Vlachos. 2019.
Nicole Limtiaco, Rhomni St John, Noah Constant, Highres: Highlight-based reference-less evaluation
Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, of summarization. In Proceedings of the 57th Con-
et al. 2018. Universal sentence encoder. arXiv ference of the Association for Computational Lin-
preprint arXiv:1803.11175. guistics, ACL 2019, Florence, Italy, July 28- August
2, 2019, Volume 1: Long Papers, pages 3381–3392.
Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac- Association for Computational Linguistics.
tive summarization with reinforce-selected sentence
rewriting. arXiv preprint arXiv:1805.11080. Donna Harman and Paul Over. 2004. The effects
of human variation in DUC summarization evalua-
Sumit Chopra, Michael Auli, and Alexander M Rush. tion. In Text Summarization Branches Out, pages
2016. Abstractive sentence summarization with at- 10–17, Barcelona, Spain. Association for Computa-
tentive recurrent neural networks. In Proceedings of tional Linguistics.
the 2016 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui
Human Language Technologies, pages 93–98. Min, Jing Tang, and Min Sun. 2018. A uni-
fied model for extractive and abstractive summa-
Arman Cohan and Nazli Goharian. 2016. Revisiting rization using inconsistency loss. arXiv preprint
summarization evaluation for scientific articles. In arXiv:1805.06266.
Proceedings of the Tenth International Conference
on Language Resources and Evaluation LREC 2016, Shubhra Kanti Karmaker Santu, Chase Geigle, Dun-
Portorož, Slovenia, May 23-28, 2016. European Lan- can Ferguson, William Cope, Mary Kalantzis, Du-
guage Resources Association (ELRA). ane Searsmith, and Chengxiang Zhai. 2018. Sofsat:
Towards a setlike operator based framework for se-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and mantic analysis of text. ACM SIGKDD Explorations
Kristina Toutanova. 2019. BERT: Pre-training of Newsletter, 20(2):21–30.
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya
of the North American Chapter of the Association Takamura, and Manabu Okumura. 2016. Control-
for Computational Linguistics: Human Language ling output length in neural encoder-decoders. arXiv
Technologies, Volume 1 (Long and Short Papers), preprint arXiv:1609.09552.

Emiel Krahmer, Erwin Marsi, and Paul van Pelt. 2008. Ani Nenkova. 2006. Summarization evaluation for
Query-based sentence fusion is better defined and text and speech: issues and approaches. In INTER-
leads to more preferred results than generic sentence SPEECH 2006 - ICSLP, Ninth International Confer-
fusion. In Proceedings of ACL-08: HLT, Short Pa- ence on Spoken Language Processing. ISCA.
pers, pages 193–196.
Jun-Ping Ng and Viktoria Abrecht. 2015. Better sum-
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- marization evaluation with word embeddings for
jan Ghazvininejad, Abdelrahman Mohamed, Omer ROUGE. In Proceedings of the 2015 Conference on
Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Empirical Methods in Natural Language Processing,
Bart: Denoising sequence-to-sequence pre-training EMNLP 2015, Lisbon, Portugal, September 17-21,
for natural language generation, translation, and 2015, pages 1925–1930. The Association for Com-
comprehension. arXiv preprint arXiv:1910.13461. putational Linguistics.
Chin-Yew Lin. 2004. Rouge: A package for automatic Jian-Yun Nie, Michel Simard, Pierre Isabelle, and
evaluation of summaries. In Text summarization Richard Durand. 1999. Cross-language information
branches out, pages 74–81. retrieval based on parallel texts and automatic min-
ing of parallel texts from the web. In Proceedings of
Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, the 22nd annual international ACM SIGIR confer-
and Hongyan Li. 2017. Generative adversarial net- ence on Research and development in information
work for abstractive text summarization. arXiv retrieval, pages 74–81.
preprint arXiv:1711.09357.
Romain Paulus, Caiming Xiong, and Richard Socher.
Yizhu Liu, Zhiyi Luo, and Kenny Zhu. 2018. Con- 2017. A deep reinforced model for abstractive sum-
trolling length in abstractive summarization using a marization. arXiv preprint arXiv:1705.04304.
convolutional neural network. In Proceedings of the
2018 Conference on Empirical Methods in Natural Dragomir Radev. 2000. A common theory of infor-
Language Processing, pages 4110–4119. mation fusion from multiple text sources step one:
cross-document structure. In 1st SIGdial workshop
Congbo Ma, Wei Emma Zhang, Mingyu Guo, on Discourse and dialogue, pages 74–83.
Hu Wang, and Quan Z Sheng. 2020. Multi-
document summarization via deep learning tech- Alec Radford, Jeffrey Wu, Dario Amodei, Daniela
niques: A survey. arXiv preprint arXiv:2011.04843. Amodei, Jack Clark, Miles Brundage, and Ilya
Sutskever. 2019. Better language models and
Takuya Makino, Tomoya Iwakura, Hiroya Takamura, their implications. OpenAI Blog https://openai.
and Manabu Okumura. 2019. Global optimization com/blog/better-language-models.
under length constraint for neural text summariza-
tion. In Proceedings of the 57th Annual Meeting Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
of the Association for Computational Linguistics, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
pages 1039–1048. Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
Erwin Marsi and Emiel Krahmer. 2005. Explorations former. arXiv preprint arXiv:1910.10683.
in sentence fusion. In Proceedings of the Tenth Eu-
ropean Workshop on Natural Language Generation Nils Reimers and Iryna Gurevych. 2019. Sentence-
(ENLG-05). bert: Sentence embeddings using siamese bert-
networks. In Proceedings of the 2019 Conference on
Yogesh Kumar Meena, Ashish Jain, and Dinesh Empirical Methods in Natural Language Processing.
Gopalani. 2014. Survey on graph and cluster Association for Computational Linguistics.
based approaches in multi-document text summa-
rization. In International Conference on Recent Michael Roth and Anette Frank. 2012. Aligning predi-
Advances and Innovations in Engineering (ICRAIE- cate argument structures in monolingual comparable
2014), pages 1–5. IEEE. texts: A new corpus for a new task. In * SEM 2012:
The First Joint Conference on Lexical and Compu-
Vanessa Murdock and W Bruce Croft. 2005. A transla- tational Semantics–Volume 1: Proceedings of the
tion model for sentence retrieval. In Proceedings of main conference and the shared task, and Volume
Human Language Technology Conference and Con- 2: Proceedings of the Sixth International Workshop
ference on Empirical Methods in Natural Language on Semantic Evaluation (SemEval 2012), pages 218–
Processing, pages 684–691. 227.
Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Alexander M Rush, Sumit Chopra, and Jason We-
Bing Xiang, et al. 2016. Abstractive text summariza- ston. 2015. A neural attention model for ab-
tion using sequence-to-sequence rnns and beyond. stractive sentence summarization. arXiv preprint
arXiv preprint arXiv:1602.06023. arXiv:1509.00685.
Shashi Narayan, Shay B Cohen, and Mirella Lapata. Raphael Schumann. 2018. Unsupervised abstrac-
2018. Ranking sentences for extractive summariza- tive sentence summarization using length con-
tion with reinforcement learning. arXiv preprint trolled variational autoencoder. arXiv preprint
arXiv:1802.08636. arXiv:1809.05233.

Elaheh ShafieiBavani, Mohammad Ebrahimi, Ray- Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
mond K. Wong, and Fang Chen. 2018. A graph- ter J Liu. 2019. Pegasus: Pre-training with extracted
theoretic summary evaluation for rouge. In Proceed- gap-sentences for abstractive summarization. arXiv
ings of the 2018 Conference on Empirical Methods preprint arXiv:1912.08777.
in Natural Language Processing, Brussels, Belgium,
October 31 - November 4, 2018, pages 762–767. As- Jinming Zhao, Ming Liu, Longxiang Gao, Yuan Jin,
sociation for Computational Linguistics. Lan Du, He Zhao, He Zhang, and Gholamreza
Haffari. 2020. Summpip: Unsupervised multi-
Tomohide Shibata and Sadao Kurohashi. 2012. document summarization with sentence graph com-
Predicate-argument structure-based textual entail- pression. In Proceedings of the 43rd International
ment recognition system exploiting wide-coverage ACM SIGIR Conference on Research and Develop-
lexical knowledge. ACM Transactions on Asian ment in Information Retrieval, pages 1949–1952.
Language Information Processing (TALIP), 11(4):1–
23. Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang,
Xipeng Qiu, and Xuanjing Huang. 2020. Extrac-
Kapil Thadani and Kathleen McKeown. 2011. To- tive summarization as text matching. arXiv preprint
wards strict sentence intersection: decoding and arXiv:2004.08795.
evaluation strategies. In Proceedings of the
Workshop on Monolingual Text-To-Text Generation, Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu,
pages 43–53. and Xuanjing Huang. 2019. Searching for effective
neural extractive summarization: What works and
Rui Wang and Yi Zhang. 2009. Recognizing tex- what’s next. arXiv preprint arXiv:1907.03491.
tual relatedness with predicate-argument structures.
In Proceedings of the 2009 Conference on Empiri- Liang Zhou, Chin-Yew Lin, Dragos Stefan Munteanu,
cal Methods in Natural Language Processing, pages and Eduard H. Hovy. 2006. Paraeval: Using para-
784–792. phrases to evaluate summaries automatically. In Hu-
man Language Technology Conference of the North
Travis Wolfe, Benjamin Van Durme, Mark Dredze,
American Chapter of the Association of Computa-
Nicholas Andrews, Charley Beller, Chris Callison-
tional Linguistics. The Association for Computa-
Burch, Jay DeYoung, Justin Snyder, Jonathan
tional Linguistics.
Weese, Tan Xu, et al. 2013. Parma: A predicate ar-
gument aligner. In Proceedings of the 51st Annual Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou.
Meeting of the Association for Computational Lin- 2017. Selective encoding for abstractive sentence
guistics (Volume 2: Short Papers), pages 63–68. summarization. arXiv preprint arXiv:1704.07073.
Yuxiang Wu and Baotian Hu. 2018. Learning to extract
coherent summary via deep reinforcement learning.
arXiv preprint arXiv:1804.07036.
Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao
Tian, Hua Wu, and Haifeng Wang. 2020. Ernie-gen:
An enhanced multi-flow pre-training and fine-tuning
framework for natural language generation. arXiv
preprint arXiv:2001.11314.
Lexing Xie, Hari Sundaram, and Murray Campbell.
2008. Event mining in multimedia streams. Pro-
ceedings of the IEEE, 96(4):623–647.
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu,
Nan Duan, Jiusheng Chen, Ruofei Zhang, and
Ming Zhou. 2020. Prophetnet: Predicting future n-
gram for sequence-to-sequence pre-training. arXiv
preprint arXiv:2001.04063.
An Yang, Kai Liu, Jing Liu, Yajuan Lyu, and Sujian
Li. 2018. Adaptations of ROUGE and BLEU to
better evaluate machine reading comprehension task.
In Proceedings of the Workshop on Machine Read-
ing for Question Answering@ACL 2018, Melbourne,
Australia, July 19, 2018, pages 98–104. Association
for Computational Linguistics.
Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,
Ayush Pareek, Krishnan Srinivasan, and Dragomir
Radev. 2017. Graph-based neural multi-document
summarization. arXiv preprint arXiv:1706.06681.

A Other definitions of Text Overlap maximum common subgraph of both GA and
GB , where GA and GB are the corresponding
Below, we present a set of possible definitions of
graphs for the documents DA and DB respec-
Semantic Overlap to encourage the readers to think
tively. However, coming up with a graph struc-
more about other alternative definitions.
ture G which can align with both documents
1. On a very simplistic level, one can think of Se- DA and DB , would itself be a challenge.
mantic Overlap to be just the common words
between the two input documents. One can 5. One can also define TextOverlap operator (∩)
also include their frequencies of occurrences between two documents based on historical
in such representation. More specifically, we context and prior knowledge. Given a knowl-
can define Dovlp as a set of unordered pairs edge base K, Dovlp = ∩(DA , DB |K) (Radev,
of words wi and their frequencies of common 2000).
occurrences fi , i.e., Dovlp = {(wi , fi )}. We
can further extend this approach such that Se- All the approaches defined above have their spe-
mantic Overlap is a set of common n-grams cific use-cases and challenges, however, from a
among the input documents. More specifically, human-centered point of view, they may not reflect
Dovlp = { (w1 , w2 , ..., wn )i , fi } such that the how humans generate semantic overlaps. A human
n-grams, (w1 , w2 , ..., wn )i , is present in both would mostly express it in the form of natural lan-
DA (with frequency fiA ) and DB (with fre- guage and this is why, we frame the TextOverlap
quency fiB ) and fi = min(fiA , fiB ). operator as a constraint summarization problem
such that the information of the output summary is
2. Another way to think of Semantic Overlap is to
present in both the input documents.
find the common topics among two documents
just like finding common object labels among
B Threshold Algorithm
two images (Alfassy et al., 2019), by computing
the joint probability of their topic distributions.
More specifically, Semantic Overlap can be de- Algorithm 2 Threshold Function
fined by the following joint probability distribu- 1: procedure T HRESHOLD (rawSs, T )
tion: P (Ti |Dovlp ) = P (Ti |DA ) × P (Ti |DB ). 2: initialize Labels ← []
This representation is more semantic in nature 3: for each element e in rawSs do
as it can capture overlap in topics. 4: if e ≥ tu % then
5: Labels.append(P )
3. Alternatively, one can take the 5W1H approach 6: else if tl % ≤ e ≤ tu % then
(Xie et al., 2008), where a given narrative D 7: Labels.append(P P )
can be represented in terms of unordered sets 8: else
of six facets: 5Ws (Who, What, When, Where 9: Labels.append(A)
and Why) and 1H (How). In this case, we can 10: end if
define Semantic Overlap as the common ele- 11: end for
ments between the corresponding sets related 12: return Labels
to these 6 facets present in both narratives, i.e. 13: end procedure
Dovlp = {Si } where Si is a set belonging to one
of the six 5W1H facets. It is entirely possible
that one of these Si ’s is an empty set (φ). The
most challenging aspect with this approach is C ROUGE Scores
accurately inferring the 5W1H facets.
Model R1 R2 RL
4. Another way could be to define a given docu- BART 40.73 25.97 29.95
ment as a graph. Specifically, we can consider T5 38.50 24.63 27.73
a document D as a directed graph G = (V, E) Pegasus 46.36 29.12 37.41
where V represents the vertices and E repre- Table 11: Average ROUGE-F1 Scores for all the test
sents the edges. Thus, TextOverlap can be de- models across test dataset. For a particular sample, we
fined as the set of common vertices or edges take the maximum value out of the 4 F1 scores corre-
or both. Specifically, Dovlp can be defined as a sponding to the 4 reference summaries.

D Motivation and Applications explore the specific articles for more details.
Question Answering: Again, one could parse
Multiple alternative narratives are frequent in a vari- the common information/answer from multiple
ety of domains, including education, health sector, documents pertinent to the given query/question.
and privacy, and and technical areas such as In- Robust Translation: Suppose you have multiple
formation Retrieval/Search Engines, QA, Transla- translation models which translates a given docu-
tion etc. In general, MNSO/TextIntersect operation ment from language A to language B. One could
can be highly effective in digesting such multi- further apply the TextOverlap operator on the trans-
narratives (from various perspectives) at scale and lated documents and get a robust translation.
speed. Here are a few examples of use-cases. In general, MNSO task could be employed in any
Peer-Reviewing: TextIntersect can extract sec- setting where we have comparative text analysis.
tions of multiple peer-reviews for an article that
agree with one other, which can assist creating a
meta-review fast.
Security and Privacy: By mining overlapping
clauses from various privacy policies, the TextIn-
tersect operation may assist real-world consumers
swiftly undertake a comparative study of different
privacy policies and thus, allowing them to make
informed judgments when selecting between multi-
ple alternative web-services.
Health Sector: TextIntersect can be applied to
compare clinical notes in patient records to reveal
changes in a patient’s condition or perform com-
parative analysis of patients with the same diagno-
sis/treatment. For example, TextIntersect can be
applied to the clinical notes of two different pa-
tients who went through the same treatments to
assess the effectiveness of the treatment.
Military Intelligence: If A and B are two intel-
ligence reports related to a mission coming from
two human agents, the TextIntersect operation can
help verify the claims in each report w.r.t. the other,
thus, TextIntersect can be used as an automated
claim-verification tool.
Computational Social Science and Journalism:
Assume that two news agencies (with different
political bias) are reporting the same real-world
event and their bias is somewhat reflected through
the articles they write. If A and B are two such
news articles, then the TextIntersect operation will
likely surface the facts (common information)
about the event.
Here are some of the use-cases of MNSO in
various technical areas.
Information Retrieval/Search Engines: One
could summarize the common information in the
multiple results fetched by a search engine for a
given query and show it in separate box to the user.
This would immensely help the to quickly parse
the information rather than going through each
individual article. If they desire, they could further

Machine-Human Agreement in terms of Reward Function
                  T = (25, 75)    T = (35, 65)    T = (45, 75)    T = (55, 65)    T = (55, 75)    T = (55, 80)    T = (60, 80)
                                                    Sentence Embedding: P-v1
             L1    0.73 ± 0.27     0.81 ± 0.25     0.77 ± 0.26     0.85 ± 0.23     0.80 ± 0.24     0.77 ± 0.24     0.77 ± 0.26
 Precision   L2    0.72 ± 0.30     0.73 ± 0.29     0.73 ± 0.30     0.78 ± 0.27     0.79 ± 0.27     0.75 ± 0.26     0.73 ± 0.29
 Reward
             L3    0.81 ± 0.23     0.86 ± 0.21     0.79 ± 0.24     0.78 ± 0.28     0.74 ± 0.28     0.69 ± 0.28     0.69 ± 0.27
             L1    0.66 ± 0.19     0.79 ± 0.16     0.75 ± 0.16     0.76 ± 0.18     0.71 ± 0.17     0.66 ± 0.17     0.61 ± 0.18
 Recall      L2    0.67 ± 0.19     0.78 ± 0.16     0.76 ± 0.15     0.73 ± 0.19     0.72 ± 0.18     0.70 ± 0.18     0.65 ± 0.21
 Reward
             L3    0.66 ± 0.15     0.72 ± 0.17     0.68 ± 0.17     0.68 ± 0.22     0.64 ± 0.20     0.59 ± 0.19     0.57 ± 0.20
                                                   Sentence Embedding: STSB
             L1    0.75 ± 0.29     0.75 ± 0.29     0.75 ± 0.29     0.75 ± 0.29     0.75 ± 0.29     0.75 ± 0.30     0.75 ± 0.23
 Precision   L2    0.63 ± 0.32     0.63 ± 0.31     0.63 ± 0.32     0.63 ± 0.31     0.63 ± 0.32     0.64 ± 0.32     0.64 ± 0.32
 Reward
             L3    0.81 ± 0.23     0.82 ± 0.23     0.81 ± 0.23     0.82 ± 0.23     0.81 ± 0.23     0.81 ± 0.22     0.81 ± 0.22
             L1    0.66 ± 0.21     0.67 ± 0.21     0.66 ± 0.21     0.68 ± 0.21     0.67 ± 0.21     0.65 ± 0.21     0.66 ± 0.21
 Recall      L2    0.57 ± 0.20     0.58 ± 0.21     0.57 ± 0.20     0.59 ± 0.20     0.59 ± 0.20     0.58 ± 0.20     0.58 ± 0.21
 Reward
             L3    0.67 ± 0.19     0.67 ± 0.20     0.67 ± 0.19     0.68 ± 0.20     0.68 ± 0.19     0.67 ± 0.18     0.68 ± 0.18
                                                   Sentence Embedding: USE
             L1    0.76 ± 0.29     0.77 ± 0.30     0.78 ± 0.27     0.80 ± 0.28     0.80 ± 0.27     0.77 ± 0.27     0.80 ± 0.27
 Precision   L2    0.69 ± 0.32     0.66 ± 0.32     0.71 ± 0.30     0.68 ± 0.30     0.72 ± 0.30     0.76 ± 0.29     0.78 ± 0.29
 Reward
             L3    0.82 ± 0.24     0.85 ± 0.22     0.85 ± 0.23     0.86 ± 0.21     0.85 ± 0.23     0.82 ± 0.23     0.78 ± 0.25
             L1    0.64 ± 0.19     0.67 ± 0.19     0.68 ± 0.19     0.70 ± 0.21     0.69 ± 0.22     0.64 ± 0.20     0.65 ± 0.21
 Recall      L2    0.62 ± 0.19     0.63 ± 0.20     0.66 ± 0.18     0.66 ± 0.21     0.68 ± 0.20     0.68 ± 0.19     0.69 ± 0.21
 Reward
             L3    0.64 ± 0.16     0.68 ± 0.19     0.66 ± 0.16     0.69 ± 0.20     0.65 ± 0.19     0.60 ± 0.17     0.60 ± 0.18

(a) Average Precision and Recall reward/correlation (mean ± std) between label-annotators (Li ) and automatically inferred labels
using SEM-F1. The results are shown for different embedding models (8.1) and multiple threshold levels T = (tl , tu ). For all
the annotators Li (i ∈ {1, 2, 3}), correlation numbers are quite high (≥ 0.50). Moreover, the reward values are consistent/stable
across all the 5 embedding models and threshold values.

                                 Machine-Human Agreement in terms of Kendall Rank Correlation
                  T = (25, 75)    T = (35, 65)    T = (45, 75)    T = (55, 65)    T = (55, 75)    T = (55, 80)    T = (60, 80)
                                                    Sentence Embedding: P-v1
             L1        0.55            0.6             0.58            0.59            0.57            0.56           0.54
 Precision   L2        0.61            0.67            0.63            0.67            0.64            0.67           0.68
 Reward
             L3        0.54            0.62            0.56            0.64            0.6             0.56           0.52
             L1        0.53            0.64            0.66            0.62            0.61            0.62           0.59
 Recall      L2        0.55            0.64            0.67            0.63            0.63            0.64           0.61
 Reward
             L3        0.54            0.65            0.64            0.66            0.65            0.65           0.61
                                                   Sentence Embedding: STSB
             L1        0.57            0.67            0.58            0.66            0.6             0.57           0.58
 Precision   L2        0.66            0.63            0.65            0.63            0.7             0.63           0.6
 Reward
             L3        0.56            0.57            0.58            0.56            0.59            0.57           0.56
             L1        0.55            0.65            0.64            0.62            0.62            0.61           0.59
 Recall      L2        0.56            0.65            0.65            0.63            0.63            0.64           0.63
 Reward
             L3        0.54            0.59            0.61            0.57            0.58            0.57           0.54
                                                   Sentence Embedding: USE
             L1        0.58            0.62            0.6             0.61            0.59            0.62           0.65
 Precision   L2        0.68            0.7             0.68            0.68            0.68            0.7            0.73
 Reward
             L3        0.66            0.67            0.65            0.64            0.63            0.53           0.56
             L1        0.53            0.59            0.56            0.61            0.62            0.61           0.6
 Recall      L2        0.54            0.6             0.61            0.62            0.64            0.64           0.62
 Reward
             L3        0.52            0.6             0.58            0.61            0.61            0.6            0.6

(b) Average Precision and Recall Kendall Tau between label-annotators (Li ) and automatically inferred labels using SEM-F1.
The results are shown for different embedding models (8.1) and multiple threshold levels T = (tl , tu ). For all the annotators Li
(i ∈ {1, 2, 3}), correlation numbers are quite high (≥ 0.50). Moreover, the reward values are consistent/stable across all the 5
embedding models and threshold values. All values are statistically significant at p-value

AllSides Dataset: Statistics
  Split    #words (docs)       #sents (docs)          #words (reference/s)             #sents (reference/s)
 Train         1613.69              66.70                  67.30                              2.82
 Test          959.80               44.73         65.46/38.06/21.72/32.82             3.65/2.15/1.39/1.52
Table 13: Two input documents are concatenated to compute the statistics. Four numbers for reference
(#words/#sents) in Test split corresponds to the 4 reference intersections. Our test dataset contains of 137 sam-
ples, wherein each sample has 4 ground truth references. Out of these 4 references, 3 of them were manually
written by 3 references annotators. Thus, we generated 3*137 = 411, references in total. One of the recent papers,
titled (Fabbri et al., 2021), also incorporated human annotations for only 100 samples. Following them, we created
reference summaries for 150 samples which later got filtered to 137 samples due to minimum 15 words criterion
as described in section 5. Overall, we agree that having more samples in the test dataset would definitely help a
lot. But this is both time and money consuming process. We are working towards it and would like to increase the
number of test samples in future.

You can also read