Topic-Aware Evaluation and Transformer Methods for Topic-Controllable Summarization

 
CONTINUE READING
Topic-Aware Evaluation and Transformer Methods
                                                                for Topic-Controllable Summarization
                                                             Tatiana Passali                           Grigorios Tsoumakas
                                                          School of Informatics                         School of Informatics
                                                   Aristotle University of Thessaloniki          Aristotle University of Thessaloniki
                                                           Thessaloniki, Greece                          Thessaloniki, Greece
                                                    scpassali@csd.auth.gr                              greg@csd.auth.gr

                                                              Abstract                              This challenge has been recently addressed by
                                                                                                 topic-controllable summarization techniques (Fan
                                             Topic-controllable summarization is an              et al., 2018a; Krishna and Srinivasan, 2018; Fr-
                                             emerging research area with a wide range            ermann and Klementiev, 2019). However, the
arXiv:2206.04317v2 [cs.CL] 13 Jun 2022

                                             of potential applications. However, existing        automatic evaluation of such techniques remains
                                             approaches suffer from significant limita-
                                                                                                 an open problem. Existing methods use the typ-
                                             tions. First, there is currently no established
                                             evaluation metric for this task. Furthermore,       ical ROUGE score (Lin, 2004) for measuring
                                             existing methods built upon recurrent               summarization accuracy and then employ user
                                             architectures, which can significantly limit        studies to qualitatively evaluate whether the topic
                                             their performance compared to more recent           of the generated summaries matches the users’
                                             Transformer-based architectures, while they         needs (Krishna and Srinivasan, 2018; Bahrainian
                                             also require modifications to the model’s           et al., 2021). However, ROUGE can only be used
                                             architecture for controlling the topic. In
                                                                                                 for measuring the quality of the summarization
                                             this work, we propose a new topic-oriented
                                             evaluation measure to automatically evaluate        output and cannot be readily used to capture the
                                             the generated summaries based on the topic          topical focus of the text. Thus, there is no clear
                                             affinity between the generated summary              way to automatically evaluate such approaches,
                                             and the desired topic. We also conducted            since there is no evaluation measure designed
                                             a user study that validates the reliability of      specifically for topic-controllable summarization.
                                             this measure. Finally, we propose simple,
                                                                                                    At the same time, existing models for topic-
                                             yet powerful methods for topic-controllable
                                             summarization either incorporating topic            controllable summarization either incorporate
                                             embeddings into the model’s architecture            topic embeddings into the model’s architec-
                                             or employing control tokens to guide the            ture (Krishna and Srinivasan, 2018; Frermann
                                             summary generation. Experimental results            and Klementiev, 2019) or modify the attention
                                             show that control tokens can achieve better         mechanism (Bahrainian et al., 2021). Since these
                                             performance compared to more complicated            approaches are restricted to very specific neural
                                             embedding-based approaches while being at
                                                                                                 architectures, it is not straightforward to use them
                                             the same time significantly faster.
                                                                                                 with any summarization model. Even though con-
                                                                                                 trol tokens have been shown to be effective and
                                         1   Introduction
                                                                                                 efficient for controlling the output of a model for
                                         Neural abstractive summarization models have            entity-based summarization (Fan et al., 2018a; He
                                         matured enough to consistently produce high qual-       et al., 2020; Chan et al., 2021) without any modifi-
                                         ity summaries (See et al., 2017; Song et al., 2019;     cation to its codebase, they have not been explored
                                         Dong et al., 2019; Zhang et al., 2020; Lewis et al.,    yet for topic-controllable summarization.
                                         2020). Building on top of this significant progress,       Motivated by these observations, we propose
                                         an interesting challenge is to go beyond delivering     a topic-aware evaluation measure for quantita-
                                         a generic summary of a document, and instead            tively evaluating topic-controllable summariza-
                                         produce a summary that focuses on specific as-          tion methods in an objective way, without involv-
                                         pects that pertain to the user’s interests. For exam-   ing expensive and time-consuming user studies.
                                         ple, a financial journalist may need a summary of       We conducted a user study to validate the reliabil-
                                         a news article to be focused on specific financial      ity of this measure, as well as to interpret its score
                                         terms, like cryptocurrencies and inflation.             ranges.
In addition, we extend prior work on topic-          prepending information using special tokens (Fan
controllable summarization in the following ways:       et al., 2018a; He et al., 2020; Liu and Chen, 2021)
a) we adapt topic embeddings from the traditional       or using decoder-only architectures (Liu et al.,
RNN architectures to the more recent and power-         2021). Controllable summarization methods can
ful Tranformer architecture (Vaswani et al., 2017),     influence several aspects of a summary, includ-
and b) we employ control tokens via three differ-       ing its topic (Krishna and Srinivasan, 2018; Fr-
ent approaches: i) prepending the thematic cate-        ermann and Klementiev, 2019; Bahrainian et al.,
gory, ii) tagging the most representative tokens        2021), length (Takase and Okazaki, 2019; Liu
of the desired topic, and iii) using both prepend-      et al., 2020; Chan et al., 2021), style (Fan et al.,
ing and tagging. For the tagging-based method,          2018a), and the inclusion of named entities (Fan
given a topic-labeled collection, we first extract      et al., 2018a; He et al., 2020; Liu and Chen, 2021;
keywords that are semantically related to the topic     Dou et al., 2021; Chan et al., 2021). Despite the
that the user requested and then employ special         importance of controllable summarization, there
tokens to tag them before feeding the document to       are still limited datasets for this task, including
the summarization model. We also demonstrate            ASPECTNEWS (Ahuja et al., 2022) for extractive
that control tokens can successfully be applied to      aspect-based summarization and EntSUM (Mad-
zero-shot topic-controllable summarization, while       dela et al., 2022) for entity-based summarization.
at the same time being significantly faster than        In this paper, we focus on topic-controllable ab-
the embedding-based formulation and can be ef-          stractive summarization.
fortlessly combined with any neural architecture.
   Our contributions can be summarized as fol-
lows:
                                                        2.2   Improving summarization using topical
    • We propose a topic-aware measure to quan-               information
      titatively evaluate topic-oriented summaries,
      validated by a user study.                        The integration of topic modeling into summariza-
    • We propose topic-controllable methods for         tion models has been initially used in the literature
      Transformers.                                     to improve the quality of existing state-of-the-art
                                                        models. Ailem et al. (2019) enhance the decoder
    • We provide an extensive empirical evalua-         of a pointer-generator network using the infor-
      tion of the proposed methods, including an        mation of the latent topics that are derived from
      investigation of the zero-shot setting.           LDA (Blei et al., 2003). Similar methods have
                                                        been applied by Wang et al. (2020) using Pois-
   The rest of this paper is organized as follows. In
                                                        son Factor Analysis (PFA) with a plug-and-play
Section 2 we review existing related literature on
                                                        architecture that uses topic embeddings as an addi-
controllable summarization. In Section 3 we in-
                                                        tional decoder input based on the most important
troduce the proposed topic-aware measure, while
                                                        topics from the input document. Liu and Yang
in Section 4 we present the proposed methods for
                                                        (2021) propose to enhance summarization models
topic-controllable summarization. In section 5 we
                                                        using an Extreme Multi-Label Text Classification
discuss the experimental results. Finally, conclu-
                                                        model to improve the consistency between the
sions are drawn and interesting future research
                                                        underlying topics of the input document and the
directions are given in Section 6.
                                                        summary, leading to summaries of higher qual-
2     Related Work                                      ity. Zhu et al. (2021) use a topic-guided abstrac-
                                                        tive summarization model for Wikipedia articles
2.1    Controllable text-to-text generation             leveraging the topical information of Wikipedia
Controllable summarization belongs to the               categories. Even though Wang et al. (2020) refers
broader field of controllable text-to-text gener-       to the potential of controlling the output condi-
ation (Liu and Yang, 2021; Pascual et al., 2021;        tioned on a specific topic using GPT-2 (Radford
Liu et al., 2021). Several approaches for con-          et al., 2019), all the aforementioned approaches
trolling the model’s output exist, either using         are focused on improving the accuracy of existing
embedding-based approaches (Krishna and Srini-          summarization models instead of influencing the
vasan, 2018; Frermann and Klementiev, 2019),            summary generation towards a particular topic.
2.3    Topic-control in neural abstractive                   tf-idf vectors                                    Sports
       summarization
Some steps towards controlling the output of a
summarization model conditioned on a thematic
category have been made by (Krishna and Srini-
vasan, 2018; Frermann and Klementiev, 2019),                                                                   Politics

who proposed embedding-based controllable sum-
marization models on top of the pointer generator                                                            Business &
                                                                                                              Finance
network (See et al., 2017). Krishna and Srini-
vasan (2018) integrate the topical information into                                                          Health Care
the model as a topic vector, which is then concate-
nated with each of the word embeddings of the
input text. Bahrainian et al. (2021) propose to          Figure 1: Obtaining representative words, given a
incorporate the topical information into the atten-      topic-assigned document collection.
tion mechanism of the pointer generator network,
using an LDA model (Blei et al., 2003). All these
                                                         maximum cosine similarity between the represen-
methods are based on recurrent neural networks
                                                         tations of the summary and all topics:
and require modification to the architecture of a
model. Our work adapts the embedding-based
paradigm to Transformers (Vaswani et al., 2017),                                                s(ys , yt )
and employs control tokens, which can be applied                 ST AS(ys , yt ) =                             ,           (2)
                                                                                              max{s(ys , yz )}
effortlessly and efficiently to any model architec-                                           z∈T
ture as well as to the zero-shot setup.
                                                         where s(ys , yt ) indicates the cosine similarity be-
3     Topic-aware Evaluation Measure                     tween the two vectors ys and yt , computed as
                                                         follows:
We propose a new topic-aware measure, called
Summarization Topic Affinity Score (STAS), to                                                   yt ys
                                                                              s(yt , ys ) =              .                 (3)
evaluate the generated summaries according to                                                 kyt kkys k
their semantic similarity with the desired topic.
STAS assumes the existence of a predefined set              Summaries that are similar to the requested
of topics T , and that each topic, t ∈ T , is defined    topic receive high STAS values, while those that
via a set, Dt , of relevant documents.                   are dissimilar receive low STAS values. It is worth
   STAS is computed on top of vector represen-           noting that cosine similarity values can differ sig-
tations of topics and summaries. Several options         nificantly across topics, due to the varying over-
exist for extracting such representations, ranging       lap between the common words that appear in
from simple bag-of-words models to sophisticated         the summaries and the different topics. Dividing
language models. For simplicity, this work em-           by the maximum value takes this phenomenon
ploys the tf-idf model,                                  into account, leading to comparable STAS values
                         S where idf is computed         across topics.
across all documents t∈T Dt . For each topic t,
we compute a topic representation yt by averaging
the tf-idf representations, xd , of each document        4   Topic Control with Transformers
d ∈ Dt (see Fig. 1):                                     In this section, we present the proposed topic-
                         1 X                             controllable methods that fall into two different
                 yt =            xd                (1)   categories: a) incorporating topic embeddings
                        |Dt |
                            d∈Dt
                                                         into the Transformer architecture and b) employ-
   The representation of the predicted summary,          ing control tokens before feeding the input to the
ys , is computed using tf-idf too, so as to lie in the   model. Note that similar to the STAS measure,
same vector space with the topic vectors.                for all the proposed methods, we assume the ex-
   To calculate STAS, we compute the cosine sim-         istence of a predefined set of topics where each
ilarity between the representations of the sum-          topic is represented from a set of relevant docu-
mary, ys , and the desired topic, yt , divided by the    ments.
4.1   Topic Embeddings                                  trol tokens: a) prepending the thematic category
As described in Section 2, Krishna and Srinivasan       as a special token to the document, b) tagging
(2018) generate topic-oriented summaries by con-        with special tokens the representative terms for
catenating topic embeddings to the token embed-         each topic, and c) combination of both control
dings of a pointer generator network (See et al.,       codes.
2017). The topic embeddings are represented as          4.2.1   Prepending
one-hot encoding vectors with a size equal to the
                                                        As described in Section 1, there are several con-
total number of topics. During training, the model
                                                        trollable approaches that prepend information to
takes as input the corresponding topic embedding
                                                        the input source using special tokens to influ-
along with the input document.
                                                        ence the different aspects of the text such as the
   However, this method cannot be directly ap-
                                                        style (Fan et al., 2018a) or the presence of a par-
plied to pre-trained Transformer-based models
                                                        ticular entity (He et al., 2020; Chan et al., 2021;
due to the different shapes of the initialized
                                                        Dou et al., 2021). Even though this technique can
weights of the word and position embeddings.
                                                        be readily combined with topic controllable sum-
Unlike RNNs, Transformer-based models are typ-
                                                        marization, this direction has not been explored
ically trained for general tasks and then fine-tuned
                                                        yet. To this end, we adapt these approaches to
with less data for more specific tasks like summa-
                                                        work with topic information by simply placing
rization. Thus, the architecture of a pre-trained
                                                        the desired thematic category at the beginning of
model is already defined. Concatenating the topic
                                                        the document. For example, prepending “Sports”
embeddings with the contextual word embeddings
                                                        to the beginning of a document represents that we
of a Transformer-based model would require re-
                                                        want to generate a summary based on the topic
training the whole summarization model from
                                                        "Sports".
scratch with the appropriate dimension. However,
this would be very computationally demanding as            During training, we prepend to the input the
it would require a large amount of data and time.       topic of the target summary, according to the train-
                                                        ing dataset. During inference, we also prepend
   Instead of concatenation, we propose to sum
                                                        the document with a special token according to
the topic embeddings, following the rationale of
                                                        the user’s requested topic.
positional encoding, where token embeddings are
summed with positional encoding representations         4.2.2   Tagging
to create an input representation that contains the
                                                        Going one step further, we propose another
position information. Instead of one-hot encoding
                                                        method for controlling the output of a model
embeddings, we use trainable embeddings allow-
                                                        based on tagging the most representative terms
ing the model to optimize them during training.
                                                        for each thematic category using a special control
The topic embeddings have the same dimension-
                                                        token. The proposed mechanism assumes the ex-
ality as the token embeddings.
                                                        istence of a set of the most representative terms
   During training, we sum the trainable topic em-
                                                        for each topic. Then, given a document and a
beddings with token and positional embeddings
                                                        requested topic, we employ a special token to tag
and we modify the input representation as fol-
                                                        the most representative terms before feeding the
lows:
                                                        document to the summarization model, aiming to
                                                        guide the summary generation towards this topic.
         zi = W E(xi ) + P E(i) + T E,           (4)
                                                           We extract N representative terms for each
where WE, PE and TE are the word embeddings,            topic by considering the terms with the top N
positional encoding and topic embeddings respec-        tf-idf scores in the corresponding topical vector,
tively, for token xi in position i. During inference,   as extracted in Section 3. An example of some
the model generates the summary based on the            indicative representative words for a number of
trained topic embeddings, according to the desired      topics from the Vox dataset (Vox Media, 2017) is
topic.                                                  shown in Table 1.
                                                           Before training with a neural abstractive model,
4.2   Control Tokens                                    we pre-process each input document by surround-
We propose three different approaches to control        ing the representative terms of its summary’s
the generation of the output summaries using con-       topic with the special token [TAG]. This way, dur-
Table 1: Representative terms for topics from 2017             5.1   Experimental Setup
KDD Data Science+Journalism Workshop (Vox Me-
dia, 2017)                                                     Datasets We use the following two datasets to
                                                               evaluate the proposed models:
    Topic         Terms
                                                                  • CNN/DailyMail is an abstractive summa-
    Politics      policy, president, state, political, vote,
                  law, country, election                            rization dataset with articles from CNN and
                                                                    DailyMail accompanied with human gener-
    Sports        game, sport, team, football, fifa, nfl,
                  player, play, soccer, league                      ated bullet summaries (Hermann et al., 2015).
    Health Care   patient, uninsured, insurer, plan, cover-         We use the anonymized version of the dataset
                  age, care, insurance                              similar to See et al. (2017).
    Education     student, college, school, education, test,
                  score, loan, teacher                            • Topic-Oriented CNN/DailyMail is a syn-
                                                                    thetic version of CNN/Dailymail which con-
                                                                    tains super-articles of two different topics
ing training the model learns to pay attention to                   accompanied with the summary for the one
tagged words, as they are aligned with the topic                    topic.
of the summary. To influence the summary gen-                     To compile the topic-oriented CNN/DailyMail
eration towards a desired topic during inference,              any dataset that contains topic annotations can be
the terms of this topic that appear inside the input           used. Following (Krishna and Srinivasan, 2018),
document are again tagged with the special token.              we also use the Vox Dataset (Vox Media, 2017)
  As mentioned in Section 4.2, the tagging-based               which consists of 23,024 news articles of 185
method can be readily combined with the prepend-               different topical categories. We discarded topics
ing mechanism.                                                 with relatively low frequency, i.e. lower than 20
                                                               articles, as well as articles assigned to general
4.2.3        Topical Training Dataset                          categories that do not discuss explicitly a topic,
                                                               i.e. “The Latest”, “Vox Articles”, “On Instagram”
All the aforementioned methods assume the exis-                and “On Snapchat”. The tf-idf topic vector repre-
tence of a training dataset, where each summary                sentations for each document in the corpus were
is associated with a particular topic. However,                extracted using tf-idf vectorizer provided by the
there are no existing datasets for summarization               Scikit-learn library (Pedregosa et al., 2011). After
that contain summaries according to the different              pre-processing, we end up with 14,312 articles
topical aspects of the text. Thus, we adopt the ap-            from 70 categories out of the 185 initial topi-
proach of Krishna and Srinivasan (2018) to com-                cal categories. The final topic-oriented dataset
pile the same topic-oriented dataset that builds               consists of 132,766, 5,248, and 6,242 articles for
upon CNN/DailyMail (Hermann et al., 2015).                     training, validation, and test, respectively while
   To create this dataset, Krishna and Srinivasan              the original CNN/DailyMail consists of 287,113,
(2018) synthesize new super-articles by combin-                13,368 and 11,490 articles. Some dataset statistics
ing two different articles of the original dataset             are shown in Table 2.
and keeping the summary of only one of them.
                                                               Table 2: Dataset Statistics. The size is measured in
The final topic-oriented dataset consists of super-            articles and length is measured in tokens.
articles that discuss two distinct topics, but are as-
signed each time to one of the corresponding sum-                                              Size                Length
maries. Therefore, the model learns to distinguish                  Dataset           Train      Val.     Test   Doc.    Sum.
                                                                CNN/DM (original)    287,113   13,368   11,490    781     56
the most important sentences for the correspond-                CNN/DM (synthetic)   132,766    5,248    6,242   1,544    56
ing topic during training. We use this dataset to
fine-tune all the aforementioned methods.
                                                               Preprocessing. For the tagging-based method,
                                                               all the words of the input document are lemma-
5      Empirical Evaluation                                    tized to their roots using NLTK (Bird, 2006).
                                                               Then, we tag the words between the existing
In this section, we present and discuss the experi-            lemmatized tokens and the representative words
mental evaluation results.                                     for the desired topic, based on the top-N =100
most representative terms for each topic. For the               • BARTprepend is the proposed topic-oriented
prepend method, the corresponding topic was                       prepend extension of BART model.
prepended to the document.
                                                                • BARTtag+prepend is the combination of the
Evaluation Metrics. All methods were evaluated                    tagging-based extension and prepend mecha-
using both the well-known ROUGE (Lin, 2004)                       nism.
score, to measure the quality of the generated
summary, as well as the proposed STAS measure.                  The experimental results reported in Table 3
                                                             indicate that topic-oriented methods indeed per-
Models and training. For all the conducted ex-               form significantly better compared to the base-
periments we employed a BART-large architec-                 line methods that do not take into account the
ture (Lewis et al., 2020), which is a transformer-           topic requested by the user. Furthermore, the pro-
based model with a bidirectional encoder and an              posed BART-based formulation significantly out-
auto-regressive decoder. BART-large consists of              performs the generic PG approach, regardless of
12 layers for both encoder and decoder and 406M              the applied topic mechanism (BARTemb , BARTtag
parameters. We used the implementation pro-                  or BARTprepend ). However, when BARTtag is com-
vided by Hugging Face (Wolf et al., 2020).                   bined with the prepending mechanism, we ob-
   All the models are fine-tuned for 100,000 steps           tain the best results of all the evaluated methods.
with a learning rate of 3×10−5 and batch size 4,             Also, as we further demonstrate later, all the ap-
with early stopping on the validation set. For               proaches that use control tokens are significantly
all the experiments, we use the established pa-              faster than the embedding-based approach, lead-
rameters for BART-large architecture, using label            ing to the overall best trade-off between accuracy
smoothed cross-entropy loss (Pereyra et al., 2017)           and speed.
with the label smoothing factor set to 0.1. For all
                                                             Table 3: Experimental results on the compiled topic-
the experiments, we use PyTorch version 1.10 and
                                                             oriented dataset based on CNN/DailyMail dataset.
Hugging Face version 4.11.0. The code and the                We report f-1 scores for ROUGE-1 (R-1), ROUGE-2
compiled dataset are publicly available1 .                   (R-2) and ROUGE-L (R-L).
5.2   Results
                                                                                         R-1     R-2     R-L
The evaluation results on the compiled topic-
                                                                        PG              26.8     9.2     24.5
oriented dataset are shown in Table 3. Our results
include the following models:                                         BART              30.46   11.92    20.57
                                                                Topic-Oriented PG       34.1     13.6    31.2
   • PG (See et al., 2017) is the generic pointer
     generator network which is based on the                     BARTtag (Ours)         39.30   18.06    36.67
     RNN’s architecture.                                         BARTemb (Ours)         40.15   18.53    37.41

   • Topic-Oriented PG (Krishna and Srini-                      BARTprepend (Ours)      41.58   19.55    38.74
     vasan, 2018) is the topic-oriented pointer               BARTprepend+tag (Ours)    41.66   19.57    38.83
     generator network also based on the RNN’s
     architecture.                                              In Table 4, we also provide an experimental
   • BART (Lewis et al., 2020) is the generic                evaluation using the proposed Summarization
     BART model which is based on the                        Topic Affinity Score (STAS) measure. The ef-
     Transformer-based architecture.                         fectiveness of using topic-oriented approaches is
                                                             further highlighted using the proposed methods
   • BARTemb is the proposed topic-oriented                  since the improvements acquired when applying
     embedding-based extension of BART model.                all the proposed methods are much higher com-
                                                             pared to the ROUGE score. Also, both the em-
   • BARTtag is the proposed topic-oriented                  bedding and the tagging method lead to similar
     tagging-based extension of BART model.                  results (∼68.5%) using STAS measure, while the
   1
     https://github.com/tatianapassali/topic-controllable-   prepending method achieves even better results
summarization                                                (∼72%). Finally, when we combine the proposed
method with the prepending mechanism, we ob-                  We do not report results for BARTemb since this
serve additional gains, outperforming all the eval-           method cannot be directly applied to a zero-shot
uated methods with 72.36% STAS score.                         setting.
Table 4: Evaluation based on the proposed Summa-               Table 6: Results on the test set with unseen topics.
rization Topic Affinity Score (STAS).
                                                                                 R-1     R-2     R-L     STAS (%)
                                      STAS (%)
                                                                 BARTtag         37.52   16.99   35.58    74.80
                     BART                51.86                  BARTprepend      38.13   17.84   35.69    74.67
                 BARTtag (Ours)          68.42                 BARTprepend+tag   39.22   18.81   36.81    77.94
                 BARTemb (Ours)          68.50
                BARTprepend (Ours)       71.90                   Even though the models have not seen the
            BARTprepend+tag (Ours)       72.36                zero-shot topics during training, they can suc-
                                                              cessfully generate topic-oriented summaries for
                                                              these topics achieving similar results in terms
   The results of the inference time for all the
                                                              of both ROUGE-1 score and STAS metric, with
methods are shown in Table 5. The inference time
                                                              the BARTprepend+tag method outperforming all the
of all the methods that use control tokens is signif-
                                                              other methods with more than 39 ROUGE-1 score
icantly smaller, improving the performance of the
                                                              and ∼78% STAS score. This finding further con-
model by almost one order of magnitude. Indeed,
                                                              firms the capability of methods that use control
all the control tokens approaches can perform in-
                                                              tokens to generalize successfully to unseen topics.
ference on 100 articles in less than 40 seconds,
while the embedding-based formulation requires                5.4   Experimental Results on Original
more than 300 seconds for the same task.                            CNN/DailyMail
Table 5: Inference time for 100 articles. All numbers         We evaluate the proposed methods on the origi-
are reported in seconds.                                      nal CNN/DailyMail using both an oracle setup,
                                                              where the topic information is extracted from the
                    Control Tokens   Inference   Total time
                                                              target summary according to the assigned topical
      BARTemb              -           303.0       303.0
                                                              dataset, and a non-oracle setup, where the topic
      BARTtag             7.1          32.0         39.1
                                                              information is extracted directly from the input
  BARTprepend
Table 7: Experimental results on the test set of origi-
nal CNN/DailyMail, with oracle and non-oracle guid-
                                                                                100   average
ance.

                                   Oracle              Non-oracle               80
                   R-1     R-2       R-L     STAS(%)   STAS(%)

                                                                     STAS (%)
     BARTemb       42.93   20.27     40.14    71.14      66.76                  60
     BARTtag       42.54   20.11     39.80    71.51      67.09
    BARTprepend    42.75   20.20     39.94    74.20      69.67                  40
 BARTprepend+tag   43.35   20.66     40.53    74.23      70.09

                                                                                20
Table 8: Correlation between human evaluation and                                       2        4           6             8   10
STAS measure.                                                                                   Human Annotators' scores

               Metric       Correlation        p-value              Figure 2: Average STAS scores versus human annota-
                                                                    tors’ score highlighting the high correlation between
              Pearson              0.83        6.8e-16
             Spearman              0.82        1.5e-16              human annotators and the proposed metric.

                                                                    beddings or control tokens demonstrating that
participants (i.e., graduate and undergraduate stu-
                                                                    the latter can successfully influence the summary
dents) to evaluate how relevant is the generated
                                                                    generation towards the desired topic, even in a
summary with respect to the given topic in a scale
                                                                    zero-shot setup. Our empirical evaluation further
of 1 to 10. Each time, we randomly picked a
                                                                    highlighted the effectiveness of control tokens
relevant or an irrelevant topic for the given sum-
                                                                    achieving better performance than embedding-
mary. Each summary was annotated from an av-
                                                                    based methods, while being significantly faster
erage of 3.5 annotators and inter-annotations with
                                                                    and easier to apply.
more than 6 degrees difference were manually
                                                                       Future research could examine other control-
discarded. The inter-annotator agreement across
                                                                    lable aspects, such as style (Fan et al., 2018b),
the raters for each sample was 0.90 and measured
                                                                    entities (He et al., 2020) or length (Takase and
using Krippendorff’s alpha coefficient (Krippen-
                                                                    Okazaki, 2019). In addition, the tagging-based
dorff, 2004), with ordinal weights (Hayes and
                                                                    method could be further extended to working with
Krippendorff, 2007).
                                                                    any arbitrary topic, bypassing the requirement of
   To measure the correlation between human                         having a labeled document collection of a topic
evaluation and the proposed metric, we obtained                     to guide the summary towards this topic.
the STAS scores for the same summary-topic
pairs. We use two statistical correlation measures:
Spearman’s and Pearson’s correlation coefficient.                   References
As shown in Table 8, the correlation for both met-
                                                                    Ojas Ahuja, Jiacheng Xu, Akshay Gupta, Kevin
rics is positive and very close to 1 (∼0.8 for both                   Horecka, and Greg Durrett. 2022. ASPECT-
metrics). Also, as indicated from both very low                       NEWS: Aspect-oriented summarization of news
p-values, we can conclude with large confidence                       documents. In Proceedings of the 60th Annual
that STAS is strongly correlated with the results of                  Meeting of the Association for Computational Lin-
                                                                      guistics (Volume 1: Long Papers), pages 6494–
human evaluation. This observation is further con-                    6506, Dublin, Ireland. Association for Computa-
firmed by the almost linear relation between the                      tional Linguistics.
average STAS scores and the human annotators
scores, as shown in Fig. 2.                                         Melissa Ailem, Bowen Zhang, and Fei Sha. 2019.
                                                                     Topic augmented generator for abstractive summa-
                                                                     rization. arXiv preprint arXiv:1908.07026.
6     Conclusions and Future Work
                                                                    Seyed Ali Bahrainian, George Zerveas, Fabio
We proposed STAS, a structured way to evalu-                          Crestani, and Carsten Eickhoff. 2021. Cats: Cus-
ate the generated summaries. In addition, we                          tomizable abstractive topic-based summarization.
conducted a user study to validate and interpret                      ACM Trans. Inf. Syst., 40(1).
the STAS score ranges. We also proposed topic-                      Steven Bird. 2006. NLTK: The Natural Language
controllable methods that employ either topic em-                      Toolkit. In Proceedings of the COLING/ACL
2006 Interactive Presentation Sessions, pages 69–     K. Krippendorff. 2004. Content Analysis: An Intro-
  72, Sydney, Australia. Association for Computa-         duction to Its Methodology. Content Analysis: An
  tional Linguistics.                                     Introduction to Its Methodology. Sage.
David M Blei, Andrew Y Ng, and Michael I Jordan.        Kundan Krishna and Balaji Vasan Srinivasan. 2018.
  2003. Latent dirichlet allocation. the Journal of       Generating topic-oriented summaries using neural
  machine Learning research, 3:993–1022.                  attention. In Proceedings of the 2018 Confer-
                                                          ence of the North American Chapter of the As-
Hou Pong Chan, Lu Wang, and Irwin King.
                                                          sociation for Computational Linguistics: Human
  2021.    Controllable summarization with con-
                                                          Language Technologies, Volume 1 (Long Papers),
  strained markov decision process. Transactions
                                                          pages 1697–1705.
  of the Association for Computational Linguistics,
  9:1213–1232.                                          Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-             Ghazvininejad, Abdelrahman Mohamed, Omer
  aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,           Levy, Veselin Stoyanov, and Luke Zettlemoyer.
  and Hsiao-Wuen Hon. 2019. Unified language              2020. BART: Denoising sequence-to-sequence
  model pre-training for natural language under-          pre-training for natural language generation, trans-
  standing and generation. In Advances in Neu-            lation, and comprehension. In Proceedings of the
  ral Information Processing Systems, pages 13042–        58th Annual Meeting of the Association for Com-
  13054.                                                  putational Linguistics, pages 7871–7880, Online.
                                                          Association for Computational Linguistics.
Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao
   Jiang, and Graham Neubig. 2021. GSum: A gen-         Chin Yew Lin. 2004. Rouge: A package for auto-
   eral framework for guided neural abstractive sum-      matic evaluation of summaries. In Proceedings of
   marization. In Proceedings of the 2021 Conference      the workshop on text summarization branches out
   of the North American Chapter of the Association       (WAS 2004).
   for Computational Linguistics: Human Language
  Technologies, pages 4830–4842, Online. Associa-       Alisa Liu, Maarten Sap, Ximing Lu, Swabha
   tion for Computational Linguistics.                    Swayamdipta, Chandra Bhagavatula, Noah A.
                                                          Smith, and Yejin Choi. 2021. DExperts: Decoding-
Angela Fan, David Grangier, and Michael Auli.             time controlled text generation with experts and
  2018a. Controllable abstractive summarization. In       anti-experts. In Proceedings of the 59th Annual
  Proceedings of the 2nd Workshop on Neural Ma-           Meeting of the Association for Computational Lin-
  chine Translation and Generation, pages 45–54,          guistics and the 11th International Joint Confer-
  Melbourne, Australia. Association for Computa-          ence on Natural Language Processing (Volume 1:
  tional Linguistics.                                     Long Papers), pages 6691–6706, Online. Associa-
                                                          tion for Computational Linguistics.
Angela Fan, David Grangier, and Michael Auli.
  2018b. Controllable abstractive summarization. In     Jingzhou Liu and Yiming Yang. 2021. Enhancing
  Proceedings of the 2nd Workshop on Neural Ma-            summarization with text classification via topic
  chine Translation and Generation, pages 45–54,           consistency. In Joint European Conference on
  Melbourne, Australia. Association for Computa-           Machine Learning and Knowledge Discovery in
  tional Linguistics.                                      Databases, pages 661–676. Springer.
Lea Frermann and Alexandre Klementiev. 2019. In-        Yizhu Liu, Zhiyi Luo, and Kenny Q. Zhu. 2020. Con-
  ducing document structure for aspect-based sum-         trolling length in abstractive summarization using
  marization. In Proceedings of the 57th Annual           a convolutional neural network. In Proceedings of
  Meeting of the Association for Computational Lin-       the 2018 Conference on Empirical Methods in Nat-
  guistics, pages 6263–6273.                              ural Language Processing, EMNLP 2018.
Andrew F Hayes and Klaus Krippendorff. 2007. An-
                                                        Zhengyuan Liu and Nancy Chen. 2021. Control-
  swering the call for a standard reliability measure
                                                          lable neural dialogue summarization with personal
  for coding data. Communication methods and mea-
                                                          named entity planning. In Proceedings of the 2021
  sures, 1(1):77–89.
                                                          Conference on Empirical Methods in Natural Lan-
Junxian He, Wojciech Kryscinski, Bryan McCann,            guage Processing, pages 92–106, Online and Punta
  Nazneen Rajani, and Caiming Xiong. 2020. Ctrl-          Cana, Dominican Republic. Association for Com-
  sum: Towards generic controllable text summariza-       putational Linguistics.
  tion. ArXiv, abs/2012.04281.
                                                        Mounica Maddela, Mayank Kulkarni, and Daniel
Karl Moritz Hermann, Tomas Kocisky, Edward               Preotiuc-Pietro. 2022. EntSUM: A data set for
  Grefenstette, Lasse Espeholt, Will Kay, Mustafa        entity-centric extractive summarization. In Pro-
  Suleyman, and Phil Blunsom. 2015. Teaching             ceedings of the 60th Annual Meeting of the Asso-
  machines to read and comprehend. In Advances           ciation for Computational Linguistics (Volume 1:
  in neural information processing systems, pages        Long Papers), pages 3355–3366, Dublin, Ireland.
  1693–1701.                                             Association for Computational Linguistics.
Damian Pascual, Beni Egressy, Clara Meister, Ryan            Mariama Drame, Quentin Lhoest, and Alexander
  Cotterell, and Roger Wattenhofer. 2021. A plug-            Rush. 2020. Transformers: State-of-the-art natural
  and-play method for controlled text generation. In         language processing. In Proceedings of the 2020
  Findings of the Association for Computational Lin-         Conference on Empirical Methods in Natural Lan-
  guistics: EMNLP 2021, pages 3973–3997.                     guage Processing: System Demonstrations, pages
                                                             38–45, Online. Association for Computational Lin-
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,          guistics.
   B. Thirion, O. Grisel, M. Blondel, P. Pretten-
   hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-   Jingqing Zhang, Yao Zhao, Mohammad Saleh, and
   sos, D. Cournapeau, M. Brucher, M. Perrot, and           Peter Liu. 2020. PEGASUS: Pre-training with ex-
   E. Duchesnay. 2011. Scikit-learn: Machine learn-         tracted gap-sentences for abstractive summariza-
   ing in Python. Journal of Machine Learning Re-           tion. In Proceedings of the 37th International
   search, 12:2825–2830.                                    Conference on Machine Learning, volume 119 of
                                                            Proceedings of Machine Learning Research, pages
Gabriel Pereyra, George Tucker, Jan Chorowski,              11328–11339. PMLR.
  Łukasz Kaiser, and Geoffrey Hinton. 2017. Reg-
  ularizing neural networks by penalizing con-           Fangwei Zhu, Shangqing Tu, Jiaxin Shi, Juanzi Li,
  fident output distributions.    arXiv preprint           Lei Hou, and Tong Cui. 2021. TWAG: A topic-
  arXiv:1701.06548.                                        guided Wikipedia abstract generator. In Proceed-
                                                           ings of the 59th Annual Meeting of the Associa-
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,         tion for Computational Linguistics and the 11th In-
  Dario Amodei, Ilya Sutskever, et al. 2019. Lan-          ternational Joint Conference on Natural Language
  guage models are unsupervised multitask learners.        Processing (Volume 1: Long Papers), pages 4623–
  OpenAI blog, 1(8):9.                                     4635, Online. Association for Computational Lin-
                                                           guistics.
Abigail See, Peter J Liu, and Christopher D Man-
  ning. 2017. Get To The Point: Summarization
  with Pointer-Generator Networks. In Proceedings
  of the 2017 Annual Meeting of the Association for      A     Example of generated Summaries
  Computational Linguistics, pages 1073–1083.
                                                         We present some examples of generated sum-
Shengli Song, Haitao Huang, and Tongxiao Ruan.           maries using the tagging-based model on the cre-
  2019.     Abstractive text summarization using         ated dataset for different topics along with the
  LSTM-CNN based deep learning. Multimedia               generic summary generated by the generic BART
  Tools and Applications, 78.
                                                         model as shown in Table 9 and 10. The proposed
Sho Takase and Naoaki Okazaki. 2019. Positional          model can shift the generation towards the desired
  encoding to control output sequence length. In         topic of the super-article which contains different
  NAACL HLT 2019 - 2019 Conference of the North
  American Chapter of the Association for Computa-
                                                         topics in contrary with the generic BART model
  tional Linguistics: Human Language Technologies        which discusses only one topic of the article.
  - Proceedings of the Conference, volume 1.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in Neural Information Pro-
  cessing Systems.
Vox Media. 2017. Vox Dataset (DS+J Workshop).
Zhengjue Wang, Zhibin Duan, Hao Zhang, Chaojie
  Wang, Long Tian, Bo Chen, and Mingyuan Zhou.
  2020. Friendly topic assistant for transformer
  based abstractive summarization. In Proceedings
  of the 2020 Conference on Empirical Methods
  in Natural Language Processing (EMNLP), pages
  485–497, Online. Association for Computational
  Linguistics.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  Chaumond, Clement Delangue, Anthony Moi,
  Pierric Cistac, Tim Rault, Remi Louf, Morgan
  Funtowicz, Joe Davison, Sam Shleifer, Patrick
  von Platen, Clara Ma, Yacine Jernite, Julien Plu,
  Canwen Xu, Teven Le Scao, Sylvain Gugger,
Table 9: Generated summaries according to the two
different topics of the super-article containing articles
of these topics along with the generic summary.
                                                            Table 10: Generated summaries similar to Table 9
 Generic summary: Sir Bradley Wiggins has                                            .
 been included in Team Sky’s line-up for Sun-               Generic summary: Ford unveiled two proto-
 day’s Gent-Wevelgem sprint classic. Wig-                   type electric bikes at Mobile World Congress
 gins, the 2012 Tour de France champion, will               in Barcelona. MoDe:Me and Mo de:Pro are
 be joined by fellow Great Britain Olympian                 powered by 200-watt motors and fold to fit on
 Geraint Thomas for the 239-kilometre one-day               a train or in the boot of a car. With pedal assist
 event in Belgium. Bernhard Eisel, who won                  they help riders reach speeds of up to 15mph
 the event in 2010, is also included in an eight-           (25km/h) The bikes are part of an experiment
 man team along with Christian Knees, Ian Stan-             by Ford called Handle on Mobility.
 nard, Andy Fenn, Luke Rowe and Elia Viviani.
                                                            Transportation: Ford unveiled two prototype
 Gender-based Violence: Staff at Venice High                electric bikes at Mobile World Congress in
 School in Los Angeles informed police of a                 Barcelona. The MoDe: Me and Mo de: Pro
 suspected attack on a female student on Tues-              are powered by 200-watt motors. They fold
 day. However the LAPD found the sexual mis-                to fit on a train or in the boot of a car.With
 conduct involved another student and had oc-               pedal assist, riders reach speeds of up to 15mph
 curred frequently for more than a year. An                 (25km/h)
 investigation has since identified 14 suspects
 between the ages of 14 and 17 who have al-                 Neuroscience: Researchers from Bristol Uni-
 legedly been involved in sexual activity at the            versity measured biosonar bat calls to calcu-
 school. Eight youngsters from the school were              late what members of group perceived as they
 arrested on Friday while another was arrested              foraged for food. Pair of Daubenton’s bats
 at another location.                                       foraged low over water for stranded insects at
                                                            a site near the village of Barrow Gurney, in
 Sports: Sir Bradley Wiggins has been included              Somerset. It found the bats interact by swap-
 in Team Sky’s line-up for Sunday’s Gent-                   ping between leading and following, and they
 Wevelgem sprint classic. Wiggins, the 2012                 swap these roles by copying the route a nearby
 Tour de France champion, will be joined by fel-            individual was using up to 500 milliseconds
 low Great Britain Olympian Geraint Thomas                  earlier.
 for the 239-kilometre one-day event in Bel-
 gium. Giant-Alpecin’s John Degenkolb is the
 defending champion in Flanders.
You can also read