NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias

Page created by Carrie Richardson
 
CONTINUE READING
NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias
NeuS: Neutral Multi-News Summarization
                                                                       for Mitigating Framing Bias

                                                  Nayeon Lee     Yejin Bang Tiezheng Yu Andrea Madotto∗ Pascale Fung
                                                                Hong Kong University of Science and Technology
                                                            nyleeaa@connect.ust.hk, pascale@ece.ust.hk

                                                              Abstract
                                            Media news framing bias can increase political
                                            polarization and undermine civil society. The
arXiv:2204.04902v3 [cs.CL] 3 May 2022

                                            need for automatic mitigation methods is there-
                                            fore growing. We propose a new task, a neu-
                                            tral summary generation from multiple news
                                            articles of the varying political leanings to fa-
                                            cilitate balanced and unbiased news reading.
                                            In this paper, we first collect a new dataset, il-
                                            lustrate insights about framing bias through a
                                            case study, and propose a new effective metric
                                            and model (N EU S-T ITLE) for the task. Based
                                            on our discovery that title provides a good
                                            signal for framing bias, we present N EU S-
                                            T ITLE that learns to neutralize news content          Figure 1: Illustration of the proposed task. We want to
                                            in hierarchical order from title to article. Our       generate neutral summarization of news articles from
                                            hierarchical multi-task learning is achieved by        varying of political orientations. Orange highlights
                                            formatting our hierarchical data pair (title, ar-      indicate phrases that can be considered framing bias.
                                            ticle) sequentially with identifier-tokens (“TI-
                                            TLE=>”, “ARTICLE=>”) and fine-tuning the
                                            auto-regressive decoder with the standard neg-
                                            ative log-likelihood objective. We then ana-           and Reynolds, 2009; Perse and Lambe, 2016), bias
                                            lyze and point out the remaining challenges            in media reporting can reinforce the problem of
                                            and future directions. One of the most inter-          political polarization and undermining civil society
                                            esting observations is that neural NLG models          rights.
                                            can hallucinate not only factually inaccurate or
                                                                                                      Allsides.com (Sides, 2018) mitigates this prob-
                                            unverifiable content but also politically biased
                                            content.                                               lem by displaying articles from various media in
                                                                                                   a single interface along with an expert-written
                                        1   Introduction                                           roundup of news articles. This roundup is a neu-
                                                                                                   tral summary for readers to grasp a bias-free un-
                                        Media framing bias occurs when journalists make            derstanding of an issue before reading individual
                                        skewed decisions regarding which events or infor-          articles. Although Allsides fights framing bias,
                                        mation to cover (information bias) and how to cover        scalability still remains a bottleneck due to the
                                        them (lexical bias) (Entman, 2002; Groeling, 2013).        time-consuming human labor needed for compos-
                                        Even if the reporting of the news is based on the          ing the roundup. Multi-document summarization
                                        same set of underlying issues or facts, the fram-          (MDS) models (Lebanoff et al., 2018; Liu and Lap-
                                        ing of that issue can convey a radically different         ata, 2019) could be one possible choice for automat-
                                        impression of what actually happened (Gentzkow             ing the roundup generation as both multi-document
                                        and Shapiro, 2006). Since the news media plays a           summaries and roundups share a similar nature in
                                        crucial role in shaping public opinion toward vari-        extracting salient information out of multiple input
                                        ous important issues (De Vreese, 2004; McCombs             articles. Yet the ability of MDS models to pro-
                                            ∗
                                              This work was done when the author was studying at   vide neutral description of a topic issue – a crucial
                                        The Hong Kong University of Science and Technology.        aspect of the roundup – remains unexplored.
In this work, we fill in this research gap          framing1 (Scheufele, 2000). Framing is a broad
by proposing a task of Neutral multi-news              term that refers to any factor or technique that af-
Summarization (N EU S), which aims to generate         fect how individuals perceive certain reality or in-
a framing-bias-free summary from news articles         formation (Goffman, 1974; Entman, 1993, 2007;
with varying degrees and orientation of political      Gentzkow and Shapiro, 2006). In the context of
bias (Fig. 1). To begin with, we construct a new       news reports, framing is about how an issue is char-
dataset by crawling Allsides.com, and investigate      acterized by journalists and how readers take the
how framing bias manifests in the news so as to        information to form their impression (Scheufele
provide a more profound and more comprehensive         and Tewksbury, 2007). Our work specifically fo-
analysis of the problem. The first important insight   cuses on framing “bias” that exists as a form of
from our analysis is a close association between       text in the news. More specifically, we focus on
framing bias and the polarity of the text. Grounded    different writing factors such as word choices and
on this basis, we propose a polarity-based framing-    the commission of extra information that sway an
bias metric that is simple yet effective in terms of   individual’s perception of certain events.
alignment with human perceptions. The second
insight is that titles serve as a good indicator of    Media Bias Detection In natural language pro-
framing bias. Thus, we propose N EU S models           cessing (NLP), computational approaches for de-
that leverage news titles as an additional signal to   tecting media bias often consider linguistic cues
increase awareness of framing bias.                    that induce bias in political text (Recasens et al.,
                                                       2013; Yano et al., 2010; Lee et al., 2019; Morstat-
   Our experimental results provide rich ideas for     ter et al., 2018; Lee et al., 2019; Hamborg et al.,
understanding the problem of mitigating framing        2019b; Lee et al., 2021b; Bang et al., 2021). For in-
bias. Primarily, we explore whether existing sum-      stance, Gentzkow and Shapiro count the frequency
marization models can already solve the problem        of slanted words within articles. These methods
and empirically demonstrate their shortcomings         mainly focus on the stylistic (“how to cover”) as-
in addressing the stylistic aspect of framing bias.    pect of framing bias. However, relatively fewer
After that, we investigate and discover an interest-   efforts have been made toward the informational
ing relationship between framing bias and hallu-       (“what to cover”) aspect of framing bias (Park
cination, an important safety-related problem in       et al., 2011; Fan et al., 2019). Majority of liter-
generation tasks. We empirically show that the         ature doing informational detection are focused on
hallucinatory generation has the risk of being not     more general factual domain (non-political infor-
only factually inaccurate and/or unverifiable but      mation) in the name of “fact-checking” (Thorne
also politically biased and controversial. To the      et al., 2018; Lee et al., 2018, 2021a, 2020). How-
best of our knowledge, this aspect of hallucination    ever, these methods cannot be directly applied to
has not been previously discussed. We thus hope        media bias detection because there does not exist
to encourage more attention toward hallucinatory       reliable source of gold standard truth to fact-check
framing bias to prevent automatic generations from     biased text upon.
fueling political bias and polarization.
   We conclude by discussing the remaining chal-       Media Bias Mitigation News aggregation, by
lenges to provide insights for future work. We hope    displaying articles from different news outlets on
our work with the proposed N EU S task serves is a     a particular topic (e.g., Google News,2 Yahoo
good starting point to promote the automatic miti-     News3 ), is the most common approach to miti-
gation of media framing bias.                          gate media bias (Hamborg et al., 2019a). However,
                                                       news aggregators require willingness and effort
                                                       from the readers to be resistant to framing biases
2   Related Works                                      and identify the neutral fact from differently framed
                                                       articles. Other approaches have been proposed to
Media Bias Media bias has been studied ex-                1
                                                             Priming happens when news reporting tells the reader
tensively in various fields such as social science,    what context of the event should they evaluate the event in;
                                                       Agenda-setting is when news reporting tell readers what are
economics, and political science. Media bias is        the most important problems to think about
known to affect readers’ perceptions of news in            2
                                                             https://news.google.com/
                                                           3
three main ways: priming, agenda-setting, and                https://news.yahoo.com/
provide additional information (Laban and Hearst,        sult in media bias. The top-3 most frequent topics4
2017), such as automatic classification of multi-        are ‘Elections’, ‘White House’, and ‘Politics’.
ple viewpoints (Park et al., 2009), multinational            We crawl the article triplets5 to serve as the
perspectives (Hamborg et al., 2017), and detailed        source inputs {AL , AR , AC }, and the neutral ar-
media profiles (Zhang et al., 2019b). However,           ticle summary to be the target output Aneu for
these methods focus on providing a broader per-          our task. Note that “center” does not necessar-
spective to readers from an enlarged selection of        ily mean completely bias-free (all, 2021) as illus-
articles, which still puts the burden of mitigating      trated in Table 1. Although “center” media out-
bias on the readers. Instead, we propose to automat-     lets are relatively less tied to a particular political
ically neutralize and summarize partisan articles to     ideology, their reports may still contain framing
produce a neutral article summary.                       bias because editorial judgement naturally leads to
                                                         human-induced biases. In addition, we also crawl
Multi-document Summarization As a chal-                  the title triplets {TL , TR , TC } and the neutral issue
lenging subtask of automatic text summarization,         title Tneu that are later used in our modeling.
multi-document summarization (MDS) aims to con-              To make the dataset richer, we also crawled
dense a set of documents to a short and informative      other meta-information such as date, topic tags, and
summary (Lebanoff et al., 2018). Recently, re-           media name. In total, we crawled 3, 564 triplets
searchers have applied deep neural models for the        (10, 692 articles). We use 2/3 of the triplets, which
MDS task thanks to the introduction of large-scale       is 2, 276, to be our training and validation set
datasets (Liu et al., 2018; Fabbri et al., 2019). With   (80 : 20 ratio), and the remaining 1, 188 triples
the advent of large pre-trained language models          as our test set. We will publicly release this dataset
(Lewis et al., 2019; Raffel et al., 2019), researchers   for future research use.
have also applied them to improve the MDS mod-
els, performance (Jin et al., 2020; Pasunuru et al.,     4       Analysis of Framing Bias
2021). In addition, many works have studied partic-
ular subtopics of the MDS task, such as agreement-       The literature on media framing bias from the NLP
oriented MDS (Pang et al., 2021), topic-guided           community and social science studies provide the
MDS (Cui and Hu, 2021) and MDS of medical stud-          definition and types of framing bias (Goffman,
ies (DeYoung et al., 2021). However, few works           1974; Entman, 1993; Gentzkow et al., 2015; Fan
have explored generating framing-bias-free sum-          et al., 2019) — Informational framing bias is the
maries from multiple news articles. To add to this       biased selection of information (tangential or spec-
direction, we propose the N EU S task and creates a      ulative information) to sway the minds of readers.
new benchmark.                                           Lexical framing bias is a sensational writing style
                                                         or linguistic attributes that may mislead readers.
3     Task and Dataset                                   However, the definition is not enough to understand
                                                         exactly how framing bias manifests in real exam-
3.1    Task Formulation                                  ples such as, in our case, the A LL S IDES dataset.
The main objective of N EU S is to generate a neutral    We conduct a case-study to obtain concrete insights
article summary Aneu given multiple news articles        to guide our design choices for defining the metrics
A0...N with varying degrees and orientations of          and methodology.
political bias. The neutral summary Aneu should          4.1      Case-Study Observations
(i) retain salient information and (ii) minimize as
much framing bias as possible from the input arti-       First, we identify and share the examples of fram-
cles.                                                    ing bias in accordance with the literature (Table 1).
                                                         Informational Bias This bias exists dominantly
3.2    A LL S IDES Dataset
                                                         in the form of “extra information” on top of the
Allsides.com provides access to triplets of news,        salient key information about an issue that changes
which comprise reports from left, right, and center      the overall impression of it. For example, in Ta-
American publishers on the same event, with an           ble 1, when reporting about the hold put on mil-
expert-written neutral summary of the articles and           4
                                                              The full list is provided in the appendix.
its neutral title. The dataset language is English and       5
                                                              In some cases, Allsides does not provide all three report-
mainly focuses on U.S. political topics that often re-   ings, and such cases are filtered out.
Issue A: Trump Put Hold On Military Aid To Ukraine Days Before Call To Ukrainian President
         Left: Trump ordered hold on military aid days before calling Ukrainian president, officials say
         Right: Trump administration claims Ukraine aid was stalled over corruption concerns, decries media ‘frenzy’
         Center: Trump Put Hold on Military Aid Ahead of Phone Call With Ukraine’s President
         Issue B: Michael Reinoehl appeared to target right-wing demonstrator before fatal shooting in Portland, police say
         Left: Suspect in killing of right-wing protester fatally shot during arrest
         Right: Portland’s Antifa-supporting gunman appeared to target victim, police say
         Center: Suspect in Patriot Prayer Shooting Killed by Police
         Issue C: Trump Says the ‘Fake News Media’ Are ‘the true Enemy of the People’
         Left: President Trump renews attacks on press as ‘true enemy of the people’ even as CNN receives another suspected bomb
         Right: ‘Great Anger’ in America caused by ‘fake news’ — Trump rips media for biased reports’
         Center: Trump blames ’fake news’ for country’s anger : ’the true enemy of the people’

Table 1: Illustration of differences in framing from Left/Right/Center media with examples from A LL -
S IDES dataset. We use titles for the analysis of bias, since they are simpler to compare and are representative
of the framing bias that exists in the article.

itary aid to Ukraine (Issue A), the right-leaning                       Thus, polarity can serve as a good indicator of fram-
media reports the speculative claim that there were                     ing bias.
“corruption concerns” and tangential information                           However, we observe that the polarity of text
“decries media ‘frenzy”’ that amplifies the negative                    must be utilized with care in the context of framing
impression of the issue. Sometimes, media with                          bias. It is the relative polarity that is meaningful to
different political leanings report additional infor-                   indicate the framing bias, not the absolute polar-
mation to convey a completely different focus on                        ity. To elaborate, if the news issue itself is about
the issue. For Issue C, left-leaning media implies                      tragic events such as “Terror Attack in Pakistan”
that Trump’s statement about fake news has led to                       or “Drone Strike That Killed 10 people”, then the
“CNN receiving another suspected bomb”, whereas                         polarity of neutral reporting will also be negative.
right-leaning media implies that the media is at
                                                                        Indicator of Framing We discover that the news
fault by producing “biased reports”.
                                                                        title is very representative of the framing bias that
Lexical Bias This bias exists mainly as biased                          exist in the associated articles – this makes sense be-
word choices that change the nuance of the infor-                       cause the title can be viewed as a succinct overview
mation that is being delivered. For example, in                         of the content that follows6 . For instance, in Ta-
Issue B, we can clearly observe that two media                          ble 3 the source input example, the right-leaning
outlets change the framing of the issue by using                        media’s title, and article are mildly mocking of
different terms “suspect” and “gunman” to refer to                      the “desperate” democrats’ failed attempts to take
the shooter, and “protester” and “victim” to refer to                   down President Trump. In contrast, the left-leaning
the person shot. Also, in Issue A, when one media                       media’s title and an article show a completely dif-
outlet uses “(ordered) hold”, another media uses                        ferent frame – implying that many investigations
“stalled”, which has a more negative connotation.                       are happening and there is “possible obstruction
                                                                        of justice, public corruption, and other abuses of
4.2   Main Insights from Case-Study                                     power.”

Next, we share important insights from the case                         5     Metric
study observation that guide our metric and model
design.                                                                 We use three metrics to evaluate summaries from
                                                                        different dimensions. For framing bias, we propose
Relative Polarity Polarity is one of the com-                           a polarity-based metric based on the careful design
monly used attributes in identifying and analyz-                        choices detailed in §5.1. For evaluating whether
ing framing bias (Fan et al., 2019; Recasens et al.,                    the summaries retain salient information, we adopt
2013). Although informational and lexical bias is                       commonly used information recall metrics (§5.2).
conceptually different, both are closely associated                     In addition, we use a hallucination metric to evalu-
with polarity changes of concept, i.e., positive or                     ate if the generations contain any unfaithful hallu-
negative, to induce strongly divergent emotional                            6
                                                                              There can be exceptions for the non-main-stream media
responses from the readers (Hamborg et al., 2019b).                     that use clickbait titles.
cinatory information because the existence of such           1. Filter out all the tokens that appear in neutral
hallucinatory generations can make the summary                  target Aneu to obtain set of tokens unique to
fake news (§5.3).                                               Âneu . This ensures that we are measuring
                                                                the relative polarity of Âneu in reference to
5.1     Framing Bias Metric                                     the neutral target Aneu – results in calibration
5.1.1    Design Consideration                                   effect.
                                                             2. Select tokens with either positive valence
Our framing bias metric is developed upon the in-               (v > 0.65) or negative valence (v < 0.35) to
sights we obtained from our case study in §4.                   eliminate neutral words (i.e., stopwords and
   First of all, we propose to build our metric based           non-emotion-provoking words) – this step ex-
on the fact that framing bias is closely associated             cludes tokens that are unlikely to be associated
with polarity. Both model-based and lexicon-based               with framing bias from the metric calculation.
polarity detection approaches are options for our            3. Sum the arousal scores for the identified
work, and we leverage the latter for the follow-                positive and negative tokens from Step 2 –
ing reasons: 1) There is increasing demand for                  positive arousal score (Arousal+ ) and neg-
interpretability in the field of NLP (Belinkov et al.,          ative arousal score (Arousal− ). We in-
2020; Sarker et al., 2019), and the lexicon-based ap-           tentionally separate the positive and nega-
proach is more interpretable (provides token-level              tive scores for finer-grained interpretation.
human interpretable annotation) compared to black-              We also have the combined arousal score
box neural models. 2) In the context of framing                 (Arousalsum =Arousal+ +Arousal− ) for a
bias, distinguishing the subtle nuance of words                 coarse view.
between synonyms is crucial (e.g., dead vs. mur-             4. Repeat for all {Aneu , Âneu } pairs in the test-
dered). The lexicon-resource provides such token-               set, and calculate the average scores to use as
level fine-grained scores and annotations, making               the final metric. We report these scores in our
it useful for our purpose.                                      experimental results section (§7).
   Metric calibration is the second design consid-
eration, and is motivated by our insight into the             In essence, our metric approximates the exis-
relativity of framing bias. The absolute polarity          tence of framing bias by quantifying how intensely
of the token itself does not necessarily indicate          aroused and sensational the generated summary
framing bias (i.e., the word “riot” has negative sen-      is in reference to the target neutral reference. We
timent but does not always indicate bias), so it is        publicly release our metric code for easy use by
essential to measure the relative degree of polarity.      other researchers7 .
Therefore, calibration of the metric in reference          5.1.3 Human Evaluation
to the neutral target is important. Any tokens ex-         To ensure the quality of our metric, we evaluate
isting in the neutral target will be ignored in bias       the correlation between our framing bias metric
measurement for the generated neutral summary.             and human judgement. We conduct A/B test-
For instance, if “riot” exists in the neutral target, it   ing8 where the annotators are given two gener-
will not be counted in bias measurement through            ated articles about an issue, one with a higher
calibration.                                               Arousalsum score and the other with a lower score.
                                                           Then, annotators are asked to select the more bi-
5.1.2    Framing Bias Metric Details
                                                           ased article summary. When asking which article
For our metric, we leverage Valence-Arousal-               is more “biased”, we adopt the question presented
Dominance (VAD) (Mohammad, 2018) dataset                   by Spinde et al. We also provide examples and the
which has a large list of lexicons annotated for           definition of framing bias for a better understand-
valence (v), arousal (a) and dominance (d) scores.         ing of the task. We obtain three annotations each
Valence, arousal, and dominance represent the di-          for 50 samples and select those with the majority
rection of polarity (positive, negative), the strength     of votes.
of the polarity (active, passive), and the level of           A critical challenge of this evaluation is in con-
control (powerful, weak), respectively.                    trolling the potential involvement of the annotators’
   Given the neutral summary generated from the                  7
                                                                     https://github.com/HLTCHKUST/framing-bias-metric
model Âneu , our metric is calculated using the                 8
                                                                     Please refer the appendix for more detail of the A/B test-
VAD lexicons in the following way:                         ing
personal political bias. Although it is hard to elim-      is a question-answering-based metric built on the
inate such bias completely, we attempt to avoid it         assumption that the same answers will be derived
by collecting annotations from those indifferent to        from hallucination-free generation and the source
the issues in the test set. Specifically, given that our   document when asked the same questions.
test set mainly covers US politics, we restrict the
nationality of annotators to non-US nationals who          6     Models and Experiments9
view themselves bias-free towards any US political         6.1    Baseline Models
parties.
   After obtaining the human annotations from A/B          Since one common form of framing bias is the re-
testing, we also obtain automatic annotation based         porting of extra information (§4), summarization
on the proposed framing bias metric score, where           models, which extract commonly shared salient
the article with a higher Arousalsum is chosen             information, may already generate a neutral sum-
to be the more biased generation. The Spearman             maries to some extent. To test this, we conduct
correlation coefficient between human-based and            experiments using the following baselines.
metric-based annotations is 0.63615 with a p-value             • L EXRANK (Erkan and Radev, 2004): an
< 0.001, and the agreement percentage 80%. These                 extractive single-document summarization
values indicate that the association between the two             (SDS) model that extracts representative sen-
annotations is statistically significant, suggesting             tences that hold information common in both
that our metric provides a good approximation of                 left- and right-leaning articles.
the existence of framing bias.                                 • BART CNN: an abstractive SDS model
5.2   Salient Info                                               that fine-tunes BART-large (Lewis et al.,
                                                                 2019), with 406M parameters, using the
The generation needs to retain essential/important
                                                                 CNN/DailyMail (Hermann et al., 2015)
information while reducing the framing bias. Thus,
                                                                 dataset.
we also report ROUGE (Lin, 2004) and B LEU (Pa-
                                                               • BART M ULTI: a multi-document summariza-
pineni et al., 2002) between the generated neu-
                                                                 tion (MDS) model that fine-tunes BART-large
tral summary, Âneu , and human-written summary,
                                                                 using Multi-News (Fabbri et al., 2019) dataset.
Aneu . Note that ROUGE measures the recall (i.e.,
                                                               • P EGASUS M ULTI: an MDS model that fine-
how often the n-grams in the human reference
                                                                 tunes Pegasus-base (Zhang et al., 2019a),
text appear in the machine-generated text) and
                                                                 with 568M parameters, using the Multi-News
B LEU measures the precision (i.e., how often the n-
                                                                 dataset.
grams in the machine-generated text appear in the
human reference text). The higher the B LEU and            Since the summarization models are not trained
ROUGE 1-R score, the better the essential infor-           with in-domain data, we provide another baseline
mation converges. In our results, we only report           model trained with in-domain data for a full picture.
Rouge-1, but Rouge-2 and Rouge-L can be found
in the appendix.                                               • N EU SFT: a baseline that fine-tunes the BART-
                                                                 large model using A LL S IDES.
5.3   Hallucination Metric
Recent studies have shown that neural sequence             6.2    Our N EU S Models (N EU S-T ITLE)
models can suffer from the hallucination of ad-            We design our models based on the second insight
ditional content not supported by the input (Re-           from the case study (§4) - the news title serves as an
iter, 2018; Wiseman et al., 2017; Nie et al., 2019;        indicator of the framing bias in the corresponding
Maynez et al., 2020; Pagnoni et al., 2021; Ji et al.,      article. We hypothesize that it would be helpful
2022), consequently adding factual inaccuracy to           to divide-and-conquer by neutralizing the the title
the generation of NLG models. Although not di-             first, then leveraging the “neutralized title” to guide
rectly related to the goal of N EU S, we evaluate          the final neutral summary of the longer articles.
the hallucination level of the generations in our              Multi-task learning (MTL) is a natural model-
work. We choose a hallucination metric called              ing choice because two sub-tasks are involved –
FeQA (Durmus et al., 2020) because it is one of            title-level and article-level neutral summarization.
the publicly available metrics known to have a high           9
                                                                Experimental details are provided in the appendix for
correlation with human faithfulness scores. This           reproducibility.
Avg. Framing Bias Metric                  Salient Info       Hallucination
          Models
                             Arousal+ ↓   Arousal− ↓   Arousalsum ↓     B LEU↑   ROUGE 1-R↑         FeQA↑
          All Source input      6.76         3.64         10.40          8.27       56.57%             -
          L EXRANK              3.02         1.74          4.76          12.21      39.08%         53.44%
          BART CNN              2.09         1.23          3.32          10.49      35.63%         58.03%
          P EGASUS M ULTI       5.12         2.39          7.51           6.12      44.42%         22.24%
          BART M ULTI           5.94         2.66          8.61           4.24      35.76%         21.06%
          N EU SFT              1.86         1.00          2.85          11.67      35.11%         58.50%
          N EU S-T ITLE         1.69         0.83          2.53          12.05      36.07%         45.95%

Table 2: Experimental results for A LL S IDES test set. We provide the level of framing bias inherent in “source
input” from the A LL S IDES test set to serve as a reference point for the framing bias metric. For framing bias
metric, the lower number is the better (↓). For other scores, the higher number is the better (↑).

Meanwhile, we also have to ensure a hierarchical            with insights obtained through qualitative analysis.
relationship between the two tasks in our MTL               Table 3 shows generation examples that are most
training because article-level neutral summariza-           representative of the insights we share.10
tion leverages the generated neutral title as an ad-
ditional resource. We use a simple technique to do          7.1      Main Results
hierarchical MTL by formatting our hierarchical
                                                            Firstly, summarization models can reduce the
data pair (title, article) in a single natural language
                                                            framing bias to a certain degree (drop in
text with identifier-tokens (“Title=>”, “Article=>”).
                                                            Arousalsum score from 10.40 to 4.76 and 3.32
This technique allows us to optimize for both title
                                                            for L EXRANK and BART CNN). This is because
and article neutral summarization tasks easily by
                                                            informational framing bias is addressed when sum-
optimizing for the negative log-likelihood of the
                                                            marization models extract the most salient sen-
single target Y. The auto-regressive nature of the
                                                            tences, which contain common information from
decoder also ensures the hierarchical relationship
                                                            the inputs. However, summarization models, espe-
between the title and article.
                                                            cially L EXRANK cannot handle the lexical framing
   We train BART’s autoregressive decoder to gen-
                                                            bias, as shown in Table 3. Moreover, if we fur-
erate the target text Y formatted as follows:
                                                            ther observe the results of L EXRANK, it is one of
      TITLE ⇒ Tneu . ARTICLE ⇒ Aneu ,                       the best performing models in terms of ROUGE 1-
                                                            R (39.08%), the standard metric for summarization
where Tneu and Aneu denote the neutral title and            performance, but not in terms of the framing bias
neutral article summary.                                    metric. This suggests that having good summariza-
  The input X to our BART encoder is formatted              tion performance (ROUGE 1-R) does not guarantee
similarly to the target text Y :                            that the model is also neutral – i.e., the require-
                                                            ment for summaries to be neutral adds an extra
    TITLE ⇒ TL . ARTICLE ⇒ AL .[SEP ]
                                                            dimension to the summarization task.
    TITLE ⇒ TC . ARTICLE ⇒ AC .[SEP ]
                                                               Secondly, one interesting pattern that deserves at-
    TITLE ⇒ TR . ARTICLE ⇒ AR ,                             tention is that only the single-document summariza-
where TL/C/R and AL/C/R denote the title and ar-            tion models (BART CNN and L EXRANK) reduced
ticle from left-wing, center, and right-wing media,         framing bias well, not the multi-document sum-
and [SEP] denotes the special token that separates          marization models (P EGASUS M ULTI and BART-
different inputs. Note that the order of left, right,       M ULTI). This is rather surprising because our task
and center are randomly shuffled for each sample            setup is more similar to MDS than SDS. One of
to discourage the model from learning spurious              the major contributors to high bias in the MDS
patterns from the input.                                    models is probably the hallucination because MDS
                                                            models portray drastically poor hallucination per-
7   Results and Analysis                                    formance than all the other models (both the MDS
                                                            models P EGASUS M ULTI and BART M ULTI achieve
In this section, we point out noteworthy observa-
                                                              10
tions from the quantitative results in Table 2 along               More examples are provided in the appendix.
SOURCE:  Title: Here Are The 81 People And Entities Close To Trump Democrats Are Investigating. Article:
 Democrats on the House Judiciary Committee on Monday sent document requests to 81 agencies, entities and individuals
 close to President Donald Trump as part of a broad investigation into possible obstruction of justice, public corruption
 and other abuses of power. The list includes Trump’s sons, Eric Trump and Donald Trump Jr., as well as his son-in-law,
 Jared Kushner.
  Title: House Panel Requests Documents From Associates of Trump. Article: House Democrats intensified
 their investigations into President Trump and his associates Monday, demanding records from more than 80 people
 and organizations related to his business dealings, interactions with the Justice Department and communications with
 Russian President Vladimir Putin.
  Title: Dems Continue Their Assault on The Trump Administration By Launching Another Probe. Article:
 Democrats are desperate to take down President Donald Trump. The Russia probe has proven to be ineffective and, quite
 frankly, a waste of time and taxpayer money. They didn’t find what they wanted so now they’re launching another probe.
  TARGET: House Democrats launched a broad probe into President Trump on Monday, requesting documents from 81
  agencies and individuals as they investigate his business dealings, interactions with Russia, and possible obstruction of
  justice.

  Lexrank: Democrats are desperate to take down President Donald Trump. The Russia probe has proven to be ineffective
  and, quite frankly, a waste of time and taxpayer money.
  N EU SFT: The Russia probe has proven to be ineffective and, quite frankly, a waste of time and taxpayer money.
  N EU S-T ITLE: TITLE=> House Panel Requests Documents. ARTICLE=> The House Select Committee on Intelligence
  has requested documents from 81 people and entities close to President Trump, including his sons Eric and Donald
  Trump Jr., as well as Jared Kushner.

Table 3: Generation examples for analysis purposes. Red highlights the tokens identified by VAD lexicons. Refer
to the appendix for more examples.

  SOURCE: ... President Trump on Saturday blasted what he called the “phony” BuzzFeed story and the mainstream
  media’s coverage of it....
  MDS Hallucination: president trump on sunday slammed what he called called a “phony” story by the
  “dishonest” and “fake news” news outlet in a series of tweets. ... “the fake news media is working overtime to make
  this story look like it is true,” trump tweeted. “they are trying to make it look like the president is trying to hide
  something, but it is not true!”

Table 4: Illustration of hallucinatory framing bias from MDS models and the corresponding “most relevant source
snippet” from the source input. Refer to the appendix for more examples with full context.

22.24% and 21.06%, when most of the other mod-                    cept the FeQA score. This model demonstrates
els achieve over 50%).11 This suggests that the                   a stronger tendency to paraphrase rather than di-
framing bias of MDS models may be related to                      rectly copy, and has comparatively more neutral
the hallucination of politically biased content. We               framing of the issue. As shown in Table 3, when
investigate into this in the next subsection (§7.2).              L EXRANK and N EU SFT are focused on the “in-
   Thirdly, although summarization models help                    effectiveness of Russia probe”, the TARGET and
reduce the framing bias scores, we, unsurprisingly,               N EU S-T ITLE focus on the start of the investigation
observe a more considerable bias reduction when                   with the request for documents. N EU S-T ITLE also
training with in-domain data. N EU SFT shows a                    generate a title with a similar neutral frame to the
further drop across all framing bias metrics without              TARGET, suggesting this title generation guided
sacrificing the ability to keep salient information.              the correctly framed generation.
However, we observe that N EU SFT often copies
directly without any neutral re-writing – e.g., the               7.2    Further Analysis and Discussion
N EU SFT example shown in Table 3 is a direct copy
                                                                  Q: Is hallucination contributing to the high
of the sentence from the input source.
                                                                  framing bias in MDS models? Through qual-
   Lastly, we can achieve slightly further improve-
                                                                  itative analysis, we discovered the MDS genera-
ment with N EU S-T ITLE across all metrics ex-
                                                                  tions were hallucinating politically controversial
  11
     Note that 22.24% and 21.06% are already high FeQA            or sensational content that did not exist in the in-
scores, however, a comparatively low score in reference.          put sources. This is probably originating from the
memorization of either the training data or the LM-       ric could also be adapted to the hallucination prob-
pretraining corpus. For instance, in Table 4, we          lem without a “neutral” reference. The source input
can observe stylistic bias being injected – “the ‘dis-    can substitute the “neutral” reference to measure if
honest’ and ‘fake news’ news outlet”. Also, the           the generated summary is more politically biased
excessive elaboration of the president’s comment          than the source – a potential indication of political
towards the news media, which does not appear the         hallucination.
in source or target, can be considered informational
bias – “they are trying to make it look like the pres-    8   Conclusion
ident is trying to hide something, but it is not true!”   We introduce a new task of Neutral Multi-News
This analysis unveils the overlooked danger of hal-       Summarization (N EU S) to mitigate media framing
lucination, which is the risk of introducing political    bias by providing a neutral summary of articles,
framing bias in summary generations. Note that            along with the dataset A LL S IDES and a set of met-
this problem is not confined to MDS models only           rics. Throughout the work, we share insights to
because other baseline models also have room for          understand the challenges and future directions in
improvement in terms of the FeQA hallucination            the task. We show the relationships among po-
score.                                                    larity, extra information, and framing bias, which
                                                          guides us to the metric design, while the insight
Q: What are the remaining challenges and fu-
                                                          that the title serves as an indicator of framing bias
ture directions? The experimental results of
                                                          leads us to the model design. Our qualitative analy-
N EU S-T ITLE suggest that there is room for im-
                                                          sis reveals that hallucinatory content generated by
provement. We qualitatively checked some error
                                                          models may also contribute to framing bias. We
cases and discovered that the title-generation is,
                                                          hope our work stimulates researchers to actively
unsurprisingly, not always accurate, and the error
                                                          tackle political framing bias in both human-written
propagating from the title-generation step adversely
                                                          and machine-generated texts.
affected the overall performance. Thus, one possi-
ble future direction would be to improve the neutral      Ethical Considerations
title generation, which would then improve the neu-
tral summarization.                                       The idea of unbiased journalism has always been
    Another challenge is the subtle lexical bias in-      challenged12 because journalists will make their
volving nuanced word choices that manoeuvre read-         own editorial judgements that can never be guar-
ers to understand the events from biased frames.          anteed to be completely bias-free. Therefore, we
For example, “put on hold” and “stalled” both mean        propose to generate a comprehensive summary of
the same outcome, but the latter has a more neg-          articles from different political leanings, instead of
ative connotations. Improving the model’s aware-          trying to generate a gold standard “neutral” article.
ness of such nuanced words or devising ways to               One of the considerations is the bias induced
incorporate style-transfer-based bias mitigation ap-      by the computational approach. Automatic ap-
proaches (Liu et al., 2021) could be another helpful      proaches replace a known source bias with another
future direction.                                         bias caused by human-annotated data or the ma-
    We started the neutral summarization task as-         chine learning models. Understanding the risk of
suming that framing bias originates from the source       uncontrolled adoption of such automatic tools, care-
inputs. However, our results and analysis suggest         ful guidance should be provided in how to adopt
that hallucination is another contributor to fram-        them. For instance, an automatically generated neu-
ing bias. Leveraging hallucination mitigation tech-       tral summary should be provided with reference to
niques would be a valuable future direction for the       the original source instead of standing alone.
N EU S task. We believe it will help to reduce in-           We use news from English-language sources
formational framing bias, although it may be less         only and largely American news outlets through-
effective to lexical framing biases. Moreover, our        out this paper. Partisanship from this data refers
work can also be used to facilitate hallucination         to domestic American politics. We note that this
research as well. We believe the proposed framing         work does not cover media bias at the international-
bias metric will help researchers evaluate hallu-         level or in other languages. In future work, we
cinatory phenomena from different angles other               12
                                                                https://www.allsides.com/blog/does-unbiased-news-
than “factuality”. The proposed framing bias met-         really-exist
will explore the application of our methodology to         Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li,
different cultures or languages. However, we hope            and Dragomir R Radev. 2019. Multi-news: A
                                                             large-scale multi-document summarization dataset
the paradigm of N EU S, providing multiple sides
                                                             and abstractive hierarchical model. arXiv preprint
to neutralize the view of an issue, can encourage            arXiv:1906.01749.
future research in mitigating framing bias in other
languages or cultures.                                     Lisa Fan, Marshall White, Eva Sharma, Ruisi Su,
                                                              Prafulla Kumar Choubey, Ruihong Huang, and
                                                              Lu Wang. 2019.        In plain sight: Media bias
                                                              through the lens of factual reporting. arXiv preprint
References                                                    arXiv:1909.02670.
2021. Center – what does a "center" media bias rating
  mean?                                                    Matthew Gentzkow and Jesse M Shapiro. 2006. Media
                                                            bias and reputation. Journal of political Economy,
Yejin Bang, Nayeon Lee, Etsuko Ishii, Andrea                114(2):280–316.
  Madotto, and Pascale Fung. 2021. Assessing po-
  litical prudence of open-domain chatbots. arXiv          Matthew Gentzkow and Jesse M Shapiro. 2010. What
  preprint arXiv:2106.06157.                                drives media slant? evidence from us daily newspa-
                                                            pers. Econometrica, 78(1):35–71.
Yonatan Belinkov, Sebastian Gehrmann, and Ellie
  Pavlick. 2020. Interpretability and analysis in neural   Matthew Gentzkow, Jesse M Shapiro, and Daniel F
  nlp. In Proceedings of the 58th Annual Meeting of         Stone. 2015. Media bias in the marketplace: The-
  the Association for Computational Linguistics: Tu-        ory. In Handbook of media economics, volume 1,
  torial Abstracts, pages 1–5.                              pages 623–645. Elsevier.
Peng Cui and Le Hu. 2021. Topic-guided abstractive
  multi-document summarization. In Findings of the         Erving Goffman. 1974. Frame analysis: An essay on
  Association for Computational Linguistics: EMNLP           the organization of experience. Harvard University
  2021, pages 1463–1472.                                     Press.

Claes De Vreese. 2004. The effects of strategic news       Tim Groeling. 2013. Media bias by the numbers: Chal-
  on political cynicism, issue evaluations, and policy       lenges and opportunities in the empirical study of
  support: A two-wave experiment. Mass Communi-              partisan news. Annual Review of Political Science,
  cation & Society, 7(2):191–214.                            16:129–151.

Jay DeYoung, Iz Beltagy, Madeleine van Zuylen, Bai-        Felix Hamborg, Karsten Donnay, and Bela Gipp.
   ley Kuehl, and Lucy Lu Wang. 2021. Ms2: Multi-            2019a. Automated identification of media bias
   document summarization of medical studies. arXiv          in news articles: an interdisciplinary literature re-
   preprint arXiv:2104.06486.                                view. International Journal on Digital Libraries,
                                                             20(4):391–415.
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
  question answering evaluation framework for faith-       Felix Hamborg, Norman Meuschke, and Bela Gipp.
  fulness assessment in abstractive summarization. In        2017. Matrix-based news aggregation: exploring
  Proceedings of the 58th Annual Meeting of the Asso-        different news perspectives. In 2017 ACM/IEEE
  ciation for Computational Linguistics, pages 5055–         Joint Conference on Digital Libraries (JCDL), pages
  5070, Online. Association for Computational Lin-           1–10. IEEE.
  guistics.
                                                           Felix Hamborg, Anastasia Zhukova, and Bela Gipp.
Robert M Entman. 1993. Framing: Towards clarifica-           2019b. Illegal aliens or undocumented immigrants?
  tion of a fractured paradigm. McQuail’s reader in          towards the automated identification of bias by word
  mass communication theory, pages 390–397.                  choice and labeling. In International Conference on
Robert M Entman. 2002. Framing: Towards clarifi-             Information, pages 179–187. Springer.
  cation of a fractured paradigm. McQuail’s Reader
  in Mass Communication Theory. London, California         Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
  and New Delhi: Sage.                                       stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
                                                             and Phil Blunsom. 2015. Teaching machines to read
Robert M Entman. 2007. Framing bias: Media in the            and comprehend. Advances in neural information
  distribution of power. Journal of communication,           processing systems, 28:1693–1701.
  57(1):163–173.
                                                           Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu,
Günes Erkan and Dragomir R Radev. 2004. Lexrank:             Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea
  Graph-based lexical centrality as salience in text         Madotto, and Pascale Fung. 2022. Survey of hal-
  summarization. Journal of artificial intelligence re-      lucination in natural language generation. arXiv
  search, 22:457–479.                                        preprint arXiv:2202.03629.
Hanqi Jin, Tianming Wang, and Xiaojun Wan. 2020.          Ruibo Liu, Chenyan Jia, Jason Wei, Guangxuan Xu,
  Multi-granularity interaction network for extractive      Lili Wang, and Soroush Vosoughi. 2021. Mitigating
  and abstractive multi-document summarization. In          political bias in language models through reinforced
  Proceedings of the 58th Annual Meeting of the Asso-       calibration.
  ciation for Computational Linguistics, pages 6244–
  6254.                                                   Yang Liu and Mirella Lapata. 2019. Hierarchical trans-
                                                            formers for multi-document summarization. In Pro-
Philippe Laban and Marti A Hearst. 2017. newslens:          ceedings of the 57th Annual Meeting of the Asso-
  building and visualizing long-ranging news stories.       ciation for Computational Linguistics, pages 5070–
  In Proceedings of the Events and Stories in the News      5081.
  Workshop, pages 1–9.
                                                          Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Logan Lebanoff, Kaiqiang Song, and Fei Liu. 2018.            Ryan McDonald. 2020. On faithfulness and factu-
  Adapting the neural encoder-decoder framework              ality in abstractive summarization. In Proceedings
  from single to multi-document summarization. In            of the 58th Annual Meeting of the Association for
  Proceedings of the 2018 Conference on Empirical           Computational Linguistics, pages 1906–1919, On-
  Methods in Natural Language Processing, pages              line. Association for Computational Linguistics.
  4131–4141.
                                                          Maxwell McCombs and Amy Reynolds. 2009. How
Nayeon Lee, Yejin Bang, Andrea Madotto, Madian             the news shapes our civic agenda. In Media effects,
  Khabsa, and Pascale Fung. 2021a. Towards few-            pages 17–32. Routledge.
  shot fact-checking via perplexity. arXiv preprint
  arXiv:2103.09535.                                       Saif Mohammad. 2018. Obtaining reliable human rat-
                                                            ings of valence, arousal, and dominance for 20,000
Nayeon Lee, Belinda Z Li, Sinong Wang, Pascale              english words. In Proceedings of the 56th Annual
  Fung, Hao Ma, Wen-tau Yih, and Madian Khabsa.             Meeting of the Association for Computational Lin-
  2021b.    On unifying misinformation detection.           guistics (Volume 1: Long Papers), pages 174–184.
  arXiv preprint arXiv:2104.05243.
                                                          Fred Morstatter, Liang Wu, Uraz Yavanoglu, Stephen R
                                                            Corman, and Huan Liu. 2018. Identifying framing
Nayeon Lee, Belinda Z Li, Sinong Wang, Wen-tau
                                                            bias in online news. ACM Transactions on Social
  Yih, Hao Ma, and Madian Khabsa. 2020. Lan-
                                                            Computing, 1(2):1–18.
  guage models as fact checkers?  arXiv preprint
  arXiv:2006.04102.
                                                          Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, and
                                                            Chin-Yew Lin. 2019. A simple recipe towards re-
Nayeon Lee, Zihan Liu, and Pascale Fung. 2019. Team         ducing hallucination in neural surface realisation. In
  yeon-zi at semeval-2019 task 4: Hyperpartisan news        Proceedings of the 57th Annual Meeting of the Asso-
  detection by de-noising weakly-labeled data. In Pro-      ciation for Computational Linguistics, pages 2673–
  ceedings of the 13th International Workshop on Se-        2679.
  mantic Evaluation, pages 1052–1056.
                                                          Artidoro Pagnoni, Vidhisha Balachandran, and Yulia
Nayeon Lee, Chien-Sheng Wu, and Pascale Fung. 2018.         Tsvetkov. 2021. Understanding factuality in abstrac-
  Improving large-scale fact-checking using decom-          tive summarization with frank: A benchmark for fac-
  posable attention models and lexical tagging. In Pro-     tuality metrics. In Proceedings of the 2021 Confer-
  ceedings of the 2018 Conference on Empirical Meth-        ence of the North American Chapter of the Associ-
  ods in Natural Language Processing, pages 1133–           ation for Computational Linguistics: Human Lan-
  1138.                                                     guage Technologies, pages 4812–4829.
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-                 Richard Yuanzhe Pang, Adam D Lelkes, Vinh Q
  jan Ghazvininejad, Abdelrahman Mohamed, Omer              Tran, and Cong Yu. 2021. Agreesum: Agreement-
  Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019.           oriented multi-document summarization.  arXiv
  Bart: Denoising sequence-to-sequence pre-training         preprint arXiv:2106.02278.
  for natural language generation, translation, and
  comprehension. arXiv preprint arXiv:1910.13461.         Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
                                                            Jing Zhu. 2002. Bleu: a method for automatic eval-
Chin-Yew Lin. 2004. Rouge: A package for automatic          uation of machine translation. In Proceedings of the
  evaluation of summaries. In Text summarization            40th annual meeting of the Association for Compu-
  branches out, pages 74–81.                                tational Linguistics, pages 311–318.

Peter J Liu, Mohammad Saleh, Etienne Pot, Ben             Souneil Park, Seungwoo Kang, Sangyoung Chung, and
  Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam           Junehwa Song. 2009. Newscube: delivering multi-
  Shazeer. 2018. Generating wikipedia by summariz-          ple aspects of news to mitigate media bias. In Pro-
  ing long sequences. In International Conference on        ceedings of the SIGCHI conference on human fac-
  Learning Representations.                                 tors in computing systems, pages 443–452.
Souneil Park, Kyung-Soon Lee, and Junehwa Song.             Sam Wiseman, Stuart Shieber, and Alexander Rush.
  2011. Contrasting opposing views of news articles           2017. Challenges in data-to-document generation.
  on contentious issues. In Proceedings of the 49th           In Proceedings of the 2017 Conference on Empiri-
  annual meeting of the association for computational         cal Methods in Natural Language Processing, pages
  linguistics: Human language technologies, pages             2253–2263, Copenhagen, Denmark. Association for
  340–349.                                                    Computational Linguistics.

Ramakanth Pasunuru, Mengwen Liu, Mohit Bansal,              Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  Sujith Ravi, and Markus Dreyer. 2021. Efficiently           Chaumond, Clement Delangue, Anthony Moi, Pier-
  summarizing text and graph encodings of multi-              ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
  document clusters. In Proceedings of the 2021 Con-          icz, Joe Davison, Sam Shleifer, Patrick von Platen,
  ference of the North American Chapter of the Asso-          Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
  ciation for Computational Linguistics: Human Lan-           Teven Le Scao, Sylvain Gugger, Mariama Drame,
  guage Technologies, pages 4768–4779.                        Quentin Lhoest, and Alexander Rush. 2020. Trans-
                                                              formers: State-of-the-art natural language process-
Elizabeth M Perse and Jennifer Lambe. 2016. Media             ing. In Proceedings of the 2020 Conference on Em-
   effects and society. Routledge.                            pirical Methods in Natural Language Processing:
                                                              System Demonstrations, pages 38–45, Online. Asso-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine           ciation for Computational Linguistics.
  Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
                                                            Tae Yano, Philip Resnik, and Noah A Smith. 2010.
  Wei Li, and Peter J Liu. 2019. Exploring the limits
                                                              Shedding (a thousand points of) light on biased lan-
  of transfer learning with a unified text-to-text trans-
                                                              guage. In Proceedings of the NAACL HLT 2010
  former. arXiv preprint arXiv:1910.10683.
                                                              Workshop on Creating Speech and Language Data
                                                              with Amazon’s Mechanical Turk, pages 152–158.
Marta Recasens, Cristian Danescu-Niculescu-Mizil,
 and Dan Jurafsky. 2013. Linguistic models for an-          Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
 alyzing and detecting biased language. In Proceed-            ter J. Liu. 2019a. Pegasus: Pre-training with ex-
 ings of the 51st Annual Meeting of the Association            tracted gap-sentences for abstractive summarization.
 for Computational Linguistics (Volume 1: Long Pa-
 pers), pages 1650–1659.                                    Yifan Zhang, Giovanni Da San Martino, Alberto
                                                              Barrón-Cedeno, Salvatore Romeo, Jisun An, Hae-
Ehud Reiter. 2018. A structured review of the validity        woon Kwak, Todor Staykovski, Israa Jaradat, Georgi
  of bleu. Computational Linguistics, 44(3):393–401.          Karadzhov, Ramy Baly, et al. 2019b. Tanbih: Get to
                                                              know what you are reading. EMNLP-IJCNLP 2019,
Abeed Sarker, Ari Z Klein, Janet Mee, Polina Harik,           page 223.
  and Graciela Gonzalez-Hernandez. 2019. An in-
  terpretable natural language processing system for        Appendix
  written medical examination assessment. Journal of
  biomedical informatics, 98:103268.                        A    Topics covered in A LL S IDESdataset
Dietram A Scheufele. 2000. Agenda-setting, priming,         The A LL S IDESdataset language is English and
  and framing revisited: Another look at cognitive ef-       mainly focuses on U.S. political topics that often
  fects of political communication. Mass communica-
  tion & society, 3(2-3):297–316.
                                                             result in media bias. The top-5 most frequent topics
                                                             are ‘Elections’, ‘White House’, ‘Politics’, ‘Coron-
Dietram A Scheufele and David Tewksbury. 2007.               avirus’, ‘Immigration’.
  Framing, agenda setting, and priming: The evolu-              The full list is as follow (in a descending order
  tion of three media effects models. Journal of com-        of frequency): [‘Elections’, ‘White House’, ‘Pol-
  munication, 57(1):9–20.
                                                             itics’, ‘Coronavirus’, ‘Immigration’, ‘Violence in
All Sides. 2018. Media bias ratings. Allsides.com.          America’, ‘Economy and Jobs’, ‘Supreme Court’,
                                                            ‘Middle East’, ‘US House’, ‘Healthcare’, ‘World’,
Timo Spinde, Christina Kreuter, Wolfgang Gaissmaier,        ‘US Senate’, ‘National Security’, ‘Gun Control and
  Felix Hamborg, Bela Gipp, and Helge Giese. 2021.           Gun Rights’, ‘Media Bias’, ‘Federal Budget’, ‘Ter-
  Do you think it’s biased? how to ask for the percep-
  tion of media bias. In 2021 ACM/IEEE Joint Con-            rorism’, ‘US Congress’, ‘Foreign Policy’, ‘Crim-
  ference on Digital Libraries (JCDL), pages 61–69.          inal Justice’, ‘Justice Department’, ‘Trade’, ‘Im-
  IEEE.                                                      peachment’, ‘Donald Trump’, ‘North Korea’, ‘Rus-
                                                             sia’, ‘Education’, ‘Environment’, ‘Free Speech’,
James Thorne,         Andreas Vlachos,        Christos
  Christodoulopoulos, and Arpit Mittal. 2018.
                                                            ‘FBI’, nan, ‘Abortion’, ‘General News’, ‘Disaster’,
  Fever: a large-scale dataset for fact extraction and      ‘US Military’, ‘Technology’, ‘LGBT Rights’, ‘Sex-
  verification. arXiv preprint arXiv:1803.05355.             ual Misconduct’, ‘Voting Rights and Voter Fraud’,
‘Joe Biden’, ‘Race and Racism’, ‘Economic Pol-                                 ROUGE 1     ROUGE 2     ROUGE L
 icy’, ‘Justice’, ‘Holidays’, ‘Taxes’, ‘China’, ‘Polar-                        R ECALL     R ECALL     R ECALL
 ization’, ‘Democratic Party’, ‘Religion and Faith’,        L EXRANK            39.08%      17.66%     34.69%
‘Sports’, ‘Homeland Security’, ‘Culture’, ‘Cyber-           BART CNN            35.63%      15.32%     32.22%
 security’, ‘National Defense’, ‘Public Health’,            P EGASUS M ULTI     44.42%      16.99%     39.45%
‘Civil Rights’, ‘Europe’, ‘Great Britain’, ‘Banking         BART M ULTI         35.76%      12.48%     32.08%
 and Finance’, ‘Republican Party’, ‘NSA’, ‘Busi-            N EU SFT            35.11%      15.74%     31.43%
 ness’, ‘State Department’, ‘Facts and Fact Check-          N EU S-T ITLE       36.07%      16.47%     32.63%
 ing’, ‘Media Industry’, ‘Labor’, ‘Veterans Affairs’,
‘Campaign Finance’, ‘Life During COVID-19’,               Table 6: Additional Salient Info Scores. Recall
‘Transportation’, ‘Marijuana Legalization’, ‘Agri-        scores for ROUGE 1, ROUGE 2 and ROUGE L for A LL -
                                                          S IDES testset. For the scores, the higher number is the
 culture’, ‘Arts and Entertainment’, ‘Fake News’,
                                                          better.
‘Campaign Rhetoric’, ‘Nuclear Weapons’, ‘Israel’,
‘Asia’, ‘CIA’, ‘Role of Government’, ‘George Floyd
 Protests’, "Women’s Issues", ‘Safety and Sanity          showed examples in Table 1 to ensure they under-
 During COVID-19’, ‘Animal Welfare’, ‘Treasury’,          stand what framing bias is. Then we asked the
‘Science’, ‘Climate Change’, ‘Domestic Policy’,           following question: “Which one of the articles do
‘Energy’, ‘Housing and Homelessness’, ‘Bridging           you believe to be more biased toward one side or
 Divides’, ‘Mexico’, ‘Inequality’, ‘COVID-19 Mis-         the other side in the reporting of news?” This is
 information’, ‘ISIS’, ‘Palestine’, ‘Bernie Sanders’,     modified to serve as a question for AB testing based
‘Tulsi Gabbard’, ‘Sustainability’, ‘Family and Mar-       on “To what extent do you believe that the article
 riage’, ‘Pete Buttigieg’, ‘Welfare’, ‘Opioid Cri-        is biased toward one side or the other side in the
 sis’, ‘Amy Klobuchar’, ‘Food’, ‘EPA’, ‘South Ko-         reporting of news?” The original question is one of
 rea’, ‘Alaska: US Senate 2014’, ‘Social Security’,       the 21 questions which are suitable and reliable for
‘US Constitution’, ‘Tom Steyer’, ‘Andrew Yang’,           measuring the perception of media bias, designed
‘Africa’]                                                 by Spinde et al. (2021).
                                                             The participants (research graudate students)
B    Additional Salient Information Score                 have different nationalities including Canada,
     Results                                              China, Indonesia, Iran, Italy, Japan, Poland and
                                                          South Korea (ordered in an alphabetical order). All
We report additional Salient information F1 (Ta-          of participants answered to be not having political
ble 5) and Recall (Table 6) scores for ROUGE 1,           leaning towards U.S. politics. All participants are
ROUGE 2 and ROUGE L.                                      fully explained on the usage of collected data in
                                                          this particular work and agreed on it.
                     ROUGE 1    ROUGE 2     ROUGE L
                       F1         F1          F1
                                                          D    Experimental Setup Details
  L EXRANK            33.60%     13.60%      29.77%
  BART CNN            33.76%     13.67%      30.57%       All our experimental codes are based on the Hug-
  P EGASUS M ULTI     30.03%     10.28%      26.70%       gingFace (Wolf et al., 2020). We used the following
  BART M ULTI         23.01%      6.84%      20.55%       hyperparameters during training and across models:
  N EU SFT            36.76%     16.27%      32.86%       10 epoch size, 3e − 5 learning rate, and a batch size
  N EU S-T ITLE       35.49%     15.69%      32.05%       of 16. We did not do hyper-parameters tuning since
                                                          our objective is to provide various baselines and
Table 5: Additional Salient Info Scores. F1 scores for    analysis. Training run-time for all of our experi-
ROUGE 1, ROUGE 2 and ROUGE L for A LL S IDES test-        ments are fast (< 6hr). We ran all experiments with
set. For the scores, the higher number is the better.
                                                          one NVIDIA 2080Ti GPU with 16 GB of memory.
                                                          The experiment was a single-run.

C    Details for Human Evaluation (A/B                    E    Generation Examples from Different
     testing)                                                  Models
We first presented the participants with the defi-        To help better understand performances of each
nition of framing bias from our paper, and also           models, we provide more examples of generation
You can also read