NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias Nayeon Lee Yejin Bang Tiezheng Yu Andrea Madotto∗ Pascale Fung Hong Kong University of Science and Technology nyleeaa@connect.ust.hk, pascale@ece.ust.hk Abstract Media news framing bias can increase political polarization and undermine civil society. The arXiv:2204.04902v3 [cs.CL] 3 May 2022 need for automatic mitigation methods is there- fore growing. We propose a new task, a neu- tral summary generation from multiple news articles of the varying political leanings to fa- cilitate balanced and unbiased news reading. In this paper, we first collect a new dataset, il- lustrate insights about framing bias through a case study, and propose a new effective metric and model (N EU S-T ITLE) for the task. Based on our discovery that title provides a good signal for framing bias, we present N EU S- T ITLE that learns to neutralize news content Figure 1: Illustration of the proposed task. We want to in hierarchical order from title to article. Our generate neutral summarization of news articles from hierarchical multi-task learning is achieved by varying of political orientations. Orange highlights formatting our hierarchical data pair (title, ar- indicate phrases that can be considered framing bias. ticle) sequentially with identifier-tokens (“TI- TLE=>”, “ARTICLE=>”) and fine-tuning the auto-regressive decoder with the standard neg- ative log-likelihood objective. We then ana- and Reynolds, 2009; Perse and Lambe, 2016), bias lyze and point out the remaining challenges in media reporting can reinforce the problem of and future directions. One of the most inter- political polarization and undermining civil society esting observations is that neural NLG models rights. can hallucinate not only factually inaccurate or Allsides.com (Sides, 2018) mitigates this prob- unverifiable content but also politically biased content. lem by displaying articles from various media in a single interface along with an expert-written 1 Introduction roundup of news articles. This roundup is a neu- tral summary for readers to grasp a bias-free un- Media framing bias occurs when journalists make derstanding of an issue before reading individual skewed decisions regarding which events or infor- articles. Although Allsides fights framing bias, mation to cover (information bias) and how to cover scalability still remains a bottleneck due to the them (lexical bias) (Entman, 2002; Groeling, 2013). time-consuming human labor needed for compos- Even if the reporting of the news is based on the ing the roundup. Multi-document summarization same set of underlying issues or facts, the fram- (MDS) models (Lebanoff et al., 2018; Liu and Lap- ing of that issue can convey a radically different ata, 2019) could be one possible choice for automat- impression of what actually happened (Gentzkow ing the roundup generation as both multi-document and Shapiro, 2006). Since the news media plays a summaries and roundups share a similar nature in crucial role in shaping public opinion toward vari- extracting salient information out of multiple input ous important issues (De Vreese, 2004; McCombs articles. Yet the ability of MDS models to pro- ∗ This work was done when the author was studying at vide neutral description of a topic issue – a crucial The Hong Kong University of Science and Technology. aspect of the roundup – remains unexplored.
In this work, we fill in this research gap framing1 (Scheufele, 2000). Framing is a broad by proposing a task of Neutral multi-news term that refers to any factor or technique that af- Summarization (N EU S), which aims to generate fect how individuals perceive certain reality or in- a framing-bias-free summary from news articles formation (Goffman, 1974; Entman, 1993, 2007; with varying degrees and orientation of political Gentzkow and Shapiro, 2006). In the context of bias (Fig. 1). To begin with, we construct a new news reports, framing is about how an issue is char- dataset by crawling Allsides.com, and investigate acterized by journalists and how readers take the how framing bias manifests in the news so as to information to form their impression (Scheufele provide a more profound and more comprehensive and Tewksbury, 2007). Our work specifically fo- analysis of the problem. The first important insight cuses on framing “bias” that exists as a form of from our analysis is a close association between text in the news. More specifically, we focus on framing bias and the polarity of the text. Grounded different writing factors such as word choices and on this basis, we propose a polarity-based framing- the commission of extra information that sway an bias metric that is simple yet effective in terms of individual’s perception of certain events. alignment with human perceptions. The second insight is that titles serve as a good indicator of Media Bias Detection In natural language pro- framing bias. Thus, we propose N EU S models cessing (NLP), computational approaches for de- that leverage news titles as an additional signal to tecting media bias often consider linguistic cues increase awareness of framing bias. that induce bias in political text (Recasens et al., 2013; Yano et al., 2010; Lee et al., 2019; Morstat- Our experimental results provide rich ideas for ter et al., 2018; Lee et al., 2019; Hamborg et al., understanding the problem of mitigating framing 2019b; Lee et al., 2021b; Bang et al., 2021). For in- bias. Primarily, we explore whether existing sum- stance, Gentzkow and Shapiro count the frequency marization models can already solve the problem of slanted words within articles. These methods and empirically demonstrate their shortcomings mainly focus on the stylistic (“how to cover”) as- in addressing the stylistic aspect of framing bias. pect of framing bias. However, relatively fewer After that, we investigate and discover an interest- efforts have been made toward the informational ing relationship between framing bias and hallu- (“what to cover”) aspect of framing bias (Park cination, an important safety-related problem in et al., 2011; Fan et al., 2019). Majority of liter- generation tasks. We empirically show that the ature doing informational detection are focused on hallucinatory generation has the risk of being not more general factual domain (non-political infor- only factually inaccurate and/or unverifiable but mation) in the name of “fact-checking” (Thorne also politically biased and controversial. To the et al., 2018; Lee et al., 2018, 2021a, 2020). How- best of our knowledge, this aspect of hallucination ever, these methods cannot be directly applied to has not been previously discussed. We thus hope media bias detection because there does not exist to encourage more attention toward hallucinatory reliable source of gold standard truth to fact-check framing bias to prevent automatic generations from biased text upon. fueling political bias and polarization. We conclude by discussing the remaining chal- Media Bias Mitigation News aggregation, by lenges to provide insights for future work. We hope displaying articles from different news outlets on our work with the proposed N EU S task serves is a a particular topic (e.g., Google News,2 Yahoo good starting point to promote the automatic miti- News3 ), is the most common approach to miti- gation of media framing bias. gate media bias (Hamborg et al., 2019a). However, news aggregators require willingness and effort from the readers to be resistant to framing biases 2 Related Works and identify the neutral fact from differently framed articles. Other approaches have been proposed to Media Bias Media bias has been studied ex- 1 Priming happens when news reporting tells the reader tensively in various fields such as social science, what context of the event should they evaluate the event in; Agenda-setting is when news reporting tell readers what are economics, and political science. Media bias is the most important problems to think about known to affect readers’ perceptions of news in 2 https://news.google.com/ 3 three main ways: priming, agenda-setting, and https://news.yahoo.com/
provide additional information (Laban and Hearst, sult in media bias. The top-3 most frequent topics4 2017), such as automatic classification of multi- are ‘Elections’, ‘White House’, and ‘Politics’. ple viewpoints (Park et al., 2009), multinational We crawl the article triplets5 to serve as the perspectives (Hamborg et al., 2017), and detailed source inputs {AL , AR , AC }, and the neutral ar- media profiles (Zhang et al., 2019b). However, ticle summary to be the target output Aneu for these methods focus on providing a broader per- our task. Note that “center” does not necessar- spective to readers from an enlarged selection of ily mean completely bias-free (all, 2021) as illus- articles, which still puts the burden of mitigating trated in Table 1. Although “center” media out- bias on the readers. Instead, we propose to automat- lets are relatively less tied to a particular political ically neutralize and summarize partisan articles to ideology, their reports may still contain framing produce a neutral article summary. bias because editorial judgement naturally leads to human-induced biases. In addition, we also crawl Multi-document Summarization As a chal- the title triplets {TL , TR , TC } and the neutral issue lenging subtask of automatic text summarization, title Tneu that are later used in our modeling. multi-document summarization (MDS) aims to con- To make the dataset richer, we also crawled dense a set of documents to a short and informative other meta-information such as date, topic tags, and summary (Lebanoff et al., 2018). Recently, re- media name. In total, we crawled 3, 564 triplets searchers have applied deep neural models for the (10, 692 articles). We use 2/3 of the triplets, which MDS task thanks to the introduction of large-scale is 2, 276, to be our training and validation set datasets (Liu et al., 2018; Fabbri et al., 2019). With (80 : 20 ratio), and the remaining 1, 188 triples the advent of large pre-trained language models as our test set. We will publicly release this dataset (Lewis et al., 2019; Raffel et al., 2019), researchers for future research use. have also applied them to improve the MDS mod- els, performance (Jin et al., 2020; Pasunuru et al., 4 Analysis of Framing Bias 2021). In addition, many works have studied partic- ular subtopics of the MDS task, such as agreement- The literature on media framing bias from the NLP oriented MDS (Pang et al., 2021), topic-guided community and social science studies provide the MDS (Cui and Hu, 2021) and MDS of medical stud- definition and types of framing bias (Goffman, ies (DeYoung et al., 2021). However, few works 1974; Entman, 1993; Gentzkow et al., 2015; Fan have explored generating framing-bias-free sum- et al., 2019) — Informational framing bias is the maries from multiple news articles. To add to this biased selection of information (tangential or spec- direction, we propose the N EU S task and creates a ulative information) to sway the minds of readers. new benchmark. Lexical framing bias is a sensational writing style or linguistic attributes that may mislead readers. 3 Task and Dataset However, the definition is not enough to understand exactly how framing bias manifests in real exam- 3.1 Task Formulation ples such as, in our case, the A LL S IDES dataset. The main objective of N EU S is to generate a neutral We conduct a case-study to obtain concrete insights article summary Aneu given multiple news articles to guide our design choices for defining the metrics A0...N with varying degrees and orientations of and methodology. political bias. The neutral summary Aneu should 4.1 Case-Study Observations (i) retain salient information and (ii) minimize as much framing bias as possible from the input arti- First, we identify and share the examples of fram- cles. ing bias in accordance with the literature (Table 1). Informational Bias This bias exists dominantly 3.2 A LL S IDES Dataset in the form of “extra information” on top of the Allsides.com provides access to triplets of news, salient key information about an issue that changes which comprise reports from left, right, and center the overall impression of it. For example, in Ta- American publishers on the same event, with an ble 1, when reporting about the hold put on mil- expert-written neutral summary of the articles and 4 The full list is provided in the appendix. its neutral title. The dataset language is English and 5 In some cases, Allsides does not provide all three report- mainly focuses on U.S. political topics that often re- ings, and such cases are filtered out.
Issue A: Trump Put Hold On Military Aid To Ukraine Days Before Call To Ukrainian President Left: Trump ordered hold on military aid days before calling Ukrainian president, officials say Right: Trump administration claims Ukraine aid was stalled over corruption concerns, decries media ‘frenzy’ Center: Trump Put Hold on Military Aid Ahead of Phone Call With Ukraine’s President Issue B: Michael Reinoehl appeared to target right-wing demonstrator before fatal shooting in Portland, police say Left: Suspect in killing of right-wing protester fatally shot during arrest Right: Portland’s Antifa-supporting gunman appeared to target victim, police say Center: Suspect in Patriot Prayer Shooting Killed by Police Issue C: Trump Says the ‘Fake News Media’ Are ‘the true Enemy of the People’ Left: President Trump renews attacks on press as ‘true enemy of the people’ even as CNN receives another suspected bomb Right: ‘Great Anger’ in America caused by ‘fake news’ — Trump rips media for biased reports’ Center: Trump blames ’fake news’ for country’s anger : ’the true enemy of the people’ Table 1: Illustration of differences in framing from Left/Right/Center media with examples from A LL - S IDES dataset. We use titles for the analysis of bias, since they are simpler to compare and are representative of the framing bias that exists in the article. itary aid to Ukraine (Issue A), the right-leaning Thus, polarity can serve as a good indicator of fram- media reports the speculative claim that there were ing bias. “corruption concerns” and tangential information However, we observe that the polarity of text “decries media ‘frenzy”’ that amplifies the negative must be utilized with care in the context of framing impression of the issue. Sometimes, media with bias. It is the relative polarity that is meaningful to different political leanings report additional infor- indicate the framing bias, not the absolute polar- mation to convey a completely different focus on ity. To elaborate, if the news issue itself is about the issue. For Issue C, left-leaning media implies tragic events such as “Terror Attack in Pakistan” that Trump’s statement about fake news has led to or “Drone Strike That Killed 10 people”, then the “CNN receiving another suspected bomb”, whereas polarity of neutral reporting will also be negative. right-leaning media implies that the media is at Indicator of Framing We discover that the news fault by producing “biased reports”. title is very representative of the framing bias that Lexical Bias This bias exists mainly as biased exist in the associated articles – this makes sense be- word choices that change the nuance of the infor- cause the title can be viewed as a succinct overview mation that is being delivered. For example, in of the content that follows6 . For instance, in Ta- Issue B, we can clearly observe that two media ble 3 the source input example, the right-leaning outlets change the framing of the issue by using media’s title, and article are mildly mocking of different terms “suspect” and “gunman” to refer to the “desperate” democrats’ failed attempts to take the shooter, and “protester” and “victim” to refer to down President Trump. In contrast, the left-leaning the person shot. Also, in Issue A, when one media media’s title and an article show a completely dif- outlet uses “(ordered) hold”, another media uses ferent frame – implying that many investigations “stalled”, which has a more negative connotation. are happening and there is “possible obstruction of justice, public corruption, and other abuses of 4.2 Main Insights from Case-Study power.” Next, we share important insights from the case 5 Metric study observation that guide our metric and model design. We use three metrics to evaluate summaries from different dimensions. For framing bias, we propose Relative Polarity Polarity is one of the com- a polarity-based metric based on the careful design monly used attributes in identifying and analyz- choices detailed in §5.1. For evaluating whether ing framing bias (Fan et al., 2019; Recasens et al., the summaries retain salient information, we adopt 2013). Although informational and lexical bias is commonly used information recall metrics (§5.2). conceptually different, both are closely associated In addition, we use a hallucination metric to evalu- with polarity changes of concept, i.e., positive or ate if the generations contain any unfaithful hallu- negative, to induce strongly divergent emotional 6 There can be exceptions for the non-main-stream media responses from the readers (Hamborg et al., 2019b). that use clickbait titles.
cinatory information because the existence of such 1. Filter out all the tokens that appear in neutral hallucinatory generations can make the summary target Aneu to obtain set of tokens unique to fake news (§5.3). Âneu . This ensures that we are measuring the relative polarity of Âneu in reference to 5.1 Framing Bias Metric the neutral target Aneu – results in calibration 5.1.1 Design Consideration effect. 2. Select tokens with either positive valence Our framing bias metric is developed upon the in- (v > 0.65) or negative valence (v < 0.35) to sights we obtained from our case study in §4. eliminate neutral words (i.e., stopwords and First of all, we propose to build our metric based non-emotion-provoking words) – this step ex- on the fact that framing bias is closely associated cludes tokens that are unlikely to be associated with polarity. Both model-based and lexicon-based with framing bias from the metric calculation. polarity detection approaches are options for our 3. Sum the arousal scores for the identified work, and we leverage the latter for the follow- positive and negative tokens from Step 2 – ing reasons: 1) There is increasing demand for positive arousal score (Arousal+ ) and neg- interpretability in the field of NLP (Belinkov et al., ative arousal score (Arousal− ). We in- 2020; Sarker et al., 2019), and the lexicon-based ap- tentionally separate the positive and nega- proach is more interpretable (provides token-level tive scores for finer-grained interpretation. human interpretable annotation) compared to black- We also have the combined arousal score box neural models. 2) In the context of framing (Arousalsum =Arousal+ +Arousal− ) for a bias, distinguishing the subtle nuance of words coarse view. between synonyms is crucial (e.g., dead vs. mur- 4. Repeat for all {Aneu , Âneu } pairs in the test- dered). The lexicon-resource provides such token- set, and calculate the average scores to use as level fine-grained scores and annotations, making the final metric. We report these scores in our it useful for our purpose. experimental results section (§7). Metric calibration is the second design consid- eration, and is motivated by our insight into the In essence, our metric approximates the exis- relativity of framing bias. The absolute polarity tence of framing bias by quantifying how intensely of the token itself does not necessarily indicate aroused and sensational the generated summary framing bias (i.e., the word “riot” has negative sen- is in reference to the target neutral reference. We timent but does not always indicate bias), so it is publicly release our metric code for easy use by essential to measure the relative degree of polarity. other researchers7 . Therefore, calibration of the metric in reference 5.1.3 Human Evaluation to the neutral target is important. Any tokens ex- To ensure the quality of our metric, we evaluate isting in the neutral target will be ignored in bias the correlation between our framing bias metric measurement for the generated neutral summary. and human judgement. We conduct A/B test- For instance, if “riot” exists in the neutral target, it ing8 where the annotators are given two gener- will not be counted in bias measurement through ated articles about an issue, one with a higher calibration. Arousalsum score and the other with a lower score. Then, annotators are asked to select the more bi- 5.1.2 Framing Bias Metric Details ased article summary. When asking which article For our metric, we leverage Valence-Arousal- is more “biased”, we adopt the question presented Dominance (VAD) (Mohammad, 2018) dataset by Spinde et al. We also provide examples and the which has a large list of lexicons annotated for definition of framing bias for a better understand- valence (v), arousal (a) and dominance (d) scores. ing of the task. We obtain three annotations each Valence, arousal, and dominance represent the di- for 50 samples and select those with the majority rection of polarity (positive, negative), the strength of votes. of the polarity (active, passive), and the level of A critical challenge of this evaluation is in con- control (powerful, weak), respectively. trolling the potential involvement of the annotators’ Given the neutral summary generated from the 7 https://github.com/HLTCHKUST/framing-bias-metric model Âneu , our metric is calculated using the 8 Please refer the appendix for more detail of the A/B test- VAD lexicons in the following way: ing
personal political bias. Although it is hard to elim- is a question-answering-based metric built on the inate such bias completely, we attempt to avoid it assumption that the same answers will be derived by collecting annotations from those indifferent to from hallucination-free generation and the source the issues in the test set. Specifically, given that our document when asked the same questions. test set mainly covers US politics, we restrict the nationality of annotators to non-US nationals who 6 Models and Experiments9 view themselves bias-free towards any US political 6.1 Baseline Models parties. After obtaining the human annotations from A/B Since one common form of framing bias is the re- testing, we also obtain automatic annotation based porting of extra information (§4), summarization on the proposed framing bias metric score, where models, which extract commonly shared salient the article with a higher Arousalsum is chosen information, may already generate a neutral sum- to be the more biased generation. The Spearman maries to some extent. To test this, we conduct correlation coefficient between human-based and experiments using the following baselines. metric-based annotations is 0.63615 with a p-value • L EXRANK (Erkan and Radev, 2004): an < 0.001, and the agreement percentage 80%. These extractive single-document summarization values indicate that the association between the two (SDS) model that extracts representative sen- annotations is statistically significant, suggesting tences that hold information common in both that our metric provides a good approximation of left- and right-leaning articles. the existence of framing bias. • BART CNN: an abstractive SDS model 5.2 Salient Info that fine-tunes BART-large (Lewis et al., 2019), with 406M parameters, using the The generation needs to retain essential/important CNN/DailyMail (Hermann et al., 2015) information while reducing the framing bias. Thus, dataset. we also report ROUGE (Lin, 2004) and B LEU (Pa- • BART M ULTI: a multi-document summariza- pineni et al., 2002) between the generated neu- tion (MDS) model that fine-tunes BART-large tral summary, Âneu , and human-written summary, using Multi-News (Fabbri et al., 2019) dataset. Aneu . Note that ROUGE measures the recall (i.e., • P EGASUS M ULTI: an MDS model that fine- how often the n-grams in the human reference tunes Pegasus-base (Zhang et al., 2019a), text appear in the machine-generated text) and with 568M parameters, using the Multi-News B LEU measures the precision (i.e., how often the n- dataset. grams in the machine-generated text appear in the human reference text). The higher the B LEU and Since the summarization models are not trained ROUGE 1-R score, the better the essential infor- with in-domain data, we provide another baseline mation converges. In our results, we only report model trained with in-domain data for a full picture. Rouge-1, but Rouge-2 and Rouge-L can be found in the appendix. • N EU SFT: a baseline that fine-tunes the BART- large model using A LL S IDES. 5.3 Hallucination Metric Recent studies have shown that neural sequence 6.2 Our N EU S Models (N EU S-T ITLE) models can suffer from the hallucination of ad- We design our models based on the second insight ditional content not supported by the input (Re- from the case study (§4) - the news title serves as an iter, 2018; Wiseman et al., 2017; Nie et al., 2019; indicator of the framing bias in the corresponding Maynez et al., 2020; Pagnoni et al., 2021; Ji et al., article. We hypothesize that it would be helpful 2022), consequently adding factual inaccuracy to to divide-and-conquer by neutralizing the the title the generation of NLG models. Although not di- first, then leveraging the “neutralized title” to guide rectly related to the goal of N EU S, we evaluate the final neutral summary of the longer articles. the hallucination level of the generations in our Multi-task learning (MTL) is a natural model- work. We choose a hallucination metric called ing choice because two sub-tasks are involved – FeQA (Durmus et al., 2020) because it is one of title-level and article-level neutral summarization. the publicly available metrics known to have a high 9 Experimental details are provided in the appendix for correlation with human faithfulness scores. This reproducibility.
Avg. Framing Bias Metric Salient Info Hallucination Models Arousal+ ↓ Arousal− ↓ Arousalsum ↓ B LEU↑ ROUGE 1-R↑ FeQA↑ All Source input 6.76 3.64 10.40 8.27 56.57% - L EXRANK 3.02 1.74 4.76 12.21 39.08% 53.44% BART CNN 2.09 1.23 3.32 10.49 35.63% 58.03% P EGASUS M ULTI 5.12 2.39 7.51 6.12 44.42% 22.24% BART M ULTI 5.94 2.66 8.61 4.24 35.76% 21.06% N EU SFT 1.86 1.00 2.85 11.67 35.11% 58.50% N EU S-T ITLE 1.69 0.83 2.53 12.05 36.07% 45.95% Table 2: Experimental results for A LL S IDES test set. We provide the level of framing bias inherent in “source input” from the A LL S IDES test set to serve as a reference point for the framing bias metric. For framing bias metric, the lower number is the better (↓). For other scores, the higher number is the better (↑). Meanwhile, we also have to ensure a hierarchical with insights obtained through qualitative analysis. relationship between the two tasks in our MTL Table 3 shows generation examples that are most training because article-level neutral summariza- representative of the insights we share.10 tion leverages the generated neutral title as an ad- ditional resource. We use a simple technique to do 7.1 Main Results hierarchical MTL by formatting our hierarchical Firstly, summarization models can reduce the data pair (title, article) in a single natural language framing bias to a certain degree (drop in text with identifier-tokens (“Title=>”, “Article=>”). Arousalsum score from 10.40 to 4.76 and 3.32 This technique allows us to optimize for both title for L EXRANK and BART CNN). This is because and article neutral summarization tasks easily by informational framing bias is addressed when sum- optimizing for the negative log-likelihood of the marization models extract the most salient sen- single target Y. The auto-regressive nature of the tences, which contain common information from decoder also ensures the hierarchical relationship the inputs. However, summarization models, espe- between the title and article. cially L EXRANK cannot handle the lexical framing We train BART’s autoregressive decoder to gen- bias, as shown in Table 3. Moreover, if we fur- erate the target text Y formatted as follows: ther observe the results of L EXRANK, it is one of TITLE ⇒ Tneu . ARTICLE ⇒ Aneu , the best performing models in terms of ROUGE 1- R (39.08%), the standard metric for summarization where Tneu and Aneu denote the neutral title and performance, but not in terms of the framing bias neutral article summary. metric. This suggests that having good summariza- The input X to our BART encoder is formatted tion performance (ROUGE 1-R) does not guarantee similarly to the target text Y : that the model is also neutral – i.e., the require- ment for summaries to be neutral adds an extra TITLE ⇒ TL . ARTICLE ⇒ AL .[SEP ] dimension to the summarization task. TITLE ⇒ TC . ARTICLE ⇒ AC .[SEP ] Secondly, one interesting pattern that deserves at- TITLE ⇒ TR . ARTICLE ⇒ AR , tention is that only the single-document summariza- where TL/C/R and AL/C/R denote the title and ar- tion models (BART CNN and L EXRANK) reduced ticle from left-wing, center, and right-wing media, framing bias well, not the multi-document sum- and [SEP] denotes the special token that separates marization models (P EGASUS M ULTI and BART- different inputs. Note that the order of left, right, M ULTI). This is rather surprising because our task and center are randomly shuffled for each sample setup is more similar to MDS than SDS. One of to discourage the model from learning spurious the major contributors to high bias in the MDS patterns from the input. models is probably the hallucination because MDS models portray drastically poor hallucination per- 7 Results and Analysis formance than all the other models (both the MDS models P EGASUS M ULTI and BART M ULTI achieve In this section, we point out noteworthy observa- 10 tions from the quantitative results in Table 2 along More examples are provided in the appendix.
SOURCE: Title: Here Are The 81 People And Entities Close To Trump Democrats Are Investigating. Article: Democrats on the House Judiciary Committee on Monday sent document requests to 81 agencies, entities and individuals close to President Donald Trump as part of a broad investigation into possible obstruction of justice, public corruption and other abuses of power. The list includes Trump’s sons, Eric Trump and Donald Trump Jr., as well as his son-in-law, Jared Kushner. Title: House Panel Requests Documents From Associates of Trump. Article: House Democrats intensified their investigations into President Trump and his associates Monday, demanding records from more than 80 people and organizations related to his business dealings, interactions with the Justice Department and communications with Russian President Vladimir Putin. Title: Dems Continue Their Assault on The Trump Administration By Launching Another Probe. Article: Democrats are desperate to take down President Donald Trump. The Russia probe has proven to be ineffective and, quite frankly, a waste of time and taxpayer money. They didn’t find what they wanted so now they’re launching another probe. TARGET: House Democrats launched a broad probe into President Trump on Monday, requesting documents from 81 agencies and individuals as they investigate his business dealings, interactions with Russia, and possible obstruction of justice. Lexrank: Democrats are desperate to take down President Donald Trump. The Russia probe has proven to be ineffective and, quite frankly, a waste of time and taxpayer money. N EU SFT: The Russia probe has proven to be ineffective and, quite frankly, a waste of time and taxpayer money. N EU S-T ITLE: TITLE=> House Panel Requests Documents. ARTICLE=> The House Select Committee on Intelligence has requested documents from 81 people and entities close to President Trump, including his sons Eric and Donald Trump Jr., as well as Jared Kushner. Table 3: Generation examples for analysis purposes. Red highlights the tokens identified by VAD lexicons. Refer to the appendix for more examples. SOURCE: ... President Trump on Saturday blasted what he called the “phony” BuzzFeed story and the mainstream media’s coverage of it.... MDS Hallucination: president trump on sunday slammed what he called called a “phony” story by the “dishonest” and “fake news” news outlet in a series of tweets. ... “the fake news media is working overtime to make this story look like it is true,” trump tweeted. “they are trying to make it look like the president is trying to hide something, but it is not true!” Table 4: Illustration of hallucinatory framing bias from MDS models and the corresponding “most relevant source snippet” from the source input. Refer to the appendix for more examples with full context. 22.24% and 21.06%, when most of the other mod- cept the FeQA score. This model demonstrates els achieve over 50%).11 This suggests that the a stronger tendency to paraphrase rather than di- framing bias of MDS models may be related to rectly copy, and has comparatively more neutral the hallucination of politically biased content. We framing of the issue. As shown in Table 3, when investigate into this in the next subsection (§7.2). L EXRANK and N EU SFT are focused on the “in- Thirdly, although summarization models help effectiveness of Russia probe”, the TARGET and reduce the framing bias scores, we, unsurprisingly, N EU S-T ITLE focus on the start of the investigation observe a more considerable bias reduction when with the request for documents. N EU S-T ITLE also training with in-domain data. N EU SFT shows a generate a title with a similar neutral frame to the further drop across all framing bias metrics without TARGET, suggesting this title generation guided sacrificing the ability to keep salient information. the correctly framed generation. However, we observe that N EU SFT often copies directly without any neutral re-writing – e.g., the 7.2 Further Analysis and Discussion N EU SFT example shown in Table 3 is a direct copy Q: Is hallucination contributing to the high of the sentence from the input source. framing bias in MDS models? Through qual- Lastly, we can achieve slightly further improve- itative analysis, we discovered the MDS genera- ment with N EU S-T ITLE across all metrics ex- tions were hallucinating politically controversial 11 Note that 22.24% and 21.06% are already high FeQA or sensational content that did not exist in the in- scores, however, a comparatively low score in reference. put sources. This is probably originating from the
memorization of either the training data or the LM- ric could also be adapted to the hallucination prob- pretraining corpus. For instance, in Table 4, we lem without a “neutral” reference. The source input can observe stylistic bias being injected – “the ‘dis- can substitute the “neutral” reference to measure if honest’ and ‘fake news’ news outlet”. Also, the the generated summary is more politically biased excessive elaboration of the president’s comment than the source – a potential indication of political towards the news media, which does not appear the hallucination. in source or target, can be considered informational bias – “they are trying to make it look like the pres- 8 Conclusion ident is trying to hide something, but it is not true!” We introduce a new task of Neutral Multi-News This analysis unveils the overlooked danger of hal- Summarization (N EU S) to mitigate media framing lucination, which is the risk of introducing political bias by providing a neutral summary of articles, framing bias in summary generations. Note that along with the dataset A LL S IDES and a set of met- this problem is not confined to MDS models only rics. Throughout the work, we share insights to because other baseline models also have room for understand the challenges and future directions in improvement in terms of the FeQA hallucination the task. We show the relationships among po- score. larity, extra information, and framing bias, which guides us to the metric design, while the insight Q: What are the remaining challenges and fu- that the title serves as an indicator of framing bias ture directions? The experimental results of leads us to the model design. Our qualitative analy- N EU S-T ITLE suggest that there is room for im- sis reveals that hallucinatory content generated by provement. We qualitatively checked some error models may also contribute to framing bias. We cases and discovered that the title-generation is, hope our work stimulates researchers to actively unsurprisingly, not always accurate, and the error tackle political framing bias in both human-written propagating from the title-generation step adversely and machine-generated texts. affected the overall performance. Thus, one possi- ble future direction would be to improve the neutral Ethical Considerations title generation, which would then improve the neu- tral summarization. The idea of unbiased journalism has always been Another challenge is the subtle lexical bias in- challenged12 because journalists will make their volving nuanced word choices that manoeuvre read- own editorial judgements that can never be guar- ers to understand the events from biased frames. anteed to be completely bias-free. Therefore, we For example, “put on hold” and “stalled” both mean propose to generate a comprehensive summary of the same outcome, but the latter has a more neg- articles from different political leanings, instead of ative connotations. Improving the model’s aware- trying to generate a gold standard “neutral” article. ness of such nuanced words or devising ways to One of the considerations is the bias induced incorporate style-transfer-based bias mitigation ap- by the computational approach. Automatic ap- proaches (Liu et al., 2021) could be another helpful proaches replace a known source bias with another future direction. bias caused by human-annotated data or the ma- We started the neutral summarization task as- chine learning models. Understanding the risk of suming that framing bias originates from the source uncontrolled adoption of such automatic tools, care- inputs. However, our results and analysis suggest ful guidance should be provided in how to adopt that hallucination is another contributor to fram- them. For instance, an automatically generated neu- ing bias. Leveraging hallucination mitigation tech- tral summary should be provided with reference to niques would be a valuable future direction for the the original source instead of standing alone. N EU S task. We believe it will help to reduce in- We use news from English-language sources formational framing bias, although it may be less only and largely American news outlets through- effective to lexical framing biases. Moreover, our out this paper. Partisanship from this data refers work can also be used to facilitate hallucination to domestic American politics. We note that this research as well. We believe the proposed framing work does not cover media bias at the international- bias metric will help researchers evaluate hallu- level or in other languages. In future work, we cinatory phenomena from different angles other 12 https://www.allsides.com/blog/does-unbiased-news- than “factuality”. The proposed framing bias met- really-exist
will explore the application of our methodology to Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, different cultures or languages. However, we hope and Dragomir R Radev. 2019. Multi-news: A large-scale multi-document summarization dataset the paradigm of N EU S, providing multiple sides and abstractive hierarchical model. arXiv preprint to neutralize the view of an issue, can encourage arXiv:1906.01749. future research in mitigating framing bias in other languages or cultures. Lisa Fan, Marshall White, Eva Sharma, Ruisi Su, Prafulla Kumar Choubey, Ruihong Huang, and Lu Wang. 2019. In plain sight: Media bias through the lens of factual reporting. arXiv preprint References arXiv:1909.02670. 2021. Center – what does a "center" media bias rating mean? Matthew Gentzkow and Jesse M Shapiro. 2006. Media bias and reputation. Journal of political Economy, Yejin Bang, Nayeon Lee, Etsuko Ishii, Andrea 114(2):280–316. Madotto, and Pascale Fung. 2021. Assessing po- litical prudence of open-domain chatbots. arXiv Matthew Gentzkow and Jesse M Shapiro. 2010. What preprint arXiv:2106.06157. drives media slant? evidence from us daily newspa- pers. Econometrica, 78(1):35–71. Yonatan Belinkov, Sebastian Gehrmann, and Ellie Pavlick. 2020. Interpretability and analysis in neural Matthew Gentzkow, Jesse M Shapiro, and Daniel F nlp. In Proceedings of the 58th Annual Meeting of Stone. 2015. Media bias in the marketplace: The- the Association for Computational Linguistics: Tu- ory. In Handbook of media economics, volume 1, torial Abstracts, pages 1–5. pages 623–645. Elsevier. Peng Cui and Le Hu. 2021. Topic-guided abstractive multi-document summarization. In Findings of the Erving Goffman. 1974. Frame analysis: An essay on Association for Computational Linguistics: EMNLP the organization of experience. Harvard University 2021, pages 1463–1472. Press. Claes De Vreese. 2004. The effects of strategic news Tim Groeling. 2013. Media bias by the numbers: Chal- on political cynicism, issue evaluations, and policy lenges and opportunities in the empirical study of support: A two-wave experiment. Mass Communi- partisan news. Annual Review of Political Science, cation & Society, 7(2):191–214. 16:129–151. Jay DeYoung, Iz Beltagy, Madeleine van Zuylen, Bai- Felix Hamborg, Karsten Donnay, and Bela Gipp. ley Kuehl, and Lucy Lu Wang. 2021. Ms2: Multi- 2019a. Automated identification of media bias document summarization of medical studies. arXiv in news articles: an interdisciplinary literature re- preprint arXiv:2104.06486. view. International Journal on Digital Libraries, 20(4):391–415. Esin Durmus, He He, and Mona Diab. 2020. FEQA: A question answering evaluation framework for faith- Felix Hamborg, Norman Meuschke, and Bela Gipp. fulness assessment in abstractive summarization. In 2017. Matrix-based news aggregation: exploring Proceedings of the 58th Annual Meeting of the Asso- different news perspectives. In 2017 ACM/IEEE ciation for Computational Linguistics, pages 5055– Joint Conference on Digital Libraries (JCDL), pages 5070, Online. Association for Computational Lin- 1–10. IEEE. guistics. Felix Hamborg, Anastasia Zhukova, and Bela Gipp. Robert M Entman. 1993. Framing: Towards clarifica- 2019b. Illegal aliens or undocumented immigrants? tion of a fractured paradigm. McQuail’s reader in towards the automated identification of bias by word mass communication theory, pages 390–397. choice and labeling. In International Conference on Robert M Entman. 2002. Framing: Towards clarifi- Information, pages 179–187. Springer. cation of a fractured paradigm. McQuail’s Reader in Mass Communication Theory. London, California Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- and New Delhi: Sage. stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read Robert M Entman. 2007. Framing bias: Media in the and comprehend. Advances in neural information distribution of power. Journal of communication, processing systems, 28:1693–1701. 57(1):163–173. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Günes Erkan and Dragomir R Radev. 2004. Lexrank: Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Graph-based lexical centrality as salience in text Madotto, and Pascale Fung. 2022. Survey of hal- summarization. Journal of artificial intelligence re- lucination in natural language generation. arXiv search, 22:457–479. preprint arXiv:2202.03629.
Hanqi Jin, Tianming Wang, and Xiaojun Wan. 2020. Ruibo Liu, Chenyan Jia, Jason Wei, Guangxuan Xu, Multi-granularity interaction network for extractive Lili Wang, and Soroush Vosoughi. 2021. Mitigating and abstractive multi-document summarization. In political bias in language models through reinforced Proceedings of the 58th Annual Meeting of the Asso- calibration. ciation for Computational Linguistics, pages 6244– 6254. Yang Liu and Mirella Lapata. 2019. Hierarchical trans- formers for multi-document summarization. In Pro- Philippe Laban and Marti A Hearst. 2017. newslens: ceedings of the 57th Annual Meeting of the Asso- building and visualizing long-ranging news stories. ciation for Computational Linguistics, pages 5070– In Proceedings of the Events and Stories in the News 5081. Workshop, pages 1–9. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Logan Lebanoff, Kaiqiang Song, and Fei Liu. 2018. Ryan McDonald. 2020. On faithfulness and factu- Adapting the neural encoder-decoder framework ality in abstractive summarization. In Proceedings from single to multi-document summarization. In of the 58th Annual Meeting of the Association for Proceedings of the 2018 Conference on Empirical Computational Linguistics, pages 1906–1919, On- Methods in Natural Language Processing, pages line. Association for Computational Linguistics. 4131–4141. Maxwell McCombs and Amy Reynolds. 2009. How Nayeon Lee, Yejin Bang, Andrea Madotto, Madian the news shapes our civic agenda. In Media effects, Khabsa, and Pascale Fung. 2021a. Towards few- pages 17–32. Routledge. shot fact-checking via perplexity. arXiv preprint arXiv:2103.09535. Saif Mohammad. 2018. Obtaining reliable human rat- ings of valence, arousal, and dominance for 20,000 Nayeon Lee, Belinda Z Li, Sinong Wang, Pascale english words. In Proceedings of the 56th Annual Fung, Hao Ma, Wen-tau Yih, and Madian Khabsa. Meeting of the Association for Computational Lin- 2021b. On unifying misinformation detection. guistics (Volume 1: Long Papers), pages 174–184. arXiv preprint arXiv:2104.05243. Fred Morstatter, Liang Wu, Uraz Yavanoglu, Stephen R Corman, and Huan Liu. 2018. Identifying framing Nayeon Lee, Belinda Z Li, Sinong Wang, Wen-tau bias in online news. ACM Transactions on Social Yih, Hao Ma, and Madian Khabsa. 2020. Lan- Computing, 1(2):1–18. guage models as fact checkers? arXiv preprint arXiv:2006.04102. Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, and Chin-Yew Lin. 2019. A simple recipe towards re- Nayeon Lee, Zihan Liu, and Pascale Fung. 2019. Team ducing hallucination in neural surface realisation. In yeon-zi at semeval-2019 task 4: Hyperpartisan news Proceedings of the 57th Annual Meeting of the Asso- detection by de-noising weakly-labeled data. In Pro- ciation for Computational Linguistics, pages 2673– ceedings of the 13th International Workshop on Se- 2679. mantic Evaluation, pages 1052–1056. Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Nayeon Lee, Chien-Sheng Wu, and Pascale Fung. 2018. Tsvetkov. 2021. Understanding factuality in abstrac- Improving large-scale fact-checking using decom- tive summarization with frank: A benchmark for fac- posable attention models and lexical tagging. In Pro- tuality metrics. In Proceedings of the 2021 Confer- ceedings of the 2018 Conference on Empirical Meth- ence of the North American Chapter of the Associ- ods in Natural Language Processing, pages 1133– ation for Computational Linguistics: Human Lan- 1138. guage Technologies, pages 4812–4829. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Richard Yuanzhe Pang, Adam D Lelkes, Vinh Q jan Ghazvininejad, Abdelrahman Mohamed, Omer Tran, and Cong Yu. 2021. Agreesum: Agreement- Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. oriented multi-document summarization. arXiv Bart: Denoising sequence-to-sequence pre-training preprint arXiv:2106.02278. for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic eval- Chin-Yew Lin. 2004. Rouge: A package for automatic uation of machine translation. In Proceedings of the evaluation of summaries. In Text summarization 40th annual meeting of the Association for Compu- branches out, pages 74–81. tational Linguistics, pages 311–318. Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Souneil Park, Seungwoo Kang, Sangyoung Chung, and Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Junehwa Song. 2009. Newscube: delivering multi- Shazeer. 2018. Generating wikipedia by summariz- ple aspects of news to mitigate media bias. In Pro- ing long sequences. In International Conference on ceedings of the SIGCHI conference on human fac- Learning Representations. tors in computing systems, pages 443–452.
Souneil Park, Kyung-Soon Lee, and Junehwa Song. Sam Wiseman, Stuart Shieber, and Alexander Rush. 2011. Contrasting opposing views of news articles 2017. Challenges in data-to-document generation. on contentious issues. In Proceedings of the 49th In Proceedings of the 2017 Conference on Empiri- annual meeting of the association for computational cal Methods in Natural Language Processing, pages linguistics: Human language technologies, pages 2253–2263, Copenhagen, Denmark. Association for 340–349. Computational Linguistics. Ramakanth Pasunuru, Mengwen Liu, Mohit Bansal, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Sujith Ravi, and Markus Dreyer. 2021. Efficiently Chaumond, Clement Delangue, Anthony Moi, Pier- summarizing text and graph encodings of multi- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- document clusters. In Proceedings of the 2021 Con- icz, Joe Davison, Sam Shleifer, Patrick von Platen, ference of the North American Chapter of the Asso- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, ciation for Computational Linguistics: Human Lan- Teven Le Scao, Sylvain Gugger, Mariama Drame, guage Technologies, pages 4768–4779. Quentin Lhoest, and Alexander Rush. 2020. Trans- formers: State-of-the-art natural language process- Elizabeth M Perse and Jennifer Lambe. 2016. Media ing. In Proceedings of the 2020 Conference on Em- effects and society. Routledge. pirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine ciation for Computational Linguistics. Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Tae Yano, Philip Resnik, and Noah A Smith. 2010. Wei Li, and Peter J Liu. 2019. Exploring the limits Shedding (a thousand points of) light on biased lan- of transfer learning with a unified text-to-text trans- guage. In Proceedings of the NAACL HLT 2010 former. arXiv preprint arXiv:1910.10683. Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 152–158. Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. 2013. Linguistic models for an- Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe- alyzing and detecting biased language. In Proceed- ter J. Liu. 2019a. Pegasus: Pre-training with ex- ings of the 51st Annual Meeting of the Association tracted gap-sentences for abstractive summarization. for Computational Linguistics (Volume 1: Long Pa- pers), pages 1650–1659. Yifan Zhang, Giovanni Da San Martino, Alberto Barrón-Cedeno, Salvatore Romeo, Jisun An, Hae- Ehud Reiter. 2018. A structured review of the validity woon Kwak, Todor Staykovski, Israa Jaradat, Georgi of bleu. Computational Linguistics, 44(3):393–401. Karadzhov, Ramy Baly, et al. 2019b. Tanbih: Get to know what you are reading. EMNLP-IJCNLP 2019, Abeed Sarker, Ari Z Klein, Janet Mee, Polina Harik, page 223. and Graciela Gonzalez-Hernandez. 2019. An in- terpretable natural language processing system for Appendix written medical examination assessment. Journal of biomedical informatics, 98:103268. A Topics covered in A LL S IDESdataset Dietram A Scheufele. 2000. Agenda-setting, priming, The A LL S IDESdataset language is English and and framing revisited: Another look at cognitive ef- mainly focuses on U.S. political topics that often fects of political communication. Mass communica- tion & society, 3(2-3):297–316. result in media bias. The top-5 most frequent topics are ‘Elections’, ‘White House’, ‘Politics’, ‘Coron- Dietram A Scheufele and David Tewksbury. 2007. avirus’, ‘Immigration’. Framing, agenda setting, and priming: The evolu- The full list is as follow (in a descending order tion of three media effects models. Journal of com- of frequency): [‘Elections’, ‘White House’, ‘Pol- munication, 57(1):9–20. itics’, ‘Coronavirus’, ‘Immigration’, ‘Violence in All Sides. 2018. Media bias ratings. Allsides.com. America’, ‘Economy and Jobs’, ‘Supreme Court’, ‘Middle East’, ‘US House’, ‘Healthcare’, ‘World’, Timo Spinde, Christina Kreuter, Wolfgang Gaissmaier, ‘US Senate’, ‘National Security’, ‘Gun Control and Felix Hamborg, Bela Gipp, and Helge Giese. 2021. Gun Rights’, ‘Media Bias’, ‘Federal Budget’, ‘Ter- Do you think it’s biased? how to ask for the percep- tion of media bias. In 2021 ACM/IEEE Joint Con- rorism’, ‘US Congress’, ‘Foreign Policy’, ‘Crim- ference on Digital Libraries (JCDL), pages 61–69. inal Justice’, ‘Justice Department’, ‘Trade’, ‘Im- IEEE. peachment’, ‘Donald Trump’, ‘North Korea’, ‘Rus- sia’, ‘Education’, ‘Environment’, ‘Free Speech’, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. ‘FBI’, nan, ‘Abortion’, ‘General News’, ‘Disaster’, Fever: a large-scale dataset for fact extraction and ‘US Military’, ‘Technology’, ‘LGBT Rights’, ‘Sex- verification. arXiv preprint arXiv:1803.05355. ual Misconduct’, ‘Voting Rights and Voter Fraud’,
‘Joe Biden’, ‘Race and Racism’, ‘Economic Pol- ROUGE 1 ROUGE 2 ROUGE L icy’, ‘Justice’, ‘Holidays’, ‘Taxes’, ‘China’, ‘Polar- R ECALL R ECALL R ECALL ization’, ‘Democratic Party’, ‘Religion and Faith’, L EXRANK 39.08% 17.66% 34.69% ‘Sports’, ‘Homeland Security’, ‘Culture’, ‘Cyber- BART CNN 35.63% 15.32% 32.22% security’, ‘National Defense’, ‘Public Health’, P EGASUS M ULTI 44.42% 16.99% 39.45% ‘Civil Rights’, ‘Europe’, ‘Great Britain’, ‘Banking BART M ULTI 35.76% 12.48% 32.08% and Finance’, ‘Republican Party’, ‘NSA’, ‘Busi- N EU SFT 35.11% 15.74% 31.43% ness’, ‘State Department’, ‘Facts and Fact Check- N EU S-T ITLE 36.07% 16.47% 32.63% ing’, ‘Media Industry’, ‘Labor’, ‘Veterans Affairs’, ‘Campaign Finance’, ‘Life During COVID-19’, Table 6: Additional Salient Info Scores. Recall ‘Transportation’, ‘Marijuana Legalization’, ‘Agri- scores for ROUGE 1, ROUGE 2 and ROUGE L for A LL - S IDES testset. For the scores, the higher number is the culture’, ‘Arts and Entertainment’, ‘Fake News’, better. ‘Campaign Rhetoric’, ‘Nuclear Weapons’, ‘Israel’, ‘Asia’, ‘CIA’, ‘Role of Government’, ‘George Floyd Protests’, "Women’s Issues", ‘Safety and Sanity showed examples in Table 1 to ensure they under- During COVID-19’, ‘Animal Welfare’, ‘Treasury’, stand what framing bias is. Then we asked the ‘Science’, ‘Climate Change’, ‘Domestic Policy’, following question: “Which one of the articles do ‘Energy’, ‘Housing and Homelessness’, ‘Bridging you believe to be more biased toward one side or Divides’, ‘Mexico’, ‘Inequality’, ‘COVID-19 Mis- the other side in the reporting of news?” This is information’, ‘ISIS’, ‘Palestine’, ‘Bernie Sanders’, modified to serve as a question for AB testing based ‘Tulsi Gabbard’, ‘Sustainability’, ‘Family and Mar- on “To what extent do you believe that the article riage’, ‘Pete Buttigieg’, ‘Welfare’, ‘Opioid Cri- is biased toward one side or the other side in the sis’, ‘Amy Klobuchar’, ‘Food’, ‘EPA’, ‘South Ko- reporting of news?” The original question is one of rea’, ‘Alaska: US Senate 2014’, ‘Social Security’, the 21 questions which are suitable and reliable for ‘US Constitution’, ‘Tom Steyer’, ‘Andrew Yang’, measuring the perception of media bias, designed ‘Africa’] by Spinde et al. (2021). The participants (research graudate students) B Additional Salient Information Score have different nationalities including Canada, Results China, Indonesia, Iran, Italy, Japan, Poland and South Korea (ordered in an alphabetical order). All We report additional Salient information F1 (Ta- of participants answered to be not having political ble 5) and Recall (Table 6) scores for ROUGE 1, leaning towards U.S. politics. All participants are ROUGE 2 and ROUGE L. fully explained on the usage of collected data in this particular work and agreed on it. ROUGE 1 ROUGE 2 ROUGE L F1 F1 F1 D Experimental Setup Details L EXRANK 33.60% 13.60% 29.77% BART CNN 33.76% 13.67% 30.57% All our experimental codes are based on the Hug- P EGASUS M ULTI 30.03% 10.28% 26.70% gingFace (Wolf et al., 2020). We used the following BART M ULTI 23.01% 6.84% 20.55% hyperparameters during training and across models: N EU SFT 36.76% 16.27% 32.86% 10 epoch size, 3e − 5 learning rate, and a batch size N EU S-T ITLE 35.49% 15.69% 32.05% of 16. We did not do hyper-parameters tuning since our objective is to provide various baselines and Table 5: Additional Salient Info Scores. F1 scores for analysis. Training run-time for all of our experi- ROUGE 1, ROUGE 2 and ROUGE L for A LL S IDES test- ments are fast (< 6hr). We ran all experiments with set. For the scores, the higher number is the better. one NVIDIA 2080Ti GPU with 16 GB of memory. The experiment was a single-run. C Details for Human Evaluation (A/B E Generation Examples from Different testing) Models We first presented the participants with the defi- To help better understand performances of each nition of framing bias from our paper, and also models, we provide more examples of generation
You can also read