NELA-GT-2018: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

Page created by Amanda Park
 
CONTINUE READING
NELA-GT-2018: A Large Multi-Labelled News Dataset for The Study
                                                      of Misinformation in News Articles

                                                            Jeppe Nørregaard† , Benjamin D. Horne* , and Sibel Adalı*
                                                              Technical University of Denmark† , Rensselaer Polytechnic Institute*
                                                                        jepno@dtu.dk, horneb@rpi.edu, adalis@rpi.edu
arXiv:1904.01546v1 [cs.CY] 2 Apr 2019

                                                               Abstract                             such as source reliability over an extended period
                                                                                                    of time (Horne et al. 2018; Baly et al. 2018).
                                          In this paper, we present a dataset of 713k articles
                                                                                                        Secondly, fact-checkers tend to concentrate
                                          collected between 02/2018-11/2018. These articles
                                          are collected directly from 194 news and media            their efforts on articles that receive a lot of at-
                                          outlets including mainstream, hyper-partisan, and         tention, making datasets with fact-checked la-
                                          conspiracy sources. We incorporate ground truth           bels engagement-driven. Engagement-driven news
                                          ratings of the sources from 8 different assessment        datasets (for example those based on social media
                                          sites covering multiple dimensions of veracity, in-       mentions), are very useful in engagement-driven
                                          cluding reliability, bias, transparency, adherence        studies, but may not provide a complete picture
                                          to journalistic standards, and consumer trust. The        of attention to malicious news sources. For exam-
                                          NELA-GT-2018 dataset can be found at https:               ple, The Drudge Report, a site known for spread-
                                          //doi.org/10.7910/DVN/ULHLCB.                             ing mixed-veracity information, is 41st in United
                                                                                                    States in terms of the amount of Internet traf-
                                                        1     Introduction                          fic, making it a highly influential source. Readers
                                                                                                    spend a long time on the site, averaging 25 min-
                                        One of the main gaps in the study of misinforma-
                                                                                                    utes with about 11 clicks pages per visit. However,
                                        tion is finding broad labelled datasets, which this
                                                                                                    readers only reach the site using social-media links
                                        data set aims to fill. There are a number of pub-
                                                                                                    4% of the time, while 83% of the time they reach
                                        lished misinformation datasets with ground truth,
                                                                                                    it through direct links1 . As a result, we argue that
                                        but they are often small, event specific, engage-
                                                                                                    there is a need for datasets collected independent
                                        ment specific, or incomplete. As a result, they are
                                                                                                    of social media in order to understand the full im-
                                        not sufficient for answering a wide-range of re-
                                                                                                    pact of and tactics used by misleading and hyper-
                                        search questions.
                                                                                                    partisan news producers.
                                           First, for many studies, particularly those in-
                                        volving machine learning methods, a large dataset               Lastly, news, particularly state-sponsored pro-
                                        with ground truth labels is necessary. Article-             paganda, can misinform through methods other
                                        level ground truth (i.e. true/false) for such datasets      than explicitly fabricated claims (Zannettou et al.
                                        can be infeasible, as fact-checking requires ex-            2018). Hence, fact-checking labels may not cap-
                                        perts conducting a slow and labor-intensive pro-            ture all types of misinformation. This leads to la-
                                        cess. Furthermore, the slow speed of fact-checking          beling mechanisms that account for other factors,
                                        makes datasets quickly out-of-date. One solution            such as whether the sources have bias in their
                                        that has been proposed to mitigate problems with            reporting or how much they adhere to journalis-
                                        article level labels is to use higher level labels,         tic standards. Therefore, we argue that datasets
                                                                                                    should contain multiple types of ground truth at the
                                        Copyright c 2019, Association for the Advancement           source-level in order to perform complete studies
                                        of Artificial Intelligence (www.aaai.org). All rights re-
                                                                                                       1
                                        served.                                                            source: similarweb.com, consulted on 13/01/2019
of misinformation.                                       news articles. While claims can be extracted from
   The dataset presented in this paper is and            news articles, most of these datasets use claims
engagement-independent collection of news arti-          made on social media or by political figures in
cles with multiple types of source-level ground          speeches. LIAR is a fake claim benchmark dataset
truths. Our dataset contains 713,534 articles from       that has 12.8K fact-check short statements from
194 news outlets collected between 01/02/2018-           politifact.com (Wang 2017). The claims in
30/11/2018. These articles are collected directly        the dataset are from social media posts and polit-
from each news producers’ websites, independent          ical speeches. CREDBANK is a dataset of 60M
of social media. We corroborate ground truth la-         tweets between 2015 and 2016. Each tweet is
bels from eight different assessment sites cover-        associated to a news event and is labeled with
ing multiple dimensions of veracity, including reli-     credibility by Amazon Mechanical Turkers (Mitra
ability, bias, transparency, and consumer trust. The     and Gilbert 2015). Again, this dataset only con-
dataset sources are from both mainstream media           tains claims/tweets, not complete news articles.
and alternative media across multiple countries.         PHEME is a dataset of 330 tweet threads anno-
The dataset can be found at https://doi.                 tated by journalist. Each tweet is associated with a
org/10.7910/DVN/ULHLCB. In this paper, we                news story (Zubiaga et al. 2016). FacebookHoax
outline dataset collection, ground-truth corrobora-      is a dataset containing 15K Facebook posts about
tion, and provide a few use cases.                       science news. The posts are labeled as “hoax”
                                                         or “non-hoax” and come from 32 different Face-
               2   Related Work                          book pages (Tacchini et al. 2017). These datasets
There are many recent news datasets focused on           are highly related to the smaller tweet credibility
misinformation, each with different focus in la-         datasets created in the last decade (Castillo, Men-
belling. Labels include various dimensions of reli-      doza, and Poblete 2011).
ability and various dimensions of bias. Buzzfeed-           There are also several recent unlabelled news
News2 is a small dataset of news articles that had       datasets, which are much larger than most of the
high Facebook engagement during the 2016 U.S.            labeled datasets. NELA2017 is a political news ar-
Presidential Election. The dataset contains 1627         ticle dataset that contains 136K articles from 92
articles that are fact-checked by 5 Buzzfeed jour-       media sources in 2017 (Horne, Khedr, and Adalı
nalists. The dataset labels include if the article is    2018). The dataset includes sources from main-
false or true, along with the political leaning of the   stream, hyper-partisan, conspiracy, and satire me-
source that produced the article. FakeNewsCor-           dia sources. Along with the news articles, the
pus3 is a dataset containing nearly 10M articles         dataset includes a rich set of natural language fea-
labeled using opensources.co. OpenSources                tures on each news article, and the corresponding
is a list of sources labeled by experts. These labels    Facebook engagement statistics. The dataset con-
include 13 different labels related to the reliabil-     tains nearly all of the articles published by the
ity of the source. FakeNewsNet is a collection of        92 sources during the 7 month period. GDELT
datasets containing news articles and tweets. The        is an open database of event-based news articles
dataset includes rich metadata including social fea-     with temporal and location features. It is said to
tures and spatiotemporal information (Shu et al.         be one of the most comprehensive event-based
2018). While this dataset is described in a paper        news datasets. However, GDELT does not ex-
on arxiv.com, to the best of our knowledge, the          plicitly contain maliciously fake or hyper-partisan
data has not been completely released to the public      news sources, needed for misinformation studies.
at this time 4 .                                            While all of these datasets are useful, there are
   Many other misinformation datasets have fo-           several limitations we address with the dataset pre-
cused on individual claims rather than complete          sented int his paper:
   2
    github.com/BuzzFeedNews/2016-10-                     1. Small number of sources and articles - With
facebook-fact-check                                         the exception of FakeNewsCorpus and the
  3                                                         NELA2017 dataset, the current publicly avail-
    github.com/several27/
FakeNewsCorpus                                              able datasets are either small in the number of
  4
    github.com/KaiDMML/FakeNewsNet                          media sources they contain, small in the num-
ber of articles, or both. Furthermore, many of                   3    Dataset Creation
   the larger datasets do not contain multiple types    We created this dataset, with the following steps:
   of sources. In comparison to FakeNewsCorpus,
   our dataset covers a wider range of news, in par-   1. We gathered a wide variety of news sources
   ticular more mainstream news. In addition, our          from varying levels of veracity, including many
   dataset is collected over a longer and more con-        well-studied misinforming sources and other
   sistent period of time, where as the many of al-        less well-known sources.
   ternative news sources in FakeNewsCorpus no         2. We scraped article data from the gathered
   longer exists and the time frame of FakeNews-           sources’ RSS feeds twice a day for 10 months
   Corpus is unknown.                                      in 2018.
                                                       3. We combine and corroborated source-level ve-
2. Engagement-driven - The majority of the cur-            racity labels from 8 independent assessments,
   rent datasets, both for news articles and claims,       some of which are used in the misinformation
   contain only data has been highly engaged with          literature, others that are not. These labels pro-
   on social media or has received attention from          vide multiple and complementary ground truth
   fact-checking organizations. While understand-          allowing for many different ways to character-
   ing the engagement of misinformation is an im-          ize the sources.
   portant task, engagement driven news datasets
   fail to show the complete picture of misinform-         Through this process, we provide 713,534 arti-
   ing news. Both malicious fake news produc-           cles from 194 news and media producers. Along
   ers and hyper-partisan media produce hundreds,       with these articles, we provide multiple labels
   sometimes thousands of articles in a year, most      from 8 independent assessments for each source.
   of which are never seen on social media or fact-     The final set of article data is arranged in an sqlite
   checkers. Questions about when fake news tac-        data, with date, source, title, and cleaned text con-
   tics work or do not work remain unanswered.          tent for each article. The labels are provided in
                                                        CSV format, with rows being sources and columns
3. Lack of ground truth labels - All of the cur-        being each label gathered from all the assessment
   rent large-scale news article datasets do not        sites. The set of labels can also be found in the Ap-
   have any form of labeling for misinformation         pendix Table 2 and Table 3. Specifics on the file-
   research, with exception of FakeNewsCorpus.          formats can be found in the documentation given
   While some contain a mix of reliable and un-         with the dataset. We describe the collection pro-
   reliable sources, it is not necessarily clear to     cess and ground truth in detail below.
   what extent each source is reliable or what di-
   mensions of credibility should be used to as-       News Article Data
   sess the sources. For example, a news arti-         To collect our dataset, we scraped the RSS feeds of
   cle can spread misinformation (or disinforma-       each source twice a day starting on 02/02/2018 us-
   tion) in many ways other than false statements.     ing the Python libraries feedparser and goose. Our
   A news article may use partially false infor-       starting point for source selection was mainstream
   mation, decontextualized information, or infor-     outlets and alternative sources that are mentioned
   mation misrepresented by hyper-partisan lan-        in other studies or high profile cases of false news
   guage. For both machine learning and compar-        coverage. An initial subset of 92 sources was avail-
   ative studies, having well defined labels about     able in NELA2017 dataset (Horne, Khedr, and
   multiple dimensions of veracity is important        Adalı 2018), which already covered a wide array
   in understand what signals a machine learning       of media types. We then continued to expand this
   model is learning or why discovered patterns ex-    source set using the same criteria, as well as by
   ist in news data.                                   automated Google searches to find other outlets
                                                       that published similar articles as those already in
   Thus, our goal with the NELA-GT-2018 dataset        our dataset. Specifically, we queried the Google
is to create a large, veracity-labeled news article    Search API with the titles of the news articles that
dataset that in independent of social media engage-    were previously collected. If a news source that
ment and specific events.                              was not in our source collection list appeared in
5000         Reliability
                          Unreliable
             4000         Reliable
                          Not determined
# articles

             3000         Unknown source
             2000
             1000
                0
             5000            Bias
                          Left
             4000         Right
                          Not determined
# articles

             3000         Unknown source
             2000
             1000
                0
             01/02/2018    01/03/2018      01/04/2018   01/05/2018   01/06/2018   01/07/2018   01/08/2018   01/09/2018   01/10/2018   01/11/2018
                                                                                      Date

   Figure 1: Number of articles in the dataset over time. For each source, we compute an aggregated reliability
   and bias rating, and label all articles in the source with this rating for illustration purposes. The two stack-
   plots contain the same datapoints, but dissected with these two distinct aggregated labels. If the aggregated
   label is uncertain we label the articles with gray. Grey-shaded vertical regions are marks where unusually
   little data were collected due to some problem with data-scraping or potentially low activity. The increase in
   the number of data points around the 01/08/2018 is caused by the addition of new sources to the collection.

   the top 10 pages of the Google search, we added                                      make their assessments, and most of these assess-
   it to our source collection list. Note, we do not                                    ments cover relatively few sources. Thus, in order
   include small local news sources or sources that                                     to create a large, centralized set of veracity labels,
   did not have operational RSS feeds, which signifi-                                   we collected ground truth (GT) data from eight dif-
   cantly reduces the size of the expected source set.                                  ferent sites, which all attempt to assess the reliabil-
   Furthermore, this Google expansion process was                                       ity and/or the bias of news.
   ran in July 2018, which caused a large increase in                                      These assessment sites are:
   unlabeled news sources, as shown in Figure 1.                                              1. NewsGuard
      By the end of the collection process                                                    2. Pew Research Center
   (30/11/2018) we had 713K articles from 194                                                 3. Wikipedia
   news and media producers. These sources come                                               4. OpenSources
   from a variety of countries, but are all articles are                                      5. Media Bias/Fact Check (MBFC)
   in English. In Tables 2, 3, and 4 we write the date                                        6. AllSides
   of the first scraped article from each source. After                                       7. BuzzFeed News
   these dates, we have near complete data from
                                                                                              8. Politifact
   the respective sources RSS-feeds. In Figure 1 we
   show the number of articles collected over time.                                        We gather data from all these sites, using html-
                                                                                        scraping and GUI-automation, and combine their
                                                                                        labels to create a centralized set of veracity ground
   Ground Truth Data                                                                    truth labels. Of the 194 sources in our data set, 154
   A number of organizations and platforms have de-                                     sources have GT labels from at least one of the
   veloped methods for assessing reliability and bias                                   assessment sites, while the remaining 40 sources
   of news sources. These organizations come from                                       remain unlabelled. Tables 2 and 3 show the com-
   both the research community and from practitioner                                    bined labels, while Table 4 lists the sources where
   communities. While each of these organizations                                       no label information was found. Table 1 provide a
   and platforms provide useful assessments on their                                    detailed described of each assessment and Table 5
   own, each uses different criteria and methods to                                     lists urls for the assessment sites.
NewsGuard uses a group of trained journalists to         lished online in detail. This list has also been
assess credibility and transparency of news web-         used in several academic studies (Horne and Adali
sites. They emphasizes the use of trained people         2017; Horne et al. 2018; Baly et al. 2018). Unfor-
rather than algorithms to determine credibility of       tunately, last repository commit was 2 years ago
sources. They allows respective news outlets com-        and many of the labeled sources no longer exist.
ment on their verdict before publishing it. They         The site provides a list of sources with 1-3 tags per
provide extensions for major browsers to inform          source (See Table 1).
users of the credibility of the sites they visit. They
also display icons on search results in search en-       Media Bias/Fact Check is a platform that an-
gines like Google and Duck Duck Go. Their anal-          alyzes news sources to determine their credibil-
ysis produces 9 granular, binary labels for each         ity, as well as to ”educate the public on media
site, with assigned point scores that sums to 100.       bias and deceptive news practices”. The site pub-
Based on the sum of points the sites get an over-        lishes the names of its editorial team and only ac-
all label for credibility - green for good score, red    cepts outside information from individuals who
for bad score. Three additional overall labels ex-       have accepted International Fact-Checking Net-
ist for satire, user-produced content and sites with     work’s code of principles. According to its pub-
unfinished analysis. Table 1 describes the granu-        lished methodology, the site numerically evaluates
lar labels. NewsGuard is transparent about their         each news outlet in 4 categories; biased word-
methodology and publish a policy for ethics and          ing/headlines, factual/sourcing, story choices and
conflicts of interest. Their full staff is listed with   political affiliation, and uses the mean of these for
names online and their ratings are free.                 a final verdict. As of January 2019, we were unfor-
                                                         tunately not able to find the numerical categories
Pew Research Center published an article enti-           for the sources. We were able to find a factual re-
tled ”Political Polarization & Media Habits” which       porting label, which is derived from the previously
analysed trust in specific news sources by liber-        mentioned scores. Many sources also had descrip-
als and conservatives. This analysis used 5 groups       tive labels, some of which were related to relia-
of people, ranging from liberals to conservatives,       bility and some of which were related to bias. All
and each group provided a rating of how much             these labels are described in Table 1.
they trust each source. The ratings are aggregated
to show whether readers with different political         Allsides takes a very idealistic approach to assess-
leanings predominantly trust or distrust a spe-          ing bias of sites and is mainly data-driven. They
cific source. We provide this trust label for each       emphasize that news are inherently biased, that a
source and political leaning, as a label for congru-     mixed news ”diet” is the true goal for newsread-
ency between bias a readership (rather than a fact-      ers and that bias can be hidden and unconscious.
checking label).                                         This site creates data through a set of methods,
                                                         each of which are noted for the sources. It conducts
Wikipedia published a list of fake news websites,        blind surveys on material in the public as well as
which they define as sites that ”intentionally, but      in an editorial board, use third party data and as-
not necessarily solely, publish hoaxes and disinfor-     sessment, conducts internal research on sources if
mation for purposes other than news satire”. The         needed, and also has a community feedback func-
page has more than 500 edits, 162 cited references       tion for all bias assessments. In the community
and has been in existence since 18/11/2016. There        feedback, users can vote to agree or disagree with
is no information on how the sites were selected,        Allsides assessment of a source. They note that the
but for each source there are references to other        community feedback is not normalized with re-
sites which has reported their bad behaviour. We         spect to bias, and should more be used as a flag
provide a fake-news tag for sources on the list.         for their own use on whether their assessments are
                                                         off and needs updating. We include their bias label
Open Sources describes itself as a ”curated re-          and feedback numbers (votes agreeing and votes
source for assessing online information sources,         disagreeing) for each source. The feedback num-
available for public use” and its analysis are done      ber are not shown in the paper, but can be found in
by its own team of experts. The criteria is pub-         the dataset.
BuzzFeed News published an article ”Inside              ticles). One problem with this approach is that
The Partisan Fight For Your News Feed” on               labelling individual articles requires a lot of re-
08/08/2017 which describes a study conducted            sources and is often times not possible. For many
by them on how partisan websites and Facebook           machine learning algorithms the minimum re-
pages have been created in increasing numbers.          quirement of labelled samples is in the thousands.
They publish an associated dataset with news            Furthermore, verifying articles will commonly re-
sources and their political leaning (left and right),   quire considerable time from an expert. A second
which we include.                                       problem is that the verification of statements in
                                                        articles can require a lot of time. This can make
PolitiFact is a well-known fact-checking organi-        available labelled articles outdated for analyzing
zation which investigates claims and evaluates the      contemporary articles, due to shifts in topics and
truthfulness of those claims. The statements can        news cycle.
be from any public person or simply rumours that
gain enough attention. PolitiFact’s data is very dif-      An alternative approach to creating labels is
ferent from the rest of our labelling sites, as their   through distant supervision (or weak supervision),
assessment is on article/statement level and not        where labels are created at the source-level and
source level. They also aggregate the statements        used as proxies for article-level labels. One advan-
and their labels for the sources that published the     tage of the approach is that it reduces the work-
statements. We have counted the types of state-         load of labelling. Additionally, labels are known
ments coming from each source, which could be           instantaneously for articles from known sources al-
used to indicate their truthfulness. However the        lowing real time update of parameters and analysis
data is not well normalized, as some sites have         of news. This approach has been shown promising
many noted statements, while some have none, due        in recent misinformation detection work (Horne et
to the origin of the statements and the amount of       al. 2018; Baly et al. 2018). The NELA-GT-2018
attention each source has.                              dataset can be used out-of-the-box for this type of
                                                        machine learning study.
Amazon’s Alexa provides a ranking of nearly all
websites based on frequency of visits, to which
they provide free access to the top 1M. We in-          Semi-Supervised Learning
clude the position of the sources in this rating in
the dataset based on our access to Alexa on 13th        Another commonly debated issue in misinfor-
of January 2019. Note, this data comes from the         mation research is handling new articles from
free portion of Alexa’s data, not the paid portion.     mixed-veracity (partial truths, benign or mali-
Furthermore, these rankings will change over time.      cious) sources or handling articles from newly
                                                        emerging sources during events (such as elec-
                 4    Use Cases                         tions). One potential way to address these prob-
                                                        lems is using semi-supervised learning, in which
There are many threads of misinformation research
                                                        these uncertain veracity news sources are included
that this dataset can benefit. We argue that our
                                                        as unlabelled data. This approach can improve sta-
dataset can especially benefit automated news ve-
                                                        bility and increase the working domain for auto-
racity methods, which need large labelled datasets,
                                                        mated systems. In fact, it has been shown that, with
and qualitative studies that focus on the tactics
                                                        some assumptions, semi-supervised approaches
used by malicious and hyper-partisan news pro-
                                                        can improve performance over fully supervised ap-
ducers. We discuss a few examples below.
                                                        proaches, where unlabelled samples enables clas-
                                                        sifiers to reduce risk exponentially with the num-
Distant Supervised Learning                             ber of labelled samples (Castelli and Cover 1996).
Much research in news has been focused on auto-         Depending on the problem, this dataset provides
mated methods for detecting misinformation (Ku-         consistent labels of 100+ sources, verified by mul-
mar and Shah 2018). For machine learning sys-           tiple assessment sites. Remaining sources are ei-
tems, this analysis generally requires article-level    ther completely unknown, or are sparsely labelled,
labelling (i.e. false/bias labels of individual ar-     but can be utilized with semi-supervised methods.
Mixed-Method Studies                                        parameter. Ieee Transactions on Information Theory
 There are unanswered research questions about the           42(6):2102–2117.
 tactics used by news producers publishing false,           [Castillo, Mendoza, and Poblete 2011] Castillo,         C.;
 misleading, or propaganda news. These questions             Mendoza, M.; and Poblete, B. 2011. Information cred-
                                                             ibility on twitter. In Proceedings of WWW, 675–684.
 cannot be answered through machine learning
                                                             ACM.
 studies, but rather require mix-method assessments
 in order to be answered. For example, recent work          [Horne and Adali 2017] Horne, B. D., and Adali, S.
                                                             2017. This just in: Fake news packs a lot in title, uses
 has focused on content sharing by alternative me-           simpler, repetitive content in text body, more similar to
 dia sources (Starbird et al. 2018). This work sheds         satire than real news. In NECO Workshop 2017.
 light on the tactics employed by state-sponsored
                                                            [Horne et al. 2018] Horne, B. D.; Dron, W.; Khedr, S.;
 news to create alternative narratives around an             and Adalı, S. 2018. Assessing the news landscape:
 event, but can continue to be improved with data            A multi-module toolkit for evaluating the credibility of
 that is more complete and independent of social             news. In WWW 2018 Companion.
 media. Other question include: how do false news           [Horne, Khedr, and Adalı 2018] Horne, B. D.; Khedr, S.;
 producers change with events? Do they keep con-             and Adalı, S. 2018. Sampling the news producers: A
 sistent ideologies? or do they adapt with the given         large news and feature data set for the study of the com-
 event? Many of these potential tactics are un-              plex media landscape. In ICWSM.
 known. This dataset provides news over many ma-            [Kumar and Shah 2018] Kumar, S., and Shah, N. 2018.
 jor events, which can be easily extracted for spe-          False information on web and social media: A survey.
 cific studies. For qualitative researchers, the data        arXiv preprint arXiv:1804.08559.
 can provide a “head-start” on exploring the data,          [Mitra and Gilbert 2015] Mitra, T., and Gilbert, E. 2015.
 as the veracity of each source is known.                    Credbank: A large-scale social media corpus with asso-
                                                             ciated credibility annotations. In ICWSM, 258–267.
                  5    Conclusion                           [Shu et al. 2018] Shu, K.; Mahudeswaran, D.; Wang, S.;
                                                             Lee, D.; and Liu, H. 2018. Fakenewsnet: A data repos-
 In this paper, we present a labelled news dataset
                                                             itory with news content, social context and dynamic in-
 for the study of misinformation. We argue that the          formation for studying fake news on social media. arXiv
 research community lacks large labelled datasets            preprint arXiv:1809.01286.
 for use in both mixed-method and machine learn-            [Starbird et al. 2018] Starbird, K.; Arif, A.; Wilson, T.;
 ing studies. To address this need, we provide a             Van Koevering, K.; Yefimova, K.; and Scarnecchia, D.
 large dataset of news articles (713K articles), col-        2018. Ecosystem or echo-system? exploring content
 lected over many sources (194), over a long pe-             sharing across alternative media domains.
 riod of time 02/2018-11/2018. The articles are in-         [Tacchini et al. 2017] Tacchini, E.; Ballarin, G.;
 dependent of engagement from online communi-                Della Vedova, M. L.; Moret, S.; and de Alfaro, L. 2017.
 ties, and reflect the publish patterns of the news          Some like it hoax: Automated fake news detection in
 producers. We have furthermore gathered labels              social networks. arXiv preprint arXiv:1704.07506.
 for these sources from 8 different assessment sites,       [Wang 2017] Wang, W. Y. 2017. ” liar, liar pants on fire”:
 each of which seeks to assess the reliability and           A new benchmark dataset for fake news detection. arXiv
 bias of sources and claims. Combined they provide           preprint arXiv:1705.00648.
 a detailed and near-complete labelling of sources,         [Zannettou et al. 2018] Zannettou, S.; Sirivianos, M.;
 which can be used for predictive analysis and qual-         Blackburn, J.; and Kourtellis, N. 2018. The web of
 itative studies of the news landscape.                      false information: Rumors, fake news, hoaxes, click-
                                                             bait, and various other shenanigans. arXiv preprint
                      References                             arXiv:1804.03461.
[Baly et al. 2018] Baly, R.; Karadzhov, G.; Alexandrov,     [Zubiaga et al. 2016] Zubiaga, A.; Liakata, M.; Procter,
 D.; Glass, J.; and Nakov, P. 2018. Predicting factuality    R.; Hoi, G. W. S.; and Tolmie, P. 2016. Analysing
 of reporting and bias of news media sources. In EMNLP       how people orient to and spread rumours in social me-
 2018.                                                       dia by looking at conversational threads. PloS one
                                                             11(3):e0150989.
[Castelli and Cover 1996] Castelli, V., and Cover, T. M.
 1996. The relative value of labeled and unlabeled sam-
 ples in pattern recognition with an unknown mixing
                                                                                6    Appendix
Section                           Description                                                    (NewsGuard points)      Coloring
NewsGuard                    1.   Does not repeatedly publish false content                                   (22.0)
                             2.   Gathers and presents information responsibly                                (18.0)
                             3.   Regularly corrects or clarifies errors                                      (12.5)
                             4.   Handles the difference between news and opinion responsibly                 (12.5)
                             5.   Avoids deceptive headlines                                                  (10.0)
                             6.   Website discloses ownership and financing                                     (7.5)
                             7.   Clearly labels advertising                                                    (7.5)
                             8.   Reveals who’s in charge, including any possible conflicts of interest         (5.0)
                             9.   Provides information about content creators                                   (5.0)
                            10.   Aggregated score computed from 1-9                                                        -
                            11.   Column 10 thresholded at 60 points
Pew Research Center         12.   Trust from consistently-liberals
                            13.   Trust from mostly-liberals
                            14.   Trust from mixed groups
                            15.   Trust from mostly-conservatives
                            16.   Trust from consistently-conservatives
                            17.   Aggregated trust from 12-16
Wikipedia                   18.   Existence of source on Wikipedia’s list of fake news sources
Open Sources                19.   Marked reliable
                            20.   Marked blog
                            21.   Marked clickbait
                            22.   Marked rumor
                            23.   Marked fake
                            24.   Marked unreliable
                            25.   Marked biased
                            26.   Marked conspiracy
                            27.   Marked hate speech
                            28.   Marked junk science
                            29.   Marked political
                            30.   Marked satire
                            31.   Marked state news
Media Bias / Fact Check     32.   Factual reporting from 5 (good) down to 1 (bad)
                            33.   Special label; conspiracy, pseudoscience or questionable source (purple), and
                                  satire (orange)
                            34.   Political leaning / bias from left to right.
Allsides                    35.   Political leaning / bias
BuzzFeed                    36.   Political leaning / bias, but only left and right
PolitiFact                  37.   Has brought story labelled as ”pants on Fire!”
                            38.   Has brought story labelled as false
                            39.   Has brought story labelled as mostly false
                            40.   Has brought story labelled as half-true
                            41.   Has brought story labelled as mostly true
                            42.   Has brought story labelled as true
Alexa Ranking                     The Alexa ranking of the source.                                                       Numerical
# Articles                        The number of articles collected from the source.                                      Numerical
First Observed                    The date of first articles collected from the source.                                 dd-mm-yyyy

Table 1: Details of the information for sources found in tables 2, 3 and 4. We generally use green-to-purple for good-
to-poor reliability/credibility, with grey as inconclusive. For bias we use blue-to-red for left-to-right bias, with grey
as unbiased. Orange is used for special cases. In NewsGuard data it represents missing information, in Open Sources
it marks auxiliary labels and for Media Bias / Fact Check it marks satire.
d d                  rch                                                 ..
                                                                 uar uar             sea ia                                              ias                         ed
                                                               sG wsG              Re kiped                                         i aB                      des Fe
                                                           e w     e             w     i                                          d
                                                                                                                                 e Media Bias..           llsi Buzz
                                      NewsGuard         , N , N Pew Research , Pe , W        Open Sources                    , M                      , A       ,         PolitiFact        Alexa # Articles First Observed
                                  1 2 3 4 5 6 7 8 9   10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31   32                  33 34 35 36               37 38 39 40 41 42
          21stCenturyWire                                                                                                                                                                   186254      322     16-07-2018
                   ABC News                                                                                                                                                                   1106     2808     14-06-2018
                Activist Post                                                                                                                                                                43808     1797     03-02-2018
              Addicting Info                                                                                                                                                                614357      429     01-02-2018
                   Al Jazeera                                                                                                                                                                 3594     4522     31-07-2018
                      Alternet                                                                                                                                                               17464     4816     28-03-2018
       AMERICAblog News                                                                                                                                                                                  42     23-07-2018
                           BBC                                                                                                                                                                  94    16416     01-02-2018
              Bearing Arms                                                                                                                                                                   91113     1193     08-02-2018
         Bipartisan Report                                                                                                                                                                             4060     06-02-2018
          Birmingham Mail                                                                                                                                                                    23331     9243     31-07-2018
                     Breitbart                                                                                                                                                                 323     1877     05-02-2018
           Business Insider                                                                                                                                                                    387      445     01-02-2018
                    Buzzfeed                                                                                                                                                                   291     1661     13-02-2018
                   CBS News                                                                                                                                                                   1224     5397     03-02-2018
        Chicago Sun-Times                                                                                                                                                                              2113     30-07-2018
                         CNBC                                                                                                                                                                  780     2426     05-02-2018
                          CNN                                                                                                                                                                  116     8202     05-02-2018
                   CNS News                                                                                                                                                                  40324     5263     06-02-2018
    Counter Current News                                                                                                                                                                                 23     02-02-2018
          Crooks and Liars                                                                                                                                                                   33864     2465     12-02-2018
                 Daily Beast                                                                                                                                                                  2358     6634     06-02-2018
                    Daily Kos                                                                                                                                                                 3187      994     01-02-2018
                    Daily Mail                                                                                                                                                                 198     3596     06-02-2018
                 Daily Signal                                                                                                                                                                33249      310     01-02-2018
              Daily Stormer                                                                                                                                                                            1378     19-09-2018
                 DC Gazette                                                                                                                                                                             185     01-02-2018
             Democracy 21                                                                                                                                                                                24     04-09-2018
             Drudge Report                                                                                                                                                                     815    18885     06-02-2018
         Evening Standard                                                                                                                                                                     6162    17638     20-07-2018
               Faking News                                                                                                                                                                  270680      220     04-02-2018
          Feministing Blog                                                                                                                                                                               23     30-07-2018
            FiveThirtyEight                                                                                                                                                                   1903      556     01-02-2018
              Foreign Policy                                                                                                                                                                 22817      702     19-07-2018
                      Fortune                                                                                                                                                                 4945     7630     21-03-2018
      Forward Progessives                                                                                                                                                                               142     08-02-2018
                    Fox News                                                                                                                                                                   275     3106     23-02-2018
                    France24                                                                                                                                                                  4993     1732     17-07-2018
             Freedom Daily                                                                                                                                                                               36     12-04-2018
         Freedom Outpost                                                                                                                                                                    127732      321     01-02-2018
      FrontPage Magazine                                                                                                                                                                     78101      892     01-02-2018
      FT Westminster Blog                                                                                                                                                                                10     29-10-2018
                        Fusion                                                                                                                                                                          141     01-02-2018
           GlobalResearch                                                                                                                                                                    52145       30     29-11-2018
                Glossy News                                                                                                                                                                              61     01-02-2018
                       Hot Air                                                                                                                                                               14455     4642     10-02-2018
           HumansAreFree                                                                                                                                                                    126151      426     12-07-2018
               Humor Times                                                                                                                                                                              282     01-02-2018
                     Infowars                                                                                                                                                                 4123     2518     01-02-2018
                 Instapundit                                                                                                                                                                 14733    15584     11-02-2018
 Intellectual Conservative                                                                                                                                                                              379     01-02-2018
                    Intellihub                                                                                                                                                              253500      334     01-02-2018
           Interpreter Mag                                                                                                                                                                               28     22-05-2018
  Investors Business Daily                                                                                                                                                                   21030      730     01-02-2018
                      iPolitics                                                                                                                                                             152948     4253     12-02-2018
               LewRockwell                                                                                                                                                                   41631     1278     19-07-2018
                  Live Action                                                                                                                                                               130580     1054     16-02-2018
Media Matters for America                                                                                                                                                                    36858     2316     03-02-2018
             Mercury News                                                                                                                                                                    10919     4828     31-07-2018
                MotherJones                                                                                                                                                                  13902     1128     11-06-2018
                       MSNBC                                                                                                                                                                  2356     6604     21-03-2018
           National Review                                                                                                                                                                    7985     5129     06-02-2018
              Natural News                                                                                                                                                                   10159     4187     06-02-2018
     New York Daily News                                                                                                                                                                      3827     2042     06-02-2018
             New York Post                                                                                                                                                                     969    25407     06-02-2018
                 New Yorker                                                                                                                                                                   1680      265     05-11-2018
               News Biscuit                                                                                                                                                                 240347     1666     06-02-2018
              News Busters                                                                                                                                                                   31060     3240     04-02-2018
                  Newsweek                                                                                                                                                                    1375     9411     19-07-2018
                 NODISINFO                                                                                                                                                                               29     15-02-2018
                           NPR                                                                                                                                                                 843     5515     06-02-2018
                          oann                                                                                                                                                               45704    14267     21-03-2018
                    Observer                                                                                                                                                                 21487      541     01-02-2018
             Palmer Report                                                                                                                                                                   14618     3539     03-02-2018
     Pamela Geller Report                                                                                                                                                                    85934      410     20-03-2018
                           PBS                                                                                                                                                                1879     1113     11-06-2018
              Pink News UK                                                                                                                                                                   17076     1645     30-07-2018
                       Politico                                                                                                                                                               1038      629     01-02-2018
                   Politics UK                                                                                                                                                              571310      137     20-07-2018
               Politicus USA                                                                                                                                                                 10991     4018     03-02-2018
             Powerline Blog                                                                                                                                                                  31674      894     31-07-2018
            Pravada Report                                                                                                                                                                  163614      601     17-07-2018
               Prison Planet                                                                                                                                                                 73821     2253     09-08-2018
                   Raw Story                                                                                                                                                                  6692     3719     12-02-2018
         Real Clear Politics                                                                                                                                                                  6215     7247     12-02-2018
     Real News Right Now                                                                                                                                                                                 13     02-02-2018
                    RedState                                                                                                                                                                 25729     4808     06-02-2018
                      Reuters                                                                                                                                                                  861     3929     09-02-2018
          RightWingWatch                                                                                                                                                                    173633     1118     09-02-2018
                            RT                                                                                                                                                                 273     4286     01-02-2018
              Russia-Insider                                                                                                                                                                 23104     1030     19-07-2018
                         Salon                                                                                                                                                                4853     1702     02-02-2018
              ScrappleFace                                                                                                                                                                               61     22-06-2018
              Shadow Proof                                                                                                                                                                  365356      260     05-02-2018
                   Shareblue                                                                                                                                                                 92290     2134     06-02-2018
           SkyNewsPolitics                                                                                                                                                                              826     31-07-2018

                                                                       Table 2: Labelling of first part of sources.
d d                  rch                                                ..
                                                                  uar uar             sea ia                                             ias                        ed
                                                                sG wsG              Re kiped                                         iaB                      des Fe
                                                            e w     e             w     i                                          d
                                                                                                                                  e Media Bias..          llsi Buzz
                                      NewsGuard          , N , N Pew Research , Pe , W        Open Sources                    , M                     , A       ,        PolitiFact        Alexa # Articles First Observed
                                  1 2 3 4 5 6 7 8 9    10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31   32                 33 34 35 36              37 38 39 40 41 42
                  SkyNewsUS                                                                                                                                                                            995     06-08-2018
                         Slate                                                                                                                                                               2676      514     01-02-2018
                      sott.net                                                                                                                                                              20944     9319     19-07-2018
                       Spiegel                                                                                                                                                                530     4171     19-07-2018
                      Sputnik                                                                                                                                                                 553    30372     11-02-2018
        Talking Points Memo                                                                                                                                                                  9837     5846     06-02-2018
                          Tass                                                                                                                                                              40473     6160     07-08-2018
                   Telesur TV                                                                                                                                                               19225      860     07-08-2018
  The American Conservative                                                                                                                                                                 28970      439     01-02-2018
                 The Atlantic                                                                                                                                                                1636     1757     01-02-2018
               The Beaverton                                                                                                                                                               209578      854     03-02-2018
        The Borowitz Report                                                                                                                                                                  1680      123     02-02-2018
                  The Chaser                                                                                                                                                                           132     01-02-2018
The Conservative Tree House                                                                                                                                                                 20423     2120     05-02-2018
        The D.C. Clothesline                                                                                                                                                                81913      654     03-09-2018
             The Daily Caller                                                                                                                                                                 923    11550     06-02-2018
           The Daily Express                                                                                                                                                                  837     1585     31-07-2018
             The Daily Mirror                                                                                                                                                                1347    13202     31-07-2018
            The Daily Record                                                                                                                                                                20381     6981     31-07-2018
               The Daily Star                                                                                                                                                                3095      219     20-07-2018
            The Denver Post                                                                                                                                                                 13135     4503     31-07-2018
                   The Duran                                                                                                                                                                97928      959     06-06-2018
            The Fiscal Times                                                                                                                                                               264746      461     06-06-2018
        The Gateway Pundit                                                                                                                                                                   9863     5667     05-02-2018
                The Guardian                                                                                                                                                                  150     2195     01-02-2018
                       The Hill                                                                                                                                                              1199     1968     13-03-2018
         The Huffington Post                                                                                                                                                                  415     5586     05-02-2018
            The Independent                                                                                                                                                                  1078    19799     20-07-2018
                The Intercept                                                                                                                                                                9890     1268     08-02-2018
              The Irish Times                                                                                                                                                                4407     3827     31-07-2018
    The Michelle Malkin Blog                                                                                                                                                               418849       53     05-02-2018
          The Moscow Times                                                                                                                                                                  84361     1137     13-07-2018
        The New York Times                                                                                                                                                                    110     5471     06-02-2018
                    The Onion                                                                                                                                                                6513     1094     28-07-2018
                     The Poke                                                                                                                                                               40252     1313     30-07-2018
         The Political Insider                                                                                                                                                              96311     2680     01-02-2018
             The Right Scoop                                                                                                                                                                64313     2697     06-02-2018
                   The Shovel                                                                                                                                                              432212      223     01-02-2018
                    The Spoof                                                                                                                                                              874461      696     02-02-2018
                      The Sun                                                                                                                                                                1370    43613     31-07-2018
               The Telegraph                                                                                                                                                                  580    33763     19-07-2018
                    The Verge                                                                                                                                                                1131     5951     12-02-2018
   The Washington Examiner                                                                                                                                                                   7581      469     01-02-2018
                TheAntiMedia                                                                                                                                                                84870      666     18-07-2018
                     TheBlaze                                                                                                                                                                7519     5287     06-02-2018
               ThinkProgress                                                                                                                                                                25033     4819     06-02-2018
                 True Activist                                                                                                                                                             420811      370     01-05-2018
                  True Pundit                                                                                                                                                               47881    13660     01-02-2018
                   USA Today                                                                                                                                                                  546     5968     05-02-2018
              Veterans Today                                                                                                                                                                51520     2624     01-02-2018
                           Vox                                                                                                                                                                996     4288     06-02-2018
               Waking Times                                                                                                                                                                 77084      447     02-02-2018
        Washington Monthly                                                                                                                                                                  47712      551     29-07-2018
            Washington Post                                                                                                                                                                   290     1252     11-06-2018
             Western Journal                                                                                                                                                                  409     4729     10-02-2018
        Wings Over Scotland                                                                                                                                                                202315      147     26-07-2018
       WSJ Washington Wire                                                                                                                                                                              79     20-07-2018
                 Yahoo News                                                                                                                                                                     9     1666     01-02-2018

                                                                   Table 3: Labelling of second part of sources.

                     Source                           Alexa # Articles First Observed                      Source                                 Alexa # Articles First Observed
                     Anonymous Conservative                          616       09-02-2018                  Newsnet Scotland                                           35            22-07-2018
                     BBC UK                                         5504       30-07-2018                  Newswars                               68363             4275            13-08-2018
                     Channel 4 UK                       2817         888       30-07-2018                  OSCE                                  136945              636            06-06-2018
                     Common Dreams                                    27       21-03-2018                  Politicalite                                              737            30-07-2018
                     Conservative Home                304146        2248       11-02-2018                  Politicscouk                                              341            01-02-2018
                     Conservative Tribune                           2353       06-02-2018                  Prepare For Change                    121860               11            28-11-2018
                     Crikey                           827664         391       27-07-2018                  Slugger OToole                        309300              303            26-07-2018
                     Delaware Liberal                               1132       09-02-2018                  The Daily Blog                                            457            01-02-2018
                     Dick Morris Blog                 157827         400       07-02-2018                  The Daily Echo                          55841            3329            30-07-2018
                     Fort Russ                         75353        1090       18-07-2018                  The Guardian UK                                         16947            20-07-2018
                     Freedom-Bunker                                 2229       18-07-2018                  The Huffington Post UK                  11216            5855            31-07-2018
                     Hit and Run                                    3441       09-02-2018                  The Inquisitr                                            2467            02-02-2018
                     Hullabaloo Blog                  126769         958       28-07-2018                  The Manchester Evening News              7335            8447            31-07-2018
                     Informnapalm                     281115          32       20-07-2018                  The Week UK                             33604            2207            31-07-2018
                     JewWorldOrder                                  1521       19-07-2018                  Trump Times                                                86            21-09-2018
                     LabourList                       221981         430       30-07-2018                  Unian                                  10908             3312            18-07-2018
                     Liberal Democrat Voice           206720         573       26-07-2018                  Window on Eurasia Blog                495303              840            15-07-2018
                     Losercom                                         10       02-10-2018                  Wizbang                                                    58            05-08-2018
                     Mail                               1383        8461       19-07-2018                  rferl                                   31069            2318            19-07-2018
                     Mint Press News                                1707       09-02-2018                  theRussophileorg                                        31842            06-08-2018

                                                                        Table 4: Sources with no labels found.
            NewsGuard                                  newsguardtech.com
            Pew Research Center                        journalism.org/2014/10/21/political-polarization-media-habits
            Wikipedia                                  en.wikipedia.org/wiki/List_of_fake_news_websites
            Open Sources                               opensources.co
            Media Bias/Fact Check                      mediabiasfactcheck.com
            Allsides                                   allsides.com
            PolitiFact                                 politifact.com
            BuzzFeed News                              buzzfeednews.com/article/craigsilverman/inside-the-partisan-fight-for-your-news-feed
            Alexa Analysis top 1million sites          s3.amazonaws.com/alexa-static/top-1m.csv.zip

                                                                           Table 5: Links for online resources.
You can also read