NELA-GT-2018: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
NELA-GT-2018: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles Jeppe Nørregaard† , Benjamin D. Horne* , and Sibel Adalı* Technical University of Denmark† , Rensselaer Polytechnic Institute* jepno@dtu.dk, horneb@rpi.edu, adalis@rpi.edu arXiv:1904.01546v1 [cs.CY] 2 Apr 2019 Abstract such as source reliability over an extended period of time (Horne et al. 2018; Baly et al. 2018). In this paper, we present a dataset of 713k articles Secondly, fact-checkers tend to concentrate collected between 02/2018-11/2018. These articles are collected directly from 194 news and media their efforts on articles that receive a lot of at- outlets including mainstream, hyper-partisan, and tention, making datasets with fact-checked la- conspiracy sources. We incorporate ground truth bels engagement-driven. Engagement-driven news ratings of the sources from 8 different assessment datasets (for example those based on social media sites covering multiple dimensions of veracity, in- mentions), are very useful in engagement-driven cluding reliability, bias, transparency, adherence studies, but may not provide a complete picture to journalistic standards, and consumer trust. The of attention to malicious news sources. For exam- NELA-GT-2018 dataset can be found at https: ple, The Drudge Report, a site known for spread- //doi.org/10.7910/DVN/ULHLCB. ing mixed-veracity information, is 41st in United States in terms of the amount of Internet traf- 1 Introduction fic, making it a highly influential source. Readers spend a long time on the site, averaging 25 min- One of the main gaps in the study of misinforma- utes with about 11 clicks pages per visit. However, tion is finding broad labelled datasets, which this readers only reach the site using social-media links data set aims to fill. There are a number of pub- 4% of the time, while 83% of the time they reach lished misinformation datasets with ground truth, it through direct links1 . As a result, we argue that but they are often small, event specific, engage- there is a need for datasets collected independent ment specific, or incomplete. As a result, they are of social media in order to understand the full im- not sufficient for answering a wide-range of re- pact of and tactics used by misleading and hyper- search questions. partisan news producers. First, for many studies, particularly those in- volving machine learning methods, a large dataset Lastly, news, particularly state-sponsored pro- with ground truth labels is necessary. Article- paganda, can misinform through methods other level ground truth (i.e. true/false) for such datasets than explicitly fabricated claims (Zannettou et al. can be infeasible, as fact-checking requires ex- 2018). Hence, fact-checking labels may not cap- perts conducting a slow and labor-intensive pro- ture all types of misinformation. This leads to la- cess. Furthermore, the slow speed of fact-checking beling mechanisms that account for other factors, makes datasets quickly out-of-date. One solution such as whether the sources have bias in their that has been proposed to mitigate problems with reporting or how much they adhere to journalis- article level labels is to use higher level labels, tic standards. Therefore, we argue that datasets should contain multiple types of ground truth at the Copyright c 2019, Association for the Advancement source-level in order to perform complete studies of Artificial Intelligence (www.aaai.org). All rights re- 1 served. source: similarweb.com, consulted on 13/01/2019
of misinformation. news articles. While claims can be extracted from The dataset presented in this paper is and news articles, most of these datasets use claims engagement-independent collection of news arti- made on social media or by political figures in cles with multiple types of source-level ground speeches. LIAR is a fake claim benchmark dataset truths. Our dataset contains 713,534 articles from that has 12.8K fact-check short statements from 194 news outlets collected between 01/02/2018- politifact.com (Wang 2017). The claims in 30/11/2018. These articles are collected directly the dataset are from social media posts and polit- from each news producers’ websites, independent ical speeches. CREDBANK is a dataset of 60M of social media. We corroborate ground truth la- tweets between 2015 and 2016. Each tweet is bels from eight different assessment sites cover- associated to a news event and is labeled with ing multiple dimensions of veracity, including reli- credibility by Amazon Mechanical Turkers (Mitra ability, bias, transparency, and consumer trust. The and Gilbert 2015). Again, this dataset only con- dataset sources are from both mainstream media tains claims/tweets, not complete news articles. and alternative media across multiple countries. PHEME is a dataset of 330 tweet threads anno- The dataset can be found at https://doi. tated by journalist. Each tweet is associated with a org/10.7910/DVN/ULHLCB. In this paper, we news story (Zubiaga et al. 2016). FacebookHoax outline dataset collection, ground-truth corrobora- is a dataset containing 15K Facebook posts about tion, and provide a few use cases. science news. The posts are labeled as “hoax” or “non-hoax” and come from 32 different Face- 2 Related Work book pages (Tacchini et al. 2017). These datasets There are many recent news datasets focused on are highly related to the smaller tweet credibility misinformation, each with different focus in la- datasets created in the last decade (Castillo, Men- belling. Labels include various dimensions of reli- doza, and Poblete 2011). ability and various dimensions of bias. Buzzfeed- There are also several recent unlabelled news News2 is a small dataset of news articles that had datasets, which are much larger than most of the high Facebook engagement during the 2016 U.S. labeled datasets. NELA2017 is a political news ar- Presidential Election. The dataset contains 1627 ticle dataset that contains 136K articles from 92 articles that are fact-checked by 5 Buzzfeed jour- media sources in 2017 (Horne, Khedr, and Adalı nalists. The dataset labels include if the article is 2018). The dataset includes sources from main- false or true, along with the political leaning of the stream, hyper-partisan, conspiracy, and satire me- source that produced the article. FakeNewsCor- dia sources. Along with the news articles, the pus3 is a dataset containing nearly 10M articles dataset includes a rich set of natural language fea- labeled using opensources.co. OpenSources tures on each news article, and the corresponding is a list of sources labeled by experts. These labels Facebook engagement statistics. The dataset con- include 13 different labels related to the reliabil- tains nearly all of the articles published by the ity of the source. FakeNewsNet is a collection of 92 sources during the 7 month period. GDELT datasets containing news articles and tweets. The is an open database of event-based news articles dataset includes rich metadata including social fea- with temporal and location features. It is said to tures and spatiotemporal information (Shu et al. be one of the most comprehensive event-based 2018). While this dataset is described in a paper news datasets. However, GDELT does not ex- on arxiv.com, to the best of our knowledge, the plicitly contain maliciously fake or hyper-partisan data has not been completely released to the public news sources, needed for misinformation studies. at this time 4 . While all of these datasets are useful, there are Many other misinformation datasets have fo- several limitations we address with the dataset pre- cused on individual claims rather than complete sented int his paper: 2 github.com/BuzzFeedNews/2016-10- 1. Small number of sources and articles - With facebook-fact-check the exception of FakeNewsCorpus and the 3 NELA2017 dataset, the current publicly avail- github.com/several27/ FakeNewsCorpus able datasets are either small in the number of 4 github.com/KaiDMML/FakeNewsNet media sources they contain, small in the num-
ber of articles, or both. Furthermore, many of 3 Dataset Creation the larger datasets do not contain multiple types We created this dataset, with the following steps: of sources. In comparison to FakeNewsCorpus, our dataset covers a wider range of news, in par- 1. We gathered a wide variety of news sources ticular more mainstream news. In addition, our from varying levels of veracity, including many dataset is collected over a longer and more con- well-studied misinforming sources and other sistent period of time, where as the many of al- less well-known sources. ternative news sources in FakeNewsCorpus no 2. We scraped article data from the gathered longer exists and the time frame of FakeNews- sources’ RSS feeds twice a day for 10 months Corpus is unknown. in 2018. 3. We combine and corroborated source-level ve- 2. Engagement-driven - The majority of the cur- racity labels from 8 independent assessments, rent datasets, both for news articles and claims, some of which are used in the misinformation contain only data has been highly engaged with literature, others that are not. These labels pro- on social media or has received attention from vide multiple and complementary ground truth fact-checking organizations. While understand- allowing for many different ways to character- ing the engagement of misinformation is an im- ize the sources. portant task, engagement driven news datasets fail to show the complete picture of misinform- Through this process, we provide 713,534 arti- ing news. Both malicious fake news produc- cles from 194 news and media producers. Along ers and hyper-partisan media produce hundreds, with these articles, we provide multiple labels sometimes thousands of articles in a year, most from 8 independent assessments for each source. of which are never seen on social media or fact- The final set of article data is arranged in an sqlite checkers. Questions about when fake news tac- data, with date, source, title, and cleaned text con- tics work or do not work remain unanswered. tent for each article. The labels are provided in CSV format, with rows being sources and columns 3. Lack of ground truth labels - All of the cur- being each label gathered from all the assessment rent large-scale news article datasets do not sites. The set of labels can also be found in the Ap- have any form of labeling for misinformation pendix Table 2 and Table 3. Specifics on the file- research, with exception of FakeNewsCorpus. formats can be found in the documentation given While some contain a mix of reliable and un- with the dataset. We describe the collection pro- reliable sources, it is not necessarily clear to cess and ground truth in detail below. what extent each source is reliable or what di- mensions of credibility should be used to as- News Article Data sess the sources. For example, a news arti- To collect our dataset, we scraped the RSS feeds of cle can spread misinformation (or disinforma- each source twice a day starting on 02/02/2018 us- tion) in many ways other than false statements. ing the Python libraries feedparser and goose. Our A news article may use partially false infor- starting point for source selection was mainstream mation, decontextualized information, or infor- outlets and alternative sources that are mentioned mation misrepresented by hyper-partisan lan- in other studies or high profile cases of false news guage. For both machine learning and compar- coverage. An initial subset of 92 sources was avail- ative studies, having well defined labels about able in NELA2017 dataset (Horne, Khedr, and multiple dimensions of veracity is important Adalı 2018), which already covered a wide array in understand what signals a machine learning of media types. We then continued to expand this model is learning or why discovered patterns ex- source set using the same criteria, as well as by ist in news data. automated Google searches to find other outlets that published similar articles as those already in Thus, our goal with the NELA-GT-2018 dataset our dataset. Specifically, we queried the Google is to create a large, veracity-labeled news article Search API with the titles of the news articles that dataset that in independent of social media engage- were previously collected. If a news source that ment and specific events. was not in our source collection list appeared in
5000 Reliability Unreliable 4000 Reliable Not determined # articles 3000 Unknown source 2000 1000 0 5000 Bias Left 4000 Right Not determined # articles 3000 Unknown source 2000 1000 0 01/02/2018 01/03/2018 01/04/2018 01/05/2018 01/06/2018 01/07/2018 01/08/2018 01/09/2018 01/10/2018 01/11/2018 Date Figure 1: Number of articles in the dataset over time. For each source, we compute an aggregated reliability and bias rating, and label all articles in the source with this rating for illustration purposes. The two stack- plots contain the same datapoints, but dissected with these two distinct aggregated labels. If the aggregated label is uncertain we label the articles with gray. Grey-shaded vertical regions are marks where unusually little data were collected due to some problem with data-scraping or potentially low activity. The increase in the number of data points around the 01/08/2018 is caused by the addition of new sources to the collection. the top 10 pages of the Google search, we added make their assessments, and most of these assess- it to our source collection list. Note, we do not ments cover relatively few sources. Thus, in order include small local news sources or sources that to create a large, centralized set of veracity labels, did not have operational RSS feeds, which signifi- we collected ground truth (GT) data from eight dif- cantly reduces the size of the expected source set. ferent sites, which all attempt to assess the reliabil- Furthermore, this Google expansion process was ity and/or the bias of news. ran in July 2018, which caused a large increase in These assessment sites are: unlabeled news sources, as shown in Figure 1. 1. NewsGuard By the end of the collection process 2. Pew Research Center (30/11/2018) we had 713K articles from 194 3. Wikipedia news and media producers. These sources come 4. OpenSources from a variety of countries, but are all articles are 5. Media Bias/Fact Check (MBFC) in English. In Tables 2, 3, and 4 we write the date 6. AllSides of the first scraped article from each source. After 7. BuzzFeed News these dates, we have near complete data from 8. Politifact the respective sources RSS-feeds. In Figure 1 we show the number of articles collected over time. We gather data from all these sites, using html- scraping and GUI-automation, and combine their labels to create a centralized set of veracity ground Ground Truth Data truth labels. Of the 194 sources in our data set, 154 A number of organizations and platforms have de- sources have GT labels from at least one of the veloped methods for assessing reliability and bias assessment sites, while the remaining 40 sources of news sources. These organizations come from remain unlabelled. Tables 2 and 3 show the com- both the research community and from practitioner bined labels, while Table 4 lists the sources where communities. While each of these organizations no label information was found. Table 1 provide a and platforms provide useful assessments on their detailed described of each assessment and Table 5 own, each uses different criteria and methods to lists urls for the assessment sites.
NewsGuard uses a group of trained journalists to lished online in detail. This list has also been assess credibility and transparency of news web- used in several academic studies (Horne and Adali sites. They emphasizes the use of trained people 2017; Horne et al. 2018; Baly et al. 2018). Unfor- rather than algorithms to determine credibility of tunately, last repository commit was 2 years ago sources. They allows respective news outlets com- and many of the labeled sources no longer exist. ment on their verdict before publishing it. They The site provides a list of sources with 1-3 tags per provide extensions for major browsers to inform source (See Table 1). users of the credibility of the sites they visit. They also display icons on search results in search en- Media Bias/Fact Check is a platform that an- gines like Google and Duck Duck Go. Their anal- alyzes news sources to determine their credibil- ysis produces 9 granular, binary labels for each ity, as well as to ”educate the public on media site, with assigned point scores that sums to 100. bias and deceptive news practices”. The site pub- Based on the sum of points the sites get an over- lishes the names of its editorial team and only ac- all label for credibility - green for good score, red cepts outside information from individuals who for bad score. Three additional overall labels ex- have accepted International Fact-Checking Net- ist for satire, user-produced content and sites with work’s code of principles. According to its pub- unfinished analysis. Table 1 describes the granu- lished methodology, the site numerically evaluates lar labels. NewsGuard is transparent about their each news outlet in 4 categories; biased word- methodology and publish a policy for ethics and ing/headlines, factual/sourcing, story choices and conflicts of interest. Their full staff is listed with political affiliation, and uses the mean of these for names online and their ratings are free. a final verdict. As of January 2019, we were unfor- tunately not able to find the numerical categories Pew Research Center published an article enti- for the sources. We were able to find a factual re- tled ”Political Polarization & Media Habits” which porting label, which is derived from the previously analysed trust in specific news sources by liber- mentioned scores. Many sources also had descrip- als and conservatives. This analysis used 5 groups tive labels, some of which were related to relia- of people, ranging from liberals to conservatives, bility and some of which were related to bias. All and each group provided a rating of how much these labels are described in Table 1. they trust each source. The ratings are aggregated to show whether readers with different political Allsides takes a very idealistic approach to assess- leanings predominantly trust or distrust a spe- ing bias of sites and is mainly data-driven. They cific source. We provide this trust label for each emphasize that news are inherently biased, that a source and political leaning, as a label for congru- mixed news ”diet” is the true goal for newsread- ency between bias a readership (rather than a fact- ers and that bias can be hidden and unconscious. checking label). This site creates data through a set of methods, each of which are noted for the sources. It conducts Wikipedia published a list of fake news websites, blind surveys on material in the public as well as which they define as sites that ”intentionally, but in an editorial board, use third party data and as- not necessarily solely, publish hoaxes and disinfor- sessment, conducts internal research on sources if mation for purposes other than news satire”. The needed, and also has a community feedback func- page has more than 500 edits, 162 cited references tion for all bias assessments. In the community and has been in existence since 18/11/2016. There feedback, users can vote to agree or disagree with is no information on how the sites were selected, Allsides assessment of a source. They note that the but for each source there are references to other community feedback is not normalized with re- sites which has reported their bad behaviour. We spect to bias, and should more be used as a flag provide a fake-news tag for sources on the list. for their own use on whether their assessments are off and needs updating. We include their bias label Open Sources describes itself as a ”curated re- and feedback numbers (votes agreeing and votes source for assessing online information sources, disagreeing) for each source. The feedback num- available for public use” and its analysis are done ber are not shown in the paper, but can be found in by its own team of experts. The criteria is pub- the dataset.
BuzzFeed News published an article ”Inside ticles). One problem with this approach is that The Partisan Fight For Your News Feed” on labelling individual articles requires a lot of re- 08/08/2017 which describes a study conducted sources and is often times not possible. For many by them on how partisan websites and Facebook machine learning algorithms the minimum re- pages have been created in increasing numbers. quirement of labelled samples is in the thousands. They publish an associated dataset with news Furthermore, verifying articles will commonly re- sources and their political leaning (left and right), quire considerable time from an expert. A second which we include. problem is that the verification of statements in articles can require a lot of time. This can make PolitiFact is a well-known fact-checking organi- available labelled articles outdated for analyzing zation which investigates claims and evaluates the contemporary articles, due to shifts in topics and truthfulness of those claims. The statements can news cycle. be from any public person or simply rumours that gain enough attention. PolitiFact’s data is very dif- An alternative approach to creating labels is ferent from the rest of our labelling sites, as their through distant supervision (or weak supervision), assessment is on article/statement level and not where labels are created at the source-level and source level. They also aggregate the statements used as proxies for article-level labels. One advan- and their labels for the sources that published the tage of the approach is that it reduces the work- statements. We have counted the types of state- load of labelling. Additionally, labels are known ments coming from each source, which could be instantaneously for articles from known sources al- used to indicate their truthfulness. However the lowing real time update of parameters and analysis data is not well normalized, as some sites have of news. This approach has been shown promising many noted statements, while some have none, due in recent misinformation detection work (Horne et to the origin of the statements and the amount of al. 2018; Baly et al. 2018). The NELA-GT-2018 attention each source has. dataset can be used out-of-the-box for this type of machine learning study. Amazon’s Alexa provides a ranking of nearly all websites based on frequency of visits, to which they provide free access to the top 1M. We in- Semi-Supervised Learning clude the position of the sources in this rating in the dataset based on our access to Alexa on 13th Another commonly debated issue in misinfor- of January 2019. Note, this data comes from the mation research is handling new articles from free portion of Alexa’s data, not the paid portion. mixed-veracity (partial truths, benign or mali- Furthermore, these rankings will change over time. cious) sources or handling articles from newly emerging sources during events (such as elec- 4 Use Cases tions). One potential way to address these prob- lems is using semi-supervised learning, in which There are many threads of misinformation research these uncertain veracity news sources are included that this dataset can benefit. We argue that our as unlabelled data. This approach can improve sta- dataset can especially benefit automated news ve- bility and increase the working domain for auto- racity methods, which need large labelled datasets, mated systems. In fact, it has been shown that, with and qualitative studies that focus on the tactics some assumptions, semi-supervised approaches used by malicious and hyper-partisan news pro- can improve performance over fully supervised ap- ducers. We discuss a few examples below. proaches, where unlabelled samples enables clas- sifiers to reduce risk exponentially with the num- Distant Supervised Learning ber of labelled samples (Castelli and Cover 1996). Much research in news has been focused on auto- Depending on the problem, this dataset provides mated methods for detecting misinformation (Ku- consistent labels of 100+ sources, verified by mul- mar and Shah 2018). For machine learning sys- tiple assessment sites. Remaining sources are ei- tems, this analysis generally requires article-level ther completely unknown, or are sparsely labelled, labelling (i.e. false/bias labels of individual ar- but can be utilized with semi-supervised methods.
Mixed-Method Studies parameter. Ieee Transactions on Information Theory There are unanswered research questions about the 42(6):2102–2117. tactics used by news producers publishing false, [Castillo, Mendoza, and Poblete 2011] Castillo, C.; misleading, or propaganda news. These questions Mendoza, M.; and Poblete, B. 2011. Information cred- ibility on twitter. In Proceedings of WWW, 675–684. cannot be answered through machine learning ACM. studies, but rather require mix-method assessments in order to be answered. For example, recent work [Horne and Adali 2017] Horne, B. D., and Adali, S. 2017. This just in: Fake news packs a lot in title, uses has focused on content sharing by alternative me- simpler, repetitive content in text body, more similar to dia sources (Starbird et al. 2018). This work sheds satire than real news. In NECO Workshop 2017. light on the tactics employed by state-sponsored [Horne et al. 2018] Horne, B. D.; Dron, W.; Khedr, S.; news to create alternative narratives around an and Adalı, S. 2018. Assessing the news landscape: event, but can continue to be improved with data A multi-module toolkit for evaluating the credibility of that is more complete and independent of social news. In WWW 2018 Companion. media. Other question include: how do false news [Horne, Khedr, and Adalı 2018] Horne, B. D.; Khedr, S.; producers change with events? Do they keep con- and Adalı, S. 2018. Sampling the news producers: A sistent ideologies? or do they adapt with the given large news and feature data set for the study of the com- event? Many of these potential tactics are un- plex media landscape. In ICWSM. known. This dataset provides news over many ma- [Kumar and Shah 2018] Kumar, S., and Shah, N. 2018. jor events, which can be easily extracted for spe- False information on web and social media: A survey. cific studies. For qualitative researchers, the data arXiv preprint arXiv:1804.08559. can provide a “head-start” on exploring the data, [Mitra and Gilbert 2015] Mitra, T., and Gilbert, E. 2015. as the veracity of each source is known. Credbank: A large-scale social media corpus with asso- ciated credibility annotations. In ICWSM, 258–267. 5 Conclusion [Shu et al. 2018] Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; and Liu, H. 2018. Fakenewsnet: A data repos- In this paper, we present a labelled news dataset itory with news content, social context and dynamic in- for the study of misinformation. We argue that the formation for studying fake news on social media. arXiv research community lacks large labelled datasets preprint arXiv:1809.01286. for use in both mixed-method and machine learn- [Starbird et al. 2018] Starbird, K.; Arif, A.; Wilson, T.; ing studies. To address this need, we provide a Van Koevering, K.; Yefimova, K.; and Scarnecchia, D. large dataset of news articles (713K articles), col- 2018. Ecosystem or echo-system? exploring content lected over many sources (194), over a long pe- sharing across alternative media domains. riod of time 02/2018-11/2018. The articles are in- [Tacchini et al. 2017] Tacchini, E.; Ballarin, G.; dependent of engagement from online communi- Della Vedova, M. L.; Moret, S.; and de Alfaro, L. 2017. ties, and reflect the publish patterns of the news Some like it hoax: Automated fake news detection in producers. We have furthermore gathered labels social networks. arXiv preprint arXiv:1704.07506. for these sources from 8 different assessment sites, [Wang 2017] Wang, W. Y. 2017. ” liar, liar pants on fire”: each of which seeks to assess the reliability and A new benchmark dataset for fake news detection. arXiv bias of sources and claims. Combined they provide preprint arXiv:1705.00648. a detailed and near-complete labelling of sources, [Zannettou et al. 2018] Zannettou, S.; Sirivianos, M.; which can be used for predictive analysis and qual- Blackburn, J.; and Kourtellis, N. 2018. The web of itative studies of the news landscape. false information: Rumors, fake news, hoaxes, click- bait, and various other shenanigans. arXiv preprint References arXiv:1804.03461. [Baly et al. 2018] Baly, R.; Karadzhov, G.; Alexandrov, [Zubiaga et al. 2016] Zubiaga, A.; Liakata, M.; Procter, D.; Glass, J.; and Nakov, P. 2018. Predicting factuality R.; Hoi, G. W. S.; and Tolmie, P. 2016. Analysing of reporting and bias of news media sources. In EMNLP how people orient to and spread rumours in social me- 2018. dia by looking at conversational threads. PloS one 11(3):e0150989. [Castelli and Cover 1996] Castelli, V., and Cover, T. M. 1996. The relative value of labeled and unlabeled sam- ples in pattern recognition with an unknown mixing 6 Appendix
Section Description (NewsGuard points) Coloring NewsGuard 1. Does not repeatedly publish false content (22.0) 2. Gathers and presents information responsibly (18.0) 3. Regularly corrects or clarifies errors (12.5) 4. Handles the difference between news and opinion responsibly (12.5) 5. Avoids deceptive headlines (10.0) 6. Website discloses ownership and financing (7.5) 7. Clearly labels advertising (7.5) 8. Reveals who’s in charge, including any possible conflicts of interest (5.0) 9. Provides information about content creators (5.0) 10. Aggregated score computed from 1-9 - 11. Column 10 thresholded at 60 points Pew Research Center 12. Trust from consistently-liberals 13. Trust from mostly-liberals 14. Trust from mixed groups 15. Trust from mostly-conservatives 16. Trust from consistently-conservatives 17. Aggregated trust from 12-16 Wikipedia 18. Existence of source on Wikipedia’s list of fake news sources Open Sources 19. Marked reliable 20. Marked blog 21. Marked clickbait 22. Marked rumor 23. Marked fake 24. Marked unreliable 25. Marked biased 26. Marked conspiracy 27. Marked hate speech 28. Marked junk science 29. Marked political 30. Marked satire 31. Marked state news Media Bias / Fact Check 32. Factual reporting from 5 (good) down to 1 (bad) 33. Special label; conspiracy, pseudoscience or questionable source (purple), and satire (orange) 34. Political leaning / bias from left to right. Allsides 35. Political leaning / bias BuzzFeed 36. Political leaning / bias, but only left and right PolitiFact 37. Has brought story labelled as ”pants on Fire!” 38. Has brought story labelled as false 39. Has brought story labelled as mostly false 40. Has brought story labelled as half-true 41. Has brought story labelled as mostly true 42. Has brought story labelled as true Alexa Ranking The Alexa ranking of the source. Numerical # Articles The number of articles collected from the source. Numerical First Observed The date of first articles collected from the source. dd-mm-yyyy Table 1: Details of the information for sources found in tables 2, 3 and 4. We generally use green-to-purple for good- to-poor reliability/credibility, with grey as inconclusive. For bias we use blue-to-red for left-to-right bias, with grey as unbiased. Orange is used for special cases. In NewsGuard data it represents missing information, in Open Sources it marks auxiliary labels and for Media Bias / Fact Check it marks satire.
d d rch .. uar uar sea ia ias ed sG wsG Re kiped i aB des Fe e w e w i d e Media Bias.. llsi Buzz NewsGuard , N , N Pew Research , Pe , W Open Sources , M , A , PolitiFact Alexa # Articles First Observed 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 21stCenturyWire 186254 322 16-07-2018 ABC News 1106 2808 14-06-2018 Activist Post 43808 1797 03-02-2018 Addicting Info 614357 429 01-02-2018 Al Jazeera 3594 4522 31-07-2018 Alternet 17464 4816 28-03-2018 AMERICAblog News 42 23-07-2018 BBC 94 16416 01-02-2018 Bearing Arms 91113 1193 08-02-2018 Bipartisan Report 4060 06-02-2018 Birmingham Mail 23331 9243 31-07-2018 Breitbart 323 1877 05-02-2018 Business Insider 387 445 01-02-2018 Buzzfeed 291 1661 13-02-2018 CBS News 1224 5397 03-02-2018 Chicago Sun-Times 2113 30-07-2018 CNBC 780 2426 05-02-2018 CNN 116 8202 05-02-2018 CNS News 40324 5263 06-02-2018 Counter Current News 23 02-02-2018 Crooks and Liars 33864 2465 12-02-2018 Daily Beast 2358 6634 06-02-2018 Daily Kos 3187 994 01-02-2018 Daily Mail 198 3596 06-02-2018 Daily Signal 33249 310 01-02-2018 Daily Stormer 1378 19-09-2018 DC Gazette 185 01-02-2018 Democracy 21 24 04-09-2018 Drudge Report 815 18885 06-02-2018 Evening Standard 6162 17638 20-07-2018 Faking News 270680 220 04-02-2018 Feministing Blog 23 30-07-2018 FiveThirtyEight 1903 556 01-02-2018 Foreign Policy 22817 702 19-07-2018 Fortune 4945 7630 21-03-2018 Forward Progessives 142 08-02-2018 Fox News 275 3106 23-02-2018 France24 4993 1732 17-07-2018 Freedom Daily 36 12-04-2018 Freedom Outpost 127732 321 01-02-2018 FrontPage Magazine 78101 892 01-02-2018 FT Westminster Blog 10 29-10-2018 Fusion 141 01-02-2018 GlobalResearch 52145 30 29-11-2018 Glossy News 61 01-02-2018 Hot Air 14455 4642 10-02-2018 HumansAreFree 126151 426 12-07-2018 Humor Times 282 01-02-2018 Infowars 4123 2518 01-02-2018 Instapundit 14733 15584 11-02-2018 Intellectual Conservative 379 01-02-2018 Intellihub 253500 334 01-02-2018 Interpreter Mag 28 22-05-2018 Investors Business Daily 21030 730 01-02-2018 iPolitics 152948 4253 12-02-2018 LewRockwell 41631 1278 19-07-2018 Live Action 130580 1054 16-02-2018 Media Matters for America 36858 2316 03-02-2018 Mercury News 10919 4828 31-07-2018 MotherJones 13902 1128 11-06-2018 MSNBC 2356 6604 21-03-2018 National Review 7985 5129 06-02-2018 Natural News 10159 4187 06-02-2018 New York Daily News 3827 2042 06-02-2018 New York Post 969 25407 06-02-2018 New Yorker 1680 265 05-11-2018 News Biscuit 240347 1666 06-02-2018 News Busters 31060 3240 04-02-2018 Newsweek 1375 9411 19-07-2018 NODISINFO 29 15-02-2018 NPR 843 5515 06-02-2018 oann 45704 14267 21-03-2018 Observer 21487 541 01-02-2018 Palmer Report 14618 3539 03-02-2018 Pamela Geller Report 85934 410 20-03-2018 PBS 1879 1113 11-06-2018 Pink News UK 17076 1645 30-07-2018 Politico 1038 629 01-02-2018 Politics UK 571310 137 20-07-2018 Politicus USA 10991 4018 03-02-2018 Powerline Blog 31674 894 31-07-2018 Pravada Report 163614 601 17-07-2018 Prison Planet 73821 2253 09-08-2018 Raw Story 6692 3719 12-02-2018 Real Clear Politics 6215 7247 12-02-2018 Real News Right Now 13 02-02-2018 RedState 25729 4808 06-02-2018 Reuters 861 3929 09-02-2018 RightWingWatch 173633 1118 09-02-2018 RT 273 4286 01-02-2018 Russia-Insider 23104 1030 19-07-2018 Salon 4853 1702 02-02-2018 ScrappleFace 61 22-06-2018 Shadow Proof 365356 260 05-02-2018 Shareblue 92290 2134 06-02-2018 SkyNewsPolitics 826 31-07-2018 Table 2: Labelling of first part of sources.
d d rch .. uar uar sea ia ias ed sG wsG Re kiped iaB des Fe e w e w i d e Media Bias.. llsi Buzz NewsGuard , N , N Pew Research , Pe , W Open Sources , M , A , PolitiFact Alexa # Articles First Observed 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 SkyNewsUS 995 06-08-2018 Slate 2676 514 01-02-2018 sott.net 20944 9319 19-07-2018 Spiegel 530 4171 19-07-2018 Sputnik 553 30372 11-02-2018 Talking Points Memo 9837 5846 06-02-2018 Tass 40473 6160 07-08-2018 Telesur TV 19225 860 07-08-2018 The American Conservative 28970 439 01-02-2018 The Atlantic 1636 1757 01-02-2018 The Beaverton 209578 854 03-02-2018 The Borowitz Report 1680 123 02-02-2018 The Chaser 132 01-02-2018 The Conservative Tree House 20423 2120 05-02-2018 The D.C. Clothesline 81913 654 03-09-2018 The Daily Caller 923 11550 06-02-2018 The Daily Express 837 1585 31-07-2018 The Daily Mirror 1347 13202 31-07-2018 The Daily Record 20381 6981 31-07-2018 The Daily Star 3095 219 20-07-2018 The Denver Post 13135 4503 31-07-2018 The Duran 97928 959 06-06-2018 The Fiscal Times 264746 461 06-06-2018 The Gateway Pundit 9863 5667 05-02-2018 The Guardian 150 2195 01-02-2018 The Hill 1199 1968 13-03-2018 The Huffington Post 415 5586 05-02-2018 The Independent 1078 19799 20-07-2018 The Intercept 9890 1268 08-02-2018 The Irish Times 4407 3827 31-07-2018 The Michelle Malkin Blog 418849 53 05-02-2018 The Moscow Times 84361 1137 13-07-2018 The New York Times 110 5471 06-02-2018 The Onion 6513 1094 28-07-2018 The Poke 40252 1313 30-07-2018 The Political Insider 96311 2680 01-02-2018 The Right Scoop 64313 2697 06-02-2018 The Shovel 432212 223 01-02-2018 The Spoof 874461 696 02-02-2018 The Sun 1370 43613 31-07-2018 The Telegraph 580 33763 19-07-2018 The Verge 1131 5951 12-02-2018 The Washington Examiner 7581 469 01-02-2018 TheAntiMedia 84870 666 18-07-2018 TheBlaze 7519 5287 06-02-2018 ThinkProgress 25033 4819 06-02-2018 True Activist 420811 370 01-05-2018 True Pundit 47881 13660 01-02-2018 USA Today 546 5968 05-02-2018 Veterans Today 51520 2624 01-02-2018 Vox 996 4288 06-02-2018 Waking Times 77084 447 02-02-2018 Washington Monthly 47712 551 29-07-2018 Washington Post 290 1252 11-06-2018 Western Journal 409 4729 10-02-2018 Wings Over Scotland 202315 147 26-07-2018 WSJ Washington Wire 79 20-07-2018 Yahoo News 9 1666 01-02-2018 Table 3: Labelling of second part of sources. Source Alexa # Articles First Observed Source Alexa # Articles First Observed Anonymous Conservative 616 09-02-2018 Newsnet Scotland 35 22-07-2018 BBC UK 5504 30-07-2018 Newswars 68363 4275 13-08-2018 Channel 4 UK 2817 888 30-07-2018 OSCE 136945 636 06-06-2018 Common Dreams 27 21-03-2018 Politicalite 737 30-07-2018 Conservative Home 304146 2248 11-02-2018 Politicscouk 341 01-02-2018 Conservative Tribune 2353 06-02-2018 Prepare For Change 121860 11 28-11-2018 Crikey 827664 391 27-07-2018 Slugger OToole 309300 303 26-07-2018 Delaware Liberal 1132 09-02-2018 The Daily Blog 457 01-02-2018 Dick Morris Blog 157827 400 07-02-2018 The Daily Echo 55841 3329 30-07-2018 Fort Russ 75353 1090 18-07-2018 The Guardian UK 16947 20-07-2018 Freedom-Bunker 2229 18-07-2018 The Huffington Post UK 11216 5855 31-07-2018 Hit and Run 3441 09-02-2018 The Inquisitr 2467 02-02-2018 Hullabaloo Blog 126769 958 28-07-2018 The Manchester Evening News 7335 8447 31-07-2018 Informnapalm 281115 32 20-07-2018 The Week UK 33604 2207 31-07-2018 JewWorldOrder 1521 19-07-2018 Trump Times 86 21-09-2018 LabourList 221981 430 30-07-2018 Unian 10908 3312 18-07-2018 Liberal Democrat Voice 206720 573 26-07-2018 Window on Eurasia Blog 495303 840 15-07-2018 Losercom 10 02-10-2018 Wizbang 58 05-08-2018 Mail 1383 8461 19-07-2018 rferl 31069 2318 19-07-2018 Mint Press News 1707 09-02-2018 theRussophileorg 31842 06-08-2018 Table 4: Sources with no labels found. NewsGuard newsguardtech.com Pew Research Center journalism.org/2014/10/21/political-polarization-media-habits Wikipedia en.wikipedia.org/wiki/List_of_fake_news_websites Open Sources opensources.co Media Bias/Fact Check mediabiasfactcheck.com Allsides allsides.com PolitiFact politifact.com BuzzFeed News buzzfeednews.com/article/craigsilverman/inside-the-partisan-fight-for-your-news-feed Alexa Analysis top 1million sites s3.amazonaws.com/alexa-static/top-1m.csv.zip Table 5: Links for online resources.
You can also read