Can the Crowd Judge Truthfulness? A Longitudinal Study on Recent Misinformation about COVID-19
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Personal and Ubiquitous Computing manuscript No. (will be inserted by the editor) Can the Crowd Judge Truthfulness? A Longitudinal Study on Recent Misinformation about COVID-19 Kevin Roitero · Michael Soprano · Beatrice Portelli · Massimiliano De Luise · Damiano Spina · Vincenzo Della Mea · Giuseppe Serra · Stefano Mizzaro · Gianluca Demartini arXiv:2107.11755v1 [cs.IR] 25 Jul 2021 Received: February 2021 / Accepted: July 2021 Abstract Recently, the misinformation problem has source of information, behavior) impact the quality of been addressed with a crowdsourcing-based approach: the data. The longitudinal study demonstrates that the to assess the truthfulness of a statement, instead of rely- time-span has a major effect on the quality of the judg- ing on a few experts, a crowd of non-expert is exploited. ments, for both novice and experienced workers. Fi- We study whether crowdsourcing is an effective and re- nally, we provide an extensive failure analysis of the liable method to assess truthfulness during a pandemic, statements misjudged by the crowd-workers. targeting statements related to COVID-19, thus ad- Keywords Information Behavior · Crowdsourcing · dressing (mis)information that is both related to a sen- Misinformation · COVID-19 sitive and personal issue and very recent as compared to when the judgment is done. In our experiments, crowd workers are asked to assess the truthfulness of state- 1 Introduction ments, and to provide evidence for the assessments. Be- sides showing that the crowd is able to accurately judge “We’re concerned about the levels of rumours and mis- the truthfulness of the statements, we report results on information that are hampering the response. [...] we’re workers behavior, agreement among workers, effect of not just fighting an epidemic; we’re fighting an info- aggregation functions, of scales transformations, and of demic. Fake news spreads faster and more easily than workers background and bias. We perform a longitudi- this virus, and is just as dangerous. That’s why we’re nal study by re-launching the task multiple times with also working with search and media companies like Face- both novice and experienced workers, deriving impor- book, Google, Pinterest, Tencent, Twitter, TikTok, YouTube tant insights on how the behavior and quality change and others to counter the spread of rumours and misin- over time. Our results show that: workers are able to de- formation. We call on all governments, companies and tect and objectively categorize online (mis)information news organizations to work with us to sound the ap- related to COVID-19; both crowdsourced and expert propriate level of alarm, without fanning the flames of judgments can be transformed and aggregated to im- hysteria.” prove quality; worker background and other signals (e.g., These are the alarming words used by Dr. Tedros Ad- This is a preprint of an article accepted for publication in Per- hanom Ghebreyesus, the WHO (World Health Organi- sonal and Ubiquitous Computing (Special Issue on Intelligent Systems for Tackling Online Harms). zation) Director General during his speech at the Mu- nich Security Conference on 15 February 2020.1 It is Kevin Roitero · Michael Soprano · Beatrice Portelli · Mas- telling that the WHO Director General chooses to tar- similiano De Luise · Vincenzo Della Mea · Giuseppe Serra · Stefano Mizzaro get explicitly misinformation related problems. University of Udine, Udine, Italy Indeed, during the still ongoing COVID-19 health Damiano Spina emergency, all of us have experienced mis- and dis- RMIT University, Melbourne, Australia information. The research community has focused on Gianluca Demartini 1 https://www.who.int/dg/speeches/detail/munich- The University of Queensland, Brisbane, Australia security-conference
2 Kevin Roitero et al. several COVID-19 related issues [5], ranging from ma- ond, the health domain is particularly sensitive, so it chine learning systems aiming to classify statements is interesting to understand if the crowdsourcing ap- and claims on the basis of their truthfulness [66], search proach is adequate also in such a particular domain. engines tailored to the COVID-19 related literature, as Third, in the previous work [47, 27, 51] the statements in the ongoing TREC-COVID Challenge2 [46], topic- judged by the crowd were not recent. This means that specific workshops like the NLP COVID workshop at evidence on statement truthfulness was often available ACL’20,3 and evaluation initiatives like the TREC Health out there (on the Web), and although the experimental Misinformation Track.4 Besides the academic research design prevented to easily find that evidence, it cannot community, commercial social media platforms also have be excluded that the workers did find it, or perhaps looked at this issue.5 they were familiar with the particular statement be- Among all the approaches, in some very recent work, cause, for instance, it had been discussed in the press. Roitero et al. [47], La Barbera et al. [27], Roitero et al. By focusing on COVID-19 related statements we in- [51] have studied whether crowdsourcing can be used stead naturally target recent statements: in some cases to identify misinformation. As it is well known, crowd- the evidence might be still out there, but this will hap- sourcing means to outsource a task – which is usu- pen more rarely. ally performed by a limited number of experts – to Fourth, an almost ideal tool to address misinforma- a large mass (the “crowd”) of unknown people (the tion would be a crowd able to assess truthfulness in real “crowd workers”), by means of an open call. The re- time, immediately after the statement becomes public: cent works mentioned before [47, 27, 51] specifically although we are not there yet, and there is a long way crowdsource the task of misinformation identification, to go, we find that targeting recent statements is a step or rather assessment of the truthfulness of statements forward in the right direction. Fifth, our experimental made by public figures (e.g., politicians), usually on po- design differs in some details, and allows us to address litical, economical, and societal issues. novel research questions. Finally, we also perform a lon- The idea that the crowd is able to identify mis- gitudinal study by collecting the data multiple times information might sound implausible at first – isn’t and launching the task at different timestamps, con- the crowd the very means by which misinformation is sidering both novice workers – i.e., workers who have spread? However, on the basis of the previous studies never done the task before – and experienced workers [47, 27, 51], it appears that the crowd can provide high – i.e., workers who have performed the task in previous quality results when asked to assess the truthfulness batches and were invited to do the task again. This al- of statements, provided that adequate countermeasures lows us to study the multiple behavioral aspects of the and quality assurance techniques are employed. workers when assessing the truthfulness of judgments. In this paper6 we address the very same problem, This paper is structured as follows. In Section 2 but focusing on statements about COVID-19. This is we summarize related work. In Section 3 we detail the motivated by several reasons. First, COVID-19 is of aims of this study and list some specific research ques- course a hot topic but, although there is a great amount tions, addressed by means of the experimental setting of research efforts worldwide devoted to its study, there described in Section 4. In Section 5 we present and dis- are no studies yet using crowdsourcing to assess truth- cuss the results, while in Section 6 we present the longi- fulness of COVID-19 related statements. To the best tudinal study conducted. Finally, in Section 7 we con- of our knowledge, we are the first to report on crowd clude the paper: we summarize our main findings, list assessment of COVID-19 related misinformation. Sec- the practical implications, highlight some limitations, 2 https://ir.nist.gov/covidSubmit/ and sketch future developments. 3 https://www.nlpcovid19workshop.org/ 4 https://trec-health-misinfo.github.io/ 5 https://www.forbes.com/sites/bernardmarr/2020/03/ 27/finding-the-truth-about-covid-19-how-facebook- twitter-and-instagram-are-tackling-fake-news/ and https://spectrum.ieee.org/view-from-the-valley/ 2 Background artificial-intelligence/machine-learning/how- facebook-is-using-ai-to-fight-covid19-misinformation 6 This paper is an extended version of the work by Roitero We survey the background work on theoretical and con- et al. [52]. In an attempt of providing a uniform, compre- ceptual aspects of misinformation spreading, the spe- hensive, and more understandable account of our research we cific case of the COVID-19 infodemic, the relation be- also report in the first part of the paper the main results already published in [52]. Conversely, all the results on the tween truthfulness classification and argumentation, and longitudinal study in the second half of the paper are novel. on the use of crowdsourcing to identify fake news.
Can the Crowd Judge Truthfulness? A longitudinal Study 3 2.1 Echo Chambers and Filter Bubbles 2.3 COVID-19 Infodemic The way information spreads through social media and, The number of initiatives to apply Information Ac- in general, the Web, have been widely studied, leading cess – and, in general, Artificial Intelligence – tech- to the discovery of a number of phenomena that were niques to combat the COVID-19 infodemic has been not so evident in the pre-Web world. Among those, echo rapidly increasing (see Bullock et al. [5] for a survey). chambers and epistemic bubbles seem to be central con- Tangcharoensathien et al. [58] distilled a subset of 50 cepts [38]. actions from a set of 594 ideas crowdsourced during Regarding their importance in news consumption, a technical consultation held online by WHO (World Flaxman et al. [17] examine the browsing history of US Health Organization) to build a framework for manag- based users who read news articles. They found that ing infodemics in health emergencies. both search engines and social networks increase the There is significant effort on analyzing COVID-19 ideological distance between individuals, and that they information on social media, and linking to data from increase the exposure of the user to material of opposed external fact-checking organizations to quantify the spread political views. of misinformation [19, 9, 69]. Mejova and Kalimeri [33] These effects can be exploited to spread misinforma- analyzed Facebook advertisements related to COVID- tion. Törnberg [63] modelled how echo chambers con- 19, and found that around 5% of them contain errors tribute to the virality of misinformation, by providing or misinformation. Crowdsourcing methodologies have an initial environment in which misinformation is prop- also been used to collect and analyze data from pa- agated up to some level that makes it easier to expand tients with cancer who are affected by the COVID-19 outside the echo chamber. This helps to explain why pandemic [11]. clusters, usually known to restrain the diffusion of in- Tsai et al. [62] investigate the relationships between formation, become central enablers of spread. news consumption, trust, intergroup contact, and prej- On the other side, acting against misinformation udicial attitudes toward Asians and Asian Americans seems not to be an easy task, at least due to the backfire residing in the United States during the COVID-19 pan- effect, i.e., the effect for which someone’s belief hardens demic when confronted with evidence opposite to its opinion. Sethi and Rangaraju [53] studied the backfire effect and 2.4 Crowdsourcing Truthfulness presented a collaborative framework aimed at fighting it by making the user understand her/his emotions and Recent work has focused on the automatic classification biases. However, the paper does not discuss the ways of truthfulness or fact-checking [43, 34, 2, 37, 12, 25, 55]. techniques for recognizing misinformation can be effec- Zubiaga and Ji [71] investigated, using crowdsourc- tively translated to actions for fighting it in practice. ing, the reliability of tweets in the setting of disaster management. CLEF developed a Fact-Checking Lab [37, 12, 4] to address the issue of ranking sentences ac- 2.2 Truthfulness and Argumentation cording to some fact-checking property. There is recent work that studies how to collect Truthfulness classification and the process of fact-checking truthfulness judgments by means of crowdsourcing us- are strongly related to the scrutiny of factual informa- ing fine grained scales [47, 27, 51]. Samples of state- tion extensively studied in argumentation theory [61, ments from the PolitiFact dataset – originally published 68, 28, 68, 54, 64, 3]. Lawrence and Reed [28] sur- by Wang [65] – have been used to analyze the agreement vey the techniques which are the foundations for ar- of workers with labels provided by experts in the origi- gument mining, i.e., extracting and processing the in- nal dataset. Workers are asked to provide the truthful- ference and structure of arguments expressed using nat- ness of the selected statements, by means of different ural language. Sethi [54] leverages argumentation the- fine grained scales. Roitero et al. [47] compared two ory and proposes a framework to verify the truthfulness fine grained scales: one in the [0, 100] range and one of facts, Visser et al. [64] uses it to increase the criti- in the (0, +∞) range, on the basis of Magnitude Esti- cal thinking ability of people who assess media reports, mation [36]. They found that both scales allow to col- Sethi et al. [55] uses it together with pedagogical agents lect reliable truthfulness judgments that are in agree- in order to develop a recommendation system to help ment with the ground truth. Furthermore, they show fighting misinformation, and Snaith et al. [57] present that the scale with one hundred levels leads to slightly a platform based on a modular architecture and dis- higher agreement levels with the expert judgments. On tributed open source for argumentation and dialogue. a larger sample of PolitiFact statements, La Barbera
4 Kevin Roitero et al. et al. [27] asked workers to use the original scale used relevant/sensitive topic for the workers. We investigate by the PolitiFact experts and the scale in the [0, 100] whether the health domain makes a difference in the range. They found that aggregated judgments (com- ability of crowd workers to identify and correctly clas- puted using the mean function for both scales) have a sify (mis)information, and if the very recent nature of high level of agreement with expert judgments. Recent COVID-19 related statements has an impact as well. work by Roitero et al. [51] found similar results in terms We focus on a single truthfulness scale, given the evi- of external agreement and its improvement when aggre- dence that the scale used does not make a significant gating crowdsourced judgments, using statements from difference. Another important difference is that we ask two different fact-checkers: PolitiFact and ABC Fact- the workers to provide a textual justification for their Check (ABC). Previous work has also looked at inter- decision: we analyze them to better understand the pro- nal agreement, i.e., agreement among workers [47, 51]. cess followed by workers to verify information, and we Roitero et al. [51] found that scales have low levels of investigate if they can be exploited to derive useful in- agreement when compared with each other: correlation formation. values for aggregated judgments on the different scales In addition to Roitero et al. [52], we perform a longi- are around ρ = 0.55–0.6 for PolitiFact and ρ = 0.35– tudinal study that includes 3 additional crowdsourcing 0.5 for ABC, and τ = 0.4 for PolitiFact and τ = 0.3 experiments over a period of 4 months and thus col- for ABC. This indicates that the same statements tend lecting additional data and evidence that include novel to be evaluated differently in different scales. responses from new and old crowd workers (see Foot- There is evidence of differences on the way workers note 6). The setup of each additional crowdsourcing provide judgments, influenced by the sources they ex- experiment is the same as the one of Roitero et al. amine, as well as the impact of worker bias. In terms [52]. This longitudinal study is the focus of the research of sources, La Barbera et al. [27] found that the vast questions RQ6–RQ8 below, which are a novel contribu- majority of workers (around 73% for both scales) use tion of this paper. Finally, we also exploit and analyze indeed the PolitiFact website to provide judgments. worker behavior. We present this paper as an extension Differently from La Barbera et al. [27], Roitero et al. of our previous work [52] in order to be able to compare [51] used a custom search engine in order to filter out against it and make the whole paper self-contained and PolitiFact and ABC from the list of results. Results much easier to follow, improving the readability and show that, for all the scales, Wikipedia and news web- overall quality of the paper thanks to its novel research sites are the most popular sources of evidence used by contributions. the workers. In terms of worker bias, La Barbera et al. More in detail, we investigate the following specific [27] and Roitero et al. [51] found that worker politi- Research Questions: cal background has an impact on how workers provide RQ1 Are the crowd-workers able to detect and objec- the truthfulness scores. In more detail, they found that tively categorize online (mis)information related to workers are more tolerant and moderate when judging the medical domain and more specifically to COVID- statements from their very own political party. 19? What are the relationship and agreement be- Roitero et al. [52] use a crowdsourcing based ap- tween the crowd and the expert labels? proach to collect truthfulness judgments on a sample of RQ2 Can the crowdsourced and/or the expert judgments PolitiFact statements concerning COVID-19 to un- be transformed or aggregated in a way that it im- derstand whether crowdsourcing is a reliable method to proves the ability of workers to detect and objec- be used to identify and correctly classify (mis)information tively categorize online (mis)information? during a pandemic. They find that workers are able RQ3 What is the effect of workers’ political bias and cog- to provide judgments which can be used to objectively nitive abilities? identify and categorize (mis)information related to the RQ4 What are the signals provided by the workers while pandemic and that such judgments show high level of performing the task that can be recorded? To what agreement with expert labels when aggregated. extent are these signals related to workers’ accu- racy? Can these signals be exploited to improve ac- 3 Aims and Research Questions curacy and, for instance, aggregate the labels in a more effective way? With respect to our previous work by Roitero et al. RQ5 Which sources of information does the crowd con- [47], La Barbera et al. [27], and Roitero et al. [51], sider when identifying online misinformation? Are and similarly to Roitero et al. [52], we focus on claims some sources more useful? Do some sources lead to about COVID-19, which are recent and interesting for more accurate and reliable assessments by the work- the research community, and arguably deal with a more ers?
Can the Crowd Judge Truthfulness? A longitudinal Study 5 RQ6 What is the effect of re-launching the experiment to complete the HIT. The worker uses the input to- and re-collecting all the data at different time-spans? ken to perform the accepted HIT. If s/he successfully Are the findings from all previous research questions completes the assigned HIT, s/he is shown the output still valid? token, which is used to submit the MTurk HIT and RQ7 How does considering the judgments from workers receive the payment, which we set to $1.5 for a set which did the task multiple times change the find- of 8 statements.8 The task itself is as follows: first, a ings of RQ6? Do they show any difference when (mandatory) questionnaire is shown to the worker, to compared to workers whom did the task only once? collect his/her background information such as age and RQ8 Which are the statements for which the truthful- political views. The full set of questions and answers to ness assessment done by the means of crowdsourc- the questionnaire can be found in B. Then, the worker ing fails? Which are the features and peculiarities needs to provide answers to three Cognitive Reflection of the statements that are misjudged by the crowd- Test (CRT) questions, which are used to measure the workers? personal tendency to answer with an incorrect “gut” re- sponse or engage in further thinking to find the correct answer [18]. The CRT questionnaire and its answers 4 Methods can be found in C. After the questionnaire and CRT phase, the worker is asked to asses the truthfulness of 8 In this section we present the dataset used to carry statements: 6 from the dataset described in Section 4.1 out our experiments (Section 4.1), and the crowdsourc- (one for each of the six considered PolitiFact cate- ing task design (Section 4.2). Overall, we considered gories) and 2 special statements called Gold Questions one dataset annotated by experts, one crowdsourced (one clearly true and the other clearly false) manually dataset, one judgment scale (the same for the expert written by the authors of this paper and used as quality and the crowd judgments), and a total of 60 statements. checks as detailed below. We used a randomization pro- cess when building the HITs to avoid all the possible source of bias, both within each HIT and considering 4.1 Dataset the overall task. We considered as primary source of information the To assess the truthfulness of each statement, the PolitiFact dataset [65] that was built as a “bench- worker is shown: the Statement, the Speaker/Source, mark dataset for fake news detection” [65] and contains and the Year in which the statement was made. We over 12k statements produced by public appearances of asked the worker to provide the following information: US politicians. The statements of the datasets are la- the truthfulness value for the statement using the six- beled by expert judges on a six-level scale of truthful- level scale adopted by PolitiFact, from now on re- ness (from now on referred to as E6 ): pants-on-fire, ferred to as C 6 (presented to the worker using a radio false, mostly-false, half-true, mostly-true, and button containing the label description for each cate- true. Recently, the PolitiFact website (the source gory as reported in the original PolitiFact website), from where the statements of the PolitiFact dataset a URL that s/he used as a source of information for are taken) created a specific section related to the COVID- the fact-checking, and a textual motivation for her/his 19 pandemic.7 For this work, we selected 10 statements response (which can not include the URL, and should for each of the six PolitiFact categories, belonging to contain at least 15 words). In order to prevent the user such COVID-19 section and with dates ranging from from using PolitiFact as primary source of evidence, February 2020 to early April 2020. A contains the full we implemented our own search engine, which is based list of the statements we used. on the Bing Web Search APIs9 and filters out Politi- Fact from the returned search results. We logged the user behavior using a logger as the 4.2 Crowdsourcing Experimental Setup one detailed by Han et al. [22, 21], and we implemented in the task the following quality checks: (i) the judg- To collect our judgments we used the crowdsourcing ments assigned to the gold questions have to be co- platform Amazon Mechanical Turk (MTurk). Each worker, herent (i.e., the judgment of the clearly false question upon accepting our Human Intelligence Task (HIT), is assigned a unique pair or values (input token, output to- 8 Before deploying the task on MTurk, we investigated the ken). Such pair is used to uniquely identify each worker, average time spent to complete the task, and we related it to which is then redirected to an external server in order the minimum US hourly wage. 9 https://azure.microsoft.com/services/cognitive- 7 https://www.politifact.com/coronavirus/ services/bing-web-search-api/
6 Kevin Roitero et al. should be lower than the one assigned to true ques- HIT, workers were first asked to complete a demograph- tion); and (ii) the cumulative time spent to perform ics questionnaire with questions about their gender, each judgment should be of at least 10 seconds. Note age, education and political views. By analyzing the an- that the CRT (and the questionnaire) answers were not swers to the questionnaire of the workers which success- used for quality check, although the workers were not fully completed the experiment we derived the following aware of that. demographic statistics. The majority of workers are in Overall, we used 60 statements in total and each the 26–35 age range (39%), followed by 19–25 (27%), statement has been evaluated by 10 distinct workers. and 36–50 (22%). The majority of the workers are well Thus, considering the main experiment, we deployed educated: 48% of them have a four year college degree 100 MTurk HITs and we collected 800 judgments in or a bachelor degree, 26% have a college degree, and total (600 judgments plus 200 gold question answers). 18% have a postgraduate or professional degree. Only Considering the main experiment and the longitudinal about 4% of workers have a high school degree or less. study all together, we collected over 4300 judgments Concerning political views, 33% of workers identified from 542 workers, over a total of 7 crowdsourcing tasks. themselves as liberals, 26% as moderate, 17% as very All the data used to carry out our experiments can be liberal, 15% as conservative, and 9% as very conserva- downloaded at https://github.com/KevinRoitero/ tive. Moreover, 52% of workers identified themselves as crowdsourcingTruthfulness. The choice of making being Democrat, 24% as being Republican, and 23% as each statement being evaluated by 10 distinct work- being Independent. Finally, 50% of workers disagreed ers deserves a discussion; such a number is aligned with on building a wall on the southern US border, and 37% previous studies using crowdsourcing to assess truthful- of them agreed. Overall we can say that our sample is ness [47, 27, 51, 52] and other concepts like relevance well balanced. [30, 48]. We believe this number is a reasonable trade- CRT Test. Analyzing the CRT scores, we found that: off between having fewer statements evaluated by many 31% of workers did not provide any correct answer, 34% workers and more statements evaluated by few workers. answered correctly to 1 test question, 18% answered We think that an in depth discussion about the quan- correctly to 2 test questions, and only 17% answered tification of such trade-off requires further experiments correctly to all 3 test questions. We correlate the results and therefore is out of scope for this paper; we plan to of the CRT test and the worker quality to answer RQ3. address this matter in detail in future work. Behaviour (Abandonment). When considering the aban- donment ratio (measured according to the definition 5 Results and Analysis for the Main provided by Han et al. [22]), we found that 100/334 Experiment workers (about 30%) successfully completed the task, 188/334 (about 56%) abandoned (i.e., voluntarily ter- We first report some descriptive statistics about the minated the task before completing it), and 46/334 population of workers and the data collected in our (about 7%) failed (i.e., terminated the task due to fail- main experiment (Section 5.1). Then, we address crowd ing the quality checks too many times). Furthermore, accuracy (i.e., RQ1) in Section 5.2, transformation of 115/188 workers (about 61%) abandoned the task be- truthfulness scales (RQ2) in Section 5.3, worker back- fore judging the first statement (i.e., before really start- ground and bias (RQ3) in Section 5.4, worker behav- ing it). ior (RQ4) in Section 5.5; finally, we study information sources (RQ5) in Section 5.6. Results related to the lon- gitudinal study (RQ6–8) are described and analyzed in 5.2 RQ1: Crowd Accuracy Section 6. 5.2.1 External Agreement 5.1 Descriptive Statistics To answer RQ1, we start by analyzing the so called ex- ternal agreement, i.e., the agreement between the crowd 5.1.1 Worker Background, Behavior, and Bias collected labels and the experts ground truth. Figure 1 shows the agreement between the PolitiFact experts (x-axis) and the crowd judgments (y-axis). In the first Questionnaire. Overall, 334 workers resident in the plot, each point is a judgment by a worker on a state- United States participated in our experiment.10 In each ment, i.e., there is no aggregation of the workers work- 10 Workers provide proof that they are based in US and have ing on the same statement. In the next plots all work- the eligibility to work. ers redundantly working on the same statement are ag-
Can the Crowd Judge Truthfulness? A longitudinal Study 7 5 5 5 5 4 4 4 4 3 3 3 3 C C C C 2 2 2 2 1 1 1 1 0 0 0 0 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 E E E E Fig. 1: The agreement between the PolitiFact experts and the crowd judgments. From left to right: C6 individual judgments; C6 aggregated with mean; C6 aggregated with median; C6 aggregated with majority vote. gregated using the mean (second plot), median (third applying Bonferroni correction to account for multiple plot), and majority vote (right-most plot). If we fo- comparisons. Results are as follows: when considering cus on the first plot (i.e., the one with no aggregation adjacent categories (e.g., pants-on-fire and false), function applied), we can see that, overall, the individ- the difference between categories are never significant, ual judgments are in agreement with the expert labels, for both tests and for all the three aggregation func- as shown by the median values of the boxplots, which tions. When considering categories of distance 2 (e.g., are increasing as the ground truth truthfulness level pants-on-fire and mostly-false), the differences are increases. Concerning the aggregated values, it is the never significant, apart from the median aggregation case that for all the aggregation functions the pants- function, where there is statistical significance to the on-fire and false categories are perceived in a very p < .05 level in 2/4 cases for both Mann-Whitney and similar way by the workers; this behavior was already t-test. When considering categories of distance 3, the shown in [51, 27], and suggests that indeed workers have differences are significant – for the Mann-Whitney and clear difficulties in distinguishing between the two cat- the t-test respectively – in the following cases: for the egories; this is even more evident considering that the mean, 3/3 and 3/3 cases; for the median, 2/3 and 3/3 interface presented to the workers contained a textual cases; for the majority vote, 0/3 and 1/3 cases. When description of the categories’ meaning in every page of considering categories of distance 4 and 5, the differ- the task. ences are always significant to the p > 0.01 level for all the aggregation functions and for all the tests, apart If we look at the plots as a whole, we see that within from the majority vote function and the Mann-Whitney each plot the median values of the boxplots increase test, where the significance is at the p > .05 level. In the when going from pants-on-fire to true (i.e., going following we use the mean as it is the most commonly from left to right of the x-axis of each chart). This indi- used approach for this type of data [51]. cates that the workers are overall in agreement with the PolitiFact ground truth, thus indicating that workers 5.2.2 Internal Agreement are indeed capable of recognizing and correctly classify- ing misinformation statements related to the COVID- Another standard way to address RQ1 and to ana- 19 pandemic. This is a very important and not obvi- lyze the quality of the work by the crowd is to com- ous result: in fact, the crowd (i.e., the workers) is the pute the so called internal agreement (i.e., the agree- primary source and cause of the spread of disinforma- ment among the workers). We measured the agreement tion and misinformation statements across social media with α [26] and Φ [7], two popular measures often used platforms [8]. By looking at the plots, and in particular to compute workers’ agreement in crowdsourcing tasks focusing on the median values of the boxplots, it ap- [47, 51, 31, 49]. Analyzing the results, we found that pears evident that the mean (second plot) is the aggre- the the overall agreement always falls in the [0.15, 0.3] gation function which leads to higher agreement levels, range, and that agreement levels measured with the two followed by the median (third plot) and the majority scales are very similar for the PolitiFact categories, vote (right-most plot). Again, this behavior was already with the only exception of Φ, which shows higher agree- remarked in [51, 27, 49], and all the cited works used ment levels for the mostly-true and true categories. the mean as primary aggregation function. This is confirmed by the fact that the α measure al- To validate the external agreement, we measured ways falls in the Φ confidence interval, and the little the statistical significance between the aggregated rat- oscillations in the agreement value are not always in- ing for all the six PolitiFact categories; we consid- dication of a real change in the agreement level, espe- ered both the Mann-Whitney rank test and the t-test, cially when considering α [7]. Nevertheless, Φ seems to
8 Kevin Roitero et al. 5 5 5 Figure 2 shows the result of such a process. As 4 4 4 we can see from the plots, the agreement between the 3 3 3 crowd and the expert judgments can be seen in a more C C C 2 2 2 1 1 1 neat way. As for Figure 1, the median values for all 0 0 0 the boxplots is increasing when going towards higher 01 23 45 01 23 45 01 23 45 E E E truthfulness values (i.e., going from left to right within each plot); this holds for all the aggregation functions 5 5 5 considered, and it is valid for both transformations of 4 4 4 3 3 3 the E6 scale, into three and two levels. Also in this case C C C 2 2 2 we computed the statistical significance between cat- 1 1 1 0 0 0 egories, applying the Bonferroni correction to account 012 345 012 345 012 345 for multiple comparisons. Results are as follows. For the E E E case of three groups, both the categories at distance one and two are always significant to the p < 0.01 level, for Fig. 2: The agreement between the PolitiFact experts both the Mann-Whitney and the t-test, for all three and the crowd judgments. From left to right: C6 aggre- aggregation functions. The same behavior holds for the gated with mean; C6 aggregated with median; C6 ag- case of two groups, where the categories of distance 1 gregated with majority vote. First row: E6 to E3 ; second are always significant to the p < 0.01 level. row: E6 to E2 . Compare with Figure 1. Summarizing, we can now conclude that by merg- ing the ground truth levels we obtained a much stronger signal: the crowd can effectively detect and classify mis- confirm the finding derived from Figure 1 that workers information statements related to the COVID-19 pan- are most effective in identifying and categorizing state- demic. ments with a higher truthfulness level. This remark is also supported by [7] which shows that Φ is better in 5.3.2 Merging Crowd Levels distinguishing agreement levels in crowdsourcing than α, which is more indicated as a measure of data relia- Having reported the results on merging the ground truth bility in non-crowdsourced settings. categories we now turn to transform the crowd labels (i.e., C6 ) into three (referred to as C3 ) and two (C2 ) 5.3 RQ2: Transforming Truthfulness Scales categories. For the transformation process we rely on the approach detailed by Han et al. [23]. This approach Given the positive results presented above, it appears has many advantages [23]: we can simulate the effect that the answer to RQ1 is overall positive, even if with of having the crowd answers in a more coarse-grained some exceptions. There are many remarks that can be scale (rather than C6 ), and thus we can simulate new made: first, there is a clear issue that affects the pants- data without running the whole experiment on MTurk on-fire and false categories, which are very often again. As we did for the ground truth scale, we choose mis-classified by workers. Moreover, while PolitiFact to select as target scales the two- and three- levels scale, used a six-level judgment scale, the usage of a two- (e.g., driven by the same motivations. Having selected C6 as True/False) and a three-level (e.g., False / In between being the source scale, and having selected the target / True) scale is very common when assessing the truth- scales as the three- and two- level ones (C3 and C2 ), fulness of statements [27, 51]. Finally, categories can we perform the following experiment. We perform all be merged together to improve accuracy, as done for the possible cuts11 from C6 to C3 and from C6 to C2 ; example by Tchechmedjiev et al. [59]. All these consid- then, we measure the internal agreement (using α and erations lead us to RQ2, addressed in the following. Φ) both on the source and on the target scale, and we compare those values. In such a way, we are able to iden- tify, among all the possible cuts, the cut which leads to 5.3.1 Merging Ground Truth Levels the highest possible internal agreement. For all the above reasons, we performed the following We found that, for the C6 to C3 transformation, experiment: we group together the six PolitiFact cat- both for α and Φ there is a single cut which leads to egories (i.e., E6 ) into three (referred to as E3 ) or two higher agreement levels with the original C6 scale. On (E2 ) categories, which we refer respectively with 01, 23, the contrary, for the C6 to C2 transformation, we found and 45 for the three level scale, and 012 and 234 for 11 C6 can be transformed into C3 in 10 different ways, and the two level scale. C6 can be transformed into C2 in 5 different ways.
Can the Crowd Judge Truthfulness? A longitudinal Study 9 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 C C C C 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0 1 2 3 4 5 0 1 2 3 4 5 01 23 45 01 23 45 E E E E 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 C C C C 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0 1 2 3 4 5 0 1 2 3 4 5 012 345 012 345 E E E E Fig. 3: Comparison with E6 . C6 to C3 (first two plots) Fig. 4: C6 to C3 (first two plots) and to C2 (last two and to C2 (last two plots), then aggregated with the plots), then aggregated with the mean function. First mean function. Best cut selected according to α (fist two plots: E6 to E3 . Last two plots: E6 to E2 . Best cut and third plot) and Φ (second and fourth plot). Com- selected according to α (first and third plots) and Φ pare with Figure 1. (second and fourth plots). Compare with Figures 1, 2, and 3. that there is a single cut for α which leads to similar agreement levels as in the original C6 scale, and there the E2 case (last two plots) the classes appear to be not are no cuts with such a property when using Φ. Having separable. Summarizing, all these results show that it identified the best possible cuts for both transforma- is feasible to successfully combine the aforementioned tions and for both agreement metrics, we now measure approaches, and transform into a three- and two-level the external agreement between the crowd and the ex- scale both the crowd and the expert judgments. pert judgments, using the selected cut. Figure 3 shows such a result when considering the judgments aggregated with the mean function. As we 5.4 RQ3: Worker Background and Bias can see from the plots, it is again the case that the me- dian values of the boxplots is always increasing, for all To address RQ3 we study if the answers to question- the transformations. Nevertheless, inspecting the plots naire and CRT test have any relation with workers we can state that the overall external agreement ap- quality. Previous work have shown that political and pears to be lower than the one shown in Figure 1. More- personal biases as well as cognitive abilities have an over, we can state that even using these transformed impact on the workers quality [27, 51]; recent articles scales the categories pants-on-fire and false are still have shown that the same effect might apply also to not separable. Summarizing, we show that it is feasible fake news [60]. For this reason, we think it is reason- to transform the judgments collected on a C6 level scale able to investigate if workers’ political biases and cog- into two new scales, C3 and C2 , obtaining judgments nitive abilities influence their quality in the setting of with a similar internal agreement as the original ones, misinformation related to COVID-19. and with a slightly lower external agreement with the When looking at the questionnaire answers, we found expert judgments. a relation with the workers quality only when consid- ering the answer to the workers political views (see B 5.3.3 Merging both Ground Truth and Crowd Levels for questions and answers). In more detail, using Accu- racy (i.e., the fraction of exactly classified statements) It is now natural to combine the two approaches. Fig- we measured the quality of workers in each group. The ure 4 shows the comparison between C6 transformed number and fraction of correctly classified statements into C3 and C2 , and E6 transformed into E3 and E2 . are however rather crude measures of worker’s quality, As we can see form the plots, also in this case the me- as small misclassification errors (e.g, pants-on-fire in dian values of the boxplots are increasing, especially for place of false) are as important as more striking ones the E3 case (first two plots). Furthermore, the external (e.g., pants-on-fire in place of true). Therefore, to agreement with the ground truth is present, even if for measure the ability of workers to correctly classify the
10 Kevin Roitero et al. Table 1: Statement position in the task versus: time 5.5.1 Time and Queries elapsed, cumulative on each single statement (first row), CEMORD (second row), number of queries issued (third Table 1 (fist two rows) shows the amount of time spent row), and number of times the statement has been used on average by the workers on the statements and their as a query (fourth row). The total and average number CEMORD score. As we can see, the time spent on the of queries is respectively 2095 and 262, while the total first statement is considerably higher than on the last and average number of statements as query is respec- statements, and overall the time spent by the workers tively of 245 and 30.6. almost monotonically decreases while the statement po- sition increases. This, combined with the fact that the Statement quality of the assessment provided by the workers (mea- 1 2 3 4 5 6 7 8 Position sured with CEMORD ) does not decrease for the last state- Time (sec) 299 282 218 216 223 181 190 180 ments is an indication of a learning effect: the workers CEMORD .63 .618 .657 .611 .614 .569 .639 .655 Number of 352 280 259 255 242 238 230 230 learn how to assess truthfulness in a faster way. Queries 16.8% 13.4% 12.4% 12.1% 11.6% 11.3% 11.0% 11.4% We now turn to queries. Table 1 (third and fourth Statement 22 32 31 33 34 30 29 34 row) shows query statistics for the 100 workers which as Query 9% 13% 12.6% 13.5% 13.9% 12.2% 11.9% 13.9% finished the task. As we can see, the higher the state- ment position, the lower the number of queries issued: 3.52% on average for the first statement down 2,30% statements, we also compute the Closeness Evaluation for the last statement. This can indicate the attitude Measure (CEMORD ), an effectiveness metric recently pro- of workers to issue fewer queries the more time they posed for the specific case of ordinal classification [1] spend on the task, probably due to fatigue, boredom, (see Roitero et al. [51, Sect. 3.3] for a more detailed or learning effects. Nevertheless, we can see that on av- discussion of these issues). The accuracy and CEMORD erage, for all the statement positions each worker is- values are respectively of 0.13 and 0.46 for ‘Very con- sues more than one query: workers often reformulate servative’, 0.21 and 0.51 for ‘Conservative’, 0.20 and their initial query. This provides further evidence that 0.50 for ‘Moderate’, 0.16 and 0.50 for ‘Liberal’, and 0.21 they put effort in performing the task and that suggests and 0.51 for ‘Very liberal’. By looking at both Accuracy the overall high quality of the collected judgments. The and CEMORD , it is clear that ‘Very conservative’ work- third row of the table shows the number of times the ers provide lower quality labels. The Bonferroni cor- worker used as query the whole statement. We can see rected two tailed t-test on CEMORD confirms that ‘Very that the percentage is rather low (around 13%) for all conservative’ workers perform statistically significantly the statement positions, indicating again that workers worse than both ‘Conservative’ and ‘Very liberal’ work- spend effort when providing their judgments. ers. The workers’ political views affect the CEMORD score, even if in a small way and mainly when considering the 5.5.2 Exploiting Worker Signals to Improve Quality extremes of the scale. An initial analysis of the other answers to the questionnaire (not shown) does not seem We have shown that, while performing their task, work- to provide strong signals; a more detailed analysis is left ers provide many signals that to some extent correlate for future work. with the quality of their work. These signals could in We also investigated the effect of the CRT tests on principle be exploited to aggregate the individual judg- the worker quality. Although there is a small variation ments in a more effective way (i.e., giving more weight in both Accuracy and CEMORD (not shown), this is never to workers that possess features indicating a higher statistically significant; it appears that the number of quality). For example, the relationships between worker correct answers to the CRT tests is not correlated with background / bias and worker quality (Section 5.4) could worker quality. We leave for future work a more detailed be exploited to this aim. study of this aspect. We thus performed the following experiment: we aggregated C6 individual scores, using as aggregation function a weighted mean, where the weights are ei- ther represented by the political views, or the number of correct answers to CRT, both normalized in [0.5, 1]. 5.5 RQ4: Worker Behavior We found a very similar behavior to the one observed for the second plot of Figure 1; it seems that leveraging We now turn to RQ4, and analyze the behavior of the quality-related behavioral signals, like questionnaire an- workers. swers or CRT scores, to aggregate results does not pro-
Can the Crowd Judge Truthfulness? A longitudinal Study 11 1 2 URL % 0.5 1.0 3 snopes.com 11.79% 4 msn.com 8.93% 0.4 0.8 Cumulative Percentage 5 factcheck.org 6.79% rank 6 wral.com 6.79% 0.3 copied 0.6 7 usatoday.com 5.36% no copied 8 statesman.com 4.64% reuters.com 4.64% 0.2 0.4 9 cdc.gov 4.29% 10 mediabiasfactcheck.com 4.29% 0.1 0.2 0.0 0.1 0.2 0.3 0.4 businessinsider.com 3.93% frequency 0.0 0.0 0 1 2 3 4 5 Fig. 5: On the left, distribution of the ranks of the URLs Judgment absolute error selected by workers, on the right, websites from which 0.4 workers chose URLs to justify their judgments. copied no copied 0.3 Percentage vide a noticeable increase in the external agreement, although it does not harm. 0.2 0.1 5.6 RQ5: Sources of Information 0.0 We now turn to RQ5, and analyze the sources of infor- -5 -4 -3 -2 -1 0 1 2 3 4 5 mation used by the workers while performing the task. Judgment error 5.6.1 URL Analysis Fig. 6: Effect of the origin of a justification on: the abso- lute value of the prediction error (top; cumulative distri- Figure 5 shows on the left the distribution of the ranks butions shown with thinner lines and empty markers), of the URL selected as evidence by the worker when and the prediction error (bottom). Text copied/not performing each judgment. URLs selected less than 1% copied from the selected URL. times are filtered out from the results. As we can see from the plot, about 40% of workers selected the first re- sult retrieved by our search engine, and selected the re- lected URLs, and their links with worker quality. 54% maining positions less frequently, with an almost mono- of the provided justifications contain text copied from tonic decreasing frequency (rank 8 makes the excep- the web page at the URL selected for evidence, while tion). We also found that 14% of workers inspected up 46% do not. Furthermore, 48% of the justification in- to the fourth page of results (i.e., rank= 40). The break- clude some “free text” (i.e., text generated and writ- down on the truthfulness PolitiFact categories does ten by the worker), and 52% do not. Considering all not show any significant difference. the possible combinations, 6% of the justifications used Figure 5 shows on the right part the top 10 of web- both free text and text from web page, 42% used free sites from which the workers choose the URL to justify text but no text from the web page, 48% used no free their judgments. Websites with percentage ≤ 3.9% are text but only text from web page, and finally 4% used filtered out. As we can see from the table, there are neither free text nor text from web page, and either many fact check websites among the top 10 URLs (e.g., inserted text from a different (not selected) web page snopes: 11.79%, factcheck 6.79%). Furthermore, med- or inserted part of the instructions we provided or text ical websites are present (cdc: 4.29%). This indicates from the user interface. that workers use various kind of sources as URLs from which they take information. Thus, it appears that they Concerning the preferred way to provide justifica- put effort in finding evidence to provide a reliable truth- tions, each worker seems to have a clear attitude: 48% of fulness judgment. the workers used only text copied from the selected web pages, 46% of the workers used only free text, 4% used 5.6.2 Justifications both, and 2% of them consistently provided text coming from the user interface or random internet pages. As a final result, we analyze the textual justifications We now correlate such a behavior with the work- provided, their relations with the web pages at the se- ers quality. Figure 6 shows the relations between dif-
12 Kevin Roitero et al. ferent kinds of justifications and the worker accuracy. Table 2: Experimental setting for the longitudinal The plots show the absolute value of the prediction er- study. All dates refer to 2020. Values reported are ab- ror on the left, and the prediction error on the right. solute numbers. The figure shows if the text inserted by the worker was Number of Workers copied or not from the web page selected; we performed Date Acronym Batch1 Batch2 Batch3 Batch4 Total the same analysis considering if the worker used or not May Batch1 100 – – – 100 free text, but results where almost identical to the for- June Batch2 – 100 – – 100 mer analysis. As we can see from the plot, statements Batch2from1 29 – – – 29 on which workers make less errors (i.e., where x-axis July Batch3 – – 100 – 100 = 0) tend to use text copied from the web page se- Batch3from1 22 – – – 22 Batch3from2 – 20 – – 20 lected. On the contrary, statements on which workers Batch3from1or2 22 20 – – 42 make more errors (values close to 5 in the left plot, and August Batch4 – – – 100 100 values close to +/− 5 in the right plot) tend to use text Batch4from1 27 – – – 27 Batch4from2 – 11 – – 11 not copied from the selected web page. The differences Batch4from3 – – 33 – 33 are small, but it might be an indication that workers of Batch4from1or2or3 27 11 33 – 71 higher quality tend to read the text from selected web Batchall 100 100 100 100 400 page, and report it in the justification box. To con- firm this result, we computed the CEMORD scores for the two classes considering the individual judgments: the derived from our analysis to answer RQ6 (Section 6.2), class “copied” has CEMORD = 0.640, while the class “not RQ7 (Section 6.3), and RQ8 (Section 6.4). copied” has a lower value, CEMORD = 0.600. We found that such behavior is consistent for what concerns the usage of free text: statements on which 6.1 Experimental Setting workers make less errors tend to use more free text than the ones that make more errors. This is an indication The longitudinal study is based on the same dataset that workers which add free text as a justification, pos- (see Section 4.1) and experimental setting (see Section sibly reworking the information present in the selected 4.2) of the main experiment. The crowdsourcing judg- URL, are of a higher quality. In this case the CEMORD ments were collected as follows. The data for main ex- measure confirms that the two classes are very similar: periment (from now denoted with Batch1) has been col- the class free text has CEMORD = 0.624, while the class lected on May 2020. On June 2020 we re-launched the not free text has CEMORD = 0.621. HITs from Batch1 with a novel set of workers (i.e., we By looking at the right part of Figure 6 we can see prevented the workers of Batch1 to perform the exper- that the distribution of the prediction error is not sym- iment again); we denote such set of data with Batch2. metrical, as the frequency of the errors is higher on the On July 2020 we collected an additional batch: we re- positive side of the x-axis ([0,5]). These errors corre- launched the HITs from Batch1 with novice workers spond to workers overestimating the truthfulness value (i.e., we prevented the workers of Batch1 and Batch2 of the statement (with 5 being the result of labeling a to perform the experiment again); we denote such set pants-on-fire statement as true). It is also notice- of data with Batch3. Finally, on August 2020 we re- able that the justifications containing text copied from launched the HITs from Batch1 for the last time, pre- the selected URL have a lower rate of errors in the neg- venting workers of previous batches to perform the ex- ative range, meaning that workers which directly quote periment, collecting the data for Batch4. Then, we con- the text avoid underestimating the truthfulness of the sidered an additional set of experiments: for a given statement. batch, we contacted the workers from previous batches sending them a $0.01 bonus and asking them to perform the task again. We obtained the datasets detailed in Ta- 6 Variation of the Judgments over Time ble 2 where BatchXfromY denotes the subset of workers that performed BatchX and had previously participated To perform repeated observations of the crowd anno- in BatchY. Note that an experienced (returning) worker tating misinformation (i.e., doing a longitudinal study) who does the task for the second time gets generally a with different sets of workers, we re-launched the HITs new HIT assigned, i.e., a HIT different from the per- of the main experiment three subsequent times, each of formed originally; we have no control on this matter, them one month apart. In this section we detail how since HITs are assigned to workers by the MTurk plat- the data was collected (Section 6.1) and the findings form. Finally, we also considered the union of the data
Can the Crowd Judge Truthfulness? A longitudinal Study 13 Table 3: Abandonment data for each batch of the lon- Table 3 shows the abandonment data for each batch gitudinal study. of the longitudinal study, indicating the amount of work- ers which completed, abandoned, or failed the task (due Number of Workers to failing the quality checks). Overall, the abandonment Acronym Complete Abandon Fail Total ratio is quite well balanced across batches, with the Batch1 100 (30%) 188 (56%) 46 (14%) 334 only exception of Batch3, that shows a small increase Batch2 100 (37%) 129 (48%) 40 (15%) 269 in the amount of workers which failed the task; never- Batch3 100 (23%) 220 (51%) 116 (26%) 436 theless, such small variation is not significant and might Batch4 100 (36%) 124 (45%) 54 (19%) 278 be caused by a slightly lower quality of workers which Average 100 (31%) 165 (50%) 64 (19%) 1317 started Batch3. On average, Table 3 shows that 31% of the workers completed the task, 50% abandoned it, and 19% failed the quality checks; these values are aligned from Batch1, Batch2, Batch3, and Batch4; we denote with previous studies (see Roitero et al. [51]). this dataset with Batchall . 6.2.2 Agreement across Batches 6.2 RQ6: Repeating the Experiment with Novice We now turn to study the quality of both individual Workers and aggregated judgments across the different batches. Measuring the correlation between individual judgments 6.2.1 Worker Background, Behavior, Bias, and we found rather low correlation values: the correlation Abandonment between Batch1 and Batch2 is of ρ = 0.33 and τ = 0.25, the correlation between Batch1 and Batch3 is of We first studied the variation in the composition of ρ = 0.20 and τ = 0.14, between Batch1 and Batch4 the worker population across different batches. To this is of ρ = 0.10 and τ = 0.074; the correlation between aim, we considered a General Linear Mixture Model Batch2 and Batch3 is of ρ = 0.21 and τ = 0.15, between (GLMM) [32] together with the Analysis Of Variance Batch2 and Batch4 is of ρ = 0.10 and τ = 0.085; finally, (ANOVA) [35] to analyze how worker behavior changes the correlation values between Batch3 and Batch4 is of across batches, and measured the impact of such changes. ρ = 0.08 and τ = 0.06. In more detail, we considered the ANOVA effect size Overall, the most recent batch (Batch4) is the batch ω 2 , an unbiased index used to provide insights of the which achieves the lowest correlation values w.r.t. the population-wide relationship between a set of factors other batches, followed by Batch3. The highest correla- and the studied outcomes [15, 14, 70, 50, 16]. With tion is achieved between Batch1 and Batch2. This pre- such setting, we fitted a linear model with which we liminary result suggest that it might be the case that measured the effect of the age, school, and all other the time-span in which we collected the judgments of possible answers to the questions in the questionnaire the different batches has an impact on the judgments (B) w.r.t. individual judgment quality, measured as the similarity across batches, and batches which have been absolute distance between the worker judgments and launched in time-spans close to each other tend to be the expert one with the mean absolute error (MAE). more similar than other batches. We now turn to an- By inspecting the ω 2 index, we found that while all the alyze the aggregated judgments, to study if such re- effects are either small or non-present [40], the largest lationship is still valid when individual judgments are effects are provided by workers’ answers to the taxes aggregated. Figure 7 shows the agreement between the and southern border questions. We also found that the aggregated judgments of Batch1, Batch2, Batch3, and effect of the batch is small but not negligible, and is Batch4. The plot shows in the diagonal the distribu- on the same order of magnitude of the effect of other tion of the aggregated judgments, in the lower triangle factors. We also computed the interaction plots (see for the scatterplot between the aggregated judgments of example [20]) considering the variation of the factors the different batches, and in the upper triangle the cor- from the previous analysis on the different batches. Re- responding ρ and τ correlation values. The plots show sults suggest a small or not significant [13] interaction that the correlation values of the aggregated judgments between the batch and all the other factors. This anal- are greater than the ones measured for individual judg- ysis suggests that, while the difference among different ments. This is consistent for all the batches. In more de- batches is present, the population of workers which per- tail, we can see that the agreement between Batch1 and formed the task is homogeneous, and thus the different Batch2 (ρ = 0.87, τ = 0.68) is greater than the agree- dataset (i.e., batches) are comparable. ment between any other pair of batches; we also see that
You can also read