Can the Crowd Judge Truthfulness? A Longitudinal Study on Recent Misinformation about COVID-19

 
CONTINUE READING
Personal and Ubiquitous Computing manuscript No.
                                         (will be inserted by the editor)

                                         Can the Crowd Judge Truthfulness? A Longitudinal Study on
                                         Recent Misinformation about COVID-19
                                         Kevin Roitero · Michael Soprano · Beatrice Portelli · Massimiliano
                                         De Luise · Damiano Spina · Vincenzo Della Mea · Giuseppe Serra ·
                                         Stefano Mizzaro · Gianluca Demartini
arXiv:2107.11755v1 [cs.IR] 25 Jul 2021

                                         Received: February 2021 / Accepted: July 2021

                                         Abstract Recently, the misinformation problem has                   source of information, behavior) impact the quality of
                                         been addressed with a crowdsourcing-based approach:                 the data. The longitudinal study demonstrates that the
                                         to assess the truthfulness of a statement, instead of rely-         time-span has a major effect on the quality of the judg-
                                         ing on a few experts, a crowd of non-expert is exploited.           ments, for both novice and experienced workers. Fi-
                                         We study whether crowdsourcing is an effective and re-              nally, we provide an extensive failure analysis of the
                                         liable method to assess truthfulness during a pandemic,             statements misjudged by the crowd-workers.
                                         targeting statements related to COVID-19, thus ad-
                                                                                                             Keywords Information Behavior · Crowdsourcing ·
                                         dressing (mis)information that is both related to a sen-
                                                                                                             Misinformation · COVID-19
                                         sitive and personal issue and very recent as compared to
                                         when the judgment is done. In our experiments, crowd
                                         workers are asked to assess the truthfulness of state-              1 Introduction
                                         ments, and to provide evidence for the assessments. Be-
                                         sides showing that the crowd is able to accurately judge            “We’re concerned about the levels of rumours and mis-
                                         the truthfulness of the statements, we report results on            information that are hampering the response. [...] we’re
                                         workers behavior, agreement among workers, effect of                not just fighting an epidemic; we’re fighting an info-
                                         aggregation functions, of scales transformations, and of            demic. Fake news spreads faster and more easily than
                                         workers background and bias. We perform a longitudi-                this virus, and is just as dangerous. That’s why we’re
                                         nal study by re-launching the task multiple times with              also working with search and media companies like Face-
                                         both novice and experienced workers, deriving impor-                book, Google, Pinterest, Tencent, Twitter, TikTok, YouTube
                                         tant insights on how the behavior and quality change                and others to counter the spread of rumours and misin-
                                         over time. Our results show that: workers are able to de-           formation. We call on all governments, companies and
                                         tect and objectively categorize online (mis)information             news organizations to work with us to sound the ap-
                                         related to COVID-19; both crowdsourced and expert                   propriate level of alarm, without fanning the flames of
                                         judgments can be transformed and aggregated to im-                  hysteria.”
                                         prove quality; worker background and other signals (e.g.,
                                                                                                             These are the alarming words used by Dr. Tedros Ad-
                                         This is a preprint of an article accepted for publication in Per-   hanom Ghebreyesus, the WHO (World Health Organi-
                                         sonal and Ubiquitous Computing (Special Issue on Intelligent
                                         Systems for Tackling Online Harms).
                                                                                                             zation) Director General during his speech at the Mu-
                                                                                                             nich Security Conference on 15 February 2020.1 It is
                                         Kevin Roitero · Michael Soprano · Beatrice Portelli · Mas-          telling that the WHO Director General chooses to tar-
                                         similiano De Luise · Vincenzo Della Mea · Giuseppe Serra ·
                                         Stefano Mizzaro                                                     get explicitly misinformation related problems.
                                         University of Udine, Udine, Italy                                       Indeed, during the still ongoing COVID-19 health
                                         Damiano Spina                                                       emergency, all of us have experienced mis- and dis-
                                         RMIT University, Melbourne, Australia                               information. The research community has focused on
                                         Gianluca Demartini                                                   1
                                                                                                                https://www.who.int/dg/speeches/detail/munich-
                                         The University of Queensland, Brisbane, Australia                   security-conference
2                                                                                                     Kevin Roitero et al.

several COVID-19 related issues [5], ranging from ma-           ond, the health domain is particularly sensitive, so it
chine learning systems aiming to classify statements            is interesting to understand if the crowdsourcing ap-
and claims on the basis of their truthfulness [66], search      proach is adequate also in such a particular domain.
engines tailored to the COVID-19 related literature, as         Third, in the previous work [47, 27, 51] the statements
in the ongoing TREC-COVID Challenge2 [46], topic-               judged by the crowd were not recent. This means that
specific workshops like the NLP COVID workshop at               evidence on statement truthfulness was often available
ACL’20,3 and evaluation initiatives like the TREC Health        out there (on the Web), and although the experimental
Misinformation Track.4 Besides the academic research            design prevented to easily find that evidence, it cannot
community, commercial social media platforms also have          be excluded that the workers did find it, or perhaps
looked at this issue.5                                          they were familiar with the particular statement be-
     Among all the approaches, in some very recent work,        cause, for instance, it had been discussed in the press.
Roitero et al. [47], La Barbera et al. [27], Roitero et al.     By focusing on COVID-19 related statements we in-
[51] have studied whether crowdsourcing can be used             stead naturally target recent statements: in some cases
to identify misinformation. As it is well known, crowd-         the evidence might be still out there, but this will hap-
sourcing means to outsource a task – which is usu-              pen more rarely.
ally performed by a limited number of experts – to                  Fourth, an almost ideal tool to address misinforma-
a large mass (the “crowd”) of unknown people (the               tion would be a crowd able to assess truthfulness in real
“crowd workers”), by means of an open call. The re-             time, immediately after the statement becomes public:
cent works mentioned before [47, 27, 51] specifically           although we are not there yet, and there is a long way
crowdsource the task of misinformation identification,          to go, we find that targeting recent statements is a step
or rather assessment of the truthfulness of statements          forward in the right direction. Fifth, our experimental
made by public figures (e.g., politicians), usually on po-      design differs in some details, and allows us to address
litical, economical, and societal issues.                       novel research questions. Finally, we also perform a lon-
     The idea that the crowd is able to identify mis-           gitudinal study by collecting the data multiple times
information might sound implausible at first – isn’t            and launching the task at different timestamps, con-
the crowd the very means by which misinformation is             sidering both novice workers – i.e., workers who have
spread? However, on the basis of the previous studies           never done the task before – and experienced workers
[47, 27, 51], it appears that the crowd can provide high        – i.e., workers who have performed the task in previous
quality results when asked to assess the truthfulness           batches and were invited to do the task again. This al-
of statements, provided that adequate countermeasures           lows us to study the multiple behavioral aspects of the
and quality assurance techniques are employed.                  workers when assessing the truthfulness of judgments.
     In this paper6 we address the very same problem,               This paper is structured as follows. In Section 2
but focusing on statements about COVID-19. This is              we summarize related work. In Section 3 we detail the
motivated by several reasons. First, COVID-19 is of             aims of this study and list some specific research ques-
course a hot topic but, although there is a great amount        tions, addressed by means of the experimental setting
of research efforts worldwide devoted to its study, there       described in Section 4. In Section 5 we present and dis-
are no studies yet using crowdsourcing to assess truth-         cuss the results, while in Section 6 we present the longi-
fulness of COVID-19 related statements. To the best             tudinal study conducted. Finally, in Section 7 we con-
of our knowledge, we are the first to report on crowd           clude the paper: we summarize our main findings, list
assessment of COVID-19 related misinformation. Sec-             the practical implications, highlight some limitations,
    2
    https://ir.nist.gov/covidSubmit/                            and sketch future developments.
    3
    https://www.nlpcovid19workshop.org/
  4
    https://trec-health-misinfo.github.io/
  5
    https://www.forbes.com/sites/bernardmarr/2020/03/
27/finding-the-truth-about-covid-19-how-facebook-
twitter-and-instagram-are-tackling-fake-news/            and
https://spectrum.ieee.org/view-from-the-valley/                 2 Background
artificial-intelligence/machine-learning/how-
facebook-is-using-ai-to-fight-covid19-misinformation
  6
    This paper is an extended version of the work by Roitero    We survey the background work on theoretical and con-
et al. [52]. In an attempt of providing a uniform, compre-      ceptual aspects of misinformation spreading, the spe-
hensive, and more understandable account of our research we
                                                                cific case of the COVID-19 infodemic, the relation be-
also report in the first part of the paper the main results
already published in [52]. Conversely, all the results on the   tween truthfulness classification and argumentation, and
longitudinal study in the second half of the paper are novel.   on the use of crowdsourcing to identify fake news.
Can the Crowd Judge Truthfulness? A longitudinal Study                                                                  3

2.1 Echo Chambers and Filter Bubbles                           2.3 COVID-19 Infodemic

The way information spreads through social media and,          The number of initiatives to apply Information Ac-
in general, the Web, have been widely studied, leading         cess – and, in general, Artificial Intelligence – tech-
to the discovery of a number of phenomena that were            niques to combat the COVID-19 infodemic has been
not so evident in the pre-Web world. Among those, echo         rapidly increasing (see Bullock et al. [5] for a survey).
chambers and epistemic bubbles seem to be central con-         Tangcharoensathien et al. [58] distilled a subset of 50
cepts [38].                                                    actions from a set of 594 ideas crowdsourced during
    Regarding their importance in news consumption,            a technical consultation held online by WHO (World
Flaxman et al. [17] examine the browsing history of US         Health Organization) to build a framework for manag-
based users who read news articles. They found that            ing infodemics in health emergencies.
both search engines and social networks increase the               There is significant effort on analyzing COVID-19
ideological distance between individuals, and that they        information on social media, and linking to data from
increase the exposure of the user to material of opposed       external fact-checking organizations to quantify the spread
political views.                                               of misinformation [19, 9, 69]. Mejova and Kalimeri [33]
    These effects can be exploited to spread misinforma-       analyzed Facebook advertisements related to COVID-
tion. Törnberg [63] modelled how echo chambers con-           19, and found that around 5% of them contain errors
tribute to the virality of misinformation, by providing        or misinformation. Crowdsourcing methodologies have
an initial environment in which misinformation is prop-        also been used to collect and analyze data from pa-
agated up to some level that makes it easier to expand         tients with cancer who are affected by the COVID-19
outside the echo chamber. This helps to explain why            pandemic [11].
clusters, usually known to restrain the diffusion of in-           Tsai et al. [62] investigate the relationships between
formation, become central enablers of spread.                  news consumption, trust, intergroup contact, and prej-
    On the other side, acting against misinformation           udicial attitudes toward Asians and Asian Americans
seems not to be an easy task, at least due to the backfire     residing in the United States during the COVID-19 pan-
effect, i.e., the effect for which someone’s belief hardens    demic
when confronted with evidence opposite to its opinion.
Sethi and Rangaraju [53] studied the backfire effect and
                                                               2.4 Crowdsourcing Truthfulness
presented a collaborative framework aimed at fighting
it by making the user understand her/his emotions and        Recent work has focused on the automatic classification
biases. However, the paper does not discuss the ways         of truthfulness or fact-checking [43, 34, 2, 37, 12, 25, 55].
techniques for recognizing misinformation can be effec-          Zubiaga and Ji [71] investigated, using crowdsourc-
tively translated to actions for fighting it in practice.    ing, the reliability of tweets in the setting of disaster
                                                             management. CLEF developed a Fact-Checking Lab
                                                             [37, 12, 4] to address the issue of ranking sentences ac-
2.2 Truthfulness and Argumentation                           cording to some fact-checking property.
                                                                 There is recent work that studies how to collect
Truthfulness classification and the process of fact-checking truthfulness judgments by means of crowdsourcing us-
are strongly related to the scrutiny of factual informa-     ing fine grained scales [47, 27, 51]. Samples of state-
tion extensively studied in argumentation theory [61,        ments from the PolitiFact dataset – originally published
68, 28, 68, 54, 64, 3]. Lawrence and Reed [28] sur-          by Wang [65] – have been used to analyze the agreement
vey the techniques which are the foundations for ar-         of workers with labels provided by experts in the origi-
gument mining, i.e., extracting and processing the in-       nal dataset. Workers are asked to provide the truthful-
ference and structure of arguments expressed using nat-      ness of the selected statements, by means of different
ural language. Sethi [54] leverages argumentation the-       fine grained scales. Roitero et al. [47] compared two
ory and proposes a framework to verify the truthfulness      fine grained scales: one in the [0, 100] range and one
of facts, Visser et al. [64] uses it to increase the criti-  in the (0, +∞) range, on the basis of Magnitude Esti-
cal thinking ability of people who assess media reports,     mation [36]. They found that both scales allow to col-
Sethi et al. [55] uses it together with pedagogical agents   lect reliable truthfulness judgments that are in agree-
in order to develop a recommendation system to help          ment with the ground truth. Furthermore, they show
fighting misinformation, and Snaith et al. [57] present      that the scale with one hundred levels leads to slightly
a platform based on a modular architecture and dis-          higher agreement levels with the expert judgments. On
tributed open source for argumentation and dialogue.         a larger sample of PolitiFact statements, La Barbera
4                                                                                                  Kevin Roitero et al.

et al. [27] asked workers to use the original scale used    relevant/sensitive topic for the workers. We investigate
by the PolitiFact experts and the scale in the [0, 100]     whether the health domain makes a difference in the
range. They found that aggregated judgments (com-           ability of crowd workers to identify and correctly clas-
puted using the mean function for both scales) have a       sify (mis)information, and if the very recent nature of
high level of agreement with expert judgments. Recent       COVID-19 related statements has an impact as well.
work by Roitero et al. [51] found similar results in terms  We focus on a single truthfulness scale, given the evi-
of external agreement and its improvement when aggre-       dence that the scale used does not make a significant
gating crowdsourced judgments, using statements from        difference. Another important difference is that we ask
two different fact-checkers: PolitiFact and ABC Fact-       the workers to provide a textual justification for their
Check (ABC). Previous work has also looked at inter-        decision: we analyze them to better understand the pro-
nal agreement, i.e., agreement among workers [47, 51].      cess followed by workers to verify information, and we
Roitero et al. [51] found that scales have low levels of    investigate if they can be exploited to derive useful in-
agreement when compared with each other: correlation        formation.
values for aggregated judgments on the different scales         In addition to Roitero et al. [52], we perform a longi-
are around ρ = 0.55–0.6 for PolitiFact and ρ = 0.35–        tudinal study that includes 3 additional crowdsourcing
0.5 for ABC, and τ = 0.4 for PolitiFact and τ = 0.3         experiments over a period of 4 months and thus col-
for ABC. This indicates that the same statements tend       lecting additional data and evidence that include novel
to be evaluated differently in different scales.            responses from new and old crowd workers (see Foot-
    There is evidence of differences on the way workers     note 6). The setup of each additional crowdsourcing
provide judgments, influenced by the sources they ex-       experiment is the same as the one of Roitero et al.
amine, as well as the impact of worker bias. In terms       [52]. This longitudinal study is the focus of the research
of sources, La Barbera et al. [27] found that the vast      questions RQ6–RQ8 below, which are a novel contribu-
majority of workers (around 73% for both scales) use        tion of this paper. Finally, we also exploit and analyze
indeed the PolitiFact website to provide judgments.         worker behavior. We present this paper as an extension
Differently from La Barbera et al. [27], Roitero et al.     of our previous work [52] in order to be able to compare
[51] used a custom search engine in order to filter out     against it and make the whole paper self-contained and
PolitiFact and ABC from the list of results. Results        much easier to follow, improving the readability and
show that, for all the scales, Wikipedia and news web-      overall quality of the paper thanks to its novel research
sites are the most popular sources of evidence used by      contributions.
the workers. In terms of worker bias, La Barbera et al.         More in detail, we investigate the following specific
[27] and Roitero et al. [51] found that worker politi-      Research Questions:
cal background has an impact on how workers provide
                                                           RQ1 Are the crowd-workers able to detect and objec-
the truthfulness scores. In more detail, they found that
                                                                tively categorize online (mis)information related to
workers are more tolerant and moderate when judging
                                                                the medical domain and more specifically to COVID-
statements from their very own political party.
                                                                19? What are the relationship and agreement be-
    Roitero et al. [52] use a crowdsourcing based ap-
                                                                tween the crowd and the expert labels?
proach to collect truthfulness judgments on a sample of
                                                           RQ2 Can the crowdsourced and/or the expert judgments
PolitiFact statements concerning COVID-19 to un-
                                                                be transformed or aggregated in a way that it im-
derstand whether crowdsourcing is a reliable method to
                                                                proves the ability of workers to detect and objec-
be used to identify and correctly classify (mis)information
                                                                tively categorize online (mis)information?
during a pandemic. They find that workers are able
                                                           RQ3 What is the effect of workers’ political bias and cog-
to provide judgments which can be used to objectively
                                                                nitive abilities?
identify and categorize (mis)information related to the
                                                           RQ4 What are the signals provided by the workers while
pandemic and that such judgments show high level of
                                                                performing the task that can be recorded? To what
agreement with expert labels when aggregated.
                                                                extent are these signals related to workers’ accu-
                                                                racy? Can these signals be exploited to improve ac-
3 Aims and Research Questions                                   curacy and, for instance, aggregate the labels in a
                                                                more effective way?
With respect to our previous work by Roitero et al. RQ5 Which sources of information does the crowd con-
[47], La Barbera et al. [27], and Roitero et al. [51],          sider when identifying online misinformation? Are
and similarly to Roitero et al. [52], we focus on claims        some sources more useful? Do some sources lead to
about COVID-19, which are recent and interesting for            more accurate and reliable assessments by the work-
the research community, and arguably deal with a more           ers?
Can the Crowd Judge Truthfulness? A longitudinal Study                                                                 5

RQ6 What is the effect of re-launching the experiment         to complete the HIT. The worker uses the input to-
    and re-collecting all the data at different time-spans?   ken to perform the accepted HIT. If s/he successfully
    Are the findings from all previous research questions     completes the assigned HIT, s/he is shown the output
    still valid?                                              token, which is used to submit the MTurk HIT and
RQ7 How does considering the judgments from workers           receive the payment, which we set to $1.5 for a set
    which did the task multiple times change the find-        of 8 statements.8 The task itself is as follows: first, a
    ings of RQ6? Do they show any difference when             (mandatory) questionnaire is shown to the worker, to
    compared to workers whom did the task only once?          collect his/her background information such as age and
RQ8 Which are the statements for which the truthful-          political views. The full set of questions and answers to
    ness assessment done by the means of crowdsourc-          the questionnaire can be found in B. Then, the worker
    ing fails? Which are the features and peculiarities       needs to provide answers to three Cognitive Reflection
    of the statements that are misjudged by the crowd-        Test (CRT) questions, which are used to measure the
    workers?                                                  personal tendency to answer with an incorrect “gut” re-
                                                              sponse or engage in further thinking to find the correct
                                                              answer [18]. The CRT questionnaire and its answers
  4 Methods                                                   can be found in C. After the questionnaire and CRT
                                                              phase, the worker is asked to asses the truthfulness of 8
  In this section we present the dataset used to carry        statements: 6 from the dataset described in Section 4.1
  out our experiments (Section 4.1), and the crowdsourc-      (one for each of the six considered PolitiFact cate-
  ing task design (Section 4.2). Overall, we considered       gories) and 2 special statements called Gold Questions
  one dataset annotated by experts, one crowdsourced          (one clearly true and the other clearly false) manually
  dataset, one judgment scale (the same for the expert        written by the authors of this paper and used as quality
  and the crowd judgments), and a total of 60 statements.     checks as detailed below. We used a randomization pro-
                                                              cess when building the HITs to avoid all the possible
                                                              source of bias, both within each HIT and considering
  4.1 Dataset
                                                              the overall task.
  We considered as primary source of information the              To assess the truthfulness of each statement, the
  PolitiFact dataset [65] that was built as a “bench-         worker    is shown: the Statement, the Speaker/Source,
  mark dataset for fake news detection” [65] and contains     and   the   Year in which the statement was made. We
  over 12k statements produced by public appearances of       asked   the  worker to provide the following information:
  US politicians. The statements of the datasets are la-      the  truthfulness   value for the statement using the six-
  beled by expert judges on a six-level scale of truthful-    level  scale  adopted   by PolitiFact, from now on re-
  ness (from now on referred to as E6 ): pants-on-fire,       ferred  to  as C 6 (presented  to the worker using a radio
  false, mostly-false, half-true, mostly-true, and            button   containing   the  label description for each cate-
  true. Recently, the PolitiFact website (the source          gory as reported in the original PolitiFact website),
  from where the statements of the PolitiFact dataset         a URL that s/he used as a source of information for
  are taken) created a specific section related to the COVID- the fact-checking, and a textual motivation for her/his
  19 pandemic.7 For this work, we selected 10 statements      response (which can not include the URL, and should
  for each of the six PolitiFact categories, belonging to     contain at least 15 words). In order to prevent the user
  such COVID-19 section and with dates ranging from           from using PolitiFact as primary source of evidence,
  February 2020 to early April 2020. A contains the full      we implemented our own search engine, which is based
  list of the statements we used.                             on the Bing Web Search APIs9 and filters out Politi-
                                                              Fact from the returned search results.
                                                                  We logged the user behavior using a logger as the
  4.2 Crowdsourcing Experimental Setup                        one detailed by Han et al. [22, 21], and we implemented
                                                              in the task the following quality checks: (i) the judg-
  To collect our judgments we used the crowdsourcing          ments assigned to the gold questions have to be co-
  platform Amazon Mechanical Turk (MTurk). Each worker, herent (i.e., the judgment of the clearly false question
  upon accepting our Human Intelligence Task (HIT), is
  assigned a unique pair or values (input token, output to-     8
                                                                  Before deploying the task on MTurk, we investigated the
  ken). Such pair is used to uniquely identify each worker,   average time spent to complete the task, and we related it to
  which is then redirected to an external server in order     the minimum US hourly wage.
                                                                9
                                                                    https://azure.microsoft.com/services/cognitive-
   7
       https://www.politifact.com/coronavirus/                  services/bing-web-search-api/
6                                                                                                      Kevin Roitero et al.

should be lower than the one assigned to true ques-            HIT, workers were first asked to complete a demograph-
tion); and (ii) the cumulative time spent to perform           ics questionnaire with questions about their gender,
each judgment should be of at least 10 seconds. Note           age, education and political views. By analyzing the an-
that the CRT (and the questionnaire) answers were not          swers to the questionnaire of the workers which success-
used for quality check, although the workers were not          fully completed the experiment we derived the following
aware of that.                                                 demographic statistics. The majority of workers are in
     Overall, we used 60 statements in total and each          the 26–35 age range (39%), followed by 19–25 (27%),
statement has been evaluated by 10 distinct workers.           and 36–50 (22%). The majority of the workers are well
Thus, considering the main experiment, we deployed             educated: 48% of them have a four year college degree
100 MTurk HITs and we collected 800 judgments in               or a bachelor degree, 26% have a college degree, and
total (600 judgments plus 200 gold question answers).          18% have a postgraduate or professional degree. Only
Considering the main experiment and the longitudinal           about 4% of workers have a high school degree or less.
study all together, we collected over 4300 judgments           Concerning political views, 33% of workers identified
from 542 workers, over a total of 7 crowdsourcing tasks.       themselves as liberals, 26% as moderate, 17% as very
All the data used to carry out our experiments can be          liberal, 15% as conservative, and 9% as very conserva-
downloaded at https://github.com/KevinRoitero/                 tive. Moreover, 52% of workers identified themselves as
crowdsourcingTruthfulness. The choice of making                being Democrat, 24% as being Republican, and 23% as
each statement being evaluated by 10 distinct work-            being Independent. Finally, 50% of workers disagreed
ers deserves a discussion; such a number is aligned with       on building a wall on the southern US border, and 37%
previous studies using crowdsourcing to assess truthful-       of them agreed. Overall we can say that our sample is
ness [47, 27, 51, 52] and other concepts like relevance        well balanced.
[30, 48]. We believe this number is a reasonable trade-
                                                               CRT Test. Analyzing the CRT scores, we found that:
off between having fewer statements evaluated by many
                                                               31% of workers did not provide any correct answer, 34%
workers and more statements evaluated by few workers.
                                                               answered correctly to 1 test question, 18% answered
We think that an in depth discussion about the quan-
                                                               correctly to 2 test questions, and only 17% answered
tification of such trade-off requires further experiments
                                                               correctly to all 3 test questions. We correlate the results
and therefore is out of scope for this paper; we plan to
                                                               of the CRT test and the worker quality to answer RQ3.
address this matter in detail in future work.
                                                               Behaviour (Abandonment). When considering the aban-
                                                               donment ratio (measured according to the definition
5 Results and Analysis for the Main                            provided by Han et al. [22]), we found that 100/334
Experiment                                                     workers (about 30%) successfully completed the task,
                                                               188/334 (about 56%) abandoned (i.e., voluntarily ter-
We first report some descriptive statistics about the          minated the task before completing it), and 46/334
population of workers and the data collected in our            (about 7%) failed (i.e., terminated the task due to fail-
main experiment (Section 5.1). Then, we address crowd          ing the quality checks too many times). Furthermore,
accuracy (i.e., RQ1) in Section 5.2, transformation of         115/188 workers (about 61%) abandoned the task be-
truthfulness scales (RQ2) in Section 5.3, worker back-         fore judging the first statement (i.e., before really start-
ground and bias (RQ3) in Section 5.4, worker behav-            ing it).
ior (RQ4) in Section 5.5; finally, we study information
sources (RQ5) in Section 5.6. Results related to the lon-
gitudinal study (RQ6–8) are described and analyzed in          5.2 RQ1: Crowd Accuracy
Section 6.
                                                               5.2.1 External Agreement

5.1 Descriptive Statistics                                     To answer RQ1, we start by analyzing the so called ex-
                                                               ternal agreement, i.e., the agreement between the crowd
5.1.1 Worker Background, Behavior, and Bias                    collected labels and the experts ground truth. Figure 1
                                                               shows the agreement between the PolitiFact experts
                                                               (x-axis) and the crowd judgments (y-axis). In the first
Questionnaire. Overall, 334 workers resident in the            plot, each point is a judgment by a worker on a state-
United States participated in our experiment.10 In each        ment, i.e., there is no aggregation of the workers work-
10
    Workers provide proof that they are based in US and have   ing on the same statement. In the next plots all work-
the eligibility to work.                                       ers redundantly working on the same statement are ag-
Can the Crowd Judge Truthfulness? A longitudinal Study                                                                                          7

    5                                   5                                   5                                   5
    4                                   4                                   4                                   4
    3                                   3                                   3                                   3
C

                                    C

                                                                        C

                                                                                                            C
    2                                   2                                   2                                   2
    1                                   1                                   1                                   1
    0                                   0                                   0                                   0
        0   1   2       3   4   5           0   1   2       3   4   5           0   1   2       3   4   5           0   1   2       3   4   5
                    E                                   E                                   E                                   E

Fig. 1: The agreement between the PolitiFact experts and the crowd judgments. From left to right: C6 individual
judgments; C6 aggregated with mean; C6 aggregated with median; C6 aggregated with majority vote.

gregated using the mean (second plot), median (third                    applying Bonferroni correction to account for multiple
plot), and majority vote (right-most plot). If we fo-                   comparisons. Results are as follows: when considering
cus on the first plot (i.e., the one with no aggregation                adjacent categories (e.g., pants-on-fire and false),
function applied), we can see that, overall, the individ-               the difference between categories are never significant,
ual judgments are in agreement with the expert labels,                  for both tests and for all the three aggregation func-
as shown by the median values of the boxplots, which                    tions. When considering categories of distance 2 (e.g.,
are increasing as the ground truth truthfulness level                   pants-on-fire and mostly-false), the differences are
increases. Concerning the aggregated values, it is the                  never significant, apart from the median aggregation
case that for all the aggregation functions the pants-                  function, where there is statistical significance to the
on-fire and false categories are perceived in a very                    p < .05 level in 2/4 cases for both Mann-Whitney and
similar way by the workers; this behavior was already                   t-test. When considering categories of distance 3, the
shown in [51, 27], and suggests that indeed workers have                differences are significant – for the Mann-Whitney and
clear difficulties in distinguishing between the two cat-               the t-test respectively – in the following cases: for the
egories; this is even more evident considering that the                 mean, 3/3 and 3/3 cases; for the median, 2/3 and 3/3
interface presented to the workers contained a textual                  cases; for the majority vote, 0/3 and 1/3 cases. When
description of the categories’ meaning in every page of                 considering categories of distance 4 and 5, the differ-
the task.                                                               ences are always significant to the p > 0.01 level for
                                                                        all the aggregation functions and for all the tests, apart
    If we look at the plots as a whole, we see that within
                                                                        from the majority vote function and the Mann-Whitney
each plot the median values of the boxplots increase
                                                                        test, where the significance is at the p > .05 level. In the
when going from pants-on-fire to true (i.e., going
                                                                        following we use the mean as it is the most commonly
from left to right of the x-axis of each chart). This indi-
                                                                        used approach for this type of data [51].
cates that the workers are overall in agreement with the
PolitiFact ground truth, thus indicating that workers
                                                                        5.2.2 Internal Agreement
are indeed capable of recognizing and correctly classify-
ing misinformation statements related to the COVID-
                                                                        Another standard way to address RQ1 and to ana-
19 pandemic. This is a very important and not obvi-
                                                                        lyze the quality of the work by the crowd is to com-
ous result: in fact, the crowd (i.e., the workers) is the
                                                                        pute the so called internal agreement (i.e., the agree-
primary source and cause of the spread of disinforma-
                                                                        ment among the workers). We measured the agreement
tion and misinformation statements across social media
                                                                        with α [26] and Φ [7], two popular measures often used
platforms [8]. By looking at the plots, and in particular
                                                                        to compute workers’ agreement in crowdsourcing tasks
focusing on the median values of the boxplots, it ap-
                                                                        [47, 51, 31, 49]. Analyzing the results, we found that
pears evident that the mean (second plot) is the aggre-
                                                                        the the overall agreement always falls in the [0.15, 0.3]
gation function which leads to higher agreement levels,
                                                                        range, and that agreement levels measured with the two
followed by the median (third plot) and the majority
                                                                        scales are very similar for the PolitiFact categories,
vote (right-most plot). Again, this behavior was already
                                                                        with the only exception of Φ, which shows higher agree-
remarked in [51, 27, 49], and all the cited works used
                                                                        ment levels for the mostly-true and true categories.
the mean as primary aggregation function.
                                                                        This is confirmed by the fact that the α measure al-
   To validate the external agreement, we measured                      ways falls in the Φ confidence interval, and the little
the statistical significance between the aggregated rat-                oscillations in the agreement value are not always in-
ing for all the six PolitiFact categories; we consid-                   dication of a real change in the agreement level, espe-
ered both the Mann-Whitney rank test and the t-test,                    cially when considering α [7]. Nevertheless, Φ seems to
8                                                                                                                 Kevin Roitero et al.

    5                        5                        5                        Figure 2 shows the result of such a process. As
    4                        4                        4
                                                                           we can see from the plots, the agreement between the
    3                        3                        3
                                                                           crowd and the expert judgments can be seen in a more
C

                         C

                                                  C
    2                        2                        2
    1                        1                        1                    neat way. As for Figure 1, the median values for all
    0                        0                        0
                                                                           the boxplots is increasing when going towards higher
        01    23    45           01    23    45           01    23    45
               E                        E                        E         truthfulness values (i.e., going from left to right within
                                                                           each plot); this holds for all the aggregation functions
    5                        5                        5
                                                                           considered, and it is valid for both transformations of
    4                        4                        4
    3                        3                        3                    the E6 scale, into three and two levels. Also in this case
C

                         C

                                                  C
    2                        2                        2                    we computed the statistical significance between cat-
    1                        1                        1
    0                        0                        0
                                                                           egories, applying the Bonferroni correction to account
        012        345           012        345           012        345
                                                                           for multiple comparisons. Results are as follows. For the
              E                        E                        E
                                                                           case of three groups, both the categories at distance one
                                                                           and two are always significant to the p < 0.01 level, for
Fig. 2: The agreement between the PolitiFact experts                       both the Mann-Whitney and the t-test, for all three
and the crowd judgments. From left to right: C6 aggre-                     aggregation functions. The same behavior holds for the
gated with mean; C6 aggregated with median; C6 ag-                         case of two groups, where the categories of distance 1
gregated with majority vote. First row: E6 to E3 ; second                  are always significant to the p < 0.01 level.
row: E6 to E2 . Compare with Figure 1.                                         Summarizing, we can now conclude that by merg-
                                                                           ing the ground truth levels we obtained a much stronger
                                                                           signal: the crowd can effectively detect and classify mis-
confirm the finding derived from Figure 1 that workers                     information statements related to the COVID-19 pan-
are most effective in identifying and categorizing state-                  demic.
ments with a higher truthfulness level. This remark is
also supported by [7] which shows that Φ is better in
                                                                           5.3.2 Merging Crowd Levels
distinguishing agreement levels in crowdsourcing than
α, which is more indicated as a measure of data relia-
                                                                           Having reported the results on merging the ground truth
bility in non-crowdsourced settings.
                                                                           categories we now turn to transform the crowd labels
                                                                           (i.e., C6 ) into three (referred to as C3 ) and two (C2 )
5.3 RQ2: Transforming Truthfulness Scales                                  categories. For the transformation process we rely on
                                                                           the approach detailed by Han et al. [23]. This approach
Given the positive results presented above, it appears                     has many advantages [23]: we can simulate the effect
that the answer to RQ1 is overall positive, even if with                   of having the crowd answers in a more coarse-grained
some exceptions. There are many remarks that can be                        scale (rather than C6 ), and thus we can simulate new
made: first, there is a clear issue that affects the pants-                data without running the whole experiment on MTurk
on-fire and false categories, which are very often                         again. As we did for the ground truth scale, we choose
mis-classified by workers. Moreover, while PolitiFact                      to select as target scales the two- and three- levels scale,
used a six-level judgment scale, the usage of a two- (e.g.,                driven by the same motivations. Having selected C6 as
True/False) and a three-level (e.g., False / In between                    being the source scale, and having selected the target
/ True) scale is very common when assessing the truth-                     scales as the three- and two- level ones (C3 and C2 ),
fulness of statements [27, 51]. Finally, categories can                    we perform the following experiment. We perform all
be merged together to improve accuracy, as done for                        the possible cuts11 from C6 to C3 and from C6 to C2 ;
example by Tchechmedjiev et al. [59]. All these consid-                    then, we measure the internal agreement (using α and
erations lead us to RQ2, addressed in the following.                       Φ) both on the source and on the target scale, and we
                                                                           compare those values. In such a way, we are able to iden-
                                                                           tify, among all the possible cuts, the cut which leads to
5.3.1 Merging Ground Truth Levels
                                                                           the highest possible internal agreement.
For all the above reasons, we performed the following                          We found that, for the C6 to C3 transformation,
experiment: we group together the six PolitiFact cat-                      both for α and Φ there is a single cut which leads to
egories (i.e., E6 ) into three (referred to as E3 ) or two                 higher agreement levels with the original C6 scale. On
(E2 ) categories, which we refer respectively with 01, 23,                 the contrary, for the C6 to C2 transformation, we found
and 45 for the three level scale, and 012 and 234 for                      11
                                                                              C6 can be transformed into C3 in 10 different ways, and
the two level scale.                                                       C6 can be transformed into C2 in 5 different ways.
Can the Crowd Judge Truthfulness? A longitudinal Study                                                                                  9

     2.0                                   2.0                                    2.0                          2.0

     1.5                                   1.5                                    1.5                          1.5

     1.0                                   1.0                                    1.0                          1.0
 C

                                       C

                                                                              C

                                                                                                           C
     0.5                                   0.5                                    0.5                          0.5

     0.0                                   0.0                                    0.0                          0.0
           0   1   2       3   4   5             0   1   2       3   4   5              01     23     45             01     23     45
                       E                                     E                                  E                            E

     1.0                                   1.0                                    1.0                          1.0
     0.8                                   0.8                                    0.8                          0.8
     0.6                                   0.6                                    0.6                          0.6
 C

                                       C

                                                                              C

                                                                                                           C
     0.4                                   0.4                                    0.4                          0.4
     0.2                                   0.2                                    0.2                          0.2
     0.0                                   0.0                                    0.0                          0.0
           0   1   2       3   4   5             0   1   2       3   4   5               012        345               012        345
                       E                                     E                                 E                            E

Fig. 3: Comparison with E6 . C6 to C3 (first two plots)                      Fig. 4: C6 to C3 (first two plots) and to C2 (last two
and to C2 (last two plots), then aggregated with the                         plots), then aggregated with the mean function. First
mean function. Best cut selected according to α (fist                        two plots: E6 to E3 . Last two plots: E6 to E2 . Best cut
and third plot) and Φ (second and fourth plot). Com-                         selected according to α (first and third plots) and Φ
pare with Figure 1.                                                          (second and fourth plots). Compare with Figures 1, 2,
                                                                             and 3.

that there is a single cut for α which leads to similar
agreement levels as in the original C6 scale, and there                      the E2 case (last two plots) the classes appear to be not
are no cuts with such a property when using Φ. Having                        separable. Summarizing, all these results show that it
identified the best possible cuts for both transforma-                       is feasible to successfully combine the aforementioned
tions and for both agreement metrics, we now measure                         approaches, and transform into a three- and two-level
the external agreement between the crowd and the ex-                         scale both the crowd and the expert judgments.
pert judgments, using the selected cut.
    Figure 3 shows such a result when considering the
judgments aggregated with the mean function. As we                           5.4 RQ3: Worker Background and Bias
can see from the plots, it is again the case that the me-
dian values of the boxplots is always increasing, for all                    To address RQ3 we study if the answers to question-
the transformations. Nevertheless, inspecting the plots                      naire and CRT test have any relation with workers
we can state that the overall external agreement ap-                         quality. Previous work have shown that political and
pears to be lower than the one shown in Figure 1. More-                      personal biases as well as cognitive abilities have an
over, we can state that even using these transformed                         impact on the workers quality [27, 51]; recent articles
scales the categories pants-on-fire and false are still                      have shown that the same effect might apply also to
not separable. Summarizing, we show that it is feasible                      fake news [60]. For this reason, we think it is reason-
to transform the judgments collected on a C6 level scale                     able to investigate if workers’ political biases and cog-
into two new scales, C3 and C2 , obtaining judgments                         nitive abilities influence their quality in the setting of
with a similar internal agreement as the original ones,                      misinformation related to COVID-19.
and with a slightly lower external agreement with the                            When looking at the questionnaire answers, we found
expert judgments.                                                            a relation with the workers quality only when consid-
                                                                             ering the answer to the workers political views (see B
5.3.3 Merging both Ground Truth and Crowd Levels                             for questions and answers). In more detail, using Accu-
                                                                             racy (i.e., the fraction of exactly classified statements)
It is now natural to combine the two approaches. Fig-                        we measured the quality of workers in each group. The
ure 4 shows the comparison between C6 transformed                            number and fraction of correctly classified statements
into C3 and C2 , and E6 transformed into E3 and E2 .                         are however rather crude measures of worker’s quality,
As we can see form the plots, also in this case the me-                      as small misclassification errors (e.g, pants-on-fire in
dian values of the boxplots are increasing, especially for                   place of false) are as important as more striking ones
the E3 case (first two plots). Furthermore, the external                     (e.g., pants-on-fire in place of true). Therefore, to
agreement with the ground truth is present, even if for                      measure the ability of workers to correctly classify the
10                                                                                                         Kevin Roitero et al.

Table 1: Statement position in the task versus: time                 5.5.1 Time and Queries
elapsed, cumulative on each single statement (first row),
CEMORD (second row), number of queries issued (third                 Table 1 (fist two rows) shows the amount of time spent
row), and number of times the statement has been used                on average by the workers on the statements and their
as a query (fourth row). The total and average number                CEMORD score. As we can see, the time spent on the
of queries is respectively 2095 and 262, while the total             first statement is considerably higher than on the last
and average number of statements as query is respec-                 statements, and overall the time spent by the workers
tively of 245 and 30.6.                                              almost monotonically decreases while the statement po-
                                                                     sition increases. This, combined with the fact that the
 Statement                                                           quality of the assessment provided by the workers (mea-
              1     2      3      4      5      6      7      8
 Position
                                                                     sured with CEMORD ) does not decrease for the last state-
 Time (sec)   299   282    218    216    223    181    190    180
                                                                     ments is an indication of a learning effect: the workers
 CEMORD       .63   .618   .657   .611   .614   .569   .639   .655
 Number of    352   280   259   255   242   238   230   230
                                                                     learn how to assess truthfulness in a faster way.
 Queries      16.8% 13.4% 12.4% 12.1% 11.6% 11.3% 11.0% 11.4%            We now turn to queries. Table 1 (third and fourth
 Statement    22    32     31    33    34    30    29    34          row) shows query statistics for the 100 workers which
 as Query     9%    13%    12.6% 13.5% 13.9% 12.2% 11.9% 13.9%       finished the task. As we can see, the higher the state-
                                                                     ment position, the lower the number of queries issued:
                                                                     3.52% on average for the first statement down 2,30%
statements, we also compute the Closeness Evaluation                 for the last statement. This can indicate the attitude
Measure (CEMORD ), an effectiveness metric recently pro-             of workers to issue fewer queries the more time they
posed for the specific case of ordinal classification [1]            spend on the task, probably due to fatigue, boredom,
(see Roitero et al. [51, Sect. 3.3] for a more detailed              or learning effects. Nevertheless, we can see that on av-
discussion of these issues). The accuracy and CEMORD                 erage, for all the statement positions each worker is-
values are respectively of 0.13 and 0.46 for ‘Very con-              sues more than one query: workers often reformulate
servative’, 0.21 and 0.51 for ‘Conservative’, 0.20 and               their initial query. This provides further evidence that
0.50 for ‘Moderate’, 0.16 and 0.50 for ‘Liberal’, and 0.21           they put effort in performing the task and that suggests
and 0.51 for ‘Very liberal’. By looking at both Accuracy             the overall high quality of the collected judgments. The
and CEMORD , it is clear that ‘Very conservative’ work-              third row of the table shows the number of times the
ers provide lower quality labels. The Bonferroni cor-                worker used as query the whole statement. We can see
rected two tailed t-test on CEMORD confirms that ‘Very               that the percentage is rather low (around 13%) for all
conservative’ workers perform statistically significantly            the statement positions, indicating again that workers
worse than both ‘Conservative’ and ‘Very liberal’ work-              spend effort when providing their judgments.
ers. The workers’ political views affect the CEMORD score,
even if in a small way and mainly when considering the
                                                                     5.5.2 Exploiting Worker Signals to Improve Quality
extremes of the scale. An initial analysis of the other
answers to the questionnaire (not shown) does not seem
                                                                     We have shown that, while performing their task, work-
to provide strong signals; a more detailed analysis is left
                                                                     ers provide many signals that to some extent correlate
for future work.
                                                                     with the quality of their work. These signals could in
    We also investigated the effect of the CRT tests on              principle be exploited to aggregate the individual judg-
the worker quality. Although there is a small variation              ments in a more effective way (i.e., giving more weight
in both Accuracy and CEMORD (not shown), this is never               to workers that possess features indicating a higher
statistically significant; it appears that the number of             quality). For example, the relationships between worker
correct answers to the CRT tests is not correlated with              background / bias and worker quality (Section 5.4) could
worker quality. We leave for future work a more detailed             be exploited to this aim.
study of this aspect.                                                    We thus performed the following experiment: we
                                                                     aggregated C6 individual scores, using as aggregation
                                                                     function a weighted mean, where the weights are ei-
                                                                     ther represented by the political views, or the number
                                                                     of correct answers to CRT, both normalized in [0.5, 1].
5.5 RQ4: Worker Behavior                                             We found a very similar behavior to the one observed
                                                                     for the second plot of Figure 1; it seems that leveraging
We now turn to RQ4, and analyze the behavior of the                  quality-related behavioral signals, like questionnaire an-
workers.                                                             swers or CRT scores, to aggregate results does not pro-
Can the Crowd Judge Truthfulness? A longitudinal Study                                                                                                                         11

        1
        2                                       URL                          %                     0.5                                                      1.0
        3                                       snopes.com               11.79%
        4                                       msn.com                   8.93%                    0.4                                                      0.8

                                                                                                                                                                  Cumulative
                                                                                      Percentage
        5                                       factcheck.org             6.79%
rank

        6                                       wral.com                  6.79%                    0.3                                    copied            0.6
        7                                       usatoday.com              5.36%                                                           no copied
        8
                                                statesman.com             4.64%
                                                reuters.com               4.64%                    0.2                                                      0.4
        9
                                                cdc.gov                   4.29%
       10
                                                mediabiasfactcheck.com    4.29%                    0.1                                                      0.2
            0.0   0.1     0.2       0.3   0.4   businessinsider.com       3.93%
                        frequency
                                                                                                   0.0                                                      0.0
                                                                                                          0        1     2       3       4          5
Fig. 5: On the left, distribution of the ranks of the URLs                                                         Judgment absolute error

selected by workers, on the right, websites from which
                                                                                                   0.4
workers chose URLs to justify their judgments.                                                                                                      copied
                                                                                                                                                    no copied
                                                                                                   0.3

                                                                                     Percentage
vide a noticeable increase in the external agreement,
although it does not harm.                                                                         0.2

                                                                                                   0.1
5.6 RQ5: Sources of Information
                                                                                                   0.0
We now turn to RQ5, and analyze the sources of infor-                                                    -5   -4   -3   -2     -1 0     1   2   3       4     5
mation used by the workers while performing the task.                                                                        Judgment error

5.6.1 URL Analysis                                                                Fig. 6: Effect of the origin of a justification on: the abso-
                                                                                  lute value of the prediction error (top; cumulative distri-
Figure 5 shows on the left the distribution of the ranks                          butions shown with thinner lines and empty markers),
of the URL selected as evidence by the worker when                                and the prediction error (bottom). Text copied/not
performing each judgment. URLs selected less than 1%                              copied from the selected URL.
times are filtered out from the results. As we can see
from the plot, about 40% of workers selected the first re-
sult retrieved by our search engine, and selected the re-
                                                                                  lected URLs, and their links with worker quality. 54%
maining positions less frequently, with an almost mono-
                                                                                  of the provided justifications contain text copied from
tonic decreasing frequency (rank 8 makes the excep-
                                                                                  the web page at the URL selected for evidence, while
tion). We also found that 14% of workers inspected up
                                                                                  46% do not. Furthermore, 48% of the justification in-
to the fourth page of results (i.e., rank= 40). The break-
                                                                                  clude some “free text” (i.e., text generated and writ-
down on the truthfulness PolitiFact categories does
                                                                                  ten by the worker), and 52% do not. Considering all
not show any significant difference.
                                                                                  the possible combinations, 6% of the justifications used
     Figure 5 shows on the right part the top 10 of web-
                                                                                  both free text and text from web page, 42% used free
sites from which the workers choose the URL to justify
                                                                                  text but no text from the web page, 48% used no free
their judgments. Websites with percentage ≤ 3.9% are
                                                                                  text but only text from web page, and finally 4% used
filtered out. As we can see from the table, there are
                                                                                  neither free text nor text from web page, and either
many fact check websites among the top 10 URLs (e.g.,
                                                                                  inserted text from a different (not selected) web page
snopes: 11.79%, factcheck 6.79%). Furthermore, med-
                                                                                  or inserted part of the instructions we provided or text
ical websites are present (cdc: 4.29%). This indicates
                                                                                  from the user interface.
that workers use various kind of sources as URLs from
which they take information. Thus, it appears that they                               Concerning the preferred way to provide justifica-
put effort in finding evidence to provide a reliable truth-                       tions, each worker seems to have a clear attitude: 48% of
fulness judgment.                                                                 the workers used only text copied from the selected web
                                                                                  pages, 46% of the workers used only free text, 4% used
5.6.2 Justifications                                                              both, and 2% of them consistently provided text coming
                                                                                  from the user interface or random internet pages.
As a final result, we analyze the textual justifications                              We now correlate such a behavior with the work-
provided, their relations with the web pages at the se-                           ers quality. Figure 6 shows the relations between dif-
12                                                                                                        Kevin Roitero et al.

ferent kinds of justifications and the worker accuracy.     Table 2: Experimental setting for the longitudinal
The plots show the absolute value of the prediction er-     study. All dates refer to 2020. Values reported are ab-
ror on the left, and the prediction error on the right.     solute numbers.
The figure shows if the text inserted by the worker was
                                                                                                Number of Workers
copied or not from the web page selected; we performed       Date     Acronym             Batch1 Batch2 Batch3 Batch4    Total
the same analysis considering if the worker used or not      May      Batch1                100      –       –       –     100
free text, but results where almost identical to the for-    June     Batch2                  –     100       –      –     100
mer analysis. As we can see from the plot, statements                 Batch2from1            29       –       –      –      29
on which workers make less errors (i.e., where x-axis        July     Batch3                  –      –      100      –     100
= 0) tend to use text copied from the web page se-                    Batch3from1            22      –        –      –      22
                                                                      Batch3from2             –     20        –      –      20
lected. On the contrary, statements on which workers                  Batch3from1or2         22     20        –      –      42
make more errors (values close to 5 in the left plot, and    August   Batch4                  –      –        –    100     100
values close to +/− 5 in the right plot) tend to use text             Batch4from1            27      –        –      –      27
                                                                      Batch4from2             –     11        –      –      11
not copied from the selected web page. The differences                Batch4from3             –      –       33      –      33
are small, but it might be an indication that workers of              Batch4from1or2or3      27     11       33      –      71
higher quality tend to read the text from selected web                Batchall              100    100      100    100     400
page, and report it in the justification box. To con-
firm this result, we computed the CEMORD scores for the
two classes considering the individual judgments: the       derived from our analysis to answer RQ6 (Section 6.2),
class “copied” has CEMORD = 0.640, while the class “not     RQ7 (Section 6.3), and RQ8 (Section 6.4).
copied” has a lower value, CEMORD = 0.600.
    We found that such behavior is consistent for what
concerns the usage of free text: statements on which
                                                            6.1 Experimental Setting
workers make less errors tend to use more free text than
the ones that make more errors. This is an indication       The longitudinal study is based on the same dataset
that workers which add free text as a justification, pos-   (see Section 4.1) and experimental setting (see Section
sibly reworking the information present in the selected     4.2) of the main experiment. The crowdsourcing judg-
URL, are of a higher quality. In this case the CEMORD       ments were collected as follows. The data for main ex-
measure confirms that the two classes are very similar:     periment (from now denoted with Batch1) has been col-
the class free text has CEMORD = 0.624, while the class     lected on May 2020. On June 2020 we re-launched the
not free text has CEMORD = 0.621.                           HITs from Batch1 with a novel set of workers (i.e., we
    By looking at the right part of Figure 6 we can see     prevented the workers of Batch1 to perform the exper-
that the distribution of the prediction error is not sym-   iment again); we denote such set of data with Batch2.
metrical, as the frequency of the errors is higher on the   On July 2020 we collected an additional batch: we re-
positive side of the x-axis ([0,5]). These errors corre-    launched the HITs from Batch1 with novice workers
spond to workers overestimating the truthfulness value      (i.e., we prevented the workers of Batch1 and Batch2
of the statement (with 5 being the result of labeling a     to perform the experiment again); we denote such set
pants-on-fire statement as true). It is also notice-        of data with Batch3. Finally, on August 2020 we re-
able that the justifications containing text copied from    launched the HITs from Batch1 for the last time, pre-
the selected URL have a lower rate of errors in the neg-    venting workers of previous batches to perform the ex-
ative range, meaning that workers which directly quote      periment, collecting the data for Batch4. Then, we con-
the text avoid underestimating the truthfulness of the      sidered an additional set of experiments: for a given
statement.                                                  batch, we contacted the workers from previous batches
                                                            sending them a $0.01 bonus and asking them to perform
                                                            the task again. We obtained the datasets detailed in Ta-
6 Variation of the Judgments over Time                      ble 2 where BatchXfromY denotes the subset of workers
                                                            that performed BatchX and had previously participated
To perform repeated observations of the crowd anno-         in BatchY. Note that an experienced (returning) worker
tating misinformation (i.e., doing a longitudinal study)    who does the task for the second time gets generally a
with different sets of workers, we re-launched the HITs     new HIT assigned, i.e., a HIT different from the per-
of the main experiment three subsequent times, each of      formed originally; we have no control on this matter,
them one month apart. In this section we detail how         since HITs are assigned to workers by the MTurk plat-
the data was collected (Section 6.1) and the findings       form. Finally, we also considered the union of the data
Can the Crowd Judge Truthfulness? A longitudinal Study                                                                   13

Table 3: Abandonment data for each batch of the lon-                  Table 3 shows the abandonment data for each batch
gitudinal study.                                                  of the longitudinal study, indicating the amount of work-
                                                                  ers which completed, abandoned, or failed the task (due
                            Number of Workers                     to failing the quality checks). Overall, the abandonment
  Acronym     Complete       Abandon        Fail          Total   ratio is quite well balanced across batches, with the
  Batch1      100   (30%)     188   (56%)    46   (14%)     334   only exception of Batch3, that shows a small increase
  Batch2      100   (37%)     129   (48%)    40   (15%)     269   in the amount of workers which failed the task; never-
  Batch3      100   (23%)     220   (51%)   116   (26%)     436
                                                                  theless, such small variation is not significant and might
  Batch4      100   (36%)     124   (45%)    54   (19%)     278
                                                                  be caused by a slightly lower quality of workers which
  Average     100 (31%)       165 (50%)      64 (19%)      1317
                                                                  started Batch3. On average, Table 3 shows that 31% of
                                                                  the workers completed the task, 50% abandoned it, and
                                                                  19% failed the quality checks; these values are aligned
from Batch1, Batch2, Batch3, and Batch4; we denote                with previous studies (see Roitero et al. [51]).
this dataset with Batchall .

                                                                  6.2.2 Agreement across Batches

6.2 RQ6: Repeating the Experiment with Novice                     We now turn to study the quality of both individual
Workers                                                           and aggregated judgments across the different batches.
                                                                  Measuring the correlation between individual judgments
6.2.1 Worker Background, Behavior, Bias, and                      we found rather low correlation values: the correlation
Abandonment                                                       between Batch1 and Batch2 is of ρ = 0.33 and τ =
                                                                  0.25, the correlation between Batch1 and Batch3 is of
We first studied the variation in the composition of              ρ = 0.20 and τ = 0.14, between Batch1 and Batch4
the worker population across different batches. To this           is of ρ = 0.10 and τ = 0.074; the correlation between
aim, we considered a General Linear Mixture Model                 Batch2 and Batch3 is of ρ = 0.21 and τ = 0.15, between
(GLMM) [32] together with the Analysis Of Variance                Batch2 and Batch4 is of ρ = 0.10 and τ = 0.085; finally,
(ANOVA) [35] to analyze how worker behavior changes               the correlation values between Batch3 and Batch4 is of
across batches, and measured the impact of such changes.          ρ = 0.08 and τ = 0.06.
In more detail, we considered the ANOVA effect size                   Overall, the most recent batch (Batch4) is the batch
ω 2 , an unbiased index used to provide insights of the           which achieves the lowest correlation values w.r.t. the
population-wide relationship between a set of factors             other batches, followed by Batch3. The highest correla-
and the studied outcomes [15, 14, 70, 50, 16]. With               tion is achieved between Batch1 and Batch2. This pre-
such setting, we fitted a linear model with which we              liminary result suggest that it might be the case that
measured the effect of the age, school, and all other             the time-span in which we collected the judgments of
possible answers to the questions in the questionnaire            the different batches has an impact on the judgments
(B) w.r.t. individual judgment quality, measured as the           similarity across batches, and batches which have been
absolute distance between the worker judgments and                launched in time-spans close to each other tend to be
the expert one with the mean absolute error (MAE).                more similar than other batches. We now turn to an-
By inspecting the ω 2 index, we found that while all the          alyze the aggregated judgments, to study if such re-
effects are either small or non-present [40], the largest         lationship is still valid when individual judgments are
effects are provided by workers’ answers to the taxes             aggregated. Figure 7 shows the agreement between the
and southern border questions. We also found that the             aggregated judgments of Batch1, Batch2, Batch3, and
effect of the batch is small but not negligible, and is           Batch4. The plot shows in the diagonal the distribu-
on the same order of magnitude of the effect of other             tion of the aggregated judgments, in the lower triangle
factors. We also computed the interaction plots (see for          the scatterplot between the aggregated judgments of
example [20]) considering the variation of the factors            the different batches, and in the upper triangle the cor-
from the previous analysis on the different batches. Re-          responding ρ and τ correlation values. The plots show
sults suggest a small or not significant [13] interaction         that the correlation values of the aggregated judgments
between the batch and all the other factors. This anal-           are greater than the ones measured for individual judg-
ysis suggests that, while the difference among different          ments. This is consistent for all the batches. In more de-
batches is present, the population of workers which per-          tail, we can see that the agreement between Batch1 and
formed the task is homogeneous, and thus the different            Batch2 (ρ = 0.87, τ = 0.68) is greater than the agree-
dataset (i.e., batches) are comparable.                           ment between any other pair of batches; we also see that
You can also read