Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims

Page created by Martha Hamilton
 
CONTINUE READING
Natural Language Inference with Self-Attention for
                                                            Veracity Assessment of Pandemic Claims
                                             Miguel Arana-Catania1,2 , Elena Kochkina3,2 , Arkaitz Zubiaga3 , Maria Liakata3,2 ,
                                                                         Rob Procter1,2 , Yulan He1,2
                                                      1
                                                        Department of Computer Science, University of Warwick, UK
                                                                       2
                                                                         The Alan Turing Institute, UK
                                                                3
                                                                  Queen-Mary University of London, UK

                                                             Abstract                           and (2) introducing two novel approaches to verac-
                                                                                                ity assessment.
                                            We present a comprehensive work on auto-               In the first part of our work, we contribute to
                                            mated veracity assessment from dataset cre-
                                                                                                the global effort on addressing misinformation in
arXiv:2205.02596v1 [cs.CL] 5 May 2022

                                            ation to developing novel methods based on
                                            Natural Language Inference (NLI), focusing
                                                                                                the context of COVID-19 by creating a dataset for
                                            on misinformation related to the COVID-19           PANdemic Ai Claim vEracity Assessment, called
                                            pandemic. We first describe the construction        the PANACEA dataset. It is a new dataset that
                                            of the novel PANACEA dataset consisting of          combines different data sources with different foci,
                                            heterogeneous claims on COVID-19 and their          thus enabling a comprehensive approach that com-
                                            respective information sources. The dataset         bines different media, domains and information
                                            construction includes work on retrieval tech-       types. To this effect our dataset brings together a
                                            niques and similarity measurements to ensure
                                                                                                heterogeneous set of True and False COVID claims
                                            a unique set of claims. We then propose novel
                                            techniques for automated veracity assessment        and online sources of information for each claim.
                                            based on Natural Language Inference includ-         The collected claims have been obtained from on-
                                            ing graph convolutional networks and atten-         line fact-checking sources, existing datasets and re-
                                            tion based approaches. We have carried out          search challenges. We have identified a large over-
                                            experiments on evidence retrieval and veracity      lap of claims between different sources and even
                                            assessment on the dataset using the proposed        within each source or dataset. Thus, given the chal-
                                            techniques and found them competitive with
                                                                                                lenges of aggregating multiple data sources, much
                                            SOTA methods, and provided a detailed dis-
                                            cussion.
                                                                                                of our efforts in dataset construction has focused
                                                                                                on eliminating repeated claims. Distinguishing be-
                                        1   Introduction                                        tween different formulations of the same claim and
                                                                                                nuanced variations that include additional infor-
                                        In recent years, and particularly with the emer-        mation is a challenging task. Our dataset is pre-
                                        gence of the COVID-19 pandemic, significant ef-         sented in a large and a small version, accounting
                                        forts have been made to detect misinformation on-       for different degrees of such similarity. Finally, the
                                        line with the aim of mitigating its impact. With        homogenisation of datasets and information me-
                                        this objective, researchers have proposed numerous      dia has presented an additional challenge, since
                                        approaches and released datasets that can help with     fact-checkers use different criteria for labelling the
                                        the advancement of research in this direction.          claims, requiring a specific review of the different
                                           Most existing datasets (D’Ulizia et al., 2021) fo-   kinds of labels in order to combine them.
                                        cus on a single medium (e.g., Twitter, Facebook,           In the second part of our work, we propose
                                        or specific websites), a unique information domain      NLI-SAN and NLI-graph, two novel veracity as-
                                        (e.g., health information, general news, or scholarly   sessment approaches for automated fact-checking
                                        papers), a type of information (e.g., general claims    of the claims. Our proposed approaches are cen-
                                        or news), or a specific application (e.g., verifying    tred around the use of Natural Language Inference
                                        claims, or retrieving useful information). This in-     (NLI) and contextualised representations of the
                                        evitably results in a limited focus on what is a com-   claims and evidence. NLI-SAN combines the in-
                                        plex, multi-faceted phenomenon. With the aim of         ference relation between claims and evidence with
                                        furthering research in this direction, the contribu-    attention techniques, while NLI-graph builds on
                                        tions of our work are twofold: (1) creating a new       graphs considering the relationship between all the
                                        comprehensive dataset of misinformation claims,         different pieces of evidence and the claim.
Specifically we make the following contribu-          critically analysing claims about health care
tions:                                                   interventions; PUBHEALTH (Kotonya and Toni,
                                                         2020) with 11,832 claims related to health topics;
    • We describe the development of a com-              FEVER (Thorne et al., 2018a) as well as its later
      prehensive COVID fact-checking dataset,            versions FEVER 2.0 (Thorne et al., 2018b) and
      PANACEA, as a result of aggregating and            FEVEROUS (Aly et al., 2021), containing claims
      de-duplicating a set of heterogeneous data         based on Wikipedia and therefore constituting
      sources. The dataset is available in the project   a well-defined, informative and non-duplicated
      website1 , as well as a fully operational search   information corpus; SciFact (Wadden et al., 2020)
      platform to find and verify COVID-19 claims        also from a very different domain, containing
      implementing the proposed approaches.              1,409 scientific claims. Our dataset is a real-world
    • We propose two novel approaches to claim           dataset bringing together heterogeneous sources,
      verification, NLI-SAN and NLI-graph.               domains and information types.
    • We perform an evaluation of both evidence re-      Approaches to claim veracity assessment. We
      trieval and the application of our proposed ve-    employ our dataset for automated fact-checking
      racity assessment methods on our constructed       and veracity assessment (Zeng et al., 2021). Re-
      dataset. Our experiments show that NLI-SAN         searchers such as Hanselowski et al. (2018);
      and NLI-graph have state-of-the-art perfor-        Yoneda et al. (2018); Luken et al. (2018); Soleimani
      mance on our dataset, beating GEAR (Zhou           et al. (2020); Pradeep et al. (2021) analysed the ve-
      et al., 2019) and matching KGAT (Liu et al.,       racity relation between the claim and each piece
      2020). We discuss challenging cases and pro-       of evidence independently, combining this infor-
      vide ideas for future research directions.         mation later. Other authors considered multiple
                                                         pieces of evidence together (Thorne et al., 2018a;
2    Related Work                                        Nie et al., 2019; Stammbach and Neumann, 2019).
COVID-19 and misinformation datasets.                    Different pieces of evidence have been previously
Comprehensive information on COVID-19                    combined using graph neural networks (Zhou et al.,
datasets is provided in Appendix A. Such datasets        2019; Liu et al., 2020; Zhong et al., 2020). Many
include the CoronaVirusFacts/DatosCoronaVirus            of these authors have centred their techniques on
Alliance Database, the largest existing collection       the use of NLI (Chen et al., 2017; Ghaeini et al.,
of COVID claims and the largest existing network         2018; Parikh et al., 2016; Li et al., 2019) to verify
of journalists working together on COVID                 the claim. In our work we also make use of NLI
misinformation, an essential reference for our           results of claim-evidence pairs, but propose alter-
work; COVID-19-TweetIDs (Chen et al., 2020)              native approaches built on a self-attention network
the widest dataset of COVID tweets with more             and a graph convolutional network for veracity as-
than 1 billion tweets; Cord-19: The COVID-19             sessment.
open research dataset (Wang et al., 2020a), the
largest downloadable set of scholarly articles           3     Dataset Construction
on the pandemic with nearly 200,000 articles.            This section describes our dataset construction by
General misinformation datasets linked to our            selecting COVID-19 related data sources (§3.1),
verification work include: Emergent (Ferreira            and applying information retrieval and re-ranking
and Vlachos, 2016) collection of 300 labeled             techniques to remove duplicate claims (§3.2).
claims by journalists; LIAR (Wang, 2017) with
12,836 statements from PolitiFact with detailed          3.1    Data Sources
justifications; FakeNewsNet (Shu et al., 2020)           We first identified a set of COVID-19 related data
collecting not only claims from news content,            sources to build our dataset. Our aim is to have the
but also social context and spatio-temporal              largest compilation of non-overlapping, labelled
information; NELA-GT-2018 (Nørregaard et al.,            and verified claims from different media and infor-
2019) with 713,534 articles from 194 news                mation domains (Twitter, Facebook, general web-
outlets; FakeHealth (Dai et al., 2020) collecting        sites, academia), and used for different applications
information from HealthNewsReview, a project             (media reporting, veracity evaluation, information
  1
    https://panacea2020.github.io/                       retrieval challenges, etc.). We have included any
https://doi.org/10.5281/zenodo.6493847                   large dataset or media, to our knowledge, related
Data Source               Description                                                   Domain           No. of claims
                                                                                                           (False / True)
  CoronaVirusFacts          Published by Poynter, this online source combines fact-       Heterogeneous    11,647
  Database                  checking articles from more than 100 fact-checkers                             (11,647 / 0)
                            from all over the world, being the largest journalist fact-
                            checking collaboration on the topic worldwide.
  CoAID dataset             This contains fake news from fact-checking websites           News             5,485
  (Cui and Lee, 2020)       and real news from health information websites, health                         (953 / 4,532)
                            clinics, and public institutions.
  MM-COVID                  This multilingual dataset contains fake and true news         News             3,409
  (Li et al., 2020)         collected from Poynter and Snopes.                                             (2,035 / 1,374)
  CovidLies                 This contains a curated list of common misconceptions         Social media     62
  (Hossain et al., 2020)    about COVID appearing in social media, carefully re-                           (62 / 0)
                            viewed to contain very relevant and unique claims.
  TREC Health Misinfor-     Research challenge using claims on the health domain          General          46
  mation track              focused on information retrieval from general websites        websites         (39 / 7)
                            through the Common Crawl corpus (commoncrawl.org).
  TREC COVID chal-          Research challenge using claims on the health domain          Scholar papers   40
  lenge                     focused on information retrieval from scholar peer-                            (3 / 37)
  (Voorhees et al., 2021;   reviewed journals through the CORD19 dataset (Wang
  Roberts et al., 2020)     et al., 2020a), the largest existing compilation of COVID-
                            related articles.

Table 1: Data sources used for the construction of our dataset. The last column shows the number of claims before
de-duplication.

to that objective that includes claims together with             ensures that the claims presented are unique, and
their information sources. The data sources iden-                avoids overlap between training and testing cases
tified are shown in Table 1. More details and pre-               when using the data to train veracity assessment
processing steps are presented in Appendix A. By                 models. These methods were carried out using
processing and combining these sources we ob-                    Pyserini2 and PyGaggle3 . The set of claims was
tained 20,689 initial claims.                                    indexed and a search was performed for each of
                                                                 the claims to detect similar claims. We created
3.2   Claim De-duplication                                       two versions of the dataset by varying the similar-
                                                                 ity threshold between claims. The L ARGE dataset
We processed claims and removed: exact dupli-
                                                                 excludes claims with a 90% probability of being
cates; claims making only a direct reference to
                                                                 similar, while in the S MALL dataset the probabil-
existing content in other media (audio, video, pho-
                                                                 ity is increased to 99%, as obtained through the
tos); automatically obtained content not represent-
                                                                 MonoT5 model. These thresholds were chosen em-
ing claims; entries with claims or fact-checking
                                                                 pirically by manual inspection of the results with
sources in languages other than English.
                                                                 simultaneous consideration of the efficiency of the
   The similarity of claims was then analysed                    method.
using: BM25 (Robertson et al., 1995; Crestani
                                                                    As a further assessment of the uniqueness of the
et al., 1998; Robertson and Zaragoza, 2009) and
                                                                 claims, we evaluated the de-duplication process
BM25 with MonoT5 re-ranking (Nogueira et al.,
                                                                 using BERTScore4 (Zhang et al., 2019) on the re-
2020). BM25 is a commonly-used ranking func-
                                                                 sulting datasets. We used the linked code with a
tion that estimates the relevance of documents to
                                                                 RoBERTa-large model with baseline rescaling. We
a given query. MonoT5 uses a T5 model trained
                                                                 compared each claim with all the other claims in
using as input the template ‘Query:[query]
                                                                 the dataset and kept the score of the most similar
Document:[doc] Relevant:’, fine-tuned
                                                                 match. The mean and standard deviation, and the
to produce as output the token ‘True’ or ‘False’. A
                                                                 90th percentile of claim similarity values are shown
softmax layer applied to those tokens gives the re-
                                                                 in the upper part of Table 3. The average claim sim-
spective relevance probabilities. These methods are
used to identify not only claims similar in content,                 2
                                                                       https://github.com/castorini/pyserini
but also distinct claims that are sufficiently relevant              3
                                                                       https://github.com/castorini/pygaggle
                                                                     4
when searching for information about them. This                        https://github.com/Tiiiger/bert_score
ilarity has been drastically reduced in the L ARGE         Category          Orig.       L ARGE        S MALL
dataset compared to the original dataset and further       Similarity   0.67 ± 0.23   0.43 ± 0.13   0.37 ± 0.14
reduced in the S MALL dataset.                             η.90                0.99          0.60          0.56

 Claim 1: Losing your sense of smell may be an early       False            14,739         1,810           477
 symptom of COVID-19.                                      True              5,950         3,333         1,232
 Exclude from L ARGE and S MALL:                           Total            20,689         5,143         1,709
 Loss of smell may suggest milder COVID-19.
 Exclude from S MALL only:                                Table 3: The average claim similarity values and the
 Loss of smell and taste validated as COVID-19 symptoms
 in patients with high recovery rate.                     PANACEA L ARGE and S MALL dataset statistics. η.90
                                                          denotes the 90th percentile value.
 Claim 2: COVID-19 hitting some African American com-
 munities harder.
 Exclude from L ARGE and S MALL:                                websites, health clinics, public institutions
 The African American community is being hit hard by            sites, and peer-reviewed scientific journals.
 COVID-19.
 Exclude from S MALL only:
 COVID-19 impacts in African-Americans are different          • Original information source. Information
 from the rest of the U.S. population.                          about which general information source was
                                                                used to obtain the claim.
      Table 2: Claim de-duplication examples.
                                                              • Claim type. The different types, explained in
   To illustrate the difference between the two ver-            Section A.2, are: Multimodal, Social Media,
sions of the dataset, we present some examples of               Questions, Numerical, and Named Entities.
claims in Table 2. For Claim 1, the semantically
similar claim ‘Loss of smell may suggest milder           4     Claim Veracity Assessment
COVID-19’ is identified and excluded from both
                                                          We develop a pipeline approach consisting of three
L ARGE and S MALL datasets. But the claim ‘Loss
                                                          steps: document retrieval, sentence retrieval and
of smell and taste validated as COVID-19 symp-
                                                          veracity assessment for claim veracity evaluation.
toms in patients with high recovery rate’, which
                                                          Given a claim, we first retrieve the most relevant
includes mentions of another symptom and the
                                                          documents from COVID-19 related sources and
recovery rate, is only excluded from the S MALL
                                                          then further retrieve the top N most relevant sen-
dataset. For Claim 2, the rephrased claim ‘The
                                                          tences. Considering each retrieved sentence as ev-
African American community is being hit hard by
                                                          idence, we train a veracity assessment model to
COVID-19’ is excluded from both datasets. But the
                                                          assign a True or False label to the claim.
claim ‘COVID-19 impacts in African-Americans
are different from the rest of the U.S. population’,      4.1    Document Retrieval
which refers specifically to the U.S. population, is
                                                          Document Dataset. In order to retrieve docu-
only excluded from the S MALL dataset.
                                                          ments relevant to the claims, we first construct an
3.3   Dataset Statistics                                  additional dataset containing documents obtained
                                                          from reliable COVID-19 related websites. These
Our final dataset statistics are shown in the lower
                                                          information sources represent a real-world com-
part of Table 3, where the original and the two
                                                          prehensive database about COVID-19 that can be
reduced versions are presented. After the steps de-
                                                          used as a primary source of information on the
scribed in Section 3.2 the L ARGE dataset contains
                                                          pandemic. We have selected four organisations
5,143 claims, and the S MALL version 1,709 claims.
                                                          from which to collect the information: (1) Cen-
   Example claims contained in the dataset are
                                                          ters for Disease Control and Prevention (CDC),
shown in Table 4. Each of the entries in the dataset
                                                          national public health agency of the United States;
contains the following information:
                                                          (2) European Centre for Disease Prevention and
                                                          Control (ECDC), EU agency aimed at strengthen-
   • Claim. Text of the claim.
                                                          ing Europe’s defenses against infectious diseases;
   • Claim label. The labels are: False, and True.        (3) WebMD, online publisher of news and informa-
                                                          tion on health; and (4) World Health Organization
   • Claim source. The sources include mostly             (WHO), agency of the United Nations responsible
     fact-checking websites, health information           for international public health.
Claim                                     Category   Source                   Orig. data src.   Type
      Stroke Scans Could Reveal COVID-19            True   ScienceDaily             CoAID
      Infection.
      Whiskey and honey cure coronavirus.          False   Independent news site    CovidLies
      COVID-19 is more deadly than Ebola or        False   Australian Associated    Poynter
      HIV.                                                 Press
      Dextromethorphan worsens COVID-19.            True   Nature                   TREC Health
                                                                                    Misinformation
                                                                                    track
      ACE inhibitors increase risk for coron-      False   Infectious Disorders -   TREC COVID
      avirus.                                              Drug Targets journal     challenge
      Nancy Pelosi visited Wuhan, China, in        False   Snopes                   MM-COVID          Named Entity,
      November 2019, just a month before the                                                          Numerical con-
      COVID-19 outbreak there.                                                                        tent

                             Table 4: Example entries in the constructed PANACEA dataset.

   All pages corresponding to the COVID-19 sub-                similar sentences obtained from the 10 most rele-
domains of each site have been downloaded. The                 vant documents. The relevance of the sentences is
web content was downloaded using the Beautiful-                calculated using cosine similarity in relation to the
Soup5 and Scrapy6 packages. Social networking                  original claim. The similarity is obtained with the
sites and non-textual content were discarded. In               pre-trained model MiniLM-L12-v2 (Wang et al.,
total 19,954 web pages have been collected. The                2020b), using Sentence-Transformers7 (Reimers
list of websites and the full content of each website          and Gurevych, 2019) to encode the sentences.
constitute this additional dataset used for document
retrieval. This dataset is enhanced with some addi-
                                                               4.3      Veracity Assessment
tional websites used only in the document retrieval
experiments, detailed in Section 5.1.                          We propose two veracity assessment approaches
Method. Information sources were indexed by                    built on the NLI results of claim-evidence pairs.
creating a Pyserini Lucene index and PyGaggle                  For each of the most similar sentences (pieces of
was used to implement a re-ranker model on the                 evidence) retrieved for a claim, we apply the pre-
results. The documents were split into paragraphs              trained NLI model RoBERTa-large-MNLI8 (Liu
of 300 tokens segmented with a BERT tokenizer.                 et al., 2019). This model acts as a cross-encoder
   To retrieve the information we first used a BM25            on pairs of sentences, trained to detect the rela-
score. Additionally, we tested the effect of multi-            tionship between the two sentences: contradiction,
stage retrieval by re-ranking the initial results using        neutrality, or entailment. The model is trained
MonoBERT (Nogueira et al., 2019) and MonoT5                    on the Multi-Genre Natural Language Inference
models, and query expansion using RM3 pseudo-                  (MultiNLI) dataset (Williams et al., 2018). The
relevance feedback (Abdul-Jaleel et al., 2004) on              inference results are then used in our proposed ap-
the BM25 results (Lin, 2019; Yang et al., 2019).               proaches described below.
   MonoBERT uses a BERT model trained us-
ing as inputs the query and each of the                        NLI-SAN. The first approach, named NLI-SAN,
documents to be re-ranked encoded together                     incorporates the inference results of claim-evidence
([CLS]query[SEP]doc[SEP]), and then the                        pairs into a Self-Attention Network (SAN) (See
[CLS] output token is passed to a single layer                 Figure 1a). First, a claim is paired with each piece
fully-connected network that produces the proba-               of retrieved relevant evidence. Each pair (c, ei )
bility of the document being relevant to the query.            is fed into a RoBERTa-large8 model, and the last
                                                               hidden layer output Si is used as its representa-
4.2     Sentence Retrieval                                     tion. Additionally, each pair is also fed to the men-
For each claim, once documents are retrieved us-               tioned RoBERTa-large-MNLI8 model obtaining Ii ,
ing BM25 and MonoT5 re-ranking of the top 100                  a triplet containing the probability of contradiction,
BM25 results, we then further retrieve the N most
   5                                                                7
    https://www.crummy.com/software/                               https://github.com/UKPLab/
BeautifulSoup/                                                 sentence-transformers
  6                                                              8
    https://scrapy.org/                                            https://huggingface.co/
Last hidden layer
                              RoBERTa                                     Value
                                               #!
   Claim ! Evidence "!                                                     Key

                  "!        RoBERTa MNLI       $!                         Query                        Self-
      !
                                                    NLI output                                       Attention                       +    MLP + Softmax

                                                                                                                            …
                                                                                       …
                                 …
            …
                                                                                                       Layer                                                    Veracity

                                                          …
                                               #"                          Value                                                                         classification output
                              RoBERTa
      !           ""                                                        Key
                            RoBERTa MNLI       $"                         Query
      !           ""

                                                                                  (a) NLI-SAN
                                                          Last hidden layer
                                                                                            +%
          Claim !                 RoBERTa            *
                                                                                  +
                                                              [0, 0, 1]
                                                     contradict, neutral, entail
                                                                                                                       ""
                                                                                                                                "!
   Evidence ""                                                                              +&!
                                  RoBERTa           )"                                                       "#
                                                                                  +                                             …        MLP + Softmax
                               RoBERTa MNLI          ("                                                                     !                                Veracity
              !        ""                                 NLI output                                                                                  classification output
                                                                                                                  "$
                       …

                                        …

                                                               …

                                                                                            …
   Evidence "!                    RoBERTa           )!                                      +& "
                                RoBERTa MNLI         (!
                                                                                  +
              !        "!

                                                                             (b) NLI-graph

                             Figure 1: Proposed veracity classification models. ⊕ means concatenation.

neutrality, or entailment.                                                                         First, for each claim-evidence pair, we derive
                                                                                                   RoBERTa-encoded representations for the claims
                       Si = RoBERTa(c, ei )                                                        and evidence separately (using the pooled output
                                                                                      (1)
                       Ii = RoBERTaNLI (c, ei )                                                    of the last layer) and obtain NLI results of the pairs
                                                                                                   as before.
The sentence representation is combined with the
NLI output through a Self Attention Network                                                          Ci = RoBERTa(c); Ei = RoBERTa(ei ) (4)
(SAN) (Galassi et al., 2020; Bahdanau et al., 2015).                                                         Ii = RoBERTaNLI (c, ei )   (5)
   The RoBERTa-encoded claim-evidence repre-
sentation Si with length nS = nK = nV is                                                           Next, we build an evidence network in which the
mapped onto a Key K ∈ RnK ×dK and a Value                                                          central node is the claim and the rest of the nodes
V ∈ RnV ×dV , while the NLI output Ii of each                                                      are the evidence. Two nodes are linked if their simi-
claim-evidence pair is mapped onto a Query Q ∈                                                     larity value exceeds a pre-defined threshold, which
RnQ ×dQ . The representation dimensionality is                                                     is empirically set to 0.9 by comparing the results of
dK = dV = dQ = 1024. The attention function is                                                     the experimental evaluation described in the follow-
defined as:                                                                                        ing section using different thresholds. The similar-
                                     √                                                             ity is considered between claim and evidence, but
    Att(Q, K, V) = softmax(QK> / d)V (2)                                                           also between pieces of evidence. Similarity calcu-
                                                                                                   lation is performed following the same approach as
While standard attention mechanisms use only the                                                   in Section 4.2. The features considered in each evi-
sentence representation information for the Key,                                                   dence node are the concatenation of Ei and Ii . For
Value and Query, here the inference information                                                    the claim node we use its representation Ci and a
is used in the Query. This attention mechanism is                                                  unity vector (0, 0, 1) for the inference. The network
applied to each of the claim-evidence pairs, and                                                   is implemented with the package PyTorch Geomet-
the outputs are concatenated into an output OSAN                                                   ric (Fey and Lenssen, 2019), using in the first layer
that is passed through a Multi-Layer Perceptron                                                    the GCNConv operator (Kipf and Welling, 2016)
(MLP) with hidden size dh and a Softmax layer to                                                   with 50 output channels and self-loops to the nodes,
generate the veracity classification output.                                                       represented by:
             ŷ = softmax(MLPReLU (OSAN ))                                            (3)                         X0 = D̂−1/2 ÂD̂−1/2 XW,                                       (6)

NLI-graph. We propose an alternative approach                                                     where X is the matrix of node feature vectors, Â =
based on Graph Convolutional Networks (GCN).                                                      A + I denotes the adjacency matrix with inserted
P
self-loops, D̂ii =     j=0 Âij its diagonal degree
                                                                               AP@5     AP@10    AP@20    AP@100
matrix, and W is a trainable weight matrix.               BM25                   0.54     0.56     0.58      0.62
                                                          BM25+MonoBERT          0.52     0.55     0.58      0.62
   Once the node representation is updated via            BM25+MonoT5            0.55     0.58     0.60      0.62
GCN, all the node representations are averaged            BM25+RM3+MonoT5        0.51     0.53     0.55      0.57
and passed to the MLP and the Softmax layer to
generate the final veracity classification output.       Table 5: Document retrieval results. Average precision
                                                         for different cut-offs. For the MonoBERT and MonoT5
         ŷ = softmax(MLPReLU (Ograph ))          (7)    cases, 100 initial results are retrieved in the first re-
                                                         trieval stage before re-ranking.
5     Experiments
In this section, we perform a twofold evaluation:        MonoBERT did not offer any improvement. It even
We first evaluate our document retrieval methods         introduced noise to the retrieval results, leading
(presented in §4.1) on obtaining information rel-        to inferior performance compared to using BM25
evant to the dataset claims from a database of           only on AP@5 and AP@10. MonoT5 appears
COVID-19 related websites. We subsequently               to be more effective, consistently improving the
present an evaluation of the veracity assessment         retrieval results across all metrics. Moreover for
approaches for the claims (described in §4.3).           this dataset the use of query expansion using RM3
5.1    Document Retrieval                                pseudo-relevance feedback on the BM25 results
                                                         does not improve the results.
In order to evaluate our document retrieval meth-
ods, we need the gold-standard relevant document         5.2   Veracity Assessment Evaluation
for each claim. Therefore, in the documents dataset
described in section 4.1 we additionally include the     Here we evaluate our proposed NLI-SAN and
web content referenced in each of the information        NLI-graph veracity assessment approaches. To
sources used to compile our claim dataset:               gain a better insight into the benefits of the pro-
The CoronaVirus Alliance Database.           All web     posed architectures, we conducted additional ex-
pages from the websites referenced as fact-              periments on the variants of the models including:
checking sources for the claims have been down-             • NLI, using only the NLI outputs of the claim-
loaded from 151 different domains.                             evidence pairs. The outputs are concatenated
CoAID dataset. We downloaded the websites used                 and then passed through the final classifica-
as fact-checking sources of false claims and the               tion layer to generate veracity classification
websites where correct information on true claims              results.
is gathered from 68 different domains.                      • NLI+sent, this is the ablated version of
MM-COVID. We collected both fact-checking                      NLI-SAN without the self-attention layer.
sources and reliable information related to the                Here, the RoBERTa-encoded claim-evidence
claims of this dataset from 58 web domains.                    representations are concatenated with the NLI
CovidLies dataset. We include the web content                  results and then fed to the classification layer
used as fact-checking sources of the misconcep-                to produce the veracity classification output.
tions from 39 domains.                                      • NLI+PSent, this is similar to the previous
   We have not included web content from the                   ablated version, but using the pooled represen-
TREC Challenges, as each of them is performed                  tation of the claim-evidence pair to concate-
on a very large dataset specific to each challenge             nate with the NLI result.
(CORD19 and Common Crawl corpus), as ex-                    • NLI-graph−abl , this is the ablated version
plained previously. Note that in our subsequent                of NLI-graph in which the node represen-
experiments, we have excluded all fact-checking                tation is the NLI result of the correspond-
websites to avoid finding directly the claim refer-            ing claim-evidence pair without its RoBERTa-
ences. The results of the document retrieval are pre-          encoded representation.
sented in Table 5. For each claim, the precision@k          For NLI, NLI+sent and NLI-SAN, we con-
is defined as 1 if the relevant result is retrieved in   sider the 5 most similar sentences for each claim,
the top k list and 0 otherwise.                          obtained from the 10 most relevant documents of
   We can see that by using BM25, it is possible         the information source database. Those documents
in many cases to retrieve the relevant results at the    are retrieved using BM25 and MonoT5 re-ranking
very top of our searches. Combining BM25 with            of the top 100 BM25 results. For NLI-graph,
False                         True
            Model                                                                                 Macro F1
                                       Precision    Recall     F1    Precision    Recall    F1
            GEAR (Zhou et al., 2019)       0.81        0.60   0.69       0.85       0.94   0.89       0.79
            KGAT (Liu et al., 2020)        0.89        0.96   0.92       0.98       0.95   0.97       0.94
            NLI                            0.48        0.24   0.31       0.75       0.90   0.82       0.56
            NLI+Sent                       0.91        0.87   0.89       0.95       0.97   0.96       0.92
            NLI+PSent                      0.87        0.72   0.79       0.90       0.96   0.93       0.86
            NLI-SAN                        0.93        0.89   0.91       0.96       0.97   0.97       0.94
            NLI-graph−abl                  0.50        0.33   0.39       0.77       0.87   0.81       0.60
            NLI-graph                      0.89        0.83   0.86       0.94       0.96   0.95       0.90

Table 6: Veracity classification results on the PANACEA S MALL dataset. The best result in each column is
highlighted in bold.

                                                   False                         True
            Model                                                                                 Macro F1
                                       Precision    Recall     F1    Precision    Recall    F1
            GEAR (Zhou et al., 2019)       0.88        0.88   0.88       0.93       0.94   0.94       0.91
            KGAT (Liu et al., 2020)        0.95        0.98   0.96       0.99       0.98   0.98       0.97
            NLI                            0.52        0.27   0.36       0.69       0.86   0.76       0.56
            NLI+Sent                       0.94        0.94   0.94       0.97       0.97   0.97       0.95
            NLI+PSent                      0.89        0.77   0.82       0.88       0.95   0.91       0.86
            NLI-SAN                        0.95        0.95   0.95       0.97       0.98   0.97       0.96
            NLI-graph−abl                  0.60        0.43   0.50       0.73       0.84   0.78       0.64
            NLI-graph                      0.94        0.91   0.93       0.95       0.97   0.96       0.94

Table 7: Veracity classification results on the PANACEA L ARGE dataset. The best result in each column is
highlighted in bold.

NLI-graph−abl and NLI+PSent, in order to                      alised representations of claim-evidence pairs are
have enough nodes to benefit from the network                 much more important than merely using the corre-
structure, the number of retrieved sentences is in-           sponding NLI values. We also note that using the
creased to 30 for each claim, selected as the 3               graph version NLI-graph obtains better scores
most similar sentences from the top 10 retrieved              than a non-graph model with the same information
documents. The retrieval procedure is as in sec-              NLI+PSent, however the scores are still lower
tions 4.1 and 4.2. Details of parameter settings              than the NLI-SAN method. Our method performs
can be found in Appendix B. We compare against                on a par with KGAT, while being simpler, and out-
the SOTA methods GEAR9 (Zhou et al., 2019) and                performs GEAR.
KGAT10 (Liu et al., 2020), with settings as de-                  Complementing the results for the S MALL
scribed by the authors.                                       dataset, Table 7 presents the results for the L ARGE
   For all approaches we perform 5-fold cross-                dataset. In general, we observe improved perfor-
validation and report the averaged results on the             mance for all models across all metrics for both
S MALL dataset in Table 6. By using the NLI                   classes compared to the results on the S MALL
information alone it is possible to obtain reason-            dataset. The previous results in the S MALL dataset
able results for the True claims, however, this is            constitute a more challenging case, since the
not the case for the most relevant False claims.              uniqueness of the claims is increased and there-
Once we add sentence representations the effi-                fore the veracity assessment models are not able
ciency of the method increases significantly. Using           to learn from similar claims when performing the
NLI-SAN instead of simply concatenating contex-               assessment.
tualised claim-evidence representations and NLI
outputs further improves the results. A similar               5.3    Discussion
observation can be made in the results generated              Our results show that in document retrieval, we
by NLI-graph and its variants; the contextu-                  have obtained values of around 0.6 from a simple
   9
       https://github.com/thunlp/GEAR                         term scoring and re-ranking retrieval model. How-
  10
       https://github.com/thunlp/KernelGAT                    ever, this baseline represents only a rough measure
of quality using this technique, since we have only     different from the claim (See Table A1 in Appendix
evaluated the retrieval of a single document specific   C for other examples).
to each claim; we have not evaluated the quality of        Such cases are more difficult to deal with, as
other retrieved documents.                              the similarity between claim and evidence is cer-
   The distinction into True and False claims can       tainly a good indicator of relevance. Nevertheless,
be rather coarse-grained. We note that initially        these cases are very interesting for future work us-
we considered a larger number of veracity labels,       ing more complex approaches. We have made an
including more nuanced cases that could be inter-       initial attempt to address this problem by represent-
esting to analyse (see A.1). However, we have not       ing claims and retrieved documents using Abstract
found a clear separation between complex cases          Meaning Representation (Banarescu et al., 2013)
and it would seem that different fact checkers do       in order to better select relevant information. Al-
not follow the same conventions when labelling          though the results were not satisfactory, it may be
such cases. The development of datasets especially      an interesting avenue for future exploration. An-
focused on such nuanced cases may be therefore          other line of future work is the design of strategies
an important line of work in the future, together       against adversarial attacks to mitigate possible risks
with the development of techniques for these more       to our system.
complex situations.
                                                        6   Conclusions
   In analysing misclassified claims, we note some
interesting cases. The scope and globality of the       We have presented a novel dataset that aggregates a
pandemic imply that similar issues are mentioned        heterogeneous set of COVID-19 claims categorised
repeatedly on multiple occasions, yet claims to be      as True or False. Aggregation of heterogeneous
verified may include nuances or specificities. This     sources involved a careful deduplication process
is challenging as it is easy to retrieve information    to ensure dataset quality. Fact-checking sources
that omits relevant nuances. E.g. The claim “Bar-       are provided for veracity assessment, as well as
ron Trump had COVID-19, Melania Trump says"             additional information sources for True claims. Ad-
retrieves sentences such as “Rudy Giuliani has          ditionally, claims are labelled with sub-types (Mul-
tested positive for COVID-19, Trump says." with a       timodal, Social Media, Questions, Numerical, and
similar structure and mentions but missing the key      Named Entities).
name. This type of situation could be addressed            We have performed a series of experiments using
by using Named Entity Recognition (NER) meth-           our dataset for information retrieval through direct
ods that prioritise matching between the entities       retrieval and using a multi-stage re-ranker approach.
involved in the claim and the information sources.      We have proposed new NLI methods for claim ve-
See e.g. (Taniguchi et al., 2018; Nooralahzadeh         racity assessment, attention-based NLI-SAN and
and Øvrelid, 2018).                                     graph-based NLI-graph, achieving in our dataset
   Other interesting cases involve claims for which     competitive results with the GEAR and KGAT
documents with adequate information are retrieved,      state-of-the-art models. We have also discussed
but the sentences containing evidence cannot be         challenging cases and provided ideas for future
identified because they are too different from the      research directions.
original claim. E.g. The claim “Vice President of
                                                        Acknowledgements
Bharat Biotech got a shot of the indigenous COV-
AXIN vaccine" retrieves correct documents on the        This work was supported by the UK Engineering
issue. Similar sentences are retrieved such as “Co-     and Physical Sciences Research Council (grant
vaxin which is being developed by Bharat Biotech        no. EP/V048597/1, EP/T017112/1). ML and YH
is the only indigenous vaccine that is approved         are supported by Turing AI Fellowships funded
for emergency use.". Despite being similar such         by the UK Research and Innovation (grant no.
retrieved sentences give no information about the       EP/V030302/1, EP/V020579/1).
claimed situation. In the retrieved document, the
sentence “The pharmaceutical company, has in a
statement, denied the claim and said the image          References
shows a routine blood test." contains the essen-        Nasreen Abdul-Jaleel, James Allan, W Bruce Croft,
tial information to debunk the original claim but is      Fernando Diaz, Leah Larkey, Xiaoyan Li, Mark D
missed by the sentence retrieval engine as it is very     Smucker, and Courtney Wade. 2004. Umass at trec
2004: Novelty and hard. Computer Science Depart-           in information retrieval. ACM Computing Surveys
  ment Faculty Publication Series, page 189.                 (CSUR), 30(4):528–552.
Muhammad Abdul-Mageed, AbdelRahim Elmadany,                Limeng Cui and Dongwon Lee. 2020.            Coaid:
 El Moatez Billah Nagoudi, Dinesh Pabbi, Kunal               Covid-19 healthcare misinformation dataset. arXiv
 Verma, and Rannie Lin. 2021. Mega-COV: A                    preprint arXiv:2006.00885.
 billion-scale dataset of 100+ languages for COVID-
 19. In Proceedings of the 16th Conference of the          Enyan Dai, Yiwei Sun, and Suhang Wang. 2020. Gin-
 European Chapter of the Association for Computa-            ger cannot cure cancer: Battling fake health news
 tional Linguistics: Main Volume, pages 3402–3420,           with a comprehensive data repository. In Proceed-
 Online. Association for Computational Linguistics.          ings of the International AAAI Conference on Web
                                                             and Social Media, volume 14, pages 853–862.
Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull,
  James Thorne, Andreas Vlachos, Christos                  Dimitar Dimitrov, Erdal Baran, Pavlos Fafalios, Ran
  Christodoulopoulos, Oana Cocarascu, and Arpit              Yu, Xiaofei Zhu, Matthäus Zloch, and Stefan Dietze.
  Mittal. 2021. The fact extraction and VERifica-            2020. Tweetscov19-a knowledge base of semanti-
  tion over unstructured and structured information          cally annotated tweets about the covid-19 pandemic.
  (FEVEROUS) shared task. In Proceedings of the              In Proceedings of the 29th ACM International Con-
  Fourth Workshop on Fact Extraction and VERifica-           ference on Information & Knowledge Management,
  tion (FEVER), pages 1–13, Dominican Republic.              pages 2991–2998.
  Association for Computational Linguistics.
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua               Arianna D’Ulizia, Maria Chiara Caschera, Fernando
  Bengio. 2015.       Neural machine translation by          Ferri, and Patrizia Grifoni. 2021. Fake news detec-
  jointly learning to align and translate. In 3rd Inter-     tion: a survey of evaluation datasets. PeerJ Com-
  national Conference on Learning Representations,           puter Science, 7:e518.
  ICLR 2015.
                                                           Elsevier    journals Novel  Coronavirus   In-
Laura Banarescu, Claire Bonial, Shu Cai, Madalina             formation Center. 2020.     Elsevier jour-
  Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin              nals novel coronavirus information center.
  Knight, Philipp Koehn, Martha Palmer, and Nathan            https://www.elsevier.com/connect/
  Schneider. 2013. Abstract meaning representation            coronavirus-information-center.
  for sembanking. In Proceedings of the 7th linguistic
  annotation workshop and interoperability with dis-       William Ferreira and Andreas Vlachos. 2016. Emer-
  course, pages 178–186.                                     gent: a novel data-set for stance classification. In
                                                             Proceedings of the 2016 conference of the North
Cambridge journals Coronavirus Free Access Collec-          American chapter of the association for computa-
  tion. 2020. Cambridge journals coronavirus free            tional linguistics: Human language technologies,
  access collection. https://www.cambridge.                  pages 1163–1168.
  org/core/browse-subjects/medicine/
  coronavirus-free-access-collection.                      Matthias Fey and Jan E. Lenssen. 2019. Fast graph
                                                            representation learning with PyTorch Geometric. In
Emily Chen, Kristina Lerman, and Emilio Ferrara.            ICLR Workshop on Representation Learning on
  2020. Tracking social media discourse about the           Graphs and Manifolds.
  covid-19 pandemic: Development of a public coro-
  navirus twitter data set. JMIR Public Health and         Andrea Galassi, Marco Lippi, and Paolo Torroni. 2020.
  Surveillance, 6(2):e19273.                                 Attention in natural language processing. IEEE
                                                             Transactions on Neural Networks and Learning Sys-
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui           tems.
  Jiang, and Diana Inkpen. 2017. Enhanced lstm for
  natural language inference. In Proceedings of the        Reza Ghaeini, Sadid A Hasan, Vivek Datla, Joey Liu,
  55th Annual Meeting of the Association for Compu-          Kathy Lee, Ashequl Qadir, Yuan Ling, Aaditya
  tational Linguistics (Volume 1: Long Papers), pages        Prakash, Xiaoli Fern, and Oladimeji Farri. 2018. Dr-
  1657–1668.                                                 bilstm: Dependent reading bidirectional lstm for nat-
Qingyu Chen, Alexis Allot, and Zhiyong Lu. 2021. Lit-        ural language inference. In Proceedings of the 2018
  covid: an open database of covid-19 literature. Nu-        Conference of the North American Chapter of the
  cleic acids research, 49(D1):D1534–D1540.                  Association for Computational Linguistics: Human
                                                             Language Technologies, Volume 1 (Long Papers),
COVID-19 Data Portal (EU). 2020.   Covid-                    pages 1460–1469.
  19 data portal (eu).      https://www.
  covid19dataportal.org/.                                  Andreas Hanselowski, Hao Zhang, Zile Li, Daniil
                                                             Sorokin, Benjamin Schiller, Claudia Schulz, and
Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsber-       Iryna Gurevych. 2018. Ukp-athene: Multi-sentence
  gen, and Iain Campbell. 1998. “is this document rel-       textual entailment for claim verification. EMNLP
  evant?. . . probably” a survey of probabilistic models     2018, page 103.
Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte,       MedRN medical research network SSRN Coronavirus
  Yoshitomo Matsubara, Sean Young, and Sameer              Infectious Disease Research Hub. 2020. Medrn
  Singh. 2020. COVIDLies: Detecting COVID-19               medical research network ssrn coronavirus infec-
  misinformation on social media. In Proceedings of        tious disease research hub. https://www.ssrn.
  the 1st Workshop on NLP for COVID-19 (Part 2)            com/index.cfm/en/coronavirus/.
  at EMNLP 2020, Online. Association for Computa-
  tional Linguistics.                                     Shahan Ali Memon and Kathleen M Carley. 2020.
                                                            Characterizing covid-19 misinformation communi-
Xiaolei Huang, Amelia Jamison, David Broni-                 ties using a novel twitter dataset. In CEUR Work-
  atowski, Sandra Quinn, and Mark Dredze.                   shop Proceedings, volume 2699.
  2020.     Coronavirus twitter data: A collection
  of covid-19 tweets with automated annotations.          Yixin Nie, Haonan Chen, and Mohit Bansal. 2019.
  Http://twitterdata.covid19dataresources.org/index.        Combining fact extraction and verification with neu-
Daniel Kerchner and Laura Wrubel. 2020. Coronavirus         ral semantic matching networks. In Proceedings of
  tweet ids. Harvard Dataverse.                             the AAAI Conference on Artificial Intelligence, vol-
                                                            ume 33, pages 6859–6866.
Thomas N Kipf and Max Welling. 2016. Semi-
  supervised classification with graph convolutional      Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and
  networks. arXiv preprint arXiv:1609.02907.                Jimmy Lin. 2020. Document ranking with a pre-
                                                            trained sequence-to-sequence model. In Proceed-
Neema Kotonya and Francesca Toni. 2020.           Ex-       ings of the 2020 Conference on Empirical Methods
  plainable automated fact-checking for public health       in Natural Language Processing: Findings, pages
  claims. In Proceedings of the 2020 Conference on          708–718.
  Empirical Methods in Natural Language Processing
  (EMNLP), pages 7740–7754.                               Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and
Rabindra Lamsal. 2021. Design and analysis of a large-      Jimmy Lin. 2019. Multi-stage document ranking
  scale covid-19 tweets dataset. Applied Intelligence,      with bert. arXiv preprint arXiv:1910.14424.
  51(5):2790–2804.
                                                          Farhad Nooralahzadeh and Lilja Øvrelid. 2018. Sirius-
Tianda Li, Xiaodan Zhu, Quan Liu, Qian Chen, Zhi-           ltg: An entity linking approach to fact extraction
   gang Chen, and Si Wei. 2019. Several experi-             and verification. In Proceedings of the First Work-
   ments on investigating pretraining and knowledge-        shop on Fact Extraction and VERification (FEVER),
   enhanced models for natural language inference.          pages 119–123.
   arXiv preprint arXiv:1904.12104.
                                                          Jeppe Nørregaard, Benjamin D Horne, and Sibel Adalı.
Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu.              2019. Nela-gt-2018: A large multi-labelled news
  2020. Mm-covid: A multilingual and multimodal              dataset for the study of misinformation in news arti-
  data repository for combating covid-19 disinforma-         cles. In Proceedings of the international AAAI con-
  tion.                                                      ference on web and social media, volume 13, pages
Jimmy Lin. 2019. The neural hype and comparisons             630–638.
   against weak baselines. In ACM SIGIR Forum, vol-
   ume 52, pages 40–51. ACM New York, NY, USA.            Oxford journals resources on COVID-19. 2020.
                                                            Oxford   journals   resources on  covid-19.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-         https://academic.oup.com/journals/
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,             pages/coronavirus.
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.
  Roberta: A robustly optimized bert pretraining ap-      Ankur Parikh, Oscar Täckström, Dipanjan Das, and
  proach. arXiv preprint arXiv:1907.11692.                  Jakob Uszkoreit. 2016. A decomposable attention
                                                            model for natural language inference. In Proceed-
Zhenghao Liu, Chenyan Xiong, Maosong Sun, and               ings of the 2016 Conference on Empirical Methods
  Zhiyuan Liu. 2020. Fine-grained fact verification         in Natural Language Processing, pages 2249–2255.
  with kernel graph attention network. In The 58th an-
  nual meeting of the Association for Computational       Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, and
  Linguistics (ACL).                                        Jimmy Lin. 2021. Vera: Prediction techniques for re-
Ilya Loshchilov and Frank Hutter. 2019. Decoupled           ducing harmful misinformation in consumer health
   weight decay regularization. In International Con-       search. In Proceedings of the 44th Annual Interna-
   ference on Learning Representations.                     tional ACM SIGIR Conference on Research and De-
                                                            velopment in Information Retrieval (SIGIR 2021).
Jackson Luken, Nanjiang Jiang, and Marie-Catherine
   de Marneffe. 2018. Qed: A fact verification sys-       Umair Qazi, Muhammad Imran, and Ferda Ofli. 2020.
   tem for the fever shared task. In Proceedings of the    Geocov19: a dataset of hundreds of millions of mul-
  First Workshop on Fact Extraction and VERification       tilingual covid-19 tweets with location information.
  (FEVER), pages 156–160.                                  SIGSPATIAL Special, 12(1):6–15.
Nils Reimers and Iryna Gurevych. 2019. Sentence-           2018b. The FEVER2.0 shared task. In Proceed-
  bert: Sentence embeddings using siamese bert-            ings of the Second Workshop on Fact Extraction and
  networks. In Proceedings of the 2019 Conference on       VERification (FEVER).
  Empirical Methods in Natural Language Processing
  and the 9th International Joint Conference on Natu-    Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina
  ral Language Processing (EMNLP-IJCNLP), pages             Demner-Fushman, William R Hersh, Kyle Lo, Kirk
  3982–3992.                                                Roberts, Ian Soboroff, and Lucy Lu Wang. 2021.
                                                            Trec-covid: constructing a pandemic information re-
Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina            trieval test collection. In ACM SIGIR Forum, vol-
  Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen              ume 54, pages 1–12. ACM New York, NY, USA.
  Voorhees, Lucy Lu Wang, and William R Hersh.
  2020. Trec-covid: rationale and structure of an in-    David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu
  formation retrieval shared task for covid-19. Jour-      Wang, Madeleine van Zuylen, Arman Cohan, and
  nal of the American Medical Informatics Associa-         Hannaneh Hajishirzi. 2020. Fact or fiction: Verify-
  tion, 27(9):1431–1436.                                   ing scientific claims. In Proceedings of the 2020
                                                           Conference on Empirical Methods in Natural Lan-
Stephen Robertson and Hugo Zaragoza. 2009. The             guage Processing (EMNLP), pages 7534–7550.
   probabilistic relevance framework: BM25 and be-       Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,
   yond. Now Publishers Inc.                               Russell Reas, Jiangjiang Yang, Doug Burdick,
                                                           Darrin Eide, Kathryn Funk, Yannis Katsis, Rod-
Stephen E Robertson, Steve Walker, Susan Jones,
                                                           ney Michael Kinney, Yunyao Li, Ziyang Liu,
   Micheline M Hancock-Beaulieu, Mike Gatford, et al.
                                                           William Merrill, Paul Mooney, Dewey A. Murdick,
  1995. Okapi at trec-3. Nist Special Publication Sp,
                                                           Devvret Rishi, Jerry Sheehan, Zhihong Shen, Bran-
  109:109.
                                                           don Stilson, Alex D. Wade, Kuansan Wang, Nancy
Gautam Kishore Shahi, Anne Dirkson, and Tim A Ma-          Xin Ru Wang, Christopher Wilhelm, Boya Xie, Dou-
  jchrzak. 2021. An exploratory study of covid-19          glas M. Raymond, Daniel S. Weld, Oren Etzioni,
  misinformation on twitter. Online social networks        and Sebastian Kohlmeier. 2020a. CORD-19: The
  and media, 22:100104.                                    COVID-19 open research dataset. In Proceedings
                                                           of the 1st Workshop on NLP for COVID-19 at ACL
Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dong-           2020, Online. Association for Computational Lin-
  won Lee, and Huan Liu. 2020. Fakenewsnet: A data         guistics.
  repository with news content, social context, and      Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan
  spatiotemporal information for studying fake news       Yang, and Ming Zhou. 2020b. Minilm: Deep
  on social media. Big data, 8(3):171–188.                 self-attention distillation for task-agnostic compres-
Amir Soleimani, Christof Monz, and Marcel Worring.         sion of pre-trained transformers. arXiv preprint
 2020. Bert for evidence retrieval and claim verifica-     arXiv:2002.10957.
 tion. Advances in Information Retrieval, 12036:359.     William Yang Wang. 2017. “liar, liar pants on fire”:
                                                           A new benchmark dataset for fake news detection.
Dominik Stammbach and Guenter Neumann. 2019.               In Proceedings of the 55th Annual Meeting of the
  Team domlin: Exploiting evidence enhancement for        Association for Computational Linguistics (Volume
  the fever shared task. In Proceedings of the Sec-        2: Short Papers), pages 422–426.
  ond Workshop on Fact Extraction and VERification
  (FEVER), pages 105–109.                                Ralph Weischedel, Martha Palmer, Mitchell Marcus,
                                                           Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Ni-
Motoki Taniguchi, Tomoki Taniguchi, Takumi Taka-           anwen Xue, Ann Taylor, Jeff Kaufman, Michelle
 hashi, Yasuhide Miura, and Tomoko Ohkuma. 2018.           Franchini, et al. 2013.    Ontonotes release 5.0
 Integrating entity linking and evidence ranking for       ldc2013t19. Linguistic Data Consortium, Philadel-
 fact extraction and verification. In Proceedings of       phia, PA, 23.
 the First Workshop on Fact Extraction and Verifica-
 tion (FEVER), pages 124–126.                            WHO database of publications on coronavirus.
                                                          2020. Who database of publications on coro-
The     Lancet   COVID-19    content  collection.         navirus.      https://search.bvsalud.
  2020.      The lancet covid-19 content collec-          org/global-literature-on-novel\
  tion.        https://www.thelancet.com/                 -coronavirus-2019-ncov/.
  coronavirus/collection.
                                                         Adina Williams, Nikita Nangia, and Samuel Bowman.
James Thorne,       Andreas Vlachos,          Christos     2018. A broad-coverage challenge corpus for sen-
  Christodoulopoulos, and Arpit Mittal. 2018a.             tence understanding through inference. In Proceed-
  FEVER: a large-scale dataset for fact extraction and     ings of the 2018 Conference of the North American
  VERification. In NAACL-HLT.                              Chapter of the Association for Computational Lin-
                                                           guistics: Human Language Technologies, Volume
James Thorne, Andreas Vlachos, Oana Cocarascu,             1 (Long Papers), pages 1112–1122. Association for
  Christos Christodoulopoulos, and Arpit Mittal.           Computational Linguistics.
Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin.
  2019. Critically examining the" neural hype" weak
  baselines and the additivity of effectiveness gains
  from neural ranking models. In Proceedings of
  the 42nd international ACM SIGIR conference on
  research and development in information retrieval,
  pages 1129–1132.
Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pon-
  tus Stenetorp, and Sebastian Riedel. 2018. Ucl ma-
  chine reading group: Four factor framework for fact
  finding (hexaf). In Proceedings of the First Work-
  shop on Fact Extraction and VERification (FEVER),
  pages 97–102.
Xia Zeng, Amani S Abumansour, and Arkaitz Zubiaga.
  2021. Automated fact-checking: A survey. Lan-
  guage and Linguistics Compass, 15(10):e12438.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
  Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
   uating text generation with bert. In International
  Conference on Learning Representations.
Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu,
  Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin.
  2020. Reasoning over semantic-level graph for fact
  checking. In Proceedings of the 58th Annual Meet-
  ing of the Association for Computational Linguistics,
  pages 6170–6180.

Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, Lifeng
   Wang, Changcheng Li, and Maosong Sun. 2019.
   Gear: Graph-based evidence aggregating and rea-
   soning for fact verification. In The 57th annual meet-
   ing of the Association for Computational Linguistics
   (ACL).

Xinyi Zhou, Apurva Mulay, Emilio Ferrara, and Reza
  Zafarani. 2020. Recovery: A multimodal reposi-
  tory for covid-19 news credibility research. In Pro-
  ceedings of the 29th ACM International Conference
  on Information & Knowledge Management, pages
  3205–3212.
A        Data Sources                                        • MM-COVID15 . The multilingual dataset (Li
                                                               et al., 2020) contains fake and true news col-
Here we present detailed information of the data               lected from Poynter and Snopes16 , being a
sources introduced in section 3.1.                             good complement to the first data source.
   It is worth noting that for the construction of
our dataset, we have only included sources or                • CovidLies dataset17 . The dataset (Hossain
datasets that contain explicit veracity labels of spe-         et al., 2020) contains a curated list of common
cific claims, thus we have not included collections            misconceptions about COVID appearing in
of tweets related to COVID that do not have verac-             social media, carefully reviewed to contain
ity labels (Chen et al., 2020; Lamsal, 2021; Abdul-            very relevant and unique claims unlike other
Mageed et al., 2021; Huang et al., 2020; Dimitrov              automatically collected datasets.
et al., 2020; Kerchner and Wrubel, 2020; Qazi et al.,
2020). We have not included claims without inde-             • TREC Health Misinformation track18 . Re-
pendent fact-checking sources (Memon and Carley,               search challenge using claims on the health
2020; Shahi et al., 2021) and information sources              domain focused on information retrieval from
without formulated claims such as the collections              general websites through the Common Crawl
of scholarly articles (Wang et al., 2020a; Chen et al.,        corpus19 . This dataset is specialized in a very
2021), news articles (Zhou et al., 2020), or articles          specific domain, and has been used for a very
obtained through specific repositories as (COVID-              different application than the previous data
19 Data Portal , EU; WHO database of publica-                  sources.
tions on coronavirus; Elsevier journals Novel Coro-
navirus Information Center; Cambridge journals               • TREC COVID challenge20 . Research chal-
Coronavirus Free Access Collection; The Lancet                 lenge (Voorhees et al., 2021; Roberts et al.,
COVID-19 content collection; Oxford journals re-               2020) using claims on the health domain fo-
sources on COVID-19; MedRN medical research                    cused on information retrieval from scholarly
network SSRN Coronavirus Infectious Disease Re-                peer-reviewed journals through the CORD19
search Hub).                                                   dataset (Wang et al., 2020a), the largest exist-
   The data sources that we have used for the con-             ing compilation of such articles. Similar to the
struction of our dataset are:                                  last source, but focused on scholarly papers
                                                               unlike the other sources.
     • The CoronaVirusFacts/DatosCoronaVirus
       Alliance Database11 . Published by Poyn-           A.1    Pre-processing
       ter12 , this online publication combines fact-     A separate pre-processing step was carried out for
       checking articles from more than 100 fact-         each of the selected data sources:
       checkers from all over the world, being the        The CoronaVirusFacts/CoronaVirus Alliance
       largest journalist fact-checking collaboration     Database. The data was downloaded on 13 Febru-
       on the topic worldwide13 . The publication         ary 2021. From the 11,647 entries initially ob-
       is presented as an online portal, thus we had      tained, entries with no fact-checking source and cat-
       to develop scripts to crawl the content and        egories with less than 10 entries were removed. The
       extract the relevant claims, categories, and       different fact-checkers used different categories to
       information sources.                               label the claims, although in most of the cases the
                                                          difference was mainly in terms of spelling. Initially
     • CoAID dataset14 . The dataset (Cui and Lee,
                                                          we identified the following common categories:
       2020) contains fake news from fact-checking
                                                          False (including FALSE, FALSO, Fake, false, false
       websites and real news from health informa-
                                                          and misleading, Two Pinocchios, Misinformation
       tion websites, health clinics, and public insti-
       tutions. Unlike most other datasets, it contains     15
                                                               https://github.com/bigheiniu/MM-COVID
                                                            16
       a wide selection of true claims.                        www.snopes.com
                                                            17
                                                               https://github.com/ucinlp/
    11
     https://www.poynter.org/                             covid19-data
ifcn-covid-19-misinformation/                               18
                                                               https://trec-health-misinfo.github.
  12
     www.poynter.org                                      io/
  13                                                        19
     https://www.poynter.org/                                  https://commoncrawl.org/
coronavirusfactsalliance/                                   20
                                                               https://ir.nist.gov/covidSubmit/data.
  14
     https://github.com/cuilimeng/CoAID                   html
You can also read