Improving Evidence Retrieval for Automated Explainable Fact-Checking
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Improving Evidence Retrieval for Automated Explainable Fact-Checking Chris Samarinas1 , Wynne Hsu1,2,3 , and Mong Li Lee1,2,3 1 Institute of Data Science, National University of Singapore 2 NUS Centre for Trusted Internet and Community 3 School of Computing, National University of Singapore Abstract words with the claims to be verified. Dense re- trieval models have proven effective in question Automated fact-checking on a large-scale is a answering as these models can better capture the challenging task that has not been studied sys- latent semantic content of text. The work in tematically until recently. Large noisy doc- (Samarinas et al., 2020) is the first to use dense re- ument collections like the web or news arti- trieval for fact checking. The authors constructed cles make the task more difficult. In this pa- per, we describe the components of a three- a new dataset called Factual-NLI comprising of stage automated fact-checking system, named claim-evidence pairs from the FEVER dataset Quin+. We demonstrate that using dense pas- (Thorne et al., 2018) as well as synthetic examples sage representations increases the evidence re- generated from benchmark Question Answering call in a noisy setting. We experiment with two datasets (Kwiatkowski et al., 2019; Nguyen et al., sentence selection approaches, an embedding- 2016). They demonstrated that using Factual-NLI based selection using a dense retrieval model, to train a dense retriever can improve evidence re- and a sequence labeling approach for context- trieval significantly. aware selection. Quin+ is able to verify open- domain claims using a large-scale corpus or While the FEVER dataset has enabled the web search results. systematic evaluation of automated fact-checking systems, it does not reflect well the noisy na- 1 Introduction ture of real-world data. Motivated by this, we introduce the Factual-NLI+ dataset, an extension With the emergence of social media and many in- of the FEVER dataset with synthetic examples dividual news sources online, the spread of misin- from question answering datasets and noise pas- formation has become a major problem with po- sages from web search results. We examine how tentially harmful social consequences. Fake news dense representations can improve the first-stage can manipulate public opinion, create conflicts, retrieval recall of passages for fact-checking in a elicit unreasonable fear and suspicion. The vast noisy setting, and make the retrieval of relevant amount of unverified online content led to the evidence more tractable on a large scale. establishment of external post-hoc fact-checking However, the selection of relevant evidence sen- organizations, such as PolitiFact, FactCheck.org, tences for accurate fact-checking and explainabil- Snopes etc, with dedicated resources to verify ity remains a challenge. Figure 1 shows an ex- claims online. However, manual fact-checking is ample of a claim and the retrieved passage which time consuming and intractable on a large scale. has three sentences, of which only the last sen- The ability to automatically perform fact-checking tence provides the critical evidence to refute the is critical to minimize negative social impact. claim. We propose two ways to select the relevant Automated fact checking is a complex task in- sentences, an embedding-based selection using a volving evidence extraction followed by evidence dense retrieval model, and a sequence labeling ap- reasoning and entailment. For the retrieval of rel- proach for context-aware selection. We show that evant evidence from a corpus of documents, ex- the former generalizes better with a high recall, isting systems typically utilize traditional sparse while the latter has higher precision, making them retrieval which may have poor recall, especially suitable for the identification of relevant evidence when the relevant passages have few overlapping sentences. Our fact-checking system Quin+ is able 84 Proceedings of NAACL-HLT 2021: Demonstrations, pages 84–91 June 6–11, 2021. ©2021 Association for Computational Linguistics
cent works have proposed a BERT-based model for extracting relevant evidence sentences from multi-sentence passages (Atanasova et al., 2020). The authors observe that joint training on verac- ity prediction and explanation generation performs better than training separate models. The work in (Stammbach and Ash, 2020) investigates how the few-shot learning capabilities of the GPT-3 model (Brown et al., 2020) can be used for generating Figure 1: Sample claim and the retrieved evidence pas- fact-checking explanations. sage where only the last sentence is relevant. 3 The Quin+ System to verify open-domain claims using a large corpus The automated claim verification task can be de- or web search results. fined as follows: given a textual claim c and a cor- pus D = {d1 , d2 , ..., dn }, where every passage d 2 Related Work is comprised of sentences sj , 1 ≤ j ≤ k, a system S will return a set of evidence sentences Ŝ ⊂ di Automated claim verification using a large cor- and a label ŷ ∈ {probably true, probably false, pus has not been studied systematically until the inconclusive}. availability of the Fact Extraction and VERifica- We have developed an automated fact-checking tion dataset (FEVER) (Thorne et al., 2018). This system, called Quin+, that verifies a given claim dataset contains claims that are supported or re- in three stages: passage retrieval from a corpus, futed by specific evidence from Wikipedia arti- sentence selection and entailment classification as cles. Prior to the work in (Samarinas et al., 2020), shown in Figure 2. The label is determined as fol- fact-checking solutions have relied on sparse pas- lows: we first perform entailment classification on sage retrieval, followed by a claim verification (en- the set of evidence sentences. When the number tailment classification) model (Nie et al., 2019). of retrieved evidence sentences that entail or con- Other approaches used the mentions of entities tradict the claim is low, we label the claim as “in- in a claim and/or basic entity linking to retrieve conclusive”. If the number of evidence sentences documents and a machine learning model such as that support the claim exceeds the number of sen- logistic regression or an enhanced sequential in- tences that refute the claim, we assign the label ference model to decide whether an article most “probably true”. Otherwise, we assign the label likely contains the evidence (Yoneda et al.; Chen “probably false”. et al., 2017; Hanselowski et al., 2018). However, retrieval based on sparse representa- 3.1 Passage Retrieval tions and exact keyword matching can be rather re- The passage retrieval model in Quin+ is based on strictive for various queries. This restriction can be a dense retrieval model called QR-BERT (Samari- mitigated by dense representations using BERT- nas et al., 2020). This model is based on BERT based language models (Devlin et al., 2019). The and creates dense vectors for passages by calculat- works in (Lee et al., 2019; Karpukhin et al., 2020; ing their average token embedding. The relevance Xiong et al., 2020; Chang et al., 2020) have suc- of a passage d to a claim c is then given by their cessfully used such models and its variants for pas- dot product: sage retrieval in open-domain question answering. The results can be further improved using passage r(c, d) = φ(c)T φ(d) (1) re-ranking with cross-attention BERT-based mod- Dot product search can run efficiently using an ap- els (Nogueira et al., 2019). The work in (Samari- proximate nearest neighbors index implemented nas et al., 2020) is the first to propose a dense using the FAISS library (Johnson et al., 2019). model to retrieve passages for fact-checking. QR-BERT maximizes the sampled softmax loss: Apart from passage retrieval, sentence selection is also a critical task in fact-checking. These ev- X X erθ (c,di ) idence sentences provide an explanation why a Lθ = rθ (c, d) − log (2) claim has been assessed to be credible or not. Re- (c,d)∈Db+ di ∈Db 85
Figure 2: Three stages of claim verification in Quin+. where Db is the set of passages in a training batch and context-aware sentence selection method. b, Db+ is the set of positive claim-passage pairs in The embedding-based selection method relies the batch b, and θ represents the parameters of the on the dense representations learned by the dense BERT model. passage retrieval model QR-BERT. For a given The work in (Samarinas et al., 2020) introduced claim c, we select the sentences si from a given the Factual-NLI dataset that extends the FEVER passage d = {s1 , s2 , ..., sk } whose relevance dataset (Thorne et al., 2018) with more diverse score r(c, si ) is greater than some threshold λ synthetic examples derived from question answer- which is set experimentally. ing datasets. There are 359,190 new entailed The context-aware sentence selection method claims with evidence and additional contradicted uses a BERT-based sequence labeling model. The claims from a rule-based approach. To ensure ro- input of the model is the concatenation of the to- bustness, we compile a new large-scale noisy ver- kenized claim C = {C1 , C2 , ..., Ck }, the special sion of Factual-NLI called Factual-NLI+1 . This [SEP] token and the tokenized evidence passage dataset includes all the 5 million Wikipedia pas- E = {E1 , E2 , ..., Em } (see Figure 3). For the out- sages in the FEVER dataset. We add ‘noise’ pas- put of the model, we adopt the BIO tagging format sages as follows. For every claim c in the FEVER so that all the irrelevant tokens are classified as O, dataset, we retrieve the top 30 web results from the first token of an evidence sentence classified as the Bing search engine and keep passages with B evidence and the rest tokens of an evidence sen- the highest BM25 score that are classified as neu- tence as I evidence. We trained a model based on tral by the entailment model. For claims gen- RoBERTa-large (Liu et al., 2019), minimizing the erated from MSMARCO queries (Nguyen et al., cross-entropy loss: 2016), we include the irrelevant passages that are found in the MSMARCO dataset for those queries. X li N X This results in 418,650 additional passages. The Lθ = − log(pθ (yji )) (3) new dataset reflects better the nature of a large- i=1 j=1 scale corpus that would be used by real-world fact- checking system. We trained a dense retrieval where N is the number of examples in the training model using this extended dataset. batch, li the number of non-padding tokens of the The Quin+ system utilizes a hybrid model that ith example, and pθ (yji ) is the estimated softmax combines the results from the dense retrieval probability of the correct label for the j th token of model described above and BM25 sparse retrieval the ith example. We trained this model on Factual- to obtain the final list of retrieved passages. For NLI with batch size 64, Adam optimizer and initial efficient sparse retrieval, we used the Rust-based learning rate 5 × 10−5 until convergence. Tantivy full text search engine2 . 3.3 Entailment Classification 3.2 Sentence Selection Natural Language Inference (NLI), also known We propose and experiment with two sentence se- as textual entailment classification, is the task of lection methods: an embedding-based selection detecting whether a hypothesis statement is en- 1 https://archive.org/details/factual-nli tailed by a premise passage. It is essentially a 2 https://github.com/tantivy-search/tantivy text classification problem, where the input is a 86
Figure 3: Sequence labeling model for evidence selection from a passage for a given claim. pair of premise-hypothesis (P, H) and the out- 4 Performance of Quin+ put a label y ∈ {entailment, contradiction, neu- tral}. An NLI model is often a core component of We evaluate the three individual components of many automated fact-checking systems. Datasets Quin+ (retrieval, sentence selection and entail- like the Stanford Natural Language Inference cor- ment classification) and finally perform an end-to- pus (SNLI) (Bowman et al., 2015), Multi-Genre end evaluation using various configurations. Natural Language Inference corpus (Multi-NLI) Table 1 gives the recall@k and Mean Recip- (Williams et al., 2018) and Adversarial-NLI (Nie rocal Rank (MRR@100) of the passage retrieval et al., 2020) have facilitated the development of models on FEVER and Factual-NLI+. We also models for this task. compare the performance on a noisy extension Even though pre-trained NLI models seem to of the FEVER dataset where additional passages perform well on the two popular NLI datasets from the Bing search engine are included as (SNLI and Multi-NLI), they are not as effective ‘noise’ passages. We see that when noise pas- in a real-world setting. This is possibly due to sages are added to the FEVER dataset, the gap be- the bias in these two datasets, which has a neg- tween the hybrid passage retrieval model in Quin+ ative effect in the generalization ability of the and sparse retrieval widens. This demonstrates the trained models (Poliak et al., 2018). Further, these limitations of using sparse retrieval, and why it is datasets are comprised of short single-sentence crucial to have a dense retrieval model to surface premises. As a result, models trained on these relevant passages from a noisy corpus. Overall, datasets usually do not perform well on noisy real- the hybrid passage retrieval model in Quin+ gives world data involving multiple sentences. These the best performance compared to BM25 and the issues have led to the development of additional dense retrieval model. more challenging datasets such as Adversarial (a) FEVER Dataset NLI (Nie et al., 2020). Model R@5 R@10 R@20 R@100 MRR Our Quin+ system utilizes an NLI model based BM25 50.53 58.92 67.93 82.93 0.381 on RoBERTa-large with a linear transformation of Dense 65.47 69.61 72.51 75.71 0.535 Hybrid 71.71 78.60 83.65 91.09 0.556 the [CLS] token embedding (Devlin et al., 2019): (b) FEVER with noise passages o = sof tmax(W · BERT[CLS] ([P ; H]) + a) (4) Model R@5 R@10 R@20 R@100 MRR BM25 35.17 44.18 53.89 73.95 0.2649 Dense 54.10 62.13 68.09 75.24 0.4053 where P ; H is the concatenation of the premise Hybrid 54.89 64.61 73.33 86.11 0.4074 with the hypothesis, W3×1024 is a linear transfor- (c) Factual-NLI+ Dataset mation matrix, and a3×1 is the bias. We trained the Model R@5 R@10 R@20 R@100 MRR entailment model by minimizing the cross-entropy BM25 45.02 53.20 61.56 77.96 0.347 loss on the concatenation of the three popular NLI Dense 59.66 67.09 72.23 78.52 0.461 datasets (SNLI, Multi-NLI and Adversarial-NLI) Hybrid 61.29 70.03 77.51 87.90 0.465 with batch size 64, Adam optimizer and initial learning rate 5 × 10−5 until convergence. Table 1: Performance of passage retrieval models. 87
(a) Factual-NLI Dataset (a) Supporting evidence Model Precision Recall F1 Input Precision Recall F1 Whole passages 63.40 53.93 58.28 Baseline 67.74 91.87 77.98 Highlighted ground truth 82.15 60.05 69.38 Sequence labeling 94.78 92.11 93.43 Selected sentences 74.40 56.68 64.34 Embedding-based 66.12 90.29 76.34 (b) Refuting evidence (b) SciFact Dataset Input Precision Recall F1 Model Precision Recall F1 Whole passages 33.95 40.65 37.00 Highlighted ground truth 77.54 89.32 83.02 Baseline 62.21 71.54 66.55 Selected sentences 75.27 81.96 78.47 Sequence labeling 69.38 68.45 68.91 Embedding-based 43.30 92.36 58.96 Table 3: Performance of entailment classification model on different forms of input evidence. Table 2: Performance of sentence selection methods. Passage retrieval Sentence selection F1 Table 2 shows the token-level precision, recall BM25, k=5 Embedding-based 52.76 and F1 score of the proposed sentence selection BM25, k=20 Embedding-based 47.65 methods on the Factual-NLI dataset and a domain- BM25, k=5 Sequence labeling 49.65 specific (medical) claim verification dataset, Sci- Dense, k=5 Embedding-based 49.03 Dense, k=5 Sequence labeling 52.83 Fact (Wadden et al., 2020). We also compare the Dense, k=50 Sequence labeling 58.22 performance to a baseline sentence-level NLI ap- Hybrid, k=6 Embedding-based 50.29 Hybrid, k=6 Sequence labeling 57.24 proach, where we perform entailment classifica- Hybrid, k=50 Sequence labeling 52.60 tion (using the model described in Section 3.3) on each sentence of a passage and select the non- Table 4: End-to-end claim verification on Factual- neutral sentences as evidence. We observe that NLI+ for different configurations. the sequence labeling model gives the highest pre- cision, recall and F1 score when tested on the Factual-NLI dataset. Further, the precision is sig- does better on sentence-level evidence compared nificantly higher than the other methods. to the longer passages. On the other hand, for the SciFact dataset, we Finally, we carry out an end-to-end evaluation see that sequence labeling method remains the top of our fact-checking system on Factual-NLI+ us- performer in terms of precision and F1 score af- ing various configurations of top-k passage re- ter fine-tuning, although its recall is lower than trieval (BM25, dense, hybrid, for various val- the embedding-based method. This shows that se- ues of k ∈ [5, 100]) and evidence selection ap- quence labeling model is able to mitigate the high proaches (embdedding-based and sequence label- false positive rate observed with the embedding- ing). Table 4 shows the macro-average F1 score based selection method by taking into account the for the three classes (supporting, refuting, neu- surrounding context. tral) for some of the tested configurations. We see The Factual-NLI+ dataset contains claims with that dense or hybrid retrieval with evidence selec- passages that either support or refute the claims tion using the proposed sequence labeling model with some sentences highlighted as ground truth gives the best results. Even though hybrid retrieval specific evidence. Table 3 shows the perfor- seems to lead to slightly worse performance, it re- mance of the entailment model to classify the in- quires much fewer passages (6 instead of 50) and put evidence as supporting or refuting the claims. makes the system more efficient. The input evidence can be in the form of the 5 System Demonstration whole passage, ground truth evidence sentences, or sentences selected by our sequence labeling We have created a demo for verifying open- model. We observe that the entailment classifica- domain claims using the top 20 results from a tion model performs poorly when whole passages web search engine. For a given claim, Quin+ re- are passed as input evidence. However, when the turns relevant text passages with highlighted sen- specific sentences are passed as input, the preci- tences. The passages are grouped into two sets, sion, recall, and F1 measures improve. The rea- supporting and refuting. It computes a veracity son is that our entailment classification model is rating based on the number of supporting and re- trained mostly on short premises. As a result, it futing evidence. It returns “probably true” if there 88
Figure 4: The Quin+ system returning relevant evidence and a veracity rating for a claim. are more supporting evidence, otherwise it returns also able to verify open-domain claims using web “probably false”. When the number of retrieved search results. The source code of our system is evidence is low, it returns “inconclusive”. Figure 4 publicly available3 . shows a screen dump of the system with a claim Even though our system is able to verify multi- that has been assessed to be probably false based ple open-domain claims successfully, it has some on the overwhelming number of refuting sentence limitations. Quin+ is not able to effectively ver- evidence (21 refute versus 0 support). Quin+ can ify multi-hop claims that require the retrieval of also be used on a large-scale corpus. multiple pieces of evidence. For the verification of multi-hop claims, methodologies inspired by 6 Conclusion & Future Work multi-hop question answering could be utilized. In this work, we have presented a three-stage fact- For the future development of large-scale fact- checking system. We have demonstrated how a checking systems we believe that a new bench- dense retrieval model can lead to higher recall mark needs to be introduced. The currently avail- when retrieving passages for fact-checking. We able datasets, including Factual-NLI+, are not have also proposed two schemes to select rele- suitable for evaluating the verification of claims vant sentences: an embedding-based approach and using multiple sources. a sequence labeling model to improve the claim verification accuracy. Quin+ gave promising re- 3 sults in our extended Factual-NLI+ corpus, and is https://github.com/algoprog/Quin 89
References Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open Pepa Atanasova, Jakob Grue Simonsen, Christina Li- domain question answering. In Proceedings of the oma, and Isabelle Augenstein. 2020. Generating Annual Meeting of the Association for Computa- fact checking explanations. In Proceedings of the tional Linguistics (ACL). Annual Meeting of the Association for Computa- tional Linguistics (ACL). Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Samuel R. Bowman, Gabor Angeli, Christopher Potts, dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, and Christopher D. Manning. 2015. A large an- Luke Zettlemoyer, and Veselin Stoyanov. 2019. notated corpus for learning natural language infer- Roberta: A robustly optimized bert pretraining ap- ence. In Proceedings of the Conference on Em- proach. arXiv preprint arXiv:1907.11692. pirical Methods in Natural Language Processing (EMNLP). Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie 2016. MS MARCO: A human generated ma- Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind chine reading comprehension dataset. CoRR, Neelakantan, Pranav Shyam, Girish Sastry, Amanda abs/1611.09268. Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Combining fact extraction and verification with neu- Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yim- ral semantic matching networks. In Proceedings of ing Yang, and Sanjiv Kumar. 2020. Pre-training the AAAI Conference on Artificial Intelligence. tasks for embedding-based large-scale retrieval. In International Conference on Learning Representa- Yixin Nie, Adina Williams, Emily Dinan, Mohit tions (ICLR). Bansal, Jason Weston, and Douwe Kiela. 2020. Ad- versarial NLI: A new benchmark for natural lan- Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui guage understanding. In Proceedings of the Annual Jiang, and Diana Inkpen. 2017. Enhanced lstm for Meeting of the Association for Computational Lin- natural language inference. In Proceedings of the guistics (ACL). Annual Meeting of the Association for Computa- tional Linguistics (ACL). Rodrigo Nogueira, W. Yang, Kyunghyun Cho, and J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Jimmy Lin. 2019. Multi-stage document ranking Toutanova. 2019. Bert: Pre-training of deep bidirec- with bert. ArXiv, abs/1910.14424. tional transformers for language understanding. In NAACL-HLT. Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Andreas Hanselowski, Hao Zhang, Zile Li, Daniil Hypothesis only baselines in natural language infer- Sorokin, Benjamin Schiller, Claudia Schulz, and ence. In Proceedings of the Joint Conference on Iryna Gurevych. 2018. Ukp-athene: Multi-sentence Lexical and Computational Semantics. textual entailment for claim verification. In Pro- ceedings of the First Workshop on Fact Extraction Chris Samarinas, Wynne Hsu, and Mong Li Lee. 2020. and VERification (FEVER), pages 103–108. Latent retrieval for large-scale fact-checking and question answering with nli training. In IEEE In- Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. ternational Conference on Tools with Artificial In- Billion-scale similarity search with gpus. IEEE telligence (ICTAI). Transactions on Big Data. Dominik Stammbach and Elliott Ash. 2020. e-fever: Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Ledell Explanations and summaries for automated fact Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. checking. In Proceedings of the Conference on 2020. Dense passage retrieval for open-domain Truth and Trust Online (TTO). question answering. Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL). James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- FEVER: a large-scale dataset for fact extraction and field, Michael Collins, Ankur Parikh, Chris Alberti, verification. In NAACL-HLT. Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, David Wadden, Kyle Lo, Lucy Lu Wang, Shanchuan Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Lin, Madeleine van Zuylen, Arman Cohan, and Han- Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- naneh Hajishirzi. 2020. Fact or fiction: Verifying ral questions: a benchmark for question answering scientific claims. Proceedings of the Conference on research. Transactions of the Association of Com- Empirical Methods in Natural Language Processing putational Linguistics. (EMNLP). 90
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceed- ings of the Conference of the North American Chap- ter of the Association for Computational Linguistics (NAACL-HLT). Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808. Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pon- tus Stenetorp, and Sebastian Riedel. Ucl machine reading group: Four factor framework for fact find- ing (hexaf). In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). 91
You can also read