Reasoning Over Semantic-Level Graph for Fact Checking
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Reasoning Over Semantic-Level Graph for Fact Checking Wanjun Zhong1∗, Jingjing Xu3∗ , Duyu Tang2 , Zenan Xu1 , Nan Duan2 , Ming Zhou2 Jiahai Wang1 and Jian Yin1 1 The School of Data and Computer Science, Sun Yat-sen University. Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, P.R.China 2 Microsoft Research 3 MOE Key Lab of Computational Linguistics, School of EECS, Peking University {zhongwj25@mail2,xuzn@mail2,wangjiah@mail,issjyin@mail}.sysu.edu.cn {dutang,nanduan,mingzhou}@microsoft.com jingjingxu@pku.edu.cn Abstract Claim: The Rodney King riots took place in the most populous county in the USA. Fact knowledge Fact checking is a challenging task because extracted from 1 evidence sentences 5 arXiv:1909.03745v3 [cs.CL] 25 Apr 2020 verifying the truthfulness of a claim requires reasoning about multiple retrievable evidence. Evidence #1: The 1992 Los Angeles riots, also known as the Rodney King riots were a series of riots, lootings, In this work, we present a method suitable arsons, and civil disturbances that occurred in Los Angeles County, California in April and May 1992. 2 for reasoning about the semantic-level struc- 3 Evidence #2: ture of evidence. Unlike most previous works, Los Angeles County, officially the County of Los Angeles, is the most populous county in the USA. 4 which typically represent evidence sentences with either string concatenation or fusing the features of isolated evidence sentences, our ap- Figure 1: A motivating example for fact checking and proach operates on rich semantic structures of the FEVER task. Verifying the claim requires under- evidence obtained by semantic role labeling. standing the semantic structure of multiple evidence We propose two mechanisms to exploit the sentences and the reasoning process over the structure. structure of evidence while leveraging the ad- vances of pre-trained models like BERT, GPT or XLNet. Specifically, using XLNet as the than the truth. The situation is more urgent as ad- backbone, we first utilize the graph structure to vanced pre-trained language models (Radford et al., re-define the relative distances of words, with 2019) can produce remarkably coherent and fluent the intuition that semantically related words texts, which lowers the barrier for the abuse of cre- should have short distances. Then, we adopt ating deceptive content. In this paper, we study fact graph convolutional network and graph atten- checking with the goal of automatically assessing tion network to propagate and aggregate infor- mation from neighboring nodes on the graph. the truthfulness of a textual claim by looking for We evaluate our system on FEVER, a bench- textual evidence. mark dataset for fact checking, and find that Previous works are dominated by natural lan- rich structural information is helpful and both guage inference models (Dagan et al., 2013; An- our graph-based mechanisms improve the ac- geli and Manning, 2014) because the task requires curacy. Our model is the state-of-the-art sys- reasoning of the claim and retrieved evidence sen- tem in terms of both official evaluation met- rics, namely claim verification accuracy and tences. They typically either concatenate evidence FEVER score. sentences into a single string, which is used in top systems in the FEVER challenge (Thorne et al., 1 Introduction 2018b), or use feature fusion to aggregate the fea- tures of isolated evidence sentences (Zhou et al., Internet provides an efficient way for individuals 2019). However, both methods fail to capture rich and organizations to quickly spread information semantic-level structures among multiple evidence, to massive audiences. However, malicious people which also prevents the use of deeper reasoning spread false news, which may have significant in- model for fact checking. In Figure 1, we give a fluence on public opinions, stock prices, even presi- motivating example. Making the correct prediction dential elections (Faris et al., 2017). Vosoughi et al. requires a model to reason based on the understand- (2018) show that false news reaches more people ing that “Rodney King riots” is occurred in “Los ∗ Work done while this author was an intern at Microsoft Angeles County” from the first evidence, and that Research. “Los Angeles County” is “the most populous county
in the USA” from the second evidence. It is there- 2 Task Definition and Pipeline fore desirable to mine the semantic structure of With a textual claim given as the input, the prob- evidence and leverage it to verify the truthfulness lem of fact checking is to find supporting evidence of the claim. sentences to verify the truthfulness of the claim. Under the aforementioned consideration, we We conduct our research on FEVER (Thorne present a graph-based reasoning approach for fact et al., 2018a), short for Fact Extraction and VER- checking. With a given claim, we represent the re- ification, a benchmark dataset for fact checking. trieved evidence sentences as a graph, and then use Systems are required to retrieve evidence sentences the graph structure to guide the reasoning process. from Wikipedia, and predict the claim as “SUP- Specifically, we apply semantic role labeling (SRL) PORTED ”, “REFUTED ” or “NOT ENOUGH to parse each evidence sentence, and establish links INFO (NEI) ”, standing for that the claim is sup- between arguments to construct the graph. When ported by the evidence, refuted by the evidence, developing the reasoning approach, we intend to and is not verifiable, respectively. There are two simultaneously leverage rich semantic structures official evaluation metrics in FEVER. The first is of evidence embodied in the graph and powerful the accuracy for three-way classification. The sec- contextual semantics learnt in pre-trained model ond is FEVER score, which further measures the like BERT (Devlin et al., 2018), GPT (Radford percentage of correct retrieved evidence for “SUP- et al., 2019) and XLNet (Yang et al., 2019). ToOur Pipeline PORTED ” and “REFUTED ” categories. Both the achieve this, we first re-define the distance between statistic of FEVER dataset and the equation for words based on the graph structure when producing calculating FEVER score are given in Appendix B. contextual representations of words. Furthermore, claim we adopt graph convolutional network and graph attention network to propagate and aggregate infor- Document Selection Claim Verification mation over the graph structure. In this way, the reasoning process employs semantic representa- documents SUPPORTED | REFUTED | NOTENOUGHINFO tions at both word/sub-word level and graph level. We conduct experiments on FEVER (Thorne Sentence Selection et al., 2018a), which is one of the most influen- tial benchmark datasets for fact checking. FEVER sentences evidence consists of 185,445 verified claims, and evidence sentences for each claim are natural language sen- Figure 2: Our pipeline for fact checking on FEVER. tences from Wikipedia. We follow the official eval- The main contribution of this work is a graph-based uation protocol of FEVER, and demonstrate that reasoning model for claim verification. our approach achieves state-of-the-art performance in terms of both claim classification accuracy and Here, we present an overview of our pipeline for FEVER score. Ablation study shows that the in- FEVER, which follows the majority of previous tegration of graph-driven representation learning studies. Our pipeline consists of three main compo- mechanisms improves the performance. We briefly nents: a document retrieval model, a sentence-level summarize our contributions as follows. evidence selection model, and a claim verification model. Figure 2 gives an overview of the pipeline. With a given claim, the document retrieval model • We propose a graph-based reasoning approach retrieves the most related documents from a given for fact checking. Our system apply SRL to collection of Wikipedia documents. With retrieved construct graphs and present two graph-driven documents, the evidence selection model selects representation learning mechanisms. top-k related sentences as the evidence. Finally, the claim verification model takes the claim and evidence sentences and outputs the veracity of the • Results verify that both graph-based mech- claim. anisms improve the accuracy, and our final The main contribution of this work is the graph- system achieves state-of-the-art performance based reasoning approach for claim verification, on the FEVER dataset. which is explained detailedly in Section 3. Our
SRL results with verb “occurred” ARG1 SRL results with verb “known” riots, lootings, arsons, and ADVERBIAL ARG1 civil disturbances Evidence #1: as the Rodney also King riots The 1992 Los Angeles riots, occurred also known as the Rodney King riots were a series of The 1992 Los VERB riots, lootings, arsons, and known In Los Angeles in April and Angeles riots civil disturbances that VERB ARG2 County, California May 1992 occurred in Los Angeles LOCATION TEMPORAL County, California in April Graph and May 1992. Construction SRL results with verb “is” ARG2 Evidence #2: VERB the most populous county in the USA Los Angeles County, is officially the County of Los Angeles, is the most Los Angeles County, officially the County of Los Angeles populous county in the USA. ARG1 Figure 3: The constructed graph for the motivating example with two evidence sentences. Each box describes a “tuple” which is extracted by SRL triggered by a verb. Blue solid lines indicate edges that connect arguments within a tuple and red dotted lines indicate edges that connect argument across different tuples. strategies for document selection and evidence se- ways to construct the graph, such as open informa- lection are described in Section 4. tion extraction (Banko et al., 2007), named entity recognition plus relation classification, sequence- 3 Graph-Based Reasoning Approach to-sequence generation which is trained to produce structured tuples (Goodrich et al., 2019), etc. In this In this section, we introduce our graph-based rea- work, we adopt a practical and flexible way based soning approach for claim verification, which is on semantic role labeling (Carreras and Màrquez, the main contribution of this paper. Taking a claim 2004). Specifically, with the given evidence sen- and retrieved evidence sentences1 as the input, our tences, our graph construction operates in the fol- approach predicts the truthfulness of the claim. For lowing steps. FEVER, it is a three-way classification problem, which predicts the claim as “SUPPORTED ”, “RE- • For each sentence, we parse it to tuples2 with FUTED ” or “NOT ENOUGH INFO (NEI) ”. an off-the-shelf SRL toolkit developed by Al- The basic idea of our approach is to employ the lenNLP3 , which is a re-implementation of a intrinsic structure of evidence to assess the truthful- BERT-based model (Shi and Lin, 2019). ness of the claim. As shown in the motivating exam- ple in Figure 1, making the correct prediction needs • For each tuple, we regard its elements with good understanding of the semantic-level structure certain types as the nodes of the graph. We of evidence and the reasoning process based on heuristically set those types as verb, argument, that structure. In this section, we first describe location and temporal, which can also be eas- our graph construction module (§3.1). Then, we ily extended to include more types. We create present how to apply graph structure for fact check- edges for every two nodes within a tuple. ing, including a contextual representation learning • We create edges for nodes across different mechanism with graph-based distance calculation tuples to capture the structure information (§3.2), and graph convolutional network and graph among multiple evidence sentences. Our idea attention network to propagate and aggregate infor- is to create edges for nodes that are literally mation over the graph (§3.3 and §3.4). similar with each other. Assuming entity A 3.1 Graph Construction and entity B come from different tuples, we add one edge if one of the following condi- Taking evidence sentences as the input, we would tions is satisfied: (1) A equals B; (2) A con- like to build a graph to reveal the intrinsic structure tains B; (3) the number of overlapped words of these evidence. There might be many different 2 A sentence could be parsed as multiple tuples. 1 3 Details about how to retrieve evidence for a claim are https://demo.allennlp.org/ described in Section 4. semantic-role-labeling
in the most populous county in the USA Graph The Rodney claim Convolutional King riots take place … Network [SEP] XLNet in Los Angeles Graph … output with The 1992 Los County, California … Attention sentence 1 Graph Angeles riots … … Distance Graph as the Rodney Los Angeles Convolutional King riots is County, sentence 2 Network officially … … known the most populous also county in the USA. Figure 4: An overview of our graph-based reasoning approach for claim verification. Taking a claim and evidence sentences as the input, we first calculate contextual word representations with graph-based distance (§3.2). After that, we use graph convolutional network to propagate information over the graph (§3.3), and use graph attention network to aggregate information (§3.4) before making the final prediction. between A and B is larger than the half of the representation learning procedure will take huge minimum number of words in A and B. memory space, which is also observed by Shaw Figure 3 shows the constructed graph of the evi- et al. (2018). dence in the motivating example. In order to obtain In this work, we adopt pre-trained model XL- the structure information of the claim, we use the Net (Yang et al., 2019) as the backbone of our same pipeline to represent a claim as a graph as approach because it naturally involves the concept well. of relative position5 . Pre-trained model captures Our graph construction module offers an ap- rich contextual representations of words, which is proach on modeling structure of multiple evidence, helpful for our task which requires sentence-level which could be further developed in the future. reasoning. Considering the aforementioned issues, we implement an approximate solution to trade 3.2 Contextual Word Representations with off between the efficiency of implementation and Graph Distance the informativeness of the graph. Specifically, we We describe the use of graph for learning graph- reorder evidence sentences with a topology sort al- enhanced contextual representations of words4 . gorithm with the intuition that closely linked nodes Our basic idea is to shorten the distance be- should exist in neighboring sentences. This would tween two semantically related words on the graph, prefer that neighboring sentences contain either which helps to enhance their relationship when parent nodes or sibling nodes, so as to better cap- we calculate contextual word representations with ture the semantic relatedness between different evi- a Transformer-based (Vaswani et al., 2017) pre- dence sentences. We present our implementation trained model like BERT and XLNet. Suppose in Appendix A. The algorithm begins from nodes we have five evidence sentences {s1 , s2 , ... s5 } without incident relations. For each node with- and the word w1i from s1 and the word w5j from out incident relations, we recursively visit its child s5 are connected on the graph, simply concatenat- nodes in a depth-first searching way. ing evidence sentences as a single string fails to capture their semantic-level structure, and would After obtaining graph-based relative position of give a large distance to w1i and w5j , which is the words, we feed the sorted sequence into XLNet number of words between them across other three to obtain the contextual representations. Mean- sentences (i.e., s2 , s3 , and s4 ). An intuitive way while, we obtain the representation h([CLS]) for to achieve our goal is to define an N × N matrix a special token [CLS], which stands for the joint of distances of words along the graph, where N is representation of the claim and the evidence in the total number of words in the evidence. How- Transformer-based architecture. ever, this is unacceptable in practice because the 4 In Transformer-based representation learning pipeline, 5 the basic computational unit can also be word-piece. For Our approach can also be easily adapted to BERT by simplicity, we use the term “word” in this paper. adding relative position like Shaw et al. (2018).
3.3 Graph Convolutional Network The graph learning mechanism will be per- We have injected the graph information in Trans- formed separately for claim-based and evidence- former and obtained h([CLS]), which captures the based graph. Therefore, we denote Hc and He semantic interaction between the claim and the evi- as the representations of all nodes in claim-based dence at word level 6 . As shown in our motivating graph and evidence-based graphs, respectively. Af- example in Figure 1 and the constructed graph in terwards, we utilize the graph attention network to Figure 3, the reasoning process needs to operate align the graph-level node representation learned on span/argument-level, where the basic computa- for two graphs before making the final prediction. tional unit typically consists of multiple words like 3.4 Graph Attention Network “Rodney King riots” and “the most popular county We explore the related information between two in the USA”. graphs and make semantic alignment for final pre- To further exploit graph information beyond v v diction. Let He ∈ RNe ×d and Hc ∈ RNc ×d word level, we first calculate the representation denote matrices containing representations of all of a node, which is a word span in the graph, by nodes in evidence-based and claim-based graph re- averaging the contextual representations of words spectively, where Nev and Ncv denote number of contained in the node. After that, we employ multi- nodes in the corresponding graph. layer graph convolutional network (GCNs) (Kipf We first employ a graph attention mechanism and Welling, 2016) to update the node represen- (Veličković et al., 2017) to generate a claim-specific tation by aggregating representations from their evidence representation for each node in claim- neighbors on the graph. Formally, we denote G as based graph. Specifically, we first take each hic ∈ the graph constructed by the previous graph con- v Hc as query, and take all node representations hje ∈ struction method and make H ∈ RN ×d a matrix He as keys. We then perform graph attention on containing representation of all nodes, where N v the nodes, an attention mechanism a : Rd × Rd → and d denote the number of nodes and the dimen- R to compute attention coefficient as follows: sion of node representations, respectively. Each row Hi ∈ Rd is the representation of node i. We eij = a(Wc hic , We hje ) (3) introduce an adjacency matrix A of graph G and which means the importance of evidence node j to its degree matrix D, where P we add self-loops to the claim node i. Wc ∈ RF ×d and We ∈ RF ×d matrix A and Dii = j Aij . One-layer GCNs is the weight matrix and F is the dimension of will aggregate information through one-hop edges, attention feature. We use the dot-product function which is calculated as follows: as a here. We then normalize eij using the softmax Hi (1) = ρ(AH e i W0 ), (1) function: exp(eij ) (1) αij = sof tmaxj (eij ) = P (4) where Hi ∈ Rd is the new d-dimension represen- k∈Nev exp(eik ) e = D− 12 AD− 12 is the normal- tation of node i, A After that, we calculate a claim-centric evidence ized symmetric adjacency matrix, W0 is a weight representation X = [x1 , . . . , xNcv ] using the matrix, and ρ is an activation function. To exploit weighted sum over He : information from the multi-hop neighboring nodes, X we stack multiple GCNs layers: xi = αij hje (5) j∈Nev (j+1) e (j) Wj ), Hi = ρ(AHi (2) We then perform node-to-node alignment and cal- culate aligned vectors A = [a1 , . . . , aNcv ] by where j denotes the layer number and Hi0 is the the claim node representation H c and the claim- initial representation of node i initialized from the centric evidence representation X, contextual representation. We simplify H (k) as H for later use, where H indicates the representation ai = falign (hic , xi ), (6) of all nodes updated by k-layer GCNs. where falign () denotes the alignment function. In- 6 By “word” in “word-level”, we mean the basic computa- spired by Shen et al. (2018), we design our align- tional unit in XLNet, and thus h([CLS]) capture the sophis- ment function as: ticated interaction between words via multi-layer multi-head attention operations. falign (x, y) = Wa [x, y, x − y, x y], (7)
where Wa ∈ Rd×4∗d is a weight matrix and is and SEP and CLS are symbols indicating end- element-wise Hadamard product. The final output ing of a sentence and ending of a whole input, re- g is obtained by the mean pooling over A. We spectively. The final representation hcei ∈ Rd is then feed the concatenated vector of g and the final obtained via extracting the hidden vector of the hidden vector h([CLS]) from XLNet through a [CLS] token. MLP layer for the final prediction. After that, we employ an MLP layer and a soft- max layer to compute score s+ cei for each evidence 4 Document Retrieval and Evidence candidate. Then, we rank all the evidence sentences Selection by score s+cei . The model is trained on the training In this section, we briefly describe our document re- data with a standard cross-entropy loss. Following trieval and evidence selection components to make the official setting in FEVER, we select top 5 evi- the paper self contained. dence sentences. The performance of our evidence selection model is shown in Appendix C. 4.1 Document Retrieval The document retrieval model takes a claim and 5 Experiments a collection of Wikipedia documents as the input, and returns m most relevant documents. We evaluate on FEVER (Thorne et al., 2018a), We mainly follow Nie et al. (2019), the top- a benchmark dataset for fact extraction and ver- performing system on the FEVER shared task ification. Each instance in FEVER dataset con- (Thorne et al., 2018b). The document retrieval sists of a claim, groups of ground-truth evi- model first uses keyword matching to filter candi- dence from Wikipedia and a label (i.e., “SUP- date documents from the massive Wikipedia docu- PORTED ”, “REFUTED ” or “NOT ENOUGH ments. Then, NSMN (Nie et al., 2019) is applied INFO (NEI) ”), indicating its veracity. FEVER to handle the documents with disambiguation titles, includes a dump of Wikipedia, which contains which are 10% of the whole documents. Docu- 5,416,537 pre-processed documents. The two of- ments without disambiguation title are assigned ficial evaluation metrics of FEVER are label ac- with higher scores in the resulting list. The input curacy and FEVER score, as described in Section to the NSMN model includes the claim and can- 2. Label accuracy is the primary evaluation metric didate documents with disambiguation title. At a we apply for our experiments because it directly high level, NSMN model has encoding, alignment, measures the performance of the claim verification matching and output layers. Readers who are in- model. We also report FEVER score for compar- terested are recommended to refer to the original ison, which measures whether both the predicted paper for more details. label and the retrieved evidence are correct. No Finally, we select top 10 documents from the evidence is required if the predicted label is NEI. resulting list. 5.1 Baselines 4.2 Sentence-Level Evidence Selection We compare our system to the following baselines, Taking a claim and all the sentences from retrieved including three top-performing systems on FEVER documents as the input, evidence selection model shared task, a recent work GEAR (Zhou et al., returns the top k most relevant sentences. 2019), and a concurrent work by Liu et al. (2019b). We regard evidence selection as a semantic matching problem, and leverage rich contextual • Nie et al. (2019) employ a semantic matching representations embodied in pre-trained models neural network for both evidence selection like XLNet (Yang et al., 2019) and RoBERTa (Liu and claim verification. et al., 2019a) to measure the relevance of a claim to every evidence candidate. Let’s take XLNet as • Yoneda et al. (2018) infer the veracity of each an example. The input of the sentence selector is claim-evidence pair and make final prediction by aggregating multiple predicted labels. cei = [Claim, SEP, Evidencei , SEP, CLS] where Claim and Evidencei indicate tokenized • Hanselowski et al. (2018) encode each claim- word-pieces of original claim and ith evidence can- evidence pair separately, and use a pooling didate, d denotes the dimension of hidden vector, function to aggregate features for prediction.
Label FEVER The last row in Table 2 corresponds to the base- Method Acc (%) Score (%) line where all the evidence sentences are simply Hanselowski et al. (2018) 65.46 61.58 concatenated as a single string, where no explicit Yoneda et al. (2018) 67.62 62.52 graph structure is used at all for fact verification. Nie et al. (2019) 68.21 64.21 GEAR (Zhou et al., 2019) 71.60 67.10 Model Label Accuracy KGAT (Liu et al., 2019b) 72.81 69.40 DREAM 79.16 DREAM (our approach) 76.85 70.60 -w/o Relative Distance 78.35 -w/o GCN&GAN 77.12 Table 1: Performance on the blind test set on FEVER. -w/o both above modules 75.40 Our approach is abbreviated as DREAM. Table 2: Ablation study on develop set. • GEAR (Zhou et al., 2019) uses BERT to ob- As shown in Table 2, compared to the XLNet tain claim-specific representation for each evi- baseline, incorporating both graph-based modules dence sentence, and applies graph network by brings 3.76% improvement on label accuracy. Re- regarding each evidence sentence as a node in moving the graph-based distance drops 0.81% in the graph. terms of label accuracy. The graph-based distance • KGAT (Liu et al., 2019b) is concurrent with mechanism can shorten the distance of two closely- our work, which regards sentences as the linked nodes and help the model to learn their nodes of a graph and uses Kernel Graph At- dependency. Removing the graph-based reason- tention Network to aggregate information. ing module drops 2.04% because graph reason- ing module captures the structural information and 5.2 Model Comparison performs deep reasoning about that. Figure 5 gives a case study of our approach. Table 1 reports the performance of our model and baselines on the blind test set with the score showed 5.4 Error Analysis on the public leaderboard7 . As shown in Table 1, We randomly select 200 incorrectly predicted in- in terms of label accuracy, our model significantly stances and summarize the primary types of errors. outperforms previous systems with 76.85% on the The first type of errors is caused by failing to test set. It is worth noting that, our approach, which match the semantic meaning between phrases that exploits explicit graph-level semantic structure of describe the same event. For example, the claim evidence obtained by SRL, outperforms GEAR states “Winter’s Tale is a book”, while the evi- and KGAT, both of which regard sentences as the dence states “Winter ’s Tale is a 1983 novel by nodes and use model to learn the implicit structure Mark Helprin”. The model fails to realize that of evidence 8 . By the time our paper is submitted, “novel” belongs to “book” and states that the claim our system achieves state-of-the-art performance is refuted. Solving this type of errors needs to in- in terms of both evaluation metrics on the leader- volve external knowledge (e.g. ConceptNet (Speer board. et al., 2017)) that can indicate logical relationships 5.3 Ablation Study between different events. The misleading information in the retrieved evi- Table 2 presents the label accuracy on the develop- dence causes the second type of errors. For exam- ment set after eliminating different components (in- ple, the claim states “The Gifted is a movie”, and cluding the graph-based relative distance (§3.2) and the ground-truth evidence states “The Gifted is an graph convolutional network and graph attention upcoming American television series”. However, network (§3.3 and §3.4) separately in our model. the retrieved evidence also contains “The Gifted is 7 The public leaderboard for perpetual evaluation of a 2014 Filipino dark comedy-drama movie”, which FEVER is https://competitions.codalab.org/ misleads the model to make the wrong judgment. competitions/18814#results. DREAM is our user name on the leaderboard. 8 We don’t overclaim the superiority of our system to 6 Related Work GEAR and KGAT only comes from the explicit graph struc- ture, because we have differences in other components like In general, fact checking involves assessing the sentence selection and the pre-trained model. truthfulness of a claim. In literature, a claim can be
Text: Congressional Space Medal of Honor is the tion phase, participants typically extract named en- highest award given only to astronauts by NASA. Tuples: ('Congressional Space Medal of Honor', 'is', tities from a claim as the query and use Wikipedia Claim 'the highest award given only to astronauts by search API. In the evidence selection phase, partici- NASA’) pants measure the similarity between the claim and ('the highest award’, 'given','only', 'to astronauts', 'by NASA') an evidence sentence candidate by training a classi- Text: The highest award given by NASA , fication model like Enhanced LSTM (Chen et al., Congressional Space Medal of Honor is awarded by 2016) in a supervised setting or using string simi- the President of the United States in Congress 's name on recommendations from the Administrator larity function like TFIDF without trainable param- Evidence #1 of the National Aeronautics and Space eters. Padia et al. (2018) utilizes semantic frames Administration . Tuples: ('The highest award','given','by NASA’) for evidence selection. In this work, our focus is ('Congressional Space Medal of Honor','awarded','by the claim classification phase. Top-ranked three the President of the United States') systems aggregate pieces of evidence through con- Text: To be awarded the Congressional Space Medal of Honor , an astronaut must perform feats of catenating evidence sentences into a single string extraordinary accomplishment while participating in (Nie et al., 2019), classifying each evidence-claim space flight under the authority of NASA . Tuples: ('awarded', 'the Congressional Space Medal pair separately, merging the results (Yoneda et al., Evidence #2 of Honor’) 2018), and encoding each evidence-claim pair fol- ('To be awarded the Congressional Space Medal of Honor',’an astronaut','perform','feats of lowed by pooling operation (Hanselowski et al., extraordinary accomplishment’) 2018). Zhou et al. (2019) are the first to use BERT ('an astronaut', 'participating','in space flight','under the authority of NASA' ) to calculate claim-specific evidence sentence rep- resentations, and then develop a graph network to Figure 5: A case study of our approach. Facts shared aggregate the information on top of BERT, regard-1 across the claim and the evidence are highlighted with ing each evidence as a node in the graph. Our work different colors. differs from Zhou et al. (2019) in that (1) the con- struction of our graph requires understanding the syntax of each sentence, which could be viewed as a text or a subject-predicate-object triple (Nakas- a more fine-grained graph, and (2) both the contex- hole and Mitchell, 2014). In this work, we only tual representation learning module and the reason- consider textual claims. Existing datasets differ ing module have model innovations of taking the from data source and the type of supporting ev- graph information into consideration. Instead of idence for verifying the claim. An early work training each component separately, Yin and Roth by Vlachos and Riedel (2014) constructs 221 la- (2018) show that joint learning could improve both beled claims in the political domain from POLITI- claim verification and evidence selection. FACT.COM and CHANNEL4.COM, giving meta- data of the speaker as the evidence. POLIFACT is 7 Conclusion further investigated by following works, including Ferreira and Vlachos (2016) who build Emergent In this work, we present a graph-based approach with 300 labeled rumors and about 2.6K news ar- for fact checking. When assessing the veracity of a ticles, Wang (2017) who builds LIAR with 12.8K claim giving multiple evidence sentences, our ap- annotated short statements and six fine-grained la- proach is built upon an automatically constructed bels, and Rashkin et al. (2017) who collect claims graph, which is derived based on semantic role la- without meta-data while providing 74K news ar- beling. To better exploit the graph information, we ticles. We study FEVER (Thorne et al., 2018a), propose two graph-based modules, one for calculat- which requires aggregating information from multi- ing contextual word embeddings using graph-based ple pieces of evidence from Wikipedia for making distance in XLNet, and the other for learning repre- the conclusion. FEVER contains 185,445 anno- sentations of graph components and reasoning over tated instances, which to the best of our knowledge the graph. Experiments show that both graph-based is the largest benchmark dataset in this area. modules bring improvements and our final system The majority of participating teams in the is the state-of-the-art on the public leaderboard by FEVER challenge (Thorne et al., 2018b) use the the time our paper is submitted. same pipeline consisting of three components, Evidence selection is an important component namely document selection, evidence sentence se- of fact checking as finding irrelevant evidence may lection, and claim verification. In document selec- lead to different predictions. A potential solution
is to jointly learn evidence selection and claim ver- Ben Goodrich, Vinay Rao, Peter J Liu, and Moham- ification model, which we leave as a future work. mad Saleh. 2019. Assessing the factual accuracy of generated text. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Acknowledge Discovery & Data Mining, pages 166–175. ACM. Wanjun Zhong, Zenan Xu, Jiahai Wang and Andreas Hanselowski, Hao Zhang, Zile Li, Daniil Jian Yin are supported by the National Natu- Sorokin, Benjamin Schiller, Claudia Schulz, and ral Science Foundation of China (U1711262, Iryna Gurevych. 2018. Ukp-athene: Multi-sentence U1611264,U1711261,U1811261,U1811264, textual entailment for claim verification. arXiv preprint arXiv:1809.01479. U1911203), National Key R&D Program of China (2018YFB1004404), Guangdong Ba- Thomas N Kipf and Max Welling. 2016. Semi- sic and Applied Basic Research Foundation supervised classification with graph convolutional (2019B1515130001), Key R&D Program of networks. arXiv preprint arXiv:1609.02907. Guangdong Province (2018B010107005). The Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- corresponding author is Jian Yin. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019a. Roberta: A robustly optimized bert pretraining ap- References proach. arXiv preprint arXiv:1907.11692. Gabor Angeli and Christopher D Manning. 2014. Natu- Zhenghao Liu, Chenyan Xiong, and Maosong Sun. ralli: Natural logic inference for common sense rea- 2019b. Kernel graph attention network for fact veri- soning. In Proceedings of the 2014 conference on fication. arXiv preprint arXiv:1910.09796. empirical methods in natural language processing (EMNLP), pages 534–545. Ndapandula Nakashole and Tom M Mitchell. 2014. Language-aware truth assessment of fact candidates. Michele Banko, Michael J Cafarella, Stephen Soder- In Proceedings of the 52nd Annual Meeting of the land, Matthew Broadhead, and Oren Etzioni. 2007. Association for Computational Linguistics (Volume Open information extraction from the web. In Ijcai, 1: Long Papers), pages 1009–1019. volume 7, pages 2670–2676. Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Xavier Carreras and Lluı́s Màrquez. 2004. Introduc- Combining fact extraction and verification with neu- tion to the conll-2004 shared task: Semantic role ral semantic matching networks. In Proceedings of labeling. In Proceedings of the Eighth Confer- the AAAI Conference on Artificial Intelligence, vol- ence on Computational Natural Language Learning ume 33, pages 6859–6866. (CoNLL-2004) at HLT-NAACL 2004, pages 89–97. Ankur Padia, Francis Ferraro, and Tim Finin. 2018. Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Team UMBC-FEVER : Claim verification using se- Hui Jiang, and Diana Inkpen. 2016. Enhanced mantic lexical resources. In Proceedings of the lstm for natural language inference. arXiv preprint First Workshop on Fact Extraction and VERification arXiv:1609.06038. (FEVER), pages 161–165, Brussels, Belgium. Asso- ciation for Computational Linguistics. Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas- simo Zanzotto. 2013. Recognizing textual entail- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, ment: Models and applications. Synthesis Lectures Dario Amodei, and Ilya Sutskever. 2019. Language on Human Language Technologies, 6(4):1–220. models are unsupervised multitask learners. OpenAI Blog, 1(8). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep Hannah Rashkin, Eunsol Choi, Jin Yea Jang, Svitlana bidirectional transformers for language understand- Volkova, and Yejin Choi. 2017. Truth of varying ing. arXiv preprint arXiv:1810.04805. shades: Analyzing language in fake news and polit- ical fact-checking. In Proceedings of the 2017 Con- Robert Faris, Hal Roberts, Bruce Etling, Nikki ference on Empirical Methods in Natural Language Bourassa, Ethan Zuckerman, and Yochai Benkler. Processing, pages 2931–2937. 2017. Partisanship, propaganda, and disinformation: Online media and the 2016 us presidential election. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position represen- William Ferreira and Andreas Vlachos. 2016. Emer- tations. arXiv preprint arXiv:1803.02155. gent: a novel data-set for stance classification. In Proceedings of the 2016 conference of the North Dinghan Shen, Xinyuan Zhang, Ricardo Henao, and American chapter of the association for computa- Lawrence Carin. 2018. Improved semantic-aware tional linguistics: Human language technologies, network embedding with fine-grained word align- pages 1163–1168. ment. arXiv preprint arXiv:1808.09633.
Peng Shi and Jimmy Lin. 2019. Simple bert models for A Typology Sort Algorithm relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255. Algorithm 1 Graph-based Distance Calculation Al- Robert Speer, Joshua Chin, and Catherine Havasi. 2017. gorithm. Conceptnet 5.5: An open multilingual graph of gen- eral knowledge. In Thirty-First AAAI Conference on Require: A sequence of nodes S = {si , s2 , · · · , sn }; A set of relations R = {r1 , r2 , · · · , rm } Artificial Intelligence. 1: function DFS(node, visited, sorted sequence) James Thorne, Andreas Vlachos, Christos 2: for each child sc in node’s children do Christodoulopoulos, and Arpit Mittal. 2018a. 3: if sc has no incident edges and visited[sc ]==0 then Fever: a large-scale dataset for fact extraction and 4: visited[sc ]=1 verification. arXiv preprint arXiv:1803.05355. 5: DFS(sc , visited) James Thorne, Andreas Vlachos, Oana Cocarascu, 6: end if 7: end for Christos Christodoulopoulos, and Arpit Mittal. 8: sorted sequence.append(0, node) 2018b. The fact extraction and verification (fever) 9: end function shared task. arXiv preprint arXiv:1811.10971. 10: sorted sequence = [] 11: visited = [0 for i in range(n)] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 12: S,R = changed to acyclic graph(S,R) Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz 13: for each node si in S do Kaiser, and Illia Polosukhin. 2017. Attention is all 14: if si has no incident edges and visited[i] == 0 then you need. In Advances in neural information pro- 15: visited[i] = 1 cessing systems, pages 5998–6008. 16: for each child sc in si ’s children do 17: DFS(sc , visited, sorted sequence) Petar Veličković, Guillem Cucurull, Arantxa Casanova, 18: end for Adriana Romero, Pietro Lio, and Yoshua Bengio. 19: sorted sequence.append(0,si ) 2017. Graph attention networks. arXiv preprint 20: end if arXiv:1710.10903. 21: end for 22: return sorted sequence Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Lan- guage Technologies and Computational Social Sci- B FEVER ence, pages 18–22. The statistic of FEVER is shown in Table 3. Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online. Science, Split SUPPORTED REFUTED NEI 359(6380):1146–1151. Training 80,035 29,775 35,659 Dev 6,666 6,666 6,666 William Yang Wang. 2017. ” liar, liar pants on fire”: Test 6,666 6,666 6,666 A new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648. Table 3: Split size of SUPPORTED, REFUTED and NOT ENOUGH INFO (NEI) classes in FEVER. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain- FEVER score is calculated with equation 8, ing for language understanding. arXiv preprint where y is the ground truth label, ŷ is the predicted arXiv:1906.08237. label, E = [E1 , · · · , Ek ] is a set of ground-truth Wenpeng Yin and Dan Roth. 2018. Twowingos: A two- evidence, and Ê = [Ê1 , · · · , Ê5 ] is a set of pre- wing optimization strategy for evidential claim veri- dicted evidence. fication. arXiv preprint arXiv:1808.03465. def Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pon- Instance Correct(y, ŷ, E, Ê) = tus Stenetorp, and Sebastian Riedel. 2018. Ucl ma- y = ŷ ∧ (y = N EI ∨ Evidence Correct(E, Ê)) chine reading group: Four factor framework for fact (8) finding (hexaf). In Proceedings of the First Work- shop on Fact Extraction and VERification (FEVER), pages 97–102. C Evidence Selection Results Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, Lifeng In this part, we present the performance of the Wang, Changcheng Li, and Maosong Sun. 2019. sentence-level evidence selection module that we GEAR: Graph-based evidence aggregating and rea- develop with different backbone. We take the con- soning for fact verification. In Proceedings of the 57th Annual Meeting of the Association for Compu- catenation of claim and each evidence as input, and tational Linguistics, pages 892–901, Florence, Italy. take the last hidden vector to calculate the score for Association for Computational Linguistics. evidence ranking. In our experiments, we try both
RoBERTa and XLNet. From Table 4, we can see that RoBERTa performs slightly better than XLNet here. When we submit our system on the leader- board, we use RoBERTa as the evidence selection model. Dev. Set Test Set Model Acc. Rec. F1 Acc. Rec. F1 XLNet 26.60 87.33 40.79 25.55 85.34 39.33 RoBERTa 26.67 87.64 40.90 25.63 85.57 39.45 Table 4: Results of evidence selection models. D Training Details In this part, we describe the training details of our experiments. We employ cross-entropy loss as the loss function. We apply AdamW as the optimizer for model training. For evidence selection model, we set learning rate as 1e-5, batch size as 8 and maximum sequence length as 128. In claim verification model, the XLNet network and graph-based reasoning network are trained sep- arately. We first train XLNet and then freeze the parameters of XLNet and train the graph-based rea- soning network. We set learning rate as 2e-6, batch size as 6 and set maximum sequence length as 256. We set the dimension of node representation as 100.
You can also read