ProoFVer: Natural Logic Theorem Proving for Fact Verification
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ProoFVer: Natural Logic Theorem Proving for Fact Verification Amrith Krishna Sebastian Riedel Andreas Vlachos University of Cambridge Facebook AI University of Cambridge ak2329@cam.ac.uk University College London av308@cam.ac.uk sriedel@fb.com Abstract instance, NaturalLi (Angeli and Manning, 2014), proposed for natural language inference, provided We propose ProoFVer, a proof system for explicit proofs in the form of logical inference fact verification using natural logic. The using natural logic to determine entailment of a textual entailment model in ProoFVer is arXiv:2108.11357v1 [cs.CL] 25 Aug 2021 given sentence. However, the performance of such a seq2seq model generating valid natu- ral logic based logical inferences as its proof systems often do not match with that of the proofs. The generation of proofs makes neural models. For a task like fact verification, ProoFVer an explainable system. The proof with sensitive areas of application (Graves, 2018), consists of iterative lexical mutations of explaining the reasoning behind the decision mak- spans in the claim with spans in a set of re- ing process is valuable. Keeping both performance trieved evidence sentences. Further, each and explainability in mind, we propose a fact veri- such mutation is marked with an entail- fication system whose entailment model is a neural ment relation using natural logic operators. The veracity of a claim is determined solely seq2seq proof generator which generates proofs based on the sequence of natural logic re- for determining the veracity of a claim. lations present in the proof. By design, In this work, we propose ProoFVer - Proof this makes ProoFVer a faithful by construc- System fort Fact Verification using natural logic. tion system that generates faithful explana- Here, the proofs generated by the proof generator tions. ProoFVer outperforms existing fact- are in the form of natural logic based logical in- verification models, with more than two per- cent absolute improvements in performance ference. Both ProoFVer and NaturalLi follow the and robustness. In addition to its explana- natural logic based theory of compositional entail- tions being faithful, ProoFVer also scores ment, originally proposed in NatLog (MacCartney high on rationale extraction, with a five and Manning, 2007). ProoFVer is the first such point absolute improvement compared to system to employ natural logic based proofs in attention-based rationales in existing mod- fact verification. For a given claim and evidence els. Finally, we find that humans correctly as shown in Figure 1, its corresponding proof is simulate ProoFVer’s decisions more often shown in Figure 2. Here, at each step in the proof, using the proofs, than the decisions of an existing model that directly use the retrieved a claim span is mutated with a span in the evi- evidence for decision making. dence. Each such mutation is marked with an en- tailment relation, by assigning a natural logic op- erator (NatOp, Angeli and Manning, 2014). A step 1 Introduction in the proof can effectively be represented using a Fact verification systems, designed for determin- triple, consisting of the aligned spans in the muta- ing a claim’s veracity, typically consist of an ev- tion and its assigned NatOp. idence retrieval model and a textual entailment The seq2seq proof generator generates a se- model. Recently, several of the high performing quence of such triples as shown in Figure 1. Here, fact verification models use black box neural mod- the mutations in the first and last triples occur with els for entailment recognition. In such models, the equivalent lexical entries and hence are assigned reasoning in the decision making process remains with the equivalence NatOp (”). However, the opaque to humans. On the contrary, proof sys- mutation in the second triple is between lexical tems that use logic based approaches can provide entries which are different from each other and transparency in their decision making process. For thus is assigned the alternation NatOp (ê). The
Claim Page Title Little Dorrit is a short story by Charles Dickens Evidence Little Dorrit Little Dorrit is a novel by Charles John Huffam Dickens, originally published in serial form between 1855 and 1857 ProoFVer Proof Generator (seq2seq) Little Dorrit is a short story by Charles Dickens ≡ ⥯ ≡ Little Dorrit is a novel by Charles John Huffam Dickens Figure 1: Explainable fact verification using ProoFVer. The explanation generator (Left) generates a sequence of triples, representing logical inference in Natural Logic. DFA is used for determining the veracity of the claim, by using the sequence of natural logic operators in Figure 2 as a transition sequence in the DFA. news agency for advice, an end-user to dispute the S Little Dorrit is a short story by Charles Dickens decision, and a developer to debug the model. ≡ ProoFVer outperforms current state-of-the- S Little Dorrit is a short story by Charles Dickens art (SOTA) fact verification models (Liu et al., ⥯ 2020; Ye et al., 2020; Zhong et al., 2020) in both R Little Dorrit is a novel by Charles Dickens performance and robustness. It achieves a la- ≡ Little Dorrit is a novel by Charles John Huffam bel accuracy of 79.25 % and a FEVER Score of R Dickens 74.37 % on the FEVER test data, outperform- ing SOTA models by 2.4 % and 2.07 % respec- Figure 2: Steps in the proof for the input in Figure 1. tively. ProoFVer’s explanations are guaranteed to The transition sequence as per the DFA in Figure 1 is be faithful. They score high on plausibility in also shown, which terminates at the ‘REFUTE’. terms of token overlap F1-Score with human ra- tionales. ProoFVer scores an F-Score of 93.28 %, which is 5.67 % more than attention based high- sequence of NatOps from the proof in Figure 2 lights from Ye et al. (2020). Further, in forward form a transition sequence in the deterministic fi- prediction, humans can better predict the model nite state automaton (DFA) shown in Figure 1, outcome using ProoFVer’s explanations than us- which in this case terminates at the ‘Refute (R)’ ing all the retrieved evidence. To be more specific, state. This implies that the evidence do not support human annotators correctly predict the model out- the claim. Unlike other natural logic based proof come 82.78 % of the times based on ProoFVer’s systems, ProoFVer can obtain spans from multiple explanations, against 70 % when using only the sentences for a single proof. The seq2seq model evidence. is trained using a heuristically annotated dataset, We evaluate the robustness of ProoFVer us- obtained by combining information from the pub- ing counterfactual instances from Symmetric licly available FEVER dataset and the FEVER fact FEVER (Schuster et al., 2019). ProoFVer’s label correction dataset (Thorne et al., 2018a; Thorne accuracy is 13.21 % points more than the next best and Vlachos, 2021b). model (Ye et al., 2020). Further, fine tuning both Moreover, the generated proof acts as an expla- the models using Symmetric FEVER development nation, or rather a faithful explanation. An expla- data leads to catastrophic forgetting on FEVER. nation is faithful if it reflects the same information We apply L2 regularisation to all the models. as that is used for decision making. By design, we Here, ProoFVer outperforms the next best model ensure that the generated proof is solely used for with a margin of 3.73 and 6 percentage points in determining the label of claim, making it a faithful the original and Symmetric FEVER datasets re- by construction model (Lei et al., 2016; Jain et al., spectively. Moreover, to evaluate the robustness of 2020). Faithful explanations can find use cases to fact verification systems against the impact of su- various stakeholders as mechanisms for dispute, perfluous information from the retriever, we pro- debug or advice (Jacovi and Goldberg, 2021). In pose a new metric, Stability Error Rate (SER). It fact verification, an explanation may be used by a quantifies the percentage of instances where the
superfluous information influences the model to them et al.; Valencia, 1991), uses natural logic change its original decision. This metric mea- for textual inference between a hypothesis and a sures the ability of an entailment model in distin- premise. NaturalLi (Angeli and Manning, 2014) guishing between evidence sentences relevant to extended NatLog by adopting the formal seman- the claim from the ones which merely has lexi- tics of Icard III and Moss (2014). cal overlap with the claim. Our model achieves a NaturalLi is a proof system formulated for the SER of 6.21 %, compared to 10.27 % of Ye et al. NLI task. It determines the entailment of a hypoth- (2020), where a lower SER is preferred. esis by searching over a database of true facts, i.e. premise statements. The proofs are in the form of 2 Natural Logic for Explainable a natural logic based logical inference, which re- Fact-Verification sults in a sequence of mutations between a premise and a hypothesis. Each mutation is marked with Justifying the model decisions has been central a natural logic relation and is realised as lexical to fact verification. While earlier models sub- substitution, forming a step in the inference. Each stantiate their decisions by presenting the evi- mutation results in a new sentence and the natu- dence as is, e.g. systems developed for the FEVER ral logic relation assigned to it identifies the type benchmark (Thorne et al., 2018b), more recent of entailment that holds between the sentences be- proposals use the evidence to generate explana- fore and after the mutation. NaturalLi adopts a set tions. Here, models either highlight salient sub- of seven natural logic operators, namely, equiv- sequences from the evidence (Popat et al., 2018; alence (”), forward-entailment (Ď), reverse en- Wu et al., 2020), generate summaries (Kotonya tailment (Ě), negation (N), alternation (ê), cover and Toni, 2020; Atanasova et al., 2020), correct (!), and independence (#). The operators were factual errors (Thorne and Vlachos, 2021b; Schus- originally proposed in NatLog and they are de- ter et al., 2021) or perform rule discovery (Ah- fined by MacCartney (2009, p. 79). We hence- madi et al., 2019; Gad-Elrab et al., 2019). In this forth refer these operators as NatOps. For NLI, work, we use natural logic for explainable fact NaturalLi and later Kotonya and Toni (2020), sup- verification. Natural logic is a prominent logic press the use of cover and alternation operators. based paradigm in textual entailment (Abzianidze, They merge the cases of alternation and negation 2017a; Hu et al., 2019; Feng et al., 2020). together. However, for fact verification, we treat Natural logic operates directly on natural lan- these as separate cases. However, we also suppress guage (Angeli and Manning, 2014; Abzianidze, the use of cover operator, as such instances rarely 2017b), and thus is appealing for fact verifica- occur in fact verification datasets as well. Thus, in tion, as structured knowledge bases like Wiki- ProoFVer we use six NatOps in our proofs. data typically lag behind text-based encyclopedias To determine whether a hypothesis is entailed such as Wikipedia in terms of coverage (Johnson, by a premise, NaturalLi uses a deterministic finite 2020). Furthermore, it obviates the need to trans- state automaton (DFA). Here, each state is an en- late claims and evidence into meaning represen- tailment label, and the transitions are the NatOps tations such as lambda calculus (Zettlemoyer and (Figure 1). The Logical inference consists of a se- Collins, 2005). While such representations may quence of mutations, each assigned with a NatOp. be more expressive, they require the development The sequence of NatOps is used to traverse the of semantic parsers, introducing another source of DFA, and the state where it terminates decides the error in the verification process. label of the hypothesis-premise pair. The decision Natural Logic, a term originally coined by making process solely relies on the steps in the Lakoff (1972), has been previously employed in logical inference. Thus the proofs form faithful several information extraction and NLU tasks such explanations. as Natural Language Inference (NLI, Angeli and Manning, 2014; Abzianidze, 2017a; Feng et al., 3 ProoFVer 2020), question answering (Angeli et al., 2016) and open information extraction (Angeli, 2016, ProoFVer is an explainable fact verification sys- Chapter 5). NatLog (MacCartney and Manning, tem. It has a seq2seq proof generator that gener- 2007), building on past theoretical work on nat- ates proofs in the form of natural logic based log- ural logic and monotonicity calculus (Van Ben- ical inferences, and a deterministic finite automa-
Claim Page Title Hyperlinked mention Page Title Bermuda Spanish Empire Bermuda was part of one of Evidence-1 Evidence-2 the largest empires in He claimed the islands The Spanish Empire was one of history. for the Spanish Empire the largest empires in history GENIE NATLOG Explanation Generator (seq2seq) Bermuda was part INS of one of the largest empires in history ≡ ⊑ ⊑ ≡ ≡ Bermuda claimed for the Spanish Empire one of the largest empires in history ≡ ⊑ ⊑ ≡ ≡ Transition Sequence: S S S S S S Outcome: Supports (S) Figure 3: An instance of claim verification requiring multiple evidences. The encoder inputs the concatenated sequence of claim and evidences. ton (DFA) for predicting the veracity of a claim. We use a seq2seq model for the proof gener- The proof generator follows GENRE (De Cao ation, following a standard auto-regressive formu- et al., 2021) in its training and inference. We elab- lation. At each step of the prediction, the predicted orate in detail on our proof generation process in token will have to belong to one of the three ele- Subsection 3.1 and on the deterministic decision ments of a triple, i.e. a token from the claim, a maker in Subsection 3.2. token from the evidence or a NatOp. This implies that we need to lexically constrain the search space 3.1 Proof Generation during prediction to avoid invalid triples, e.g. a The proof generator, as shown in Figures 1 and 3, triple lacking the NatOp or a triple containing lex- takes as input a claim along with one or more ical entries which are not in the input. Further, the evidence sentences retrieved from a knowledge search space needs to change depending on which source. We generate the steps in the proof as a element of the triple is being predicted. Therefore, sequence of triples, where each triple consists of a we perform dynamically constrained markup de- span from the claim, a span from the evidence and coding, a modified form of lexically constrained a NatOp. The claim span being substituted and the decoding (Post and Vilar, 2018), introduced by evidence span replacing it form a mutation. In a De Cao et al. (2021). This approach helps not only proof, we start with the claim, and span level mu- to lexically constrain the search space, but also tations are iteratively applied to the claim from left to dynamically switch between the three different to right. For each claim span in a mutation, a rele- spaces depending on what the decoder is currently vant evidence span is used as its replacement. Fur- generating. ther, each mutation is assigned a NatOp. Figure 1 shows the generated proof containing a sequence Dynamically constrained markup decoding uses of three triples. The corresponding mutated string markups to identify the boundaries of the elements at each step of the proof, along with the assigned in a triple in order to switch search spaces. The NatOps, is shown in Figure 2. Based on this as claim and evidence spans in a triple are enclosed proof, the DFA in Figure 1 determines that the ev- within the delimiters, ‘{}’ and ‘[]’ respectively. A idence refutes the claim, i.e. it terminates in state triple starts with generating a claim span. Here, R. As shown in Figure 3, the evidence spans may the prediction proceeds either by monotonically come from multiple sentences, may be accessed in copying the tokens from the claim, or by predict- any order, and need not all end up being used. In ing the delimiter ‘}’, where ‘}’ signifies the end Figure 3, the first three mutations use spans from of the span. This results in switching the search “Evidence-1", while the last one uses a span from space to that of evidence spans. Here, the candi- “Evidence-2", a span which happens to be in the dates for the search space will contain all the pos- middle of the sentence. Finally, each mutation sible spans obtained from the evidence, as any ar- is marked with the NatOp that holds between the bitrary span from the evidence can be predicted as claim and evidence spans. the span most relevant to the claim span already
generated. The evidence span prediction, which alignment step finds the corresponding evidence begins with the ‘[’ delimiter, ends when the final span for each such claim span. Here, the evidence token of the selected evidence span is reached and sentences in the input are aligned separately with is marked by ‘]’ delimiter. This is followed by pre- the claim using a word aligner (Jalili Sabet et al., diction of a NatOp, and the decoder is constrained 2020). For each claim span from the chunking to predict from one of the seven NatOps consid- step, the corresponding evidence span is obtained ered. The generation process will similarly gen- by grouping together those words from the evi- erate all the triples, until the triple with the final dence that are aligned to it.1 Further, alignments claim span is generated. with similarity scores that fall below an empiri- cally set threshold are ignored. These are marked 3.2 Verification with ‘DEL’ in Figure 4, implying the correspond- The DFA shown in Figure 1 uses the sequence of ing claim span is deleted in such mutations. NatOps generated by the proof generator as transi- As shown in Figure 4, we convert the align- tions to arrive at the outcome. Figure 2 shows the ments into a sequence of mutations. The result- corresponding sequence of transitions for the in- ing sequence needs to contain at least one span put in Figure 1. NaturalLi (Angeli and Manning, from each of the evidence sentences given in the 2014) designed the DFA for the three class classifi- ground truth for an instance. To switch between cation NLI task, namely entail, contradict and neu- evidence spans, we assume these sentences will tral. Here, we replace them with SUPPORTS (S), be hyperlinked. As shown in Figure 3, ‘Evidence REFUTES (R), and NOT ENOUGH INFO (N) re- - 1’ is hyperlinked to ‘Evidence - 2’ via the men- spectively for fact verification. Figure 2 shows the tion ‘Spanish Empire’. Here, if the hyperlinked transition sequence in the DFA for the input in Fig- entity is not part of the aligned evidence spans ure 1, based on the NatOps in the generated proof. then we insert it as a mutation. Such an inser- tion is shown as box three of ‘mutations’ in Figure 4 Generating Proofs for Training 4. Further, the aligned evidence spans that form part of the mutations before and after the insertion Training datasets for evidence-based fact verifica- are from Evidence-1 and Evidence-2 respectively tion consist of instances containing a claim, a la- and are marked with solid borders in Figure 4. We bel indicating its veracity and the evidence, typ- generalise this to a trajectory of hyperlinked men- ically a set of sentences (Thorne et al., 2018a; tions for cases with more than two evidence sen- Hanselowski et al., 2019; Wadden et al., 2020). tences. Further, if there are multiple possible mu- However, we need the proofs to train our proof tation sequences as candidates, we enumerate all generator (Section 3.1). Manually obtaining these such combinations. We then calculate their aver- annotations would be laborious; thus, we heuristi- age similarity score based on their alignment with cally generate the proofs from existing resources. the claim. Finally, we choose the one with the A proof is a sequence of triples, where each triple highest average similarity. contains a ‘mutation’, i.e. a claim span and an Insertions can also occur when a claim needs to evidence span, along with a NatOp. As shown be substantiated with enumeration of items from in Figure 4, we perform a three step-annotation a list. For instance, the claim “Steve Buscemi has process: chunking, alignment and entailment as- appeared in multiple films” would require enumer- signment. The claim spans are obtained using the ation of a list of films as its explanation. Here, chunking step; Alignment provides the mutations the first film name in the list would be aligned by obtaining the corresponding evidence spans for with the claim span ‘multiple films’ and the rest of each claim span; and the entailment assignment them will be treated as insertions with no aligned step assigns NatOps for the mutations. claim spans. The second case occurs when a claim 4.1 Chunking and Alignment requires information from multiple evidence sen- tences, which we discuss below. Summarily, for a The chunking step performs span level segmenta- given input instance, the alignment step produces tion of the claim using the chunker of Akbik et al. (2019). Any span that contains only non-content 1 If there are additional words between two words in evi- words is merged with its subsequent span as a dence aligned to the same claim span, then those words are post-processing step. As shown in Figure 4, the also considered as part of the evidence span.
Chunking Claim: Segmented spans Bermuda was part of one … empires in history Alignment Evidence-1: Bermuda claimed Spanish Empire DEL Aligned evidence spans Empire DEL of one … empires in history Evidence-2: 1 2 3 4 5 Converting Bermuda was part INS of one …empires in history alignments Mutations to mutations: Bermuda claimed for the Spanish Empire of one …empires in history Entailment Filtering Filtering Assignment Individual assignment using oracle of NatOps using label ≡ ⊑ ⊑≡≡ _____ ≡_ ⊑≡≡ ≡ ≡ ⊑≡≡ ≡ ⊑ ⊑≡≡ 1 2 3 4 5 1 2 3 4 5 Transform Label: Supports 1 2 3 4 5 to specific Figure 4: Annotation process for obtaining explanations for the claim and evidence in Figure 3. Annotation proceeds in three steps, chunking, alignment and entailment assignment. Entailment assignment proceeds by individual mutation assignment and two filtering steps. a sequence of mutations where a mutation can be paraphrases and assign the corresponding NatOp an aligned pair of claim and evidence spans, in- to those. Similarly, mutations which fully match sertion of an evidence span or deletion of a claim lexically are assigned with the equivalence opera- span. tor, like the mutations marked as 1, 4 and 5 in Fig- ure 2. Next, cases of insertions (deletions) are gen- 4.2 Entailment Assignment erally treated as making the existing claim more As shown in Figure 4, the entailment assignment specific (general) and hence assigned the forward step produces a sequence of NatOps, one each for (reverse) entailment operator, like mutation 4 in every mutation. Here, the search space will be- Figure 2. An exception to this is that a deletion come exponentially large, i.e. 6n possible NatOp or insertion of a negation word would result in as- sequences for n mutations. First, we restrict the signing the negation operator. The negation words search space by identifying individual mutations include common negation terms such as “not” and that can be assigned a NatOp based on external “none”, as well as words with a negative polarity lexicons and knowledge bases (KBs). We then on the lexicon by Hu and Liu (2004). perform two filtering steps to further narrow down the search space; one using label information from training data and the other using an oracle. In case Little Dorrit Rashomon Inception this filtering results in multiple NatOp sequences, genre instance of we arbitrarily choose one. novel short story 4.2.1 Individual Assignment of NatOps film We assign a NatOp to individual mutations from subclass of subclass of the mutation sequence by relying on external visual artwork fiction literature lexicons and KBs. Specifically, we use Word- subclass of net (Miller, 1995), Wikidata (Vrandečić and subclass of Krötzsch, 2014), Paraphrase Database (Ganitke- fiction work of art vitch et al., 2013, PPDB), and negative sentiment subclass of polarity words from the opinion lexicon by Hu and Liu (2004). Every paraphrase pair present in creative work PPDB 2.0 (Pavlick et al., 2015) is marked with a NatOp. We identify mutations where both the Figure 5: A snapshot of entities and their relations in claim and evidence spans are present in PPDB as Wikidata.
For several of these mutations, the NatOp in- say between “novel” and “Rashomon”, then those formation need not be readily available to be di- entities are assigned the independence operator, . rectly assigned at a span level. In such cases, we This signifies that the relatedness between the en- retain the word level alignments from the aligner tities are not of any significance in our annotation. and perform lexical level entailment assignment will be assigned with independence operator. with the help of Wordnet and Wikidata. We follow MacCartney (2009, Chapter 6) for entailment as- signment of open-class terms using Wordnet. Ad- O-class S R N ditionally, we define our own rules to assign a substitute with Ď ê Ě NatOp for named entities using Wikidata as our similar information KB. Here, aliases of an entity are marked with an substitute with Ď ê # equivalence operator, as shown in third triple in dissimilar information Figure 1. Further, we manually assign NatOps to paraphrasing ” ê # 500 relations in Wikidata, which are held most fre- negation N N N quently between the aligned pairs of named enti- transform to specific Ď Ď Ď ties in our training dataset. For instance, as shown transform to general Ě Ě Ě in Figure 5, the entities “Little Dorrit” and “novel” form the relation “genre”. As “Little Dorrit” is a Table 1: NatOp assignment based on O-class and ve- racity label information. novel, we consider the entity “novel” as a gener- alisation of ‘Little Dorrit’. Here, for a mutation with ‘Little Dorrit’ in the claim span and ‘novel’ in the evidence span, we assign the reverse en- tailment operator, signifying generalisation. How- 4.2.2 Filtering Using Label and Oracle ever, we assign forward entailment operator, signi- fying specialisation, when the entities switch their positions. Summarily any two entities that share In Figure 4, there is only one unfilled NatOp af- the ‘genre’ relation is assigned a forward or re- ter NatOp assignment in Subsection 4.2.1. Hence verse entailment operator. there are six possible candidate transition se- The KB relations we annotated have to occur quences possible. We first filter out invalid candi- directly between the entities, and they do not cap- dates by using the veracity label information from ture hierarchical multihop relations between the the training set, which is ‘SUPPORTS’ in Figure entities in the KB. We create such a hierarchy by 4. Of the six, only two terminates at the ‘SUP- combining the “instance of”, “part of”, and “sub- PORTS (S)’ state when used as a transition se- class of” relations in Wikidata.2 In this hierar- quence in the DFA (Section 3.2). Hence only those chy, if an entity is connected to another via a di- two candidates will be retained. For the next fil- rected path of length k ď 3, such as “Work of tering step, we use an oracle to find two additional art” and “Rashomon” in Figure 5, then the pair is pieces of information. First, we use it to obtain a considered to have a parent child-relation. Here, six class categorisation of the mutation sequence, the parent entity is considered as a generalisation as shown in Table 1. This categorisation is based of the child entity, leading to assignment of for- on the annotation process followed in the FEVER ward or reverse entailment NatOp, as previously dataset (Thorne et al., 2018a, §3.1), and we re- explained. Similarly, two entities are considered to fer to it as the O-classes. Second, while the la- be siblings if they have a common parent, and are bel information can change after every mutation, assigned the alternation NatOp. This signifies that the O-class is assigned to a specific mutation after the two entities have related properties but are still which it remains unchanged. The oracle identi- different instantiations, such as “Rashomon” and fies the specific mutation as well, which we refer “Inception”. However, if two connected entities to as O-span. In Figure 4, O-span happens to be do not satisfy the aforementioned length criteria, the second mutation. We assign a unique NatOp for each O-class and label combination as shown 2 We do not distinguish between these relations in our hi- in Table 1, at the O-span. In Figure 4, the muta- erarchy, though Wikidata does. For more details refer to: tion sequence at the top is selected based on the https://bit.ly/hier_props information in Table 1.
5 Experimental Methodology 5.3 Baseline Systems KernelGAT (Liu et al., 2020) propose a graph 5.1 Data attention based network. where each retrieved We evaluate our model on FEVER (Thorne et al., evidence sentence, concatenated with the claim, 2018a), a benchmark dataset for fact verification. forms a node in the graph. The node represen- It is split into train, development and a blind tations are initialised using a pretrained LM; we test-set containing 145,449, 19,998, and 19,998 use their best performing RoBERTA (Large) con- claims respectively. We further evaluate our model figuration. The relative importance of each node, on Symmetric FEVER (Schuster et al., 2019), a is computed with node kernels, and information dataset designed to evaluate the robustness of fact propagation is performed using edge kernels. verification systems against the claim-only bias CorefBERT (Ye et al., 2020) follows the Ker- present in FEVER. The dataset consists of 1,420 nelGAT architecture, and differs only in terms of counterfactual instances, split into development the LM used for the node initialisation. Here, they and test sets of 708 and 712 instances respectively. further pretrain the LM on a task that involves pre- diction of referents of a masked mention to capture 5.2 Evaluation Metrics co-referential relations in context. We use Core- fRoBERTA, their best performing configuration. The official evaluation metrics for FEVER include label accuracy (LA) and FEVER Score (Thorne DREAM (Zhong et al., 2020) use a graph con- et al., 2018b). Label accuracy is simply the ve- volutional network for information propagation racity classification accuracy. FEVER Score re- followed by a graph attention network for aggre- wards only those predictions which are accom- gation. However, unlike KernelGAT, they operate panied by at least one complete set of ground on a semantic structure involving span level nodes, truth evidence. For robustness related experiments obtained using semantic role labelling of the in- with Symmetric FEVER, five random initialisa- put. Their graph structure is further enhanced with tions of each model are trained. We perform mul- edges based on lexical overlap and similarity be- tiple training with random seeds due to the limited tween span level nodes.3 size of dataset available for the training. We re- 5.4 ProoFVer: Implementation Details port their mean accuracy, standard deviation, and We follow most previous works on FEVER which p-value with an unpaired t-test. model the task in three steps, namely document re- We further introduce a new evaluation metric, trieval, retrieval of evidence sentences from those to assess model robustness, called Stability Error documents, and finally classification based on the Rate (SER). Neural models, especially with a re- claim and the evidence. ProoFVer’s novelty lies triever component, have shown to be vulnerable to in proof generation at the third step. Hence, for model overstability (Jia and Liang, 2017). Model better comparability against the baselines, for the overstability is the inability of a model to distin- first two steps we follow the setup of Liu et al. guish between strings that is relevant to address (2020), also used in Ye et al. (2020). We retrieve the user given query and those which merely have five sentences for each claim since more evidence lexical similarity with the user input. Due to over- is ignored in the FEVER evaluation. stability, superfluous information in the input may influence the decision making capacity of a model, Claim Verification For the proof generator, we even when sufficient information to arrive at the take the pretrained BART (Large) model (Lewis correct decision is available in the input. In a et al., 2020) and fine tune it using the heuristi- fact verification system, it is expected that an ideal cally annotated data from Section 4. The encoder verification model should predict a ‘REFUTE’ or takes as input the concatenation of a claim and ’SUPPORT’ label for a claim only given sufficient one or more evidence sentences, each separated evidence, and any additional evidence should not with a delimiter, and the decoder generates a se- alter this decision. To asses this, we define SER quence of triples. We use three different configu- as the percentage of claims where additional evi- rations, namely ProoFVer-SE, ProoFVer-ME-Full, dence alters the SUPPORT or REFUTE decision 3 Since their implementation is not publicly available, we of the model. report the results stated by Zhong et al. (2020).
and ProoFVer-ME-Align which differ in the way shown in Table 1. We use the categorisation infor- the retrieved evidence is handled. In ProoFVer- mation as O-class. The O-span at which O-class SE (single evidence), a claim is concatenated with is applied, is identified as the claim span where the exactly one evidence sentence as input; this pro- modification was done to create the claim. We use duces five proofs and five decisions per claim, and both these information in our final filtering step. we arrive at the final decision based on majority To limit the reliance on gold standard data from voting. The remaining two use all the retrieved Thorne and Vlachos (2021b) during the heuris- evidence together as a single input, by concatenat- tic annotation, we first train a classifier for pre- ing with the claim. Also both the models produce dicting the O-class. The classifier is trained us- only one proof and hence one decision for a claim. ing KEPLER (Wang et al., 2021), a RoBERTA While ProoFVer-ME-Full (multiple evidence) use based pretrained LM that is enhanced with KB the evidence as is, ProoFVer-ME-Align use a word relations and entity pairs from WikiData.It cov- aligner (Jalili Sabet et al., 2020) to extract subse- ers 97.5 % of the entities present in FEVER. We quences that align with the claim from each of the first fine tune KEPLER with the FEVER training five evidence sentences. Only the extracted subse- dataset. We further fine tune it for the six class quences are concatenated with the claim to form classification task of predicting the O-class, and the input in the latter. refer to this classifier as Kepler-CT. Kepler-CT is trained with varying the size of training data from 5.5 Heuristic Annotation for Training Thorne and Vlachos (2021b), ranging from 1.24 % (1,800; 300 per class) to 41.25 % (60,000; 10,000 As elaborated in Section 4, the explanations are per class) of the FEVER dataset. We now con- generated using a three step heuristic annotation sider two configurations; one where the oracle still process. We first obtain a phrase level chunk- uses gold data to identify the O-span, which we ing of the claims using the flair based phrase refer to as ProoFVer-K. In the second configura- chunker (Akbik et al., 2019). Second, we ob- tion, ProoFVer-K-NoCS, we relax this constraint tain alignments between the evidence sentences and instead retain only those candidate sequences and the claim using simAlign (Jalili Sabet et al., where at least one occurrence of the corresponding 2020). For this, we use RoBERTA Large, fine NatOp as per Table 1 is possible in the candidate tuned on MNLI, for obtaining the contextualised sequence. Summarily, ProoFVer-K-NoCS does word embeddings for the alignment. The final not use any gold information from the fact correc- step, i.e. entailment assignment, starts by assign- tion dataset directly, apart from using Kepler-CT. ing NatOps to individual mutations, in isolation, Further, ProoFVer-K use gold standard informa- by relying on external lexicons and KBs. About tion for identifying O-span for the heuristic anno- 93.59 % of the claims had only at most one mu- tation of our training data. tation for which we could not assign a NatOp in isolation. There were another 4.17 % cases where 6 Evaluation Results exactly two mutations were unfilled after the in- dividual assignment. In the oracle based filter- 6.1 Fact-checking accuracy ing during entailment assignment, the information As shown in Table 2, ProoFVer-ME-Full has the required for the oracle is obtained from the an- best scores in terms of Label accuracy and FEVER notation logs of the FEVER annotations process. score on both the blind test data and the devel- Recently, Thorne and Vlachos (2021b) released a opment data, among all the models we compare fact correction dataset which contains this infor- against. Compared to ProoFVer-SE, these gains mation from FEVER. In the annotation of FEVER, come primarily from ProoFVer-ME-Full’s ability given a sentence in Wikipedia, annotators first ex- to handle multiple evidence sentences (five in our tract a subsequence containing exactly one piece experiments) together as one input as opposed of information about an entity. These extracted to handling each separately and then aggregat- subsequences become the source statements which ing the predictions. The FEVER development set are then modified to create the claims in FEVER contains 9.8 % (1,960) of the instances requir- (Thorne et al., 2018a). Further, each such pair of ing multiple evidence sentences for claim verifica- the source statement and the claim created from tion. While ProoFVer-SE predicts 60.1 % of these it is annotated with a six class categorisation, as instances correctly, ProoFVer-ME-Full correctly
Dev Test System Stability Fever Fever LA LA Error Rate Score Score SE 78.71 74.62 74.18 70.09 ProoFVer 6.21 ProoFVer KernelGAT 12.35 ME-Align 79.83 76.33 77.16 72.47 (BART) CorefBERT 10.27 ME-Full 80.23 78.17 79.25 74.37 KGAT (Roberta) 78.29 76.11 74.07 70.38 Table 3: Stability Error Rate CorefBERT (Roberta) 79.12 77.46 75.96 72.30 (SER) on FEVER development DREAM (XLNET) 79.16 - 76.85 70.60 data. Lower the SER, better the model. Table 2: Performance on the development and blind test data on FEVER. predicts 67.44 % of these. Further, around 80.73 Training ProoFVer Kepler-CT % (of 18,038) of the single evidence instances are Set Size K-NoCS K correctly predicted by ProoFVer-SE, in compari- 1,800 69.07 64.65 66.73 son to 81.62 % instances for ProoFVer-ME-Full. 3,600 72.53 67.25 69.17 All the configurations of ProoFVer use dynami- 6,000 74.02 68.86 72.41 cally constrained markup decoding during predic- 12,000 77.95 72.83 74.42 tion (3.1) to constrain the search space. How- 18,000 79.67 74.25 76.23 ever, we observe that any attempts to further re- 30,000 80.61 75.39 77.76 strict the control of the proof generator over pre- 45,000 82.76 77.62 78.84 dicting the text portions results in performance 60,000 84.85 78.61 79.67 deterioration. In our most restrictive version of Table 4: Performance (label accuracy) of ProoFVer-K ProoFVer, we determine the aligned spans of the and -NoCS, two variants which use predictions from claim and evidence beforehand using the same Kepler-CT classifier for identifying O-class. The size chunker and aligner used in heuristic annotation of training data used for Kepler-CT and its classifier of the training data. During prediction, the de- accuracy is also provided. coding mechanism is coded to enforce the pre- determined aligned spans as predictions. During evaluation, this setup provides perfect overlap in text portions of the proof in the prediction and the in Table 4, we can limit the influence of the addi- groundtruth of the development data of FEVER. tional data required by building the six class classi- Here, the proof generator has control only over fier, Kepler-CT. Here, by only using gold informa- prediction of the NatOps. However, this config- tion from 41.25 % of the instances in FEVER, we uration reported a label accuracy of 77.42 % and a can obtain a classifier with an accuracy 0f 84.85 FEVER score of 75.27 % on the development data, %. Moreover, the corresponding ProoFVer config- performing worse than all the models reported in urations trained on proofs generated using infor- Table 2. Similarly, the proof generator performs mation from Kepler-CT are competitive with those the best when it has access to the entire evidence using the gold standard information. For instance, retrieved. ProoFVer-ME-Full and ProoFVer-ME- ProoFVer-K-NOCS achieves a label accuracy of Align differs only in terms of access to the re- 78.61 % as against 80.23 % of ProoFVer-ME-Full trieved evidence, where the latter uses only those on the FEVER development data. This configu- parts of the evidence sentences that align with the ration uses no additional gold information other claim, instead of the whole sentences. While this than what is required for training Kepler-CT. Sim- configuration outperforms all the baselines, it still ilarly, ProoFVer-K which use gold information for falls short of ProoFVer-ME-Full as shown in Table identifying the O-span, achieve a label accuracy 2. of 79.67 %, outperforming all the baseline models. Moreover, with Kepler-CT, trained with as low as Impact of additional manual annotation The 1.24 % of instances in FEVER, helps ProoFVer-K- use of manual annotation from the fact correction NoCS and ProoFVer-K achieve label accuracy of dataset in terms of both O-class and O-span help 64.65 % and 66.73 % respectively on the FEVER the performance of the model. However, as shown development data.
Dataset Model FEVER-DEV Symmetric FEVER Original FT FT+L2 Original FT FT+L2 ProoFVer 89.07˘0.3 86.41˘0.8 87.95˘1.0 ˚ 81.70˘0.4 85.88˘1.3 # 83.37˘1.3#˚ KernelGAT 86.02˘0.2 76.67˘0.3 79.93˘0.9 ˚ 65.73˘0.3 84.94˘1.1 # 73.34˘1.5#˚ CorefBERT 88.26˘0.4 78.79˘0.2 84.22˘1.5˚ 68.49˘0.6 85.45˘0.2# 77.37˘0.5#˚ Table 5: Accuracies of models on FEVER-DEV and Symmetric FEVER with and without fine tuning. All results marked with ˚ and # are statistically significant with p ă 0.05 against their FT and Original variants respectively. 6.2 Robustness 6.3 ProoFVer proofs as explanations Stability Error Rate (SER): SER quantifies the ProoFVer’s proofs are guaranteed to be faithful ex- rate of instances where a system alters its deci- planations by construction. We further evaluate sion due to the presence of superfluous informa- the explanations for plausibility and also conduct tion in its input, passed on by the retriever com- a human evaluation using forward prediction task. ponent. Table 3 shows the SER for the three competing systems, where a lower SER implies Plausibility: Rationales extracted based on at- improved robustness against superfluous informa- tention scores are often used as means to highlight tion. Both KernelGAT and CorefBERT have the reasoning involved in the decision making pro- a SER of 12.35 % and 10.27 % respectively, cess of various models (DeYoung et al., 2020). We whereas ProoFVer reports an SER of only 6.21 %. use thresholded attention scores from KernelGAT The SER results imply that both the baselines and CorefBERT to identify rationales for the pre- change their predictions from SUPPORT or RE- dictions of these systems. For ProoFVer, we obtain FUTE after providing them with superfluous in- our rationales from the generated prof, by extract- formation more often than ProoFVer, and hence ing those lexical entries from the aligned evidence are more vulnerable to model overstability. spans in the proof that do not have a lexical over- lap with the claim. The evaluation compares the Symmetric FEVER As shown in Table 5, extracted rationales with human rationales. To ob- ProoFVer shows better robustness by reporting tain the latter, we identify those claims from the a mean accuracy of 81.70 % on the Symmetric FEVER development data, where the claim and FEVER test dataset, which is 13.21 percentage its source statement differ from each other only by points more than that of CorefBERT, the next best a word, an entity or a phrase. This implies that model. All models improve their accuracy and the portion of the text in the source statement is are comparable on Symmetric FEVER test set, the human rationale, as long as it is present in the when we fine tune them using the dataset’s de- evidence. We ensure that all the systems have ac- velopment set. However, this comes at a perfor- cess to this rationale, we filter 300 such cases from mance degradation of more than 9 % on the orig- FEVER development data where the source state- inal FEVER development data for both Kernel- ment form a subsequence in the evidence. We use GAT and CorefBERT, due to catastrophic forget- token-level F-score based on the matches between ting (French, 1999). This occurs primarily due to the ground truth rationales and the rationales from the shift in label distribution during fine tuning, different systems. We experiment with tokens of as Symmetric FEVER contains claims with only a different attention weights ranging from top 1 % SUPPORT or REFUTE label. ProoFVer reports a to 50 % with a step size of 5 for both KernelGAT degradation of less than 3 % in its accuracy as it is and CorefBERT. For ProoFVer, we remove those a seq2seq model with an objective trained to gen- evidence spans in our explanations which are exact erate a sequence of NatOps. To mitigate the effect matches with the claim and keep the rest. We find of catastrophic forgetting, we apply L2 regularisa- that the ProoFVer reports a token level F-score of tion for all the three models (Thorne and Vlachos, 93.28 %, compared to 87.61 % and 86.42 %, the 2021a) which improves all models on the FEVER best F-Scores for CorefBERT and KernelGAT. development set. Nevertheless, ProoFVer has the highest scores among the competing models after Human evaluation of the Natural Logic rea- regularisation. soning as an explanation: We perform human
evaluation of our explanations using forward sim- cent state-of-the-art models both in terms of ac- ulation (Doshi-Velez and Kim, 2017), where we curacy and robustness. Further, our explanations ask humans to predict the system output based on are shown to score better on rationale extraction in the explanations. For ProoFVer we provide the comparison to attention based highlights, and in claim along with the sequence of triples gener- improving the understanding of the decision mak- ated, as well as the those sentences from which ing process of the models by humans . the evidence spans in the explanation were ex- tracted. The baseline is a claim provided with 5 re- Acknowledegments trieved evidence sentences, which are the same for Amrith Krishna and Andreas Vlachos are sup- all models considered. We use predictions from ported by a Sponsored Research Agreement be- CorefBERT as the model predictions in the base- tween Facebook and the University of Cambridge. line. Each annotator is presented with 24 different Andreas Vlachos is additionally supported by the claims, 12 from each system, ProoFVer and base- ERC grant AVeriTeC. line. We used 60 claims, thus 120 explanations ob- tained a total of 360 impressions, from 15 annota- tors. All the annotators are either pursuing a PhD, References or are postdocs in fields related to computer sci- Lasha Abzianidze. 2017a. LangPro: Natural lan- ence and computational linguistics, or are indus- guage theorem prover. In Proceedings of the try researchers or data scientists. Since the human 2017 Conference on Empirical Methods in Nat- annotators need not be aware of the natural logic ural Language Processing: System Demonstra- operators, we replaced them with human readable tions, pages 115–120, Copenhagen, Denmark. phrases.4 The rephrasing also allows to avoid the Association for Computational Linguistics. use of the DFA to arrive at the model prediction by those who are aware of the deterministic na- Lasha Abzianidze. 2017b. A Natural Proof Sys- ture of the explanation. With the ProoFVer expla- tem for Natural Language. Ph.D. thesis, Tilburg nations, the human evaluators are able to predict University. the model decisions correctly in 82.78 % of the cases as against 70 % of the cases for the baseline. Naser Ahmadi, Joohyung Lee, Paolo Papotti, and In both systems, humans were mostly confused Mohammed Saeed. 2019. Explainable fact in instances with a ‘Not Enough Info’ label, and checking with probabilistic answer set program- had comparable percentage of correct predictions ming. In Proceedings of the 2019 Truth and between the two systems, 68.33 % and 65 % re- Trust Online Conference (TTO 2019), London, spectively. The biggest gain occurred for instances UK, October 4-5, 2019. with a REFUTE label. Here, annotators could cor- Alan Akbik, Tanja Bergmann, Duncan Blythe, rectly predict the model decisions in 86.67 % of Kashif Rasul, Stefan Schweter, and Roland the cases for ProoFVer, though it was only 70 % Vollgraf. 2019. Flair: An easy-to-use frame- for the baseline. work for state-of-the-art nlp. In NAACL 2019, 7 Conclusion 2019 Annual Conference of the North American Chapter of the Association for Computational We present ProoFVer, a natural logic based proof Linguistics (Demonstrations), pages 54–59. system for fact verification that generates faithful explanations. The proofs we provide are aligned Gabor Angeli. 2016. Learning Open Domain spans of evidences labelled with natural logic op- Knowledge From Text. Ph.D. thesis, Stanford erators, making it more fine grained than sentence University. level explanations. Further, our explanations rep- Gabor Angeli and Christopher D. Manning. 2014. resent a reasoning process using the natural logic NaturalLI: Natural logic inference for common relations, making the decision making determinis- sense reasoning. In Proceedings of the 2014 tic. Our extensive experiments based on FEVER, Conference on Empirical Methods in Natural shows that our model performs better than re- Language Processing (EMNLP), pages 534– 4 The screenshot for our evaluation page is shown here 545, Doha, Qatar. Association for Computa- https://pasteboard.co/KcleMbf.png tional Linguistics.
Gabor Angeli, Neha Nayak, and Christopher D. New York, NY, USA. Association for Comput- Manning. 2016. Combining natural logic and ing Machinery. shallow reasoning for question answering. In Proceedings of the 54th Annual Meeting of the Juri Ganitkevitch, Benjamin Van Durme, and Association for Computational Linguistics (Vol- Chris Callison-Burch. 2013. PPDB: The para- ume 1: Long Papers), pages 442–452, Berlin, phrase database. In Proceedings of the 2013 Germany. Association for Computational Lin- Conference of the North American Chapter of guistics. the Association for Computational Linguistics: Human Language Technologies, pages 758– Pepa Atanasova, Jakob Grue Simonsen, Christina 764, Atlanta, Georgia. Association for Compu- Lioma, and Isabelle Augenstein. 2020. Gener- tational Linguistics. ating fact checking explanations. In Proceed- ings of the 58th Annual Meeting of the As- D Graves. 2018. Understanding the promise and sociation for Computational Linguistics, pages limits of automated fact-checking. Technical 7352–7364, Online. Association for Computa- report. tional Linguistics. Andreas Hanselowski, Christian Stab, Claudia Nicola De Cao, Gautier Izacard, Sebastian Riedel, Schulz, Zile Li, and Iryna Gurevych. 2019. A and Fabio Petroni. 2021. Autoregressive en- richly annotated corpus for different tasks in tity retrieval. In International Conference on automated fact-checking. In Proceedings of Learning Representations. the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 493–503, Jay DeYoung, Sarthak Jain, Nazneen Fatema Ra- Hong Kong, China. Association for Computa- jani, Eric Lehman, Caiming Xiong, Richard tional Linguistics. Socher, and Byron C. Wallace. 2020. ERASER: A benchmark to evaluate rationalized NLP Hai Hu, Qi Chen, and Larry Moss. 2019. Nat- models. In Proceedings of the 58th Annual ural language inference with monotonicity. In Meeting of the Association for Computational Proceedings of the 13th International Confer- Linguistics, pages 4443–4458, Online. Associ- ence on Computational Semantics - Short Pa- ation for Computational Linguistics. pers, pages 8–15, Gothenburg, Sweden. Asso- ciation for Computational Linguistics. Finale Doshi-Velez and Been Kim. 2017. To- wards a rigorous science of interpretable ma- Minqing Hu and Bing Liu. 2004. Mining and chine learning. summarizing customer reviews. In Proceed- ings of the Tenth ACM SIGKDD International Yufei Feng, Zi’ou Zheng, Quan Liu, Michael Conference on Knowledge Discovery and Data Greenspan, and Xiaodan Zhu. 2020. Explor- Mining, KDD ’04, page 168–177, New York, ing end-to-end differentiable natural logic mod- NY, USA. Association for Computing Machin- eling. In Proceedings of the 28th Interna- ery. tional Conference on Computational Linguis- tics, pages 1172–1185, Barcelona, Spain (On- Thomas F. Icard III and Lawrence S. Moss. 2014. line). International Committee on Computa- Recent progress on monotonicity. In Linguistic tional Linguistics. Issues in Language Technology, Volume 9, 2014 - Perspectives on Semantic Representations for Robert M. French. 1999. Catastrophic forgetting Textual Inference. CSLI Publications. in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135. Alon Jacovi and Yoav Goldberg. 2021. Align- ing Faithful Interpretations with their Social At- Mohamed H. Gad-Elrab, Daria Stepanova, Jacopo tribution. Transactions of the Association for Urbani, and Gerhard Weikum. 2019. Exfakt: A Computational Linguistics, 9:294–310. framework for explaining facts over knowledge graphs and text. In Proceedings of the Twelfth Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and ACM International Conference on Web Search Byron C. Wallace. 2020. Learning to faithfully and Data Mining, WSDM ’19, page 87–95, rationalize by construction. In Proceedings of
the 58th Annual Meeting of the Association for Zhenghao Liu, Chenyan Xiong, Maosong Sun, Computational Linguistics, pages 4459–4473, and Zhiyuan Liu. 2020. Fine-grained fact ver- Online. Association for Computational Linguis- ification with kernel graph attention network. tics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- Masoud Jalili Sabet, Philipp Dufter, François tics, pages 7342–7351, Online. Association for Yvon, and Hinrich Schütze. 2020. SimAlign: Computational Linguistics. High quality word alignments without paral- lel training data using static and contextualized Bill MacCartney. 2009. NATURAL LANGUAGE embeddings. In Proceedings of the 2020 Con- INFERENCE. Ph.D. thesis, STANFORD UNI- ference on Empirical Methods in Natural Lan- VERSITY. guage Processing: Findings, pages 1627–1643, Online. Association for Computational Linguis- Bill MacCartney and Christopher D. Manning. tics. 2007. Natural logic for textual inference. In Proceedings of the ACL-PASCAL Workshop on Robin Jia and Percy Liang. 2017. Adversarial ex- Textual Entailment and Paraphrasing, pages amples for evaluating reading comprehension 193–200, Prague. Association for Computa- systems. In Proceedings of the 2017 Confer- tional Linguistics. ence on Empirical Methods in Natural Lan- guage Processing, pages 2021–2031, Copen- George A. Miller. 1995. Wordnet: A lexi- hagen, Denmark. Association for Computa- cal database for english. Commun. ACM, tional Linguistics. 38(11):39–41. Isaac Johnson. 2020. Analyzing wikidata tran- Ellie Pavlick, Pushpendre Rastogi, Juri Gan- sclusion on english wikipedia. arXiv preprint itkevitch, Benjamin Van Durme, and Chris arXiv:2011.00997. Callison-Burch. 2015. PPDB 2.0: Better para- phrase ranking, fine-grained entailment rela- Neema Kotonya and Francesca Toni. 2020. Ex- tions, word embeddings, and style classifica- plainable automated fact-checking for public tion. In Proceedings of the 53rd Annual Meet- health claims. In Proceedings of the 2020 Con- ing of the Association for Computational Lin- ference on Empirical Methods in Natural Lan- guistics and the 7th International Joint Con- guage Processing (EMNLP), pages 7740–7754, ference on Natural Language Processing (Vol- Online. Association for Computational Linguis- ume 2: Short Papers), pages 425–430, Beijing, tics. China. Association for Computational Linguis- George Lakoff. 1972. Linguistics and Natural tics. Logic. Springer Netherlands, Dordrecht. Kashyap Popat, Subhabrata Mukherjee, Andrew Tao Lei, Regina Barzilay, and Tommi Jaakkola. Yates, and Gerhard Weikum. 2018. DeClarE: 2016. Rationalizing neural predictions. In Pro- Debunking fake news and false claims using ceedings of the 2016 Conference on Empiri- evidence-aware deep learning. In Proceedings cal Methods in Natural Language Processing, of the 2018 Conference on Empirical Methods pages 107–117, Austin, Texas. Association for in Natural Language Processing, pages 22–32, Computational Linguistics. Brussels, Belgium. Association for Computa- tional Linguistics. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Matt Post and David Vilar. 2018. Fast lexically Levy, Veselin Stoyanov, and Luke Zettlemoyer. constrained decoding with dynamic beam allo- 2020. BART: Denoising sequence-to-sequence cation for neural machine translation. In Pro- pre-training for natural language generation, ceedings of the 2018 Conference of the North translation, and comprehension. In Proceed- American Chapter of the Association for Com- ings of the 58th Annual Meeting of the As- putational Linguistics: Human Language Tech- sociation for Computational Linguistics, pages nologies, Volume 1 (Long Papers), pages 1314– 7871–7880, Online. Association for Computa- 1324, New Orleans, Louisiana. Association for tional Linguistics. Computational Linguistics.
You can also read