Critical Thinking for Language Models - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Critical Thinking for Language Models Gregor Betz† and Christian Voigt† and Kyle Richardson‡ † Karlsruhe Institute of Technology, Karlsruhe, Germany {gregor.betz, christian.voigt}@kit.edu ‡ Allen Institute for AI, Seattle, WA, USA {kyler}@allenai.org Abstract ing skills (Paglieri, 2017), the resulting omnipres- ence of fallacies and biases in texts and the fre- This paper takes a first step towards a crit- quently low argumentative quality of online de- ical thinking curriculum for neural auto- bates (Hansson, 2004; Guiaşu and Tindale, 2018; regressive language models. We introduce arXiv:2009.07185v2 [cs.CL] 17 Dec 2020 a synthetic corpus of deductively valid ar- Cheng et al., 2017). Neural language models are guments, and generate artificial argumenta- known to pick up and reproduce normative bi- tive texts to train and evaluate GPT-2. Sig- ases (e.g., regarding gender or race) present in the nificant transfer learning effects can be ob- dataset they are trained on (Gilburt, 2019), as well served: Training a model on three simple as other annotation artifacts (Gururangan et al., core schemes allows it to accurately com- 2018); no wonder this happens with argumenta- plete conclusions of different, and more tive biases and reasoning flaws, too (Kassner and complex types of arguments, too. The language models generalize the core argu- Schütze, 2020; Talmor et al., 2020). This diag- ment schemes in a correct way. More- nosis suggests that there is an obvious remedy for over, we obtain consistent and promising LMs’ poor reasoning capability: make sure that results for NLU benchmarks. In particu- the training corpus contains a sufficient amount of lar, pre-training on the argument schemes exemplary episodes of sound reasoning. raises zero-shot accuracy on the GLUE di- In this paper, we take a first step towards the agnostics by up to 15 percentage points. The findings suggest that intermediary pre- creation of a “critical thinking curriculum” for training on texts that exemplify basic rea- neural language models. Critical thinking can be soning abilities (such as typically covered in loosely defined as “reasonable reflective thinking critical thinking textbooks) might help lan- that is focused on deciding what to believe or guage models to acquire a broad range of do.” (Norris and Ennis, 1989) Generally speak- reasoning skills. The synthetic argumen- ing, our study exploits an analogy between teach- tative texts presented in this paper are a ing critical thinking to students and training lan- promising starting point for building such guage models so as to improve their reasoning a “critical thinking curriculum for language models.” skill. More specifically, we build on three key as- sumptions that are typically made in critical think- ing courses and textbooks: First, there exist fun- 1 Introduction damental reasoning skills that are required for, or Pre-trained autoregressive language models (LM) highly conducive to, a large variety of more spe- such as GPT-2 and GPT-3 achieve, remarkably, cific and advanced critical thinking skills (e.g., competitive results in a variety of language model- Fisher, 2001, p. 7). Second, drawing deductive ing benchmarks without task-specific fine-tuning inferences is one such basic ability (e.g., Fisher, (Radford et al., 2019; Brown et al., 2020). Yet, 2001, pp. 7–8). Third, reasoning skills are not it is also widely acknowledged that these mod- (just) acquired by learning a theory of correct rea- els struggle with reasoning tasks, such as natu- soning, but by studying lots of examples and doing ral language inference (NLI) or textual entailment “lots of good-quality exercises” (Lau and Chan, (Askell, 2020). Actually, that doesn’t come as a 2020), typically moving from simple to more dif- surprise, given the tendency of humans to com- ficult problems (e.g., Bowell and Kemp, 2014). mit errors in reasoning (Kahneman, 2011; Sun- These insights from teaching critical thinking stein and Hastie, 2015), their limited critical think- translate, with respect to our study, as follows.
First of all, we design and build ‘lots of good- ple deductive argumentation. Obviously, drawing quality exercises’: a synthetic corpus of deduc- correct inferences is just one of the elementary tively valid arguments which instantiate a variety skills typically covered in critical thinking courses of (syllogistic) argument schemes, and which are (Fisher, 2001). Critical thinking involves more rendered as text paragraphs (Section 3). Next, we than deduction. And it would hence, by analogy, use our synthetic argument text corpus to train and be unreasonable to expect that intermediary pre- to evaluate GPT-2 (Section 4). The training, which training on the synthetic argument corpus suffices maximizes a causal language modeling objective, to turn language models into accomplished rea- can be conceived of as a generic, intermediary soners. However, we have shown that argumen- pre-training in the spirit of STILTS (Phang et al., tative texts (with valid syllogistic arguments) are 2018). certainly a good starting point when building a Evaluating the models’ ability to correctly more comprehensive dataset for initial or interme- complete conclusions of arguments, we observe diary pre-training that might help language models strong transfer learning effects/generalization to acquire a broad range of reasoning skills. Or, to (Section 5): Just training the models on a few put it differently, the synthetic argumentative texts central core schemes (generalized modus ponens, might belong to the core of a “critical thinking cur- contraposition and chain rule) allows them to ac- riculum for language models.” In the final section, curately complete conclusions of different types we advance some ideas for complementing the ar- of arguments, too (e.g., complex argumentative tificial argument corpus so as to further improve forms that involve dilemma and de Morgan). The the performance of LMs with regard to different language models appear to connect and generalize reasoning benchmarks. the core argument schemes in a correct way. In ad- 2 Related Work dition, the models are equally able to apply learned argument patterns beyond the training corpus’ do- To our knowledge, this paper is, together with main. Tests with a simple manually authored ar- Gontier et al. (2020), among the first to show gument produce evidence that generic language that autoregressive language models like GPT-2 modeling skill facilitates the successful general- can learn to reason by training on a text corpus ization of learned argument patterns. of correct natural language arguments. By con- Moreover, we test the trained models on differ- trast, previous work in this field, described below, ent reasoning benchmarks. Because we are par- has typically modeled natural language reasoning ticularly interested in transfer learning effects, we problems as classification tasks and trained neural do so in a zero-shot set-up (i.e., evaluating our systems to accomplish them. For example, Schick argumentation models on entirely unrelated NLU and Schütze (2020a,b), using pattern verbaliza- tasks, which follows recent work by Mitra et al. tions, construct structured training data that is suit- (2019); Shwartz et al. (2020); Ma et al. (2020)). able for training a masked language model with We obtain consistent and promising results for the classification head, and thusly achieve remarkable GLUE diagnostics (Wang et al., 2018) and SNLI NLU performance. This paper explores the oppo- (Bowman et al., 2015) benchmarks (Section 5), site route: We start with highly structured (syn- finding that training on core schemes clearly im- thetic) data, render it as unstructured, plain text proves NLU skill. However, training on the argu- and train a uni-directional language model on the ment corpus doesn’t affect the performance with synthetic text corpus. regard to the semantically more demanding Ar- Over and above the methodological novelty of gument Reasoning Comprehension task (Haber- our approach, we discuss, in the following, related nal et al., 2018) or the critical thinking assessment reasoning benchmarks and explain what sets our compiled in LogiQA (Liu et al., 2020). synthetic argument corpus apart from this work. All these transfer learning effects observed Rule reasoning in natural language Various strengthen the analogy between teaching critical datasets have been developed for (deductive) rule thinking and training language models: A variety reasoning in natural language. In these tasks, one of reasoning skills are improved by generic, in- or multiple rules, i.e. (generalized) conditionals, termediary pre-training on high-quality texts that must be applied to a fact base in order to deduc- exemplify a basic reasoning skill, namely sim- tively infer a conclusion. Facts and conclusions
are represented by atomic statements. Rule ap- novel facts. This confirms that language mod- plication closely resembles the conclusion com- els can, in principle, learn basic conceptual rules, pletion task for generalized modus ponens and which, e.g., express that a relation is symmetric or generalized modus tollens schemes described be- that two terms are equivalent. low. However, we go beyond previous work in in- vestigating the ability of language models to in- Benchmarks for enthymematic reasoning An fer conclusions that have a more complex logico- ‘enthymeme’ is an argument whose premises are semantic structure (e.g., existential or universal not explicitly stated, e.g.: “Jerry is a mouse. statements). Therefore, Jerry is afraid of cats.” The three tasks described below involve such reasoning with im- The question answering bAbI dataset (Weston plicit assumptions, whereas our synthetic argu- et al., 2016) contains a task which involves apply- ment corpus doesn’t: all premises are transparent ing very specific rules of the form “Xs are afraid of and explicitly given. Ys” to an instance (for example: “Mice are afraid Commensense Transformers (COMET) are au- of cats. Jerry is a mouse. What is Jerry afraid toregressive language models for generating com- of? A:cats”). Equally simple, one-step rule ap- monsense knowledge graphs (Bosselut et al., plications are tested in Richardson et al. (2020), 2019). Being trained on seed data, the models are and also contained in the QuaRTz dataset (Tafjord able to meaningfully relate subject phrases to ob- et al., 2019). ject phrases in terms of multiple binary relations ROPES (Lin et al., 2019) is a reading compre- (by doing the type of completion tasks we intro- hension task that involves applying background duce in Section 4), and can thereby both repro- knowledge to a given situation (both being pre- duce and extend a given knowledge graph. In par- sented as paragraph long text). Correct answers ticular, this includes generating statements about can be inferred by one-step rule application; part causal relationships, which can be construed as of the challenge is to identify the relevant rule and enthymematic reasoning with commonsense back- fact in the text. ground assumptions. For example, given the in- RuleTaker, arguably the most general system put "PersonX is re-elected. As a result, PersonX for natural rule reasoning in natural language so wants" the model generates as completions: "to far, is a transformer model that has been fine-tuned get a raise", "to go to office", "to go home", "to to predict whether a conclusion can be inferred make a speech", "to celebrate" – all of which from a set of rules and facts, not all of which are are plausible fill-ins. The implicit commonsense necessarily required to draw the conclusion (Clark premises that underlie this (entyhmematic) infer- et al., 2020). Moreover, inferring the conclusion ence are principles such as "If someone has been from the premise set might involve multiple in- re-elected, then they want to celebrate." ference steps. The authors show that the trans- The Argument Reasoning Comprehension former model can be trained to perform this task (ARC) dataset (Habernal et al., 2018) comprises nearly flawlessly and, moreover, to ‘explain’ its simple informal arguments. Each argument inferences by identifying relevant premises. They contains two premises: whereas the first premise also observe substantial transfer learning effects. is explicitly stated, there are two alternative PRover extends RuleTaker by a component for formulations of the second premise. The task proof generation (Saha et al., 2020). Technically, consists in identifying which of these two alter- the QA head of the RoBERTa language model native formulations is actually assumed in the (Liu et al., 2019) is complemented by two ad- argument. For example: “Miss America gives ditional neural classifiers (for nodes and edges) honors and education scholarships. And since that are used to to construct proof chains. Saha [scholarships would give women a chance to et al. (2020) show that PRover can construct valid study | scholarships would take women from the proofs and outperforms RuleTaker in terms answer home], Miss America is good for women.” ARC accuracy in a zero-shot setting. therefore assesses the ability to make implicit Training on synthetic knowledge-graph data premises explicit. An adversarial ARC dataset (such as "Paris CapitalOf France" and "France that eliminates clues in the original benchmark is HasCapital Paris") from scratch, Kassner et al. also available in Niven and Kao (2019). (2020) find that BERT is able to correctly infer CLUTRR is a task generator for relational rea-
soning on kinship graphs (Sinha et al., 2019). in Figure 1). These base schemes have been cho- CLUTTR takes a set of (conceptual) rules about sen because of their logical simplicity as well as family relations as given and constructs set- their relevance in critical thinking and argument theoretic possible worlds (represented as graphs) analysis (Feldman, 2014; Bowell and Kemp, 2014; which instantiate these rules. In such a possible Brun and Betz, 2016). Each of these eight base (kinship) world, a target fact and a set of base facts schemes is manually varied in specific ways to cre- are identified such that the base facts together with ate further valid variants. the rules deductively entail the target fact. The Negation variants of base schemes (second row task consists in inferring the target fact from the in Figure 1) are created by substituting a sub- base facts alone – the conceptual rules remain im- formula with its negation and/or by applying du- plicit. For example: “Kristin and her son Justin plex negatio affirmat. went to visit her mother Carol on a nice Sunday Complex predicates variants (third row in Fig- afternoon. They went out for a movie together ure 1) build on base schemes or their respective and had a good time. Q: How is Carol related to negation variants and are obtained by substituting Justin? A: Carol is the grandmother of Justin.” atomic predicates with compound disjunctive or So, CLUTRR assesses entyhmematic deductive conjunctive ones. reasoning with implicit conceptual rules. Gon- De Morgan variants of base schemes (fourth tier et al. (2020) have trained a generative Trans- row in Figure 1) are finally derived by applying former language model on a synthetic text corpus de Morgan’s law to the respective variants created (with each argumentative text containing a story, before. a proof chain and a conclusion from CLUTTR) With 2-3 different versions for each of these and show that the language model does not only variations of a base scheme (parameter "n" in Fig- learn to draw the correct conclusion (given an ar- ure 1), we obtain, all in all, 71 distinct hand- gument with implicit commonsense premises), but crafted argument schemes. Obviously, some of also seems to acquire the ability to generate valid these schemes can be derived from others. For proof chains. example, generalized modus ponens and general- Critical thinking tasks LogiQA (Liu et al., ized contraposition (base schemes) entail a nega- 2020) is a collection of publicly available criti- tion variant of generalized modus tollens. Like- cal thinking questions, used by the National Civil wise, generalized contraposition and hypothetical Servants Examination of China to assess candi- syllogism 1 entail a (negation variant of) hypo- dates’ critical thinking and problem solving skills. thetical syllogism 2. LogiQA covers tasks of various types: different In view of their simplicity and prominence in kinds of natural language inference problems as natural language argumentation, three of the eight well as the identification of implicit premises or base schemes are marked as core schemes: gener- (practical) instrumental reasoning. Its scope is alized modus ponens, generalized contraposition, much broader than our highly specific and care- hypothetical syllogism 1. fully designed argument corpus. The LogiQA Natural language instances of the argument tasks are shown to be hard for current AI systems, schemes can be created by means of a first-order- of which a fine-tuned transformer model performs logic domain (with names and predicates) and nat- best with an accuracy score of 35% – 50 percent- ural language templates for the formal schemes. In age points below human performance. order to obtain a large variety of realistic natural language arguments, we have devised 3 An Artificial Argument Corpus • a multi-stage templating process with This section describes the construction of a syn- • alternative templates at each stage and thetic corpus of natural language arguments used • multiple domains. for training and evaluating GPT-2.1 The corpus is built around eight simple, deduc- As shown in Figure 2, this process can be split into tively valid syllogistic argument schemes (top row five consecutive steps. 1 The corpus as well as the source code used to gen- In step 1, the argument scheme, which serves erate it will be released at https://github.com/ as formal template for the natural language argu- debatelab/aacorpus. ment, is chosen.
generalized generalized hypothetical hypothetical hypothetical generalized disjunctive generalized modus ponens contraposition syllogism 1 syllogism 2 syllogism 3 modus tollens syllogism dilemma ∀x Fx→Gx∨Hx base_scheme ∀x Fx→Gx ∀x Fx→Gx ∀x Fx→Gx ∀x Fx→Gx ∀x Fx→Gx ∀x Fx→Gx∨Hx ∀x Gx→Jx Fa ∀x Fx→¬Gx ∀x Gx→Hx ∀x ¬Hx→¬Gx ∃x Hx∧¬Gx ¬Ga ∀x Fx→¬Gx ∀x Hx→Jx ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— Ga ∀x Gx→¬Fx ∀x Fx→Hx ∀x Fx→Hx ∃x Hx∧¬Fx ¬Fa ∀x Fx→Hx ∀x Fx→Jx complex_predicates negation_variant ∀x Fx→Gx∨Hx ∀x Fx→¬Gx ∀x Fx→¬Gx ∀x Fx→¬Gx ∀x ¬Fx→Gx ∀x Fx→¬Gx ∀x Fx→Gx∨Hx ∀x Jx→¬Gx Fa ∀x Fx→Gx ∀x ¬Gx→Hx ∀x ¬Hx→Gx ∃x Hx∧¬Gx Ga ∀x Gx→¬Fx ∀x Jx→¬Hx ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ¬Ga ∀x ¬Gx→¬Fx ∀x Fx→Hx ∀x Fx→Hx ∃x Hx∧Fx ¬Fa ∀x Fx→Hx ∀x Fx→¬Jx n=2 n=3 n=3 n=3 n=3 n=2 n=3 n=3 ∀x Fx∧Hx→Gx ∀x Fx→Gx ∀x Fx→Gx ∀x Fx→Gx∨Hx∨Ix ∀x Fx→Gx∨Hx∨Ix Fa ∀x Fx→Ix ∀x Fx→¬(Gx∨Ix) ∀x Fx→Ix ∀x Fx→Gx∧Hx ∀x Fx→¬Gx ∀x Gx→Jx Ha ∀x (Fx∧Hx)→¬Gx ∀x Gx∧Ix→Hx ∀x Hx→¬(Gx∨Ix) ∃x Hx∧¬(Gx∧Ix) ¬Ga ∀x Fx→¬Ix ∀x Hx→Jx ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— Ga ∀x Gx→¬(Fx∧Hx) ∀x Fx→Hx ∀x Fx→Hx ∃x Hx∧¬Fx ¬Fa ∀x Fx→Hx ∀x Fx→Jx∨Ix n=3 n=2 n=3 n=3 n=3 n=2 n=3 n=3 ∀x ¬(Fx∨Hx)→Gx ∀x Fx→Gx ∀x Fx→¬(Gx∧Hx) de_morgan ¬Fa ∀x (¬Fx∧¬Ix)→Gx ∀x Fx→¬(Gx∨Ix) ∀x Fx→Ix ∀x Fx→Gx∧Hx ∀x Fx∧Ix→Gx∨Hx ∀x ¬Gx→Jx ¬Ha ∀x (Fx∧Hx)→¬Gx ∀x Gx → Hx ∀x Hx→¬Gx∧¬Ix ∃x Hx∧(¬Gx∨¬Ix) ¬Ga∨¬Ha ∀x Gx→¬Fx∨¬Ix ∀x ¬Hx→Jx ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— ——⇩—— Ga ∀x Gx→¬Fx∨¬Hx ∀x ¬(Fx ∨ Ix)→Hx ∀x Fx→Hx ∃x Hx∧¬Fx ¬Fa ∀x Fx∧Ix→Hx ∀x Fx→Jx n=2 n=2 n=2 n=3 n=3 n=3 n=2 n=2 Figure 1: Syllogistic argument schemes used to create an artificial argument corpus. In step 2, each sentence in the formal scheme • Male Relatives: grandson of Ryan, nephew (premises and conclusion) is individually replaced of Jim, cousin of Lee, . . . by a natural language pattern in accordance with a • Football Fans: supporter of Real Madrid CF, randomly chosen template. For example, the for- ex-fan of Sevilla FC, member of SSC Napoli, mula “∀xF x → Gx” might be replaced by any of ... the following natural language sentence schemes: • Personal Care: regular consumer of Dove shampoo, infrequent user of L’Oreal sham- • “Every F is a G.” poo, loyal buyer of Redken shampoo, . . . • “Whoever is a F is also a G.” • Chemical Ingredients: ingredient of Maypole • “Being a G is necessary for being a F.” Soap, ingredient of OASIS CREAM, ingredi- • “If someone is a F, then they are a G.”* ent of BB concealer, . . . • Dinosaurs*: contemporary of Megalosaurus, Some of these patterns are not used for training, predator of Iguanodon, ancestor of Al- but are reserved for generating an out-of-domain losaurus, . . . test dataset (e.g., the template marked with an as- • Philosophers*: teacher of Aeschines of terisk in the above list). Neapolis, pupil of Cratylus, reader of Dem- In step 3, the entity- and property-placeholders ocritus, . . . in the resulting argument scheme are replaced argument-wise with names and predicates from a Domains marked with an asterisk are used for test- domain. We hence obtain an instance of the for- ing only, and not for training (see below and Sec- mal argument scheme as premise-conclusion list. tion 4.2). Each domain provides hundreds of entity-names, which can be paired with different binary predi- In step 4, the premises of the natural language cates to create thousands of different unary predi- argument are randomly re-ordered. cates. The following example predicates illustrate In step 5, the premise-conclusion list is packed the domains used in this study: into a text paragraph by adding an argument in- tro, framing the premises, and adding an inference • Female Relatives: sister of Anna, grand- indicator. Again, multiple templates are available daughter of Elsa, cousin of Sarah, . . . for doing so, which yields a large variety of textual
artificial argument corpus config file topic-neutral argument-, topic-neutral domain-specific NL templates for premise-, and formal argument NL names and formal sentence inference- schemes binary predicates schemes indicators Step 3: construct Step 2: choose & Step 4: Step 5: construct Step 1: choose & substitute substitute NL permutate & apply formal argument domain-specific schemes premises argument scheme predicates and sentence-wise randomly template names 1. No sister of Lisa is a friend of Here comes a perfectly valid ∀x Fx→¬Gx Chloe. argument: To begin with, Susan is Ga 2. Susan is a friend of Chloe. a friend of Chloe. Moreover, no ——⇩—— ——⇩—— sister of Lisa is a friend of Chloe. ¬Fa 3. It is false that Susan is a sister In consequence, it is false that of Lisa. Susan is a sister of Lisa. 1. Susan is a friend of Chloe. No F is a G. 2. No sister of Lisa is a friend of a is a G. Chloe. ——⇩—— ——⇩—— It is false that a is a F. 3. It is false that Susan is a sister of Lisa. Figure 2: Pipeline for creating natural language instances of argument schemes with multiple templating. renderings of an argument. 4.1 Training Following this pipeline, we generate natu- From the training items in the Artificial Argu- ral language instances of each formal argument ment Corpus (TRAIN) we sample three types of scheme, thus creating: differently-sized training sets as follows (see also 1. a training set of argumentative texts, based on the color pattern in Figure 1): the default domains and templates (TRAIN); 2. an evaluation set of argumentative texts, • TRAIN01: all training items which are in- based on the default domains and templates, stances of a core scheme, i.e. generalized which are used for development (DEV); modus ponens, generalized contraposition, 3. a test set of argumentative texts, based on the hypothetical syllogism 1 (N=4.5K, 9K, 18K, default domains and templates and used for 36K) final tests (TEST_OUT-OF-SAMPLE); • TRAIN02: all training items which are in- 4. a test set of argumentative texts, based on the stances of a base scheme (N=4.5K, 9K, 18K, domains and templates reserved for testing 36K) (TEST_OUT-OF-DOMAIN). • TRAIN03: all training items in the corpus (N=4.5K, 9K, 18K, 36K) This represents the artificial argument text cor- pus we use to train and evaluate GPT-2. In an attempt to avoid over-fitting, we blend the training arguments with snippets from Reuters 4 Experiments with GPT-2 news stories (Lewis et al., 2004) and the standard- We train and evaluate three compact versions of ized Project Gutenberg Corpus (Gerlach and Font- GPT-2 with 117M, 345M and 762M parameters Clos, 2018), trying a mixing ratio of 1:1 and thus respectively using the implementation from Wolf doubling training size to N=9K, 18K, 36K, 72K. et al. (2019). We note that all of these models fall (We find that fine-tuning on the accordingly en- short of the full-scale model with 1542M parame- hanced argument corpus still increases the model’s ters.2 perplexity on the Wiki103 dataset by a factor of 2 The fine-tuned models will be released through https: 1.5 (see Appendix B), which suggests to mix a //huggingface.co/models. higher proportion of common texts into the train-
ing data in future work.) The three different ver- Task Conclusion with Comple- sions of GPT-2 are fine-tuned (causal language cloze-style prompt tion modeling objective, using default training scripts by Wolf et al. (2019)) on each of the 12 enhanced split Every F is a G G training sets (hyper-parameters are detailed in Ap- Some F is not a G G pendix A). This gives us 36 fine-tuned model ver- a is a F or not a G G sions plus the three BASE models to evaluate. Un- less explicitly stated otherwise, we report results extended Every F is a G aG of 762M parameter model trained on 72K items. Some F is not a G not a G a is a F or not a G not a G 4.2 Testing Conclusion Completion on Artificial Argument inverted Every F is a G not a G Corpus To test whether language models can Some F is not a G not a G reason correctly, we assess their ability to accu- a is a F or not a G not a G rately complete conclusions of arguments in the artificial argument corpus. Here, we make use of Table 1: Three conclusion completion tasks the fact that, by construction, the conclusion of ev- ery argument in the corpus ends with a predicate inverted task, the stronger the model’s reasoning (a property-term such as “sister of Chloe” or “sup- performance. porter of Tottenham Hotspurs”), which is poten- Based on the artificial argument corpus (see tially preceded by a negator. First of all, as shown Section 3), we generate and distinguish three dif- in Table 1, we test whether the model is able to ferent test datasets, each of which comprises the correctly fill in the final predicate (task split). The three tasks described above, as follows: second, more difficult task consists in completing the final predicate plus, if present, the preceding • out of sample: contains items from negator (task extended). With a third, adverserial TEST_OUT-OF-SAMPLE, which share task we check how frequently the model wrongly domain and natural language templates with adjoins the complement of the correct completion the training data; of the extended task (task inverted). Consider, for • paraphrased: a sample of 100 items, ran- example, the following argument: domly drawn from TEST_OUT-OF-SAMPLE, which have been manually reformulated so as It is not always easy to see who is re- to alter the premises’ grammatical structure lated to whom – and in which ways. imposed by the natural language templates; The following argument pertains to • out of domain: contains items from this question: First premise: Every TEST_OUT-OF-DOMAIN, which belong workmate of Brad is a classmate of to different domains instantiate grammatical James. Second premise: Every class- patterns other than the training data. mate of James is not a classmate of Theodore. So, necessarily, everyone Technically, conclusion completions, in all who is a workmate of Brad is [not a]E tasks and tests, are generated by the language [classmate of Theodore.]S ” model with top-p nucleus sampling (p = 0.9). In the split task, we prompt the model with the Classification for NLU Benchmarks To inves- argument, dropping []S , and check whether it gen- tigate transfer learning effects, we evaluate the erates “classmate of Theodore”. In the extended trained models on standard NLU benchmarks, task, we prompt the model with the argument, such as GLUE AX and SNLI. These benchmark dropping []E []S , and check whether it generates tasks are classification problems. In the following, “not a classmate of Theodore”. Finally, in the we describe how we use the generative language inverted task, we prompt the model as before models to perform such classification. and check whether it generates “a classmate of Using simple templates, we translate each Theodore”. benchmark entry into alternative prompts (e.g., Clearly, the higher the accuracy in the split and context and question) and/or alternative comple- extended tasks, and the lower the accuracy in the tions (e.g., answers). Consider for example a
GLUE-style problem given by two sentences “The use relevance perplexity as a score function to pre- girl is eating a pizza.” and “The girl is eating food” dict the category of X: and the question whether one entails, contradicts, or is independent of the other. We can construct three prompts, corresponding to the three possible category(X) = L argmin(relPP(cj , pi )) . (pi ,cj ) answers (entail / contradict / independent): 5 Results Prompt1: The girl is eating a pizza. Therefore, Conclusion Completion on Artificial Argument Prompt2: The girl is eating a pizza. This Corpus Does the (fine-tuned) GPT-2 model cor- rules out that rectly complete conclusions of natural language Prompt3: The girl is eating a pizza. This arguments? Figure 3 displays the evaluation re- neither entails nor rules out that sults in an aggregated way. Each subplot visual- Completion: the girl is eating food. izes the accuracy of the models in the three com- pletion tasks for a different test dataset (see Sec- In this case, the correct match is obviously tion 4.2), comparing the BASE model (points at Prompt1–Completion. The ability of a language the very left) with the fine-tuned models trained model to discern that “The girl is eating pizza” en- on TRAIN01, TRAIN02, and TRAIN03 (in this or- tails (and does not contradict) “The girl is eating der from left to right). The task-specific accuracy food” will be reflected in a comparatively low con- values are distinguished by line color. ditional perplexity of Completion given Prompt1 We may observe, first of all, that training on the and a correspondingly high conditional perplexity argument corpus effectively improves conclusion- of Completion given Prompt2 or Prompt3. completion-skill. In all three test datasets, the ac- Let us describe this procedure in more gen- curacy in the split and extended tasks increases as eral terms and consider a textual classification the models are trained on more and more argu- problem with categories k = 1 . . . N . To clas- ment schemes, far exceeding the base model’s per- sify a given input X, one constructs n alternative formance. Once the model has seen all schemes prompts p1 , . . . pn and m alternative completions (TRAIN03), accuracy levels reach 100% for in- c1 , . . . , cm (N = m·n), such that each pair (pi , cj ) domain and 70%-90% for out-of-domain tests. corresponds to a class k of the classification prob- However, the TRAIN01 and TRAIN02 models do lem, i.e., also generate more incorrect completions than the BASE model (inverted task). But the frequency L : (pi , cj ) 7→ {1 . . . N }. of such incorrect completions increases much less than the frequency of correct ones (the gap be- In the above pizza example, we have N = n = tween blue and gray curve widens), and it actu- 3 and m = 1. Moreover, let PPL (c|p) refer ally falls back to almost zero with the TRAIN03 to the conditional perplexity of the completion c model. Out-of-domain performance of the models given prompt p according to the language model (right-hand plot) is qualitatively similar and only L. Rather than directly using this conditional per- slightly less strong than in-domain performance plexity as a prediction score (as for instance in (left-hand and middle plot). The models trained Shwartz et al., 2020), which doesn’t account for on arguments from a given domain are able to ef- varying ‘prima facie’ or ‘prior’ perplexities of al- fectively exercise the reasoning skill thus acquired ternative completions, we consider the degree to in other domains, and have hence gained topic- which prompting the model L with p changes the neutral, universal reasoning ability. the perplexity of c, i.e. The strong performance of TRAIN01 models, averaged over all schemes, suggests that signifi- PPL (c|p) relPPL (c, p) := . cant transfer learning occurs and that training on PPL (c) a few argument schemes positively affects perfor- In analogy to Bayesian confirmation theory, this mance on other schemes, too. To further investi- might be termed a (perplexity-based) relevance gate this issue, Table 2 contrasts (a) the models’ measure, as opposed to a measure of absolute con- accuracy on schemes they have not been trained firmation (cf. Carnap, 1950, pp. 346-48). We now on – averaged over TRAIN01 and TRAIN02 mod-
test = out of sample test = paraphrased test = out of domain 1.0 0.8 0.6 task accuracy split 0.4 extended inverted 0.2 0.0 base train01 train02 train03 base train01 train02 train03 base train01 train02 train03 model model model Figure 3: Accuracy of four model versions in three conclusion completion tasks and on different test datasets (out of sample, paraphrased, out of domain). BASE (a) schemes not in training data (TR01–02) (b) trained on schemes (TR01–03) Task o-o-sample paraphr. o-o-domain o-o-sample paraphr. o-o-domain split 21.4 85.4 82.0 69.4 99.9 99.2 89.0 extended 10.7 60.3 59.3 45.8 99.9 99.2 76.2 inverted 1.5 16.9 18.0 22.1 0.0 0.0 3.2 Table 2: Accuracy of models in three conclusion completion tasks and on different test datasets (out of sample, paraphrased, out of domain). Columns report, separately, the performance (a) on schemes the model has not been trained on, and (b) on schemes that are covered by the model’s training data. els – with (b) their accuracy on schemes that are performance on unknown schemes. Figure 4 re- instantiated in their respective training corpus – veals, first of all, that even the BASE models (only averaged over TRAIN01, TRAIN02, and TRAIN023 pre-training, no fine-tuning) display a significant models. The upshot is that trained models per- ability to correctly complete conclusions of some form way more strongly than the base model not kinds of arguments. For example, GPT-2-762M only on argument schemes they’ve been trained, achieves 50% accuracy (split task) in completing but also on those schemes they haven’t seen yet. contrapositions, 30% accuracy in completing gen- We take this to be a promising result as it strength- eralized modus ponens, and still 20% accuracy in ens the analogy between teaching critical think- completing disjunctive syllogism and dilemma ar- ing and training language models: generic inter- guments. These findings further corroborate the mediary pre-training on high-quality texts that ex- hypothesis that NLMs learn (basic) linguistic and emplify a specific, basic reasoning skill – namely, reasoning skills “on the fly” by training on a large simple deductive argumentation – improves other, generic corpus (Radford et al., 2019). more complex reasoning skills. In addition, the matrix plot (Figure 4) demon- Figure 4 gives further insights by differentiating strates that some types of arguments are much evaluation results according to argument type. Its easier to master, given training on the core and subplots are arranged in a grid that mirrors the or- possibly base schemes, than others. For in- ganisation of argument schemes in Figure 1. Each stance, complex_predicates variants of general- subplot visualizes the ability of the models to cor- ized modus ponens or de_morgan variants of gen- rectly complete arguments of the corresponding eralized modus tollens seem to be easily mas- scheme (given the out-of-sample test dataset). Ac- tered by the TRAIN01 model. In contrast, even cordingly, the left-hand plot in Figure 3 in effect the TRAIN02 model, which has been fine-tuned on averages all curves in Figure 4. Reported accu- all eight base schemes, struggles with the nega- racy values that fall within gray background areas tion_variants of generalized modus ponens (gen- are attained by models which have seen the cor- erating substantially more incorrect than correct responding scheme during training. Vice versa, completions). All in all, the picture that emerges thick lines on white background visualize model is plausible: Generalization towards novel types
Generalized Generalized Hypothetical Hypothetical Hypothetical Generalized Disjunctive Generalized modus ponens Contraposition Syllogism 1 Syllogism 2 Syllogism 3 modus tollens Syllogism Dilemma 1.0 0.8 base_scheme 0.6 0.4 0.2 0.0 1.0 negation_variant 0.8 0.6 0.4 0.2 0.0 1.0 complex_predicates 0.8 0.6 0.4 0.2 0.0 1.0 0.8 de_morgan 0.6 0.4 0.2 0.0 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 task: split task: extended task: inverted not trained on scheme trained on scheme Figure 4: Accuracy of conclusion completions (three tasks) for instances of different argument schemes (see Figure 1) and four model versions. of argument appears to be comparatively diffi- cult whenever the new scheme involves negations (compare 2nd and 4th row in Figure 4 with 3rd row). This is consistent with the finding that some NLMs seemingly fail to understand simple nega- 762M 117M tion (Kassner and Schütze, 2020; Talmor et al., Completion TR01 BASE TR01 2020). . . . is not a philosopher. ? 100 2 2 The results reported so far suggest that reason- . . . is immortal. = 0 12 0 ing skills acquired on (a subset of) the artificial . . . is not a critic. ◦ 0 0 9 argument corpus generalize rather well – both to . . . is mortal. † 0 8 0 other domains and other types of arguments. We . . . is not mortal. = 0 6 0 have further cross-checked these statistical find- . . . is not Hermes. † 0 2 0 . . . does not exist. ◦ 0 2 0 ings by letting the models complete a conclusion . . . is not God. ◦ 0 2 0 of a simple manually authored argument: . . . is not a friend of Eckhardt. ◦ 0 0 1 [Hermes] Every philosopher is mortal. . . . is not an expert of BSI Ar- ◦ 0 0 1 senal FC. Hermes is not mortal. Therefore, Her- . . . is not a friend of Atalanta. ◦ 0 0 1 mes . . . . . . is not an infrequent user of ◦ 0 0 1 This text differs syntactically and semantically Neutrogena shampoo. others 0 66 85 from any argument possibly contained in the arti- ficial argument corpus (where predicates have al- ways the form “is/being a Y of X,” and no domain Table 3: Absolute frequency of predicted completions for the hand-written [Hermes] query by three different covers philosophers or mortality). Obviously, it models. Completions are – relative to the premises – follows that Hermes “is not a philosopher.” The entailed (?), redundant (=), contradictory (†) or inde- argument instantiates generalized modus tollens, pendent (◦). which is not a core scheme in TRAIN01. Can TRAIN01-models nonetheless fill out the unfin- ished argument in a sensible way? Table 3 counts and compares the most frequent
completions generated by two TRAIN01 models sized model that profit from fine-tuning on the (762M and 117M) and by the large untrained AAC; the SNLI performance of the 762M param- BASE model (762M). Exclusively the 762M- eter model gets rather deteriorated. This might be model trained on the core schemes reliably pre- due to a coincidentally strong performance of the dicts the correct conclusion. The large BASE corresponding BASE model (see Figure 7), or sug- model rather repeats a premise or even generate a gest that the large model, unlike the smaller ones, contradiction, whereas the small TRAIN01 model has already learned during pre-training whatever is (117M) changes the topic. This is consistent with of relevance for SNLI in the AAC. (Further exper- and illustrates our previous findings. Remarkably, iments, preferably involving more model versions, although both the small and the large TRAIN01 are required to clarify this.) models have been fine-tuned on precisely the same arguments, only the large model seems to correctly Argument Reasoning Comprehension Task recognize the logical structure of the [Hermes] ar- The Argument Reasoning Comprehension (ARC) gument. Generic language modeling skill, it is task (Habernal et al., 2018) assesses the ability to suggested, facilitates the successful generalization identify a missing premise in an informally recon- of learned argument patterns beyond the templates structed and not necessarily deductively valid ar- used to create the synthetic training data. gument. It is a multiple-choice task where two al- ternative sentences are provided, one of which is To further understand transfer learning ef- the missing premise. fects, we next examine whether intermediary pre- training on the artificial argument corpus improves We design and apply specific templates to con- zero-shot performance in other NLP reasoning struct prompts and completions, and calculate rel- tasks (i.e., without task-specific fine-tuning). ative perplexity as described in Section 4.2. As shown in Figure 5, we find no evidence of GLUE AX The GLUE datasets (Wang et al., transfer learning effects with respect to ARC. 2018) represent standard benchmarks for natural LogiQA LogiQA (Liu et al., 2020) is a col- language understanding (NLU). We evaluate our lection of nearly 9,000 multiple-choice questions models’ NLU skill in terms of accuracy on the cu- (four alternative answers each) used in critical rated GLUE diagnostics dataset (Figure 5). thinking assessments. These questions span the Training on the artificial argument corpus sub- whole range of critical thinking tasks. stantially boosts accuracy on the GLUE diagnos- We design and apply specific templates to con- tics. Accuracy increases by at least 5 and up to 17 struct prompts and completions (one prompt and percentage points, depending on model size. Re- four completions per question), and use perplexity markably, training on the core scheme alone suf- scores to predict classifications as described above fices to bring about these improvements. (Section 4.2). This is a major finding and our clearest evidence As can be seen from Figure 5, training on the so far that training on the AAC involves substantial artificial argument corpus has no effect whatsoever transfer learning effects. on the ability of the models to handle the critical SNLI The SNLI dataset (Bowman et al., 2015) thinking tasks collected in LogiQA. is another standard benchmark for NLI. Like the 6 Conclusion GLUE dataset, it consists in pairs of sentences which entail, contradict, or don’t bear on each This paper has taken a first step towards the cre- other. The assessment of our models with re- ation of a critical thinking curriculum for neural spect to SNLI data proceeds in close analogy to language models. It presents a corpus of deduc- the GLUE benchmark. tively valid, artificial arguments, and uses this ar- The results, reported in Figure 5, are consistent tificial argument corpus to train and evaluate GPT- with, albeit less definite than our previous find- 2. The observation of strong transfer learning ef- ings for the GLUE benchmark: First and foremost, fects/generalization is its main finding: Training a fine-tuning on all schemes (TRAIN03) improves model on a few central core schemes allows it to the performance by up to 8 percentage points. accurately complete conclusions of different types Training on fewer schemes is slightly less effec- of arguments, too. The language models seem tive. However, it is only the small and medium to connect and to generalize the core argument
GLUE AX SNLI ARC Task LogiQA 20 20 20 20 model_size model_size model_size 117M 117M 117M 15 15 345M 15 345M 15 345M 762M 762M 762M gain in accuracy (rel. to base) gain in accuracy (rel. to base) gain in accuracy (rel. to base) gain in accuracy (rel. to base) 10 10 10 10 5 5 5 5 0 0 0 0 5 5 5 5 model_size 10 117M 10 10 10 345M 762M 15 15 15 15 train01 train02 train03 train01 train02 train03 train01 train02 train03 train01 train02 train03 model model model model Figure 5: Gains in accuracy due to fine-tuning on the AAC (accuracy TRAIN model – accuracy BASE model) for differently sized models and different NLP benchmark tasks: the GLUE diagnostics data, the SNLI dataset, the argument reasoning comprehension (ARC) benchmark, and the LogiQA dataset. schemes in a correct way. Moreover, the models through adjusting the argument corpus con- are equally able to apply learned argument pat- figuration file.) terns beyond the domain they have been trained • To succeed in NLI tasks, it doesn’t suffice on, and there is evidence that generic language to understand ‘what follows.’ In addition, modeling skill facilitates the successful general- a system needs to be able to explicitly dis- ization of learned argument patterns. These find- cern contradictions and non sequiturs (rela- ings are consistent with previous work on rule rea- tions of logical independence). This suggests soning (Clark et al., 2020). They suggest that there that the artificial argument corpus might be exist (learning-wise) fundamental reasoning skills fruitfully supplemented with corpora of cor- in the sense that generic intermediary pre-training rectly identified aporetic clusters (Rescher, on texts which exemplify these skills leads to spill- 1987) as well as corpora containing correctly over effects and can improve performance on a diagnosed fallacies. broad variety of reasoning tasks. The synthetic ar- • In addition, the idea of curriculum learning gumentative texts might be a good starting point for ML (Bengio et al., 2009) might be given for building such a “critical thinking curriculum a try. Accordingly, a critical thinking cur- for language models.” riculum with basic exemplars of good rea- Moreover, the trained models have been tested soning would not only be used to fine-tune a on different reasoning benchmarks. We obtain pre-trained model, but would be employed as clear and promising results for the GLUE and starting point for training a language model SNLI benchmarks. But training on the argument from scratch. corpus doesn’t affect the performance with re- gard to the semantically more demanding Argu- Natural language templating is a fundamental ment Reasoning Comprehension task or the criti- technique used throughout this paper: both in con- cal thinking assessment compiled in LogiQA. structing the artificial argument corpus as well Our work suggests different directions for ad- as in transforming the NLP benchmark datasets vancing the approach adopted in this paper and into text that can be processed by language mod- further improving the general reasoning skill of els. The concrete templates applied have been de- neural language models: signed in a trial-and-error process. It is far from • The syllogistic argument text corpus might clear that these represent optimal choices for ef- be complemented with corpora of argu- fectively eliciting a language model’s skills. Still, ments that instantiate different kinds of cor- following (Jiang et al., 2020), it seems of great im- rect schemes, e.g., propositional inference portance to gain a more systematic understanding schemes, modal schemes, argument schemes of different templating strategies and their effects for practical reasoning, complex argument on metrics based on accuracy and perplexity. schemes with intermediary conclusions or as- In conclusion, designing a critical thinking cur- sumptions for the sake of the argument, etc. riculum for neural language models seems to be (Technically, we provide the infrastructure a promising and worthwhile research program to for doing so, as all this might be achieved pursue.
A Appendix: Training Parameters Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-Voss, Gretchen Krueger, We train the models on 8 GPUs for 2 epochs with Tom Henighan, Rewon Child, Aditya Ramesh, batch size = 2, learning rate = 5 × 10−5 , gradient Daniel M. Ziegler, Jeffrey Wu, Clemens Win- accumulation steps = 2, and default parameters of ter, Christopher Hesse, Mark Chen, Eric Sigler, the HuggingFace implementation otherwise (Wolf Mateusz Litwin, Scott Gray, Benjamin Chess, et al., 2019). Jack Clark, Christopher Berner, Sam McCan- B Appendix: Performance Metrics for dlish, Alec Radford, Ilya Sutskever, and Dario Differently Sized Training Sets Amodei. 2020. Language models are few-shot learners. Figure 6 displays accuracy values on conclusion completion tasks for models trained on differently Georg Brun and Gregor Betz. 2016. Analysing sized datasets. practical argumentation. In Sven Ove Hansson Figure 7 reports perplexity and NLU accuracy and Gertrude Hirsch-Hadorn, editors, The Ar- metrics for models trained on differently sized gumentative Turn in Policy Analysis. Reason- datasets. ing about Uncertainty, pages 39–77. Springer, Cham. References Rudolf Carnap. 1950. Logical Foundations of Probability. University of Chicago Press, Amanda Askell. 2020. Gpt-3: Towards renais- Chicago. sance models. In Daily Nous Blog: Philoso- phers On GPT-3. J. Cheng, M. Bernstein, C. Danescu-Niculescu- Mizil, and J. Leskovec. 2017. Anyone can be- Yoshua Bengio, Jérôme Louradour, Ronan Col- come a troll: Causes of trolling behavior in on- lobert, and Jason Weston. 2009. Curriculum line discussions. CSCW: Proceedings of the learning. In Proceedings of the 26th Annual In- Conference on Computer-Supported Coopera- ternational Conference on Machine Learning, tive Work. Conference on Computer-Supported ICML ’09, pages 41–48, New York, NY, USA. Cooperative Work, 2017, page 1217–1230. ACM. Antoine Bosselut, Hannah Rashkin, Maarten Sap, Peter Clark, Oyvind Tafjord, and Kyle Richard- Chaitanya Malaviya, Asli Çelikyilmaz, and son. 2020. Transformers as soft reasoners over Yejin Choi. 2019. Comet: Commonsense trans- language. arXiv preprint arXiv:2002.05867v2. formers for automatic knowledge graph con- struction. In Proceedings of the 57th Annual Richard Feldman. 2014. Reason and Argument. Meeting of the Association for Computational Pearson, Harlow. Linguistics (ACL). Alec Fisher. 2001. Critical Thinking: An Intro- Tracey Bowell and Gary Kemp. 2014. Critical duction. Cambridge University Press, Cam- Thinking: A Concise Guide, 4th edition edition. bridge. Routledge, London. Martin Gerlach and Francesc Font-Clos. 2018. A Samuel R. Bowman, Gabor Angeli, Christopher standardized project gutenberg corpus for sta- Potts, and Christopher D. Manning. 2015. A tistical analysis of natural language and quanti- large annotated corpus for learning natural lan- tative linguistics. CoRR, abs/1812.08092. guage inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Ben Gilburt. 2019. Examining gender bias in ope- Language Processing (EMNLP). Association nai’s gpt-2 language model. hackernoon.com. for Computational Linguistics. Nicolas Gontier, Koustuv Sinha, Siva Reddy, and Tom B. Brown, Benjamin Mann, Nick Ry- Christopher Pal. 2020. Measuring systematic der, Melanie Subbiah, Jared Kaplan, Prafulla generalization in neural proof generation with Dhariwal, Arvind Neelakantan, Pranav Shyam, transformers.
Figure 6: Accuracy on three conclusion completion tasks as a function of training corpus size. Perplexity Wiki103 GLUE AX SNLI model_size 0.50 0.50 55 762M 345M 117M 0.45 0.45 50 train train03 45 train02 0.40 0.40 train01 perplexity accuracy accuracy 40 0.35 0.35 model_size model_size 35 762M 762M 0.30 345M 0.30 345M 30 117M 117M train train 0.25 train03 0.25 train03 25 train02 train02 train01 train01 0.20 0.20 0 9K 18K 36K 72K 0 9K 18K 36K 72K 0 9K 18K 36K 72K size training set size training set size training set Figure 7: Perplexity and NLI metrics as a function of training corpus size. Radu Cornel Guiaşu and Christopher W Tindale. Graham Neubig. 2020. How can we know 2018. Logical fallacies and invasion biology. what language models know? Transactions of Biology & philosophy, 33(5-6):34. the Association for Computational Linguistics, 8:423–438. Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Daniel Kahneman. 2011. Thinking, fast and slow, Noah A Smith. 2018. Annotation artifacts in 1st edition. Farrar, Straus and Giroux, New natural language inference data. In Proceed- York. ings of the 2018 Conference of the North Amer- ican Chapter of the Association for Computa- Nora Kassner, Benno Krojer, and Hinrich Schütze. tional Linguistics: Human Language Technolo- 2020. Are pretrained language models sym- gies, Volume 2 (Short Papers), pages 107–112. bolic reasoners over knowledge? Ivan Habernal, Henning Wachsmuth, Iryna Nora Kassner and Hinrich Schütze. 2020. Gurevych, and Benno Stein. 2018. The argu- Negated and misprimed probes for pretrained ment reasoning comprehension task: Identifi- language models: Birds can talk, but cannot fly. cation and reconstruction of implicit warrants. In Proceedings of the 2018 Conference of the Joe Lau and Jonathan Chan. 2020. Critical think- North American Chapter of the Association for ing web. https://philosophy.hku.hk/think. Computational Linguistics: Human Language D. D. Lewis, Y. Yang, T. Rose, and F. Li. 2004. Technologies, NAACL-HLT 2018, New Orleans, Rcv1: A new benchmark collection for text Louisiana, USA, June 1-6, 2018, Volume 1 categorization research. Journal of Machine (Long Papers), pages 1930–1940. Association Learning Research, 5:361–397. for Computational Linguistics. Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Sven Ove Hansson. 2004. Fallacies of risk. Jour- Gardner. 2019. Reasoning over paragraph ef- nal of Risk Research, 7(3):353–360. fects in situations. Proc. MRQA Workshop Zhengbao Jiang, Frank F. Xu, Jun Araki, and (EMNLP’19).
You can also read