Comprehension Based Question Answering using Bloom's Taxonomy
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Comprehension Based Question Answering using Bloom’s Taxonomy Pritish Sahu1,2 ∗ Michael Cogswell1 ∗ Sara Rutherford-Quach1 Ajay Divakaran1 1 SRI International 2 Rutgers University Abstract memorized the necessary information from large amounts of text, but does not use its knowledge Current pre-trained language models have lots of knowledge, but a more limited ability to use appropriately. Following (Shwartz et al., 2020), we that knowledge. Bloom’s Taxonomy helps ed- extract this knowledge as explicit language then ucators teach children how to use knowledge feed it back as additional context during inference, arXiv:2106.04653v1 [cs.CL] 8 Jun 2021 by categorizing comprehension skills, so we forcing the model to use what it already knows but use it to analyze and improve the comprehen- in our case targeting specifically useful knowledge. sion skills of large pre-trained language mod- To formalize this distinction we drew inspiration els. Our experiments focus on zero-shot ques- from elementary school classrooms, where teach- tion answering, using the taxonomy to provide proximal context that helps the model answer ers (Miller, 2002; Harvey and Goudvis, 2007) have questions by being relevant to those questions. a schema based approach in which they teach chil- We show targeting context in this manner im- dren to demonstrate multiple levels of comprehen- proves performance across 4 popular common sion, making complex inferences and direct recall sense question answer datasets. from memory. They use a hierarchy of comprehen- sion skills called Bloom’s Taxonomy (Anderson 1 Introduction et al., 2000) (c.f . Fig. 1) with memorization is at the Recent large language models such as GPT- bottom (requiring children to recall facts) followed 3 (Brown et al., 2020) have made a giant leap for- by understanding (requiring children to grasp se- ward in knowledge acquisition and even generalize mantics) application (requiring children to solve this knowledge to a new tasks. But when less nar- problems), and more complex skills. For us, these row tasks are considered they fail to understand comprehension skills describe ways our language as much as these benchmarks suggest. They turn model might fail to use its knowledge. out to be “stochastic parrots” (Bender et al., 2021) In this paper we address our failure mode by or “smart/super parrots.” (Dunietz et al., 2020) that relying on commonly understood relationships be- just memorize without all of the comprehension tween the skills of Bloom’s Taxonomy which we we want from a Natural Language Understanding term proximal context. In order to understand system. We focus on a particular kind of failure whether the cranberry grape mixture is poisonous mode where the model knows (has memorized) the the model needs to remember whether grape juice information it needs, but is not able to apply that is poisonous. In order to apply its knowledge to information correctly, and we do so in a zero-shot figure out what will happen next it needs to un- fashion to control for what the model knows. derstand whether the cranberry grape mixture is For example, in Fig. 1 the model is asked if a poisonous or not. In general, the proximal context mixture of grape juice and cranberry juice is safe for a particular task T at level L is given by those to drink (Marcus and Davis, 2020). GPT-3 declares tasks implicitly required by T , which are mostly that it is a deadly poison, even though it appears to at level L − 1 of the taxonomy. We guide our ”know” that grape juice and cranberry juice are safe language to answer questions more accurately by to drink by themselves (Fig. 1, Level 1, dark pur- providing it not just any context, but proximal con- ple). It even knows that cranberry juice with grape text 1 . In performing zero-shot question answering juice is not poisonous, but it still thinks the result is our language model asks itself additional clarifica- death (Fig. 1, Level 2, light blue). The model has 1 Proximal context is not defined for level 1 questions, so ∗ These two authors contributed equally. we only address questions at level 2 or above.
Figure 1: Our approach incorporates context into question answering guided by Bloom’s Taxonomy. tion questions, choosing those most likely to result Bloom’s Taxonomy. The original work (Bloom, in proximal context. 1956) defined taxonomies for learning in the cog- Our contributions in this paper are: nitive (intellectual), affective (interests, attitudes, values), and psychomotor domains, though the cog- • We use Bloom’s Taxonomy to choose proxi- nitive domain is what we usually refer to today. mal clarifying context that improves question Almost half a century later the cognitive domain answering performance using only what the taxonomy was revised (Anderson et al., 2000) to re- model already knows. flect more active thinking and improve usability by adding verbs to describe levels. Teachers use this • We show proximal context is better than other taxonomy, for example in computer science educa- levels of context on four different common- tion (Whalley et al., 2006; Thompson et al., 2008; sense question answering tasks. Oliver et al., 2004), and our inspiration is from this revised version of the cognitive taxonomy. Ma- • By observing how different levels of clarifi- chine learning has been applied to automatically cation impact our language model we also classify questions (Mohammed and Omar, 2020; explain how the model answers questions. Zhang et al., 2021; Nafa et al., 2016) into Bloom’s 2 Related Works Taxonomy levels, but the taxonomy has not been applied to analyze or improve machine learning Question Answering from External Supervi- models themselves. We use it to help our model sion. Several approaches has been proposed to think about what it knows. improve question-answering by adding external knowledge source. Recent large pre-trained lan- 3 Approach guage models (Peters et al., 2018; Radford et al., 2019; Devlin et al., 2018; Liu et al., 2019; Joshi Our approach builds on the zero-shot question an- et al., 2020; Clark et al., 2020) learn general swering approach of Shwartz et al. (2020) to an- purpose text encoders from a huge text corpus. swer questions (Section 3.1) by adding clarifica- (Petroni et al., 2019) recently used a language tions with self-talk (Section 3.2). We describe this model as knowledge base to unmask a token given approach then we use Bloom’s Taxonomy to select an entity and a relation in a predefined template. better clarifications (Section 3.2). Shwartz et al. (2020); Bosselut et al. (2019a,b) used 3.1 Question Answering with Language pretrained language models to improve zero-shot Models question answering performance by extracting con- text from the language model itself, using self-talk Given a prompt p, a question q, and answer op- or a knowledge graph. We add context via self-talk, tions ao ∀o ∈ [1, K] we use a pre-trained language with structure provided by Bloom’s Taxonomy. model LM to pick the correct answer ao∗ . This
Sample Clarification Sample Clarification Dataset Question Prefix Question Answer (a) The definition of an accident is the crash of (a) What is the definition of an accident? collusion caused by the vehicle. What is the definition of : 1 COPA (b) What is the definition of a flat tire? (b) The definition of a flat tire is that the tire does not hold air. (a) What is the main purpose of this (a) The purpose of this investigation is to provide investigation? information about how and why he was shot. What is the main purpose of : 2 (b) What is the main purpose of this (b) The purpose of this post is to share my thoughts post? and feelings on his death. (a) What is the main function of a (a)The main function of a teacher in this area is to teacher in this area? teach them about life and love. What is the main function of a : 2 CommonsenseQA (b) What is the main function of a (b) The main function of a farmer is to provide food farmer? for his family and the community. (a) the cause of this problem was that his wife’s (a) What might have caused this problem? What might have caused : 3 husband didn’t have enough money. (b) What might have caused the animal to flee? (b) The cause of the animal to flee was a predator. (a) What Kendall did was make sure that (a) What did Kendall do? he wasn’t going anywhere else. What did [NAME] do? : 1 Social IQA (b) What did Kendall do? (b) What Kendall did was so horrible, that it was hard to believe. (a) Riley is a big brother, he’s an awesome dad. (a) How would you describe Riley? How would you describe [NAME]? : 3 (b) Riley is a very sensitive person and has a lot (b) How would you describe Riley? of anxiety. (a) The property of a diet that is not healthy (a) What are the properties of a diet are that it has high cholesterol (a good idea). What are the properties of a : 1 that is not healthy? Winogrande (b) The properties of a home are that which (b) What are the properties of a home? makes it comfortable and pleasant for the occupants. (a) Be an explorer means to explore and make (a) What does it mean to be an explorer? sense of things. What does it mean to : 2 (b) What does it mean to be sophisticated? (b) Be sophisticated means to be classy, elegant and smart. Table 1: This table shows some of the question prefixes we used for different datasets in our experiments. We assign each prefix a level in Bloom’s Taxonomy. We show generated clarifications questions and answers for both Distil-GPT2 (a) and GPT-Neo (b) for their corresponding question prefixes. approach simply concatenates each (prompt, ques- Stage 3: Reconsider with a new prompt. To tion, answer) tuple into into a single string of text use the clarifications we pick one from the list then To = [p, q, ao ] and feeds this string to the language append it to the original prompt. This approach model to assign each choice a score so = LM (To ). simply considers all combinations of clarifications The language model’s answer is just the answer questions and answers Tj,o = [p, q, cj , ao ] ∀o, j, with the highest score: ô = argmaxo so . first chooses the clarification which maximizes model score per answer option, then chooses the 3.2 Self-talk Clarifications final answer o∗ = argmaxo maxj LM (Tj,o ). This Self-talk (Shwartz et al., 2020) has a language can improve question answering performance on model ask itself clarifying questions then answer its own, but in the next section we more carefully those questions to generate clarifications. choose clarifications using our notion of proximal Stage 1: Ask clarification questions. To pro- context and Bloom’s Taxonomy. duce clarifications we start with a set of clarifica- tion question prefixes r1 , . . . , rJ that are designed 3.3 Using Bloom’s Taxonomy to Choose specifically for each question answering dataset. Clarifications with Proximal Context “What happens if” is a sample prefix for the clarifi- To test our idea of proximal context we consider the cations, shown in Fig. 1, and in Tab. 1 we present level L of task give by each dataset then allow only examples for all the datasets we use. In this stage proximal clarifications of level L − 1. We label the language model completes each of these pre- each question prefix with the level of Bloom’s Tax- fixes, using its generator function LMG to ask one onomy that it falls into, and then force the model question Rj = LMG (rj ) per prefix. to choose from the set CL of clarifications of level Stage 2: Answer the questions. Next we use L. This results in a final choice for each level the model to answer each of these questions, possi- o∗L = argmaxo maxj∈CL LM (Tj,o ). We also pro- bly prompted with an answer prefix bj correspond- vide a Choice Baseline that allows the model to ing to question prefix rj . The results are the clarifi- choose any level of clarification to show the model cations cj = LMG ([Rj , bj ]). would have difficulty choosing proximal clarifica-
tions itself. Note that the annotation of questions Table 2: Question answering accuracy and std. dev. along Bloom’s taxonomy requires special skills using different levels ofclarification over multiple clar- ification samples. Results on the dev sets of each typically found only among educators. While a dataset.(* = level of proximal context \wrt the dataset) layperson can be trained to annotate such ques- tions, our experience was that it takes much more Task Model Level Accuracy time than we could afford for a preliminary study 0A: Choice Baseline 53.2 ± 1.8 Distil-GPT2 1A: Remember* 54.7 ± 3.6 (235±5 valid) such as this one. We therefore relied on our co- Winogrande 2A: Understand 52.5 ± 3.1 (1267 total) 0A: Choice Baseline 54.62 ± 0.5 author, Sara Rutherford-Quach, who is a researcher GPT-Neo (2: Understand) 1A: Remember* 54.77 ± 0.5 (1230±7 valid) 2A: Understand 54.76 ± 0.3 at SRI’s Education Division and has also worked 0B: Choice Baseline 44.5 ± 0.1 as a teacher at the kindergarten-elementary level to Distil-GPT2 1B: Remember 43.7 ± 2.1 (58±5 valid) 2B: Understand* 48.0 ± 1.1 provide us the annotations. Two other co-authors, SocialIQA 3B: Apply 44.4 ± 1.8 (1954 total) Sahu and Cogswell, went through those annota- 0B: Choice Baseline 48.74 ± 0.4 (3: Apply) GPT-Neo 1B: Remember 47.31 ± 0.1 tions and made sure that each label had a three (1334±9 valid) 2B: Understand* 48.44 ± 0.5 3B: Apply 48.1 ± 0.1 way consensus among Rutherford-Quach, Sahu 0C: Choice Baseline 54.9 ± 0.9 and Cogswell. There might be some ambiguity Distil-GPT2 1C: Remember 46.0 ± 14.7 (11±2 valid) 2C: Understand* 53.1 ± 12.5 about which level a particular prefix fits into, but COPA 3C: Apply 40.8 ± 15.2 (100 total) 0C: Choice Baseline 70.83 ± 0.0 this is also true of other applications of the taxon- (3: Apply) GPT-Neo 1C: Remember 65.62 ± 0.0 omy (Thompson et al., 2008). In future work, we (96±0 valid) 2C: Understand* 70.83 ± 1.4 3C: Apply 70.83 ± 0.0 plan to carry out a more rigorous annotation with 0D: Choice Baseline 29.9 ± 2.7 more than one skilled annotator so we can measure Distil-GPT2 1D: Remember 26.5 ± 3.3 (68±1 valid) 2D: Understand* 28.1 ± 1.2 CommonsenseQA inter-annotator agreement through measures such (1221 total) 3D: Apply 25.6 ± 3.4 0D: Choice Baseline 40.59 ± 3.6 as Kappa scores. (3: Apply) GPT-Neo 1D: Remember 38.00 ± 6.0 (1118±4 valid) 2D: Understand* 43.19 ± 0.2 3D: Apply 42.30 ± 0.8 4 Experiments 4.1 Datasets We evaluate our study on four datasets that can from (Shwartz et al., 2020). For each question pre- each be thought of in terms of multiple choice fix, we generate 5 clarification questions using nu- question answering, all measuring some kind of cleus sampling threshold probability p = 0.2 and common sense: COPA (Roemmele et al., 2011) adding at most 6 words to the clarification ques- measures common sense causal reasoning, Com- tion prefix. We then generate 10 answers to each monSenseQA (Talmor et al., 2019) asks questions clarification question using p = 0.5 and maximum that require prior knowledge, Social IQA (Sap et al., answer length 10. Some changes were necessary 2019) asks about social common sense, and Wino- to accurately measure the impact of clarification Grande (Sakaguchi et al., 2020) adversarially mea- level. Instead of always including no clarification sures semantic common sense. Perhaps surpris- as a choice we do not allow this option as it defeats ingly, all of the datasets we used asked questions our goal of measuring clarification level impact. that fell into just one level of the taxonomy (Tab. 2). Furthermore, we do not use the clarification ques- These datasets do focus on very specific problems, tions which were manually completed without in- but the result is still disappointing because it would put from the model (as in COPA and Winogrande). be more useful to see variations in both task and In order to compare performance across differ- clarification level. It may be interesting to develop ent levels of clarifications we only consider exam- datasets that can better express the range of abilities ples where the model was able to generate at least described by Bloom’s Taxonomy. one clarification from each level. To increase the number of viable examples we found it necessary 4.2 Language Model to remove some restrictions relative to the imple- We use distill-GPT2 (Sanh et al., 2019) and the mentation of (Shwartz et al., 2020). In particular, publicly released GPT-Neo2.7B(Black et al., 2021) we kept all clarifications that had no overlapping (based on EleutherAI’s replication of the GPT-3 ar- words with the context and did not allow the model chitecture) as the language models throughout our to chose the “no clarification” option. Even with experiments. Our clarification question prefixes these constraints it was still often the case that and hyperparameter settings for both models are distil-GPT2 could not generate a short clarification
sentence that was plausible enough to use whereas consistent in efficacy. GPT-Neo was able to generate clarifications for al- most the entire dataset. This indicates larger scale 4.4 Qualitative Results models may be more able to take advantage of clar- In Tab. 1 we show samples of question answer pairs ifying questions. The number of examples with generated for each model and in Tab. 5 of the ap- valid clarifications for all levels is indicated for pendix we show complete examples (with context each model in column 2 of Tab. 2. These changes and choices) for each model and dataset. GPT-Neo help us more accurately measure the impact of is much larger than distil-GPT2 and is expected to Bloom’s Taxonomy, but mean our approach is not generalize to slightly new tasks like the clarifica- directly comparable to Shwartz et al. (2020). tion generation task better than the smaller model. This expectation is clearly met by the observed 4.3 Results quality of clarifications. Distil-GPT2 clarification questions and answers often do not have meaning- Table 2 reports the performance of our Bloom’s ful semantics, are not correct, or are not relevant. Taxonomy infused zero-shot question answering GPT-Neo is much more likely to generate questions method. Each row shows question answering ac- and answers which are meaningful, correct, and rel- curacy for a particular dataset and level of clarifi- evant. This suggests the greater number of valid cation. If our hypothesis is correct then the level clarifications generated by GPT-Neo may be due to of available clarifications should matter and clari- an increase in clarification quality. Furthermore, it fications that provide proximal context –one level fails in an intuitive fashion: when it fails to gener- below the dataset level– should be most helpful. ate meaningful answers it often has also failed to Clarification Level Makes a Difference. All generate a meaningful clarification question in the levels of clarification questions and answers pro- first place. vide some amount of extra information that Also note that the performance differences ob- changes how a language model processes the entire served for distil-GPT2 occur despite its relatively string it is presented with. This is often helpful in- poor interpretability. This indicates that context formation, but it may be that all levels of Bloom’s which is somewhat relevant to the topic even if it Taxonomy provide equally useful information. We does not precisely make sense can still be useful. find that is not the case. Different levels of clarifi- cation help more or less, as evidenced by the large 5 Conclusion gap between minimum and maximum accuracy for each dataaset. Furthermore, when the model can Large pre-trained language models sometimes have choose any clarification (rows 0A/B/C/D) it either the right information, but they just do not know does a worse job than proximal context or its per- how to use it. We used Bloom’s taxonomy to pick formance similar to proximal context, so enforcing questions with the right amount of proximal con- a particular kind of context should be helpful. text. This helped the language models use their Proximal Context Helps Most. Proximal con- knowledge to more effectively answer questions. text, as we’ve defined it with respect to Bloom’s In the future we would like to extend our work on Taxonomy is context from the clarification level tasks that present a wide range of questions that fall directly below the dataset question level. The prox- under different levels of the taxonomy. Similarly, imal clarification level for each dataset is marked we also would like to study and improve upon the by a * in Tab. 2. In all cases proximal clarifica- current limited set of prefix questions used. tions are better than using clarifications of a lower Acknowledgments level. For the datasets that ask level 3 questions the proximal (level 2) clarifications also outperform The authors thank Yunye Gong, Stephanie Nunn, level 1 clarifications (2B/C/D greater than 1B/C/D). and the anonymous reviewers for the helpful dis- Proximal clarifications are also about as good as cussions and comments. or better than using clarifications of a higher level. You can see this for Winogrande by noting row 1A is greater than 2A and for the other datasets by not- References ing rows 2B/C/D usually have greater performance L. Anderson, D. Krathwohl, and B. Bloom. 2000. A than 3B/C/D. Overall, proximal context is most taxonomy for learning, teaching, and assessing: A
revision of bloom’s taxonomy of educational objec- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- tives. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Emily M. Bender, Timnit Gebru, Angelina McMillan- Roberta: A robustly optimized bert pretraining ap- Major, and Shmargaret Shmitchell. 2021. On the proach. arXiv preprint arXiv:1907.11692. dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Confer- Gary Marcus and Ernest Davis. 2020. Gpt-3, bloviator: ence on Fairness, Accountability, and Transparency. Openai’s language generator has no idea what it’s talking about. MIT Technology Review. Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Debbie Miller. 2002. Reading with meaning: Teaching scale autoregressive language modeling with mesh- comprehension in the primary grades. Stenhouse tensorflow. Publishers. B. Bloom. 1956. Taxonomy of educational objectives: Manal Mohammed and N. Omar. 2020. Question clas- The classification of educational goals. sification based on bloom’s taxonomy cognitive do- main using modified tf-idf and word2vec. PLoS Antoine Bosselut, Ronan Le Bras, and Yejin Choi. ONE, 15. 2019a. Dynamic neuro-symbolic knowledge graph construction for zero-shot commonsense question Fatema Nafa, S. Othman, and J. Khan. 2016. Auto- answering. arXiv preprint arXiv:1911.03876. matic concepts classification based on bloom’s tax- onomy using text analysis and the naı̈ve bayes clas- Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- sifier method. In CSEDU. tanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019b. Comet: Commonsense transformers for D. Oliver, Tony Dobele, Myles Greber, and Tim S. automatic knowledge graph construction. arXiv Roberts. 2004. This course has a bloom rating of preprint arXiv:1906.05317. 3.9. In ACE. T. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Neelakantan, Pranav Shyam, Girish Sastry, Gardner, Christopher Clark, Kenton Lee, and Luke Amanda Askell, Sandhini Agarwal, Ariel Herbert- Zettlemoyer. 2018. Deep contextualized word repre- Voss, Gretchen Krueger, T. Henighan, R. Child, sentations. In Proceedings of the 2018 Conference of the North American Chapter of the Association A. Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens for Computational Linguistics: Human Language Winter, Christopher Hesse, Mark Chen, Eric Sigler, Technologies, Volume 1 (Long Papers), pages 2227– Mateusz Litwin, Scott Gray, Benjamin Chess, 2237. J. Clark, Christopher Berner, Sam McCandlish, A. Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton ArXiv, abs/2005.14165. Bakhtin, Yuxiang Wu, Alexander H Miller, and Se- bastian Riedel. 2019. Language models as knowl- Kevin Clark, Minh-Thang Luong, Quoc V Le, and edge bases? arXiv preprint arXiv:1909.01066. Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than genera- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, tors. arXiv preprint arXiv:2003.10555. Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Jacob Devlin, Ming-Wei Chang, Kenton Lee, and blog, 1(8):9. Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- Melissa Roemmele, Cosmin Adrian Bejan, and An- ing. arXiv preprint arXiv:1810.04805. drew S Gordon. 2011. Choice of plausible alterna- tives: An evaluation of commonsense causal reason- Jesse Dunietz, Greg Burnham, Akash Bharadwaj, Jen- ing. In AAAI Spring Symposium: Logical Formal- nifer Chu-Carroll, Owen Rambow, and D. Ferrucci. izations of Commonsense Reasoning, pages 90–95. 2020. To test machine comprehension, start by defining comprehension. In ACL. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- ula, and Yejin Choi. 2020. Winogrande: An adver- Stephanie Harvey and Anne Goudvis. 2007. Strategies sarial winograd schema challenge at scale. In Pro- that work: Teaching comprehension for understand- ceedings of the AAAI Conference on Artificial Intel- ing and engagement. Stenhouse Publishers. ligence, volume 34-05, pages 8732–8740. Mandar Joshi, Kenton Lee, Yi Luan, and Kristina Victor Sanh, Lysandre Debut, Julien Chaumond, and Toutanova. 2020. Contextualized representations us- Thomas Wolf. 2019. Distilbert, a distilled version ing textual encyclopedic knowledge. arXiv preprint of bert: smaller, faster, cheaper and lighter. ArXiv, arXiv:2004.12006. abs/1910.01108.
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. Social iqa: Com- monsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 4453– 4463. Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. Unsupervised commonsense question answering with self-talk. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4615–4629. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commonsenseqa: A ques- tion answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158. E. Thompson, Andrew Luxton-Reilly, J. Whalley, M. Hu, and Phil Robbins. 2008. Bloom’s taxonomy for cs assessment. In ACE ’08. J. Whalley, R. Lister, E. Thompson, T. Clear, Phil Rob- bins, P. Kumar, and C. Prasad. 2006. An australasian study of reading and comprehension skills in novice programmers, using the bloom and solo taxonomies. James Zhang, Casey Wong, Nasser Giacaman, and An- drew Luxton-Reilly. 2021. Automated classification of computing education questions using bloom’s tax- onomy. Australasian Computing Education Confer- ence. A Prefixes and Examples In the appendix we provides more details about the question prefixes we used in Tab. 3 and pro- vide more examples of outputs from our models in Tab. 5.
Table 3: All the prefix questions with its corresponding taxonomy level used in our zero shot question answering evaluation. Question Prefix Answer Prefix Bloom’s Taxonomy Level CommonsenseQA & COPA What is the definition of The definition of is 1 What is the main purpose of The purpose of is to 2 What is the main function of a The main function of a is 2 What are the properties of a The properties of a are that 1 What is a is 1 What happened as a result of As a result of , 3 What might have caused The cause of was 3 SocialIQA What will [NAME] want to do next? NAME] wanted 3 What will [NAME] want to do after? [NAME] wanted 3 How would [NAME] feel afterwards? [NAME] felt 3 How would [NAME] feel as a result? [NAME] felt 3 How would [NAME] feel after? [NAME] felt 3 How would you describe [NAME]? [NAME] is a 2 What kind of person is [NAME]? [NAME] is a 2 How would you describe [NAME] as a person? [NAME] is a 2 Why did [NAME] do that? [NAME] did this because they wanted 3 Why did [NAME] do this? [NAME] did this because they wanted 3 Why did [NAME] want to do this? [NAME] did this because they wanted 3 What does [NAME] need to do beforehand? Before doing that, [NAME] first had to 2 What does [NAME] need to do before? Before doing that, [NAME] first had to 2 What does [NAME] need to do before this? Before doing that, [NAME] first had to 2 What did [NAME] need to do before this? Before doing that, [NAME] first had to 2 What will happen to [NAME]? [NAME] 3 What will happen to [NAME] next? [NAME] 3 What will [NAME] do next? [NAME] 3 What did [NAME] do? What [NAME] did was 1 Winogrande What is the definition of The definition of is 1 What is the main purpose of The purpose of is to 2 What is the main function of a The main function of a is 2 What are the properties of a The properties of a are that 1 What is is 1 What does it mean to means 2
Table 4: Example contexts, questions, choices, clarification questions and clarification answers for each dataset. We present results for both Distil-GPT2 and GPT-Neo. COPA Generated Clarifications (Context, Question, Choice) Distil-GPT2 GPT-Neo C:The man turned on the faucet. Q: What might have caused it to be so hot? Q: effect A: The cause of it to be so hot was that he would Q: What happened as a result of that? Ch: [The toilet filled with water., have a big fire in his face and then immediately A: As a result of that, he was forced to drink water. Water flowed from the spout.] had an explosion, which killed all but one person. C: The woman retired. Q: What happened as a result of this? Q: What happened as a result of her retirement? Q: effect A: As a result of this, the government has announced A: As a result of her retirement, she had been receiving Ch: [She received her pension., that it will pay Rs 7 crore to her family for compensation a pension for life. She paid off her mortgage.] and legal expenses related thereto as part-time payments. C: I wanted to conserve energy. Q: What is the main function of a plant? Q: effect A: The main function of a plant is the growth Q: What might have caused this? Ch: [I swept the floor in the unoccupied room., and maintenance of nutrients, so that we can A: The cause of this was the lack of electricity in my home. I shut off the light in the unoccupied room.] increase our food intake by 10% or more per year, he said. Winogrande Generated Clarifications (Sentence, Context, Choice) Distil-GPT2 GPT-Neo S: The home that my parents had when I was in school was a lot nicer than my house now Q: What is the main function of a building? Q: What are the properties of a home? because the was sophisticated. A: The main function of a building is to keep A: he properties of a home are that which C: home people happy and healthy. makes it comfortable and pleasant for the occupants. Ch: [home, house] S: He had enough time between classes to go to a cafe or to the library. He went to the Q: What is the definition of a book? Q: What is the definition of cafe? because his paper could wait. A: The definition of a book is an article A: cafe is defined as a place where people C: classes written in order for it be read. gather for refreshment. Ch: [cafe, library] S: They had to eat a lot to gain the strength Q: What is the main function of a good diet? Q: What is the definition of strength? they had lost and be able to work, the was too much. A: The main function of a good diet is not for A: strength is defined as : having power C: strength food or health but also on exercise. or ability to do something. Ch: [work, strength] SocialIQA Generated Clarifications (Context, Question, Choice) Distil-GPT2 GPT-Neo C: Kendall got a new sports car and could not wait to show friends. Q: What will Kendall want to do next? Q: What will Kendall want to do next? Q: Why did Kendall want to do this? A: Kendall wanted to impress them with Ch: [drive that sports car, A: Kendall did this because they wanted her, too. his new ride.” show off his new sports car, clean and wax her legs] C: Kendall was really into house plants and often pour water in to jars so the fluoride would evaporate out. Q: Why did Kendall do that? Q: Why did Kendall do this? Q: What will Kendall want to do next? A: Kendall did this because they wanted him at A: Kendall did this because they wanted to Ch: [watered a few of the plants, home, he said. get a lot of water for their plants. gather jars, watered a few of her pets] C: Remy gave birth to a baby girl on a Saturday afternoon. Q: What does Remy need to do before this? Q: What did Remy need to do before this? Q: What does Remy need to do before? Ch: [leave her husband, A: Before doing that, Remy first had to tell her A: Before doing that, Remy first had to go to did this to start a family, he was too old. the hospital to deliver her baby. go to the hospital] CommonsenseQA Generated Clarifications (Context, Question, Choice) Distil-GPT2 GPT-Neo C: In what Spanish speaking North American country can you get a great cup of coffee? Q: What is the definition of coffee? Q: What is the definition of mexico? Q: cup of coffee A: The definition of coffee is the word A: mexico is defined as the country that Ch: [mildred’s coffee shop, mexico, diner, that comes from Latin for espresso. produces the best coffee. kitchen, canteen] C: Unlike a spider and his many sight seers, people Q: What is the main purpose of this article? Q: What is the definition of two eyes? only have what? A: The purpose of this article is to explain A: two eyes is defined as the organs Q: people the basic characteristics that are characteristic of spiders. by which we see. Ch: [tongues, names, brains, feelings, two eyes] C: The artist was sitting quietly pondering, then Q: What is the definition of inspiration? suddenly he began to paint when what struck him? Q: What is a man who can do that? A: inspiration is defined as a sudden flash Q: sitting quietly A: Man who can do that is a really great painter. of inspiration that comes from within. Ch: [sadness, anxiety, inspiration, discomfort, insights]
You can also read