Comprehension Based Question Answering using Bloom's Taxonomy

Page created by Andy Simpson

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Comprehension Based Question Answering using Bloom's Taxonomy

Comprehension Based Question Answering using Bloom’s Taxonomy

                                            Pritish Sahu1,2 ∗ Michael Cogswell1 ∗ Sara Rutherford-Quach1                                Ajay Divakaran1
                                                                            1
                                                                              SRI International
                                                                           2
                                                                             Rutgers University

                                                               Abstract                             memorized the necessary information from large
                                                                                                    amounts of text, but does not use its knowledge
                                             Current pre-trained language models have lots
                                             of knowledge, but a more limited ability to use
                                                                                                    appropriately. Following (Shwartz et al., 2020), we
                                             that knowledge. Bloom’s Taxonomy helps ed-             extract this knowledge as explicit language then
                                             ucators teach children how to use knowledge            feed it back as additional context during inference,
arXiv:2106.04653v1 [cs.CL] 8 Jun 2021

                                             by categorizing comprehension skills, so we            forcing the model to use what it already knows but
                                             use it to analyze and improve the comprehen-           in our case targeting specifically useful knowledge.
                                             sion skills of large pre-trained language mod-            To formalize this distinction we drew inspiration
                                             els. Our experiments focus on zero-shot ques-
                                                                                                    from elementary school classrooms, where teach-
                                             tion answering, using the taxonomy to provide
                                             proximal context that helps the model answer
                                                                                                    ers (Miller, 2002; Harvey and Goudvis, 2007) have
                                             questions by being relevant to those questions.        a schema based approach in which they teach chil-
                                             We show targeting context in this manner im-           dren to demonstrate multiple levels of comprehen-
                                             proves performance across 4 popular common             sion, making complex inferences and direct recall
                                             sense question answer datasets.                        from memory. They use a hierarchy of comprehen-
                                                                                                    sion skills called Bloom’s Taxonomy (Anderson
                                        1    Introduction                                           et al., 2000) (c.f . Fig. 1) with memorization is at the
                                        Recent large language models such as GPT-                   bottom (requiring children to recall facts) followed
                                        3 (Brown et al., 2020) have made a giant leap for-          by understanding (requiring children to grasp se-
                                        ward in knowledge acquisition and even generalize           mantics) application (requiring children to solve
                                        this knowledge to a new tasks. But when less nar-           problems), and more complex skills. For us, these
                                        row tasks are considered they fail to understand            comprehension skills describe ways our language
                                        as much as these benchmarks suggest. They turn              model might fail to use its knowledge.
                                        out to be “stochastic parrots” (Bender et al., 2021)           In this paper we address our failure mode by
                                        or “smart/super parrots.” (Dunietz et al., 2020) that       relying on commonly understood relationships be-
                                        just memorize without all of the comprehension              tween the skills of Bloom’s Taxonomy which we
                                        we want from a Natural Language Understanding               term proximal context. In order to understand
                                        system. We focus on a particular kind of failure            whether the cranberry grape mixture is poisonous
                                        mode where the model knows (has memorized) the              the model needs to remember whether grape juice
                                        information it needs, but is not able to apply that         is poisonous. In order to apply its knowledge to
                                        information correctly, and we do so in a zero-shot          figure out what will happen next it needs to un-
                                        fashion to control for what the model knows.                derstand whether the cranberry grape mixture is
                                           For example, in Fig. 1 the model is asked if a           poisonous or not. In general, the proximal context
                                        mixture of grape juice and cranberry juice is safe          for a particular task T at level L is given by those
                                        to drink (Marcus and Davis, 2020). GPT-3 declares           tasks implicitly required by T , which are mostly
                                        that it is a deadly poison, even though it appears to       at level L − 1 of the taxonomy. We guide our
                                        ”know” that grape juice and cranberry juice are safe        language to answer questions more accurately by
                                        to drink by themselves (Fig. 1, Level 1, dark pur-          providing it not just any context, but proximal con-
                                        ple). It even knows that cranberry juice with grape         text 1 . In performing zero-shot question answering
                                        juice is not poisonous, but it still thinks the result is   our language model asks itself additional clarifica-
                                        death (Fig. 1, Level 2, light blue). The model has             1
                                                                                                         Proximal context is not defined for level 1 questions, so
                                            ∗
                                            These two authors contributed equally.                  we only address questions at level 2 or above.

Figure 1: Our approach incorporates context into question answering guided by Bloom’s Taxonomy.

tion questions, choosing those most likely to result Bloom’s Taxonomy. The original work (Bloom,
in proximal context. 1956) defined taxonomies for learning in the cog-
Our contributions in this paper are: nitive (intellectual), affective (interests, attitudes,
values), and psychomotor domains, though the cog-
• We use Bloom’s Taxonomy to choose proxi- nitive domain is what we usually refer to today.
mal clarifying context that improves question Almost half a century later the cognitive domain
answering performance using only what the taxonomy was revised (Anderson et al., 2000) to re-
model already knows. flect more active thinking and improve usability by
adding verbs to describe levels. Teachers use this
• We show proximal context is better than other taxonomy, for example in computer science educa-
levels of context on four different common- tion (Whalley et al., 2006; Thompson et al., 2008;
sense question answering tasks. Oliver et al., 2004), and our inspiration is from this
revised version of the cognitive taxonomy. Ma-
• By observing how different levels of clarifi-
chine learning has been applied to automatically
cation impact our language model we also
classify questions (Mohammed and Omar, 2020;
explain how the model answers questions.
Zhang et al., 2021; Nafa et al., 2016) into Bloom’s
2 Related Works Taxonomy levels, but the taxonomy has not been
applied to analyze or improve machine learning
Question Answering from External Supervi- models themselves. We use it to help our model
sion. Several approaches has been proposed to think about what it knows.
improve question-answering by adding external
knowledge source. Recent large pre-trained lan- 3 Approach
guage models (Peters et al., 2018; Radford et al.,
2019; Devlin et al., 2018; Liu et al., 2019; Joshi Our approach builds on the zero-shot question an-
et al., 2020; Clark et al., 2020) learn general swering approach of Shwartz et al. (2020) to an-
purpose text encoders from a huge text corpus. swer questions (Section 3.1) by adding clarifica-
(Petroni et al., 2019) recently used a language tions with self-talk (Section 3.2). We describe this
model as knowledge base to unmask a token given approach then we use Bloom’s Taxonomy to select
an entity and a relation in a predefined template. better clarifications (Section 3.2).
Shwartz et al. (2020); Bosselut et al. (2019a,b) used
3.1 Question Answering with Language
pretrained language models to improve zero-shot
Models
question answering performance by extracting con-
text from the language model itself, using self-talk Given a prompt p, a question q, and answer op-
or a knowledge graph. We add context via self-talk, tions ao ∀o ∈ [1, K] we use a pre-trained language
with structure provided by Bloom’s Taxonomy. model LM to pick the correct answer ao∗ . This

Sample Clarification Sample Clarification
Dataset Question Prefix
Question Answer
(a) The definition of an accident is the crash of
(a) What is the definition of an accident? collusion caused by the vehicle.
What is the definition of : 1
COPA (b) What is the definition of a flat tire? (b) The definition of a flat tire is that the tire does
not hold air.
(a) What is the main purpose of this (a) The purpose of this investigation is to provide
investigation? information about how and why he was shot.
What is the main purpose of : 2
(b) What is the main purpose of this (b) The purpose of this post is to share my thoughts
post? and feelings on his death.
(a) What is the main function of a (a)The main function of a teacher in this area is to
teacher in this area? teach them about life and love.
What is the main function of a : 2
CommonsenseQA (b) What is the main function of a (b) The main function of a farmer is to provide food
farmer? for his family and the community.
(a) the cause of this problem was that his wife’s
(a) What might have caused this problem?
What might have caused : 3 husband didn’t have enough money.
(b) What might have caused the animal to flee?
(b) The cause of the animal to flee was a predator.
(a) What Kendall did was make sure that
(a) What did Kendall do? he wasn’t going anywhere else.
What did [NAME] do? : 1
Social IQA (b) What did Kendall do? (b) What Kendall did was so horrible, that
it was hard to believe.
(a) Riley is a big brother, he’s an awesome dad.
(a) How would you describe Riley?
How would you describe [NAME]? : 3 (b) Riley is a very sensitive person and has a lot
(b) How would you describe Riley?
of anxiety.
(a) The property of a diet that is not healthy
(a) What are the properties of a diet
are that it has high cholesterol (a good idea).
What are the properties of a : 1 that is not healthy?
Winogrande (b) The properties of a home are that which
(b) What are the properties of a home?
makes it comfortable and pleasant for the occupants.
(a) Be an explorer means to explore and make
(a) What does it mean to be an explorer? sense of things.
What does it mean to : 2
(b) What does it mean to be sophisticated? (b) Be sophisticated means to be classy, elegant
and smart.

Table 1: This table shows some of the question prefixes we used for different datasets in our experiments. We
assign each prefix a level in Bloom’s Taxonomy. We show generated clarifications questions and answers for both
Distil-GPT2 (a) and GPT-Neo (b) for their corresponding question prefixes.

approach simply concatenates each (prompt, ques- Stage 3: Reconsider with a new prompt. To
tion, answer) tuple into into a single string of text use the clarifications we pick one from the list then
To = [p, q, ao ] and feeds this string to the language append it to the original prompt. This approach
model to assign each choice a score so = LM (To ). simply considers all combinations of clarifications
The language model’s answer is just the answer questions and answers Tj,o = [p, q, cj , ao ] ∀o, j,
with the highest score: ô = argmaxo so . first chooses the clarification which maximizes
model score per answer option, then chooses the
3.2 Self-talk Clarifications final answer o∗ = argmaxo maxj LM (Tj,o ). This
Self-talk (Shwartz et al., 2020) has a language can improve question answering performance on
model ask itself clarifying questions then answer its own, but in the next section we more carefully
those questions to generate clarifications. choose clarifications using our notion of proximal
Stage 1: Ask clarification questions. To pro- context and Bloom’s Taxonomy.
duce clarifications we start with a set of clarifica-
tion question prefixes r1 , . . . , rJ that are designed 3.3 Using Bloom’s Taxonomy to Choose
specifically for each question answering dataset. Clarifications with Proximal Context
“What happens if” is a sample prefix for the clarifi- To test our idea of proximal context we consider the
cations, shown in Fig. 1, and in Tab. 1 we present level L of task give by each dataset then allow only
examples for all the datasets we use. In this stage proximal clarifications of level L − 1. We label
the language model completes each of these pre- each question prefix with the level of Bloom’s Tax-
fixes, using its generator function LMG to ask one onomy that it falls into, and then force the model
question Rj = LMG (rj ) per prefix. to choose from the set CL of clarifications of level
Stage 2: Answer the questions. Next we use L. This results in a final choice for each level
the model to answer each of these questions, possi- o∗L = argmaxo maxj∈CL LM (Tj,o ). We also pro-
bly prompted with an answer prefix bj correspond- vide a Choice Baseline that allows the model to
ing to question prefix rj . The results are the clarifi- choose any level of clarification to show the model
cations cj = LMG ([Rj , bj ]). would have difficulty choosing proximal clarifica-

tions itself. Note that the annotation of questions       Table 2: Question answering accuracy and std. dev.
along Bloom’s taxonomy requires special skills            using different levels ofclarification over multiple clar-
                                                          ification samples. Results on the dev sets of each
typically found only among educators. While a
                                                          dataset.(* = level of proximal context \wrt the dataset)
layperson can be trained to annotate such ques-
tions, our experience was that it takes much more          Task              Model            Level                 Accuracy

time than we could afford for a preliminary study                                             0A: Choice Baseline   53.2 ± 1.8
                                                                             Distil-GPT2
                                                                                              1A: Remember*         54.7 ± 3.6
                                                                             (235±5 valid)
such as this one. We therefore relied on our co-           Winogrande                         2A: Understand        52.5 ± 3.1
                                                           (1267 total)                       0A: Choice Baseline   54.62 ± 0.5
author, Sara Rutherford-Quach, who is a researcher                           GPT-Neo
                                                           (2: Understand)                    1A: Remember*         54.77 ± 0.5
                                                                             (1230±7 valid)
                                                                                              2A: Understand        54.76 ± 0.3
at SRI’s Education Division and has also worked
                                                                                              0B: Choice Baseline   44.5 ± 0.1
as a teacher at the kindergarten-elementary level to                         Distil-GPT2      1B: Remember          43.7 ± 2.1
                                                                             (58±5 valid)     2B: Understand*       48.0 ± 1.1
provide us the annotations. Two other co-authors,          SocialIQA
                                                                                              3B: Apply             44.4 ± 1.8
                                                           (1954 total)
Sahu and Cogswell, went through those annota-                                                 0B: Choice Baseline   48.74 ± 0.4
                                                           (3: Apply)
                                                                             GPT-Neo          1B: Remember          47.31 ± 0.1
tions and made sure that each label had a three                              (1334±9 valid)   2B: Understand*       48.44 ± 0.5
                                                                                              3B: Apply             48.1 ± 0.1
way consensus among Rutherford-Quach, Sahu                                                    0C: Choice Baseline   54.9 ± 0.9
and Cogswell. There might be some ambiguity                                  Distil-GPT2      1C: Remember          46.0 ± 14.7
                                                                             (11±2 valid)     2C: Understand*       53.1 ± 12.5
about which level a particular prefix fits into, but       COPA
                                                                                              3C: Apply             40.8 ± 15.2
                                                           (100 total)
                                                                                              0C: Choice Baseline   70.83 ± 0.0
this is also true of other applications of the taxon-      (3: Apply)
                                                                             GPT-Neo          1C: Remember          65.62 ± 0.0
omy (Thompson et al., 2008). In future work, we                              (96±0 valid)     2C: Understand*       70.83 ± 1.4
                                                                                              3C: Apply             70.83 ± 0.0
plan to carry out a more rigorous annotation with                                             0D: Choice Baseline   29.9 ± 2.7
more than one skilled annotator so we can measure                            Distil-GPT2      1D: Remember          26.5 ± 3.3
                                                                             (68±1 valid)     2D: Understand*       28.1 ± 1.2
                                                           CommonsenseQA
inter-annotator agreement through measures such            (1221 total)
                                                                                              3D: Apply             25.6 ± 3.4
                                                                                              0D: Choice Baseline   40.59 ± 3.6
as Kappa scores.                                           (3: Apply)
                                                                             GPT-Neo          1D: Remember          38.00 ± 6.0
                                                                             (1118±4 valid)   2D: Understand*       43.19 ± 0.2
                                                                                              3D: Apply             42.30 ± 0.8
4     Experiments
4.1    Datasets
We evaluate our study on four datasets that can           from (Shwartz et al., 2020). For each question pre-
each be thought of in terms of multiple choice            fix, we generate 5 clarification questions using nu-
question answering, all measuring some kind of            cleus sampling threshold probability p = 0.2 and
common sense: COPA (Roemmele et al., 2011)                adding at most 6 words to the clarification ques-
measures common sense causal reasoning, Com-              tion prefix. We then generate 10 answers to each
monSenseQA (Talmor et al., 2019) asks questions           clarification question using p = 0.5 and maximum
that require prior knowledge, Social IQA (Sap et al.,     answer length 10. Some changes were necessary
2019) asks about social common sense, and Wino-           to accurately measure the impact of clarification
Grande (Sakaguchi et al., 2020) adversarially mea-        level. Instead of always including no clarification
sures semantic common sense. Perhaps surpris-             as a choice we do not allow this option as it defeats
ingly, all of the datasets we used asked questions        our goal of measuring clarification level impact.
that fell into just one level of the taxonomy (Tab. 2).   Furthermore, we do not use the clarification ques-
These datasets do focus on very specific problems,        tions which were manually completed without in-
but the result is still disappointing because it would    put from the model (as in COPA and Winogrande).
be more useful to see variations in both task and            In order to compare performance across differ-
clarification level. It may be interesting to develop     ent levels of clarifications we only consider exam-
datasets that can better express the range of abilities   ples where the model was able to generate at least
described by Bloom’s Taxonomy.                            one clarification from each level. To increase the
                                                          number of viable examples we found it necessary
4.2    Language Model                                     to remove some restrictions relative to the imple-
We use distill-GPT2 (Sanh et al., 2019) and the           mentation of (Shwartz et al., 2020). In particular,
publicly released GPT-Neo2.7B(Black et al., 2021)         we kept all clarifications that had no overlapping
(based on EleutherAI’s replication of the GPT-3 ar-       words with the context and did not allow the model
chitecture) as the language models throughout our         to chose the “no clarification” option. Even with
experiments. Our clarification question prefixes          these constraints it was still often the case that
and hyperparameter settings for both models are           distil-GPT2 could not generate a short clarification

sentence that was plausible enough to use whereas consistent in efficacy.
GPT-Neo was able to generate clarifications for al-
most the entire dataset. This indicates larger scale 4.4 Qualitative Results
models may be more able to take advantage of clar- In Tab. 1 we show samples of question answer pairs
ifying questions. The number of examples with generated for each model and in Tab. 5 of the ap-
valid clarifications for all levels is indicated for pendix we show complete examples (with context
each model in column 2 of Tab. 2. These changes and choices) for each model and dataset. GPT-Neo
help us more accurately measure the impact of is much larger than distil-GPT2 and is expected to
Bloom’s Taxonomy, but mean our approach is not generalize to slightly new tasks like the clarifica-
directly comparable to Shwartz et al. (2020). tion generation task better than the smaller model.
This expectation is clearly met by the observed
4.3 Results quality of clarifications. Distil-GPT2 clarification
questions and answers often do not have meaning-
Table 2 reports the performance of our Bloom’s
ful semantics, are not correct, or are not relevant.
Taxonomy infused zero-shot question answering
GPT-Neo is much more likely to generate questions
method. Each row shows question answering ac-
and answers which are meaningful, correct, and rel-
curacy for a particular dataset and level of clarifi-
evant. This suggests the greater number of valid
cation. If our hypothesis is correct then the level
clarifications generated by GPT-Neo may be due to
of available clarifications should matter and clari-
an increase in clarification quality. Furthermore, it
fications that provide proximal context –one level
fails in an intuitive fashion: when it fails to gener-
below the dataset level– should be most helpful.
ate meaningful answers it often has also failed to
Clarification Level Makes a Difference. All generate a meaningful clarification question in the
levels of clarification questions and answers pro- first place.
vide some amount of extra information that Also note that the performance differences ob-
changes how a language model processes the entire served for distil-GPT2 occur despite its relatively
string it is presented with. This is often helpful in- poor interpretability. This indicates that context
formation, but it may be that all levels of Bloom’s which is somewhat relevant to the topic even if it
Taxonomy provide equally useful information. We does not precisely make sense can still be useful.
find that is not the case. Different levels of clarifi-
cation help more or less, as evidenced by the large 5 Conclusion
gap between minimum and maximum accuracy for
each dataaset. Furthermore, when the model can Large pre-trained language models sometimes have
choose any clarification (rows 0A/B/C/D) it either the right information, but they just do not know
does a worse job than proximal context or its per- how to use it. We used Bloom’s taxonomy to pick
formance similar to proximal context, so enforcing questions with the right amount of proximal con-
a particular kind of context should be helpful. text. This helped the language models use their
Proximal Context Helps Most. Proximal con- knowledge to more effectively answer questions.
text, as we’ve defined it with respect to Bloom’s In the future we would like to extend our work on
Taxonomy is context from the clarification level tasks that present a wide range of questions that fall
directly below the dataset question level. The prox- under different levels of the taxonomy. Similarly,
imal clarification level for each dataset is marked we also would like to study and improve upon the
by a * in Tab. 2. In all cases proximal clarifica- current limited set of prefix questions used.
tions are better than using clarifications of a lower
Acknowledgments
level. For the datasets that ask level 3 questions the
proximal (level 2) clarifications also outperform The authors thank Yunye Gong, Stephanie Nunn,
level 1 clarifications (2B/C/D greater than 1B/C/D). and the anonymous reviewers for the helpful dis-
Proximal clarifications are also about as good as cussions and comments.
or better than using clarifications of a higher level.
You can see this for Winogrande by noting row 1A
is greater than 2A and for the other datasets by not- References
ing rows 2B/C/D usually have greater performance L. Anderson, D. Krathwohl, and B. Bloom. 2000. A
than 3B/C/D. Overall, proximal context is most taxonomy for learning, teaching, and assessing: A

revision of bloom’s taxonomy of educational objec-      Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
  tives.                                                    dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
                                                            Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Emily M. Bender, Timnit Gebru, Angelina McMillan-           Roberta: A robustly optimized bert pretraining ap-
  Major, and Shmargaret Shmitchell. 2021. On the            proach. arXiv preprint arXiv:1907.11692.
  dangers of stochastic parrots: Can language models
  be too big? Proceedings of the 2021 ACM Confer-         Gary Marcus and Ernest Davis. 2020. Gpt-3, bloviator:
  ence on Fairness, Accountability, and Transparency.       Openai’s language generator has no idea what it’s
                                                            talking about. MIT Technology Review.
Sid Black, Leo Gao, Phil Wang, Connor Leahy,
   and Stella Biderman. 2021. GPT-Neo: Large              Debbie Miller. 2002. Reading with meaning: Teaching
   scale autoregressive language modeling with mesh-        comprehension in the primary grades. Stenhouse
   tensorflow.                                              Publishers.
B. Bloom. 1956. Taxonomy of educational objectives:       Manal Mohammed and N. Omar. 2020. Question clas-
   The classification of educational goals.                sification based on bloom’s taxonomy cognitive do-
                                                           main using modified tf-idf and word2vec. PLoS
Antoine Bosselut, Ronan Le Bras, and Yejin Choi.           ONE, 15.
  2019a. Dynamic neuro-symbolic knowledge graph
  construction for zero-shot commonsense question         Fatema Nafa, S. Othman, and J. Khan. 2016. Auto-
  answering. arXiv preprint arXiv:1911.03876.               matic concepts classification based on bloom’s tax-
                                                            onomy using text analysis and the naı̈ve bayes clas-
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-        sifier method. In CSEDU.
  tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.
  2019b. Comet: Commonsense transformers for              D. Oliver, Tony Dobele, Myles Greber, and Tim S.
  automatic knowledge graph construction. arXiv             Roberts. 2004. This course has a bloom rating of
  preprint arXiv:1906.05317.                                3.9. In ACE.
T. Brown, Benjamin Mann, Nick Ryder, Melanie
   Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind          Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
   Neelakantan, Pranav Shyam, Girish Sastry,               Gardner, Christopher Clark, Kenton Lee, and Luke
  Amanda Askell, Sandhini Agarwal, Ariel Herbert-          Zettlemoyer. 2018. Deep contextualized word repre-
  Voss, Gretchen Krueger, T. Henighan, R. Child,           sentations. In Proceedings of the 2018 Conference
                                                           of the North American Chapter of the Association
  A. Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens
                                                           for Computational Linguistics: Human Language
  Winter, Christopher Hesse, Mark Chen, Eric Sigler,
                                                           Technologies, Volume 1 (Long Papers), pages 2227–
   Mateusz Litwin, Scott Gray, Benjamin Chess,
                                                           2237.
  J. Clark, Christopher Berner, Sam McCandlish,
  A. Radford, Ilya Sutskever, and Dario Amodei.
   2020. Language models are few-shot learners.           Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton
  ArXiv, abs/2005.14165.                                    Bakhtin, Yuxiang Wu, Alexander H Miller, and Se-
                                                            bastian Riedel. 2019. Language models as knowl-
Kevin Clark, Minh-Thang Luong, Quoc V Le, and               edge bases? arXiv preprint arXiv:1909.01066.
  Christopher D Manning. 2020. Electra: Pre-training
  text encoders as discriminators rather than genera-     Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
  tors. arXiv preprint arXiv:2003.10555.                    Dario Amodei, and Ilya Sutskever. 2019. Language
                                                            models are unsupervised multitask learners. OpenAI
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               blog, 1(8):9.
   Kristina Toutanova. 2018. Bert: Pre-training of deep
   bidirectional transformers for language understand-    Melissa Roemmele, Cosmin Adrian Bejan, and An-
   ing. arXiv preprint arXiv:1810.04805.                   drew S Gordon. 2011. Choice of plausible alterna-
                                                           tives: An evaluation of commonsense causal reason-
Jesse Dunietz, Greg Burnham, Akash Bharadwaj, Jen-         ing. In AAAI Spring Symposium: Logical Formal-
   nifer Chu-Carroll, Owen Rambow, and D. Ferrucci.        izations of Commonsense Reasoning, pages 90–95.
   2020. To test machine comprehension, start by
   defining comprehension. In ACL.                        Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-
                                                            ula, and Yejin Choi. 2020. Winogrande: An adver-
Stephanie Harvey and Anne Goudvis. 2007. Strategies         sarial winograd schema challenge at scale. In Pro-
   that work: Teaching comprehension for understand-        ceedings of the AAAI Conference on Artificial Intel-
   ing and engagement. Stenhouse Publishers.                ligence, volume 34-05, pages 8732–8740.

Mandar Joshi, Kenton Lee, Yi Luan, and Kristina           Victor Sanh, Lysandre Debut, Julien Chaumond, and
 Toutanova. 2020. Contextualized representations us-        Thomas Wolf. 2019. Distilbert, a distilled version
 ing textual encyclopedic knowledge. arXiv preprint         of bert: smaller, faster, cheaper and lighter. ArXiv,
 arXiv:2004.12006.                                          abs/1910.01108.

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan
 Le Bras, and Yejin Choi. 2019. Social iqa: Com-
 monsense reasoning about social interactions. In
 Proceedings of the 2019 Conference on Empirical
 Methods in Natural Language Processing and the
 9th International Joint Conference on Natural Lan-
 guage Processing (EMNLP-IJCNLP), pages 4453–
 4463.
Vered Shwartz, Peter West, Ronan Le Bras, Chandra
  Bhagavatula, and Yejin Choi. 2020. Unsupervised
  commonsense question answering with self-talk. In
  Proceedings of the 2020 Conference on Empirical
  Methods in Natural Language Processing (EMNLP),
  pages 4615–4629.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
  Jonathan Berant. 2019. Commonsenseqa: A ques-
  tion answering challenge targeting commonsense
  knowledge. In Proceedings of the 2019 Conference
  of the North American Chapter of the Association
  for Computational Linguistics: Human Language
  Technologies, Volume 1 (Long and Short Papers),
  pages 4149–4158.
E. Thompson, Andrew Luxton-Reilly, J. Whalley,
  M. Hu, and Phil Robbins. 2008. Bloom’s taxonomy
  for cs assessment. In ACE ’08.
J. Whalley, R. Lister, E. Thompson, T. Clear, Phil Rob-
   bins, P. Kumar, and C. Prasad. 2006. An australasian
   study of reading and comprehension skills in novice
   programmers, using the bloom and solo taxonomies.
James Zhang, Casey Wong, Nasser Giacaman, and An-
  drew Luxton-Reilly. 2021. Automated classification
  of computing education questions using bloom’s tax-
  onomy. Australasian Computing Education Confer-
  ence.

A    Prefixes and Examples
In the appendix we provides more details about
the question prefixes we used in Tab. 3 and pro-
vide more examples of outputs from our models in
Tab. 5.

Table 3: All the prefix questions with its corresponding taxonomy level used in our zero shot question answering
evaluation.

 Question Prefix                             Answer Prefix                            Bloom’s Taxonomy Level
                                          CommonsenseQA & COPA
 What is the definition of                   The definition of is                                1
 What is the main purpose of                 The purpose of is to                                2
 What is the main function of a              The main function of a is                           2
 What are the properties of a                The properties of a are that                        1
 What is a                                    is                                                 1
 What happened as a result of                As a result of ,                                    3
 What might have caused                      The cause of was                                    3
                                                 SocialIQA
 What will [NAME] want to do next?           NAME] wanted                                        3
 What will [NAME] want to do after?          [NAME] wanted                                       3
 How would [NAME] feel afterwards?           [NAME] felt                                         3
 How would [NAME] feel as a result?          [NAME] felt                                         3
 How would [NAME] feel after?                [NAME] felt                                         3
 How would you describe [NAME]?              [NAME] is a                                         2
 What kind of person is [NAME]?              [NAME] is a                                         2
 How would you describe [NAME] as a person? [NAME] is a                                          2
 Why did [NAME] do that?                     [NAME] did this because they wanted                 3
 Why did [NAME] do this?                     [NAME] did this because they wanted                 3
 Why did [NAME] want to do this?             [NAME] did this because they wanted                 3
 What does [NAME] need to do beforehand?     Before doing that, [NAME] first had to              2
 What does [NAME] need to do before?         Before doing that, [NAME] first had to              2
 What does [NAME] need to do before this?    Before doing that, [NAME] first had to              2
 What did [NAME] need to do before this?     Before doing that, [NAME] first had to              2
 What will happen to [NAME]?                 [NAME]                                              3
 What will happen to [NAME] next?            [NAME]                                              3
 What will [NAME] do next?                   [NAME]                                              3
 What did [NAME] do?                         What [NAME] did was                                 1

                                                 Winogrande
 What is the definition of                     The definition of is                              1
 What is the main purpose of                   The purpose of is to                              2
 What is the main function of a                The main function of a is                         2
 What are the properties of a                  The properties of a are that                      1
 What is                                        is                                               1
 What does it mean to                           means                                            2

Table 4: Example contexts, questions, choices, clarification questions and clarification answers for each dataset.
We present results for both Distil-GPT2 and GPT-Neo.
COPA Generated Clarifications
(Context, Question, Choice) Distil-GPT2 GPT-Neo
C:The man turned on the faucet. Q: What might have caused it to be so hot?
Q: effect A: The cause of it to be so hot was that he would Q: What happened as a result of that?
Ch: [The toilet filled with water., have a big fire in his face and then immediately A: As a result of that, he was forced to drink water.
Water flowed from the spout.] had an explosion, which killed all but one person.
C: The woman retired. Q: What happened as a result of this?
Q: What happened as a result of her retirement?
Q: effect A: As a result of this, the government has announced
A: As a result of her retirement, she had been receiving
Ch: [She received her pension., that it will pay Rs 7 crore to her family for compensation
a pension for life.
She paid off her mortgage.] and legal expenses related thereto as part-time payments.
C: I wanted to conserve energy. Q: What is the main function of a plant?
Q: effect A: The main function of a plant is the growth Q: What might have caused this?
Ch: [I swept the floor in the unoccupied room., and maintenance of nutrients, so that we can A: The cause of this was the lack of electricity in my home.
I shut off the light in the unoccupied room.] increase our food intake by 10% or more per year, he said.

Winogrande Generated Clarifications
(Sentence, Context, Choice) Distil-GPT2 GPT-Neo
S: The home that my parents had when I was
in school was a lot nicer than my house now Q: What is the main function of a building? Q: What are the properties of a home?
because the was sophisticated. A: The main function of a building is to keep A: he properties of a home are that which
C: home people happy and healthy. makes it comfortable and pleasant for the occupants.
Ch: [home, house]
S: He had enough time between classes to go
to a cafe or to the library. He went to the Q: What is the definition of a book? Q: What is the definition of cafe?
because his paper could wait. A: The definition of a book is an article A: cafe is defined as a place where people
C: classes written in order for it be read. gather for refreshment.
Ch: [cafe, library]
S: They had to eat a lot to gain the strength
Q: What is the main function of a good diet? Q: What is the definition of strength?
they had lost and be able to work, the was too much.
A: The main function of a good diet is not for A: strength is defined as : having power
C: strength
food or health but also on exercise. or ability to do something.
Ch: [work, strength]

SocialIQA Generated Clarifications
(Context, Question, Choice) Distil-GPT2 GPT-Neo
C: Kendall got a new sports car and could not wait
to show friends.
Q: What will Kendall want to do next?
Q: What will Kendall want to do next? Q: Why did Kendall want to do this?
A: Kendall wanted to impress them with
Ch: [drive that sports car, A: Kendall did this because they wanted her, too.
his new ride.”
show off his new sports car,
clean and wax her legs]
C: Kendall was really into house plants and often
pour water in to jars so the fluoride would evaporate out.
Q: Why did Kendall do that? Q: Why did Kendall do this?
Q: What will Kendall want to do next?
A: Kendall did this because they wanted him at A: Kendall did this because they wanted to
Ch: [watered a few of the plants,
home, he said. get a lot of water for their plants.
gather jars,
watered a few of her pets]
C: Remy gave birth to a baby girl on a Saturday afternoon.
Q: What does Remy need to do before this? Q: What did Remy need to do before this? Q: What does Remy need to do before?
Ch: [leave her husband, A: Before doing that, Remy first had to tell her A: Before doing that, Remy first had to go to
did this to start a family, he was too old. the hospital to deliver her baby.
go to the hospital]
CommonsenseQA Generated Clarifications
(Context, Question, Choice) Distil-GPT2 GPT-Neo
C: In what Spanish speaking North American country
can you get a great cup of coffee? Q: What is the definition of coffee? Q: What is the definition of mexico?
Q: cup of coffee A: The definition of coffee is the word A: mexico is defined as the country that
Ch: [mildred’s coffee shop, mexico, diner, that comes from Latin for espresso. produces the best coffee.
kitchen, canteen]
C: Unlike a spider and his many sight seers, people
Q: What is the main purpose of this article? Q: What is the definition of two eyes?
only have what?
A: The purpose of this article is to explain A: two eyes is defined as the organs
Q: people
the basic characteristics that are characteristic of spiders. by which we see.
Ch: [tongues, names, brains, feelings, two eyes]
C: The artist was sitting quietly pondering, then
Q: What is the definition of inspiration?
suddenly he began to paint when what struck him? Q: What is a man who can do that?
A: inspiration is defined as a sudden flash
Q: sitting quietly A: Man who can do that is a really great painter.
of inspiration that comes from within.
Ch: [sadness, anxiety, inspiration, discomfort, insights]

You can also read