Annotating Student Talk in Text-based Classroom Discussions
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Annotating Student Talk in Text-based Classroom Discussions Luca Lugini, Diane Litman, Amanda Godley, Christopher Olshefski University of Pittsburgh Pittsburgh, PA 15260 {lul32,dlitman,agodley,cao48}@pitt.edu Abstract Additionally, studies of student-centered dis- cussions rarely use the same coding schemes, Classroom discussions in English Language making it difficult to generalize across studies Arts have a positive effect on students’ read- (Elizabeth et al., 2012; Soter et al., 2008). This ing, writing, and reasoning skills. Although limitation is partly due to the time-intensive work prior work has largely focused on teacher required to analyze discourse data through quali- talk and student-teacher interactions, we fo- cus on three theoretically-motivated aspects tative methods such as ethnography and discourse of high-quality student talk: argumentation, analysis. Thus, qualitative case studies have gen- specificity, and knowledge domain. We intro- erated compelling theories about the specific fea- duce an annotation scheme, then show that the tures of student talk that lead to high-quality dis- scheme can be used to produce reliable annota- cussions, but few findings can be generalized and tions and that the annotations are predictive of leveraged to influence instructional improvements discussion quality. We also highlight opportu- across ELA classrooms. nities provided by our scheme for educational and natural language processing research. As a first step towards developing an automated system for detecting the features of student talk 1 Introduction that lead to high quality discussions, we propose a new annotation scheme for student talk during Current research, theory, and policy surround- ELA “text-based” discussions - that is, discussions ing K-12 instruction in the United States high- that center on a text or piece of literature (e.g., light the role of student-centered disciplinary dis- book, play, or speech). The annotation scheme cussions (i.e. discussions related to a specific was developed to capture three aspects of class- academic discipline or school subject such as room talk that are theorized in the literature as im- physics or English Language Arts) in instruc- portant to discussion quality and learning oppor- tional quality and student learning opportunities tunities: argumentation (the process of systemat- (Danielson, 2011; Grossman et al., 2014). Such ically reasoning in support of an idea), specificity student-centered discussions – often called “dia- (the quality of belonging or relating uniquely to logic” or “inquiry-based” – are widely viewed as a particular subject), and knowledge domain (area the most effective instructional approach for dis- of expertise represented in the content of the talk). ciplinary understanding, problem-solving, and lit- We demonstrate the reliability and validity of our eracy (Elizabeth et al., 2012; Engle and Conant, scheme via an annotation study of five transcripts 2002; Murphy et al., 2009). In English Language of classroom discussion. Arts (ELA) classrooms, student-centered discus- sions about literature have a positive impact on 2 Related Work the development of students’ reasoning, writing, and reading skills (Applebee et al., 2003; Reznit- One discourse feature used to assess the quality skaya and Gregory, 2013). However, most stud- of discussions is students’ argument moves: their ies have focused on the role of teachers and their claims about the text, their sharing of textual evi- talk (Bloome et al., 2005; Elizabeth et al., 2012; dence for claims, and their warranting or reason- Michaels et al., 2008) rather than on the aspects of ing to support the claims (Reznitskaya et al., 2009; student talk that contribute to discussion quality. Toulmin, 1958). Many researchers view student 110 Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 110–116 New Orleans, Louisiana, June 5, 2018. c 2018 Association for Computational Linguistics
reasoning as of primary importance, particularly 3 Annotation Scheme when the reasoning is elaborated and highly infer- ential (Kim, 2014). In Natural Language Process- Our annotation scheme1 uses argument moves as ing (NLP), most educationally-oriented argumen- the unit of analysis. We define an argument tation research has focused on corpora of student move as an utterance, or part of an utterance, that persuasive essays (Ghosh et al., 2016; Klebanov contains an argumentative discourse unit (ADU) et al., 2016; Persing and Ng, 2016; Wachsmuth (Peldszus and Stede, 2013). Like Peldszus and et al., 2016; Stab and Gurevych, 2017; Nguyen Stede (2015), in this paper we use transcripts al- and Litman, 2018). We instead focus on multi- ready segmented into argument moves and focus party spoken discussion transcripts from class- on the steps following segmentation, i.e., label- rooms. A second key difference consists in the ing argumentation, specificity, and knowledge do- inclusion of the warrant label in our scheme, as it main. Table 1 shows a section of a transcribed is important to understand how students explicitly classroom discussion along with labels assigned use reasoning to connect evidence to claims. by a human annotator following segmentation. Educational studies suggest that discussion 3.1 Argumentation quality is also influenced by the specificity of stu- The argumentation scheme is based on (Lee, 2006) dent talk (Chisholm and Godley, 2011; Sohmer and consists of a simplified set of labels derived et al., 2009). Chisholm and Godley found that from Toulmin’s (1958) model: (i) Claim: an ar- as specificity increased, the quality of students’ guable statement that presents a particular inter- claims and reasoning also increased. Previous pretation of a text or topic. (ii) Evidence: facts, NLP research has studied specificity in the con- documentation, text reference, or testimony used text of professionally written newspaper articles to support or justify a claim. (iii) Warrant: rea- (Li and Nenkova, 2015; Li et al., 2016; Louis and sons explaining how a specific evidence instance Nenkova, 2011, 2012). While the annotation in- supports a specific claim. Our scheme specifies structions used in these studies work well for gen- that warrants must come after claim and evidence, eral purpose corpora, specificity in text-based dis- since by definition warrants cannot exist without cussions also needs to capture particular relations them. between discussions and texts. Furthermore, since The first three moves in Table 1 show a natural the concept of a sentence is not clearly defined in expression of an argument: a student first claims speech, we annotate argumentative discourse units that Willy’s wife is only trying to protect him, then rather than sentences (see Section 3). provides a reference as evidence by mentioning The knowledge domain of student talk may also something she said to her kids at the end of the matter, that is, whether the talk focuses on dis- book, and finally explains how not caring about ciplinary knowledge or lived experiences. Some her kids ties the evidence to the initial claim. The research suggests that disciplinary learning oppor- second group shows the same argument progres- tunities are maximized when students draw on ev- sion, with evidence given as a direct quote. idence and reasoning that are commonly accepted in the discipline (Resnick and Schantz, 2015), al- 3.2 Specificity though some studies suggest that evidence or rea- Specificity annotations are based on (Chisholm soning from lived experiences increases discus- and Godley, 2011) and have the goal of captur- sion quality (Beach and Myers, 2001). Previ- ing text-related characteristics expressed in stu- ous related work in NLP analyzed evidence type dent talk. Specificity labels are directly related to for argumentative tweets (Addawood and Bashir, four distinct elements for an argument move: (1) 2016). Although the categories of evidence type it is specific to one (or a few) character or scene; are different, their definition of evidence type is (2) it makes significant qualifications or elabora- in line with our definition of knowledge domain. tions; (3) it uses content-specific vocabulary (e.g. However, our research is distinct from this re- quotes from the text); (4) it provides a chain of search in its application domain (i.e. social me- reasons. Our annotation scheme for specificity in- dia vs. education) and in analyzing knowledge do- cludes three labels along a linear scale: (i) Low: main for all argumentative components, not only those containing claims. 1 The coding manual is in the supplemental material. 111
Move Stu Argument Move Argument Specificity Domain 23 S1 She’s like really just protecting Willy from claim medium disciplinary everything. 24 S1 Like at the end of the book remember how evidence medium disciplinary she was telling the kids to leave and never come back. 25 S1 Like she’s not even caring about them, she’s warrant medium disciplinary caring about Willy. 41 S2 It’s like she’s concerned with him trying to claim high disciplinary [inaudible] and he’s concerned with trying to make her happy, you know? So he feels like he’s failing when he’s not making her happy like 42 S2 ”Let’s bring your mother some good news” evidence high disciplinary 43 S2 but she knew that, there wasn’t any good warrant high disciplinary news, so she wanted to act happy so he wouldn’t be in pain. 55 S3 Some people they just ask for a job is just evidence low experiential like, some money. Table 1: Examples of argument moves and their respective annotations from a discussion of the book Death of a Salesman. As shown by the argument move numbers, boxes for students S1, S2, and S3 indicate separate, non contiguous excerpts of the discussion. statement that does not contain any of these ele- edge gathered from a text (either the one under ments. (ii) Medium: statement that accomplishes discussion or others), such as a quote or a descrip- one of these elements. (iii) High: statement that tion of a character/event. (ii) Experiential: the clearly accomplishes at least two specificity ele- statement is drawn from human experience, such ments. Even though we do not explicitly use la- as what the speaker has experienced or thinks that bels for the four specificity elements, we found other humans have experienced. that explicitly breaking down specificity into mul- In Table 1 the first six argument moves are tiple components helped increase reliability when labeled as disciplinary, since the moves reflect training annotators. knowledge from the text currently being dis- The first three argument moves in Table 1 all cussed. The last move, however, draws from a stu- contain the first element, as they refer to select dent’s experience or perceived knowledge about characters in the book. However, no content- the real world. specific vocabulary, clear chain of reasoning, or significant qualifications are provided; therefore 4 Reliability and Validity Analyses all three moves are labeled as medium specificity. We carried out a reliability study for the proposed The fourth move, however, accomplishes the first scheme using two pairs of expert annotators, P1 and fourth specificity elements, and is labeled as and P2. The annotators were trained by coding one high specificity. The fifth move is also labeled transcript at a time and discussing disagreements. high specificity since it is specific to one char- Five text-based discussions were used for testing acter/scene, and provides a direct quote from the reliability after training: pair P1 annotated discus- text. The last move is labeled as low specificity as sions of The Bluest Eye, Death of a Salesman, it reflects an overgeneralization about all humans. and Macbeth, while pair P2 annotated two sepa- rate discussions of Ain’t I a Woman. 250 argument 3.3 Knowledge Domain moves (discussed by over 40 students and consist- The possible labels for knowledge domain are: (i) ing of over 8200 words) were annotated. Inter- Disciplinary: the statement is grounded in knowl- rater reliability was assessed using Cohen’s kappa: 112
Argumen- pared to low-med and med-high. This is also re- Specificity Domain Moves tation flected in the quadratic-weighted kappa as low- (qwkappa) (kappa) (kappa) high disagreements will carry a larger penalty (un- 169 0.729 0.874 0.980 weighted kappa is 0.797). The main reasons for 81 0.725 0.930 1 disagreements over specificity labels come from two of the four specificity elements discussed in Table 2: Inter-rater reliability for pairs P1 and P2. Section 3.2: whether an argument move is related Argumentation evidence warrant claim to one character or scene, and whether it provides a chain of reasons. With respect to the first of these evidence 25 5 0 two elements we observed disagreements in argu- warrant 6 92 12 ment moves containing pronouns with an ambigu- claim 0 2 27 ous reference. Of particular note is the pronoun it. Specificity low medium high If we consider the argument move “I mean even low 59 5 3 if you know you have a hatred towards a standard medium 5 25 2 or whatever, you still don’t kill it”, the pronoun it high 1 6 63 clearly refers to something within the move (i.e. Knowledge discipl- experi- the standard) that the student themselves men- Domain inary ential tioned. In contrast, for argument moves such as “It disciplinary 138 1 did happen” it might not be clear to what previous experiential 0 30 move the pronoun refers, therefore creating con- fusion on whether this specificity element is ac- Table 3: Confusion matrices for argumentation, speci- complished. Regarding specificity element (4) we ficity, and knowledge domain, for annotator pair P1. found that it was easier to determine the presence of a chain of reasons when discourse connectives unweighted for argumentation and knowledge do- (e.g. because, therefore) were present in the ar- main, but quadratic-weighted for specificity given gument move. The absence of explicit discourse its ordered labels. connectives in an argument move might drive an- Table 2 shows that kappa for argumentation notators to disagree on the presence/absence of a ranges from 0.61 − 0.8, which generally indicates chain of reasons, which is likely to result in a dif- substantial agreement (McHugh, 2012). Kappa ferent specificity label. Additionally, annotators values for specificity and knowledge domain are found that shorter turns at talk proved harder to an- in the 0.81 − 1 range which generally indicates al- notate for specificity. Finally, as we can see from most perfect agreement (McHugh, 2012). These the third section in the table, knowledge domain results show that our proposed annotation scheme has the lowest disagreements with only one. can be used to produce reliable annotations of We also (Godley and Olshefski, 2017) explored classroom discussion with respect to argumenta- the validity of our coding scheme by comparing tion, specificity, and knowledge domain. our annotations of student talk to English Ed- Table 3 shows confusion matrices2 for annota- ucation experts’ evaluations (quadratic-weighted tor pair P1 (we observed similar trends for P2). kappa of 0.544) of the discussion’s quality. Us- The argumentation section of the table shows that ing stepwise regressions, we found that the best the largest number of disagreements happens be- model of discussion quality (R-squared of 0.432) tween the claim and warrant labels. One reason included all three of our coding dimensions: argu- may be related to the constraint we impose on war- mentation, specificity, and knowledge domain. rants - they require the existence of a claim and evidence. If a student tries to provide a warrant 5 Opportunities and Challenges for a claim that happened much earlier in the dis- cussion, the annotators might interpret the warrant Our annotation scheme introduces opportunities as new claim. The specificity section shows rel- for the educational community to conduct futher atively few low-high label disagreements as com- research on the relationship between features of 2 student talk, student learning, and discussion qual- The class distributions for argumentation and specificity labels vary significantly across transcripts, as can be seen in ity. Although Chisholm and Godley (2011) and we (Lugini and Litman, 2017) and (Godley and Olshefski, 2017). found relations between our coding constructs and 113
discussion quality, these were small-scale studies resulting in negative kappa in some cases. Using based on manual annotations. Once automated our scheme to create a corpus of classroom dis- classifiers are developed, such relations between cussion data manually annotated for argumenta- talk and learning can be examined at scale. Also, tion, specificity, and knowledge domain will sup- automatic labeling via a standard coding scheme port the development of more robust NLP predic- can support the generalization of findings across tion systems. studies, and potentially lead to automated tools for teachers and students. 6 Conclusions In this work we proposed a new annotation scheme The proposed annotation scheme also intro- for three theoretically-motivated features of stu- duces NLP opportunities and challenges. Exist- dent talk in classroom discussion: argumentation, ing systems for classifying specificity and argu- specificity, and knowledge domain. We demon- mentation have largely been designed to analyze strated usage of the scheme by presenting an an- written text rather than spoken discussions. This notated excerpt of a classroom discussion. We is (at least in part) due to a lack of publicly avail- demonstrated that the scheme can be annotated able corpora and schemes for annotating argumen- with high reliability and reported on scheme va- tation and specificity in spoken discussions. The lidity. Finally, we discussed some possible ap- development of an annotation scheme explicitly plications and challenges posed by the proposed designed for this problem is the first step towards annotation scheme for both the educational and collecting and annotating corpora that can be used NLP communities. We plan to extend our anno- by the NLP community to advance the field in tation scheme to label information about collabo- this particular area. Furthermore, in text-based rative relations between different argument moves, discussions, NLP methods need to tightly couple and release a corpus annotated with the extended the discussion with contextual information (i.e., scheme. the text under discussion). For example, an argu- ment move from one of the discussions mentioned Acknowledgements in Section 4 stated “She’s saying like free like, I don’t have to be, I don’t have to be this salesman’s We want to thank Haoran Zhang, Tazin Afrin, wife anymore, your know? I don’t have to play and Annika Swallen for their contribution, and all this role anymore.” The use of the term salesman the anonymous reviewers for their helpful sugges- shows the presence of specificity element (3) (see tions. Section 3.2) because the text under discussion is This work was supported by the Learning Re- indeed Death of a Salesman. If the students were search and Development Center at the University discussing another text, the mention of the term of Pittsburgh. salesman would not indicate one of the specificity elements, therefore lowering the specificity rating. References Thus, using existing systems is unlikely to yield good performance. In fact, we previously (Lug- Aseel Addawood and Masooda Bashir. 2016. ”what is your evidence?” a study of controversial topics on ini and Litman, 2017) showed that while using social media. In Proceedings of the Third Workshop an off-the-shelf system for predicting specificity on Argument Mining (ArgMining2016), pages 1–11. in newspaper articles resulted in low performance when applied to classroom discussions, exploiting Arthur N Applebee, Judith A Langer, Martin Nystrand, and Adam Gamoran. 2003. Discussion-based ap- characteristics of our data could significantly im- proaches to developing understanding: Classroom prove performance. We have similarly evaluated instruction and student performance in middle and the performance of two existing argument min- high school english. American Educational Re- ing systems (Nguyen and Litman, 2018; Niculae search Journal, 40(3):685–730. et al., 2017) on the transcripts described in Sec- Richard Beach and Jamie Myers. 2001. Inquiry-based tion 4. We noticed that since the two systems were English instruction: Engaging students in life and trained to classify only claims and premises, they literature, volume 55. Teachers College Press. were never able to correctly predict warrants in our David Bloome, Stephanie Power Carter, Beth Mor- transcripts. Additionally, both systems classified ton Christian, Sheila Otto, and Nora Shuart-Faris. the overwhelming majority of moves as premise, 2005. Discourse analysis and the study of classroom 114
language and literacy events: A microethnographic Junyi Jessy Li and Ani Nenkova. 2015. Fast and accu- perspective. Lawrence Erlbaum. rate prediction of sentence specificity. In Proceed- ings of the Twenty-Ninth Conference on Artificial In- James S Chisholm and Amanda J Godley. 2011. Learn- telligence (AAAI), pages 2281–2287. ing about language through inquiry-based discus- sion: Three bidialectal high school students talk Junyi Jessy Li, Bridget ODaniel, Yi Wu, Wenli Zhao, about dialect variation, identity, and power. Journal and Ani Nenkova. 2016. Improving the annotation of Literacy Research, 43(4):430–468. of sentence specificity. In Proceedings of the Tenth International Conference on Language Resources Charlotte Danielson. 2011. Evaluations that help and Evaluation (LREC). teachers learn. Educational leadership, 68(4):35– Annie Louis and Ani Nenkova. 2011. General versus 39. specific sentences: automatic identification and ap- plication to analysis of news summaries. Technical Tracy Elizabeth, Trisha L Ross Anderson, Elana H Report MS-CIS-11-07, University of Pennsylvania. Snow, and Robert L Selman. 2012. Academic dis- cussions: An analysis of instructional discourse and Annie Louis and Ani Nenkova. 2012. A corpus of gen- an argument for an integrative assessment frame- eral and specific sentences from news. In LREC, work. American Educational Research Journal, pages 1818–1821. 49(6):1214–1250. Luca Lugini and Diane Litman. 2017. Predicting speci- Randi A Engle and Faith R Conant. 2002. Guiding ficity in classroom discussion. In Proceedings of the principles for fostering productive disciplinary en- 12th Workshop on Innovative Use of NLP for Build- gagement: Explaining an emergent argument in a ing Educational Applications, pages 52–61. community of learners classroom. Cognition and Mary L McHugh. 2012. Interrater reliability: the Instruction, 20(4):399–483. kappa statistic. Biochemia medica: Biochemia med- ica, 22(3):276–282. Debanjan Ghosh, Aquila Khanam, Yubo Han, and Smaranda Muresan. 2016. Coarse-grained argu- Sarah Michaels, Catherine OConnor, and Lauren B mentation features for scoring persuasive essays. In Resnick. 2008. Deliberative discourse idealized and Proceedings of the 54th Annual Meeting of the As- realized: Accountable talk in the classroom and in sociation for Computational Linguistics (Volume 2: civic life. Studies in philosophy and education, Short Papers), volume 2, pages 549–554. 27(4):283–297. Amanda Godley and Christopher Olshefski. 2017. The P Karen Murphy, Ian AG Wilkinson, Anna O Soter, role of argument moves, specificity and evidence Maeghan N Hennessey, and John F Alexander. 2009. type in meaningful literary discussions across di- Examining the effects of classroom discussion on verse secondary classrooms. Unpublished paper students comprehension of text: A meta-analysis. presented at Literacy Research Association 67th An- Journal of Educational Psychology, 101(3):740. nual Conference: Literacy Research for Expanding Huy V Nguyen and Diane J Litman. 2018. Argument Meaningfulness. mining for improving the automated scoring of per- suasive essays. In Proceedings of the Thirty-Second Pam Grossman, Julie Cohen, Matthew Ronfeldt, and AAAI Conference on Artificial Intelligence. Lindsay Brown. 2014. The test matters: The rela- tionship between classroom observation scores and Vlad Niculae, Joonsuk Park, and Claire Cardie. teacher value added on multiple types of assessment. 2017. Argument Mining with Structured SVMs and Educational Researcher, 43(6):293–303. RNNs. In Proceedings of ACL. Il-Hee Kim. 2014. Development of reasoning skills Andreas Peldszus and Manfred Stede. 2013. From ar- through participation in collaborative synchronous gument diagrams to argumentation mining in texts: online discussions. Interactive Learning Environ- A survey. International Journal of Cognitive Infor- ments, 22(4):467–484. matics and Natural Intelligence (IJCINI), 7(1):1–31. Andreas Peldszus and Manfred Stede. 2015. Joint pre- Beata Beigman Klebanov, Christian Stab, Jill Burstein, diction in mst-style discourse parsing for argumen- Yi Song, Binod Gyawali, and Iryna Gurevych. 2016. tation mining. In Proceedings of the 2015 Confer- Argumentation: Content, structure, and relationship ence on Empirical Methods in Natural Language with essay quality. In Proceedings of the Third Processing, pages 938–948. Workshop on Argument Mining (ArgMining2016), pages 70–75. Isaac Persing and Vincent Ng. 2016. End-to-end ar- gumentation mining in student essays. In Proceed- Carol D Lee. 2006. every good-bye aint gone: ana- ings of the 2016 Conference of the North Ameri- lyzing the cultural underpinnings of classroom talk. can Chapter of the Association for Computational International Journal of Qualitative Studies in Edu- Linguistics: Human Language Technologies, pages cation, 19(3):305–327. 1384–1394. 115
Lauren B Resnick and Faith Schantz. 2015. Talking to learn: The promise and challenge of dialogic teach- ing. Socializing Intelligence through academic talk and dialogue, pages 441–450. Alina Reznitskaya and Maughn Gregory. 2013. Stu- dent thought and classroom language: Examining the mechanisms of change in dialogic teaching. Ed- ucational Psychologist, 48(2):114–133. Alina Reznitskaya, Li-Jen Kuo, Ann-Marie Clark, Brian Miller, May Jadallah, Richard C Anderson, and Kim Nguyen-Jahiel. 2009. Collaborative rea- soning: A dialogic approach to group discussions. Cambridge Journal of Education, 39(1):29–48. Richard Sohmer, Sarah Michaels, MC OConnor, and Lauren Resnick. 2009. Guided construction of knowledge in the classroom. Transformation of knowledge through classroom interaction, pages 105–129. Anna O Soter, Ian A Wilkinson, P Karen Murphy, Lucila Rudge, Kristin Reninger, and Margaret Ed- wards. 2008. What the discourse tells us: Talk and indicators of high-level comprehension. Interna- tional Journal of Educational Research, 47(6):372– 391. Christian Stab and Iryna Gurevych. 2017. Parsing ar- gumentation structures in persuasive essays. Com- putational Linguistics, 43(3):619–659. Stephen Toulmin. 1958. The uses of argument. Cam- bridge: Cambridge University Press. Henning Wachsmuth, Khalid Al Khatib, and Benno Stein. 2016. Using argument mining to assess the argumentation quality of essays. In Proceedings of COLING 2016, the 26th International Confer- ence on Computational Linguistics: Technical Pa- pers, pages 1680–1691. 116
You can also read