Virtual Pre-Service Teacher Assessment and Feedback via Conversational Agents
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Virtual Pre-Service Teacher Assessment and Feedback via Conversational Agents Debajyoti Datta 1 ∗, Maria Phillips 1 ∗, James P. Bywater 2 , Jennifer Chiu 3 , Ginger S. Watson 3 , Laura Barnes 1 Donald Brown 1 1 School of Engineering and Applied Science, University of Virginia 2 College of Education, James Madison University 3 School of Education and Human Development, University of Virginia {dd3ar, mp6kv, jlc4dz, gw2b,lb3dp, deb}@virginia.edu bywatejx@jmu.edu Abstract training scenarios to train teachers and nurses in many different contexts (Datta et al., 2016). Virtual Conversational agents and assistants have been used for decades to facilitate learning. There conversational agents refers to an online interac- are many examples of conversational agents tive system where a user is able to dialogue and used for educational and training purposes in receive responses in turn. all future references to K-12, higher education, healthcare, the mili- conversational agents will refer to this definition tary, and private industry settings. The most of virtual conversational agents. Conversational common forms of conversational agents in ed- agents, often characterized as dialogue systems, are ucation are teaching agents that directly teach common in customer support representative appli- and support learning, peer agents that serve as cations (Yan et al., 2017). In education, they have knowledgeable learning companions to guide learners in the learning process, and teach- been used extensively for interpreting the content able agents that function as a novice or less- of the conversation, integrating and assimilating knowledgeable student trained and taught by a information, and providing feedback(Chhibber and learner who learns by teaching. The Instruc- Law, 2019). However, as pointed out by Smutny tional Quality Assessment (IQA) provides a and Schreiberova (2020) very few conversational robust framework to evaluate reading compre- agents used in education use recent advances in hension and mathematics instruction. We de- machine learning and deep learning, instead re- veloped a system for pre-service teachers, in- dividuals in a teacher preparation program, to lying on simple decision trees. This reinforces evaluate teaching instruction quality based on the need for more research and development on a modified interpretation of IQA metrics. Our artificial intelligence–based methods to support demonstration and approach take advantage content-specific conversations. A conversational of recent advances in Natural Language Pro- agent deployed through a web interface, as opposed cessing (NLP) and deep learning for each di- to software implementations, for teacher learning alogue system component. We built an open- has multiple benefits. Two of the prominent ben- source conversational agent system to engage efits are that this approach can easily be scaled pre-service teachers in a specific mathematical scenario focused on scale factor with the aim to reach more in-service and pre-service teachers to provide feedback on pre-service teachers’ as well while also providing a cost-effective way questioning strategies. We believe our system to build systems that can be scaled to new teach- is not only practical for teacher education pro- ing scenarios as multiple dialogue system compo- grams but can also enable other researchers to nents can be reused. The dialogue manager and build new educational scenarios with minimal semantic sentence-level components can be used effort. for different mathematical scenarios as long as the 1 Introduction assessment component remains unchanged. In this case the term assessment is referring to pre-service In the era of remote teaching due to governmental teacher evaluation as opposed to a students under- regulations and stay-at-home orders for COVID-19, standing of a mathematical topic. In this work, we remote teaching methodologies have come to the combine the advances in NLP and deep learning forefront of education and training. Virtual con- research with a modified version of the Instruc- versational agents have been used for a variety of tional Quality Assessment (IQA) (Boston, 2012) ∗ * First two authors have contributed equally framework to build a scenario to be used in teacher 185 Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 185–198 April 20, 2021 ©2021 Association for Computational Linguistics
education settings. Our goal in this project is to by relying on transfer learning and weak su- assess and give teachers immediate feedback on the pervision quality of their instructional moves using a specific scenario on a given mathematical topic. 2 Related Work Research demonstrates that the Instructional Quality Assessment provides a robust framework Virtual human-based simulations can provide mean- for evaluating teachers’ instructional practice in ingful, deliberate practice for learning a wide range mathematics classrooms. Our demonstration uses of teaching skills during teacher preparation as a modified version of the IQA that focuses on the well as extended practice for advanced skills for strategy of teachers’ questions during one-on-one in-service teachers. Dialogue systems for conver- or class discussions through a web-based platform sational agents can be built in two different ap- that allows for pre-service teachers to receive real- proaches: End-to-end approaches that combine all time feedback on the quality of their questioning. of the dialogue system stages require large-scaled Although our demonstration centers on using an labeled training data and component-based systems adapted version of IQA as the assessment com- which require less data, but each component needs ponent, our modular architecture can be used to to be trained separately. Component-based systems incorporate alternative evaluation schemes as well. have the advantage of allowing certain components A critical challenge of developing dialogue sys- to be reused for similar contextual scenarios. Our tems in a new domain is the requirement of large work utilizes the latter approach leveraging unstruc- data sets for the different components of the dia- tured data collected from web and textbooks as logue system (intent classification, slot filling for knowledge bases. Our dialogue policy is a combi- dialogue state tracking, and the dialogue policy). nation of handcrafted rules by education domain In education, where data collection can itself prove experts, and dialogue states are tracked through a a significant challenge, especially given a need for reading comprehension based approach highlighted increased domain expert annotations, our system is by Gao et al. (2019). The most common approach developed with only a small amount of annotated for dialogue state tracking is the “slot-value” pairs data: around two thousand sentences that are la- approach. In this scenario, different stages of the beled with one of four adapted IQA classes. Note dialogue are often framed as a multi-class classifica- that this system is not necessarily better than ex- tion task (Mrkšić et al., 2015). While this approach isting approaches that use annotated training data is well-studied, the decision to use the reading com- for each component of the dialogue system, rather, prehension based approach in this system is based this system is a compromise because dialogue sys- on the ability to incorporate pre-trained models for tems are challenging to build in low- or no-data dialogue state tracking. amount scenarios. What our system demonstrates The assessment component for the pre-service is a solution to developing a highly-manipulable teachers’ questioning strategy is an adapted version scenario given minimal domain-expert annotated of IQA. The IQA has been developed by the Learn- data that can be used to support virtual feedback ing Research and Developmental Center at the Uni- for pre-service teachers. versity of Pittsburgh since 2002 (Matsumura et al., Our contributions are as follows: 2006). The IQA offers a holistic assessment of mathematical instruction, including the academic • An open-source web-platform for assessment rigor of the specific tasks and student and teacher of the quality of teachers’ mathematical ques- discussion surrounding the task (Pianta and Hamre, tioning 2009). The IQA has been further validated by its developers in subsequent research (Boston and Can- • A process that allows for scenario develop- dela, 2018). More recent developments suggest ment with minimal training data that the IQA can be used not only as an assessment • Direct feedback to pre-service teachers on the tool but also as a feedback tool to help teachers quality of mathematical questioning by rely- actively improve their instruction (Boston and Can- ing on the state of the art NLP components dela, 2018). The questions that teachers ask are essential for • A framework in which new scenarios can be promoting students’ meaningful mathematical dis- deployed with minimal change in components course. The academic rigor component (Junker 186
et al., 2005; Boston, 2012) of the IQA builds on ear- • Procedural and Factual: Elicits a mathemat- lier classifications of teacher questions (e.g. Boaler ical fact or procedure; Requires a yes/no or and Brodie (2004), distinguishes between ”probing single response answer; Requires the recall of and exploring” questions that ask students to clarify a memorized fact or procedure (e.g., What is their ideas or the connections between them, and the square root of 4?) ”procedural and factual” questions that elicit facts or yes/no responses). The IQA is intended to be used • Expository and Cueing: Provides a mathe- in contexts where cognitively demanding mathe- matical cueing or mathematical information matical tasks are implemented and is well suited to students. (e.g: To solve this problem you for fine-grained teacher professional development need to double this side, then take that number such as that which focuses on teacher questioning and multiply it by 3.) (Boston et al., 2015). Another advantage of utiliz- • Other This refers to all other conversations ing IQA is the limited categories as a high number not related to the above topics.(e.g: Close your of categories results in very complex dialogue poli- books, Why didn’t you use graph paper?) cies (Yan et al., 2017). Given the limited amount of time domain experts Annotators also had the opportunity to flag any may have for annotating data, several methods to data as a ”data issue” which would represent a tran- improve label efficiency were explored. Weak su- script preprocessing error or another issue indicat- pervision techniques, as highlighted by Ratner et al. ing the data could not be labeled such as incoherent (2016) provides the two-fold benefit in that it re- text or blank text. quires less human labeling than would otherwise The data used for the adapted IQA evaluation be required for training. An additional benefit of rubric was developed from transcriptions of audio Weak supervision is that noisy data and the accu- recordings of teachers in whole-class and teacher- racy of each annotator can be taken into account for student conversations that took place in elemen- classification. Ratner et al. (2016) has shown that tary mathematics classrooms using different math- weak supervision systems are better than generic ematics curricula across the United States. The majority vote approaches. Noisy label data for de-identified dataset was shared from an NSF- model classification has also been studied by deep- sponsored project that had previously collected learning-based approaches (Guan et al., 2018) and the recordings to answer separate research ques- proven effective. tions. Students engaged with a project purposed to help them understand different geometry concepts 3 Data and Tools like scale factor, dimensions, surface area, and vol- 3.1 Classification Model Data ume of rectangular prisms. The students recorded the observations from a given visualization and ex- A primary purpose of this conversational agent is plained the impact of the scale factor. The data in its ability to provide feedback for pre-service collected for the development of this scenario con- teachers. This requires the ability to classify each tained 2826 questions. The unique question along statement or question using the selected assess- with the context, or speaking turn, in which the ment rubric which in this implementation is an question was uttered were both provided as refer- adapted IQA rubric. The categories of the adapted ence for the annotators to use during labeling. IQA measure were set by education domain experts We had 5799 total labeled data instances. There and iterated on over the course of several months were five total annotators: three expert teachers as for the purposes of classification and feedback for well as two pre-service teachers. The total number pre-service teachers. The adapted IQA measure of annotators fluctuated during different stages of includes the following categories of questions: the annotation process resulting in varied amounts • Probing and Exploring: Clarifies student of labels generated by each annotator. The time to thinking, enables students to elaborate their label each data point averaged between 5.2 to 6.7 own thinking for their own benefit and the seconds per annotator. The total number of unique class. Points to underlying mathematical re- labeled sentences was 2826. The total distribution lationships and meanings and makes links of labels between the four assessment categories among mathematical ideas. (e.g., Explain to ranged from 856 to 2133. We used weak super- me how you got that expression?) vision based approaches to combine the labeled 187
Figure 1: Labeling interface for annotation data from multiple annotators over majority vote gies. In this paradigm noisy labels acquired either approaches. through human labels or machine learning mod- els are cost effective to acquire. In the scenario 3.2 Labeling Platforms in which domain expert annotators are available Two labeling platforms were used extensively for (in our case, expert teachers), noisy disagreements this project: Labelbox (Labelbox, 2020) and La- between annotators can be leveraged to build high bel Studio (Tkachenko et al., 2020). While both accuracy models (Ratner et al., 2016; Guan et al., platforms were simple and straightforward to use, 2018). Weak supervision approaches are scalable, Label Studio enabled the building of custom user enabling easy adaptation to multiple mathematical interfaces with several improved features such as scenarios, one of the key contributions and focuses keyboard shortcuts that allowed annotators to more of this project. easily onboard and complete labeling tasks more efficiently. 3.3 Knowledge Base Data Each individual question was labeled with a con- The knowledge-base of dialogue systems can be text reference that allowed annotators to see the very complex depending on the scenario for which entire speaking turn of the teacher. The decision the dialogue system is being built (Yan et al., 2017). to include context came after previous iterations For this initial demonstration scenario the conversa- of labeling questions resulted in an inter-annotator tional agent represents a student with some level of agreement of below 0.50, which subsequently in- understanding of the topic scale factor. The knowl- creased to 0.66 after including context. An example edge base relies on unstructured knowledge about of the Label Studio labeling interface is shown in scale factor collected from the web and textbooks Figure 1. in which text compiled is in the format of plain Our data collection approach relied on weak text. As our system is intended to reflect a students supervision and learning with noisy labels strate- understanding of a topic, which is reasonably im- 188
Figure 2: Conversational Agent Implementation Architecture perfect, contradictory sources of information are 3.4 Platform Interface and Hosting not a primary concern. In fact a knowledge base The application was built with Django, a Python with contradictory information may be leveraged web application development framework. Creden- to support more robust answering. Additionally, tials must be generated and are required prior to the expected level of understanding of a student for using the interface. The interface includes an Insti- a given topic is likely to be documented in instruc- tutional Review Board (IRB) consent form as well tional materials readily available on the web and as a description of the scenario to include topic, therefore collecting this data is a simple way to de- expected student understanding of topic (Beginner, velop a knowledge base. Future efforts will address Intermediate, Advanced), and student grade level. identifying grade-appropriate filtering that may im- Screenshots of the application are included in the prove interaction by generating a more realistic Appendix. student profile with a grade-reflective knowldge- base. 4 Methods The text collected from the web was not cleaned, The overall architecture of our conversational agent labeled, or annotated. Basic pre-processing in- is depicted in Figure 2. As discussed, a central cluded removing references to figures or hyperlinks component of our demonstration is evaluating and to other web pages. Once the plain text reference providing feedback of pre-service teacher instruc- base was compiled, we then separated the text into tion. For this initial version of the scenario, domain sections of no more than 512 words. This process- experts provided a specified rubric for pre-service ing step was done so that an entire section could be teachers to meet that clarifies the types of IQA cate- directly used as the input in the response generation gories desired within a session. One sample rubric discussed further in the methods section. evaluates if the teacher is asking at least one ”prob- Relying on unstructured knowledge bases is crit- ing and exploring” question, one ”expository and ical to rapidly developing and deploying new con- cuing” question, and one ”procedural and factual” versational agent scenarios. Unstructured texts on question. This rubric can be changed easily in the varying mathematical topics are readily available demonstration through a separate JSON file. In our from websites and video transcripts. Our frame- current evaluation, we do not evaluate the order work allows for a simple way to incorporate newly in which the questions are asked, but we plan to generated external knowledge bases as a way to include more sophisticated evaluation protocols in scale to additional scenarios. The ability to use un- future work. structured knowledge as the key input of our knowl- The current implementation incorporates text edge base is possible due to the recent advances interactions between a conversational agent, repre- in question-answering models, reading comprehen- senting a student, and a pre-service teacher. The sion tasks, and readily available libraries such as pre-service teacher can interact with the conver- Huggingface Transformers (Wolf et al., 2020). sational agent by providing new knowledge and 189
Figure 3: Dialogue System Selection of Knowledge Base Reference Section via Semantic Similarity testing understanding as well as testing knowledge 4.2 Semantic Matching of the topic “scale factor”. The pre-service teacher types a statement, question, multiple statements For the conversational agent to respond to input or multiple questions in the text box, and the con- text, the first step is to identify the most relevant versational agent responds by taking into account section of the knowledge base. To do this, the pre- the conversation context as well as the pre-service processed input text is used in combination with teacher’s utterance. Each component of the dia- the Universal Sentence Encoder (Cer et al., 2018) logue system is described in detail in the subse- to find the most relevant, or more specifically, se- quent subsections. mantically similar section of the knowledge base. Semantic similarity refers to identifying the degree to which two texts have the same meaning. As dis- 4.1 Assessment Metric Classifier cussed in the data treatment section, the knowledge base section is split into smaller sections that are In our demonstration, we utilize the entire speak- a more optimized size for the semantic similarity ing turn of the pre-service teacher as the input text. tool used: the Universal Sentence Encoder. This This formulation is very similar to the two-sentence process is depicted in Figure 3. classification task like what is used in the Stanford Natural Language Inference corpus (Bowman et al., The Universal Sentence Encoder is optimized 2015). We frame our input in a similar format for for short phrases or paragraphs and outputs a 512- classification and fine-tuning. The input text under- dimensional vector. Semantic similarity computa- goes basic cleaning and is tokenized prior to being tion is accomplished by computing the inner prod- used as the input to our classifier model. We experi- uct between the input and knowledge base text. The mented with multiple text classification approaches semantic similarity between generated embeddings to include Convolutional Neural Network (CNN)— was computed using normalized cosine similarity based text classification (Kim, 2014), Long Short- of the embeddings. Semantic similarity computa- Term Memory (LSTM)—based text classification tion at the sentence level is more accurate than the (Liu et al., 2016), and newer approaches that rely on aggregate of word-level similarities and is there- Transformer Architectures (Devlin et al., 2018; Liu fore preferred in this application. Models trained to et al., 2019) and perform well with small amounts understand words in context are often better suited of labeled data. Transfer learning models tend to for identifying semantic similarities of phrases and perform well with less labeled data than other mod- sentences. In application, we may take input such els because of the pretraining with unsupervised as ”What is scale factor?” By finding the most se- text that encodes knowledge and semantic meaning mantically similar section in the knowledge base, of words and sentences. This demonstration incor- we can use this section to input the response gener- porates a fine-tuned Bidirectional Encoder Repre- ation. sentations from Transformers (BERT) model for If the text’s semantic similarity and the knowl- our adapted IQA classification task. edge base sections do not achieve a pre-defined 190
threshold, the system responds from the unknown the pre-service teacher acknowledge the answer”. category. For this demonstration the threshold was This iterative framework of question answering set to 0.80 after empirical evaluation of semantic helps keep track of the dialogue state. Since this is coherence. This threshold value will be further a task-specific dialogue system being evaluated for tested and empirically evaluated in future iterations a specific mathematical scenario, we are already of this system. aware of the dialogue states of the conversation. At In the unknown category, one of the six random each turn in the conversation, we use all the previ- responses (pre-defined) is selected to convey to the ous pre-service teacher utterances to determine the pre-service teacher that the system does not under- current dialogue state of the conversation. stand the user input. This pre-defined selection 4.3.2 Dialogue Policy represents a hand-crafted dialogue policy that was determined by domain experts. In our task-specific dialogue system, our dialogue If the semantic similarity is greater than or equal policy is rule-based. Depending on the dialogue to the threshold for a given knowledge base sec- state accomplished up to turn n, our utterance at tion, the system then selects this knowledge sec- n + 1 depends on the dialogue states accomplished tion. The initial input text along with the selected up to that point. The hand-crafted rules for our knowledge base section are used as the inputs dialogue policy also enable the use of direct evalua- to a question-answering module. The question- tion based on simple metrics such as the number of answering module is a pre-trained BERT model questions in each adapted IQA category or devel- that is fine-tuned on the Stanford Question Answer- opment of metrics like Initiate Response Evaluate ing Dataset (SQuaD). This module is then used to (IRE) (Mehan, 1979). generate a response in the user interface for the 4.4 Response Generation subsequent turn of the conversation. The dialogue manager retains the dialogue states of the conver- The response generation component extracts rele- sation for record and reference within the conver- vant sections of the knowledge base as a question- sation. answering task. A question-answering task, also referred to as a reading comprehension task, is a 4.3 Dialogue Manager supervised learning problem, where given a seg- ment of text of i tokens, a question of j tokens, it 4.3.1 Dialogue State Tracking returns an answer segment of k tokens. The answer Dialogue State Tracking is a core component of the in question-answering tasks can be cloze-style as dialogue system. The goal of the dialogue state in CNN/Daily Mail (Hermann et al., 2015), span tracking system is to estimate the goal at each prediction (like SQuaD (Rajpurkar et al., 2016)), turn of the conversation. There are multiple for- or be similar to NarrativeQA (Kočiskỳ et al., 2018). mulations of dialogue state tracking systems like We retrieved our knowledge from semantic match- hand-crafted rules (Wang and Lemon, 2013) and ing of web-text categories and thus our response web-style ranking (Williams, 2014). In our ap- generation pipeline matched closely to span predic- proach we use the most recent question-answering tion tasks. We implemented the response genera- paradigm for dialogue state tracking (Gao et al., tion pipeline using the transformers library (Wolf 2019). Unlike Gao et al. (2019) we do not train et al., 2020), where a BERT model (Devlin et al., our reading comprehension based model, but in- 2018) was fine-tuned on the SQuaD dataset. We stead use the question-answering paradigm to un- did not fine-tune our question-answering system for derstand the dialogue states. We append each of the response generation module, rather relying on the pre-service teacher utterances along with the semantically-matched unstructured data sections student responses. Since conversational agent re- to be used as inputs in generating answers to ques- sponses can either be mathematical (evaluating an tions. expression) or responding to a pre-service teacher utterance, we evaluate each stage as a question- 4.5 Session Feedback answer task. So at the end an utterance questions All text input by the pre-service teacher is retained prompted include: “Did the teacher ask a prob- as well as the associated adapted IQA category ing and exploratory question?”, “Did the conversa- classification. The compilation of classifications of tional agent answer the question correctly?”, “Did the pre-service teachers’ input texts are provided in 191
an assessment report which is used as feedback for teacher receives a session report which incorporates the pre-service teacher at the end of the session. the adapted IQA rubric and the number of their interactions that would be classified within each 5 Results of the sections. They are also provided a report of the entire conversation that is shared with expert This section will discuss the step by step interac- teachers as well in an effort to identify how the tion with the developed interface that is developed. conversational agent stages can be designed better. Only one scenario currently exists and is defined as a 4th-5th grade average understanding of the 6 Conclusion and Future Work mathematics topic scale factor. Our goal in this paper is to demonstrate the im- 5.1 IQA Classifier plementation of a conversational agent with very To train the IQA classifier, we used BERT (Devlin little training data that relied on foundational and et al., 2018) and DistillBERT (Sanh et al., 2019) for well-studied metrics like IQA. By leveraging state classification based on the open source Transform- of the art modules for natural language process- ers (Wolf et al., 2020) implementation. We trained ing and deep learning we could build a functional the classifiers on 80% of the data and sectioned the prototype that is now going to be used as a pilot remaining data as a 10% validation set and 10% for training for a specific mathematical scenario of for the test set. The validation set accuracy for the ”scale factor”. By integrating pre-trained models BERT and DistillBert models were 75.8 and 74.3 such as SQuAD, BERT, and the Universal Sen- respectively. Since, the performance increase with tence Encoder as well as using weak supervision Bert was minimal we used Distillbert for faster approaches in data treatment we have leveraged inference. minimal amounts of domain-expert-labeled data and knowledge base data in order to create a us- 5.2 Walk Through able interface. Approaches like this help evaluate This platform is currently hosted on Python Any- pre-service teachers in a scalable fashion and can where. Pre-generated credentials for instructors are also be deployed across the web for large scale generated for a given user ID. Each user uses the participation. There are several areas we are pur- unique ID to create an associated password. Fol- suing to improve this interface in order to provide lowing the prompts the user is able to read through a more robust interface as well as a more useful the IRB consent form and sign in to access the assessment tool. Some of the current features that conversational interface. Figure 4 shows the first we are working on are as follows: conversational agent interaction screen as well as an example inputs where the pre-service teacher • Improved IQA model: We plan to continue may ask several questions to assess the conversa- collecting domain-expert labeled data that can tional agents current level of understanding. then be used to improve the trained classi- The pre-service teacher, or user, can then interact fication model. The improved classification with the conversational agent by typing questions model will better reflect a realistic assessment or statements. The Appendix contains additional of pre-service teacher sessions using the IQA- screen captures of the developed interface. Cur- developed categories as a rubric. rently the system only supports text-based conver- sation. • Increased Knowledge Base: This first sce- An example of a text snippet that does not meet nario is limited to a specific knowledge base the set semantic matching threshold with the gener- on the topic ”Scale Factor”. We plan to in- ated conversational agent response is demonstrated corporate a wider variety of data related to in Figure 4. There are instances when the conver- expected knowledge on this topic (from 4th sation breaks despite the teacher re-framing their grade to 10th grade) as well as associated top- questions or input differently. This is intended to ics such as perimeter, volume, and ratio. With reflect how a student may sometimes fail to answer these knowledge bases compiled we will be a properly framed question or statement appropri- able to test techniques that can assist in gen- ately. erating responses that are representative of a Once the session is completed, the pre-service limited understanding of the desired topic to a 192
Figure 4: Conversational Agent Interface Screen. Left image shows initial screen. Right image shows interaction example where conversational agent has identified a statement that does not meet semantic matching threshold within the knowledge base. more advanced level of understanding on the Melissa Boston, Jonathan Bostic, Kristin Lesseig, and topic. Milan Sherman. 2015. A comparison of mathemat- ics classroom observation protocols. Mathematics • Response Generation: Currently there are Teacher Educator, 3(2):154–175. some features within the response that appear Melissa D Boston and Amber G Candela. 2018. The in- to represent the way a student is more likely structional quality assessment as a tool for reflecting to respond. We would like to further develop on instructional practice. ZDM, 50(3):427–444. these features by incorporating more student- like speech features. This would allow for Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large anno- more realistic conversational agent interaction tated corpus for learning natural language inference. and result in less formal or textbook-like re- arXiv preprint arXiv:1508.05326. sponses. Additionally, current responses gen- erated are most coherent when responding to Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, a question which is not necessarily reflective Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, of all student-teacher interactions. Future de- Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. velopments are planned to improve the robust- 2018. Universal sentence encoder. ness of responses to better account for the different forms of inputs. Nalin Chhibber and Edith Law. 2019. Using conversa- tional agents to support learning by teaching. arXiv Finally, we plan to be deploy this tool with a preprint arXiv:1909.13443. group of pre-service teachers under the direction Debajyoti Datta, Valentina Brashers, John Owen, of expert teachers in order to test the qualitative Casey White, and Laura E Barnes. 2016. A deep aspects and realism of this system. learning methodology for semantic utterance classi- fication in virtual human dialogue systems. In In- Acknowledgments ternational Conference on Intelligent Virtual Agents, pages 451–455. Springer. This work was funded in part by the Robertson Foundation. The authors wish to acknowledge the Jacob Devlin, Ming-Wei Chang, Kenton Lee, and use of de-identified classroom dialogue from NSF Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- 1535024. ing. arXiv preprint arXiv:1810.04805. Shuyang Gao, Abhishek Sethi, Sanchit Agarwal, Tagy- References oung Chung, and Dilek Hakkani-Tur. 2019. Dia- Jo Boaler and Karin Brodie. 2004. The importance, log state tracking: A neural reading comprehension nature and impact of teacher questions. In Proceed- approach. In Proceedings of the 20th Annual SIG- ings of the twenty-sixth annual meeting of the North dial Meeting on Discourse and Dialogue, pages 264– American Chapter of the International Group for the 273. Psychology of Mathematics Education, volume 2, pages 774–782. Melody Guan, Varun Gulshan, Andrew Dai, and Ge- offrey Hinton. 2018. Who said what: Modeling Melissa Boston. 2012. Assessing instructional qual- individual labelers improves classification. In Pro- ity in mathematics. The Elementary School Journal, ceedings of the AAAI Conference on Artificial Intel- 113(1):76–104. ligence, volume 32. 193
Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- Victor Sanh, Lysandre Debut, Julien Chaumond, and stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Thomas Wolf. 2019. Distilbert, a distilled version and Phil Blunsom. 2015. Teaching machines to read of bert: smaller, faster, cheaper and lighter. arXiv and comprehend. Advances in neural information preprint arXiv:1910.01108. processing systems, 28:1693–1701. Pavel Smutny and Petra Schreiberova. 2020. Chatbots Brian William Junker, Yanna Weisberg, Lindsay Clare for learning: A review of educational chatbots for Matsumura, Amy Crosson, Mikyung Wolf, Allison the facebook messenger. Computers & Education, Levison, and Lauren Resnick. 2005. Overview of page 103862. the instructional quality assessment. Regents of the University of California. Maxim Tkachenko, Mikhail Malyuk, Nikita Shevchenko, Andrey Holmanyuk, and Nikolai Yoon Kim. 2014. Convolutional neural networks for Liubimov. 2020. Label Studio: Data labeling sentence classification. In EMNLP. software. Open source software available from https://github.com/heartexlabs/label-studio. Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom, Zhuoran Wang and Oliver Lemon. 2013. A simple Chris Dyer, Karl Moritz Hermann, Gábor Melis, and and generic belief tracking mechanism for the dialog Edward Grefenstette. 2018. The narrativeqa reading state tracking challenge: On the believability of ob- comprehension challenge. Transactions of the Asso- served information. In Proceedings of the SIGDIAL ciation for Computational Linguistics, 6:317–328. 2013 Conference, pages 423–432. Labelbox. 2020. Labelbox: Training data platform. Jason D Williams. 2014. Web-style ranking and slu combination for dialog state tracking. In Proceed- Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. ings of the 15th Annual Meeting of the Special Inter- 2016. Recurrent neural network for text classi- est Group on Discourse and Dialogue (SIGDIAL), fication with multi-task learning. arXiv preprint pages 282–291. arXiv:1605.05101. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Chaumond, Clement Delangue, Anthony Moi, Pier- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- Luke Zettlemoyer, and Veselin Stoyanov. 2019. icz, Joe Davison, Sam Shleifer, Patrick von Platen, Roberta: A robustly optimized bert pretraining ap- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, proach. Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Lindsay Clare Matsumura, Brian Junker, Yanna Weis- Transformers: State-of-the-art natural language pro- berg, and Amy Crosson. 2006. Overview of the in- cessing. In Proceedings of the 2020 Conference on structional quality assessment. Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- Hugh Mehan. 1979. ‘what time is it, denise?”: Ask- ciation for Computational Linguistics. ing known information questions in classroom dis- course. Theory into practice, 18(4):285–294. Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jian- she Zhou, and Zhoujun Li. 2017. Building task- Nikola Mrkšić, Diarmuid O Séaghdha, Blaise Thom- oriented dialogue systems for online shopping. In son, Milica Gašić, Pei-Hao Su, David Vandyke, Proceedings of the AAAI Conference on Artificial In- Tsung-Hsien Wen, and Steve Young. 2015. Multi- telligence, volume 31. domain dialog state tracking using recurrent neural networks. arXiv preprint arXiv:1506.07190. Appendix Robert C Pianta and Bridget K Hamre. 2009. Concep- tualization, measurement, and improvement of class- room processes: Standardized observation can lever- age capacity. Educational researcher, 38(2):109– 119. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data pro- gramming: Creating large training sets, quickly. In Advances in neural information processing systems, pages 3567–3575. 194
Figure 5: Conversational Agent Session Example: Login Figure 6: Conversational Agent Session Example: Acknowledgement Figure 7: Conversational Agent Session Example: Institutional Review Board consent 195
Figure 8: Conversational Agent Session Example: Initial Session Screen Figure 9: Conversational Agent Session Example: Testing Scale Factor Knowledge 196
Figure 10: Conversational Agent Session Example: Improperly Phrased Question Figure 11: Conversational Agent Session Example: Properly Phrased Question 197
Figure 12: Conversational Agent Session Example: Irrelevant Question 198
You can also read