Transformer-based End-to-End Question Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Transformer-based End-to-End Question Generation Luis Enrico Lopez∗, Diane Kathryn Cruz∗, Jan Christian Blaise Cruz∗and Charibeth Cheng Center for Language Technologies (CeLT) College of Computer Studies (CCS) De La Salle University {luis lopez, diane cruz, jan christian cruz, charibeth.cheng}@dlsu.edu.ph Abstract (Seq2Seq) (Sutskever et al., 2014) models. These approaches use two networks—usually LSTMs Question generation (QG) is a natural lan- (Sutskever et al., 2014), but also Transformers very guage generation task where a model is trained arXiv:2005.01107v2 [cs.CL] 27 Feb 2021 to ask questions corresponding to some in- recently (Vaswani et al., 2017; Dong et al., 2019)— put text. Most recent approaches frame QG to encode the source context paragraph and to de- as a sequence-to-sequence problem and rely code the embedded information into an output ques- on additional features and mechanisms to in- tion respectively. crease performance; however, these often in- Further works that improve on the standard crease model complexity, and can rely on aux- Seq2Seq-based QG models use either extra mecha- iliary data unavailable in practical use. A nisms, extra features, or both. These include the us- single Transformer-based unidirectional lan- age of extra linguistic features (Zhou et al., 2017) or guage model leveraging transfer learning can be used to produce high quality questions the introduction of answer-awareness (Zhao et al., while disposing of additional task-specific 2018a; Du et al., 2017; Dong et al., 2019), which complexity. Our QG model, finetuned from uses the answer to the desired question, or the posi- GPT-2 Small, outperforms several paragraph- tion of the answer within the context paragraph as level QG baselines on the SQuAD dataset additional features. A combination of these tech- by 0.95 METEOR points. Human evaluators niques provide the base for state-of-the-art QG in rated questions as easy to answer, relevant recent years. to their context paragraph, and corresponding well to natural human speech. Also introduced While these models yield high performance, they is a new set of baseline scores on the RACE rely on the additional cost of an encoder-decoder dataset, which has not previously been used for setup, as well as extra features and mechanisms QG tasks. Further experimentation with vary- that make them harder to train and reproduce. ing model capacities and datasets with non- We propose that a robust question generation sys- identification type questions is recommended tem can be created using only a single pretrained in order to further verify the robustness of Transformer-based language model without the use pretrained Transformer-based LMs as question generators. of additional mechanisms, answer metadata, and extensive features. This method, while simpler, pro- 1 Introduction duces results able to outperform established base- lines within the same task. Question Generation (QG) remains a relevant We benchmark our approach on the SQuAD (Ra- task in NLP. The ability to ask meaningful ques- jpurkar et al., 2016) and RACE (Lai et al., 2017) tions provides evidence towards comprehension Question Answering (QA) datasets, repurposed for within an Artificial Intelligence (AI) model (Nappi, QG use, in line with the methods done in previous 2017), which can have implications on other studies. understanding-based tasks. Generated questions were evaluated with stan- Many studies have produced robust models with dard language generation metrics as well as human good performance for QG in recent years. The preference annotation based on certain question most widely-used techniques are Deep Learning- characteristics. In addition, a variety of analyses based approaches involving Sequence-to-Sequence were performed on the model and its output in or- ∗ *Equal contribution. der to isolate performance indicators and identify
the model’s weaknesses and failure modes. continuous language modeling-ready text. Exper- In addition, we also introduced a set of baseline iments were performed on two identified factors scores on the RACE dataset which has not been in formatting this data: the delimiter used, and the previously used in QG research. representation method for multiple questions per context paragraph. 2 Methodology Super Bowl 50 was an American football game to de- 2.1 Data Preparation termine the champion of the National Football League (NFL) for the 2015 season. The American Football The QG model was finetuned on two datasets: Conference (AFC) champion Denver Broncos defeated SQuAD (Rajpurkar et al., 2016) and RACE (Lai the National Football Conference (NFC) champion Car- olina Panthers 24–10 to earn their third Super Bowl et al., 2017). title. The game was played on February 7, 2016, at Version 1.1 of the Stanford Question Answering Levi’s Stadium in the San Francisco Bay Area at Santa Dataset (SQuAD) (Rajpurkar et al., 2016) contains Clara, California. As this was the 50th Super Bowl, the league emphasized the ”golden anniversary” with context paragraphs, each with sets of questions and various gold-themed initiatives, as well as temporarily corresponding answer spans related to the contents suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game of these paragraphs; in total, SQuAD contains more would have been known as ”Super Bowl L”), so that than 100,000 crowdsourced questions. While origi- the logo could prominently feature the Arabic numerals nally intended for the task of question answering, 50. [SEP] Which NFL team represented the AFC at Super Bowl 50? [SEP] Where did Super Bowl 50 take previous works on question generation (Du et al., place? [SEP] What color was used to emphasize the 2017; Zhao et al., 2018a) have repurposed SQuAD 50th anniversary of the Super Bowl? as a training and test dataset, designating the ques- Figure 1: A sample training example for question gen- tions as the target output rather than the answer eration training. The context, delimiter, and questions spans. are highlighted in red, green, and blue respectively. The Large-scale ReAding Comprehension Uses the ARTIFICIAL delimiter and the AQPL format. Dataset From Examinations (RACE) (Lai et al., Text adapted from SQuAD dataset (Rajpurkar et al., 2017) takes its context paragraph and question data 2016). from English reading comprehension tests writ- ten for Chinese middle and high school students. Super Bowl 50 was an American football game to de- The primary difference setting RACE apart from termine the champion of the National Football League SQuAD is its use of multiple choice questions (NFL) for the 2015 season. The American Football instead of SQuAD’s identification-type questions. Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Car- Each question in RACE is paired with a set of four olina Panthers 24–10 to earn their third Super Bowl title. choices, similar to a multiple choice test. The game was played on February 7, 2016, at Levi’s As GPT-2 was pretrained to perform language Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league modeling, it must be finetuned similarly. Thus, emphasized the ”golden anniversary” with various gold- the two QG datasets were formatted such that they themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with appear similar to input data for language model- Roman numerals (under which the game would have ing. Each dataset was transformed into a continu- been known as ”Super Bowl L”), so that the logo could ous body of text. Each training example consists prominently feature the Arabic numerals 50. [SEP] Which NFL team represented the AFC at Super Bowl of a context paragraph and its associated ques- 50? tion(s) transformed into a single continuous se- quence with a special delimiter in between. For the Figure 2: A sample training example for question gen- RACE-trained model, the corresponding choices eration training. The context, delimiter, and question for each question are also appended after the ques- are highlighted in red, green, and blue respectively. Uses the ARTIFICIAL delimiter and the OQPL format. tion, where A. is used as the delimiter between Text adapted from SQuAD dataset (Rajpurkar et al., question and choice, as well as between choices. 2016). Training examples are separated by the newline character \n. Figure 2 shows an example of a sin- gle training example in the context-question form. 2.1.1 Delimiters There can be multiple ways to perform this trans- During data preparation, a delimiter is placed be- formation from the datasets’ original structured tween each input context paragraph and output representation (JSON for SQuAD and RACE) to a question. During training, this delimiter allows
the model to properly distinguish between the con- window problem raised with AQPL, as the length text and question, while during prediction, it can of a single training example is reduced to the length be used as a marker at the end of some input text to of an input paragraph plus the length of a single invoke question generation behavior in the model. associated question. However, this format does Three different delimiting schemes were tested: 1) result in a longer training time due to the duplicated ARTIFICIAL, or a delimiter in the form of the contexts increasing the size of the final formatted token [SEP], 2) NATURAL-QUESTION, or a dataset. delimiter in the form of the word Question, and 3) NATURAL-NUMBER, or a delimiting scheme 2.2 Model Setup and Finetuning in the form of a numbered list, where each item is The 124 million parameter GPT-2, the smallest of a question. the four available GPT-2 model sizes, was used The ARTIFICIAL delimiter was not present in as the base pretrained model. From this, a total the original model’s vocabulary, and its weights are of 12 question generation models were finetuned, learned from scratch during the finetuning phase, each using one of the data format combinations while the NATURAL delimiting schemes rely on to- enumerated in Section 2.1. ken weights already learning during the pretraining Each model was trained for 3 epochs using phase, thus making it possible for the model’s pre- causal language modeling loss. The Adam opti- trained knowledge to affect performance through mizer (Kingma and Ba, 2015) was used with an these delimiters. Similar keywords have been initial learning rate of 5 × 10−4 and a linearly de- shown to be effective in invoking certain pretrained creasing learning rate schedule with warmup for model behaviors (e.g. TL;DR: for summariza- 10% of total training steps. tion), even in a zero-shot setting (Radford et al., Due to memory constraints with the hardware 2019). available, a batch size of 32 was simulated by com- bining an actual batch size of 2 with 16 gradient ac- 2.1.2 Questions Per Line cumulation steps per minibatch. As such, the larger There can be several possible questions associ- GPT-2 sizes were no longer used in this study. ated with a single paragraph. Two ways to flatten this many-to-one relationship in the formatted data 2.3 Model Generation were identified: The model temperature was set to 0.6. Higher tem- All Questions Per Line (AQPL) A single train- perature values result in more randomness in gen- ing example consists of a context paragraph with erations, while lower values approach greedy be- all of its associated questions placed immediately havior. after it, separated from one another with the se- The top-p nucleus sampling method (Holtzman lected delimiter. While this avoids duplication of et al., 2020) was used with a value of p = 0.9. Top- context and thus results in faster training time, it p allows for more diverse generations than a purely may potentially result in the model no longer being greedy scheme, and minimizes the occurrence of able to attend to earlier tokens as its context win- certain tokens or token spans repeating indefinitely dow moves further away from the beginning of the in the generated text. input paragraph. Each generation loop is terminated either when This is critical in the question generation task, the model generates the newline character \n, or as information pertaining to a reference question when the model reaches a generation length of 32 may be found anywhere in the input paragraph. If tokens. This maximum length was manually set that information is found at the beginning, outside in order to terminate generation sessions that are of the model’s current context window, the model stuck in token span loops and do not reach the \n may have difficulty generating the corresponding end-of-text token on their own. question. 2.4 Metrics and Evaluation One Question Per Line (OQPL) Each context Model generations were automatically evaluated paragraph is duplicated for each of its associated against reference data using several text generation questions, such that for a single training example, quality metrics commonly used in QG: BLEU-1, there is only one context and one question. For BLEU-2, BLEU-3, BLEU-4 (Papineni et al., 2002), many cases, this may alleviate the moving context ROUGE-L (Lin, 2004), and METEOR (Denkowski
and Lavie, 2014). Existing implementations of the from context paragraphs in order to generate ques- aforementioned metrics (Sharma et al., 2017) were tions. 94.67% of generated questions contained used to quantify the models’ performance. tokens or spans of tokens from their correspond- Model generations were also evaluated via hu- ing context paragraphs. To quantify the extent of man annotation. Volunteers were asked to rate ques- context-copying in the average question, longest tions based on three criteria: 1) naturalness and 2) common subsequence (LCS) was calculated be- difficulty, following Du et al. (2017)’s methodol- tween the generated question and given context ogy, as well as 3) relevance to the context. For paragraph. On average, generated questions take the RACE-trained model, generated choices were 6 tokens (mean LCS length of 6.25) from the con- also evaluated according only to naturalness and text paragraph. This context-copying behavior is relevance. usually found in identification type questions (who, what, when or where), which comprise 91.67% of 3 Results and Discussion the total generated samples. The model appears to have learned the context- 3.1 SQuAD Automated Evaluation (AE) copying technique on its own without explicit in- The best performing model is the One Question struction during training. This may be due to the Per Line (OQPL) model with number delimiters, frequency of identification-type questions in the achieving the highest score for BLEU-2, BLEU-3, training data; 88.26% of SQuAD’s training set is BLEU-4 and METEOR. For BLEU-1 and ROUGE- composed of identification type questions. L, the One Question Per Line (OQPL) model with artificial delimiters performed the best. 3.1.2 Failure Modes It is interesting to note, however, that the best After testing, 19 samples from the generated ques- OQPL models are on average only 0.6917 points tion set were observed to be non-questions based better than their corresponding All Questions Per on two identified failure modes: 1) the last three Line (AQPL) counterparts. Additionally, the maxi- words repeat indefinitely, and 2) the generated ques- mum difference between OQPL number delimiter tion is cut prematurely. See Table 2 for example of and others model for all automatic metrics is only the failure modes. 1.06. This shows that different tags and delimiters For failure case 1, the generated question keeps have only a marginal effect on the AE performance. on repeating words, presuming that the atten- For further analysis, post-finetuning features tion mechanism fail to pinpoint important context were also extracted from the generated questions words, which led to the model being confused in such as question length, paragraph context length, generating the next token. and longest sub-sequence (between the paragraph The visualization of the attention mechanism’s context and generated question) on the best per- behavior while generating for this failure mode forming model. was used to analyze the attention weights seen in From the generated questions of OQPL Num- attention visualization in Figure 3. ber delimiter model, the following behaviors were When observing the attention scores over the observed: context paragraph for failure case 1, we find that the attention mechanism is “confused.” Attention • Some generated questions seem to be simply should ideally point to specific positions in the extracting phrases from the paragraph context, inputs in order to provide context information bet- and returning them in question form. ter. However, in this case, it can be observed that the attention scores are evenly distributed over a • From the 2067 sample generated questions, number of random positions in the given context 19 of which do not end with a “?” token. paragraph when generating a token after the word Note that such samples are not considered “profession.” This behavior can be seen in most of “full questions” based on the identified failure the attention heads and layers given this example. modes. This failure of the mechanism to identify important context points in the paragraph may have led to its 3.1.1 Evaluating Context-Copy inability to properly generate a question. The model exhibits a form of context-copying be- For failure case 2 , it can be surmised that the havior, where it extracts verbatim spans of text generation is cut simply because it reached the
Format Delimiter BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L Artificial 54.83 30.13 15.72 7.31 20.53 43.88 AQPL Number 54.98 30.31 15.79 7.57 20.69 43.83 Question 55.03 30.46 16.20 7.74 20.71 44.039 Artificial 55.60 31.03 16.56 7.89 21.03 44.41 OQPL Number 55.51 31.17 16.79 8.27 21.2 44.38 Question 55.28 30.81 16.55 8.21 21.11 44.27 Table 1: SQuAD automatic evaluation results for all models. Case Question Context 1 What is a profession of the profession of the pro- Teaching may be carried out informally, within the fession of the profession of the profession of the family, which is called homeschooling, or in the profession of the profession of the profession of the wider community. Formal teaching may be carried profession of the profession out by paid professionals. Such professionals enjoy a status in some societies on a par with physicians, lawyers, engineers, and accountants (Chartered or CPA). 2 Which newspaper in the United States defined In 1900, the Los Angeles Times defined southern Southern California as including the seven counties California as including ”the seven counties of Los of Los Angeles, San Bernardino, Orange, Riverside, Angeles, San Bernardino, Orange, Riverside, San San Diego, Ventura and Sant Diego, Ventura and Santa Barbara.” In 1999, the Times added a newer county—Imperial—to that list. Table 2: Examples of failed generations from the best performing model’s failure modes. maximum generation length while copying text duces the best results in recent literature. from the context, as a consequence of the model’s context-copy mode (which it learned as its most 3.2 RACE Automated Evaluation (AE) common generation mechanism). For analyzing performance in RACE, we utilize the best performing model (OQPL number delimiter). 3.1.3 Comparison with previous studies After producing a set of generated questions from As seen in Table 3, the model outperforms prior the test set, we observe the following: RNN-based Seq2Seq works Du and Cardie (2018); Zhao et al. (2018b) in terms of METEOR and • Only 44% of the generated questions exhibit ROUGE-L score. It is worth noting that, in ad- context copying. dition to a more complex model setup, (Zhao et al., 2018b) uses other techniques such as a maxout • From the 1407 sample generated questions, 77 pointer mechanism and gated self attention mecha- are repeated questions, albeit generated from nisms. Other previous work also use answer aware- different context paragraphs. ness, using the positions of the answers in the para- graph, or the answers themselves, as additional fea- • Three of the generated questions do not end tures for the model. The proposed model uses none with a “?”,“.” or “ ” token. Note that these of these extra features, yet still achieves robust ME- questions are not considered “full questions.” TEOR and ROUGE-L scores that outperform these studies. 3.2.1 Question Types Our model performs worse in terms of BLEU- The RACE datset has two observable question 4 and ROUGE-L, and slightly worse in terms of types: fill in the blanks-type question, and METEOR when compared with the recent UniLM identification-type questions. In the training work of (Dong et al., 2019). It is important to note dataset, the frequency of these two types are well- that (Dong et al., 2019) is also the only other work balanced, with 52.44% being fill in the blank-type that uses a Transformer for their question gener- questions, and 47.56% being identification-type ation model. Their incorporation of an answer- questions. awareness mechanism, in addition to the multiple After studying the generated outputs from the modes of finetuning on a Seq2Seq transformer pro- best-performing RACE model, we observe that the
Figure 3: Sample attention visualization for generated outputs of failure mode 1. This example shows the words and the attention values to those words when focusing on the word “profession,” which is highlighted in red. Model Answer BLEU-4 METEOR ROUGE-L Du and Cardie (2018) 15.16 19.12 - Zhao et al. (2018b) (s2s+a) 4.8 12.52 30.11 Zhao et al. (2018b) (s2s-a-at-mcp-gsa) 16.38 20.25 44.48 Dong et al. (2019) 22.12 25.06 51.07 GPT-2 QG (ours) 8.26 21.2 44.38 Table 3: Performance of previous works in QG with Paragraph Level Input compared to ours. “Answer” refers to the model’s usage of answer awareness. (Du and Cardie, 2018) did not report a ROGUE-L score. model generated significantly more identification- comprehension questions which can apply to any type questions than fill in the blanks-type questions paragraph, and which are primarily distinguished (65.26% against 34.75%). by their choices; examples include “What is the We hypothesize that the model generates more main idea of the passage?” , “What is the best title identification-type questions despite the two types for the passage?”, and “Which of the following is being evenly distributed in the training dataset be- true/false ?” The model generated the latter true- cause of the model’s tendency to learn context- or-false question for 7.9% of the test set passages, copying. In addition, copying from existing text suggesting that it learned to ask these generic ques- is significantly easier than producing a fill in the tions that are independent of the passage’s content. blanks-type question along with plausible multiple choice answers. 3.2.4 Failure Modes Only a single instance was observed where the 3.2.2 Evaluating Context-Copy model, instead of generating a question, opted to Context-copying frequency was quantified in two continue the context passage. As the model was methods: pretrained on the language modeling task, it is pos- • Calculating the LCS between the generated sible for this to occur even after a learned delimiter question and the given context paragraph. is used to invoke QG behavior, though only very rarely. • Calculating the LCS between the generated choices and the given context paragraph. 3.3 SQuAD Human Evaluation (HE) We observe that only 44% and 24% of questions Five rounds of evaluation were performed across exhibit context-copying, based on methods 1 and the entire generated set, resulting in five annota- 2, respectively. We hypothesize that the RACE- tions for each question. finetuned models perform context-copying less due 3.3.1 Distribution of SQuAD HE Scores to the fact that answers to questions in RACE can- not simply be lifted from the context paragraph. Average scores for each attribute (3.61, 2.43, and 4.20 for relevance, difficulty, and naturalness re- 3.2.3 Repeating Questions spectively) suggest that generated questions were Duplicated questions account for 3.5% of RACE’s mostly relevant to their contexts, easy, and natural training set. Many of these, while valid, are generic sounding. Low standard deviation values for all
Model Data Format BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L RACE OQPL Question 49.35 36.14 27.53 19.80 38.79 SQuAD OQPL Number 55.51 31.17 16.79 8.27 44.43 Table 4: SQuAD and RACE results for Automatic Evaluation. Relevance Difficulty Naturalness Relevance Difficulty Naturalness Mean 3.61 2.43 4.2 Original 0.1603 0.1338 0.0698 Standard 0.66 0.9 0.55 Regrouped 0.2325 0.2405 0.1843 deviation Table 6: Average pairwise Cohen’s kappa values for Table 5: Average and standard deviation of human each question scoring attribute. evaluation scores for the SQuAD-OQPL-QUESTION model. 3.3.3 High HE and Low AE We define a low AE score as falling below the 25th three also show that there were few outlier ques- percentile (BLEU-4
3.4.1 Distribution of RACE HE Scores performing the QG task without sacrificing perfor- Average scores for each attribute (3.93, 2.09, and mance. 4.11 for relevance, difficulty, and naturalness re- Our model proved to be robust to datasets with spectively) suggest that generated questions and increasing difficulty, yielding high human prefer- choices were mostly relevant to their contexts, easy, ence scores on both the SQuAD dataset and the and natural sounding. RACE dataset. It also adapted well to data con- taining mixed question types beyond the typical Relevance Difficulty Naturalness question word (e.g. Who, What, When) construc- Mean 3.93 2.09 4.11 tion, such as the fill-in-the-blank-type questions Standard 0.39 0.61 0.38 found in RACE. deviation Second, we provide initial baseline results for the RACE dataset in the context of QG. Previously, Table 7: Average and standard deviation of human eval- RACE has only been proposed for QG use, but has uation scores for the RACE-OQPL-NUMBER model. never been used in actual studies. We look towards the reporting of our baseline scores to be the start of future QG research involving this complex dataset. 3.4.2 High HE and Low AE Third, we identify limitations with existing hu- We again define a low AE score as falling below man evaluation schemes. Human annotation results the 25th percentile (BLEU-4
on Neural Information Processing Systems 2019, Shikhar Sharma, Layla El Asri, Hannes Schulz, and NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Jeremie Zumer. 2017. Relevance of unsupervised Canada, pages 13042–13054. metrics in task-oriented dialogue for evaluating nat- ural language generation. CoRR, abs/1706.09799. Xinya Du and Claire Cardie. 2018. Harvest- ing paragraph-level question-answer pairs from Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Wikipedia. In Proceedings of the 56th Annual Meet- Sequence to sequence learning with neural networks. ing of the Association for Computational Linguistics In Advances in Neural Information Processing Sys- (Volume 1: Long Papers), pages 1907–1917, Mel- tems 27: Annual Conference on Neural Informa- bourne, Australia. Association for Computational tion Processing Systems 2014, December 8-13 2014, Linguistics. Montreal, Quebec, Canada, pages 3104–3112. Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ing to ask: Neural question generation for reading Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz comprehension. In Proceedings of the 55th Annual Kaiser, and Illia Polosukhin. 2017. Attention is all Meeting of the Association for Computational Lin- you need. In Advances in Neural Information Pro- guistics (Volume 1: Long Papers), pages 1342–1352, cessing Systems 30: Annual Conference on Neural Vancouver, Canada. Association for Computational Information Processing Systems 2017, 4-9 Decem- Linguistics. ber 2017, Long Beach, CA, USA, pages 5998–6008. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Yejin Choi. 2020. The curious case of neural text Ke. 2018a. Paragraph-level neural question gener- degeneration. In 8th International Conference on ation with maxout pointer and gated self-attention Learning Representations, ICLR 2020, Addis Ababa, networks. In Proceedings of the 2018 Conference Ethiopia, April 26-30, 2020. OpenReview.net. on Empirical Methods in Natural Language Process- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A ing, pages 3901–3910, Brussels, Belgium. Associa- method for stochastic optimization. In 3rd Inter- tion for Computational Linguistics. national Conference on Learning Representations, Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Ke. 2018b. Paragraph-level neural question gener- Conference Track Proceedings. ation with maxout pointer and gated self-attention Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, networks. In Proceedings of the 2018 Conference and Eduard Hovy. 2017. RACE: Large-scale ReAd- on Empirical Methods in Natural Language Process- ing comprehension dataset from examinations. In ing, pages 3901–3910, Brussels, Belgium. Associa- Proceedings of the 2017 Conference on Empirical tion for Computational Linguistics. Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark. Association for Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Computational Linguistics. Hangbo Bao, and Ming Zhou. 2017. Neural ques- tion generation from text: A preliminary study. In Chin-Yew Lin. 2004. ROUGE: A package for auto- Natural Language Processing and Chinese Comput- matic evaluation of summaries. In Text Summariza- ing - 6th CCF International Conference, NLPCC tion Branches Out, pages 74–81, Barcelona, Spain. 2017, Dalian, China, November 8-12, 2017, Pro- Association for Computational Linguistics. ceedings, volume 10619 of Lecture Notes in Com- puter Science, pages 662–671. Springer. Judith S Nappi. 2017. The importance of questioning in developing critical thinking skills. Delta Kappa Gamma Bulletin, 84(1):30. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic eval- uation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Com- putational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natu- ral Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
You can also read