Interpretability for Language Learners Using Example-Based Grammatical Error Correction
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Interpretability for Language Learners Using Example-Based Grammatical Error Correction Masahiro Kaneko Sho Takase Ayana Niwa Naoaki Okazaki Tokyo Institute of Technology {masahiro.kaneko, sho.takase, ayana.niwa}@nlp.c.titech.ac.jp okazaki@c.titech.ac.jp Abstract Grammatical Error Correction (GEC) should focus not only on correction accuracy but arXiv:2203.07085v1 [cs.CL] 14 Mar 2022 also on the interpretability of the results for language learners. However, existing neural- based GEC models mostly focus on improv- ing accuracy, while their interpretability has not been explored. Example-based methods are promising for improving interpretability, which use similar retrieved examples to gen- Figure 1: EB-GEC presents not only a correction but erate corrections. Furthermore, examples are also an example of why the GEC model suggested this beneficial in language learning, helping learn- correction. ers to understand the basis for grammatically incorrect/correct texts and improve their confi- dence in writing. Therefore, we hypothesized that incorporating an example-based method learners with no explanation of the reason for a into GEC could improve interpretability and correction. support language learners. In this study, we Interpretability plays a key role in educational introduce an Example-Based GEC (EB-GEC) scenarios (Webb et al., 2020). In particular, present- that presents examples to language learners as ing examples is shown to be effective in improving a basis for correction result. The examples con- sist of pairs of correct and incorrect sentences understanding. Language learners acquire gram- similar to a given input and its predicted cor- matical rules and vocabulary from examples (Johns, rection. Experiments demonstrate that the ex- 1994; Mizumoto and Chujo, 2015). Presenting ex- amples presented by EB-GEC help language amples of incorrect sentences together with correct learners decide whether to accept or refuse sug- ones improves the understanding of grammatical gestions from the GEC output. Furthermore, correctness as well as essay quality (Arai et al., the experiments show that retrieved examples 2019, 2020). also improve the accuracy of corrections. Recently, example-based methods have been ap- 1 Introduction plied to a wide range of natural language process- ing tasks to improve the interpretability of neu- Grammatical Error Correction (GEC) models, ral models, including machine translation (Khan- which generate grammatically correct texts from delwal et al., 2021), part-of-speech tagging (Wise- grammatically incorrect texts, are useful for lan- man and Stratos, 2019), and named entity recogni- guage learners. In GEC, various neural-based mod- tion (Ouchi et al., 2020). These methods predict els have been proposed to improve the correction labels or tokens by considering the nearest neigh- accuracy (Yuan and Briscoe, 2016; Chollampatt bor examples retrieved by the representations of and Ng, 2018; Junczys-Dowmunt et al., 2018; Zhao the model at the inference time. Khandelwal et al. et al., 2019; Kaneko et al., 2020; Omelianchuk (2021) showed that in machine translation, exam- et al., 2020). However, the basis on which a neural ples close to a target sentence in the representation GEC model makes corrections is generally uninter- space of a decoder are useful for translating the pretable to learners. Neural GEC models rarely ad- source sentence. Inspired by this, we hypothesized dress correction interpretability, leaving language that examples corrected for similar reasons are dis-
Figure 2: An illustration of how EB-GEC chooses examples and predicts a correction. The model predicts a correction “They have /a tremendous problem .” by using the example “This has /a tremendous problem .” Hidden states of the decoder computed during the training phase are stored as keys, and tokens of the output sentences corresponding to the hidden states are stored as values. A hidden state of the decoder (blue box) at the time of inference is used as a query to search for k-neighbors (yellow box) of hidden states of the training data. EB-GEC predicts a distribution of tokens for the correction from a combination of two distributions of tokens: a vanilla distribution computed by transforming the hidden state of the decoder; and a kNN distribution by the retrieved k-neighbors. tributed closely in the representation space. Thus, 2 EB-GEC we assume that neighbor examples can enhance EB-GEC presents language learners with a correc- the interpretability of the GEC model, allowing tion and the related examples it used for generating language learners to understand the reason for a the correction of the input sentence. k-Nearest- correction and access its validity. Neighbor Machine Translation (kNN-MT; Khan- In this paper, we introduce an example-based delwal et al., 2021) was used as a base method to GEC (EB-GEC)1 that corrects grammatical errors consider example in predicting corrections. kNN- in an input text and provides examples for language MT predicts tokens by considering the nearest learners explaining the reason for correction (Fig- neighbor examples based on representations from ure 1). As shown in Figure 2, the core idea of the decoder at the time of inference. EB-GEC EB-GEC is to unify the token prediction model for could use any method (Gu et al., 2018; Zhang et al., correction and the related example retrieval model 2018; Lewis et al., 2020) to consider examples, but from the supervision data into a single encoder- kNN-MT was used in this study because it does not decoder model. EB-GEC can present the reason require additional training for example retrieval. for the correction, which we hope will help learn- Figure 2 shows how the EB-GEC retrieves exam- ers decide whether to accept or to refuse a given ples using kNN-MT. EB-GEC performs inference correction. using the softmax distribution of target tokens, re- Experimental results show that EB-GEC predicts ferred to as vanilla distribution, hereafter, obtained corrections more accurately than the vanilla GEC from the encoder-decoder model and the distribu- without examples on the three datasets and com- tion generated by the nearest neighbor examples. parably on one dataset. Experiments with human Nearest neighbor search is performed for a cache of participants demonstrate that EB-GEC presents sig- examples indexed by the decoder hidden states on nificantly more useful examples than the baseline supervision data (kNN distribution). EB-GEC can methods of example retrieval (Matsubara et al., be adapted to any trained autoregressive encoder- 2008; Yen et al., 2015; Arai et al., 2020). These decoder GEC model. A detailed explanation of results indicate that examples are useful not only retrieving examples using kNN-MT is provided in to the GEC models but also to language learners. Section 2.1, and of presenting examples in Section This is the first study to demonstrate the benefits 2.2. of examples themselves for real users, as existing studies (Wiseman and Stratos, 2019; Ouchi et al., 2.1 Retrieving Examples Using kNN-MT 2020; Khandelwal et al., 2021) only showed exam- Let x = (x1 , ..., xN ) be an input sequence and ple utility for improving the task accuracy. y = (y1 , ..., yM ) be an output sequence of the 1 Our code is publicly available at https://github. autoregressive encoder-decoder model. Here, N com/kanekomasahiro/eb-gec and M are the lengths of the input and output se-
quences, respectively. L2 distance. The tuple (v (j) , x(j) , y (j) ) is the value associated with the key u(j) in the datas- Vanilla Distribution. In a vanilla autoregressive tore (K, V). Then, the kNN-MT aggregates the encoder-decoder model, the distribution for i-th retrieved tokens to form a probability distribution token yi of the output sequence is conditioned pkNN (yi |x, ŷ1:i−1 ) with a softmax with tempera- from the entire input sequence x and previous out- ture T to the negative L2 distances2 , put tokens ŷ1:i−1 , where ŷ represents a sequence of generated tokens. The probability distribution pkNN (yi |x, ŷ1:i−1 ) ∝ of the i-th token p(yi |x, ŷ1:i−1 ) is calculated by X −ku − h(x, ŷ1:i−1 )k a linear translation to the decoder’s hidden state Iv=yi exp . T h(x, ŷ1:i−1 ) followed by the softmax function. (u,(v,_,_))∈N (4) Output Distribution. Let pEB (yi |x, ŷ1:i−1 ) de- note the final probability distribution of tokens 2.2 Presenting Examples from EB-GEC. We define pEB (yi |x, ŷ1:i−1 ) as We used a pair of incorrect and correct sentences a linear interpolation of the vanilla distribution stored in the value retrieved for the predicted to- p(yi |x, ŷ1:i−1 ) and pkNN (yi |x, ŷ1:i−1 ) (explained ken ŷi as an example from the correction. Figure 1 later), which is the distribution computed using the depicts an example where the retrieved value con- examples in the datastore, sists of the predicted token v (j) = “a” and the pEB (yi |x, ŷ1:i−1 ) =λpkNN (yi |x, ŷ1:i−1 ) incorrect/correct sentences x(j) , y (j) correspond- ing to “This has /a tremendous problem .”. In + (1 − λ)p(yi |x, ŷ1:i−1 ). this study, we presented examples for each edited (1) token in an output. For example, when an input Here, 0 ≤ λ ≤ 1 is an interpolation coefficient or output is “They have /a tremendous problem between the two distributions. This interpolation .”, we presented examples for the edit “ /a”. To also improves the output robustness when relevant extract edit operations from an input/output pair, examples are not found in the datastore. we aligned the tokens in input and output sentences by using the Gestalt pattern matching (Ratcliff and Datastore. In the work of Khandelwal et al. Metzener, 1988). (2021), the i-th hidden state h(x, y1:i−1 ) of the There are several ways to decide which exam- decoder in the trained model was stored as a key, ples should be presented to a language learner. For and the corresponding next token yi was stored instance, we could use all the examples in k-nearest as a value. In order to present examples of in- neighbors N and possibly filter them with a thresh- correct/correct sentences, we stored a tuple of the old based on L2 distance. In this paper, we present token yi , the incorrect input sentence x, and the an example incorrect/correct sentence pair that is correct output sentence y as a value of the datas- the nearest to the query in N , which is the most tore. Thus, we built key-value pairs (K, V) from confident example estimated by the model. all decoder timesteps for the entire training data (X , Y), 3 Experiments (K, V) = {(h(x, y1:i−1 ), (yi , x, y)) | This section investigates the effectiveness of the ex- amples via manual evaluation and accuracy on the ∀yi ∈ y, (x, y) ∈ (X , Y)}. (2) GEC benchmark to show that the EB-GEC does, kNN Distribution. During inference, given a in fact, improve the interpretability without sacri- source x as input, the model uses the i-th hidden ficing accuracy. We first describe the experimental state h(x, y1:i−1 ) of the decoder as the query to setup and then report the results of the experiments. search for k-nearest neighbors, 3.1 Datasets and Evaluation Metrics N = {(u(j) , (v (j) , x(j) , y (j) )) ∈ (K, V)}kj=1 , We used the official datasets of BEA-2019 (3) Shared Task (Bryant et al., 2019), W&I- train (Granger, 1998; Yannakoudakis et al., 2018), where u(j) (j = 1, . . . , k) are the k-nearest neigh- 2 In Equation 4, we do not use the input and output sen- bors of the query h(x, y1:i−1 ) measured by squared tences in the value, and thus represent them as _.
NUCLE (Dahlmeier et al., 2013), FCE-train (Yan- presented by EB-GEC, we examined the relative nakoudakis et al., 2011) and Lang-8 (Mizumoto effectiveness of presenting examples in GEC as et al., 2011) as training data and W&I-dev as de- compared to providing none. Moreover, we used velopment data. We followed Chollampatt and two baseline methods for example selection, token- Ng (2018) to exclude sentence pairs in which the based retrieval and BERT-based retrieval. Note that, source and target sentences are identical from the unlike EB-GEC, token-based and BERT-based re- training data. The final number of sentence pairs in trievals do not directly use the representations in the training data was 0.6M. We used this training the GEC model; in other words, these baselines per- data to create the EB-GEC datastore. Note that the form the task of choosing examples independently same amount of data is used by EB-GEC and the of the GEC model. In contrast, EB-GEC uses ex- vanilla GEC model. amples directly for generating an output. EB-GEC We used W&I-test, CoNLL2014 (Ng et al., was expected to provide examples more related 2014), FCE-test, and JFLEG-test (Napoles et al., to GEC input/output sentences than the baseline 2017) as test data. To measure the accuracy of the methods. GEC models, we used the evaluation metrics ER- Token-based Retrieval. This baseline method RANT (Felice et al., 2016; Bryant et al., 2017) for retrieves examples from the training data where the the W&I-test and FCE-test, M2 (Dahlmeier and corrections of the EB-GEC output match the cor- Ng, 2012) for CoNLL2014, and GLEU (Napoles rections in the target sentence of the training data. et al., 2015) for the JFLEG-test. M2 and ERRANT This is a similar method to the example search per- report F0.5 values. formed using surface matching (Matsubara et al., 3.2 Implementation Details of EB-GEC 2008; Yen et al., 2015). If multiple sentences are We used Transformer-big (Vaswani et al., 2017) found with matching tokens, an example is selected as the GEC model. Note that EB-GEC does not at random. If the tokens do not match, this method assume a specific autoregressive encoder-decoder cannot present any examples. model. The beam search was performed with a BERT-based Retrieval. This baseline method beam width of 5. We tokenized the data into uses BERT3 (Devlin et al., 2019) to retrieve exam- subwords with a vocabulary size of 8,000 using ples, considering the context of both the corrected BPE (Sennrich et al., 2016). The hyperparameters sentence and example from the datastore. This reported in Vaswani et al. (2017) were used, aside method corresponds to one based on context-aware from the max epoch, which was set to 20. In our example retrieval (Arai et al., 2020). In order to re- experiments, we reported the average results of five trieve examples using BERT, we create a datastore, GEC models trained using different random seeds. We used four Tesla V100 GPUs for training. (KBERT , VBERT ) = {(e(yi ), (yi , x, y))| We considered the kNN and vanilla distributions ∀yi ∈ y, (x, y) ∈ (X , Y)}. equally, with λ in Eq. (1) set to 0.5, to achieve (5) both accuracy and interpretability. Based on the development data results, the number of nearest Here e(yi ) is the hidden state of the last layer of neighbors k was set to 16 and the softmax temper- BERT for the token yi when the sentence y is given ature T to 1,000. We used the final layer of the without masking. This method uses e(yi ) as a decoder feedforward network as the datastore key. query for the model output sentence to then search We used Faiss (Johnson et al., 2021) with the same the datastore for k nearest neighbors. settings as Khandelwal et al. (2021) for fast nearest neighbor search in high-dimensional space. The input and output sentences of the GEC model and the examples from the baselines and 3.3 Human Evaluation Settings EB-GEC were presented to the annotators with We assessed the interpretability by human evalua- anonymized system names. Annotators then de- tion based on Doshi-Velez and Kim (2017). The cided whether the examples helped to interpret the human evaluation was performed to determine GEC output or not, or whether they aided under- whether the examples improved user understanding standing of grammar and vocabulary. The example and helped users to accept or refuse the GEC cor- 3 https://huggingface.co/ rections. To investigate the utility of the examples bert-base-cased
Method Human evaluation score Method W&I CoNLL2014 FCE JFLEG Token-based retrieval 28.8 Vanilla GEC 50.12 49.68 41.49 53.71 BERT-based retrieval 52.4 EB-GEC 52.45 50.51 43.00 53.46 EB-GEC 68.8†,‡ Table 2: Accuracy of vanilla GEC model and EB-GEC Table 1: Results of the human evaluation of the use- model on W&I, CoNLL2014, FCE and JFLEG test fulness of Token-based retrieval, BERT-based retrieval data. and EB-GEC examples. Human evaluation score is the percentage of useful examples among those presented W&I CoNLL 2013 FCE JFLEG to the language learners. The † and ‡ indicate statisti- 55 cally significant differences of EB-GEC according to McNemar’s test (p < 0.05) against Token-based re- 45 trieval and BERT-based retrieval, respectively. Score 35 sentence pair was labeled as 1 if it was “useful for decision-making or understanding the correc- 25 tion” and 0 otherwise. We then computed scores 0.00 0.25 0.50 0.75 1.00 λ for Token-based retrieval, BERT-based retrieval, and EB-GEC models by counting the number of Figure 3: Scores for each development data using dif- sentences labeled with 1. We confirm whether ferent λ values from 0 to 1 in increments of 0.25. The corrections with examples were more beneficial evaluation metrics for each data are the same as for the test data. for learners than those without, and whether EB- GEC could present more valuable examples than those from the baselines. Since it is not always is greater than 50, which indicates that present- the case that only corrected parts are helpful for ing examples is more useful than providing none. learners (Matsubara et al., 2008; Yen et al., 2015), This result is non-trivial because the percentage for the uncorrected parts were also considered during token-based retrieval is only 28.8, which indicates annotation. that those presented examples were mostly useless. We manually evaluated 990 examples provided Therefore, the examples for interpretability in EB- by the three methods for 330 ungrammatical and GEC support language learners’ understanding and grammatical sentence pairs randomly sampled acceptance of the model output. from the W&I-test, CoNLL2014, FCE-test, and JFLEG-test. The human evaluation was performed GEC Accuracy. We examined the impact of us- by two annotators with CEFR4 proficiency level B ing examples for the prediction of GEC accuracy. and one annotator with level C5 . All three annota- Table 2 shows the scores of the vanilla GEC and tors evaluated different examples. EB-GEC for the W&I, CoNLL2014, FCE, and JF- LEG test data. The accuracy of EB-GEC is slightly 3.4 Results lower for JFLEG but outperforms the vanilla GEC Human Evaluation of Examples. Table 1 for W&I, CoNLL2014, and FCE. This indicates shows the results of human evaluation of Token- that the use of examples contributes to improving based retrieval, BERT-based retrieval, and EB-GEC GEC model accuracy. models. The percentage of useful examples has 4 Analysis increased significantly for EB-GEC compared to token-based and BERT-based retrieval baselines. 4.1 Effect of λ The percentage of useful examples from EB-GEC We analyzed the relationship between the interpo- 4 https://www.cambridgeenglish.org/ lation coefficient λ (in Equation (1)) and the GEC exams-and-tests/cefr accuracy. A smaller λ value may reduce the in- 5 They are not authors of this paper. In this human evalua- tion, annotators with a middle and high proficiency level are se- terpretability as examples are not considered in lected in case annotators cannot understand errors/corrections prediction. In contrast, a larger λ value may reduce and make a judgment whether the presented example is nec- robustness, especially when relevant examples are essary or unnecessary. Therefore, this study does not focus on whether annotators with lower proficiency levels find it not included in the datastore; the model must then helpful to see examples without explanation. generate corrections relying more on kNN exam-
Token BERT EB-GEC Error type Freq. Vanilla GEC EB-GEC Diff. 60.0 PREP 115K 40.9 44.6 3.7 PUNCT 98K 33.5 37.0 3.5 DET 171K 46.6 49.8 3.2 40.0 ADJ:FORM 2K 54.5 38.4 -16.08 20.0 ADJ 21K 17.0 14.5 -2.42 SPELL 72K 68.6 66.8 -1.87 0.0 W&I CoNLL2014 FCE JFLEG Table 3: The error types with the highest and the lowest EB-GEC accuracy compared to vanilla GEC on FCE- test based on Diff. column. Freq. column is the fre- Figure 4: Matching percentage of edits and error types quency of the error type in the datastore. in model outputs and examples. ples, which may not be present in the datastore for sults. Furthermore, we see that EB-GEC has a some inputs. lower percentage on JFLEG compared to those on Figure 3 shows the accuracy of the GEC for each W&I, CoNLL2014, and FCE. This corroborates development data when the λ is changed from 0 to the results of Table 2, which suggests that the ac- 1 in increments of 0.25. We found that when λ was curacy of GEC improved further when examples set to 1, the accuracy for all development datasets more relevant to the corrections could be retrieved. was lower than when λ was set to 0.50 or less. It is 4.3 EB-GEC and Error Types shown that the highest accuracy was obtained for λ = 0.5, as this treats the vanilla output distribution We analyzed the accuracy of EB-GEC for different and the output distribution equally. error types to investigate the effect of error type on EB-GEC performance. We used ERRANT to 4.2 Matching Error Types of Model Outputs evaluate the accuracy of EB-GEC for each error and Examples type on the FCE-test. In Section 1, we hypothesized that similar error- Table 3 shows three error types selected as hav- correcting examples are closely clustered in the ing the most significant increase and decrease in ac- representation space. Therefore, we investigated curacy for EB-GEC compared to the vanilla GEC. the agreement between the GEC output and the ex- The three error types with the largest increases amples for edits and error types. We extracted edits were preposition (PREP; e.g. I think we should and their error types, which were automatically book at/ the Palace Hotel .), punctuation error assigned by ERRANT (Felice et al., 2016; Bryant (PUNCT; e.g. Yours ./ sincerely ,), and article et al., 2017) for incorrect/correct sentence pairs. error (DET; e.g. That should complete that/an For example, for a GEC input/output pair “They amazing day .). The three error types with the have /a tremendous problem .”, the example pair largest decreases are adjective conjugation error is “This has /a tremendous problem .”, its edit (ADJ:FORM; e.g. I was very please/pleased to re- is “ /a” and the error type is the determiner error ceive your letter .), adjective error (ADJ; e.g. The (DET). We calculated the matching percentage of adjoining restaurant is very enjoyable/good as well the edits and error types for EB-GEC outputs and .), and spelling error (SPELL; e.g. Pusan Castle is for the examples retrieved using EB-GEC to show locted/located in the South of Pusan .). their similarity. In addition, we used token-based We concluded the following findings from these and BERT-based retrieval as comparison methods results. Error types with the largest increase in ac- for obtaining examples relevant to EB-GEC out- curacy have a limited number of tokens used for puts. the edits compared to those with the largest de- Figure 4 shows the matching percentage of ed- creases in accuracy (namely, error types referring its and error types between the GEC outputs and to adjectives and nouns). Furthermore, these error the k-nearest neighbors examples. First, we see types are the most frequent errors in the datastore, that EB-GEC has the highest percentage for all test (excluding the unclassified error type annotated as data. This indicates that of the methods tested, EB- OTHER), and the datastore sufficiently covers such GEC retrieves the most relevant examples. This edits. Contrary to the error types with improved trend is consistent with the human evaluation re- accuracy, ADJ and SPELL have a considerable
Error type Error-correction pair Label Input/Output PREP You will be able to buy them in/at /a reasonable price . - Naturally , it ’s easier to get a job then/when you were/are good in/at Token-based retrieval PREP 0 foreign languagers/languages or computers . BERT-based retrieval PREP I could purchase them in/at reasonable price/prices . 1 EB-GEC PREP I could purchase them in/at reasonable price/prices . 1 Input/Output PUNCT for/For example /, a reasercher that wants to be successfull must take risk . - Token-based retrieval PUNCT Today /, we first/ met for /the first time in about four weeks . 0 BERT-based retrieval PUNCT for/For example /, a kid named Michael . 1 EB-GEC PUNCT for/For example /, a kid named Michael . 1 Input/Output DET Apart from that /, it takes /a long time to go somewhere . - Token-based retrieval DET If you have enough time , I recommend /a bus trip . 0 BERT-based retrieval PREP However , it will take for/ a long time to go abroad in my company . 0 EB-GEC DET So/Because of that , it takes /a long time to write my journal/entries . 1 Table 4: Examples retrieved by Token-based retrieval, BERT-based retrieval, and EB-GEC for input/output, and the human evaluation labels. Underlines indicate error-correction pairs in the sentences. Bold indicates the edit used as the query to retrieve the example, and error types of the bold edits are assigned by ERRANT. number of tokens used in edits, and they are not Conversely, EB-GEC is able to present examples in easy to cover sufficiently in a datastore. Moreover, which the editing pair tokens are consistent for all ADJ:FORM is the second least frequently occur- corrections. Furthermore, the contexts were similar ring error type in the datastore, and we believe such to those of the input/output, for example “purchase examples cannot be covered sufficiently. These re- them in/at reasonable price/prices”, “for/For ex- sults show that EB-GEC improves the accuracy of ample /,” and “it takes /a long time to”, and all error types that are easily covered by examples, as the examples were labeled 1 during human evalua- there are fewer word types rarely used for edits and tion. This demonstrates that EB-GEC retrieves the they are better presented in datastore. Furthermore, most related examples that are helpful for users. the results show that the accuracy deteriorates for error types that are difficult to cover, such as word 5 Related Work types used for edits and infrequent error types in 5.1 Example Retrieval for Language the datastore. Learners We investigated the characteristics of the EB- There are example search systems that support lan- GEC examples by comparing specific examples guage learners by finding examples. Before neural- for each error type with those from token-based based models, examples were retrieved and pre- and BERT-based retrieval. Table 4 shows exam- sented by surface matching (Matsubara et al., 2008; ples of Token-based retrieval, BERT-based retrieval Yen et al., 2015). Arai et al. (2019, 2020) pro- and EB-GEC for the top three error types (PREP, posed to combine Grammatical Error Detection PUNCT and DET) with accuracy improvement in (GED) and example retrieval to present both gram- EB-GEC. Token-based retrieval showed that the matically incorrect and correct examples of essays tokens in the edits are consistent, including “in/at”, written by Japanese language learners. This study “ /,”, and “ /a”. However, only surface informa- showed that essay quality was improved by provid- tion is used, and context is not considered. So ing examples. Their method is similar to EB-GEC such unrelated examples are not useful for language in that it presents both correct and incorrect ex- learners. BERT-based retrieval presented the same amples but incorporates example search systems examples as EB-GEC for PREP and PUNCT error for GED rather than the GEC. Furthermore, the types, and the label for human evaluation was also example search systems search for examples inde- 1. However, the last example is influenced by the pendently of the model. Contrastingly, EB-GEC context rather than the correction and so presents presents more related examples as shown in Section an irrelevant example, labeled 0 by human eval- 3.4. uation. This indicates that BERT-based retrieval Cheng and Nagase (2012) developed a Japanese overly focuses on context, resulting in examples re- example-based system that retrieves examples us- lated to the overall output but unrelated to the edits. ing dependency structures and proofread texts.
Proofreading is a task similar to GEC because it token-based retrieval. Moreover, these studies do also involves correcting grammatical errors. How- not focus on the interpretability of the model. ever, this method also does not focus on using ex- Several methods have been proposed to retrieve amples to improve interpretability. examples using neural model representations and consider them for prediction. Khandelwal et al. 5.2 Explanation for Language Learners (2020, 2021) proposed the retrieval of similar ex- There is a feedback comment generation task (Na- amples using the nearest neighbor examples of gata, 2019) that can generate useful hints and ex- pre-trained hidden states during inference and to planations for grammatical errors and unnatural ex- complement the output distributions of the lan- pressions in writing education. Nagata et al. (2020) guage model and machine translation with the dis- used a grammatical error detection model (Kaneko tributions of these examples. Lewis et al. (2020) et al., 2017; Kaneko and Komachi, 2019) and neu- combined a pre-trained retriever with a pre-trained ral retrieval-based method for prepositional errors. encoder-decoder model and fine-tuned it end-to- The motivation of this study was similar to ours, end. For the input query, they found the top-k that is, to help language learners understand gram- documents and used them as a latent variable for matical errors and unnatural expressions in an inter- final prediction. Guu et al. (2020) first conducted pretable way. On the other hand, EB-GEC supports an unsupervised joint pre-training of the knowl- language learners using examples from the GEC edge retriever and knowledge-augmented encoder model rather than using feedback. for the language modeling task, then fine-tuned it using a task of primary interest, with supervised ex- 5.3 Example Retrieval in Text Generation amples. The main purpose of these methods was to improve the accuracy using examples, and whether Various previous studies have used neural network the examples were helpful for the users was not ver- models to retrieve words, phrases, and sentences ified. Conversely, our study showed that examples for use in prediction. Nagao (1984) proposed an for the interpretability in GEC could be helpful for example-based MT to translate sequences by anal- real users. ogy. This method has been extended to a variety of other methods for MT (Sumita and Iida, 1991; Doi 6 Conclusion et al., 2005; Van Den Bosch, 2007; Stroppa et al., 2007; Van Gompel et al., 2009; Haque et al., 2009). We introduced EB-GEC to improve the inter- In addition, the example-based method has been pretability of corrections by presenting examples to used for summarization (Makino and Yamamoto, language learners. The human evaluation showed 2008) and paraphrasing (Ohtake and Yamamoto, that the examples presented by EB-GEC supported 2003). These studies were performed before neural language learners’ decision to accept corrections networks were in general use, and the examples and improved their understanding of the correction were not used to solve the neural network black results. Although existing interpretive methods box as was done in this study. using examples have not verified if examples are In neural network models, methods using exam- helpful for humans, this study demonstrated that ples have been proposed to improve accuracy and examples were helpful for learners using GEC. In interpretability during inference. Gu et al. (2018) addition, the results of the GEC benchmark showed proposed a model that during inference retrieves that EB-GEC could predict corrections more accu- parallel sentences similar to input sentences and rately or comparably to its vanilla counterpart. generates translations by the retrieved parallel sen- Future work would include investigations of tences. Zhang et al. (2018) proposed a method whether example presentation is beneficial for that, during inference, retrieves parallel sentences learners with low language proficiency. In addi- where the source sentences are similar to the in- tion, we plan to improve the datastore coverage by put sentences and weights the output containing using pseudo-data (Xie et al., 2018) and weight low n-grams of the retrieved sentence pairs based on frequency error types to present diverse examples. the similarity between the input sentence and the We explore whether methods to improve accuracy retrieved source sentence. These methods differ and diversity (Chollampatt and Ng, 2018; Kaneko from EB-GEC using kNN-MT in that they retrieve et al., 2019; Hotate et al., 2019, 2020) are effective examples via surface matching, as done in baseline for EB-GEC.
Acknowledgements Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of This paper is based on results obtained from a deep bidirectional transformers for language under- project, JPNP18002, commissioned by the New standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association Energy and Industrial Technology Development for Computational Linguistics: Human Language Organization (NEDO). We thank Yukiko Konishi, Technologies, Volume 1 (Long and Short Papers), Yuko Unzai, and Naoko Furuya for their help with pages 4171–4186, Minneapolis, Minnesota. Associ- our experiments. ation for Computational Linguistics. Takao Doi, Hirofumi Yamamoto, and Eiichiro Sumita. 2005. Example-based machine translation using effi- References cient sentence retrieval based on edit-distance. ACM Transactions on Asian Language Information Pro- Mio Arai, Masahiro Kaneko, and Mamoru Komachi. cessing, 4(4):377–399. 2019. Grammatical-error-aware incorrect example retrieval system for learners of Japanese as a second Finale Doshi-Velez and Been Kim. 2017. Towards a language. In Proceedings of the Fourteenth Work- rigorous science of interpretable machine learning. shop on Innovative Use of NLP for Building Educa- arXiv preprint arXiv:1702.08608. tional Applications, pages 296–305, Florence, Italy. Association for Computational Linguistics. Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. Automatic Extraction of Learner Errors Mio Arai, Masahiro Kaneko, and Mamoru Komachi. in ESL Sentences Using Linguistically Enhanced 2020. Example retrieval system using grammatical Alignments. In COLING, pages 825–835, Osaka, error detection for japanese as a second language Japan. The COLING 2016 Organizing Committee. learners. Transactions of the Japanese Society for Sylviane Granger. 1998. Developing an Automated Artificial Intelligence, 35(5):A–K23. Writing Placement System for ESL Learners. In Christopher Bryant, Mariano Felice, Øistein E. Ander- LEC, pages 3–18. sen, and Ted Briscoe. 2019. The BEA-2019 Shared Jiatao Gu, Yong Wang, Kyunghyun Cho, and Vic- Task on Grammatical Error Correction. In BEA, tor OK Li. 2018. Search engine guided neural ma- pages 52–75, Florence, Italy. Association for Com- chine translation. In Proceedings of the AAAI Con- putational Linguistics. ference on Artificial Intelligence, volume 32. Christopher Bryant, Mariano Felice, and Ted Briscoe. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu- 2017. Automatic Annotation and Evaluation of Er- pat, and Ming-Wei Chang. 2020. Realm: Retrieval- ror Types for Grammatical Error Correction. In augmented language model pre-training. arXiv ACL, pages 793–805, Vancouver, Canada. Associa- preprint arXiv:2002.08909. tion for Computational Linguistics. Rejwanul Haque, Sudip Kumar Naskar, Yanjun Ma, Yuchang Cheng and Tomoki Nagase. 2012. An and Andy Way. 2009. Using supertags as source lan- example-based Japanese proofreading system for guage context in SMT. In Proceedings of the 13th offshore development. In Proceedings of COLING Annual conference of the European Association for 2012: Demonstration Papers, pages 67–76, Mum- Machine Translation, Barcelona, Spain. European bai, India. The COLING 2012 Organizing Commit- Association for Machine Translation. tee. Kengo Hotate, Masahiro Kaneko, Satoru Katsumata, and Mamoru Komachi. 2019. Controlling grammat- Shamil Chollampatt and Hwee Tou Ng. 2018. A Mul- ical error correction using word edit rate. In Pro- tilayer Convolutional Encoder-Decoder Neural Net- ceedings of the 57th Annual Meeting of the Asso- work for Grammatical Error Correction. In Proceed- ciation for Computational Linguistics: Student Re- ings of the Thirty-Second AAAI Conference on Arti- search Workshop, pages 149–154, Florence, Italy. ficial Intelligence, pages 5755–5762, New Orleans, Association for Computational Linguistics. Louisiana. Association for the Advancement of Arti- ficial Intelligence. Kengo Hotate, Masahiro Kaneko, and Mamoru Ko- machi. 2020. Generating diverse corrections with Daniel Dahlmeier and Hwee Tou Ng. 2012. Better local beam search for grammatical error correction. Evaluation for Grammatical Error Correction. In In Proceedings of the 28th International Conference NAACL, pages 568–572, Montréal, Canada. Associ- on Computational Linguistics, pages 2132–2137, ation for Computational Linguistics. Barcelona, Spain (Online). International Committee on Computational Linguistics. Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. 2013. Building a Large Annotated Corpus of Tim Johns. 1994. From printout to handout: Gram- Learner English: The NUS Corpus of Learner En- mar and vocabulary teaching in the context of data- glish. In BEA, pages 22–31, Atlanta, Georgia. Asso- driven learning. Perspectives on pedagogical gram- ciation for Computational Linguistics. mar, 293.
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Megumi Makino and Kazuhide Yamamoto. 2008. Billion-scale similarity search with gpus. IEEE Summarization by analogy: An example-based ap- Transactions on Big Data, 7(3):535–547. proach for news articles. In Proceedings of the Third International Joint Conference on Natural Marcin Junczys-Dowmunt, Roman Grundkiewicz, Language Processing: Volume-II. Shubha Guha, and Kenneth Heafield. 2018. Ap- proaching neural grammatical error correction as a Shigeki Matsubara, Yoshihide Kato, and Seiji Egawa. low-resource machine translation task. In Proceed- 2008. Escort: example sentence retrieval system as ings of the 2018 Conference of the North Ameri- support tool for english writing. Journal of Informa- can Chapter of the Association for Computational tion Processing and Management, 51(4):251–259. Linguistics: Human Language Technologies, Vol- ume 1 (Long Papers), pages 595–606, New Orleans, Atsushi Mizumoto and Kiyomi Chujo. 2015. A meta- Louisiana. Association for Computational Linguis- analysis of data-driven learning approach in the tics. japanese efl classroom. English Corpus Studies, 22:1–18. Masahiro Kaneko, Kengo Hotate, Satoru Katsumata, and Mamoru Komachi. 2019. TMU transformer Tomoya Mizumoto, Mamoru Komachi, Masaaki Na- system using BERT for re-ranking at BEA 2019 gata, and Yuji Matsumoto. 2011. Mining Revi- grammatical error correction on restricted track. In sion Log of Language Learning SNS for Auto- Proceedings of the Fourteenth Workshop on Innova- mated Japanese Error Correction of Second Lan- tive Use of NLP for Building Educational Applica- guage Learners. In IJCNLP, pages 147–155, Chi- tions, pages 207–212, Florence, Italy. Association ang Mai, Thailand. Asian Federation of Natural Lan- for Computational Linguistics. guage Processing. Masahiro Kaneko and Mamoru Komachi. 2019. Multi- Makoto Nagao. 1984. A framework of a mechanical head multi-layer attention to deep language repre- translation between japanese and english by analogy sentations for grammatical error detection. Com- principle. Artificial and human intelligence, pages putación y Sistemas, 23(3):883–891. 351–354. Masahiro Kaneko, Masato Mita, Shun Kiyono, Jun Ryo Nagata. 2019. Toward a task of feedback comment Suzuki, and Kentaro Inui. 2020. Encoder-decoder generation for writing learning. In Proceedings of models can benefit from pre-trained masked lan- the 2019 Conference on Empirical Methods in Nat- guage models in grammatical error correction. In ural Language Processing and the 9th International Proceedings of the 58th Annual Meeting of the Asso- Joint Conference on Natural Language Processing ciation for Computational Linguistics, pages 4248– (EMNLP-IJCNLP), pages 3206–3215, Hong Kong, 4254, Online. Association for Computational Lin- China. Association for Computational Linguistics. guistics. Ryo Nagata, Kentaro Inui, and Shin’ichiro Ishikawa. Masahiro Kaneko, Yuya Sakaizawa, and Mamoru Ko- 2020. Creating corpora for research in feedback machi. 2017. Grammatical error detection us- comment generation. In Proceedings of the 12th ing error- and grammaticality-specific word embed- Language Resources and Evaluation Conference, dings. In Proceedings of the Eighth International pages 340–345, Marseille, France. European Lan- Joint Conference on Natural Language Processing guage Resources Association. (Volume 1: Long Papers), pages 40–48, Taipei, Tai- wan. Asian Federation of Natural Language Process- Courtney Napoles, Keisuke Sakaguchi, Matt Post, and ing. Joel Tetreault. 2015. Ground Truth for Grammati- cal Error Correction Metrics. In NAACL, pages 588– Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke 593, Beijing, China. Association for Computational Zettlemoyer, and Mike Lewis. 2021. Nearest neigh- Linguistics. bor machine translation. In International Confer- ence on Learning Representations. Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. 2017. JFLEG: A Fluency Corpus and Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Benchmark for Grammatical Error Correction. In Zettlemoyer, and Mike Lewis. 2020. Generalization EACL, pages 229–234, Valencia, Spain. Association through memorization: Nearest neighbor language for Computational Linguistics. models. In International Conference on Learning Representations. Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christo- Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio pher Bryant. 2014. The CoNLL-2014 Shared Task Petroni, Vladimir Karpukhin, Naman Goyal, Hein- on Grammatical Error Correction. In CoNLL, pages rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- 1–14, Baltimore, Maryland. Association for Compu- täschel, Sebastian Riedel, and Douwe Kiela. 2020. tational Linguistics. Retrieval-augmented generation for knowledge- intensive nlp tasks. In Advances in Neural Infor- Kiyonori Ohtake and Kazuhide Yamamoto. 2003. Ap- mation Processing Systems, volume 33, pages 9459– plicability analysis of corpus-derived paraphrases to- 9474. Curran Associates, Inc. ward example-based paraphrasing. In Proceedings
of the 17th Pacific Asia Conference on Language, Mary E Webb, Andrew Fluck, Johannes Magenheim, Information and Computation, pages 380–391, Sen- Joyce Malyn-Smith, Juliet Waters, Michelle De- tosa, Singapore. COLIPS PUBLICATIONS. schênes, and Jason Zagami. 2020. Machine learning for human learners: opportunities, issues, tensions Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem and threats. Educational Technology Research and Chernodub, and Oleksandr Skurzhanskyi. 2020. Development, pages 1–22. GECToR – grammatical error correction: Tag, not rewrite. In Proceedings of the Fifteenth Workshop Sam Wiseman and Karl Stratos. 2019. Label-agnostic on Innovative Use of NLP for Building Educational sequence labeling by copying nearest neighbors. In Applications, pages 163–170, Seattle, WA, USA → Proceedings of the 57th Annual Meeting of the Online. Association for Computational Linguistics. Association for Computational Linguistics, pages 5363–5369, Florence, Italy. Association for Compu- Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi, Sho tational Linguistics. Yokoi, Tatsuki Kuribayashi, Ryuto Konno, and Ken- taro Inui. 2020. Instance-based learning of span Ziang Xie, Guillaume Genthial, Stanley Xie, Andrew representations: A case study through named en- Ng, and Dan Jurafsky. 2018. Noising and denoising tity recognition. In Proceedings of the 58th Annual natural language: Diverse backtranslation for gram- Meeting of the Association for Computational Lin- mar correction. In Proceedings of the 2018 Confer- guistics, pages 6452–6459, Online. Association for ence of the North American Chapter of the Associ- Computational Linguistics. ation for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), pages John W Ratcliff and David E Metzener. 1988. Pattern- 619–628, New Orleans, Louisiana. Association for matching-the gestalt approach. Dr Dobbs Journal, Computational Linguistics. 13(7):46. Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2011. A New Dataset and Method for Automatically 2016. Neural machine translation of rare words Grading ESOL Texts. In NAACL, pages 180–189, with subword units. In Proceedings of the 54th An- Portland, Oregon, USA. Association for Computa- nual Meeting of the Association for Computational tional Linguistics. Linguistics (Volume 1: Long Papers), pages 1715– Helen Yannakoudakis, Øistein E. Andersen, Geran- 1725, Berlin, Germany. Association for Computa- payeh Ardeshir, Briscoe Ted, and Nicholls Diane. tional Linguistics. 2018. Developing an Automated Writing Placement Nicolas Stroppa, Antal van den Bosch, and Andy Way. System for ESL Learners. In Applied Measurement 2007. Exploiting source similarity for SMT using in Education, pages 251–267. context-informed features. In Proceedings of the Tzu-Hsi Yen, Jian-Cheng Wu, Jim Chang, Joanne Bois- 11th Conference on Theoretical and Methodologi- son, and Jason Chang. 2015. WriteAhead: Mining cal Issues in Machine Translation of Natural Lan- grammar patterns in corpora for assisted writing. In guages: Papers, Skövde, Sweden. Proceedings of ACL-IJCNLP 2015 System Demon- strations, pages 139–144, Beijing, China. Associa- Eiichiro Sumita and Hitoshi Iida. 1991. Experiments tion for Computational Linguistics and The Asian and prospects of example-based machine transla- Federation of Natural Language Processing. tion. In 29th Annual Meeting of the Association for Computational Linguistics, pages 185–192, Berke- Zheng Yuan and Ted Briscoe. 2016. Grammatical er- ley, California, USA. Association for Computational ror correction using neural machine translation. In Linguistics. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- Antal Van Den Bosch. 2007. A memory-based classifi- tional Linguistics: Human Language Technologies, cation approach to marker-based ebmt. In Proceed- pages 380–386, San Diego, California. Association ings of the METIS-II Workshop on New Approaches for Computational Linguistics. to Machine Translation. Jingyi Zhang, Masao Utiyama, Eiichro Sumita, Gra- Maarten Van Gompel, Antal Van Den Bosch, and Pe- ham Neubig, and Satoshi Nakamura. 2018. Guid- ter Berck. 2009. Extending memory-based machine ing neural machine translation with retrieved transla- translation to phrases. In Proceedings of the Third tion pieces. In Proceedings of the 2018 Conference Workshop on Example-Based Machine Translation, of the North American Chapter of the Association pages 79–86. for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1325– Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 1335, New Orleans, Louisiana. Association for Com- Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz putational Linguistics. Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In I. Guyon, U. V. Luxburg, S. Bengio, Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- Jingming Liu. 2019. Improving grammatical error nett, editors, NeurIPS, pages 5998–6008. Curran As- correction via pre-training a copy-augmented archi- sociates, Inc. tecture with unlabeled data. In Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers), pages 156–165, Minneapolis, Min- nesota. Association for Computational Linguistics.
You can also read