Interpretability for Language Learners Using Example-Based Grammatical Error Correction

Page created by Anne Collins

World Around

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Interpretability for Language Learners Using Example-Based Grammatical Error Correction

Interpretability for Language Learners
                                                        Using Example-Based Grammatical Error Correction

                                                       Masahiro Kaneko
                                                                   Sho Takase Ayana Niwa Naoaki Okazaki
                                                                 Tokyo Institute of Technology
                                             {masahiro.kaneko, sho.takase, ayana.niwa}@nlp.c.titech.ac.jp
                                                                okazaki@c.titech.ac.jp

                                                               Abstract

                                             Grammatical Error Correction (GEC) should
                                             focus not only on correction accuracy but
arXiv:2203.07085v1 [cs.CL] 14 Mar 2022

                                             also on the interpretability of the results for
                                             language learners. However, existing neural-
                                             based GEC models mostly focus on improv-
                                             ing accuracy, while their interpretability has
                                             not been explored. Example-based methods
                                             are promising for improving interpretability,
                                             which use similar retrieved examples to gen-       Figure 1: EB-GEC presents not only a correction but
                                             erate corrections. Furthermore, examples are       also an example of why the GEC model suggested this
                                             beneficial in language learning, helping learn-    correction.
                                             ers to understand the basis for grammatically
                                             incorrect/correct texts and improve their confi-
                                             dence in writing. Therefore, we hypothesized
                                             that incorporating an example-based method         learners with no explanation of the reason for a
                                             into GEC could improve interpretability and        correction.
                                             support language learners. In this study, we         Interpretability plays a key role in educational
                                             introduce an Example-Based GEC (EB-GEC)            scenarios (Webb et al., 2020). In particular, present-
                                             that presents examples to language learners as
                                                                                                ing examples is shown to be effective in improving
                                             a basis for correction result. The examples con-
                                             sist of pairs of correct and incorrect sentences   understanding. Language learners acquire gram-
                                             similar to a given input and its predicted cor-    matical rules and vocabulary from examples (Johns,
                                             rection. Experiments demonstrate that the ex-      1994; Mizumoto and Chujo, 2015). Presenting ex-
                                             amples presented by EB-GEC help language           amples of incorrect sentences together with correct
                                             learners decide whether to accept or refuse sug-   ones improves the understanding of grammatical
                                             gestions from the GEC output. Furthermore,         correctness as well as essay quality (Arai et al.,
                                             the experiments show that retrieved examples       2019, 2020).
                                             also improve the accuracy of corrections.
                                                                                                   Recently, example-based methods have been ap-
                                         1   Introduction                                       plied to a wide range of natural language process-
                                                                                                ing tasks to improve the interpretability of neu-
                                         Grammatical Error Correction (GEC) models,             ral models, including machine translation (Khan-
                                         which generate grammatically correct texts from        delwal et al., 2021), part-of-speech tagging (Wise-
                                         grammatically incorrect texts, are useful for lan-     man and Stratos, 2019), and named entity recogni-
                                         guage learners. In GEC, various neural-based mod-      tion (Ouchi et al., 2020). These methods predict
                                         els have been proposed to improve the correction       labels or tokens by considering the nearest neigh-
                                         accuracy (Yuan and Briscoe, 2016; Chollampatt          bor examples retrieved by the representations of
                                         and Ng, 2018; Junczys-Dowmunt et al., 2018; Zhao       the model at the inference time. Khandelwal et al.
                                         et al., 2019; Kaneko et al., 2020; Omelianchuk         (2021) showed that in machine translation, exam-
                                         et al., 2020). However, the basis on which a neural    ples close to a target sentence in the representation
                                         GEC model makes corrections is generally uninter-      space of a decoder are useful for translating the
                                         pretable to learners. Neural GEC models rarely ad-     source sentence. Inspired by this, we hypothesized
                                         dress correction interpretability, leaving language    that examples corrected for similar reasons are dis-

Figure 2: An illustration of how EB-GEC chooses examples and predicts a correction. The model predicts a
correction “They have /a tremendous problem .” by using the example “This has /a tremendous problem .”
Hidden states of the decoder computed during the training phase are stored as keys, and tokens of the output
sentences corresponding to the hidden states are stored as values. A hidden state of the decoder (blue box) at the
time of inference is used as a query to search for k-neighbors (yellow box) of hidden states of the training data.
EB-GEC predicts a distribution of tokens for the correction from a combination of two distributions of tokens:
a vanilla distribution computed by transforming the hidden state of the decoder; and a kNN distribution by the
retrieved k-neighbors.

tributed closely in the representation space. Thus,       2     EB-GEC
we assume that neighbor examples can enhance
                                                          EB-GEC presents language learners with a correc-
the interpretability of the GEC model, allowing
                                                          tion and the related examples it used for generating
language learners to understand the reason for a
                                                          the correction of the input sentence. k-Nearest-
correction and access its validity.
                                                          Neighbor Machine Translation (kNN-MT; Khan-
   In this paper, we introduce an example-based
                                                          delwal et al., 2021) was used as a base method to
GEC (EB-GEC)1 that corrects grammatical errors
                                                          consider example in predicting corrections. kNN-
in an input text and provides examples for language
                                                          MT predicts tokens by considering the nearest
learners explaining the reason for correction (Fig-
                                                          neighbor examples based on representations from
ure 1). As shown in Figure 2, the core idea of
                                                          the decoder at the time of inference. EB-GEC
EB-GEC is to unify the token prediction model for
                                                          could use any method (Gu et al., 2018; Zhang et al.,
correction and the related example retrieval model
                                                          2018; Lewis et al., 2020) to consider examples, but
from the supervision data into a single encoder-
                                                          kNN-MT was used in this study because it does not
decoder model. EB-GEC can present the reason
                                                          require additional training for example retrieval.
for the correction, which we hope will help learn-
                                                             Figure 2 shows how the EB-GEC retrieves exam-
ers decide whether to accept or to refuse a given
                                                          ples using kNN-MT. EB-GEC performs inference
correction.
                                                          using the softmax distribution of target tokens, re-
   Experimental results show that EB-GEC predicts
                                                          ferred to as vanilla distribution, hereafter, obtained
corrections more accurately than the vanilla GEC
                                                          from the encoder-decoder model and the distribu-
without examples on the three datasets and com-
                                                          tion generated by the nearest neighbor examples.
parably on one dataset. Experiments with human
                                                          Nearest neighbor search is performed for a cache of
participants demonstrate that EB-GEC presents sig-
                                                          examples indexed by the decoder hidden states on
nificantly more useful examples than the baseline
                                                          supervision data (kNN distribution). EB-GEC can
methods of example retrieval (Matsubara et al.,
                                                          be adapted to any trained autoregressive encoder-
2008; Yen et al., 2015; Arai et al., 2020). These
                                                          decoder GEC model. A detailed explanation of
results indicate that examples are useful not only
                                                          retrieving examples using kNN-MT is provided in
to the GEC models but also to language learners.
                                                          Section 2.1, and of presenting examples in Section
This is the first study to demonstrate the benefits
                                                          2.2.
of examples themselves for real users, as existing
studies (Wiseman and Stratos, 2019; Ouchi et al.,         2.1    Retrieving Examples Using kNN-MT
2020; Khandelwal et al., 2021) only showed exam-          Let x = (x1 , ..., xN ) be an input sequence and
ple utility for improving the task accuracy.              y = (y1 , ..., yM ) be an output sequence of the
  1
    Our code is publicly available at https://github.     autoregressive encoder-decoder model. Here, N
com/kanekomasahiro/eb-gec                                 and M are the lengths of the input and output se-

quences, respectively.                                      L2 distance. The tuple (v (j) , x(j) , y (j) ) is the
                                                            value associated with the key u(j) in the datas-
Vanilla Distribution. In a vanilla autoregressive
                                                            tore (K, V). Then, the kNN-MT aggregates the
encoder-decoder model, the distribution for i-th
                                                            retrieved tokens to form a probability distribution
token yi of the output sequence is conditioned
                                                            pkNN (yi |x, ŷ1:i−1 ) with a softmax with tempera-
from the entire input sequence x and previous out-
                                                            ture T to the negative L2 distances2 ,
put tokens ŷ1:i−1 , where ŷ represents a sequence
of generated tokens. The probability distribution           pkNN (yi |x, ŷ1:i−1 ) ∝
of the i-th token p(yi |x, ŷ1:i−1 ) is calculated by                                                       
                                                                X                      −ku − h(x, ŷ1:i−1 )k
a linear translation to the decoder’s hidden state                         Iv=yi exp                           .
                                                                                              T
h(x, ŷ1:i−1 ) followed by the softmax function.            (u,(v,_,_))∈N
                                                                                                                   (4)
Output Distribution. Let pEB (yi |x, ŷ1:i−1 ) de-
note the final probability distribution of tokens           2.2    Presenting Examples
from EB-GEC. We define pEB (yi |x, ŷ1:i−1 ) as
                                                            We used a pair of incorrect and correct sentences
a linear interpolation of the vanilla distribution
                                                            stored in the value retrieved for the predicted to-
p(yi |x, ŷ1:i−1 ) and pkNN (yi |x, ŷ1:i−1 ) (explained
                                                            ken ŷi as an example from the correction. Figure 1
later), which is the distribution computed using the
                                                            depicts an example where the retrieved value con-
examples in the datastore,
                                                            sists of the predicted token v (j) = “a” and the
  pEB (yi |x, ŷ1:i−1 ) =λpkNN (yi |x, ŷ1:i−1 )            incorrect/correct sentences x(j) , y (j) correspond-
                                                            ing to “This has /a tremendous problem .”. In
                         + (1 − λ)p(yi |x, ŷ1:i−1 ).
                                                            this study, we presented examples for each edited
                                                    (1)
                                                            token in an output. For example, when an input
Here, 0 ≤ λ ≤ 1 is an interpolation coefficient             or output is “They have /a tremendous problem
between the two distributions. This interpolation           .”, we presented examples for the edit “ /a”. To
also improves the output robustness when relevant           extract edit operations from an input/output pair,
examples are not found in the datastore.                    we aligned the tokens in input and output sentences
                                                            by using the Gestalt pattern matching (Ratcliff and
Datastore. In the work of Khandelwal et al.                 Metzener, 1988).
(2021), the i-th hidden state h(x, y1:i−1 ) of the             There are several ways to decide which exam-
decoder in the trained model was stored as a key,           ples should be presented to a language learner. For
and the corresponding next token yi was stored              instance, we could use all the examples in k-nearest
as a value. In order to present examples of in-             neighbors N and possibly filter them with a thresh-
correct/correct sentences, we stored a tuple of the         old based on L2 distance. In this paper, we present
token yi , the incorrect input sentence x, and the          an example incorrect/correct sentence pair that is
correct output sentence y as a value of the datas-          the nearest to the query in N , which is the most
tore. Thus, we built key-value pairs (K, V) from            confident example estimated by the model.
all decoder timesteps for the entire training data
(X , Y),                                                    3     Experiments

      (K, V) = {(h(x, y1:i−1 ), (yi , x, y)) |              This section investigates the effectiveness of the ex-
                                                            amples via manual evaluation and accuracy on the
                   ∀yi ∈ y, (x, y) ∈ (X , Y)}.       (2)
                                                            GEC benchmark to show that the EB-GEC does,
kNN Distribution. During inference, given a                 in fact, improve the interpretability without sacri-
source x as input, the model uses the i-th hidden           ficing accuracy. We first describe the experimental
state h(x, y1:i−1 ) of the decoder as the query to          setup and then report the results of the experiments.
search for k-nearest neighbors,                             3.1    Datasets and Evaluation Metrics
  N = {(u(j) , (v (j) , x(j) , y (j) )) ∈ (K, V)}kj=1 ,     We used the official datasets of BEA-2019
                                                      (3)   Shared Task (Bryant et al., 2019), W&I-
                                                            train (Granger, 1998; Yannakoudakis et al., 2018),
where u(j) (j = 1, . . . , k) are the k-nearest neigh-         2
                                                                 In Equation 4, we do not use the input and output sen-
bors of the query h(x, y1:i−1 ) measured by squared         tences in the value, and thus represent them as _.

NUCLE (Dahlmeier et al., 2013), FCE-train (Yan- presented by EB-GEC, we examined the relative
nakoudakis et al., 2011) and Lang-8 (Mizumoto effectiveness of presenting examples in GEC as
et al., 2011) as training data and W&I-dev as de- compared to providing none. Moreover, we used
velopment data. We followed Chollampatt and two baseline methods for example selection, token-
Ng (2018) to exclude sentence pairs in which the based retrieval and BERT-based retrieval. Note that,
source and target sentences are identical from the unlike EB-GEC, token-based and BERT-based re-
training data. The final number of sentence pairs in trievals do not directly use the representations in
the training data was 0.6M. We used this training the GEC model; in other words, these baselines per-
data to create the EB-GEC datastore. Note that the form the task of choosing examples independently
same amount of data is used by EB-GEC and the of the GEC model. In contrast, EB-GEC uses ex-
vanilla GEC model. amples directly for generating an output. EB-GEC
We used W&I-test, CoNLL2014 (Ng et al., was expected to provide examples more related
2014), FCE-test, and JFLEG-test (Napoles et al., to GEC input/output sentences than the baseline
2017) as test data. To measure the accuracy of the methods.
GEC models, we used the evaluation metrics ER-
Token-based Retrieval. This baseline method
RANT (Felice et al., 2016; Bryant et al., 2017) for
retrieves examples from the training data where the
the W&I-test and FCE-test, M2 (Dahlmeier and
corrections of the EB-GEC output match the cor-
Ng, 2012) for CoNLL2014, and GLEU (Napoles
rections in the target sentence of the training data.
et al., 2015) for the JFLEG-test. M2 and ERRANT
This is a similar method to the example search per-
report F0.5 values.
formed using surface matching (Matsubara et al.,
3.2 Implementation Details of EB-GEC 2008; Yen et al., 2015). If multiple sentences are
We used Transformer-big (Vaswani et al., 2017) found with matching tokens, an example is selected
as the GEC model. Note that EB-GEC does not at random. If the tokens do not match, this method
assume a specific autoregressive encoder-decoder cannot present any examples.
model. The beam search was performed with a BERT-based Retrieval. This baseline method
beam width of 5. We tokenized the data into uses BERT3 (Devlin et al., 2019) to retrieve exam-
subwords with a vocabulary size of 8,000 using ples, considering the context of both the corrected
BPE (Sennrich et al., 2016). The hyperparameters sentence and example from the datastore. This
reported in Vaswani et al. (2017) were used, aside method corresponds to one based on context-aware
from the max epoch, which was set to 20. In our example retrieval (Arai et al., 2020). In order to re-
experiments, we reported the average results of five trieve examples using BERT, we create a datastore,
GEC models trained using different random seeds.
We used four Tesla V100 GPUs for training. (KBERT , VBERT ) = {(e(yi ), (yi , x, y))|
We considered the kNN and vanilla distributions ∀yi ∈ y, (x, y) ∈ (X , Y)}.
equally, with λ in Eq. (1) set to 0.5, to achieve (5)
both accuracy and interpretability. Based on the
development data results, the number of nearest Here e(yi ) is the hidden state of the last layer of
neighbors k was set to 16 and the softmax temper- BERT for the token yi when the sentence y is given
ature T to 1,000. We used the final layer of the without masking. This method uses e(yi ) as a
decoder feedforward network as the datastore key. query for the model output sentence to then search
We used Faiss (Johnson et al., 2021) with the same the datastore for k nearest neighbors.
settings as Khandelwal et al. (2021) for fast nearest
neighbor search in high-dimensional space. The input and output sentences of the GEC
model and the examples from the baselines and
3.3 Human Evaluation Settings EB-GEC were presented to the annotators with
We assessed the interpretability by human evalua- anonymized system names. Annotators then de-
tion based on Doshi-Velez and Kim (2017). The cided whether the examples helped to interpret the
human evaluation was performed to determine GEC output or not, or whether they aided under-
whether the examples improved user understanding standing of grammar and vocabulary. The example
and helped users to accept or refuse the GEC cor- 3
https://huggingface.co/
rections. To investigate the utility of the examples bert-base-cased

Method Human evaluation score Method W&I CoNLL2014 FCE JFLEG
Token-based retrieval 28.8 Vanilla GEC 50.12 49.68 41.49 53.71
BERT-based retrieval 52.4 EB-GEC 52.45 50.51 43.00 53.46
EB-GEC 68.8†,‡
Table 2: Accuracy of vanilla GEC model and EB-GEC
Table 1: Results of the human evaluation of the use- model on W&I, CoNLL2014, FCE and JFLEG test
fulness of Token-based retrieval, BERT-based retrieval data.
and EB-GEC examples. Human evaluation score is the
percentage of useful examples among those presented W&I CoNLL 2013 FCE JFLEG
to the language learners. The † and ‡ indicate statisti- 55
cally significant differences of EB-GEC according to
McNemar’s test (p < 0.05) against Token-based re- 45
trieval and BERT-based retrieval, respectively.

Score
35
sentence pair was labeled as 1 if it was “useful
for decision-making or understanding the correc- 25
tion” and 0 otherwise. We then computed scores 0.00 0.25 0.50 0.75 1.00
λ
for Token-based retrieval, BERT-based retrieval,
and EB-GEC models by counting the number of Figure 3: Scores for each development data using dif-
sentences labeled with 1. We confirm whether ferent λ values from 0 to 1 in increments of 0.25. The
corrections with examples were more beneficial evaluation metrics for each data are the same as for the
test data.
for learners than those without, and whether EB-
GEC could present more valuable examples than
those from the baselines. Since it is not always is greater than 50, which indicates that present-
the case that only corrected parts are helpful for ing examples is more useful than providing none.
learners (Matsubara et al., 2008; Yen et al., 2015), This result is non-trivial because the percentage for
the uncorrected parts were also considered during token-based retrieval is only 28.8, which indicates
annotation. that those presented examples were mostly useless.
We manually evaluated 990 examples provided Therefore, the examples for interpretability in EB-
by the three methods for 330 ungrammatical and GEC support language learners’ understanding and
grammatical sentence pairs randomly sampled acceptance of the model output.
from the W&I-test, CoNLL2014, FCE-test, and
JFLEG-test. The human evaluation was performed GEC Accuracy. We examined the impact of us-
by two annotators with CEFR4 proficiency level B ing examples for the prediction of GEC accuracy.
and one annotator with level C5 . All three annota- Table 2 shows the scores of the vanilla GEC and
tors evaluated different examples. EB-GEC for the W&I, CoNLL2014, FCE, and JF-
LEG test data. The accuracy of EB-GEC is slightly
3.4 Results lower for JFLEG but outperforms the vanilla GEC
Human Evaluation of Examples. Table 1 for W&I, CoNLL2014, and FCE. This indicates
shows the results of human evaluation of Token- that the use of examples contributes to improving
based retrieval, BERT-based retrieval, and EB-GEC GEC model accuracy.
models. The percentage of useful examples has
4 Analysis
increased significantly for EB-GEC compared to
token-based and BERT-based retrieval baselines. 4.1 Effect of λ
The percentage of useful examples from EB-GEC
We analyzed the relationship between the interpo-
4
https://www.cambridgeenglish.org/ lation coefficient λ (in Equation (1)) and the GEC
exams-and-tests/cefr accuracy. A smaller λ value may reduce the in-
5
They are not authors of this paper. In this human evalua-
tion, annotators with a middle and high proficiency level are se- terpretability as examples are not considered in
lected in case annotators cannot understand errors/corrections prediction. In contrast, a larger λ value may reduce
and make a judgment whether the presented example is nec- robustness, especially when relevant examples are
essary or unnecessary. Therefore, this study does not focus
on whether annotators with lower proficiency levels find it not included in the datastore; the model must then
helpful to see examples without explanation. generate corrections relying more on kNN exam-

Token BERT EB-GEC Error type Freq. Vanilla GEC EB-GEC Diff.
60.0 PREP 115K 40.9 44.6 3.7
PUNCT 98K 33.5 37.0 3.5
DET 171K 46.6 49.8 3.2
40.0
ADJ:FORM 2K 54.5 38.4 -16.08
20.0 ADJ 21K 17.0 14.5 -2.42
SPELL 72K 68.6 66.8 -1.87

0.0
W&I CoNLL2014 FCE JFLEG Table 3: The error types with the highest and the lowest
EB-GEC accuracy compared to vanilla GEC on FCE-
test based on Diff. column. Freq. column is the fre-
Figure 4: Matching percentage of edits and error types
quency of the error type in the datastore.
in model outputs and examples.

ples, which may not be present in the datastore for sults. Furthermore, we see that EB-GEC has a
some inputs. lower percentage on JFLEG compared to those on
Figure 3 shows the accuracy of the GEC for each W&I, CoNLL2014, and FCE. This corroborates
development data when the λ is changed from 0 to the results of Table 2, which suggests that the ac-
1 in increments of 0.25. We found that when λ was curacy of GEC improved further when examples
set to 1, the accuracy for all development datasets more relevant to the corrections could be retrieved.
was lower than when λ was set to 0.50 or less. It is
4.3 EB-GEC and Error Types
shown that the highest accuracy was obtained for
λ = 0.5, as this treats the vanilla output distribution We analyzed the accuracy of EB-GEC for different
and the output distribution equally. error types to investigate the effect of error type
on EB-GEC performance. We used ERRANT to
4.2 Matching Error Types of Model Outputs evaluate the accuracy of EB-GEC for each error
and Examples type on the FCE-test.
In Section 1, we hypothesized that similar error- Table 3 shows three error types selected as hav-
correcting examples are closely clustered in the ing the most significant increase and decrease in ac-
representation space. Therefore, we investigated curacy for EB-GEC compared to the vanilla GEC.
the agreement between the GEC output and the ex- The three error types with the largest increases
amples for edits and error types. We extracted edits were preposition (PREP; e.g. I think we should
and their error types, which were automatically book at/ the Palace Hotel .), punctuation error
assigned by ERRANT (Felice et al., 2016; Bryant (PUNCT; e.g. Yours ./ sincerely ,), and article
et al., 2017) for incorrect/correct sentence pairs. error (DET; e.g. That should complete that/an
For example, for a GEC input/output pair “They amazing day .). The three error types with the
have /a tremendous problem .”, the example pair largest decreases are adjective conjugation error
is “This has /a tremendous problem .”, its edit (ADJ:FORM; e.g. I was very please/pleased to re-
is “ /a” and the error type is the determiner error ceive your letter .), adjective error (ADJ; e.g. The
(DET). We calculated the matching percentage of adjoining restaurant is very enjoyable/good as well
the edits and error types for EB-GEC outputs and .), and spelling error (SPELL; e.g. Pusan Castle is
for the examples retrieved using EB-GEC to show locted/located in the South of Pusan .).
their similarity. In addition, we used token-based We concluded the following findings from these
and BERT-based retrieval as comparison methods results. Error types with the largest increase in ac-
for obtaining examples relevant to EB-GEC out- curacy have a limited number of tokens used for
puts. the edits compared to those with the largest de-
Figure 4 shows the matching percentage of ed- creases in accuracy (namely, error types referring
its and error types between the GEC outputs and to adjectives and nouns). Furthermore, these error
the k-nearest neighbors examples. First, we see types are the most frequent errors in the datastore,
that EB-GEC has the highest percentage for all test (excluding the unclassified error type annotated as
data. This indicates that of the methods tested, EB- OTHER), and the datastore sufficiently covers such
GEC retrieves the most relevant examples. This edits. Contrary to the error types with improved
trend is consistent with the human evaluation re- accuracy, ADJ and SPELL have a considerable

Error type Error-correction pair Label
Input/Output PREP You will be able to buy them in/at /a reasonable price . -
Naturally , it ’s easier to get a job then/when you were/are good in/at
Token-based retrieval PREP 0
foreign languagers/languages or computers .
BERT-based retrieval PREP I could purchase them in/at reasonable price/prices . 1
EB-GEC PREP I could purchase them in/at reasonable price/prices . 1
Input/Output PUNCT for/For example /, a reasercher that wants to be successfull must take risk . -
Token-based retrieval PUNCT Today /, we first/ met for /the first time in about four weeks . 0
BERT-based retrieval PUNCT for/For example /, a kid named Michael . 1
EB-GEC PUNCT for/For example /, a kid named Michael . 1
Input/Output DET Apart from that /, it takes /a long time to go somewhere . -
Token-based retrieval DET If you have enough time , I recommend /a bus trip . 0
BERT-based retrieval PREP However , it will take for/ a long time to go abroad in my company . 0
EB-GEC DET So/Because of that , it takes /a long time to write my journal/entries . 1

Table 4: Examples retrieved by Token-based retrieval, BERT-based retrieval, and EB-GEC for input/output, and
the human evaluation labels. Underlines indicate error-correction pairs in the sentences. Bold indicates the edit
used as the query to retrieve the example, and error types of the bold edits are assigned by ERRANT.

number of tokens used in edits, and they are not Conversely, EB-GEC is able to present examples in
easy to cover sufficiently in a datastore. Moreover, which the editing pair tokens are consistent for all
ADJ:FORM is the second least frequently occur- corrections. Furthermore, the contexts were similar
ring error type in the datastore, and we believe such to those of the input/output, for example “purchase
examples cannot be covered sufficiently. These re- them in/at reasonable price/prices”, “for/For ex-
sults show that EB-GEC improves the accuracy of ample /,” and “it takes /a long time to”, and all
error types that are easily covered by examples, as the examples were labeled 1 during human evalua-
there are fewer word types rarely used for edits and tion. This demonstrates that EB-GEC retrieves the
they are better presented in datastore. Furthermore, most related examples that are helpful for users.
the results show that the accuracy deteriorates for
error types that are difficult to cover, such as word 5 Related Work
types used for edits and infrequent error types in 5.1 Example Retrieval for Language
the datastore. Learners
We investigated the characteristics of the EB- There are example search systems that support lan-
GEC examples by comparing specific examples guage learners by finding examples. Before neural-
for each error type with those from token-based based models, examples were retrieved and pre-
and BERT-based retrieval. Table 4 shows exam- sented by surface matching (Matsubara et al., 2008;
ples of Token-based retrieval, BERT-based retrieval Yen et al., 2015). Arai et al. (2019, 2020) pro-
and EB-GEC for the top three error types (PREP, posed to combine Grammatical Error Detection
PUNCT and DET) with accuracy improvement in (GED) and example retrieval to present both gram-
EB-GEC. Token-based retrieval showed that the matically incorrect and correct examples of essays
tokens in the edits are consistent, including “in/at”, written by Japanese language learners. This study
“ /,”, and “ /a”. However, only surface informa- showed that essay quality was improved by provid-
tion is used, and context is not considered. So ing examples. Their method is similar to EB-GEC
such unrelated examples are not useful for language in that it presents both correct and incorrect ex-
learners. BERT-based retrieval presented the same amples but incorporates example search systems
examples as EB-GEC for PREP and PUNCT error for GED rather than the GEC. Furthermore, the
types, and the label for human evaluation was also example search systems search for examples inde-
1. However, the last example is influenced by the pendently of the model. Contrastingly, EB-GEC
context rather than the correction and so presents presents more related examples as shown in Section
an irrelevant example, labeled 0 by human eval- 3.4.
uation. This indicates that BERT-based retrieval Cheng and Nagase (2012) developed a Japanese
overly focuses on context, resulting in examples re- example-based system that retrieves examples us-
lated to the overall output but unrelated to the edits. ing dependency structures and proofread texts.

Proofreading is a task similar to GEC because it token-based retrieval. Moreover, these studies do
also involves correcting grammatical errors. How- not focus on the interpretability of the model.
ever, this method also does not focus on using ex- Several methods have been proposed to retrieve
amples to improve interpretability. examples using neural model representations and
consider them for prediction. Khandelwal et al.
5.2 Explanation for Language Learners (2020, 2021) proposed the retrieval of similar ex-
There is a feedback comment generation task (Na- amples using the nearest neighbor examples of
gata, 2019) that can generate useful hints and ex- pre-trained hidden states during inference and to
planations for grammatical errors and unnatural ex- complement the output distributions of the lan-
pressions in writing education. Nagata et al. (2020) guage model and machine translation with the dis-
used a grammatical error detection model (Kaneko tributions of these examples. Lewis et al. (2020)
et al., 2017; Kaneko and Komachi, 2019) and neu- combined a pre-trained retriever with a pre-trained
ral retrieval-based method for prepositional errors. encoder-decoder model and fine-tuned it end-to-
The motivation of this study was similar to ours, end. For the input query, they found the top-k
that is, to help language learners understand gram- documents and used them as a latent variable for
matical errors and unnatural expressions in an inter- final prediction. Guu et al. (2020) first conducted
pretable way. On the other hand, EB-GEC supports an unsupervised joint pre-training of the knowl-
language learners using examples from the GEC edge retriever and knowledge-augmented encoder
model rather than using feedback. for the language modeling task, then fine-tuned it
using a task of primary interest, with supervised ex-
5.3 Example Retrieval in Text Generation amples. The main purpose of these methods was to
improve the accuracy using examples, and whether
Various previous studies have used neural network the examples were helpful for the users was not ver-
models to retrieve words, phrases, and sentences ified. Conversely, our study showed that examples
for use in prediction. Nagao (1984) proposed an for the interpretability in GEC could be helpful for
example-based MT to translate sequences by anal- real users.
ogy. This method has been extended to a variety of
other methods for MT (Sumita and Iida, 1991; Doi 6 Conclusion
et al., 2005; Van Den Bosch, 2007; Stroppa et al.,
2007; Van Gompel et al., 2009; Haque et al., 2009). We introduced EB-GEC to improve the inter-
In addition, the example-based method has been pretability of corrections by presenting examples to
used for summarization (Makino and Yamamoto, language learners. The human evaluation showed
2008) and paraphrasing (Ohtake and Yamamoto, that the examples presented by EB-GEC supported
2003). These studies were performed before neural language learners’ decision to accept corrections
networks were in general use, and the examples and improved their understanding of the correction
were not used to solve the neural network black results. Although existing interpretive methods
box as was done in this study. using examples have not verified if examples are
In neural network models, methods using exam- helpful for humans, this study demonstrated that
ples have been proposed to improve accuracy and examples were helpful for learners using GEC. In
interpretability during inference. Gu et al. (2018) addition, the results of the GEC benchmark showed
proposed a model that during inference retrieves that EB-GEC could predict corrections more accu-
parallel sentences similar to input sentences and rately or comparably to its vanilla counterpart.
generates translations by the retrieved parallel sen- Future work would include investigations of
tences. Zhang et al. (2018) proposed a method whether example presentation is beneficial for
that, during inference, retrieves parallel sentences learners with low language proficiency. In addi-
where the source sentences are similar to the in- tion, we plan to improve the datastore coverage by
put sentences and weights the output containing using pseudo-data (Xie et al., 2018) and weight low
n-grams of the retrieved sentence pairs based on frequency error types to present diverse examples.
the similarity between the input sentence and the We explore whether methods to improve accuracy
retrieved source sentence. These methods differ and diversity (Chollampatt and Ng, 2018; Kaneko
from EB-GEC using kNN-MT in that they retrieve et al., 2019; Hotate et al., 2019, 2020) are effective
examples via surface matching, as done in baseline for EB-GEC.

Acknowledgements Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
This paper is based on results obtained from a deep bidirectional transformers for language under-
project, JPNP18002, commissioned by the New standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
Energy and Industrial Technology Development
for Computational Linguistics: Human Language
Organization (NEDO). We thank Yukiko Konishi, Technologies, Volume 1 (Long and Short Papers),
Yuko Unzai, and Naoko Furuya for their help with pages 4171–4186, Minneapolis, Minnesota. Associ-
our experiments. ation for Computational Linguistics.
Takao Doi, Hirofumi Yamamoto, and Eiichiro Sumita.
2005. Example-based machine translation using effi-
References cient sentence retrieval based on edit-distance. ACM
Transactions on Asian Language Information Pro-
Mio Arai, Masahiro Kaneko, and Mamoru Komachi. cessing, 4(4):377–399.
2019. Grammatical-error-aware incorrect example
retrieval system for learners of Japanese as a second Finale Doshi-Velez and Been Kim. 2017. Towards a
language. In Proceedings of the Fourteenth Work- rigorous science of interpretable machine learning.
shop on Innovative Use of NLP for Building Educa- arXiv preprint arXiv:1702.08608.
tional Applications, pages 296–305, Florence, Italy.
Association for Computational Linguistics. Mariano Felice, Christopher Bryant, and Ted Briscoe.
2016. Automatic Extraction of Learner Errors
Mio Arai, Masahiro Kaneko, and Mamoru Komachi. in ESL Sentences Using Linguistically Enhanced
2020. Example retrieval system using grammatical Alignments. In COLING, pages 825–835, Osaka,
error detection for japanese as a second language Japan. The COLING 2016 Organizing Committee.
learners. Transactions of the Japanese Society for
Sylviane Granger. 1998. Developing an Automated
Artificial Intelligence, 35(5):A–K23.
Writing Placement System for ESL Learners. In
Christopher Bryant, Mariano Felice, Øistein E. Ander- LEC, pages 3–18.
sen, and Ted Briscoe. 2019. The BEA-2019 Shared Jiatao Gu, Yong Wang, Kyunghyun Cho, and Vic-
Task on Grammatical Error Correction. In BEA, tor OK Li. 2018. Search engine guided neural ma-
pages 52–75, Florence, Italy. Association for Com- chine translation. In Proceedings of the AAAI Con-
putational Linguistics. ference on Artificial Intelligence, volume 32.
Christopher Bryant, Mariano Felice, and Ted Briscoe. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-
2017. Automatic Annotation and Evaluation of Er- pat, and Ming-Wei Chang. 2020. Realm: Retrieval-
ror Types for Grammatical Error Correction. In augmented language model pre-training. arXiv
ACL, pages 793–805, Vancouver, Canada. Associa- preprint arXiv:2002.08909.
tion for Computational Linguistics.
Rejwanul Haque, Sudip Kumar Naskar, Yanjun Ma,
Yuchang Cheng and Tomoki Nagase. 2012. An and Andy Way. 2009. Using supertags as source lan-
example-based Japanese proofreading system for guage context in SMT. In Proceedings of the 13th
offshore development. In Proceedings of COLING Annual conference of the European Association for
2012: Demonstration Papers, pages 67–76, Mum- Machine Translation, Barcelona, Spain. European
bai, India. The COLING 2012 Organizing Commit- Association for Machine Translation.
tee. Kengo Hotate, Masahiro Kaneko, Satoru Katsumata,
and Mamoru Komachi. 2019. Controlling grammat-
Shamil Chollampatt and Hwee Tou Ng. 2018. A Mul-
ical error correction using word edit rate. In Pro-
tilayer Convolutional Encoder-Decoder Neural Net-
ceedings of the 57th Annual Meeting of the Asso-
work for Grammatical Error Correction. In Proceed-
ciation for Computational Linguistics: Student Re-
ings of the Thirty-Second AAAI Conference on Arti-
search Workshop, pages 149–154, Florence, Italy.
ficial Intelligence, pages 5755–5762, New Orleans,
Association for Computational Linguistics.
Louisiana. Association for the Advancement of Arti-
ficial Intelligence. Kengo Hotate, Masahiro Kaneko, and Mamoru Ko-
machi. 2020. Generating diverse corrections with
Daniel Dahlmeier and Hwee Tou Ng. 2012. Better local beam search for grammatical error correction.
Evaluation for Grammatical Error Correction. In In Proceedings of the 28th International Conference
NAACL, pages 568–572, Montréal, Canada. Associ- on Computational Linguistics, pages 2132–2137,
ation for Computational Linguistics. Barcelona, Spain (Online). International Committee
on Computational Linguistics.
Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu.
2013. Building a Large Annotated Corpus of Tim Johns. 1994. From printout to handout: Gram-
Learner English: The NUS Corpus of Learner En- mar and vocabulary teaching in the context of data-
glish. In BEA, pages 22–31, Atlanta, Georgia. Asso- driven learning. Perspectives on pedagogical gram-
ciation for Computational Linguistics. mar, 293.

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Megumi Makino and Kazuhide Yamamoto. 2008.
Billion-scale similarity search with gpus. IEEE Summarization by analogy: An example-based ap-
Transactions on Big Data, 7(3):535–547. proach for news articles. In Proceedings of the
Third International Joint Conference on Natural
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Language Processing: Volume-II.
Shubha Guha, and Kenneth Heafield. 2018. Ap-
proaching neural grammatical error correction as a Shigeki Matsubara, Yoshihide Kato, and Seiji Egawa.
low-resource machine translation task. In Proceed- 2008. Escort: example sentence retrieval system as
ings of the 2018 Conference of the North Ameri- support tool for english writing. Journal of Informa-
can Chapter of the Association for Computational tion Processing and Management, 51(4):251–259.
Linguistics: Human Language Technologies, Vol-
ume 1 (Long Papers), pages 595–606, New Orleans, Atsushi Mizumoto and Kiyomi Chujo. 2015. A meta-
Louisiana. Association for Computational Linguis- analysis of data-driven learning approach in the
tics. japanese efl classroom. English Corpus Studies,
22:1–18.
Masahiro Kaneko, Kengo Hotate, Satoru Katsumata,
and Mamoru Komachi. 2019. TMU transformer Tomoya Mizumoto, Mamoru Komachi, Masaaki Na-
system using BERT for re-ranking at BEA 2019 gata, and Yuji Matsumoto. 2011. Mining Revi-
grammatical error correction on restricted track. In sion Log of Language Learning SNS for Auto-
Proceedings of the Fourteenth Workshop on Innova- mated Japanese Error Correction of Second Lan-
tive Use of NLP for Building Educational Applica- guage Learners. In IJCNLP, pages 147–155, Chi-
tions, pages 207–212, Florence, Italy. Association ang Mai, Thailand. Asian Federation of Natural Lan-
for Computational Linguistics. guage Processing.

Masahiro Kaneko and Mamoru Komachi. 2019. Multi- Makoto Nagao. 1984. A framework of a mechanical
head multi-layer attention to deep language repre- translation between japanese and english by analogy
sentations for grammatical error detection. Com- principle. Artificial and human intelligence, pages
putación y Sistemas, 23(3):883–891. 351–354.
Masahiro Kaneko, Masato Mita, Shun Kiyono, Jun Ryo Nagata. 2019. Toward a task of feedback comment
Suzuki, and Kentaro Inui. 2020. Encoder-decoder generation for writing learning. In Proceedings of
models can benefit from pre-trained masked lan- the 2019 Conference on Empirical Methods in Nat-
guage models in grammatical error correction. In ural Language Processing and the 9th International
Proceedings of the 58th Annual Meeting of the Asso- Joint Conference on Natural Language Processing
ciation for Computational Linguistics, pages 4248– (EMNLP-IJCNLP), pages 3206–3215, Hong Kong,
4254, Online. Association for Computational Lin- China. Association for Computational Linguistics.
guistics.
Ryo Nagata, Kentaro Inui, and Shin’ichiro Ishikawa.
Masahiro Kaneko, Yuya Sakaizawa, and Mamoru Ko- 2020. Creating corpora for research in feedback
machi. 2017. Grammatical error detection us- comment generation. In Proceedings of the 12th
ing error- and grammaticality-specific word embed- Language Resources and Evaluation Conference,
dings. In Proceedings of the Eighth International pages 340–345, Marseille, France. European Lan-
Joint Conference on Natural Language Processing guage Resources Association.
(Volume 1: Long Papers), pages 40–48, Taipei, Tai-
wan. Asian Federation of Natural Language Process- Courtney Napoles, Keisuke Sakaguchi, Matt Post, and
ing. Joel Tetreault. 2015. Ground Truth for Grammati-
cal Error Correction Metrics. In NAACL, pages 588–
Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke 593, Beijing, China. Association for Computational
Zettlemoyer, and Mike Lewis. 2021. Nearest neigh- Linguistics.
bor machine translation. In International Confer-
ence on Learning Representations. Courtney Napoles, Keisuke Sakaguchi, and Joel
Tetreault. 2017. JFLEG: A Fluency Corpus and
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Benchmark for Grammatical Error Correction. In
Zettlemoyer, and Mike Lewis. 2020. Generalization EACL, pages 229–234, Valencia, Spain. Association
through memorization: Nearest neighbor language for Computational Linguistics.
models. In International Conference on Learning
Representations. Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian
Hadiwinoto, Raymond Hendy Susanto, and Christo-
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio pher Bryant. 2014. The CoNLL-2014 Shared Task
Petroni, Vladimir Karpukhin, Naman Goyal, Hein- on Grammatical Error Correction. In CoNLL, pages
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- 1–14, Baltimore, Maryland. Association for Compu-
täschel, Sebastian Riedel, and Douwe Kiela. 2020. tational Linguistics.
Retrieval-augmented generation for knowledge-
intensive nlp tasks. In Advances in Neural Infor- Kiyonori Ohtake and Kazuhide Yamamoto. 2003. Ap-
mation Processing Systems, volume 33, pages 9459– plicability analysis of corpus-derived paraphrases to-
9474. Curran Associates, Inc. ward example-based paraphrasing. In Proceedings

of the 17th Pacific Asia Conference on Language, Mary E Webb, Andrew Fluck, Johannes Magenheim,
Information and Computation, pages 380–391, Sen- Joyce Malyn-Smith, Juliet Waters, Michelle De-
tosa, Singapore. COLIPS PUBLICATIONS. schênes, and Jason Zagami. 2020. Machine learning
for human learners: opportunities, issues, tensions
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem and threats. Educational Technology Research and
Chernodub, and Oleksandr Skurzhanskyi. 2020. Development, pages 1–22.
GECToR – grammatical error correction: Tag, not
rewrite. In Proceedings of the Fifteenth Workshop Sam Wiseman and Karl Stratos. 2019. Label-agnostic
on Innovative Use of NLP for Building Educational sequence labeling by copying nearest neighbors. In
Applications, pages 163–170, Seattle, WA, USA → Proceedings of the 57th Annual Meeting of the
Online. Association for Computational Linguistics. Association for Computational Linguistics, pages
5363–5369, Florence, Italy. Association for Compu-
Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi, Sho tational Linguistics.
Yokoi, Tatsuki Kuribayashi, Ryuto Konno, and Ken-
taro Inui. 2020. Instance-based learning of span Ziang Xie, Guillaume Genthial, Stanley Xie, Andrew
representations: A case study through named en- Ng, and Dan Jurafsky. 2018. Noising and denoising
tity recognition. In Proceedings of the 58th Annual natural language: Diverse backtranslation for gram-
Meeting of the Association for Computational Lin- mar correction. In Proceedings of the 2018 Confer-
guistics, pages 6452–6459, Online. Association for ence of the North American Chapter of the Associ-
Computational Linguistics. ation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers), pages
John W Ratcliff and David E Metzener. 1988. Pattern- 619–628, New Orleans, Louisiana. Association for
matching-the gestalt approach. Dr Dobbs Journal, Computational Linguistics.
13(7):46.
Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2011. A New Dataset and Method for Automatically
2016. Neural machine translation of rare words Grading ESOL Texts. In NAACL, pages 180–189,
with subword units. In Proceedings of the 54th An- Portland, Oregon, USA. Association for Computa-
nual Meeting of the Association for Computational tional Linguistics.
Linguistics (Volume 1: Long Papers), pages 1715– Helen Yannakoudakis, Øistein E. Andersen, Geran-
1725, Berlin, Germany. Association for Computa- payeh Ardeshir, Briscoe Ted, and Nicholls Diane.
tional Linguistics. 2018. Developing an Automated Writing Placement
Nicolas Stroppa, Antal van den Bosch, and Andy Way. System for ESL Learners. In Applied Measurement
2007. Exploiting source similarity for SMT using in Education, pages 251–267.
context-informed features. In Proceedings of the Tzu-Hsi Yen, Jian-Cheng Wu, Jim Chang, Joanne Bois-
11th Conference on Theoretical and Methodologi- son, and Jason Chang. 2015. WriteAhead: Mining
cal Issues in Machine Translation of Natural Lan- grammar patterns in corpora for assisted writing. In
guages: Papers, Skövde, Sweden. Proceedings of ACL-IJCNLP 2015 System Demon-
strations, pages 139–144, Beijing, China. Associa-
Eiichiro Sumita and Hitoshi Iida. 1991. Experiments tion for Computational Linguistics and The Asian
and prospects of example-based machine transla- Federation of Natural Language Processing.
tion. In 29th Annual Meeting of the Association for
Computational Linguistics, pages 185–192, Berke- Zheng Yuan and Ted Briscoe. 2016. Grammatical er-
ley, California, USA. Association for Computational ror correction using neural machine translation. In
Linguistics. Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computa-
Antal Van Den Bosch. 2007. A memory-based classifi- tional Linguistics: Human Language Technologies,
cation approach to marker-based ebmt. In Proceed- pages 380–386, San Diego, California. Association
ings of the METIS-II Workshop on New Approaches for Computational Linguistics.
to Machine Translation.
Jingyi Zhang, Masao Utiyama, Eiichro Sumita, Gra-
Maarten Van Gompel, Antal Van Den Bosch, and Pe- ham Neubig, and Satoshi Nakamura. 2018. Guid-
ter Berck. 2009. Extending memory-based machine ing neural machine translation with retrieved transla-
translation to phrases. In Proceedings of the Third tion pieces. In Proceedings of the 2018 Conference
Workshop on Example-Based Machine Translation, of the North American Chapter of the Association
pages 79–86. for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), pages 1325–
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 1335, New Orleans, Louisiana. Association for Com-
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz putational Linguistics.
Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In I. Guyon, U. V. Luxburg, S. Bengio, Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and
H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- Jingming Liu. 2019. Improving grammatical error
nett, editors, NeurIPS, pages 5998–6008. Curran As- correction via pre-training a copy-augmented archi-
sociates, Inc. tecture with unlabeled data. In Proceedings of the

2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long and
Short Papers), pages 156–165, Minneapolis, Min-
nesota. Association for Computational Linguistics.

You can also read