Sequence-to-Action: Grammatical Error Correction with Action Guided Sequence Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Sequence-to-Action: Grammatical Error Correction with Action Guided Sequence Generation Jiquan Li1 , Junliang Guo1 , Yongxin Zhu1 , Xin Sheng1 , Deqiang Jiang2 , Bo Ren2 , Linli Xu1 1 Anhui Province Key Laboratory of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China 2 Tencent YouTu Lab {lijiquan, guojunll, zyx2016, xins}@mail.ustc.edu.cn, {dqiangjiang, timren}@tencent.com, linlixu@ustc.edu.cn arXiv:2205.10884v1 [cs.CL] 22 May 2022 Abstract with sequence-to-sequence (seq2seq) architectures have be- come the preferred solution for the GEC task, where the The task of Grammatical Error Correction (GEC) has re- ceived remarkable attention with wide applications in Natu- original sentences with various kinds of grammatical errors ral Language Processing (NLP) in recent years. While one are taken as the source input while the correct sentences are of the key principles of GEC is to keep the correct parts taken as the target supervision. Specifically, the Transformer unchanged and avoid over-correction, previous sequence-to- model (Vaswani et al. 2017; Bryant et al. 2019) has been a sequence (seq2seq) models generate results from scratch, predominant choice for NMT-based methods. which are not guaranteed to follow the original sentence However, there are some issues with the seq2seq methods structure and may suffer from the over-correction problem. In for the GEC task. In general, as a result of generating target the meantime, the recently proposed sequence tagging mod- sentences from scratch, the repetition and omission of words els can overcome the over-correction problem by only gener- ating edit operations, but are conditioned on human designed frequently occur in the seq2seq generation process as stud- language-specific tagging labels. In this paper, we combine ied by previous works (Tu et al. 2016). More importantly, the pros and alleviate the cons of both models by proposing there is no guarantee that the generated sentences can keep a novel Sequence-to-Action (S2A) module. The S2A module all the original correct words while maintaining the seman- jointly takes the source and target sentences as input, and is tic structures. Experiments show that such problems occur able to automatically generate a token-level action sequence more frequently when the original sentences are too long before predicting each token, where each action is generated or contain low-frequency/out-of-vocabulary words. Conse- from three choices named SKIP, COPY and GENerate. Then quentially, the seq2seq models may suffer from these poten- the actions are fused with the basic seq2seq framework to pro- tial risks in practice which conflict with the second primary vide final predictions. We conduct experiments on the bench- goal of the GEC task. Recently, sequence tagging methods mark datasets of both English and Chinese GEC tasks. Our model consistently outperforms the seq2seq baselines, while (Malmi et al. 2019; Awasthi et al. 2019; Omelianchuk et al. being able to significantly alleviate the over-correction prob- 2020; Stahlberg and Kumar 2020) consider GEC as a text lem as well as holding better generality and diversity in the editing task by detecting and applying edits to the original generation results compared to the sequence tagging models. sentences, therefore bypassing the above mentioned prob- lems of seq2seq models. Nevertheless, the edits are usually constrained by human designed or automatically generated Introduction lexical rules (Omelianchuk et al. 2020; Stahlberg and Ku- Grammatical Error Correction (GEC), which aims at au- mar 2020) and vocabularies (Awasthi et al. 2019; Malmi tomatically correcting various kinds of errors in the given et al. 2019), which limits the generality and transferability text, has received increasing attention in recent years, with of these methods. Moreover, when it comes to corrections wide applications in natural language processing such as which need longer insertions, most of these sequence tag- post-processing the results of Automatic Speech Recogni- ging methods rely on iterative corrections, which can reduce tion (ASR) and Neural Machine Translation (NMT) mod- the fluency. els (Mani et al. 2020), as well as providing language assis- To tackle these problems, in this paper, we propose a tance to non-native speakers. In the task of GEC, the primary Sequence-to-Action (S2A) model which is able to automat- goals are two-fold: 1) identifying as many errors as possi- ically edit the erroneous tokens in the original sentences ble and successfully correcting them; 2) keeping the original without relying on human knowledge, while keeping as correct words unchanged without bringing in new errors. many correct tokens as possible. Specifically, we simply Intuitively, GEC can be considered as a machine transla- introduce three atomic actions named SKIP, COPY and tion task by taking the incorrect text as the source language GENerate to guide the model when generating the corrected and the corrected text as the target language. In recent years, outputs. By integrating the erroneous source sentence and NMT-based approaches (Bahdanau, Cho, and Bengio 2015) the golden target sentence, we construct a tailored input for- Copyright © 2022, Association for the Advancement of Artificial mat for our model, and for each token, the model learns to Intelligence (www.aaai.org). All rights reserved. skip it if the current input token is an erroneous one, or copy
it if the token is an originally correct token, or generate it if GEC by Generating Edits the token is a target token. Another line of research takes a different perspective by In this way, the proposed model provides the following treating the GEC task as a text editing problem and proposes benefits. 1) As a large proportion of tokens are correct in sequence tagging models for GEC. In general, these models the source sentences, the COPY action will appear more predict an edit tag sequence t = (t1 , t2 , ..., tQ m ) based on the frequently than the other actions. Therefore, by taking the m source sentence x by estimating p(t|x) = i=1 p(ti |x). In source sentence as an input to the S2A module, our model the edit tag sequence, each tag is selected from a pre-defined is more inclined to copy the current input token, and there- set and assigned to a source token, representing an edition to fore alleviates the over-correction problem of seq2seq mod- this token. els. 2) Comparing to sequence tagging models which gen- Specifically, LaserTagger (Malmi et al. 2019) predicts erate edits based on the human-designed rules or vocabular- editions between keeping, deleting or adding a new to- ies, our model can realize more flexible generations such as ken/phrase from a handcrafted vocabulary. PIE (Awasthi long insertions. More importantly, our model is language- et al. 2019) iteratively predicts token-level editions for a independent with good generality. fixed number of iterations in an non-autoregressive way. We evaluate the S2A model on both English and Chinese GECToR (Omelianchuk et al. 2020) further improves the GEC tasks. On the English GEC task, the proposed method model by designing more fine-grained editions w.r.t the lex- achieves 52.5 in F0.5 score, producing an improvement of ical rules of English. Seq2edits (Stahlberg and Kumar 2020) 2.1 over the baseline models. Meanwhile, it achieves 38.06 generates span-level (instead of token-level) tags to gener- in F0.5 score on the Chinese GEC task, with improvement ate more compact editions. In the sequence tagging mod- of 1.09 over the current state-of-the-art baseline method els mentioned above, the operations of generating new to- MaskGEC (Zhao and Wang 2020). kens are restricted either by human-designed vocabularies or language-dependent lexical rules, limiting their general- Related Work ity. For example, GECToR outperforms baseline models on GEC by Seq2seq Generation English GEC datasets, but its performance severely degen- A basic seq2seq model for GEC consists of an encoder and a erates on the Chinese GEC task, as shown in Table 3. There decoder, where the source sentence is first encoded into the also exist methods that integrate seq2seq models with se- corresponding hidden states by the encoder, and each target quence tagging methods into a pipeline system to improve token is then generated by the decoder conditioned on the the performance or efficiency (Chen et al. 2020; Hinson, hidden states and the previously generated tokens. Given a Huang, and Chen 2020), which cannot be optimized end- pair of training samples (x, y), the objective function of the to-end. seq2seq based GEC model can be written as: Different to the works discussed above, the proposed n S2A model alleviates the over-correction/omission problem X of seq2seq models by taking the original sentence as in- Ls2s (y|x; Θ) = − log p(yi |y
Algorithm 1: Input Construction Input: The source sentence x, the integrated sequence z Probability Fusion and the action sequence a with length t Output: The S2A input x̃, the decoder input ỹin , the target Token Distribution Action Distribution sequence ỹout 1: Initialize x̃ and ỹ as empty lists; i = 1; Encoder Decoder S2A Module 2: for k ← 1 to t do 3: if ak == COPY then 4: Append zk to x̃; i++; Embedding Embedding Embedding 5: Append zk to ỹ; 6: else if ak == SKIP then 7: Append zk to x̃; i++; 8: Append BLK to ỹ; Figure 1: An illustration of the proposed framework. 9: else 10: Append xi to x̃; editions w.r.t erroneous tokens without human knowledge. 11: Append zk to ỹ; To achieve these goals, we propose a novel Sequence-to- 12: end if Action (S2A) module based on the seq2seq framework, with 13: end for tailored input formats. An illustration of the proposed frame- 14: ỹout ← ỹ; work is shown in Figure 1. 15: Replace BLK in ỹ with the token on its left; Specifically, our framework consists of three components: 16: ỹin ← [S] +ỹ0:−1 ; the encoder, the decoder and the S2A module. The S2A 17: return x̃, ỹin , ỹout module is built upon the decoder to predict an action at each input position. Different from previous sequence tag- ging works which usually introduce lots of labels and gen- rect token; 3) if the source token is an erroneous token that erate new tokens from the pre-defined vocabulary, we sim- should be deleted (i.e., no corresponding correct token ap- ply design three actions named SKIP, COPY and GENerate, pears in y), we append it to z and assign the action SKIP where SKIP indicates that the current input token is an er- to it; 4) for the target tokens that do not have alignments in ror that should not appear in the output, COPY refers to an x and thus should be generated, we append them to z and original correct token that should be kept, and GEN indicates assign the action GEN. Finally, we obtain the integrated se- a new token should be generated at this position, and the to- quence z and the action sequence a, both with length t and ken is predicted from the whole dictionary instead of a pre- t ≥ max(m, n). This process can be implemented by dy- defined subset. namic programming and we provide an illustration in Ta- The S2A module takes the hidden output of the decoder ble 1. and the source sentence as input, and we design a special We then construct ỹin which serves as the decoder in- input format to integrate them, ensuring that at each step, put and x̃ which maintains the source tokens and serves as the action is taken considering both the source and target in- the input to the S2A module. Details are described in Al- formation. Then the action probability produced by the S2A gorithm 1. To construct x̃, we iterate over z and keep the module is fused with the traditional seq2seq prediction prob- tokens with COPY and SKIP actions. For tokens with the ability to obtain the final results. We introduce the details as GEN action, we fill up the location with its next token in the follows. original source sentence. Meanwhile, for ỹ, we keep the to- kens with COPY and GEN actions, and fill up the locations Input Construction with the SKIP action by introducing a special blank token The encoder in our model is akin to a traditional seq2seq [BLK], which serves as a placeholder to represent the posi- encoder, which takes x as input and provides its hidden rep- tions where the original tokens are skipped. It is worth not- resentation he . As for the decoder side, instead of simply ing that while training, we take ỹ as the target sequence ỹout , taking y as input, we integrate x with y as the decoder input and we replace the [BLK] token with the most recent non- to provide direct information of the original sentence. blank token on its left and take the right-shifted version of ỹ Specifically, we follow four principles to integrate x and as the decoder input ỹin to enable teacher forcing. y into a new sequence z with an action sequence a. For ev- ery token in x and y, 1) if the source token is an originally Training and Inference correct token (i.e., appears both in x and y), we simply copy Given the source input x, the decoder input ỹin , the S2A in- it to z and assign the action COPY to it; 2) if the source to- put x̃ and the target sentence ỹout , the computation flow of ken is an erroneous token that should be replaced (i.e., the our model can be written as follows, corresponding correct token appears in y), we jointly ap- pend this token as well as the correct token to z, and assign h = fa (hd , e(x̃)) where hd = fd (ỹin , he ) SKIP to the erroneous token while assigning GEN to the cor- (2) and he = fe (x),
Sequence Source x The cat is sat at mat . [/S] Gold y The cat sat on the mat . [/S] Integrated z The cat is sat at on the mat . [/S] Action a C C S C S G G C C C x̃ The cat is sat at mat mat mat . [/S] ỹin [S] The cat cat sat sat on the mat . ỹout The cat [BLK] sat [BLK] on the mat . [/S] Table 1: An illustration for the decoder input construction. We refer to SKIP as S, COPY as C, and GEN as G. To differentiate the three actions, we mark SKIP and [BLK] with underlines and GEN in boldface. [S] and [/S] represent the begin- and end-of-sequence symbol. where fa , fd and fe indicate the S2A module, the decoder where θs2a indicates the parameters of the proposed S2A and encoder respectively, with the hidden outputs h, hd module. Together with the loss function of the traditional and he correspondingly, and e(·) indicates the embedding seq2seq model described in Equation (1), the final loss func- lookup. Then the token prediction probability can be written tion of our framework can be written as: as pd (ỹout |ỹin , x) = softmax(fo (hd )), where fo (·) ∈ Rd×V is the linear output layer that provides the prediction proba- L(x, y; Θ) =(1 − λ)Ls2a (ỹout |ỹin , x̃, x; θs2s , θs2a )) (6) bility over the dictionary, while d and V indicate the dimen- + λLs2s (ỹout |ỹin , x; θs2s ), sion of the hidden outputs and the vocabulary respectively. The S2A module simply consists of two feed-forward lay- where Θ = (θs2s , θs2a ) denotes the parameters of the model, ers, taking the concatenation of hd and e(x̃) as input and and λ is the hyper-parameter that controls the trade-off be- producing the probability over actions as output: tween the two objectives. While inference, our model generates in a similar way as pa (a|hd , e(x̃)) the traditional seq2seq model, except that we additionally = softmax(σ([hd ; e(x̃)] · w1 + b1 ) · w2 + b2 ), (3) take a source token xj as the input, which denotes the cur- where w1 ∈ R 2d×2d , w2 ∈ R 2d×3 . rent source token that serves as the input to the S2A mod- ule. After the current generation step, if the predicted action For each position, the three actions SKIP, COPY and GEN lies in SKIP and COPY, we move the source token a step indicate that the model should predict a [BLK] token, or forward (i.e., j++). Otherwise with action GEN, we keep j predict the input token, or generate a new token respectively. unchanged. In this way, we ensure that xj serves the same Therefore, for the i-th position, denote the probabilities of functionality as x̃i (i.e., the source token that should be con- three actions as (pia (s), pia (c), pia (g)) = pia , we fuse them sidered at the current step) during training. with the token probability pid to obtain the prediction proba- bility of the S2A module pis2a ∈ RV , Discussion pis2a ([BLK]) = pia (s), Our model combines the advantages of both the seq2seq and pis2a (x̃i ) = pia (c), the sequence tagging models, while alleviating their prob- lems. On the one hand, while seq2seq models usually suf- pis2a (V \ {[BLK], x̃i }) = pia (g) · pid (V \ {[BLK], x̃i }), fer from over-correction or omission of correct tokens in the (4) source sentence (Tu et al. 2016), the proposed sequence-to- action module guarantees that the model will directly visit where x̃i indicates the current input token and pid (V \ and consider the source tokens in x̃ as the next predictions’ {[BLK], x̃i }) indicates the normalized token probability af- candidates before making the final predictions. On the other ter setting the prediction of the blank token as well as the hand, comparing with sequence tagging models which de- current token to 0. In this way, as the probabilities in pa (3- pend on human-designed labels, lexical rules and vocabu- classes) are usually larger than that in pd (V -classes), with laries for generating new tokens, we only introduce three action COPY, we amplify the probability of keeping the cur- atomic actions in the model without other constraints, which rent token which is originally correct; with action GEN, we enhances the generality and diversity of the generated tar- force the model to predict a new token from the vocabulary; gets. As shown in the case studies, sequence tagging models otherwise with action SKIP, we force the model to skip the usually fail when dealing with hard editions, e.g., reorder- current erroneous token by predicting the [BLK] token. ing or generating a long range of tokens, and our model Then the loss function of the proposed Sequence-to- performs well on these cases by generating the target sen- Action module can be written as: tence from scratch auto-regressively. In addition, we choose Ls2a (ỹout |ỹin , x̃, x) the term “action” instead of “label” because the proposed t sequence-to-action module does not need any actual labels X i
Model Precision Recall F0.5 Transformer Big 64.9 26.6 50.4 LaserTagger (Malmi et al. 2019) 50.9 26.9 43.2 ESD+ESC (Chen et al. 2020) 66.0 24.7 49.5 S2A Model 65.9 28.9 52.5 Transformer Big (Ensemble) 71.3 26.5 53.3 S2A Model (Ensemble) 72.7 27.6 54.8 Transformer Big + pre-train 69.4 42.5 61.5 PRETLarge (Kiyono et al. 2019) 67.9 44.1 61.3 BERT-fuse (Kaneko et al. 2020) 69.2 45.6 62.6 Seq2Edits (Stahlberg and Kumar 2020) 63.0 45.6 58.6 PIE (Awasthi et al. 2019) 66.1 43.0 59.7 GECToR-BERT (Omelianchuk et al. 2020) 72.1 42.0 63.0 GECToR-XLNet (Omelianchuk et al. 2020) 77.5 40.1 65.3 ESD+ESC (Chen et al. 2020) + pre-train 72.6 37.2 61.0 S2A Model + pre-train 74.0 38.9 62.7 Transformer Big + pre-train (Ensemble) 74.1 37.5 62.0 GECToR (Ensemble) 78.2 41.5 66.5 S2A Model + pre-train (Ensemble) 74.9 38.4 62.9 Table 2: The results on the CoNLL-2014 English GEC task. The top group of results are generated by models trained only on the BEA-2019 training set. The bottom group of results are generated by models that are first pre-trained on a large amount of synthetic pseudo data followed with fine-tuning. Here we bold the best results of single models and ensemble models separately. Experiments et al. 2014). We also conduct experiments by pre-training the model with 100M synthetic parallel examples provided Datasets and Evaluation Metrics by (Grundkiewicz, Junczys-Dowmunt, and Heafield 2019). We conduct experiments on both Chinese GEC and English All English sentences are preprocessed and tokenized by GEC tasks. For the Chinese GEC task, we use the dataset 32K SentencePiece (Kudo and Richardson 2018) Byte Pair of NLPCC-2018 Task 2 (Zhao et al. 2018)1 , which is the Encoding (BPE) (Sennrich, Haddow, and Birch 2016). first and latest benchmark dataset for Chinese GEC. Follow- We use the official MaxMatch (M 2 ) scorer (Dahlmeier ing the pre-processing settings in (Zhao and Wang 2020), and Ng 2012) to evaluate the models in both the Chinese we get 1.2M sentence pairs in all. Then 5k sentence pairs and English GEC tasks. Given a source sentence and a sys- are randomly sampled from the whole parallel corpus as the tem hypothesis, M 2 searches for a sequence of phrase-level development set. The rest are used as the final training set. edits between them that achieves the highest overlap with the We use the official test set2 which contains 2k sentences gold-standard annotation. This optimal edit sequence is then extracted from the PKU Chinese Learner Corpus. The test used to calculate the values of precision, recall and F0.5 . set also includes the annotations that mark the golden ed- its of grammatical errors in each sentence. We tokenize the Model Configurations sentence pairs following (Zhao and Wang 2020). Specifi- For the basic seq2seq baseline, we adopt the original Trans- cally, we use the tokenization script of BERT3 to tokenize former architecture. To compare with the previous works, the Chinese symbols and keep the non-Chinese symbols un- we use the base/big model of Transformer for the Chine- changed. se/English GEC task respectively, and we follow the official For English GEC, we take the datasets provided in settings of Transformer: in the base/big model, the number the restricted track of the BEA-2019 GEC shared task of self-attention heads of Transformer is set to 8/16, the (Bryant et al. 2019). Specifically, the training set is the con- embedding dimension is 512/1024, the inner-layer dimen- catenation of the Lang-8 corpus (Mizumoto et al. 2011), sion in the feed-forward layers is set to 2048/4096 respec- the FCE training set (Yannakoudakis, Briscoe, and Med- tively. For the loss function, we use the cross entropy loss lock 2011), NUCLE (Dahlmeier, Ng, and Wu 2013), and with label smoothing and set the epsilon value to 0.1. We W&I+LOCNESS (Granger 2014; Bryant et al. 2019). We apply dropout (Srivastava et al. 2014) on the encoders and use the CoNLL-2013 (Ng et al. 2013) test set as the develop- decoders with probability of 0.3. The implementations of ment set to choose the best-performing checkpoint, which is our model and baselines are based on fairseq (Ott et al. then evaluated on the benchmark test set CoNLL-2014 (Ng 2019) and will be available when published. Moreover, as suggested in (Junczys-Dowmunt et al. 1 http://tcci.ccf.org.cn/conference/2018/taskdata.php 2018), for the English GEC task, we not only report the re- 2 https://github.com/pkucoli/NLPCC2018 GEC sults generated by single models, but also compare to en- 3 https://github.com/google-research/bert sembles of four models with different initializations.
Model Precision Recall F0.5 YouDao (Fu, Huang, and Duan 2018) 35.24 18.64 29.91 AliGM (Zhou et al. 2018) 41.00 13.75 29.36 BLCU (Ren, Yang, and Xun 2018) 41.73 13.08 29.02 Transformer 36.57 14.27 27.86 S2A Model 36.57 18.25 30.46 Transformer + MaskGEC (Zhao and Wang 2020) 44.36 22.18 36.97 GECToR-BERT (Omelianchuk et al. 2020) + MaskGEC 41.43 23.60 35.99 S2A Model + MaskGEC 42.34 27.11 38.06 Table 3: The results on the NLPCC-2018 Chinese GEC task. The upper group of results are generated by models trained on the original NLPCC-2018 training data without data augmentation. The lower group of results are generated by models trained on the same training data but with the dynamic masking based data augmentation method proposed by MaskGEC (Zhao and Wang 2020). Here we bold the best results. Baselines Compared to Transformer Big, our model achieves an im- For the English GEC task, we compare the proposed S2A provement of 2.1 in F0.5 score with single model, and an model to several representative systems, including three improvement of 1.5 in F0.5 score with ensemble model. Fur- seq2seq baselines (Transformer Big, BERT-fuse (Kaneko ther, when pre-trained on the synthetic data, our method still et al. 2020), PRETLarge (Kiyono et al. 2019)), four se- achieves clear gains over most of the listed models. In the quence tagging models (LaserTagger (Malmi et al. 2019), meantime, it slightly outperforms BERT-fuse using the pre- PIE (Awasthi et al. 2019), GECToR (Omelianchuk et al. trained BERT in their framework, which is not used in the 2020), Seq2Edits (Stahlberg and Kumar 2020)), and a S2A model. One can also notice that, as benefits of pre- pipeline model ESD+ESC (Chen et al. 2020). Specifically, trained language models, GECToR-XLNet performs the best for GECToR, we report their results when utilizing the pre- among all the listed methods with single models, and GEC- trained BERT model, XLNet model (Yang et al. 2019) and ToR (Ensemble) performs the best among all the baselines. the results that integrate three different pre-trained language Nevertheless, the proposed S2A model uses neither any pre- models in an ensemble. We denote them as GECToR-BERT, trained language models, nor human designed rules, and is GECToR-XLNet and GECToR (Ensemble) respectively. only outperformed by the GECToR models. We proceed with the Chinese GEC results as shown in For the Chinese GEC task, we compare S2A to sev- Table 3. In the upper group, when trained on the NLPCC- eral best performing systems evaluated on the NLPCC- 2018 training set without data augmentation, our proposed 2018 dataset, including three top systems in the NLPCC- model consistently outperforms all the other baseline sys- 2018 challenge (YouDao (Fu, Huang, and Duan 2018), tems. In the lower group, when trained with the data aug- AliGM (Zhou et al. 2018), BLCU (Ren, Yang, and Xun mentation method, our model outperforms MaskGEC with 2018)), the seq2seq baseline Char Transformer, and the an improvement of 1.09 in F0.5 score and achieves a new current state-of-the-art method MaskGEC (Zhao and Wang state-of-the-art result. 2020). Note that the proposed S2A model is orthogonal to It is worth noting that GECToR, which performs the best MaskGEC, and we also report our results enhanced with the on the English GEC task, degenerates when generalizing to data augmentation method of MaskGEC. the Chinese GEC task. Without a well-designed edit vocab- In addition, we reproduce and conduct Chinese GEC ex- ulary in Chinese GEC, it fails to achieve comparable per- periments with the sequence tagging based method GEC- formance as a standard seq2seq model even equipped with ToR (Omelianchuk et al. 2020), which is originally de- a pre-trained Chinese BERT model. In comparison, the pro- signed for the English task. To adapt it to the Chinese posed S2A framework is language independent with good GEC task, we utilize the pre-trained Chinese BERT model generality. BERT-base-Chinese and use $KEEP, $REPLACE and $APPEND as the tags. We pre-train the model using the data augmentation method proposed in MaskGEC and then Model COPY SKIP GEN fine-tune it on the training set. We follow the default hyper- Transformer 96.6 37.2 13.9 parameter settings, and we set the max iteration number to 10 while inference. GECToR 96.8 43.3 12.1 S2A Model 96.7 39.8 14.7 Results The English GEC results are summarized in Table 2. In Table 4: F1 values of COPY, SKIP actions and GEN action the top group of results without pre-training, the proposed fragments generated by Transformer, GECToR, S2A model model achieves 52.5/54.8 in F0.5 score with single/ensem- on the NLPCC-2018 test set. ble model, which significantly outperforms the baselines.
Table 2: The case study of the seq2seq (Transformer) model, the sequence tagging model (GECToR) and the proposed S2A model on the Chinese NLPCC-2018 test set. The translation of the golden target in each pair is also listed. The tokens in red are errors, while tokens in blue are the corrections made by the gold target or hypotheses. [unk] means out-of-vocabulary words. Type Samples SRC 通过些[unk][unk] ::::::::::: ,不但能回忆自己的人生,而且能准备. . . . . . TGT 通过写[unk] [unk] ,不但能回忆自己的人生,而且能准备. . . . . . Transformer 我们不但能回忆自己的人生,而且能准备. :::: ..... GECToR 通过写Bucket List ,不但能回忆自己的人生,而且能:: 预备. . . . . . Ours 通过写[unk] [unk] ,不但能回忆自己的人生,而且能准备. . . . . . Translation By writing [unk] [unk], not only can I recall my own life,but also prepare for ... SRC 他就问了这个女子 ::就找到了他的哥哥的家。 TGT 他就问了这个女子,然后就找到了他的哥哥的家。 Transformer 他就问了这个女子 ::就找到了他的哥哥的家。 GECToR 他就问了这个女子 ::就找到了他的哥哥的家。 Ours 他就问了这个女子,然后就找到了他的哥哥的家。 Translation He asked the woman , and then he found his brother’s home. Figure 2: Case studies of the seq2seq (Transformer) model, the sequence tagging model (GECToR) and the proposed S2A model on the Chinese NLPCC-2018 test set. The translation of the golden target in each pair is also listed. The tokens with wavy lines are errors, while tokens with underlines are the corrections made by the gold target or hypotheses. [unk] means out-of-vocabulary words. Accuracy of Action Prediction are listed in Figure 2. In order to analyze the impact of the S2A module, we evalu- Comparison to Seq2seq Methods In the upper part of ate the generated actions. To compare with other models, we Figure 2, we list an example in which a standard seq2seq extract the action sequences from the results generated by model does not perform well. In this example, the seq2seq them on the NLPCC-2018 test set. Next, we calculate the F1 model suffers from word omissions. In comparison, the pro- values of COPY, SKIP actions and GEN action fragments. posed S2A model rarely omits a long string of characters. All scores are listed in Table 4. Comparison to Sequence Tagging Methods In the bot- As shown in Table 4, with a lower F0.5 value of the M 2 tom example of Figure 2, a relatively long insertion is nec- score, GECToR performs the best in deciding whether each essary which is very common in Chinese GEC. As shown in token should be copied or skipped from the original sen- the results, the sequence tagging model is not able to pro- tence. This is not surprising considering that GECToR is duce correct output in this challenging case, while the pro- a sequence editing model. In the meantime, with an extra posed S2A model can generate the expected result due to its S2A module, our proposed framework can learn to explic- flexibility, which justifies the generality of S2A. itly process each token from the original sentence as GEC- ToR does, and it performs better than Transformer in pre- dicting COPY and SKIP. As for the action GEN, S2A and Conclusion Transformer both generate new tokens in an auto-regressive way, which are more flexible than GECToR. As a conse- In this paper, we propose a Sequence-to-Action (S2A) quence, they achieve higher accuracy when predicting the model based on the sequence-to-sequence framework for action GEN than GECToR, and the S2A model performs the Grammatical Error Correction. We design tailored input for- best in this aspect. mats so that in each prediction step, the model learns to If we look further into one of the gold-standard anno- COPY the next unvisited token in the original erroneous sen- tations in NLPCC 2018, among all the corrections, 1864 tence, SKIP it, or GENerate a new token before it. The S2A are replacing error characters, 130 are reordering characters, model alleviates the over-correction problem, and does not 1008 are inserting missing characters, and 789 are delet- rely on any human-designed rules or vocabularies, which ing redundant characters. Except the corrections of delet- provides a language-independent and flexible GEC frame- ing redundant characters, the other three types of corrections work. The superior performance on both Chinese and En- all involve both deleting and extra generating. As a conse- glish GEC justifies the effectiveness and generality of the quence, improving the accuracy of predicting GEN may help S2A framework. In the future, we will extend our idea to more to boost the GEC quality, which is the advantage of the other text editing tasks such as text infilling, re-writing and proposed S2A model, and agrees with the results in Table 3. text style transfer. Case Study Acknowledgments Next, we conduct case studies to intuitively demonstrate the advantages of the proposed S2A model. All the cases are This research was supported by Anhui Provincial Natural picked from the NLPCC-2018 Chinese GEC test set. Results Science Foundation (2008085J31).
References Grundkiewicz, R.; Junczys-Dowmunt, M.; and Heafield, K. Awasthi, A.; Sarawagi, S.; Goyal, R.; Ghosh, S.; and Piratla, 2019. Neural Grammatical Error Correction Systems with V. 2019. Parallel iterative edit models for local sequence Unsupervised Pre-training on Synthetic Data. In Proceed- transduction. arXiv preprint arXiv:1910.02893. ings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, 252–263. Florence, Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural Ma- Italy: Association for Computational Linguistics. chine Translation by Jointly Learning to Align and Trans- late. In Bengio, Y.; and LeCun, Y., eds., 3rd International Hinson, C.; Huang, H.-H.; and Chen, H.-H. 2020. Hetero- Conference on Learning Representations, ICLR 2015, San geneous Recycle Generation for Chinese Grammatical Error Diego, CA, USA, May 7-9, 2015, Conference Track Proceed- Correction. In Proceedings of the 28th International Confer- ings. ence on Computational Linguistics, 2191–2201. Barcelona, Spain (Online): International Committee on Computational Bryant, C.; Felice, M.; Andersen, Ø. E.; and Briscoe, T. Linguistics. 2019. The BEA-2019 Shared Task on Grammatical Error Junczys-Dowmunt, M.; Grundkiewicz, R.; Guha, S.; and Correction. In Proceedings of the Fourteenth Workshop on Heafield, K. 2018. Approaching Neural Grammatical Error Innovative Use of NLP for Building Educational Applica- Correction as a Low-Resource Machine Translation Task. In tions, 52–75. Florence, Italy: Association for Computational Proceedings of the 2018 Conference of the North American Linguistics. Chapter of the Association for Computational Linguistics: Chen, M.; Ge, T.; Zhang, X.; Wei, F.; and Zhou, M. 2020. Human Language Technologies, Volume 1 (Long Papers), Improving the Efficiency of Grammatical Error Correction 595–606. New Orleans, Louisiana: Association for Compu- with Erroneous Span Detection and Correction. In Proceed- tational Linguistics. ings of the 2020 Conference on Empirical Methods in Nat- Kaneko, M.; Mita, M.; Kiyono, S.; Suzuki, J.; and Inui, ural Language Processing (EMNLP), 7162–7169. Online: K. 2020. Encoder-Decoder Models Can Benefit from Pre- Association for Computational Linguistics. trained Masked Language Models in Grammatical Error Dahlmeier, D.; and Ng, H. T. 2012. Better Evaluation for Correction. In Proceedings of the 58th Annual Meeting of Grammatical Error Correction. In Proceedings of the 2012 the Association for Computational Linguistics, 4248–4254. Conference of the North American Chapter of the Associa- Online: Association for Computational Linguistics. tion for Computational Linguistics: Human Language Tech- Kingma, D. P.; and Ba, J. 2015. Adam: A Method for nologies, 568–572. Montréal, Canada: Association for Com- Stochastic Optimization. In ICLR (Poster). putational Linguistics. Kiyono, S.; Suzuki, J.; Mita, M.; Mizumoto, T.; and Inui, Dahlmeier, D.; Ng, H. T.; and Wu, S. M. 2013. Building a K. 2019. An Empirical Study of Incorporating Pseudo Data Large Annotated Corpus of Learner English: The NUS Cor- into Grammatical Error Correction. In Proceedings of the pus of Learner English. In Proceedings of the Eighth Work- 2019 Conference on Empirical Methods in Natural Lan- shop on Innovative Use of NLP for Building Educational Ap- guage Processing and the 9th International Joint Confer- plications, 22–31. Atlanta, Georgia: Association for Compu- ence on Natural Language Processing (EMNLP-IJCNLP), tational Linguistics. 1236–1242. Hong Kong, China: Association for Computa- Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. tional Linguistics. BERT: Pre-training of Deep Bidirectional Transformers for Kudo, T.; and Richardson, J. 2018. SentencePiece: A sim- Language Understanding. In Proceedings of the 2019 Con- ple and language independent subword tokenizer and deto- ference of the North American Chapter of the Association kenizer for Neural Text Processing. In Proceedings of the for Computational Linguistics: Human Language Technolo- 2018 Conference on Empirical Methods in Natural Lan- gies, Volume 1 (Long and Short Papers), 4171–4186. Min- guage Processing: System Demonstrations, 66–71. Brussels, neapolis, Minnesota: Association for Computational Lin- Belgium: Association for Computational Linguistics. guistics. Malmi, E.; Krause, S.; Rothe, S.; Mirylenka, D.; and Sev- Fu, K.; Huang, J.; and Duan, Y. 2018. Youdao’s winning so- eryn, A. 2019. Encode, tag, realize: High-precision text edit- lution to the nlpcc-2018 task 2 challenge: a neural machine ing. arXiv preprint arXiv:1909.01187. translation approach to chinese grammatical error correc- Mani, A.; Palaskar, S.; Meripo, N. V.; Konam, S.; and Metze, tion. In CCF International Conference on Natural Language F. 2020. ASR Error Correction and Domain Adaptation Us- Processing and Chinese Computing, 341–350. Springer. ing Machine Translation. In ICASSP 2020 - 2020 IEEE Ge, T.; Wei, F.; and Zhou, M. 2018. Fluency Boost Learn- International Conference on Acoustics, Speech and Signal ing and Inference for Neural Grammatical Error Correction. Processing (ICASSP), 6344–6348. In Proceedings of the 56th Annual Meeting of the Associa- Mizumoto, T.; Komachi, M.; Nagata, M.; and Matsumoto, Y. tion for Computational Linguistics (Volume 1: Long Papers), 2011. Mining Revision Log of Language Learning SNS for 1055–1065. Melbourne, Australia: Association for Compu- Automated Japanese Error Correction of Second Language tational Linguistics. Learners. In Proceedings of 5th International Joint Con- Granger, S. 2014. The computer learner corpus: a versatile ference on Natural Language Processing, 147–155. Chiang new source of data for SLA research. In Learner English on Mai, Thailand: Asian Federation of Natural Language Pro- Computer, 3–18. Routledge. cessing.
Ng, H. T.; Wu, S. M.; Briscoe, T.; Hadiwinoto, C.; Susanto, Yannakoudakis, H.; Briscoe, T.; and Medlock, B. 2011. A R. H.; and Bryant, C. 2014. The CoNLL-2014 Shared Task New Dataset and Method for Automatically Grading ESOL on Grammatical Error Correction. In Proceedings of the Texts. In Proceedings of the 49th Annual Meeting of the As- Eighteenth Conference on Computational Natural Language sociation for Computational Linguistics: Human Language Learning: Shared Task, 1–14. Baltimore, Maryland: Associ- Technologies, 180–189. Portland, Oregon, USA: Associa- ation for Computational Linguistics. tion for Computational Linguistics. Ng, H. T.; Wu, S. M.; Wu, Y.; Hadiwinoto, C.; and Tetreault, Zhao, W.; Wang, L.; Shen, K.; Jia, R.; and Liu, J. 2019. J. 2013. The CoNLL-2013 Shared Task on Grammatical Improving Grammatical Error Correction via Pre-Training Error Correction. In Proceedings of the Seventeenth Confer- a Copy-Augmented Architecture with Unlabeled Data. In ence on Computational Natural Language Learning: Shared Proceedings of the 2019 Conference of the North American Task, 1–12. Sofia, Bulgaria: Association for Computational Chapter of the Association for Computational Linguistics: Linguistics. Human Language Technologies, Volume 1 (Long and Short Omelianchuk, K.; Atrasevych, V.; Chernodub, A.; and Papers), 156–165. Minneapolis, Minnesota: Association for Skurzhanskyi, O. 2020. GECToR–Grammatical Error Cor- Computational Linguistics. rection: Tag, Not Rewrite. arXiv preprint arXiv:2005.12592. Zhao, Y.; Jiang, N.; Sun, W.; and Wan, X. 2018. Overview Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; of the NLPCC 2018 Shared Task: Grammatical Error Cor- Grangier, D.; and Auli, M. 2019. fairseq: A Fast, Extensible rection. In Zhang, M.; Ng, V.; Zhao, D.; Li, S.; and Zan, Toolkit for Sequence Modeling. In Proceedings of NAACL- H., eds., Natural Language Processing and Chinese Com- HLT 2019: Demonstrations. puting, 439–445. Cham: Springer International Publishing. ISBN 978-3-319-99501-4. Ren, H.; Yang, L.; and Xun, E. 2018. A sequence to se- Zhao, Z.; and Wang, H. 2020. MaskGEC: Improving Neu- quence learning for Chinese grammatical error correction. ral Grammatical Error Correction via Dynamic Masking. In CCF International Conference on Natural Language Pro- In The Thirty-Fourth AAAI Conference on Artificial Intel- cessing and Chinese Computing, 401–410. Springer. ligence, AAAI 2020, The Thirty-Second Innovative Applica- Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural tions of Artificial Intelligence Conference, IAAI 2020, The Machine Translation of Rare Words with Subword Units. Tenth AAAI Symposium on Educational Advances in Artifi- In Proceedings of the 54th Annual Meeting of the Associ- cial Intelligence, EAAI 2020, New York, NY, USA, February ation for Computational Linguistics (Volume 1: Long Pa- 7-12, 2020, 1226–1233. AAAI Press. pers), 1715–1725. Berlin, Germany: Association for Com- Zhou, J.; Li, C.; Liu, H.; Bao, Z.; Xu, G.; and Li, L. 2018. putational Linguistics. Chinese grammatical error correction using statistical and Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and neural models. In CCF International Conference on Natu- Salakhutdinov, R. 2014. Dropout: A Simple Way to Pre- ral Language Processing and Chinese Computing, 117–128. vent Neural Networks from Overfitting. Journal of Machine Springer. Learning Research, 15(56): 1929–1958. Zhou, W.; Ge, T.; Mu, C.; Xu, K.; Wei, F.; and Zhou, M. Stahlberg, F.; and Kumar, S. 2020. Seq2Edits: Sequence 2020. Improving Grammatical Error Correction with Ma- Transduction Using Span-level Edit Operations. arXiv chine Translation Pairs. In Findings of the Association for preprint arXiv:2009.11136. Computational Linguistics: EMNLP 2020, 318–328. On- Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016. Modeling line: Association for Computational Linguistics. Coverage for Neural Machine Translation. In Proceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), 76–85. Berlin, Germany: Association for Computational Linguistics. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At- tention Is All You Need. CoRR, abs/1706.03762. Xie, Z.; Genthial, G.; Xie, S.; Ng, A.; and Jurafsky, D. 2018. Noising and Denoising Natural Language: Diverse Back- translation for Grammar Correction. In Proceedings of the 2018 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 619–628. New Or- leans, Louisiana: Association for Computational Linguis- tics. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R. R.; and Le, Q. V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
Appendix F0.5 scores over different hyper-parameter lambda 65 Dataset Statistics Conll13 60 Conll14 We list the statistics of all the datasets used in this paper in the following tables. Specifically, Table 5 and Table 6 in- 55 53.48 53.35 54.19 54.81 52.84 clude the detailed information for the datasets used in the Chinese GEC task and the English GEC task respectively. 50 F0.5 score 45 40 40.21 Table 5: Dataset in the NLPCC 2018 Task 2 38.03 38.18 38.2 38.53 36.87 Corpus Split #Sent. 35 Lang-8 Train 1.2M 30 28.23 Lang-8 Valid 5,000 PKU CLC Test 2,000 25 0 0.05 0.1 0.2 0.4 0.6 lambda Table 6: Datasets in the BEA-2019 Shared Task Figure 3: The F0.5 scores on the development set (the red line) and the test set (the blue line) of the English GEC task over different λ values. The performance of the Transformer Corpus Split #Sent. (Big) baseline is plotted as dashed lines. Lang-8 Train 1.1M NUCLE Train 57K FCE Train 32K the hyper-parameter λ that controls the trade-off between W&I+LOCNESS Train 34K the two objectives in Equation (6), we conduct experiments CoNLL-2013 Valid 1.3K on the English GEC task. We fix the rest of the hyper- CoNLL-2014 Test 1.3K parameters of the model and select the values of λ from [0, 0.05, 0.1, 0.2, 0.4, 0.6]. We plot the F0.5 scores of the S2A model and Trans- Training Details former (Big) baseline on the development and test set (i.e., Here we also report the training details in our experiments. CoNLL-2013 (Ng et al. 2013) and CoNLL-2014 (Ng et al. For both English/Chinese GEC tasks, we train the S2A 2014)) in Figure 3. As is shown in Figure 3, the proposed model with 4 Nvidia V100 GPUs. Models are optimized S2A model performs stably with varying nonzero λ values with the Adam optimizer (Kingma and Ba 2015) with a beta and achieves the best results when λ = 0.4. We also find that value of (0.9, 0.98)/(0.9, 0.998). when λ = 0, which means that the seq2seq loss Ls2s is omit- For the English GEC Task, the max tokens per GPU is ted, the performance of the S2A model decreases drastically, set to 4096 with the update frequency set to 2. We set the indicating the original seq2seq loss is crucial for predicting learning rate to 1e − 4 when the training is only conducted tokens. on datasets from the BEA-2019 shared task. And we train it for 30K updates. When synthetic data is used, we first pre- More Case Studies train our model on the synthetic data for 4 epochs with the In addition to the case studies in the main paper, we list six max tokens per GPU set to 3072 and the update frequency more cases in Figure 4. In the first example, a long string set to 2. After that, we fine-tune it on the BEA-2019 training of characters are generated twice in the seq2seq model, re- datasets with the same settings as when no synthetic data is sulting a severe repetition problem. In the second example, introduced. The hyper-parameter λ in the English GEC task the seq2seq model wrongly omits several originally correct is set to 0.4. words. These situations nearly never happen in GECToR and For the Chinese GEC Task, we use the polynomial learn- our S2A model. ing rate decay scheme, with 8k warm-up steps, whereas for The third example corresponds to the cases where over- the English GEC Task, we use the inverse sqrt learning rate correction happens for the seq2seq model. In the third ex- decay scheme, with 4k warm-up steps. The max tokens per ample, the correction produced by the seq2seq model means GPU is set to 8192. And we set the learning rate to 4e − 4. that “she always wanted to talk ”. In contrast, thanks to the For the experiments without data augmentation, we train it guidance from the proposed sequence-to-action module, our for 80K steps, while the for experiments with the dynamic model tends to follow the original structure, thus alleviating masking data augmentation technique, we train it for 120K such biases in generation. steps. The hyper-parameter λ in the Chinese GEC task is set For the fourth and fifth example, longer reordering or to 0.05. insertions are necessary. GECToR, which is the sequence tagging model, can not handle these situations essentially. Influence of λ In comparison, our S2A model can perform as well as a The λ value mentioned above in Equation (6) is chosen on seq2seq model. the development set. To investigate the actual influence of
Type Samples SRC 譬如,“江陵端午节”:: 很有名,在2005年时,联合国教科文组织把“江陵端午节”指定:: 了 非物质文化遗产,[unk]人类口传与无形绝作[unk]。 TGT 譬如,“江陵端午节”就很有名,在2005年时,联合国教科文组织把“江陵端午节”指定为 非物质文化遗产,[unk]人类口传与无形绝作[unk]。 Transformer 譬如,“江陵端午节”:: 很有名,在2005年时,联合国教科文组织把“江陵端午节”指定为 非物质文化遗产,在2005年时,联合国教科文组织把“江陵端午节”指定为非物质文化遗产。 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: GECToR 譬如,“江陵端午节”:: 很有名,在2005年时,联合国教科文组织把“江陵端午节”指定为 非物质文化遗产,(人类口传与无形绝作)。 Ours 譬如,“江陵端午节”:: 很有名,在2005年时,联合国教科文组织把“江陵端午节”指定为 非物质文化遗产,[unk]人类口传与无形绝作[unk]。 Translation For example, the “ The Gangneung Danoje Festival” is very famous. In 2005, UNESCO designated the “ The Gangneung Danoje Festival ” as an intangible cultural heritage, [unk] an oral and intangible masterpiece [unk]. SRC 在上一学期的阅读与写作[unk]初级[unk]课上写一篇作文时,. . . TGT 在上一学期的阅读与写作[unk]初级[unk]课上写一篇作文时,. . . Transformer 在上一学期的阅读与写作 ::::时,. . . GECToR 在上一学期的阅读与写作(初级)课上写一篇作文时,. . . Ours 在上一学期的阅读与写作[unk]初级[unk]课上写一篇作文时,. . . Translation When writing an essay in the last semester’s reading and writing [unk] elementary [unk] class, . . . SRC 回家的路上,她跟家人一直互想谈话。 :: TGT 回家的路上,她跟家人一直互相谈话。 Transformer 回家的路上,她一直想跟家人谈话。:::::: GECToR 回家的路上,她跟家人 ::::谈话。 Ours 回家的路上,她跟家人一直互相聊天。 :::: Translation On the way home, she and her family kept talking to each other. SRC 我过去有偏见对中国。 :::::::::::: TGT 我过去对中国有偏见。 Transformer 我过去对中国有偏见。 GECToR 我过去的对了。 :::::: Ours 我过去对中国有偏见。 Translation I used to have a prejudice against China. SRC 光复节是为了纪念韩民族获得了解放与韩国独立的政府建立 :::::::::: 。 TGT 光复节是为了纪念韩民族获得了解放与韩国独立的政府确立的节日。 Transformer 光复节是为了纪念韩民族获得了解放与韩国独立的政府建立的节日。 :::: GECToR 光复节是为了纪念韩民族获得解放与韩国独立的政府而建立的。 :::::::: Ours 光复节是为了纪念韩民族获得了解放与韩国独立的政府建立的节日。 :::: Translation The Liberation Day is a festival established to commemorate the liberation of the Korean nation and the establishment of an independent government of Korea. Figure 4: More case study of the seq2seq (Transformer) model, the sequence tagging model (GECToR) and the proposed S2A model on the Chinese NLPCC-2018 test set. The translation of the golden target in each pair is also listed. The tokens with wavy lines are errors, while tokens with underlines are the corrections made by the gold target or hypotheses. [unk] means out-of-vocabulary words.
You can also read