Modeling Coverage for Non-Autoregressive Neural Machine Translation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Modeling Coverage for Non-Autoregressive Neural Machine Translation Yong Shan, Yang Feng? , Chenze Shao Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS), Beijing, China University of Chinese Academy of Sciences, Beijing, China shanyong18s@ict.ac.cn,fengyang@ict.ac.cn,shaochenze18z@ict.ac.cn Abstract—Non-Autoregressive Neural Machine Translation TABLE I arXiv:2104.11897v1 [cs.CL] 24 Apr 2021 (NAT) has achieved significant inference speedup by generating A G ERMAN -E NGLISH EXAMPLE OF over-translation AND under-translation ::::::::::: all tokens simultaneously. Despite its high efficiency, NAT usually ERRORS IN NAT. “S RC ” AND “R EF ” MEAN SOURCE AND REFERENCE , suffers from two kinds of translation errors: over-translation (e.g. RESPECTIVELY. repeated tokens) and under-translation (e.g. missing translations), which eventually limits the translation quality. In this paper, Src: Das technische Personal hat mir ebenfalls viel gegeben. we argue that these issues of NAT can be addressed through Ref: The technical staff has also brought me a lot. coverage modeling, which has been proved to be useful in NAT: The technical technical staff has has (also) brought me a lot. ::: autoregressive decoding. We propose a novel Coverage-NAT to model the coverage information directly by a token-level coverage iterative refinement mechanism and a sentence-level coverage agreement, which can remind the model if a source [9], where “coverage” indicates whether a source token is token has been translated or not and improve the semantics translated or not and the decoding process is completed when consistency between the translation and the source, respectively. all source tokens are “covered”. Despite their successes in AT, Experimental results on WMT14 En↔De and WMT16 En↔Ro existing NAT improvements to those two errors [7], [10], [11] translation tasks show that our method can alleviate those errors and achieve strong improvements over the baseline system. have no explorations in the coverage modeling. It is difficult for NAT to model the coverage information due to the non- I. I NTRODUCTION autoregressive nature, where step t cannot know if a source Neural Machine Translation (NMT) has achieved impressive token has been translated by other steps because all target performance over the recent years [1]–[5]. NMT models are tokens are generated simultaneously during decoding. typically built on the encoder-decoder framework where the In this paper, we propose a novel Coverage-NAT to address encoder encodes the source sentence into distributed repre- this challenge by directly modeling coverage information in sentations, and the decoder generates the target sentence from NAT. Our model works at both token level and sentence the representations in an autoregressive manner: the decoding level. At token level, we propose a token-level coverage at step t is conditioned on the predictions of past steps. The iterative refinement mechanism (TCIR), which models the autoregressive translation manner (AT) leads to high inference token-level coverage information with an iterative coverage latency and restricts NMT applications. layer and refines the source exploitation iteratively. Specif- Recent studies try to address this issue by replacing AT ically, we replace the top layer of the decoder with the with non-autoregressive translation (NAT) which generates all iterative coverage layer and compute the coverage vector in tokens simultaneously [6]. However, the lack of sequential parallel. In each iteration, the coverage vector tracks the dependency between target tokens in NAT makes it difficult attention history of the previous iteration and then adjusts for the model to capture the highly multimodal distribution the attention of the current iteration, which encourages the of the target translation. This phenomenon is termed as the model to concentrate more on the untranslated parts and multimodality problem [6], which usually causes two kinds of hence reduces over-translation and under-translation errors. In translation errors during inference: over-translation where the contrast to the step-wise manner [8], TCIR models the token- same token is generated repeatedly at consecutive time steps, level coverage information in an iteration-wise manner, which and under-translation where the semantics of several source does not hurt the non-autoregressive generation. At sentence tokens are not fully translated [7]. Table I shows a German- level, we propose a sentence-level coverage agreement (SCA) English example, where “technical” and “has” are translated as a regularization term. Inspired by the intuition that a good repeatedly while “also” is not translated at all. translation should “cover” the source semantics at sentence Actually, there are over-translation and under-translation level, SCA firstly computes the sentence representations of errors in AT models, too. They are usually addressed through the translation and the source sentence, and then constrain explicitly modeling the coverage information step by step [8], the semantics agreement between them. Thus, our model can improve the sentence-level coverage of translation, too. ? Yang Feng is the corresponding author. We evaluate our model on four translation tasks (WMT14
En↔De, WMT16 En↔Ro). Experimental results show that the lengths of the source and target sentence, respectively. The our method outperforms the baseline system by up to +2.75 translation probability is defined as: BLEU with competitive speedup (about 10×), and further T achieves near-Transformer performance on WMT16 dataset Y P (Y|X, θ) = p(yt |X, θ) (3) but 5.22× faster with the rescoring technique. The main t=1 contributions of our method are as follows: The cross-entropy loss is applied to minimize the negative • To alleviate the over-translation and under-translation log-likelihood as: errors, we introduce the coverage modeling into NAT. T • We propose to model the coverage information at both X Lmle (θ) = − log(p(yt |X, θ)) (4) token level and sentence level through iterative refinement t=1 and semantics agreement respectively, both of which are effective and achieve strong improvements. During training, the target length T is usually set as the reference length. During inference, T is determined by the target length predictor, and then the translation is obtained by II. BACKGROUND taking the token with the maximum likelihood at each step: Recently, neural machine translation has achieved impres- ŷt = arg max p(yt |X, θ) (5) sive improvements with the autoregressive decoding. Despite yt its high performance, AT models suffer from long inference latency. NAT [6] is proposed to address this problem through III. C OVERAGE -NAT parallelizing the decoding process. In this section, we describe the details of Coverage-NAT. We The NAT architecture can be formulated into an encoder- first briefly introduce the NAT architecture in Section III-A. decoder framework [2]. The same as AT models, the NAT Then, we illustrate the proposed token-level coverage iterative encoder takes the embedding of source tokens as inputs and refinement (TCIR) and sentence-level coverage agreement outputs the contextual source representations Eenc which are (SCA) in Section III-B and III-C, respectively. Figure II shows further used to compute the target length T and inter-attention the architecture of Coverage-NAT. at decoder side. A. NAT Architecture Multi-Head Attention: The attention mechanism used in Transformer and NAT are multi-head attention (Attn), which The NAT architecture we used is similar to [6] with the can be generally formulated as the weighted sum of value following differences in the decoder: (1) Instead of copying vectors V using query vectors Q and key vectors K: source embedding according to fertility or uniform mapping, we feed a sequence of “hunki” tokens to the NAT decoder QKT directly. (2) We remove the positional attention in decoder A = softmax( √ ) dmodel (1) layers. This design simplifies the decoder but works well in Attn(Q, K, V) = AV our experiments. In Coverage-NAT, the bottom L − 1 layers of the decoder where dmodel represents the dimension of hidden states. There are identical. Specifically, we design a novel iterative coverage are 3 kinds of multi-head attention in NAT: multi-head self layer as the top layer, which refines the source exploitations attention, multi-head inter-attention and multi-head positional by iteratively modeling token-level coverage. attention. Refer to [6] for details. Encoder: A stack of L identical layers. Each layer consists B. Token-Level Coverage Iterative Refinement of a multi-head self attention and a position-wise feed-forward The token-level coverage information has been applied network (FFN). FFN consists of two linear transformations to phrase-based SMT [12] and AT [8], where the decoder with a ReLU activation [5]. maintains a coverage vector to indicate whether a source token is translated or not and then pays more attention to FFN(x) = max(0, xW1 + b1 )W2 + b2 (2) the untranslated parts. Benefit from the autoregressive nature, AT models can easily model the coverage information step Note that we omit layer normalization, dropout and residual by step. Generally, it is a step-wise refinement of source connections for simplicity. Besides, a length predictor is used exploitation with the assistance of coverage vector. However, to predict the target length. it encounters difficulties in non-autoregressive decoding. Since Decoder: A stack of L identical layers. Each layer consists all target tokens are generated simultaneously, step t cannot of a multi-head self attention, a multi-head inter-attention, know if a source token has been translated by other steps a multi-head positional attention, and a position-wise feed- during decoding. forward network. The key to solving this problem of NAT lies in: (1) How Given the input sentence in the source language X = to compute the coverage vector under the non-autoregressive {x1 , x2 , · · · , xn }, NAT models generate a sentence in the setting? (2) How to use the coverage vector to refine the source target language Y = {y1 , y2 , · · · , yT } , where n and T are exploitation under the non-autoregressive setting? To address
(1) Pre-training Phase Sentence-Level Coverage Agreement Hypothesis (2) Fine-tuning Phase Expectation Linear & Softmax Mean Training Procedures Token-Level Coverage Iterative Refinement K iterations Iterative Coverage Layer Target Length Predictor N-1 x Nx Feed Forward Mean Feed Forward Multi-Head Inter-Attention Multi-Head Self Attention Multi-Head Self Attention Encoder Decoder Embedding Embedding Linear , , ..., Fig. 1. The architecture of Coverage-NAT. We propose the token-level coverage iterative refinement mechanism and the sentence-level coverage agreement to model the coverage information. At token level, we replace the top layer of the decoder with an iterative coverage layer which models the token-level coverage information and then refines the source exploitation in an iteration-wise manner. At sentence level, we maximize the source-semantics coverage of translation by constraining the sentence-level semantical agreement between the translation and the source sentence. We use a two-phase optimization to train our model. denotes the matrix multiplication. it, we propose a novel token-level coverage iterative refinement iteration can be given by: mechanism. Our basic idea is to replace the top layer of the Ĥk = Attn(Hk−1 , Hk−1 , Hk−1 ) (7) decoder with an iterative coverage layer, where we model the k k coverage information to refine the source exploitation in an B = (1 − C ) (8) k iteration-wise manner. The iterative coverage layer also con- Ĥ ETenc Ak = softmax( √ + λ × Bk ) (9) tains a multi-head self attention, a multi-head inter-attention dmodel and an FFN. Hk = FFN(Ak Eenc ) (10) For question (1), we maintain an attention precedence which where H is the hidden states of iteration k, λ ∈ R1 is k means the former steps have priorities to attend to source trainable and is initialized to 1, × means the scalar multipli- tokens and later steps shouldn’t scramble for those already cation. In Equation 7, we compute the self attention and store “covered” parts. We manually calculate the coverage vector the output as Ĥk . In Equation 8-9, we firstly calculate the Ck on the basis of the inter-attention weights of the previous subtraction between 1 and Ck to make the source tokens with iteration. At step t, we compute the accumulation of past steps’ a lower coverage obtain a higher bias, and then integrate the k attention on xi as Ct,i to represent the coverage of xi : bias Bk into the inter-attention weights by a trainable scaling factor λ. Finally, we output the hidden states of iteration k in t−1 Equation 10. In this way, each step is discouraged to consider X source tokens heavily attended by past steps in the previous k Ct,i = min( Ak−1 t0 ,i , 1) (6) iteration, and pushed to concentrate on source tokens less t0 =0 attended. The source exploitation is refined iteratively with token-level coverage, too. where k is the iteration number, Ak−1 We use the hidden states and the inter-attention weights of t0 ,i is the inter-attention weights from step t0 to xi in the previous iteration. We use (L−1)th layer as H0 and A0 , respectively. After K iterations, min function to ensure all coverage do not to exceed 1. we use HK as the output of the decoder. For question (2), we use Ck as a bias of inter-attention C. Sentence-Level Coverage Agreement so that step t will pay more attention to source tokens In AT models, the sentence-level agreement between the untranslated by past steps in the previous iteration. A full source side and the target side has been exploited by Yang
et al. [13], which constrains the agreement between source where β is a hyper-parameter to control the weight of sentence and reference sentence. However, it can only enhance Lcoverage . We have tried to train the model with all loss the representations of the source side and target side instead functions jointly from scratch. The result is worse than fine- of improving the semantics consistency between the source tuning. We think that this is because the SCA will interfere sentence and the translation. the optimization of translation when the training begins. Different from them, we consider the sentence-level agree- ment from a coverage perspective. Intuitively, a translation IV. E XPERIMENTS S ETUP “covers” source semantics at sentence level if the representa- Datasets We evaluate the effectiveness of our proposed tion of translation is as close as possible to the source sentence. method on WMT14 En↔De (4.5M pairs) and WMT16 Thus, we propose to constrain the sentence-level agreement En↔Ro (610k pairs) datasets. For WMT14 En↔De, we between the source sentence and the translation. For the employ newstest2013 and newstest2014 as development and source sentence, we compute the sentence representation by test sets. For WMT16 En↔Ro, we employ newsdev2016 and considering the whole sentence as a bag-of-words. Specifically, newstest2016 as development and test sets. We tokenize all we firstly map the source embeddings to target latent space by sentences into subword units [18] with 32k merge operations. a linear transformation, and then use a mean pooling to obtain We share the vocabulary between the source and target sides. the sentence representation Ēsrc : Metrics We evaluate the model performance with case- Ēsrc = Mean(ReLU(Esrc Ws )) (11) sensitive BLEU [19] for the En-De dataset and case-insensitive BLEU for the En-Ro dataset. Latency is computed as the where Esrc = {Esrc,1 , · · · , Esrc,n } is the word embeddings average of per sentence decoding time (ms) on the full test set of source sentence, Ws ∈ Rdmodel ×dmodel represents the of WMT14 En→De without mini-batching. We test latency on parameters of linear transformation and ReLU is the acti- 1 Nvidia RTX 2080Ti GPU. vation function. For the translation, we need to calculate Baselines We take the Transformer model [5] as our AT the expectation of hypothesis at each position, which takes baseline as well as the teacher model. We choose the following all possible translation results into consideration. Although models as our NAT baselines: (1) NAT-FT which equips it is inconvenient for AT to calculate the expectation by vanilla NAT with fertility fine-tuning [6]. (2) NAT-Base which multinomial sampling [14], we can do it easily in NAT due to is the NAT implementation of Fairseq1 which removes the the token-independence assumption: multi-head positional attention and feeds the decoder with a sequence of “hunki” instead of copying the source embed- Ehyp,t = pt We (12) ding. (3) NAT-IR which iteratively refines the translation for dvocab where pt ∈ R is the probability vector of hypothesis at multiple times [15]. (4) Reinforce-NAT which introduces a step t, We ∈ Rdvocab ×dmodel is the weights of target-side word sequence-level objective using reinforcement learning [10]. embedding. Then, we calculate the sentence representation of (5) NAT-REG which proposes two regularization terms to translation by mean pooling as follows: alleviate translation errors [7]. (6) BoN-NAT which minimizes the bag-of-ngrams differences between the translation and the Ēhyp = Mean({Ehyp,1 , · · · , Ehyp,T }) (13) reference [11]. (7) Hint-NAT which leverages the hints from Finally, we compute the L2 distance between the source AT models to enhance NAT [16]. (8) Flowseq-large which sentence and the translation as the sentence-level coverage integrates generative flow into NAT [17]. agreement. Note that we use a scaling factor √d 1 to Model Configurations We implement our approach on NAT- model stablize the training: Base. We closely follow the settings of [6], [15]. For all experiments, we use the base configurations (dmodel =512, L2(Ēsrc , Ēhyp ) Lcoverage = √ (14) dhidden =2048, nlayer =6, nhead =8). In the main experiments, dmodel we set the number of iterations during both training (Ktrain ) D. Two-Phase Optimization and decoding (Kdec ) to 5, α to 0.1 and β to 0.5. In our approach, we firstly pre-train the proposed model to Training We train the AT teacher with the original cor- convergence with basic loss functions, and then fine-tune the pus, construct the distillation corpus with the sequence-level proposed model with the SCA and basic loss functions. knowledge distillation [20], and train NAT models with the Pre-training During the pre-training phase, we optimize the distillation corpus. We choose Adam [21] as the optimizer. basic loss functions for NAT: Each mini-batch contains 64k tokens. During pre-training, we stop training when the number of training steps exceeds 400k, L = Lmle + αLlength (15) 180k for WMT14 En↔De, WMT16 En↔Ro, respectively. where Llength is the loss of target length predictor [6]. α is a The learning rate warms up to 5e-4 in the first 10K steps and hyper-parameter to control the weight of Llength . then decays under the inverse square-root schedule. During Fine-tuning During the fine-tuning phase, we optimize the fine-tuning, we fix the learning rate to 1e-5 and stop training SCA and basic loss functions jointly: when the number of training steps exceeds 1k. We run all L = Lmle + αLlength + βLcoverage (16) 1 https://github.com/pytorch/fairseq
TABLE II T HE BLEU (%) AND THE SPEEDUP ON TEST SETS OF WMT14 E N↔D E , WMT16 E N↔RO TASKS . WMT14 WMT14 WMT16 WMT16 Models Latency Speedup En→De De→En En→Ro Ro→En Transformer [5] 27.19 31.25 32.80 32.61 276.88 ms 1.00× Existing NAT Implementations NAT-FT [6] 17.69 21.47 27.29 29.06 / 15.60× NAT-FT (rescoring 10) [6] 18.66 22.41 29.02 30.76 / 7.68× NAT-Base 1 18.60 22.70 28.78 29.00 19.36 ms 14.30× NAT-IR (idec = 5) [15] 20.26 23.86 28.86 29.72 / 3.11× Reinforce-NAT [10] 19.15 22.52 27.09 27.93 / 10.73× NAT-REG [7] 20.65 24.77 / / / 27.60× BoN-NAT [11] 20.90 24.61 28.31 29.29 / 10.77× Hint-NAT [16] 21.11 25.24 / / / 30.20× FlowSeq-large [17] 23.72 28.39 29.73 30.72 / / Our NAT Implementations Coverage-NAT 21.35 25.04 30.05 30.33 27.58 ms 10.04× Coverage-NAT (rescoring 9) 23.76 27.84 31.93 32.32 53.08 ms 5.22× experiments on 4 Nvidia RTX 2080Ti GPUs and choose the TABLE III best checkpoint by BLEU points. A BLATION STUDY ON WMT16 E N→RO DEVELOPMENT SET. W E SHOW THE EFFECTS OF DIFFERENT ITERATIONS DURING TRAINING (Ktrain ) Inference During inference, we employ length parallel de- AND DIFFERENT SCA WEIGHTS (β). coding (LPD) which generates several candidates in parallel and re-ranks with the AT teacher [6]. During post-process, we Latency remove tokens generated repeatedly. Model Ktrain β BLEU (%) (ms) V. R ESULTS AND A NALYSIS NAT-Base / / 28.96 20.51 A. Main Results 1 / 28.77 (-0.19) 20.82 3 / 29.20 (+0.24) 24.84 Table II shows our main results on WMT14 En↔De and 5 / 29.33 (+0.37) 28.34 Coverage-NAT 10 / 29.45 (+0.49) 40.80 WMT16 En↔Ro datasets. Coverage-NAT achieves strong BLEU improvements over NAT-Base by +2.75, +2.34, +1.27 15 / 29.61 (+0.65) 48.90 and +1.33 BLEU on WMT14 En→De, WMT14 De→En, 5 0 29.59 (+0.63) 28.34 5 1 30.03 (+1.07) 28.34 WMT16 En→Ro and WMT16 Ro→En, respectively. More- 5 0.5 30.16 (+1.20) 28.34 over, Coverage-NAT surpasses most NAT baselines on BLEU scores with a competitive speedup over Transformer (10.04×). It is worth mentioning that Coverage-NAT achieves higher Ktrain and Kdec are the same. With the increase of Ktrain , improvements on WMT14 dataset than WMT16 dataset. The the BLEU firstly drops slightly then increases continuously. reason may be that it is harder for NAT to align to long It demonstrates that the TCIR needs enough iterations to source sentences which are more frequent in WMT14 dataset, learn effective source exploitation with the help of token- and Coverage-NAT exactly models the coverage informa- level coverage information. Because of the trade-off between tion and hence leads to higher improvements on WMT14 performance and latency, we set Ktrain to 5 in our main dataset. With the LPD of 9 candidates, Coverage-NAT achieves experiments for an equilibrium although Ktrain = 15 brings +5.10, +5.43, +2.91 and +1.56 BLEU of increases than NAT- more gains. Second, we evaluate the effects of different SCA FT (rescoring 10) on WMT14 En→De, WMT14 De→En, weights. When we set β to 0, 1 and 0.5, the BLEU increases by WMT16 En→Ro and WMT16 Ro→En respectively, which +0.63, +1.07 and +1.20, respectively. It proves that a moderate is near the performance of Transformer but 5.22× faster than β is important for the effects of SCA. Transformer on WMT16 En↔Ro tasks. C. Effects of Iterations During Decoding B. Ablation Study As Figure 2 shows, we use Coverage-NAT (w/o SCA, As Table III shows, we conduct the ablation study on Ktrain = 5) to decode the WMT14 En→De development WMT16 En→Ro development set to show the effectiveness of set with different iterations during decoding (Kdec ). As Kdec each module in Coverage-NAT. First, we evaluate the effects of grows, the BLEU increases and gets the maximum value near different iterations during training (Ktrain ). In ablation study, Ktrain while the latency increases continuously. The reason
&RYHUDJH1$7ZR6&$
%/(8 1$7%DVH%/(8 &RYHUDJH1$7ZR6&$
/DWHQF\ 1$7%DVH/DWHQF\ %/(8 /DWHQF\PV %/(8 7UDQVIRUPHU &RYHUDJH1$7 &RYHUDJH1$7ZR6&$ 1$7%DVH >
>
>
>
>
>
>
>
! Kdec 6RXUFH/HQJWK Fig. 3. The BLEU with different source lengths on the WMT14 En→De Fig. 2. The BLEU and the latency with different iterations during decoding development set. Compared to NAT-Base, Coverage-NAT is closer to Trans- (Kdec ) on WMT14 En→De development set. The results are from Coverage- former with a slighter decline in “[10,50)” and a greater increase in “[50,80)”. NAT (w/o SCA, Ktrain = 5). It shows that our model reaches the maximum BLEU when Kdec = Ktrain , and achieves almost the maximum improvement when Kdec = 4 which indicates our model can work well with TABLE V less iterations and lower latency. S UBJECTIVE EVALUATION OF under-translation ERRORS . T HE HIGHER THE METRIC , THE LESS THE ERRORS . TABLE IV T HE RATIO OF REPEATED TOKENS (%) FOR TRANSLATIONS ON THE Model Under-Translation Metric WMT14 E N→D E DEVELOPMENT SET. I T PROVES THAT over-translation ERRORS GROW AS THE SENTENCE BECOMES LONGER AND THE SCA NAT-Base 58.9% ACHIEVES HIGHER IMPROVEMENTS THAN THE TCIR. Coverage-NAT (w/o SCA) 67.1% Coverage-NAT 70.6% Model Short Long All NAT-Base 5.47 12.05 10.23 Coverage-NAT 3.31 (-2.16) 7.52 (-4.53) 6.36 (-3.87) It leads to 0.65% and 1.84% decreases of repeated tokens – SCA 4.82 (-0.65) 10.21 (-1.84) 8.72 (-1.51) on short and long sentences, respectively. The results prove that both token-level and sentence-level coverage can reduce over-translation errors and sentence-level coverage contributes that the performance doesn’t increase continuously may be that more than token-level coverage. our model are trained to provide the best coverage at Ktrain and this consistency will be broken if Kdec stays away from E. Subjective Evaluation of Under-Translation Ktrain . Our model surpasses NAT-Base when Kdec = 3 and We conduct a subjective evaluation to validate our im- achieves almost the maximum BLEU when Kdec = 4, which provements to under-translation errors. For each sentence, two indicates our model can work well with less iterations and human evaluators are asked to select the one with the least lower latency than Ktrain during decoding. under-translation errors from translations of three anonymous systems. We evaluate a total of 300 sentences sampled ran- D. Analysis of Repeated Tokens domly from the WMT14 De→En test set. Then, we calculate We study the ratio of repeated tokens during post-process the under-translation metric which is the frequency that each to see how our methods improve the translation quality. system is selected. The higher the under-translation metric, the We split the WMT14 En→De development set equally into less the under-translation errors. Table V shows the results. the short part and the long part according to the sentence Note that there are some tied selections so that the sum of length. Then, we translate them using NAT models and score three scores does not equal to 1. It proves that Coverage- the ratio of repeated tokens during post-process. As shown NAT produces less under-translation errors than NAT-Base, in Table IV, NAT-base suffers from over-translation errors and both word-level and sentence-level coverage modeling are especially on the long sentences, which is in line with [11]. effective in reducing them. As the sentence length grows, it becomes harder for NAT to align to source, which leads to more translation errors F. Effects of Source Length and eventually damages the translation quality. However, the To evaluate the effects of sentence lengths, we split the ratio of repeated tokens reduces obviously in Coverage-NAT WMT14 En→De development set into different buckets ac- by 2.16% and 4.53% on short and long sentences, which cording to the source length and compare the BLEU perfor- proves that our model achieves higher improvements as the mance of our model with other baselines. Figure 3 shows sentence length becomes longer. Moreover, we remove the the results. As the sentence length grows, the performance SCA from Coverage-NAT and compare it with NAT-Base. of NAT-Base firstly drops continuously, then has a slight rise
TABLE VI T RANSLATION EXAMPLES FROM THE WMT14 D E→E N TEST SET. W E USE UNDERLINES TO REPRESENT over-translation ERRORS AND WAVY LINES TO REPRESENT under-translation ERRORS . N OTE THAT THE TEXT IN BRACKETS IS ADDED TO REPRESENT THE MISSING PARTS OTHER THAN THE EXACT ::::::::::: TRANSLATION . O UR MODEL CAN ALLEVIATE THOSE ERRORS AND GENERATE BETTER TRANSLATIONS . Source Edis erklärte , Coulsons Aktivitäten bei dieser Story folgten demselben Muster wie bei anderen wichtigen Persönlichkeiten , etwa dem früheren Innenminister David Blunkett . Reference Mr Edis said Coulson 's involvement in the story followed the same pattern as with other important men , such as former home secretary David Blunkett . Transformer Edis explained that Coulson ’ s activities in this story followed the same pattern as those of other important personalities , such as former Interior Minister David Blunkett . NAT-Base Edis explained that Coulson ’ s activities in this story followed the same pattern as other other (important) person personalities such as former former (Inter)ior ::::::: :::::: Minister David Blunkett . Coverage-NAT (w/o SCA) Edis explained that Coulson ’ s activities in this story followed the same pattern as other other important personalities , such as former Interior Minister David Blunkett . Coverage-NAT Edis explained that Coulson ’ s activities in this story followed the same pattern as with other important personalities , such as former Interior Minister David Blunkett . and finally declines heavily on very long sentences. However, classification loss to guide the NAT model. Shao et al. [10], Transformer can keep a high performance until the sentence [11] train NAT with sequence-level objectives and bag-of- length becomes very long, which indicates that translating ngrams loss. Wang et al. [7] propose NAT-REG with two very long sentences is exactly difficult. Compared to NAT- auxiliary regularization terms to alleviate over-translation and Base, the result of Coverage-NAT is closer to Transformer, under-translation errors. Guo et al. [26] enhance the input which has a more slight decline in “[10,50)” and a greater of the decoder with target-side information. Some works increase in “[50,80)”. The result of Coverage-NAT (w/o SCA) introduce latent variables to guide generation [17], [23], [27], lies between Coverage-NAT and NAT-Base. It demonstrates [28]. Ran et al. [29] and Bao et al. [30] consider the reordering that our model generates better translations as the sentence information in decoding. Li et al. [16] use the AT to guide the becomes longer and both token-level and sentence-level cov- NAT training. Wang et al. [31] and Ran et al. [32] further erage modeling are effective to improve the translation quality, propose semi-autoregressive methods. in line with Table IV. To alleviate the over-translation and under-translation er- rors, coverage modeling is commonly used in SMT [12] and G. Case Study NMT [8], [9], [33]–[35]. Tu et al. [8] and Mi et al. [9] model We present several translation examples on the WMT14 the coverage in a step-wise manner and employ it as additional De→En test set in Table VI, including the source, the ref- source features, which can not be applied to NAT directly. erence and the translations given by Transformer, NAT-Base, However, our TCIR models the coverage in an iteration-wise Coverage-NAT (w/o SCA) and Coverage-NAT. As it shows, manner and employs it as attention biases, which does not NAT-Base suffers severe over-translation (e.g. “other other”) hurt the non-autoregressive generation. Li et al. [33] introduce and under-translation (e.g. “important”) errors. However, Our a coverage-based feature into NMT. Zheng et al. [34], [35] Coverage-NAT can alleviate these errors and generate higher explicitly model the translated and untranslated contents. quality translations. The differences between Coverage-NAT Agreement-based learning in NMT has been explored in (w/o SCA) and Coverage-NAT further prove the effectiveness [36]–[39]. Yang et al. [13] propose a sentence-level agree- of sentence-level coverage. ment between source and reference while our SCA considers VI. R ELATED W ORK the sentence-level agreement between source and translation. Gu et al. [6] introduce non-autoregressive neural machine Besides, Tu et al. [40] propose to reconstruct source from translation to accelerate inference. Recent studies have tried translation, which improves the translation adequacy but can to bridge the gap between non-autoregressive models and not model the sentence-level coverage directly. autoregressive models. Lee et al. [15] refine the translation VII. C ONCLUSION iteratively where the outputs of decoder are fed back as inputs in the next iteration. Ghazvininejad et al. [22] propose a mask- In this paper, we propose a novel Coverage-NAT which predict method to generate translations in several decoding explicitly models the coverage information at token level and iterations. [23], [24] are based on iterative refinements, too. sentence level to alleviate the over-translation and under- These methods refine the translation by decoding iteratively translation errors in NAT. Experimental results show that while our TCIR refines the source exploitation by iteratively our model can achieve strong BLEU improvements with modeling the coverage with an individual layer. Besides, competitive speedup over the baseline system. The analysis Libovickỳ and Helcl [25] introduce a connectionist temporal further prove that it can effectively reduce translation errors.
R EFERENCES [20] Y. Kim and A. M. Rush, “Sequence-level knowledge distillation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural [1] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, Language Processing, 2016, pp. 1317–1327. H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn [21] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” encoder–decoder for statistical machine translation,” in EMNLP, 2014, ICLR, 2015. pp. 1724–1734. [22] M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, “Mask-predict: [2] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning Parallel decoding of conditional masked language models,” in Pro- with neural networks,” in Advances in neural information processing ceedings of the 2019 Conference on Empirical Methods in Natural systems, 2014, pp. 3104–3112. Language Processing and the 9th International Joint Conference on [3] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6114–6123. jointly learning to align and translate,” 3rd International Conference on [23] R. Shu, J. Lee, H. Nakayama, and K. Cho, “Latent-variable non- Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, autoregressive neural machine translation with deterministic inference 2015, Conference Track Proceedings, 2014. using a delta posterior,” Proceedings of the AAAI Conference on Artifi- cial Intelligence, 2019. [4] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, [24] J. Gu, C. Wang, and J. Zhao, “Levenshtein transformer,” in Advances M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural in Neural Information Processing Systems, 2019, pp. 11 179–11 189. machine translation system: Bridging the gap between human and [25] J. Libovickỳ and J. Helcl, “End-to-end non-autoregressive neural ma- machine translation,” arXiv: Computation and Language, 2016. chine translation with connectionist temporal classification,” in Proceed- [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ings of the 2018 Conference on Empirical Methods in Natural Language Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances Processing, 2018, pp. 3016–3021. in neural information processing systems, 2017, pp. 5998–6008. [26] J. Guo, X. Tan, D. He, T. Qin, L. Xu, and T.-Y. Liu, “Non-autoregressive [6] J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher, “Non- neural machine translation with enhanced decoder input,” in Proceedings autoregressive neural machine translation,” in 6th International Confer- of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. ence on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 3723–3730. April 30 - May 3, 2018, Conference Track Proceedings. OpenRe- [27] L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit, and view.net, 2018. N. Shazeer, “Fast decoding in sequence models using discrete latent [7] Y. Wang, F. Tian, D. He, T. Qin, C. Zhai, and T.-Y. Liu, “Non- variables,” in ICML, 2018. autoregressive machine translation with auxiliary regularization,” in [28] N. Akoury, K. Krishna, and M. Iyyer, “Syntactically supervised trans- AAAI, vol. 33, 2019, pp. 5377–5384. formers for faster neural machine translation,” in ACL, 2019, pp. 1269– [8] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li, “Modeling coverage for neural 1281. machine translation,” in Proceedings of the 54th Annual Meeting of the [29] Q. Ran, Y. Lin, P. Li, and J. Zhou, “Guiding non-autoregressive neural Association for Computational Linguistics (Volume 1: Long Papers), machine translation decoding with reordering information,” AAAI, 2021. 2016, pp. 76–85. [30] Y. Bao, H. Zhou, J. Feng, M. Wang, S. Huang, J. Chen, and L. Li, [9] H. Mi, B. Sankaran, Z. Wang, and A. Ittycheriah, “Coverage embedding “Non-autoregressive transformer by position learning,” arXiv preprint models for neural machine translation,” in EMNLP, 2016, pp. 955–960. arXiv:1911.10677, 2019. [10] C. Shao, Y. Feng, J. Zhang, F. Meng, X. Chen, and J. Zhou, “Retrieving [31] C. Wang, J. Zhang, and H. Chen, “Semi-autoregressive neural machine sequential information for non-autoregressive neural machine transla- translation,” in Proceedings of the 2018 Conference on Empirical tion,” in Proceedings of the 57th Annual Meeting of the Association for Methods in Natural Language Processing, 2018, pp. 479–488. Computational Linguistics, 2019, pp. 3013–3024. [32] Q. Ran, Y. Lin, P. Li, and J. Zhou, “Learning to recover from multi- [11] C. Shao, J. Zhang, Y. Feng, F. Meng, and J. Zhou, “Minimizing the bag- modality errors for non-autoregressive neural machine translation,” ACL, of-ngrams difference for non-autoregressive neural machine translation,” 2020. Proceedings of the AAAI Conference on Artificial Intelligence, 2020. [33] Y. Li, T. Xiao, Y. Li, Q. Wang, C. Xu, and J. Zhu, “A simple and effective [12] P. Koehn, F. J. Och, and D. Marcu, “Statistical phrase-based translation,” approach to coverage-aware neural machine translation,” in ACL, 2018, in Proceedings of the 2003 Conference of the North American Chapter pp. 292–297. of the Association for Computational Linguistics on Human Language [34] Z. Zheng, H. Zhou, S. Huang, L. Mou, X. Dai, J. Chen, and Z. Tu, Technology-Volume 1, 2003, pp. 48–54. “Modeling past and future for neural machine translation,” Transactions [13] M. Yang, R. Wang, K. Chen, M. Utiyama, E. Sumita, M. Zhang, and of the Association for Computational Linguistics, vol. 6, pp. 145–157, T. Zhao, “Sentence-level agreement for neural machine translation,” 2018. in Proceedings of the 57th Annual Meeting of the Association for [35] Z. Zheng, S. Huang, Z. Tu, X. Dai, and C. Jiajun, “Dynamic past Computational Linguistics, 2019, pp. 3076–3082. and future for neural machine translation,” in Proceedings of the 2019 [14] S. Chatterjee and N. Cancedda, “Minimum error rate training by Conference on Empirical Methods in Natural Language Processing and sampling the translation lattice,” in EMNLP, 2010, pp. 606–615. the 9th International Joint Conference on Natural Language Processing [15] J. Lee, E. Mansimov, and K. Cho, “Deterministic non-autoregressive (EMNLP-IJCNLP), 2019, pp. 930–940. neural sequence modeling by iterative refinement,” in Proceedings [36] L. Liu, M. Utiyama, A. Finch, and E. Sumita, “Agreement on target- of the 2018 Conference on Empirical Methods in Natural Language bidirectional neural machine translation,” in Proceedings of the 2016 Processing, 2018, pp. 1173–1182. Conference of the North American Chapter of the Association for [16] Z. Li, Z. Lin, D. He, F. Tian, Q. Tao, W. Liwei, and T.-Y. Liu, Computational Linguistics: Human Language Technologies, 2016, pp. “Hint-based training for non-autoregressive machine translation,” in 411–416. Proceedings of the 2019 Conference on Empirical Methods in Natural [37] Y. Cheng, S. Shen, Z. He, W. He, H. Wu, M. Sun, and Y. Liu, Language Processing and the 9th International Joint Conference on “Agreement-based joint training for bidirectional attention-based neural Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 5712–5717. machine translation,” in Proceedings of the Twenty-Fifth International [17] X. Ma, C. Zhou, X. Li, G. Neubig, and E. Hovy, “Flowseq: Non- Joint Conference on Artificial Intelligence, ser. IJCAI’16. AAAI Press, autoregressive conditional sequence generation with generative flow,” in 2016, p. 2761–2767. Proceedings of the 2019 Conference on Empirical Methods in Natural [38] Z. Zhang, S. Wu, S. Liu, M. Li, M. Zhou, and T. Xu, “Regularizing Language Processing and the 9th International Joint Conference on neural machine translation by target-bidirectional agreement,” in AAAI, Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 4273–4283. vol. 33, 2019, pp. 443–450. [18] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation [39] M. Al-Shedivat and A. Parikh, “Consistency by agreement in zero-shot of rare words with subword units,” in Proceedings of the 54th Annual neural machine translation,” in Proceedings of the 2019 Conference Meeting of the Association for Computational Linguistics (Volume 1: of the North American Chapter of the Association for Computational Long Papers), 2016, pp. 1715–1725. Linguistics: Human Language Technologies, Volume 1 (Long and Short [19] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method Papers), 2019, pp. 1184–1197. for automatic evaluation of machine translation,” in Proceedings of [40] Z. Tu, Y. Liu, L. Shang, X. Liu, and H. Li, “Neural machine translation the 40th annual meeting on association for computational linguistics. with reconstruction,” in Thirty-First AAAI Conference on Artificial Association for Computational Linguistics, 2002, pp. 311–318. Intelligence, 2017.
You can also read