Towards Robust Neural Machine Translation - Association for ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Towards Robust Neural Machine Translation Yong Cheng? , Zhaopeng Tu? , Fandong Meng? , Junjie Zhai? and Yang Liu† ? Tencent AI Lab, China † State Key Laboratory of Intelligent Technology and Systems Beijing National Research Center for Information Science and Technology Department of Computer Science and Technology, Tsinghua University, Beijing, China Beijing Advanced Innovation Center for Language Resources chengyong3001@gmail.com {zptu, fandongmeng, jasonzhai}@tencent.com liuyang2011@tsinghua.edu.cn Abstract Input tamen bupa kunnan zuochu weiqi AI. They are not afraid of difficulties to Small perturbations in the input can Output make Go AI. severely distort intermediate representa- Input tamen buwei kunnan zuochu weiqi AI. tions and thus impact translation quality of Output They are not afraid to make Go AI. neural machine translation (NMT) mod- els. In this paper, we propose to improve Table 1: The non-robustness problem of neural the robustness of NMT models with adver- machine translation. Replacing a Chinese word sarial stability training. The basic idea is with its synonym (i.e., “bupa” → “buwei”) leads to to make both the encoder and decoder in significant erroneous changes in the English trans- NMT models robust against input pertur- lation. Both “bupa” and “buwei” can be translated bations by enabling them to behave sim- to the English phrase “be not afraid of.” ilarly for the original input and its per- turbed counterpart. Experimental results on Chinese-English, English-German and However, studies reveal that very small changes English-French translation tasks show that to the input can fool state-of-the-art neural net- our approaches can not only achieve sig- works with high probability (Goodfellow et al., nificant improvements over strong NMT 2015; Szegedy et al., 2014). Belinkov and Bisk systems but also improve the robustness of (2018) confirm this finding by pointing out that NMT models. NMT models are very brittle and easily falter when presented with noisy input. In NMT, due 1 Introduction to the introduction of RNN and attention, each Neural machine translation (NMT) models have contextual word can influence the model predic- advanced the state of the art by building a sin- tion in a global context, which is analogous to the gle neural network that can better learn represen- “butterfly effect.” As shown in Table 1, although tations (Cho et al., 2014; Sutskever et al., 2014). we only replace a source word with its synonym, The neural network consists of two components: the generated translation has been completely dis- an encoder network that encodes the input sen- torted. We investigate severe variations of trans- tence into a sequence of distributed representa- lations caused by small input perturbations by re- tions, based on which a decoder network generates placing one word in each sentence of a test set with the translation with an attention model (Bahdanau its synonym. We observe that 69.74% of transla- et al., 2015; Luong et al., 2015). A variety of NMT tions have changed and the BLEU score is only models derived from this encoder-decoder frame- 79.01 between the translations of the original in- work have further improved the performance of puts and the translations of the perturbed inputs, machine translation systems (Gehring et al., 2017; suggesting that NMT models are very sensitive to Vaswani et al., 2017). NMT is capable of general- small perturbations in the input. The vulnerabil- izing better to unseen text by exploiting word simi- ity and instability of NMT models limit their ap- larities in embeddings and capturing long-distance plicability to a broader range of tasks, which re- reordering by conditioning on larger contexts in a quire robust performance on noisy inputs. For ex- continuous way. ample, simultaneous translation systems use auto- 1756 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1756–1766 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
matic speech recognition (ASR) to transcribe in- tem (Gehring et al., 2017). Related experimen- put speech into a sequence of hypothesized words, tal analyses validate that our training approach can which are subsequently fed to a translation sys- improve the robustness of NMT models. tem. In this pipeline, ASR errors are presented as sentences with noisy perturbations (the same pro- 2 Background nunciation but incorrect words), which is a signif- NMT is an end-to-end framework which directly icant challenge for current NMT models. More- optimizes the translation probability of a target over, instability makes NMT models sensitive to sentence y = y1 , ..., yN given its corresponding misspellings and typos in text translation. source sentence x = x1 , ..., xM : In this paper, we address this challenge with N Y adversarial stability training for neural machine P (y|x; θ) = P (yn |y
Linv(x, x’) Ltrue(x, y) Lnoisy(x’, y) encoder, which benefits outputting the same translations. We cast this objective in the ad- Discriminator Decoder versarial learning framework. • Lnoisy (x0 , y) to guide the decoder to generate Hx Hx’ output y given the noisy input x0 , which is modeled as − log P (y|x0 ). It can also be de- fined as KL divergence between P (y|x) and Encoder P (y|x0 ) that indicates using P (y|x) to teach P (y|x0 ). x x’ As seen, the two introduced objectives aim to im- +perturbations prove the robustness of the NMT model which can be free of high variances in target outputs caused Figure 1: The architecture of NMT with adversar- by small perturbations in inputs. It is also natural ial stability training. The dark solid arrow lines to introduce the original training objective L(x, y) represent the forward-pass information flow for on x and y, which can guarantee good transla- the input sentence x, while the red dashed arrow tion performance while keeping the stability of the lines for the noisy input sentence x0 , which is NMT model. transformed from x by adding small perturbations. Formally, given a training corpus S, the adver- sarial stability training objective is 3 Approach J (θ) X 3.1 Overview = Ltrue (x, y; θenc , θdec ) The goal of this work is to propose a general ap- hx,yi∈S X proach to make NMT models learned to be more +α Linv (x, x0 ; θenc , θdis ) robust to input perturbations. Our basic idea is x0 ∈N (x) to maintain the consistency of behaviors through X the NMT model for the source sentence x and its +β Lnoisy (x0 , y; θenc , θdec ) (4) x0 ∈N (x) perturbed counterpart x0 . As aforementioned, the NMT model contains two procedures for project- where Ltrue (x, y) and Lnoisy (x0 , y) are calculated ing a source sentence x to its target sentence y: using Equation 3, and Linv (x, x0 ) is the adversar- the encoder is responsible for encoding x as a se- ial loss to be described in Section 3.3. α and β quence of representations Hx , while the decoder control the balance between the original transla- outputs y with Hx as input. We aim at learning tion task and the stability of the NMT model. θ = the perturbation-invariant encoder and decoder. {θenc , θdec , θdis } are trainable parameters of the Figure 1 illustrates the architecture of our ap- encoder, decoder, and the newly introduced dis- proach. Given a source sentence x, we construct a criminator used in adversarial learning. As seen, set of perturbed sentences N (x), in which each the parameters of encoder θenc and decoder θdec sentence x0 is constructed by adding small per- are trained to minimize both the translation loss turbations to x. We require that x0 is a subtle Ltrue (x, y) and the stability losses (Lnoisy (x0 , y) variation from x and they have similar semantics. and Linv (x, x0 )). Given the input pair (x, x0 ), we have two expecta- tions: (1) the encoded representation Hx0 should Since Lnoisy (x0 , y) evaluates the translation be close to Hx ; and (2) given Hx0 , the decoder is loss on the perturbed neighbour x0 and its corre- able to generate the robust output y. To this end, sponding target sentence y, it means that we aug- we introduce two additional objectives to improve ment the training data by adding perturbed neigh- the robustness of the encoder and decoder: bours, which can potentially improve the transla- tion performance. In this way, our approach not • Linv (x, x0 ) to encourage the encoder to out- only makes the output of NMT models more ro- put similar intermediate representations Hx bust, but also improves the performance on the and Hx0 for x and x0 to achieve an invariant original translation task. 1758
In the following sections, we will first describe to improve the robustness of common NMT mod- how to construct perturbed inputs with different els. In practice, one can design specific strategies strategies to fulfill different goals (Section 3.2), for particular tasks. For example, we can replace followed by the proposed adversarial learning correct words with their homonyms (same pronun- mechanism for the perturbation-invariant encoder ciation but different meanings) to improve NMT (Section 3.3). We conclude this section with the models for simultaneous translation systems. training strategy (Section 3.4). 3.3 Adversarial Learning for the 3.2 Constructing Perturbed Inputs Perturbation-invariant Encoder At each training step, we need to generate a per- The goal of the perturbation-invariant encoder is turbed neighbour set N (x) for each source sen- to make the representations produced by the en- tence x for adversarial stability training. In this coder indistinguishable when fed with a correct paper, we propose two strategies to construct the sentence x and its perturbed counterpart x0 , which perturbed inputs at multiple levels of representa- is directly beneficial to the output robustness of tions. the decoder. We cast the problem in the adversar- The first approach generates perturbed neigh- ial learning framework (Goodfellow et al., 2014). bours at the lexical level. Given an input sentence The encoder serves as the generator G, which de- x, we randomly sample some word positions to fines the policy that generates a sequence of hid- be modified. Then we replace words at these posi- den representations Hx given an input sentence x. tions with other words in the vocabulary according We introduce an additional discriminator D to dis- to the following distribution: tinguish the representation of perturbed input Hx0 from that of the original input Hx . The goal of exp {cos (E[xi ], E[x])} the generator G (i.e., encoder) is to produce sim- P (x|xi ) = P (5) x∈Vx \xi exp {cos (E[xi ], E[x])} ilar representations for x and x0 which could fool the discriminator, while the discriminator D tries where E[xi ] is the word embedding for word xi , to correctly distinguish the two representations. Vx \xi is the source vocabulary set excluding the Formally, the adversarial learning objective is word xi , and cos (E[xi ], E[x]) measures the simi- larity between word xi and x. Thus we can change Linv (x, x0 ; θenc , θdis ) the word to another word with similar semantics. = Ex∼S [− log D(G(x))] + Ex0 ∼N (x) − log(1 − D(G(x0 ))) (7) One potential problem of the above strategy is that it is hard to enumerate all possible positions and possible types to generate perturbed neigh- The discriminator outputs a classification score bours. Therefore, we propose a more general ap- given an input representation, and tries to max- proach to modifying the sentence at the feature imize D(G(x)) to 1 and minimize D(G(x0 )) to level. Given a sentence, we can obtain the word 0. The objective encourages the encoder to output embedding for each word. We add the Gaussian similar representations for x and x0 , so that the noise to a word embedding to simulate possible discriminator fails to distinguish them. types of perturbations. That is The training procedure can be regarded as a min-max two-player game. The encoder parame- E[x0i ] = E[xi ] + , ∼ N(0, σ 2 I) (6) ters θenc are trained to maximize the loss function to fool the discriminator. The discriminator pa- where the vector is sampled from a Gaussian dis- rameters θdis are optimized to minimize this loss tribution with variance σ 2 . σ is a hyper-parameter. for improving the discriminating ability. For ef- We simply introduce Gaussian noise to all of word ficiency, we update both the encoder and the dis- embeddings in x. criminator simultaneously at each iteration, rather The proposed scheme is a general framework than the periodical training strategy that is com- where one can freely define the strategies to con- monly used in adversarial learning. Lamb et al. struct perturbed inputs. We just present two pos- (2016) also propose a similar idea to use Professor sible examples here. The first strategy is poten- Forcing to make the behaviors of RNNs be indis- tially useful when the training data contains noisy tinguishable when training and sampling the net- words, while the latter is a more general strategy works. 1759
3.4 Training 40K and 30K respectively for Chinese-English, As shown in Figure 1, our training objective in- English-German, and English-French. We re- cludes three sets of model parameters for three port the case-sensitive tokenized BLEU score for modules. We use mini-batch stochastic gradient English-German and English-French and the case- descent to optimize our model. In the forward insensitive tokenized BLEU score for Chinese- pass, besides a mini-batch of x and y, we also English. construct a mini-batch consisting of the perturbed Our baseline system is an in-house NMT sys- neighbour x0 and y. We propagate the informa- tem. Following Bahdanau et al. (2015), we im- tion to calculate these three loss functions accord- plement an RNN-based NMT in which both the ing to arrows. Then, gradients are collected to up- encoder and decoder are two-layer RNNs with date three sets of model parameters. Except for residual connections between layers (He et al., the gradients of Linv with respect to θenc are mul- 2016b). The gating mechanism of RNNs is gated tiplying by −1, other gradients are normally back- recurrent unit (GRUs) (Cho et al., 2014). We propagated. Note that we update θinv and θenc si- apply layer normalization (Ba et al., 2016) and multaneously for training efficiency. dropout (Hinton et al., 2012) to the hidden states of GRUs. Dropout is also added to the source and 4 Experiments target word embeddings. We share the same ma- trix between the target word embeedings and the 4.1 Setup pre-softmax linear transformation (Vaswani et al., We evaluated our adversarial stability training on 2017). We update the set of model parameters us- translation tasks of several language pairs, and re- ing Adam SGD (Kingma and Ba, 2015). Its learn- ported the 4-gram BLEU (Papineni et al., 2002) ing rate is initially set to 0.05 and varies according score as calculated by the multi-bleu.perl script. to the formula in Vaswani et al. (2017). Chinese-English We used the LDC corpus con- Our adversarial stability training initializes the sisting of 1.25M sentence pairs with 27.9M Chi- model based on the parameters trained by maxi- nese words and 34.5M English words respectively. mum likelihood estimation (MLE). We denote ad- We selected the best model using the NIST 2006 versarial stability training based on lexical-level set as the validation set (hyper-parameter opti- perturbations and feature-level perturbations re- mization and model selection). The NIST 2002, spectively as ASTlexical and ASTfeature . We only 2003, 2004, 2005, and 2008 datasets are used as sample one perturbed neighbour x0 ∈ N (x) for test sets. training efficiency. For the discriminator used in English-German We used the WMT 14 corpus Linv , we adopt the CNN discriminator proposed containing 4.5M sentence pairs with 118M En- by Kim (2014) to address the variable-length prob- glish words and 111M German words. The vali- lem of the sequence generated by the encoder. In dation set is newstest2013, and the test set is new- the CNN discriminator, the filter windows are set stest2014. to 3, 4, 5 and rectified linear units are applied af- English-French We used the IWSLT corpus ter convolution operations. We tune the hyper- which contains 0.22M sentence pairs with 4.03M parameters on the validation set through a grid English words and 4.12M French words. The search. We find that both the optimal values of IWLST corpus is very dissimilar from the NIST α and β are set to 1.0. The standard variance in and WMT corpora. As they are collected from Gaussian noise used in the formula (6) is set to TED talks and inclined to spoken language, 0.01. The number of words that are replaced in we want to verify our approaches on the non- the sentence x during lexical-level perturbations is normative text. The IWSLT 14 test set is taken taken as max(0.2|x|, 1) in which |x| is the length as the validation set and 15 test set is used as the of x. The default beam size for decoding is 10. test set. For English-German and English-French, we 4.2 Translation Results tokenize both English, German and French words using tokenize.perl script. We follow Sen- 4.2.1 NIST Chinese-English Translation nrich et al. (2016b) to split words into sub- Table 2 shows the results on Chinese-English word units. The numbers of merge operations translation. Our strong baseline system signifi- in byte pair encoding (BPE) are set to 30K, cantly outperforms previously reported results on 1760
System Training MT06 MT02 MT03 MT04 MT05 MT08 Shen et al. (2016) MRT 37.34 40.36 40.93 41.37 38.81 29.23 Wang et al. (2017) MLE 37.29 – 39.35 41.15 38.07 – Zhang et al. (2018) MLE 38.38 – 40.02 42.32 38.84 – MLE 41.38 43.52 41.50 43.64 41.58 31.60 this work ASTlexical 43.57 44.82 42.95 45.05 43.45 34.85 ASTfeature 44.44 46.10 44.07 45.61 44.06 34.94 Table 2: Case-insensitive BLEU scores on Chinese-English translation. System Architecture Training BLEU Shen et al. (2016) Gated RNN with 1 layer MRT 20.45 Luong et al. (2015) LSTM with 4 layers MLE 20.90 Kalchbrenner et al. (2017) ByteNet with 30 layers MLE 23.75 Wang et al. (2017) DeepLAU with 4 layers MLE 23.80 Wu et al. (2016) LSTM with 8 layers RL 24.60 Gehring et al. (2017) CNN with 15 layers MLE 25.16 Vaswani et al. (2017) Self-attention with 6 layers MLE 28.40 MLE 24.06 this work Gated RNN with 2 layers ASTlexical 25.17 ASTfeature 25.26 Table 3: Case-sensitive BLEU scores on WMT 14 English-German translation. Training tst2014 tst2015 points over Shen et al. (2016), up to +3.51 MLE 36.92 36.90 BLEU points over Wang et al. (2017) and up to ASTlexical 37.35 37.03 +2.74 BLEU points over Zhang et al. (2018)) ASTfeature 38.03 37.64 and our system trained with MLE across all the datasets. Compared with our baseline system, Table 4: Case-sensitive BLEU scores on IWSLT ASTlexical achieves +1.75 BLEU improvement on English-French translation. average. ASTfeature performs better, which can obtain +2.59 BLEU points on average and up to +3.34 BLEU points on NIST08. Chinese-English NIST datasets trained on RNN- based NMT. Shen et al. (2016) propose minimum 4.2.2 WMT 14 English-German Translation risk training (MRT) for NMT, which directly op- In Table 3, we list existing NMT systems as com- timizes model parameters with respect to BLEU parisons. All these systems use the same WMT 14 scores. Wang et al. (2017) address the issue of English-German corpus. Except that Shen et al. severe gradient diffusion with linear associative (2016) and Wu et al. (2016) respectively adopt units (LAU). Their system is deep with an encoder MRT and reinforcement learning (RL), other sys- of 4 layers and a decoder of 4 layers. Zhang et al. tems all use MLE as training criterion. All the sys- (2018) propose to exploit both left-to-right and tems except for Shen et al. (2016) are deep NMT right-to-left decoding strategies for NMT to cap- models with no less than four layers. Google’s ture bidirectional dependencies. Compared with neural machine translation (GNMT) (Wu et al., them, our NMT system trained by MLE outper- 2016) represents a strong RNN-based NMT sys- forms their best models by around 3 BLEU points. tem. Compared with other RNN-based NMT sys- We hope that the strong baseline systems used in tems except for GNMT, our baseline system with this work make the evaluation convincing. two layers can achieve better performance than We find that introducing adversarial stability theirs. training into NMT can bring substantial improve- When training our NMT system with ments over previous work (up to +3.16 BLEU ASTleixcal , significant improvement (+1.11 1761
Synthetic Type Training 0 Op. 1 Op. 2 Op. 3 Op. 4 Op. 5 Op. MLE 41.38 38.86 37.23 35.97 34.61 32.96 Swap ASTlexical 43.57 41.18 39.88 37.95 37.02 36.16 ASTfeature 44.44 42.08 40.20 38.67 36.89 35.81 MLE 41.38 37.21 31.40 27.43 23.94 21.03 Replacement ASTlexical 43.57 40.53 37.59 35.19 32.56 30.42 ASTfeature 44.44 40.04 35.00 30.54 27.42 24.57 MLE 41.38 38.45 36.15 33.28 31.17 28.65 Deletion ASTlexical 43.57 41.89 38.56 36.14 34.09 31.77 ASTfeature 44.44 41.75 39.06 36.16 33.49 30.90 Table 5: Translation results of synthetic perturbations on the validation set in Chinese-English translation. “1 Op.” denotes that we conduct one operation (swap, replacement or deletion) on the original sentence. Source zhongguo dianzi yinhang yewu guanli xingui jiangyu sanyue yiri qi shixing Reference china’s new management rules for e-banking operations to take effect on march 1 MLE china’s electronic bank rules to be implemented on march 1 new rules for business administration of china ’s electronic banking industry ASTlexical will come into effect on march 1 . new rules for business management of china ’s electronic banking industry to ASTfeature come into effect on march 1 Perturbed Source zhongfang dianzi yinhang yewu guanli xingui jiangyu sanyue yiri qi shixing MLE china to implement new regulations on business management the new regulations for the business administrations of the chinese electronics ASTlexical bank will come into effect on march 1 . new rules for business management of china’s electronic banking industry to ASTfeature come into effect on march 1 Table 6: Example translations of a source sentence and its perturbed counterpart by replacing a Chinese word “zhongguo” with its synonym “zhongfang.” BLEU points) can be observed. ASTfeature demonstrate that our approach maintains good per- can obtain slightly better performance. Our formance on the non-normative text. NMT system outperforms the state-of-the-art RNN-based NMT system, GNMT, with +0.66 4.3 Results on Synthetic Perturbed Data BLEU point and performs comparably with In order to investigate the ability of our training Gehring et al. (2017) which is based on CNN approaches to deal with perturbations, we experi- with 15 layers. Given that our approach can be ment with three types of synthetic perturbations: applied to any NMT systems, we expect that • Swap: We randomly choose N positions the adversarial stability training mechanism can from a sentence and then swap the chosen further improve performance upon the advanced words with their right neighbours. NMT architectures. We leave this for future work. • Replacement: We randomly replace sam- 4.2.3 IWSLT English-French Translation pled words in the sentence with other words. Table 4 shows the results on IWSLT English- • Deletion: We randomly delete N words from French Translation. Compared with our strong each sentence in the dataset. baseline system trained by MLE, we observe that our models consistently improve translation per- As shown in Table 5, we can find that our train- formance in all datasets. ASTfeature can achieve ing approaches, ASTlexical and ASTfeature , consis- significant improvements on the tst2015 although tently outperform MLE against perturbations on ASTlexical obtains comparable results. These all the numbers of operations. This means that our 1762
Ltrue Lnoisy Ladv BLEU √ 46 × × 41.38 AST lexical √ √ AST feature × 41.91 44 √ × × 42.20 √ √ × 42.93 42 √ √ √ 43.57 BLEU 40 Table 7: Ablation study of adversarial stabil- 38 ity training ASTlexical on Chinese-English trans- √ lation. “ ” means the loss function is included in 36 the training objective while “×” means it is not. 34 0 20 40 60 80 100 120 140 160 180 200 Iterations × 10 3 approaches have the capability of resisting pertur- bations. Along with the number of operations in- Figure 2: BLEU scores of ASTlexical over itera- creasing, the performance on MLE drops quickly. tions on Chinese-English validation set. Although the performance of our approaches also drops, we can see that our approaches consistently surpass MLE. In ASTlexical , with 0 operation, the 5 L noisy difference is +2.19 (43.57 Vs. 41.38) for all syn- 4.5 L true thetic types, but the differences are enlarged to 4 L inv +3.20, +9.39, and +3.12 respectively for the three 3.5 types with 5 operations. 3 Cost In the Swap and Deletion types, ASTlexical and 2.5 ASTfeature perform comparably after more than 2 four operations. Interestingly, ASTlexical per- 1.5 forms significantly better than both of MLE and 1 ASTfeature after more than one operation in the Replacement type. This is because ASTlexical 0.5 0 50 100 150 200 trains the model specifically on perturbation data Iterations × 10 3 that is constructed by replacing words, which Figure 3: Learning curves of three loss functions, agrees with the Replacement Type. Overall, Ltrue , Linv and Lnoisy over iterations on Chinese- ASTlexical performs better than ASTfeature against English validation set. perturbations after multiple operations. We spec- ulate that the perturbation method for ASTlexical and synthetic type are both discrete and they keep study on the Chinese-English translation to under- more consistent. Table 6 shows example transla- stand the importance of these loss functions by tions of a Chinese sentence and its perturbed coun- choosing ASTlexical as an example. As Table 7 terpart. shows, if we remove Ladv , the translation perfor- These findings indicate that we can construct mance decreases by 0.64 BLEU point. However, specific perturbations for a particular task. For when Lnoisy is excluded from the training objec- example, in simultaneous translation, an auto- tive function, it results in a significant drop of 1.66 matic speech recognition system usually generates BLEU point. Surprisingly, only using Lnoisy is wrong words with the same pronunciation of cor- able to lead to an increase of 0.88 BLEU point. rect words, which dramatically affects the quality of machine translation system. Therefore, we can 4.4.2 BLEU Scores over Iterations design specific perturbations aiming for this task. Figure 2 shows the changes of BLEU scores over iterations respectively for ASTlexical and 4.4 Analysis ASTfeature . They behave nearly consistently. Ini- 4.4.1 Ablation Study tialized by the model trained by MLE, their per- Our training objective function Eq. (4) contains formance drops rapidly. Then it starts to go up three loss functions. We perform an ablation quickly. Compared with the starting point, the 1763
maximal dropping points reach up to about 7.0 NMT models. Some authors have recently en- BLEU points. Basically, the curves present the deavored to achieve zero-shot NMT through trans- state of oscillation. We think that introducing ferring knowledge from bilingual corpora of other random perturbations and adversarial learning can language pairs (Chen et al., 2017; Zheng et al., make the training not very stable like MLE. 2017; Cheng et al., 2017) or monolingual corpora (Lample et al., 2018; Artetxe et al., 2018). Our 4.4.3 Learning Curves of Loss Functions work significantly differs from these work. We do Figure 3 shows the learning curves of three loss not resort to any complicated models to generate functions, Ltrue , Linv and Lnoisy . We can find that perturbed data and do not depend on extra mono- their costs of loss functions decrease not steadily. lingual or bilingual corpora. The way we exploit Similar to the Figure 2, there still exist oscilla- is more convenient and easy to implement. We tions in the learning curves although they do not focus more on improving the robustness of NMT change much sharply. We find that Linv converges models. to around 0.68 after about 100K iterations, which indicates that discriminator outputs probability 0.5 6 Conclusion for both positive and negative samples and it can- We have proposed adversarial stability training to not distinguish them. Thus the behaviors of the improve the robustness of NMT models. The ba- encoder for x and its perturbed neighbour x0 per- sic idea is to train both the encoder and decoder form nearly consistently. robust to input perturbations by enabling them to behave similarly for the original input and its per- 5 Related Work turbed counterpart. We propose two approaches Our work is inspired by two lines of research: (1) to construct perturbed data to adversarially train adversarial learning and (2) data augmentation. the encoder and stabilize the decoder. Experi- ments on Chinese-English, English-German and Adversarial Learning Generative Adversarial English-French translation tasks show that the pro- Network (GAN) (Goodfellow et al., 2014) and posed approach can improve both the robustness its related derivative have been widely applied and translation performance. in computer vision (Radford et al., 2015; Sali- As our training framework is not limited to spe- mans et al., 2016) and natural language process- cific perturbation types, it is interesting to evalu- ing (Li et al., 2017; Yang et al., 2018). Previous ate our approach in natural noise existing in prac- work has constructed adversarial examples to at- tical applications, such as homonym in the simul- tack trained networks and make networks resist taneous translation system. It is also necessary to them, which has proved to improve the robust- further validate our approach on more advanced ness of networks (Goodfellow et al., 2015; Miy- NMT architectures, such as CNN-based NMT ato et al., 2016; Zheng et al., 2016). Belinkov (Gehring et al., 2017) and Transformer (Vaswani and Bisk (2018) introduce adversarial examples et al., 2017). to training data for character-based NMT models. In contrast to theirs, adversarial stability training Acknowledgments aims to stabilize both the encoder and decoder in NMT models. We adopt adversarial learning to We thank the anonymous reviewers for their in- learn the perturbation-invariant encoder. sightful comments and suggestions. We also thank Xiaoling Li for analyzing experimental results and Data Augmentation Data augmentation has the providing valuable examples. Yang Liu is sup- capability to improve the robustness of NMT mod- ported by the National Key R&D Program of els. In NMT, there is a number of work that aug- China (No. 2017YFB0202204), National Natural ments the training data with monolingual corpora Science Foundation of China (No. 61761166008, (Sennrich et al., 2016a; Cheng et al., 2016; He No. 61522204), Beijing Advanced Innovation et al., 2016a; Zhang and Zong, 2016). They all Center for Language Resources, and the NExT++ leverage complex models such as inverse NMT project supported by the National Research Foun- models to generate translation equivalents for dation, Prime Ministers Office, Singapore under monolingual corpora. Then they augment the par- its IRC@Singapore Funding Initiative. allel corpora with these pseudo corpora to improve 1764
References Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kavukcuoglu. 2017. Neural machine translation in Kyunghyun Cho. 2018. Unsupervised neural ma- linear time. In Proceedings of ICML. chine translation. In Proceedings of ICLR. Yoon Kim. 2014. Convolutional neural networks for Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- sentence classification. In Proceedings of EMNLP. ton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450. Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings Dzmitry Bahdanau, KyungHyun Cho, and Yoshua of ICLR. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of Alex M Lamb, Anirudh Goyal ALIAS PARTH ICLR. GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. 2016. Professor Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic forcing: A new algorithm for training recurrent net- and natural noise both break neural machine transla- works. In Proceedings of NIPS. tion. In Proceedings of ICLR. Guillaume Lample, Ludovic Denoyer, and Yun Chen, Yang Liu, Yong Cheng, and Victor OK Marc’Aurelio Ranzato. 2018. Unsupervised Li. 2017. A teacher-student framework for zero- machine translation using monolingual corpora resource neural machine translation. In Proceedings only. In Proceedings of ICLR. of ACL. Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Wu, Maosong Sun, and Yang Liu. 2016. Semi- Alan Ritter, and Dan Jurafsky. 2017. Adversarial supervised learning for neural machine translation. learning for neural dialogue generation. In Proceed- In Proceedings of ACL. ings of EMNLP. Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Minh-Thang Luong, Hieu Pham, and Christopher D Wei Xu. 2017. Joint training for pivot-based neural Manning. 2015. Effective approaches to attention- machine translation. In Proceedings of IJCAI. based neural machine translation. In Proceedings of EMNLP. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Aleksander Madry, Makelov Aleksandar, Schmidt Schwenk, and Yoshua Bengio. 2014. Learning Ludwig, Dimitris Tsipras, and Adrian Vladu. 2018. phrase representations using rnn encoder-decoder Towards deep learning models resistant to adversar- for statistical machine translation. In Proceedings ial attacks. In Proceedings of ICLR. of EMNLP. Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Jonas Gehring, Michael Auli, David Grangier, Denis Ken Nakae, and Shin Ishii. 2016. Distributional Yarats, and Yann N Dauphin. 2017. Convolutional smoothing with virtual adversarial training. In Pro- sequence to sequence learning. In Proceedings of ceedings of ICLR. ICML. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Jing Zhu. 2002. BLEU: a methof for automatic eval- Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron uation of machine translation. In Proceedings of Courville, and Yoshua Bengio. 2014. Generative ad- ACL. versarial nets. In Proceedings of NIPS. Alec Radford, Luke Metz, and Soumith Chintala. Ian Goodfellow, Jonathon Shlens, and Christian 2015. Unsupervised representation learning with Szegedy. 2015. Explaining and harnessing adver- deep convolutional generative adversarial networks. sarial examples. In Proceedings of ICLR. arXiv preprint arXiv:1511.06434. Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Tie-Yan Liu, and Wei-Ying Ma. 2016a. Dual learn- Vicki Cheung, Alec Radford, and Xi Chen. 2016. ing for machine translation. In Proceedings of NIPS. Improved techniques for training gans. In Proceed- ings of NIPS. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Deep residual learning for image recog- Rico Sennrich, Barry Haddow, and Alexandra Birch. nition. In Proceedings of CVPR. 2016a. Improving nerual machine translation mod- els with monolingual data. In Proceedings of ACL. Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural networks by preventing co- 2016b. Neural machine translation of rare words adaptation of feature detectors. arXiv preprint with subword units. In Proceedings of ACL. arXiv:1207.0580. 1765
Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua human and machine translation. arXiv preprint Wu, Maosong Sun, and Yang Liu. 2016. Minimum arXiv:1609.08144. risk training for neural machine translation. In Pro- ceedings of ACL. Z. Yang, W. Chen, F. Wang, and B. Xu. 2018. Improv- ing Neural Machine Translation with Conditional Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence Generative Adversarial Nets. In Proceed- Sequence to sequence learning with neural net- ings of NAACL. works. In Proceddings of NIPS. Jiajun Zhang and Chengqing Zong. 2016. Exploit- Christian Szegedy, Wojciech Zaremba, Sutskever Ilya, ing source-side monolingual data in neural machine Joan Bruna, Dumitru Erhan, Ian Goodfellow, and translation. In Proceedings of EMNLP. Rob Fergus. 2014. Intriguing properties of neural networks. In Proceedings of ICML. Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Ron- grong Ji, and Hongji Wang. 2018. Asynchronous Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Bidirectional Decoding for Neural Machine Trans- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz lation. In Proeedings of AAAI. Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS. Hao Zheng, Yong Cheng, and Yang Liu. 2017. Maximum expected likelihood estimation for zero- Mingxuan Wang, Zhengdong Lu, Jie Zhou, and Qun resource neural machine translation. In Proceedings Liu. 2017. Deep neural machine translation with lin- of IJCAI. ear associative unit. In Proceedings of ACL. Stephan Zheng, Yang Song, Thomas Leung, and Ian Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Goodfellow. 2016. Improving the robustness of Le, Mohammad Norouzi, Wolfgang Macherey, deep neural networks via stability training. In Pro- Maxim Krikun, et al. 2016. Google’s neural ma- ceedings of CVPR. chine translation system: Bridging the gap between 1766
You can also read