Neural Machine Translation with Key-Value Memory-Augmented Attention - IJCAI
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Neural Machine Translation with Key-Value Memory-Augmented Attention Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, Di Wang Tencent AI Lab {fandongmeng,zptu,yongcheng,gavinwu,jasonzhai,yuekuiyang,diwang}@tencent.com Abstract to indicate whether a source word is translated or not. Meng et al. [2016] and Zheng et al. [2018] take the idea one step Although attention-based Neural Machine Transla- further, and directly model translated and untranslated source tion (NMT) has achieved remarkable progress in contents by operating on the attention context (i.e., the partial recent years, it still suffers from issues of repeating source content being translated) instead of on the attention and dropping translations. To alleviate these issues, probability (i.e., the chance of the corresponding source word we propose a novel key-value memory-augmented is translated). Specifically, Meng et al. [2016] capture trans- attention model for NMT, called KVM EM ATT. lation status with an interactive attention augmented with a Specifically, we maintain a timely updated key- NTM [Graves et al., 2014] memory. Zheng et al. [2018] memory to keep track of attention history and a separate the modeling of translated (PAST) and untranslated fixed value-memory to store the representation of (F UTURE) source content from decoder states by introducing source sentence throughout the whole translation two additional decoder adaptive layers. process. Via nontrivial transformations and itera- tive interactions between the two memories, the de- Meng et al. [2016] propose a generic framework of coder focuses on more appropriate source word(s) memory-augmented attention, which is independent from the for predicting the next target word at each decoding specific architectures of the NMT models. However, the orig- step, therefore can improve the adequacy of trans- inal mechanism takes only a single memory to both represent lations. Experimental results on Chinese⇒English the source sentence and track attention history. Such over- and WMT17 German⇔English translation tasks loaded usage of memory representations makes training the demonstrate the superiority of the proposed model. model difficult [Rocktäschel et al., 2017]. In contrast, Zheng et al. [2018] try to ease the difficulty of representation learn- ing by separating the PAST and F UTURE functions from the 1 Introduction decoder states. However, it is designed specifically for the The past several years have witnessed promising progress precise architecture of NMT models. in Neural Machine Translation (NMT) [Cho et al., 2014; In this work, we combine the advantages of both models by Sutskever et al., 2014], in which attention model plays leveraging the generic memory-augmented attention frame- an increasingly important role [Bahdanau et al., 2015; Lu- work, while easing the memory training by maintaining sep- ong et al., 2015; Vaswani et al., 2017]. Conventional arate representations for the expected two functions. Partially attention-based NMT encodes the source sentence as a se- inspired by [Miller et al., 2016], we split the memory into quence of vectors after bi-directional recurrent neural net- two parts: a dynamical key-memory along with the update- works (RNN) [Schuster and Paliwal, 1997], and then gener- chain of the decoder state to keep track of attention history, ates a variable-length target sentence with another RNN and and a fixed value-memory to store the representation of source an attention mechanism. The attention mechanism plays a sentence throughout the whole translation process. In each crucial role in NMT, as it shows which source word(s) the decoding step, we conduct multi-rounds of memory opera- decoder should focus on in order to predict the next tar- tions repeatedly layer by layer, which may let the decoder get word. However, there is no mechanism to effectively have a chance of re-attention by considering the “intermedi- keep track of attention history in conventional attention-based ate” attention results achieved in early stages. This structure NMT. This may let the decoder tend to ignore past atten- allows the model to leverage possibly complex transforma- tion information, and lead to the issues of repeating or drop- tions and interactions between 1) the key-value memory pair ping translations [Tu et al., 2016]. For example, conventional in the same layer, as well as 2) the key (and value) memory attention-based NMT may repeatedly translate some source across different layers. words while mistakenly ignore other words. Experimental results on Chinese⇒English translation task A number of recent efforts have explored ways to allevi- show that attention model augmented with a single-layer key- ate the inadequate translation problem. For example, Tu et value memory improves both translation and attention perfor- al. [2016] employ coverage vector as a lexical-level indicator mances over not only a standard attention model, but also 2574
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) over the existing NTM-augmented attention model [Meng Decoder et al., 2016]. Its multi-layer counterpart further improves Yt-1 Yt Yt+1 GRU GRU GRU model performances consistently. We also validate our model … … on bidirectional German⇔English translation tasks, which St-1 St St+1 demonstrates the effectiveness and generalizability of our ap- proach. … read write read write … H0 … Ht-1 Ht Ht+1 … 2 Background Encoder Attention-based NMT Given a source sentence x = hj-1 hj hj+1 {x1 , x2 , · · · , xn } and a target sentence y={y1 , y2 , · · · , ym }, … … hj-1 hj hj+1 NMT models the translation probability word by word: Ym … Xj-1 Xj Xj+1 … p(y|x) = P (yt |y
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) ct y1 y2 … ym Softmax Key-Memory Value-Memory Update qt Softmax st-1 Encoder … … … Update yt-1 h1 h2 … hn qt h1 h2 … hn Key-Memory Softmax Value-Memory Decoder x1 x2 … xn … st-1 yt-1 y1 … ym-1 Figure 2: The architecture of KVM EM ATT-based NMT. The encoder (on the left) and the decoder (on the right) are connected by the KVM EM ATT (in the middle). In particular, we show the KVM EM ATT at decoding time step t with multi-rounds of memory access. to generate the “intermediate” attention vector ãrt step. Clearly, KVM EM ATT can subsume the coverage mod- (r−1) els [Tu et al., 2016; Mi et al., 2016] and the interactive atten- ãrt = Address(qt , K(t) ) (7) tion model [Meng et al., 2016] as special cases, while more generic and powerful, as empirically verified in the experi- which is subsequently used as the guidance for reading from ment section. the value memory V to get the “intermediate” context repre- sentation c̃rt of source sentence 3.1 Memory Access Operations c̃rt = Read(ãrt , V) (8) In this section, we will detail the memory access operations from round r-1 to round r at decoding time step t. which works together with the “query” state qt are used to (r) get the “intermediate” hidden state: Key-Memory Addressing Formally, K(t) ∈ Rn×d is the K EY-M EMORY in round r at decoding time step t before the s̃rt = GRU(qt , c̃rt ) (9) decoder RNN state update, where n is the number of mem- Finally, we use the “intermediate” hidden state to update ory slots and d is the dimension of vector in each slot. The (r−1) addressed attention vector is given by K(t) as recording the “intermediate” attention status, to (r−1) finish a round of operations, ãrt = Address(qt , K(t) ) (11) (r) (r−1) where ãrt n ∈ R specifies the normalized weights assigned to K(t) = Update(s̃rt , K(t) ) (10) (r−1) (r−1) the slots in K(t) , with the j-th slot being k(t,j) . We can After the last round (R) of the operations, we use use content-based addressing to determine ãrt as described {s̃R R R t , c̃t , ãt } as the resulting states to compute a final pre- in [Graves et al., 2014] or (quite similarly) use the soft- (R) alignment as in Eq. 4. In this paper, for convenience we adopt diction via Eq. 1. Then the K EY-M EMORY K(t) will be (0) the latter one. And the j-th cell of ãrt is transited to the next decoding step t+1, being K(t+1) . The exp(ẽrt,j ) details of Address, Read and Update operations will be ãrt,j = Pn r (12) described later in next section. i=1 exp(ẽt,i ) As seen, the KVM EM ATT mechanism can update the where ẽrt,j = var T tanh(War qt + Ura k(t,j) ). (r−1) K EY-M EMORY along with the update-chain of the decoder state to keep track of attention status and also maintains a Value-Memory Reading Formally, V ∈ Rn×d is the fixed VALUE -M EMORY to store the representation of source VALUE -M EMORY, where n is the number of memory slots sentence. At each decoding step, the KVM EM ATT generates and d is the dimension of vector in each slot. Before the de- the context representation of source sentence via nontrivial coder state update at time t, the output of reading at round r transformations between the K EY-M EMORY and the VALUE - is c̃rt given by M EMORY, and records the attention status via interactions be- n X tween the two memories. This structure allows the model to c̃rt = ãrt,j vj (13) leverage possibly complex transformations and interactions j=1 between two memories, and lets the decoder choose more where ãrt ∈ Rn specifies the normalized weights assigned to appropriate source context for the word prediction at each the slots in V. 2576
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Key-Memory Updating Inspired by the attentive writing 4 Experiments operation of neural turing machines [Graves et al., 2014], we define two types of operation for updating the K EY- 4.1 Setup M EMORY: F ORGET and A DD. We carry out experiments on Chinese⇒English (Zh⇒En) F ORGET determines the content to be removed from mem- and German⇔English (De⇔En) translation tasks. For ory slots. More specifically, the vector Frt ∈ Rd specifies Zh⇒En, the training data consist of 1.25M sentence pairs ex- the values to be forgotten or removed on each dimension in tracted from LDC corpora. We choose NIST 2002 (MT02) memory slots, which is then assigned to each slot through dataset as our valid set, and NIST 2003-2006 (MT03-06) normalized weights wtr . Formally, the memory (“intermedi- datasets as our test sets. For De⇔En, we perform our exper- ate”) after F ORGET operation is given by iments on the corpus provided by WMT17, which contains (r) (r−1) r 5.6M sentence pairs. We use newstest2016 as the develop- k̃t,i = kt,i (1 − wt,i · Frt ), i = 1, 2, · · · , n (14) ment set, and newstest2017 as the testset. We measure the where translation quality with BLEU scores [Papineni et al., 2002].1 • Frt = σ(WFr , s̃rt ) is parameterized with WFr ∈ Rd×d , In training the neural networks, we limit the source and tar- and σ stands for the Sigmoid activation function, and get vocabulary to the most frequent 30K words for both sides Frt ∈ Rd ; in Zh⇒En task, covering approximately 97.7% and 99.3% of • wtr ∈ Rn specifies the normalized weights assigned to two corpus respectively. For De⇔En, sentences are encoded (r) r the slots in K(t) , and wt,i specifies the weight associ- using byte-pair encoding [Sennrich et al., 2016], which has a shared source-target vocabulary of about 36000 tokens. The ated with the i-th slot. wtr is determined by parameters are updated by SGD and mini-batch (size 80) with (r−1) wtr = Address(s̃rt , K(t) ) (15) learning rate controlled by AdaDelta [Zeiler, 2012] ( = 1e−6 and ρ = 0.95). We limit the length of sentences in training A DD decides how much current information should be to 80 words for Zh⇒En and 100 sub-words for De⇔En. The written to the memory as the added content dimension of word embedding and hidden layer is 512, and (r) (r) r kt,i = k̃t,i + wt,i · Art , i = 1, 2, · · · , n (16) the beam size in testing is 10. We apply dropout on the output layer to avoid over-fitting [Hinton et al., 2012], with dropout where Art = σ(WA r , s̃rt ) is parameterized with WA r ∈ Rd×d , r d rate being 0.5. Hyper parameter λ in Eq. 19 is set to 1.0. The and At ∈ R . Clearly, with F ORGET and A DD operations, parameters of our KVM EM ATT (i.e., encoder and decoder, KVM EM ATT potentially can modify and add to the K EY- except for those related to KVM EM ATT) are initialized by M EMORY more than just history of attention. the pre-trained baseline model. 3.2 Training 4.2 Results on Chinese-English EOS-attention Objective: The translation process is fin- ished when the decoder generates eostrg (a special token that We compare our KVM EM ATT with two strong baselines: stands for the end of target sentence). Therefore, accurately 1) RNNS EARCH, which is our in-house implementation of generating eostrg is crucial for producing correct translation. the attention-based NMT as described in Section 2; and Intuitively, accurate attention for the last word (i.e. eossrc ) 2) RNNS EARCH +M EM ATT, which is our implementation of source sentence will help the decoder accurately predict of interactive attention [Meng et al., 2016] on top of our eostrg . When to predict eostrg (i.e. ym ), the decoder should RNNS EARCH. Table 1 shows the translation performance. highly focus on eossrc (i.e. xn ). That is to say the atten- Model Complexity KVM EM ATT brings in little parame- tion probability of am,n should be closed to 1.0. And when ter increase. Compared with RNN SEARCH, KVM EM ATT- to generate other target words, such as y1 , y2 , · · · , ym−1 , the R1 and KVM EM ATT-R2 only bring in 1.95% and 7.78% decoder should not focus on eossrc too much. Therefore we parameter increase. Additionally, introducing KVM EM ATT define slows down the training speed to a certain extent m (18.39%∼39.56%). When running on a single GPU device Tesla P40, the speed of the RNN SEARCH model is 2773 tar- X ATTEOS = Atteos (t, n) (17) t=1 get words per second, while the speed of the proposed models where is 1676∼2263 target words per second. at,n , t
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) S YSTEM A RCHITECTURE # Para. Speed MT03 MT04 MT05 MT06 AVE . Existing end-to-end NMT systems [Tu et al., 2016] C OVERAGE – – 33.69 38.05 35.01 34.83 35.40 [Meng et al., 2016] M EM ATT – – 35.69 39.24 35.74 35.10 36.44 [Wang et al., 2016] M EM D EC – – 36.16 39.81 35.91 35.98 36.95 [Zhang et al., 2017] D ISTORTION – – 37.93 40.40 36.81 35.77 37.73 Our end-to-end NMT systems RNNS EARCH 53.98M 2773 36.02 38.65 35.64 35.80 36.53 + M EM ATT 55.03M 2319 36.93*+ 40.06*+ 36.81*+ 37.39*+ 37.80 + KVM EM ATT-R1 55.03M 2263 37.56*+ 40.56*+ 37.83*+ 37.84*+ 38.45 this work +ATT E OS O BJ 55.03M 1992 37.87*+ 40.64+ 38.01+ 38.13*+ 38.66 + KVM EM ATT-R2 58.18M 1804 38.08+ 40.73+ 38.09+ 38.80*+ 38.93 + ATT E OS O BJ 58.18M 1676 38.40*+ 41.10*+ 38.73*+ 39.08*+ 39.33 Table 1: Case-insensitive BLEU scores (%) on Chinese⇒English translation task. “Speed” denotes the training speed (words/second). “R1” and “R2” denotes KVM EM ATT with one and two rounds, respectively. “+ATT E OS O BJ” stands for adding the EOS-attention objective. The “*” and “+” indicate that the results are significantly (p
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) S YSTEM A RCHITECTURE D E⇒E N E N⇒D E Existing end-to-end NMT systems [Rikters et al., 2017] cGRU + dropout + named entity forcing + synthetic data 29.00 22.70 [Escolano et al., 2017] Char2Char + rescoring with inverse model + synthetic data 28.10 21.20 [Sennrich et al., 2017] cGRU + synthetic data 32.00 26.10 [Tu et al., 2016] RNNS EARCH + C OVERAGE 28.70 23.60 [Zheng et al., 2018] RNNS EARCH + PAST-F UTURE -L AYERS 29.70 24.30 Our end-to-end NMT systems RNN SEARCH 29.33 23.83 this work + M EM ATT 30.13 24.62 + KVM EM ATT-R2 + ATT E OS O BJ 30.98 25.39 Table 4: Case-sensitive BLEU scores (%) on German⇔English translation task. “synthetic data” denotes additional 10M monolingual sentences, which is not used in this work. we can see that, compared with the baseline system, our ap- distortion knowledge into the attention-based NMT. [Tu et proach can decline the percentages of source words which al., 2016; Mi et al., 2016] proposed coverage mechanisms to are under-translated from 13.1% to 9.7% and which are over- encourage the decoder to consider more untranslated source translated from 2.7% to 1.3%. The main reason is that our words during translation. These works are different from KVM EM ATT can keep track of attention status and generate our KVM EM ATT, since we use a rather generic key-value more appropriate source context for predicting the next target memory-augmented framework with memory access (i.e. ad- word at each decoding step. dress, read and update). Our work is also related to recent efforts on attaching a 4.3 Results on German-English memory to neural networks [Graves et al., 2014] and ex- We also evaluate our model on the WMT17 benchmarks on ploiting memory [Tang et al., 2016; Wang et al., 2016; bidirectional German⇔English translation tasks, as listed in Feng et al., 2017; Meng et al., 2016; Wang et al., 2017] dur- Table 4. Our baseline achieves even higher BLEU scores to ing translation. [Tang et al., 2016] exploited a phrase mem- the state-of-the-art NMT systems of WMT17, which do not ory stored in symbolic form for NMT. [Wang et al., 2016] ex- use additional synthetic data.2 Our proposed model consis- tended the NMT decoder by maintaining an external memory, tently outperforms two strong baselines (i.e., standard and which is operated by reading and writing operations. [Feng et memory-augmented attention models) on both De⇒En and al., 2017] proposed a neural-symbolic architecture, which ex- En⇒De translation tasks. These results demonstrate that our ploits a memory to provide knowledge for infrequently used model works well across different language pairs. words. Our work differ at that we augment attention with a specially designed interactive key-value memory, which al- 5 Related Work lows the model to leverage possibly complex transformations and interactions between the two memories via single- or Our work is inspired by the key-value memory net- multi-rounds of memory access in each decoding step. works [Miller et al., 2016] originally proposed for the ques- tion answering, and has been successfully applied to ma- 6 Conclusion and Future Work chine translation [Gu et al., 2017; Gehring et al., 2017; We propose an effective KVM EM ATT model for NMT, Vaswani et al., 2017; Tu et al., 2018]. In these works, both the which maintains a timely updated key-memory to track at- key-memory and value-memory are fixed during translation. tention history and a fixed value-memory to store the repre- Different from these works, we update the K EY-M EMORY sentation of source sentence during translation. Via nontriv- along with the update-chain of the decoder state via attentive ial transformations and iterative interactions between the two writing operations (e.g. F ORGET and A DD). memories, our KVM EM ATT can focus on more appropriate Our work is related to recent studies that focus on design- source context for predicting the next target word at each de- ing better attention models [Luong et al., 2015; Cohn et al., coding step. Additionally, to further enhance the attention, 2016; Feng et al., 2016; Tu et al., 2016; Mi et al., 2016; we propose a simple yet effective attention-oriented objec- Zhang et al., 2017]. [Luong et al., 2015] proposed to use tive in a weakly supervised manner. Our empirical study on a global attention to attend to all source words and a local Chinese⇒English, German⇒English and English⇒German attention model to look at a subset of source words. [Cohn translation tasks shows that KVM EM ATT can significantly et al., 2016] extended the attention-based NMT to include improve the performance of NMT. structural biases from word-based alignment models. [Feng et al., 2016] added implicit distortion and fertility models For future work, we will consider to explore more rounds to attention-based NMT. [Zhang et al., 2017] incorporated of memory access with more powerful operations on key- value memories to further enhance the attention. Another in- 2 [Sennrich et al., 2017] obtains better BLEU scores than our teresting direction is to apply the proposed approach to Trans- model, since they use large scaled synthetic data (about 10M). It former [Vaswani et al., 2017], in which the attention model maybe unfair to compare our model to theirs directly. plays a more important role. 2579
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) References [Papineni et al., 2002] Kishore Papineni, Salim Roukos, [Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Todd Ward, and Wei-Jing Zhu. Bleu: a method for au- Cho, and Yoshua Bengio. Neural machine translation by tomatic evaluation of machine translation. In ACL, 2002. jointly learning to align and translate. In ICLR, 2015. [Rikters et al., 2017] Matı̄ss Rikters, Chantal Amrhein, [Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer, Maksym Del, and Mark Fishel. C-3ma: Tartu-riga-zurich Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and translation systems for wmt17. In WMT, 2017. Yoshua Bengio. Learning phrase representations using [Rocktäschel et al., 2017] Tim Rocktäschel, Johannes rnn encoder-decoder for statistical machine translation. In Welbl, and Sebastian Riedel. Frustratingly short attention EMNLP, 2014. spans in neural language modeling. In ICLR, 2017. [Cohn et al., 2016] Trevor Cohn, Cong Duy Vu Hoang, Eka- [Schuster and Paliwal, 1997] Mike Schuster and Kuldip K terina Vymolova, Kaisheng Yao, Chris Dyer, and Gholam- Paliwal. Bidirectional recurrent neural networks. TSP, reza Haffari. Incorporating structural alignment biases into 1997. an attentional neural translation model. In NAACL, 2016. [Sennrich et al., 2016] Rico Sennrich, Barry Haddow, and [Escolano et al., 2017] Carlos Escolano, Marta R. Costa- Alexandra Birch. Neural machine translation of rare words jussà, and José A. R. Fonollosa. The talp-upc neural ma- with subword units. In ACL, 2016. chine translation system for german/finnish-english using [Sennrich et al., 2017] Rico Sennrich, Alexandra Birch, the inverse direction model in rescoring. In WMT, 2017. Anna Currey, Ulrich Germann, Barry Haddow, Ken- [Feng et al., 2016] Shi Feng, Shujie Liu, Mu Li, and Ming neth Heafield, Antonio Valerio Miceli Barone, and Philip Zhou. Implicit distortion and fertility models for attention- Williams. The university of edinburgh’s neural MT sys- based encoder-decoder NMT model. In COLING, 2016. tems for WMT17. arXiv, 2017. [Feng et al., 2017] Yang Feng, Shiyue Zhang, Andi Zhang, [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Dong Wang, and Andrew Abel. Memory-augmented neu- Quoc VV Le. Sequence to sequence learning with neu- ral machine translation. arXiv, 2017. ral networks. In NIPS, 2014. [Gehring et al., 2017] Jonas Gehring, Michael Auli, David [Tang et al., 2016] Yaohua Tang, Fandong Meng, Zheng- Grangier, Denis Yarats, and Yann N. Dauphin. Convo- dong Lu, Hang Li, and Philip L. H. Yu. Neural machine lutional sequence to sequence learning. arXiv, 2017. translation with external phrase memory. arXiv, 2016. [Graves et al., 2014] Alex Graves, Greg Wayne, and Ivo [Tu et al., 2016] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Danihelka. Neural turing machines. arXiv, 2014. Xiaohua Liu, and Hang Li. Modeling coverage for neu- [Gu et al., 2017] Jiatao Gu, Yong Wang, Kyunghyun Cho, ral machine translation. In ACL, 2016. and Victor OK Li. Search engine guided non-parametric [Tu et al., 2018] Zhaopeng Tu, Yang Liu, Shuming Shi, and neural machine translation. arXiv, 2017. Tong Zhang. Learning to remember translation history [Hinton et al., 2012] Geoffrey E Hinton, Nitish Srivastava, with a continuous cache. TACL, 2018. Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhut- [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki dinov. Improving neural networks by preventing co- Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, adaptation of feature detectors. arXiv, 2012. Lukasz Kaiser, and Illia Polosukhin. Attention is all you [Liu and Sun, 2015] Yang Liu and Maosong Sun. Con- need. In NIPS, 2017. trastive unsupervised word alignment with non-local fea- [Wang et al., 2016] Mingxuan Wang, Zhengdong Lu, Hang tures. In AAAI, 2015. Li, and Qun Liu. Memory-enhanced decoder for neural [Luong et al., 2015] Thang Luong, Hieu Pham, and Christo- machine translation. In EMNLP, 2016. pher D. Manning. Effective approaches to attention-based [Wang et al., 2017] Xing Wang, Zhengdong Lu, Zhaopeng neural machine translation. In EMNLP, 2015. Tu, Hang Li, Deyi Xiong, and Min Zhang. Neural machine [Meng et al., 2016] Fandong Meng, Zhengdong Lu, Hang translation advised by statistical machine translation. In Li, and Qun Liu. Interactive attention for neural machine AAAI, 2017. translation. In COLING, 2016. [Zeiler, 2012] Matthew D Zeiler. Adadelta: an adaptive [Mi et al., 2016] Haitao Mi, Baskaran Sankaran, Zhiguo learning rate method. arXiv, 2012. Wang, and Abe Ittycheriah. Coverage embedding models [Zhang et al., 2017] Jinchao Zhang, Mingxuan Wang, Qun for neural machine translation. In EMNLP, 2016. Liu, and Jie Zhou. Incorporating word reordering knowl- [Miller et al., 2016] Alexander Miller, Adam Fisch, Jesse edge into attention-based neural machine translation. In Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason ACL, 2017. Weston. Key-value memory networks for directly reading [Zheng et al., 2018] Zaixiang Zheng, Hao Zhou, Shujian documents. In EMNLP, 2016. Huang, Lili Mou, Dai Xinyu, Jiajun Chen, and Zhaopeng [Och and Ney, 2003] Franz Josef Och and Hermann Ney. Tu. Modeling past and future for neural machine transla- A systematic comparison of various statistical alignment tion. TACL, 2018. models. CL, 2003. 2580
You can also read