Towards Linear Time Neural Machine Translation with Capsule Networks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Towards Linear Time Neural Machine Translation with Capsule Networks Mingxuan Wang1 Jun xie2 Zhixing Tan2 Jinsong Su3 Deyi Xiong4 Lei Li1 1 ByteDance AI Lab, Beijing, China {wangmingxuan.89,lileilab}@bytedance.com 2 Mobile Internet Group, Tencent Technology Co., Ltd 3 Xiamen University, Xiamen, China 4 Tianjin University, Tianjin, China Abstract et al., 2003; Chiang, 2005) which consist of mul- In this study, we first investigate a novel cap- tiple separately tuned components, NMT aims at sule network with dynamic routing for lin- building upon a single and large neural network arXiv:1811.00287v4 [cs.CL] 12 Oct 2020 ear time Neural Machine Translation (NMT), to directly map input text to associated output text referred as C APS NMT. C APS NMT uses an (Sutskever et al., 2014). aggregation mechanism to map the source In general, there are several research lines of sentence into a matrix with pre-determined NMT architectures, among which the Enc-Dec size, and then applys a deep LSTM net- NMT (Sutskever et al., 2014) and the Enc-Dec work to decode the target sequence from the Att NMT are of typical representation (Bahdanau source representation. Unlike the previous work (Sutskever et al., 2014) to store the et al., 2014; Wu et al., 2016; Vaswani et al., 2017). source sentence with a passive and bottom- The Enc-Dec represents the source inputs with a up way, the dynamic routing policy encodes fixed dimensional vector and the target sequence the source sentence with an iterative process is generated from this vector word by word. The to decide the credit attribution between nodes Enc-Dec, however, does not preserve the source from lower and higher layers. C APS NMT sequence resolution, a feature which aggravates has two core properties: it runs in time that learning for long sequences. This results in the is linear in the length of the sequences and provides a more flexible way to aggregate computational complexity of decoding process be- the part-whole information of the source sen- ing O(|S| + |T |), with |S| denoting the source tence. On WMT14 English-German task and sentence length and |T | denoting the target sen- a larger WMT14 English-French task, C AP - tence length. The Enc-Dec Att preserves the reso- S NMT achieves comparable results with the lution of the source sentence which frees the neu- Transformer system. We also devise new ral model from having to squash all the informa- hybrid architectures intended to combine the tion into a fixed represention, but at a cost of a strength of C APS NMT and the RNMT model. quadratic running time. Due to the attention mech- Our hybrid models obtain state-of-the-arts re- sults on both benchmark datasets. To the best anism, the computational complexity of decoding of our knowledge, this is the first work that process is O(|S| × |T |). This drawbacks grow capsule networks have been empirically inves- more severe as the length of the sequences in- tigated for sequence to sequence problems.1 creases. Currently, most work focused on the Enc-Dec 1 Introduction Att, while the Enc-Dec paradigm is less empha- Neural Machine Translation (NMT) is an end- sized on despite its advantage of linear-time de- to-end learning approach to machine translation coding (Kalchbrenner et al., 2016). The linear- which has recently shown promising results on time approach is appealing, however, the perfor- multiple language pairs (Luong et al., 2015; Shen mance lags behind the Enc-Dec Att. One poten- et al., 2015; Wu et al., 2016; Gehring et al., 2017a; tial issue is that the Enc-Dec needs to be able Kalchbrenner et al., 2016; Sennrich et al., 2015; to compress all the necessary information of the Vaswani et al., 2017). Unlike conventional Statis- input source sentence into context vectors which tical Machine Translation (SMT) systems (Koehn are fixed during decoding. Therefore, a natural 1 The work is partially done when the first author worked question was raised, Will carefully designed ag- at Tencent. gregation operations help the Enc-Dec paradigm
Output to achieve the best performance? Probabilities In recent promising work of capsule network, a Softmax dynamic routing policy is proposed and proven to Linear effective (Sabour et al., 2017; Zhao et al., 2018; Gong et al., 2018). Following a similar spirit to Parent Capsule N× use this technique, we present C APS NMT, which Child Capsule Add & Norm Forward is characterized by capsule encoder to address LSTM the drawbacks of the conventional linear-time ap- N× proaches. The capsule encoder processes the at- Add & Norm Add & Norm tractive potential to address the aggregation issue, Bi-LSTM Layer Projection Layer and then introduces an iterative routing policy to decide the credit attribution between nodes from lower (child) and higher (parent) layers. Three Input Input Embedding Embedding strategies are also proposed to stabilize the dy- Inputs Outputs namic routing process. We empirically verify (shifted right) C APS NMT on WMT14 English-German task and a larger WMT14 English-French task. C APS NMT Figure 1: C APS NMT:Linear time neural machine achieves comparable results with the state-of-the- translation with capsule encoder art Transformer systems. Our contributions can be summrized as follows: Constant Encoder with Aggregation Lay- ers Given the input source sentence x = • We propose a sophisticated designed linear- x1 , x2 · · · , xL , Then, we introduce capsule net- time C APS NMT which achieved compara- works to transfer ble results with the Transformer framework. To the best of our knowledge, C APS NMT X = [x1 , x2 , · · · , xL ] ∈ RL×dx . is the first work that capsule networks have been empirically investigated for sequence- The goal of the constant encoder is to transfer the to-sequence problems. inputs X ∈ RL×dv into a pre-determined size rep- resentation • We propose several techniques to stabilize C = [c1 , c2 , · · · , cM ] ∈ RM ×dc . the dynamic routing process. We believe that these technique should always be em- where M < L is the pre-determined size of the en- ployed by capsule networks for the best per- coder output, and dc is the dimension of the hidden formance. states. We first introduce a bi-directional LSTM (BiL- 2 Linear Time Neural Machine STM) as the primary-capsule layer to incorporate Translation forward and backward context information of the input sentence: From the perspective of machine learning, the task of linear-time translation can be formalized → − −−−−→ ht = LSTM(ht−1 , xt ) as learning the conditional distribution p(y|x) of ← − ←−−−− a target sentence (translation) y given a source ht = LSTM(ht+1 , xt ) (1) sentence x. Figure 1 gives the framework of our → − ← − ht = [ht , ht ] proposed linear time NMT with capsule encoder, which mainly consists of two components: a con- Here we produce the sentence-level encoding ht → − stant encoder that represents the input source sen- of word xt by concatenating the forward ht and ← − tence with a fixed-number of vectors, and a de- backward output vector ht . Thus, the output of coder which leverages these vectors to generate BiLSTM encoder are a sequence of vectors H = the target-side translation. Please note that due to [h1 , h2 , · · · , hL ] corresponding to the input se- the fixed-dimension representation of encoder, the quence. time complexity of our model could be linear in On the top of the BiLSTM, we introduce sev- the number of source words. eral aggregation layers to map the variable-length
inputs into the compressed representation. To is the concatenation of the source sentence repre- demonstrate the effectiveness of our encoder, we sention and Wc is the projection matrix. Since compare it with a more powerful aggregation Wc [c1 , · · · , cM ] is calculated in advance, the de- method, which mainly involves Max and Average coding time could be linear in the length of the pooling. sentence length. At inference stage, we only uti- Both Max and Average pooling are the simplest lize the top-most hidden states st to make the final ways to aggregate information, which require no prediction with a softmax layer: additional parameters while being computation- ally efficient. Using these two operations, we per- p(yt |yt−1 , x) = softmax(Wo st ). (6) form pooling along the time step as follows: Similar as (Vaswani et al., 2017), we also h max = max([h1 , h2 , · · · , hL ]) (2) employ a residual connection (He et al., 2016) L around each of the sub-layers, followed by layer 1X normalization (Ba et al., 2016). havg = hi (3) L i=1 3 Aggregation layers with Capsule Moreover, because the last time step state hL and Networks the first time step state h1 provide complimentary information, we also exploit them to enrich the fi- nal output representation of encoder. Formally, the output representation of encoder consists of four vectors, C = [hmax , havg , h1 , hL ]. (4) The last time step state hL and the first time step state h1 provide complimentary information, thus improve the performance. The compressed rep- resention C is fixed for the subsequent translation generation, therefore its quality directly affects the success of building the Enc-Dec paradigm. In this Figure 2: Capsule encoder with dynamic routing by work, we mainly focus on how to introduce ag- agreement gregation layers with capsule networks to accu- rately produce the compressed representation of The aggregation layers with capsule networks input sentences, of which details will be provided play a crucial role in our model. As shown in Section 3. in Figure 2, unlike the traditional linear-time LSTM Decoder The goal of the LSTM approaches collecting information in a bottom- is to estimate the conditional proba- up approach without considering the state of the bility p(yt |y
tomatically learn child-parent (or part-whole) rela- This coefficient btij is simply a temporary value tionships. Formally, ui→j denotes the information that will be iteratively updated with the value bt−1 ij be transferred from the child capsule hi into the of the previous iteration and the scalar product of parent capsule cj : ct−1 j and f (ht−1 t−1 j , θ j ), which is essentially the similarity between the input to the capsule and ui→j = αij f (hi , θ j ) (7) the output from the capsule. Likewise, remember where αij ∈ R can be viewed as the voting weight from above, the lower level capsule will send its on the information flow from child capsule to output to the higher level capsule with similar out- the parent capsule; f (hi , θ j ) is the transformation put. Particularly, b0ij is initialized with 0. The co- function and in this paper, we use a single layer efficient depends on the location and type of both feed forward neural networks: the child and the parent capsules, which iteratively refinement of bij . The capsule network can in- f (hi , θ j ) = ReLU(hi Wj ) (8) crease or decrease the connection strength by dy- namic routing. It is more effective than the previ- where Wj ∈ Rdc ×dc is the transformation matrix ous routing strategies such as max-pooling which corresponding to the position j of the parent cap- essentially detects whether a feature is present in sule. any position of the text, but loses spatial informa- Finally, the parent capsule aggregates all the in- tion about the feature. coming messages from all the child capsules: L Algorithm 1 Dynamic Routing Algorithm X vi = uj→i (9) 1: procedure ROUTING([h1 , h2 , · · · , hL ],T ) j=1 2: Initialize b0ij ← 0 3: for each t ∈ range(0 : T ) do and then squashes vi to ||vi || ∈ (0, 1) confine. t 4: Compute the routing coefficients αij ReLU or similar non linearity functions work well for all i ∈ [1, L], j ∈ [1, M ] . From Eq.(11) with single neurons. However we find that this 5: Update all the output capsule ctj for all squashing function works best with capsules. This j ∈ [1, M ] . From Eq.(7,8,9,10) tries to squash the length of output vector of a cap- 6: Update all the coefficient btij for all sule. It squashes to 0 if it is a small vector and tries i ∈ [1, L], j ∈ [1, M ] . From Eq.(12) to limit the output vector to 1 if the vector is long. 7: end for 8: return [c1 , c2 , · · · , cM ] ci = squash(vi ) 9: end procedure ||vi ||2 vi (10) = 1 + ||vi ||2 ||vi || When an output capsule ctj receives the incom- ing messages uti→j , its state will be updated and 3.2 Dynamic Routing by Agreement t is also re-computed for the in- the coefficient αij The dynamic routing process is implemented via put capsule ht−1 i . Thus, we iteratively refine the an EM iterative process of refining the coupling route of information flowing, towards an instance coefficient αij , which measures proportionally dependent and context aware encoding of a se- how much information is to be transferred from quence. After the source input is encoded into M hi to cj . capsules, we map these capsules into vector rep- At iteration t, the coupling coefficient αij is resentation by simply concatenating all capsules: computed by t−1 exp(bij ) C = [c1 , c2 , · · · , cM ] (13) t αij =P t−1 (11) k exp(bik ) Finally matrix C will then be fed to the final end to P end NMT model as the source sentence encoder. where αj αij. This ensures that all the informa- In this work, we also explore three strategies to tion from the child capsule hi will be transferred improve the accuracy of the routing process. to the parent. Position-aware Routing strategy The routing t−1 btij = bij + ctj · f (hti , θ tj ) (12) process iteratively decides what and how much in-
formation is to be sent to the final encoding with Here g(·) is a fully connected feed-forward net- considering the state of both the final outputs cap- work, which consists of two linear transformations sule and the inputs capsule. In order to fully ex- with a ReLU activation and is applied to each po- ploit the order of the child and parent capsules sition separately. to capture the child-parent relationship more ef- ficiently, we add “positional encoding” to the rep- 4 Experiments resentations of child and parent capsules. In this 4.1 Datasets way, some information can be injected according to the relative or absolute position of capsules in We mainly evaluated C APS NMT on the widely the sequence. In this aspect, there have many used WMT English-German and English-French choices of positional encodings proposed in many translation task. The evaluation metric is BLEU. NLP tasks (Gehring et al., 2017b; Vaswani et al., We tokenized the reference and evaluated the per- 2017). Their experimental results strongly demon- formance with multi-bleu.pl. The metrics are ex- strate that adding positional information in the text actly the same as in previous work (Papineni et al., is more effective than in image since there is some 2002). sequential information in the sentence. In this For English-German, to compare with the re- work, we follow (Vaswani et al., 2017) to apply sults reported by previous work, we used the same sine and cosine functions of different frequencies. subset of the WMT 2014 training corpus that con- tains 4.5M sentence pairs with 91M English words Non-sharing Weight Strategy Besides, we ex- and 87M German words. The concatenation of plore two types of transformation matrices to gen- news-test 2012 and news-test 2013 is used as the erate the message vector ui→j which has been pre- validation set and news-test 2014 as the test set. viously mentioned Eq.(7,8). The first one shares To evaluate at scale, we also report the results parameters θ across different iterations. In the sec- of English-French. To compare with the results re- ond design, we replace the shared parameters with ported by previous work on end-to-end NMT, we the non-shared strategy θ t where t is the iteration used the same subset of the WMT 2014 training step during the dynamic process. We will com- corpus that contains 36M sentence pairs. The con- pare the effects of these two strategies in our ex- catenation of news-test 2012 and news-test 2013 periments. In our preliminary, we found that non- serves as the validation set and news-test 2014 as shared weight strategy works slightly better than the test set. the shared one which is in consistent with (Liao and Poggio, 2016). 4.2 Training details Separable Composition and Scoring strategy Our training procedure and hyper parameter The most important idea behind capsule networks choices are similar to those used by (Vaswani is to measure the input and output similarity. It is et al., 2017). In more details, For English-German often modeled as a dot product function between translation and English-French translation, we use the input and the output capsule, and the routing 50K sub-word tokens as vocabulary based on coefficient is updated correspondingly. Traditional Byte Pair Encoding(Sennrich et al., 2015). We capsule networks often resort to a straightforward batched sentence pairs by approximate length, and strategy in which the “fusion” decisions (e.g., de- limited input and output tokens per batch to 4096 ciding the voting weight ) are made based on the per GPU. values of feature-maps. This is essentially a soft During training, we employed label smoothing template matching (Lawrence et al., 1997), which of value = 0.1(Pereyra et al., 2017). We used works for tasks like classification, however, is un- a beam width of 8 and length penalty α = 0.8 desired for maintaining the composition function- in all the experiments. The dropout rate was set ality of capsules. Here, we propose to employ to 0.3 for the English-German task and 0.1 for the separate functional networks to release the scoring English-French task. Except when otherwise men- duty, and let θ defined in Eq.(7) be responsible for tioned, NMT systems had 4 layers encoders fol- composition. More specifically, we redefined the lowed by a capsule layer and 3 layers decoders. iteratively scoring function in Eq.(12) as follow, We trained for 300,000 steps on 8 M40 GPUs, and averaged the last 50 checkpoints, saved at 30 btij = bt−1 t t ij + g(cj ) · g(hi ). (14) minute intervals. For our base model, the dimen-
SYSTEM Architecture Time EN-Fr BLEU EN-DE BLEU Buck et al. (2014) Winning WMT14 - 35.7 20.7 Existing Enc-Dec Att NMT systems Wu et al. (2016) GNMT + Ensemble |S||T | 40.4 26.3 Gehring et al. (2017a) ConvS2S |S||T | 40.5 25.2 Vaswani et al. (2017) Transformer (base) |S||T |+|T ||T | 38.1 27.3 Vaswani et al. (2017) Transformer (large) |S||T |+|T ||T | 41.0 27.9 Existing Enc-Dec NMT systems Luong et al. (2015) Reverse Enc-Dec |S|+|T | - 14.0 Sutskever et al. (2014) Reverse stack Enc-Dec |S|+|T | 30.6 - Zhou et al. (2016) Deep Enc-Dec |S|+|T | 36.3 20.6 Kalchbrenner et al. (2016) ByteNet c|S|+c|T | - 23.7 C APS NMT systems Base Model Simple Aggregation c|S|+c|T | 37.1 21.3 Base Model C APS NMT c|S|+c|T | 39.6 25.7 Big Model C APS NMT c|S|+c|T | 40.0 26.4 Base Model Simple Aggregation + KD c|S|+c|T | 38.6 23.4 Base Model C APS NMT + KD c|S|+c|T | 40.4 26.9 Big Model C APS NMT+ KD c|S|+c|T | 40.6 27.6 Table 1: Case-sensitive BLEU scores on English-German and English-French translation. KD indicates knowl- edge distillation (Kim and Rush, 2016). sions of all the hidden states were set to 512 and is the previous state-of-the-art system which has for the big model, the dimensions were set to 1024. 150 convolutional encoder layers and 150 convo- The capsule number is set to 6. lutional decoder layers. Sequence-level knowledge distillation is ap- On the English-to-German task, our big C AP - plied to alleviate multimodality in the training S NMT achieves the highest BLEU score among dataset, using the state-of-the-art transformer base all the Enc-Dec approaches which even outper- models as the teachers(Kim and Rush, 2016). form ByteNet, a relative strong competitor, by We decode the entire training set once using the +3.9 BLEU score. In the case of the larger teacher to create a new training dataset for its re- English-French task, we achieves the compara- spective student. ble BLEU score among all the with Big Trans- form, a relative strong competitor with a gap of 4.3 Results on English-German and only 0.4 BLEU score. To show the power of English-French Translation the capsule encoder, we also make a comparison The results on English-German and English- with the simple aggregation version of the Enc- French translation are presented in Table 1. We Dec model, and again yields a gain of +1.8 BLEU compare C APS NMT with various other systems score on English-German task and +3.5 BLEU including the winning system in WMT’14 (Buck score on English-French task for the base model. et al., 2014), a phrase-based system whose lan- The improvements is in consistent with our intu- guage models were trained on a huge monolingual ition that the dynamic routing policy is more effec- text, the Common Crawl corpus. For Enc-Dec Att tive than the simple aggregation method. It is also systems, to the best of our knowledge, GNMT worth noting that for the small model, the capsule is the best RNN based NMT system. Trans- encoder approach get an improvement of +2.3 former (Vaswani et al., 2017) is currently the BLEU score over the Base Transform approach SOTA system which is about 2.0 BLEU points on English-French task. Knowledge also helps a better than GNMT on the English-German task lot to bridge the performance gap between C AP - S NMT and the state-of-the-art transformer model. and 0.6 BLEU points better than GNMT on the English-French task. For Enc-Dec NMT, ByteNet The first column indicates the time complexity
of the network as a function of the length of the model, resulting in an increase of 0.4 BLEU sequences and is denoted by Time. The ByteNet, score. the RNN Encoder-Decoder are the only networks • Knowledge Distillation Sequence-level that have linear running time (up to the constant knowledge distillation can still achive an c). The RNN Enc-Dec, however, does not pre- increase of +1.2 BLEU. serve the source sequence resolution, a feature that aggravates learning for long sequences. The 4.5 Model Analysis Enc-Dec Att do preserve the resolution, but at a In this section, We study the attribution of C AP - cost of a quadratic running time. The ByteNet S NMT. overcomes these problem with the convolutional neural network, however the architecture must be Effects of Iterative Routing We also study how deep enough to capture the global information of the iteration number affect the performance of ag- a sentence. The capsule encoder makes use of gregation on the English-German task. Figure 3 the dynamic routing policy to automatically learn shows the comparison of 2 − 4 iterations in the dy- the part-whole relationship and encode the source namic routing process. The capsule number is set sentence into fixed size representation. With the to 2, 4, 6 and 8 for each comparison respectively. capsule encoder, C APS NMT keeps linear running We found that the performances on several differ- time and the constant c is the capsule number ent capsule number setting reach the best when it- which is set to 6 in our mainly experiments. eration is set to 3. The results indicate the dynamic routing is contributing to improve the performance 4.4 Ablation Experiments and a larger capsule number often leads to better In this section, we evaluate the importance of our results. main techniques for training C APS NMT. We be- 26 2 caps lieve that these techniques are universally applica- 4 caps ble across different NLP tasks, and should always 25.5 6 caps 8 caps be employed by capsule networks for best perfor- 25 mance. From Table 2 we draw the following con- BLEU 24.5 Model BLEU 24 Base C APS NMT 24.7 + Non-weight sharing 25.1 23.5 + Position-aware Routing Policy 25.3 2 3 4 5 + Separable Composition and Scoring 25.7 Iteration +Knowledge Distillation 26.9 Figure 3: Effects of Iterative Routing with different Table 2: English-German task: Ablation experiments capsule numbers of different technologies. clusions: Model Num Latency(ms) • Non-weight sharing strategy We observed Transformer - 225 that the non-weight sharing strategy im- C APS NMT 4 146 proves the baseline model leading to an in- C APS NMT 6 153 crease of 0.4 BLEU. C APS NMT 8 168 • Position-aware Routing strategy Adding Table 3: Time required for decoding with the base the position embedding to the child capsule model. Num indicates the capsule number. Decoding and the parent capsule can obtain an improve- indicates the amount of time in millisecond required ment of 0.2 BLEU score. for translating one sentence, which is averaged over the whole English-German newstest2014 dataset. • Separable Composition and Scoring strat- egy Redefinition of the dot product function Analysis on Decoding Speed We show the de- contributes significantly to the quality of the coding speed of both the transformer and C AP -
S NMT in Table 3. Thg results empirically demon- mentioning that the capsule seems able to capture strates that C APS NMT can improve the decoding some structure information. For example, the in- speed of the transformer approach by +50%. formation of phrase still love each other will be sent to the same capsule. We will make further Performance on long sentences A more de- exploration in the future work. tailed comparison between C APS NMT and Trans- former can be seen in Figure 4. In particular, we test the BLEU scores on sentences longer 5 Related Work than {0, 10, 20, 30, 40, 50}. We were surprised to discover that the capsule encoder did well on Linear Time Neural Machine Translation medium-length sentences. There is no degrada- Several papers have proposed to use neural net- tion on sentences with less than 40 words, how- works to directly learn the conditional distribution ever, there is still a gap on the longest sentences. from a parallel corpus(Kalchbrenner and Blun- A deeper capsule encoder potentially helps to ad- som, 2013; Sutskever et al., 2014; Cho et al., dress the degradation problem and we will leave 2014; Kalchbrenner et al., 2016). In (Sutskever this in the future work. et al., 2014), an RNN was used to encode a source sentence and starting from the last hidden state, 28 CapsNMT to decode a target sentence. Different the RNN Transformer based approach, Kalchbrenner et al., (2016) pro- pose ByteNet which makes use of the convolution 27 networks to build the liner-time NMT system. Un- BLEU like the previous work, the C APS NMT encodes the source sentence with an iterative process to 26 decide the credit attribution between nodes from lower and higher layers. 0 10 20 30 40 50 Capsule Networks for NLP Currently, much Sentence Length attention has been paid to how developing a so- phisticated encoding models to capture the long Figure 4: The plot shows the performance of our sys- tem as a function of sentence length, where the x-axis and short term dependency information in a se- corresponds to the test sentences sorted by their length. quence. Gong et al., (2018) propose an aggre- gation mechanism to obtain a fixed-size encoding with a dynamic routing policy. Zhao et al., (2018) Orlando Bloom and Miranda Kerr still love each other explore capsule networks with dynamic routing Orlando Bloom and Miranda Kerr still love each other for multi-task learning and achieve the best per- Orlando Bloom and Miranda Kerr still love each other formance on six text classification benchmarks. Orlando Bloom and Miranda Kerr still love each other Wang et al., (2018) propose RNN-Capsule, a cap- sule model based on Recurrent Neural Network Table 4: A visualization to show the perspective of a (RNN) for sentiment analysis. To the best of our sentence from 4 different capsules at the third iteration. knowledge, C APS NMT is the first work that cap- sule networks have been empirically investigated Visualization We visualize how much informa- for sequence-to-sequence problems. tion each child capsule sends to the parent cap- sules. As shown in Table 4, the color density of 6 Conclusion each word denotes the coefficient αij at iteration 3 in Eq.(11). At first iteration, the αij follows We have introduced C APS NMT for linear-time a uniform distribution since bij is initialized to 0, NMT. Three strategies were proposed to boost the and αij then will be iteratively fitted with dynamic performance of dynamic routing process. The em- routing policy. It is appealing to find that afte 3 it- perical results show that C APS NMT can achieve erations, the distribution of the voting weightsαij state-of-the-art results with better decoding la- will converge to a sharp distribution and the val- tency in several benchmark datasets. ues will be very close to 0 or 1. It is also worth
References Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- Proceedings of the 2003 Conference of the North ton. 2016. Layer normalization. arXiv preprint American Chapter of the Association for Computa- arXiv:1607.06450. tional Linguistics on Human Language Technology- Volume 1, pages 48–54. Association for Computa- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- tional Linguistics. gio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and An- arXiv:1409.0473. drew D Back. 1997. Face recognition: A convolu- tional neural-network approach. IEEE transactions Christian Buck, Kenneth Heafield, and Bas Van Ooyen. on neural networks, 8(1):98–113. 2014. N-gram counts and language models from the common crawl. In LREC, volume 2, page 4. Cite- Qianli Liao and Tomaso Poggio. 2016. Bridging seer. the gaps between residual learning, recurrent neu- ral networks and visual cortex. arXiv preprint David Chiang. 2005. A hierarchical phrase-based arXiv:1604.03640. model for statistical machine translation. In Pro- ceedings of the 43rd Annual Meeting on Association Minh-Thang Luong, Hieu Pham, and Christopher D for Computational Linguistics, pages 263–270. As- Manning. 2015. Effective approaches to attention- sociation for Computational Linguistics. based neural machine translation. arXiv preprint arXiv:1508.04025. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Jing Zhu. 2002. Bleu: a method for automatic eval- Schwenk, and Yoshua Bengio. 2014. Learning uation of machine translation. In Proceedings of phrase representations using rnn encoder-decoder the 40th annual meeting on association for compu- for statistical machine translation. arXiv preprint tational linguistics, pages 311–318. Association for arXiv:1406.1078. Computational Linguistics. Jonas Gehring, Michael Auli, David Grangier, De- Gabriel Pereyra, George Tucker, Jan Chorowski, nis Yarats, and Yann N. Dauphin. 2017a. Con- Łukasz Kaiser, and Geoffrey Hinton. 2017. Regular- volutional sequence to sequence learning. CoRR, izing neural networks by penalizing confident output abs/1705.03122. distributions. arXiv preprint arXiv:1701.06548. Jonas Gehring, Michael Auli, David Grangier, Denis Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Yarats, and Yann N Dauphin. 2017b. Convolu- 2017. Dynamic routing between capsules. neural tional sequence to sequence learning. arXiv preprint information processing systems, pages 3856–3866. arXiv:1705.03122. Rico Sennrich, Barry Haddow, and Alexandra Birch. Jingjing Gong, Xipeng Qiu, Shaojing Wang, and Xuan- 2015. Neural machine translation of rare words with jing Huang. 2018. Information aggregation via dy- subword units. CoRR, abs/1508.07909. namic routing for sequence encoding. international conference on computational linguistics. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2015. Minimum Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian risk training for neural machine translation. arXiv Sun. 2016. Deep residual learning for image recog- preprint arXiv:1512.02433. nition. In Proceedings of the IEEE Conference on Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Computer Vision and Pattern Recognition, pages Sequence to sequence learning with neural net- 770–778. works. In Advances in neural information process- ing systems, pages 3104–3112. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob the 2013 Conference on Empirical Methods in Nat- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz ural Language Processing, pages 1700–1709. Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, cessing Systems, pages 6000–6010. Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in Yequan Wang, Aixin Sun, Jialong Han, Ying Liu, and linear time. arXiv preprint arXiv:1610.10099. Xiaoyan Zhu. 2018. Sentiment analysis by capsules. In Proceedings of the 2018 World Wide Web Con- Yoon Kim and Alexander M Rush. 2016. Sequence- ference on World Wide Web, pages 1165–1174. In- level knowledge distillation. arXiv preprint ternational World Wide Web Conferences Steering arXiv:1606.07947. Committee.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural ma- chine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, Suofei Zhang, and Zhou Zhao. 2018. Investigating capsule networks with dynamic routing for text classifica- tion. arXiv: Computation and Language. Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. 2016. Deep recurrent models with fast-forward connections for neural machine translation. arXiv preprint arXiv:1606.04199.
You can also read