Dynamic Position Encoding for Transformers
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Dynamic Position Encoding for Transformers Joyce Zheng∗ Mehdi Rezagholizadeh Huawei Noah’s Ark Lab Huawei Noah’s Ark Lab jy6zheng@uwaterloo.ca mehdi.rezagholizadeh@huawei.com Peyman Passban † Amazon passban.peyman@gmail.com Abstract of complexity and importance, and is handled with a variety of statistical as well as string- and tree- Recurrent models have been dominating the field of neural machine translation (NMT) for based solutions (Bisazza and Federico, 2016). the past few years. Transformers (Vaswani As evident in SMT, the structure and position arXiv:2204.08142v1 [cs.CL] 18 Apr 2022 et al., 2017), have radically changed it by of words within a sentence is crucial for accurate proposing a novel architecture that relies on translation. The importance of such information a feed-forward backbone and self-attention can also be explored in NMT and for Transform- mechanism. Although Transformers are pow- ers. Previous literature, such as Chen et al. (2020), erful, they could fail to properly encode se- demonstrated that source input sentences enriched quential/positional information due to their non-recurrent nature. To solve this problem, with target order information have the capability to position embeddings are defined exclusively improve translation quality in neural models. They for each time step to enrich word informa- showed that position encoding seems to play a key tion. However, such embeddings are fixed af- role in translation and this motivated us to further ter training regardless of the task and the word explore this area. ordering system of the source or target lan- Since Transformers have a non-recurrent archi- guage. tecture, they could face problems when encoding In this paper, we propose a novel architecture sequential data. As a result, they require an explicit with new position embeddings depending on summation of the input embeddings with position the input text to address this shortcoming by taking the order of target words into consid- encoding to provide information about the order of eration. Instead of using predefined position each word. However, this approach falsely assumes embeddings, our solution generates new em- that the correct position of each word is always its beddings to refine each word’s position infor- original position in the source sentence. This inter- mation. Since we do not dictate the position pretation might be true when only considering the of source tokens and learn them in an end- source side, whereas we know from SMT (Bisazza to-end fashion, we refer to our method as dy- and Federico, 2016; Cui et al., 2016) that arranging namic position encoding (DPE). We evaluated the impact of our model on multiple datasets input words with respect to the order of their target to translate from English into German, French, peers can lead to better results. and Italian and observed meaningful improve- In this work, we explore the use of injecting ments in comparison to the original Trans- target position information alongside the source former. words to enhance Transformers’ translation quality. 1 Introduction We first examine the accuracy and efficiency of a two-pass Transformer (2PT), which consists of a In statistical machine translation (SMT), the gen- pipeline connecting two Transformers, where the eral task of translation consists of reducing the first Transformer reorders the source sentence and input sentence into smaller units (also known as the second one translates the reordered sentences. statistical phrases), selecting an optimal translation Although, this approach incorporates order infor- for each unit, and placing them in the correct order mation from the target language, it lacks end-to-end (Koehn, 2009). The last step, which is referred to training and requires more resources than a typical as the word reordering problem, is a great source Transformer. Accordingly, we introduce a better ∗ Work done while Joyce Zheng was an intern at Huawei alternative, which effectively learns the reordered † Work done while Peyman Passban was at Huawei. positions in an end-to-end fashion and uses this
information with the original source sequence to a classifier after translation to correct the position boost translation accuracy. We refer to this alterna- of words that are placed in the wrong order. The tive as Dynamic Position Encoding (DPE). entire reordering process can also be embedded Our contribution in this work is threefold: into the decoding process (Feng et al., 2013). • First, we demonstrate that providing source- 2.2 Tackling the Order Problem in NMT side representations with target position infor- Pre-reordering and position encoding are also com- mation improves translation quality in Trans- mon in NMT and have been explored by different formers. researchers. Du and Way (2017) studied if recur- rent neural models can benefit from pre-reordering. • We also propose a novel architecture, DPE, Their findings show that these models might not that efficiently learns reordered positions in require any order adjustments because the network an end-to-end fashion and incorporates such itself is powerful enough to learn such mappings. information into the encoding process. Kawara et al. (2020), unlike the previous work, • Finally, using a preliminary two-pass design studied the same problem and reported promising we show the importance of end-to-end learn- observations on the usefulness of pre-reordering. ing in this problem. They used a transduction-grammar-based method with recursive neural networks and showed how 2 Background impactful pre-reordering could be. Liu et al. (2020) followed a different approach 2.1 Pre-Reordering in Machine Translation and proposed to model position encoding as a In standard SMT, pre-reordering is a well-known continuous dynamical system through a neural technique. Usually, the source sentence is re- ODE. Ke et al. (2020) investigated improving po- ordered using heuristics such that it follows the sitional encoding by untying the relationship be- word order of the target language. Figure 1 illus- tween words and positions. They suggest that there trates this concept with an imaginary example. is no strong correlation between words and abso- lute positions, so they remove this noisy correlation. This form of separation has its own advantages, but by removing this relationship between words and positions from the translation process, they lose valuable semantic information about the source and target sides. Figure 1: The order of source words before (left-hand Shaw et al. (2018a) explored a relative position side) and after (right-hand side) pre-reordering. Si and encoding method in Transformers by having the Tj show the i-th source and j-th target words, accord- self-attention mechanism consider the distance be- ingly. tween source words. Garg et al. (2019) incorpo- rated target position information via multitasking As the figure shows, the original alignment be- where a translation loss is combined with an align- tween source (Si ) and target (Ti ) words are used to ment loss that supervises one decoder head to learn define a new order for the source sentence. With position information. Chen et al. (2020) changed the new order, the translation engine does not need the Transformer architecture to incorporate order to learn the relation between source and target or- information at each layer. They select a reordered dering systems, as it directly translates from one target word position from the output of each layer position on the source side to the same position on and inject it into the next layer. This is the clos- the target side. Clearly, this can significantly reduce est work to ours, so we consider it as our main the complexity of the translation task. The work baseline. of Wang et al. (2007), provides a good example of systems with pre-reordering, in which the authors 3 Methodology studied this technique for the English–Chinese pair. The concept of re-ordering does not necessarily 3.1 Two-Pass Translation for Pre-Reordering need to be tackled prior to translation; in Koehn Our goal is to boost translation quality in Trans- et al. (2007), generated sequences are reviewed by formers by injecting target order information into
the source-side encoding process. We propose dy- As the figure demonstrates, given a pair of namic position encoding (DPE) to achieve this, English–German sentences, word alignments are but before that, we discuss a preliminary two-pass generated. In order to process the alignments, we Transformer (2PT) architecture to demonstrate the designed rules to handle different cases: impact of order information in Transformers. The main difference between 2PT and DPE is • One-to-Many Alignments: We only con- that DPE is an end-to-end, differential solution sider the first target position in one-to-many whereas 2PT is a pipeline to connect two different alignments (see the figure). Transformers. They both work towards leveraging • Many-to-One Alignments: Multiple source order information to improve translation but in dif- words are reordered together (as one unit ferent fashions. The 2PT architecture is illustrated while maintaining their relative positions with in Figure 2. each other) using the position of the corre- sponding target word. • No Alignment: Words that do not have any alignments are skipped and we do not change their position • We also ensure that no source word would be aligned with a position beyond the source sentence length. Considering these rules and what FastAlign generates, the example input sentence “mr hän@@ sch represented you on this occasion .” (in Figure 3) is reordered to “mr hän sch you this represented Figure 2: The two-pass Transformer architecture. The .” @@ are auxiliary symbols added in-between input sequence is first re-ordered to a new form which subword units during preprocessing. See Section 4 is less complex for the translation Transformer. Then, for more information. the final translation is carried out by decoding a target Using re-ordered sentences, the first Transformer sequence using the re-ordered input sequence. in 2PT is trained to learn to reorder the origi- nal source sentences, and the second Transformer, 2PT has two different Transformers. The first which is responsible for translation, receives re- one is used for reordering purposes instead of trans- ordered source sentences and maps them into their lation. It takes source sentences and generates a target translations. Despite different data formats reordered version of them, e.g. referring back to and different inputs/outputs, 2PT is still a pipeline Figure 1, if the input to the first Transformer is [S1 , that translates a source language to a target one, S2 , S3 , S4 ] the expected output should be [S1 , S4 , through internal modifications which are hidden S3 , S2 ]. In order to train such a reordering model, from the user. we created a new corpus using FastAlign (Dyer et al., 2013).1 3.2 Dynamic Position Encoding FastAlign is an unsupervised word aligner Unlike 2PT, the dynamic position encoding (DPE) that processes source and target sentences together method takes the advantage of end-to-end training, and provides word-level alignments. It is usable while the source side still learns target reordering at training time but not for inference because it position information. It boosts the input of an or- requires access to both sides and we only have the dinary Transformer’s encoder with target position source side (at test time). As a solution, we create information, but leaves its architecture untouched, a training set using the alignments and utilized it to as illustrated in Figure 4. train the first Transformer in 2PT. Figure 3 shows The input to DPE is a source word embedding the input and output format in FastAlign and (wi ) summed with sinusoidal position (pi ) encod- how it helps generate training samples. ing (wi ⊕ pi ). We refer to these (summed) embed- 1 https://github.com/clab/fast_align dings as enriched embeddings. Sinusoidal position
Figure 3: FastAlign-based reordering using a sample sentence from our English–German dataset. Using word alignments, we generate a new, reordered form for each target sequence, we then use the pairs of source and new target sequences to train the first Transformer of 2PT. Figure 4: The figure on the left-hand side shows the original Transformer’s architecture and the figure on the right is our proposed architecture. E1 is the first encoder layer of the ordinary Transformer and DP1 is the first layer of the DPE network. encoding is part of the original design of Trans- adding pi , but the original position of the word is formers and we assume the reader is familiar with not always the best one for translation; ri is defined this concept. For more details see Vaswani et al. to address this problem. If wi appears in the i-th (2017). position but j is its best position with respect to the DPE is another neural network placed in- target language, ri is supposed to learn information between the enriched embeddings and the first en- about the j-th position and mimic pj . Accordingly, coder layer of the (translation) Transformer. In the combination of pi and ri should provide wi other words, the DPE network is connected to in- with the pre-reordering information it requires to put the embedding table of the Transformer from improve translation. one end, and its final layer outputs into the first In our design, DPE consists of two Transformer encoder layer of the Transformer. Thus, the DPE layers. We determined this number through an network can be trained jointly with the Transformer empirical study to find a reasonable balance be- using the original parallel sentence pairs. tween translation quality and resource consump- DPE processes enriched embeddings and gener- tion. These two layers are connected to an auxiliary ates a new form of them that are represented as ri in loss function to ensure that the output of DPE is this paper, i.e. DPE(wi ⊕ pi ) = ri . DPE-generated what we need for re-ordering. embeddings are intended to preserve target-side This additional loss measures the mean squared order information about each word. In the origi- error between the embeddings produced by DPE nal Transformer, the position of wi is dictated by (ri ) and the supervising positions (P E) defined by
FastAlign alignments. This learning process is To prepare the data, sequences were lower-cased, formulated in Equation 1: normalized, and tokenized using the scripts pro- P|S| vided by the Moses toolkit4 (Koehn et al., 2007) M SE(P Ei , ri ) and decomposed into subwords via Byte-Pair En- Lorder = i=1 (1) |S| coding (BPE) (Sennrich et al., 2016). The vo- cabulary size extracted for the IWSLT and WMT where S is a source sequence and MSE() is the datasets are 32K and 40K, respectively. For the mean-square error function. The supervising posi- En–De pair of WMT-14, newstest2013 is used as tion P Ei is obtained by taking the target position a development set and newstest2014 is our test set. (associated with wi ) defined by the FastAlign For the IWSLT experiments, our test and develop- described in Section 3.1. ment sets are as suggested by Zhu et al. (2020). To clarify how Lorder works, we use the afore- Table 1 provides the statistics of our datasets. mentioned scenario as an example. We assumed that the correct position for wi according to Data Train Dev Test FastAlign alignments is j, so in our setting WMT-14 (En→De) 4.45M 3k 3k P Ei = pj and we thus compute M SE(pj , ri ). IWSLT-14 (En→De) 160k 7k 6k Through this technique, we encourage the DPE IWSLT-14 (En→Fr) 168k 7k 4k network to learn pre-reordering in an end-to-end IWSLT-14 (En→It) 167k 7k 6k fashion and provide wi with position refinement information. Table 1: Statistics of datasets used in our experiments. The total loss function when training the entire Train, Dev, and Test stand for the training, develop- model includes the auxiliary reordering loss func- ment, and test sets, respectively. tion Lorder summed with the standard Transformer loss Ltranslation , as in Equation 2: 4.2 Experimental Setup Ltotal = (1 − λ) × Ltranslation + λ × Lorder (2) In the interest of fair comparisons, we used the where λ is a hyper-parameter that represents the same setup as Chen et al. (2020) to build our base- weight of the reordering loss. λ was determined by line for the WMT-14 En–De experiments. This minimizing the total loss on the development set baseline setting was also used for our DPE model during training. and DPE related experiments. Our models were trained on 8 × V100 GPUs. Also, since our mod- 4 Experimental Study els rely on the Transformers backbone, all hyper- parameters that were related to the main Trans- 4.1 Dataset former architecture, such as embedding dimen- To train and evaluate our models, we used the sions, the number of attention heads etc., were IWSLT-14 collection (Cettolo et al., 2012) and set to the default values proposed for Transformer the WMT-14 dataset.2 Our datasets are commonly Base in Vaswani et al. (2017). Refer to the original used in the field, which makes our results easily work for detailed information. reproducible. Our code is also publicly available For IWSLT experiments, we used a lighter archi- to help other researchers further investigate this tecture since the datasets are smaller than WMT, topic.3 The IWSLT-14 collection is used to study where the hidden dimension was 256 for all en- the impact of our model for the English–German coder and decoder layers, and a dimension of 1024 (En–De), English–French (En–Fr), and English– was used for the inner feed-forward network layer. Italian (En–It) pairs. We also report results on There were 2 encoder and 2 decoder layers, and 2 the WMT dataset, which provides a larger training attention heads. We found this setting through an corpus. We know that the quality of NMT mod- empirical study to maximize the performance of els vary in proportion to the corpus size, so these our IWSLT models. experiments provide more information to better For the WMT-14 En–De experiments, similar to understand our model. Chen et al. (2020), we trained the model for 300K 2 http://statmt.org/wmt14/ updates and used a single model obtained from translation-task.html 3 4 https://github.com/jy6zheng/ https://github.com/moses-smt/ DynamicPositionEncodingModule mosesdecoder
averaging the last 5 checkpoints. The model was a +0.81 improvement in the BLEU score compared validated with an interval of 2K on the development to Transformer Base. To ensure that we evaluated dataset. The decoding beam size was 5. In the DPE in a fair setup, we re-implemented the Trans- IWSLT-14 cases, we trained the models for 15,000 former Base in our own environment. This elim- updates and used a single model obtained from inates the impact of different factors and ensures averaging the last 5 checkpoints that were validated that the gain is due to the design of the DPE mod- with an interval of 1000 updates. We evaluated ule itself. We also compare our model with models all our models with detokenized BLEU (Papineni discussed in the related literature such as that of et al., 2002). Shaw et al. (2018b), the reordering embeddings of Chen et al. (2019), and the more recent explicit 4.3 2PT Experiments reordering embeddings of Chen et al. (2020). Our Results related to the two-pass architecture are sum- model achieves the best score and we believe it is marized in Table 2. The reordering Transformer due to the direct use of order information. (Row 1) works with the source sentences and re- For the DPE architecture, we decided to have orders them with respect to the order of the target two layers (DP1 and DP2) as it produced the language. This is a monolingual translation task best measured BLEU scores on the development with a BLEU score of 35.21. This is a relatively sets without imposing significant training overhead. low score for a monolingual setting and indicates One important hyper-parameter that directly affects how complicated the reordering problem could be. DPE’s performance is the position loss weight (λ). Even dedicating a whole Transformer does not fully We ran an ablation study on the development set to tackle the reordering problem. This finding also adjust λ. Table 4 summarizes our findings for this indicates that NMT engines can benefit from an investigation. The best λ value in our setting is 0.5. auxiliary module to handle order complexities. It is This value provides an acceptable balance between usually assumed that the translation engine should the translation accuracy and pre-reordering costs perform in an end-to-end fashion, where it deals during training. with all reordering, translation, and other complex- ities via a single model at the same time. However, The design of our Transformer (Transformer if we can separate these sub-tasks systematically Base + DPE) might raise the concern that incor- and tackle them individually, we might be able to porating pre-reordering information or defining an improve the overall quality. auxiliary loss might not be necessary. One might In Row 3, we consume the information gener- suggest that if we use the same amount of resources ated previously (in Row 1) and show how a transla- to increase the number of Transformer Base’s en- tion model performs when it is fed with reordered coder parameters, we should obtain competitive or sentences. The BLEU score for this task is 21.96, even better results than the DPE-enhanced Trans- which is significantly lower compared to the base- former. To address this concern, we designed an- line (Row 2). Order information was supposed other experiment where we increase the number of to increase the overall performance, but we see a parameters/layers in the Transformer Base encoder degradation. This is because the first Transformer to match the number of our model’s parameters. is unable to detect the correct order (due to the dif- Results related to this experiment are shown in ficulty of this task). In Row 4, we feed the same Table 5. translation engines with higher-quality order in- The comparison between DPE and different ex- formation and BLEU rises to 31.82. We cannot tensions of Transformer Base, namely 8E (8 en- use FastAlign at test time but this experiment coder layers) and 10E (10 encoder layers), demon- shows that our hypothesis on the usefulness of or- strates that the increase in BLEU is due to the po- der information seems to be correct. Motivated sition information provided by DPE rather than by this, we invented DPE to better leverage order additional parameters of the DPE layers. In 8E, we information, and these results are reported in the provided the same number of additional parameters next section. as the DPE module adds, but experienced less gain in translation quality than DPE. In 10E, we even 4.4 DPE Experiments doubled the number of additional parameters to sur- Results related to DPE are reported in Table 3. pass the number of parameters that DPE uses, and According to the reported scores, DPE leads to yet the DPE extension with 8 encoding layers (two
Model Data type BLEU Score 1 Reordering Transformer En →Enreordered 35.21 2 Transformer Base En→De 27.76 3 + fed with the output of reordering Transformer Enreordered →De 21.96 4 + fed with the output of FastAlign Enreordered →De 31.82 Table 2: BLEU scores for the 2PT series of experiments. Model # Params En→De (WMT) Transformer Base (Vaswani et al., 2017) 65.0 M 27.30 + Relative PE (Shaw et al., 2018b) N/A 26.80 + Explicit Global Reordering Embeddings (Chen et al., 2020) 66.5 M 28.44 + Reorder fusion-based source representation (Chen et al., 2020) 66.5 M 28.55 + Reordering Embeddings (Encoder Only) (Chen et al., 2019) 102.1 M 28.03 + Reordering Embeddings (Encoder/Decoder) (Chen et al., 2019) 106.8 M 28.22 Transformer Base (our re-implementation) 66.5 M 27.78 Dynamic Position Encoding (λ = 0.5) 72.8 M 28.59 Table 3: BLEU score comparison of DPE versus other peers. Model WMT’14 En→ De 4.5 Experimental Results on Other Languages Baseline 27.78 DPE (λ = 0.1) 28.17 In addition to the previously reported experiments, DPE (λ = 0.3) 28.16 we evaluated the DPE model on different IWSLT- DPE (λ = 0.5) 28.59 14 datasets of English–German (En–De), English– DPE (λ = 0.7) 27.98 French (En–Fr), and English–Italian (En–It). Af- ter tuning with different position loss weights on Table 4: BLEU scores of DPE with different values for the development set, we determined λ = 0.3 to be λ. ideal for the IWSLT-14 setting. The results in Table 6 show that with DPE, the translation accuracy im- proves for different settings and the improvement was not only unique to the En–De WMT language Model WMT’14 En→De pair. Baseline 27.78 Our DPE architecture works with a variety of lan- 8E 28.07 guage pairs with different sizes and this increases 10E 28.54 our confidence on the beneficial effect of order in- DPE (N = 2) 28.59 (+ 0.81) formation. It is usually hard to show the impact of auxiliary signals in NMT models and this could be Table 5: A BLEU score comparison of DPE against the even harder with smaller datasets, but our IWSLT baseline Transformer models with additional encoder results are promising. Accordingly, it would not be layers (8E for 8 encoder layers and 10E for 10 encoder unfair to claim that DPE is useful regardless of the layers) language and dataset size. 5 Conclusion and Future Work for pre-reordering and six from the original transla- In this paper, we first explored whether Transform- tion encoder) is still superior. This reinforces the ers would benefit from order signals. We then idea that our DPE module improves translation ac- proposed a new architecture, DPE, that generates curacy by injecting position information alongside embeddings containing target word position infor- the encoder input. mation to boost translation quality.
Model En→De En→Fr En→It Transformer 26.42 38.86 27.94 DPE-based Extension 27.47 (↑ 1.05) 39.42 (↑ 0.56) 28.35 (↑ 0.41) Table 6: BLEU results for different IWSLT-14 Language pairs. The results obtained in our experiments demon- Jinhua Du and Andy Way. 2017. Pre-reordering strate that DPE improves the translation process by for neural machine translation: helpful or harm- ful? Prague Bulletin of Mathematical Linguistics, helping the source side learn target position infor- (108):171–181. mation. The DPE model consistently outperformed the baselines of related literature. It also showed Chris Dyer, Victor Chahuneau, and Noah A. Smith. improvements with different language pairs and 2013. A simple, fast, and effective reparameter- ization of IBM model 2. In Proceedings of the dataset sizes. 2013 Conference of the North American Chapter of Our experiments can provide the groundwork for the Association for Computational Linguistics: Hu- further exploration of dynamic position encoding in man Language Technologies, pages 644–648, At- Transformers. In our future work, we plan to study lanta, Georgia. Association for Computational Lin- guistics. DPE from a linguistics point of view to further an- alyze its behaviour. We can also explore injecting Minwei Feng, J. Peter, and H. Ney. 2013. Advance- order information into other language processing ments in reordering models for statistical machine models through DPE or a similar mechanism. Such translation. In ACL. information seems to be useful for tasks such as Sarthak Garg, Stephan Peitz, Udhyakumar Nallasamy, dependency parsing or sequence tagging. and Matthias Paulik. 2019. Jointly learning to align and translate with transformer models. References Yuki Kawara, Chenhui Chu, and Yuki Arase. 2020. Preordering encoding on transformer for translation. Arianna Bisazza and Marcello Federico. 2016. Sur- IEEE/ACM Transactions on Audio, Speech, and Lan- veys: A survey of word reordering in statistical guage Processing. machine translation: Computational models and language phenomena. Computational Linguistics, Guolin Ke, Di He, and Tie-Yan Liu. 2020. Rethinking 42(2):163–205. positional encoding in language pre-training. Mauro Cettolo, Christian Girardi, and Marcello Fed- Philipp Koehn. 2009. Statistical machine translation. erico. 2012. WIT3: Web inventory of transcribed Cambridge University Press. and translated talks. In Proceedings of the 16th An- nual conference of the European Association for Ma- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris chine Translation, pages 261–268, Trento, Italy. Eu- Callison-Burch, Marcello Federico, Nicola Bertoldi, ropean Association for Machine Translation. Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Kehai Chen, Rui Wang, Masao Utiyama, and Eiichiro Constantin, and Evan Herbst. 2007. Moses: Open Sumita. 2019. Neural machine translation with re- source toolkit for statistical machine translation. In ordering embeddings. In Proceedings of the 57th Proceedings of the 45th Annual Meeting of the As- Annual Meeting of the Association for Computa- sociation for Computational Linguistics Companion tional Linguistics, pages 1787–1799, Florence, Italy. Volume Proceedings of the Demo and Poster Ses- Association for Computational Linguistics. sions, pages 177–180, Prague, Czech Republic. As- sociation for Computational Linguistics. Kehai Chen, Rui Wang, Masao Utiyama, and Eiichiro Sumita. 2020. Explicit reordering for neural ma- Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and chine translation. Cho-Jui Hsieh. 2020. Learning to encode position for transformer with continuous dynamical model. Yiming Cui, Shijin Wang, and Jianfeng Li. 2016. LSTM neural reordering feature for statistical ma- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- chine translation. In Proceedings of the 2016 Con- Jing Zhu. 2002. Bleu: a method for automatic eval- ference of the North American Chapter of the As- uation of machine translation. In Proceedings of sociation for Computational Linguistics: Human the 40th Annual Meeting of the Association for Com- Language Technologies, pages 977–982, San Diego, putational Linguistics, pages 311–318, Philadelphia, California. Association for Computational Linguis- Pennsylvania, USA. Association for Computational tics. Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018a. Self-attention with relative position represen- tations. CoRR, abs/1803.02155. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018b. Self-attention with relative position repre- sentations. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Chao Wang, Michael Collins, and Philipp Koehn. 2007. Chinese syntactic reordering for statistical machine translation. In Proceedings of the 2007 Joint Con- ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 737–745, Prague, Czech Republic. Association for Computational Lin- guistics. Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tieyan Liu. 2020. Incorporating bert into neural machine translation. In International Conference on Learning Represen- tations.
You can also read