Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages Raj Dabre1 , Fabien Cromieres2 , and Sadao Kurohashi2 1 Graduate School of Informatics, Kyoto University 2 Japan Science and Technology Agency raj@nlp.ist.i.kyoto-u.ac.jp, fabien@pa.jst.jp, kuro@i.kyoto-u.ac.jp Abstract a resource rich language pair is used to initialize the parameters for a model that is to be trained In this paper, we propose a novel and ele- for a resource poor pair, Multilingual-Multiway gant solution to “Multi-Source Neural Ma- NMT (Firat et al., 2016a) where multiple language arXiv:1702.06135v3 [cs.CL] 3 Apr 2017 chine Translation” (MSNMT) which only pairs are learned simultaneously with separate en- relies on preprocessing a N-way multilin- coders and decoders for each source and target lan- gual corpus without modifying the Neural guage and Zero Shot NMT (Johnson et al., 2016) Machine Translation (NMT) architecture where a single NMT system is trained for multi- or training procedure. We simply concate- ple language pairs that share all model parameters nate the source sentences to form a sin- thereby allowing for multiple languages to interact gle long multi-source input sentence while and help each other improve the overall translation keeping the target side sentence as it is and quality. train an NMT system using this prepro- Multi-Source Machine Translation is an approach cessed corpus. We evaluate our method that allows one to leverage N-way (N-lingual) cor- in resource poor as well as resource rich pora to improve translation quality in resource settings and show its effectiveness (up to poor as well as resource rich scenarios. N-way 4 BLEU using 2 source languages and (or N-lingual) corpora are those in which transla- up to 6 BLEU using 5 source languages) tions of the same sentence exist in N different lan- by comparing against existing methods for guages1 . A realistic scenario is when N equals 3 MSNMT. We also provide some insights since there exist many domain specific as well as on how the NMT system leverages multi- general domain trilingual corpora. For example, lingual information in such a scenario by in Spain international news companies write news visualizing attention. articles in English as well as Spanish and thus it is possible to utilize the same sentence written in two 1 Introduction different languages to translate to a third language Neural machine translation (NMT) (Bahdanau like Italian by utilizing a large English-Spanish- et al., 2015; Cho et al., 2014; Sutskever et al., Italian trilingual corpus. However there do ex- 2014) enables training an end-to-end system with- ist N-way corpora (ordered from largest to small- out the need to deal with word alignments, trans- est according to number of lines of corpora) like lation rules and complicated decoding algorithms, United Nations (Ziemski et al., 2016), Europarl which are a characteristic of phrase based statisti- (Koehn, 2005), Ted Talks (Cettolo et al., 2012) cal machine translation (PBSMT) systems. How- ILCI (Jha, 2010) and Bible (Christodouloupoulos ever, it is reported that NMT works better than and Steedman, 2015) where the same sentences PBSMT only when there is an abundance of par- are translated into more than 5 languages. allel corpora. In a low resource scenario, vanilla Two major approaches for Multi-Source NMT NMT is either worse than or comparable to PB- (MSNMT) have been explored, namely the multi- SMT (Zoph et al., 2016). 1 Multilingual NMT has shown to be quite effec- Sometimes a N-lingual corpus is available as N-1 bilin- gual corpora (leading to N-1 sources), each having a fixed tive in a variety of settings like Transfer Learn- target language (typically English) and share a number of tar- ing (Zoph et al., 2016) where a model trained on get language sentences.
encoder (Zoph and Knight, 2016) and multi- • An empirical comparison of our approach source ensembling (Garmash and Monz, 2016; Fi- against two existing methods (Zoph and rat et al., 2016b). The multi-encoder approach in- Knight, 2016; Firat et al., 2016b) for volves extending the vanilla NMT architecture to MSNMT. have an encoder for each source language lead- • An analysis of how NMT gives more im- ing to larger models and it is clear that NMT can portance to certain linguistically closer lan- accommodate multiple languages (Johnson et al., guages while doing multi-source translation 2016) without needing to resort to a larger pa- by visualizing attention vectors. rameter space. Moreover, since the encoders for each source language are separate it is difficult to 2 Related Work explore how the source languages contribute to- One of the first studies on multi-source MT (Och wards the improvement in translation quality. On and Ney, 2001) was done to see how word based the other hand, the ensembling approach is simpler SMT systems would benefit from multiple source since it involves training multiple bilingual NMT languages. Although effective, it suffered from a models each with a different source language but number of limitations that classic word and phrase the same target language. This method also helps based SMT systems do including the inability to eliminate the need for N-way corpora which al- perform end-to-end training. In the context of lows one to exploit bilingual corpora which are NMT, the work on multi-encoder multi source larger in size. Multi-source NMT ensembling NMT (Zoph and Knight, 2016) is the first of its works in essentially the same way as single-source kind end-to-end approach which focused on uti- NMT ensembling does except that each of the sys- lizing French and German as source languages to tems in the ensemble take source sentences in dif- translate to English. However their method led to ferent languages. In the case of a multilingual models with substantially larger parameter spaces multi-way NMT model, multi-source ensembling and they did not study the effect of using more is a form of self-ensembling but such a model con- than 2 source languages. Multi-source ensembling tains too many parameters and is difficult to train. using a multilingual multi-way NMT model (Fi- Multi-source ensembling by using separate mod- rat et al., 2016b) is an end-to-end approach but els for each source language involves the overhead requires training a very large and complex NMT of learning an ensemble function and hence this model. The work on multi-source ensembling method is not truly end-to-end. which uses separately trained single source mod- To overcome the limitations of both the ap- els (Garmash and Monz, 2016) is comparatively proaches we propose a new simplified end-to-end simpler in the sense that one does not need to train method that avoids the need to modify the NMT additional NMT models but the approach is not architecture as well as the need to learn an ensem- truly end-to-end since it needs an ensemble func- ble function. We simply propose to concatenate tion to be learned to effectively leverage multiple the source sentences leading to a parallel corpus source languages. In all three cases one ends up where the source side is a long multilingual sen- with either one large model or many small mod- tence and the target side is a single sentence which els. is the translation of the aforementioned multilin- gual sentence. This corpus is then fed to any NMT 3 Overview of Our Method training pipeline whose output is a multi-source NMT model. The main contributions of this paper Refer to Figure 1 for an overview of our method are as follows: which is as follows: • A novel preprocessing step that allows for • For each target sentence concatenate the cor- MSNMT without any change to the NMT ar- responding source sentences leading to a par- chitecture2 . allel corpus where the source sentence is a • An exhaustive study of how our approach very long sentence that conveys the same works in a resource poor as well as a resource meaning in multiple languages. An exam- rich setting. ple line in such a corpus would be: source: 2 “Hello Bonjour Namaskar Kamusta Hallo” One additional benefit of our approach is that any NMT architecture can be used, be it attention based or hierarchical and target:“konnichiwa”. The 5 source lan- NMT. guages here are English, French, Marathi,
Figure 1: Our proposed Multi-Source NMT Approach. Filipino and Luuxembourgish whereas the works with the Word Piece Model (WPM) (Schus- target language is Japanese. In this exam- ter and Nakajima, 2012). We evaluate our models ple each source sentence is a word conveying using the standard BLEU (Papineni et al., 2002) “Hello” in different languages. We romanize metric4 on the translations of the test set. Base- the Marathi and Japanese words for readabil- line models are simply ones trained from scratch ity. by initializing the model parameters with random • Apply word segmentation to the source and values. target sentences, Byte Pair Encoding (BPE)3 (Sennrich et al., 2016a) in our case, to over- 4.1 Languages and Corpora Settings come data sparsity and eliminate the un- All of our experiments were performed using the known word rate. publicly available ILCI5 (Jha, 2010), United Na- • Use the training corpus to learn an NMT tions8 (Ziemski et al., 2016) and IWSLT9 (Cettolo model using any off the shelf NMT toolkit. et al., 2015) corpora. The ILCI corpus is a 6-way multilingual corpus 4 Experimental Settings spanning the languages Hindi, English, Tamil, Tel- All of our experiments were performed using ugu, Marathi and Bengali was provided as a part an encoder-decoder NMT system with attention of the task. The target language is Hindi and thus for the various baselines and multi-source exper- there are 5 source languages. The training, devel- iments. In order to enable infinite vocabulary and opment and test sets contain 45600, 1000 and 2400 reduce data sparsity we use the Byte Pair Encoding 6-lingual sentences respectively10 . Hindi, Bengali (BPE) based word segmentation approach (Sen- and Marathi are Indo-Aryan languages, Telugu nrich et al., 2016b). However we perform a slight and Tamil are Dravidian languages and English modification to the original code where instead of 4 specifying the number of merge operations manu- This is computed by the multi-bleu.pl script, which can be downloaded from the public implementation of Moses ally we specify a desired vocabulary size and the (Koehn et al., 2007). 5 BPE learning process automatically stops after it This was used for the Indian Languages MT task in learns enough rules to obtain the prespecified vo- ICON 20146 and 20157 . 8 https://conferences.unite.un.org/uncorpus cabulary size. We prefer this approach since it al- 9 https://wit3.fbk.eu/mt.php?release=2016-01 lows us to learn a minimal model and it resembles 10 In the task there are 3 domains: health, tourism and gen- the way Google’s NMT system (Wu et al., 2016) eral. However, we focus on the general domain in which half the corpus comes from the health domain the other half comes 3 The BPE model is learned only on the training set. from the tourism domain.
corpus type Languages train dev2010 tst2010/tst2013 3 lingual Fr, De, En 191381 880 1060/886 4 lingual Fr, De, Ar, En 84301 880 1059/708 5 lingual Fr, De, Ar, Cs, En 45684 461 1016/643 Table 1: Statistics for the the N-lingual corpora extracted for the languages French (Fr), German (De), Arabic (Ar), Czech (Cs) and English (En) is a European language. In this group English our purpose was not to train the best system but is the farthest from Hindi, grammatically speak- to show that using additional source languages is ing, whereas Marathi is the closest to it. Mor- useful. The development and test sets provided phologically speaking, Bengali is closer to Hindi contain 4000 lines each and are also available as compared to Marathi (which has agglutinative suf- 6-lingual sentences. We chose English to be the fixes) but Marathi and Hindi share the same script target language and focused on Spanish, French, and they also share more cognates compared to Arabic and Russian as source languages. Due to the other languages. It is natural to expect that lack of computational facilities we only worked translating from Bengali and Marathi to Hindi with the following source language combinations: should give Hindi sentences of higher quality as French and Spanish, French and Russian, French compared to those obtained by translating from and Arabic and Russian and Arabic. the other languages and thus using these two lan- guages as source languages in multi-source ap- 4.2 NMT Systems and Model Settings proaches should lead to significant improvements For training various NMT systems, we used in translation quality. We verify this hypothesis by the open source KyotoNMT system11 (Cromieres exhaustively trying all language combinations. et al., 2016). KyotoNMT implements an Attention The IWSLT corpus is a collection of 4 bilingual based Encoder-Decoder (Bahdanau et al., 2015) corpora spanning 5 languages where the target lan- with slight modifications to the training proce- guage is English: French-English (234992 lines), dure. We modify the NMT implementation in German-English (209772 lines), Czech-English KyotoNMT to enable multi encoder multi source (122382 lines) and Arabic-English (239818 lines). NMT. Since the NMT model architecture used in Linguistically speaking French and German are (Zoph and Knight, 2016) is different from the one the closest to English followed by Czech and Ara- in KyotoNMT the multi encoder implementation bic. In order to obtain N-lingual sentences we is not identical (but is equivalent) to the one in only keep the sentence pairs from each corpus the original work. For the rest of the paper “base- such that the English sentence is present in all line” systems indicate single source NMT models the corpora. From the given training data we ex- trained on bilingual corpora. We train and evaluate tract trilingual (French, German and English), 4- the following models: lingual (French, German, Arabic and English) and • One source to one target. 5-lingual corpora. Similarly we extract 3, 4 and • N source to one target using our proposed 5 lingual development and test sets. The IWSLT multi source approach. corpus (downloaded from the link given above) • N source to one target using the multi encoder comes with a development set called dev2010 and multi source approach (Zoph and Knight, test sets named tst2010 to tst2013 (one for each 2016). year from 2010 to 2013). Unfortunately only the • N source to one target using the multi source tst2010 and tst2013 test sets are N-lingual. Re- ensembling approach that late averages (Fi- fer to Table 1 which contains the number of lines rat et al., 2016b) N one source to one target of training, development and test sentences we ex- models12 . tracted. The model and training details are as follows: The UN corpus spans 6 languages (French, Span- • BPE vocabulary size: 8k13 (separate models ish, Arabic, Chinese, Russian and English) and we 11 https://github.com/fabiencro/knmt used the 6-lingual version for our multisource ex- 12 In the original work a single multilingual multiway NMT periments. Although there are 11 million 6-lingual model was trained and ensembled but we train separate NMT models for each source language. sentences we use only 2 million for training since 13 We also try vocabularies of size 16k and 32k but they
for source and target) for ILCI and IWSLT Source Source Language 2 Language 1 En Mr Ta Te corpora settings and 16k for the UN corpus (11.08) (24.60) (10.37) (16.55) setting. When training the BPE model for Bn 20.70* 29.02 19.85 22.73 the source languages we learn a single shared 19.45 30.10* 20.79* 24.83* (19.14) 19.10 27.33 18.26 22.14 BPE model so that the total source side vo- 25.56* 14.03 18.91 En cabulary size is 8k (or 16k as applicable). In (11.08) - 23.06 15.05* 19.68* case of languages that use the same script it 26.01 13.30 17.53 25.64* 27.62 allows for cognate sharing thereby reducing Mr - - 24.70 28.00* (24.60) the overall vocabulary size. 23.79 26.83 • Embeddings: 620 nodes 18.14 Ta - - - 19.11* • RNN for encoders and decoders: LSTM with (10.37) 17.34 1 layer, 1000 nodes output. Each encoder is 31.56* All 30.29 a bidirectional RNN. 28.31 • In the case of multiple encoders, one for each language, each encoder has its own separate Table 2: BLEU scores for two source to one tar- vocabulary. get setting for all language combinations and for • Attention: 500 nodes hidden layer. In case of five source to one target using the ILCI corpus. the multi encoder approach there is a separate The languages are Bengali (Bn), English (En), attention mechanism per encoder. Marathi (Mr), Tamil (Ta), Telegu (Te) and Hindi • Batch size: 64 for single source, 16 for 2 (Hi). Each cell in the upper right triangle contains sources and 8 for 3 sources and above for the BLEU scores using a. Our proposed approach IWSLT and ILCI corpora settings. 32 for sin- (bold), b. Multi source ensembling approach and gle source and 16 for 2 sources for the UN c. Multi Encoder Multi Source approach. The corpus setting. highest score is the one with the asterisk mark (*). • Training steps: 10k14 for 1 source, 15k for 2 source and 40k for 5 source settings when using the IWSLT and ILCI corpora. 200k pus setting we did not try various combinations of for 1 source and 400k for 2 source for the source languages as we did in the ILCI corpus set- UN corpus setting to ensure that in both cases ting. We train and evaluate the following NMT the models get saturated with respect to heir models for each N-lingual corpus: learning capacity. • One source to one target: N-1 models (Base- • Optimization algorithms: Adam with an ini- lines; 2 for the trilingual corpus, 3 for the 4- tial learning rate of 0.01 lingual corpus and 4 for the 5-lingual corpus) • Choosing the best model: Evaluate the model • N-1 source to one target: 3 models (1 for on the development set and select the one trilingual, 1 for 4-lingual and 1 for 5-lingual) with the best BLEU (Papineni et al., 2002) Similarly for the UN corpous setting we after reversing the BPE segmentation on the only tried the following 2 source combi- output of the NMT model. nations: French+Spanish, French+Arabic, We train and evaluate the following NMT mod- French+Russian, Russian+Arabic. The target els using the ILCI corpus: language is English. • One source to one target: 5 models (Base- For the results of the ILCI corpus setting, re- lines) fer to Table 5 for the BLEU scores for individ- • Two source to one target: 10 models (5 ual language pairs. Table 2 contains the BLEU source languages, choose 2 at a time) scores for all combinations of source languages, • Five source to one target: 1 model two at a time. Each cell in the upper right triangle For evaluation we translate the test set sentences contains the BLEU score and the difference com- using a beam search decoder with a beam of size pared to the best BLEU obtained using either of 16 for all corpora settings15 . In the IWSLT cor- the languages for the two source to one target set- ting where the source languages are specified in take longer to train and overfit badly in a low resource setting 14 We observed that the models start overfitting around 7k- 16 but found that the differences in BLEU between beam 8k iterations sizes 12 and 16 are small and gains in BLEU for beam sizes 15 We performed evaluation using beam sizes 4, 8, 12 and beyond 16 are insignificant
Corpus Language Number tst2010 tst2013 tst2010 tst2013 Type Pair of sources 3 lingual Fr-En 19.72 22.05 2 sources 22.56*/18.64/22.03 24.02*/18.45/23.92 191381 lines De-En 16.19 16.13 Fr-En 9.02 7.78 4 lingual De-En 7.58 5.45 3 sources 11.70/12.86*/10.30 9.16/9.48*/7.30 84301 lines Ar-En 6.53 5.25 Fr-En 6.69 6.36 5 lingual De-En 5.76 3.86 4 sources 8.34/9.23*/7.79 6.67*/6.49/5.92 45684 lines Ar-En 4.53 2.92 Cs-En 4.56 3.40 Table 3: BLEU scores for the baseline and N source settings using the IWSLT corpus. The languages are French (Fr), German (De), Arabic (Ar), Czech (Cs) and English (En). Each cell in the upper right triangle contains the BLEU scores using a. Our proposed approach (bold), b. Multi source ensembling approach and c. Multi Encoder Multi Source approach. The highest score is the one with the asterisk mark (*) Language Language the leftmost and topmost cells. The last row of BLEU BLEU Pair Combination 49.93*+ Table 2 contains the BLEU score for the setting Es-En 49.20 Es+Fr-En 46.65 which uses all 5 source languages and the differ- 47.39 ence compared to the best BLEU obtained using 43.99*+ Fr-En 40.52 Fr+Ru-En 40.63 any single source language. 42.12+ For the results of the IWSLT corpus setting, refer 43.85+ to Table 3. Finally, refer to Table 4 for the UN Ar-En 40.58 Fr+Ar-En 41.13+ 44.06*+ corpus setting. 41.66+ Ru-En 38.94 Ar+Ru-En 43.12+ 4.3 Analysis 43.69*+ From Tables 2, 3 and Table 4 it is clear that our Table 4: BLEU scores for the baseline and 2 simple source sentence concatenation based ap- source settings using the UN corpus. The lan- proach is able to leverage multiple languages lead- guages are Spanish (Es), French (Fr), Russian ing to significant improvements compared to the (Ru), Arabic (Ar) and English (En). Each cell in BLEU scores obtained using any of the individ- the upper right triangle contains the BLEU scores ual source languages. The ensembling and the using a. Our proposed approach (bold), b. Multi multi encoder approaches also lead to improve- source ensembling approach and c. Multi Encoder ments in BLEU. It should be noted that in a re- Multi Source approach. The highest score is the source poor scenario ensembling outperforms all one with the asterisk mark (*). Entries with an as- other approaches but in a resource scenario our terisk mark (+) next to them indicate BLEU scores method as well as the multi encoder are much that are statistically significant (p
speaking) compared to the other languages and which we plan to investigate in the future. thus when used together they help obtain an im- Finally in the large training corpus setting, where provement of 4.39 BLEU points compared to we used approximately 2 million training sen- when Marathi is used as the only source language tences, we also obtained statistically significant (p (24.63). It is also surprising to note that Marathi
nificant gains in BLEU. Building on this observa- Mauro Cettolo, Christian Girardi, and Marcello Fed- tion we believe that the gains we obtained by using erico. 2012. Wit3 : Web inventory of transcribed and translated talks. In Proceedings of the 16th Con- all 5 source languages were mostly due to Ben- ference of the European Association for Machine gali, Telugu and Marathi whereas the NMT sys- Translation (EAMT). Trento, Italy, pages 261–268. tem learns to practically ignore Tamil and English. However there does not seem to be any detrimen- Kyunghyun Cho, Bart van Merriënboer, Çalar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Hol- tal effect of using English and Tamil. ger Schwenk, and Yoshua Bengio. 2014. Learning From Figure 3 it can be seen that this obser- phrase representations using rnn encoder–decoder vation also holds in the UN corpus setting for for statistical machine translation. In Proceedings of French+Spanish to English where the attention the 2014 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP). Association mechanism gives a higher weight to Spanish for Computational Linguistics, Doha, Qatar, pages words compared to French words. It is also in- 1724–1734. http://www.aclweb.org/anthology/D14- teresting to note that the attention can potentially 1179. be used to extract a multilingual dictionary simply Christos Christodouloupoulos and Mark Steedman. by learning a N-source NMT system and then gen- 2015. A massively parallel corpus: the bible in 100 erating a dictionary by extracting the words from languages. Language Resources and Evaluation the source sentence that receive the highest atten- 49(2):375–395. https://doi.org/10.1007/s10579- tion for each target word generated. 014-9287-y. Fabien Cromieres, Chenhui Chu, Toshiaki Nakazawa, 5 Conclusion and Sadao Kurohashi. 2016. Kyoto univer- sity participation to wat 2016. In Proceed- In this paper, we have proposed and evaluated a ings of the 3rd Workshop on Asian Transla- simple approach for “Multi-Source Neural Ma- tion (WAT2016). The COLING 2016 Organiz- chine Translation” without modifying the NMT ing Committee, Osaka, Japan, pages 166–174. http://aclweb.org/anthology/W16-4616. system architecture in a resource poor as well as a resource rich setting using the ILCI, IWSLT and Orhan Firat, Kyunghyun Cho, and Yoshua Ben- UN corpora. We have compared our approach gio. 2016a. Multi-way, multilingual neural ma- chine translation with a shared attention mecha- with two other previously proposed approaches nism. In NAACL HLT 2016, The 2016 Con- and showed that it is highly effective, domain and ference of the North American Chapter of the language independent and the gains are signifi- Association for Computational Linguistics: Hu- cant. We furthermore observed, by visualizing man Language Technologies, San Diego Califor- nia, USA, June 12-17, 2016. pages 866–875. attention, that NMT focuses on some languages http://aclweb.org/anthology/N/N16/N16-1101.pdf. by practically ignoring others indicating that lan- guage relatedness is one of the aspects that should Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, be considered in a multilingual MT scenario. In Fatos T. Yarman-Vural, and Kyunghyun Cho. 2016b. Zero-resource translation with multi-lingual neu- the future we plan on conducting a full scale inves- ral machine translation. In Proceedings of the tigation of the language relatedness phenomenon 2016 Conference on Empirical Methods in Natu- by considering even more languages in resource ral Language Processing, EMNLP 2016, Austin, rich as well as resource poor scenarios. Texas, USA, November 1-4, 2016. pages 268–277. http://aclweb.org/anthology/D/D16/D16-1026.pdf. Ekaterina Garmash and Christof Monz. 2016. Ensem- References ble learning for multi-source neural machine transla- tion. In Proceedings of COLING 2016, the 26th In- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- ternational Conference on Computational Linguis- gio. 2015. Neural machine translation by jointly tics: Technical Papers. The COLING 2016 Orga- learning to align and translate. In In Proceedings of nizing Committee, Osaka, Japan, pages 1409–1418. the 3rd International Conference on Learning Rep- http://aclweb.org/anthology/C16-1133. resentations (ICLR 2015). International Conference on Learning Representations, San Diego, USA. Girish Nath Jha. 2010. The tdil program and the in- dian langauge corpora intitiative (ilci). In Nico- M Cettolo, J Niehues, S Stüker, L Bentivogli, R Cat- letta Calzolari (Conference Chair), Khalid Choukri, toni, and M Federico. 2015. The iwslt 2015 evalua- Bente Maegaard, Joseph Mariani, Jan Odijk, Ste- tion campaign. In Proceedings of the Twelfth Inter- lios Piperidis, Mike Rosner, and Daniel Tapias, national Workshop on Spoken Language Translation editors, Proceedings of the Seventh International (IWSLT). Conference on Language Resources and Evaluation
Figure 2: Attention Visualization for ILCI corpus setting with all 5 source languages Figure 3: Attention Visualization for UN corpus setting with French and Spanish as source languages
(LREC’10). European Language Resources Associ- International Conference on Neural Informa- ation (ELRA), Valletta, Malta. tion Processing Systems. MIT Press, Cam- bridge, MA, USA, NIPS’14, pages 3104–3112. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim http://dl.acm.org/citation.cfm?id=2969033.2969173. Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- rat, Fernanda B. Viégas, Martin Wattenberg, Greg Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Le, Mohammad Norouzi, Wolfgang Macherey, Google’s multilingual neural machine translation Maxim Krikun, Yuan Cao, Qin Gao, Klaus system: Enabling zero-shot translation. CoRR Macherey, Jeff Klingner, Apurva Shah, Melvin abs/1611.04558. http://arxiv.org/abs/1611.04558. Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Philipp Koehn. 2005. Europarl: A Parallel Corpus Kazawa, Keith Stevens, George Kurian, Nishant for Statistical Machine Translation. In Conference Patil, Wei Wang, Cliff Young, Jason Smith, Jason Proceedings: the tenth Machine Translation Sum- Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, mit. AAMT, AAMT, Phuket, Thailand, pages 79–86. Macduff Hughes, and Jeffrey Dean. 2016. Google’s http://mt-archive.info/MTS-2005-Koehn.pdf. neural machine translation system: Bridging the gap between human and machine translation. CoRR Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris abs/1609.08144. http://arxiv.org/abs/1609.08144. Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Micha Ziemski, Marcin Junczys-Dowmunt, and Bruno Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Pouliquen. 2016. The united nations parallel cor- Constantin, and Evan Herbst. 2007. Moses: Open pus v1.0. In Nicoletta Calzolari (Conference Chair), source toolkit for statistical machine translation. In Khalid Choukri, Thierry Declerck, Sara Goggi, ACL. The Association for Computer Linguistics. Marko Grobelnik, Bente Maegaard, Joseph Mari- ani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Franz Josef Och and Hermann Ney. 2001. Statisti- Stelios Piperidis, editors, Proceedings of the Tenth cal multi-source translation. In Proceedings of MT International Conference on Language Resources Summit. volume 8, pages 253–258. and Evaluation (LREC 2016). European Language Resources Association (ELRA), Paris, France. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for au- Barret Zoph and Kevin Knight. 2016. Multi-source tomatic evaluation of machine translation. In neural translation. In Proceedings of the 2016 Con- Proceedings of the 40th Annual Meeting on As- ference of the North American Chapter of the Asso- sociation for Computational Linguistics. Asso- ciation for Computational Linguistics: Human Lan- ciation for Computational Linguistics, Strouds- guage Technologies. Association for Computational burg, PA, USA, ACL ’02, pages 311–318. Linguistics, San Diego, California, pages 30–34. https://doi.org/10.3115/1073083.1073135. http://www.aclweb.org/anthology/N16-1004. Mike Schuster and Kaisuke Nakajima. 2012. Japanese Barret Zoph, Deniz Yuret, Jonathan May, and Kevin and korean voice search. In 2012 IEEE Knight. 2016. Transfer learning for low-resource International Conference on Acoustics, Speech neural machine translation. In Proceedings of the and Signal Processing, ICASSP 2012, Kyoto, 2016 Conference on Empirical Methods in Natu- Japan, March 25-30, 2012. pages 5149–5152. ral Language Processing, EMNLP 2016, Austin, http://dx.doi.org/10.1109/ICASSP.2012.6289079. Texas, USA, November 1-4, 2016. pages 1568–1575. http://aclweb.org/anthology/D/D16/D16-1163.pdf. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation mod- els with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Com- putational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. http://aclweb.org/anthology/P/P16/P16-1009.pdf. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceed- ings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 1715–1725. http://www.aclweb.org/anthology/P16-1162. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th
You can also read