An Effective Approach to Unsupervised Machine Translation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
An Effective Approach to Unsupervised Machine Translation Mikel Artetxe, Gorka Labaka, Eneko Agirre IXA NLP Group University of the Basque Country (UPV/EHU) {mikel.artetxe, gorka.labaka, e.agirre}@ehu.eus Abstract tialized with cross-lingual embeddings (Artetxe et al., 2018c; Lample et al., 2018a). Neverthe- While machine translation has traditionally re- lied on large amounts of parallel corpora, a re- less, these early systems were later superseded by Statistical Machine Translation (SMT) based arXiv:1902.01313v1 [cs.CL] 4 Feb 2019 cent research line has managed to train both Neural Machine Translation (NMT) and Sta- approaches, which induced an initial phrase-table tistical Machine Translation (SMT) systems through cross-lingual embedding mappings, com- using monolingual corpora only. In this pa- bined it with an n-gram language model, and fur- per, we identify and address several deficien- ther improved the system through iterative back- cies of existing unsupervised SMT approaches translation (Lample et al., 2018b; Artetxe et al., by exploiting subword information, develop- ing a theoretically well founded unsupervised 2018b). tuning method, and incorporating a joint re- In this paper, we develop a more principled ap- finement procedure. Moreover, we use our im- proach to unsupervised SMT, addressing several proved SMT system to initialize a dual NMT deficiencies of previous systems by incorporat- model, which is further fine-tuned through on- ing subword information, applying a theoretically the-fly back-translation. Together, we obtain large improvements over the previous state- well founded unsupervised tuning method, and de- of-the-art in unsupervised machine transla- veloping a joint refinement procedure. In addition tion. For instance, we get 22.5 BLEU points to that, we use our improved SMT approach to ini- in English-to-German WMT 2014, 5.5 points tialize an unsupervised NMT system, which is fur- more than the previous best unsupervised sys- ther improved through on-the-fly back-translation. tem, and 0.5 points more than the (supervised) shared task winner back in 2014. Our experiments on WMT 2014/2016 French- English and German-English show the effective- 1 Introduction ness of our approach, as our proposed system out- The recent advent of neural sequence-to-sequence performs the previous state-of-the-art in unsuper- modeling has resulted in significant progress in the vised machine translation by 5-7 BLEU points field of machine translation, with large improve- in all these datasets and translation directions. ments in standard benchmarks (Vaswani et al., Our system also outperforms the supervised WMT 2017; Edunov et al., 2018) and the first solid 2014 shared task winner in English-to-German, claims of human parity in certain settings (Has- and is around 2 BLEU points behind it in the rest san et al., 2018). Unfortunately, these systems of translation directions, suggesting that unsuper- rely on large amounts of parallel corpora, which vised machine translation can be a usable alterna- are only available for a few combinations of major tive in practical settings. languages like English, German and French. The remaining of this paper is organized as fol- Aiming to remove this dependency on paral- lows. Section 2 first discusses the related work in lel data, a recent research line has managed to the topic. Section 3 then describes our principled train unsupervised machine translation systems unsupervised SMT method, while Section 4 dis- using monolingual corpora only. The first such cusses our hybridization method with NMT. We systems were based on Neural Machine Transla- then present the experiments done and the results tion (NMT), and combined denoising autoencod- obtained in Section 5, and Section 6 concludes the ing and back-translation to train a dual model ini- paper.
2 Related work cussed earlier, and use them to induce an initial phrase-table that is combined with an n-gram lan- Early attempts to build machine translation sys- guage model and a distortion model. This ini- tems with monolingual corpora go back to statis- tial system is then refined through iterative back- tical decipherment (Ravi and Knight, 2011; Dou translation (Sennrich et al., 2016) which, in the and Knight, 2012). These methods see the source case of Artetxe et al. (2018b), is preceded by an language as ciphertext produced by a noisy chan- unsupervised tuning step. Our work identifies nel model that first generates the original English some deficiencies in these previous systems, and text and then probabilistically replaces the words proposes a more principled approach to unsuper- in it. The English generative process is modeled vised SMT that incorporates subword information, using an n-gram language model, and the chan- uses a theoretically better founded unsupervised nel model parameters are estimated using either tuning method, and applies a joint refinement pro- expectation maximization or Bayesian inference. cedure, outperforming these previous systems by This basic approach was later improved by incor- a substantial margin. porating syntactic knowledge (Dou and Knight, Very recently, some authors have tried to com- 2013) and word embeddings (Dou et al., 2015). bine both SMT and NMT to build hybrid unsuper- Nevertheless, these methods were only shown to vised machine translation systems. This idea was work in limited settings, being most often evalu- already explored by Lample et al. (2018b), who ated in word-level translation. aided the training of their unsupervised NMT sys- More recently, the task got a renewed inter- tem by combining standard back-translation with est after the concurrent work of Artetxe et al. synthetic parallel data generated by unsupervised (2018c) and Lample et al. (2018a) on unsuper- SMT. Marie and Fujita (2018) go further and use vised NMT which, for the first time, obtained synthetic parallel data from unsupervised SMT to promising results in standard machine transla- train a conventional NMT system from scratch. tion benchmarks using monolingual corpora only. The resulting NMT model is then used to aug- Both methods build upon the recent work on ment the synthetic parallel corpus through back- unsupervised cross-lingual embedding mappings, translation, and a new NMT model is trained on which independently train word embeddings in top of it from scratch, repeating the process it- two languages and learn a linear transformation to eratively. Ren et al. (2019) follow a similar ap- map them to a shared space through self-learning proach, but use SMT as posterior regularization (Artetxe et al., 2017, 2018a) or adversarial train- at each iteration. As shown later in our experi- ing (Conneau et al., 2018). The resulting cross- ments, our proposed NMT hybridization obtains lingual embeddings are used to initialize a shared substantially larger absolute gains than all these encoder for both languages, and the entire sys- previous approaches, even if our initial SMT sys- tem is trained using a combination of denoising tem is stronger and thus more challenging to im- autoencoding, back-translation and, in the case prove upon. of Lample et al. (2018a), adversarial training. This method was further improved by Yang et al. 3 Principled unsupervised SMT (2018), who use two language-specific encoders sharing only a subset of their parameters, and in- Phrase-based SMT is formulated as a log-linear corporate a local and a global generative adversar- combination of several statistical models: a trans- ial network. lation model, a language model, a reordering Nevertheless, it was later argued that the mod- model and a word/phrase penalty. As such, build- ular architecture of phrase-based SMT was more ing an unsupervised SMT system requires learn- suitable for this problem, and Lample et al. ing these different components from monolingual (2018b) and Artetxe et al. (2018b) adapted the corpora. As it turns out, this is straightforward same principles discussed above to train an un- for most of them: the language model is learned supervised SMT model, obtaining large improve- from monolingual corpora by definition; the word ments over the original unsupervised NMT sys- and phrase penalties are parameterless; and one tems. More concretely, both approaches learn can drop the standard lexical reordering model at a cross-lingual n-gram embeddings from monolin- small cost and do with the distortion model alone, gual corpora based on the mapping method dis- which is also parameterless. This way, the main
challenge left is learning the translation model, likely generating it, and taking the product of their that is, building the phrase-table. respective translation probabilities. The reader is Our proposed method starts by building an ini- referred to Artetxe et al. (2018b) for more details. tial phrase-table through cross-lingual embedding mappings (Section 3.1). This initial phrase-table is 3.2 Adding subword information then extended by incorporating subword informa- An inherent limitation of existing unsupervised tion, addressing one of the main limitations of pre- SMT systems is that words are taken as atomic vious unsupervised SMT systems (Section 3.2). units, making it impossible to exploit character- Having done that, we adjust the weights of the un- level information. This is reflected in the known derlying log-linear model through a novel unsu- difficulty of these models to translate named en- pervised tuning procedure (Section 3.3). Finally, tities, as it is very challenging to discriminate we further improve the system by jointly refining among related proper nouns based on distribu- two models in opposite directions (Section 3.4). tional information alone, yielding to translation er- rors like “Sunday Telegraph” → “The Times of 3.1 Initial phrase-table London” (Artetxe et al., 2018b). So as to build our initial phrase-table, we follow So as to overcome this issue, we propose to Artetxe et al. (2018b) and learn n-gram embed- incorporate subword information once the initial dings for each language independently, map them alignment is done at the word/phrase level. For to a shared space through self-learning, and use that purpose, we add two additional weights to the the resulting cross-lingual embeddings to extract initial phrase-table that are analogous to the lexi- and score phrase pairs. cal weightings, but use a character-level similarity More concretely, we train our n-gram embed- function instead of word translation probabilities: dings using phrase2vec1 , a simple extension of skip-gram that applies the standard negative sam- ¯ Y ¯ score(f |ē) = max , max sim(fi , ēj ) pling loss of Mikolov et al. (2013) to bigram- j i context and trigram-context pairs in addition to the usual word-context pairs.2 Having done that, we where = 0.3 guarantees a minimum similarity map the embeddings to a cross-lingual space us- score, as we want to favor translation candidates ing VecMap3 with identical initialization (Artetxe that are similar at the character level without ex- et al., 2018a), which builds an initial solution cessively penalizing those that are not. In our case, by aligning identical words and iteratively im- we use a simple similarity function that normal- proves it through self-learning. Finally, we extract izes the Levenshtein distance lev(·) (Levenshtein, translation candidates by taking the 100 nearest- 1966) by the length of the words len(·): neighbors of each source phrase, and score them by applying the softmax function over their cosine lev(f, e) sim(f, e) = 1 − similarities: max(len(f ), len(e)) exp cos(ē, f¯)/τ ¯ We leave the exploration of more elaborated sim- φ(f |ē) = P ¯0 f¯0 exp cos(ē, f )/τ ilarity functions and, in particular, learnable met- rics (McCallum et al., 2005), for future work. where the temperature τ is estimated using max- imum likelihood estimation over a dictionary in- 3.3 Unsupervised tuning duced in the reverse direction. In addition to Having trained the underlying statistical models the phrase translation probabilities in both direc- independently, SMT tuning aims to adjust the tions, the forward and reverse lexical weightings weights of their resulting log-linear combination are also estimated by aligning each word in the tar- to optimize some evaluation metric like BLEU in a get phrase with the one in the source phrase most parallel validation corpus, which is typically done 1 https://github.com/artetxem/ through Minimum Error Rate Training or MERT phrase2vec (Och, 2003). Needless to say, this cannot be done 2 So as to keep the model size within a reasonable limit, in strictly unsupervised settings, but we argue that we restrict the vocabulary to the most frequent 200,000 uni- grams, 400,000 bigrams and 400,000 trigrams. it would still be desirable to optimize some un- 3 https://github.com/artetxem/vecmap supervised criterion that is expected to correlate
well with test performance. Unfortunately, nei- penalizes excessively long translations:5 ther of the existing unsupervised SMT systems do so: Artetxe et al. (2018b) use a heuristic that len(TF →E (TE→F (E))) LP(E) = max 1, builds two initial models in opposite directions, len(E) uses one of them to generates a synthetic parallel corpus through back-translation (Sennrich et al., So as to minimize the combined loss function, 2016), and applies MERT to tune the model in we adapt MERT to jointly optimize the param- the reverse direction, iterating until convergence, eters of the two models. In its basic form, MERT whereas Lample et al. (2018b) do not perform any approximates the search space for each source tuning at all. In what follows, we propose a more sentence through an n-best list, and performs a principled approach to tuning that defines an unsu- form of coordinate descent by computing the op- pervised criterion and an optimization procedure timal value for each parameter through an effi- that is guaranteed to converge to a local optimum cient line search method and greedily taking the of it. step that leads to the largest gain. The process Inspired by the previous work on CycleGANs is repeated iteratively until convergence, augment- (Zhu et al., 2017) and dual learning (He et al., ing the n-best list with the updated parameters at 2016), our method takes two initial models in op- each iteration so as to obtain a better approxima- posite directions, and defines an unsupervised op- tion of the full search space. Given that our opti- timization objective that combines a cyclic con- mization objective combines two translation sys- sistency loss and a language model loss over the tems TF →E (TE→F (E)), this would require gen- two monolingual corpora E and F : erating an n-best list for TE→F (E) first and, for each entry on it, generating a new n-best list with TF →E , yielding a combined n-best list with N 2 L = Lcycle (E) + Lcycle (F ) + Llm (E) + Llm (F ) entries. So as to make it more efficient, we pro- pose an alternating optimization approach where The cyclic consistency loss captures the intu- we fix the parameters of one model and optimize ition that the translation of a translation should be the other with standard MERT. Thanks to this, we close to the original text. So as to quantify this, we do not need to expand the search space of the fixed take a monolingual corpus in the source language, model, so we can do with an n-best list of N en- translate it to the target language and back to the tries alone. Having done that, we fix the parame- source language, and compute its BLEU score tak- ters of the opposite model and optimize the other, ing the original text as reference: iterating until convergence. 3.4 Joint refinement Lcycle (E) = 1 − BLEU(TF →E (TE→F (E)), E) Constrained by the lack of parallel corpora, the At the same time, the language model loss cap- procedure described so far makes important sim- tures the intuition that machine translation should plifications that could compromise its potential produce fluent text in the target language. For that performance: its phrase-table is somewhat unnatu- purpose, we estimate the per-word entropy in the ral (e.g. the translation probabilities are estimated target language corpus using an n-gram language from cross-lingual embeddings rather than actual model, and penalize higher per-word entropies in frequency counts) and it lacks a lexical reordering machine translated text as follows:4 model altogether. So as to overcome this issue, ex- isting unsupervised SMT methods generate a syn- Llm (E) = LP · max(0, H(F ) − H(TE→F (E)))2 thetic parallel corpus through back-translation and use it to train a standard SMT system from scratch, where the length penalty LP = LP(E) · LP(F ) iterating until convergence. 4 5 We initially tried to directly minimize the entropy of the Without this penalization, the system tended to produce generated text, but this worked poorly in our preliminary ex- unnecessary tokens (e.g. quotes) that looked natural in their periments. More concretely, the behavior of the optimization context, which served to minimize the per-word perplexity algorithm was very unstable, as it tended to excessively focus of the output. Minimizing the overall perplexity instead of on either the cyclic consistency loss or the language model the per-word perplexity did not solve the problem, as the op- loss at the cost of the other, and we found it very difficult to posite phenomenon arose (i.e. the system tended to produce find the right balance between the two factors. excessively short translations).
An obvious drawback of this approach is that forming SMT by a large margin in standard bench- the back-translated side will contain ungrammati- marks. As such, the choice of SMT over NMT cal n-grams that will end up in the induced phrase- also imposes a hard ceiling on the potential per- table. One could argue that this should be innocu- formance of these approaches, as unsupervised ous as long as the ungrammatical n-grams are in SMT systems inherit the very same limitations the source side, as they should never occur in real of their supervised counterparts (e.g. the local- text and their corresponding entries in the phrase- ity and sparsity problems). For that reason, we table should therefore not be used. However, un- argue that SMT provides a more appropriate ar- grammatical source phrases do ultimately affect chitecture to find an initial alignment between the the estimation of the backward translation prob- languages, but NMT is ultimately a better archi- abilities, including those of grammatical phrases. tecture to model the translation process. For instance, let’s say that the target phrase “dos Following this observation, we propose a hybrid gatos” has been aligned 10 times with “two cats” approach that uses unsupervised SMT to warm up and 90 times with “two cat”. While the un- a dual NMT model trained through iterative back- grammatical phrase-table entry two cat- dos gatos translation. More concretely, we first train two should never be picked, the backward probability SMT systems in opposite directions as described estimation of two cats - dos gatos is still affected in Section 3, and use them to assist the training of by it (it would be 0.1 instead of 1.0 in this exam- another two NMT systems in opposite directions. ple). These NMT systems are trained following an it- We argue that, ultimately, the backward prob- erative process where, at each iteration, we alter- ability estimations can only be meaningful when nately update the model in each direction by per- all source phrases are grammatical (so the prob- forming a single pass over a synthetic parallel cor- abilities of all plausible translations sum to one) pus built through back-translation (Sennrich et al., and, similarly, the forward probability estimations 2016).7 In the first iteration, the synthetic parallel can only be meaningful when all target phrases are corpus is entirely generated by the SMT system in grammatical. Following this observation, we pro- the opposite direction but, as training progresses pose an alternative approach that jointly refines and the NMT models get better, we progressively both translation directions. More concretely, we switch to a synthetic parallel corpus generated by use the initial systems to build two synthetic cor- the reverse NMT model. More concretely, itera- pora in opposite directions.6 Having done that, we tion t uses Nsmt = N · max(0, 1 − t/a) syn- independently extract phrase pairs from each syn- thetic parallel sentences from the reverse SMT thetic corpus, and build a phrase-table by taking system, where the parameter a controls the num- their intersection. The forward probabilities are ber of transition iterations from SMT to NMT estimated in the parallel corpus with the synthetic back-translation. The remaining N − Nsmt sen- source side, while the backward probabilities are tences are generated by the reverse NMT model. estimated in the one with the synthetic target side. Inspired by Edunov et al. (2018), we use greedy This does not only guarantee that the probability decoding for half of them, which produces more estimates are meaningful as discussed previously, fluent and predictable translations, and random but it also discards the ungrammatical phrases al- sampling for the other half, which produces more together, as both the source and the target n-grams varied translations. In our experiments, we use must have occurred in the original monolingual N = 1, 000, 000 and a = 30, and perform a to- texts to be present in the resulting phrase-table. tal of 60 such iterations. At test time, we use beam We repeat this process for a total of 3 iterations. search decoding with an ensemble of all check- points from every 10 iterations. 4 NMT hybridization 5 Experiments and results While the rigid and modular design of SMT pro- vides a very suitable framework for unsupervised In order to make our experiments comparable to machine translation, NMT has shown to be a fairly previous work, we use the French-English and superior paradigm in supervised settings, outper- 7 Note that we do not train a new model from scratch each 6 For efficiency purposes, we restrict the size of each syn- time, but continue training the model from the previous iter- thetic parallel corpus to 10 million sentence pairs. ation.
WMT-14 WMT-16 fr-en en-fr de-en en-de de-en en-de Artetxe et al. (2018c) 15.6 15.1 10.2 6.6 - - Lample et al. (2018a) 14.3 15.1 - - 13.3 9.6 NMT Yang et al. (2018) 15.6 17.0 - - 14.6 10.9 Lample et al. (2018b) 24.2 25.1 - - 21.0 17.2 Artetxe et al. (2018b) 25.9 26.2 17.4 14.1 23.1 18.2 Lample et al. (2018b) 27.2 28.1 - - 22.9 17.9 SMT Marie and Fujita (2018)∗ - - - - 20.2 15.5 Proposed system 28.4 30.1 20.1 15.8 25.4 19.7 detok. SacreBLEU ∗ 27.9 27.8 19.7 14.7 24.8 19.4 Lample et al. (2018b) 27.7 27.6 - - 25.2 20.2 SMT Marie and Fujita (2018)∗ - - - - 26.7 20.0 + Ren et al. (2019) 28.9 29.5 20.4 17.0 26.3 21.7 NMT Proposed system 33.5 36.2 27.0 22.5 34.4 26.9 detok. SacreBLEU ∗ 33.2 33.6 26.4 21.2 33.8 26.4 Table 1: Results of the proposed method in comparison to previous work (BLEU). Overall best results are in bold, the best ones in each group are underlined. ∗ Detokenized BLEU equivalent to the official mteval-v13a.pl script. The rest use tokenized BLEU with multi-bleu.perl (or similar). German-English datasets from the WMT 2014 English. Following common practice, we re- shared task. More concretely, our training data port tokenized BLEU scores as computed by the consists of the concatenation of all News Crawl multi-bleu.perl script included in Moses. monolingual corpora from 2007 to 2013, which In addition to that, we also report detokenized make a total of 749 million tokens in French, 1,606 BLEU scores as computed by SacreBLEU11 millions in German, and 2,109 millions in English, (Post, 2018), which is equivalent to the official from which we take a random subset of 2,000 mteval-v13a.pl script. sentences for tuning (Section 3.3). Preprocessing We next present the results of our proposed sys- is done using standard Moses tools, and involves tem in comparison to previous work in Section punctuation normalization, tokenization with ag- 5.1. Section 5.2 then compares the obtained re- gressive hyphen splitting, and truecasing. sults to those of different supervised systems. Fi- Our SMT implementation is based on Moses8 , nally, Section 5.3 presents some translation exam- and we use the KenLM (Heafield et al., 2013) ples from our system. tool included in it to estimate our 5-gram language model with modified Kneser-Ney smoothing. Our 5.1 Main results unsupervised tuning implementation is based on Table 1 reports the results of the proposed sys- Z-MERT (Zaidan, 2009), and we use FastAlign tem in comparison to previous work. As it can be (Dyer et al., 2013) for word alignment within the seen, our full system obtains the best published re- joint refinement procedure. Finally, we use the big sults in all cases, outperforming the previous state- transformer implementation from fairseq9 for our of-the-art by 5-7 BLEU points in all datasets and NMT system, training with a total batch size of translation directions. 20,000 tokens across 8 GPUs with the exact same A substantial part of this improvement comes hyperparameters as Ott et al. (2018). from our more principled unsupervised SMT ap- We use newstest2014 as our test set for proach, which outperforms all previous SMT- French-English, and both newstest2014 and new- based systems by around 2 BLEU points. Nev- stest2016 (from WMT 201610 ) for German- ertheless, it is the NMT hybridization that brings the largest gains, improving the results of this ini- 8 http://www.statmt.org/moses/ tial SMT systems by 5-9 BLEU points. As shown 9 https://github.com/pytorch/fairseq 10 11 Note that it is only the test set that is from WMT 2016. SacreBLEU signature: BLEU+case.mixed+lang.LANG All the training data comes from WMT 2014 News Crawl, so +numrefs.1+smooth.exp+test.TEST+tok.13a+version.1.2.1 it is likely that our results could be further improved by using 1, with LANG ∈ {fr-en, en-fr, de-en, en-de} and TEST ∈ the more extensive monolingual corpora from WMT 2016. {wmt14/full, wmt16}
WMT-14 WMT-16 fr-en en-fr de-en en-de Initial SMT 27.2 28.1 22.9 17.9 Lample et al. (2018b) + NMT hybrid 27.7 (+0.5) 27.6 (-0.5) 25.2 (+2.3) 20.2 (+2.3) Initial SMT - - 20.2 15.5 Marie and Fujita (2018) + NMT hybrid - - 26.7 (+6.5) 20.0 (+4.5) Initial SMT 28.4 30.1 25.4 19.7 Proposed system + NMT hybrid 33.5 (+5.1) 36.2 (+6.1) 34.4 (+9.0) 26.9 (+7.2) Table 2: NMT hybridization results for different unsupervised machine translation systems (BLEU). WMT-14 fr-en en-fr de-en en-de Proposed system 33.5 36.2 27.0 22.5 Unsupervised detok. SacreBLEU ∗ 33.2 33.6 26.4 21.2 WMT best∗ 35.0 35.8 29.0 20.6† Supervised Vaswani et al. (2017) - 41.0 - 28.4 Edunov et al. (2018) - 45.6 - 35.0 Table 3: Results of the proposed method in comparison to different supervised systems (BLEU). ∗ Detokenized BLEU equivalent to the official mteval-v13a.pl script. The rest use tokenized BLEU with multi-bleu.perl (or similar). † Results in the original test set from WMT 2014, which slightly differs from the full test set used in all subsequent work. Our proposed system obtains 22.4 BLEU points (21.1 detokenized) in that same subset. in Table 2, our absolute gains are considerably include the best results from the shared task itself, larger than those of previous hybridization meth- which reflect the state-of-the-art in machine trans- ods, even if our initial SMT system is substan- lation back in 2014; those of Vaswani et al. (2017), tially better and thus more difficult to improve who introduced the now predominant transformer upon. This way, our initial SMT system is about architecture; and those of Edunov et al. (2018), 4-5 BLEU points above that of Marie and Fujita who apply back-translation at a large scale and (2018), yet our absolute gain on top of it is around hold the current best results in the test set. 2.5 BLEU points higher. When compared to Lam- As it can be seen, our unsupervised system out- ple et al. (2018b), we obtain an absolute gain of 5- performs the WMT 2014 shared task winner in 6 BLEU points in both French-English directions English-to-German, and is around 2 BLEU points while they do not get any clear improvement, and behind it in the other translation directions. This we obtain an improvement of 7-9 BLEU points in shows that unsupervised machine translation is al- both German-English directions, in contrast with ready competitive with the state-of-the-art in su- the 2.3 BLEU points they obtain. pervised machine translation in 2014. While the More generally, it is interesting that pure SMT field of machine translation has undergone great systems perform better than pure NMT systems, progress in the last 5 years, and the gap between yet the best results are obtained by initializing an our unsupervised system and the current state-of- NMT system with an SMT system. This suggests the-art in supervised machine translation is still that the rigid and modular architecture of SMT large as reflected by the other results, this suggests might be more suitable to find an initial alignment that unsupervised machine translation can be a us- between the languages, but the final system should able alternative in practical settings. be ultimately based on NMT for optimal results. 5.3 Qualitative results 5.2 Comparison with supervised systems Table 4 shows some translation examples from our So as to put our results into perspective, Table 3 re- proposed system in comparison to those reported ports the results of different supervised systems in by Artetxe et al. (2018b). We choose the exact the same WMT 2014 test set. More concretely, we same sentences reported by Artetxe et al. (2018b),
Source Reference Artetxe et al. (2018b) Proposed system D’autres révélations ont fait Other revelations cited doc- Other disclosures have re- Other revelations have état de documents divulgués uments leaked by Snow- ported documents disclosed pointed to documents dis- par Snowden selon lesquels den that the NSA moni- by Snowden suggested the closed by Snowden that la NSA avait intercepté des tored German Chancellor NSA had intercepted the NSA had intercepted données et des communica- Angela Merkel’s cellphone communications and data data and communications tions émanant du téléphone and those of up to 34 other from the mobile phone of emanating from German portable de la chancelière alle- world leaders. German Chancellor Angela Chancellor Angela Merkel’s mande Angela Merkel et de Merkel and those of 32 mobile phone and those of ceux de 34 autres chefs d’État. other heads of state. 34 other heads of state. La NHTSA n’a pas pu ex- NHTSA could not review The NHTSA could not con- NHTSA said it could not ex- aminer la lettre d’information the owner notification let- sider the letter of informa- amine the letter of informa- aux propriétaires en raison de ter due to the 16-day gov- tion to owners because of tion to owners because of the l’arrêt de 16 jours des activités ernment shutdown, which halting 16-day government 16-day halt in government gouvernementales, ce qui a tempered auto sales growth activities, which slowed the operations, which slowed ve- ralenti la croissance des ventes in October. growth in vehicle sales in hicle sales growth in Octo- de véhicules en octobre. October. ber. Le M23 est né d’une mu- The M23 was born of an M23 began as a mutiny in The M23 was born into a tinerie, en avril 2012, April 2012 mutiny by for- April 2012, former rebels, mutiny in April 2012, of for- d’anciens rebelles, essen- mer rebels, principally Tut- mainly Tutsi integrated into mer rebels, mostly Tutsi, em- tiellement tutsi, intégrés dans sis who were integrated the national army in 2009 bedded in the army in 2009 l’armée en 2009 après un into the army in 2009 fol- after a peace deal. after a peace deal. accord de paix. lowing a peace agreement. Tunks a déclaré au Sun- Tunks told Sydney’s Sun- Tunks told The Times of Tunks told the Sunday Tele- day Telegraph de Sydney que day Telegraph the whole London from Sydney that graph in Sydney that the toute la famille était «extrême- family was “extremely con- the whole family was “ex- whole family was “extremely ment préoccupée» du bien- cerned” about his daugh- tremely concerned” of the concerned” about her daugh- être de sa fille et voulait ter’s welfare and wanted welfare of her daughter and ter’s well-being and wanted qu’elle rentre en Australie. her back in Australia. wanted it to go in Australia. her to go into Australia. Table 4: Randomly chosen translation examples from French→English newstest2014 in comparison of those re- ported by Artetxe et al. (2018b). which were randomly taken from newstest2014, them by incorporating subword information, us- so they should be representative of the general be- ing a theoretically well founded unsupervised tun- havior of both systems. ing method, and developing a joint refinement pro- While not perfect, our proposed system pro- cedure. In addition to that, we use our improved duces generally fluent translations that accurately SMT approach to initialize a dual NMT model capture the meaning of the original text. Just in that is further improved through on-the-fly back- line with our quantitative results, this suggests that translation. Our experiments show the effective- unsupervised machine translation can be a usable ness of our approach, as we improve the previous alternative in practical settings. state-of-the-art in unsupervised machine transla- Compared to Artetxe et al. (2018b), our transla- tion by 5-7 BLEU points in French-English and tions are generally more fluent, which is not sur- German-English WMT 2014 and 2016. prising given that they are produced by an NMT system rather than an SMT system. In addition to that, the system of Artetxe et al. (2018b) has some adequacy issues when translating named entities and numerals (e.g. 34 → 32, Sunday Telegraph → In the future, we would like to explore learn- The Times of London), which we do not observe able similarity functions like the one proposed by for our proposed system in these examples. (McCallum et al., 2005) to compute the character- level scores in our initial phrase-table. In addition 6 Conclusions and future work to that, we would like to incorporate a language modeling loss during NMT training similar to He In this paper, we identify several deficiencies in et al. (2016). Finally, we would like to adapt our previous unsupervised SMT systems, and pro- approach to more relaxed scenarios with multiple pose a more principled approach that addresses languages and/or small parallel corpora.
Acknowledgments space models for improved decipherment. In Pro- ceedings of the 53rd Annual Meeting of the Associ- This research was partially supported by the Span- ation for Computational Linguistics and the 7th In- ish MINECO (UnsupNMT TIN2017-91692-EXP, ternational Joint Conference on Natural Language cofunded by EU FEDER), the UPV/EHU (excel- Processing (Volume 1: Long Papers), pages 836– 845, Beijing, China. Association for Computational lence research group), and the NVIDIA GPU grant Linguistics. program. Mikel Artetxe enjoys a doctoral grant from the Spanish MECD. Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A simple, fast, and effective reparameteriza- tion of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the References Association for Computational Linguistics: Human Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Language Technologies, pages 644–648, Atlanta, Learning bilingual word embeddings with (almost) Georgia. Association for Computational Linguistics. no bilingual data. In Proceedings of the 55th Annual Sergey Edunov, Myle Ott, Michael Auli, and David Meeting of the Association for Computational Lin- Grangier. 2018. Understanding back-translation at guistics (Volume 1: Long Papers), pages 451–462, scale. In Proceedings of the 2018 Conference on Vancouver, Canada. Association for Computational Empirical Methods in Natural Language Process- Linguistics. ing, pages 489–500, Brussels, Belgium. Association Mikel Artetxe, Gorka Labaka, and Eneko Agirre. for Computational Linguistics. 2018a. A robust self-learning method for fully un- Hany Hassan, Anthony Aue, Chang Chen, Vishal supervised cross-lingual mappings of word embed- Chowdhary, Jonathan Clark, Christian Feder- dings. In Proceedings of the 56th Annual Meeting of mann, Xuedong Huang, Marcin Junczys-Dowmunt, the Association for Computational Linguistics (Vol- William Lewis, Mu Li, et al. 2018. Achieving hu- ume 1: Long Papers), pages 789–798. Association man parity on automatic chinese to english news for Computational Linguistics. translation. arXiv preprint arXiv:1803.05567. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, 2018b. Unsupervised statistical machine transla- Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learn- tion. In Proceedings of the 2018 Conference on ing for machine translation. In Advances in Neural Empirical Methods in Natural Language Process- Information Processing Systems 29, pages 820–828. ing, pages 3632–3642, Brussels, Belgium. Associ- ation for Computational Linguistics. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Mikel Artetxe, Gorka Labaka, Eneko Agirre, and kneser-ney language model estimation. In Proceed- Kyunghyun Cho. 2018c. Unsupervised neural ma- ings of the 51st Annual Meeting of the Association chine translation. In Proceedings of the 6th Inter- for Computational Linguistics (Volume 2: Short Pa- national Conference on Learning Representations pers), pages 690–696, Sofia, Bulgaria. Association (ICLR 2018). for Computational Linguistics. Alexis Conneau, Guillaume Lample, Marc’Aurelio Guillaume Lample, Alexis Conneau, Ludovic De- Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. noyer, and Marc’Aurelio Ranzato. 2018a. Un- Word translation without parallel data. In Proceed- supervised machine translation using monolingual ings of the 6th International Conference on Learning corpora only. In Proceedings of the 6th Inter- Representations (ICLR 2018). national Conference on Learning Representations (ICLR 2018). Qing Dou and Kevin Knight. 2012. Large scale deci- pherment for out-of-domain machine translation. In Guillaume Lample, Myle Ott, Alexis Conneau, Lu- Proceedings of the 2012 Joint Conference on Empir- dovic Denoyer, and Marc’Aurelio Ranzato. 2018b. ical Methods in Natural Language Processing and Phrase-based & neural unsupervised machine trans- Computational Natural Language Learning, pages lation. In Proceedings of the 2018 Conference on 266–275, Jeju Island, Korea. Association for Com- Empirical Methods in Natural Language Process- putational Linguistics. ing, pages 5039–5049, Brussels, Belgium. Associ- ation for Computational Linguistics. Qing Dou and Kevin Knight. 2013. Dependency-based decipherment for resource-limited machine transla- Vladimir I Levenshtein. 1966. Binary codes capable tion. In Proceedings of the 2013 Conference on Em- of correcting deletions, insertions, and reversals. In pirical Methods in Natural Language Processing, Soviet physics doklady, volume 10, pages 707–710. pages 1668–1676, Seattle, Washington, USA. Asso- ciation for Computational Linguistics. Benjamin Marie and Atsushi Fujita. 2018. Unsuper- vised neural machine translation initialized by un- Qing Dou, Ashish Vaswani, Kevin Knight, and Chris supervised statistical machine translation. arXiv Dyer. 2015. Unifying bayesian inference and vector preprint arXiv:1810.12703.
Andrew McCallum, Kedar Bellare, and Fernando Omar Zaidan. 2009. Z-mert: A fully configurable open Pereira. 2005. A conditional random field for source tool for minimum error rate training of ma- discriminatively-trained finite-state string edit dis- chine translation systems. The Prague Bulletin of tance. In Proceedings of the Twenty-First Confer- Mathematical Linguistics, 91:79–88. ence on Uncertainty in Artificial Intelligence, pages 388–395. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- translation using cycle-consistent adversarial net- rado, and Jeff Dean. 2013. Distributed representa- works. In The IEEE International Conference on tions of words and phrases and their compositional- Computer Vision (ICCV). ity. In Advances in Neural Information Processing Systems 26, pages 3111–3119. Franz Josef Och. 2003. Minimum error rate train- ing in statistical machine translation. In Proceed- ings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sap- poro, Japan. Association for Computational Linguis- tics. Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine trans- lation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 1–9, Belgium, Brussels. Association for Computational Linguistics. Matt Post. 2018. A call for clarity in reporting bleu scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191, Belgium, Brussels. Association for Computa- tional Linguistics. Sujith Ravi and Kevin Knight. 2011. Deciphering for- eign language. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 12– 21, Portland, Oregon, USA. Association for Compu- tational Linguistics. Shuo Ren, Zhirui Zhang, Shujie Liu, Ming Zhou, and Shuai Ma. 2019. Unsupervised neural ma- chine translation with smt as posterior regulariza- tion. arXiv preprint arXiv:1901.04112. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation mod- els with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computa- tional Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010. Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018. Unsupervised neural machine translation with weight sharing. In Proceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 46–55. As- sociation for Computational Linguistics.
You can also read