ABU-MATRAN AT WMT 2014 TRANSLATION TASK: TWO-STEP DATA SELECTION AND RBMT-STYLE SYNTHETIC RULES
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Abu-MaTran at WMT 2014 Translation Task: Two-step Data Selection and RBMT-Style Synthetic Rules Raphael Rubino? , Antonio Toral† , Victor M. Sánchez-Cartagena?‡ , Jorge Ferrández-Tordera? , Sergio Ortiz-Rojas? , Gema Ramı́rez-Sánchez? , Felipe Sánchez-Martı́nez‡ , Andy Way† ? Prompsit Language Engineering, S.L., Elche, Spain {rrubino,vmsanchez,jferrandez,sortiz,gramirez}@prompsit.com † NCLT, School of Computing, Dublin City University, Ireland {atoral,away}@computing.dcu.ie ‡ Dep. Llenguatges i Sistemes Informàtics, Universitat d’Alacant, Spain fsanchez@dlsi.ua.es Abstract models (TMs) are trained using all the data pro- This paper presents the machine trans- vided by the shared task organisers, as well as lation systems submitted by the Abu- the Gigaword monolingual corpora distributed by MaTran project to the WMT 2014 trans- LDC. lation task. The language pair concerned is English–French with a focus on French To train the LMs, monolingual corpora and the as the target language. The French to En- target side of the parallel corpora are first used glish translation direction is also consid- individually to train models. Then the individ- ered, based on the word alignment com- ual models are interpolated according to perplex- puted in the other direction. Large lan- ity minimisation on the development sets. guage and translation models are built us- ing all the datasets provided by the shared To train the TMs, first a baseline is built us- task organisers, as well as the monolin- ing the News Commentary parallel corpus. Sec- gual data from LDC. To build the trans- ond, each remaining parallel corpus is processed lation models, we apply a two-step data individually using bilingual cross-entropy differ- selection method based on bilingual cross- ence (Axelrod et al., 2011) in order to sepa- entropy difference and vocabulary satura- rate pseudo in-domain and out-of-domain sen- tion, considering each parallel corpus in- tence pairs, and filtering the pseudo out-of- dividually. Synthetic translation rules are domain instances with the vocabulary saturation extracted from the development sets and approach (Lewis and Eetemadi, 2013). Third, used to train another translation model. synthetic translation rules are automatically ex- We then interpolate the translation mod- tracted from the development set and used to train els, minimising the perplexity on the de- another translation model following a novel ap- velopment sets, to obtain our final SMT proach (Sánchez-Cartagena et al., 2014). Finally, system. Our submission for the English to we interpolate the four translation models (base- French translation task was ranked second line, in-domain, filtered out-of-domain and rules) amongst nine teams and a total of twenty by minimising the perplexity obtained on the de- submissions. velopment sets and investigate the best tuning and decoding parameters. 1 Introduction This paper presents the systems submitted by the The reminder of this paper is organised as fol- Abu-MaTran project (runs named DCU-Prompsit- lows: the datasets and tools used in our experi- UA) to the WMT 2014 translation task for the ments are described in Section 2. Then, details English–French language pair. Phrase-based sta- about the LMs and TMs are given in Section 3 and tistical machine translation (SMT) systems were Section 4 respectively. Finally, we evaluate the submitted, considering the two translation direc- performance of the final SMT system according to tions, with the focus on the English to French di- different tuning and decoding parameters in Sec- rection. Language models (LMs) and translation tion 5 before presenting conclusions in Section 6.
2 Datasets and Tools Corpus Sentences (k) Words (M) Monolingual Data – English We use all the monolingual and parallel datasets Europarl v7 2,218.2 59.9 in English and French provided by the shared task News Commentary v8 304.2 7.4 News Shuffled 2007 3,782.5 90.2 organisers, as well as the LDC Gigaword for the News Shuffled 2008 12,954.5 308.1 same languages1 . For each language, a true-case News Shuffled 2009 14,680.0 347.0 model is trained using all the data, using the train- News Shuffled 2010 6,797.2 157.8 News Shuffled 2011 15,437.7 358.1 truecaser.perl script included in the M OSES tool- News Shuffled 2012 14,869.7 345.5 kit (Koehn et al., 2007). News Shuffled 2013 21,688.4 495.2 Punctuation marks of all the monolingual and LDC afp 7,184.9 869.5 LDC apw 8,829.4 1,426.7 parallel corpora are then normalised using the LDC cna 618.4 45.7 script normalize-punctuation.perl provided by the LDC ltw 986.9 321.1 organisers, before being tokenised and true-cased LDC nyt 5,327.7 1,723.9 LDC wpb 108.8 20.8 using the scripts distributed with the M OSES tool- LDC xin 5,121.9 423.7 kit. The same pre-processing steps are applied to Monolingual Data – French the development and test sets. As development Europarl v7 2,190.6 63.5 sets, we used all the test sets from previous years News Commentary v8 227.0 6.5 News Shuffled 2007 119.0 2.7 of WMT, from 2008 to 2013 (newstest2008-2013). News Shuffled 2008 4,718.8 110.3 Finally, the training parallel corpora are cleaned News Shuffled 2009 4,366.7 105.3 News Shuffled 2010 1,846.5 44.8 using the script clean-corpus-n.perl, keeping the News Shuffled 2011 6,030.1 146.1 sentences longer than 1 word, shorter than 80 News Shuffled 2012 4,114.4 100.8 words, and with a length ratio between sentence News Shuffled 2013 9,256.3 220.2 LDC afp 6,793.5 784.5 pairs lower than 4.2 The statistics about the cor- LDC apw 2,525.1 271.3 pora used in our experiments after pre-processing Parallel Data are presented in Table 1. 9 549.0 (EN) 10 Corpus 21,327.1 For training LMs we use K EN LM (Heafield et 642.5 (FR) 76.0 (EN) al., 2013) and the SRILM tool-kit (Stolcke et al., Common Crawl 3,168.5 82.7 (FR) 2011). For training TMs, we use M OSES (Koehn 52.5 (EN) Europarl v7 1,965.5 et al., 2007) version 2.1 with MGIZA ++ (Och and 56.7 (FR) 4.5 (EN) Ney, 2003; Gao and Vogel, 2008). These tools are News Commentary v9 181.3 5.3 (FR) used with default parameters for our experiments UN 12,354.7 313.4 (EN) except when explicitly said. 356.5 (FR) The decoder used to generate translations is Table 1: Data statistics after pre-processing of the M OSES using features weights optimised with monolingual and parallel corpora used in our ex- MERT (Och, 2003). As our approach relies on periments. training individual TMs, one for each parallel cor- pus, our final TM is obtained by linearly interpo- and introduced by Durrani et al. (2011). lating the individual ones. The interpolation of TMs is performed using the script tmcombine.py, 3 Language Models minimising the cross-entropy between the TM and the concatenated development sets from 2008 The LMs are trained in the same way for both to 2012 (noted newstest2008-2012), as described languages. First, each monolingual and parallel in Sennrich (2012). Finally, we make use of the corpus is considered individually (except the par- findings from WMT 2013 brought by the win- allel version of Europarl and News Commentary) ning team (Durrani et al., 2013) and decide to use and used to train a 5-gram LM with the modified the Operation Sequence Model (OSM), based on Kneser-Ney smoothing method. We then interpo- minimal translation units and Markov chains over late the individual LMs using the script compute- sequences of operations, implemented in M OSES best-mix available with the SRILM tool-kit (Stol- cke et al., 2011), based on their perplexity scores 1 LDC2011T07 English Gigaword Fifth Edition, on the concatenation of the development sets from LDC2011T10 French Gigaword Third Edition 2 This ratio was empirically chosen based on words fertil- 2008 to 2012 (the 2013 version is held-out for the ity between English and French. tuning of the TMs).
The final LM for French contains all the word Then, for each out-of-domain parallel corpus, sequences from 1 to 5-grams contained in the we compute the bilingual cross-entropy difference of each sentence pair as: training corpora without any pruning. However, with the computing resources at our disposal, the [Hin (Ssrc ) − Hout (Ssrc )] + [Hin (Strg ) − Hout (Strg )] (1) English LMs could not be interpolated without where Ssrc and Strg are the source and the tar- pruning non-frequent n-grams. Thus, n-grams get sides of a sentence pair, Hin and Hout are with n ∈ [3; 5] with a frequency lower than 2 were the cross-entropies of the in and out-of-domain removed. Details about the final LMs are given in LMs given a sentence pair. The sentence pairs are Table 2. then ranked and the lowest-scoring ones are taken 1-gram 2-gram 3-gram 4-gram 5-gram to train the pseudo in-domain TMs. However, the cross-entropy difference threshold required to English 13.4 198.6 381.2 776.3 1,068.7 French 6.0 75.5 353.2 850.8 1,354.0 split a corpus in two parts (pseudo in and out-of- domain) is usually set empirically by testing sev- Table 2: Statistics, in millions of n-grams, of the eral subset sizes of the top-ranked sentence pairs. interpolated LMs. This method is costly in our setup as it would lead to training and evaluating multiple SMT systems for each of the pseudo in-domain parallel corpora. 4 Translation Models In order to save time and computing power, we consider only pseudo in-domain sentence pairs In this Section, we describe the TMs trained for those with a bilingual cross-entropy difference be- the shared task. First, we present the two-step data low 0, i.e. those deemed more similar to the selection process which aims to (i) separate in and in-domain LMs than to the out-of-domain LMs out-of-domain parallel sentences and (ii) reduce (Hin < Hout ). A sample of the distribution of the total amount of out-of-domain data. Second, scores for the out-of-domain corpora is shown in a novel approach for the automatic extraction of Figure 1. The resulting pseudo in-domain corpora translation rules and their use to enrich the phrase are used to train individual TMs, as detailed in Ta- table is detailed. ble 3. 10 4.1 Parallel Data Filtering and Vocabulary Bilingual Cross-Entropy Difference Saturation 8 Amongst the parallel corpora provided by the 6 shared task organisers, only News Commentary 4 can be considered as in-domain regarding the de- 2 velopment and test sets. We use this training corpus to build our baseline SMT system. The 0 other parallel corpora are individually filtered us- Common Crawl -2 Europarl ing bilingual cross-entropy difference (Moore and 10^9 -4 UN Lewis, 2010; Axelrod et al., 2011). This data filtering method relies on four LMs, two in the 0 2k 4k 6k 8k 10k source and two in the target language, which Sentence Pairs aim to model particular features of in and out-of- Figure 1: Sample of ranked sentence-pairs (10k) domain sentences. of each of the out-of-domain parallel corpora with We build the in-domain LMs using the source bilingual cross-entropy difference and target sides of the News Commentary paral- lel corpus. Out-of-domain LMs are trained on a The results obtained using the pseudo in- vocabulary-constrained subset of each remaining domain data show B LEU (Papineni et al., 2002) parallel corpus individually using the SRILM tool- scores superior or equal to the baseline score. kit, which leads to eight models (four in the source Only the Europarl subset is slightly lower than language and four in the target language).3 the baseline, while the subset taken from the 109 3 The subsets contain the same number of sentences and corpus reaches the highest B LEU compared to the the same vocabulary as News Commentary. other systems (30.29). This is mainly due to the
size of this subset which is ten times larger than Corpus Sentences (k) BLEUdev the one taken from Europarl. The last row of Ta- Baseline 181.3 27.76 ble 3 shows the B LEU score obtained after interpo- Common Crawl 1,598.7 29.84 lating the four pseudo in-domain translation mod- Europarl 461.9 28.87 els. This system outperforms the best pseudo in- 109 Corpus 5,153.0 30.50 UN 1,707.3 29.03 domain one by 0.5 absolute points. Interpolation - 31.37 Corpus Sentences (k) BLEUdev Table 4: Number of sentence pairs and B LEU Baseline 181.3 27.76 scores reported by MERT on English–French Common Crawl 208.3 27.73 Europarl 142.0 27.63 newstest2013 for the pseudo out-of-domain cor- 109 Corpus 1,442.4 30.29 pora obtained by filtering the out-of-domain cor- UN 642.4 28.91 pora with bilingual cross-entropy difference, keep- Interpolation - 30.78 ing sentence pairs below an entropy score of 10 and applying vocabulary saturation. The interpo- Table 3: Number of sentence pairs and BLEU lation of pseudo out-of-domain models is evalu- scores reported by MERT on English–French new- ated in the last row. stest2013 for the pseudo in-domain corpora ob- tained by filtering the out-of-domain corpora with domain subsets, the reported B LEU scores are bilingual cross-entropy difference. The interpola- higher than the baseline for the four individual tion of pseudo in-domain models is evaluated in SMT systems and the interpolated one. This latter the last row. system outperforms the baseline by 3.61 absolute After evaluating the pseudo in-domain parallel points. Compared to the results obtained with the data, the remaining sentence pairs for each cor- pseudo in-domain data, we observe a slight im- pora are considered out-of-domain according to provement of the B LEU scores using the pseudo our filtering approach. However, they may still out-of-domain data. However, despite the com- contain useful information, thus we make use of paratively larger sizes of the latter datasets, the these corpora by building individual TMs for each B LEU scores reached are not that higher. For in- corpus (in a similar way we built the pseudo in- stance with the 109 corpus, the pseudo in and out- domain models). The total amount of remaining of-domain subsets contain 1.4 and 5.1 million sen- data (more than 33 million sentence pairs) makes tence pairs respectively, and the two systems reach the training process costly in terms of time and 30.3 and 30.5 B LEU. These scores indicate that computing power. In order to reduce these costs, the pseudo in-domain SMT systems are more ef- sentence pairs with a bilingual cross-entropy dif- ficient on the English–French newstest2013 devel- ference higher than 10 were filtered out, as we no- opment set. ticed that most of the sentences above this thresh- old contain noise (non-alphanumeric characters, 4.2 Extraction of Translation Rules foreign languages, etc.). A synthetic phrase-table based on shallow-transfer We also limit the size of the remaining data by MT rules and dictionaries is built as follows. First, applying the vocabulary saturation method (Lewis a set of shallow-transfer rules is inferred from the and Eetemadi, 2013). For the out-of-domain sub- concatenation of the newstest2008-2012 develop- set of each corpus, we traverse the sentence pairs ment corpora exactly in the same way as in the in the order they are ranked by perplexity differ- UA-Prompsit submission to this translation shared ence and filter out those sentence pairs for which task (Sánchez-Cartagena et al., 2014). In sum- we have seen already each 1-gram at least 10 mary, rules are obtained from a set of bilingual times. Each out-of-domain subset from each par- phrases extracted from the parallel corpus after allel corpus is then used to train a TM before inter- its morphological analysis and part-of-speech dis- polating them to create the pseudo out-of-domain ambiguation with the tools in the Apertium rule- TM. The results reported by MERT obtained on based MT platform (Forcada et al., 2011). the newstest2013 development set are detailed in The extraction algorithm commonly used in Table 4. phrase-based SMT is followed with some added Mainly due to the sizes of the pseudo out-of- heuristics which ensure that the bilingual phrases
extracted are compatible with the bilingual dic- the number of n-bests used by MERT. tionary. Then, many different rules are generated Results obtained on the development set new- from each bilingual phrase; each of them encodes stest2013 are reported in Table 6. These scores a different degree of generalisation over the partic- show that adding OSM to the interpolated trans- ular example it has been extracted from. Finally, lation models slightly degrades B LEU. However, the minimum set of rules which correctly repro- by increasing the number of n-bests considered by duces all the bilingual phrases is found based on MERT to 200-bests, the SMT system with OSM integer linear programming search (Garfinkel and outperforms the systems evaluated previously in Nemhauser, 1972). our experiments. Adding the synthetic translation Once the rules have been inferred, the phrase rules degrades B LEU (as indicated by the last row table is built from them and the original rule- in the Table), thus we decide to submit two sys- based MT dictionaries, following the method tems to the shared task: one without and one with by Sánchez-Cartagena et al. (2011), which was synthetic rules. By submitting a system without one of winning systems4 (together with two on- synthetic rules, we also ensure that our SMT sys- line SMT systems) in the pairwise manual evalu- tem is constrained according to the shared task ation of the WMT11 English–Spanish translation guidelines. task (Callison-Burch et al., 2011). This phrase- table is then interpolated with the baseline TM and System BLEUdev the results are presented in Table 5. A slight im- Baseline 27.76 provement over the baseline is observed, which + pseudo in + pseudo out 31.93 + OSM 31.90 motivates the use of synthetic rules in our final MT + MERT 200-best 32.21 system. This small improvement may be related + Rules 32.10 to the small coverage of the Apertium dictionar- ies: the English–French bilingual dictionary has a Table 6: B LEU scores reported by MERT on low number of entries compared to more mature English–French newstest2013 development set. language pairs in Apertium which have around 20 times more bilingual entries. As MERT is not suitable when a large number of features are used (our system uses 19 fetures), System BLEUdev we switch to the Margin Infused Relaxed Algo- Baseline 27.76 rithm (MIRA) for our submitted systems (Watan- Baseline+Rules 28.06 abe et al., 2007). The development set used is newstest2012, as we aim to select the best decod- Table 5: BLEU scores reported by MERT on ing parameters according to the scores obtained English–French newstest2013 for the baseline when decoding the newstest2013 corpus, after de- SMT system standalone and with automatically truecasing and de-tokenising using the scripts dis- extracted translation rules. tributed with M OSES. This setup allowed us to compare our results with the participants of the 5 Tuning and Decoding translation shared task last year. We pick the de- coding parameters leading to the best results in We present in this Section a short selection of our terms of B LEU and decode the official test set of experiments, amongst 15+ different configura- WMT14 newstest2014. The results are reported in tions, conducted on the interpolation of TMs, tun- Table 7. Results on newstest2013 show that the de- ing and decoding parameters. We first interpolate coding parameters investigation leads to an over- the four TMs: the baseline, the pseudo in and out- all improvement of 0.1 B LEU absolute. The re- of-domain, and the translation rules, minimising sults on newstest2014 show that adding synthetic the perplexity obtained on the concatenated de- rules did not help improving B LEU and degraded velopment sets from 2008 to 2012 (newstest2008- slightly TER (Snover et al., 2006) scores. 2012). We investigate the use of OSM trained on In addition to our English→French submission, pseudo in-domain data only or using all the paral- we submitted a French→English translation. Our lel data available. Finally, we make variations of French→English MT system is built on the align- 4 No other system was found statistically significantly bet- ments obtained from the English→French direc- ter using the sign test at p ≤ 0.1. tion. The training processes between the two sys-
System B LEU 13 A TER rameters implemented in M OSES, as we observed newstest2013 inconsistent results in our experiments. Best tuning 31.02 60.77 cube-pruning (pop-limit 10000) 31.04 60.71 Acknowledgments increased table-limit (100) 31.06 60.77 monotonic reordering 31.07 60.69 Best decoding 31.14 60.66 The research leading to these results has re- newstest2014 ceived funding from the European Union Seventh Best decoding 34.90 54.70 Framework Programme FP7/2007-2013 under Best decoding + Rules 34.90 54.80 grant agreement PIAP-GA-2012-324414 (Abu- Table 7: Case sensitive results obtained with MaTran). our final English–French SMT system on new- stest2013 when experimenting with different de- References coding parameters. The best parameters are kept to translate the WMT14 test set (newstest2014) Amittai Axelrod, Xiaodong He, and Jianfeng Gao. and official results are reported in the last two 2011. Domain Adaptation Via Pseudo In-domain Data Selection. In Proceedings of EMNLP, pages rows. 355–362. Chris Callison-Burch, Philipp Koehn, Christof Monz, tems are identical, except for the synthetic rules and Omar Zaidan. 2011. Findings of the 2011 work- which are not extracted for the French→English shop on statistical machine translation. In Proceed- direction. Tuning and decoding parameters for ings of WMT, pages 22–64. this latter translation direction are the best ones obtained in our previous experiments on this Nadir Durrani, Helmut Schmid, and Alexander Fraser. 2011. A Joint Sequence Translation Model with In- shared task. The case-sensitive scores obtained tegrated Reordering. In Proceedings of ACL/HLT, for French→English on newstest2014 are 35.0 pages 1045–1054. B LEU 13 A and 53.1 TER, which ranks us at the fifth position for this translation direction. Nadir Durrani, Barry Haddow, Kenneth Heafield, and Philipp Koehn. 2013. Edinburgh’s Machine Trans- lation Systems for European Language Pairs. In 6 Conclusion Proceedings of WMT, pages 112–119. We have presented the MT systems developed by Mikel L Forcada, Mireia Ginestı́-Rosell, Jacob Nord- the Abu-MaTran project for the WMT14 trans- falk, Jim O’Regan, Sergio Ortiz-Rojas, Juan An- lation shared task. We focused on the French– tonio Pérez-Ortiz, Felipe Sánchez-Martı́nez, Gema Ramı́rez-Sánchez, and Francis M Tyers. 2011. English language pair and particularly on the Apertium: A Free/Open-source Platform for Rule- English→French direction. We have used a two- based Machine Translation. Machine Translation, step data selection process based on bilingual 25(2):127–144. cross-entropy difference and vocabulary satura- tion, as well as a novel approach for the extraction Qin Gao and Stephan Vogel. 2008. Parallel Implemen- tations of Word Alignment Tool. In Software Engi- of synthetic translation rules and their use to en- neering, Testing, and Quality Assurance for Natural rich the phrase table. For the LMs and the TMs, Language Processing, pages 49–57. we rely on training individual models per corpus before interpolating them by minimising perplex- Robert S Garfinkel and George L Nemhauser. 1972. Integer Programming, volume 4. Wiley New York. ity according to the development set. Finally, we made use of the findings of WMT13 by including Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. an OSM model. Clark, and Philipp Koehn. 2013. Scalable Mod- Our English→French translation system was ified Kneser-Ney Language Model Estimation. In Proceedings of ACL, pages 690–696. ranked second amongst nine teams and a total of twenty submissions, while our French→English Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris submission was ranked fifth. As future work, Callison-Burch, Marcello Federico, Nicola Bertoldi, we plan to investigate the effect of adding to the Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open Source phrase table synthetic translation rules based on Toolkit for Statistical Machine Translation. In Pro- larger dictionaries. We also would like to study the ceedings of ACL, Interactive Poster and Demonstra- link between OSM and the different decoding pa- tion Sessions, pages 177–180.
William D. Lewis and Sauleh Eetemadi. 2013. Dra- matically Reducing Training Data Size Through Vo- cabulary Saturation. In Proceedings of WMT, pages 281–291. Robert C. Moore and William Lewis. 2010. Intelligent Selection of Language Model Training Data. In Pro- ceedings of ACL, pages 220–224. Franz Josef Och and Hermann Ney. 2003. A System- atic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29:19–51. Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of ACL, volume 1, pages 160–167. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL, pages 311–318. Vı́ctor M. Sánchez-Cartagena, Felipe Sánchez- Martı́nez, and Juan Antonio Pérez-Ortiz. 2011. In- tegrating Shallow-transfer Rules into Phrase-based Statistical Machine Translation. In Proceedings of MT Summit XIII, pages 562–569. Vı́ctor M. Sánchez-Cartagena, Juan Antonio Pérez- Ortiz, and Felipe Sánchez-Martı́nez. 2014. The UA-Prompsit Hybrid Machine Translation System for the 2014 Workshop on Statistical Machine Translation. In Proceedings of WMT. Rico Sennrich. 2012. Perplexity Minimization for Translation Model Domain Adaptation in Statisti- cal Machine Translation. In Proceedings of EACL, pages 539–549. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- nea Micciulla, and John Makhoul. 2006. A Study of Translation Edit Rate with Targeted Human An- notation. In Proceedings of AMTA, pages 223–231. Andreas Stolcke, Jing Zheng, Wen Wang, and Victor Abrash. 2011. SRILM at Sixteen: Update and Out- look. In Proceedings of ASRU. Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki. 2007. Online Large-margin Train- ing for Statistical Machine Translation. In Proceed- ings of EMNLP.
You can also read