THE HIFST SYSTEM FOR THE EUROPARL SPANISH-TO-ENGLISH TASK
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
∗ The HiFST System for the EuroParl Spanish-to-English Task Gonzalo Iglesias⋆ Adrià de Gispert‡ Eduardo R. Banga⋆ William Byrne‡ ⋆ University of Vigo. Dept. of Signal Processing and Communications. Vigo, Spain {giglesia,erbanga}@gts.tsc.uvigo.es ‡ University of Cambridge. Dept. of Engineering. CB2 1PZ Cambridge, U.K. {ad465,wjb31}@eng.cam.ac.uk Abstract: In this paper we present results for the Europarl Spanish-to-English translation task. We use HiFST, a novel hierarchical phrase-based translation sys- tem implemented with finite-state technology that creates target lattices rather than k-best lists. Keywords: Statistical Machine Translation, Hierarchical Phrase-Based Transla- tion, Rule Patterns 1 Introduction ried out at the upper-most cell, after the full In this paper we report translation results CYK grid has been traversed. in the EuroParl Spanish-to-English trans- 1.1 Related Work lation task using HiFST, a novel hierar- chical phrase-based translation system that Hierarchical translation systems (Chiang, builds word translation lattices guided by a 2007), or simply Hiero, share many com- CYK parser based on a synchronous context- mon features with standard phrase-based sys- free grammar induced from automatic word tems (Koehn, Och, and Marcu, 2003), such alignments. HiFST has achieved state- as feature-based translation and strong tar- of-the-art results for Arabic-to-English and get language models. However, they differ Chinese-to-English translation tasks (Iglesias greatly in that they guide their reordering et al., 2009a). This decoder is easily im- model by a powerful synchronous context- plemented with Weighted Finite-State Trans- free grammar, which leads to flexible and ducers (WFSTs) by means of standard op- highly lexicalized reorderings. They yield erations such as union, concatenation, ep- state-of-the-art performance for some of the silon removal, determinization, minimization most complex translation tasks, such as and shortestpath, available for instance in Chinese-to-English. Thus, Hiero decoding is the OpenFst libraries (Allauzen et al., 2007). one of the dominant trends in the field of Sta- In every CYK cell we build a single, mini- tistical Machine Translation. mal word lattice containing all possible trans- We summarize some extensions to the ba- lations of the source sentence span covered sic approach to put our work in context. by that cell. When derivations contain non- Hiero Search Refinements: Huang and terminals, arcs refer to lower-level lattices Chiang (2007) offer several refinements to for memory efficiency, working effectively as cube pruning to improve translation speed. pointers. These are only expanded to the ac- Venugopal et al. (2007) introduce a Hiero tual translations if pruning is required dur- variant with relaxed constraints for hypothe- ing search; expansion is otherwise only car- sis recombination during parsing; speed and results are comparable to those of cube ∗ This work was supported in part by the GALE pruning, as described by Chiang (2007). program of the Defense Advanced Research Projects Agency, Contract No. HR0011- 06-C-0022. G. Igle- Li and Khudanpur (2008) report signifi- sias supported by Spanish Government research grant cant improvements in translation speed by BES-2007-15956 (project TEC2006-13694-C03-03). taking unseen n-grams into account within
cube pruning to minimize language model Younger-Kasami (CYK) algorithm closely re- requests. Dyer et al. (2008) extend the lated to CYK+ (Chappelier and Rajman, translation of source sentences to transla- 1998), for which hypothesis recombination tion of input lattices following Chappelier et without pruning is performed and back- al. (1999). pointers are mantained. Although the model Extensions to Hiero: Several authors de- is a synchronous grammar, in this stage only scribe extensions to Hiero, to incorporate ad- the source language sentence is parsed using ditional syntactic information (Zollmann and the corresponding context-free grammar with Venugopal, 2006; Zhang and Gildea, 2006; rules N → γ. Each cell in the CYK grid is Shen, Xu, and Weischedel, 2008; Marton and specified by a non-terminal symbol and posi- Resnik, 2008), or to combine it with discrim- tion in the CYK grid: (N, x, y), which spans inative latent models (Blunsom, Cohn, and sx+y−1 x on the source sentence s1 ...sJ . Osborne, 2008). For the second step, we use a recursive Analysis and Contrastive Experiments: algorithm to construct word lattices with Zollman et al. (2008) compare phrase-based, all possible translations produced by the hi- hierarchical and syntax-augmented decoders erarchical rules. Construction proceeds by for translation of Arabic, Chinese, and Urdu traversing the CYK grid along the back- into English. Lopez (2008) explores whether pointers established in parsing. In each cell lexical reordering or the phrase discontigu- (N, x, y) of the CYK grid, we build a tar- ity inherent in hierarchical rules explains im- get language word lattice L(N, x, y). This provements over phrase-based systems. Hi- lattice contains every translation of sx+y−1 x erarchical translation has also been used to from every derivation headed by N . For each great effect in combination with other trans- rule in this cell, a simple acceptor is built lation architectures, e.g. (Sim et al., 2007; based on the target side, as the result of stan- Rosti et al., 2007). dard concatenation of acceptors that encode WFSTs for Translation: There is exten- either terminals or non-terminals. A word sive work in using Weighted Finite State is encoded trivially with an acceptor of two Transducers for machine translation (Banga- states binded by a single transition arc. In lore and Riccardi, 2001; Casacuberta, 2001; turn, a non-terminal corresponds to a lattice Kumar and Byrne, 2005; Mathias and Byrne, retrieved by means of its back-pointer to a 2006; Graehl, Knight, and May, 2008). low-level cell. If this low-level lattice has not This paper proceeds as follows. Section 2 been required previously, it has to be built describes the HiFST system; section 3 pro- first. Once we have all the acceptors (one per vides details of the translation task and opti- rule) that apply to (N, x, y), we obtain the mization. Section 4 discusses different strate- cell lattice L(N, x, y) by unioning all these ac- gies in order to reduce the size of grammar. ceptors. For a source sentence with J words, Finally, section 5 shows rescoring results, af- the lattice we are looking for is at the top cell ter which we conclude. (S, 1, J). The final translation lattice L(S, 1, J) can 2 HiFST System Description grow very large after the pointer arcs are ex- In brief, HiFST is a hierarchical decoder that panded. Therefore, in the third step we ap- builds target word lattices guided by a syn- ply a word-based language model via WFST chronous context-free grammar consisting of composition, and perform likelihood-based a set R = {Rr } of rules Rr : N → hγ r ,αr i / pruning (Allauzen et al., 2007) based on pr . A priori, N represents any non-terminal; the combined translation and language model in this paper, N can be either S, X, or V . scores. As usual, the special glue rules S → hX,Xi This method can be seen as a generaliza- and S → hS X,S Xi are included. T de- tion of the k-best algorithm with cube prun- notes the terminals (words), and the gram- ing (Chiang, 2007), as the performance of mar builds parses based on strings γ, α ∈ this cube pruning search is clearly limited {{S, X, V } ∪ T}+ , where we follow Chiang’s by the size k of each k-best list. For small general restrictions to the grammar (2007). tasks where k is sufficiently large compared to As shown in Figure 1, the system per- the number of translations of each derivation, forms translation in three main steps. The search could be exhaustive. On the other first step is a variant of the classic Cocke- hand, for reasonably large tasks, the inven-
Figure 1: The HiFST System tory of hierarchical phrases is much bigger sentences words vocab than standard phrase pair tables, and using ES 38.2M 140k 1.30M a large k is impossible without exponentially EN 35.7M 106k increasing memory requirements and decod- ing time. In practice, values of (no more Table 1: Parallel corpora statistics. than) k=1000 or k=10000 are used. This results in search errors. Search errors have two negative consequences. Clearly, transla- etc. tion quality is undermined as the obtained first-best hypothesis is suboptimal given the 2.2 Pruning in Lattice Building models. Additionally, the quality of the ob- Ideally, pruning-in-search performed during tained k-best list is also suboptimal, which lattice building should be avoided, but this is limits the margin of gain potentially achieved not always possible, depending on the com- by subsequent re-ranking strategies (such as plexity of the grammar and the size of the high order language model rescoring or Min- sentence. In HiFST this pruning-in-search imum Bayes Risk). is triggered selectively by three concurrent factors: number of states of a given lat- 2.1 Skeleton Lattices tice in a cell, size of source-side span (in As explained previously, the lattice build- words) and the non-terminal category. This ing step uses a recursive algorithm. If this pruning-in-search consists of applying the is actually carried out over full word lat- language model via composition, likelihood- tices, memory consumption would increase based pruning and posterior removal of lan- dramatically. Fortunately, we can avoid this guage model scores. by using arcs that encode a reference to low level lattices, thus working as pointers. In 3 Development and Tuning other words, we actually build for each cell We present results on the Spanish-to-English a skeleton of the desired lattice. This skele- translation shared task of the ACL 2008 ton lattice is a mixture of words and pointers Workshop on Statistical Machine Transla- to low-level cell lattices, but WFST opera- tion, WMT (Callison-Burch et al., 2008). tions are still possible. In this way, skeleton The parallel corpus statistics are summarized lattices in each cell can be greatly reduced in Table 1. Specifically, throughout all our through operations such as epsilon removal, experiments we use the Europarl dev2006 determinization and minimization. Expan- and test2008 sets for development and test, sion is ideally carried out with the final skele- respectively. ton lattice in (S, 1, J) using a single recursive The training was performed using lower- replacement operation. Nevertheless, redun- cased data. Word alignments were gen- dancy still remains in the final expanded lat- erated using GIZA++ (Och, 2003) over a tice as concatenation and union of sublattices stemmed1 version of the parallel text. After with different spans typically lead to repeated unioning the Viterbi alignments, the stems hypotheses. were replaced with their original words, and The use of the lattice pointer arc was in- phrase-based rules of up to five source words spired by the ‘lazy evaluation’ techniques de- in length were extracted (Koehn, Och, and veloped by Mohri et al (2000). Its implemen- tation uses the infrastructure provided by the 1 We used snowball stemmer, available at OpenFST libraries for delayed composition, http://snowball.tartarus.org.
Marcu, 2003). Hierarchical rules with up Excluded Rules Types to two non-contiguous non-terminals in the hX1 w,X1 wi , hwX1 ,wX1 i 1530797 source side are then extracted applying the hX1 wX2 ,∗i 737024 restrictions described by Chiang (2007). hX1 wX2 w,X1 wX2 wi , 41600246 The Europarl language model is a Kneser- hwX1 wX2 ,wX1 wX2 i Ney (Kneser and Ney, 1995) smoothed de- hwX1 wX2 w,∗i 45162093 fault cutoff 4-gram back-off language model Nnt .Ne = 1.3 mincount=5 39013887 estimated over the concatenation of the Eu- Nnt .Ne = 2.4 mincount=10 6836855 roparl and News language model training data. Table 2: Rules excluded from grammar G. Minimum error training (Och, 2003) un- der BLEU (Papineni et al., 2001) is used to optimize the feature weights of the decoder 4.1 Filtering by Patterns and with respect to the dev2006 development set. Mincounts We obtain a k-best list from the translation Even after applying the rule extraction con- lattice and extract each feature score with straints proposed by Chiang (2007), our ini- an aligner variant of a k-best cube-pruning tial grammar G for dev2006 exceeds 138M decoder. This variant produces very effi- rules, of which only 1M are simple phrase- ciently the most probable rule segmentation based rules. With the following procedure that generated the output hypothesis, along we reduce the size of the initial grammar with with each feature contribution. The following a more informed criterion than general min- features are optimized: count filtering. • Target language model A rule pattern is simply obtained by re- placing every sequence of terminals by a sin- • Number of usages of the glue rule gle symbol ‘w’ (indicating word, i.e. terminal string, w ∈ T+ ). Every rule of the grammar • Word and rule insertion penalties has one unique corresponding rule pattern. • Word deletion scale factor For instance, rule: • Source-to-target and target-to-source h X2 en X1 ocasiones , on X1 occasions X2 i translation models would correspond to the rule pattern: • Source-to-target and target-to-source lexical models h X2 w X1 w , w X1 w X2 i • Three rule count features inspired We further classify patterns by num- by (Bender et al., 2007) that classify rule ber of non-terminals Nnt and elements ocurrences (one, two, or more than two Ne (non-terminals and substring of termi- times respectively). nals). There are 5 possible classes: Nnt .Ne = 1.2, 1.3, 2.3, 2.4, 2.5. We apply different min- 4 Grammar Design count filterings to each class. Our first working grammar was built In order to work with reasonably small gram- by excluding patterns reported in Table 2 mar – yet competitive in performance, we ap- and limiting the number of translations per ply three filtering strategies succesfully used source-side to 20. In brief, we have fil- for Chinese-to-English and Arabic-to-English tered out identical patterns (corresponding translation tasks (Iglesias et al., 2009b). to rules with the same source and target pat- tern) and some monotonic non-terminal pat- • Pattern and mincount-per-class filtering terns (rule patterns in which non-terminals • Hiero Shallow model do not reorder from source to target). Iden- tical patterns encompass a large number of • Filtering by number of translations. rules and we have not been able to improve performance by using them in other trans- We also add deletion rules, i.e. rules that lation tasks. Additionally, we have also ap- delete single words. In the following subsec- plied mincount filtering to Nnt .Ne =1.3 and tions we explain these strategies. Nnt .Ne =2.4.
Hiero Model dev2006 test2008 Shallow 33.65/7.852 33.65/7.877 Full 33.63/7.849 33.66/7.880 Table 3: Performance of Hiero Full versus Hi- ero Shallow Grammars. NT dev2006 test2008 20 33.65/7.852 33.65/7.877 Figure 2: Consecutive Null filtering. 30 33.61/7.849 33.75/7.896 40 33.63/7.853 33.73/7.883 4.2 Deletion Rules Table 4: Performance of G1 when varying the It has been experimentally found that sta- filter by number of translations, NT. tistical machine translation systems tend to benefit from allowing a small number of dele- tions. In other words, allowing some in- a filtering technique. Whether it has a neg- put words to be ignored (untranslated, or ative impact on performance or not depends translated to NULL) can improve transla- on each translation task: for instance, it was tion output. For this purpose, we add to our not useful for Chinese-to-English, as this task grammar one deletion rule for each source- takes advantage of nested rules to find bet- language word, i.e. synchronous rules with ter reorderings encompassing a large number the target side set to a special tag identi- of words. On the other hand, a Spanish-to- fying the null word. In practice, this rep- English translation task is not expected to re- resents a huge increase in the search space quire big reorderings: thus, as a premise it is as any number of consecutive words can be a good candidate for this kind of grammars. left untranslated. To control this undesired In effect, Table 3 shows that a hierarchical situtation, we would like to limit the number shallow grammar yields the same perfomance of consecutive deleted words to one. This is as full hierarchical translation. achieved by means of standard composition with a simple finite state transducer shown 4.4 Filtering by Number of in Figure 2, where all words are allowed save Translations for null word after having accepted one. Filtering rules by a fixed number of transla- 4.3 Hiero Shallow Model tions per source-side (N T ) allows faster de- coding with the same perfomance. As stated A traditional Hiero grammar (X → hγ,αi, before, the previous experiments for this task γ, α ∈ ({X} ∪ T)+ , in addition to the glue used a convenient baseline filtering of 20 rules) allows rule nesting only limited to the translations. In our experience, this has been maximum word span. In contrast, the Hi- a good threshold for the NIST 2008 Arabic- ero Shallow model allows hierarchical rules to-English and Chinese-to-English transla- to be applied only once on top of phrase- tion tasks (Iglesias et al., 2009b; Iglesias et based rules for the same given span. Stated al., 2009a). In Table 4 we compare perfor- more formally, for s, t ∈ T+ and γs , αs ∈ mance of our shallow grammar with different ({V } ∪ T)+ the Hiero Shallow grammar con- filterings, i.e. by 30 and 40 translations re- sists of three kind of rules: spectively. Interestingly, the grammar with X → hγs ,αs i 30 translations yields a slight improvement, X → hV ,V i but widening to 40 translations does not im- V → hs,ti prove the translation system in performance. Shifting from one grammar to another is 4.5 Revisiting Patterns and Class as simple as rewriting X non-terminals in γ, α Mincounts to V . It is easy to imagine the impact this In order to review the grammar design de- has in terms of speed as it reduces drastically cisions taken in Section 4, and assess their the size of the search space. In this sense, us- impact in translation quality, we consider ing Hiero Shallow Grammar can be seen as three competing grammars, i.e. G1, G2
dev2006 test2008 dev2006 test2008 G1 33.65/7.852 33.65/7.877 HiFST 33.61/7.849 33.75/7.896 G2 33.47/7.838 33.65/7.877 +5gram 33.66 /7.902 33.90/7.954 G3 33.09/7.787 33.14/7.808 +MBR 33.87 /7.901 34.24/7.962 Table 5: Contrastive performance with three Table 6: EuroParl Spanish-to-English trans- slightly different grammars. lation results (lower-cased IBM BLEU / NIST) after MET and subsequent rescoring steps and G3. G1 is the shallow grammar with N T = 20 already used (baseline). G2 is a subset of G1 (3.65M rules) with mincount examples extracted from dev2006. Scores filtering of 5 applied to Nnt .Ne = 2.3 and are state-of-the-art, comparable to the top Nnt .Ne = 2.5. With this smaller grammar submissions in the W M T 08 shared-task re- (3.25M rules) we would like to evaluate if sults (Callison-Burch et al., 2008). we can obtain the same performance. G3 (4.42M rules) is a superset of G1 where the 6 Conclusions identical pattern hX1 w,X1 wi has been added. Table 5 shows translation performance with HiFST is a hierarchical phrase-based transla- each of them. Decrease in performance for tion system that builds target word lattices. G2 is not surprising. These rules filtered out Based on well known standard WFST opera- from G2 belong to reordered non-terminal tions (such as union, concatenation and com- rule patterns (Nnt .Ne = 2.3 and Nnt .Ne = 2.5) position, among others), it is easy to imple- and some highly lexicalized monotonic non- ment, e.g. with the OpenFST library. The terminal patterns from Nnt .Ne = 2.5, with compact representation of multiple transla- three subsequence of words. More interesting tion hypotheses in lattice form requires less is the comparison between G1 and G3, where pruning and performs better hypotheses re- we see that this extra identical rule pattern combination (i.e. by standard determiniza- produces a degradation in performance. tion) than cube pruning decoders, yield- ing fewer search errors and reduced overall 5 Rescoring and Final Results memory use over k-best lists. For the Eu- roParl Spanish-to-English translation task, After translation with optimized feature we have shown that Hiero grammars per- weights, we carry out the two following form similarly to Hiero Shallow grammars, rescoring steps to the output lattice. which only allow one level of hierarchical rules and thus yield much faster decoding • Large-LM rescoring. We build times. We also present results with language- sentence-specific zero-cutoff stupid- model rescoring and MBR. The performance backoff (Brants et al., 2007) 5-gram of our system is state-of-the-art for Spanish- language models, estimated using ∼4.7B to-English (Callison-Burch et al., 2008). words of English newswire text, and apply them to rescore word lattice generated by HiFST. References • Minimum Bayes Risk (MBR). We Allauzen, Cyril, Michael Riley, Johan Schalk- rescore the first 1000-best hypotheses wyk, Wojciech Skut, and Mehryar Mohri. with MBR, taking the negative sen- 2007. OpenFst: A general and efficient tence level BLEU score as the loss func- weighted finite-state transducer library. tion (Kumar and Byrne, 2004). In Proceedings of CIAA, pages 11–23. Bangalore, Srinivas and Giuseppe Riccardi. Table 6 shows results for our best Hi- 2001. A finite-state approach to machine ero model so far (using G1 with N T = translation. In Proceedings of NAACL. 30) and subsequent rescoring steps. Gains from large language models are more mod- Bender, Oliver, Evgeny Matusov, Stefan est than MBR, possibly due to the domain Hahn, Sasa Hasan, Shahram Khadivi, discrepancy between the EuroParl and the and Hermann Ney. 2007. The RWTH additional newswire data. Table 7 contains Arabic-to-English spoken language trans-
Spanish English Estoy de acuerdo con él en cuanto al I agree with him about the central role papel central que debe conservar en el that must be retained in the future the futuro la comisión como garante del in- commission as guardian of the general terés general comunitario. interest of the community. Por ello, creo que es muy importante I therefore believe that it is very im- que el presidente del eurogrupo -que portant that the president of the eu- nosotros hemos querido crear- con- rogroup - which we have wanted to cre- serve toda su función en esta materia. ate - retains all its role in this area. Creo por este motivo que el método del I therefore believe that the method of convenio es bueno y que en el futuro the convention is good and that in the deberá utilizarse mucho más. future must be used much more. Table 7: Examples from the EuroParl Spanish-to-English dev2006 set. lation system. In Proceedings of ASRU, lattice translation. In Proceedings of pages 396–401. ACL-HLT, pages 1012–1020. Blunsom, Phil, Trevor Cohn, and Miles Os- Graehl, Jonathan, Kevin Knight, and borne. 2008. A discriminative latent vari- Jonathan May. 2008. Training tree able model for statistical machine transla- transducers. Computational Linguistics, tion. In Proceedings of ACL-HLT, pages 34(3):391–427. 200–208. Huang, Liang and David Chiang. 2007. For- Brants, Thorsten, Ashok C. Popat, Peng Xu, est rescoring: Faster decoding with inte- Franz J. Och, and Jeffrey Dean. 2007. grated language models. In Proceedings Large language models in machine trans- of ACL, pages 144–151. lation. In Proceedings of EMNLP-ACL, Iglesias, Gonzalo, Adrià de Gispert, Eduardo pages 858–867. R. Banga, and William Byrne. 2009a. Hi- Callison-Burch, Chris, Cameron Fordyce, erarchical phrase-based translation with Philipp Koehn, Christof Monz, and Josh weighted finite state transducers. In Pro- Schroeder. 2008. Further meta-evaluation ceedings of NAACL, pages 433–441. of machine translation. In Proceedings of Iglesias, Gonzalo, Adrià de Gispert, Eduardo the ACL Workshop on Statistical Machine R. Banga, and William Byrne. 2009b. Translation, pages 70–106. Rule filtering by pattern for efficient hi- erarchical translation. In Proceedings of Casacuberta, Francisco. 2001. Finite-state EACL, pages 380–388. transducers for speech-input translation. In Proceedings of ASRU. Kneser, R. and H. Ney. 1995. Improved backing-off for m-gram language model- Chappelier, Jean-Cédric and Martin Raj- ing. In Proceedings of ICASSP, volume 1, man. 1998. A generalized CYK algorithm pages 181–184. for parsing stochastic CFG. In Proceed- ings of TAPD, pages 133–137. Koehn, Philipp, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based Chappelier, Jean-Cédric, Martin Rajman, translation. In Proceedings of NAACL, Ramón Aragüés, and Antoine Rozenknop. pages 48–54. 1999. Lattice parsing for speech recogni- tion. In Proceedings of TALN, pages 95– Kumar, Shankar and William Byrne. 2004. 104. Minimum Bayes-risk decoding for statisti- cal machine translation. In Proceedings of Chiang, David. 2007. Hierarchical phrase- HLT-NAACL, pages 169–176. based translation. Computational Lin- Kumar, Shankar and William Byrne. 2005. guistics, 33(2):201–228. Local phrase reordering models for statis- Dyer, Christopher, Smaranda Muresan, and tical machine translation. In Proceedings Philip Resnik. 2008. Generalizing word of HLT-EMNLP, pages 161–168.
Li, Zhifei and Sanjeev Khudanpur. 2008. Zollmann, Andreas and Ashish Venugopal. A scalable decoder for parsing-based ma- 2006. Syntax augmented machine trans- chine translation with equivalent language lation via chart parsing. In Proceedings of model state maintenance. In Proceedings NAACL Workshop on Statistical Machine of the ACL-HLT Second Workshop on Translation, pages 138–141. Syntax and Structure in Statistical Trans- Zollmann, Andreas, Ashish Venugopal, Franz lation, pages 10–18. Och, and Jay Ponte. 2008. A system- Lopez, Adam. 2008. Tera-scale translation atic comparison of phrase-based, hierar- models via pattern matching. In Proceed- chical and syntax-augmented statistical ings of COLING, pages 505–512. MT. In Proceedings of COLING, pages Marton, Yuval and Philip Resnik. 2008. 1145–1152. Soft syntactic constraints for hierarchical phrased-based translation. In Proceedings of ACL-HLT, pages 1003–1011. Mathias, Lambert and William Byrne. 2006. Statistical phrase-based speech transla- tion. In Proceedings of ICASSP. Mohri, Mehryar, Fernando Pereira, and Michael Riley. 2000. The design princi- ples of a weighted finite-state transducer library. Theoretical Computer Science, 231:17–32. Och, Franz J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL, pages 160–167. Rosti, Antti-Veikko, Necip Fazil Ayan, Bing Xiang, Spyros Matsoukas, Richard Schwartz, and Bonnie Dorr. 2007. Com- bining outputs from multiple machine translation systems. In Proceedings of HLT-NAACL, pages 228–235. Shen, Libin, Jinxi Xu, and Ralph Weischedel. 2008. A new string-to-dependency ma- chine translation algorithm with a target dependency language model. In Proceed- ings of ACL-HLT, pages 577–585. Sim, Khe Chai, William Byrne, Mark Gales, Hichem Sahbi, and Phil Woodland. 2007. Consensus network decoding for statisti- cal machine translation system combina- tion. In Proceedings of ICASSP, volume 4, pages 105–108. Venugopal, Ashish, Andreas Zollmann, and Vogel Stephan. 2007. An efficient two- pass approach to synchronous-CFG driven statistical MT. In Proceedings of HLT- NAACL, pages 500–507. Zhang, Hao and Daniel Gildea. 2006. Syn- chronous binarization for machine trans- lation. In Proceedings of HLT-NAACL, pages 256–263.
You can also read