ABU-MATRAN AT WMT 2014 TRANSLATION TASK: TWO-STEP DATA SELECTION AND RBMT-STYLE SYNTHETIC RULES

Page created by Kathryn Bush

Buildings

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Abu-MaTran at WMT 2014 Translation Task:
Two-step Data Selection and RBMT-Style Synthetic Rules
Raphael Rubino? , Antonio Toral† , Victor M. Sánchez-Cartagena?‡ ,
Jorge Ferrández-Tordera? , Sergio Ortiz-Rojas? , Gema Ramı́rez-Sánchez? ,
Felipe Sánchez-Martı́nez‡ , Andy Way†
?
Prompsit Language Engineering, S.L., Elche, Spain
{rrubino,vmsanchez,jferrandez,sortiz,gramirez}@prompsit.com
†
NCLT, School of Computing, Dublin City University, Ireland
{atoral,away}@computing.dcu.ie
‡
Dep. Llenguatges i Sistemes Informàtics, Universitat d’Alacant, Spain
fsanchez@dlsi.ua.es

Abstract models (TMs) are trained using all the data pro-
This paper presents the machine trans- vided by the shared task organisers, as well as
lation systems submitted by the Abu- the Gigaword monolingual corpora distributed by
MaTran project to the WMT 2014 trans- LDC.
lation task. The language pair concerned
is English–French with a focus on French To train the LMs, monolingual corpora and the
as the target language. The French to En- target side of the parallel corpora are first used
glish translation direction is also consid- individually to train models. Then the individ-
ered, based on the word alignment com- ual models are interpolated according to perplex-
puted in the other direction. Large lan- ity minimisation on the development sets.
guage and translation models are built us-
ing all the datasets provided by the shared To train the TMs, first a baseline is built us-
task organisers, as well as the monolin- ing the News Commentary parallel corpus. Sec-
gual data from LDC. To build the trans- ond, each remaining parallel corpus is processed
lation models, we apply a two-step data individually using bilingual cross-entropy differ-
selection method based on bilingual cross- ence (Axelrod et al., 2011) in order to sepa-
entropy difference and vocabulary satura- rate pseudo in-domain and out-of-domain sen-
tion, considering each parallel corpus in- tence pairs, and filtering the pseudo out-of-
dividually. Synthetic translation rules are domain instances with the vocabulary saturation
extracted from the development sets and approach (Lewis and Eetemadi, 2013). Third,
used to train another translation model. synthetic translation rules are automatically ex-
We then interpolate the translation mod- tracted from the development set and used to train
els, minimising the perplexity on the de- another translation model following a novel ap-
velopment sets, to obtain our final SMT proach (Sánchez-Cartagena et al., 2014). Finally,
system. Our submission for the English to we interpolate the four translation models (base-
French translation task was ranked second line, in-domain, filtered out-of-domain and rules)
amongst nine teams and a total of twenty by minimising the perplexity obtained on the de-
submissions. velopment sets and investigate the best tuning and
decoding parameters.
1 Introduction
This paper presents the systems submitted by the The reminder of this paper is organised as fol-
Abu-MaTran project (runs named DCU-Prompsit- lows: the datasets and tools used in our experi-
UA) to the WMT 2014 translation task for the ments are described in Section 2. Then, details
English–French language pair. Phrase-based sta- about the LMs and TMs are given in Section 3 and
tistical machine translation (SMT) systems were Section 4 respectively. Finally, we evaluate the
submitted, considering the two translation direc- performance of the final SMT system according to
tions, with the focus on the English to French di- different tuning and decoding parameters in Sec-
rection. Language models (LMs) and translation tion 5 before presenting conclusions in Section 6.

2    Datasets and Tools                                              Corpus                 Sentences (k)     Words (M)
                                                                                 Monolingual Data – English
We use all the monolingual and parallel datasets                     Europarl v7                  2,218.2           59.9
in English and French provided by the shared task                    News Commentary v8             304.2            7.4
                                                                     News Shuffled 2007           3,782.5           90.2
organisers, as well as the LDC Gigaword for the                      News Shuffled 2008          12,954.5          308.1
same languages1 . For each language, a true-case                     News Shuffled 2009          14,680.0          347.0
model is trained using all the data, using the train-                News Shuffled 2010           6,797.2          157.8
                                                                     News Shuffled 2011          15,437.7          358.1
truecaser.perl script included in the M OSES tool-                   News Shuffled 2012          14,869.7          345.5
kit (Koehn et al., 2007).                                            News Shuffled 2013          21,688.4          495.2
   Punctuation marks of all the monolingual and                      LDC afp                      7,184.9          869.5
                                                                     LDC apw                      8,829.4        1,426.7
parallel corpora are then normalised using the                       LDC cna                        618.4           45.7
script normalize-punctuation.perl provided by the                    LDC ltw                        986.9          321.1
organisers, before being tokenised and true-cased                    LDC nyt                      5,327.7        1,723.9
                                                                     LDC wpb                        108.8           20.8
using the scripts distributed with the M OSES tool-                  LDC xin                      5,121.9          423.7
kit. The same pre-processing steps are applied to                                Monolingual Data – French
the development and test sets. As development                        Europarl v7                  2,190.6          63.5
sets, we used all the test sets from previous years                  News Commentary v8             227.0           6.5
                                                                     News Shuffled 2007             119.0           2.7
of WMT, from 2008 to 2013 (newstest2008-2013).                       News Shuffled 2008           4,718.8         110.3
   Finally, the training parallel corpora are cleaned                News Shuffled 2009           4,366.7         105.3
                                                                     News Shuffled 2010           1,846.5          44.8
using the script clean-corpus-n.perl, keeping the                    News Shuffled 2011           6,030.1         146.1
sentences longer than 1 word, shorter than 80                        News Shuffled 2012           4,114.4         100.8
words, and with a length ratio between sentence                      News Shuffled 2013           9,256.3         220.2
                                                                     LDC afp                      6,793.5         784.5
pairs lower than 4.2 The statistics about the cor-                   LDC apw                      2,525.1         271.3
pora used in our experiments after pre-processing
                                                                                       Parallel Data
are presented in Table 1.                                              9                                      549.0 (EN)
                                                                     10 Corpus                   21,327.1
   For training LMs we use K EN LM (Heafield et                                                               642.5 (FR)
                                                                                                               76.0 (EN)
al., 2013) and the SRILM tool-kit (Stolcke et al.,                   Common Crawl                 3,168.5
                                                                                                               82.7 (FR)
2011). For training TMs, we use M OSES (Koehn                                                                  52.5 (EN)
                                                                     Europarl v7                  1,965.5
et al., 2007) version 2.1 with MGIZA ++ (Och and                                                               56.7 (FR)
                                                                                                                4.5 (EN)
Ney, 2003; Gao and Vogel, 2008). These tools are                     News Commentary v9             181.3
                                                                                                                5.3 (FR)
used with default parameters for our experiments                     UN                          12,354.7
                                                                                                              313.4 (EN)
except when explicitly said.                                                                                  356.5 (FR)
   The decoder used to generate translations is
                                                                 Table 1: Data statistics after pre-processing of the
M OSES using features weights optimised with
                                                                 monolingual and parallel corpora used in our ex-
MERT (Och, 2003). As our approach relies on
                                                                 periments.
training individual TMs, one for each parallel cor-
pus, our final TM is obtained by linearly interpo-               and introduced by Durrani et al. (2011).
lating the individual ones. The interpolation of
TMs is performed using the script tmcombine.py,                  3    Language Models
minimising the cross-entropy between the TM
and the concatenated development sets from 2008                  The LMs are trained in the same way for both
to 2012 (noted newstest2008-2012), as described                  languages. First, each monolingual and parallel
in Sennrich (2012). Finally, we make use of the                  corpus is considered individually (except the par-
findings from WMT 2013 brought by the win-                       allel version of Europarl and News Commentary)
ning team (Durrani et al., 2013) and decide to use               and used to train a 5-gram LM with the modified
the Operation Sequence Model (OSM), based on                     Kneser-Ney smoothing method. We then interpo-
minimal translation units and Markov chains over                 late the individual LMs using the script compute-
sequences of operations, implemented in M OSES                   best-mix available with the SRILM tool-kit (Stol-
                                                                 cke et al., 2011), based on their perplexity scores
    1
      LDC2011T07 English Gigaword Fifth Edition,                 on the concatenation of the development sets from
LDC2011T10 French Gigaword Third Edition
    2
      This ratio was empirically chosen based on words fertil-   2008 to 2012 (the 2013 version is held-out for the
ity between English and French.                                  tuning of the TMs).

The final LM for French contains all the word                Then, for each out-of-domain parallel corpus,
sequences from 1 to 5-grams contained in the                 we compute the bilingual cross-entropy difference
                                                             of each sentence pair as:
training corpora without any pruning. However,
with the computing resources at our disposal, the                            [Hin (Ssrc ) − Hout (Ssrc )] + [Hin (Strg ) − Hout (Strg )] (1)
English LMs could not be interpolated without
                                                              where Ssrc and Strg are the source and the tar-
pruning non-frequent n-grams. Thus, n-grams
                                                             get sides of a sentence pair, Hin and Hout are
with n ∈ [3; 5] with a frequency lower than 2 were
                                                             the cross-entropies of the in and out-of-domain
removed. Details about the final LMs are given in
                                                             LMs given a sentence pair. The sentence pairs are
Table 2.
                                                             then ranked and the lowest-scoring ones are taken
           1-gram   2-gram   3-gram    4-gram    5-gram
                                                             to train the pseudo in-domain TMs. However,
                                                             the cross-entropy difference threshold required to
 English     13.4    198.6    381.2      776.3   1,068.7
 French       6.0     75.5    353.2      850.8   1,354.0     split a corpus in two parts (pseudo in and out-of-
                                                             domain) is usually set empirically by testing sev-
Table 2: Statistics, in millions of n-grams, of the          eral subset sizes of the top-ranked sentence pairs.
interpolated LMs.                                            This method is costly in our setup as it would lead
                                                             to training and evaluating multiple SMT systems
                                                             for each of the pseudo in-domain parallel corpora.
4     Translation Models                                        In order to save time and computing power,
                                                             we consider only pseudo in-domain sentence pairs
In this Section, we describe the TMs trained for             those with a bilingual cross-entropy difference be-
the shared task. First, we present the two-step data         low 0, i.e. those deemed more similar to the
selection process which aims to (i) separate in and          in-domain LMs than to the out-of-domain LMs
out-of-domain parallel sentences and (ii) reduce             (Hin < Hout ). A sample of the distribution of
the total amount of out-of-domain data. Second,              scores for the out-of-domain corpora is shown in
a novel approach for the automatic extraction of             Figure 1. The resulting pseudo in-domain corpora
translation rules and their use to enrich the phrase         are used to train individual TMs, as detailed in Ta-
table is detailed.                                           ble 3.
                                                                                                  10
4.1    Parallel Data Filtering and Vocabulary
                                                             Bilingual Cross-Entropy Difference

       Saturation                                                                                  8

Amongst the parallel corpora provided by the                                                       6
shared task organisers, only News Commentary                                                       4
can be considered as in-domain regarding the de-
                                                                                                   2
velopment and test sets. We use this training
corpus to build our baseline SMT system. The                                                       0
other parallel corpora are individually filtered us-                                                                 Common Crawl
                                                                                                  -2                     Europarl
ing bilingual cross-entropy difference (Moore and                                                                            10^9
                                                                                                  -4                          UN
Lewis, 2010; Axelrod et al., 2011). This data
filtering method relies on four LMs, two in the                                                        0   2k    4k       6k     8k    10k
source and two in the target language, which                                                                    Sentence Pairs

aim to model particular features of in and out-of-
                                                             Figure 1: Sample of ranked sentence-pairs (10k)
domain sentences.
                                                             of each of the out-of-domain parallel corpora with
    We build the in-domain LMs using the source              bilingual cross-entropy difference
and target sides of the News Commentary paral-
lel corpus. Out-of-domain LMs are trained on a                 The results obtained using the pseudo in-
vocabulary-constrained subset of each remaining              domain data show B LEU (Papineni et al., 2002)
parallel corpus individually using the SRILM tool-           scores superior or equal to the baseline score.
kit, which leads to eight models (four in the source         Only the Europarl subset is slightly lower than
language and four in the target language).3                  the baseline, while the subset taken from the 109
    3
      The subsets contain the same number of sentences and   corpus reaches the highest B LEU compared to the
the same vocabulary as News Commentary.                      other systems (30.29). This is mainly due to the

size of this subset which is ten times larger than            Corpus           Sentences (k)   BLEUdev
the one taken from Europarl. The last row of Ta-              Baseline                181.3       27.76
ble 3 shows the B LEU score obtained after interpo-           Common Crawl          1,598.7       29.84
lating the four pseudo in-domain translation mod-             Europarl                461.9       28.87
els. This system outperforms the best pseudo in-              109 Corpus            5,153.0       30.50
                                                              UN                    1,707.3       29.03
domain one by 0.5 absolute points.
                                                              Interpolation                -      31.37
      Corpus           Sentences (k)   BLEUdev
                                                        Table 4: Number of sentence pairs and B LEU
      Baseline                181.3       27.76
                                                        scores reported by MERT on English–French
      Common Crawl            208.3       27.73
      Europarl                142.0       27.63
                                                        newstest2013 for the pseudo out-of-domain cor-
      109 Corpus            1,442.4       30.29         pora obtained by filtering the out-of-domain cor-
      UN                      642.4       28.91         pora with bilingual cross-entropy difference, keep-
      Interpolation                -      30.78         ing sentence pairs below an entropy score of 10
                                                        and applying vocabulary saturation. The interpo-
Table 3: Number of sentence pairs and BLEU              lation of pseudo out-of-domain models is evalu-
scores reported by MERT on English–French new-          ated in the last row.
stest2013 for the pseudo in-domain corpora ob-
tained by filtering the out-of-domain corpora with
                                                        domain subsets, the reported B LEU scores are
bilingual cross-entropy difference. The interpola-
                                                        higher than the baseline for the four individual
tion of pseudo in-domain models is evaluated in
                                                        SMT systems and the interpolated one. This latter
the last row.
                                                        system outperforms the baseline by 3.61 absolute
   After evaluating the pseudo in-domain parallel       points. Compared to the results obtained with the
data, the remaining sentence pairs for each cor-        pseudo in-domain data, we observe a slight im-
pora are considered out-of-domain according to          provement of the B LEU scores using the pseudo
our filtering approach. However, they may still         out-of-domain data. However, despite the com-
contain useful information, thus we make use of         paratively larger sizes of the latter datasets, the
these corpora by building individual TMs for each       B LEU scores reached are not that higher. For in-
corpus (in a similar way we built the pseudo in-        stance with the 109 corpus, the pseudo in and out-
domain models). The total amount of remaining           of-domain subsets contain 1.4 and 5.1 million sen-
data (more than 33 million sentence pairs) makes        tence pairs respectively, and the two systems reach
the training process costly in terms of time and        30.3 and 30.5 B LEU. These scores indicate that
computing power. In order to reduce these costs,        the pseudo in-domain SMT systems are more ef-
sentence pairs with a bilingual cross-entropy dif-      ficient on the English–French newstest2013 devel-
ference higher than 10 were filtered out, as we no-     opment set.
ticed that most of the sentences above this thresh-
old contain noise (non-alphanumeric characters,         4.2   Extraction of Translation Rules
foreign languages, etc.).                               A synthetic phrase-table based on shallow-transfer
   We also limit the size of the remaining data by      MT rules and dictionaries is built as follows. First,
applying the vocabulary saturation method (Lewis        a set of shallow-transfer rules is inferred from the
and Eetemadi, 2013). For the out-of-domain sub-         concatenation of the newstest2008-2012 develop-
set of each corpus, we traverse the sentence pairs      ment corpora exactly in the same way as in the
in the order they are ranked by perplexity differ-      UA-Prompsit submission to this translation shared
ence and filter out those sentence pairs for which      task (Sánchez-Cartagena et al., 2014). In sum-
we have seen already each 1-gram at least 10            mary, rules are obtained from a set of bilingual
times. Each out-of-domain subset from each par-         phrases extracted from the parallel corpus after
allel corpus is then used to train a TM before inter-   its morphological analysis and part-of-speech dis-
polating them to create the pseudo out-of-domain        ambiguation with the tools in the Apertium rule-
TM. The results reported by MERT obtained on            based MT platform (Forcada et al., 2011).
the newstest2013 development set are detailed in           The extraction algorithm commonly used in
Table 4.                                                phrase-based SMT is followed with some added
   Mainly due to the sizes of the pseudo out-of-        heuristics which ensure that the bilingual phrases

extracted are compatible with the bilingual dic-                   the number of n-bests used by MERT.
tionary. Then, many different rules are generated                     Results obtained on the development set new-
from each bilingual phrase; each of them encodes                   stest2013 are reported in Table 6. These scores
a different degree of generalisation over the partic-              show that adding OSM to the interpolated trans-
ular example it has been extracted from. Finally,                  lation models slightly degrades B LEU. However,
the minimum set of rules which correctly repro-                    by increasing the number of n-bests considered by
duces all the bilingual phrases is found based on                  MERT to 200-bests, the SMT system with OSM
integer linear programming search (Garfinkel and                   outperforms the systems evaluated previously in
Nemhauser, 1972).                                                  our experiments. Adding the synthetic translation
   Once the rules have been inferred, the phrase                   rules degrades B LEU (as indicated by the last row
table is built from them and the original rule-                    in the Table), thus we decide to submit two sys-
based MT dictionaries, following the method                        tems to the shared task: one without and one with
by Sánchez-Cartagena et al. (2011), which was                     synthetic rules. By submitting a system without
one of winning systems4 (together with two on-                     synthetic rules, we also ensure that our SMT sys-
line SMT systems) in the pairwise manual evalu-                    tem is constrained according to the shared task
ation of the WMT11 English–Spanish translation                     guidelines.
task (Callison-Burch et al., 2011). This phrase-
table is then interpolated with the baseline TM and                        System                      BLEUdev
the results are presented in Table 5. A slight im-                         Baseline                       27.76
provement over the baseline is observed, which                              + pseudo in + pseudo out      31.93
                                                                            + OSM                         31.90
motivates the use of synthetic rules in our final MT                        + MERT 200-best               32.21
system. This small improvement may be related                               + Rules                       32.10
to the small coverage of the Apertium dictionar-
ies: the English–French bilingual dictionary has a                 Table 6: B LEU scores reported by MERT on
low number of entries compared to more mature                      English–French newstest2013 development set.
language pairs in Apertium which have around 20
times more bilingual entries.                                         As MERT is not suitable when a large number
                                                                   of features are used (our system uses 19 fetures),
                System              BLEUdev                        we switch to the Margin Infused Relaxed Algo-
                Baseline                 27.76                     rithm (MIRA) for our submitted systems (Watan-
                Baseline+Rules           28.06                     abe et al., 2007). The development set used is
                                                                   newstest2012, as we aim to select the best decod-
Table 5: BLEU scores reported by MERT on                           ing parameters according to the scores obtained
English–French newstest2013 for the baseline                       when decoding the newstest2013 corpus, after de-
SMT system standalone and with automatically                       truecasing and de-tokenising using the scripts dis-
extracted translation rules.                                       tributed with M OSES. This setup allowed us to
                                                                   compare our results with the participants of the
5    Tuning and Decoding                                           translation shared task last year. We pick the de-
                                                                   coding parameters leading to the best results in
We present in this Section a short selection of our                terms of B LEU and decode the official test set of
experiments, amongst 15+ different configura-                      WMT14 newstest2014. The results are reported in
tions, conducted on the interpolation of TMs, tun-                 Table 7. Results on newstest2013 show that the de-
ing and decoding parameters. We first interpolate                  coding parameters investigation leads to an over-
the four TMs: the baseline, the pseudo in and out-                 all improvement of 0.1 B LEU absolute. The re-
of-domain, and the translation rules, minimising                   sults on newstest2014 show that adding synthetic
the perplexity obtained on the concatenated de-                    rules did not help improving B LEU and degraded
velopment sets from 2008 to 2012 (newstest2008-                    slightly TER (Snover et al., 2006) scores.
2012). We investigate the use of OSM trained on                       In addition to our English→French submission,
pseudo in-domain data only or using all the paral-                 we submitted a French→English translation. Our
lel data available. Finally, we make variations of                 French→English MT system is built on the align-
    4
      No other system was found statistically significantly bet-   ments obtained from the English→French direc-
ter using the sign test at p ≤ 0.1.                                tion. The training processes between the two sys-

System B LEU 13 A TER rameters implemented in M OSES, as we observed
newstest2013 inconsistent results in our experiments.
Best tuning 31.02 60.77
cube-pruning (pop-limit 10000) 31.04 60.71 Acknowledgments
increased table-limit (100) 31.06 60.77
monotonic reordering 31.07 60.69
Best decoding 31.14 60.66 The research leading to these results has re-
newstest2014 ceived funding from the European Union Seventh
Best decoding 34.90 54.70 Framework Programme FP7/2007-2013 under
Best decoding + Rules 34.90 54.80
grant agreement PIAP-GA-2012-324414 (Abu-
Table 7: Case sensitive results obtained with MaTran).
our final English–French SMT system on new-
stest2013 when experimenting with different de-
References
coding parameters. The best parameters are kept
to translate the WMT14 test set (newstest2014) Amittai Axelrod, Xiaodong He, and Jianfeng Gao.
and official results are reported in the last two 2011. Domain Adaptation Via Pseudo In-domain
Data Selection. In Proceedings of EMNLP, pages
rows. 355–362.

Chris Callison-Burch, Philipp Koehn, Christof Monz,
tems are identical, except for the synthetic rules
and Omar Zaidan. 2011. Findings of the 2011 work-
which are not extracted for the French→English shop on statistical machine translation. In Proceed-
direction. Tuning and decoding parameters for ings of WMT, pages 22–64.
this latter translation direction are the best ones
obtained in our previous experiments on this Nadir Durrani, Helmut Schmid, and Alexander Fraser.
2011. A Joint Sequence Translation Model with In-
shared task. The case-sensitive scores obtained tegrated Reordering. In Proceedings of ACL/HLT,
for French→English on newstest2014 are 35.0 pages 1045–1054.
B LEU 13 A and 53.1 TER, which ranks us at the
fifth position for this translation direction. Nadir Durrani, Barry Haddow, Kenneth Heafield, and
Philipp Koehn. 2013. Edinburgh’s Machine Trans-
lation Systems for European Language Pairs. In
6 Conclusion Proceedings of WMT, pages 112–119.
We have presented the MT systems developed by Mikel L Forcada, Mireia Ginestı́-Rosell, Jacob Nord-
the Abu-MaTran project for the WMT14 trans- falk, Jim O’Regan, Sergio Ortiz-Rojas, Juan An-
lation shared task. We focused on the French– tonio Pérez-Ortiz, Felipe Sánchez-Martı́nez, Gema
Ramı́rez-Sánchez, and Francis M Tyers. 2011.
English language pair and particularly on the Apertium: A Free/Open-source Platform for Rule-
English→French direction. We have used a two- based Machine Translation. Machine Translation,
step data selection process based on bilingual 25(2):127–144.
cross-entropy difference and vocabulary satura-
tion, as well as a novel approach for the extraction Qin Gao and Stephan Vogel. 2008. Parallel Implemen-
tations of Word Alignment Tool. In Software Engi-
of synthetic translation rules and their use to en- neering, Testing, and Quality Assurance for Natural
rich the phrase table. For the LMs and the TMs, Language Processing, pages 49–57.
we rely on training individual models per corpus
before interpolating them by minimising perplex- Robert S Garfinkel and George L Nemhauser. 1972.
Integer Programming, volume 4. Wiley New York.
ity according to the development set. Finally, we
made use of the findings of WMT13 by including Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.
an OSM model. Clark, and Philipp Koehn. 2013. Scalable Mod-
Our English→French translation system was ified Kneser-Ney Language Model Estimation. In
Proceedings of ACL, pages 690–696.
ranked second amongst nine teams and a total of
twenty submissions, while our French→English Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
submission was ranked fifth. As future work, Callison-Burch, Marcello Federico, Nicola Bertoldi,
we plan to investigate the effect of adding to the Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, et al. 2007. Moses: Open Source
phrase table synthetic translation rules based on Toolkit for Statistical Machine Translation. In Pro-
larger dictionaries. We also would like to study the ceedings of ACL, Interactive Poster and Demonstra-
link between OSM and the different decoding pa- tion Sessions, pages 177–180.

William D. Lewis and Sauleh Eetemadi. 2013. Dra-
  matically Reducing Training Data Size Through Vo-
  cabulary Saturation. In Proceedings of WMT, pages
  281–291.
Robert C. Moore and William Lewis. 2010. Intelligent
  Selection of Language Model Training Data. In Pro-
  ceedings of ACL, pages 220–224.
Franz Josef Och and Hermann Ney. 2003. A System-
  atic Comparison of Various Statistical Alignment
  Models. Computational Linguistics, 29:19–51.
Franz Josef Och. 2003. Minimum Error Rate Training
  in Statistical Machine Translation. In Proceedings
  of ACL, volume 1, pages 160–167.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  Jing Zhu. 2002. BLEU: A Method for Automatic
  Evaluation of Machine Translation. In Proceedings
  of ACL, pages 311–318.
Vı́ctor M. Sánchez-Cartagena, Felipe Sánchez-
   Martı́nez, and Juan Antonio Pérez-Ortiz. 2011. In-
   tegrating Shallow-transfer Rules into Phrase-based
   Statistical Machine Translation. In Proceedings of
   MT Summit XIII, pages 562–569.
Vı́ctor M. Sánchez-Cartagena, Juan Antonio Pérez-
   Ortiz, and Felipe Sánchez-Martı́nez. 2014. The
   UA-Prompsit Hybrid Machine Translation System
   for the 2014 Workshop on Statistical Machine
   Translation. In Proceedings of WMT.
Rico Sennrich. 2012. Perplexity Minimization for
  Translation Model Domain Adaptation in Statisti-
  cal Machine Translation. In Proceedings of EACL,
  pages 539–549.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
 nea Micciulla, and John Makhoul. 2006. A Study
 of Translation Edit Rate with Targeted Human An-
 notation. In Proceedings of AMTA, pages 223–231.
Andreas Stolcke, Jing Zheng, Wen Wang, and Victor
  Abrash. 2011. SRILM at Sixteen: Update and Out-
  look. In Proceedings of ASRU.
Taro Watanabe, Jun Suzuki, Hajime Tsukada, and
  Hideki Isozaki. 2007. Online Large-margin Train-
  ing for Statistical Machine Translation. In Proceed-
  ings of EMNLP.

You can also read