English-to-Hindi system description for WMT 2014: Deep Source-Context Features for Moses
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
English-to-Hindi system description for WMT 2014: Deep Source-Context Features for Moses Marta R. Costa-jussà1 , Parth Gupta2 , Rafael E. Banchs3 and Paolo Rosso2 1 Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico 2 NLE Lab, PRHLT Research Center, Universitat Politècnica de València 3 Human Language Technology, Institute for Infocomm Research, Singapore 1 marta@nlp.cic.ipn.mx,2 {pgupta,prosso}@dsic.upv.es, 3 rembanchs@i2r.a-star.edu.sg Abstract contexts. Section 4 details the WMT experimental framework and results, which proves the relevance This paper describes the IPN-UPV partici- of the technique proposed. Finally, section 5 re- pation on the English-to-Hindi translation ports the main conclusions of this system descrip- task from WMT 2014 International Evalu- tion paper. ation Campaign. The system presented is based on Moses and enhanced with deep 2 Integration of a deep source-context learning by means of a source-context fea- feature function in Moses ture function. This feature depends on the input sentence to translate, which makes This section reports the motivation and descrip- it more challenging to adapt it into the tion of the source-context feature function, to- Moses framework. This work reports the gether with the explanation of how it is integrated experimental details of the system putting in Moses. special emphasis on: how the feature func- tion is integrated in Moses and how the deep learning representations are trained 2.1 Motivation and description and used. Source context information in the phrase-based system is limited to the length of the translation 1 Introduction units (phrases). Also, all training sentences con- This paper describes the joint participation of the tribute equally to the final translation. Instituto Politécnico Nacional (IPN) and the Uni- We propose a source-context feature func- versitat Politècnica de Valencia (UPV) in cooper- tion which measures the similarity between ation with Institute for Infocomm Research (I2R) the input sentence and all training sen- on the 9th Workshop on Statistical Machine Trans- tences. In this way, the translation unit lation (WMT 2014). In particular, our participa- should be extended from source|||target to tion was in the English-to-Hindi translation task. source|||target|||trainingsentence, with the Our baseline system is an standard phrase- trainingsentence the sentence from which based SMT system built with Moses (Koehn et al., the source and target phrases were extracted. 2007). Starting from this system we propose to in- The measured similarity is used to favour those troduce a source-context feature function inspired translation units that have been extracted from by previous works (R. Costa-jussà and Banchs, training sentences that are similar to the current 2011; Banchs and Costa-jussà, 2011). The main sentence to be translated and to penalize those novelty of this work is that the source-context fea- translation units that have been extracted from ture is computed in a new deep representation. unrelated or dissimilar training sentences as The rest of the paper is organized as follows. shown in Figure 2.1. Section 2 presents the motivation of this seman- In the proposed feature, sentence similarity is tic feature and the description of how the source measured by means of the cosine distance in a context feature function is added to Moses. Sec- reduced dimension vector-space model, which is tion 3 explains how both the latent semantic in- constructed either by means of standard latent se- dexing and deep representation of sentences are mantic analysis or using deep representation as de- used to better compute similarities among source cribed in section 3.
S1: we could not book the room in time 5. Each phrase list LS is collapsed into a phrase T1: hm smy m \ EVkV aArE?ft nhF\ kr sk \ table TS by removing repetitions (when re- moving repeated entries in the list, the largest S2: I want to write the book in time value of the source-context similarity feature T2: m {\ smy m \ EktAb ElKnA cAhtA h is retained). 6. Each phrase table is completed by adding Input: i am reading a nice book standard feature values (which are computed S2 in the standard manner). Input book : EktAb 7. Moses is used on a sentence-per-sentence ba- S1 book : aArE?ft krnA sis, using a different translation table for each development (or test) sentence. Figure 1: Illustration of the method 3 Representation of Sentences We represent the sentences of the source language 2.2 Integration in Moses in the latent space by means of linear and non- As defined in the section above and, previously, linear dimensionality reduction techniques. Such in (R. Costa-jussà and Banchs, 2011; Banchs models can be seen as topic models where the low- and Costa-jussà, 2011), the value of the proposed dimensional embedding of the sentences represent source context similarity feature depends on each conditional latent topics. individual input sentence to be translated by the 3.1 Deep Autoencoders system. We are computing the similarity between the source input sentence and all the source train- The non-linear dimensionality reduction tech- ing sentences. nique we employ is based on the concept of deep This definition implies the feature function de- learning, specifically deep autoencoders. Autoen- pends on the input sentence to be translated. To coders are three-layer networks (input layer, hid- implement this requirement, we followed our pre- den layer and output layer) which try to learn an vious implementation of an off-line version of the identity function. In the neural network represen- proposed methodology, which, although very in- tation of autoencoder (Rumelhart et al., 1986), the efficient in the practice, allows us to evaluate the visible layer corresponds to the input layer and impact of the source-context feature on a state-of- hidden layer corresponds to the latent features. the-art phrase-based translation system. This prac- The autoencoder tries to learn an abstract repre- tical implementation follows the next procedure: sentation of the data in the hidden layer in such a way that minimizes reconstruction error. When 1. Two sentence similarity matrices are com- the dimension of the hidden layer is sufficiently puted: one between sentences in the develop- small, autoencoder is able to generalise and derive ment and training sets, and the other between powerful low-dimensional representation of data. sentences in the test and training datasets. We consider bag-of-words representation of text 2. Each matrix entry mij should contain the sentences where the visible layer is binary feature similarity score between the ith sentence in vector (v) where vi corresponds to the presence the training set and the j th sentence in the de- or absence of ith word. We use binary restricted velopment (or test) set. Boltzmann machines to construct an autoencoder as shown in (Hinton et al., 2006). Latent repre- 3. For each sentence s in the test and develop- sentation of the input sentence can be obtained as ment sets, a phrase pair list LS of all poten- shown below: tial phrases that can be used during decoding is extracted from the aligned training set. p(h|v) = σ(W ∗ v + b) (1) 4. The corresponding source-context similarity where W is the symmetric weight matrix between values are assigned to each phrase in lists LS visible and hidden layer and b is hidden layer according to values in the corresponding sim- bias vector and σ(x) is sigmoid logistic function ilarity matrices. 1/(1 + exp(−x)).
Autoencoders with single hidden layer do not The preprocessing of the corpus was done with have any advantage over linear methods like the standard tools from Moses. English was low- PCA (Bourlard and Kamp, 1988), hence we ercased and tokenized. Hindi was tokenized with consider deep autoencoder by stacking multiple the simple tokenizer provided by the organizers. RBMs on top of each other (Hinton and Salakhut- We cleaned the corpus using standard parameters dinov, 2006). The autoencoders have always been (i.e. we keep sentences between 1 and 80 words difficult to train through back-propagation until of length). greedy layerwise pre-training was found (Hinton For training, we used the default Moses op- and Salakhutdinov, 2006; Hinton et al., 2006; Ben- tions, which include: the grow-diag-final and gio et al., 2006). The pre-training initialises the word alignment symmetrization, the lexicalized network parameters in such a way that fine-tuning reordering, relative frequencies (conditional and them through back-propagation becomes very ef- posterior probabilities) with phrase discounting, fective and efficient (Erhan et al., 2010). lexical weights and phrase bonus for the trans- lation model (with phrases up to length 10), a 3.2 Latent Semantic Indexing language model (see details below) and a word Linear dimensionality reduction technique, latent bonus model. Optimisation was done using the semantic indexing (LSI) is used to represent sen- MERT algorithm available in Moses. Optimisa- tences in abstract space (Deerwester et al., 1990). tion is slow because of the way integration of the The term-sentence matrix (X) is created where xij feature function is done that it requires one phrase denotes the occurrence of ith term in j th sentence. table for each input sentence. Matrix X is factorized using singular value decom- During translation, we dropped unknown words position (SVD) method to obtain top m principle and used the option of minimum bayes risk de- components and the sentences are represented in coding. Postprocessing consisted in de-tokenizing this m dimensional latent space. Hindi using the standard detokenizer of Moses (the English version). 4 Experiments 4.1 Autoencoder training This section describes the experiments carried out The architecture of autoencoder we consider was in the context of WMT 2014. For English-Hindi n-500-128-500-n where n is the vocabulary size. the parallel training data was collected by Charles The training sentences were stemmed, stopwords University and consisted of 3.6M English words were removed and also the terms with sentences and 3.97M Hindi words. There was a monolingual frequency1 less than 20 were also removed. This corpus for Hindi comming from different sources resulted in vocabulary size n=7299. which consisted of 790.8M Hindi words. In ad- The RBMs were pretrained using Contrastive dition, there was a development corpus of news Divergence (CD) with step size 1 (Hinton, 2002). data translated specifically for the task which con- After pretraining, the RBMs were stacked on top sisted of 10.3m English words and 10.1m Hindi of each other and unrolled to create deep autoen- words. For internal experimentation we built a coder (Hinton and Salakhutdinov, 2006). During test set extracted from the training set. We se- the fine-tuning stage, we backpropagated the re- lected randomly 429 sentences from the training construction error to update network parameters. corpus which appeared only once, removed them The size of mini-batches during pretraining and from training and used them as internal test set. fine-tuning were 25 and 100 respectively. Weight Monolingual Hindi corpus was used to build a decay was used to prevent overfitting. Addition- larger language model. The language model was ally, in order to encourage sparsity in the hid- computed doing an interpolation of the language den units, Kullback-Leibler sparsity regularization model trained on the Hindi part of the bilingual was used. We used GPU2 based implementation of corpus (3.97M words) and the language model autoencoder to train the models which took around trained on the monolingual Hindi corpus (790.8M 4.5 hours for full training. words). Interpolation was optimised in the de- 1 velopment set provided by the organizers. Both total number of training sentences in which the term ap- pears language models interpolated were 5-grams using 2 NVIDIA GeForce GTX Titan with Memory 5 GiB and Kneser-Ney smoothing. 2688 CUDA cores
4.2 Results cp pp scf Table 1 shows the improvements in terms of cry|||ronA 0.23 0.06 0.85 BLEU of adding deep context over the baseline cry|||cFK 0.15 0.04 0.90 system for English-to-Hindi (En2Hi) over devel- opment and test sets. Adding source-context infor- Table 3: Probability values (conditional, cp, and mation using deep learning outperforms the latent posterior, pp, as standard features in a phrase- semantic analysis methodology. based system) for the word cry and two Hindi translations. En2Hi Dev Test the baseline system are reported in the task from baseline 9.42 14.99 English to Hindi. +lsi 9.83 15.12 As further work, we will implement our feature +deep context 10.40 15.43† † function in Moses using suffix arrays in order to make it more efficient. Table 1: BLEU scores for En2Hi translation task.. † depicts statistical significance (p-value
Geoffrey E. Hinton, Simon Osindero, and Yee Whye on Interactive Poster and Demonstration Sessions, Teh. 2006. A fast learning algorithm for deep belief ACL ’07, pages 177–180. nets. Neural Computation, 18(7):1527–1554. Marta R. Costa-jussà and Rafael E. Banchs. 2011. The bm-i2r haitian-créole-to-english translation system Geoffrey E. Hinton. 2002. Training products of ex- description for the wmt 2011 evaluation campaign. perts by minimizing contrastive divergence. Neural In Proceedings of the Sixth Workshop on Statistical Computation, 14(8):1771–1800. Machine Translation, pages 452–456, Edinburgh, Scotland, July. Association for Computational Lin- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris guistics. Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Williams. 1986. Learning representations by back- Constantin, and Evan Herbst. 2007. Moses: Open propagating errors. Nature, 323(6088):533–536. source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL
You can also read