Incorporating Distinct Translation System Outputs Statistical and Transformer Model
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
394 Incorporating Distinct Translation System Outputs into Statistical and Transformer Model Mani Bansal, D.K.Lobiyal Jawaharlal Nehru University, Hauz khas South, New Delhi, 110067, India Abstract To find correct translation of an input sentence in Machine Translation is not an easy task of Natural language processing (NLP). The hybridization of different translation models has been found to handle this problem in an easy way. This paper presents an approach that takes advantage of various translation models by combining their outputs with statistical machine translation (SMT) and transformer method. Firstly, we achieve Google Translator and Bing Microsoft Translator outputs as external system outputs. Then, outputs of those models are fed into SMT and Transformer. Finally, the combined output is generated by analyzing the Google Translator, Bing, SMT and Transformer output. Prior work used system combination but no such approach exist which tried to combine the statistical and transformer system with other translation system. The experimental results on English-Hindi and Hindi-English language have shown significant improvement. Keywords Machine Translation, Transformer, Statistical Machine Translation, Google Translator, Bing Microsoft Translator, BLEU. words. Neural Machine Translation is a 1. INTRODUCTION breakthrough which reduces post-editing efforts [2] and helps in dealing with syntactic Machine Translation is the main area of structure of sentence. The NMT [3] outputs Natural Language Processing. There are more fluent translations. Therefore, we make a various translation approaches each with its hybrid system by combining Statistical and pros and cons. One of the recent and existing Transformer (NMT with multi-head self- approaches of Machine Translation (MT) is attention architecture) outputs to refine the Statistical Machine Translation (SMT). The machine translation outputs. Statistical system is [1] structured for The combining these approaches into one is adequacy and handling out-of-vocabulary not an easy task. By using either SMT or Transformer does not give the solution to all issues. NMT has a problem of over-translates ISIC’21:International Semantic Intelligence Conference, and under-translates to some extent. Also long February 25–27, 2021, New Delhi, India distance dependency, phrase repetitions, ✉manibansal1991@gmail(M.Bansal);lobiyal@gmail.co translation adequacy for rare words and word m (D.K. Lobiyal) 2021 Copyright for this paper by its authors. Use permitted under Creative alignment problems are observed in neural Commons License Attribution 4.0 International (CC BY 4.0). based system. As SMT [4] handles long-term CEUR Workshop Proceedings(CEUR-WS.org)
395 dependency issues but unable to integrate the Minimum Baye’s risk system combination information in the source text. The additional (MBRSC) method [8] gathers the benefits of information in source text helps to two methods- combination of sub-sequences disambiguate the word sense, and named entity and selection of sentences. These methods use problems. The proposed architecture best subsequences to generate best translation. performed the experiment on English- Hindi The lattice-based system combination model and Hindi-English dataset. In that, the output [9] entitles for phrase alignments and uses of Bing Microsoft Translator and Google lattice to encode all candidate translations. The Translator are given as input to Statistical and earlier proposed confusion network processed Transformer model to analyze the word-level translation whereas lattice improvement in the combined target output. If expressed n-to-n mappings for phrase-based the external translator outputs are achieved by translation. The hybrid architecture [10], where using English to Hindi (Eng-Hi) language pair, every target hypothesis was paraphrased using then Statistical and Transformer used the various approaches to obtain fused translations reverse language pair as input i.e. Hindi to for each target, and make final selection English (Hi-Eng). Therefore, the output of among all fused translations. external translator can be easily merged with Multi-engine machine translation input of other two systems i.e. Statistical and amalgamated output of several MT systems Transformer. into single corrected translation [11]. It The paper is framed as following: In consists of search space, beam search decoder Section 2, a brief introduction of hybrid with its features and many accessories. As approaches proposed for Machine Translation. NMT decoding lacks a mechanism to Section 3 elaborates our proposed approach. guarantee all source words to be translated and The experiments undertaken in this study have favors short translations. Therefore, the authors been discussed along with the results obtained in [12] incorporates SMT translation model in Section 4. In Section 5, the conclusion is and n-gram language model under log-linear presented. framework. 2. RELATED WORK 3. PROPOSED APPROACH Many methods have been presented in literature for machine translation. Researchers 3.1. BASIC TRANSFORMER combined different translation techniques [5] to improve translation quality. We have MACHINE TRANSLATION identified that most of the related studies take SMT as baseline, very few studies in the The Transformer model [13] accepts a source literature show combination with NMT. language sentence X = (x1, x2, ..., xN) as an The Example-based, Knowledge-based and input and outputs a target language sentence Y Lexical transfer system combined using chart = (y1, y2, ...,yM). The NMT construct a neural manager in [6] and selected best group of network that translates X into Y by learning edges with the help of chart-walk algorithm objective function p(Y |X) from a parallel (1994). Authors in the [7] computed a corpus. The Transformer model is encoder- consensus translation by voting on confusion decoder model in which the encoder generates network. They produced the word alignments the intermediate representation ht (t = 1, 2, ...., of original machine translation hypotheses in N) from X (source sentence) and the pairs for confusion network.
396 decoder generates Y (target sentence) from the intermediate representation ht: MultiHead (Q, K, V) = Concat (head1, …. , headh)WO (4) ht = Transformer Encoder(X) (1) Y = Transformer Decoder (ht) (2) headi = Attention (Q ,K ,V ) (5) The encoder and decoder are made up of stack of six layers. Each encoder layers where, ∈ , ∈ consists of Multi-head and Feed Forward sub- O layers. Whereas, each decoder layers is , ∈ , W ∈ consists of three sub-layers. Apart from two , are weight matrices and Concat sub-layers of encoder, decoder embeds cross- is a function that concatenates two matrices. lingual multi-head attention layer for the Multiple head attention learns information encoder stack output. from representation spaces at different The attention mechanisms:- Both self- positions. The Transformer uses position attention mechanism and cross-lingual encoding (PE) to encode the position related attention are computed as follows: information of each word in a sentence because the Transformer does not have any recurrent or convolution structure. PE is Attention (Q, K, V) = softmax ( ) V (3) calculated as follows: Here, Q represent Query vector, V and K PE (pos, 2i) = sin (pos/ ) (6) represent as Value and Key vector of both encoder and decoder respectively and dmodel is PE (pos, 2i + 1) = cos(pos/ ) (7) the size of this key vector. The product of Query and Key represents the similarity where, i is dimension or size, and pos is between each element of Q and K and it is absolute position of the word. converted to a probability by using the softmax function, which can be treated as weights of attention of Q to K. The self-attention captures the degree of association between words in the input sentence by using the Q, K and V in the encoder. In similar way, the self-attention in the decoder captures the degree of association between words of output sentence by using Query, Key and Value in the decoder. The cross-lingual attention mechanism computes the degree of association between a word in source and target language sentence by using Query of decoder and output of last layer of encoder as Key and Value. In the multiple head self-attention with h number of heads, Q, K, and V are linearly projected to h subspaces, and then the attention function is used in Fig 1. Translation System Architecture parallel on each subspace. The concatenation of these heads is projected to a space with the original dimension.
397 3.2. COMBINING MACHINE The best translation output sentence ebest is formulated as follows: TRANSLATION OUTPUT WITH ebest = argmaxe P(e|f) = argmaxe [P(f|e) TRANFORMER MODEL PLM(e)] (12) The transformer system inputs are the where, f is source sentence and e is target translated outputs of different or external sentence. PLM(e) and P(f |e) are language translation methods and same source sentence model (LM) and the translation model (TM) , as shown in Fig 1. Then, we used three respectively. The input text f is partitioned encoders: one for Google output text, another uniformly into a sequence of T phrases . uses Bing and source language encoder. The Each foreign phrase in is translated into concatenation generated by the three encoders english phrase. The translation model P(f|e) is fed into conventional Transformer decoder. is disintegrated into: The Google output text encoder represented as X1 = (x11, x12, ..., x1N) input. The Bing P( | )= d (αi – βi-1) (13) output sentence represented as X2 = (x21, x22, ..., x2N) and third transformer encoder accepts The phrase translation is formed by a source language sentence X3 = (x31, x32, ..., probability distribution . The relative x3N) as an input. Then, Transformer encoder distortion probability distribution d (αi – βi-1) generates the intermediate representation hg (t calculates the output phrases. = 11,...., 1N), hb (p = 21,...., 2N), ht (t = 31,...., 3N), from the source language sentence X1, X2 and X3. The intermediate representation hg, hb, ht: 3.4. COMBINING MACHINE TRANSLATION OUTPUT WITH hg = Transformer Encoder(X1) (8) STATISTICAL SYSTEM hb = Transformer Encoder(X2) (9) ht = Transformer Encoder(X3) (10) The Statistical combination approach uses three modules: Alignment module, Decoding After the intermediate representations of and Scoring. The alignment is useful for string inputs, these are concatenated in the alignments of the hypotheses generated from composition layer [14]. The concatenation h is different machine translation systems. A the output of the proposed encoder and is fed decoding step builds hypotheses using aligned to the decoder of our model. strings from previous step by using beam search algorithm. The final scoring step helps h = Concat (hg, hb, ht) (11) in estimating the final hypotheses. The decoder generated the combined target language output using the above expressions. 3.4.1. ALIGNMENT The single best outputs d1, d2, ….dm from each 3.3. BASIC STATISTICAL MACHINE of the m participating systems are taken into TRANSLATION consideration. We take sentence pairs di and dj, and strings between the sentence pairs are The phrase translation method or Baye’s Rule aligned. For m sentences, possible forms the basis of Statistical Translation [15]. sentence pairs are required to be aligned. The string w1 in sentence di aligned to string w2 in
398 sentence dj following two conditions:- Firstly, 3.4.4. SCORING w1 and w2 are same. Then, w1 and w2 have Wu and PaLMer [16] similarity score > δ. In the output of decoding steps, the m-best lists METEOR [17] is used to align sentences are generated. Language Model Probability configuration. [18] and Word Mover’s Distance [19] methods are used to calculate scoring of m-best list. The m-best list is represented as h1, h2, h3,…., hp 3.4.2. DECODING and the score of each hi is calculated. The minimum error rate [20] training (MERT) In decoding, best outputs are combined of method is used to calculate the weights. participating systems to form a set of hypothesis. The first word of a sentence is used to start the hypothesis. At any moment of time, the search can be shifted to a different sentence 4. EXPERIMENTATION AND or addition of the new words continued using RESULTS words from the previous sentence. Let a word w is added to the hypothesis, taken from the The proposed approach is tested on best output di or shift to different output dj. On HindEnCorp [21] dataset. It contains 273,880 shifting, the first left over word from best sentences. For preparing training and output sentence is added to next hypothesis. development set, we use 272,880 (267,880 for With the help of this method, a hypothesis can training + 5k for tuning) sentences for be made using various system outputs. If a statistical system and transformer. The test set hypothesis made up of at most single word contains 1k sentences. The output of Google from each set of aligned words, there is less Translate1 and Bing Microsoft Translate2 is possibility of occurrence of duplication. combined with SMT and transformer. Our The search space is easily switched through proposed architecture should be trained along sentences, and thus maintaining adequacy and with the outputs of various translated fluency is difficult. Therefore, hypothesis sentences. length, language model probability and We trained and tested our approach on one number of n-gram matching between more dataset from ILCC (Institute for individual system’s output and hypothesis, Language, Cognition and Computation) for features are used for complete hypothesis. English to Hindi language which contains 43,396 sentences. For Hindi to English translation, we used TDIL-LC (Indian 3.4.3. BEAM SEARCH Language Technology Proliferation and Deployment Centre) dataset divided into In this search, the number of equal length tourism (25k sentences) and health (25k hypotheses is assigned to beam. The sentences) domain. Therefore, Hindi to English hypotheses are recombined by feature state, language pair trained and tested on 50k history of the hypothesis appropriate to the sentences. length requested by features and search space We train Statistical Machine Translation hashing. Then, pointers are maintained of with KneserNey smoothing [22] for probability recombined hypotheses that are packed into distribution of 4-gram language model (LM) single hypothesis. Therefore, it enabled by using IRSTLM [23]. The Moses decoder extraction of k-best. 1 https://translate.google.com/ 2 https://www.bing.com/translate
399 [24] finds highest scoring sentence for phrase- Table 1: based system. The model learns the heuristics Translation results with respect to BLEU score using GIZA++ [26] and word alignment with of multiple methods using different dataset. gro-diag-final. BLEU Score The Transformer1 system contains encoder BLEU Score (HindiEnCorp) and decoder six layers, eight attention heads, Eng-Hi Hi-Eng Eng-Hi Hi-Eng and 2048 feed-forward inner-layer size or Models (ILCC) (TDIL- dimensions with dropout = 0.1. The hidden DC) SMT 9.41 8.03 7.88 7.14 state and word embedding dimension dmodel is Transformer 12.52 11.89 8.36 9.33 512. We limit maximum sentence length to Google 11.20 10.42 8.14 7.56 256, and input and the output tokens per batch Bing 11.92 10.61 9.65 8.07 are limited to 2048. We used Adam optimizer SMT + 15.37 14.70 11.92 10.82 [26] with β1 = 0.90, β2 = 0.98 and ϵ = 10-9. Google + Bing Further, we used length penalty α = 0.6 and Transformer+ 14.36 13.85 11.04 10.51 beam search with a size of 4. Google + The Bilingual Evaluation Understudy Bing (BLEU) [27] selected as primary evaluation SMT + 13.09 12.66 10.16 9.29 algorithm. It evaluates the quality of machine Transformer + Google + translated text with that of human translation. Bing Scores in BLEU are calculated for translated sentences by comparing with good quality We also observe in the Table1 that by reference sentences. The scores are increasing number of MT system does not help normalized over the complete dataset to in improving accuracy i.e. Bleu scores of estimate overall quality. BLEU calculates the SMT, Transformer, Google and Bing together score always between 0 and 1, but it shows achieved 2 points less than SMT, Google and the score in percentage form. If BLEU score Bing. The BLEU score achieved by using are more close to 1 better is the accuracy of SMT, Bing Microsoft and Google translator the system. together are highest. Also the scores of The results on different test sets are obtained Transformer, Google and Bing are better than for Hindi to English (Hi-Eng) and English- using all translation models and bleu scores Hindi (Eng-Hi) language pairs. It is evident improved by 1 point. The scores retrieved from Table1 that translation system using TDIL-DC and ILCC dataset are lesser combination shows better results than than HindiEnCorp because the size of dataset individual system i.e. SMT with Google and is very less. The overall accuracy of our Bing improved approximately 4 bleu scores translation output using reverse language pair than SMT alone in all language pairs. The is improved by combining the better parts of output from individual system contains some outputs. But, the Bleu scores are not improved erroneous or un-translated words. But the much in our approach. The main reason is that selection of best phrase among different the error occurred in external machine translated outputs (Google and Bing) generated translation systems, would also reflect in the by participating systems makes the target combination approach. Hence, by removing sentence more accurate. these errors, we will try to achieve better results in future. 1 https://github.com/tensorflow/tensor2tensor
400 Code Mixed Language (Hinglish) to Pure 5. CONCLUSION Languages (Hindi and English). Computer Science, 21.3 (2020). doi: We investigated the approach of combining the 10.7494/csci.2020.21.3.3624. various translation outputs with statistical [6] S. Nirenburg, R. Frederking, Toward machine translation and Transformer which Multi-Engine Machine Translation. In improve the final translation in this paper. The Human Language Technology: proposed method increased the complexity of Proceedings of a Workshop, Plainsboro, the overall system. Experimentation on Hindi New Jersey (1994). to English and from English to Hindi shows [7] E. Matusov, U. Nicola, Computing that incorporating different system output Consensus Translation for Multiple achieves better result than individual system. Machine Translation Systems using In future, we extend this approach for Enhanced Hypothesis Alignment. in: 11th translation of other language pairs and tasks Conference of the European Chapter of the like text abstraction, sentence compression. Association for Computational Linguistics, We will also try to incorporate BERT model (2006). into the neural based English to Hindi [8] J. González-Rubio, C. Francisco, translation and will also explore the graph Minimum Bayes’ Risk Subsequence based encoder-decoder translation methods. Combination for Machine Translation. Pattern Analysis and Applications (2015) 523-533. doi: 10.1007/s10044-014-0387-5. [9] Y. Feng, Y. Liu, H. Mi, Q. Liu, Y. Lu, Lattice-Based System Combination for REFERENCES Statistical Machine Translation. in: Proceedings of the 2009 Conference on [1] D. Jiang, A Statistical Translation Empirical Methods in Natural Language Approach by Network Model, in: Recent Processing, 2009, pp. 1105-1113. Developments in Intelligent Computing, [10] W.Y. Ma, K. McKeown, System Communication and Devices, Springer, Combination for Machine Translation Singapore (2019), pp. 325-331. through Paraphrasing. in: Proceedings of [2] A. Toral, Martijn Wieling, and Andy Way. the 2015 Conference on Empirical Post-editing effort of a novel with Methods in Natural Language Processing, statistical and neural machine 2015, pp. 1053-1058. doi: translation. Frontiers in Digital 10.18653/v1/D15-1122. Humanities, 5.9(2018). [11] J. Zhu, M. Yang, S. Li, T. Zhao, doi:10.3389/fdigh.2018.00009. Sentence-Level Paraphrasing for Machine [3] L. Zhou, W. Hu, J. Zhang, C. Zong, Neural Translation System Combination. in: System Combination for Machine International Conference of Pioneering Translation. arXiv preprint Computer Scientists, Engineers and arXiv:1704.06393 (2017).doi: Educators, Springer, Springer, Singapore, 10.18653/v1/P17-2060 2016, pp. 612-620. doi: 10.1007/978-981- [4] W.He, Z. He, H. Wu, H. Wang, Improved 10-2053-7_54. Neural Machine Translation with SMT [12] K. Heafield, A. Lavie, Combining Features, in: Thirtieth AAAI conference on Machine Translation Output with Open artificial intelligence (2016). Source: The Carnegie Mellon Multi- [5] S. Attri, T. Prasad, G. Ramakrishna, G, Engine Machine Translation Scheme. The HiPHET: A Hybrid Approach to Translate Prague Bulletin of Mathematical
401 Linguistics, 2010, pp. 27-36. [21] O. Bojar, V. Diatka, P. Rychlỳ, P. doi:10.2478/v10108-010-0008-4. Stranák, V. Suchomel, A. Tamchyna, D. [13] A. Vaswani, N. Shazeer, J. Parmar, L. Zeman, Hindencorp-Hindi-English and Jones, A.N. Gomez, L. Kaiser, I. Hindi-only Corpus for Machine Polosukhin, Attention is All You Need. in: Translation. in: Proceedings of the 9th Advances in Neural Information International Conference on Language Processing Systems, 2017, pp. 5998-6008. Resources and Evaluation, 2014, pp. 3550- [14] A. Currey, K. Heafield, Incorporating 3555. source syntax into transformer-based [22] R. Kneser, H. Ney, Improved Backing-off neural machine translation. in: Proceedings for m-gram Language Modeling. In: 1995 of the Fourth Conference on Machine International Conference on Acoustics, Translation, vol.1, 2019, pp. 24-33. Speech, and Signal Processing, IEEE, [15] P. Koehn, Statistical machine translation. 1995, vol. 1, pp.181-184. doi: Cambridge University Press, 2009. 10.1109/ICASSP.1995.479394 [16] Z. Wu, M. Palmer Verb Semantics and [23] M. Federico, N. Bertoldi, M. Cettolo, Lexical Selection. arXiv preprint cmp- IRSTLM: An Open Source Toolkit for lg/9406033 (1994).doi: Handling Large Scale Language Models. 10.3115/981732.981751 in: Ninth Annual Conference of the [17] S. Banerjee, A. Lavie, METEOR: An International Speech Communication Automatic Metric for MT Evaluation with Association, 2008. Improved Correlation with Human [24] H. Hoang, P. Koehn, Design of the Moses Judgments. in: Proceedings of the Acl Decoder for Statistical Machine Workshop on Intrinsic and Extrinsic Translation. in: Proceedings of Software Evaluation Measures for Machine Engineering, Testing, and Quality Translation and/or Summarization, 2005, Assurance for Natural Language pp. 65-72. Processing, 2008, pp. 58-65. [18] T. Brants, A.C. Popat, P. Xu, F.J. Och, [25] F. J. Och, H. Ney, A Systematic J.Dean, Large Language Models in Comparison of Various Statistical Machine Translation (2007). Alignment Models. Computational [19] M. Kusner, Y. Sun, N. Kolkin, Linguistics, 29.1 (2003), 19-51. K.Weinberger, From Word Embeddings to [26] D.P. Kingma, B.J. Adam, A method for Document Distances. in: International Stochastic Optimization. arXiv preprint Conference on Machine Learning, 2015, arXiv:1412.6980 (2015). pp. 957-966. [27] K. Papineni, S. Roukos, T. Ward, W. Zhu, [20] O. Zaidan, Z-MERT: A Fully BLEU: a method for automatic evaluation Configurable Open Source Tool for of machine translation. in: Proceedings of Minimum Error Rate Training of Machine the 40th annual meeting on association for Translation Systems. The Prague Bulletin computational linguistics, 2002, pp.311- of Mathematical Linguistics, 91.1 (2009): 318. doi: 10.3115/1073083.1073135. 79-88. doi:10.2478/v10108-009-0018-2.
You can also read