Incorporating Distinct Translation System Outputs Statistical and Transformer Model

Page created by Donna Watkins

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

394

Incorporating Distinct Translation System Outputs into
Statistical and Transformer Model

Mani Bansal, D.K.Lobiyal
Jawaharlal Nehru University, Hauz khas South, New Delhi, 110067, India

Abstract
To find correct translation of an input sentence in Machine Translation is not an easy
task of Natural language processing (NLP). The hybridization of different translation
models has been found to handle this problem in an easy way. This paper presents an
approach that takes advantage of various translation models by combining their outputs
with statistical machine translation (SMT) and transformer method. Firstly, we achieve
Google Translator and Bing Microsoft Translator outputs as external system outputs.
Then, outputs of those models are fed into SMT and Transformer. Finally, the
combined output is generated by analyzing the Google Translator, Bing, SMT and
Transformer output. Prior work used system combination but no such approach exist
which tried to combine the statistical and transformer system with other translation
system. The experimental results on English-Hindi and Hindi-English language have
shown significant improvement.

Keywords
Machine Translation, Transformer, Statistical Machine Translation, Google Translator,
Bing Microsoft Translator, BLEU.
words. Neural Machine Translation is a
1. INTRODUCTION breakthrough which reduces post-editing
efforts [2] and helps in dealing with syntactic
Machine Translation is the main area of structure of sentence. The NMT [3] outputs
Natural Language Processing. There are more fluent translations. Therefore, we make a
various translation approaches each with its hybrid system by combining Statistical and
pros and cons. One of the recent and existing Transformer (NMT with multi-head self-
approaches of Machine Translation (MT) is attention architecture) outputs to refine the
Statistical Machine Translation (SMT). The machine translation outputs.
Statistical system is [1] structured for The combining these approaches into one is
adequacy and handling out-of-vocabulary not an easy task. By using either SMT or
Transformer does not give the solution to all
issues. NMT has a problem of over-translates
ISIC’21:International Semantic Intelligence Conference, and under-translates to some extent. Also long
February 25–27, 2021, New Delhi, India
distance dependency, phrase repetitions,
✉manibansal1991@gmail(M.Bansal);lobiyal@gmail.co translation adequacy for rare words and word
m (D.K. Lobiyal)
2021 Copyright for this paper by its authors. Use permitted under Creative
alignment problems are observed in neural
Commons License Attribution 4.0 International (CC BY 4.0). based system. As SMT [4] handles long-term
CEUR Workshop Proceedings(CEUR-WS.org)

395

dependency issues but unable to integrate the Minimum Baye’s risk system combination
information in the source text. The additional (MBRSC) method [8] gathers the benefits of
information in source text helps to two methods- combination of sub-sequences
disambiguate the word sense, and named entity and selection of sentences. These methods use
problems. The proposed architecture best subsequences to generate best translation.
performed the experiment on English- Hindi The lattice-based system combination model
and Hindi-English dataset. In that, the output [9] entitles for phrase alignments and uses
of Bing Microsoft Translator and Google lattice to encode all candidate translations. The
Translator are given as input to Statistical and earlier proposed confusion network processed
Transformer model to analyze the word-level translation whereas lattice
improvement in the combined target output. If expressed n-to-n mappings for phrase-based
the external translator outputs are achieved by translation. The hybrid architecture [10], where
using English to Hindi (Eng-Hi) language pair, every target hypothesis was paraphrased using
then Statistical and Transformer used the various approaches to obtain fused translations
reverse language pair as input i.e. Hindi to for each target, and make final selection
English (Hi-Eng). Therefore, the output of among all fused translations.
external translator can be easily merged with Multi-engine machine translation
input of other two systems i.e. Statistical and amalgamated output of several MT systems
Transformer. into single corrected translation [11]. It
The paper is framed as following: In consists of search space, beam search decoder
Section 2, a brief introduction of hybrid with its features and many accessories. As
approaches proposed for Machine Translation. NMT decoding lacks a mechanism to
Section 3 elaborates our proposed approach. guarantee all source words to be translated and
The experiments undertaken in this study have favors short translations. Therefore, the authors
been discussed along with the results obtained in [12] incorporates SMT translation model
in Section 4. In Section 5, the conclusion is and n-gram language model under log-linear
presented. framework.

2. RELATED WORK 3. PROPOSED APPROACH
Many methods have been presented in
literature for machine translation. Researchers 3.1. BASIC TRANSFORMER
combined different translation techniques [5]
to improve translation quality. We have MACHINE TRANSLATION
identified that most of the related studies take
SMT as baseline, very few studies in the The Transformer model [13] accepts a source
literature show combination with NMT. language sentence X = (x1, x2, ..., xN) as an
The Example-based, Knowledge-based and input and outputs a target language sentence Y
Lexical transfer system combined using chart = (y1, y2, ...,yM). The NMT construct a neural
manager in [6] and selected best group of network that translates X into Y by learning
edges with the help of chart-walk algorithm objective function p(Y |X) from a parallel
(1994). Authors in the [7] computed a corpus. The Transformer model is encoder-
consensus translation by voting on confusion decoder model in which the encoder generates
network. They produced the word alignments the intermediate representation ht (t = 1, 2, ....,
of original machine translation hypotheses in N) from X (source sentence) and the
pairs for confusion network.

396

decoder generates Y (target sentence) from the
intermediate representation ht:                   MultiHead (Q, K, V) = Concat (head1, …. ,
                                                  headh)WO                             (4)
       ht = Transformer Encoder(X)        (1)
       Y = Transformer Decoder (ht)       (2)     headi = Attention (Q       ,K   ,V     )    (5)
    The encoder and decoder are made up of
stack of six layers. Each encoder layers            where,            ∈               ,       ∈
consists of Multi-head and Feed Forward sub-                                               O
layers. Whereas, each decoder layers is                       ,       ∈               , W     ∈
consists of three sub-layers. Apart from two                    , are weight matrices and Concat
sub-layers of encoder, decoder embeds cross-      is a function that concatenates two matrices.
lingual multi-head attention layer for the        Multiple head attention learns information
encoder stack output.                             from representation spaces at different
   The attention mechanisms:- Both self-          positions. The Transformer uses position
attention mechanism and cross-lingual             encoding (PE) to encode the position related
attention are computed as follows:                information of each word in a sentence
                                                  because the Transformer does not have any
                                                  recurrent or convolution structure. PE is
Attention (Q, K, V) = softmax (       ) V (3)     calculated as follows:

   Here, Q represent Query vector, V and K        PE (pos, 2i) = sin (pos/              )       (6)
represent as Value and Key vector of both
encoder and decoder respectively and dmodel is    PE (pos, 2i + 1) = cos(pos/                ) (7)
the size of this key vector. The product of
Query and Key represents the similarity             where, i is dimension or size, and pos is
between each element of Q and K and it is         absolute position of the word.
converted to a probability by using the softmax
function, which can be treated as weights of
attention of Q to K.
   The self-attention captures the degree of
association between words in the input
sentence by using the Q, K and V in the
encoder. In similar way, the self-attention in
the decoder captures the degree of association
between words of output sentence by using
Query, Key and Value in the decoder. The
cross-lingual attention mechanism computes
the degree of association between a word in
source and target language sentence by using
Query of decoder and output of last layer of
encoder as Key and Value. In the multiple
head self-attention with h number of heads, Q,
K, and V are linearly projected to h subspaces,
and then the attention function is used in             Fig 1. Translation System Architecture
parallel on each subspace. The concatenation
of these heads is projected to a space with the
original dimension.

397

3.2. COMBINING MACHINE                                   The best translation output sentence ebest is
                                                         formulated as follows:
     TRANSLATION OUTPUT WITH                                  ebest = argmaxe P(e|f) = argmaxe [P(f|e)
     TRANFORMER MODEL                                    PLM(e)]                                 (12)

The transformer system inputs are the                       where, f is source sentence and e is target
translated outputs of different or external              sentence. PLM(e) and P(f |e) are language
translation methods and same source sentence             model (LM) and the translation model (TM) ,
as shown in Fig 1. Then, we used three                   respectively. The input text f is partitioned
encoders: one for Google output text, another            uniformly into a sequence of T phrases         .
uses Bing and source language encoder. The               Each foreign phrase in        is translated into
concatenation generated by the three encoders               english phrase. The translation model P(f|e)
is fed into conventional Transformer decoder.            is disintegrated into:
     The Google output text encoder represented
as X1 = (x11, x12, ..., x1N) input. The Bing
                                                            P(   |   )=                 d (αi – βi-1) (13)
output sentence represented as X2 = (x21, x22,
..., x2N) and third transformer encoder accepts
                                                            The phrase translation is formed by
a source language sentence X3 = (x31, x32, ...,
                                                         probability distribution       . The relative
x3N) as an input. Then, Transformer encoder
                                                         distortion probability distribution d (αi – βi-1)
generates the intermediate representation hg (t
                                                         calculates the output phrases.
= 11,...., 1N), hb (p = 21,...., 2N), ht (t = 31,....,
3N), from the source language sentence X1, X2
and X3. The intermediate representation hg, hb,
ht:
                                                         3.4. COMBINING MACHINE
                                                              TRANSLATION OUTPUT WITH
        hg = Transformer Encoder(X1)              (8)         STATISTICAL SYSTEM
        hb = Transformer Encoder(X2)              (9)
        ht = Transformer Encoder(X3)             (10)    The Statistical combination approach uses
                                                         three modules: Alignment module, Decoding
   After the intermediate representations of             and Scoring. The alignment is useful for string
inputs, these are concatenated in the                    alignments of the hypotheses generated from
composition layer [14]. The concatenation h is           different machine translation systems. A
the output of the proposed encoder and is fed            decoding step builds hypotheses using aligned
to the decoder of our model.                             strings from previous step by using beam
                                                         search algorithm. The final scoring step helps
            h = Concat (hg, hb, ht)             (11)     in estimating the final hypotheses.
   The decoder generated the combined target
language output using the above expressions.             3.4.1. ALIGNMENT
                                                         The single best outputs d1, d2, ….dm from each
3.3. BASIC STATISTICAL MACHINE                           of the m participating systems are taken into
     TRANSLATION                                         consideration. We take sentence pairs di and dj,
                                                         and strings between the sentence pairs are
The phrase translation method or Baye’s Rule             aligned. For m sentences,                possible
forms the basis of Statistical Translation [15].
                                                         sentence pairs are required to be aligned. The
                                                         string w1 in sentence di aligned to string w2 in

398

sentence dj following two conditions:- Firstly, 3.4.4. SCORING
w1 and w2 are same. Then, w1 and w2 have Wu
and PaLMer [16] similarity score > δ. In the output of decoding steps, the m-best lists
METEOR [17] is used to align sentences are generated. Language Model Probability
configuration. [18] and Word Mover’s Distance [19] methods
are used to calculate scoring of m-best list. The
m-best list is represented as h1, h2, h3,…., hp
3.4.2. DECODING and the score of each hi is calculated. The
minimum error rate [20] training (MERT)
In decoding, best outputs are combined of method is used to calculate the weights.
participating systems to form a set of
hypothesis. The first word of a sentence is used
to start the hypothesis. At any moment of time,
the search can be shifted to a different sentence 4. EXPERIMENTATION AND
or addition of the new words continued using RESULTS
words from the previous sentence. Let a word
w is added to the hypothesis, taken from the The proposed approach is tested on
best output di or shift to different output dj. On HindEnCorp [21] dataset. It contains 273,880
shifting, the first left over word from best sentences. For preparing training and
output sentence is added to next hypothesis. development set, we use 272,880 (267,880 for
With the help of this method, a hypothesis can training + 5k for tuning) sentences for
be made using various system outputs. If a statistical system and transformer. The test set
hypothesis made up of at most single word contains 1k sentences. The output of Google
from each set of aligned words, there is less Translate1 and Bing Microsoft Translate2 is
possibility of occurrence of duplication. combined with SMT and transformer. Our
The search space is easily switched through proposed architecture should be trained along
sentences, and thus maintaining adequacy and with the outputs of various translated
fluency is difficult. Therefore, hypothesis sentences.
length, language model probability and We trained and tested our approach on one
number of n-gram matching between more dataset from ILCC (Institute for
individual system’s output and hypothesis, Language, Cognition and Computation) for
features are used for complete hypothesis. English to Hindi language which contains
43,396 sentences. For Hindi to English
translation, we used TDIL-LC (Indian
3.4.3. BEAM SEARCH Language Technology Proliferation and
Deployment Centre) dataset divided into
In this search, the number of equal length tourism (25k sentences) and health (25k
hypotheses is assigned to beam. The sentences) domain. Therefore, Hindi to English
hypotheses are recombined by feature state, language pair trained and tested on 50k
history of the hypothesis appropriate to the sentences.
length requested by features and search space We train Statistical Machine Translation
hashing. Then, pointers are maintained of with KneserNey smoothing [22] for probability
recombined hypotheses that are packed into distribution of 4-gram language model (LM)
single hypothesis. Therefore, it enabled by using IRSTLM [23]. The Moses decoder
extraction of k-best.
1 https://translate.google.com/
2 https://www.bing.com/translate

399

[24] finds highest scoring sentence for phrase- Table 1:
based system. The model learns the heuristics Translation results with respect to BLEU score
using GIZA++ [26] and word alignment with of multiple methods using different dataset.
gro-diag-final. BLEU Score
The Transformer1 system contains encoder BLEU Score
(HindiEnCorp)
and decoder six layers, eight attention heads, Eng-Hi Hi-Eng Eng-Hi Hi-Eng
and 2048 feed-forward inner-layer size or Models (ILCC) (TDIL-
dimensions with dropout = 0.1. The hidden DC)
SMT 9.41 8.03 7.88 7.14
state and word embedding dimension dmodel is Transformer 12.52 11.89 8.36 9.33
512. We limit maximum sentence length to Google 11.20 10.42 8.14 7.56
256, and input and the output tokens per batch Bing 11.92 10.61 9.65 8.07
are limited to 2048. We used Adam optimizer SMT + 15.37 14.70 11.92 10.82
[26] with β1 = 0.90, β2 = 0.98 and ϵ = 10-9. Google +
Bing
Further, we used length penalty α = 0.6 and
Transformer+ 14.36 13.85 11.04 10.51
beam search with a size of 4. Google +
The Bilingual Evaluation Understudy Bing
(BLEU) [27] selected as primary evaluation SMT + 13.09 12.66 10.16 9.29
algorithm. It evaluates the quality of machine Transformer
+ Google +
translated text with that of human translation. Bing
Scores in BLEU are calculated for translated
sentences by comparing with good quality We also observe in the Table1 that by
reference sentences. The scores are increasing number of MT system does not help
normalized over the complete dataset to in improving accuracy i.e. Bleu scores of
estimate overall quality. BLEU calculates the SMT, Transformer, Google and Bing together
score always between 0 and 1, but it shows achieved 2 points less than SMT, Google and
the score in percentage form. If BLEU score Bing. The BLEU score achieved by using
are more close to 1 better is the accuracy of SMT, Bing Microsoft and Google translator
the system. together are highest. Also the scores of
The results on different test sets are obtained Transformer, Google and Bing are better than
for Hindi to English (Hi-Eng) and English- using all translation models and bleu scores
Hindi (Eng-Hi) language pairs. It is evident improved by 1 point. The scores retrieved
from Table1 that translation system using TDIL-DC and ILCC dataset are lesser
combination shows better results than than HindiEnCorp because the size of dataset
individual system i.e. SMT with Google and is very less. The overall accuracy of our
Bing improved approximately 4 bleu scores translation output using reverse language pair
than SMT alone in all language pairs. The is improved by combining the better parts of
output from individual system contains some outputs. But, the Bleu scores are not improved
erroneous or un-translated words. But the much in our approach. The main reason is that
selection of best phrase among different the error occurred in external machine
translated outputs (Google and Bing) generated translation systems, would also reflect in the
by participating systems makes the target combination approach. Hence, by removing
sentence more accurate. these errors, we will try to achieve better
results in future.

1 https://github.com/tensorflow/tensor2tensor

400

Code Mixed Language (Hinglish) to Pure
5. CONCLUSION Languages (Hindi and English). Computer
Science, 21.3 (2020). doi:
We investigated the approach of combining the 10.7494/csci.2020.21.3.3624.
various translation outputs with statistical [6] S. Nirenburg, R. Frederking, Toward
machine translation and Transformer which Multi-Engine Machine Translation. In
improve the final translation in this paper. The Human Language Technology:
proposed method increased the complexity of Proceedings of a Workshop, Plainsboro,
the overall system. Experimentation on Hindi New Jersey (1994).
to English and from English to Hindi shows [7] E. Matusov, U. Nicola, Computing
that incorporating different system output Consensus Translation for Multiple
achieves better result than individual system. Machine Translation Systems using
In future, we extend this approach for Enhanced Hypothesis Alignment. in: 11th
translation of other language pairs and tasks Conference of the European Chapter of the
like text abstraction, sentence compression. Association for Computational Linguistics,
We will also try to incorporate BERT model (2006).
into the neural based English to Hindi [8] J. González-Rubio, C. Francisco,
translation and will also explore the graph Minimum Bayes’ Risk Subsequence
based encoder-decoder translation methods. Combination for Machine Translation.
Pattern Analysis and Applications (2015)
523-533. doi: 10.1007/s10044-014-0387-5.
[9] Y. Feng, Y. Liu, H. Mi, Q. Liu, Y. Lu,
Lattice-Based System Combination for
REFERENCES Statistical Machine Translation. in:
Proceedings of the 2009 Conference on
[1] D. Jiang, A Statistical Translation Empirical Methods in Natural Language
Approach by Network Model, in: Recent Processing, 2009, pp. 1105-1113.
Developments in Intelligent Computing, [10] W.Y. Ma, K. McKeown, System
Communication and Devices, Springer, Combination for Machine Translation
Singapore (2019), pp. 325-331. through Paraphrasing. in: Proceedings of
[2] A. Toral, Martijn Wieling, and Andy Way. the 2015 Conference on Empirical
Post-editing effort of a novel with Methods in Natural Language Processing,
statistical and neural machine 2015, pp. 1053-1058. doi:
translation. Frontiers in Digital 10.18653/v1/D15-1122.
Humanities, 5.9(2018). [11] J. Zhu, M. Yang, S. Li, T. Zhao,
doi:10.3389/fdigh.2018.00009. Sentence-Level Paraphrasing for Machine
[3] L. Zhou, W. Hu, J. Zhang, C. Zong, Neural Translation System Combination. in:
System Combination for Machine International Conference of Pioneering
Translation. arXiv preprint Computer Scientists, Engineers and
arXiv:1704.06393 (2017).doi: Educators, Springer, Springer, Singapore,
10.18653/v1/P17-2060 2016, pp. 612-620. doi: 10.1007/978-981-
[4] W.He, Z. He, H. Wu, H. Wang, Improved 10-2053-7_54.
Neural Machine Translation with SMT [12] K. Heafield, A. Lavie, Combining
Features, in: Thirtieth AAAI conference on Machine Translation Output with Open
artificial intelligence (2016). Source: The Carnegie Mellon Multi-
[5] S. Attri, T. Prasad, G. Ramakrishna, G, Engine Machine Translation Scheme. The
HiPHET: A Hybrid Approach to Translate Prague Bulletin of Mathematical

401

    Linguistics,      2010,      pp.     27-36.    [21] O. Bojar, V. Diatka, P. Rychlỳ, P.
    doi:10.2478/v10108-010-0008-4.                     Stranák, V. Suchomel, A. Tamchyna, D.
[13] A. Vaswani, N. Shazeer, J. Parmar, L.             Zeman, Hindencorp-Hindi-English and
    Jones, A.N. Gomez, L. Kaiser, I.                   Hindi-only      Corpus       for  Machine
    Polosukhin, Attention is All You Need. in:         Translation. in: Proceedings of the 9th
    Advances      in     Neural     Information        International Conference on Language
    Processing Systems, 2017, pp. 5998-6008.           Resources and Evaluation, 2014, pp. 3550-
[14] A. Currey, K. Heafield, Incorporating             3555.
    source syntax into transformer-based           [22] R. Kneser, H. Ney, Improved Backing-off
    neural machine translation. in: Proceedings        for m-gram Language Modeling. In: 1995
    of the Fourth Conference on Machine                International Conference on Acoustics,
    Translation, vol.1, 2019, pp. 24-33.               Speech, and Signal Processing, IEEE,
[15] P. Koehn, Statistical machine translation.        1995,     vol.    1,    pp.181-184.     doi:
    Cambridge University Press, 2009.                  10.1109/ICASSP.1995.479394
[16] Z. Wu, M. Palmer Verb Semantics and           [23] M. Federico, N. Bertoldi, M. Cettolo,
    Lexical Selection. arXiv preprint cmp-             IRSTLM: An Open Source Toolkit for
    lg/9406033                       (1994).doi:       Handling Large Scale Language Models.
    10.3115/981732.981751                              in: Ninth Annual Conference of the
[17] S. Banerjee, A. Lavie, METEOR: An                 International Speech Communication
    Automatic Metric for MT Evaluation with            Association, 2008.
    Improved Correlation with Human                [24] H. Hoang, P. Koehn, Design of the Moses
    Judgments. in: Proceedings of the Acl              Decoder      for     Statistical  Machine
    Workshop on Intrinsic and Extrinsic                Translation. in: Proceedings of Software
    Evaluation Measures for Machine                    Engineering, Testing, and           Quality
    Translation and/or Summarization, 2005,            Assurance      for    Natural    Language
    pp. 65-72.                                         Processing, 2008, pp. 58-65.
[18] T. Brants, A.C. Popat, P. Xu, F.J. Och,       [25] F. J. Och, H. Ney, A Systematic
    J.Dean, Large Language Models in                   Comparison       of   Various    Statistical
    Machine Translation (2007).                        Alignment       Models.       Computational
[19] M. Kusner, Y. Sun, N. Kolkin,                     Linguistics, 29.1 (2003), 19-51.
    K.Weinberger, From Word Embeddings to          [26] D.P. Kingma, B.J. Adam, A method for
    Document Distances. in: International              Stochastic Optimization. arXiv preprint
    Conference on Machine Learning, 2015,              arXiv:1412.6980 (2015).
    pp. 957-966.                                   [27] K. Papineni, S. Roukos, T. Ward, W. Zhu,
[20] O. Zaidan,          Z-MERT: A Fully               BLEU: a method for automatic evaluation
    Configurable Open Source Tool for                  of machine translation. in: Proceedings of
    Minimum Error Rate Training of Machine             the 40th annual meeting on association for
    Translation Systems. The Prague Bulletin           computational linguistics, 2002, pp.311-
    of Mathematical Linguistics, 91.1 (2009):          318. doi: 10.3115/1073083.1073135.
    79-88. doi:10.2478/v10108-009-0018-2.

You can also read