Incorporating Distinct Translation System Outputs Statistical and Transformer Model

Page created by Donna Watkins
 
CONTINUE READING
394

Incorporating Distinct Translation                                                                            System         Outputs          into
Statistical and Transformer Model

Mani Bansal, D.K.Lobiyal
Jawaharlal Nehru University, Hauz khas South, New Delhi, 110067, India

                          Abstract
                          To find correct translation of an input sentence in Machine Translation is not an easy
                          task of Natural language processing (NLP). The hybridization of different translation
                          models has been found to handle this problem in an easy way. This paper presents an
                          approach that takes advantage of various translation models by combining their outputs
                          with statistical machine translation (SMT) and transformer method. Firstly, we achieve
                          Google Translator and Bing Microsoft Translator outputs as external system outputs.
                          Then, outputs of those models are fed into SMT and Transformer. Finally, the
                          combined output is generated by analyzing the Google Translator, Bing, SMT and
                          Transformer output. Prior work used system combination but no such approach exist
                          which tried to combine the statistical and transformer system with other translation
                          system. The experimental results on English-Hindi and Hindi-English language have
                          shown significant improvement.

                          Keywords
                          Machine Translation, Transformer, Statistical Machine Translation, Google Translator,
                          Bing Microsoft Translator, BLEU.
                                                                                                      words. Neural Machine Translation is a
1. INTRODUCTION                                                                                       breakthrough which reduces post-editing
                                                                                                      efforts [2] and helps in dealing with syntactic
   Machine Translation is the main area of                                                            structure of sentence. The NMT [3] outputs
Natural Language Processing. There are                                                                more fluent translations. Therefore, we make a
various translation approaches each with its                                                          hybrid system by combining Statistical and
pros and cons. One of the recent and existing                                                         Transformer (NMT with multi-head self-
approaches of Machine Translation (MT) is                                                             attention architecture) outputs to refine the
Statistical Machine Translation (SMT). The                                                            machine translation outputs.
Statistical system is [1] structured for                                                                  The combining these approaches into one is
adequacy and handling out-of-vocabulary                                                               not an easy task. By using either SMT or
                                                                                                      Transformer does not give the solution to all
                                                                                                      issues. NMT has a problem of over-translates
ISIC’21:International Semantic Intelligence Conference,                                               and under-translates to some extent. Also long
February 25–27, 2021, New Delhi, India
                                                                                                      distance dependency, phrase repetitions,
✉manibansal1991@gmail(M.Bansal);lobiyal@gmail.co                                                      translation adequacy for rare words and word
m (D.K. Lobiyal)
                         2021 Copyright for this paper by its authors. Use permitted under Creative
                                                                                                      alignment problems are observed in neural
Commons License Attribution 4.0 International (CC BY 4.0).                                            based system. As SMT [4] handles long-term
                 CEUR Workshop Proceedings(CEUR-WS.org)
395

dependency issues but unable to integrate the          Minimum Baye’s risk system combination
information in the source text. The additional      (MBRSC) method [8] gathers the benefits of
information in source text helps to                 two methods- combination of sub-sequences
disambiguate the word sense, and named entity       and selection of sentences. These methods use
problems.      The     proposed     architecture    best subsequences to generate best translation.
performed the experiment on English- Hindi             The lattice-based system combination model
and Hindi-English dataset. In that, the output      [9] entitles for phrase alignments and uses
of Bing Microsoft Translator and Google             lattice to encode all candidate translations. The
Translator are given as input to Statistical and    earlier proposed confusion network processed
Transformer      model     to    analyze      the   word-level      translation   whereas       lattice
improvement in the combined target output. If       expressed n-to-n mappings for phrase-based
the external translator outputs are achieved by     translation. The hybrid architecture [10], where
using English to Hindi (Eng-Hi) language pair,      every target hypothesis was paraphrased using
then Statistical and Transformer used the           various approaches to obtain fused translations
reverse language pair as input i.e. Hindi to        for each target, and make final selection
English (Hi-Eng). Therefore, the output of          among all fused translations.
external translator can be easily merged with          Multi-engine         machine        translation
input of other two systems i.e. Statistical and     amalgamated output of several MT systems
Transformer.                                        into single corrected translation [11]. It
   The paper is framed as following: In             consists of search space, beam search decoder
Section 2, a brief introduction of hybrid           with its features and many accessories. As
approaches proposed for Machine Translation.        NMT decoding lacks a mechanism to
Section 3 elaborates our proposed approach.         guarantee all source words to be translated and
The experiments undertaken in this study have       favors short translations. Therefore, the authors
been discussed along with the results obtained      in [12] incorporates SMT translation model
in Section 4. In Section 5, the conclusion is       and n-gram language model under log-linear
presented.                                          framework.

2. RELATED WORK                                     3. PROPOSED APPROACH
   Many methods have been presented in
literature for machine translation. Researchers     3.1. BASIC TRANSFORMER
combined different translation techniques [5]
to improve translation quality. We have                  MACHINE TRANSLATION
identified that most of the related studies take
SMT as baseline, very few studies in the            The Transformer model [13] accepts a source
literature show combination with NMT.               language sentence X = (x1, x2, ..., xN) as an
   The Example-based, Knowledge-based and           input and outputs a target language sentence Y
Lexical transfer system combined using chart        = (y1, y2, ...,yM). The NMT construct a neural
manager in [6] and selected best group of           network that translates X into Y by learning
edges with the help of chart-walk algorithm         objective function p(Y |X) from a parallel
(1994). Authors in the [7] computed a               corpus. The Transformer model is encoder-
consensus translation by voting on confusion        decoder model in which the encoder generates
network. They produced the word alignments          the intermediate representation ht (t = 1, 2, ....,
of original machine translation hypotheses in       N) from X (source sentence) and the
pairs for confusion network.
396

decoder generates Y (target sentence) from the
intermediate representation ht:                   MultiHead (Q, K, V) = Concat (head1, …. ,
                                                  headh)WO                             (4)
       ht = Transformer Encoder(X)        (1)
       Y = Transformer Decoder (ht)       (2)     headi = Attention (Q       ,K   ,V     )    (5)
    The encoder and decoder are made up of
stack of six layers. Each encoder layers            where,            ∈               ,       ∈
consists of Multi-head and Feed Forward sub-                                               O
layers. Whereas, each decoder layers is                       ,       ∈               , W     ∈
consists of three sub-layers. Apart from two                    , are weight matrices and Concat
sub-layers of encoder, decoder embeds cross-      is a function that concatenates two matrices.
lingual multi-head attention layer for the        Multiple head attention learns information
encoder stack output.                             from representation spaces at different
   The attention mechanisms:- Both self-          positions. The Transformer uses position
attention mechanism and cross-lingual             encoding (PE) to encode the position related
attention are computed as follows:                information of each word in a sentence
                                                  because the Transformer does not have any
                                                  recurrent or convolution structure. PE is
Attention (Q, K, V) = softmax (       ) V (3)     calculated as follows:

   Here, Q represent Query vector, V and K        PE (pos, 2i) = sin (pos/              )       (6)
represent as Value and Key vector of both
encoder and decoder respectively and dmodel is    PE (pos, 2i + 1) = cos(pos/                ) (7)
the size of this key vector. The product of
Query and Key represents the similarity             where, i is dimension or size, and pos is
between each element of Q and K and it is         absolute position of the word.
converted to a probability by using the softmax
function, which can be treated as weights of
attention of Q to K.
   The self-attention captures the degree of
association between words in the input
sentence by using the Q, K and V in the
encoder. In similar way, the self-attention in
the decoder captures the degree of association
between words of output sentence by using
Query, Key and Value in the decoder. The
cross-lingual attention mechanism computes
the degree of association between a word in
source and target language sentence by using
Query of decoder and output of last layer of
encoder as Key and Value. In the multiple
head self-attention with h number of heads, Q,
K, and V are linearly projected to h subspaces,
and then the attention function is used in             Fig 1. Translation System Architecture
parallel on each subspace. The concatenation
of these heads is projected to a space with the
original dimension.
397

3.2. COMBINING MACHINE                                   The best translation output sentence ebest is
                                                         formulated as follows:
     TRANSLATION OUTPUT WITH                                  ebest = argmaxe P(e|f) = argmaxe [P(f|e)
     TRANFORMER MODEL                                    PLM(e)]                                 (12)

The transformer system inputs are the                       where, f is source sentence and e is target
translated outputs of different or external              sentence. PLM(e) and P(f |e) are language
translation methods and same source sentence             model (LM) and the translation model (TM) ,
as shown in Fig 1. Then, we used three                   respectively. The input text f is partitioned
encoders: one for Google output text, another            uniformly into a sequence of T phrases         .
uses Bing and source language encoder. The               Each foreign phrase in        is translated into
concatenation generated by the three encoders               english phrase. The translation model P(f|e)
is fed into conventional Transformer decoder.            is disintegrated into:
     The Google output text encoder represented
as X1 = (x11, x12, ..., x1N) input. The Bing
                                                            P(   |   )=                 d (αi – βi-1) (13)
output sentence represented as X2 = (x21, x22,
..., x2N) and third transformer encoder accepts
                                                            The phrase translation is formed by
a source language sentence X3 = (x31, x32, ...,
                                                         probability distribution       . The relative
x3N) as an input. Then, Transformer encoder
                                                         distortion probability distribution d (αi – βi-1)
generates the intermediate representation hg (t
                                                         calculates the output phrases.
= 11,...., 1N), hb (p = 21,...., 2N), ht (t = 31,....,
3N), from the source language sentence X1, X2
and X3. The intermediate representation hg, hb,
ht:
                                                         3.4. COMBINING MACHINE
                                                              TRANSLATION OUTPUT WITH
        hg = Transformer Encoder(X1)              (8)         STATISTICAL SYSTEM
        hb = Transformer Encoder(X2)              (9)
        ht = Transformer Encoder(X3)             (10)    The Statistical combination approach uses
                                                         three modules: Alignment module, Decoding
   After the intermediate representations of             and Scoring. The alignment is useful for string
inputs, these are concatenated in the                    alignments of the hypotheses generated from
composition layer [14]. The concatenation h is           different machine translation systems. A
the output of the proposed encoder and is fed            decoding step builds hypotheses using aligned
to the decoder of our model.                             strings from previous step by using beam
                                                         search algorithm. The final scoring step helps
            h = Concat (hg, hb, ht)             (11)     in estimating the final hypotheses.
   The decoder generated the combined target
language output using the above expressions.             3.4.1. ALIGNMENT
                                                         The single best outputs d1, d2, ….dm from each
3.3. BASIC STATISTICAL MACHINE                           of the m participating systems are taken into
     TRANSLATION                                         consideration. We take sentence pairs di and dj,
                                                         and strings between the sentence pairs are
The phrase translation method or Baye’s Rule             aligned. For m sentences,                possible
forms the basis of Statistical Translation [15].
                                                         sentence pairs are required to be aligned. The
                                                         string w1 in sentence di aligned to string w2 in
398

sentence dj following two conditions:- Firstly,      3.4.4. SCORING
w1 and w2 are same. Then, w1 and w2 have Wu
and PaLMer [16] similarity score > δ.                In the output of decoding steps, the m-best lists
METEOR [17] is used to align sentences               are generated. Language Model Probability
configuration.                                       [18] and Word Mover’s Distance [19] methods
                                                     are used to calculate scoring of m-best list. The
                                                     m-best list is represented as h1, h2, h3,…., hp
3.4.2. DECODING                                      and the score of each hi is calculated. The
                                                     minimum error rate [20] training (MERT)
In decoding, best outputs are combined of            method is used to calculate the weights.
participating systems to form a set of
hypothesis. The first word of a sentence is used
to start the hypothesis. At any moment of time,
the search can be shifted to a different sentence    4. EXPERIMENTATION AND
or addition of the new words continued using            RESULTS
words from the previous sentence. Let a word
w is added to the hypothesis, taken from the         The proposed approach is tested on
best output di or shift to different output dj. On   HindEnCorp [21] dataset. It contains 273,880
shifting, the first left over word from best         sentences. For preparing training and
output sentence is added to next hypothesis.         development set, we use 272,880 (267,880 for
With the help of this method, a hypothesis can       training + 5k for tuning) sentences for
be made using various system outputs. If a           statistical system and transformer. The test set
hypothesis made up of at most single word            contains 1k sentences. The output of Google
from each set of aligned words, there is less        Translate1 and Bing Microsoft Translate2 is
possibility of occurrence of duplication.            combined with SMT and transformer. Our
   The search space is easily switched through       proposed architecture should be trained along
sentences, and thus maintaining adequacy and         with the outputs of various translated
fluency is difficult. Therefore, hypothesis          sentences.
length, language model probability and                  We trained and tested our approach on one
number of n-gram matching between                    more dataset from ILCC (Institute for
individual system’s output and hypothesis,           Language, Cognition and Computation) for
features are used for complete hypothesis.           English to Hindi language which contains
                                                     43,396 sentences. For Hindi to English
                                                     translation, we used TDIL-LC (Indian
3.4.3. BEAM SEARCH                                   Language Technology Proliferation and
                                                     Deployment Centre) dataset divided into
   In this search, the number of equal length        tourism (25k sentences) and health (25k
hypotheses is assigned to beam. The                  sentences) domain. Therefore, Hindi to English
hypotheses are recombined by feature state,          language pair trained and tested on 50k
history of the hypothesis appropriate to the         sentences.
length requested by features and search space           We train Statistical Machine Translation
hashing. Then, pointers are maintained of            with KneserNey smoothing [22] for probability
recombined hypotheses that are packed into           distribution of 4-gram language model (LM)
single hypothesis. Therefore, it enabled             by using IRSTLM [23]. The Moses decoder
extraction of k-best.
                                                     1   https://translate.google.com/
                                                     2   https://www.bing.com/translate
399

[24] finds highest scoring sentence for phrase-      Table 1:
based system. The model learns the heuristics        Translation results with respect to BLEU score
using GIZA++ [26] and word alignment with            of multiple methods using different dataset.
gro-diag-final.                                                       BLEU Score
   The Transformer1 system contains encoder                                             BLEU Score
                                                                     (HindiEnCorp)
and decoder six layers, eight attention heads,                      Eng-Hi   Hi-Eng   Eng-Hi   Hi-Eng
and 2048 feed-forward inner-layer size or            Models                           (ILCC)   (TDIL-
dimensions with dropout = 0.1. The hidden                                                        DC)
                                                     SMT             9.41     8.03     7.88      7.14
state and word embedding dimension dmodel is         Transformer    12.52    11.89     8.36      9.33
512. We limit maximum sentence length to             Google         11.20    10.42     8.14      7.56
256, and input and the output tokens per batch       Bing           11.92    10.61     9.65      8.07
are limited to 2048. We used Adam optimizer          SMT +          15.37    14.70    11.92     10.82
[26] with β1 = 0.90, β2 = 0.98 and ϵ = 10-9.         Google +
                                                     Bing
Further, we used length penalty α = 0.6 and
                                                     Transformer+   14.36    13.85    11.04     10.51
beam search with a size of 4.                        Google +
   The Bilingual Evaluation Understudy               Bing
(BLEU) [27] selected as primary evaluation           SMT +          13.09    12.66    10.16     9.29
algorithm. It evaluates the quality of machine       Transformer
                                                     + Google +
translated text with that of human translation.      Bing
Scores in BLEU are calculated for translated
sentences by comparing with good quality                We also observe in the Table1 that by
reference sentences. The scores are                  increasing number of MT system does not help
normalized over the complete dataset to              in improving accuracy i.e. Bleu scores of
estimate overall quality. BLEU calculates the        SMT, Transformer, Google and Bing together
score always between 0 and 1, but it shows           achieved 2 points less than SMT, Google and
the score in percentage form. If BLEU score          Bing. The BLEU score achieved by using
are more close to 1 better is the accuracy of        SMT, Bing Microsoft and Google translator
the system.                                          together are highest. Also the scores of
   The results on different test sets are obtained   Transformer, Google and Bing are better than
for Hindi to English (Hi-Eng) and English-           using all translation models and bleu scores
Hindi (Eng-Hi) language pairs. It is evident         improved by 1 point. The scores retrieved
from Table1 that translation system                  using TDIL-DC and ILCC dataset are lesser
combination shows better results than                than HindiEnCorp because the size of dataset
individual system i.e. SMT with Google and           is very less. The overall accuracy of our
Bing improved approximately 4 bleu scores            translation output using reverse language pair
than SMT alone in all language pairs. The            is improved by combining the better parts of
output from individual system contains some          outputs. But, the Bleu scores are not improved
erroneous or un-translated words. But the            much in our approach. The main reason is that
selection of best phrase among different             the error occurred in external machine
translated outputs (Google and Bing) generated       translation systems, would also reflect in the
by participating systems makes the target            combination approach. Hence, by removing
sentence more accurate.                              these errors, we will try to achieve better
                                                     results in future.

1   https://github.com/tensorflow/tensor2tensor
400

                                                       Code Mixed Language (Hinglish) to Pure
5. CONCLUSION                                          Languages (Hindi and English). Computer
                                                       Science,     21.3     (2020).         doi:
We investigated the approach of combining the          10.7494/csci.2020.21.3.3624.
various translation outputs with statistical       [6] S. Nirenburg, R. Frederking, Toward
machine translation and Transformer which              Multi-Engine Machine Translation. In
improve the final translation in this paper. The       Human         Language        Technology:
proposed method increased the complexity of            Proceedings of a Workshop, Plainsboro,
the overall system. Experimentation on Hindi           New Jersey (1994).
to English and from English to Hindi shows         [7] E. Matusov, U. Nicola, Computing
that incorporating different system output             Consensus Translation for Multiple
achieves better result than individual system.         Machine        Translation Systems using
In future, we extend this approach for                 Enhanced Hypothesis Alignment. in: 11th
translation of other language pairs and tasks          Conference of the European Chapter of the
like text abstraction, sentence compression.           Association for Computational Linguistics,
We will also try to incorporate BERT model             (2006).
into the neural based English to Hindi             [8] J.    González-Rubio,     C.    Francisco,
translation and will also explore the graph            Minimum Bayes’ Risk Subsequence
based encoder-decoder translation methods.             Combination for Machine Translation.
                                                       Pattern Analysis and Applications (2015)
                                                       523-533. doi: 10.1007/s10044-014-0387-5.
                                                   [9] Y. Feng, Y. Liu, H. Mi, Q. Liu, Y. Lu,
                                                       Lattice-Based System Combination for
REFERENCES                                             Statistical Machine Translation. in:
                                                       Proceedings of the 2009 Conference on
[1] D. Jiang, A Statistical Translation                Empirical Methods in Natural Language
    Approach by Network Model, in: Recent              Processing, 2009, pp. 1105-1113.
    Developments in Intelligent Computing,         [10] W.Y. Ma, K. McKeown, System
    Communication and Devices, Springer,               Combination for Machine Translation
    Singapore (2019), pp. 325-331.                     through Paraphrasing. in: Proceedings of
[2] A. Toral, Martijn Wieling, and Andy Way.           the 2015 Conference on Empirical
    Post-editing effort of a novel with                Methods in Natural Language Processing,
    statistical       and     neural    machine        2015,       pp.      1053-1058.       doi:
    translation. Frontiers        in     Digital       10.18653/v1/D15-1122.
    Humanities,                       5.9(2018).   [11] J. Zhu, M. Yang, S. Li, T. Zhao,
    doi:10.3389/fdigh.2018.00009.                      Sentence-Level Paraphrasing for Machine
[3] L. Zhou, W. Hu, J. Zhang, C. Zong, Neural          Translation System Combination. in:
    System Combination for Machine                     International Conference of Pioneering
    Translation.           arXiv        preprint       Computer Scientists, Engineers and
    arXiv:1704.06393                 (2017).doi:       Educators, Springer, Springer, Singapore,
    10.18653/v1/P17-2060                               2016, pp. 612-620. doi: 10.1007/978-981-
[4] W.He, Z. He, H. Wu, H. Wang, Improved              10-2053-7_54.
    Neural Machine Translation with SMT            [12] K. Heafield, A. Lavie, Combining
    Features, in: Thirtieth AAAI conference on         Machine Translation Output with Open
    artificial intelligence (2016).                    Source: The Carnegie Mellon Multi-
[5] S. Attri, T. Prasad, G. Ramakrishna, G,            Engine Machine Translation Scheme. The
    HiPHET: A Hybrid Approach to Translate             Prague     Bulletin    of    Mathematical
401

    Linguistics,      2010,      pp.     27-36.    [21] O. Bojar, V. Diatka, P. Rychlỳ, P.
    doi:10.2478/v10108-010-0008-4.                     Stranák, V. Suchomel, A. Tamchyna, D.
[13] A. Vaswani, N. Shazeer, J. Parmar, L.             Zeman, Hindencorp-Hindi-English and
    Jones, A.N. Gomez, L. Kaiser, I.                   Hindi-only      Corpus       for  Machine
    Polosukhin, Attention is All You Need. in:         Translation. in: Proceedings of the 9th
    Advances      in     Neural     Information        International Conference on Language
    Processing Systems, 2017, pp. 5998-6008.           Resources and Evaluation, 2014, pp. 3550-
[14] A. Currey, K. Heafield, Incorporating             3555.
    source syntax into transformer-based           [22] R. Kneser, H. Ney, Improved Backing-off
    neural machine translation. in: Proceedings        for m-gram Language Modeling. In: 1995
    of the Fourth Conference on Machine                International Conference on Acoustics,
    Translation, vol.1, 2019, pp. 24-33.               Speech, and Signal Processing, IEEE,
[15] P. Koehn, Statistical machine translation.        1995,     vol.    1,    pp.181-184.     doi:
    Cambridge University Press, 2009.                  10.1109/ICASSP.1995.479394
[16] Z. Wu, M. Palmer Verb Semantics and           [23] M. Federico, N. Bertoldi, M. Cettolo,
    Lexical Selection. arXiv preprint cmp-             IRSTLM: An Open Source Toolkit for
    lg/9406033                       (1994).doi:       Handling Large Scale Language Models.
    10.3115/981732.981751                              in: Ninth Annual Conference of the
[17] S. Banerjee, A. Lavie, METEOR: An                 International Speech Communication
    Automatic Metric for MT Evaluation with            Association, 2008.
    Improved Correlation with Human                [24] H. Hoang, P. Koehn, Design of the Moses
    Judgments. in: Proceedings of the Acl              Decoder      for     Statistical  Machine
    Workshop on Intrinsic and Extrinsic                Translation. in: Proceedings of Software
    Evaluation Measures for Machine                    Engineering, Testing, and           Quality
    Translation and/or Summarization, 2005,            Assurance      for    Natural    Language
    pp. 65-72.                                         Processing, 2008, pp. 58-65.
[18] T. Brants, A.C. Popat, P. Xu, F.J. Och,       [25] F. J. Och, H. Ney, A Systematic
    J.Dean, Large Language Models in                   Comparison       of   Various    Statistical
    Machine Translation (2007).                        Alignment       Models.       Computational
[19] M. Kusner, Y. Sun, N. Kolkin,                     Linguistics, 29.1 (2003), 19-51.
    K.Weinberger, From Word Embeddings to          [26] D.P. Kingma, B.J. Adam, A method for
    Document Distances. in: International              Stochastic Optimization. arXiv preprint
    Conference on Machine Learning, 2015,              arXiv:1412.6980 (2015).
    pp. 957-966.                                   [27] K. Papineni, S. Roukos, T. Ward, W. Zhu,
[20] O. Zaidan,          Z-MERT: A Fully               BLEU: a method for automatic evaluation
    Configurable Open Source Tool for                  of machine translation. in: Proceedings of
    Minimum Error Rate Training of Machine             the 40th annual meeting on association for
    Translation Systems. The Prague Bulletin           computational linguistics, 2002, pp.311-
    of Mathematical Linguistics, 91.1 (2009):          318. doi: 10.3115/1073083.1073135.
    79-88. doi:10.2478/v10108-009-0018-2.
You can also read