MT: PHRASE BASED & NEURAL ENCODER-DECODER - COMP90042 LECTURE 22 Copyright 2018 The University of Melbourne - GitHub Pages

Page created by Ian Terry
 
CONTINUE READING
er   geht        ja              nicht          nach     hause

                                                                     yes

                                                     he
                                                                    goes              home

                                                     are
                                                                   does not            go           home

                                                         it
                                                                                       to

COMP90042 LECTURE 22

MT: PHRASE BASED &
NEURAL ENCODER-DECODER
Copyright 2018 The University of Melbourne
2
OVERVIEW
‣ Phrase based SMT
    ‣ Scoring formula
    ‣ Decoding algorithm

‣ Neural network ‘encoder-decoder’

Copyright 2018 The University of Melbourne
3
WORD- AND PHRASE-BASED MT
‣ Seen word based models of translation
    ‣ now used for alignment, but not actual translation
    ‣ overly simplistic formulation

‣ Phrase based MT
    ‣ treats n-grams as translation units, referred to as
         146                      Chapter 5. Phrase-Based Models
      ‘phrases’ (not linguistic phrases though)

Copyright 2018 The University of Melbourne                                          Fig from Koehn09
            Figure 5.1: Phrase-based machine translation.   The input is segmented into
4
 PHRASE VS WORD BASED MT
 ‣ Phrase-pairs memorise:
    ‣ common translation fragments (have access to local
      context in choosing lexical translation)
    ‣ common reordering patterns (making up for naïve models
      of reordering)
                           did not slap                                 the green witch

  did not slap
                         no dio una bofetada
                                               the green witch          la bruja verde

no dio una bofetada
Copyright 2018 The University of Melbourne             la bruja verde
5
      FINDING & SCORING PHRASE PAIRS
               michael
                                              Chapter 5. Phrase-Based Models

                                davon
                                                                                   ‣ “Extract” phrase pairs as

                                                                          bleibt
                                                                   haus
                                                  dass
                         geht

                                        aus

                                                              im
                                                         er
   michael
                                              ,                                      contiguous chunks in word
  assumes                                                                            aligned text; then
       that
         he                                                                          ‣ compute counts over the
        will
                                                                                       whole corpus
       stay
         in
                                                                                     ‣ normalise counts to produce
        the
     house
                                                                                       ‘probabilities’

 Extracting a phrase from a word alignment. The English phrase                     ‣ E.g.,
 and
   Figthefrom
          German   phrase geht davon aus , dass are aligned, because
                Koehn09
re aligned to each other.            (im haus bleibt|will stay                                 in the house)
                                                                                         c(will stay in the house; im haus bleibt)
                                                                                     =
                                                                                                      c(im haus bleibt)
     Copyright 2018 The University of Melbourne
Figure 5.1: Phrase-based machine translation. The input is segmented into
                                                                                         6
    THE PHRASE-TABLE
phrases (not necessarily linguistically motivated), translated one-to-one into phrases
in English and possibly reordered.

    ‣ The
by five  phrasephrase-table
                pairs. The Englishconsists    of to
                                    phrases have allbephrase-pairs
                                                       reordered, so thatand
                                                                           the
        theirthe
verb follows   scores,
                  subject. which forms the search space for
    Thedecoding
          German word natuerlich best translates into of course. To cap-
ture this, we would like to have a translation table that maps not words
        ‣ E.g.,A for
but phrases.         natuerlich
                  phrase          it may
                           translation    contain
                                        table       the following
                                              of English            translation
                                                         translations  for the
           phrases may look like the following:
German natuerlich

                      Translation       Probability p(e|f )
                        of course             0.5
                        naturally             0.3
                       of course ,            0.15
                      , of course ,           0.05

   It is important to point out that current phrase-based models are not
       ‣ generally a massive list with many millions of phrase-pairs
rooted in any deep linguistic notion of the concept phrase. One of the
phrases   in 2018
   Copyright  Figure   5.1 is offun
                  The University     with the. This is an unusual grouping. Most
                                  Melbourne
7
DECODING
                ⇤       ⇤
            E , A = argmaxE,A score(E, A, F )
    ‣ A describes the segmentation of F into phrases;
      and the re-ordering of their translations to produce E

‣ The score function is a product of the
    ‣ translation “probability”, P(F|E), split into phrase-pairs
    ‣ language model probability, P(E), over full sentence E
    ‣ distortion cost, d(starti, endi-1), measuring amount of
      reordering between adjacent phrase-pairs

‣ Search problem
    ‣ find translation E* with the best overall score
Copyright 2018 The University of Melbourne
8
TRANSLATION PROCESS
‣ Score the translations based on translation
  probabilities (step 2), reordering (step 3) and
  language model scores (steps 2 & 3).
                          er       geht      ja       nicht        nach       hause

    1: segment            er        geht          ja nicht            nach hause

    2: translate         he          go        does not                   home

    3: order             he         does not            go                home

Copyright 2018 The University of Melbourne          Figure from Koehn, 2009
9
SEARCH PROBLEM
                  er                  geht                   ja                   nicht                  nach            hause
                  he                    is                   yes                     not                after             house
                   it                  are                    is                   do not                 to              home
                  , it                goes              , of course               does not           according to        chamber
                 , he                  go                      ,                   is not                 in             at home
                              it is                                       not                                      home
                         he will be                                     is not                                 under house
                           it goes                                     does not                                return home
                          he goes                                       do not                                    do not
                                                  is                                             to
                                                 are                                         following
                                             is after all                                    not after
                                                does                                           not to
                                                               not
                                                             is not
                                                            are not
                                                            is not a

‣ Cover all source words exactly once; visited in any order; and
  with any segmentation into “phrases”
‣ Choose a translation from phrase-table options
Leads to millions of possible translations…
                                                                                                 Figure from Koehn, 2009
Copyright 2018 The University of Melbourne
10
DYNAMIC PROGRAMMING SOLUTION
‣ Akin to Viterbi algorithm
    ‣ factor out repeated computation
      (like Viterbi for HMMs, “chart” used in parsing)
    ‣ efficiently solve the maximisation problem

‣ Aim is to translate every word of the input once
    ‣ searching over every segmentation into phrases;
    ‣ the translations of each phrase; and
    ‣ all possible ordering of the phrases

Copyright 2018 The University of Melbourne
11
PHRASE-BASED DECODING
         er             geht             ja       nicht           nach    hause

                                                          Start with empty state

Copyright 2018 The University of Melbourne    Figure from Koehn, 2009
12
PHRASE-BASED DECODING
       er             geht              ja      nicht           nach   hause

                            are
                                                        Expand by choosing
                                                        input span and
                                                        generating translation
Copyright 2018 The University of Melbourne
                                             Figure from Koehn, 2009
13
PHRASE-BASED DECODING
        er             geht              ja       nicht           nach   hause

                             he

                             are
                                                          Consider all possible
                              it                          options to start the
                                                          translation
Copyright 2018 The University of Melbourne
                                              Figure from Koehn, 2009
14
PHRASE-BASED DECODING
          er            geht                 ja              nicht          nach     hause

                                                   Continue to expand states, visiting
                                                   uncovered words. Generating
                                                   outputs left to right.

                                                    yes

                              he
                                                   goes              home

                              are
                                                  does not            go           home

                               it
                                                                      to

Copyright 2018 The University of Melbourne
                                                       Figure from Koehn, 2009
15
PHRASE-BASED DECODING
       er              geht              ja              nicht          nach     hause

                                                      Read off translation from best
                                                      complete derivation by back-
                                                      tracking

                                                yes

                             he
                                               goes              home

                            are
                                              does not            go           home

                              it
                                                                  to

Copyright 2018 The University of Melbourne
                                                   Figure from Koehn, 2009
16
REPRESENTING TRANSLATION STATE
‣ Need to record
    ‣ translation of phrase
    ‣ which words are translated in bit-vector
    ‣ last n-1 words in E… so that ngram LM can compute
      probability of subsequent words
    ‣ end position of the last phrase translated in the source,
      for scoring distortion in next step

‣ Together allows for the score computation to be
  factorised
Copyright 2018 The University of Melbourne
17
COMPLEXITY
‣ Full search is intractable
    ‣ word-based and phrase-based decoding is NP complete
      — arises from arbitrary reordering

‣ A solution is to prune the search space
    ‣ Use beam search, a form of approximate search
    ‣ maintaining no more than k options (“hypotheses")
    ‣ pruning over translations that cover a given number of
      input words

Copyright 2018 The University of Melbourne
20
PHRASE-BASED MT SUMMARY
‣ Start with sentence-aligned parallel text
    1. learn word alignments
    2. extract phrase-pairs from word alignments &
       normalise counts
    3. learn a language model

‣ Now decode test sentences using
  beam-search (where 2 & 3 above form part of
  scoring function)

Copyright 2018 The University of Melbourne
21
NEURAL MACHINE TRANSLATION
‣ Phrase-based approach is rather complicated!
‣ Neural approach poses question:
    ‣ Can we throw away all this complexity, instead learn a
      single model to directly translate from source to target?

‣ Using deep learning of neural networks
    ‣ learn robust representations of words and sentences
    ‣ attempts to generate words in the target given “deep”
      (vector/matrix) representation of the source

Copyright 2018 The University of Melbourne
22
ENCODER-DECODER MODELS
‣ So-called “sequence2sequence” models combine:
    ‣ encoder which represents the source sentence as a
      vector or matrix of real values
        ‣ akin to word2vec’s method for learning word vectors

    ‣ decoder which predicts the word sequence in the target
        ‣ framed as a language model, albeit conditioned on the encoder
          representation

Copyright 2018 The University of Melbourne
RECURRENT NEURAL NETWORKS
    (RNNS)

                                                              c

                                   start
                                   START     x1   x2   x3     x4

What is a vector representation of a sequence                      ?
Copyright 2018 The University of Melbourne             Slide credit: Duh, Dyer et al. 2015
RNN ENCODER-DECODERS

                                                                                       c

                                             Aller   Anfang      ist     schwer      STOP

What is the probability of a sequence                             ?
Copyright 2018 The University of Melbourne                    Slide credit: Duh, Dyer et al. 2015
RNN ENCODER-DECODERS

                                        Beginnings are        difficult    STOP

                                START

                                                                                       c

                                             Aller   Anfang      ist      schwer     STOP

What is the probability of a sequence                             ?
Copyright 2018 The University of Melbourne                    Slide credit: Duh, Dyer et al. 2015
RNN ATTENTION MODEL

                                 Aller       Anfang   ist    schwer      STOP

What is the probability of a sequence                           ?
Copyright 2018 The University of Melbourne                  Slide credit: Duh, Dyer et al. 2015
RNN ATTENTION MODEL
                            Beginnings

                     START

                                 Aller       Anfang   ist    schwer      STOP

What is the probability of a sequence                           ?
Copyright 2018 The University of Melbourne                  Slide credit: Duh, Dyer et al. 2015
RNN ATTENTION MODEL
                            Beginnings are

                     START

                                 Aller       Anfang   ist    schwer      STOP

What is the probability of a sequence                           ?
Copyright 2018 The University of Melbourne                  Slide credit: Duh, Dyer et al. 2015
RNN ATTENTION MODEL
                            Beginnings are            difficult

                     START

                                 Aller       Anfang      ist       schwer      STOP

What is the probability of a sequence                                 ?
Copyright 2018 The University of Melbourne                        Slide credit: Duh, Dyer et al. 2015
RNN ATTENTION MODEL
                            Beginnings are            difficult     STOP

                     START

                                 Aller       Anfang      ist       schwer      STOP

What is the probability of a sequence                                 ?
Copyright 2018 The University of Melbourne                        Slide credit: Duh, Dyer et al. 2015
31
APPLICATIONS OF SEQ2SEQ
‣ Machine translation
‣ Summarisation (document as input)
‣ Speech recognition & speech synthesis
‣ Image captioning & image generation
‣ Word morphology (over characters)
    ‣ e.g., study → student; receive → recipient;
            play → player; pay → payer/payee

‣ Generating source code from text & more….
Copyright 2018 The University of Melbourne
32
EVALUATION: DID IT WORK?
‣ Given input in Persian
 ,‫ هنر امپرسیونیسم‬,‫ رقص باله‬,‫ تلویزیون‬,‫ملبورن مهد و مرکز پیدایش صنعت فیملسازی و سیمنا‬
 ‫سبکهای مختلف رقص مثل نیو وگ و ملبورن شافل در استرالیا و مرکز مهم موزﯾﮏ کالسﯾﮏ و امروزی در‬
  .‫این کشوراست‬

‣ Google translate outputs the English
 Melbourne cradle and center of origin of the film industry and cinema, television,
 ballet, art, impressionism, various dance styles such as New Vogue and the
 Melbourne Shuffle in Australia and an important center of classical and
 contemporary music in this country.
‣ Ask bilingual to judge? Ask to rate for two components
    ‣   fluency: follows grammar of English, and semantically coherent

    ‣   adequacy: contains the same information as the original source document

    ‣   or edit the sentence until is is adequate, and measure #changes, time spent etc
Copyright 2018 The University of Melbourne
33
RESUABLE EVALUATION
‣ What if we have one (or several) good translations,
  e.g.
                        Referred to as Australia's “cultural capital” it
                        is the birthplace of Australian
                        impressionism, Australian rules football, the
                        Australian film and television industries, and
                        Australian contemporary dance such as the
                        Melbourne Shuffle. It is recognised as a
                        UNESCO City of Literature and a major
                        centre for street art, music and theatre.

‣ We can use this text to evaluate many different MT
  system outputs for the same input
Copyright 2018 The University of Melbourne
34
AUTOMATIC EVALUATION
‣ How many words are the shared between output:
       Melbourne cradle and center of origin of the film industry and cinema,
       television, ballet, art, impressionism, various dance styles such as New
       Vogue and the Melbourne Shuffle in Australia and an important center of
       classical and contemporary music in this country.

‣ And the reference:
       Referred to as Australia’s “cultural capital” it is the birthplace of Australian
       impressionism, Australian rules football, the Australian film and television
       industries, and Australian contemporary dance such as the Melbourne
       Shuffle. It is recognised as a UNESCO City of Literature and a major centre for
       street art, music and theatre.

Copyright 2018 The University of Melbourne
35
MT EVALUATION: BLEU
‣ BLEU measures closeness of translation to one or
    more references
    ‣ defined as:
          BLEU = bp ⨉ prec1-gram ⨉ prec2-gram ⨉ prec3-gram ⨉ prec4-gram
    ‣ weighted average of 1, 2, 3 & 4-gram precisions
        ‣ precn-gram = num n-grams correct / num n-grams predicted in output
        ‣ numerator clipped to #occurences of ngram in the reference
    ‣ and a brevity penality to hedge against short outputs
        ‣ bp = min ( 1, output length / reference length )

‣ Correlates with human judgements of fluency &
    adequacy
Copyright 2018 The University of Melbourne
36
SUMMARY
‣ Word vs phrase based MT
    ‣ Components of phrase-base approach
    ‣ Decoding algorithm

‣ Neural encoder-decoder
‣ Evaluation using BLEU
‣ Reading
    ‣ JM2 25.7 – 25.9
    ‣ Neural Machine Translation and Sequence-to-sequence
      Models: A Tutorial, Neubig 201, Sections 7 & 8
      https://arxiv.org/abs/1703.01619
Copyright 2018 The University of Melbourne
You can also read