MT: PHRASE BASED & NEURAL ENCODER-DECODER - COMP90042 LECTURE 22 Copyright 2018 The University of Melbourne - GitHub Pages
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
er geht ja nicht nach hause yes he goes home are does not go home it to COMP90042 LECTURE 22 MT: PHRASE BASED & NEURAL ENCODER-DECODER Copyright 2018 The University of Melbourne
2 OVERVIEW ‣ Phrase based SMT ‣ Scoring formula ‣ Decoding algorithm ‣ Neural network ‘encoder-decoder’ Copyright 2018 The University of Melbourne
3 WORD- AND PHRASE-BASED MT ‣ Seen word based models of translation ‣ now used for alignment, but not actual translation ‣ overly simplistic formulation ‣ Phrase based MT ‣ treats n-grams as translation units, referred to as 146 Chapter 5. Phrase-Based Models ‘phrases’ (not linguistic phrases though) Copyright 2018 The University of Melbourne Fig from Koehn09 Figure 5.1: Phrase-based machine translation. The input is segmented into
4 PHRASE VS WORD BASED MT ‣ Phrase-pairs memorise: ‣ common translation fragments (have access to local context in choosing lexical translation) ‣ common reordering patterns (making up for naïve models of reordering) did not slap the green witch did not slap no dio una bofetada the green witch la bruja verde no dio una bofetada Copyright 2018 The University of Melbourne la bruja verde
5 FINDING & SCORING PHRASE PAIRS michael Chapter 5. Phrase-Based Models davon ‣ “Extract” phrase pairs as bleibt haus dass geht aus im er michael , contiguous chunks in word assumes aligned text; then that he ‣ compute counts over the will whole corpus stay in ‣ normalise counts to produce the house ‘probabilities’ Extracting a phrase from a word alignment. The English phrase ‣ E.g., and Figthefrom German phrase geht davon aus , dass are aligned, because Koehn09 re aligned to each other. (im haus bleibt|will stay in the house) c(will stay in the house; im haus bleibt) = c(im haus bleibt) Copyright 2018 The University of Melbourne
Figure 5.1: Phrase-based machine translation. The input is segmented into 6 THE PHRASE-TABLE phrases (not necessarily linguistically motivated), translated one-to-one into phrases in English and possibly reordered. ‣ The by five phrasephrase-table pairs. The Englishconsists of to phrases have allbephrase-pairs reordered, so thatand the theirthe verb follows scores, subject. which forms the search space for Thedecoding German word natuerlich best translates into of course. To cap- ture this, we would like to have a translation table that maps not words ‣ E.g.,A for but phrases. natuerlich phrase it may translation contain table the following of English translation translations for the phrases may look like the following: German natuerlich Translation Probability p(e|f ) of course 0.5 naturally 0.3 of course , 0.15 , of course , 0.05 It is important to point out that current phrase-based models are not ‣ generally a massive list with many millions of phrase-pairs rooted in any deep linguistic notion of the concept phrase. One of the phrases in 2018 Copyright Figure 5.1 is offun The University with the. This is an unusual grouping. Most Melbourne
7 DECODING ⇤ ⇤ E , A = argmaxE,A score(E, A, F ) ‣ A describes the segmentation of F into phrases; and the re-ordering of their translations to produce E ‣ The score function is a product of the ‣ translation “probability”, P(F|E), split into phrase-pairs ‣ language model probability, P(E), over full sentence E ‣ distortion cost, d(starti, endi-1), measuring amount of reordering between adjacent phrase-pairs ‣ Search problem ‣ find translation E* with the best overall score Copyright 2018 The University of Melbourne
8 TRANSLATION PROCESS ‣ Score the translations based on translation probabilities (step 2), reordering (step 3) and language model scores (steps 2 & 3). er geht ja nicht nach hause 1: segment er geht ja nicht nach hause 2: translate he go does not home 3: order he does not go home Copyright 2018 The University of Melbourne Figure from Koehn, 2009
9 SEARCH PROBLEM er geht ja nicht nach hause he is yes not after house it are is do not to home , it goes , of course does not according to chamber , he go , is not in at home it is not home he will be is not under house it goes does not return home he goes do not do not is to are following is after all not after does not to not is not are not is not a ‣ Cover all source words exactly once; visited in any order; and with any segmentation into “phrases” ‣ Choose a translation from phrase-table options Leads to millions of possible translations… Figure from Koehn, 2009 Copyright 2018 The University of Melbourne
10 DYNAMIC PROGRAMMING SOLUTION ‣ Akin to Viterbi algorithm ‣ factor out repeated computation (like Viterbi for HMMs, “chart” used in parsing) ‣ efficiently solve the maximisation problem ‣ Aim is to translate every word of the input once ‣ searching over every segmentation into phrases; ‣ the translations of each phrase; and ‣ all possible ordering of the phrases Copyright 2018 The University of Melbourne
11 PHRASE-BASED DECODING er geht ja nicht nach hause Start with empty state Copyright 2018 The University of Melbourne Figure from Koehn, 2009
12 PHRASE-BASED DECODING er geht ja nicht nach hause are Expand by choosing input span and generating translation Copyright 2018 The University of Melbourne Figure from Koehn, 2009
13 PHRASE-BASED DECODING er geht ja nicht nach hause he are Consider all possible it options to start the translation Copyright 2018 The University of Melbourne Figure from Koehn, 2009
14 PHRASE-BASED DECODING er geht ja nicht nach hause Continue to expand states, visiting uncovered words. Generating outputs left to right. yes he goes home are does not go home it to Copyright 2018 The University of Melbourne Figure from Koehn, 2009
15 PHRASE-BASED DECODING er geht ja nicht nach hause Read off translation from best complete derivation by back- tracking yes he goes home are does not go home it to Copyright 2018 The University of Melbourne Figure from Koehn, 2009
16 REPRESENTING TRANSLATION STATE ‣ Need to record ‣ translation of phrase ‣ which words are translated in bit-vector ‣ last n-1 words in E… so that ngram LM can compute probability of subsequent words ‣ end position of the last phrase translated in the source, for scoring distortion in next step ‣ Together allows for the score computation to be factorised Copyright 2018 The University of Melbourne
17 COMPLEXITY ‣ Full search is intractable ‣ word-based and phrase-based decoding is NP complete — arises from arbitrary reordering ‣ A solution is to prune the search space ‣ Use beam search, a form of approximate search ‣ maintaining no more than k options (“hypotheses") ‣ pruning over translations that cover a given number of input words Copyright 2018 The University of Melbourne
20 PHRASE-BASED MT SUMMARY ‣ Start with sentence-aligned parallel text 1. learn word alignments 2. extract phrase-pairs from word alignments & normalise counts 3. learn a language model ‣ Now decode test sentences using beam-search (where 2 & 3 above form part of scoring function) Copyright 2018 The University of Melbourne
21 NEURAL MACHINE TRANSLATION ‣ Phrase-based approach is rather complicated! ‣ Neural approach poses question: ‣ Can we throw away all this complexity, instead learn a single model to directly translate from source to target? ‣ Using deep learning of neural networks ‣ learn robust representations of words and sentences ‣ attempts to generate words in the target given “deep” (vector/matrix) representation of the source Copyright 2018 The University of Melbourne
22 ENCODER-DECODER MODELS ‣ So-called “sequence2sequence” models combine: ‣ encoder which represents the source sentence as a vector or matrix of real values ‣ akin to word2vec’s method for learning word vectors ‣ decoder which predicts the word sequence in the target ‣ framed as a language model, albeit conditioned on the encoder representation Copyright 2018 The University of Melbourne
RECURRENT NEURAL NETWORKS (RNNS) c start START x1 x2 x3 x4 What is a vector representation of a sequence ? Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015
RNN ENCODER-DECODERS c Aller Anfang ist schwer STOP What is the probability of a sequence ? Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015
RNN ENCODER-DECODERS Beginnings are difficult STOP START c Aller Anfang ist schwer STOP What is the probability of a sequence ? Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015
RNN ATTENTION MODEL Aller Anfang ist schwer STOP What is the probability of a sequence ? Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015
RNN ATTENTION MODEL Beginnings START Aller Anfang ist schwer STOP What is the probability of a sequence ? Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015
RNN ATTENTION MODEL Beginnings are START Aller Anfang ist schwer STOP What is the probability of a sequence ? Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015
RNN ATTENTION MODEL Beginnings are difficult START Aller Anfang ist schwer STOP What is the probability of a sequence ? Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015
RNN ATTENTION MODEL Beginnings are difficult STOP START Aller Anfang ist schwer STOP What is the probability of a sequence ? Copyright 2018 The University of Melbourne Slide credit: Duh, Dyer et al. 2015
31 APPLICATIONS OF SEQ2SEQ ‣ Machine translation ‣ Summarisation (document as input) ‣ Speech recognition & speech synthesis ‣ Image captioning & image generation ‣ Word morphology (over characters) ‣ e.g., study → student; receive → recipient; play → player; pay → payer/payee ‣ Generating source code from text & more…. Copyright 2018 The University of Melbourne
32 EVALUATION: DID IT WORK? ‣ Given input in Persian , هنر امپرسیونیسم, رقص باله, تلویزیون,ملبورن مهد و مرکز پیدایش صنعت فیملسازی و سیمنا سبکهای مختلف رقص مثل نیو وگ و ملبورن شافل در استرالیا و مرکز مهم موزﯾﮏ کالسﯾﮏ و امروزی در .این کشوراست ‣ Google translate outputs the English Melbourne cradle and center of origin of the film industry and cinema, television, ballet, art, impressionism, various dance styles such as New Vogue and the Melbourne Shuffle in Australia and an important center of classical and contemporary music in this country. ‣ Ask bilingual to judge? Ask to rate for two components ‣ fluency: follows grammar of English, and semantically coherent ‣ adequacy: contains the same information as the original source document ‣ or edit the sentence until is is adequate, and measure #changes, time spent etc Copyright 2018 The University of Melbourne
33 RESUABLE EVALUATION ‣ What if we have one (or several) good translations, e.g. Referred to as Australia's “cultural capital” it is the birthplace of Australian impressionism, Australian rules football, the Australian film and television industries, and Australian contemporary dance such as the Melbourne Shuffle. It is recognised as a UNESCO City of Literature and a major centre for street art, music and theatre. ‣ We can use this text to evaluate many different MT system outputs for the same input Copyright 2018 The University of Melbourne
34 AUTOMATIC EVALUATION ‣ How many words are the shared between output: Melbourne cradle and center of origin of the film industry and cinema, television, ballet, art, impressionism, various dance styles such as New Vogue and the Melbourne Shuffle in Australia and an important center of classical and contemporary music in this country. ‣ And the reference: Referred to as Australia’s “cultural capital” it is the birthplace of Australian impressionism, Australian rules football, the Australian film and television industries, and Australian contemporary dance such as the Melbourne Shuffle. It is recognised as a UNESCO City of Literature and a major centre for street art, music and theatre. Copyright 2018 The University of Melbourne
35 MT EVALUATION: BLEU ‣ BLEU measures closeness of translation to one or more references ‣ defined as: BLEU = bp ⨉ prec1-gram ⨉ prec2-gram ⨉ prec3-gram ⨉ prec4-gram ‣ weighted average of 1, 2, 3 & 4-gram precisions ‣ precn-gram = num n-grams correct / num n-grams predicted in output ‣ numerator clipped to #occurences of ngram in the reference ‣ and a brevity penality to hedge against short outputs ‣ bp = min ( 1, output length / reference length ) ‣ Correlates with human judgements of fluency & adequacy Copyright 2018 The University of Melbourne
36 SUMMARY ‣ Word vs phrase based MT ‣ Components of phrase-base approach ‣ Decoding algorithm ‣ Neural encoder-decoder ‣ Evaluation using BLEU ‣ Reading ‣ JM2 25.7 – 25.9 ‣ Neural Machine Translation and Sequence-to-sequence Models: A Tutorial, Neubig 201, Sections 7 & 8 https://arxiv.org/abs/1703.01619 Copyright 2018 The University of Melbourne
You can also read