IAAA / PSTALN Machine translation - Benoit Favre last generated on January 20, 2020 - page du TP
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IAAA / PSTALN Machine translation Benoit Favre Aix-Marseille Université, LIS/CNRS last generated on January 20, 2020 Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 1 / 35
Definition What is machine translation (MT)? ▶ Write a translated version of a text from a source to a target language ▶ Word, sentence, paragraph, document-level translation Formalization ▶ x = x1 . . . xn : sequence of words in the source language (ex: Chinese) ▶ y = y1 . . . ym : sequence of words in the target language (ex: English) ▶ Objective find f such that y = f (x) Why is it hard? ▶ Non-synchronous n to m symbol generation ▶ One-to-many / many-to-one word translation ▶ Things move around ▶ Some phrases do not translate Yet people are quite good at it ▶ But learning a new language takes a lot of effort Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 2 / 35
History of MT 1950 : Development of Computers/AI in the West is driven by idea of translating from Russian to English ▶ Link with cryptography 1960-1980: Reduced domains ▶ Bilingual dictionaries + rules to order words 1980-2000: Statistical approaches ▶ Translate from examples through statistical models 2000-2010: Translate speech ▶ DARPA projects: high volume article/blog translation, dialogues with translation 2010+: Neural machine translation ▶ Neural language model rescoring ▶ Sequence-to-sequence decoding ▶ Attention mechanisms Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 3 / 35
The translation pyramid interlingua like(person(me), edible(soup)) aimer(personne(moi), comestible(soupe)) sémantique subject-verb-object sujet-verbe-complément I syntaxique Je to like aimer soup la soupe lexical Source Cible phrase en anglais phrase en français "I like soup" "J'aime la soupe" Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 4 / 35
Machine translation (the legacy approach) Definitions source: text in the source language (ex: Chinese) target: text in the target language (ex: English) Phrase-based statistical translation Decouple word translation and word ordering P(source|target) × P(target) P(target|source) = P(source) Model components P(source|target) = translation model P(target) = language model P(source) = ignored because constant for an input Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 5 / 35
Language model (LM) Objective ▶ Find function that ranks a word sequence according to its likelihood of being proper language ▶ Compute probability of text to originate from a corpus P(w1 . . . wn ) = P(wn |wn−1 . . . w1 )P(wn−1 . . . w1 ) = P(wn |wn−1 . . . w1 )P(wn−1 |wn−2 . . . w1 ) ∏ = P(w1 ) P(wi |wi−1 . . . w1 ) i P(le chat boit du lait) =P(le) × P(chat|le) × P(boit|le chat) × P(du|le chat boit) × P(lait|le chat boit du) Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 6 / 35
N-gram LM Apply Markov chain limited-horizon approximation P(mot(i)|historique(1, i − 1)) ≃P(mot|historique(i − k, i − 1) P(wi |w1 . . . wi−1 ) ≃P(wi |wi−k . . . wi−1 ) For k = 2 P(le chat boit du lait) ≃P(le) × P(chat|le) × P(boit|le chat) × P(du|chat boit) × P(lait|boit du) Estimation nb(le chat boit) P(boit|le chat) = nb(chat boit) N-gram LM (n = k + 1), uses n words for estimation Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 7 / 35
LM Smoothing Example, bigram model (2-gram) : P(la chaise boit du lait) = P(la) × P(chaise|la) × P(boit|chaise) × . . . How to deal with unseen events Method of pseudo-counts (Laplace smoothing) (N = number of simulated events) nb(chaise boit) + 1 Ppseudo (boit|chaise) = nb(chaise) + N Interpolation methods Pinterpol (boit|chaise) = λchaise P(boit|chaise) + (1 − λchaise )P(chaise) Backoff methods: like interpolation but only when events are not observed Most popular approach: “modified Kneser-Ney" [James et al, 2000] Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 8 / 35
Neural language model Train a (potentially recurrent) classifier to predict the next word w1 w2 w3 w3 w4 w5 w6 end start In training, two possible regimes: ▶ Use true word to predict next word ▶ Use predicted word from previous slot w1 w2 w3 w3 w4 w5 w6 end start w1 w2 w3 w3 w4 w5 w6 Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 9 / 35
Softmax approximations When vocabulary is large (> 10000), the softmax layer gets too expensive ▶ Store a h × |V | matrix in GPU memory ▶ Training time gets very long Turn the problem to a sequence of decisions ▶ Hierarchical softmax Turn the problem to a small set of binary decisions ▶ Noise contrastive estimation, sampled softmax... ▶ → Pair target against a small set of randomly selected words More here: http://sebastianruder.com/word-embeddings-softmax Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 10 / 35
Perplexity How good is a language model? 1 Intrinsic metric: compute the probability of a validation corpus 2 Extrinsic metric: use it in a system and compute its performance Perplexity (PPL) is an intrinsic measure ▶ If you had a dice with one word per face, how often would you get the correct next word for a validation context? ▶ Lower is better ▶ Only comparable for LM trained with the same vocabulary PPL(w1 . . . wn ) = p(w1 . . . wn )− n 1 ∏n p(wi |wi−1 . . . w1 )− n 1 = i=1 ( ) 1∑ n PPL(w1 . . . wn ) = exp2 − log2 score(i) n i=1 Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 11 / 35
Limits of language modeling Train a language model on the One Billion Word benchmark ▶ “Exploring the Limits of Language Modeling", Jozefowicz et al. 2016 ▶ 800k different words ▶ Best model → 3 weeks on 32 GPU ▶ PPL: perplexity evaluation metric (lower is better) System PPL RNN-2048 68.3 Interpolated KN 5-GRAM 67.6 LSTM-512 32.2 2-layer LSTM-2048 30.6 + CNN inputs 30.0 + CNN softmax 39.8 Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 12 / 35
Byte-pair encoding (BPE) Word language models Character language models Large decision layer Don’t know about words Unknown words problem Require stability over long history Word-piece models ▶ Split words in smaller pieces ▶ Frequent tokens are modeled as one piece ▶ Can factor morphology Byte pair encoding [Shibata et al, 1999] 1 Start with alphabet containing all characters ⋆ Split words as characters 2 Repeat until up to desired alphabet size (typically 10-30k) 1 Compute most frequent 2-gram (a, b) 2 Add to alphabet new symbol γ(a,b) 3 Replace all occurrences of (a, b) with γ(a,b) in corpus Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 13 / 35
Generation from LM Given a language model, how can we generate text? Start with input x = ⟨start⟩, hidden state h = 0 Repeat until x = ⟨end⟩: 1 Compute logits and new hidden state y, h ← model(h, x) 2 Introduce temperature y ′ = y/θ 3 Make distribution p = softmax(y) 4 Draw symbol from multinomial distribution s̃ ∼ p 1 Draw v ∼ Uniform(0, 1) ∑ 2 Compute s̃ = argmaxs v > si=0 pi 5 x ← s̃ Temperature θ modifies the distribution (θ = 0.7 is a good value) ▶ θ < 1 is more conservative results ▶ θ > 1 leads to more variability Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 14 / 35
Neural LM: conclusions Use (recurrent) classifier to predict next word given history ▶ Typically train on true history Evaluation ▶ Perplexity, but not really related to downstream usefulness Large decision layer for realistic vocabulary ▶ Softmax approximations ▶ Maybe words are not the best representation Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 15 / 35
Translation model How to compute P(source|target) = P(s1 , . . . , sn |t1 , . . . , tn ) ? nb(s1 , . . . , sn → t1 , . . . , tn ) P(s1 , . . . , sn |t1 , . . . , tn ) = ∑ x nb(x → t1 , . . . , tn ) Piecewise translation P(I am your father → Je suis ton père) =P(I → je) × P(am → suis) × P(your → ton) × P(father → père) To compute those probabilities ▶ Need for alignment between source and target words Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 16 / 35
Bitexts French English je déclare reprise la session du parlement i declare resumed the session of the européen qui avait été interrompue le european parliament adjourned on friday vendredi 17 décembre dernier et je vous 17 december 1999 , and i would like once renouvelle tous mes voeux en espérant again to wish you a happy new year in the que vous avez passé de bonnes vacances . hope that you enjoyed a pleasant festive comme vous avez pu le constater , le period . grand " bogue de l’ an 2000 " ne s’ est although , as you will have seen , the pas produit . en revanche , les citoyens d’ dreaded ’ millennium bug ’ failed to un certain nombre de nos pays ont été materialise , still the people in a number victimes de catastrophes naturelles qui of countries suffered a series of natural ont vraiment été terribles . disasters that truly were dreadful . vous avez souhaité un débat à ce sujet you have requested a debate on this dans les prochains jours , au cours de subject in the course of the next few days cette période de session . , during this part-session . Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 17 / 35
IBM alignment model 1 Let s = s1 . . . sn , the source sentence et t = t1 , . . . tm , the target sentence Let P(si → ta(i) ) the probability that word ti be aligned with sa(i) . We try to compute an alignment a : P(a, s|t) P(a|s, t) = P(s|t) We can write ∑ P(s|t) = P(a, s|t) a So everything depends on P(a, s|t). Definition of IBM1 model : ∏ P(a, s|t) = P(si → ta(i) ) i Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 18 / 35
Determine the alignment Dictionnaire Alignements savons we nous 0.3695 passe we avons 0.3210 nous pas qui we devons 0.2824 ne ce se ... . do veuillez 0.2707 we do pensez-vous 0.2317 do do dis-je 0.2145 not do ne 0.0425 know ... what not pas 0.4126 not non 0.3249 is not ne 0.2721 happening ... . Chicken and egg problem ▶ If we had an alignment we could compute translation probabilities ▶ If we had translation probabilities, we could compute an alignment ▶ → use Expectation-Maximization (EM) Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 19 / 35
IBM1 alignment pseudo-code for t in target_words: # uniform probabilities for s in source_words: prob[t|s] = 1 / len(source_words) while not converged(): # setup counters for s in source_words: total[s] = 0 for t in target_words: count[t|s] = 0 for target, source in bitext: # traverse bitexts for t in target: total_sent[t] = 0 for s in source: total_sent[t] += prob[t|s] for t in target: for s in source: count[t|s] += prob[t|s] / total_sent[t] total[s] += prob[t|s] / total_sent[t] for s in source_words: # update probabilities for t in target_words: prob[t|s] = count[t|s] / total[s] Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 20 / 35
IBM models 2+ Model 1 : lexical Model 2 : absolute reordering Model 3 : fertility Model 4 : relative reordering Models 5-6, HMM, learn to align ... s1 s2 s3 s1 s2 s3 c1 c2 c3 c4 c5 c1 c2 c3 P(fertilité|s2) P(distortion|s2,s3) We can hope an alignment error rate < 30% Software: Giza++, berkeleyaligner Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 21 / 35
Which direction? irrecevabilité irrecevabilité demander demander concerne concerne voudrais voudrais conseil conseil article article sujet sujet vous vous 143 143 qui qui un un de de au au je je l' l' l' l' , . , . i i would would like like your your advice advice about about rule rule 143 143 concerning concerning inadmissibility inadmissibility . . Anglais > Français Français > Anglais irrecevabilité demander concerne voudrais conseil article sujet vous 143 qui un de au je l' l' , . i would like your advice about rule 143 concerning inadmissibility . Fusion Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 22 / 35
Phrase table savons savons passe passe nous nous pas pas qui qui ne ne ce se ce se . . we we do do not not know know what what is is happening happening . . savons passe "Phrase table" nous pas qui ne ce se we > nous we . do not know > ne savons pas do what > ce qui not is happening > se passe know what we do not know > nous ne savons pas is what is happening > ce qui se passe happening . Compute translation probability for all known phrases (an extension of n-gram language models) ▶ Combine with LM and find best translation with decoding algorithm Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 23 / 35
Decoding problem Given source text and model, find best translation t̂ = argmax P(t)P(s|t) c Decoding process 1 For each segment of the source, generate all possible translations 2 Combine and reorder translated pieces 3 Apply language model 4 Score each complete translation Very large search space ▶ Requires lots of tricks and optimization ▶ Pruning of least probable translations ▶ Notable implementation: Moses decoder [Koehn et al, 2006] Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 24 / 35
Decoding (2) +Modèle de distortion Modèle de traduction tension rises in egypt 's capital de de l' égypte la tension augmente dans la capitale de l' égypte capitale la tension Modèle de langage tension augmente augmente dans dans la la capitale capitale de de l' l' égypte Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 25 / 35
Stat-MT: conclusions Machine translation trainable from bi-texts ▶ Large quantities of translation memories available ▶ Use alignment to infer latent link between languages Split problem ▶ Segment translation (translation model) ▶ Segment ordering (language model) Search space is large ▶ Decoders are complex ▶ Require lots of pruning and approximations Estimation is hard ▶ Pointwise maximum likelihood probability estimation ▶ How to deal with unseen events? Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 26 / 35
Neural machine translation (NMT) Phrase-based translation ▶ Same coverage problem as with word-ngrams ▶ Alignment still wrong in 30% of cases ▶ A lot of tricks to make it work ▶ Researchers have progressively introduced NN ⋆ Language model ⋆ Phrase translation probability estimation ▶ The google translate approach until mid-2016 End-to-end approach to machine translation ▶ Can we directly input source words and generate target words? Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 27 / 35
Encoder-decoder framework Generalisation of the conditioned language model ▶ Build a representation, then generate sentence ▶ Also called the seq2seq framework But still limited for translation ▶ Bad for long sentences ▶ How to account for unknown words? ▶ How to make use of alignments? Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 28 / 35
Interlude: Pointer networks Decision is an offset in the input ▶ Number of classes dependent on the length of the input ▶ Decision depends on hidden state in input and hidden state in output ▶ Encoder state ej , decoder state di yi = softmax(v ⊺ tanh(Wej + Udi )) Oriol Vinyals, Meire Fortunato, Navdeep Jaitly, “Pointer Networks", arXiv:1506.03134 Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 29 / 35
Attention mechanisms Loosely based on human visual attention mechanism ▶ Let neural network focus on aspects of the input to make its decision ▶ Learn what to attend based on what it has produced so far αi = softmaxj (falign (di , ej )) ∑ attni = αi,j ej j yi = softmax(W [attni ⊕ di ] + b) Additive attention + falign (di , ej ) =v ⊺ tanh(W1 di + W2 ej ) Multiplicative attention × falign (di , ej ) =di⊺ W3 ej Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 30 / 35
Machine translation with attention Learns the word-to-word alignment Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 31 / 35
How to deal with unknown words If you don’t have attention ▶ Introduce unk symbols for low frequency words ▶ Realign them to the input a posteriori ▶ Use large translation dictionary or copy if proper name Use attention MT, extract α as alignment parameter ▶ Then translate input word directly What about morphologically rich languages? ▶ Reduce vocabulary size by translating word factors ⋆ Byte pair encoding algorithm ▶ Use word-level RNN to transliterate word Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 32 / 35
Zero-shot machine translation How to deal with the quadratic need for parallel data? ▶ n languages → n 2 pairs ▶ So far, people have been using a pivot language (x → english → y) Parameter sharing across language pairs ▶ Many to one → share the target weights ▶ One to many → share the source weights ▶ Many to many → train single system for all pairs Zero-shot learning ▶ Use token to identify target language (ex: ) ▶ Let model learn to recognize source language ▶ Can process pairs never seen in training! ▶ The model learns the “interlingua" ▶ Can also handle code switching "Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation", Johnson et al., arXiv:1611.04558 Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 33 / 35
Attention is all you need Attention treats words as a bag ▶ Need RNN to convey word order Maybe we can encode position information as embeddings ▶ Absolute position ▶ Relative position ▶ Absolute and relative position? ⋆ → use sinusoids of different frequencies and phase Multiple attention heads ▶ Allow network to focus on multiple phenomena Multiple layers of attention ▶ Encode variables conditioned on subsets of inputs Transformer networks [Vaswani et al, 2017, arXiv:1706.03762] ▶ Encoder-decode with multiple layers of multi-head attention ▶ http://jalammar.github.io/illustrated-transformer/ BERT / GPT-2 ▶ Encoder trained on language modeling tasks Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 34 / 35
NMT: conclusions Machine translation ▶ Transform source to target language ▶ Sequence to sequence (encoder-decoder) framework Attention mechanisms ▶ Learn to align inputs and outputs ▶ Can look at all words from input Self-attention ▶ Transformer / BERT Zero-shot learning ▶ Evade the “language-pair" requirement ▶ Can be interesting in all of NLP Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 35 / 35
You can also read