IAAA / PSTALN Machine translation - Benoit Favre last generated on January 20, 2020 - page du TP
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IAAA / PSTALN
Machine translation
Benoit Favre
Aix-Marseille Université, LIS/CNRS
last generated on January 20, 2020
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 1 / 35Definition
What is machine translation (MT)?
▶ Write a translated version of a text from a source to a target language
▶ Word, sentence, paragraph, document-level translation
Formalization
▶ x = x1 . . . xn : sequence of words in the source language (ex: Chinese)
▶ y = y1 . . . ym : sequence of words in the target language (ex: English)
▶ Objective find f such that y = f (x)
Why is it hard?
▶ Non-synchronous n to m symbol generation
▶ One-to-many / many-to-one word translation
▶ Things move around
▶ Some phrases do not translate
Yet people are quite good at it
▶ But learning a new language takes a lot of effort
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 2 / 35History of MT
1950 : Development of Computers/AI in the West is driven by idea of
translating from Russian to English
▶ Link with cryptography
1960-1980: Reduced domains
▶ Bilingual dictionaries + rules to order words
1980-2000: Statistical approaches
▶ Translate from examples through statistical models
2000-2010: Translate speech
▶ DARPA projects: high volume article/blog translation, dialogues with
translation
2010+: Neural machine translation
▶ Neural language model rescoring
▶ Sequence-to-sequence decoding
▶ Attention mechanisms
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 3 / 35The translation pyramid
interlingua
like(person(me), edible(soup)) aimer(personne(moi), comestible(soupe))
sémantique
subject-verb-object sujet-verbe-complément
I syntaxique Je
to like aimer
soup la soupe
lexical
Source Cible
phrase en anglais phrase en français
"I like soup" "J'aime la soupe"
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 4 / 35Machine translation (the legacy approach)
Definitions
source: text in the source language (ex: Chinese)
target: text in the target language (ex: English)
Phrase-based statistical translation
Decouple word translation and word ordering
P(source|target) × P(target)
P(target|source) =
P(source)
Model components
P(source|target) = translation model
P(target) = language model
P(source) = ignored because constant for an input
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 5 / 35Language model (LM)
Objective
▶ Find function that ranks a word sequence according to its likelihood of
being proper language
▶ Compute probability of text to originate from a corpus
P(w1 . . . wn ) = P(wn |wn−1 . . . w1 )P(wn−1 . . . w1 )
= P(wn |wn−1 . . . w1 )P(wn−1 |wn−2 . . . w1 )
∏
= P(w1 ) P(wi |wi−1 . . . w1 )
i
P(le chat boit du lait) =P(le)
× P(chat|le)
× P(boit|le chat)
× P(du|le chat boit)
× P(lait|le chat boit du)
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 6 / 35N-gram LM
Apply Markov chain limited-horizon approximation
P(mot(i)|historique(1, i − 1)) ≃P(mot|historique(i − k, i − 1)
P(wi |w1 . . . wi−1 ) ≃P(wi |wi−k . . . wi−1 )
For k = 2
P(le chat boit du lait) ≃P(le) × P(chat|le) × P(boit|le chat)
× P(du|chat boit) × P(lait|boit du)
Estimation
nb(le chat boit)
P(boit|le chat) =
nb(chat boit)
N-gram LM (n = k + 1), uses n words for estimation
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 7 / 35LM Smoothing
Example, bigram model (2-gram) :
P(la chaise boit du lait) = P(la) × P(chaise|la) × P(boit|chaise) × . . .
How to deal with unseen events
Method of pseudo-counts (Laplace smoothing) (N = number of
simulated events)
nb(chaise boit) + 1
Ppseudo (boit|chaise) =
nb(chaise) + N
Interpolation methods
Pinterpol (boit|chaise) = λchaise P(boit|chaise) + (1 − λchaise )P(chaise)
Backoff methods: like interpolation but only when events are not
observed
Most popular approach: “modified Kneser-Ney" [James et al, 2000]
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 8 / 35Neural language model
Train a (potentially recurrent) classifier to predict the next word
w1 w2 w3 w3 w4 w5 w6 end
start
In training, two possible regimes:
▶ Use true word to predict next word
▶ Use predicted word from previous slot
w1 w2 w3 w3 w4 w5 w6 end
start w1 w2 w3 w3 w4 w5 w6
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 9 / 35Softmax approximations
When vocabulary is large (> 10000), the softmax layer gets too
expensive
▶ Store a h × |V | matrix in GPU memory
▶ Training time gets very long
Turn the problem to a sequence of decisions
▶ Hierarchical softmax
Turn the problem to a small set of binary decisions
▶ Noise contrastive estimation, sampled softmax...
▶ → Pair target against a small set of randomly selected words
More here:
http://sebastianruder.com/word-embeddings-softmax
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 10 / 35Perplexity
How good is a language model?
1 Intrinsic metric: compute the probability of a validation corpus
2 Extrinsic metric: use it in a system and compute its performance
Perplexity (PPL) is an intrinsic measure
▶ If you had a dice with one word per face, how often would you get the
correct next word for a validation context?
▶ Lower is better
▶ Only comparable for LM trained with the same vocabulary
PPL(w1 . . . wn ) = p(w1 . . . wn )− n
1
∏n
p(wi |wi−1 . . . w1 )− n
1
=
i=1
( )
1∑
n
PPL(w1 . . . wn ) = exp2 − log2 score(i)
n i=1
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 11 / 35Limits of language modeling
Train a language model on the One Billion Word benchmark
▶ “Exploring the Limits of Language Modeling", Jozefowicz et al. 2016
▶ 800k different words
▶ Best model → 3 weeks on 32 GPU
▶ PPL: perplexity evaluation metric (lower is better)
System PPL
RNN-2048 68.3
Interpolated KN 5-GRAM 67.6
LSTM-512 32.2
2-layer LSTM-2048 30.6
+ CNN inputs 30.0
+ CNN softmax 39.8
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 12 / 35Byte-pair encoding (BPE)
Word language models Character language models
Large decision layer Don’t know about words
Unknown words problem Require stability over long history
Word-piece models
▶ Split words in smaller pieces
▶ Frequent tokens are modeled as one piece
▶ Can factor morphology
Byte pair encoding [Shibata et al, 1999]
1 Start with alphabet containing all characters
⋆ Split words as characters
2 Repeat until up to desired alphabet size (typically 10-30k)
1 Compute most frequent 2-gram (a, b)
2 Add to alphabet new symbol γ(a,b)
3 Replace all occurrences of (a, b) with γ(a,b) in corpus
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 13 / 35Generation from LM
Given a language model, how can we generate text?
Start with input x = ⟨start⟩, hidden state h = 0
Repeat until x = ⟨end⟩:
1 Compute logits and new hidden state y, h ← model(h, x)
2 Introduce temperature y ′ = y/θ
3 Make distribution p = softmax(y)
4 Draw symbol from multinomial distribution s̃ ∼ p
1 Draw v ∼ Uniform(0, 1) ∑
2 Compute s̃ = argmaxs v > si=0 pi
5 x ← s̃
Temperature θ modifies the distribution (θ = 0.7 is a good value)
▶ θ < 1 is more conservative results
▶ θ > 1 leads to more variability
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 14 / 35Neural LM: conclusions
Use (recurrent) classifier to predict next word given history
▶ Typically train on true history
Evaluation
▶ Perplexity, but not really related to downstream usefulness
Large decision layer for realistic vocabulary
▶ Softmax approximations
▶ Maybe words are not the best representation
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 15 / 35Translation model
How to compute P(source|target) = P(s1 , . . . , sn |t1 , . . . , tn ) ?
nb(s1 , . . . , sn → t1 , . . . , tn )
P(s1 , . . . , sn |t1 , . . . , tn ) = ∑
x nb(x → t1 , . . . , tn )
Piecewise translation
P(I am your father → Je suis ton père) =P(I → je) × P(am → suis)
× P(your → ton)
× P(father → père)
To compute those probabilities
▶ Need for alignment between source and target words
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 16 / 35Bitexts
French English
je déclare reprise la session du parlement i declare resumed the session of the
européen qui avait été interrompue le european parliament adjourned on friday
vendredi 17 décembre dernier et je vous 17 december 1999 , and i would like once
renouvelle tous mes voeux en espérant again to wish you a happy new year in the
que vous avez passé de bonnes vacances . hope that you enjoyed a pleasant festive
comme vous avez pu le constater , le period .
grand " bogue de l’ an 2000 " ne s’ est although , as you will have seen , the
pas produit . en revanche , les citoyens d’ dreaded ’ millennium bug ’ failed to
un certain nombre de nos pays ont été materialise , still the people in a number
victimes de catastrophes naturelles qui of countries suffered a series of natural
ont vraiment été terribles . disasters that truly were dreadful .
vous avez souhaité un débat à ce sujet you have requested a debate on this
dans les prochains jours , au cours de subject in the course of the next few days
cette période de session . , during this part-session .
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 17 / 35IBM alignment model 1
Let s = s1 . . . sn , the source sentence et t = t1 , . . . tm , the target
sentence
Let P(si → ta(i) ) the probability that word ti be aligned with sa(i) .
We try to compute an alignment a :
P(a, s|t)
P(a|s, t) =
P(s|t)
We can write
∑
P(s|t) = P(a, s|t)
a
So everything depends on P(a, s|t).
Definition of IBM1 model :
∏
P(a, s|t) = P(si → ta(i) )
i
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 18 / 35Determine the alignment
Dictionnaire Alignements
savons
we nous 0.3695
passe
we avons 0.3210
nous
pas
qui
we devons 0.2824
ne
ce
se
...
.
do veuillez 0.2707 we
do pensez-vous 0.2317 do
do dis-je 0.2145 not
do ne 0.0425 know
...
what
not pas 0.4126
not non 0.3249
is
not ne 0.2721
happening
... .
Chicken and egg problem
▶ If we had an alignment we could compute translation probabilities
▶ If we had translation probabilities, we could compute an alignment
▶ → use Expectation-Maximization (EM)
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 19 / 35IBM1 alignment pseudo-code
for t in target_words: # uniform probabilities
for s in source_words:
prob[t|s] = 1 / len(source_words)
while not converged(): # setup counters
for s in source_words:
total[s] = 0
for t in target_words:
count[t|s] = 0
for target, source in bitext: # traverse bitexts
for t in target:
total_sent[t] = 0
for s in source:
total_sent[t] += prob[t|s]
for t in target:
for s in source:
count[t|s] += prob[t|s] / total_sent[t]
total[s] += prob[t|s] / total_sent[t]
for s in source_words: # update probabilities
for t in target_words:
prob[t|s] = count[t|s] / total[s]
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 20 / 35IBM models 2+
Model 1 : lexical
Model 2 : absolute reordering
Model 3 : fertility
Model 4 : relative reordering
Models 5-6, HMM, learn to align ...
s1 s2 s3 s1 s2 s3
c1 c2 c3 c4 c5 c1 c2 c3
P(fertilité|s2) P(distortion|s2,s3)
We can hope an alignment error rate < 30%
Software: Giza++, berkeleyaligner
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 21 / 35Which direction?
irrecevabilité
irrecevabilité
demander
demander
concerne
concerne
voudrais
voudrais
conseil
conseil
article
article
sujet
sujet
vous
vous
143
143
qui
qui
un
un
de
de
au
au
je
je
l'
l'
l'
l'
,
.
,
.
i i
would would
like like
your your
advice advice
about about
rule rule
143 143
concerning concerning
inadmissibility inadmissibility
. .
Anglais > Français Français > Anglais
irrecevabilité
demander
concerne
voudrais
conseil
article
sujet
vous
143
qui
un
de
au
je
l'
l'
,
.
i
would
like
your
advice
about
rule
143
concerning
inadmissibility
.
Fusion
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 22 / 35Phrase table
savons
savons
passe
passe
nous
nous
pas
pas
qui
qui
ne
ne
ce
se
ce
se
.
.
we we
do do
not not
know know
what what
is is
happening happening
. .
savons
passe
"Phrase table"
nous
pas
qui
ne
ce
se
we > nous
we . do not know > ne savons pas
do what > ce qui
not is happening > se passe
know
what we do not know > nous ne savons pas
is what is happening > ce qui se passe
happening
.
Compute translation probability for all known phrases (an extension of
n-gram language models)
▶ Combine with LM and find best translation with decoding algorithm
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 23 / 35Decoding problem
Given source text and model, find best translation
t̂ = argmax P(t)P(s|t)
c
Decoding process
1 For each segment of the source, generate all possible translations
2 Combine and reorder translated pieces
3 Apply language model
4 Score each complete translation
Very large search space
▶ Requires lots of tricks and optimization
▶ Pruning of least probable translations
▶ Notable implementation: Moses decoder [Koehn et al, 2006]
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 24 / 35Decoding (2)
+Modèle de distortion
Modèle de traduction
tension rises in egypt 's capital
de
de l' égypte
la tension augmente dans la capitale de l' égypte
capitale
la tension
Modèle de langage
tension augmente
augmente dans
dans la
la capitale
capitale de
de l'
l' égypte
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 25 / 35Stat-MT: conclusions
Machine translation trainable from bi-texts
▶ Large quantities of translation memories available
▶ Use alignment to infer latent link between languages
Split problem
▶ Segment translation (translation model)
▶ Segment ordering (language model)
Search space is large
▶ Decoders are complex
▶ Require lots of pruning and approximations
Estimation is hard
▶ Pointwise maximum likelihood probability estimation
▶ How to deal with unseen events?
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 26 / 35Neural machine translation (NMT)
Phrase-based translation
▶ Same coverage problem as with word-ngrams
▶ Alignment still wrong in 30% of cases
▶ A lot of tricks to make it work
▶ Researchers have progressively introduced NN
⋆ Language model
⋆ Phrase translation probability estimation
▶ The google translate approach until mid-2016
End-to-end approach to machine translation
▶ Can we directly input source words and generate target words?
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 27 / 35Encoder-decoder framework
Generalisation of the conditioned language model
▶ Build a representation, then generate sentence
▶ Also called the seq2seq framework
But still limited for translation
▶ Bad for long sentences
▶ How to account for unknown words?
▶ How to make use of alignments?
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 28 / 35Interlude: Pointer networks
Decision is an offset in the input
▶ Number of classes dependent on the length of the input
▶ Decision depends on hidden state in input and hidden state in output
▶ Encoder state ej , decoder state di
yi = softmax(v ⊺ tanh(Wej + Udi ))
Oriol Vinyals, Meire Fortunato, Navdeep Jaitly, “Pointer Networks", arXiv:1506.03134
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 29 / 35Attention mechanisms
Loosely based on human visual attention mechanism
▶ Let neural network focus on aspects of the input to make its decision
▶ Learn what to attend based on what it has produced so far
αi = softmaxj (falign (di , ej ))
∑
attni = αi,j ej
j
yi = softmax(W [attni ⊕ di ] + b)
Additive attention
+
falign (di , ej ) =v ⊺ tanh(W1 di + W2 ej )
Multiplicative attention
×
falign (di , ej ) =di⊺ W3 ej
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 30 / 35Machine translation with attention Learns the word-to-word alignment Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 31 / 35
How to deal with unknown words
If you don’t have attention
▶ Introduce unk symbols for low frequency words
▶ Realign them to the input a posteriori
▶ Use large translation dictionary or copy if proper name
Use attention MT, extract α as alignment parameter
▶ Then translate input word directly
What about morphologically rich languages?
▶ Reduce vocabulary size by translating word factors
⋆ Byte pair encoding algorithm
▶ Use word-level RNN to transliterate word
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 32 / 35Zero-shot machine translation
How to deal with the quadratic need for parallel data?
▶ n languages → n 2 pairs
▶ So far, people have been using a pivot language (x → english → y)
Parameter sharing across language pairs
▶ Many to one → share the target weights
▶ One to many → share the source weights
▶ Many to many → train single system for all pairs
Zero-shot learning
▶ Use token to identify target language (ex: )
▶ Let model learn to recognize source language
▶ Can process pairs never seen in training!
▶ The model learns the “interlingua"
▶ Can also handle code switching
"Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot
Translation", Johnson et al., arXiv:1611.04558
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 33 / 35Attention is all you need
Attention treats words as a bag
▶ Need RNN to convey word order
Maybe we can encode position information as embeddings
▶ Absolute position
▶ Relative position
▶ Absolute and relative position?
⋆ → use sinusoids of different frequencies and phase
Multiple attention heads
▶ Allow network to focus on multiple phenomena
Multiple layers of attention
▶ Encode variables conditioned on subsets of inputs
Transformer networks [Vaswani et al, 2017, arXiv:1706.03762]
▶ Encoder-decode with multiple layers of multi-head attention
▶ http://jalammar.github.io/illustrated-transformer/
BERT / GPT-2
▶ Encoder trained on language modeling tasks
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 34 / 35NMT: conclusions
Machine translation
▶ Transform source to target language
▶ Sequence to sequence (encoder-decoder) framework
Attention mechanisms
▶ Learn to align inputs and outputs
▶ Can look at all words from input
Self-attention
▶ Transformer / BERT
Zero-shot learning
▶ Evade the “language-pair" requirement
▶ Can be interesting in all of NLP
Benoit Favre (AMU) PSTALN: LM/MT January 20, 2020 35 / 35You can also read