On the Weaknesses of Reinforcement Learning for Neural Machine Translation

Page created by Larry Sutton

Health & Fitness

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

On the Weaknesses of Reinforcement Learning
                                                                    for Neural Machine Translation

                                         Leshem Choshen1            Lior Fox1          Zohar Aizenbud1           Omri Abend1,2
                                            1
                                              School of Computer Science and Engineering, 2 Department of Cognitive Sciences
                                                                  The Hebrew University of Jerusalem
                                          {leshem.choshen, lior.fox, zohar.aizenbud}@mail.huji.ac.il
                                                                      oabend@cs.huji.ac.il

                                                              Abstract                          interest and strong results, little is known about
                                                                                                what accounts for these performance gains, and
                                            Reinforcement learning (RL) is frequently           the training dynamics involved.
                                            used to increase performance in text gen-
arXiv:1907.01752v2 [cs.CL] 9 Nov 2019

                                                                                                   We present the following contributions. First,
                                            eration tasks, including machine translation
                                            (MT), notably through the use of Minimum            our theoretical analysis shows that commonly
                                            Risk Training (MRT) and Generative Adver-           used approximation methods are theoretically ill-
                                            sarial Networks (GAN). However, little is           founded, and may converge to parameter values
                                            known about what and how these methods              that do not minimize the risk, nor are local min-
                                            learn in the context of MT. We prove that one       ima thereof (§2.3).
                                            of the most common RL methods for MT does              Second, using both naturalistic experiments and
                                            not optimize the expected reward, as well as
                                                                                                carefully constructed simulations, we show that
                                            show that other methods take an infeasibly
                                            long time to converge. In fact, our results sug-    performance gains observed in the literature likely
                                            gest that RL practices in MT are likely to im-      stem not from making target tokens the most prob-
                                            prove performance only where the pre-trained        able, but from unrelated effects, such as increasing
                                            parameters are already close to yielding the        the peakiness of the output distribution (i.e., the
                                            correct translation. Our findings further sug-      probability mass of the most probable tokens). We
                                            gest that observed gains may be due to effects      do so by comparing a setting where the reward is
                                            unrelated to the training signal, but rather from   informative, vs. one where it is constant. In §4 we
                                            changes in the shape of the distribution curve.
                                                                                                discuss this peakiness effect (P K E).
                                        1   Introduction                                           Third, we show that promoting the target token
                                                                                                to be the mode is likely to take a prohibitively long
                                        Reinforcement learning (RL) is an appealing path        time. The only case we find, where improvements
                                        for advancement in Machine Translation (MT),            are likely, is where the target token is among the
                                        as it allows training systems to optimize non-          first 2-3 most probable tokens according to the pre-
                                        differentiable score functions, common in MT            trained model. These findings suggest that R EIN -
                                        evaluation, as well as its ability to tackle the “ex-   FORCE (§5) and CMRT (§6) are likely to improve
                                        posure bias” (Ranzato et al., 2015) in standard         over the pre-trained model only under the best pos-
                                        training, namely that the model is not exposed dur-     sible conditions, i.e., where the pre-trained model
                                        ing training to incorrectly generated tokens, and is    is “nearly” correct.
                                        thus unlikely to recover from generating such to-          We conclude by discussing other RL practices
                                        kens at test time. These motivations have led to        in MT which should be avoided for practical and
                                        much interest in RL for text generation in gen-         theoretical reasons, and briefly discuss alternative
                                        eral and MT in particular. Various policy gradi-        RL approaches that will allow RL to tackle a larger
                                        ent methods have been used, notably R EINFORCE          class of errors in pre-trained models (§7).
                                        (Williams, 1992) and variants thereof (e.g., Ran-
                                        zato et al., 2015; Edunov et al., 2018) and Mini-       2     RL in Machine Translation
                                        mum Risk Training (MRT; e.g., Och, 2003; Shen
                                        et al., 2016). Another popular use of RL is for         2.1    Setting and Notation
                                        training GANs (Yang et al., 2018; Tevet et al.,         An MT system generates tokens y = (y1 , ..., yn )
                                        2018). See §2. Nevertheless, despite increasing         from a vocabulary V one token at a time. The

probability of generating yi given preceding to-          R EINFORCE is adversarial training with discrete
kens y

it here. Given a sample S, the gradient of R      e is          Since the pretrained model (or policy) is peaky, ex-
given by                                                        ploration of other potentially more rewarding to-
            k 
          X                                                    kens will be limited, hampering convergence.
   ∇R =α
     e          Q(yi ) · r(yi ) · ∇ log P (yi )
                                                   (3)             Intuitively, R EINFORCE increases the probabil-
           i=1
                                                                ities of successful (positively rewarding) observa-
           − EQ [r]∇ log Z(S)
                                                                tions, weighing updates by how rewarding they
where Z(S) = i P (yi )α . See Appendix B.
                 P
                                                                were. When sampling a handful of tokens in each
   Comparing Equations (2) and (3), the differ-                 context (source sentence x and generated prefix
ences between R EINFORCE and CMRT are re-                       y

θ = {θj }j∈V are the model’s parameters. where R EINFORCE was not close to converging af-
This model simulates the top of any MT decoder ter 50K steps, the same was true after 1M steps.
that ends with a softmax layer, as essentially all We evaluate the peakiness of a distribution in
NMT decoders do. To make experiments realis- terms of the probability of the most probable token
tic, we use similar parameters as those reported in (the mode), the total probability of the ten most
the influential Transformer NMT system (Vaswani probable tokens, and the entropy of the distribu-
et al., 2017). Specifically, the size of V (distinct tion (lower entropy indicates more peakiness).
BPE tokens) is 30715, and the initial values for θ
were sampled from 1000 sets of logits taken from Results. The distributions become peakier in
decoding the standard newstest2013 development terms of all three measures: on average, the
set, using a pretrained Transformer model. The mode’s probability and the 10 most probable to-
model was pretrained on WMT2015 training data kens increases, and the entropy decreases. Figure
(Bojar et al., 2015). Hyperparameters are reported 1a presents the histogram of the difference in the
in Appendix C. We define one of the tokens in probability of the 10 most probable tokens in the
V to be the target token and denote it with ybest . Constant Reward setting, after a single step. Fig-
We assign deterministic token reward, this makes ure 1b depicts similar statistics for the mode. The
learning easier than when relying on approxima- average entropy in the pretrained model is 2.9 is
tions and our predictions optimistic. We experi- reduced to 2.85 after one R EINFORCE step.
ment with two reward functions: Simulated Reward setting shows similar trends.
1. Simulated Reward: r(y) = 2 for y = ybest , For example, entropy decreases from 3 to about
r(y) = 1 if y is one of the 10 initially highest 0.001 in 100K steps. This extreme decrease sug-
scoring tokens, and r(y) = 0 otherwise. This gests it is effectively a deterministic policy. P K E
simulates a condition where the pretrained is achieved in a few hundred steps, usually before
model is of decent but sub-optimal quality. r other effects become prominent (see Figure 2), and
here is at the scale of popular rewards used is stronger than for Constant Reward.
in MT, such as GAN-based rewards or BLEU
(which are between 0 and 1). 4.2 NMT Experiments
2. Constant Reward: r is constantly equal to We turn to analyzing a real-world application of
1, for all tokens. This setting is aimed to con- R EINFORCE to NMT. Important differences be-
firm that P K E is not a result of the signal car- tween this and the previous simulations are: (1) it
ried by the reward. is rare in NMT for R EINFORCE to sample from the
Experiments with the first setting were run 100 same conditional distribution more than a hand-
times, each time for 50K steps, updating θ after ful of times, given the number of source sentences
each step. With the second setting, it is sufficient x and sentence prefixes y

(a) Top 10 (b) Mode

Figure 1: A histogram of the update size (x-axis) to the total probability of the 10 most probable tokens (left) or
the most probable token (right) in the Constant Reward setting. An update is overwhelmingly more probable to
increase this probability than to decrease it.

(a) (b) (c)

Figure 2: Token probabilities through R EINFORCE training, in the controlled simulations in the Simulated Reward
setting. The left/center/right figures correspond to simulations where the target token (ybest ) was initially the
second/third/fourth most probable token. The green line corresponds to the target token, yellow lines to medium-
reward tokens and red lines to no-reward tokens.

Results. Results indicate an increase in the tropy of valid next tokens (given the source and a
peakiness of the conditional distributions. Our re- prefix of the sentence).
sults are based on a sample of 1000 contexts from
5 Performance following R EINFORCE
the pretrained model, and another (independent)
sample from the reinforced model. We now turn to assessing under what conditions it
Indeed, the modes of the conditional distribu- is likely that R EINFORCE will lead to an improve-
tions tend to increase. Figure 3 presents the distri- ment in the performance of an NMT system. As in
bution of the modes’ probability in the reinforced the previous section, we use both controlled simu-
conditional distributions compared with the pre- lations and NMT experiments.
trained model, showing a shift of probability mass
towards higher probabilities for the mode, follow- 5.1 Controlled Simulations
ing RL. Another indication of the increased peak- We use the same model and experimental setup de-
iness is the decrease in the average entropy of Pθ , scribed in Section 4.1, this time only exploring the
which was reduced from 3.45 in the pretrained Simulated Reward setting, as a Constant Reward
model to an average of 2.82 following RL. This is not expected to converge to any meaningful θ.
more modest reduction in entropy (compared to Results are averaged over 100 conditional distri-
§4.1) might also suggest that the procedure did not butions sampled from the pretrained model.
converge to the optimal value for θ, as then we Caution should be exercised when determining
would have expected the entropy to substantially the learning rate (LR). Common LRs used in the
drop if not to 0 (overfit), then to the average en- NMT literature are of the scale of 10−4 . How-

Figure 3: The cumulative distribution of the proba- Figure 4: Cumulative percentage of contexts where the
bility of the most likely token in the NMT experi- pretrained model ranks ybest in rank x or below and
ments. The green distribution corresponds to the pre- where it does not rank ybest first (x = 0). In about half
trained model, and the blue corresponds to the rein- the cases it is ranked fourth or below.
forced model. The y-axis is the proportion of condi-
tional probabilities with a mode of value ≤ x (the x- is 30.31, the reinforced one is 30.73.
axis). Note that a lower cumulative percentage means
a more peaked output distribution. Analyzing what words were influenced by the
RL procedure, we begin by computing the cumu-
ever, in our simulations, no LR smaller than 0.1 lative probability of the target token ybest to be
yielded any improvement in R. We thus set the ranked lower than a given rank according to the
LR to be 0.1. We note that in our simulations, a pretrained model. Results (Figure 4) show that in
higher learning rate means faster convergence as about half of the cases, ybest is not among the top
our reward is noise-free: it is always highest for three choices of the pretrained model, and we thus
the best option. In practice, increasing the learn- expect it not to gain substantial probability follow-
ing rate may deteriorate results, as it may cause the ing R EINFORCE, according to our simulations.
system to overfit to the sampled instances. Indeed, We next turn to compare the ranks the rein-
when increasing the learning rate in our NMT ex- forced model assigns to the target tokens, and their
periments (see below) by an order of magnitude, ranks according to the pretrained model. Figure 5
early stopping caused the RL procedure to stop presents the difference in the probability that ybest
without any parameter updates. is ranked at a given rank following RL and the
Figure 2 shows how Pθ changes over the first probability it is ranked there initially. Results in-
50K steps of R EINFORCE (probabilities are aver- dicate that indeed more target tokens are ranked
aged over 100 repetitions), for a case where ybest first, and less second, but little consistent shift of
was initially the second, third and fourth most probability mass occurs otherwise across the ten
probable. Although these are the easiest settings first ranks. It is possible that RL has managed to
for R EINFORCE, and despite the high learning push ybest in some cases between very low ranks
rate, it fails to make ybest the mode of the distribu- (

7   Discussion
                                                               In this paper, we showed that the type of distri-
                                                               butions used in NMT entail that promoting the
                                                               target token to be the mode is likely to take a
                                                               prohibitively long times for existing RL practices,
                                                               except under the best conditions (where the pre-
                                                               trained model is “nearly” correct). This leads us to
                                                               conclude that observed improvements from using
                                                               RL for NMT are likely due either to fine-tuning
                                                               the most probable tokens in the pretrained model
                                                               (an effect which may be more easily achieved us-
                                                               ing reranking methods, and uses but little of the
Figure 5: Difference between the ranks of ybest in the         power of RL methods), or to effects unrelated to
reinforced and the pretrained model. Each column x             the signal carried by the reward, such as P K E. An-
corresponds to the difference in the probability that          other contribution of this paper is in showing that
ybest is ranked in rank x in the reinforced model, and         CMRT does not optimize the expected reward and
the same probability in the pretrained model.
                                                               is thus theoretically unmotivated.
pect that even in cases where RL yields an im-                    A number of reasons lead us to believe that
provement in BLEU, it may partially result from                in our NMT experiments, improvements are not
reward-independent factors, such as P K E.4                    due to the reward function, but to artefacts such
                                                               as P K E. First, reducing a constant baseline from
6       Experiments with Contrastive MRT                       r, so as to make the expected reward zero, disal-
                                                               lows learning. This is surprising, as R EINFORCE,
In §2.3 we showed that CMRT does not, in fact,
                                                               generally and in our simulations, converges faster
maximize R, and so does not enjoy the same theo-
                                                               where the reward is centered around zero, and so
retical guarantees as R EINFORCE and similar pol-
                                                               the fact that this procedure here disallows learning
icy gradient methods. However, it is still the RL
                                                               hints that other factors are in play. As P K E can
procedure of choice in much recent work on NMT.
                                                               be observed even where the reward is constant (if
We therefore repeat the simulations described in
                                                               the expected reward is positive; see §4.1), this sug-
§4 and §5, assessing the performance of MRT in
                                                               gests P K E may play a role here. Second, we ob-
these conditions. We experiment with α = 0.005
                                                               serve more peakiness in the reinforced model and
and k = 20, common settings in the literature, and
                                                               in such cases, we expect improvements in BLEU
average over 100 trials.
                                                               (Caccia et al., 2018). Third, we achieve similar
   Figure 6 shows how the distribution Pθ changes
                                                               results with a constant reward in our NMT exper-
over the course of 50K update steps to θ, where
                                                               iments (§5.2). Fourth, our controlled simulations
ybest is taken to be the second and third initially
                                                               show that asymptotic convergence is not reached
most probable token (Simulated Reward setting).
                                                               in any but the easiest conditions (§5.1).
Results are similar in trends to those obtained with
                                                                  Our analysis further suggests that gradient clip-
R EINFORCE: MRT succeeds in pushing ybest to be
                                                               ping, sometimes used in NMT (Zhang et al.,
the highest ranked token if it was initially second,
                                                               2016), is expected to hinder convergence further.
but struggles where it was initially ranked third or
                                                               It should be avoided when using R EINFORCE as it
below. We only observe a small P K E in MRT. This
                                                               violates R EINFORCE’s assumptions.
is probably due to the contrastive effect, which
means that tokens that were not sampled do not                    The per-token sampling as done in our experi-
lose probability mass.                                         ments is more exploratory than beam search (Wu
   All graphs we present here allow sampling the               et al., 2018), reducing P K E. Furthermore, the lat-
same token more than once in each batch (i.e., S               ter does not sample from the behavior policy, but
is a sample with replacements). Simulations with               does not properly account for being off-policy in
deduplication show similar results.                            the parameter updates.
    4
                                                                  Adding the reference to the sample S, which
     We tried several other reward functions as well, all of
which got BLEU scores of 30.73–30.84. This improvement         some implementations allow (Sennrich et al.,
is very stable across metrics, trials and pretrained models.   2017) may help reduce the problems of never sam-

Figure 6: The probability of different tokens following CMRT, in the controlled simulations in the Simulated
Reward setting. The left/right figures correspond to simulations where the target token (ybest ) was initially the
second/third most probable token. The green line corresponds to the target token, yellow lines to medium-reward
tokens and red lines to tokens with r(y) = 0.

pling the target tokens. However, as Edunov et al. vent them from being truly useful. At least some
(2018) point out, this practice may lower results, of these challenges have been widely studied in
as it may destabilize training by leading the model the RL literature, with numerous techniques de-
to improve over outputs it cannot generalize over, veloped to address them, but were not yet adopted
as they are very different from anything the model in NLP. We turn to discuss some of them.
assigns a high probability to, at the cost of other Off-policy methods, in which observations are
outputs. sampled from a different policy than the one being
currently optimized, are prominent in RL (Watkins
8 Conclusion and Dayan, 1992; Sutton and Barto, 1998), and
The standard MT scenario poses several uncom- were also studied in the context of policy gradient
mon challenges for RL. First, the action space methods (Degris et al., 2012; Silver et al., 2014).
in MT problems is a high-dimensional discrete In principle, such methods allow learning from a
space (generally in the size of the vocabulary of more “exploratory” policy. Moreover, a key mo-
the target language or the product thereof for sen- tivation for using α in CMRT is smoothing; off-
tences). This contrasts with the more common policy sampling allows smoothing while keeping
scenario studied by contemporary RL methods, convergence guarantees.
which focuses mostly on much smaller discrete ac- In its basic form, exploration in R EINFORCE re-
tion spaces (e.g., video games (Mnih et al., 2015, lies on stochasticity in the action-selection (in MT,
2016)), or continuous action spaces of relatively this is due to sampling). More sophisticated explo-
low dimensions (e.g., simulation of robotic con- ration methods have been extensively studied, for
trol tasks (Lillicrap et al., 2015)). Second, reward example using measures for the exploratory use-
for MT is naturally very sparse – almost all possi- fulness of states or actions (Fox et al., 2018), or re-
ble sentences are “wrong” (hence, not rewarding) lying on parameter-space noise rather than action-
in a given context. Finally, it is common in MT space noise (Plappert et al., 2017).
to use RL for tuning a pretrained model. Using For MT, an additional challenge is that even ef-
a pretrained model ameliorates the last problem. fective exploration (sampling diverse sets of ob-
But then, these pretrained models are in general servations), may not be enough, since the state-
quite peaky, and because training is done on-policy action space is too large to be effectively covered,
– that is, actions are being sampled from the same with almost all sentences being not rewarding. Re-
model being optimized – exploration is inherently cently, diversity-based and multi-goal methods for
limited. RL were proposed to tackle similar challenges
Here we argued that, taken together, these chal- (Andrychowicz et al., 2017; Ghosh et al., 2018;
lenges result in significant weaknesses for current Eysenbach et al., 2019). We believe the adoption
RL practices for NMT, that may ultimately pre- of such methods is a promising path forward for

the application of RL in NLP. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
References Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
Constantin, and Evan Herbst. 2007. Moses: Open
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas source toolkit for statistical machine translation. In
Schneider, Rachel Fong, Peter Welinder, Bob Mc- Proceedings of the 45th Annual Meeting of the As-
Grew, Josh Tobin, OpenAI Pieter Abbeel, and Woj- sociation for Computational Linguistics Companion
ciech Zaremba. 2017. Hindsight experience replay. Volume Proceedings of the Demo and Poster Ses-
In Advances in Neural Information Processing Sys- sions, pages 177–180.
tems, pages 5048–5058.
Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean,
Shiqi Shen Ayana, Zhiyuan Liu, and Maosong Sun. Alan Ritter, and Dan Jurafsky. 2017. Adversarial
2016. Neural headline generation with minimum learning for neural dialogue generation. In Proceed-
risk training. arXiv preprint arXiv:1604.01904. ings of the 2017 Conference on Empirical Methods
in Natural Language Processing, pages 2157–2169,
Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Copenhagen, Denmark. Association for Computa-
Barry Haddow, Matthias Huck, Chris Hokamp, tional Linguistics.
Philipp Koehn, Varvara Logacheva, Christof Monz,
Matteo Negri, Matt Post, Carolina Scarton, Lucia Timothy P Lillicrap, Jonathan J Hunt, Alexander
Specia, and Marco Turchi. 2015. Findings of the Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
2015 workshop on statistical machine translation. In David Silver, and Daan Wierstra. 2015. Continu-
WMT@EMNLP. ous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971.
Massimo Caccia, Lucas Caccia, William Fedus, Hugo
Larochelle, Joelle Pineau, and Laurent Charlin. Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama,
2018. Language gans falling short. arXiv preprint and Kevin Murphy. 2016. Optimization of image
arXiv:1811.02549. description metrics using policy gradient methods.
CoRR, abs/1612.00370, 2.
Thomas Degris, Martha White, and Richard S Sut-
ton. 2012. Off-policy actor-critic. arXiv preprint Peter Makarov and Simon Clematide. 2018. Neu-
arXiv:1205.4839. ral transition-based string transduction for limited-
resource setting in morphology. In COLING.
Sergey Edunov, Myle Ott, Michael Auli, David Grang-
ier, and Marc’Aurelio Ranzato. 2018. Classical
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi
structured prediction losses for sequence to se-
Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,
quence learning. In Proceedings of the 2018 Con-
David Silver, and Koray Kavukcuoglu. 2016. Asyn-
ference of the North American Chapter of the Asso-
chronous methods for deep reinforcement learning.
ciation for Computational Linguistics: Human Lan-
In International conference on machine learning,
guage Technologies, Volume 1 (Long Papers), pages
pages 1928–1937.
355–364. Association for Computational Linguis-
tics.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver,
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, Andrei A Rusu, Joel Veness, Marc G Bellemare,
and Sergey Levine. 2019. Diversity is all you need: Alex Graves, Martin Riedmiller, Andreas K Fidje-
Learning skills without a reward function. In Inter- land, Georg Ostrovski, et al. 2015. Human-level
national Conference on Learning Representations. control through deep reinforcement learning. Na-
ture, 518(7540):529–533.
Lior Fox, Leshem Choshen, and Yonatan Loewenstein.
2018. Dora the explorer: Directed outreaching rein- Graham Neubig. 2016. Lexicons and minimum risk
forcement action-selection. ICLR, abs/1804.04012. training for neural machine translation: Naist-cmu
at wat2016. In WAT@COLING.
Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash
Kumar, and Sergey Levine. 2018. Divide-and- Graham Neubig, Matthias Sperber, Xinyi Wang,
conquer reinforcement learning. In International Matthieu Felix, Austin Matthews, Sarguna Pad-
Conference on Learning Representations. manabhan, Ye Qi, Devendra Singh Sachan, Philip
Arthur, Pierre Godard, John Hewitt, Rachid Riad,
Lisa Anne Hendricks, Zeynep Akata, Marcus and Liming Wang. 2018. Xnmt: The extensible neu-
Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor ral machine translation toolkit. In AMTA.
Darrell. 2016. Generating visual explanations. In
ECCV. Franz Josef Och. 2003. Minimum error rate training
in statistical machine translation. In Proceedings of
Diederik P. Kingma and Jimmy Ba. 2015. Adam: the 41st Annual Meeting on Association for Compu-
A method for stochastic optimization. CoRR, tational Linguistics-Volume 1, pages 160–167. As-
abs/1412.6980. sociation for Computational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hen-
Jing Zhu. 2002. Bleu: a method for automatic eval- dricks, Mario Fritz, and Bernt Schiele. 2017. Speak-
uation of machine translation. In Proceedings of ing the same language: Matching machine to human
the 40th annual meeting on association for compu- captions by adversarial training. In 2017 IEEE In-
tational linguistics, pages 311–318. Association for ternational Conference on Computer Vision (ICCV),
Computational Linguistics. pages 4155–4164. IEEE.

Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, David Silver, Guy Lever, Nicolas Heess, Thomas De-
Szymon Sidor, Richard Y Chen, Xi Chen, Tamim gris, Daan Wierstra, and Martin Riedmiller. 2014.
Asfour, Pieter Abbeel, and Marcin Andrychowicz. Deterministic policy gradient algorithms. In ICML.
2017. Parameter space noise for exploration. arXiv
preprint arXiv:1706.01905. Richard S Sutton and Andrew G Barto. 1998. Rein-
forcement learning: An introduction. MIT press.
O. Press, A. Bar, B. Bogin, J. Berant, and L. Wolf.
2017. Language generation with recurrent genera- G. Tevet, G. Habib, V. Shwartz, and J. Berant. 2018.
tive adversarial networks without pre-training. In Evaluating text GANs as language models. arXiv
Fist Workshop on Learning to Generate Natural preprint arXiv:1810.12686.
Language@ICML.
Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, 6.5-rmsprop: Divide the gradient by a running av-
and Wojciech Zaremba. 2015. Sequence level train- erage of its recent magnitude. COURSERA: Neural
ing with recurrent neural networks. arXiv preprint networks for machine learning, 4(2):26–31.
arXiv:1511.06732.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Jerret Ross, and Vaibhava Goel. 2017. Self-critical Kaiser, and Illia Polosukhin. 2017. Attention is all
sequence training for image captioning. In Proceed- you need. In Advances in Neural Information Pro-
ings of the IEEE Conference on Computer Vision cessing Systems, pages 5998–6008.
and Pattern Recognition, pages 7008–7024.
Christopher JCH Watkins and Peter Dayan. 1992. Q-
Keisuke Sakaguchi, Matt Post, and Benjamin learning. Machine learning, 8(3-4):279–292.
Van Durme. 2017. Grammatical error correction Ronald J Williams. 1992. Simple statistical gradient-
with neural reinforcement learning. arXiv preprint following algorithms for connectionist reinforce-
arXiv:1707.00299. ment learning. Machine learning, 8(3-4):229–256.
Philip Schulz, Wilker Aziz, and Trevor Cohn. 2018. A Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-
stochastic decoder for neural machine translation. In Yan Liu. 2018. A study of reinforcement learning
ACL. for neural machine translation. In EMNLP.
Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan- Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin,
dra Birch, Barry Haddow, Julian Hitschler, Marcin Jianhuang Lai, and Tie-Yan Liu. 2017. Adver-
Junczys-Dowmunt, Samuel Läubli, Antonio Vale- sarial neural machine translation. arXiv preprint
rio Miceli Barone, Jozef Mokry, and Maria Nadejde. arXiv:1704.06933.
2017. Nematus: a toolkit for neural machine trans-
lation. In EACL. Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018.
Improving neural machine translation with condi-
Rico Sennrich, Barry Haddow, and Alexandra Birch. tional sequence generative adversarial nets. In Pro-
2016. Neural machine translation of rare words with ceedings of the 2018 Conference of the North Amer-
subword units. In Proceedings of the 54th Annual ican Chapter of the Association for Computational
Meeting of the Association for Computational Lin- Linguistics: Human Language Technologies, Vol-
guistics (Volume 1: Long Papers), volume 1, pages ume 1 (Long Papers), pages 1346–1355. Association
1715–1725. for Computational Linguistics.
Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.
Wu, Maosong Sun, and Yang Liu. 2016. Minimum 2017. Seqgan: Sequence generative adversarial nets
risk training for neural machine translation. In Pro- with policy gradient. In AAAI, pages 2852–2858.
ceedings of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long Jiac heng Zhang, Yanzhuo Ding, Shiqi Shen, Yong
Papers), pages 1683–1692. Association for Compu- Cheng, Maosong Sun, Huanbo Luan, and Yang Liu.
tational Linguistics. 2017. Thumt: An open source toolkit for neural ma-
chine translation. arXiv preprint arXiv:1706.06415.
Shiqi Shen, Yang Liu, and Maosong Sun. 2017. Op-
timizing non-decomposable evaluation metrics for Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016.
neural machine translation. Journal of Computer Generating text via adversarial training. In NIPS
Science and Technology, 32:796–804. workshop on Adversarial Training, volume 21.

A    Contrastive MRT does not Maximize                    B       Deriving the Gradient of R
                                                                                           e
     the Expected Reward
                                                          Given S, recall the definition of R:
                                                                                            e
We hereby detail a simple example where follow-                                            k
ing the Contrastive MRT method (see §2.3) does
                                                                                           X
                                                                           R(θ,
                                                                           e S) =                Qθ,S (yi )r(yi )
not converge to the parameter value that maxi-                                             i=1
mizes R.                                                      Taking the deriviative w.r.t. θ:
   Let θ be a real number in [0, 0.5], and let Pθ be
a family of distributions over three values a, b, c       k
                                                          X              ∇P (y) · αP (y)α−1 · Z(S) − ∇Z(S) · P (y)α
such that:                                                      r(yi )                                              =
                                                          i=1
                                                                                           Z(S)2

                  
                  θ                   x=a                        k
                                                                 X               α∇P (y )
                                                                                            i          ∇Z(S)       
                                                                        r(yi )                Q(yi ) −       Q(yi ) =
          Pθ (x) = 2θ2                 x=b                        i=1
                                                                                    P (yi )             Z(S)
                  
                   1 − θ − 2θ2         x=c
                  
                                                                  k
                                                                  X                                            
                                                                        r(yi )Q(yi ) α∇ log P (yi ) − ∇ log Z(S) =
  Let r(a) = 1, r(b) = 0, r(c) = 0.5. The ex-                     i=1
pected reward as a function of θ is:
                                                                  k 
                                                                  X                           
                                                              α      r(yi )Q(yi )∇ log P (yi ) − EQ [r]∇ log Z(S)
           R(θ) = θ + 0.5(1 − θ − 2θ2 )                           i=1

R(θ) is uniquely maximized by θ∗ = 0.25.
                                                          C       NMT Implementation Details
    Table 1 details the possible samples of size k =
2, their probabilities, the corresponding R  e and its    True casing and tokenization were used (Koehn
gradient. Standard numerical methods show that            et al., 2007), including escaping html symbols and
E[∇R] e over possible samples S is positive for θ ∈       "-" that represents a compound was changed into
(0, γ) and negative for θ ∈ (γ, 0.5], where γ ≈           a separate token of =. Some preprocessing used
0.295. This means that for any initialization of          before us converted the latter to ##AT##-##AT##
θ ∈ (0, 0.5], Contrastive MRT will converge to γ          but standard tokenizers in use process that into 11
if the learning rate is sufficiently small. For θ = 0,    different tokens, which over-represents the signif-
Re ≡ 0.5, and there will be no gradient updates,          icance of that character when BLEU is calculated.
so the method will converge to θ = 0. Neither of          BPE (Sennrich et al., 2016) extracted 30715 to-
these values maximizes R(θ).                              kens. For the MT experiments we used 6 lay-
    We further note that resorting to maximizing          ers in the encoder and the decoder. The size of
E[R]e instead, does not maximize R(θ) either. In-         the embeddings was 512. Gradient clipping was
deed, plotting E[R] e as a function of θ for this ex-     used with size of 5 for pre-training (see Discus-
ample, yields a maximum at θ ≈ 0.32.                      sion on why not to use it in training). We did not
                                                          use attention dropout, but 0.1 residual dropout rate
   S           P (S)              R
                                  e            ∇R
                                                e         was used. In pretraining and training sentences of
                                 1               −2       more than 50 tokens were discarded. Pretraining
 {a, b}         4θ3             1+2θ          (1+2θ)2
                                     θ         2x2 +1
                                                          and training were considered finished when BLEU
 {a, c}    2θ(1-θ-2θ2 )     0.5 + 2−4θ2      2(1−2θ2 )2   did not increase in the development set for 10 con-
                               1−θ−2θ2         θ2 −2θ
 {b, c}    4θ2 (1-θ-2θ2 )       2−2θ           (1−θ)2
                                                          secutive evaluations, and evaluation was done ev-
  a, a           θ2               1              0        ery 1000 and 5000 for batches of size 100 and 256
  b, b          4θ4               0              0        for pretraining and training respectively. Learn-
  c, c      (1-θ-2θ2 )2          0.5             0        ing rate used for rmsprop (Tieleman and Hin-
                                                          ton, 2012) was 0.01 in pretraining and for adam
Table 1: The gradients of R
                          e for each possible sample S.   (Kingma and Ba, 2015) with decay was 0.005 for
The batch size is k = 2. Rows correspond to different     training. 4000 learning rate warm up steps were
sampled outcomes. ∇R   e is the gradient of R
                                            e given the   used. Pretraining took about 7 days with 4 GPUs,
corresponding value for S.                                afterwards, training took roughly the same time.
                                                          Monte Carlo used 20 sentence rolls per word.

(a)                                       (b)                                (c)

Figure 7: The probability of different tokens following R EINFORCE, in the controlled simulations in the Con-
stant Reward setting. The left/center/right figures correspond to simulations where the target token (ybest ) was
initially the second/third/fourth most probable token. The green line corresponds to the target token, yellow lines
to medium-reward tokens and red lines to tokens with r(y) = 0.

D    Detailed Results for Constant Reward
     Setting
We present graphs for the constant reward setting
in Figures 8 and 7. Trends are similar to the ones
obtained for the Simulated Reward setting.

Figure 8: Difference between the ranks of ybest in
the reinforced with constant reward and the pretrained
model. Each column x corresponds to the difference
in the probability that ybest is ranked in rank x in the
reinforced model, and the same probability in the pre-
trained model.

You can also read