On the Weaknesses of Reinforcement Learning for Neural Machine Translation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
On the Weaknesses of Reinforcement Learning for Neural Machine Translation Leshem Choshen1 Lior Fox1 Zohar Aizenbud1 Omri Abend1,2 1 School of Computer Science and Engineering, 2 Department of Cognitive Sciences The Hebrew University of Jerusalem {leshem.choshen, lior.fox, zohar.aizenbud}@mail.huji.ac.il oabend@cs.huji.ac.il Abstract interest and strong results, little is known about what accounts for these performance gains, and Reinforcement learning (RL) is frequently the training dynamics involved. used to increase performance in text gen- arXiv:1907.01752v2 [cs.CL] 9 Nov 2019 We present the following contributions. First, eration tasks, including machine translation (MT), notably through the use of Minimum our theoretical analysis shows that commonly Risk Training (MRT) and Generative Adver- used approximation methods are theoretically ill- sarial Networks (GAN). However, little is founded, and may converge to parameter values known about what and how these methods that do not minimize the risk, nor are local min- learn in the context of MT. We prove that one ima thereof (§2.3). of the most common RL methods for MT does Second, using both naturalistic experiments and not optimize the expected reward, as well as carefully constructed simulations, we show that show that other methods take an infeasibly long time to converge. In fact, our results sug- performance gains observed in the literature likely gest that RL practices in MT are likely to im- stem not from making target tokens the most prob- prove performance only where the pre-trained able, but from unrelated effects, such as increasing parameters are already close to yielding the the peakiness of the output distribution (i.e., the correct translation. Our findings further sug- probability mass of the most probable tokens). We gest that observed gains may be due to effects do so by comparing a setting where the reward is unrelated to the training signal, but rather from informative, vs. one where it is constant. In §4 we changes in the shape of the distribution curve. discuss this peakiness effect (P K E). 1 Introduction Third, we show that promoting the target token to be the mode is likely to take a prohibitively long Reinforcement learning (RL) is an appealing path time. The only case we find, where improvements for advancement in Machine Translation (MT), are likely, is where the target token is among the as it allows training systems to optimize non- first 2-3 most probable tokens according to the pre- differentiable score functions, common in MT trained model. These findings suggest that R EIN - evaluation, as well as its ability to tackle the “ex- FORCE (§5) and CMRT (§6) are likely to improve posure bias” (Ranzato et al., 2015) in standard over the pre-trained model only under the best pos- training, namely that the model is not exposed dur- sible conditions, i.e., where the pre-trained model ing training to incorrectly generated tokens, and is is “nearly” correct. thus unlikely to recover from generating such to- We conclude by discussing other RL practices kens at test time. These motivations have led to in MT which should be avoided for practical and much interest in RL for text generation in gen- theoretical reasons, and briefly discuss alternative eral and MT in particular. Various policy gradi- RL approaches that will allow RL to tackle a larger ent methods have been used, notably R EINFORCE class of errors in pre-trained models (§7). (Williams, 1992) and variants thereof (e.g., Ran- zato et al., 2015; Edunov et al., 2018) and Mini- 2 RL in Machine Translation mum Risk Training (MRT; e.g., Och, 2003; Shen et al., 2016). Another popular use of RL is for 2.1 Setting and Notation training GANs (Yang et al., 2018; Tevet et al., An MT system generates tokens y = (y1 , ..., yn ) 2018). See §2. Nevertheless, despite increasing from a vocabulary V one token at a time. The
probability of generating yi given preceding to- R EINFORCE is adversarial training with discrete kens y
it here. Given a sample S, the gradient of R e is Since the pretrained model (or policy) is peaky, ex- given by ploration of other potentially more rewarding to- k X kens will be limited, hampering convergence. ∇R =α e Q(yi ) · r(yi ) · ∇ log P (yi ) (3) Intuitively, R EINFORCE increases the probabil- i=1 ities of successful (positively rewarding) observa- − EQ [r]∇ log Z(S) tions, weighing updates by how rewarding they where Z(S) = i P (yi )α . See Appendix B. P were. When sampling a handful of tokens in each Comparing Equations (2) and (3), the differ- context (source sentence x and generated prefix ences between R EINFORCE and CMRT are re- y
θ = {θj }j∈V are the model’s parameters. where R EINFORCE was not close to converging af- This model simulates the top of any MT decoder ter 50K steps, the same was true after 1M steps. that ends with a softmax layer, as essentially all We evaluate the peakiness of a distribution in NMT decoders do. To make experiments realis- terms of the probability of the most probable token tic, we use similar parameters as those reported in (the mode), the total probability of the ten most the influential Transformer NMT system (Vaswani probable tokens, and the entropy of the distribu- et al., 2017). Specifically, the size of V (distinct tion (lower entropy indicates more peakiness). BPE tokens) is 30715, and the initial values for θ were sampled from 1000 sets of logits taken from Results. The distributions become peakier in decoding the standard newstest2013 development terms of all three measures: on average, the set, using a pretrained Transformer model. The mode’s probability and the 10 most probable to- model was pretrained on WMT2015 training data kens increases, and the entropy decreases. Figure (Bojar et al., 2015). Hyperparameters are reported 1a presents the histogram of the difference in the in Appendix C. We define one of the tokens in probability of the 10 most probable tokens in the V to be the target token and denote it with ybest . Constant Reward setting, after a single step. Fig- We assign deterministic token reward, this makes ure 1b depicts similar statistics for the mode. The learning easier than when relying on approxima- average entropy in the pretrained model is 2.9 is tions and our predictions optimistic. We experi- reduced to 2.85 after one R EINFORCE step. ment with two reward functions: Simulated Reward setting shows similar trends. 1. Simulated Reward: r(y) = 2 for y = ybest , For example, entropy decreases from 3 to about r(y) = 1 if y is one of the 10 initially highest 0.001 in 100K steps. This extreme decrease sug- scoring tokens, and r(y) = 0 otherwise. This gests it is effectively a deterministic policy. P K E simulates a condition where the pretrained is achieved in a few hundred steps, usually before model is of decent but sub-optimal quality. r other effects become prominent (see Figure 2), and here is at the scale of popular rewards used is stronger than for Constant Reward. in MT, such as GAN-based rewards or BLEU (which are between 0 and 1). 4.2 NMT Experiments 2. Constant Reward: r is constantly equal to We turn to analyzing a real-world application of 1, for all tokens. This setting is aimed to con- R EINFORCE to NMT. Important differences be- firm that P K E is not a result of the signal car- tween this and the previous simulations are: (1) it ried by the reward. is rare in NMT for R EINFORCE to sample from the Experiments with the first setting were run 100 same conditional distribution more than a hand- times, each time for 50K steps, updating θ after ful of times, given the number of source sentences each step. With the second setting, it is sufficient x and sentence prefixes y
(a) Top 10 (b) Mode Figure 1: A histogram of the update size (x-axis) to the total probability of the 10 most probable tokens (left) or the most probable token (right) in the Constant Reward setting. An update is overwhelmingly more probable to increase this probability than to decrease it. (a) (b) (c) Figure 2: Token probabilities through R EINFORCE training, in the controlled simulations in the Simulated Reward setting. The left/center/right figures correspond to simulations where the target token (ybest ) was initially the second/third/fourth most probable token. The green line corresponds to the target token, yellow lines to medium- reward tokens and red lines to no-reward tokens. Results. Results indicate an increase in the tropy of valid next tokens (given the source and a peakiness of the conditional distributions. Our re- prefix of the sentence). sults are based on a sample of 1000 contexts from 5 Performance following R EINFORCE the pretrained model, and another (independent) sample from the reinforced model. We now turn to assessing under what conditions it Indeed, the modes of the conditional distribu- is likely that R EINFORCE will lead to an improve- tions tend to increase. Figure 3 presents the distri- ment in the performance of an NMT system. As in bution of the modes’ probability in the reinforced the previous section, we use both controlled simu- conditional distributions compared with the pre- lations and NMT experiments. trained model, showing a shift of probability mass towards higher probabilities for the mode, follow- 5.1 Controlled Simulations ing RL. Another indication of the increased peak- We use the same model and experimental setup de- iness is the decrease in the average entropy of Pθ , scribed in Section 4.1, this time only exploring the which was reduced from 3.45 in the pretrained Simulated Reward setting, as a Constant Reward model to an average of 2.82 following RL. This is not expected to converge to any meaningful θ. more modest reduction in entropy (compared to Results are averaged over 100 conditional distri- §4.1) might also suggest that the procedure did not butions sampled from the pretrained model. converge to the optimal value for θ, as then we Caution should be exercised when determining would have expected the entropy to substantially the learning rate (LR). Common LRs used in the drop if not to 0 (overfit), then to the average en- NMT literature are of the scale of 10−4 . How-
Figure 3: The cumulative distribution of the proba- Figure 4: Cumulative percentage of contexts where the bility of the most likely token in the NMT experi- pretrained model ranks ybest in rank x or below and ments. The green distribution corresponds to the pre- where it does not rank ybest first (x = 0). In about half trained model, and the blue corresponds to the rein- the cases it is ranked fourth or below. forced model. The y-axis is the proportion of condi- tional probabilities with a mode of value ≤ x (the x- is 30.31, the reinforced one is 30.73. axis). Note that a lower cumulative percentage means a more peaked output distribution. Analyzing what words were influenced by the RL procedure, we begin by computing the cumu- ever, in our simulations, no LR smaller than 0.1 lative probability of the target token ybest to be yielded any improvement in R. We thus set the ranked lower than a given rank according to the LR to be 0.1. We note that in our simulations, a pretrained model. Results (Figure 4) show that in higher learning rate means faster convergence as about half of the cases, ybest is not among the top our reward is noise-free: it is always highest for three choices of the pretrained model, and we thus the best option. In practice, increasing the learn- expect it not to gain substantial probability follow- ing rate may deteriorate results, as it may cause the ing R EINFORCE, according to our simulations. system to overfit to the sampled instances. Indeed, We next turn to compare the ranks the rein- when increasing the learning rate in our NMT ex- forced model assigns to the target tokens, and their periments (see below) by an order of magnitude, ranks according to the pretrained model. Figure 5 early stopping caused the RL procedure to stop presents the difference in the probability that ybest without any parameter updates. is ranked at a given rank following RL and the Figure 2 shows how Pθ changes over the first probability it is ranked there initially. Results in- 50K steps of R EINFORCE (probabilities are aver- dicate that indeed more target tokens are ranked aged over 100 repetitions), for a case where ybest first, and less second, but little consistent shift of was initially the second, third and fourth most probability mass occurs otherwise across the ten probable. Although these are the easiest settings first ranks. It is possible that RL has managed to for R EINFORCE, and despite the high learning push ybest in some cases between very low ranks rate, it fails to make ybest the mode of the distribu- (
7 Discussion In this paper, we showed that the type of distri- butions used in NMT entail that promoting the target token to be the mode is likely to take a prohibitively long times for existing RL practices, except under the best conditions (where the pre- trained model is “nearly” correct). This leads us to conclude that observed improvements from using RL for NMT are likely due either to fine-tuning the most probable tokens in the pretrained model (an effect which may be more easily achieved us- ing reranking methods, and uses but little of the Figure 5: Difference between the ranks of ybest in the power of RL methods), or to effects unrelated to reinforced and the pretrained model. Each column x the signal carried by the reward, such as P K E. An- corresponds to the difference in the probability that other contribution of this paper is in showing that ybest is ranked in rank x in the reinforced model, and CMRT does not optimize the expected reward and the same probability in the pretrained model. is thus theoretically unmotivated. pect that even in cases where RL yields an im- A number of reasons lead us to believe that provement in BLEU, it may partially result from in our NMT experiments, improvements are not reward-independent factors, such as P K E.4 due to the reward function, but to artefacts such as P K E. First, reducing a constant baseline from 6 Experiments with Contrastive MRT r, so as to make the expected reward zero, disal- lows learning. This is surprising, as R EINFORCE, In §2.3 we showed that CMRT does not, in fact, generally and in our simulations, converges faster maximize R, and so does not enjoy the same theo- where the reward is centered around zero, and so retical guarantees as R EINFORCE and similar pol- the fact that this procedure here disallows learning icy gradient methods. However, it is still the RL hints that other factors are in play. As P K E can procedure of choice in much recent work on NMT. be observed even where the reward is constant (if We therefore repeat the simulations described in the expected reward is positive; see §4.1), this sug- §4 and §5, assessing the performance of MRT in gests P K E may play a role here. Second, we ob- these conditions. We experiment with α = 0.005 serve more peakiness in the reinforced model and and k = 20, common settings in the literature, and in such cases, we expect improvements in BLEU average over 100 trials. (Caccia et al., 2018). Third, we achieve similar Figure 6 shows how the distribution Pθ changes results with a constant reward in our NMT exper- over the course of 50K update steps to θ, where iments (§5.2). Fourth, our controlled simulations ybest is taken to be the second and third initially show that asymptotic convergence is not reached most probable token (Simulated Reward setting). in any but the easiest conditions (§5.1). Results are similar in trends to those obtained with Our analysis further suggests that gradient clip- R EINFORCE: MRT succeeds in pushing ybest to be ping, sometimes used in NMT (Zhang et al., the highest ranked token if it was initially second, 2016), is expected to hinder convergence further. but struggles where it was initially ranked third or It should be avoided when using R EINFORCE as it below. We only observe a small P K E in MRT. This violates R EINFORCE’s assumptions. is probably due to the contrastive effect, which means that tokens that were not sampled do not The per-token sampling as done in our experi- lose probability mass. ments is more exploratory than beam search (Wu All graphs we present here allow sampling the et al., 2018), reducing P K E. Furthermore, the lat- same token more than once in each batch (i.e., S ter does not sample from the behavior policy, but is a sample with replacements). Simulations with does not properly account for being off-policy in deduplication show similar results. the parameter updates. 4 Adding the reference to the sample S, which We tried several other reward functions as well, all of which got BLEU scores of 30.73–30.84. This improvement some implementations allow (Sennrich et al., is very stable across metrics, trials and pretrained models. 2017) may help reduce the problems of never sam-
Figure 6: The probability of different tokens following CMRT, in the controlled simulations in the Simulated Reward setting. The left/right figures correspond to simulations where the target token (ybest ) was initially the second/third most probable token. The green line corresponds to the target token, yellow lines to medium-reward tokens and red lines to tokens with r(y) = 0. pling the target tokens. However, as Edunov et al. vent them from being truly useful. At least some (2018) point out, this practice may lower results, of these challenges have been widely studied in as it may destabilize training by leading the model the RL literature, with numerous techniques de- to improve over outputs it cannot generalize over, veloped to address them, but were not yet adopted as they are very different from anything the model in NLP. We turn to discuss some of them. assigns a high probability to, at the cost of other Off-policy methods, in which observations are outputs. sampled from a different policy than the one being currently optimized, are prominent in RL (Watkins 8 Conclusion and Dayan, 1992; Sutton and Barto, 1998), and The standard MT scenario poses several uncom- were also studied in the context of policy gradient mon challenges for RL. First, the action space methods (Degris et al., 2012; Silver et al., 2014). in MT problems is a high-dimensional discrete In principle, such methods allow learning from a space (generally in the size of the vocabulary of more “exploratory” policy. Moreover, a key mo- the target language or the product thereof for sen- tivation for using α in CMRT is smoothing; off- tences). This contrasts with the more common policy sampling allows smoothing while keeping scenario studied by contemporary RL methods, convergence guarantees. which focuses mostly on much smaller discrete ac- In its basic form, exploration in R EINFORCE re- tion spaces (e.g., video games (Mnih et al., 2015, lies on stochasticity in the action-selection (in MT, 2016)), or continuous action spaces of relatively this is due to sampling). More sophisticated explo- low dimensions (e.g., simulation of robotic con- ration methods have been extensively studied, for trol tasks (Lillicrap et al., 2015)). Second, reward example using measures for the exploratory use- for MT is naturally very sparse – almost all possi- fulness of states or actions (Fox et al., 2018), or re- ble sentences are “wrong” (hence, not rewarding) lying on parameter-space noise rather than action- in a given context. Finally, it is common in MT space noise (Plappert et al., 2017). to use RL for tuning a pretrained model. Using For MT, an additional challenge is that even ef- a pretrained model ameliorates the last problem. fective exploration (sampling diverse sets of ob- But then, these pretrained models are in general servations), may not be enough, since the state- quite peaky, and because training is done on-policy action space is too large to be effectively covered, – that is, actions are being sampled from the same with almost all sentences being not rewarding. Re- model being optimized – exploration is inherently cently, diversity-based and multi-goal methods for limited. RL were proposed to tackle similar challenges Here we argued that, taken together, these chal- (Andrychowicz et al., 2017; Ghosh et al., 2018; lenges result in significant weaknesses for current Eysenbach et al., 2019). We believe the adoption RL practices for NMT, that may ultimately pre- of such methods is a promising path forward for
the application of RL in NLP. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, References Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas source toolkit for statistical machine translation. In Schneider, Rachel Fong, Peter Welinder, Bob Mc- Proceedings of the 45th Annual Meeting of the As- Grew, Josh Tobin, OpenAI Pieter Abbeel, and Woj- sociation for Computational Linguistics Companion ciech Zaremba. 2017. Hindsight experience replay. Volume Proceedings of the Demo and Poster Ses- In Advances in Neural Information Processing Sys- sions, pages 177–180. tems, pages 5048–5058. Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Shiqi Shen Ayana, Zhiyuan Liu, and Maosong Sun. Alan Ritter, and Dan Jurafsky. 2017. Adversarial 2016. Neural headline generation with minimum learning for neural dialogue generation. In Proceed- risk training. arXiv preprint arXiv:1604.01904. ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169, Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Copenhagen, Denmark. Association for Computa- Barry Haddow, Matthias Huck, Chris Hokamp, tional Linguistics. Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Timothy P Lillicrap, Jonathan J Hunt, Alexander Specia, and Marco Turchi. 2015. Findings of the Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, 2015 workshop on statistical machine translation. In David Silver, and Daan Wierstra. 2015. Continu- WMT@EMNLP. ous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, 2018. Language gans falling short. arXiv preprint and Kevin Murphy. 2016. Optimization of image arXiv:1811.02549. description metrics using policy gradient methods. CoRR, abs/1612.00370, 2. Thomas Degris, Martha White, and Richard S Sut- ton. 2012. Off-policy actor-critic. arXiv preprint Peter Makarov and Simon Clematide. 2018. Neu- arXiv:1205.4839. ral transition-based string transduction for limited- resource setting in morphology. In COLING. Sergey Edunov, Myle Ott, Michael Auli, David Grang- ier, and Marc’Aurelio Ranzato. 2018. Classical Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi structured prediction losses for sequence to se- Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, quence learning. In Proceedings of the 2018 Con- David Silver, and Koray Kavukcuoglu. 2016. Asyn- ference of the North American Chapter of the Asso- chronous methods for deep reinforcement learning. ciation for Computational Linguistics: Human Lan- In International conference on machine learning, guage Technologies, Volume 1 (Long Papers), pages pages 1928–1937. 355–364. Association for Computational Linguis- tics. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, Andrei A Rusu, Joel Veness, Marc G Bellemare, and Sergey Levine. 2019. Diversity is all you need: Alex Graves, Martin Riedmiller, Andreas K Fidje- Learning skills without a reward function. In Inter- land, Georg Ostrovski, et al. 2015. Human-level national Conference on Learning Representations. control through deep reinforcement learning. Na- ture, 518(7540):529–533. Lior Fox, Leshem Choshen, and Yonatan Loewenstein. 2018. Dora the explorer: Directed outreaching rein- Graham Neubig. 2016. Lexicons and minimum risk forcement action-selection. ICLR, abs/1804.04012. training for neural machine translation: Naist-cmu at wat2016. In WAT@COLING. Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. 2018. Divide-and- Graham Neubig, Matthias Sperber, Xinyi Wang, conquer reinforcement learning. In International Matthieu Felix, Austin Matthews, Sarguna Pad- Conference on Learning Representations. manabhan, Ye Qi, Devendra Singh Sachan, Philip Arthur, Pierre Godard, John Hewitt, Rachid Riad, Lisa Anne Hendricks, Zeynep Akata, Marcus and Liming Wang. 2018. Xnmt: The extensible neu- Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor ral machine translation toolkit. In AMTA. Darrell. 2016. Generating visual explanations. In ECCV. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of Diederik P. Kingma and Jimmy Ba. 2015. Adam: the 41st Annual Meeting on Association for Compu- A method for stochastic optimization. CoRR, tational Linguistics-Volume 1, pages 160–167. As- abs/1412.6980. sociation for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hen- Jing Zhu. 2002. Bleu: a method for automatic eval- dricks, Mario Fritz, and Bernt Schiele. 2017. Speak- uation of machine translation. In Proceedings of ing the same language: Matching machine to human the 40th annual meeting on association for compu- captions by adversarial training. In 2017 IEEE In- tational linguistics, pages 311–318. Association for ternational Conference on Computer Vision (ICCV), Computational Linguistics. pages 4155–4164. IEEE. Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, David Silver, Guy Lever, Nicolas Heess, Thomas De- Szymon Sidor, Richard Y Chen, Xi Chen, Tamim gris, Daan Wierstra, and Martin Riedmiller. 2014. Asfour, Pieter Abbeel, and Marcin Andrychowicz. Deterministic policy gradient algorithms. In ICML. 2017. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905. Richard S Sutton and Andrew G Barto. 1998. Rein- forcement learning: An introduction. MIT press. O. Press, A. Bar, B. Bogin, J. Berant, and L. Wolf. 2017. Language generation with recurrent genera- G. Tevet, G. Habib, V. Shwartz, and J. Berant. 2018. tive adversarial networks without pre-training. In Evaluating text GANs as language models. arXiv Fist Workshop on Learning to Generate Natural preprint arXiv:1810.12686. Language@ICML. Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, 6.5-rmsprop: Divide the gradient by a running av- and Wojciech Zaremba. 2015. Sequence level train- erage of its recent magnitude. COURSERA: Neural ing with recurrent neural networks. arXiv preprint networks for machine learning, 4(2):26–31. arXiv:1511.06732. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Jerret Ross, and Vaibhava Goel. 2017. Self-critical Kaiser, and Illia Polosukhin. 2017. Attention is all sequence training for image captioning. In Proceed- you need. In Advances in Neural Information Pro- ings of the IEEE Conference on Computer Vision cessing Systems, pages 5998–6008. and Pattern Recognition, pages 7008–7024. Christopher JCH Watkins and Peter Dayan. 1992. Q- Keisuke Sakaguchi, Matt Post, and Benjamin learning. Machine learning, 8(3-4):279–292. Van Durme. 2017. Grammatical error correction Ronald J Williams. 1992. Simple statistical gradient- with neural reinforcement learning. arXiv preprint following algorithms for connectionist reinforce- arXiv:1707.00299. ment learning. Machine learning, 8(3-4):229–256. Philip Schulz, Wilker Aziz, and Trevor Cohn. 2018. A Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie- stochastic decoder for neural machine translation. In Yan Liu. 2018. A study of reinforcement learning ACL. for neural machine translation. In EMNLP. Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan- Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, dra Birch, Barry Haddow, Julian Hitschler, Marcin Jianhuang Lai, and Tie-Yan Liu. 2017. Adver- Junczys-Dowmunt, Samuel Läubli, Antonio Vale- sarial neural machine translation. arXiv preprint rio Miceli Barone, Jozef Mokry, and Maria Nadejde. arXiv:1704.06933. 2017. Nematus: a toolkit for neural machine trans- lation. In EACL. Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018. Improving neural machine translation with condi- Rico Sennrich, Barry Haddow, and Alexandra Birch. tional sequence generative adversarial nets. In Pro- 2016. Neural machine translation of rare words with ceedings of the 2018 Conference of the North Amer- subword units. In Proceedings of the 54th Annual ican Chapter of the Association for Computational Meeting of the Association for Computational Lin- Linguistics: Human Language Technologies, Vol- guistics (Volume 1: Long Papers), volume 1, pages ume 1 (Long Papers), pages 1346–1355. Association 1715–1725. for Computational Linguistics. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Wu, Maosong Sun, and Yang Liu. 2016. Minimum 2017. Seqgan: Sequence generative adversarial nets risk training for neural machine translation. In Pro- with policy gradient. In AAAI, pages 2852–2858. ceedings of the 54th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Jiac heng Zhang, Yanzhuo Ding, Shiqi Shen, Yong Papers), pages 1683–1692. Association for Compu- Cheng, Maosong Sun, Huanbo Luan, and Yang Liu. tational Linguistics. 2017. Thumt: An open source toolkit for neural ma- chine translation. arXiv preprint arXiv:1706.06415. Shiqi Shen, Yang Liu, and Maosong Sun. 2017. Op- timizing non-decomposable evaluation metrics for Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016. neural machine translation. Journal of Computer Generating text via adversarial training. In NIPS Science and Technology, 32:796–804. workshop on Adversarial Training, volume 21.
A Contrastive MRT does not Maximize B Deriving the Gradient of R e the Expected Reward Given S, recall the definition of R: e We hereby detail a simple example where follow- k ing the Contrastive MRT method (see §2.3) does X R(θ, e S) = Qθ,S (yi )r(yi ) not converge to the parameter value that maxi- i=1 mizes R. Taking the deriviative w.r.t. θ: Let θ be a real number in [0, 0.5], and let Pθ be a family of distributions over three values a, b, c k X ∇P (y) · αP (y)α−1 · Z(S) − ∇Z(S) · P (y)α such that: r(yi ) = i=1 Z(S)2 θ x=a k X α∇P (y ) i ∇Z(S) r(yi ) Q(yi ) − Q(yi ) = Pθ (x) = 2θ2 x=b i=1 P (yi ) Z(S) 1 − θ − 2θ2 x=c k X r(yi )Q(yi ) α∇ log P (yi ) − ∇ log Z(S) = Let r(a) = 1, r(b) = 0, r(c) = 0.5. The ex- i=1 pected reward as a function of θ is: k X α r(yi )Q(yi )∇ log P (yi ) − EQ [r]∇ log Z(S) R(θ) = θ + 0.5(1 − θ − 2θ2 ) i=1 R(θ) is uniquely maximized by θ∗ = 0.25. C NMT Implementation Details Table 1 details the possible samples of size k = 2, their probabilities, the corresponding R e and its True casing and tokenization were used (Koehn gradient. Standard numerical methods show that et al., 2007), including escaping html symbols and E[∇R] e over possible samples S is positive for θ ∈ "-" that represents a compound was changed into (0, γ) and negative for θ ∈ (γ, 0.5], where γ ≈ a separate token of =. Some preprocessing used 0.295. This means that for any initialization of before us converted the latter to ##AT##-##AT## θ ∈ (0, 0.5], Contrastive MRT will converge to γ but standard tokenizers in use process that into 11 if the learning rate is sufficiently small. For θ = 0, different tokens, which over-represents the signif- Re ≡ 0.5, and there will be no gradient updates, icance of that character when BLEU is calculated. so the method will converge to θ = 0. Neither of BPE (Sennrich et al., 2016) extracted 30715 to- these values maximizes R(θ). kens. For the MT experiments we used 6 lay- We further note that resorting to maximizing ers in the encoder and the decoder. The size of E[R]e instead, does not maximize R(θ) either. In- the embeddings was 512. Gradient clipping was deed, plotting E[R] e as a function of θ for this ex- used with size of 5 for pre-training (see Discus- ample, yields a maximum at θ ≈ 0.32. sion on why not to use it in training). We did not use attention dropout, but 0.1 residual dropout rate S P (S) R e ∇R e was used. In pretraining and training sentences of 1 −2 more than 50 tokens were discarded. Pretraining {a, b} 4θ3 1+2θ (1+2θ)2 θ 2x2 +1 and training were considered finished when BLEU {a, c} 2θ(1-θ-2θ2 ) 0.5 + 2−4θ2 2(1−2θ2 )2 did not increase in the development set for 10 con- 1−θ−2θ2 θ2 −2θ {b, c} 4θ2 (1-θ-2θ2 ) 2−2θ (1−θ)2 secutive evaluations, and evaluation was done ev- a, a θ2 1 0 ery 1000 and 5000 for batches of size 100 and 256 b, b 4θ4 0 0 for pretraining and training respectively. Learn- c, c (1-θ-2θ2 )2 0.5 0 ing rate used for rmsprop (Tieleman and Hin- ton, 2012) was 0.01 in pretraining and for adam Table 1: The gradients of R e for each possible sample S. (Kingma and Ba, 2015) with decay was 0.005 for The batch size is k = 2. Rows correspond to different training. 4000 learning rate warm up steps were sampled outcomes. ∇R e is the gradient of R e given the used. Pretraining took about 7 days with 4 GPUs, corresponding value for S. afterwards, training took roughly the same time. Monte Carlo used 20 sentence rolls per word.
(a) (b) (c) Figure 7: The probability of different tokens following R EINFORCE, in the controlled simulations in the Con- stant Reward setting. The left/center/right figures correspond to simulations where the target token (ybest ) was initially the second/third/fourth most probable token. The green line corresponds to the target token, yellow lines to medium-reward tokens and red lines to tokens with r(y) = 0. D Detailed Results for Constant Reward Setting We present graphs for the constant reward setting in Figures 8 and 7. Trends are similar to the ones obtained for the Simulated Reward setting. Figure 8: Difference between the ranks of ybest in the reinforced with constant reward and the pretrained model. Each column x corresponds to the difference in the probability that ybest is ranked in rank x in the reinforced model, and the same probability in the pre- trained model.
You can also read