Nested-Wasserstein Self-Imitation Learning for Sequence Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Nested-Wasserstein Self-Imitation Learning for Sequence Generation Ruiyi Zhang1 Changyou Chen2 Zhe Gan3 Zheng Wen4 Wenlin Wang1 Lawrence Carin1 1 Duke University 2 University at Buffalo 3 Microsoft Dynamics 365 AI Research 4 DeepMind ryzhang@cs.duke.edu arXiv:2001.06944v1 [cs.CL] 20 Jan 2020 Abstract log-likelihood of the next word conditioned on its pre- ceding ground-truth partial sentence. However, when testing, the generated partial sequence is fed to the Reinforcement learning (RL) has been widely generator to draw the next token. Such a discrepancy studied for improving sequence-generation between training and testing, commonly known as ex- models. However, the conventional rewards posure bias, leads to accumulated approximation errors used for RL training typically cannot capture along the sequence-generation trajectory [Bengio et al., sufficient semantic information and therefore 2015, Ranzato et al., 2016]. render model bias. Further, the sparse and delayed rewards make RL exploration ineffi- To address exposure bias, reinforcement learning (RL) cient. To alleviate these issues, we propose techniques have been introduced [Ranzato et al., 2016]. the concept of nested-Wasserstein distance Unlike MLE, which only leverages training examples, for distributional semantic matching. To fur- RL can also exploit samples drawn from the current ther exploit it, a novel nested-Wasserstein policy. Improvements are gained from reinforcing the self-imitation learning framework is developed, training towards more-plausible generations, typically encouraging the model to exploit historical based on a user-specified reward function [Ranzato high-rewarded sequences for enhanced explo- et al., 2016, Yu et al., 2017]. However, the manually ration and better semantic matching. Our designed rewards often target specific desirable prop- solution can be understood as approximately erties in sequence generation (e.g., matching n-gram executing proximal policy optimization with overlap between generated sequences and ground-truth Wasserstein trust-regions. Experiments on references), which unintentionally induces extra bias a variety of unconditional and conditional and is often criticized as a bad proxy for human evalu- sequence-generation tasks demonstrate the ation [Wang et al., 2018a, Hu et al., 2019]. Concerns proposed approach consistently leads to im- have also been raised w.r.t. efficient exploration in proved performance. sequence generation. In existing RL-based methods for sequence generation [Bahdanau et al., 2017, Ran- zato et al., 2016, Rennie et al., 2016], all experiences 1 Introduction are treated as equivalent. However, merely relying on policy samples to explore often leads to forgetting a Sequence generation is an important research topic in high-reward trajectory, unless it can be re-sampled fre- machine learning, covering a wide range of applications, quently [Liang et al., 2018]. This problem becomes including machine translation [Bahdanau et al., 2015, severe in the sparse-reward setting in sequence genera- Cho et al., 2014, Sutskever et al., 2014], image cap- tion, i.e., the reward is only available after the whole tioning [Anderson et al., 2017, Vinyals et al., 2015, Xu sentence is generated. et al., 2015], and text summarization [Paulus et al., Motivated by the above observations, we present a novel 2017, Rush et al., 2015]. Standard sequence generation nested-Wasserstein Self-Imitation Learning (WSIL) follows an auto-regressive model design under maxi- framework for sequence generation. Specifically, we mum likelihood estimation (MLE) learning [Huszár, propose the nested-Wasserstein distance, a general- 2015, Sutskever et al., 2014, Wiseman and Rush, 2016]. ization of the Wasserstein distance, and exploit it to That is, models are trained to maximize the expected measure distance between the behavior policy and the artificial policy defined by the replay buffer to encour- Proceedings of the 23rd International Conference on Artificial age self-imitation. The nested-Wasserstein distance Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. is well suited for distributional semantic matching be- PMLR: Volume 108. Copyright 2020 by the author(s).
Nested-Wasserstein Self-Imitation Learning for Sequence Generation tween two (sequence) distributions whose samples are with deterministic state transition and sparse reward. still discrete distributions, as in the case of sequence It can be formulated as a Markov decision process generation. The proposed WSIL is inspired by and (MDP) M = hS, A, P, ri, where S is the state space, A derived from the policy optimization with Wasserstein is the action space, P is the deterministic environment trust-regions [Zhang et al., 2018b]. It provides a novel dynamics and r(s, y) is a reward function. The policy reward function to match the generated sequences with πθ , parameterized by θ, maps each state s ∈ S to a the high-reward sequences in the replay buffer, encour- probability distribution over A. The objective is to aging distributional semantic matching rather than maximize the expected reward, defined as: simple n-gram overlapping. XT J(πθ ) = EY ∼πθ [r(Y )] = E(st ,yt )∼πθ [r(st , yt )] , (3) The main contributions of this paper are summarized t=1 as follows. (i) A novel nested-Wasserstein self-imitation where Y , (s1 , y1 , · · · , sT , yT ) is a trajectory from learning framework is developed for sequence genera- policy πθ with yt ∈ A, and r(Y ) represents the reward tion, exploiting historical good explorations for better for a sentence Y , and r(st , yt ) is the step-wise reward. future exploration. (ii) A novel nested-Wasserstein RL seeks to learn an optimal policy, that maximizes distance is introduced for sequence generation via dis- the expected total reward J(πθ ). tributional semantic matching, effectively alleviating the model training bias imposed by conventional re- Optimal transport on discrete domains The op- wards. (iii) Extensive empirical evaluation is performed timal transport (OT) distance Wc (µ, ν) is a discrep- on both unconditional and conditional text generation ancy score that measures the distance between two tasks, demonstrating consistent performance improve- probability distributions µ(·) and ν(·) w.r.t. a cost ment over existing state-of-the-art approaches. function c(·, ·). Specifically, Pn we consider Pmtwo discrete distributions µ , i=1 ui δzi and ν , j=1 vj δzj0 with 2 Background δz the Dirac delta function centered on z. The weight vectors u = {ui }ni=1 ∈ ∆n and v = {vj }m j=1 ∈ ∆m re- Sequence-generation model We consider the spectively belong to the n and m-dimensional simplex, Pn Pm problem of discrete sequence generation, which learns i.e., i=1 u i = j=1 v j = 1. Accordingly, Wasser- to generate a sequence Y = (y1 , . . . , yT ) ∈ Y of length stein distance is equivalent to solving the following T , possibly conditioned on context X. Here each yt is minimization problem: m X n a token from vocabulary A. Pairs (X, Y ) are used for X Wc (µ, ν) = min Tij · c(zi , zj0 ) training a sequence-generation model. We are particu- T∈Γ(µ,ν) i=1 j=1 (4) larly interested in applications to text generation, where Y is a sentence and each yt is a word. Starting from = min hT, Ci , T∈Γ(µ,ν) the initial state s0 , a recurrent neural network (RNN) Pn where j=1 Tij = m 1 Pm and i=1 Tij = n1 are the con- produces a sequence of states (s1 , . . . , sT ) given an in- straints, h·, ·i represents the Frobenius dot-product, put sequence-feature representation (e(y1 ), . . . , e(yT )), and C is the cost matrix defined by Cij = c(zi , zj0 ). where e(·) denotes a word embedding mapping a token Intuitively, the OT distance is the minimal cost of to its d-dimensional feature representation. The states transporting mass from µ to ν. are recursively updated with a function known as the cell: st = hθ (st−1 , e(yt )), where θ denotes the model parameters. Popular implementations include Long 3 Distributional Semantic Matching Short-Term Memory (LSTM) [Hochreiter and Schmid- huber, 1997] and the Gated Recurrent Unit (GRU) [Cho We first consider evaluating the sentence from syntac- et al., 2014]. In order to generate sequence Y s from a tic and semantic perspectives. Conventional metric (trained) model, one iteratively applies the following rewards (e.g., BLEU) can capture the syntactic struc- operations: s ture better, where the exact matching of words (or yt+1 ∼ Multi(softmax(g(st ))) , (1) short phases) to the reference sequences is encouraged, st = h(st−1 , e(yts )) , (2) which induces strong bias in many cases. As such, we focus on the semantic matching and propose the nested- where Multi(·) denotes a multinomial distribution. In Wasserstein distance, which defines the distance be- conditional generation, s0 is initialized with Enc(X), tween two sequence distributions. Nested-Wasserstein where Enc(·) encodes the relevant information from the distance provides a natural way to manifest semantic context [Bahdanau et al., 2017, Cho et al., 2014]. For matching compared with the conventional rewards used unconditional generation, one typically draws s0 from in existing RL-based sequence models. Alternatively, a standard Gaussian distribution. we can train a discriminator to learn the reward model, Sequence generation as an RL problem Se- but empirically it only rewards high-quality genera- quence generation can be considered as an RL problem tions, even though they may be characterized by mode
Ruiyi Zhang1 Changyou Chen2 Zhe Gan3 Zheng Wen4 Wenlin Wang1 Lawrence Carin1 Candidate 1 (C1): There are six freshmen reading papers . (PY , Wnc ) Distribution-level AAACDXicbVDLSsNAFJ3UV62vqEs3g1WoICWpBV0W3LisYB/ShDCZTNqhk0yYmQgl5Afc+CtuXCji1r07/8ZJ7EKrF4Y5nHsv95zjJ4xKZVmfRmVpeWV1rbpe29jc2t4xd/f6kqcCkx7mjIuhjyRhNCY9RRUjw0QQFPmMDPzpZdEf3BEhKY9v1CwhboTGMQ0pRkpTnnnUcCKkJr6fdXMvc3zOAjmL9Adv81M48LIY5yeeWbeaVlnwL7DnoA7m1fXMDyfgOI1IrDBDUo5sK1FuhoSimJG85qSSJAhP0ZiMNIxRRKSblW5yeKyZAIZc6BcrWLI/NzIUyUKjniyky8VeQf7XG6UqvHAzGiepItpWeShMGVQcFtHAgAqCFZtpgLCgWivEEyQQVjrAmg7BXrT8F/RbTfus2bpu1zvteRxVcAAOQQPY4Bx0wBXogh7A4B48gmfwYjwYT8ar8fY9WjHmO/vgVxnvXzs2m5o= (Y, Wc ) Sequence-level Reference: There are six freshmen playing football . AAAB+3icbVDLSsNAFL3xWesr1qWbwSJUkJLUgi4LblxWsA9pQ5hMJ+3QyYOZiVhCfsWNC0Xc+iPu/BsnbRbaemDgcM693DPHizmTyrK+jbX1jc2t7dJOeXdv/+DQPKp0ZZQIQjsk4pHoe1hSzkLaUUxx2o8FxYHHac+b3uR+75EKyaLwXs1i6gR4HDKfEay05JqV2jDAakIwTx+yC9RzyblrVq26NQdaJXZBqlCg7Zpfw1FEkoCGinAs5cC2YuWkWChGOM3Kw0TSGJMpHtOBpiEOqHTSefYMnWllhPxI6BcqNFd/b6Q4kHIWeHoyDyqXvVz8zxskyr92UhbGiaIhWRzyE45UhPIi0IgJShSfaYKJYDorIhMsMFG6rrIuwV7+8irpNur2Zb1x16y2mkUdJTiBU6iBDVfQgltoQwcIPMEzvMKbkRkvxrvxsRhdM4qdY/gD4/MHxkqTkg== (A, ccos ) AAACAHicbVDLSsNAFJ34rPUVdeHCzWARKkhJakGXFTcuK9gHNCFMptN26GQmzEyEErLxV9y4UMStn+HOv3HSZqGtBy4czrmXe+8JY0aVdpxva2V1bX1js7RV3t7Z3du3Dw47SiQSkzYWTMheiBRhlJO2ppqRXiwJikJGuuHkNve7j0QqKviDnsbEj9CI0yHFSBspsI+rXoT0GCOW3mQXEAeph4XKzgO74tScGeAycQtSAQVagf3lDQROIsI1ZkipvuvE2k+R1BQzkpW9RJEY4Qkakb6hHEVE+ensgQyeGWUAh0Ka4hrO1N8TKYqUmkah6cyvVYteLv7n9RM9vPZTyuNEE47ni4YJg1rAPA04oJJgzaaGICypuRXiMZIIa5NZ2YTgLr68TDr1mntZq983Ks1GEUcJnIBTUAUuuAJNcAdaoA0wyMAzeAVv1pP1Yr1bH/PWFauYOQJ/YH3+AOw3le4= Word-level Candidate 2 (C2): Six freshmen are playing soccer . BLEU ROUGE-L CIDEr Naive Wasserstein Figure 1: Illustration of nested-Wasserstein distance C1 36.8 50.0 163.7 84.1 76.3 (Wnc ) over distributions of sequences (PY ), showing C2 0.0 35.8 55.9 42.5 80.1 how the distance is defined in a nested manner to measure distance of sequence distributions. ccos is the Table 1: Comparison of different rewards in terms of word ground metric; Wc is the sequence ground metric. the sequence-level (higher is better). The top figure illustrates the Wasserstein reward of comparing two Nested-Wasserstein distance Our ultimate goal candidate sentences with a reference sentence, which is to measure distance between two policy distributions will automatically match semantically similar words. instead of sequence pairs. Given two sets of sequences Dominant edges are shown in dark blue, determined from two policies, one aims to incorporate the semantic by the optimal transport matrix T. information between sequences into the distance mea- sure. To this end, we propose the nested-Wasserstein collapse [He et al., 2019]. This undermines diversity, distance in Definition 2. Figure 1 illustrates the nested- an important aspect in evaluation. Wasserstein, considering both word- and sequence-level To better understand the issue, consider the example matching with Wasserstein distance. on sequence matching in Table 1. One can also use Definition 2 (Nested-Wasserstein Distance) a naive way of semantic matching, i.e., measuring a Consider two sets of sequences Y = {Yi }K i=1 and 0 distance between average word embeddings. It is clear Y 0 = {Yj0 }K j=1 drawn from two sequence distributions that while the first candidate sentence has a similar PY and PY 0 , where K and K 0 are the number of syntactic structure to the reference, the second can- sequences in Y and Y 0 . The nested-Wasserstein didate sentence is more semantically consistent with distance, denoted as Wnc (PY , PY 0 ), is a metric the reference. However, popular hard-matching met- measuring the distance between PY and PY 0 defined in rics [Papineni et al., 2002, Vedantam et al., 2015] and a nested manner: K XK0 the naive method consistently indicate the first can- X didate is a better match to the reference. The above Wnc (PY , PY 0 ) , min s Tijs Wc (pYi , pYj0 ) , (5) T i=1 j=1 contradiction can be alleviated if the reward metric is 1 Tij0 ≥ 0 satisfies i Tijs = K and j Tijs = K10 ; P P more semantic-aware. So motivated, the remainder of where this section is devoted to a discussion of design and and Wc (·, ·) denotes the c-Wasserstein distance defined implementation of Wasserstein rewards. The general in (4). idea is to match the semantic features via minimizing Remark 1 The word “nested” comes from the defini- the Wasserstein distance between hypothesis sentences tion in (5), which essentially consists of two nested and their references in the semantic space. A nested levels of Wasserstein distances. The proposed nested- version of the Wasserstein distance arises when inte- Wasserstein distance brings in the semantic information grating the distributional semantic matching into the via the distance measure Wc in the first level distance. objective of sequence distribution matching. Note that we have omitted the expectation over samples in (5) for simplicity, as we essentially use a single set Definition 1 (Wasserstein Distance between of samples to approximate Wnc (·, ·) in algorithms. Sequence Pairs) Consider sequence P Y = (y1 , . . . , yT ) Sample-based estimation of nested-Wasserstein as a discrete distribution pY = T1 t δe(yt ) in the se- mantic space, with the length-normalized point mass distance Computing the exact Wasserstein distance placed at the word embedding, i.e., zt = e(yt ) of each is computationally intractable [Arjovsky et al., 2017, token yt from the sequence Y . Given a hypothesis se- Genevay et al., 2018, Salimans et al., 2018], let alone quence Y w.r.t. a reference sequence Y 0 , we define the the proposed nested-Wasserstein distance. Fortunately, Wasserstein distance as Wc (pY , pY 0 ) , minT hT, Ci we can employ the recently proposed IPOT algorithm between pY and pY 0 with cost c(z, z 0 ). When the co- [Xie et al., 2018] to obtain an efficient approximation. | 0 Specifically, IPOT considers the following proximal sine distance ccos (z, z 0 ) = 1 − kzkz2 kz z 0k is used as our 2 gradient descent to solve the optimal transport cost, we define the Wasserstein reward as rs (Y, Y 0 ) , matrix T via iterative optimization, i.e., T(t+1) = hT∗ , 1 − Ci, where T∗ is the optimal transport matrix. arg minT∈Π(µ,ν) hT, Ci + γ · DKL (T, T(t) ) , where
Nested-Wasserstein Self-Imitation Learning for Sequence Generation 1/γ > 0 is the generalized step size and the generalized KL-divergence DKL (T, T(t) ) = High-Rewards Replay P Tij P P (t) Sequences Buffer i,j Tij log (t) − i,j Tij + i,j Tij is used Tij as the proximity metric. Standard Sinkhorn iterations Generated Sequences from [Cuturi, 2013] are used to solve the above sub-problem. Sequences Replay Buffer The IPOT was designed to approximately calculate the standard Wasserstein distance. Here we extend it to Standard Wasserstein Rewards Generator calculate the nested-Wasserstein distance by applying Rewards IPOT twice in a nested manner, i.e., in the sequence and distribution levels, respectively. The full approach Figure 2: Illustration of the proposed nested- of IPOT is summarized as Algorithm 2 in Appendix B. Wasserstein Self-Imitation Learning (WSIL) framework, where Wasserstein self-imitation rewards are defined to 4 Nested-Wasserstein Self-Imitation encourage the generator to imitate samples from the Learning replay buffer. The standard RL framework is given in the gray dotted box. Purely adopting the nested-Wasserstein distance as the distance between the behavior policy πθ,X and the arti- reward in a standard policy-gradient method is not ficial policy πB,X . Note when K = K 0 = 1, the nested effective, because the syntactic information is missing. Wasserstein distance degenerates to the definition of Specifically, we consider sequences generated from a Wasserstein distance between two sequences. conditional behavior policy πθ,X , parameterized by θ Remark 2 Unconditional Generation: By con- with the conditional variable X. For example, in im- sidering samples (features) themselves as discrete dis- age captioning, each sequence is generated conditioned tributions, we replace the mean square difference over on a given image. For unconditional generation, the features of sequence pairs, i.e., Euclidean norm, with conditional variable is empty. Instead of combining the Wasserstein distance. Then for the distributions of the rewards with different weights [Liu et al., 2017, Pa- sequences, we again adopt the Wasserstein distance as sunuru et al., 2017], we present the nested-Wasserstein in WGAN [Arjovsky et al., 2017] but in the discrete Self-Imitation Learning (WSIL) framework, which pro- domain. Thus, the Wasserstein distance is defined in a vides a novel way of leveraging both syntactic (metric) nested manner. and semantic (Wasserstein) information. Remark 3 Conditional Generation: We replace The overall idea of the proposed nested-Wasserstein self- the exact matching of sequence pairs with metric re- imitation learning is to define a Wasserstein trust-region wards in RL training, with the Wasserstein distance. In between the current policy (a.k.a. behavior policy) this case, we are matching two conditional distributions and the artificial policy defined by the replay buffer. with Wasserstein distance, instead of matching the gen- Intuitively, the Wasserstein trust-region encourages the erated sentence with all reference sentences by average. self-imitation of historical high-reward sequences, which This is a more suitable way as a generated sentence provides semantic signals to guide training, in addition does not necessarily need to match all the references. to the stabilizing effect from trust-region optimization. For simplicity, we sometimes omit the first expectation Furthermore, a replay buffer is used to store high- EX∼pd . With the proposed nested Wasserstein distance, reward historical sequences, whose induced conditional we propose the Wasserstein self-imitation scheme in (6), policy is denoted πB,X . Our new objective function as illustrated in Figure 2. We seek to use historical high- with a Wasserstein trust-region is defined as: reward sequences to define a “self-imitation” reward J(πθ ) = EX∼pd {EY s ∼πθ,X [r(Y s )] function, which is then combined with the original (6) reward function to update the generator with policy − λ · Wnc (πθ,X , πB,X )} , gradient methods. Intuitively, higher self-imitation rewards are achieved when the generated sequences where Wnc is the nested-Wasserstein distance defined are close to historical high-reward sequences. Thus the in Definition 2, and r(·) can be a metric reward be- generator is guided to perform self imitation and we call tween Y s and the ground-truth references Y . With this method indirect nested-Wasserstein self-imitation a little abuse of notation, but for conciseness, we use learning (WSIL-I). The word “indirect” comes from the πθ to denote both the policy and the distribution over mechanism that historical sequences interact with the the sequences. Distinct from classic trust-region pol- policy indirectly via the self-imitation reward. icy optimization, which defines the trust region based on KL-divergence [Schulman et al., 2015], WSIL de- WSIL-I incorporates a self-imitation reward, denoted fines the trust region based on the nested-Wasserstein as rs (Y s , Y b ), into the objective function. Here Y b
Ruiyi Zhang1 Changyou Chen2 Zhe Gan3 Zheng Wen4 Wenlin Wang1 Lawrence Carin1 denotes a sample from the replay buffer and Y s de- Algorithm 1 Nested-Wasserstein Self-Imitation. notes a sample from the current policy. To this end, Require: Generator policy πθ ; a sequence dataset D = we replace the Wasserstein distance Wc in the nested- {Y1...T }N 1 ; a possibly empty condition X = {X}1 . N Wasserstein distance with rs (Y s , Y b ) in the general ob- Initialize πθ and replay buffer B. jective (6). Specifically, we define the two sets of sample Pretrain generator πθ with MLE. sequences from πθ,X and πB,X to be Y s , {Yis }K i=1 and repeat 0 Y b , {Yjb }Kj=1 , with sizes of K and K 0 , respectively. Generate K sequences Y s = {Yks }K k=1 , where Here Yis ∼ πθ,X and Yjb ∼ πB,X , ∀j. {Yis }K i=1 and Yks ∼ πθ . Y b will be used in calculatingPthe nested-Wasserstein Update replay buffer B using Y s . distance. Let rns (Yis , Y b ) , j Tijs rs (Yis , Yjb ) be the if Self-Imitation then 0 nested-Wasserstein reward, with Ts = {Tijs } the op- Sample K 0 sequences Y b = {Yjb }K j=1 , where timal weights in distribution-level. Based on (6), the Yjb ∼ πB . objective of WSIL-I is adapted to be: Estimate the OT matrix T and Ts via IPOT Compute rns (Yks , Y b ) and update πθ with (8). JI (πθ ) , EX∼pd EY s ∼πθ,X r(Y s ) + λrns (Y s , Y b ) , (7) else Update the generator πθ with (3) using Y s . where r is the original RL reward; rns is the nested- end if Wasserstein reward. Since not all historically explored until Algorithm converges samples are helpful for updating the current policy, we only consider a subset of the high-reward sequences when performing self-imitation. Using K trajectories MLE RL WSIL sampled i.i.d. from πθ and introducing a baseline b, the gradient estimate of WSIL-I is expressed as: K X ∇θ JI (πθ ) ≈ − [(r(Yks ) − b)∇θ log πθ (Yks ) Figure 3: Exploration space of different methods. Circle: k=1 (8) ground truth; Star: high-reward sequences. + λrns (Yks , Y b )∇θ log πθ (Yks )] . Increasing Self-Imitation According to the theory of Wasserstein gradient flows [Villani, 2008], 1/λ can In practice, I r(Y b ) > r(Y s ) will be combined with be interpreted as a generalized decaying learning rate. the nested-Wasserstein rewards, where I(·) = 1 if the With more explorations, λ becomes larger, and the algo- condition is satisfied, and 0 otherwise; b is the baseline rithm should focus more on the self-imitation learning, to stabilize training. If the reward of a historical high- providing a guideline to balance the standard RL train- reward sequence is greater than the current one (i.e., ing and self-imitation learning. More details are pro- r(Y b ) > r(Y s )), the generator learns to imitate this vided in Appendix B. Practically, nested-Wasserstein high-reward sequence. Otherwise, the update based on provides weak supervision focusing on semantic match- the historical sequence is not performed due to the I(·) ing, which is reasonable since the historical high-reward operator. This encourages the agent to only imitate its sequences contain some noises. good historical explorations. We have also developed another way to implement (direct) WSIL (WSIL-D) as 5 Related Work discussed in the Appendix A. Algorithm 1 describes the general implementation procedure of the WSIL. Optimal transport Kusner et al. [2015] proposed the word mover’s distance (WMD) and first applied Exploration Efficiency The exploration space of optimal transport (OT) to NLP; OT has also been em- MLE is the examples in the training set [Tan et al., ployed to improve topic modeling [Huang et al., 2016]. 2018], i.e., no exploration is performed in super- The transportation cost is usually defined as Euclidean vised training. In contrast, standard policy optimiza- distance, and OT distance is approximated by solving tion [Ranzato et al., 2016] basically allows the whole ex- a Kantorovich-Rubinstein dual [Gulrajani et al., 2017] ploration space. However, the exploration may become or a less-accurate lower bound [Kusner et al., 2015]. inefficient since it may be too flexible, and some good Yurochkin et al. [2019] proposed a hierarchical OT sequences observed in history tend to be less explored representation for document, but the hierarchy was in and imitated due to the sparse rewards. Our proposed word- and topic-level based on the WMD. Our work WSIL aims to provide more efficient and systematic considers nested-Wasserstein distance, presenting an exploration. It allows the whole-space exploration, but efficient IPOT-based implementation for OT distance re-weights the exploration space to focus more on the approximation [Xie et al., 2018], successfully using it exploration that may provide better performance with to guide sequence generation. the Wasserstein trust-region.
Nested-Wasserstein Self-Imitation Learning for Sequence Generation Self-Imitation Learning Experience replay has been widely considered in RL. Deterministic policy gradient [Silver et al., 2014, Lillicrap et al., 2016] per- forms experience replay, but is limited to continuous control. Actor-critic approaches [Konda and Tsitsiklis, 2000] can also utilize a replay buffer to improve per- formance. Prioritized experience replay [Schaul et al., 2015] samples trajectories based on the time-difference Figure 4: Demonstration of nested-Wasserstein dis- error, and we adopt it in our implementation. These ap- tance in word-level (left) and sentence-level (right). proaches indiscriminately buffer all experiences, while the approach proposed here only buffers high-reward ex- perience. Further, episodic control [Lengyel and Dayan, 2008] can be regarded as an extreme way of exploit- ing past experience, trying to reproduce its best past decisions, but retrieving states leads to poor efficiency and generalization in testing. Self-imitation learning was first applied in Atari games and Mujoco [Oh et al., 2018, Gangwani et al., 2018], reporting performance improvement w.r.t. sparse rewards. Compared with that work, our solution considers a novel self-imitation Figure 5: An example of image captioning. The right learning scheme in the context of sequence generation. generated sentence is better but given a lower CIDEr. RL for Sequence Generation RL techniques have been explored in detail for sequence generation. For ex- Implementation Details A few key techniques are ample, a Seq2Seq model can be trained by directly opti- required for successful model training. (i ) The reward mizing BLEU/ROUGE scores via policy gradient [Ran- from a greedy-decoding sentence is used as the base- zato et al., 2016, Bahdanau et al., 2017]. Furthermore, line [Rennie et al., 2016] in conditional text generation; Rennie et al. [2016] baselines the actor with the reward in unconditional text generation, a constant baseline is of a greedy-decoding sequence for the REINFORCE used. (ii ) A single large replay buffer is maintained for method. Model-based RL and hierarchical RL have unconditional generation, and multiple replay buffers also been studied for sequence generation [Zhang et al., are maintained for different conditions in conditional 2018a, Huang et al., 2019]. Further, a learned discrimi- generation. (iii ) For each pair of sentences, the shorter nator (or, critic) can also be used to provide sequence- one should be padded to the same length as the longer level guidance. By constructing different objectives, one for a balanced optimal transport, which is a key previous work [Yu et al., 2017, Lin et al., 2017, Guo implementation technique. et al., 2017, Fedus et al., 2018] combines the policy- gradient algorithm with the original GAN training Demonstration of nested-Wasserstein Figure 4 procedure. However, mode-collapse problems make the shows the optimal matching in word-level (T) and training of these methods challenging. By contrast, we sentence-level (Ts ). It is interesting to see that all propose the use of self-imitation learning, and maintain similar words (e.g., bike and cycle) are matched with a replay buffer to exploit past good explorations. each other (higher weights), which cannot be achieved via exact hard-matching metrics. At the distribution- 6 Experiments level, we show an example in captioning tasks, where we have five reference and hypothesis sentences. Tra- We evaluate the proposed method on both uncondi- ditional methods will match a hypothesis sentence to tional and conditional text-generation tasks, consid- each of the references and average over them; while our ering standard benchmark datasets. Our approach method performs distributional semantic matching, i.e., achieves state-of-the-art results on unconditional text only matching similar references instead of all of them. generation and video captioning. We also observed im- For example, the third hypothesis is almost matched proved performance on image captioning though relying with the fifth reference, because they are more similar. on much simpler features compared to prior state-of- This is reasonable, because the references are usually the-art methods. We also perform ablation studies to very different, and equivalently matching with all of understand the improvements brought by self-imitation them is confusing for the generator. As shown in Fig- and Wasserstein rewards individually. Details of the ure 5, CIDEr focuses more on the locality fluency and datasets, experimental setup and model architectures equivalent matching with all references, while nested- are provided in Appendix C. Wasserstein performs distributional semantic matching. More examples are provided in the Appendix.
Ruiyi Zhang1 Changyou Chen2 Zhe Gan3 Zheng Wen4 Wenlin Wang1 Lawrence Carin1 Method Test-BLEU-2 3 4 5 Self-BLEU-2 3 4 MLE [Caccia et al., 2018] 0.902 0.706 0.470 0.392 0.787 0.646 0.485 SeqGAN [Yu et al., 2017] 0.820 0.604 0.361 0.211 0.807 0.577 0.278 RankGAN [Lin et al., 2017] 0.852 0.637 0.389 0.248 0.822 0.592 0.230 TextGAN [Zhang et al., 2017] 0.910 0.728 0.484 0.306 0.806 0.548 0.217 FMGAN [Chen et al., 2018] 0.911 0.782 0.584 0.382 0.834 0.643 0.405 LeakGAN [Guo et al., 2017] 0.922 0.797 0.602 0.416 0.912 0.825 0.689 WSIL-D (ours) 0.917 0.774 0.576 0.393 0.797 0.569 0.284 WSIL-I (ours) 0.922 0.778 0.576 0.396 0.813 0.600 0.326 Table 2: Test-BLEU (↑) and Self-BLEU (↓) scores on Image COCO. Method Test-BLEU-2 3 4 5 Self-BLEU-2 3 4 MLE [Caccia et al., 2018] 0.905 0.701 0.464 0.278 0.764 0.522 0.295 SeqGAN [Yu et al., 2017] 0.630 0.354 0.164 0.087 0.728 0.411 0.139 RankGAN [Lin et al., 2017] 0.723 0.440 0.210 0.107 0.672 0.346 0.119 TextGAN [Zhang et al., 2017] 0.777 0.529 0.305 0.161 0.806 0.662 0.448 FMGAN [Chen et al., 2018] 0.913 0.751 0.512 0.315 0.830 0.682 0.427 LeakGAN [Guo et al., 2017] 0.923 0.757 0.546 0.335 0.837 0.683 0.513 SIL-D (ours) 0.875 0.634 0.401 0.243 0.724 0.466 0.256 SIL-I (ours) 0.869 0.633 0.399 0.242 0.710 0.455 0.263 WSIL-D (ours) 0.931 0.736 0.503 0.317 0.795 0.553 0.299 WSIL-I (ours) 0.926 0.726 0.492 0.307 0.815 0.595 0.380 Table 3: Test-BLEU (↑) and Self-BLEU (↓) scores on EMNLP2017 WMT News. 6.1 Unconditional Text Generation baselines obtain both low self-BLEU and test-BLEU scores, leading to more random generations. We compare our approach with a number of related RL-based GAN models for unconditional text gener- Ablation Study We conduct ablation studies on ation [Guo et al., 2017, Lin et al., 2017, Yu et al., EMNLP2017 WMT news to investigate the improve- 2017, Zhang et al., 2017]. Our implementation is devel- ments brought by each part of WSIL. First, we test the oped based on the LeakGAN model, by incorporating benefits of using two types of self-imitation schemes. Wasserstein self-imitation learning. All baseline exper- We compare RL training with (i) self-imitation (SIL-D iments are performed on the texygen platform [Zhu and SIL-I), where only a replay buffer and conventional et al., 2018]. The corpus-level BLEU score is employed matching (features extracted from a neural network) are to evaluate the generated sentences. Specifically, we employed; and (ii) Wasserstein self-imitation (WSIL-D follow the strategy in Yu et al. [2017], Guo et al. [2017] and WSIL-I). Results are shown in Table 3. We ob- and adopt the BLEU score, referenced by test set (test- serve that the self-imitation strategy, with specific re- BLEU) and themselves (self-BLEU) to evaluate the play buffer construction, can alleviate the discrepancies quality of generated samples. Test-BLEU evaluates between reward model bias and conventional rewards the goodness of generated samples, and self-BLEU (e.g., self-BLEU). Without Wasserstein rewards, we measures their diversity. The BLEU scores for 1000 achieve lower self-BLEU at the sacrifice of test-BLEU. generated sentences are averaged to obtain the final When combining with Wasserstein rewards, WSIL-D score for each model. A good generator should achieve and WSIL-I show superior performance relative to the both a high test-BLEU score and a low self-BLEU score. baselines. The random generated samples in Appendix Following previous work [Guo et al., 2017], we test the D and human evaluations further validate this. proposed method on the short and long text genera- Sweep the Temperature To better evaluate the tion on Image COCO and EMNLP2017 WMT News proposed method, we follow Caccia et al. [2018] to datasets. The BLEU scores with different methods are evaluate the trade-off between the quality and diversity. provided in Tables 2 and 3. We use the F1-BLEU score as a metric, which consid- Analysis Compared with other methods, LeakGAN, ers both quality and diversity, and is defined as the WSIL-D and WSIL-I achieve comparable test-BLEU geometry average of BLEU score and 1− Self-BLEU: scores, demonstrating high-quality generated sentences. 2 × BLEU × (1-Self-BLEU) However, LeakGAN tends to over-fit on training data, F1-BLEU = . (9) BLEU + (1-Self-BLEU) leading to much higher (worse) self-BLEU scores. Our proposed methods, by contrast, show good diversity of Figure 6 indicates that WSIL is consistently better the generated text with lower self-BLEU scores. Other than the MLE model on the F1-BLEU-4 score.
Nested-Wasserstein Self-Imitation Learning for Sequence Generation Method BLEU-4 METEOR ROUGE-L CIDEr Method BLEU-4 METEOR ROUGE-L CIDEr ED-LG [Yao et al., 2015] 35.2 25.2 - - S & T [Vinyals et al., 2015] 27.7 23.7 - 85.5 SA-LSTM [Xu et al., 2016] 36.6 25.9 - - OT [Chen et al., 2019] 31.0 24.6 - 94.7 SCST [Pasunuru et al., 2017] 40.5 28.4 61.4 51.7 Adaptive [Lu et al., 2017] 33.2 26.6 - 108.5 MBP [Wang et al., 2018b] 41.3 28.7 61.7 48.0 TD [Anderson et al., 2017] 33.3 26.3 55.3 111.4 Our Implementations Our Implementations MLE 39.2 27.8 59.8 46.6 MLE 28.8 24.4 52.0 91.3 MIXER [Ranzato et al., 2016] 40.2 27.9 60.8 50.3 MIXER [Ranzato et al., 2016] 30.8 24.7 52.9 101.2 SCST [Rennie et al., 2016] 40.7 27.9 61.6 51.3 SCST [Rennie et al., 2016] 32.1 25.4 53.9 105.5 WSIL-D 42.5 29.0 62.4 52.1 WSIL-D 31.8 25.7 54.0 107.4 WSIL-I 41.6 28.4 62.0 52.2 WSIL-I 32.0 25.6 53.9 107.6 Table 4: Video captioning results on MSR-VTT. Table 5: Image captioning results on COCO. jee and Lavie, 2005] scores. Results are summarized Methods MLE LeakGAN SIL-D SIL-I in Table 4. Consistent improvements are observed Human scores 2.97±0.05 2.63±0.05 2.54±0.05 2.55±0.05 with the WSIL framework. WSIL-D performs slightly Methods Real WSIL-D WSIL-I - better than WSIL-I, both yielding much higher opti- Human scores 4.11±0.04 3.49±0.05 3.41±0.05 - mized CIDEr and METEOR scores than SCST. This Table 6: Results of human evaluation. indicates that Wasserstein self-imitation can improve 0.600 the semantic matching between generated sentences 0.53 SCST 0.575 and their references, while achieving reasonable exact- CIDEr on Validation Set WSIL-D 0.550 0.52 WSIL-I matching-based metric scores. F1-BLEU4 0.525 0.51 0.500 0.50 0.475 MLE 0.49 Image Captioning We consider image captioning 0.450 0.48 0.425 WSIL-D WSIL-I 0.47 using the COCO dataset [Lin et al., 2014], which con- 0.400 1.0 1.1 1.2 1.3 1.4 1.5 0 5 10 15 20 25 30 tains 123,287 images in total, each of which is annotated Reverse Temprature Epochs with at least 5 captions. Following with Karpathy’s Figure 6: F1-BLEU-4 on sweeping temperature on split [Karpathy and Fei-Fei, 2015], 113,287 images are unconditional generation; CIDEr scores of Video Cap- used for training and 5,000 images are used for vali- tioning on validation set. dation and testing. We follow the implementation of the SCST approach [Rennie et al., 2016], and use ex- Human Evaluation Simply relying on the above tracted image tags [Gan et al., 2017] as image features metrics is not sufficient to evaluate the proposed (encoder). We report BLEU-k (k from 1 to 4) [Pap- method [Caccia et al., 2018]. Following previous ineni et al., 2002], CIDEr [Vedantam et al., 2015], and work [Guo et al., 2017], we performed additional human METEOR [Banerjee and Lavie, 2005] scores. Results evaluation on the EMNNLP2017 WMT News dataset are summarized in Table 5. Compared with the MLE using Amazon Mechnical Turk. We require all the baseline, RL-based methods significantly increase the workers to be native English speakers, with approval overall performance under all evaluation metrics. We rate higher than 95% and at least 100 assignments choose CIDEr as the optimizing metric, since it per- completed. Previous work has shown higher scores of forms best [Rennie et al., 2016]. Our proposed WSIL LeakGAN compared with other baselines [Guo et al., shows improvement on most metrics compared with 2017], therefore we mainly focus on the comparison of the SCST baseline. Examples of generated captions our methods with LeakGAN. We randomly sampled are provided in Appendix E. 200 sentences from each model, and asked 5 different workers to score each sentence on a scale of 1 to 5, considering its readability and meaning. Results are 7 Conclusions shown in Table 6, which indicates better performance of the proposed WSIL. We have proposed a novel Wasserstein self-imitation learning framework for sequence generation, to allevi- ate the sparse-rewards problem of RL methods, and 6.2 Conditional Text Generation model-training bias imposed by conventional rewards. Video Captioning We conduct experiments on the This is done by encouraging self imitation and semantic MSR-VTT dataset [Xu et al., 2016] for video caption- matching in policy learning. Further, our method can ing. The MSR-VTT is a large-scale video dataset, be approximately interpreted as policy optimization consisting of 20 video categories. The dataset was split with Wasserstein trust-regions. Experiments on uncon- into 6513 and 3487 clips in the training and testing sets. ditional and conditional text generation demonstrate Each video is annotated with about 20 captions. For consistent performance improvement over strong base- each video, we sample at 3 fps and extract Inception- lines. For future work, the proposed method has the v4 [Szegedy et al., 2017] features from these sampled potential to be applied on other interesting sequence- frames. We report BLEU-4 [Papineni et al., 2002], generation tasks such as program synthesis [Liang et al., CIDEr [Vedantam et al., 2015], and METEOR [Baner- 2018].
Ruiyi Zhang1 Changyou Chen2 Zhe Gan3 Zheng Wen4 Wenlin Wang1 Lawrence Carin1 Acknowledge The authors would like to thank the Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, anonymous reviewers for their insightful comments. Kenneth Tran, Jianfeng Gao, Lawrence Carin, and The research was supported in part by DARPA, DOE, Li Deng. Semantic compositional networks for visual NIH, NSF and ONR. captioning. In CVPR, 2017. Tanmay Gangwani, Qiang Liu, and Jian Peng. Learn- References ing self-imitating diverse policies. arXiv:1805.10309, 2018. Peter Anderson, Xiaodong He, Chris Buehler, Damien Aude Genevay, Gabriel Peyré, and Marco Cuturi. Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Learning generative models with sinkhorn diver- Bottom-up and top-down attention for image cap- gences. In AISTATS, 2018. tioning and vqa. In CVPR, 2017. Xiaodong Gu, Kyunghyun Cho, Jungwoo Ha, and Martin Arjovsky, Soumith Chintala, and Léon Bot- Sunghun Kim. Dialogwae: Multimodal response gen- tou. Wasserstein generative adversarial networks. In eration with conditional wasserstein auto-encoder. ICML, 2017. In ICLR, 2019. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vin- gio. Neural machine translation by jointly learning cent Dumoulin, and Aaron C Courville. Improved to align and translate. In ICLR, 2015. training of Wasserstein GANs. In NeurIPS, 2017. Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron and Jun Wang. Long text generation via adversarial Courville, and Yoshua Bengio. An actor-critic algo- training with leaked information. In AAAI, 2017. rithm for sequence prediction. In ICLR, 2017. Junxian He, Daniel Spokoyny, Graham Neubig, and Satanjeev Banerjee and Alon Lavie. Meteor: An auto- Taylor Berg-Kirkpatrick. Lagging inference networks matic metric for mt evaluation with improved cor- and posterior collapse in variational autoencoders. relation with human judgments. In ACL Workshop, In ICLR, 2019. 2005. Sepp Hochreiter and Jürgen Schmidhuber. Long short- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam term memory. Neural computation, 1997. Shazeer. Scheduled sampling for sequence prediction Junjie Hu, Yu Cheng, Zhe Gan, Jingjing Liu, Jianfeng with recurrent neural networks. In NeurIPS, 2015. Gao, and Graham Neubig. What makes a good story? Massimo Caccia, Lucas Caccia, William Fedus, Hugo designing composite rewards for visual storytelling. Larochelle, Joelle Pineau, and Laurent Charlin. Lan- arXiv preprint arXiv:1909.05316, 2019. guage gans falling short. arXiv:1811.02549, 2018. Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Liqun Chen, Shuyang Dai, Chenyang Tao, Haichao Salakhutdinov, and Eric P Xing. Controllable text Zhang, Zhe Gan, Dinghan Shen, Yizhe Zhang, generation. In ICML, 2017. Guoyin Wang, Ruiyi Zhang, and Lawrence Carin. Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei Adversarial text generation via feature-mover’s dis- Sha, and Kilian Q Weinberger. Supervised word tance. In NeurIPS, 2018. mover’s distance. In NeurIPS, 2016. Liqun Chen, Yizhe Zhang, Ruiyi Zhang, Chenyang Tao, Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Zhe Gan, Haichao Zhang, Bai Li, Dinghan Shen, Wu, Jianfeng Wang, and Xiaodong He. Hierarchi- Changyou Chen, and Lawrence Carin. Improving cally structured reinforcement learning for topically sequence-to-sequence learning via optimal transport. coherent visual story generation. In AAAI, 2019. In ICLR, 2019. Ferenc Huszár. How (not) to train your generative Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- model: Scheduled sampling, likelihood, adversary? cehre, Dzmitry Bahdanau, Fethi Bougares, Holger arXiv preprint arXiv:1511.05101, 2015. Schwenk, and Yoshua Bengio. Learning phrase repre- Andrej Karpathy and Li Fei-Fei. Deep visual-semantic sentations using rnn encoder-decoder for statistical alignments for generating image descriptions. In machine translation. In EMNLP, 2014. CVPR, 2015. Marco Cuturi. Sinkhorn distances: Lightspeed compu- Vijay R Konda and John N Tsitsiklis. Actor-critic tation of optimal transport. In NeurIPS, 2013. algorithms. In NeurIPS, 2000. William Fedus, Ian Goodfellow, and Andrew M Dai. Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Maskgan: Better text generation via filling in the _. Weinberger. From word embeddings to document ICLR, 2018. distances. In ICML, 2015.
Nested-Wasserstein Self-Imitation Learning for Sequence Generation Máté Lengyel and Peter Dayan. Hippocampal con- Tim Salimans, Han Zhang, Alec Radford, and Dimitris tributions to control: the third way. In NeurIPS, Metaxas. Improving GANs using optimal transport. 2008. In ICLR, 2018. Chen Liang, Mohammad Norouzi, Jonathan Berant, Tom Schaul, John Quan, Ioannis Antonoglou, and Quoc Le, and Ni Lao. Memory augmented policy op- David Silver. Prioritized experience replay. In ICLR, timization for program synthesis with generalization. 2015. In NeurIPS, 2018. John Schulman, Sergey Levine, Pieter Abbeel, Michael Timothy P Lillicrap, Jonathan J Hunt, Alexander Jordan, and Philipp Moritz. Trust region policy Pritzel, et al. Continuous control with deep rein- optimization. In ICML, 2015. forcement learning. In ICLR, 2016. David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, Daan Wierstra, and Martin Riedmiller. Deterministic and Ming-Ting Sun. Adversarial ranking for language policy gradient algorithms. In ICML, 2014. generation. In NeurIPS, 2017. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Se- Tsung-Yi Lin, Michael Maire, Serge Belongie, James quence to sequence learning with neural networks. Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, In NeurIPS, 2014. and C Lawrence Zitnick. Microsoft coco: Common Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, objects in context. In ECCV, 2014. and Alexander A Alemi. Inception-v4, inception- resnet and the impact of residual connections on Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, learning. In AAAI, 2017. and Kevin Murphy. Improved image captioning via policy gradient optimization of spider. In ICCV, Bowen Tan, Zhiting Hu, Zichao Yang, Ruslan Salakhut- 2017. dinov, and Eric Xing. Connecting the dots between mle and rl for sequence generation. arXiv:1811.09740, Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard 2018. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR, Ramakrishna Vedantam, C Lawrence Zitnick, and Devi 2017. Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015. Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances Cédric Villani. Optimal transport: old and new. in pre-training distributed word representations. In Springer Science & Business Media, 2008. LREC, 2018. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak caption generator. In CVPR, 2015. Lee. Self-imitation learning. In ICML, 2018. Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Guoyin Wang, Dinghan Shen, Changyou Chen, and Jing Zhu. Bleu: a method for automatic evaluation Lawrence Carin. Topic-guided variational autoen- of machine translation. In ACL, 2002. coders for text generation. In NAACL, 2019. Ramakanth Pasunuru, Mohit Bansal, and Mohit Bansal. Xin Wang, Wenhu Chen, Yuan-Fang Wang, and Reinforced video captioning with entailment rewards. William Yang Wang. No metrics are perfect: Ad- In NAACL, 2017. versarial reward learning for visual storytelling. In Romain Paulus, Caiming Xiong, and Richard Socher. A ACL, 2018a. deep reinforced model for abstractive summarization. Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, In ICLR, 2017. and William Yang Wang. Video captioning via hier- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, archical reinforcement learning. In CVPR, 2018b. and Wojciech Zaremba. Sequence level training with Sam Wiseman and Alexander M Rush. Sequence-to- recurrent neural networks. In ICLR, 2016. sequence learning as beam-search optimization. In Steven J Rennie, Etienne Marcheret, Youssef Mroueh, EMNLP, 2016. Jarret Ross, and Vaibhava Goel. Self-critical se- Yujia Xie, Xiangfeng Wang, Ruijia Wang, and quence training for image captioning. In CVPR, Hongyuan Zha. A fast proximal point method for 2016. Wasserstein distance. In arXiv:1802.04307, 2018. Alexander M Rush, Sumit Chopra, and Jason Weston. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A neural attention model for abstractive sentence A large video description dataset for bridging video summarization. arXiv:1509.00685, 2015. and language. In CVPR, 2016.
Ruiyi Zhang1 Changyou Chen2 Zhe Gan3 Zheng Wen4 Wenlin Wang1 Lawrence Carin1 Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual atten- tion. In ICML, 2015. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Bal- las, Christopher Pal, Hugo Larochelle, and Aaron Courville. Describing videos by exploiting temporal structure. In CVPR, 2015. Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2017. Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin Solomon. Hierar- chical optimal transport for document representation. In NeurIPS, 2019. Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Liqun Chen, Dinghan Shen, Guoyin Wang, and Lawrence Carin. Sequence generation with guider network. arXiv preprint arXiv:1811.00696, 2018a. Ruiyi Zhang, Changyou Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as wasserstein gradient flows. In ICML, 2018b. Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adver- sarial feature matching for text generation. In ICML, 2017. Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A bench- marking platform for text generation models. In SIGIR, 2018.
Nested-Wasserstein Self-Imitation Learning for Sequence Generation Supplementary Material of “Nested-Wasserstein Self-Imitation Learning for Sequence Generation” A More Details about WSIL Algorithm 2 IPOT for Wasserstein Rewards 0 0 m Direct nested-Wasserstein Self-Imitation Learn- 1: Input: Feature vectors µ = {zi }n 1 , ν = {zj }1 and ing Direct Wasserstein self-imitation learning (WSIL- generalized stepsize 1/λ, 2: σ = m1 1m , T(1) = 1n 1m > D) weights the original rewards with outputs from the Cij behavior policy for sequences in the replay buffer B. 3: Cij = c(zi , zj0 ), Aij = e− λ The sequences from the replay buffer are directly used 4: for t = 1, 2, 3 . . . do as pseudo-samples to update the generator 5: Q = A T(t) // is Hadamard product P [Liang et al., 6: for k = 1, . . . K do 2018]. Similarly, define rns (Y s , Y ) , j Tjs rs (Y s , Yj ), 7: δ = nQσ1 , σ = mQ1> δ with T0 = {Tjs } the optimal weights. to be the 8: end for nested-Wasserstein reward between the sequence Y s 9: T(t+1) = diag(δ)Qdiag(σ) and ground-truth references Y . The general objective 10: end for (6) is then extended to be the objective for WSIL-D, as 11: Return hT, 1 − Ci JD (πθ ) , EY s ∼πθ,X [r(Y s )] (10) i) For unconditional generation with synthetic data, + λEY b ∼πB,X rns (Y b , Y )πθ (Y b ) , (11) following Chen et al. [2018], we adopt the negative log-likelihood (NLL) to measure model performance, where r is the original RL reward; rns is the nested- as there exists an oracle data distribution. For this Wasserstein reward. Based on the objective of (11), we experiment, the replay buffer is constructed by gener- update the generator with standard RL loss and the ated sentences which achieved higher reward from the self-imitation loss alternatively, with a hyperparameter learned discriminator. λ that controls the update frequency: K ii) For unconditional generation with real data, since X ∇θ JD (πθ ) ≈ − [(r(Yks ) − b) ∇θ log πθ (Yks )] we will use Test BLEU score and Self BLEU score for k=1 evaluating generated sentences, we maintain a single 0 (12) large replay buffer with BLEU-F1 score as the selection K h i criteria to evaluate quality and diversity trade-off Gu X rns (Yjb , Y bs + ∇θ log πθ (Yjb ) −λ )− j=1 et al. [2019]. F1-BLEU score is defined as the geometry where (·)+ = max(·, 0) and bs and b are the baselines to average of BLEU score and 1− Self-BLEU reduce the variance of gradient estimates. In practice, (·)+ means that WSIL-D only imitates the sequences in 2 × BLEU × (1-Self-BLEU) the replay buffer with the higher rewards. Intuitively, F1-BLEU = . (13) BLEU + (1-Self-BLEU) direct self-imitation implicitly imposes larger weights on good simulated data for training, to exploit good iii) For conditional generation with captioning task, we historical explorations. The main difference between maintain a small (K 0 = 5 sequences) replay buffer for WSIL-D and its indirect counterpart is that sequences each conditional input; the replay buffer seems large, from the replay buffer are not used to compute the but we only need to store sequences of indexes, which self-imitation rewards, but used to evaluate the policy. is very efficient. Here we use the nested Wasserstein Intuitively, WSIL-D changes the data distribution to rewards as the metric. explore the good history more efficiently. iv) For conditional generation with non-parallel style transfer, we maintain a large replay buffer storing suc- B Implementation Details cessfully transferred pairs, and we define a metric which considers both the accuracy and content preservation: Replay Buffer Construction In our algorithm, a p(Right Style)× BLEU. metric is required to be designed to select high-reward history demonstrations, which will be stored in the Balance between RL and self-imitation Accord- replay buffer D. There are different ways for evaluating ing to the theory of Wasserstein policy gradient Villani sentences: [2008], 1/λ defined in Section (6) can be interpreted
Ruiyi Zhang1 Changyou Chen2 Zhe Gan3 Zheng Wen4 Wenlin Wang1 Lawrence Carin1 as generalized decaying learning rate. With more ex- C Experimental Setup plorations, λ becomes larger, and the algorithm should focus more on the self-imitated learning. In practice, we Conditional text generation We consider image do one self-imitated learning update with every 10 RL captioning using the COCO dataset Lin et al. [2014], training updates, and as training proceeds, we increase which contains 123,287 images in total, each of which the frequency of self-imitation, and finally update the is annotated with at least 5 captions. Following Karpa- generator with one-step self-imitation followed with thy’s split Karpathy and Fei-Fei [2015], 113,287 images one-step standard RL training. are used for training and 5,000 images are used for validation and testing. We follow the implementation The trick of soft-argmax Recall that in sequence of the SCST approach [Rennie et al., 2016], and use generation, one first samples a token based on the extracted image tags [Gan et al., 2017, Wang et al., policy, then feeds its token embedding into the RNN 2019] as image features (encoder). The learning rate to compute the logits of the next token, and repeat the of the generator is 0.0002, the maximum length of se- above process based on the logits again until the stop quence is set to 25. For video captioning, the learning token is generated. Instead of using the embedding rate of the generator is 0.0001, the maximum length of of a sampled token, the soft-argmax trick feeds the sequence is set to 30. We use fixed image features and RNN with the weighted average of the embeddings of do not finetune the image encoder following previous most-likely tokens. In particular, let E be the word work. A one-layer LSTM with 1024 units is used as embedding matrix, gt be the logits under the current the decoder. The word-embedding dimension is set to policy and st be the hidden state of the policy πθ . With 512. the soft-argmax trick, the state vector is updated by Unconditional text generation We use the ỹt = E · softmax(gt /β), (14) COCO dataset Lin et al. [2014], in which most sentences are of length about 10. Since we consider unconditional s̃t = h(s̃t−1 , e(ỹt )) , (15) text generation, only image captions are used as the training data. After preprocessing, the training dataset where 0 < β < 1 is the annealing factor, and in practice, consists of 27,842 words and 417,126 sentences. We use we set β = 0.01. 120,000 random sample sentences as the training set, and 10,000 as the test set. For the COCO dataset, the Discriminator implementation In unconditional learning rate of the generator is 0.0002, the learning generation, instead of using policy gradient and the rate of the manager is 0.0002 (we follow the LeakGAN output of the discriminator as rewards, we use the soft- work), and the maximum length of sequence is set to argmax trick Hu et al. [2017]. Since the policy gradient 25. is not stable enough and soft-argmax trick gives us Following Zhu et al. [2018], we use the News section in better performance (See our extensive experiments). the EMNLP2017 WMT4 Dataset as our training data, which consists of 646,459 words and 397,726 sentences. Nested-Wasserstein rewards implementation After preprocessing, the training dataset contains 5,728 In conditional generation, the Wasserstein rewards is words and 278,686 sentences. The learning rate of the implemented based on COCO test tools, and we use generator is 0.0002, the learning rate of the manager the fasttext Mikolov et al. [2018] as the fixed word is 0.0002, and the maximum length of sequence is set embedding to compute the reward. In practice, we use to 50. The number of hidden units used in both the K = 5 with a hyper-parameter search from {3, 5, 8, 10}. LSTM for the generator and the manager are set to We will release this code, which is easy to use as other 128. The dimension of the word embedding is 300. The metrics. For unconditional generation, we use the fixed discriminator is a CNN with its structure specified in learned word embedding via stop its gradient, where Table 7. the embedding and the Wasserstein trust region are jointly optimized. Settings of human evaluation We perform human We conduct experiments on synthetic data similar to evaluation using Amazon Mechanical Turk, evaluating Yu et al. [2017], where our implementation is based the text quality based on readability and meaning- on LeakGAN. The result is shown in Figure 3, where fulness (whether sentences make sense). We ask the WSIL-I and WSIL-D show better performance than worker to rate the input sentence with scores scaling LeakGAN. Specifically, LeakGAN is not stable in the from 1 to 5, with criterion listed in Table C. We re- training and the Negative log-likelihood increases after quire all the workers to be native English speakers, 150 epochs. Compared with LeakGAN, WSIL-I and with approval rate higher than 95% and at least 100 WSIL-D are more stable. assignments completed.
You can also read