OptAGAN: Entropy-based finetuning on text VAE-GAN
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
OptAGAN: Entropy-based finetuning on text VAE-GAN Paolo Tirotta Stefano Lodi Department of Statistics Department of Computer Science Alma Mater Studiorum Alma Mater Studiorum University of Bologna University of Bologna paolo.tirotta@gmail.com stefano.lodi@unibo.it Abstract time it takes as input its own generated sequence of words. This dissimilarity leads to the so-called Transfer learning through large pre-trained models has changed the landscape of cur- exposure bias (Bengio et al., 2015), where the ac- rent applications in natural language process- cumulation of errors during inference can produce arXiv:2109.00239v1 [cs.CL] 1 Sep 2021 ing (NLP). Recently Optimus, a variational poor outputs. Furthermore, the loss function of autoencoder (VAE) which combines two pre- MLE is very strict. For each sequence, only the to- trained models, BERT and GPT-2, has been ken accounted by the training sample is considered released, and its combination with generative as correct (Press et al., 2017), and the model learns adversial networks (GANs) has been shown precisely to mimic the given samples, often leading to produce novel, yet very human-looking to quite dull and homogeneous outputs. text. The Optimus and GANs combination avoids the troublesome application of GANs An alternative to MLE methods are generative to the discrete domain of text, and prevents adversial networks (GANs) (Goodfellow et al., the exposure bias of standard maximum like- 2014), where a generator learns to create outputs lihood methods. We combine the training that can fool a discriminator into believing they of GANs in the latent space, with the fine- are real. Thus, GANs do not have the strict loss tuning of the decoder of Optimus for sin- function of MLE, and do not suffer from expo- gle word generation. This approach lets us model both the high-level features of the sen- sure bias, as they learn to sample during training. tences, and the low-level word-by-word gener- Nonetheless, the application of GANs to the text ation. We finetune using reinforcement learn- realm has been rather complicated. Due to the dis- ing (RL) by exploiting the structure of GPT- creteness of text, the sampling of each token results 2 and by adding entropy-based intrinsically in a non-differentiable function, which does not al- motivated rewards to balance between quality low to backpropagate the loss of the discriminator. and diversity. We benchmark the results of Countermeasures include the use of reinforcement the VAE-GAN model, and show the improve- ments brought by our RL finetuning on three learning (RL) (Yu et al., 2016; Lin et al., 2017; Guo widely used datasets for text generation, with et al., 2017; de Masson d’Autume et al., 2020), the results that greatly surpass the current state-of- use of the Gumbel-Softmax relaxation (Kusner and the-art for the quality of the generated texts. Hernández-Lobato, 2016; Nie et al., 2019), or to avoid the discrete space altogether and work with 1 Introduction continuous embeddings using autoencoders (Zhao Unsupervised text generation finds its use on a et al., 2017; Donahue and Rumshisky, 2018; Haidar plethora of real-world application, ranging from et al., 2019). However, methods which utilize RL machine translation (Wu et al., 2016), to summa- often rely on MLE pre-training, and usually do not rization (Allahyari et al., 2017) and dialogue gen- improve over them (Caccia et al., 2020). Instead, eration (Li et al., 2016). A general approach to for both the approaches using the Gumbel-Softmax modelling text sequences is to autoregressively gen- distribution, and even more so for autoencoders, erate the next token given the previous ones, and the discriminator considers a continuous represen- the most successful and widespread technique is to tation of text, so it is not able to judge effectively train a model using maximum likelihood estima- the single word-by-word generation. tion (MLE). This approach, however, is not without In the past few years, natural language process- fault. At training time the model learns to generate ing (NLP) applications have found huge improve- a token given the ground truth, while at inference ments with the introduction of the attention mecha-
nism and the transformer architecture, with notable conditional text generation, and also provide results examples of BERT, GPT-2 and GPT-3 among oth- for the conditional review dataset YELP. ers (Vaswani et al., 2017; Devlin et al., 2018; Rad- Results show that the base VAE-GAN model ford et al., 2019; Brown et al., 2020). These kind already improves over other GAN methods, espe- of language models are large deep neural networks cially with regards to quality. OptAGAN, further that are able to understand the depedencies between improves over these results, and manages to handle words thanks to attention and are trained over huge the quality-diversity tradeoff very well. Moreover, amounts of unannotated data. As such, pre-trained we show a further experiment that helps understand- language models provide better language under- ing the strenghts and weaknesses of our finetuning standing over recurrent neural networks, can be approach. very easily finetuned on a downstream task, includ- ing text generation, and reached state-of-the-art 2 Background results in many areas. Recently Optimus, a text variational autoencoder (VAE), that is an autoen- In this section we introduce the mathematical nota- coder which maps sentences to a meaningful la- tion and briefly describe the main theoretical tools tent space, has been proposed (Li et al., 2020). It which are used in OptAGAN. We also present an combines both BERT and GPT-2, as encoder and overview of the other methods of text generation. decoder respectively, and can be employed both 2.1 Variational autoencoders as a generative model, and as a tool for language understanding tasks. VAEs are generative models formed by two inde- In this work, we aim to benchmark the results pendent models, an encoder qφ and a decoder pθ . obtained from combining Optimus and GANs, sim- The encoder is tasked with mapping the input x ilarly as indicated in the original paper. In doing so, to a latent space z that allows for interpolation. we also investigate the GAN structure and compare The decoder maps from z → x̃, providing an ap- the adaptive update strategy presented in (Ouyang proximation of the original input. Thanks to the and Agam, 2020) with the standard update strategy introduction of a local variation from sampling the of GANs. Furthermore, we combine the training in encoder output, it is possible to induce a smooth the continuous space, with the finetuning of the de- latent representation of the inputs, which differs coder of Optimus in the discrete text space, in a sim- from the rigid space of autoencoders. ilar fashion as done in ConCreteGAN (Kim et al., Optimus Optimus combines the autoregressive 2020). However, differently from most approaches nature of the GPT-2 text generation with the latent which use RL, we do not use REINFORCE, but produced by the encoder, such that text generation add an additional value head to GPT-2, which out- is done as: puts the intermediate rewards (Ziegler et al., 2019). Moreover, we modify the reward function by con- n Y sidering the entropy of the model when generating pθ (x|z) = pθ (xt |x1 , ..., xt−1 , z), (1) tokens, and favour diversity in the output by adding t=1 an intrinsic reward. where the probability of each token is estimated Thus, our model OptAGAN 1 is able to model conditionally on the latent embedding and the pre- both the higher level sentence structure, and has vious tokens. The latent vector z, which comes more control over single word generation, in a way from the output of the BERT encoder, controls the that favours both quality and diversity for the gen- high-level characteristics of the sentence, such as erated sentences. We measure such criteria us- length, tense, style and topic, and allows for the ing standard automatic metrics: BLEU for qual- guided generation of text. ity, Backwards-BLEU for diversity, and Fréchet distance, on which we also present further analy- 2.2 Generative adversarial networks sis. We consider the image caption dataset COCO, the Stanford Natural Language Inference (SNLI) GANs are also generative models formed by two dataset, and the EMNLP News 2017 dataset for un- models: a generator and a discriminator. Differ- ently from VAEs, the generator G samples from a 1 Opt(imus) A(ugmented) GAN - Implementation can be random variable to produce output that can fool the found at https://github.com/Egojr/optagan discriminator D into believing they are real, while
the discriminator is constantly learning to distin- form of the discriminator, and require MLE pre- guish between the real and generated data. The training followed by adversarial training with RE- objective of the two models can formulated as: INFORCE. ScratchGAN (de Masson d’Autume et al., 2020) is the first model to show that MLE min maxV (D, G) = Ex∼pD [log(D(x))] pre-training can be avoided by carefully combin- G D (2) ing existing techniques. Other works that train + E∼p [log(1 − D(G())], GANs using continuous relaxations include ARAE (Zhao et al., 2017) and LATEXT-GAN (Haidar where pD and p are the distribution of the data et al., 2019), which use autoencoders to learn a and of the input noise of the generator, respectively. continuous latent representation. Models based on When combining GANs with autoregressive text the Gumbel-Softmax distribution are RelGAN (Nie generation, the operation of sampling the next to- et al., 2019) and GSGAN (Kusner and Hernández- kens is non-differentiable, so the application of Lobato, 2016). On the comparison of these meth- GANs relies on either the use of policy gradient al- ods and the evaluation metrics, (Caccia et al., 2020; gorithms, or the use of continuous approximations, Semeniuta et al., 2019) have shown the inadequacy such as the combination of GANs with autoen- of current GANs when compared to MLE and the coders or VAEs. need for metrics that can better measure the quality 2.3 Reinforcement learning and diversity of the models. On the topic of exploration of text GAN models and Approaches that use policy gradient algorithms to RL are ColdGANs (Scialom et al., 2020), which allow the training of GANs consider the genera- delve deeper into the effects of temperature for the tor as the policy πθ to train, the sampling of the text generation. An approach similar to ours, which token as an action A from a state S, and the out- involves the use of large pre-trained models and RL put of the discriminator as the reward R. For each is TextGAIL (Wu et al., 2021), where both GPT-2 sentence generated by a model a reward is calcu- as the generator, and RoBERTa as the discriminator lated, however, as the discriminator only calculates are used for the task of text generation. the reward over finished sentences, the intermedi- Regarding text VAEs, Optimus (Li et al., 2020) ate rewards, usually referring to the word-by-word is the first large pre-trained model of such kind, generation, are obtained through the REINFORCE whereas previously researchers had tried develop- algorithm and Monte-Carlo rollout. Optimization ping VAEs using either recurrent neural networks of the parameters is performed through gradient (Bowman et al., 2015b) or semi-amortized infer- ascent: ence (Kim et al., 2018). θ t+1 = θ t + α∇θ J(θ t ) (3) 3 OptAGAN ∇θ J(θ t ) ∝ Eπ [Gt ∇θ ln π(At |St , θ)] , (4) The architecture that we present in this work is where the gradient of the REINFORCE objective, composed of three main processes, as can be seen ∇θ J(θP from Figure 1. Each process is independent of the t ) is proportional to the discounted returns Gt = t j others, so each part is trained sequentially. j=0 γ Rj , so that higher return actions are favored, and is inversely proportional to the • In order to fully utilize the strengths of Op- probability of being selected, so that higher prob- timus, we finetune both the encoder and the ability actions are not at an advantage compared decoder on the target dataset. The end results to low probability ones. Equation 4 also provides are a more separated and distinct latent space a value that can be sampled at each time step and for each sentence, and a decoder which better only depends on the policy π. reconstructs the original sentences. 2.4 Related work • Next, we train the GAN model composed of Many works have dealt with the training of GANs the generator and the discriminator. In the in the discrete realm, starting with SeqGAN (Yu case of conditional generation, we also add et al., 2016), LeakGAN (Guo et al., 2017) and a classifier, whose loss is then passed to the RankGAN (Lin et al., 2017), where all of them generator. Both the generator and the discrim- share a similar structure, mostly differing in the inator only consider the continuous latent em-
Discriminator Encoder x = Original text Decoder x z x̃ x̃ = Reconstructed text x̂ = Generated text z = Latent variable Generator Decoder BLEU ẑ x̂ R ẑ = Generated latent variable = Noise Real Fake R = Rewards Figure 1: Structure of our proposed model OptAGAN. In red is the main process to generate new text, while in black we show the other models taking part in the training of OptAGAN. Finally, in blue is the last bit of the RL finetuning process. beddings, so they are much lighter and faster or the generator is given by a comparison of the to train compared to other text GANs. loss change ratio of the two networks. |(LcG − LpG )| |(LcD − LpD )| • Finally, we finetune the decoder on discrete rG = , rD = , (6) text using a value head, which estimates the LpG + c LpD + c reward of each single token in a sentence for where the relative change between the current loss the generated sequences. The estimated re- Lc and previous loss Lp for both networks is used wards are also augmented by considering the in determining which one gets updated. We also model entropy of each generated token. The add an arbitrarily small constant c in case the losses gradient is then passed to the decoder through are too close to 0. A weight λ can be also intro- simple policy gradient. duced, so that if rD > λrG the discriminator is updated, and viceversa. Contrarily to the original The structure of both the generator and the dis- paper, which suggests a value λ ≥ 1, we notice criminator is a simple feed-forward neural network. that at the beginning of training there is a stark Current literature does not give clear answers about imbalance in the number of updates between the which loss function specification for GANs is best networks resulting in slower convergence. To bet- for continuous data such as text latent embeddings. ter balance the training, we end up using a value Our experiments show that Wasserstein GANs with of λ < 1, which converges to 1 with each passing gradient penalty (WGAN-GP) perform the best epoch of training. over sliced Wasserstein distances, which produce very homogeneous outputs. Thus, the loss function 3.2 Value head to optimize for the generator and discriminator is: Due to the high computational costs of implement- ing a text discriminator with a vocabulary of size min max Ex∼pD [D(x)] − E∼p() [D(G())] equal to GPT-2, we rely on an external value head, G D + λEx̂∼p(x̂) k(Ox̂ D(x̂)k2 − 1)2 . whose scalar output for each token corresponds to (5) the intermediate reward. Let the hidden states of the decoder ρ and the value head ν, the rewards 3.1 Update strategy calculated from an external metric Rex and each When training GANs, the common update strat- state St are: egy is to have k update steps, usually in the range Rt = ν(St |ρ), with ν(St |ρ) minimizing [5, 10], for the discriminator for each step of the T X (7) generator, as it has been shown to give stable train- (ν(St |ρ)) − Rex . ing for GANs. We indeed also use this update t=0 method, but we also experiment with an adaptive The external head takes as input the frozen hid- update strategy proposed in (Ouyang and Agam, den states of GPT-2. Freezing the hidden parame- 2020), where the choice to update the discriminator ters is necessary because we are not modelling the
parameters of the VAE model during RL. The loss 4.1 Evaluation metrics is then calculated according to a MAE objective BLEU is a metric that measures the overlapping and passed back to the value head. This process n-grams between a hypothesis text and all the ref- estimates how much each token contributes to the erence texts. The final score is calculated as the reward, and comparatively to a text discriminator average of the scores over all hypothesis sentences. is faster to train and showed better results. Studies have shown (Semeniuta et al., 2019; Caccia et al., 2020) that BLEU can only detect small syn- 3.3 Entropy-based rewards tax problems, resulting in poor correlation with hu- Our external rewards are calculated based on the man evaluations, however it still remains the stan- quality metric BLEU, which we briefly describe in dard when evaluating the quality of generated texts. Section 4.1. Under many considered reward speci- To measure diversity we utilize Backwards-BLEU fications, which included the addition of diversity (BBLEU) where the generated texts are the refer- metrics, maximum entropy RL, changes of tem- ence and the test set becomes the hypothesis, giving perature, or a combination of these, the increase a measure of how much the test set is represented. in quality is counterbalanced with a drop in the Additionally, we consider the Fréchet Distance diversity of the generated sentences. One approach with the InferSent embedding model (FID). It has that managed to balance the quality-diversity trade- been shown that FID responds better than BLEU off was the addition of an intrinsically motivated at identifying mode collapse and changes in words penalty based on the confidence of the model when usage. However, we show that it can be biased due generating the token, calculated by the entropy. to its distributional assumptions, mainly for differ- If again we consider the hidden states ρ, the last ences in sentence length distribution. Nonetheless, layer calculating the logits as our policy π with it can be useful in identifying very homogeneous parameters θ and the entropy as H(π(·|ρ, St )), we outputs, especially in conjunction with BLEU and calculate the intrinsic rewards and the performance BBLEU scores. objective as: 4.2 Datasets Rtin = log(clamp(H(π(·|ρ, St )), 0.2, 1)) (8) We consider three of most widely used datasets for unconditional text generation: image COCO (Chen t X et al., 2015), Stanford Natural Language Inference k ∇θ J(θ t ) =Eπ (γ Rk ) + Rtin (SNLI) (Bowman et al., 2015a) and the EMNLP k=0 (9) News 2017 dataset2 . Moreover, we consider the · ∇θ ln π(At |ρ, St , θ) YELP review dataset for conditional text genera- tion (Asghar, 2016). Each of the datasets presents This specification favours high-reward actions with COCO SNLI EMNLP YELP high entropy, while low-entropy actions have to Conditional x x x X Average sentence length 11.3 9.7 28.8 96.4 have a high enough reward to be able to keep their Size of train set 10k 100k 270k 100k high probability, resulting in a more diverse gen- Size of dev set 10k 10k 10k 10k eration. As a rule of thumb, we found out that Size of test set 10k 10k 10k 10k penalties should be lower than the maximum over- Table 1: Average length of the train set and number of all reward. sentences for each of the datasets used for evaluation. 4 Experimental settings different challenges when training: COCO and In this section, we introduce the automatic met- SNLI are a small and medium-sized dataset with rics and the datasets used for evaluation. For short sentences, respectively. EMNLP is a large comparison we consider a MLE model, SeqGAN, dataset with longer sentences. Lastly, the YELP RankGAN, as implemented by the benchmarking dataset is a conditional, medium-sized dataset platform Texygen (Zhu et al., 2018), and Scratch- with very long sentences. The preprocessing on GAN. We also provide further details on our RL the datasets is minimal and limited to the YELP finetuning and present issues with the current eval- dataset. 2 uation metrics. http://www.statmt.org/wmt17/
Metrics MLE SeqGAN RankGAN ScratchGAN Standard Adaptive OptAGAN BLEU-2 ↑ 0.829 0.796 0.764 0.835 0.825 0.816 0.860 BLEU-3 ↑ 0.548 0.471 0.399 0.556 0.554 0.544 0.605 BLEU-4 ↑ 0.304 0.228 0.159 0.313 0.285 0.284 0.356 BBLEU-2 ↑ 0.840 0.762 0.728 0.824 0.765 0.805 0.841 BBLEU-3 ↑ 0.563 0.450 0.383 0.545 0.488 0.526 0.586 BBLEU-4 ↑ 0.317 0.221 0.163 0.303 0.261 0.278 0.350 FID ↓ 0.926 1.934 3.509 0.466 1.153 0.784 0.674 Table 2: EMNLP results of the automatic metrics for OptAGAN, the two base VAE-GAN models with standard and adaptive updates and the other models implemented for comparison. Metrics MLE SeqGAN RankGAN ScratchGAN Standard Adaptive OptAGAN BLEU-2 ↑ 0.841 0.838 0.784 0.795 0.867 0.888 0.889 BLEU-3 ↑ 0.635 0.599 0.514 0.564 0.693 0.726 0.727 BLEU-4 ↑ 0.428 0.380 0.309 0.363 0.484 0.524 0.525 BBLEU-2 ↑ 0.843 0.768 0.771 0.800 0.786 0.764 0.764 BBLEU-3 ↑ 0.639 0.546 0.523 0.564 0.570 0.547 0.548 BBLEU-4 ↑ 0.433 0.347 0.347 0.362 0.374 0.361 0.362 FID ↓ 0.376 0.919 1.486 0.539 1.424 1.764 1.765 Table 3: SNLI results of the automatic metrics for OptAGAN, the two base VAE-GAN models with standard and adaptive updates and the other models implemented for comparison. Metrics MLE SeqGAN RankGAN ScratchGAN Standard Adaptive OptAGAN BLEU-2 ↑ 0.854 0.866 0.834 0.862 0.917 0.919 0.920 BLEU-3 ↑ 0.664 0.667 0.641 0.678 0.792 0.792 0.794 BLEU-4 ↑ 0.459 0.452 0.423 0.478 0.627 0.617 0.620 BBLEU-2 ↑ 0.844 0.813 0.775 0.791 0.755 0.775 0.778 BBLEU-3 ↑ 0.656 0.616 0.553 0.573 0.550 0.579 0.584 BBLEU-4 ↑ 0.455 0.415 0.344 0.377 0.368 0.396 0.399 FID ↓ 0.588 0.756 2.347 1.001 2.297 1.908 1.903 Table 4: COCO results of the automatic metrics for OptAGAN, the two base VAE-GAN models with standard and adaptive updates and the other models implemented for comparison. 4.3 Experimental results with the MLE model, the quality-diversity trade- off favours our model for quality, and the MLE Tables 2, 3, 4 show the results for the quality and approach for diversity. Notably, for the EMNLP diversity of the models. The changes between stan- dataset, the curiosity-driven finetuning allows Op- dard and adaptive updates mostly favour the adap- tAGAN to surpass all models for both BLEU and tive one, with larger gains in diversity, with the BBLEU. We believe that the difference in the mag- exception of the SNLI dataset. Additional con- nitude of change between the EMNLP task and siderations about the two updates can be found in the COCO and SNLI ones is due the starting qual- Appendix B. Therefore we only apply the RL fine- ity of the model. In fact, as the RL finetuning tuning on the adaptively trained model to obtain the slightly prioritizes quality, so increases in BLEU OptAGAN results. Our approach shows improve- score, over diversity, the actual changes on the ments under all metrics, albeit small for the COCO word-by-word generation are very few for those and SNLI datasets. In comparison with the other two datasets. Compared to the other methods, ours GAN models, OptAGAN boasts the highest qual- is also better able to reproduce longer sequences of ity, and average, or higher diversity. In comparison words, as the growing differences between 2,3 and
4 n-grams metrics show. Regarding the FID scores, sentences is close to the one of the test data, while they mostly show the same behaviour as BBLEU. ignoring the fact that the actual distribution may However, we show in section 4.5 that the FID is not match the test one. In fact, the score for the biased for the sentence length distribution, that, in ScratchGAN model is much lower than the one contrast with other methods such as ScratchGAN, for OptAGAN, although the distribution of the sen- we do not model. From the computational cost tences of ScratchGAN in the space misses most point of view, the full training of our model took of the distribution of the test sentences, as can be at most 24 hours using a single Tesla V100 GPU. seen from the PCA representation. Although only Additional details about the training can be found 10-15% of the overall variability, over the InferSent in Appendix A. embedding dimensions, is explained by PCA, there is a huge mismatch that is not addressed by the 4.4 Conditional generation use of the Fréchet distance, which favours homoge- For the conditional generation task we follow the neous length distributions over correct representa- same procedure as the unconditional one, with tion of the space. the only exception of the addition of a classifier network to better model the GAN generation de- 4.6 RL finetuning experiment pending on the label. The results of table 5 are In order to fully gauge the strenghts and weak- very similar to the ones of the COCO and SNLI, nesses of our entropy penalty approach we set up where the entropy regularized finetuning performs an experiment where we train a GAN model us- slightly better than the base adaptive VAE-GAN ing the pre-trained Optimus encoder and decoder model, with small gains in diversity and quality. that are not finetuned on the dataset, so we can We present some examples of the generated sen- finetune a lower quality model. We compare this tences in Appendix C. model with two curiosity-driven regularized model for 1000 and 5000 epochs, respectively. We also Metrics Standard Adaptive OptAGAN use a higher learning rate as we are not interested BLEU-2 0.880 0.886 0.887 in preserving the structure of the original sentences BLEU-3 0.675 0.683 0.685 and present the results in Table 6. BLEU-4 0.448 0.456 0.458 BBLEU-2 0.860 0.854 0.854 Metrics VAEGAN RL 1000 RL 5000 BBLEU-3 0.666 0.660 0.661 BLEU-2 0.641 0.707 0.829 BBLEU-4 0.453 0.449 0.451 BLEU-3 0.370 0.444 0.624 FID 2.598 2.539 2.562 BLEU-4 0.198 0.243 0.402 BBLEU-2 0.752 0.752 0.717 Table 5: YELP results of the automatic metrics for the BBLEU-3 0.478 0.497 0.479 base VAE-GAN models and the entropy regularized BBLEU-4 0.266 0.286 0.283 OptAGAN. FID 2.441 2.369 2.892 4.5 Analysis on FID space Table 6: Results of an experiment where we start with an unoptimized VAE-GAN model and finetune it with We discuss problems with the FID metric by show- our RL strategy for 1000 and 5000 epochs on the ing an heatmap of the two-dimensional principal COCO dataset. component analysis (PCA) representation and the length distribution of the sentences for the SNLI The starting model achieves much worse results dataset. Previous works (de Masson d’Autume than the fully optimized OptAGAN, especially with et al., 2020) already investigated the dependency regards to the BLEU score. After 1000 epochs of of the FID scores on the length distribution, which RL finetuning, the model improves over both qual- can overshadow other problems with the generated ity and diversity. However, when finetuning for samples. longer, the model cannot balance anymore between We further reinforce those analyses, as a prime exploration and exploitation. We believe this be- example of this issue can be seen in Figure 2 and haviour is because of the bias that the model has 3, where the FID scores get progressively higher in finding high reward tokens. The RL finetuning the more the length distribution of the generated does not evaluate all the tokens in the vocabulary,
COCO length distribution of the generated sentences COCO Test data COCO OptAGAN generated data 0.16 40 Test set 30 OptAGAN (1.765) 0.14 ScratchGAN (0.539) 20 10 MLE (0.376) 0 0.12 10 20 0.10 Density 30 40 0.08 COCO ScratchGAN generated data COCO MLE generated data 40 0.06 30 20 10 0.04 0 10 0.02 20 30 0.00 40 0 10 20 30 40 50 60 40 20 0 20 40 60 40 20 0 20 40 60 Length Figure 2: PCA representation of the FID space of the Figure 3: Length distribution of the four datasets and the SNLI test set (top left), OptAGAN (top right), Scratch- FID scores for the SNLI dataset. GAN (bottom left) and MLE (bottom right) generated data. so there might be multiple good scoring tokens References which are never considered during the finetuning. Mehdi Allahyari, Seyed Amin Pouriyeh, Mehdi Assefi, This means that our approach is limited by the qual- Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, ity of the original model. A countermeasure that and Krys J. Kochut. 2017. Text summarization tech- might prevent this kind of behaviour from happen- niques: A brief survey. CoRR, abs/1707.02268. ing could be to increase the penalty for models Nabiha Asghar. 2016. Yelp dataset challenge: Review with higher average entropy, and tuning its value rating prediction. CoRR, abs/1605.05362. depending on the use case, to further encourage Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and heterogeneous generation. Noam Shazeer. 2015. Scheduled sampling for se- quence prediction with recurrent neural networks. CoRR, abs/1506.03099. 5 Conclusions and discussion Samuel R. Bowman, Gabor Angeli, Christopher Potts, In this work, we benchmark the combination of Op- and Christopher D. Manning. 2015a. A large anno- timus and GANs for a text VAE-GAN model, with tated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empiri- results that already surpass current methods for the cal Methods in Natural Language Processing, pages quality of generated texts. We further improve this 632–642, Lisbon, Portugal. Association for Compu- baseline using entropy-based curiosity-driven re- tational Linguistics. wards to improve both the quality and the diversity Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- of the model. This novel approach could benefit drew M. Dai, Rafal Józefowicz, and Samy Bengio. many models utilizing RL for text generation, and 2015b. Generating sentences from a continuous supplementary research could be done into explor- space. CoRR, abs/1511.06349. ing advantage policy gradient, or proximal policy Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie optimization with intrinsic rewards. This specifica- Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind tion of the reward also allows researchers to priori- Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, tize quality, diversity, or to balance between both. Gretchen Krueger, Tom Henighan, Rewon Child, Due to our limited computational resources, we Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, utilize smaller batch sizes than we would otherwise Clemens Winter, Christopher Hesse, Mark Chen, have preferred, as larger batch sizes could help Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc- in reducing the high variance gradients of these Candlish, Alec Radford, Ilya Sutskever, and Dario approaches. Moreover, further research on the au- Amodei. 2020. Language models are few-shot learn- tomatic metrics could be beneficial not only for ers. evaluation, but also for better reward signals to Massimo Caccia, Lucas Caccia, William Fedus, Hugo improve the speed and quality of the finetuning. Larochelle, Joelle Pineau, and Laurent Charlin. 2020. Language gans falling short.
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, ishna Vedantam, Saurabh Gupta, Piotr Dollár, and and Lior Wolf. 2017. Language generation with re- C. Lawrence Zitnick. 2015. Microsoft COCO cap- current generative adversarial networks without pre- tions: Data collection and evaluation server. CoRR, training. CoRR, abs/1706.01399. abs/1504.00325. Alec Radford, Jeff Wu, Rewon Child, David Luan, Cyprien de Masson d’Autume, Mihaela Rosca, Jack Dario Amodei, and Ilya Sutskever. 2019. Language Rae, and Shakir Mohamed. 2020. Training language models are unsupervised multitask learners. gans from scratch. Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Benjamin Piwowarski, and Jacopo Staiano. 2020. Kristina Toutanova. 2018. BERT: pre-training of Coldgans: Taming language gans with cautious sam- deep bidirectional transformers for language under- pling strategies. standing. CoRR, abs/1810.04805. Stanislau Semeniuta, Aliaksei Severyn, and Sylvain David Donahue and Anna Rumshisky. 2018. Adversar- Gelly. 2019. On accurate evaluation of gans for lan- ial text generation without reinforcement learning. guage generation. CoRR, abs/1810.06640. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Kaiser, and Illia Polosukhin. 2017. Attention is all Courville, and Yoshua Bengio. 2014. Generative ad- you need. CoRR, abs/1706.03762. versarial networks. Qingyang Wu, Lei Li, and Zhou Yu. 2021. Textgail: Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Generative adversarial imitation learning for text Yu, and Jun Wang. 2017. Long text generation via generation. adversarial training with leaked information. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Md. Akmal Haidar, Mehdi Rezagholizadeh, Alan Do- Le, Mohammad Norouzi, Wolfgang Macherey, Omri, and Ahmad Rashid. 2019. Latent code and Maxim Krikun, Yuan Cao, Qin Gao, Klaus text-based generative adversarial networks for soft- Macherey, Jeff Klingner, Apurva Shah, Melvin John- text generation. CoRR, abs/1904.07293. son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Yanghoon Kim, Seungpil Won, Seunghyun Yoon, and Stevens, George Kurian, Nishant Patil, Wei Wang, Kyomin Jung. 2020. Collaborative training of gans Cliff Young, Jason Smith, Jason Riesa, Alex Rud- in continuous and discrete spaces for text generation. nick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine Yoon Kim, Sam Wiseman, Andrew C. Miller, David translation system: Bridging the gap between human Sontag, and Alexander M. Rush. 2018. Semi- and machine translation. CoRR, abs/1609.08144. amortized variational autoencoders. Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Matt J. Kusner and José Miguel Hernández-Lobato. 2016. Seqgan: Sequence generative adversarial nets 2016. Gans for sequences of discrete elements with with policy gradient. CoRR, abs/1609.05473. the gumbel-softmax distribution. Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexan- Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiu- der M. Rush, and Yann LeCun. 2017. Adversari- jun Li, Yizhe Zhang, and Jianfeng Gao. 2020. Opti- ally regularized autoencoders for generating discrete mus: Organizing sentences via pre-trained modeling structures. CoRR, abs/1706.04223. of a latent space. Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texy- Jianfeng Gao, and Dan Jurafsky. 2016. Deep rein- gen: A benchmarking platform for text generation forcement learning for dialogue generation. CoRR, models. CoRR, abs/1802.01886. abs/1606.01541. Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, Brown, Alec Radford, Dario Amodei, Paul F. Chris- and Ming-Ting Sun. 2017. Adversarial ranking for tiano, and Geoffrey Irving. 2019. Fine-tuning lan- language generation. CoRR, abs/1705.11001. guage models from human preferences. CoRR, abs/1909.08593. Weili Nie, Nina Narodytska, and Ankit Patel. 2019. RelGAN: Relational generative adversarial net- works for text generation. In International Confer- ence on Learning Representations. Xu Ouyang and Gady Agam. 2020. Accelerated wgan update strategy with loss change rate balancing.
A Training details B Adaptive strategy details For the finetuning of Optimus, we follow the origi- Figure 4 and 5 show the validation BLEU and nal work and train for one epoch with the hyperpa- BBLEU for the COCO and YELP datasets. We rameters that give the best reconstruction quality, show the results over these two datasets due to the namely: stark difference between them. • Pre-trained model epoch = 508523 BLEU scores for different update strategies • Training epochs3 = 1 1.2 • Learning rate = 5 · 10−5 1.0 • Batch size = 5 0.8 BLEU scores • Latent size = 768 0.6 • β=0 0.4 BLEU - standard B-BLEU - standard Sum - standard • Annealing ratio = 0.5 0.2 BLEU - adaptive B-BLEU - adaptive Sum - adaptive • Ratio increase = 0.25 0.0 0 10 20 30 40 50 Epochs We follow a similar approach for the training of the GAN part of the model, where we use standard Figure 4: Differences in BLEU scores during training hyperparameters for the training of WGAN-GP. for the standard and adaptive update in GANs, evalu- The best performing epoch of the GAN according ated on the COCO validation set. to the sum of the BLEU and BBLEU partial scores with 500 texts for both reference and hypothesis is BLEU scores for different update strategies saved. 1.4 • Training epochs = 50 1.2 1.0 • Learning rate = 10−4 BLEU scores 0.8 • Batch size = 256 0.6 • Latent size = 768 BLEU - standard 0.4 B-BLEU - standard • Maximum sequence length = 100 Sum - standard 0.2 BLEU - adaptive B-BLEU - adaptive Sum - adaptive • Number of blocks of generator and discrimi- 0.0 0 10 20 30 40 50 nator = 10 Epochs • Gradient penalty λ = 10 Figure 5: Differences in BLEU scores during training Finally, these are the details for the entropy reg- for the standard and adaptive update in GANs, evalu- ated on the YELP validation set. ularized finetuning. Empirical results showed next to no difference for the BLEU n-gram choice, so we choose 1-gram due to slightly faster computa- It is evident that the adaptive strategy is slightly tional times, and it also translates into a clearer slower to converge. Since the YELP dataset is 10 understanding of the intermediate values. More- times larger than COCO, in both cases after about over, we use a small learning rate, in order to keep 200,000 samples the two models reach the same the same structure as the original sentences. quality. The remainder of the training appears to be very stable for the standard update, with lit- • BLEU reward n-grams = 1 tle to no changes in both the scores. Meanwhile, • Finetuning epochs = 1000 the adaptive updates look slightly more volatile, with changes that impact the two scores both posi- • Learning rate = 10−6 tively and negatively, usually following the quality- • Batch size = 32 diversity trade-off principle. • Epochs value head pre-training = 200 • Learning rate value head pre-training = 10−4 C Generated samples 3 Due to the size of the COCO dataset, it is the only one We include several randomly sampled sentences where we finetune for 5 epochs. from OptAGAN for each dataset:
Dataset Sentences a man posing with a bike inside of a forest . COCO a man sitting on a swinging chair pulled by some purple ducks . a woman holding a bird over some flowers on the beach . the man is sweating because they are blue . SNLI the man is getting more thank with the dog . a man in white clothes stands next to a marketplace where he can store plastic . Republican presidential nominee , Donald Trump , has said that 20 to 30 years might be the way to try and narrow it out . EMNLP Whether or not agenda minutes can deliver , it would , therefore , encourage the majority of Scottish MPs to think about that . Earlier in the day , Bell travelled to Sydney ’ s Supreme Court and was effectively blocking a vote of no - one that would produce the album . Table 7: Examples of the generated unconditional sentences from OptAGAN trained on COCO, SNLI and EMNLP dataset. Stars Generated reviews 1 horrible experience at this mcdonalds . i have no idea what they are trying to sell you , but if you want anything other than vanilla bean puree and no rice at all , do yourself a favor and go here . for the most part , the waiters do not give a shit . 2 i have been to some great buffets , but this was mediocre at best . my husband ordered a turkey sandwich and it was just like anyone else’s sandwich . for the service , there wasn’t much seating . the food and the _UNK were stale , awful bread . something to do if you go to vegas for dinner and want to have some classic awesomeness ... maybe try a dim sum instead . 3 this place is pretty good . i like the burgers , the selection is pretty good and they have a ginormous amount of steak . being a vegetarian , i took one bite of everything i ordered and went back again . for the friday afternoon rush - they brought me the french fries instead of the turkey , a mousse and mgr . 4 loved this place . i had the corned beef burrito , which was very good , though a bit greasy and lacking . they have a good selection of veggie options and happy hour specials and are very attentive to your meal . it’s clean , spacious , and the ambiance is great . i have definitely come here when my friends visit vegas to see if they have any other option . the best part about going here is the sitting area outside where you can hang out while eating all you could eat . 5 love this place . i have visited all of the great restaurants that offer all sorts of flavors , and to top it all off , all the facilities are extremely clean . _UNK is the guy . he came for a quick check up , took my dad to the bar and came out free of charge !! seriously ! Table 8: Examples of the generated sentences from OptAGAN trained on the YELP dataset conditioned on the label.
You can also read