DGST: a Dual-Generator Network for Text Style Transfer - Association ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DGST: a Dual-Generator Network for Text Style Transfer Xiao Li♠ , Guanyi Chen♥ , Chenghua Lin♠∗, Ruizhe Li♠ ♠ Department of Computer Science, University of Sheffield ♥ Department of Information and Computing Sciences, Utrecht University {xiao.li, c.lin, r.li}@sheffield.ac.uk, g.chen@uu.nl Abstract ing sentiment words, which is so-called the “pivot word” (Li et al., 2018; Wu et al., 2019). There are We propose DGST, a novel and simple Dual- also works which add extra components for con- Generator network architecture for text Style Transfer. Our model employs two generators straining the content from being changed too much. only, and does not rely on any discriminators These include models like autoencoder (Lample or parallel corpus for training. Both quantita- et al., 2019; Dai et al., 2019), part-of-speech tive and qualitative experiments on the Yelp preservation, and the content conditional language and IMDb datasets show that our model gives model (Tian et al., 2018). In order to achieve high- competitive performance compared to several quality style transfer, existing works normally re- strong baselines with more complicated archi- sort to adding additional inner or outer structures tecture designs. such as additional adversarial networks or data pre- 1 Introduction processing steps (e.g. generating pseudo-parallel corpora). This inevitably increases the complex- Attribute style transfer is a task which seeks to ity of the model and raises the bar of training data change a stylistic attribute of text, while preserving requirement. its attribute-independent information. Sentiment In this paper, we propose a novel and simple transfer is a typical example of such kind, which model architecture for text style transfer, which focuses on controlling the sentiment polarity of employs two generators only. In contrast to some the input text (Shen et al., 2017). Given a review of the dominant approaches to style transfer such as “the service was very poor”, a successful sentiment CycleGAN (Zhu et al., 2017), our model does not transferrer should covert the negative sentiment employ any discriminators and yet can be trained of the input to positive (e.g., replacing the phrase without requiring any parallel corpus. We achieve “very poor” with “pretty good”), while keeping all this by developing a novel sentence noisification other information unchanged (e.g., the aspect “ser- approach called neighbourhood sampling, which vice” should not being changed to “food”). can introduce noise to each input sentence dynam- Without supervised signals from parallel data, ically. The nosified sentences are then used to a transferrer must be supervised in a way to en- train our style transferrers in the way similar to the sure that the generated texts belongs to a certain training of denoising autoencoders (Vincent et al., style category (i.e., transfer intensity). There is a 2008). Both quantitative and qualitative evaluation growing body of studies to intensify the target style on the Yelp and IMDb benchmark datasets show by means of adversarial training (Fu et al., 2018), that DGST gives competitive performance com- variational autoencoder (John et al., 2019; Li et al., pared to several strong baselines which have more 2019; Fang et al., 2019), generative adversarial complicated model design. The code of DGST is nets (Shen et al., 2017; Zhao et al., 2018; Yang available at: https://xiao.ac/proj/dgst. et al., 2018), or subspace matrix projection (Li et al., 2020) 2 Methodology Furthermore, in order to boost the preservation of non-attribute information during style transfor- Suppose we have two non-parallel corpora X and mation, some works explicitly focus on modify- Y with style Sx and Sy , the goal is training two ∗ Corresponding author transferrers, each of which can (i) transfer a sen- 7131 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7131–7136, November 16–20, 2020. c 2020 Association for Computational Linguistics
responsible for transferring the input text to the text + noise (') ℒ& target style (detailed in §2.2). (in S!) ( to S! ) ($) ℒ& 2.1 Preserving the Content of Input Text = Reconstruction e + nois This section discusses our loss function which + no ise enforces our transferrers to preserve the style- = independent information of the input. A common ($) ℒ" solution to this problem is to use the reconstruc- text + noise ( to S" ) ℒ" (') tion loss of the autoencoders (Dai et al., 2019), (in S" ) which is also known as the identity loss (Zhu et al., 2017). However, too much emphasis on preserv- Figure 1: The general architecture of DGST, in which ing the content would hinder the style transferring “=” means no back-propagation of gradients. ability of the transferrers. To balance our model’s capability in content preservation and transfer in- tensity, we instead first train our transferrers in the tence from one style (either Sx and Sy ) to another way of training denoising autoencoders (DAE, Vin- (i.e., transfer intensity); and (ii) preserve the style- cent et al., 2008), which has been proved to help independent context during the transformation (i.e., preserving the style independent content of input preservation). Specifically, we denote the two trans- text (Shen et al., 2020). More specifically, we train ferrers f and g. f : X → Y transfers a sentence f (or g; we use f as an example in the rest of this x ∈ X with style Sx to y ∗ with style Sy . Likewise, section) by feeding it with a noisy sentence ẙ as g : Y → X transfers a sentence y ∈ Y with style input, where ẙ is noisified from y ∈ Y and f is Sy to x∗ with Sx . To obtain good style transfer expected to reconstruct y. performance, f and g need to achieve both a high Different from previous works which use DAE in transfer intensity and a high preservation, which style transfer or MT (Artetxe et al., 2018; Lample can be formulated as follows: et al., 2019), we propose a novel sentence noisifi- ∀x, ∀x0 ∈ X , ∀y, ∀y 0 ∈ Y cation approach, named neighbourhood sampling, which introduces noise to each sentence dynami- y ∗ = f (x) ∈ Y, x∗ = g(y) ∈ X (1) cally. For a sentence y, we define Uα (y, γ) as a ∗ 0 ∗ 0 D(y ||x) ≤ D(y ||x), D(x ||y) ≤ D(x ||y) (2) neighbourhood of y, which is a set of sentences consisting of y and all variations of noisified y with Here D(x||y) is a function that measures the ab- the same noise intensity γ (which will be explained stract distance between sentences in terms of the later). The size of the neighbourhood Uα (y, γ) is minimum edit distance, where the editing opera- determined by the proportion (denoted by m) of tions Φ includes word-level replacement, insertion, tokens in y that are modified using the editing op- and deletion (i.e., the Hamming distance or the erations in Φ. Here the proportion m is sampled Levenshtein distance). On the one hand, Eq. 1 from a Folded Normal Distribution F. We hereby requires the transferred text should fall within the define that the average value of m (i.e., the mean of target style spaces (i.e., X or Y). On the other hand, F) is the noise intensity γ. Formally, m is defined Eq. 2 constrains the transferred text from changing as: too much, i.e., to preserve the style-independent m02 2 − πγ information. m ∼ F(m0 ; γ) = e 2 (3) πγ Inspired by CycleGAN (Zhu et al., 2017), our model (sketched in Figure 1) is trained by a cyclic That said, a neighbourhood Uα (y, γ) would be con- process: for each transferrer, a text is transferred structed using y and all sentences that are created to the target style, and then back-transferred to the by modifying (m × length(y)) words in y, from source style using another transferrer. In order to which we sample ẙ, i.e., a noisified sentence of y: transfer a sentence to a target style while preserving ẙ ∼ Uα (y, γ). Analogously, we could also con- the style-independent information, we formulate struct a neighbourhood Uβ (x, γ) for x ∈ X and two sets of training objectives: one set ensures sample x̊ from it. Using these noisified data as that the generated sentences is preserved as much inputs, we then train our transferrers f and g in as possible (detailed in §2.1) and the other set is the way of DAE by optimising the following recon- 7132
struction objectives: Dataset Yelp IMDb Positive Negative Positive Negative (c) Lf = Ey∼Y,ẙ∼Uα (y,γ) D(y||f (ẙ)) Train 266,041 177,218 178,869 187,597 (4) Dev 2,000 2,000 2,000 2,000 L(c) g = Ex∼X,x̊∼Uβ (x,γ) D(x||g(x̊)) Test 500 500 1,000 1,000 With Eq. 4, we essentially encourages the generator Table 1: Statistics of Datasets. to preserve the input as much as possible. 2.2 Transferring Text Styles (Yelp), which consists of restaurants and business Making use of non-parallel datasets, we train f and reviews together with their sentiment polarity (i.e., g in an iterative process. Let M = {g(y)|y ∈ Y } positive or negative), and the IMDb Movie Review be the range of g when the input is all sentences in Dataset (IMDb), which consists of online movie the training set Y . Similarly, we can define N = reviews. For Yelp, we split the dataset following Li {f (x)|x ∈ X}. During the training cycle of f , g et al. (2018), who also provided human produced will be kept unchanged. We first feed each sentence reference sentences for evaluation. For IMDb, we y (y ∈ Y ) to g, which tries to transfer y to the target follow the pre-processing and data splitting proto- style X (i.e. ideally x∗ = g(y) ∈ X ). In this way, col of Dai et al. (2019). Detailed dataset statistics we obtain M which is composed of all x∗ for each is given in Table 1. y ∈ Y . Next, we sample x̊∗ (a noised sentence of Evaluation Protocol. Following the standard eval- x∗ ) based on x∗ via the neighbourhood sampling, uation practice, we evaluate the performance of i.e., x̊∗ ∼ Uα (x∗ , γ) = Uα (g(y), γ). We use M̊ to our model on the textual style transfer task from represent the collection of x̊∗ . Similarly, we obtains two aspects: (1) Transfer Intensity: a style classi- N and N̊ using the aforementioned procedures fier is employed for quantifying the intensity of during the training cycle for g. the transferred text. In our work, we use Fast- Instead of directly using the sentences from X Text (Joulin et al., 2017) trained on the training for training, we use M̊ to train f by forcing f to set of Yelp; (2) Content Preservation: to validate transfer each x̊∗ back to the corresponding original whether the style-independent context is preserved y. In parallel, N̊ is utilised to train g. We repre- by the transferrer, we calculate self -BLEU, which sent the aforementioned operation as the transfer computes a BLEU score (Papineni et al., 2002) by objective. comparing inputs and outputs of a system. A higher self -BLEU score indicates more tokens from the (t) Lf = Eα,y∼Y,x̊∗ ∼Uα (g(y),γ) D(y||f (x̊∗ )) sources are retained, henceforth, better preserva- (5) L(t) ∗ tion of the contents. In addition, we also use ref - g = Eβ,x∼X,ẙ ∗ ∼Uβ (f (x),γ) D(x||g(ẙ )) BLEU, which compares the system outputs and the The main difference between Eq. 4 and Eq. 5 is how references written by human beings. Uα (·, γ) and Uβ (·, γ) are constructed, i.e., Uα (y, γ) and Uβ (x, γ) in Eq. 4 compared to Uα (g(y), γ) and 3.2 Experimental Results Uβ (f (x), γ) in Eq. 5. Finally, the overall loss of In our experiment, the two transferrers (f and g) DGST is the sum of the four partial losses: are Stacked BiLSTM-based sequence-to-sequence (c) (t) models, i.e., both 4-layer BiLSTM for the encoder L = Lf + Lf + L(c) (t) g + Lg (6) and decoder. The noise intensity γ is set to 0.3 in the first 50 epochs and 0.03 in the following During optimisation, we freeze g when optimis- epochs. ing f , and vice versa. Also with the reconstruction As shown in Table 2, for the Yelp dataset objective, x∗ must to be sampled first, and then our model defeats all baselines models (apart passed x̊∗ into f ; in contrast, it is not necessary to from StyleTransformer (Multi-Class)) on both ref - sample according to y when we obtain x∗ = g(y). BLEU and self -BLEU. In addition, as shown in 3 Experiment Table 2, our model works remarkably well on both transfer intensity and preservation without requir- 3.1 Setup ing adversarial training or reinforcement learning, Dataset. We evaluated our model on two bench- or external offline sentiment classifiers (as in Dai mark datasets, namely, the Yelp review dataset et al. (2019)). Besides, the current version of our 7133
Yelp IMDb Model acc. ref -BLEU self -BLEU acc. self -BLEU RetrieveOnly (Li et al., 2018) 92.6 0.4 0.7 n/a n/a TemplateBased (Li et al., 2018) 84.3 13.7 44.1 n/a n/a DeleteOnly (Li et al., 2018) 85.7 9.7 28.6 n/a n/a DeleteAndRetrieve (Li et al., 2018) 87.7 10.4 29.1 55.8 55.4 ControlledGen (Hu et al., 2017) 88.8 14.3 45.7 94.1 62.1 CycleRL (Xu et al., 2018) 88.0 2.8 7.2 97.8 4.9 StyleTransformer (Conditional) (Dai et al., 2019) 93.7 17.1 45.3 86.6 66.2 StyleTransformer (Multi-Class) (Dai et al., 2019) 87.7 20.3 54.9 80.3 70.5 DGST 88.0 18.7 54.5 70.1 70.2 Table 2: Automatic evaluation results on Yelp and IMDb corpora, most of which are from Dai et al. (2019). Yelp positive → negative input this golf club is one of the best in my opinion . output this golf club is one of the worst in my opinion . input i definitely recommend this place to others ! output i do not recommend this to anyone ! Yelp negative → positive input the garlic bread was bland and cold . output the garlic bread was tasty and fresh . input my dish was pretty salty and could barely taste the garlic crab . output my dish was pretty good and could even taste the garlic crab . IMDb positive → negative input a timeless classic , one of the best films of all time . output a complete disaster , one of the worst films of all time . input and movie is totally backed up by the excellent music both in background and in songs by monty . output the movie is totally messed up by the awful music both in background and in songs by chimps . IMDb negative → positive input this one is definitely one for my “ worst movies ever ” list . output this one is definitely one of my “ best movies ever ” list . input i found this movie puerile and silly , as well as predictable . output i found this movie credible and funny , as well as tragic . Table 3: Example results from our model for the sentiment style transfer on the Yelp and IMDb datasets. model is built upon fundamental BiLSTM, which 3.3 Ablation Study is a likely explanation of why we lose to the SOTA To confirm the validity of our model, we did an (i.e., StyleTransformer (Multi-Class)) for a small ablation study on Yelp by eliminating or modifying margin, which are based on the Transformer archi- a certain component (e.g., objective functions, or tecture (Vaswani et al., 2017) with much higher sampling neighbourhood). We tested the following capacity. For the IMDb dataset, comparing to other variations: 1) full-model: the proposed model; 2) systems, our model obtained moderate accuracy but no-tran: the model without the transfer objective; competitive self-BLEU score (70.2), i.e., slightly 3) no-rec: the model without the reconstruction lower than StyleTransformer. Table 3 lists several objective; 4) rec-no-noise: the model adding no examples for style transfer in sentiment for both noise when optimising the reconstruction objective; datasets. By examining the results, we can see that 5) tran-no-noise: the model adding no noise when DGST is quite effective in transferring the senti- optimising the transfer objective; 6) pre-noise: the ment polarity of the input sentence while maintain- model trained by adding noise to y first and then ing the non-sentiment information. feeding the nosified sentences ẙ to g (or x̊ to f ) in Eq. 5. In this study, the transferrers are the simplest LSTM-based sequence-to-sequence models. The hidden size and γ are set to 256 and 0.3, respec- 7134
positive → negative negative → positive input it is a cool place , with lots to see and try . so , that was my one and only time ordering the benedict there . full-model it is a sad place , with lots to see and something . so , that was my one and best time in the shopping there . no-rec no no , , num . so , that was my one and time time over the there there . rec-no-noise it is a cool place , with me to see and try . service was very friendly . no-tran it is a loud place , with lots to try and see . so , that was my only and first visit ordering the there ) . tran-no-noise it is a noisy place , with lots to try and see . so , that was my one and time time ordering the ordering there . pre-noise it is a cool place , with lots to see to try . so , that ’s one one and my only the the day there . input it is the most authentic thai in the valley . even if i was insanely drunk , i could n’t force this pizza down . full-model it is the most overrated thai in the valley . even if i was n’t hungry , i ’ll definitely enjoy this pizza here . no-rec i was in the the the the food . she was perfect . rec-no-noise it is the most authentic thai in the valley . even if i was n’t , , i could n’t recommend this pizza . . no-tran it is the most authentic thai in the valley . even if i was n’t , , i could n’t get this pizza down . tran-no-noise it is the most common thai in the valley . even if i was hungry hungry , i could n’t love this pizza shop . pre-noise it is the most thai thai in the valley . even if i was n’t hungry , i could n’t recommend this pizza down . Table 4: Example transferred from the ablation study. Model Variants self -BLEU acc. negative review. no-rec 0.0 98.9 rec-no-noise 41.9 73.1 4 Conclusion no-tran 98.0 4.2 tran-no-noise 35.6 82.9 In this paper, we propose a novel and simple dual- pre-noise 38.9 76.8 generator network architecture for text style trans- full-model 37.2 86.3 fer, which does not rely on any discriminators or parallel corpus for training. Extensive exper- Table 5: Evaluation results for the ablation study. iments on two public datasets show that our model yields competitive performance compared to sev- tively. eral strong baselines, despite of our simpler model Results. Table 5 depicts the results of the ablation architecture design. study. As we can see, eliminating the reconstruc- Acknowledgements tion or transfer objectives would damage preserva- tion and transfer intensity, respectively. As for the We would like to thank all the anonymous review- use of noise, the results of the rec-no-noise model ers for their insightful comments. This work is shows that the noise in the reconstruction objec- supported by the award made by the UK Engineer- tive helps balance our model’s ability in content ing and Physical Sciences Research Council (Grant preservation and transfer intensity. For the trans- number: EP/P011829/1). fer objective, omitting noise (tran-no-sp) would reduce the transfer intensity while placing noise in the wrong position (pre-noise) reduces it yet again. References Case Study. Transferred sentences produced by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and each model variant in the ablation study are listed Kyunghyun Cho. 2018. Unsupervised neural ma- chine translation. In International Conference on in Table 4. The model without correction objec- Learning Representations. tive (no-corr) collapsed and as a result it gener- ates irrelevant sentences to the inputs most of the Ning Dai, Jianze Liang, Xipeng Qiu, and Xuanjing time. When neighbourhood sampling is dropped Huang. 2019. Style transformer: Unpaired text style transfer without disentangled latent represen- in either corrective or transfer objectives, the trans- tation. In Proceedings of the 57th Annual Meet- fer intensity is reduced. These models, including ing of the Association for Computational Linguis- rec-no-noise, tran-no-noise, and pre-noise, tend tics, pages 5997–6007, Florence, Italy. Association to substitute random words, and result in reduced for Computational Linguistics. transfer intensity (i.e., style words are either not Le Fang, Chunyuan Li, Jianfeng Gao, Wen Dong, and modified or still express the same sentiment af- Changyou Chen. 2019. Implicit deep latent vari- ter modification) and preservation. For example, able models for text generation. In Proceedings of the 2019 Conference on Empirical Methods in Nat- when transferring from negative to positive, rec-no- ural Language Processing and the 9th International noise replace “force” to “recommend” resulting Joint Conference on Natural Language Processing “I couldn’t recommend this pizza”, which is still a (EMNLP-IJCNLP), pages 3937–3947. 7135
Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, Tianxiao Shen, Jonas Mueller, Regina Barzilay, and and Rui Yan. 2018. Style transfer in text: Explo- Tommi Jaakkola. 2020. Educating text autoen- ration and evaluation. In Thirty-Second AAAI Con- coders: Latent representation guidance via denois- ference on Artificial Intelligence. ing. In Proceedings of Machine Learning and Sys- tems 2020, pages 9129–9139. Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward Youzhi Tian, Zhiting Hu, and Zhou Yu. 2018. Struc- controlled generation of text. In Proceedings tured content preservation for unsupervised text of the 34th International Conference on Machine style transfer. arXiv preprint arXiv:1810.06526. Learning-Volume 70, pages 1587–1596. JMLR. org. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Kaiser, and Illia Polosukhin. 2017. Attention is all Vechtomova. 2019. Disentangled representation you need. In Advances in neural information pro- learning for non-parallel text style transfer. In Pro- cessing systems, pages 5998–6008. ceedings of the 57th Annual Meeting of the Associa- tion for Computational Linguistics, pages 424–434. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and Armand Joulin, Edouard Grave, Piotr Bojanowski, and composing robust features with denoising autoen- Tomas Mikolov. 2017. Bag of tricks for efficient coders. In Proceedings of the 25th international con- text classification. In Proceedings of the 15th Con- ference on Machine learning, pages 1096–1103. ference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Pa- Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han, pers, pages 427–431. Association for Computational and Songlin Hu. 2019. Mask and infill: Apply- Linguistics. ing masked language model for sentiment transfer. In Proceedings of the Twenty-Eighth International Guillaume Lample, Sandeep Subramanian, Eric Smith, Joint Conference on Artificial Intelligence, IJCAI- Ludovic Denoyer, Marc’Aurelio Ranzato, and Y- 19, pages 5271–5277. International Joint Confer- Lan Boureau. 2019. Multiple-attribute text rewrit- ences on Artificial Intelligence Organization. ing. In International Conference on Learning Rep- resentations. Jingjing Xu, Xu Sun, Qi Zeng, Xiaodong Zhang, Xu- ancheng Ren, Houfeng Wang, and Wenjie Li. 2018. Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Unpaired sentiment-to-sentiment translation: A cy- Delete, retrieve, generate: a simple approach to sen- cled reinforcement learning approach. In Proceed- timent and style transfer. In Proceedings of the 2018 ings of the 56th Annual Meeting of the Association Conference of the North American Chapter of the for Computational Linguistics (Volume 1: Long Pa- Association for Computational Linguistics: Human pers), pages 979–988, Melbourne, Australia. Asso- Language Technologies, Volume 1 (Long Papers), ciation for Computational Linguistics. pages 1865–1874, New Orleans, Louisiana. Associ- Zichao Yang, Zhiting Hu, Chris Dyer, Eric P Xing, and ation for Computational Linguistics. Taylor Berg-Kirkpatrick. 2018. Unsupervised text style transfer using language models as discrimina- Ruizhe Li, Xiao Li, Chenghua Lin, Matthew Collinson, tors. In Advances in Neural Information Processing and Rui Mao. 2019. A stable variational autoen- Systems, pages 7287–7298. coder for text modelling. In Proceedings of the 12th International Conference on Natural Language Gen- Junbo Zhao, Yoon Kim, Kelly Zhang, Alexander Rush, eration, pages 594–599. and Yann LeCun. 2018. Adversarially regularized autoencoders. volume 80 of Proceedings of Ma- Xiao Li, Chenghua Lin, Ruizhe Li, Chaozheng Wang, chine Learning Research, pages 5902–5911, Stock- and Frank Guerin. 2020. Latent space factorisation holmsmässan, Stockholm Sweden. PMLR. and manipulation via matrix subspace projection. In Proceedings of Machine Learning and Systems 2020, Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A pages 3211–3221. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Pro- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ceedings of the IEEE international conference on Jing Zhu. 2002. Bleu: a method for automatic eval- computer vision, pages 2223–2232. uation of machine translation. In Proceedings of the 40th annual meeting on association for compu- tational linguistics, pages 311–318. Association for Computational Linguistics. Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Advances in neural informa- tion processing systems, pages 6830–6841. 7136
You can also read