Multi-Style Transfer with Discriminative Feedback on Disjoint Corpus
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-Style Transfer with Discriminative Feedback on Disjoint Corpus Navita Goyal, Balaji Vasan Srinivasan, Anandhavelu N, Abhilasha Sancheti Adobe Research, India {navgoyal, balsrini, anandvn, sancheti}@adobe.com Abstract on a parallel corpus, state-of-the-art models are semi-supervised/unsupervised and operate on non- Style transfer has been widely explored in nat- parallel corpus. These models achieve style trans- ural language generation with non-parallel cor- pus by directly or indirectly extracting a no- fer by aligning source and target distribution of tion of style from source and target domain sentences from non-parallel corpus (Shen et al., corpus. A common shortcoming of existing 2017), disentangling content space from style space approaches is the prerequisite of joint anno- in latent representation (Hu et al., 2017) or em- tations across all the stylistic dimensions un- ploying self-reconstruction (Dai et al., 2019) and der consideration. Availability of such dataset back translation (Lample et al., 2018) objectives across a combination of styles limits the exten- to achieve pseudo-supervision with non-parallel sion of these setups to multiple style dimen- corpus. Recent works have also modeled this in sions. While cascading single-dimensional models across multiple styles is a possibility, a self-supervised manner where rewriting (trans- it suffers from content loss, especially when fer) is achieved by utilizing corpus from the target the style dimensions are not completely inde- style alone (Syed et al., 2020). These wide stud- pendent of each other. In our work, we re- ies have also led to the curation and benchmark- lax this requirement of jointly annotated data ing of non-parallel dataset for various style dimen- across multiple styles by using independently sions, such as sentiment (Li et al., 2018), formality acquired data across different style dimensions (Rao and Tetreault, 2018), politeness (Danescu- without any additional annotations. We initial- ize an encoder-decoder setup with transformer- Niculescu-Mizil et al., 2013), excitement (Sancheti based language model pre-trained on a generic et al., 2020), etc. But availability of data with joint corpus and enhance its re-writing capability to tagging across multiple styles is limited and has multiple target style dimensions by employing restricted the ability of existing approaches to scale multiple style-aware language models as dis- from single-dimensional transfer to multiple style criminators. Through quantitative and quali- dimensions. In this paper, we propose a multi- tative evaluation, we show the ability of our dimensional style transfer approach that can work model to control styles across multiple style di- off partially labelled data for style transfer across mensions while preserving content of the input text. We compare it against baselines involv- multiple dimensions simultaneously. ing cascaded state-of-the-art uni-dimensional The work by Subramanian et al. (2018) attempts style transfer models. style transfer with multiple attributes such as age, gender, and sentiment simultaneously. However, 1 Introduction their approach avails corpus tagged with each of Style transfer is a popular task in natural language these three style dimensions. In contrast to this and processing and has been studied on attributes like other similar explorations in multi-style transfer, age or gender (Subramanian et al., 2018), styles our approach does not require jointly labelled data emanating from social construct like formality across all the stylistic dimensions in source and/or (Rao and Tetreault, 2018) and politeness (Madaan target corpus. We focus on the problem where inde- et al., 2020), linguistic styles based on author writ- pendent corpus is available across different stylistic ing style (Syed et al., 2020), or psycho-linguistic dimensions (say sentiment and formality) and we styles based on personality types (Mairesse and achieve style transfer spanning different stylistic Walker, 2011). While early style transfer frame- dimensions (say make a sentence more positive and works were modeled as a supervised learning task formal). While state-of-the-art approaches can be 3500 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3500–3510 June 6–11, 2021. ©2021 Association for Computational Linguistics
extended to achieve this by sequentially transfer- (BT) objective with input sentence x from style ring one style after another, it is limited as differ- s, first estimates x0 = f (x, s0 ), where s 6= s0 and ent style dimensions are not necessarily indepen- then reconstruct x from x̃ = f (x0 , s). Thus, these dent of each other. In aspects that are not indepen- approaches are inherently dependent on knowledge dent, changing one style aspect of the text might of annotation of each sentence with all the style affect another aspect considered, making a sequen- combinations. Dai et al. (2019) achieve state-of- tial brute-force approach non-ideal. As we show the-art style transfer in single style dimensions by in our experiments later, the cascaded setup also employing transformer-based model in conjunction lacks common grounding between the content from with classifier-based discriminator. In addition to different styles leading to erratic changes in con- discriminator losses, their proposed technique uses tent. We circumvent this by grounding our frame- self-reconstruction and cycle reconstruction losses, work on the linguistic understanding of a large which similar to DAE and BT losses are also re- language model. Our model builds understanding liant on availability of jointly annotated data to be of interplay between the different styles by incor- extendable to multiple style setup. porating multiple discriminative language models Language modeling is integral to several natu- (LM) with language model-based encoder-decoder ral language generation (NLG) tasks like text sum- setup. The key contributions of this paper are: marization, spelling correction, image captioning, 1) An encoder-decoder setup with multiple lan- etc. The model architecture for these tasks has guage models as discriminator, with each entity evolved from n-gram based methods to Recurrent harnessing the language understanding from a large Neural Networks to transformer architectures. The pre-trained transformer model. introduction of Transformer-based architecture ac- 2) Relaxing the requirement of jointly labelled companied with generative pre-training (Radford, data for multi-style transfer, by leveraging indepen- 2018) capabilities have led to strong improvements dently acquired disjoint corpus for different styles. in many downstream generation and GLUE (Wang 3) Achieving better style control with better con- et al., 2018) tasks. Generative pre-training aims to tent preservation in multi-dimensional style trans- adapt a large Transformer language model to large fer than a cascaded setup of state-of-the-art uni- unsupervised corpus. This capability of generative dimensional style transfer models. pre-training is exploited in many large language models like BERT (Devlin et al., 2019), GPT-2 2 Related Work (Radford et al., 2018), ERNIE 2.0 (Sun et al., 2020) which have the ability to perform tasks like read- One line of work in style transfer attempts to ing comprehension (Xu et al., 2019), summariza- learn disentangled latent representation for style tion (Liu and Lapata, 2019), question-answering and content, and transfer style by manipulating (Rajpurkar et al., 2016) and translation (Clinchant latent representation of style (Shen et al., 2017). et al., 2019) in zero-shot and few-shot settings. Although these approaches perform well with one style at a time, they do not trivially scale to multi- Recently these pre-trained generative language dimensional style transfer. Several other works models have been explored in translation (Con- develop unsupervised approach for style transfer neau and Lample, 2019) and style transfer tasks by employing Denoising Autoencoding (DAE) (Fu (Syed et al., 2020). Conneau and Lample (2019) et al., 2017) and back-translation (BT) (Lample develop cross-lingual models for unsupervised ma- et al., 2018) loss to develop interaction and hence chine translation by initializing encoder and de- transfer between the source and target domain. Sub- coder with a pre-trained language model trained ramanian et al. (2018) extend this approach to mul- on Masked Language Modeling (MLM) (Devlin tiple styles by conditioning on average of embed- et al., 2019) objective and fine-tuning the encoder- ding of each target attribute and using combina- decoder framework with adversarial training. Syed tion of DAE and back-translation techniques. DAE et al. (2020) extend this to stylized re-writing task takes as input a sentence x from style s and tries by employing DAE during fine-tuning. The joint to reconstruct sentence x from its corrupted ver- encoder-decoder framework learns to reconstruct sion x̃. This relies on the assumption that the in- sentences in target-domain from its noisy version put sentence x is from a certain style combination using DAE objective. As previously discussed, the s = {s1 , s2 , . . . , sk }. Similarly back translation DAE objective is reliant on the corpus being tagged 3501
for the target domain style (or combination of style) words over a large corpus. Masked Language Mod- and restricts the generalization of this setup to mul- eling leverages bidirectional context of the input, tiple attributes. We overcome this by employing thus enabling better language understanding. Fol- discriminative language models to assist the de- lowing Masked Language Modeling objective from coder with feedback for various target styles. Devlin et al. (2019), we randomly sample 15% of Shen et al. (2017) show that even with non- the tokens from the text stream and replace them parallel data, the content distribution across source with the [MASK] token 80% of the time, by a and target style is shared. Based on this, a lan- random token 10% of the time and keep them un- guage model trained on target style will have high changed 10% of the time, with the objective of perplexity on transferred text if it does not match predicting the original identity of the masked word target style and low perplexity otherwise. Yang based on its bidirectional context. To enable style et al. (2018) exploit this ability of language models transfer from a given sentence to target style, we to replace standard binary classifier-based discrim- use independently trained language models (LMs) inators with an implicitly trained language model to initialize the encoder and decoder and connect as discriminator. They show that using the lan- these with randomly initialized attention layers to guage model as structured discriminator allows for arrive at a encoder-decoder setup. As discussed more stable training by eliminating the adversarial by Syed et al. (2020), the Transformer architecture step. We extend this idea to a multi-discriminator (Vaswani et al., 2017) allows such independent ini- approach. Training a LM on combination of tar- tialization by implicitly aligning encoder-decoder get styles is not possible in absence of jointly la- layers via attention mechanism. belled dataset. Due to this, we attempt to use mul- Pre-training an encoder only transformer on gen- tiple discriminators for each of the target styles. erative task and then leveraging it to initialize Since with multiple styles, the underlying corpus as both encoder and decoder as opposed to pre- is independently acquired, the variation in content training a joint encoder-decoder model has sev- distribution across different styles is more notice- eral advantages. Transformer-based models with able. Consequently, an independently trained LM encoder-only (Devlin et al., 2019) or decoder-only on one of the target styles might have high per- (Radford et al., 2018) blocks have been shown plexity even if the transferred sentence fits in the to perform well in generative pre-training task. corresponding target style, due to the content space Clearly, pre-training a single transformer block on of source sentence. To equip discriminative LM generative task and then utilizing it as both en- with more generalized notion of content, we use coder and decoder blocks has lower computational large transformer-based LM pre-trained on large cost than training the entire encoder-decoder block unsupervised corpus to establish generic content jointly. Moreover, this also enables us to use the distribution before style-oriented fine-tuning. same pre-trained model to initialize both style trans- 3 Approach fer module and the discriminator models, explained in the following section. This is not only compu- Our proposed approach has two key elements — tationally more efficient but it also closely ties the a Transformer-based encoder-decoder model ini- underlying language distribution of the two mod- tialized with a pre-trained Transformer Language ules. This is expected to make the discriminative Model and fine-tuned on DAE loss to achieve style feedback more effective while fine tuning the trans- transfer (Section 3.1) and the multiple language fer model for multiple styles. models as discriminators stacked together to en- In Syed et al. (2020)’s setup, both encoder and able multi-style transfer (Section 3.2). decoder in the style transfer module are initialized 3.1 Pre-trained LM as Encoder-Decoder with the pre-trained language model (trained on MLM objective). Instead, we initialize the decoder Similar to Syed et al. (2020), we first pre-train a with the language model fine-tuned with the target Transformer-based language model with Masked style using Causal Language Modeling (CLM) ob- Language Modeling (MLM) objective on English jective, before training the joint encoder-decoder Wikipedia data extracted using WikiExtractor.1 model, as detailed in Section 3.2. The encoder This equips LM with the ability to predict masked is initialized with the pre-trained model directly. 1 https://github.com/attardi/wikiextractor Aligning the decoder to the distribution of the tar- 3502
⊕ DAE loss Discriminator Transferred x loss - 1 Discriminator The city The city loss - 2 Discriminator The big loss - k Transformer Causal Language Modeling Layer 12 De-noising Autoencoding Objective Masked Language Modeling Transformer .... Layer 12 Decoder .... Transformer Layer 3 Transformer Transformer Layer 1 Layer 2 0 1 …. n Encoder Transformer The …. big Layer 1 0 1 2 …. n Position Embeddings 0 1 2 …. n Masked x …. The cat …. city 0 1 2 n The …. city Text Embeddings The cat …. Original x Noisy x Encoder-decoder Discriminator Pre-training Fine-tuning Fine-tuning Figure 1: Model Architecture - Left: Generative pre-training using MLM objective, and Fine-tuning encoder- decoder LM with multiple discriminative losses and Right: Discriminator fine-tuning with language modeling (next token prediction) objective. Color for model blocks represents the pre-trained model used for initialization prior to fine-tuning. get style helps speed up the fine-tuning process as 3.2 Fine-tuned LM as discriminators decoder is more adept at generating stylized out- To extend the single-dimensional style transfer puts. This does not add to computational overhead setup above to multi-dimensional setting, we use as these fine-tuned models are repurposed as dis- language models as discriminators to provide the criminators for stylistic feedback (Section 3.2). feedback to the model for partially annotated na- To instill style-awareness to the encoder-decoder ture of input data. As opposed to a classifier-based setup initialized with pre-trained Transformer mod- discriminator, the language model as discriminator els, we fine-tune it with Denoising Autoencoder takes into account the wider language distribution (DAE) loss using the target-domain corpus. In case of the target style. Additionally, such a setup allows of multiple styles, we use a randomized mixture us to use only the target style corpus for training the of target-domain corpus from each of the target transfer model, whereas the classifier would require styles. Under the DAE objective, the encoder takes both source and target style corpus to distinguish a noisy masked version x̃ of the text x as input between a sentence as being from one style or an- and attempts to fill in the mask token as per the other. Inspired by Yang et al. (2018), we fine-tune MLM objective that it was pre-trained on. In turn, a language model on the target style si , so that the the decoder re-creates stylistic version of original language model is equipped with language distribu- sentence from this noisy output from the encoder. tion of target domain data. This entails generating The overall training objective is the probability of next token, given the previous tokens — also known as Causal Language Model- LDAE (θG ) = Ex∼T [− log PθG (x|x̃)], (1) ing objective (Conneau and Lample, 2019). The training loss for the LM for target style si with where θG are the trainable parameters of the corresponding corpus Ti is encoder-decoder model. The noisy version of sen- tence x from the target corpus T is obtained after X n Ex∼Ti [− log PLM (xt |x1 , . . . , xt−1 )] (2) dropping tokens from x with probability pdrop and t=1 masking with a probability of pmask . In conjunc- tion, the encoder and decoder enable style transfer We show in our experiments that such a fine- to the target style. The noteworthy aspect here is tuning step transforms language distribution of this that the model has no sense of source style and is language model to style si and hence serve as soft- trained to generate sentences to match the style of discriminator for our framework. We exploit this the target-domain corpus with which it is trained. capability of language models to imbibe style of 3503
fine-tuning corpus by employing language models discriminative LM for each style guides the gener- as style discriminators for transferred sentences. ation towards that specific style by rewarding style This is based on the idea that if the transferred adherence in the transferred sentence. Random- sentence does not fit well in the target style, then ized mixture of training corpus across styles allows the perplexity of language model fine-tuned on that for unified and cohesive understanding of multiple style will be high (Section 4.1). styles by diversifying rewards from different dis- For k-dimensional style transfer with target criminators across samples. The overall training styles s = {s1 , s2 , . . . , sk }, we independently fine- loss for the joint encoder-decoder model is tune k language models on each of the target styles. k As discussed in Yang et al. (2018), we are able to L = λDAE Ex∼T [− log Pθ (x|x̃)] + X λi Lsi , (6) forgo the adversarial training for the discriminator, i=1 since the fine-tuned discriminative language model where Lsi is as defined in Equation 5, and λDAE is implicitly capable of assigning high perplexity to and {λi }ki=1 are hyper-parameters. negative samples (out-of-style samples), as shown The overall training process is summarized in Section 4.1. For the transferred sentence x0 , the in Figure 1. First, we pre-train a transformer training objective for each target style si is, model with Masked language modeling objective argminLsi = Ex∼T,x0 ∼Pθ as shown in Figure 1(Left). We then initialize dis- G (x) θG criminator model with this pre-trained language n X (3) model and fine-tune it with Causal language mod- − log PLM i (x0t |x01 , .., x0t−1 ) eling objective, shown in Figure 1(Right), for each t=1 target style. Finally, we initialize the encoder and This dictates that transferred sentence x0 has low decoder of the style transfer module with the pre- perplexity on the language model fine-tuned on trained and style-specific fine-tuned language mod- style si , for each target style si . However, we can- els, respectively. In case of multiple styles, the not directly find the argminθG using gradient de- decoder can be initialized with the language model scent because of discrete sampling of x0 ∼ PθG (x). which is fine-tuned with CLM loss on the mixture To account for this, we use a policy gradient rein- of data from target styles, i.e., CLM loss in Equa- forcement learning approach using REINFORCE tion 2 with x ∼ T . The joint encoder-decoder algorithm (Sutton et al., 1999). The reward for an model (Figure 1(Centre)) is then trained with a input sequence x to the style discriminator LMi is combination of DAE objective and rewards from calculated as, fine-tuned discriminators of respective target styles. n X 4 Experiments r(x) = log PLM i (xt |x1 , .., xt−1 ) (4) t=1 We experiment with a combination of sentiment and formality styles. For sentiment, we use a mix- Using these rewards, the RL objective is to mini- ture of IMDB (Maas et al., 2011) and Yelp dataset mize the loss Lsi given by, (Li et al., 2018) with 300k examples in the positive and negative sentiment each. For formality, we use Lsi = Ex∼T,x0 ∼Pθ (x) (r(x0 ) − r(x)) GYAFC corpus (Rao and Tetreault, 2018) which G 0 (5) [− log PθG (x |x̃)] has 104k examples in each formal and informal class. The test set has 3000 and 4849 examples for for style si , where PθG (x|x̃) is as in Equation 1 sentiment and formality respectively, following the and r(x0 ) is the reward in the Equation 4 for the data split available in Dai et al. (2019); Rao and transferred sentence x0 . The rewards r(x) repre- Tetreault (2018). For both datasets, the training sents the baseline reward of greedily sampling the corpus is non-parallel and the test corpus has hu- input sequence x by the style discriminator LMi . man written references available, which we use for For the style combination s = {s1 , s2 , . . . , sk }, content evaluation (Section 4.2). the joint encoder-decoder model is trained on ran- For pre-training, we use 12-layer Transformer domized mixture of data from each of the target- model with 512 hidden units, 16 heads, a dropout domain corpus. The mixture is thus agnostic of rate of 0.1 and learned positional embedding. We individual style of each of the sentence and the train our models with the Adam optimizer, and 3504
Style/Dimension Sentiment % Formality % to guide our style transfer model. As discussed in Positive 71.41 67.09 discriminative modeling (Section 3.2), the model Negative 76.17 75.59 fine-tuned with corpus from a certain style is ex- pected to have high perplexity on sentence not from Table 1: Accuracy of sentences generated by model that style and low perplexity otherwise. To this end, fine-tuned on style si as % of generated sentences la- belled as class si by the classifier trained on the corre- we experiment with two models independently fine- sponding style dimension. tuned on positive and negative corpus. We calculate the perplexity of each of these models on the test Fine-tuning Test Corpus corpus from the same style and from the opposite corpus Same ↓ Opposite ↑ style. As seen in Table 2, the perplexity for each Positive 6.9275 9.6850 model is substantially lower on the same corpus Negative 7.7131 9.9637 as compared to that on the opposite corpus. This implies that a language model fine-tuned on pos- Table 2: Perplexity of test corpus on models fine-tuned itive corpus shows higher perplexity for negative positive and negative corpus (rows). The column Same sentences and lower for positive sentences and vice represents that test corpus is same as fine-tuning cor- versa. This corroborates the effectiveness of these pus, leading to lower perplexities and Opposite repre- fine-tuned language models to serve as discrimina- sent test corpus from opposite polarity as fine-tuning corpus leading to higher perplexity. tors for training the style transfer module. 4.2 Evaluation metrics a learning rate of 10−4 . To handle large vocabu- We measure the performance of our model and the lary sizes, we use Byte Pair Encoding (BPE) (Sen- baselines based on the style control, content preser- nrich et al., 2016) learned on the Wikipedia dataset. vation and fluency. To measure the accuracy of The λs in Equation 6 are determined using hyper- style transfer, we train two Fasttext2 classifiers parameter tuning on validation set, with style trans- independently for sentiment and formality using fer accuracy (Section 4.2) as search criteria. the train corpus, as described in Section 4.1. These 4.1 Style-awareness of Language Models classifiers have accuracy of 93.74% and 88.95% re- spectively on test corpus of respective datasets. We To evaluate style variation across language mod- note that formality as a style is more intricately de- els fine-tuned on different styles, we compare the signed, so we also check lexical scoring by Brooke generations of the fine-tuned models. For single- et al. (2010) to evaluate formality, which uses a for- dimensional style evaluation, we generate sen- mality lexicon to assign formality score between tences from models fine-tuned on negative corpus −1 (informal) and 1 (formal) to each word and av- and positive corpus and compare the style accu- erages it. We scale these scores between 0–100, racy of generated sentences. The style accuracy is where higher (100) lexical score signifies formal evaluated by employing a FastText (Joulin et al., style and lower (0) score signifies informal style. 2016) classifier trained on the corresponding style For informal target style, we report lexical score dimension. For instance, the classifier for evalu- as 100 − n, so that a higher average lexical score ating sentiment accuracy is trained on sentiment signifies a better transfer for either polarity. corpus tagged with positive and negative class in To measure content preservation on transfer, IMDB and Yelp data. Table 1 shows the accuracy we calculate the BLEU score (Papineni et al., 2002) of sentences generated by a model fine-tuned on between the transferred sentence and the input sen- style si as belonging to the class si . For both senti- tence (self-BLEU). Besides this, we also calculate ment and formality, the fine-tuned language models BLEU score between the transferred sentence gen- are able to generate text faithful to the target style erated by our model and the corresponding hu- dimension. Thus, we conclude that the language man reference transferred sentence, available for models trained on style si are able to capture the GYAFC and Yelp corpus (ref-BLEU). Since both essence of the corresponding style reasonably well. these corpus account for transfer across only one These accuracies are an indication of the style style dimension each, the provided references are awareness in these fine-tuned LMs. We, there- only partial indication of expected outcome. This fore, employ the perplexities of these fine-tuned 2 language models to gauge the style of the input text https://github.com/facebookresearch/fastText 3505
Style Accuracy Content Preservation Fluency Model Classifier ↑ Lexical Scoring ↑ BLEU ↑ Perplexity ↓ Sentiment Formality Formality -self -ref Cascaded Style Transformer 72.17 64.08 81.29 0.6066 0.3479 8.8657 (Dai et al., 2019) Adapted Rewriting LM 52.59 36.39 72.21 0.7917 0.4259 6.5963 (Syed et al., 2020) Cascaded Discriminative LM 69.30 48.18 83.02 0.6634 0.3579 6.6846 Joint Discriminative LM 79.78 65.33 85.39 0.7710 0.4136 6.4574 Table 3: Quantitative Comparison of our proposed approach (Joint Discriminative LM) against Cascaded Style Transformer (Dai et al., 2019), Cascaded Discriminative LM method and multi-style transfer using Adapted Rewrit- ing LM (Syed et al., 2020). The upward arrow signifies that higher is better and vice versa. Score of near 100 on formality lexical scoring imply the transferred text is close in formality to the target corpus. is also apparent from low ref-BLEU scores for our of jointly annotated dataset, which is a stronger model as well as baselines. Since, the results are assumption that we aim to relax. In addition to presented on aggregated dataset from both these our proposed model with multiple style transfer, style dimensions, this evaluation is still able to pro- we also train our encoder-decoder architecture with vide reasonable indication of content preservation. single discriminative LM for one style at a time and To measure the fluency of the text, we calculate perform two stage transfer, similar to one with Cas- perplexity assigned to the generated text sequence caded Style Transformer (Dai et al., 2019) setup. by a language model trained on the train corpus, The results in Table 3 show that our model as is standard in style transfer literature (Dai et al., achieves better style control than the Cascaded 2019; Subramanian et al., 2018). The perplexity is Style Transformer (Dai et al., 2019) as well as the the measure of log likelihood of the generated sen- joint transfer using Syed et al. (2020) for both sen- tence on the language model. A lower perplexity is timent and formality. As seen in Table 3, cascaded indicative of a more fluent sentence. We use a gen- style transfer models perform poorly on content erative transformer-based language model trained preservation. This is because transferring style on the dataset combined from two styles. one after other leads to huge loss in content, thus both the two-stage models score lower on content 4.3 Automatic Evaluation preservation metrics, both w.r.t. the input text and Dai et al. (2019) use transformer-based model the reference transferred text. This demonstrates (Style Transformer) for single-dimensional style the advantage of using single model to control for transfer. We train two independent Style Trans- multiple styles. The effect can also be observed in former models for sentiment and formality transfer Table 4 which demonstrates qualitative results for and then perform transfer one after another to com- Cascaded Style Transformer model and our model. pare results with our model. We term this as Cas- We can see in many cases content loses the under- caded Style Transformer setup. The Style Trans- lying meaning of source sentence during the two- former model is shown to have state-of-the-art per- stage transfer, whereas our model is able to retain formance in single-dimensional style transfer; thus original meaning of the sentence well, corroborat- it provides an estimate of the performance of se- ing the findings of automatic evaluation. Among quential single style transfer. We also experiment the cascaded models, the Discriminative LM scores with Adapted Rewriting LM (Syed et al., 2020) as marginally better on content preservation than the another baseline. Their work on style rewriting to Style Transformer model. We attribute this to ini- match author-specific style does not require explicit tialization with the same pre-trained LM resulting annotations for the various aspects that constitutes in shared content space in the underlying single an author’s style, but is based on the assumption style transfer models. However, due to independent that the training corpus reflects the target style. In training of the two single style transfer models, they this context, we train their framework on the mix- are not able to model interplay between these styles ture of data from the respective target styles and and hence perform worse on style control than our report the performance. These are the closest base- proposed model trained jointly on multiple styles. lines to our proposed approach, since other works Our model also scores better on fluency, as seen dealing with multi-style transfer assume presence in Table 3. This is also apparent from the exam- 3506
Transferred Sentence Target style Source sentence Style Transformer Our model (multi-style) Positive+Formal That’s not funny. I don’t think So funny movie. I really like it. That was very funny. I am sure she’ll like it. she will appreciate it. Give your brother some money Just give your brother some time Give your brother some money and tell him to take a hike. and it will be good again. and request him to leave. Negative+Formal An intelligent, rewarding film ludicrous, shallow film that look An unintelligent, poor film that that I look forward to watching forward to watching again. I would not look forward to again. watching again. super friendly staff, quick ser- says wait staff, quick not amaz- dirty staff and slow service and vice and amazing and simple ing before overcooked food simple food was not done right. food was done right! done were okay. Positive+Informal You need to separate the bad need to the great thing and move You need to enjoy the good stuff thing and move on. on. and move on. The evening started out slow. The evening spent in profes- The evening began amazing. sional show. Negative+Informal Great food recommendations terrible food 9am steak and were Disappointing food recommen- steak and tuna were both great. both terrible. dations steak and tuna were hor- rible. That person in hilarious. You person in worse! That guy in so boring. Table 4: Qualitative results for transfer to different target style combination across different models. (Differ- ent colors highlight the transferred segments corresponding to underlined input sentence; Text in bold highlights adherence to target formality in text generated by our model.) Style Accuracy Content Transfer Model Fluency Sentiment Formality Preservation Quality Cascaded Style Transformer 3.5909 2.7424 3.2803 2.7424 2.9318 (Dai et al., 2019) Joint Discriminative LM 3.8561 3.0379 4.1061 4.1894 4.1091 (Our Model) Table 5: Results for Human Evaluation across different metrics. Each value represents the average of rating between 1 (Very bad) and 5 (Very good). ples in Table 4, where sentences generated by Cas- fer quality. Based on comparable style control in caded Style Transformer are much less coherent. Cascaded Style Transformer and our proposed ap- Qualitative experiments also highlight the ability proach on automatic metrics, we compare the trans- of our model to incorporate intricacies of formality fer quality across these two models by a small-scale stylistic dimension (shown in bold) better than the human study. We select 40 sentences, with 10 ex- Cascaded Style Transformer model. Among single amples from each combinations of sentiment and step transfer models (Syed et al. (2020) and our pro- formality as target style, and collect annotations posed approach), we note that content preservation from 4–5 participants for each example. Out of is marginally better for Syed et al. (2020)’s model, resulting annotations, more than 85% annotations however, our model is able to yield much better favoured our results over baseline. The average par- style transfer owing to feedback on style control by ticipant rating across different dimensions is shown multiple discriminators. in Table 5. We test the statistical significance of these results using z-test statistic. With α = 0.05, 4.4 Human evaluation the preferences indicated in human study are sig- To augment automatic evaluation results, we con- nificant across all metrics. These results are in line duct a human study to evaluate the model outputs with our automatic evaluations and add confidence across various dimensions such as content preser- to the efficacy of our proposed approach in achiev- vation, style control, fluency, and overall trans- ing style transfer across multiple dimensions. 3507
5 Conclusion and Future Work E. Fox, and R. Garnett, editors, Advances in Neu- ral Information Processing Systems 32, pages 7059– We propose an approach to extend currently ex- 7069. Curran Associates, Inc. isting style transfer work to multiple style setting Ning Dai, Jianze Liang, Xipeng Qiu, and Xuanjing without imposing any extra constraints on avail- Huang. 2019. Style transformer: Unpaired text ability of dataset. Our method makes use of dis- style transfer without disentangled latent represen- joint corpus from separate styles to enable one tation. In Proceedings of the 57th Annual Meet- step transfer across multiple target styles. We ing of the Association for Computational Linguis- exploit multiple discriminative language models tics, pages 5997–6007, Florence, Italy. Association for Computational Linguistics. with an encoder-decoder framework, all emerging from large transformer-based language models pre- Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, trained on Masked Language Modeling objective Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A computational approach to politeness with and fine-tuned separately for transfer and discrimi- application to social factors. In Proceedings of the native purposes. We show that unified single step 51st Annual Meeting of the Association for Compu- transfer approach is able to achieve better trans- tational Linguistics (Volume 1: Long Papers), pages fer while offering much better content preservation 250–259, Sofia, Bulgaria. Association for Computa- tional Linguistics. which is paramount to any style transfer task. Further improvements are in scope for adding Jacob Devlin, Ming-Wei Chang, Kenton Lee, and modularity to the proposed transfer module. In the Kristina Toutanova. 2019. BERT: Pre-training of current setup, each version of model is trained for deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference a specific combination of target style(s). The utility of the North American Chapter of the Association of such a model increases manifold with added for Computational Linguistics: Human Language ease of transfer across multiple style combinations Technologies, Volume 1 (Long and Short Papers), within a single model. This could be attempted by pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. employing a controlled language model as a unified discriminator for multiple styles, which would be Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan the subject of further research. Zhao, and Rui Yan. 2017. Style Transfer in Text: Exploration and Evaluation. arXiv e-prints, page Ethics Statement. We recognise the ethical im- arXiv:1711.06861. plication of employing large language models Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan trained on data infused with unchecked biases. As Salakhutdinov, and Eric P. Xing. 2017. Toward con- with any generative task, style transfer too suffers trolled generation of text. In Proceedings of the from the potential misuse for fact distortion, pla- 34th International Conference on Machine Learning, giarism and more. The paper aims at establish- volume 70 of Proceedings of Machine Learning Re- search, pages 1587–1596, International Convention ing academic utility of proposed framework. To Centre, Sydney, Australia. PMLR. meet ethical standards, this solution has to cou- pled with strict misrepresentation, offensiveness Armand Joulin, Edouard Grave, Piotr Bojanowski, and bias checks. Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651. References Guillaume Lample, Myle Ott, Alexis Conneau, Lu- Julian Brooke, Tong Wang, and Graeme Hirst. 2010. dovic Denoyer, and Marc’Aurelio Ranzato. 2018. Automatic acquisition of lexical formality. In Col- Phrase-based & neural unsupervised machine trans- ing 2010: Posters, pages 90–98, Beijing, China. Col- lation. In Proceedings of the 2018 Conference on ing 2010 Organizing Committee. Empirical Methods in Natural Language Processing, pages 5039–5049, Brussels, Belgium. Association Stephane Clinchant, Kweon Woo Jung, and Vassilina for Computational Linguistics. Nikoulina. 2019. On the use of BERT for neu- ral machine translation. In Proceedings of the 3rd Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Workshop on Neural Generation and Translation, Delete, retrieve, generate: a simple approach to sen- pages 108–117, Hong Kong. Association for Com- timent and style transfer. In Proceedings of the 2018 putational Linguistics. Conference of the North American Chapter of the Association for Computational Linguistics: Human Alexis Conneau and Guillaume Lample. 2019. Cross- Language Technologies, Volume 1 (Long Papers), lingual language model pretraining. In H. Wal- pages 1865–1874, New Orleans, Louisiana. Associ- lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, ation for Computational Linguistics. 3508
Yang Liu and Mirella Lapata. 2019. Text summariza- Rico Sennrich, Barry Haddow, and Alexandra Birch. tion with pretrained encoders. In Proceedings of 2016. Neural machine translation of rare words the 2019 Conference on Empirical Methods in Nat- with subword units. In Proceedings of the 54th An- ural Language Processing and the 9th International nual Meeting of the Association for Computational Joint Conference on Natural Language Processing Linguistics (Volume 1: Long Papers), pages 1715– (EMNLP-IJCNLP), pages 3730–3740, Hong Kong, 1725, Berlin, Germany. Association for Computa- China. Association for Computational Linguistics. tional Linguistics. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Dan Huang, Andrew Y. Ng, and Christopher Potts. Jaakkola. 2017. Style transfer from non-parallel text 2011. Learning word vectors for sentiment analy- by cross-alignment. In I. Guyon, U. V. Luxburg, sis. In Proceedings of the 49th Annual Meeting of S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, the Association for Computational Linguistics: Hu- and R. Garnett, editors, Advances in Neural Informa- man Language Technologies, pages 142–150, Port- tion Processing Systems 30, pages 6830–6841. Cur- land, Oregon, USA. Association for Computational ran Associates, Inc. Linguistics. Sandeep Subramanian, Guillaume Lample, Aman Madaan, Amrith Setlur, Tanmay Parekh, Barn- Eric Michael Smith, Ludovic Denoyer, abas Poczos, Graham Neubig, Yiming Yang, Rus- Marc’Aurelio Ranzato, and Y-Lan Boureau. lan Salakhutdinov, Alan W Black, and Shrimai 2018. Multiple-attribute text style transfer. CoRR, Prabhumoye. 2020. Politeness Transfer: A Tag abs/1811.00552. and Generate Approach. arXiv e-prints, page arXiv:2004.14257. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie 2.0: François Mairesse and Marilyn A. Walker. 2011. Con- A continual pre-training framework for language un- trolling user perceptions of linguistic style: Train- derstanding. Proceedings of the AAAI Conference able generation of personality traits. Computational on Artificial Intelligence, 34(05):8968–8975. Linguistics, 37(3):455–488. Richard S. Sutton, David McAllester, Satinder Singh, Kishore Papineni, Salim Roukos, Todd Ward, and Wei- and Yishay Mansour. 1999. Policy gradient methods Jing Zhu. 2002. Bleu: a method for automatic eval- for reinforcement learning with function approxima- uation of machine translation. In Proceedings of tion. In Proceedings of the 12th International Con- the 40th Annual Meeting of the Association for Com- ference on Neural Information Processing Systems, putational Linguistics, pages 311–318, Philadelphia, NIPS’99, page 1057–1063, Cambridge, MA, USA. Pennsylvania, USA. Association for Computational MIT Press. Linguistics. Bakhtiyar Syed, Gaurav Verma, Balaji Vasan Srini- Alec Radford. 2018. Improving language understand- vasan, Anandhavelu Natarajan, and Vasudeva Varma. ing by generative pre-training. 2020. Adapting language models for non-parallel author-stylized rewriting. Proceedings of the AAAI Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Conference on Artificial Intelligence, 34(05):9008– Dario Amodei, and Ilya Sutskever. 2018. Language 9015. models are unsupervised multitask learners. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Percy Liang. 2016. SQuAD: 100,000+ questions for Kaiser, and Illia Polosukhin. 2017. Attention is all machine comprehension of text. In Proceedings of you need. In NIPS, pages 6000–6010. the 2016 Conference on Empirical Methods in Natu- ral Language Processing, pages 2383–2392, Austin, Alex Wang, Amanpreet Singh, Julian Michael, Fe- Texas. Association for Computational Linguistics. lix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis plat- Sudha Rao and Joel Tetreault. 2018. Dear sir or form for natural language understanding. In Pro- madam, may I introduce the GYAFC dataset: Cor- ceedings of the 2018 EMNLP Workshop Black- pus, benchmarks and metrics for formality style boxNLP: Analyzing and Interpreting Neural Net- transfer. In Proceedings of the 2018 Conference of works for NLP, pages 353–355, Brussels, Belgium. the North American Chapter of the Association for Association for Computational Linguistics. Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long Papers), pages 129–140, Hu Xu, Bing Liu, Lei Shu, and Philip Yu. 2019. BERT New Orleans, Louisiana. Association for Computa- post-training for review reading comprehension and tional Linguistics. aspect-based sentiment analysis. In Proceedings of the 2019 Conference of the North American Chap- Abhilasha Sancheti, Kundan Krishna, Balaji Vasan ter of the Association for Computational Linguistics: Srinivasan, and Anandhavelu Natarajan. 2020. Rein- Human Language Technologies, Volume 1 (Long forced rewards framework for text style transfer. In and Short Papers), pages 2324–2335, Minneapolis, Advances in Information Retrieval, pages 545–560, Minnesota. Association for Computational Linguis- Cham. Springer International Publishing. tics. 3509
Zichao Yang, Zhiting Hu, Chris Dyer, Eric P Xing, and Taylor Berg-Kirkpatrick. 2018. Unsupervised text style transfer using language models as discrim- inators. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems 31, pages 7287–7298. Curran Associates, Inc. 3510
You can also read