FUDGE: Controlled Text Generation With Future Discriminators - The Berkeley NLP ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
F UDGE: Controlled Text Generation With Future Discriminators Kevin Yang Dan Klein UC Berkeley UC Berkeley yangk@berkeley.edu klein@berkeley.edu Abstract tribution of some large generative model G, it is nontrivial to add conditioning on a new attribute a We propose Future Discriminators for Genera- tion (F UDGE), a flexible and modular method without either training a new model from scratch for controlled text generation. Given a pre- or fine-tuning with additional data. Although in arXiv:2104.05218v1 [cs.CL] 12 Apr 2021 existing model G for generating text from a dis- principle we can trivially sample from P (X|a) via tribution of interest, F UDGE enables condition- rejection sampling from P (X), rejection sampling ing on a desired attribute a (for example, for- may be highly inefficient in practice. On the other mality) while requiring access only to G’s out- hand, while generating according to attribute a, put logits. F UDGE learns an attribute predictor P (X) should be left otherwise intact: in the pre- operating on a partial sequence, and uses this predictor’s outputs to adjust G’s original prob- vious translation formality example, it is pointless abilities. We show that F UDGE models terms to generate formal English outputs if they do not corresponding to a Bayesian decomposition of preserve the original Spanish meaning. the conditional distribution of G given attribute In light of these concerns, we propose Future a. Moreover, F UDGE can easily compose pre- Discriminators for Generation (F UDGE), a flexible dictors for multiple desired attributes. We eval- and modular method for modeling P (X|a) which uate F UDGE on three tasks — couplet comple- accesses only the output probabilities of the gener- tion in poetry, topic control in language gener- ative model G which defines P (X). F UDGE learns ation, and formality change in machine trans- lation — and observe gains in all three tasks. a binary predictor for whether attribute a will be- come true in the complete future, based on an in- 1 Introduction complete sequence prefix (Sec. 3). Multiplying Recent advances in large pretrained language mod- the output probabilities of this predictor with G’s els allow us to generate increasingly realistic text original probabilities and then renormalizing yields by modeling a distribution P (X) over natural lan- a model for the desired P (X|a) via Bayes’ Rule. guage sequences X. The distribution P (X) may We run experiments on three controlled text be truly unconditional, as is common in language generation tasks — couplet completion in poetry, modeling, or it may model P (X|I) conditioned on topic control in language generation, and formal- some input I, as in machine translation or summa- ity change in machine translation — showing our rization. method’s broad applicability. Additionally, we We are frequently interested in controlled text demonstrate the modularity of F UDGE by com- generation, the task of generating text conditioned posing multiple attribute constraints in both the on an additional desirable attribute a which is not couplet and topic control tasks. In our experiments, already built into P (X). That is, we would like to we find that F UDGE is highly effective at attribute model P (X|a) (or possibly P (X|I, a); henceforth control, outperforming both a baseline which di- we will drop I from the notation for simplicity). rectly fine-tunes G and also a strong gradient- For example, P (X) may be a pretrained transla- based method (P PLM (Dathathri et al., 2019)). Our tion model for Spanish inputs I to English outputs code is available at https://github.com/yangkevin2/ X, but we may wish to additionally constrain the naacl-2021-fudge-controlled-generation. outputs to possess a new attribute a, e.g., formality, 2 Related Work which we did not optimize for during training. Unfortunately, once we have already obtained Ideally, a controlled text generation method should an unconditioned P (X) defined as the output dis- efficiently control for a while preserving P (X)
as much as possible. Recent work on controlled factorization more closely in implementation (Sec. text generation has greatly advanced our ability 3). The key distinguishing feature of F UDGE is to control for a required attribute a flexibly and that it models whether attribute a will be true in cheaply, with varying degrees of modification to the future, rather than in the present. We find that the original model G which defines P (X). F UDGE substantially outperforms previous WD One line of work fine-tunes a pretrained model approaches in our experiments (Sec. 4.2). for a desired attribute (Ficler and Goldberg, 2017; 3 Future Discriminators for Generation Yu et al., 2017; Ziegler et al., 2019). The result is a class-conditional language model (CCLM). How- We now explain the details of our proposed method, ever, it is difficult to isolate the desired attribute Future Discriminators for Generation (F UDGE), from the distribution shift between G and the fine- and show that it corresponds to modeling the de- tuning dataset (Hu et al., 2017; John et al., 2018; sired conditional distribution P (X|a). Lazaridou et al., 2020), i.e., it is nontrivial to pre- For a given language generation task, assume serve the desirable qualities of the P (X) modeled we have an autoregressive model G (e.g., a by G. One may also need to fine-tune separately large pretrained language model) which models for each attribute of interest. C TRL (Keskar et al., P (xi |x1:i−1 ) for tokens x1 . . . xi . Letting X = 2019) partially addresses these issues by provid- x1:n denote a completed sequence, G can sample ing 55 attribute control codes for a large language from P (X) = P (x1:n ) one token at a time by fac- model trained from scratch, although this is expen- toring P (X): sive. Very recently, G E D I (Krause et al., 2020) n Y achieves strong performance by using CCLM gen- P (X) = P (xi |x1:i−1 ) erators as discriminators, though it relies on several i=1 heuristics. More broadly, text generation models To condition on attribute a, we instead model for style transfer (Hu et al., 2017; Lample et al., P (X|a). This requires a model for P (xi |x1:i−1 , a), 2018b; Dai et al., 2019a), summarization (See et al., modifying the previous factorization: 2017; Gehrmann et al., 2018; Zaheer et al., 2020), n and machine translation (Lample et al., 2018a; Ng P (X|a) = Y P (xi |x1:i−1 , a) et al., 2019; Lewis et al., 2019) can also be viewed i=1 as CCLM’s for different “attributes.” If we model P (xi |x1:i−1 , a) directly, we obtain a A second type of approach instead conditions on class-conditional language model (CCLM). We can a desired attribute by backpropagating gradients, ei- learn the CCLM by e.g., fine-tuning G depending ther to directly modify model activations (Dathathri on the available data, possibly with some structural et al., 2019; Liu et al., 2020) or to find a trigger modification to G to accommodate conditioning. string (Wallace et al., 2019, 2020). Such methods However, F UDGE instead relies on the follow- often exhibit a high degree of attribute control, and ing Bayesian factorization, exchanging xi and a can be used in adversarial attacks (Wallace et al., conditioned on x1:i−1 : 2020). In fact, Subramani et al. (2019) show that by carefully modifying the latent state, one can cause P (xi |x1:i−1 , a) ∝ P (a|x1:i )P (xi |x1:i−1 ) the base G to produce arbitrary outputs. The second term is exactly the quantity mod- A third class of methods, referred to as weighted eled by the base G. It then suffices to model the decoding (WD), assumes access only to P (X) (i.e., first term, P (a|x1:i ), with a binary classifier B for G’s output logits), and operates directly on these the attribute a given a prefix x1:i . Intuitively, one logits (Ghazvininejad et al., 2017; Holtzman et al., can view B as rescoring or reranking G’s original 2018; Cohn-Gordon et al., 2018; Shen et al., 2019). hypotheses. Compared to other approaches, WD methods are We emphasize that although B takes a prefix x1:i relatively interpretable in how they obtain P (X|a) as input, it predicts whether attribute a will in the from P (X), but prior WD implementations have future be satisfied for the completed generation been observed to perform poorly in controlled text x1:n . For instance, suppose we are given a dataset generation (See et al., 2019; Dathathri et al., 2019). of examples {(x1:n , a0 )} with a0 being the values While F UDGE shares a Bayesian motivation with of binary indicators for the desired a (i.e., if a is other WD methods, F UDGE follows the Bayesian formality, then a0 is 0 or 1 when x1:n is informal
Figure 1: Illustration of one decoding step in F UDGE, for an example where the desired attribute a is formality. A large pretrained model G (dark blue) outputs unconditioned probabilities. Our binary predictor (red) predicts whether the eventual completed sequence will be formal for each possible continuation (computed for each candi- date x3 , e.g., “want”; holding a fixed). The probabilities for each x3 are multiplied (purple) and then renormalized to obtain P (x3 |x1:2 , a), from which we sample the next token x3 =“prefer.” or formal respectively). For each training exam- mation to the full distribution, and find that this ple (x1:n , a0 ), we train our classifier B using all works well in practice.1 In each task in Sec. 4, pairs (x1:i , a0 ); that is, we construct a separate ex- running F UDGE on the test set takes no more than ample from each prefix x1:i of x1:n . Our approach 15 minutes on a single Quadro RTX 6000 GPU. contrasts with previous methods such as Dathathri Finally, as with other controlled generation ap- et al. (2019), which greedily optimize for a on the proaches such as Dathathri et al. (2019), it is likely immediate extension x1:i+1 . One particular ben- that augmenting F UDGE with reranking approaches efit is that F UDGE naturally plans for the future: such as rejection sampling could improve output in the example for generating text on the “space” quality at the cost of compute time, although we topic in Table 6, F UDGE writes about a “myste- do not comprehensively evaluate such extensions rious ship” despite “ship” itself not being in the in this work. given “space”-topic bag of words, because “mys- terious ship” easily leads into a mention of one of 3.1 Advantages and Limitations the targeted “space” words (“Earth”). Similarly, We highlight several additional potential advan- in the first couplet completion example in Table 3, tages of F UDGE compared to directly modeling F UDGE needs to rhyme with “fear” after exactly P (xi |x1:i−1 , a) via e.g., a fine-tuned CCLM: ten syllables. After seven syllables, it could reason- ably generate the word “clear,” but it first generates 1. F UDGE requires access only to P (X) (i.e., the adverb “pretty” in order to set up the generation G’s output logits) rather than G itself. of “clear” as the tenth syllable. 2. G can be freely swapped out for any other F UDGE’s implementation is shown schemati- model that shares the same tokenization when cally in Figure 1, and is quite simple in practice. larger models become available. F UDGE just needs to learn a B (red in Figure 1) sharing tokenization with G (dark blue). It then 3. Given multiple conditionally independent at- converts B’s output into probabilities (red table in tributes with predictors for each, F UDGE can Figure 1), and multiplies with the original output easily condition on the combination of these probabilities from G (dark blue table), to obtain un- attributes in a modular fashion by summing normalized probabilities P (xi , a|x1:i−1 ) (purple ta- their output log-probabilities (Sec. 4.1, 4.2). ble). Finally, renormalizing over the output vocabu- lary yields the desired distribution P (xi |x1:i−1 , a). Unfortunately, like previous methods, F UDGE In practice, we operate in the log-probability space cannot fully guarantee that all outputs possess the for numerical stability. desired attribute a. In F UDGE’s case, this is due to the approximation inherent in modeling P (a|x1:i ), To improve computational efficiency, we typi- as well as only considering the top 200 possible xi cally choose B to be lightweight relative to G. We for computational efficiency. also consider only the top 200 possibilities for xi according to G at each step, as a cheap approxi- 1 See Appendix H for ablations on the top-200 pruning.
4 Experiments 4. Distinctness of completions, measured as the number of unique unigrams, bigrams, and tri- We run experiments on a range of controlled text grams across all samples, divided by the total generation tasks to evaluate the effectiveness of our number of words (Li et al., 2015). proposed method: poetry couplet completion (Sec. 4.1), topic-controlled language generation (Sec. At test time, we decode until the model gener- 4.2), and machine translation formality change ates ten syllables followed by an end-of-sentence (Sec. 4.3). For each task we discuss the evalua- punctuation mark, or after the eleventh syllable (an tion setup, the specific details of our method and automatic failure, since iambic pentameter requires baselines, and finally experimental results. exactly ten syllables). Overall, because we define a using a rule-based 4.1 Poetry Couplet Completion F which is accessible during training, our formula- tion of couplet completion is a relatively clean task for evaluating the effectiveness of F UDGE. So long as men can breathe or eyes can see, So long lives this and this gives life to thee. 4.1.1 Method and Baselines Table 1: An example couplet by William Shakespeare. F UDGE Instantiation. The obvious approach is to Every second syllable is stressed, following iambic me- learn a predictor for F directly. However, the three ter, and the last words of each line (see/thee) rhyme. components of a — meter, rhyme, and sentence- ending — should be roughly independent. Thus we We begin with English poetry generation, a task assume conditional independence, and demonstrate that emphasizes well-formedness, and which has the modularity of F UDGE by constructing three been studied in different forms by many previ- separate predictors to be combined at test time: ous works (Zhang and Lapata, 2014; Wang et al., 1. B1 (x1:i ) takes a text prefix x1:i , and predicts 2016; Ghazvininejad et al., 2016, 2017). Our task whether the completion x1:n of prefix x1:i will here is couplet completion. Given the first line of be in iambic meter. The model is an LSTM an iambic pentameter couplet (e.g., Table 1), the followed by a linear output layer. model must generate a second line which (1) sat- isfies iambic pentameter, (2) rhymes with the first 2. B2 (x1:i , t, r) takes prefix x1:i , the number of line, and (3) ends a sentence. The desired attribute syllables t between xi and xn for n ≥ i, a is defined as possessing all three properties, as and a rhyme sound r.3 It predicts whether evaluated by a rule-based checker F (Appendix the completion x1:n has the rhyme sound r A). Our test set is a collection of prefix lines of at the end of token xn . The model is an couplets, collected from the ending couplet of each LSTM with attention dependent on t and r, of Shakespeare’s 154 sonnets. followed by a shallow feedforward network, Metrics. We consider four metrics. and is trained via noise-contrastive estimation (Gutmann and Hyvärinen, 2010).4 1. Success, the fraction of couplet completions with the desired attribute a, as checked by F. 3. B3 (x1:i , t) takes prefix x1:i and the number This is the main metric. of syllables t between xi and xn for n ≥ i, and predicts whether xn ends a sentence. The 2. Grammaticality, the probability of grammati- model is an LSTM followed by a shallow feed- cality given by a Roberta-based CoLA gram- forward network. maticality model (Liu et al., 2019; Warstadt The predictors vary in architecture because B2 and et al., 2019), averaged over all outputs. B3 require inputs other than x1:i — in truth, they are families of related predictors. We find that per- 3. Perplexity of the completion conditioned on formance is not overly sensitive to the particulars the prefix. Following Dathathri et al. (2019), of the predictor architectures (Appendix D). since our models use GPT2-Medium (Radford 3 Two words have the same “rhyme sound” r if they rhyme et al., 2019) as G, we evaluate perplexity using according to the CMU Pronouncing Dictionary (Weide, 1998). GPT (Radford et al., 2018).2 4 The output logits from B2 are unnormalized, but this does not affect F UDGE after they are added to the output logits 2 See Appendix E for other perplexity measurements. of G and softmaxed for sampling.
Correctness Text Quality Diversity Method Success ↑ Grammar ↑ Perplexity ↓ Dist-1 ↑ Dist-2 ↑ Dist-3 ↑ G 0 0.52 44.3 ± 42.2 0.35 0.74 0.77 F INETUNE 0.21 0.44 55.8 ± 98.3 0.35 0.74 0.78 P PLM 0 0.54 60.8 ± 66.1 0.40 0.78 0.78 F UDGE 0.44 0.44 70.9 ± 89.4 0.40 0.79 0.78 Shakespeare 0.45 0.29 333.8 ± 418.9 0.44 0.81 0.79 Table 2: Couplet completion results. Success (main metric), grammaticality, perplexity, and distinctness of differ- ent methods, tested on 154 prefix lines from Shakespeare sonnets. F UDGE substantially outperforms automated baselines on success and maintains high diversity, although quality unsurprisingly suffers compared to the base G due to the difficult constraint F. Note Shakespeare’s work is often “incorrect” due to the narrowness of our metric F;6 he also scores poorly on text quality because our evaluation models are intended for more modern English. To train the discriminators, we sample a dataset rhyme sound at the end of xn , and (3) whether of 10 million generations of varied length from a sentence ends with xn . A special token is GPT2-Medium. From these generations, we sam- inserted 10 syllables from the end of x1:n . ple random subsequences x1:n of roughly 10 to 30 syllables and truncate t ≤ 10 ending syllables. 3. P PLM (Dathathri et al., 2019), which uses shal- These truncations become inputs x1:i to the predic- low predictors learned from G’s top-level hid- tors. For simplicity, we did not balance the class den layer to modify G’s states toward increas- labels for e.g., the iambic predictor during training, ing probability of the desired attribute via gra- although it is likely that doing so would improve dient ascent. We decompose the predictors performance. into the same iambic, rhyme sound, and end- At test time, we extract r from the given first of-sentence predictors as for F UDGE, inserting line of the couplet, and initialize t = 10, updating an additional hidden layer in the shallow pre- at each step. We then modify the output logits of dictor when needed to incorporate additional G by simply adding the log-probabilities from B1 , input (the desired rhyme sound and/or number B2 , and B3 , demonstrating the ease of composing of syllables until end-of-sentence). constraints in F UDGE. 4. Shakespeare’s original couplet completions. Baselines. We compare to four baselines.5 1. G, the original GPT2-Medium. All non-Shakespeare methods use top-k sam- pling with k = 10. 2. F INETUNE, a CCLM which finetunes G on 4.1.2 Results similar inputs to those used for B2 in F UDGE. Since it is not obvious how to compose multi- Even though our GPT2-Medium-generated train- ple CCLM’s for different attributes, we train ing dataset is completely different from the test a single CCLM for all desired properties to- domain, and contains essentially zero examples of gether. We condition by prefixing the input correct couplets, F UDGE is able to learn the desired with (1) whether the last 10 syllables of the attribute. As shown in Table 2, F UDGE greatly out- original untruncated x1:n are iambic, (2) the performs all automated baselines in success rate. Surprisingly, the P PLM baseline achieves zero 5 A system like Hafez (Ghazvininejad et al., 2016, 2017), success. We find that its iambic and rhyme pre- which enforces meter and rhyme at each decoding step using a hard constraint, could achieve perfect success rate. How- dictors are very poor, so we hypothesize that the ever, this approach relies on the meter and rhyme attributes relevant information is not easily extractable from being “prefix-checkable” at the word level: one can guarantee success by simply never selecting a word which immediately the last hidden layer of G. In contrast, F UDGE’s violates the constraint. This is often the case for simple rule- predictors operate directly on the raw text. based constraints, but not for many other interesting attributes, Funnily enough, F UDGE even matches Shake- such as the topic and formality attributes in our subsequent experiments. To preserve generality, F UDGE does not rely on speare according to F, although this is largely due this “prefix-checkable” property, and neither do our baselines. to the narrowness of F and should not be taken se-
riously.6 Similarly, the grammaticality and perplex- tionally, use rate of words in W is a poor metric, ity metrics are designed for our automated base- because a model can score highly by e.g., simply re- lines, and thus assign poor scores to Shakespeare’s turning the words in W, without generalizing to the antiquated and flowery style. full topic that W approximates. Instead, we adopt a F UDGE also maintains relatively fluent genera- notion of success which requires the model to gen- tion despite lower grammaticality and perplexity eralize the bag W to the full topic. The remaining compared to G. See Table 3 for two successful ex- metrics are measures of quality and diversity. amples. Interestingly, F UDGE also increases diver- sity compared to G, perhaps due to the difficult con- 1. Success, the average number of distinct words straint F forcing F UDGE to use lower-probability in a heldout bag W 0 which appear in the model regions of the base distribution P (X). output. Specifically, for each word in W, we add to W 0 the closest GloVe (Pennington et al., And even thence thou wilt be stol’n, I fear, 2014) word by cosine similarity, such that for this shall be the end. That’s pretty clear. the new word does not contain (and is not Or, if they sleep, thy picture in my sight contained by) any word in W. (This excludes I will be glad to look upon the night. e.g., most plurals.) Usage of distinct words in W 0 measures the model’s ability to generalize Table 3: Two examples of successful couplet comple- tions (in purple) generated by F UDGE. W to other on-topic words, of which W 0 is a non-exhaustive set. This is our main metric. Finally, it is possible (and trivial) to adjust the conditioning strength in F UDGE by multiplying the 2. Grammaticality, identical to the couplet task. binary predictors’ output logits by a constant. How- 3. Perplexity, identical to the couplet task. ever, this deviates from our Bayesian factorization of P (X|a), and we do not do so. 4. Distinctness, defined as in the couplet task. However, it is calculated separately within 4.2 Topic-Controlled Language Generation the 60 generations for each topic, and then Next, we explore topic control in English language averaged over the 7 topics. generation. The desired attribute a is to be on-topic for a given topic, such as science or politics. To Additionally, following the evaluation procedure facilitate comparison with prior work, we largely of prior work such as (Dathathri et al., 2019), we follow the setup of P PLM (Dathathri et al., 2019): run human evaluations via Amazon Mechanical the model is provided an approximation to the topic Turk for F UDGE against each baseline, comparing at test time, in the form of a bag of on-topic words topic control and fluency. For each pairwise com- W. The goal is to sample text according to the topic parison, we ask 3 workers to evaluate each of 420 approximated by W, starting from a generic prefix. paired outputs. Workers were asked to mark which There are 7 topics (space, politics, military, legal, generation is more on topic (first, second, both, or science, religion, and computers) and 20 prefixes, neither), and to rate each generation’s fluency on and the model generates 3 80-token7 samples from a Likert scale from 1 to 5. We report the average each topic-prefix pair, for a total of 420 generations. fraction of outputs marked as on-topic as well as Metrics. Unfortunately, we cannot easily con- the average fluency rating for each method. struct a rule-based F for being “on-topic.” Addi- 6 4.2.1 Method and Baselines We define F using somewhat narrow criteria (Appendix A), which capture only a subset of what Shakespeare consid- F UDGE Instantiation. Since we model topics as ered to be well-written couplets. The purpose of this task is to bags of words, F UDGE uses a binary predictor evaluate F UDGE’s ability to satisfy a difficult well-formedness constraint compared to automated baselines, rather than to B(x1:i , w) which takes a prefix x1:i and word w, perfectly capture the human notion of an iambic pentameter and classifies whether w appears in the future xi:n couplet. Thus Shakespeare is marked wrong when he (1) uses for n ≥ i. (Since it is desirable to stay on topic archaic pronunciations, (2) uses loose rhymes, (3) elides syl- lables to fit meter, or (4) uses words missing from the CMU even after successfully getting on topic, we use xi:n Pronouncing Dictionary. See Appendix A.1 for details. Of rather than x1:n .) Training examples (x1:i , w) are course, Shakespeare is only included as a whimsical point of sampled from the same dataset of 10 million GPT2- reference; our generations obviously do not hold a candle to Shakespeare’s originals. Medium generations used for the couplet task, and 7 All models and baselines use GPT2 tokenization. B is trained using noise-contrastive estimation. B
On-Topic Text Quality Diversity Method Success ↑ Grammar ↑ Perplexity ↓ Dist-1 ↑ Dist-2 ↑ Dist-3 ↑ G 0.22 0.81 37.1 ± 26.9 0.35 0.78 0.92 F INETUNE 0.28 0.74 24.9 ± 13.7 0.29 0.70 0.88 W DEC 0.14 0.59 33.8 ± 33.7 0.16 0.42 0.55 P PLM 0.48 0.78 43.1 ± 23.7 0.35 0.78 0.92 F UDGE 0.59 0.79 40.7 ± 26.3 0.34 0.75 0.91 Table 4: Topic control results. Success (main metric), grammaticality, perplexity, and distinctness for different methods. F INETUNE and W DEC often degenerate into repeating the given bag of words W; this is ill-captured by perplexity, but results in poor grammaticality and distinctness. F UDGE substantially outperforms all baselines on success, including the strong gradient-based P PLM baseline, while preserving high quality and diversity. is a lightweight LSTM-based classifier similar to 4. P PLM (Dathathri et al., 2019), which modifies B2 from the couplet task. the activations of G to make the desired bag of At test time, we can compose individual-word words more likely at the immediate next posi- constraints if we assume conditional independence tion. We use their method without reranking between words (although this may be imperfect). for fair comparison. Given a bag of N words {w1 . . . wN } and pre- fix x1:i , we could condition on all words in the All methods use top-k sampling with k = 10, bag appearing in the future by adding all log- following Dathathri et al. (2019)’s setup. probabilities log P (w1 |x1:i ) . . . log P (wN |x1:i ) to 4.2.2 Results G’s logits. However, topic control does not require every word to appear; perhaps some number λ of Method Topic Fluency on-topic words is enough to be “on-topic.” There- fore, we model the topic constraint as selecting a G 0.16 4.11 random subset of λ words from the original bag, F UDGE 0.78 4.30 and requiring that only those λ words all appear. F INETUNE 0.24 3.95 Since each of the N words is selected with proba- F UDGE 0.76 4.22 bility Nλ , the quantity we add to the base G logits is λ PN W DEC 0.49 2.50 N j=1 log P (wj |x1:i ) in expectation. In our ex- F UDGE 0.75 4.21 periments we use λ = 4, based on a fantasy-topic bag of words used for validation (Appendix C). P PLM 0.45 4.05 Baselines. We compare to four baselines. F UDGE 0.74 4.16 1. G, the original GPT2-Medium. Table 5: Topic control human evaluations, pairwise comparisons. F UDGE achieves a substantially higher 2. F INETUNE, which finetunes G on the same fraction of on-topic outputs compared to each baseline, inputs used for F UDGE. The future word is in addition to higher average fluency (rated 1 to 5). given as a prefix for conditioning. At test time, we compute logits for each prefix in the given F UDGE achieves the highest success by a sub- W and use the average as the true logits, as stantial margin (Table 4), and outperforms all base- an ad hoc way to condition on the full W. lines on human evaluations in both topic relevance 3. W DEC, a simple weighted decoding imple- and fluency (Table 5). F UDGE simultaneously pre- mentation which greedily considers only the serves high quality and diversity according to auto- immediate next token when optimizing for a. mated metrics. Table 6 shows two examples. Instead of using B, W DEC just adds a fixed Unsurprisingly, G performs poorly on success. λW DEC to the logit for each word in W. Note W DEC and F INETUNE also perform poorly, in suc- W DEC requires a to be well-defined at the cess and especially in distinctness. W DEC fre- token level, so it is not easily transferable to quently degenerates into repeating the given words certain tasks (e.g., couplet completion). in the bag W, despite tuning λW DEC (Appendix C).
Space: The issue focused on the original plot, which was Translation Corpus (Post et al., 2013), a collection about a mysterious ship that would land on Earth, and of transcribed Spanish conversations with English would lead to humanity’s first interstellar expedition. The original plan called for humanity to use the spacecraft to translations. Both the source Spanish and target colonize outer space and build the first city on Mars. But English are highly informal and disfluent. Salesky this idea fell by the wayside in the final drafts.\n\n"It was et al. (2019) augment the Fisher dataset with addi- just not a very popular idea and it wasn’ tional parallel English translations, rewritten to be Politics: The issue focused on whether the two institutions were operating within the bounds set by the constitution more fluent (and hence more formal); see Table 7 and the law.\n\nThe Constitutional Court said that both for an example. Our task is to translate the origi- governments "have a duty to ensure the integrity of the nal informal Spanish to into more formal English. electoral process and its effective administration, especially in light of the current political climate that is threatening However, we assume that Salesky et al. (2019)’s the functioning of elections" fluent references are unavailable during training. Table 6: The first output from F UDGE when using the entonces de verdad sí sí pero entonces tu estudiando para prefix “The issue focused on” for two topics. We use es es digo es más porque es exactamente red to highlight words in the given bag of words W Then, if it’s business, but then you are a student for a PHD, along with obvious forms (e.g., plurals), and cyan for the Master’s is that exactly. other on-topic words, including related words not in the If it’s business, then you are a student for a PhD. The heldout bag W 0 . More examples in Appendix J. masters is exactly that. Table 7: An example from the Fisher dataset. F INETUNE also suffers from repetition, which ap- Top: The original Spanish transcription. pears to be the result of distribution shift from fine- Middle: The original English translation. tuning. Our fine-tuning dataset was built by sam- Bottom: Salesky et al. (2019)’s more fluent version. pling directly from the original P (X) modeled by G to mitigate distribution shift, but it is well-known Metrics. The desired attribute a is formality, that language model generations are more repeti- but we cannot sacrifice the source sentence’s mean- tive than natural language (Holtzman et al., 2018, ing. The latter requirement makes generation more 2019). We hypothesize that F INETUNE, being fine- constrained than in the couplet and topic tasks, so tuned on language model generations rather than perplexity and distinctness are less relevant. In- natural language, amplifies this repetitiveness. This stead, we use the following: repetition is reflected in the poor grammaticality for both F INETUNE and especially W DEC. In contrast, 1. BLEU Score (Papineni et al., 2002), using two F UDGE does not touch the original P (X), largely of Salesky et al. (2019)’s fluent references per avoiding F INETUNE’s distribution shift problem on test example. This is our main metric. this task. 2. Formality, the average probability that the Finally, F UDGE outperforms the strong gradient- model’s outputs are formal, according to an based P PLM method, despite requiring access only evaluator trained on the Family/Relationships to G’s output logits. Non-reliance on gradients domain of the GYAFC formality dataset (Rao means F UDGE is also many times faster than P PLM, and Tetreault, 2018). The evaluator is an which takes a few hours compared to F UDGE’s LSTM followed by a linear layer. 15 minutes for the full set of 420 generations on our hardware. Sometimes we do not even have 4.3.1 Method and Baselines gradients: for example, gradients are unavailable F UDGE Instantiation. We assume that the at- in the API for GPT3 at time of writing. tribute a, formality, is conditionally independent from the original conditioning in G, i.e., the mean- 4.3 Machine Translation Formality Change ing of the Spanish input. F UDGE uses a binary Finally, we turn to a somewhat more challenging predictor B(x1:n ) which classifies whether the text task, changing formality in machine translation starting with prefix x1:n is written in a formal style. — specifically, from informal to formal. Given a B is an LSTM followed by a linear layer, trained source sentence written in an informal and con- on the Entertainment/Music domain of GYAFC. versational style, the goal is to output a transla- At test time, F UDGE directly augments G’s log- tion which is also more formal. We test on the its using log-probabilities from B. G is a pre- Fisher and CALLHOME Spanish–English Speech trained Marian (Junczys-Dowmunt et al., 2018)
transformer model for Spanish-English. We evalu- the couplet task, one could adjust the strength of ate both when G is fine-tuned on the original Fisher the formality control in F UDGE, although this is training dataset (i.e., using the original targets, not unprincipled from the view of modeling P (X|a). Salesky et al. (2019)’s more fluent targets) as well Moreover, while F UDGE and G achieve similar as zero-shot with no fine-tuning, which is challeng- BLEU after fine-tuning G, F UDGE achieves higher ing due to the highly informal and disfluent text. BLEU compared to G when G is not fine-tuned on Baselines. We compare to two baselines. the Fisher training set. In the latter case, controlling for formality somewhat remedies the struggles of 1. G, the original machine translation model. G when not fine-tuned on such disfluent text. 2. G + ST, a pipeline consisting of G followed In contrast, the G + ST baseline achieves near- by a style transfer model. Our style transfer perfect formality but less than half the BLEU of model is T5 (Raffel et al., 2020), fine-tuned G, due to the style transfer model overfitting to on the same GYAFC Entertainment/Music do- the GYAFC Entertainment/Music dataset. This main that we used to train B in F UDGE. is similar to the distribution shift issue that we observed in topic control for F INETUNE, an issue Since we do not assume access to Salesky et al. which F UDGE largely avoids. Nevertheless, there (2019)’s more formal targets during training, it is remains substantial room for improvement on this difficult to apply P PLM to this task: P PLM’s pre- difficult task. dictor would operate on the pretrained translation model’s hidden states, thus requiring a Spanish- Spanish que era lo que tenía que tienes que hacer G that was what you had to do English translation dataset with both formal and F UDGE That was what you had to do informal English.8 We omit F INETUNE for the Reference What’s there to do? same reason. In contrast, F UDGE requires only the Spanish ah en mi en inglaterra por ejemplo original English dataset with formality annotations. G Ah, in my, in England, for example. F UDGE Ah, in England, for example. All methods use greedy decoding. Reference In England, for example? 4.3.2 Results Table 9: Example translations by G (fine-tuned on the Fisher dataset) and F UDGE using the same G. Origi- G (No fine-tune) G (Fine-tune) nal Spanish and Salesky et al. (2019) references also Method BLEU ↑ Form. ↑ BLEU ↑ Form. ↑ shown. In this setting, F UDGE achieves similar BLEU to G while increasing formality. While F UDGE often G 16.98 0.45 22.03 0.41 simply corrects punctuation or capitalization (top), it G + ST 7.87 0.96 9.63 0.97 also makes more complex adjustments (bottom). More F UDGE 17.96 0.51 22.18 0.48 examples in Appendix L. Table 8: Machine translation formality results. BLEU (main metric) and average formality for different meth- 5 Discussion ods, with and without fine-tuning G on the Fisher do- main. F UDGE increases the formality of translations F UDGE is a principled approach to controlled text compared to the base model G while preserving or in- generation which models P (X|a) by closely fol- creasing BLEU score. Conversely, G with style transfer lowing a Bayesian factorization, thus preserving overfits to the formality data, resulting in near-perfect the base P (X) as much as possible. F UDGE formality but losing the original meaning. achieves strong performance on a wide range of different tasks: poetry couplet completion, topic As shown in Table 8, F UDGE increases the for- control, and informal-to-formal machine transla- mality of outputs compared to G, even though the tion. Additionally, F UDGE can easily compose test-time formality predictor is trained on a dif- different attributes in a modular fashion: the meter, ferent domain (Family/Relationships, rather than rhyme, and end-of-sentence constraints for couplet Entertainment/Music). Note that formality unsur- completion, and the individual words within each prisingly decreases after fine-tuning G, simply due topic bag for topic control. In principle, F UDGE is to the informality of the fine-tuning dataset. As in applicable to any controlled generation task where 8 We nevertheless ran P PLM in a somewhat convoluted we can train discriminators for the desired attribute setup, but found that it performed poorly (Appendix B). or attributes.
6 Ethics of Controlled Text Generation Rosanne Liu. 2019. Plug and play language mod- els: a simple approach to controlled text generation. We recognize that strong controlled generation arXiv preprint arXiv:1912.02164. methods have the potential to produce harmful out- Jessica Ficler and Yoav Goldberg. 2017. Controlling puts and/or misinformation when used adversari- linguistic style aspects in neural language genera- ally (Wallace et al., 2019, 2020). However, such tion. arXiv preprint arXiv:1707.02633. methods can also be a powerful tool for mitigating harmful biases learned by large pretrained language Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up abstractive summarization. models (Radford et al., 2019; Brown et al., 2020), arXiv preprint arXiv:1808.10792. for example by detoxifying language (Dathathri et al., 2019; Krause et al., 2020). Overall, we be- Marjan Ghazvininejad, Xing Shi, Yejin Choi, and lieve it is still beneficial to continue research into Kevin Knight. 2016. Generating topical poetry. In Proceedings of the 2016 Conference on Empirical general controlled text generation methods such as Methods in Natural Language Processing, pages F UDGE. 1183–1191. Acknowledgements Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. 2017. Hafez: an interactive poetry We thank Daniel Fried, David Gaddy, Eric Wallace, generation system. In Proceedings of ACL 2017, System Demonstrations, pages 43–48. Kevin Lin, Nicholas Tomlin, Ruiqi Zhong, and the three anonymous reviewers for their helpful Michael Gutmann and Aapo Hyvärinen. 2010. Noise- comments and feedback, which aided us in greatly contrastive estimation: A new estimation principle improving the paper. We also thank the authors of for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artifi- Dathathri et al. (2019) for clarifying our questions cial Intelligence and Statistics, pages 297–304. about their topic control setup. This work was supported by Berkeley AI Research, DARPA under Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and agreement HR00112020054, and the NSF through Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. a fellowship to the first author. The content does not necessarily reflect the position or the policy Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine of the government, and no official endorsement Bosselut, David Golub, and Yejin Choi. 2018. should be inferred. Learning to write with cooperative discriminators. arXiv preprint arXiv:1805.06087. Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan References Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. arXiv preprint Tom B Brown, Benjamin Mann, Nick Ryder, Melanie arXiv:1703.00955. Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Askell, et al. 2020. Language models are few-shot Vechtomova. 2018. Disentangled representation learners. arXiv preprint arXiv:2005.14165. learning for non-parallel text style transfer. arXiv preprint arXiv:1808.04339. Reuben Cohn-Gordon, Noah Goodman, and Christo- pher Potts. 2018. Pragmatically informative im- Marcin Junczys-Dowmunt, Roman Grundkiewicz, age captioning with character-level inference. arXiv Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, preprint arXiv:1804.05417. Tom Neckermann, Frank Seide, Ulrich Germann, Al- ham Fikri Aji, Nikolay Bogoychev, et al. 2018. Mar- Ning Dai, Jianze Liang, Xipeng Qiu, and Xuanjing ian: Fast neural machine translation in c++. arXiv Huang. 2019a. Style transformer: Unpaired text preprint arXiv:1804.00344. style transfer without disentangled latent representa- tion. arXiv preprint arXiv:1905.05621. Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- conditional transformer language model for control- bonell, Quoc V Le, and Ruslan Salakhutdinov. lable generation. arXiv preprint arXiv:1909.05858. 2019b. Transformer-xl: Attentive language mod- els beyond a fixed-length context. arXiv preprint Ben Krause, Akhilesh Deepak Gotmare, Bryan Mc- arXiv:1901.02860. Cann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2020. Gedi: Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Generative discriminator guided sequence genera- Hung, Eric Frank, Piero Molino, Jason Yosinski, and tion. arXiv preprint arXiv:2009.06367.
Guillaume Lample, Myle Ott, Alexis Conneau, Lu- Jeffrey Pennington, Richard Socher, and Christopher D dovic Denoyer, and Marc’Aurelio Ranzato. 2018a. Manning. 2014. Glove: Global vectors for word rep- Phrase-based & neural unsupervised machine trans- resentation. In Proceedings of the 2014 conference lation. arXiv preprint arXiv:1804.07755. on empirical methods in natural language process- ing (EMNLP), pages 1532–1543. Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc’Aurelio Ranzato, and Y- Matt Post, Gaurav Kumar, Adam Lopez, Damianos Lan Boureau. 2018b. Multiple-attribute text rewrit- Karakos, Chris Callison-Burch, and Sanjeev Khu- ing. In International Conference on Learning Rep- danpur. 2013. Improved speech-to-text translation resentations. with the fisher and callhome spanish–english speech translation corpus. In Proc. IWSLT. Jey Han Lau, Trevor Cohn, Timothy Baldwin, Julian Brooke, and Adam Hammond. 2018. Deep-speare: Alec Radford, Karthik Narasimhan, Tim Salimans, and A joint neural model of poetic language, meter and Ilya Sutskever. 2018. Improving language under- rhyme. arXiv preprint arXiv:1807.03491. standing by generative pre-training. Angeliki Lazaridou, Anna Potapenko, and Olivier Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Tieleman. 2020. Multi-agent communication meets Dario Amodei, and Ilya Sutskever. 2019. Language natural language: Synergies between functional models are unsupervised multitask learners. OpenAI and structural language learning. arXiv preprint blog, 1(8):9. arXiv:2005.07064. Colin Raffel, Noam Shazeer, Adam Roberts, Kather- Mike Lewis, Yinhan Liu, Naman Goyal, Mar- ine Lee, Sharan Narang, Michael Matena, Yanqi jan Ghazvininejad, Abdelrahman Mohamed, Omer Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. the limits of transfer learning with a unified text-to- Bart: Denoising sequence-to-sequence pre-training text transformer. Journal of Machine Learning Re- for natural language generation, translation, and search, 21(140):1–67. comprehension. arXiv preprint arXiv:1910.13461. Sudha Rao and Joel Tetreault. 2018. Dear sir or Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, madam, may i introduce the gyafc dataset: Corpus, and Bill Dolan. 2015. A diversity-promoting objec- benchmarks and metrics for formality style transfer. tive function for neural conversation models. arXiv arXiv preprint arXiv:1803.06535. preprint arXiv:1510.03055. Elizabeth Salesky, Matthias Sperber, and Alex Waibel. Ruibo Liu, Guangxuan Xu, Chenyan Jia, Weicheng 2019. Fluent translations from disfluent speech Ma, Lili Wang, and Soroush Vosoughi. 2020. Data in end-to-end speech translation. arXiv preprint boost: Text data augmentation through reinforce- arXiv:1906.00556. ment learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Abigail See, Peter J Liu, and Christopher D Man- Methods in Natural Language Processing (EMNLP), ning. 2017. Get to the point: Summarization pages 9031–9041. with pointer-generator networks. arXiv preprint arXiv:1704.04368. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Abigail See, Stephen Roller, Douwe Kiela, and Jason Luke Zettlemoyer, and Veselin Stoyanov. 2019. Weston. 2019. What makes a good conversation? Roberta: A robustly optimized bert pretraining ap- how controllable attributes affect human judgments. proach. arXiv preprint arXiv:1907.11692. arXiv preprint arXiv:1902.08654. Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Sheng Shen, Daniel Fried, Jacob Andreas, and Dan Michael Auli, and Sergey Edunov. 2019. Face- Klein. 2019. Pragmatically informative text gener- book fair’s wmt19 news translation task submission. ation. arXiv preprint arXiv:1904.01301. arXiv preprint arXiv:1907.06616. Nishant Subramani, Samuel Bowman, and Kyunghyun Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Cho. 2019. Can unconditional language models re- Jing Zhu. 2002. Bleu: a method for automatic eval- cover arbitrary sentences? In Advances in Neural In- uation of machine translation. In Proceedings of the formation Processing Systems, pages 15258–15268. 40th annual meeting of the Association for Compu- tational Linguistics, pages 311–318. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial trig- Adam Paszke, Sam Gross, Francisco Massa, Adam gers for attacking and analyzing nlp. arXiv preprint Lerer, James Bradbury, Gregory Chanan, Trevor arXiv:1908.07125. Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, Eric Wallace, Mitchell Stern, and Dawn Song. high-performance deep learning library. In Ad- 2020. Imitation attacks and defenses for black- vances in neural information processing systems, box machine translation systems. arXiv preprint pages 8026–8037. arXiv:2004.15015.
Qixin Wang, Tianyi Luo, Dong Wang, and Chao Xing. 2016. Chinese song iambics generation with neural attention-based model. arXiv preprint arXiv:1604.06274. Alex Warstadt, Amanpreet Singh, and Samuel R Bow- man. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641. Robert L Weide. 1998. The cmu pronouncing dictionary. URL: http://www. speech. cs. cmu. edu/cgibin/cmudict. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, et al. 2019. Huggingface’s transformers: State- of-the-art natural language processing. ArXiv, pages arXiv–1910. Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In Thirty-first AAAI conference on artificial intelligence. Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer se- quences. arXiv preprint arXiv:2007.14062. Xingxing Zhang and Mirella Lapata. 2014. Chinese poetry generation with recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 670–680. Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. 2019. Fine-tuning lan- guage models from human preferences. arXiv preprint arXiv:1909.08593.
A Details of F for Couplet Completion same is true for our grammaticality and perplexity metrics.) We provide the full details of the function F we use One source of error is words which are out-of- to check iambic pentameter, rhyme, and sentence- vocabulary for the CMU Pronouncing Dictionary. ending in our couplet completion task. Note that Such words are almost never generated by either iambic pentameter consists of two components: F UDGE or our automated baselines, but appear in a iambic meter as well as containing exactly ten syl- fifth of Shakespeare’s lines, resulting in failures on lables. the iambic meter and syllable checks. Nevertheless, most of Shakespeare’s “errors” are 1. Iambic meter: Given a phrase, we obtain the the result of real — though slight — deviations sequence of stresses (0 for unstressed, 1 for from our very strict definitions of meter and rhyme. stressed, 2 for secondary stress) for each word, In particular, he frequently (1) elides syllables to according to the CMU Pronouncing Dictio- fit meter, and (2) uses loose rhymes; both “error” nary (Weide, 1998). If any word does not types are likely exacerbated by differences between exist in the dictionary (almost never for non- archaic and modern pronunciations. The example Shakespeare methods) we return False. We in Table 10 illustrates both types of “errors.” Al- treat 2 as ambiguous stress, and additionally though such deviations are often acceptable to a change 1 to 2 for any monosyllabic words, human, they are difficult to capture in an automatic i.e. we allow monosyllabic stressed words metric, and we do not allow such deviations in F. to be unstressed but not vice versa. Finally, Again, Shakespeare is only included as a whimsical we check that all syllables at even indices (0- point of reference, and not as a serious baseline to indexed) are unstressed or ambiguous, and all be compared to. syllables at odd indices are stressed or am- biguous. But here’s the joy; my friend and I are one; Sweet flattery! then she loves but me alone. 2. Number of syllables: We count the number of syllables in each word based on the number of Table 10: An example couplet by William Shakespeare, stresses according to the CMU Pronouncing illustrating two common deviations from the narrow Dictionary. If a word does not exist in the dic- definition of correctness we use in F. For this exam- tionary, we estimate the number of syllables ple to follow iambic meter, one must read “flattery” in only two syllables. Moreover, “one/alone” is a loose by rounding the number of letters divided by (non-perfect) rhyme, at least in modern English. 3 to the nearest integer. 3. Rhyme: Two words rhyme if and only if they B P PLM Baseline in Machine Translation both exist in the CMU Pronouncing Dictio- nary and are a perfect rhyme according to the As discussed in the main text, it is difficult to dictionary. apply P PLM in our machine translation setup, in which P (a|X) is learned from an English formal- 4. Sentence-ending: We check if the output ends ity dataset without parallel Spanish. Since P (X) with a period, question mark, or exclamation is a Spanish-English translation model, we must mark. obtain hidden states for training P PLM’s P (a|X) by first “backtranslating” English into Spanish, ac- Of course, both F UDGE and F INETUNE will fit cessing a second pretrained translator. For this to whatever output is given by F. The purpose purpose we use a second pretrained Marian trans- of the couplet task is to check F UDGE’s ability former from HuggingFace (https://huggingface.co/ to fit a difficult well-formedness constraint. We Helsinki-NLP/opus-mt-en-es). Additionally, we simply design an F that corresponds to true iambic needed to tune their suggested hyperparameters. pentameter rhymes in most cases. During evaluation, we observe that P PLM makes some reasonable modifications for formality com- A.1 Shakespeare Evaluation pared to the base P (X), like changing “hard” to Shakespeare himself performs somewhat poorly “difficult,” but such improvements are also accom- according to F, which is designed with the auto- panied by occasional disfluencies and/or repetitions mated baselines in mind, not for Shakespeare. (The (although such problems plague all methods to
some degree). Overall, while P PLM achieves simi- lightweight process of constructing and training lar BLEU to F UDGE, it is substantially less formal predictors, noise-contrastive estimation is a natural (Table 11). choice for the rhyme and future word predictors: we avoid softmaxing over the output dimension G (Fine-tune) during training. (This is primarily relevant for the future word predictor, as the number of distinct Method BLEU ↑ Form. ↑ rhyme sounds is not too large, but we use noise- P PLM 21.94 0.40 contrastive estimation for both for consistency’s F UDGE 22.18 0.48 sake.) For the P PLM baseline, we used step size 0.01 Table 11: P PLM baseline in machine translation formal- for both couplet completion and MT after tuning, ity on the fine-tuned G. and kept their other hyperparameters fixed. For topic control we simply evaluated their provided C Hyperparameter Choices generations instead of rerunning their model. F UDGE has essentially one hyperparameter in our D Ablations on Predictor Architectures topic control task, λ, which controls the strength of conditioning and corresponds to the number of Some variation in predictor architectures is neces- words in the bag which should appear in the future. sary due to the diversity of our tasks (as evidenced To choose λ in topic control, we used a separate by the difficulties in adapting PPLM). Specifically, validation bag of words (on the topic of fantasy; Ap- while our core predictor architecture is word em- pendix K.4) to select a reasonable λ for our main beddings followed by LSTM and output layer, task- paper experiments (λ = 4). Unlike in the main specific architectures vary because some “predic- paper where we use heldout bags W 0 to measure tors” are actually families of related predictors. We success, during validation we simply use the origi- model such families as a single predictor taking nal bag. We use a set of 60 generations, considering additional input (e.g., rhyme sound in poetry); this values ranging from 1 to 6 (Table 12), although the is needed in our poetry and topic tasks. result may be somewhat noisy. Of course, different On these two tasks, we provide ablations with choices of λ result in different tradeoffs (Appendix more homogenized predictors: additional inputs G). are simply embedded and concatenated to each in- We also optimized the conditioning strength put word embedding. The difference is relatively λW DEC for the W DEC baseline on the same fantasy small in both cases (Tables 14 and 15). F UDGE - bag of words, considering values ranging from 1 to M OD indicates the ablated version of F UDGE. 32. We selected the only value (4) which achieved reasonable success without a total collapse in diver- E Alternative Perplexity Measurements sity (Table 13), but diversity still collapsed when tested on our seven main test bags of words. On the couplet completion task, we additionally We do not optimize any model hyperparameters measure perplexity using Transformer-XL (Dai in the couplet completion and informal-to-formal et al., 2019b) and using a GPT model fine-tuned translation tasks. LSTM’s and feedforward net- on Shakespearean language as generated by (Lau works are 3 layers (including the output layer of di- et al., 2018). We measure using Transformer-XL mension 1) and 300-dimensional unless otherwise on the topic control task as well. Relative perplex- specified. They are bidirectional (150-dimensional ities between most models remain largely similar in each direction) for the couplet rhyme predictor when switching between GPT and Transformer- and the topic control future words predictor, and XL, with a few exceptions. Compared to the base otherwise unidirectional. Attention mechanisms GPT, Shakespeare’s perplexity naturally decreases use key-query-value attention. For the rhyme and while other models’ perplexities increase when future words predictors the output hidden state is measured with Shakespeare-finetuned GPT. The multiplied element-wise by the embedding of the highly repetitive and disfluent W DEC baseline is rhyme sound or future word, then concatenated to rightly punished for this behavior when measured the embeddings, before the final feedforward net- by Transformer-XL. P PLM also obtains slightly work. Since a selling point of our method is the lower perplexity than F UDGE on topic control when
You can also read