A Temporal Variational Model for Story Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Temporal Variational Model for Story Generation David Wilmot Frank Keller School of Informatics School of Informatics University of Edinburgh University of Edinburgh david.wilmot@ed.ac.uk keller@inf.ed.ac.uk Abstract that have multiple meanings. The plot also requires events be significant and unpredictable (Schmid, Recent language models can generate interest- 2003) while transgressing seemingly impenetrable ing and grammatically correct text in story gen- eration but often lack plot development and barriers (Lotman et al., 1977), e.g. a hero on a arXiv:2109.06807v1 [cs.CL] 14 Sep 2021 long-term coherence. This paper experiments dangerous quest or an unlikely marriage. Writing with a latent vector planning approach based a good story requires tracking a complex set of on a TD-VAE (Temporal Difference Varia- entities and events. tional Autoencoder), using the model for con- Some more recent neural work has tried to ad- ditioning and reranking for text generation. dress this challenge by using a multi-step planning The results demonstrate strong performance in automatic cloze and swapping evaluations. approach: A story skeleton is generated using se- The human judgments show stories generated mantic role labelling (SRL; Fan et al. 2019), then with TD-VAE reranking improve on a GPT- a surface sentence is generated from each point on 2 medium baseline and show comparable per- the plan. Numerous alternatives to SRL for story formance to a hierarchical LSTM reranking generation plans have been tried, including key- model. Conditioning on the latent vectors word extraction (Yao et al., 2019; Goldfarb-Tarrant proves disappointing and deteriorates perfor- et al., 2019), compression (Xu et al., 2018), event mance in human evaluation because it reduces representation (Martin et al., 2018; Ammanabrolu the diversity of generation, and the models don’t learn to progress the narrative. This high- et al., 2020), and for non-neural models, graph- lights an important difference between techni- based generation (Li et al., 2013). Goldfarb-Tarrant cal task performance (e.g. cloze) and generat- et al. (2020) additionally use a suite of individu- ing interesting stories. ally trained re-rankers on top of a multistage SRL planning pipeline. 1 Introduction Latent state-space models (Koller and Friedman, There has been huge recent success using large lan- 2009) abstract away the relationship from the ob- guage models such as BERT (Devlin et al., 2019) served input and output into a probabilistic latent and the GPT family (Radford, 2018; Radford et al., state. In time series modelling, autoregressive mod- 2019; Brown et al., 2020) in a diverse range of els infer within the original observation space (x). language understanding and language generation In contrast, state-space models learn and infer tem- tasks. In story generation, recent work has focused poral relations in the abstract state (z) and then on using neural language generation either directly reconstruct the observed state (x). This is in prin- using language models, seq2seq models (Roem- ciple similar to the story planning skeletons that mele, 2016), or part of a hierarchical model (Fan use representations such as SRL. The limitation et al., 2018). However, these models often gener- to using keywords or SRL is that the model has ate stories that lack both diversity and long-range to commit to a specific representation apriori. In coherence (See et al., 2019; Holtzman et al., 2018). this paper, we propose to use latent representations A story is a series of events (Abbott, 2008). But instead. The advantage of using a latent state for a story is not just a series of events. It involves planning is that the model can learn the most valu- (Ryan, 2014) sudden switches in the plot, contrasts able features to infer in the planning process. These between the goals and results of characters’ ac- representations can be richer than a predetermined tions, self-contradiction; repetition of narrative se- simplification such as keywords or SRL. quences; and elements of the narrated happenings Specifically, we use the Temporal Difference 1
1 2 Produce a belief state (blue circles) from Choose two time points observations (x) online, using a recurrent separated by a time interval. The network. There is a deterministic path from the agent is going to learn a 3 Produce an explicit inputs (no information bottleneck) so that the relationship between states at belief about the state of agent can make unrestricted use of information these two time points and the world, expressed as in forming belief and making decisions. consequently improve its state. a probability distribution over a latent state. 4 Sample from this belief, imagining a 5 specific possible state of Given the the world (a bootstrap imagined state at state). t2, infer what would have been the state at t1 and sample. 6 Given the state at t1, predict/ 7 Ground the state in reconstruct the observation. state at t2. This is the model of how the world evolves. 8 Calculate the gradient of the loss to be minimized. Belief network (filtering) Inference network (smoothing) State prediction network (forward model) Decoder network (observation model) Figure 1: This is the main TD-VAE standalone architecture reproduced from Gregor et al. (2019). Variational Autoencoder (TD-VAE; Gregor et al. predictions for multiple time steps ahead, which 2019), a state-space model which learns an abstract can reduce the errors from an autoregressive rollout. state z, which is projected forward to a future state VAE models have been used in story generation and ztn and then reconstructed to the input xtn . t is a planning. Li et al. (2019) use a conditional VAE temporal representation referring to the index of a that learns a latent global state cache which text sentence within a story. For the internal workings generation conditions on. Wang et al. (2020) add of TD-VAE see fig 1. States space models have a planning mechanism to a conditional VAE, but been the subject of much recent interest and de- unlike their proposal, there is separate keyword velopment (Tang and Salakhutdinov, 2013; Chung skeleton story planning, and the planning isn’t di- et al., 2015; Ha and Schmidhuber, 2018; Kim et al., rectly in the latent state. 2019). The intuition behind state-space models Overall we apply our TD-VAE model to text is they can learn an abstract state containing the generation in two ways: as a reranker and directly most salient features of the environment, remov- conditioning on the latent vectors for text genera- ing irrelevant details. These models have found tion. Our TD-VAE model when used directly as much success in complex RL learning environ- a discriminator to rerank generated candidate sen- ments where the simplified latent state aids policy tences (Holtzman et al., 2018) as part of a beam planning. While these models have been used for search of sampled continuations. To allow compar- dialogue generation (Serban et al., 2017), their use ison with two-step planning methods, we also use for other NLP tasks has been limited. Pseudo Self-Attention (PSA; Ziegler et al. 2019) to In TD-VAE, z is a Bayesian representation inject the expected latent state directly as history implemented using a variational autoencoder in the transformer model. This conditions directly (Kingma and Welling, 2014). Unlike autoregres- on the latent vectors as a planning mechanism (TD- sive state-space models such as those proposed by VAE Conditioning), Li et al. (2020) employs a sim- Graves (2013) or Doerr et al. (2018) that only pre- ilar approach. We find that the reranker model dict the next time step, TD-VAE can make jumpy can improve text generation and perform well on 2
automated coherence evaluations. The second ap- To predict t2 , the model should estimate what the proach, the conditioning model performs worse state of the world should be at t1 for t2 to have than other baselines because conditioning reduces happened. To do that a smoothing distribution is the diversity of the text generated. calculated q(zt1 |zt2 , bt1 , bt2 ). The notion behind this is the forward prediction during training should 2 Method be based on what did happen rather than unrealised 2.1 Approach possibilities. (2) is the transition distribution that projects the latent state forward into the future, it Our overall approach is to build a hierarchical is the jumpy prediction: model on top of GPT-2. On top of GPT-2 is a sentence encoder based on transformers that build `jumpy = p(zt2 | zt1 ) (2) rich sentence representations. The top level of the model is the TD-VAE, which learns to draw infer- Finally (3) is the KL divergence between the ence between the sentence time steps in the story. smoothed state and what could have been known Figure 2 illustrates our architecture.1 at t1 . Minimising this divergence indirectly trains the belief distribution to learn what the state of the 2.2 TD-VAE world should be at t1 to anticipate it at t2 : The top TD-VAE layer in the architecture, infers a future time step t2 from an earlier time step t1 `smooth = KL(q(zt1 | zt2 , bt1 , bt2 ) || p(zt1 | bt1 )) (see Figure 2). TD-VAE enables jumpy predictions (3) of state, projecting forward multiple time steps Combining these losses produces the overall loss: of latent vectors from each position, conditioning from t1 to tn . The input is the sentence embed- `(t1 , t2 ) = E[log pD (et2 |zt2 ) ding et , which encapsulates the world’s state as + log pB (zt1 |bt1 ) the reader observes it. These sentence representa- + log pT (zt2 |zt1 ) (4) tions are compressed into beliefs bt1 and bt2 using − log pB (zt2 |bt2 ) a stacked unidirectional LSTM. These beliefs are expectations about the state of the world in the fu- − log pS (zt1 |zt2 , bt1 , bt2 )] ture, i.e., they model what the reader expects to Training is performed by taking n samples ran- happen. These expectations aren’t mapped directly domly selected (default 100) for each batch using to a future sentence state but via an abstract latent a linear sample of time steps between t+1 to t+k . state zt . The model can sample using variational This allows the model to learn to make jumpy pre- inference from the latent distribution pB (zt2 | bt2 ) dictions more than t + 1 steps ahead. We use the to guess the latent state for a future point in the hierarchical version of the TD-VAE model where story, and given the future latent state reconstruct layers are stacked so that each layer samples from the sentence embedding at this future point in the those below it. story, p(et2 | zt2 ). This allows the model to infer what the sentences at 2, 3, 4, . . . steps in the future 2.3 Sentence Encoding will be like, without generating intermediate con- In the middle layer of the architecture (see Fig- crete sentence representations, and so functions as ure 2), we encode a sentence representation for an abstract planning mechanism. each sentence in the batch. The sentence repre- The reconstruction loss of the decoder network sentation needs to be independent and detached is given in (1). It is designed to maximise the prob- (from gradient propagation) in the TD-VAE layer ability of the sentence embedding reconstruction above; otherwise, TD-VAE will force the sentence given the latent state at that time step. This is con- representations to be degenerate. strained by the second part, a bottleneck which The sentence representations are based primarily prevents too much of the concrete sentence state on Quick-Thoughts (Logeswaran and Lee, 2018), et2 being encoded in the latent state zt2 . inspired by Skip-Thoughts (Kiros et al., 2015). For `recon = log p(et2 | zt2 ) − log pB (zt2 | bt2 ) (1) the sentence representations, let there be two en- 1 coded vectors from two functions u = f (s) and Code and configuration for this paper are avail- able on Github at https://github.com/dwlmt/ v = g(s), with s being the sequence of the word knowledgeable-stories. tokens representing the sentence output by GPT-2. 3
TD-VAE Prediction Reconstruction LSTM Inference Sent Encoder Transformer GPT-2 Encoder Once upon ... ... ... happily ever after. Figure 2: TD-VAE hierarchical architecture showing the multi-layer architecture: The base layer is GPT-2. From this, a sentence encoder infers sentence embeddings. The top-level is TD-VAE which learns to infer and reconstruct future sentence embedding states. It should be noted that that the sentence embeddings are concatenated when fed into the TD-VAE loss, but the respective loss maximises the dot product. f (s) and g(s) can be any functions that can convert the plot. A custom sentence encoder is preferred a sequence of inputs into a single vector. Unlike to a model such as Sentence BERT (Reimers and Quick-Thoughts, f (s) and g(s) are stacked autore- Gurevych, 2019), as it allows us to take advantage gressive transformers (Vaswani et al., 2017) that of the finetuning of the underlying language model. use embeddings to encode position. The weights are not shared between the two encoders. To pro- duce a single vector per sentence, u and v con- catenated across the embeddings from the GPT-2 layer.2 For a batch of sentences, the loss is for a given exp(f (si ) · g(sp )) `sent (sti , stp ) = − log P candidate sentence scand is the negative log proba- b∈B exp(f (si ) · g(stb ) bility of the dot product between f (st2 ) and g(st2 ), (5) normalised by all the other sentences in the batch, To enrich the sentence representations, we finetune indexed by i, see (5). For each sentence in the on SNLI datasets (Bowman et al., 2015; Williams batch there are two correct candidates st−1 and st1 , et al., 2018) and the common sense reasoning given by the target position stp in (5), the previous dataset ATOMIC (Sap et al., 2019). As per In- and following sentence. The intuition behind this ferSent (Conneau et al., 2017), we concatenate the loss is it encourages vectors to be similar to their sentence embedding vectors together and apply a neighbours while dissimilar to sentences further single projection layer W , softmax(W [u; v; |u − away in the batch. Making representations in narra- v|]), and train using binary cross-entropy. The com- tives more localised is desirable since it is specific plete sentence loss is `sent + `snli + `atomic . The events described by a sentence that we want the motivation for this finetuning is that supervised TD-VAE model to learn to project forward to plan tasks have been shown to improve sentence repre- sentations (Cer et al., 2018; Reimers and Gurevych, 2 Mean pooling was all experimented with but proved infe- 2019), and both entailment and common sense in- rior and is not used in any reported results. ference are relevant to the story domain. 4
2.4 Language Model 2.6 Datasets The language model is Generative Pre-Training 2 (GPT-2; Radford et al. 2019), using the medium WritingPrompts (Fan et al., 2018) is a dataset that model with 345 million parameters as a base. For consists of circa 300k short stories of circa 50 sen- text generation, an auto-regressive language model tences in length from a Reddit creative writing is preferred to a bidirectional and masked one, such group. Only WritingPrompts is used for evaluation as BERT (Devlin et al., 2019). The model is fine- in the following experiments, whereas the follow- tuned using the original LM loss in (6) on the story ing datasets are used during training: CMU Books datasets. wj is a particular encoded word piece, w, (Bamman and Smith, 2013) and Movies Corpora in a passage of text. (Bamman et al., 2013), which are summaries for X whole novels and movies, and the CBT dataset `lm = − log P (wj |wj−1 , wj−2 , . . . ) (6) (Hill et al., 2016) of children’s books. Longer story j forms may also provide improvements over short stories, so the full processed text from Schmoop Training GPT-2 works best on longer blocks of (Chaudhury et al., 2019), the Books Corpus (Zhu text whereas the hierarchical model, outlined later, et al., 2015) and Film Corpus (Lin and Walker, encodes sentences as vectors and trains predictions 2011) are also used. In practice, models trained based on the whole story. To train GPT-2 efficiently, on multiple datasets performed better on automatic the language model training is run distinct from evaluation than single dataset trained models, and the hierarchical training, with the training process so only these results are presented in four exper- alternating per batch between the two. This allows iments. For training, we use the existing dataset GPT-2 to be fine-tuned with a longer context of split for WritingPrompts. 1024 word tokens. We train models by sampling batches from each 2.5 Conditional Generation dataset in proportion to the dataset size. During et−n vectors for future states are calculated by sam- training, batches of four blocks of 100 sentences pling forward with the TD-VAE model to via the for each story were used. For the TD-VAE training state space zt of the model and reconstructing a per block, the model randomly samples 200 exam- vector in the sentence embeddings space et . To ple states to reconstruct where the target state is condition text generation on predicted sentence randomly sampled between t + 1 and t + 5 sen- vectors, the et sentence representations are pro- tence ahead using a linear sampling scheme. SGD jected into the GPT-2 transformer space. A hidden with Nesterov momentum was used to train the space hmem is created by applying a weight ma- model with a learning rate of 0.01 and momentum trix Wm et , with Wm ∈ RL·H·P , where P is the of 0.9. Models were trained with 10,000 batches dimensionality of the sentence vector. TD-VAE per epoch (note this is not the whole dataset), early projections et , L is the number of layers in the stopping once the model failed to improve every GPT-2 transformers and H the size of the trans- three epochs, and the learning rate is halved every former hidden state. The hidden state is shared epoch without improvement. across transformer heads; this method is Pseudo Self-Attention (Ziegler et al., 2019). The approach allows the hidden state to be prefixed directly onto 2.7 Implementation Details the GPT-2 hidden state when generating or spliced into it when generating longer sequences. Thus the The models are implemented on the AllenNLP latent state can be used to control text generation. framework (Gardner et al., 2018) on Pytorch With models that use conditional generation, there (Paszke et al., 2019). The final evaluated mod- is an additional training step. For alternating train- els used the finetuned GPT-2 medium model for ing iterations, the et sentence representations are the LM layer, six layers of transformer with a projected into the GPT vector space as described, 1024 embedding size (matching GPT-2) for the concatenated before the GPT representation, and sentence encoding layer layers top layers of the then fine-tuned using the LM objective. Thus the respective models: LSTM/Transformer/TD-VAE. training process learns to use the latent vectors to The primary TD-VAE model has 485M tunable improve text generation directly. parameters, circa 345M are from GPT-2. 5
Beam search and rerank next sentence using cosine similarity over tree on Generate sentences Concat context Project expected future states A prompt ... hierarchical model Figure 3: For conditional text generation, the expected et2 vector is the TD-VAE inferred state for the following sentence. It is concatenated with the existing prompt, and sentences are sampled from the conditioned vector. This happens iteratively as part of a beam search (beam 10) where the most likely sentences are kept based on how close they are to the expected et2 vectors using cosine similarity. The final single story is the most like path through the beam search. 2.8 Planning Generation sentences and then reranks them by minimising the reconstruction error cosine similarity with the Baseline text generation uses top-p sampling (or nu- reconstruction from the latent state z. For text cleus sampling) from Holtzman et al. (2020) on the generation, both conditioning and reranking can underlying GPT-2 model. This approach restricts be applied independently. However, in practice, the sampled tokens in generation to the top cumu- conditioning worked better when combined with lative probability distribution tokens. Holtzman et sampling multiple candidate sentences and rerank- al. found top-p sampling generates sentences with ing. Only stories generated with either reranking better coherence and diversity than top-k sampling or reranking with conditioning are judged in the (Fan et al., 2018; Holtzman et al., 2018; Radford human evaluation. et al., 2019) methods. The default top-p threshold in our experiments is 0.925. Most of the discrete plan-based methods gener- 2.9 Baselines ate text in two phases; the first generates a discrete plan for the stories, the second realises the gener- The first baseline is the GPT-2 model without the ated text from the plan. Figure 3 illustrates how this hierarchical layers above it. Additionally, two other can be extended with TD-VAE and latent planning. architectures have been used with the same hierar- Given a prompt, TD-VAE can project forward the chical setup but using an LSTM (Hochreiter and most likely sentence vectors to n steps in the future. Schmidhuber, 1997) or Transformer (Vaswani et al., These vectors can then be conditioned on using 2017) discriminator instead of TD-VAE at the story the projection layer described by splicing the pro- level. These models are trained to discriminate be- jected vector into the hidden state history of the tween the following correct sentence and all the transformer. The advantage is that the latent state other sentences in the batch by maximising the information is incorporated as a natural part of the correct dot product versus the incorrect one be- GPT-2 history, and so architectural changes are not tween the sentence representation and the encoded required for GPT. Most planning approaches gen- vector state of the LSTM or the transformer. The erate a template for the whole story in advance, loss is the same as the `disc loss, except that it is but this implementation is incremental: TD-VAE the concatenated et vector used with the LSTM or projects forward sentences over n sentences in- Transformer sentence vector as u and v. This is crementally, choosing the next sentence and then the same loss as used by Wilmot and Keller (2020) projecting forward again. for modelling story suspense. Story generation is In addition to conditioning on these latent vec- performed by sampling n sentence using GPT and tors, we employ a beam search of k (10 in evaluated reranking for the following sentence using the soft- stories) reranking approach that generates n (100 max over the vector dot products of the candidates per time step for the evaluated stories) candidates as a probability distribution. 6
3 Experiments and cloze tasks. This shows that TD-VAE is able to infer coherence well and is an improvement on 3.1 Automatic Structural Evaluation the baseline hierarchical models. BLEU (Xu et al., 2018), and similar metrics are problematic for evaluating story generation, as is 3.2 Generated Story Evaluation it unlikely there will be much overlap between well-written stories, even if they share a common Inferring coherence is quite different from the prompt. ROC cloze endings have been widely used model’s influence on story generation, which we (Schwartz et al., 2017) for evaluation, and we also evaluate using human judgements on MTurk. We use cloze and swapping for automated structural take the writing prompt and the first 20 sentences evaluations. For swap, we randomly swap two sen- for 50 stories. For each of these stories, we gen- tences in each story. For mutation, we randomly erate continuations of 5 sentences. We evaluate sample a sentence generated by GPT-2 at that point using these long contexts because some initial tests in the story, conditioned on the story up to that of generating long stories from just the writing point; the rationale is that just randomly selecting a prompt found that the stories generated by the dif- cloze alternative sentence makes it too easy. There ferent models varied so much that it was hard for are two versions of the task: In the easy version, the annotators to compare them reliably. A shorter ex- task is to distinguish the original from the modified tension to a longer context also requires the model version of the story. For the harder version, the to have a strong understanding of the long context task is to identify which of the sentences have been to generate a coherent continuation. modified (mutated or swapped) by determining if Five models are evaluated: Gold, the actual they are in the top K least likely candidates across human-written story from the dataset: LM, the the whole story with varying K values reported. fine-tuned GPT-2 medium model, LSTM Reranking, The results are given in Table 1. Random re- which generates multiple sentences using GPT and sults are obtained by randomly choosing the K reranks and selects the most likely candidate using most likely sentences. The evaluation is on 400 a beam search. LSTM Reranking is used instead random stories of length between 25 and 75 sen- of the transformer model because the automated tences from the WritingPrompts testset. The LM benchmarks are similar, so evaluating both would models results are calculated using the perplexity be superfluous. TD-VAE Reranking used beam of a two-sentence sliding window. Each of the search like the LSTM Reranking model but based other models’ probabilities are based on the soft- on the TD-VAE. TD-VAE Conditioning differs in max of a distance metric. Therefore, the mutated that it directly conditions on the expected latent or swapped sentence should be the furthest away vectors. So the TD-VAE expectations change the from expectations using cosine distance. For the text generated, unlike the Reranking model, which hierarchical LSTM, the transformer discriminator only filters out less suitable sentences generated is used to predict the correct answer as per condi- by GPT-2. Mechanical Turk workers are asked to tional generation, and for the LM, lower perplexity rank the five options from best to worst using three is used. criteria: Overall (a subjective judgement on their The easy story cloze task has strong results of overall impression of the story.) coherence (inter- greater than 0.9 for most LSTM, transformer and nal consistency of characters, places, actions and TD-VAE models in telling the original from the plot development.) Fluency (how well written the modified story, so this task is not that useful. The story is, including its use of expressive language.) language model (LM) performs much worse in Additionally, MTurk workers were asked to sum- identifying the original story, so all the hierarchi- marise the story and give reasons for their overall cal models are improved. This is perhaps not sur- choice. This served to filter poor quality submis- prising with cloze, as sampling from the same LM sions and to give insight into the factors they consid- makes it unlikely this will perform well in detecting ered important. Five annotators judged each story its own sampled sentences from the original. How- after screening out and re-collecting for low-quality ever, the comparison with the hierarchical models annotations. Overall inter-annotator agreement for is tougher: For the hard task, K-1, K-5, and K-10, the task is 0.35 using Krippendorff’s alpha; this is the TD-VAE model shows a clear improvement in fairly low, but the task is inherently subjective. both the transformer and LSTM models the swap Figure 4 shows how the models were ranked 7
Task Model Easy ± (CI) K-1 ± (CI) K-5 ± (CI) K-10 ± (CI) Avg ± (CI) Swap 2 Rand .500 (.050) .037 (.020) .190 (.040) .377 (.049) .276 (.045) - LM .545 (.050) .045 (.022) .239 (.043) .473 (.050) .326 (.047) - LSTM .946 (.024) .035 (.019) .226 (.042) .441 (.050) .412 (.049) - TRANS .953 (.022) .058 (.024) .238 (.041) .427 (.049) .419 (.049) - TD-VAE .979 (.000) .058 (.024) .279 (.045) .496 (.050) .453 (.050) Mut 1 Rand .500 (.050) .018 (.014) .094 (.030) .188 (.039) .200 (.040) - LM .640 (.048) .026 (.017) .121 (.033) .255 (.044) .261 (.044) - LSTM .932 (.026) .106 (.031) .187 (.039) .303 (.046) .382 (.048) - TRANS .966 (.019) .020 (.015) .217 (.042) .379 (.049) .396 (.049) - TD-VAE .931 (.026) .134 (.035) .356 (.048) .556 (.050) .494 (.050) Table 1: Results of the hard cloze and swap tasks for models trained on all datasets. Confidence Interval at 0.05. Averages are means across all the other tasks. Figure 4: Continuation story generation results. Bar markers split into groups at p < 0.05 significance using a pairwise t-test. The exception is LM fluency marked with the star, which is statistically the same as LSTM R and TD-VAE R. by annotators on a scale of 1 to 5, i.e., lower is other models and uses more repeated linguistic con- better. There isn’t much difference between the dif- structs in continuations. We analysed the linguistic ferent questions (overall, coherence, and fluency), diversity of the generated continuations by count- except that fluency is relatively better for the base ing unique nouns and verbs (see Table 2). We found GPT-2 language model than overall and coherence. that the TD-VAE C model has far less diversity than The gold standard stories are better than all model- any other models and differs markedly from the generated continuations. The two reranking mod- human-written stories. For both TD-VAE models, els LSTM and TD-VAE R, are significantly better the number of unique coreferences is lower than than the base LM model, which means reranking other models. In contrast, the coreference chain improves the story over straight sampling. The TD- length is similar, implying that the TD-VAE models VAE C model that uses the planning mechanism are more repetitive when generating coreferences doesn’t improve the base LM, and fluency is worse. and are less likely to introduce new entities, again reducing text diversity. On the other hand, TD- The main question is why TD-VAE Conditioning VAE R outperforms both the LM and LSTM model can perform well on automated evaluation (close in terms of noun and verb diversity. The reduction and swap) but not improve story generation com- can be seen in the appendix examples in A.2. pared to the LSTM-based reranking model. In- spection of the stories used for evaluation suggests For completeness, we also report Meteor and that the conditioning model is more repetitive than BLEU scores (also in Table 2); we find the TD- 8
Model Nouns Verbs Coreferences Coreference Chains Meteor BLEU Gold 4.08 2.04 2.23 4.17 - - LM 2.48 1.29 2.06 4.69 .101 .628 LSTM 2.59 1.21 2.18 5.02 .107 .619 TD-VAE R 3.19 1.41 1.61 4.00 .090 .705 TD-VAE C 2.07 .885 1.65 3.99 .083 .639 Table 2: Statistics with the average number of unique nouns and verbs (after stemming), coreferences per 100 tokens, and length of coreference mention chains per story. Meteor and BLEU are measured against the Gold stories. These stats are on a more extensive set of 400 stories and text of up to 2000 characters per generation. VAE models outperform the LM baseline and the 4 Conclusion LSTM-based hierarchical model in terms of BLEU, while the patterns is reversed for Meteor. We al- Overall this paper has shown that the TD-VAE ready pointed out the problem with using these model can outperform alternative LSTM and trans- metrics for story evaluation. former models on automatic coherence evaluation. When used for reranking, we also find that the TD-VAE model can improve text generation of a baseline LM. However, the conditioning TD-VAE If a planning mechanism works, we would ex- model does not improve in a baseline because the pect it to increase lexical diversity, as the planning latent vector conditioning reduces the diversity of would condition the LM to explore news areas and text generated, which is crucial for exciting sto- develop the plot. In contrast, it would appear the rytelling. While the ranking model shows some opposite is happening here. It has been noted by improvement and the models show it can improve Welleck et al. (2020) amongst others that large lan- comprehension, latent planning did not work as guage models can produce dull and repetitive text intended to produce better stories. If it had, then and often copy from the context. The failure of the further comparison with SOA multi-stage planning conditioning model could be an analogue, where or newer VAE text generation models would have TD-VAE can train well by expecting the future text been warranted with human evaluation. A warning to be similar in topic and style to the text seen. to emphasise in this paper shows that improvements This would lead conditioning in the latent vectors in technical measures such as BLEU or cloze can to produce more dull and repetitive text than the worsen generated text. Guan and Huang (2020) LM on its own, which appears to be the case for and Guan et al. (2021) have found many of these TD-VAE C. There seems to be a clear split between automated measures, including ROUGE, Perplex- TD-VAE used in a ranking model and when condi- ity for stories, and more sophisticated ones such as tioned on. The strong automated performance and BERT-Score, correlate poorly with human judge- improvement on the base LM with human evalua- ment. tion show that TD-VAE R can determine coherent In narrative text, there are continual shifts in the continuations and improve story generation. How- text in terms of topic, events, characters, and places ever, when conditioned on this, it produces text that are hard to anticipate from sentence to sen- more similar to the existing text, not interesting tence. Simply inferring coherence is not enough; and eventful plots. Part of this could come from the instead, the generation process must push the story richness of the latent representations: In a keyword in new directions while remaining coherent to what planning system, the plot skeleton conditioned on has gone before. The challenge for future work is is quite loose, for example from Wang et al. (2020): to adapt latent state models to focus on the plot pro- lake → friends → water → swim → shore. This gression rather than keeping the benefits of having allows many possible realisations and is not that a model self-learn the most salient representations. restrictive of the LM. In contrast, sentence embed- One avenue for this may the success of recent dis- dings represent many more linguistic and semantic crete VQ-VAE models (Razavi et al., 2019; Van den properties, so learning to condition them can cause Oord et al., 2017). Just as SRL or keyword-based generated text to be repetitive and limited. planning forces a simplified representation, VQ- 9
VAE or similar discrete models may force a sim- Gretchen Krueger, Tom Henighan, Rewon Child, plified representation. This could make learning Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, plot progression more straightforward and hence Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin improve the quality of generated stories. Chess, Jack Clark, Christopher Berner, Sam Mc- Candlish, Alec Radford, Ilya Sutskever, and Dario Acknowledgments Amodei. 2020. Language models are few-shot learn- ers. We like to thank the Mechanical Turk workers who evaluated the generated stories. Wilmot’s work is Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Con- funded by an EPSRC doctoral training award. stant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Ethics Statement Kurzweil. 2018. Universal sentence encoder. CoRR, abs/1803.11175. The evaluation study reported in Section 3.2 in- volved participants recruited through Amazon Me- Atef Chaudhury, Makarand Tapaswi, Seung Wook Kim, and Sanja Fidler. 2019. The shmoop corpus: chanical Turk. All participants gave informed A dataset of stories with loosely aligned summaries. consent, received appropriate compensation, and CoRR, abs/1912.13082. were informed that participation was voluntary. Some texts could contain mildly offensive lan- Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. 2015. guage, which participants were also informed about A recurrent latent variable model for sequential data. before taking part. The study had received prior In Advances in Neural Information Processing Sys- approval by the ethics panel of the School of Infor- tems, volume 28. Curran Associates, Inc. matics at the University of Edinburgh. Alexis Conneau and Douwe Kiela. 2018. SentEval: An evaluation toolkit for universal sentence representa- tions. In Proceedings of the Eleventh International References Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language H.P. Abbott. 2008. The Cambridge Introduction to Nar- Resources Association (ELRA). rative. Cambridge Introductions to Literature. Cam- bridge University Press. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised Prithviraj Ammanabrolu, Ethan Tien, Wesley Cheung, learning of universal sentence representations from Zhaochen Luo, William Ma, Lara J. Martin, and natural language inference data. In Proceedings of Mark O. Riedl. 2020. Story realization: Expanding the 2017 Conference on Empirical Methods in Nat- plot events into sentences. Proceedings of the AAAI ural Language Processing, pages 670–680, Copen- Conference on Artificial Intelligence, 34(05):7375– hagen, Denmark. Association for Computational 7382. Linguistics. David Bamman, Brendan O’Connor, and Noah A. Alexis Conneau, German Kruszewski, Guillaume Lam- Smith. 2013. Learning latent personas of film char- ple, Loïc Barrault, and Marco Baroni. 2018. What acters. In Proceedings of the 51st Annual Meeting of you can cram into a single $&!#* vector: Probing the Association for Computational Linguistics (Vol- sentence embeddings for linguistic properties. In ume 1: Long Papers), pages 352–361, Sofia, Bul- Proceedings of the 56th Annual Meeting of the As- garia. Association for Computational Linguistics. sociation for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Aus- David Bamman and Noah A. Smith. 2013. New align- tralia. Association for Computational Linguistics. ment methods for discriminative book summariza- tion. CoRR, abs/1305.1319. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Samuel R. Bowman, Gabor Angeli, Christopher Potts, deep bidirectional transformers for language under- and Christopher D. Manning. 2015. A large anno- standing. In Proceedings of the 2019 Conference tated corpus for learning natural language inference. of the North American Chapter of the Association In Proceedings of the 2015 Conference on Empiri- for Computational Linguistics: Human Language cal Methods in Natural Language Processing, pages Technologies, Volume 1 (Long and Short Papers), 632–642, Lisbon, Portugal. Association for Compu- pages 4171–4186, Minneapolis, Minnesota. Associ- tational Linguistics. ation for Computational Linguistics. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Andreas Doerr, Christian Daniel, Martin Schiegg, Duy Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nguyen-Tuong, Stefan Schaal, Marie-Eve Toussaint, Neelakantan, Pranav Shyam, Girish Sastry, Amanda and Sebastian Trimpe. 2018. Probabilistic recurrent Askell, Sandhini Agarwal, Ariel Herbert-Voss, state-space models. In ICML. 10
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi- David Ha and Jürgen Schmidhuber. 2018. Recurrent erarchical neural story generation. In Proceedings world models facilitate policy evolution. In Ad- of the 56th Annual Meeting of the Association for vances in Neural Information Processing Systems, Computational Linguistics (Volume 1: Long Papers), volume 31. Curran Associates, Inc. pages 889–898, Melbourne, Australia. Association for Computational Linguistics. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2016. The goldilocks principle: Reading Angela Fan, Mike Lewis, and Yann Dauphin. 2019. children’s books with explicit memory representa- Strategies for structuring story generation. In Pro- tions. In ICLR. ceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 2650– Sepp Hochreiter and Jürgen Schmidhuber. 1997. 2660, Florence, Italy. Association for Computa- Long short-term memory. Neural Computation, tional Linguistics. 9(8):1735–1780. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- Yejin Choi. 2020. The curious case of neural text de- ters, Michael Schmitz, and Luke Zettlemoyer. 2018. generation. In International Conference on Learn- AllenNLP: A deep semantic natural language pro- ing Representations. cessing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1– Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine 6, Melbourne, Australia. Association for Computa- Bosselut, David Golub, and Yejin Choi. 2018. tional Linguistics. Learning to write with cooperative discriminators. In ACL. Seraphina Goldfarb-Tarrant, Tuhin Chakrabarty, Ralph Armand Joulin, Edouard Grave, Piotr Bojanowski, and Weischedel, and Nanyun Peng. 2020. Content plan- Tomas Mikolov. 2017. Bag of tricks for efficient ning for neural story generation with aristotelian text classification. In Proceedings of the 15th Con- rescoring. In Proceedings of the 2020 Conference ference of the European Chapter of the Association on Empirical Methods in Natural Language Process- for Computational Linguistics: Volume 2, Short Pa- ing (EMNLP), pages 4319–4338, Online. Associa- pers, pages 427–431. Association for Computational tion for Computational Linguistics. Linguistics. Seraphina Goldfarb-Tarrant, Haining Feng, and Taesup Kim, Sungjin Ahn, and Yoshua Bengio. 2019. Nanyun Peng. 2019. Plan, write, and revise: an Variational temporal abstraction. In Advances in interactive system for open-domain story generation. Neural Information Processing Systems, volume 32. In Proceedings of the 2019 Conference of the Curran Associates, Inc. North American Chapter of the Association for Computational Linguistics (Demonstrations), pages Diederik P. Kingma and Max Welling. 2014. Auto- 89–97. encoding variational bayes. CoRR, abs/1312.6114. Alex Graves. 2013. Generating sequences with recur- Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, rent neural networks. CoRR, abs/1308.0850. Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Karol Gregor, George Papamakarios, Frederic Besse, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, Lars Buesing, and Theophane Weber. 2019. Tempo- and R. Garnett, editors, Advances in Neural Informa- ral difference variational auto-encoder. In Interna- tion Processing Systems 28, pages 3294–3302. Cur- tional Conference on Learning Representations. ran Associates, Inc. Jian Guan and Minlie Huang. 2020. UNION: An Un- Daphne Koller and Nir Friedman. 2009. Probabilistic referenced Metric for Evaluating Open-ended Story graphical models: principles and techniques. MIT Generation. In Proceedings of the 2020 Conference press. on Empirical Methods in Natural Language Process- ing (EMNLP), pages 9157–9166, Online. Associa- Boyang Li, Stephen Lee-Urban, George Johnston, and tion for Computational Linguistics. Mark O. Riedl. 2013. Story generation with crowd- sourced plot graphs. In Proceedings of the Twenty- Jian Guan, Zhexin Zhang, Zhuoer Feng, Zitao Liu, Seventh AAAI Conference on Artificial Intelligence, Wenbiao Ding, Xiaoxi Mao, Changjie Fan, and AAAI’13, page 598–604. AAAI Press. Minlie Huang. 2021. OpenMEVA: A benchmark for evaluating open-ended story generation metrics. Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiu- In Proceedings of the 59th Annual Meeting of the jun Li, Yizhe Zhang, and Jianfeng Gao. 2020. Opti- Association for Computational Linguistics and the mus: Organizing sentences via pre-trained modeling 11th International Joint Conference on Natural Lan- of a latent space. In Proceedings of the 2020 Con- guage Processing (Volume 1: Long Papers), pages ference on Empirical Methods in Natural Language 6394–6407, Online. Association for Computational Processing (EMNLP), pages 4678–4699, Online. As- Linguistics. sociation for Computational Linguistics. 11
Juntao Li, Lidong Bing, L. Qiu, Dongmin Chen, Atomic: An atlas of machine commonsense for if- Dongyan Zhao, and Rui Yan. 2019. Learning to then reasoning. In AAAI. write stories with thematic consistency and wording novelty. In AAAI. Wolf Schmid. 2003. Narrativity and eventfulness. What is narratology, pages 17–33. Grace I. Lin and Marilyn A. Walker. 2011. All the world’s a stage: Learning character models from Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila film. In AIIDE. Zilles, Yejin Choi, and Noah A. Smith. 2017. The effect of different writing tasks on linguistic style: A Lajanugen Logeswaran and Honglak Lee. 2018. An ef- case study of the ROC story cloze task. In Proceed- ficient framework for learning sentence representa- ings of the 21st Conference on Computational Natu- tions. In International Conference on Learning Rep- ral Language Learning (CoNLL 2017), pages 15–25, resentations. Vancouver, Canada. Association for Computational Linguistics. Iu. M. Lotman, G. Lenhoff, and R. Vroon. 1977. The structure of the artistic text. Abigail See, Aneesh Pappu, Rohun Saxena, Akhila Lara J. Martin, Prithviraj Ammanabrolu, William Han- Yerukola, and Christopher D. Manning. 2019. Do cock, Shruti Singh, Brent Harrison, and Mark O. massively pretrained language models make better Riedl. 2018. Event representations for automated storytellers? In CoNLL. story generation with deep neural nets. In AAAI. Iulian Serban, Alessandro Sordoni, Ryan Lowe, Lau- Adam Paszke, Sam Gross, Francisco Massa, Adam rent Charlin, Joelle Pineau, Aaron C. Courville, and Lerer, James Bradbury, Gregory Chanan, Trevor Yoshua Bengio. 2017. A hierarchical latent variable Killeen, Zeming Lin, Natalia Gimelshein, Luca encoder-decoder model for generating dialogues. In Antiga, Alban Desmaison, Andreas Kopf, Edward AAAI. Yang, Zachary DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Charlie Tang and Russ R Salakhutdinov. 2013. Learn- Junjie Bai, and Soumith Chintala. 2019. Py- ing stochastic feedforward neural networks. In Ad- torch: An imperative style, high-performance deep vances in Neural Information Processing Systems, learning library. In H. Wallach, H. Larochelle, volume 26. Curran Associates, Inc. A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar- nett, editors, Advances in Neural Information Pro- Aaron Van den Oord, Oriol Vinyals, and koray cessing Systems 32, pages 8024–8035. Curran Asso- kavukcuoglu. 2017. Neural discrete representation ciates, Inc. learning. In Advances in Neural Information Pro- cessing Systems, volume 30. Curran Associates, Inc. Alec Radford. 2018. Improving language understand- ing by generative pre-training. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Kaiser, and Illia Polosukhin. 2017. Attention is all Dario Amodei, and Ilya Sutskever. 2019. Language you need. In Advances in Neural Information Pro- models are unsupervised multitask learners. OpenAI cessing Systems, volume 30. Curran Associates, Inc. Blog, 1(8):9. Lin Wang, Juntao Li, Rui Yan, and Dongyan Zhao. Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2020. Plan-CVAE: A planning-based conditional 2019. Generating diverse high-fidelity images with variational autoencoder for story generation. In vq-vae-2. In NeurIPS. Proceedings of the 19th Chinese National Confer- ence on Computational Linguistics, pages 892–902, Nils Reimers and Iryna Gurevych. 2019. Sentence- Haikou, China. Chinese Information Processing So- BERT: Sentence embeddings using Siamese BERT- ciety of China. networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di- and the 9th International Joint Conference on Natu- nan, Kyunghyun Cho, and Jason Weston. 2020. Neu- ral Language Processing (EMNLP-IJCNLP), pages ral text generation with unlikelihood training. In 3982–3992, Hong Kong, China. Association for International Conference on Learning Representa- Computational Linguistics. tions. Melissa Roemmele. 2016. Writing stories with help from recurrent neural networks. In AAAI. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen- Marie-Laure Ryan. 2014. Possible worlds. De tence understanding through inference. In Proceed- Gruyter. ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- Maarten Sap, Ronan Le Bras, Emily Allaway, Chan- guistics: Human Language Technologies, Volume dra Bhagavatula, Nicholas Lourie, Hannah Rashkin, 1 (Long Papers), pages 1112–1122. Association for Brendan Roof, Noah A. Smith, and Yejin Choi. 2019. Computational Linguistics. 12
David Wilmot and Frank Keller. 2020. Modelling sus- pense in short stories as uncertainty reduction over neural representation. In Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, pages 1763–1788, Online. Association for Computational Linguistics. Jingjing Xu, Xuancheng Ren, Yi Zhang, Qi Zeng, Xi- aoyan Cai, and Xu Sun. 2018. A skeleton-based model for promoting coherence among sentences in narrative story generation. In Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing, pages 4306–4315, Brus- sels, Belgium. Association for Computational Lin- guistics. Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. 2019. Plan- and-write: Towards better automatic storytelling. In Proceedings of the AAAI Conference on Artificial In- telligence, volume 33, pages 7378–7385. Zhuosheng Zhang, Yu-Wei Wu, Zhao Hai, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiaodong Zhou. 2020. Semantics-aware bert for language under- standing. In AAAI. Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pages 19– 27. Zachary M. Ziegler, Luke Melas-Kyriazi, Sebastian Gehrmann, and Alexander M. Rush. 2019. Encoder- agnostic adaptation for conditional language genera- tion. CoRR, abs/1908.06938. 13
A Story Continuation Evaluation Prompt [ WP ] A woman investigates her hus- band ’s murder ... while receiving insufferable A.1 Instructions vague help from his ghost . “ Would you like a The following are the instructions provided to the blanket Mrs . Everson ? ” the police officer asked MTurk Annotators: in a calm and collected voice . “ No , im fine thank Read a writing prompt and the first part of a you ” she responded , holding back her tears . The story. Then read 5 different continuations of the police officer nodded and made his way over to the story. Rank the story continuations from 1 (Best) group of officers surrounding the body on the floor . to 5 (Worst) for a set of criteria. Judgements will Mrs. Everson sat , quietly waiting for the next offi- be subjective but try to judge each according to cer to ask her a question , “ leave ” a voice appeared the criteria given. You are only ranking how well in her ear as if emanating from the wall behind her . each continuation follows on from the story, not A shiver ran down her spine and she quickly turned the story itself. You may do multiple HITS with to see the blank wall , as she looked back a tall different stories but please do no more than 20 in man stood before her , “ Mrs. Eve Everson ? ” the total. Estimated time is 10 minutes per HIT. The man requested . “ Yes ” she responded looking up prompts often begin "[WP]" which is just a con- through watery eyes . The man ran through the vention used on the forum the stories are collected standards , he introduced himself as the local DI , from. Rank the stories according to the following and began to present her with a series of questions criteria: . “ Did your husband have any enemies Mrs . Ev- erson ? ” . “ No ” She responded quickly , “ yes ” • Coherence: Narrative coherence is the degree came the voice from the wall , she quickly turned to which the continuation makes sense given again to see nothing once again . “ Are you alright what has gone before. Coherent stories are Mrs . Everson ? ” asked the DI “ ah , yes , sorry internally consistent, and involve a plot de- it was just ... nothing ” she responded nervously . “ velopment and progression with characters, Did you notice anything suspicious leading up to places and actions that are meaningful in the the incident ? ” The DI asked leaping right back context of the prior story. into questioning mode , “ ah no , nothing I can thi.. • Fluency: How well does the continuation fol- ”. low on from the story, is it natural, does it Gold she began before the voice cut her off “ follow in terms of style, grammar, use of liter- Think Eve , THINK ! ” it screamed this time , ary devices, and expressive language. and with the added volume the voice became rec- ognizable . Eve leapt from the chair and fell back • Overall: This is an overall judgement of how wards onto the ground , she scrambled away from good and interesting you think the continua- the wall , staring at it intensely , screaming out to tions are in terms of criteria such as character, it “ DAVE ! DAVE ! ” . ... plot and how well written they are. LM -’ve noticed at all. Must have been some Note, the rankings for each continuation must kind of mistake on your part... or is this just an- be unique for each question category - Overall, other case of mistaken identity. ” Now that he was Coherence, and Fluency. No two continuations can back in his own voice, the old wizard resumed his be given the same rank, and so one continuation conversation : “ The wisest of us have fallen to must get each of the ranks 1-5. thinking this over. Why have not the Wise Ones The three figures 5,6, 7 are screenshots for the declared this thing a danger? One only knows that AWS task showing the story prompt, a ranking box we are unfortunate, for it does not take much skill for one continuation, and the submission questions. to hunt down the Bad One. ’ ‘ We may at least care A.2 Example Generated Stories for our own lives, as the ogresses do,’thought they. So they agreed to go with the Bad One, and the two The following are two random examples of eval- started on their way. uated stories and the continuations generated by each model. These examples are presented in a LSTM - That’s just normal stuff all the time fixed order, but in the real task the different model these days. You don’t talk too much after dark, continuations were shuffled. what are you doing out here?! ” he said, heading 14
Figure 5: The story prompt show to AWS Turkers. Figure 6: Showing the continuation and dropdown choices. Figure 7: Checker validation question and the submit button. 15
to the door. Walking further into the old room the clear it and to activate his collar . The small bell Man familiarly began to open her hair, dress and rang out , scaring the Horrors back toward the trees shoes, several tresses flying madly about her. He . They were n’t stopped for long though . A brave saw through the cloud, after all the time they had horror approached from the forest moving toward spent together the flame consumed him, but in all the house Jingles protected . The cat pounced onto those years before, this woman still kept talking. it with as fierce a growl as his small body could She began to rub her hands together, feeling her make . His claws cut into the creature ’s gooey hair dancing, her dresses floating, yet he remained flesh , slicing the shadowy skin . Black blood fell so entranced. At last he ran them off as fast as to the ground and evaporated into a puff of black thought, sure to leave no trail. ... smoke .. TD-VAE R and rubbing his chin. The young Gold More horrors came out , rushing for Jingles woman began forcing past the police officers star- . Every young kitten knew about the Night Horrors ing at her with wide eyed bewilderment and longing , and as soon as they were old enough to fight they eyes, almost nauseous. She silently wandered over began training . Most kittens were lucky enough to where she went down the opposite corner next to have a sparing partner in their parent or their to the fence. This time however she stopped short siblings . Jingles came from a large litter and he of finding where Mrs. Everson had been hidden, was no exception . He was ready for the horrors very well probably upstairs though she might not charging him . be dead. She raised an eyebrow asking the man if the officer was lady like that, the young woman did LM , flecks of black smoke dancing in the air. not know, but he looked at her derisively for not Omar had killed the magician, he was sure. It actually knowing. was he who had spoken ill of the Red Knight. He wanted to kill the Red Knight, and the two would TD-VAE C he replied angrily. Mrs. Everson have hated each other as brothers. But a little trick- could not hide her worry, trying to console herself ery, something he had learnt from a magician in that no one was paying any attention to her at all. Crete, allowed him to escape and do what he had ; Once upon a time, a magician walked down an intended to do. For a whole month all happened avenue in our town. ; He asked a young woman just as the two brothers intended. what happened to her. ; At one point, he told the woman, you had knocked her head down. LSTM Jingle bells rang in his ears as he jumped The second example: away, brushing up against the blackened remains of his coat and fur Blurry lights blinded him once Prompt [ WP ] My cat comes home nearly every more and he cursed softly into his night ears, All be- morning with cuts and scratches all over . Write cause he feared The Grudge, the curse long gnawed about what he could be getting up to on his nights on his bones They lured him back with stories about out . The wind picked up and the leaves of the small how *the streets were no place for bad men to roam cluster of trees began to rustle . Jingles could see wild and cruel man to terrorize a terrified girl. Lit- the shadows moving through the trees , coming in tle did he know they only made him afraid, and fed his direction . * It ’ll be another rough night . * He the paranoia with sad tales of candy shops filled thought . The small bell on his collar had been little with children and toddlers. He had grown so ac- defense against the recent Night Horrors . Jingles customed to life being the same as a petrified child, briefly considered recruiting some alley cats to his that eventually, the grudges abandoned him forever. neighborhood . As it was there were n’t enough He made himself miserable, now and again trying cats in the area . The mercenaries typically stayed to hide in his own room and begging anyone who in the downtown areas of the cities , living off of would listen to get off his case and tell the truth. ... the scraps of humans and what little payment the local cats could afford them . To get some into the TD-VAE R Disgusted with himself he began to suburbs Jingles would have to come up with more throw open the door to the house. The house- food somehow . Jingles would have to give up part hold alarm woke him first when his private room of his own rations to pay them . The Horrors were window sprang open. A second surprised shadow closer now , their shadowy figures peeking past the approached with a single dimple on its forehead. treeline . Jingles shook his head violently , both to Then two lumps that looked like heads jostled to 16
You can also read