Modelling Suspense in Short Stories as Uncertainty Reduction over Neural Representation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Modelling Suspense in Short Stories as Uncertainty Reduction over Neural Representation David Wilmot and Frank Keller Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, UK david.wilmot@ed.ac.uk, keller@inf.ed.ac.uk Abstract only sporadically been used in story generation sys- Suspense is a crucial ingredient of narrative fic- tems (O’Neill and Riedl, 2014; Cheong and Young, tion, engaging readers and making stories com- 2014). pelling. While there is a vast theoretical litera- Suspense, intuitively, is a feeling of anticipation ture on suspense, it is computationally not well that something risky or dangerous will occur; this understood. We compare two ways for mod- includes the idea both of uncertainty and jeopardy. elling suspense: surprise, a backward-looking Take the play Romeo and Juliet: Dramatic suspense measure of how unexpected the current state is is created throughout — the initial duel, the meet- given the story so far; and uncertainty reduc- ing at the masquerade ball, the marriage, the fight tion, a forward-looking measure of how unex- pected the continuation of the story is. Both in which Tybalt is killed, and the sleeping potions can be computed either directly over story rep- leading to the death of Romeo and Juliet. At each resentations or over their probability distribu- moment, the audience is invested in something be- tions. We propose a hierarchical language ing at stake and wonders how it will end. model that encodes stories and computes sur- This paper aims to model suspense in computa- prise and uncertainty reduction. Evaluating tional terms, with the ultimate goal of making it against short stories annotated with human sus- deployable in NLP systems that analyze or generate pense judgements, we find that uncertainty re- duction over representations is the best predic- narrative fiction. We start from the assumption that tor, resulting in near human accuracy. We also concepts developed in psycholinguistics to model show that uncertainty reduction can be used to human language processing at the word level (Hale, predict suspenseful events in movie synopses. 2001, 2006) can be generalised to the story level to capture suspense, the Hale model. This assumption 1 Introduction is As current NLP research expands to include longer, similar concepts to model suspense in games (Ely fictional texts, it becomes increasingly important et al., 2015; Li et al., 2018), the Ely model. Com- to understand narrative structure. Previous work mon to both approaches is the idea that suspense has analyzed narratives at the level of characters is a form of expectation: In games, we expect to and plot events (e.g., Gorinski and Lapata, 2018; win or lose instead in stories, we expect that the Martin et al., 2018). However, systems that pro- narrative will end a certain way. cess or generate narrative texts also have to take We will therefore compare two ways for mod- into account what makes stories compelling and elling narrative suspense: surprise, a backward- enjoyable. We follow a literary tradition that makes looking measure of how unexpected the current And then? (Forster, 1985; Rabkin, 1973) the pri- state is given the story so far; and uncertainty re- mary question and regards suspense as a crucial duction, a forward-looking and measure of how factor of storytelling. Studies show that suspense is unexpected the continuation of the story is. Both important for keeping readers’ attention (Khrypko measures can be computed either directly over story and Andreae, 2011), promotes readers’ immersion representations, or indirectly over the probability and suspension of disbelief (Hsu et al., 2014), and distributions over such representations. We pro- plays a big part in making stories enjoyable and in- pose a hierarchical language model based on Gen- teresting (Oliver, 1993; Schraw et al., 2001). Com- erative Pre-Training (GPT, Radford et al., 2018) to putationally less well understood, suspense has encode story-level representations and develop an 1763 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1763–1788 July 5 - 10, 2020. c 2020 Association for Computational Linguistics
inference scheme that uses these representations to pense using general language models fine-tuned on compute both surprise and uncertainty reduction. stories, without planning and domain knowledge. For evaluation, we use the WritingPrompt corpus The advantage is that the model can be trained on of short stories (Fan et al., 2018), part of which we large volumes of available narrative text without annotate with human sentence-by-sentence judge- requiring expensive annotations, making it more ments of suspense. We find that surprise over rep- generalisable. resentations and over probability distributions both Other work emphasises the role of characters and predict suspense judgements. However uncertainty their development in story understanding (Bamman reduction over representations is better, resulting et al., 2014, 2013; Chaturvedi et al., 2017; Iyyer in near human-level accuracy. We also show that et al., 2016) or summarisation (Gorinski and Lap- our models can be used to predict turning points, ata, 2018). A further important element of narra- i.e., major narrative events, in movie synopses (Pa- tive structure is plot, i.e., the sequence of events palampidi et al., 2019). in which characters interact. Neural models have explicitly modelled events (Martin et al., 2018; Har- 2 Related Work rison et al., 2017; Rashkin et al., 2018) or the results of actions (Roemmele and Gordon, 2018; Liu et al., In narratology, uncertainty over outcomes is tradi- 2018a,b). On the other hand, some neural genera- tionally seen as suspenseful (e.g., O’Neill, 2013; tion models (Fan et al., 2018) just use a hierarchical Zillmann, 1996; Abbott, 2008). Other authors model on top of a language model; our architecture claim that suspense can exist without uncertainty follows this approach. (e.g., Smuts, 2008; Hoeken and van Vliet, 2000; Gerrig, 1989) and that readers feel suspense even 3 Models of Suspense when they read a story for the second time (Dela- 3.1 Definitions torre et al., 2018), which is unexpected if suspense is uncertainty; this is referred to as the paradox of In order to formalise measures of suspense, we suspense (Prieto-Pablos, 1998; Yanal, 1996). Con- assume that a story consists of a sequence of sen- sidering Romeo and Juliet again, in the first view tences. These sentences are processed one by one, suspense is motivated by primarily by uncertainty and the sentence at the current timepoint t is repre- over what will happen. Who will be hurt or killed in sented by an embedding et (see Section 4 for how the fight? What will happen after marriage? How- embeddings are computed). Each embedding is ever, at the beginning of the play we are told “from associated with a probability P(et ). Continuations forth the fatal loins of these two foes, a pair of star- of the story are represented by a set of possible next i crossed lovers take their life”, and so the suspense sentences, whose embeddings are denoted by et+1 . is more about being invested in the plot than not The first measure of suspense we consider is knowing the outcome, aligning more with the sec- surprise (Hale, 2001), which in the psycholinguis- ond view: suspense can exist without uncertainty. tic literature has been successfully used to predict We do not address the paradox of suspense directly word-based processing effort (Demberg and Keller, in this paper, but we are guided by the debate to 2008; Roark et al., 2009; Van Schijndel and Linzen, operationalise methods that encompass both views. 2018a,b). Surprise is a backward-looking predic- The Hale model is closer to the traditional model tor: it measures how unexpected the current word of suspense as being about uncertainty. In contrast, is given the words that preceded it (i.e., the left the Ely model is more in line with the second view context). Hale formalises surprise as the negative that uncertainty matters less than consequentially log of the conditional probability of the current different outcomes. word. For stories, we compute surprise over sen- In NLP, suspense is studied most directly in nat- tences. As our sentence embeddings et include ural language generation, with systems such as information about the left context e1 , . . . , et−1 , we Dramatis (O’Neill and Riedl, 2014) and Suspenser can write Hale surprise as: (Cheong and Young, 2014), two planning-based St Hale = − log P(et ) (1) story generators that use the theory of Gerrig and Bernardo (1994) that suspense is created when a An alternative measure for predicting word-by- protagonist faces obstacles that reduce successful word processing effort used in psycholinguistics is outcomes. Our approach, in contrast, models sus- entropy reduction (Hale, 2006). This measure is 1764
forward-looking: it captures how much the current the next state et+1 : word changes our expectations about the words we Ely i 2 will encounter next (i.e., the right context). Again, Ut = E[(et − et+1 ) ] i i 2 (4) we compute entropy at the story level, i.e., over sen- = ∑ P(et+1 )(et − et+1 ) tences instead of over words. Given a probability i distribution over possible next sentences P(et+1 ), i This is closely related to Hale entropy reduction, we calculate the entropy of that distribution. En- but again the entropy is computed over states (sen- tropy reduction is the change of that entropy from tence embeddings in our case), rather than over one sentence to the next: probability distributions. Intuitively, this measure captures how much the uncertainty about the rest i i Ht = − ∑ P(et+1 ) log P(et+1 ) of the story is reduced by the current sentence. i (2) We refer to the forward-looking measures in Equa- Hale Ut = Ht−1 − Ht tions (2) and (4) as Hale and Ely uncertainty reduc- tion, respectively. Ely et al. also suggest versions of their measures Note that we follow Frank (2013) in computing in which each state is weighted by a value αt , thus entropy over surface strings, rather than over parse accounting for the fact that some states may be states as in Hale’s original formulation. more inherently suspenseful than others: In the economics literature, Ely et al. (2015) 2 = αt (et − et−1 ) αEly have proposed two measures that are closely re- St lated to Hale surprise and entropy reduction. At i 2 (5) = E[αt+1 (et − et+1 ) ] αEly Ut the heart of their theory of suspense is the notion of belief in an end state. Games are a good example: We stipulate that sentences with high emotional va- the state of a tennis game changes with each point lence are more suspenseful, as emotional involve- being played, making a win more or less likely. ment heightens readers’ experience of suspense. Ely et al. define surprise as the amount of change This can be captured in Ely et al.’s framework by from the previous time step to the current time step. assigning the αs the scores of a sentiment classifier. Intuitively, large state changes (e.g., one player sud- denly comes close to winning) are more surprising 3.2 Modelling Approach than small ones. Representing the state at time t as We now need to show how to compute the surprise et , Ely surprise is defined as: and uncertainty reduction measures introduced in the previous section. This involves building a Ely 2 model that processes stories sentence by sentence, St = (et − et−1 ) (3) and assigns each sentence an embedding that en- codes the sentence and its preceding context, as Ely et al.’s approach can be adapted for modelling well as a probability. These outputs can then be suspense in stories if we assume that each sentence used to compute a surprise value for the sentence. in a story changes the state (the characters, places, Furthermore, the model needs to be able to gen- events in a story, etc.). States et then become sen- erate a set of possible next sentences (story contin- tence embeddings, rather than beliefs in end states, uations), each with an embedding and a probability. and Ely surprise is the distance between the current Generating upcoming sentences is potentially very embedding et and the previous embedding et−1 . In computationally expensive since the number of con- this paper, we will use L1 and L2 distances; other tinuations grows exponentially with the number of authors (Li et al., 2018) experiment with informa- future time steps. As an alternative, we can there- tion gain and KL divergence, but found worse per- fore sample possible next sentences from a corpus formance when modelling suspense in games. Just and use the model to assign them embeddings and like Hale surprise, Ely surprise models backward- probabilities. Both of these approaches will pro- looking prediction, but over representations, rather duce sets of upcoming sentences, which we can than over probabilities. then use to compute uncertainty reduction. While Ely et al. also introduce a measure of forward- we have so far only talked about the next sentences, looking prediction, which they define as the ex- we will also experiment with uncertainty reduction pected difference between the current state et and computed using longer rollouts. 1765
ℓ lm ℓ (story enc) that computes a story embedding. The ⋅ fusion (affine) overall story representation is the hidden state of its last sentence. Crucially, this model also gives Concat story_enc (RNN) 0 +1 +2 +3 us et , a contextualised representation of the current word and story sentence at point t in the story, to compute surprise vectors ( ) 0 1 2 3 and uncertainty reduction. 3 ( ) ( ) ( ) = [ ; ( )] sent_enc Model training includes a generative loss `gen to (RNN) 0 1 2 3 improve the quality of the sentences generated by 0 the model. We concatenate the word representa- tions w j for all word embeddings in the latest sen- tence with the latest story embedding emax(t) . This word_enc 0 1 2 3 (GPT) is run through affine ELU layers to produce en- Once upon a time 1 2 3 riched word embedding representations, analogous to the Deep Fusion model (Gülçehre et al., 2015), Figure 1: Architecture of our hierarchical model. with story state instead of a translation model. The See text for explanation of the components word enc, related Cold Fusion approach (Sriram et al., 2018) sent enc, and story enc. proved inferior. Loss Functions To obtain the discriminatory 4 Model loss `disc for a particular sentence s in a batch, we compute the dot product of all the story embed- 4.1 Architecture dings e in the batch, and then take the cross-entropy Our overall approach leverages contextualised lan- across the batch with the correct next sentence: guage models, which are a powerful tool in NLP exp(et+1 ⋅ et ) i=s i=s when pretrained on large amounts of text and fine `disc (et+1 ) = − log (6) tuned on a specific task (Peters et al., 2018; De- i ∑i exp(et+1 ⋅ et ) vlin et al., 2019). Specifically, we use Generative Modelled on Quick Thoughts (Logeswaran and Pre-Training (GPT, Radford et al., 2018), a model Lee, 2018), this forces the model to maximise the which has proved successful in generation tasks dot product of the correct next sentence versus (Radford et al., 2019; See et al., 2019). other sentences in the same story, and negative examples from other stories, and so encourages Hierarchical Model Previous work found that representations that anticipate what happens next. hierarchical models show strong performance in The generative loss in Equation (7) is a standard story generation (Fan et al., 2018) and under- LM loss, where w j is the GPT word embeddings standing tasks (Cai et al., 2017). The language from the sentence and emax(t) is the story context model and hierarchical encoders we use are uni- that each word is concatenated with: directional, which matches the incremental way in which human readers process stories when they `gen = − ∑ log P(w j ∣w j−1 , w j−2 , . . . ; emax(t) ) (7) experience suspense. j Figure 1 depicts the architecture of our hierar- The overall loss is `disc + `gen . More advanced gen- 1 chical model. It builds a chain of representations eration losses (e.g., Zellers et al., 2019) could be that anticipates what will come next in a story, al- used, but are an order of magnitude slower. lowing us to infer measures of suspense. For a given sentence, we use GPT as our word encoder 4.2 Inference (word enc in Figure 1) which turns each word in a We compute the measures of surprise and uncer- sentence into a word embedding wi . Then, we use tainty reduction introduced in Section 3.1 using the an RNN (sent enc) to turn the word embeddings of output of the story encoder story enc. In addition the sentences into a sentence embedding γi . Each to the contextualised sentence embeddings et , this sentence is represented by the hidden state of its requires their probabilities P(et ), and a distribution over alternative continuations P(et+1 ). i last word, which is then fed into a second RNN 1 We implement a recursive beam search over a Model code and scripts for evaluation are avail- able at https://github.com/dwlmt/Story-Untangling/ tree of future sentences in the story, looking be- tree/acl-2020-dec-submission tween one and three sentences ahead (rollout). The 1766
probability is calculated using the same method as GRU LSTM the discriminatory loss, but with the cosine similar- ity rather than the dot product of the embeddings Loss 5.84 5.90 i et and et+1 fed into a softmax function. We found Discriminatory Acc. 0.55 0.54 that cosine outperformed dot product on inference Discriminatory Acc. k = 10 0.68 0.68 as the resulting probability distribution over contin- Generative Acc. 0.37 0.46 uations is less concentrated. Generative Acc. k = 10 0.85 0.85 Cosine Similarity 0.48 0.50 5 Methods L2 Distance 1.73 1.59 Number of Epochs 4 2 Dataset The overall goal of this work is to test whether the psycholinguistic and economic theo- Table 1: For accuracy the baseline probability is 1 in ries introduced in Section 3 are able to capture 99; k = 10 is the accuracy of the top 10 sentences of the human intuition of suspense. For this, it is impor- batch. From the best epoch of training on the Writing- tant to use actual stories which were written by Prompts development set. authors with the aim of being engaging and inter- esting. Some of the story datasets used in NLP do to write a short summary of the story. not meet this criterion; for example ROC Cloze In the instructions, suspense was framed as dra- (Mostafazadeh et al., 2016) is not suitable because matic tension, as pilot annotations showed that the the stories are very short (five sentences), lack nat- term suspense was too closely associated with mur- uralness, and are written by crowdworkers to fulfill der mystery and related genres. Annotators were narrow objectives, rather than to elicit reader en- asked to take the character’s perspective when read- gagement and interest. A number of authors have ing to achieve stronger inter-annotator agreement also pointed out technical issues with such artificial and align closely with literary notions of suspense. corpora (Cai et al., 2017; Sharma et al., 2018). During training, all workers had to annotate a test Instead, we use WritingPrompts (Fan et al., story and achieve 85% accuracy before they could 2018), a corpus of circa 300k short stories from continue. Full instructions and the training story the /r/WritingPrompts subreddit. These stories are in Appendix B. were created as an exercise in creative writing, re- sulting in stories that are interesting, natural, and of The inter-annotator agreement α (Krippendorff, suitable length. The original split of the data into 2011) was 0.52 and 0.57 for the development and 90% train, 5% development, and 5% test was used. test sets, respectively. Given the inherently sub- Pre-processing steps are described in Appendix A. jective nature of the task, this is substantial agree- ment. This was achieved after screening out and Annotation To evaluate the predictions of our replacing annotators who had low agreement for model, we selected 100 stories each from the devel- the stories they annotated (mean α < 0.35), showed opment and test sets of the WritingPrompts corpus, suspiciously low reading times (mean RT < 600 ms such that each story was between 25 and 75 sen- per sentence), or whose story summaries indicated tence in length. Each sentence of these stories was low-quality annotation. judged for narrative suspense; five master work- ers from Amazon Mechanical Turk annotated each Training and Inference The training used SGD story after reading instructions and completing a with Nesterov momentum (Sutskever et al., 2013) training phase. They read one sentence at a time with a learning rate of 0.01 and a momentum of 0.9. and provided a suspense judgement using the five- Models were run with early stopping based on the point scale consisting of Big Decrease in suspense mean of the accuracies of training tasks. For each (1% of the cases), Decrease (11%), Same (50%), In- batch, 50 sentence blocks from two different stories crease (31%), and Big Increase (7%). In contrast to were chosen to ensure that the negative examples in prior work (Delatorre et al., 2018), a relative rather the discriminatory loss include easy (other stories) than absolute scale was used. Relative judgements and difficult (same story) sentences. are easier to make while reading, though in prac- We used the pretrained GPT weights but fine- tice, the suspense curves generated are very similar, tuned the encoder and decoder weights on our task. with a long upward trajectory and flattening or dip For the RNN components of our hierarchical model, near the end. After finishing a story, annotators had we experimented with both GRU (Chung et al., 1767
2015) and LSTM (Hochreiter and Schmidhuber, GloveSim is the cosine similarity between the av- 1997) variants. The GRU model had two layers in eraged Glove (Pennington et al., 2014) word em- both sen enc and story enc; the LSTM model had beddings of the two sentences, and GPTSim is the four layers each in sen enc and story enc. Both cosine similarity between the GPT embeddings of had two fusion layers and the size of the hidden the two sentences. The α baseline is the weighted layers for both model variants was 768. We give VADER sentiment score. the results of both variants on the tasks of sentence generation and sentence discrimination in Table 1. 6 Results Both perform similarly, with slightly worse loss 6.1 Narrative Suspense for the LSTM variant, but faster training and better generation accuracy. Overall, model performance Task The annotator judgements are relative is strong: the LSTM variant picks out the correct (amount of decrease/increase in suspense from sen- sentence 54% of the time and generates it 46% tence to sentence), but the model predictions are of the time. This indicates that our architecture absolute values. We could convert the model pre- successfully captures the structure of stories. dictions into discrete categories, but this would At inference time, we obtained a set of story fail to capture the overall arc of the story. Instead, continuations either by random sampling or by gen- we convert the relative judgements into absolute eration. Random sampling means that n sentences suspense values, where Jt = j1 + ⋅ ⋅ ⋅ + jt is the ab- were selected from the corpus and used as contin- solute value for sentence t and j1 , . . . , jt are the rel- uations. For generation, sentences were generated ative judgements for sentences 1 to t. We use −0.2 using top-k sampling (with k = 50) using the GPT for Big Decrease, −0.1 for Decrease, 0 for Same, 2 language model and the approach of Radford et al. 0.1 for Increase, and 0.2 for Big Increase. Both (2019), which generates better output than beam the absolute suspense judgements and the model search (Holtzman et al., 2018) and can outperform predictions are normalised by converting them to a decoder (See et al., 2019). For generation, we z-scores. used up to 300 words as context, enriched with the To compare model predictions and absolute sus- story sentence embeddings from the corresponding pense values, we use Spearman’s ρ (Sen, 1968) points in the story. For rollouts of one sentence, and Kendall’s τ (Kendall, 1975). Rank correlation we generated 100 possibilities at each step; for roll- is preferred because we are interested in whether outs of two, 50 possibilities and rollouts of three, human annotators and models view the same part 25 possibilities. This keeps what is an expensive of the story as more or less suspenseful; also, rank inference process manageable. correlation methods are good at detecting trends. We compute ρ and τ between the model predic- Importance We follow Ely et al. in evaluat- tions and the judgements of each of the annotators ing weighted versions of their surprise and un- (i.e., five times for five annotators), and then take certainty reduction measure St αEly and Ut αEly (see the average. We then average these values again Equation (5)). We obtain the αt values by tak- over the 100 stories in the test or development sets. ing the sentiment scores assigned by the VADER As the human upper bound, we compute the mean sentiment classifier (Hutto and Gilbert, 2014) to pairwise correlation of the five annotators. each sentence and multiplying them by 1.0 for pos- Results Figure 2 shows surprise and uncertainty itive sentiment and 2.0 for negative sentiment. The reduction measures and human suspense judge- stronger negative weighting reflects the observation ments for an example story (text and further ex- that negative consequences can be more important amples in Appendix C). We performed model se- than positive ones (O’Neill, 2013; Kahneman and lection using the correlations on the development Tversky, 2013). set, which are given in Table 2. We experimented with all the measures introduced in Section 3.1, Baselines We test a number of baselines as al- computing sets of alternative sentences either us- ternatives to surprise and uncertainty reduction de- 2 rived from our hierarchical model. These base- These values were fitted with predictions (or cross-worker lines also reflect how much change occurs from annotation) using 5-fold cross validation and an L1 loss to optimise the mapping. A constraint is placed so that Same one sentence to the next in a story: WordOverlap is is 0, increases are positive and decreases are negative with a the Jaccard similarity between the two sentences, minimum 0.05 distance between. 1768
3 Prediction Model Roll τ↑ ρ↑ 2.5 Human .553 .614 2 Baselines WordOverlap 1 .017 .026 Suspense 1.5 GloveSim 1 .017 .029 1 GPTSim 1 .021 .031 0.5 α 1 .024 .036 Hale 0 S -Gen GRU 1 .145 .182 LSTM 1 .434 .529 0 5 10 15 20 25 30 35 Hale Sentence S -Cor GRU 1 .177 .214 Hale Ely Ely αEly LSTM 1 .580 .675 Figure 2: Story 27, Human, S ,S ,U ,U . Solid lines: generated alternative continuations, dashed Hale U -Gen GRU 1 .036 .055 lines: sampled alternative continuations. LSTM 1 .009 .016 Hale U -Cor GRU 1 .048 .050 ing generated continuations (Gen) or continuations LSTM 1 .066 .094 Ely sampled from the corpus (Cor), except for S , Ely which can be computed without alternatives. We S GRU 1 .484 .607 compared the LSTM and GRU variants (see Sec- LSTM 1 .427 .539 tion 4) and experimented with rollouts of up to S αEly GRU 1 .089 .123 three sentences. We tried L1 and L2 distance for LSTM 1 .115 .156 the Ely measures, but only report L1, which always Ely performed better. U -Gen GRU 1 .241 .161 2 .304 .399 Discussion On the development set (see Table 2), LSTM 1 .610 .698 we observe that all baselines perform poorly, indi- 2 .393 .494 cating that distance between simple sentence rep- Ely resentations or raw sentiment values do not model U -Cor GRU 1 .229 .264 suspense. We find that Hale surprise S Hale performs 2 .512 .625 well, reaching a maximum ρ of .675 on the devel- 3 .515 .606 Hale LSTM 1 .594 .678 opment set. Hale uncertainty reduction U , how- 2 .564 .651 ever, performs consistently poorly. Ely surprise Ely 3 .555 .645 S also performs well, reaching as similar value αEly as Hale surprise. Overall, Ely uncertainty reduction U -Gen GRU 1 .216 .124 Ely U is the strongest performer, with ρ = .698, nu- 2 .219 .216 merically outperforming the human upper bound. LSTM 1 .474 .604 Some other trends are clear from the develop- 2 .316 .418 ment set: using GRUs reduces performance in all αEly cases but one; rollout of more than one never leads U -Cor GRU 1 .205 .254 to an improvement; sentiment weighting (prefix 2 .365 .470 α in the table) always reduces performance, as it LSTM 1 .535 .642 introduces considerable noise (see Figure 2). We 2 .425 .534 therefore eliminate the models that correspond to Table 2: Development set results for WritingPrompts these settings when we evaluate on the test set. for generated (Gen) or corpus sampled (Cor) alternative For the test set results in Table 3 we also report continuations; α indicates sentiment weighting. Bold: upper and lower confidence bounds computed us- best model in a given category; red: best model overall. ing the Fisher Z-transformation (p < 0.05). On the Ely test set, U again is the best measure, with a cor- relation statistically indistinguishable from human reflecting the higher human upper bound. performance (based on CIs). We find that absolute Overall, we conclude that our hierarchical ar- correlations are higher on the test set, presumably chitecture successfully models human suspense 1769
Prediction τ↑ ρ↑ Dev D ↓ Test D ↓ Human .652 (.039) .711 (.033) Human Not reported 4.30 (3.43) Hale S -Gen .407 (.089) .495 (.081) Theory Baseline 9.65 (0.94) 7.47 (3.42) Hale TAM 7.11 (1.71) 6.80 (2.63) S -Cor .454 (.085) .523 (.079) Hale U -Gen .036 (.102) .051 (.102) WordOverlap 13.9 (1.45) 12.7 (3.13) Hale U -Cor .061 (.100) .088 (.101) GloveSim 10.2 (0.74) 10.4 (2.54) Ely GPTSim 16.8 (1.47) 18.1 (4.71) S .391 (.092) .504 (.082) α 11.3 (1.24) 11.2 (2.67) Ely U -Gen .620 (.067) .710 (.053) Hale Ely U -Cor .605 (.069) .693 (.056) S -Gen 8.27 (0.68) 8.72 (2.27) Hale U -Gen 10.9 (1.02) 10.69 (3.66) Table 3: Test set results for WritingPrompts for gen- Ely erated (Gen) or corpus sampled (Cor) continuations. S 9.54 (0.56) 9.01 (1.92) αEly LSTM with rollout one; brackets: confidence intervals. S 9.95 (0.78) 9.54 (2.76) Ely U -Gen 8.75 (0.76) 8.38 (1.53) Ely U -Cor 8.74 (0.76) 8.50 (1.69) judgements on the WritingPrompts dataset. The αEly Ely U -Gen 8.80 (0.61) 7.84 (3.34) overall best predictor is U , uncertainty reduc- αEly tion computed over story representations. This U -Cor 8.61 (0.68) 7.78 (1.61) measure combines the probability of continuation Hale Table 4: TP prediction on the TRIPOD development (S ) with distance between story embeddings Ely and test sets. D is the normalised distance to the gold (S ), which are both good predictors in their own standard; CI in brackets. right. This finding supports the theoretical claim that suspense is an expectation over the change in future states of a game or a story, as advanced by which uses screenwriting theory to predict where Ely et al. (2015). in a movie a given TP should occur (e.g., Point of No Return theoretically occurs 50% through the 6.2 Movie Turning Points movie). This baseline is hard to beat (Papalampidi Task and Dataset An interesting question is et al., 2019). whether the peaks in suspense in a story correspond to important narrative events. Such events are some- Results and Discussion Figure 3 plots both gold times called turning points (TPs) and occur at cer- standard and predicted TPs for a sample movie tain positions in a movie according to screenwrit- synopsis (text and further examples in Appendix D). ing theory (Cutting, 2016). A corpus of movie The results on the TRIPOD development and test synopses annotated with turning points is available sets are reported in Table 4 (we report both due to in the form of the TRIPOD dataset (Papalampidi the small number of synopses in TRIPOD). We use et al., 2019). We can therefore test if surprise or our best LSTM model with a of rollout of one; the uncertainty reduction predict TPs in TRIPOD. As distance measure for Ely surprise and uncertainty our model is trained on a corpus of short stories, reduction is now L2 distance, as it outperformed this will also serve as an out-of-domain evaluation. L1 on TRIPOD. We report results in terms of D, Papalampidi et al. (2019) assume five TPs: 1. Op- the normalised distance between gold standard and portunity, 2. Change of Plans, 3. Point of no Return, predicted TP positions. 4. Major Setback, and 5. Climax. They derive a On the test set, the best performing model αEly αEly prior distribution of TP positions from their test set, with D = 7.78 is U -Cor, with U -Gen only and use this to constrain predicted turning points slightly worse. It is outperformed by TAM, the to windows around these prior positions. We fol- best model of Papalampidi et al. (2019), which low this approach and select as the predicted TP however requires TP annotation at training time. αEly the sentence with the highest surprise or uncer- U -Cor is close to the Theory Baseline on the tainty reduction value within a given constrained test set, an impressive result given that our model window. We report the same baselines as in the pre- has no TP supervision and is trained on a differ- vious experiment, as well as the Theory Baseline, ent domain. The fact that models with sentiment 1770
readers are about the outcome of the story, may 6 also be helpful in better understanding the relation- 5 ship between suspense and uncertainty. Automated 4 interpretability methods as proposed by Sundarara- Suspense 3 jan et al. (2017), could shed further light on models’ predictions. 2 The recent success of language models in wide- 1 ranging NLP tasks (e.g., Radford et al., 2019) has 0 shown that language models are capable of learn- 0 10 20 30 40 50 ing semantically rich information implicitly. How- Sentence ever, generating plausible future continuations is an essential part of the model. In text generation, Hale Ely Ely αEly Figure 3: Movie 15 Minutes, S ,S ,U ,U , Fan et al. (2019) have found that explicitly incor- ◆ theory baseline, ⭑ TP annotations, triangles are pre- porating coreference and structured event repre- dicted TPs. sentations into generation produces more coherent generated text. A more sophisticated model would weighting (prefix α) perform well here indicates incorporate similar ideas. that turning points often have an emotional reso- Autoregressive models that generate step by step nance as well as being suspenseful. alternatives for future continuations are computa- tionally impractical for longer rollouts and are not 7 Conclusions cognitively plausible. They also differ from the Ely et al. (2015) conception of suspense, which Our overall findings suggest that by implementing is in terms of Bayesian beliefs over a longer-term concepts from psycholinguistic and economic the- future state, not step by step. There is much recent ory, we can predict human judgements of suspense work (e.g., Ha and Schmidhuber (2018); Gregor Ely in storytelling. That uncertainty reduction (U ) et al. (2019)), on state-space approaches that model Hale outperforms probability-only (S ) and state-only beliefs as latent states using variational methods. Ely (S ) surprise suggests that, while consequential In principle, these would avoid the brute-force cal- state change is of primary importance for suspense, culation of a rollout and conceptually, anticipating the probability distribution over the states is also a longer-term states aligns with theories of suspense. necessary factor. Uncertainty reduction therefore Related tasks such as inverting the understanding captures the view of suspense as reducing paths to of suspense to utilise the models in generating more a desired outcome, with more consequential shifts suspenseful stories may also prove fruitful. as the story progresses (O’Neill and Riedl, 2014; This paper is a baseline that demonstrates how Ely et al., 2015; Perreault, 2018). This is more in modern neural network models can implicitly rep- line with the Smuts (2008) Desire-Frustration view resent text meaning and be useful in a narrative con- of suspense, where uncertainty is secondary. text without recourse to supervision. It provides a Strong psycholinguistic claims about suspense springboard to further interesting applications and are difficult to make due to several weaknesses in research on suspense in storytelling. our approach, which highlight directions for fu- ture research: the proposed model does not have a Acknowledgments higher-level understanding of event structure; most The authors would like to thank the anonymous re- likely it picks up the textual cues that accompany viewers, Pinelopi Papalampidi and David Hodges dramatic changes in the text. One strand of further for reviews of the annotation task, the AMT annota- work is therefore analysis: Text could be artificially tors, and Mirella Lapata, Ida Szubert, and Elizabeth manipulated using structural changes, for example Nielson for comments on the paper. Wilmot’s work by switching the order of sentences, mixing multi- is funded by an EPSRC doctoral training award. ple stories, including a summary at the beginning that foreshadows the work, masking key suspense- ful words, or paraphrasing. An analogue of this would be adversarial examples used in computer vision. Additional annotations, such as how certain 1771
References Jeffrey Ely, Alexander Frankel, and Emir Kamenica. 2015. Suspense and surprise. Journal of Political H Porter Abbott. 2008. The Cambridge introduction to Economy, 123(1):215–260. narrative. Cambridge University Press. Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi- David Bamman, Brendan O’Connor, and Noah A erarchical neural story generation. In Proceedings Smith. 2013. Learning latent personas of film char- of the 56th Annual Meeting of the Association for acters. In Proceedings of the 51st Annual Meeting of Computational Linguistics (Volume 1: Long Papers), the Association for Computational Linguistics (Vol- pages 889–898, Melbourne, Australia. Association ume 1: Long Papers), pages 352–361. for Computational Linguistics. David Bamman, Ted Underwood, and Noah A. Smith. Angela Fan, Mike Lewis, and Yann Dauphin. 2019. 2014. A Bayesian mixed effects model of literary Strategies for structuring story generation. In ACL. character. In Proceedings of the 52nd Annual Meet- ing of the Association for Computational Linguis- Edward Morgan Forster. 1985. Aspects of the Novel, tics (Volume 1: Long Papers), pages 370–379, Balti- volume 19. Houghton Mifflin Harcourt. more, Maryland. Association for Computational Lin- guistics. Stefan L Frank. 2013. Uncertainty reduction as a mea- sure of cognitive load in sentence comprehension. Zheng Cai, Lifu Tu, and Kevin Gimpel. 2017. Pay at- Topics in Cognitive Science, 5(3):475–494. tention to the ending: Strong neural baselines for the roc story cloze task. In Proceedings of the 55th An- Richard J Gerrig. 1989. Suspense in the absence nual Meeting of the Association for Computational of uncertainty. Journal of Memory and Language, Linguistics (Volume 2: Short Papers), pages 616– 28(6):633–648. 622. Richard J Gerrig and Allan BI Bernardo. 1994. Read- Snigdha Chaturvedi, Mohit Iyyer, and Hal Daume III. ers as problem-solvers in the experience of suspense. 2017. Unsupervised learning of evolving relation- Poetics, 22(6):459–472. ships between literary characters. In Thirty-First AAAI Conference on Artificial Intelligence. Philip John Gorinski and Mirella Lapata. 2018. What’s this movie about? a joint neural network architec- Yun-Gyung Cheong and R Michael Young. 2014. Sus- ture for movie content analysis. In Proceedings of penser: A story generation system for suspense. the 2018 Conference of the North American Chap- IEEE Transactions on Computational Intelligence ter of the Association for Computational Linguistics: and AI in Games, 7(1):39–52. Human Language Technologies, Volume 1 (Long Pa- pers), pages 1770–1781, New Orleans, Louisiana. Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, Association for Computational Linguistics. and Yoshua Bengio. 2015. Gated feedback recur- rent neural networks. In International Conference Karol Gregor, George Papamakarios, Frederic Besse, on Machine Learning, pages 2067–2075. Lars Buesing, and Theophane Weber. 2019. Tempo- ral difference variational auto-encoder. In Interna- James E Cutting. 2016. Narrative theory and the dy- tional Conference on Learning Representations. namics of popular movies. Psychonomic Bulletin & Review, 23(6):1713–1743. Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loı̈c Barrault, Huei-Chi Lin, Fethi Bougares, Pablo Delatorre, Carlos León, Alberto G Salguero, Holger Schwenk, and Yoshua Bengio. 2015. On us- Manuel Palomo-Duarte, and Pablo Gervás. 2018. ing monolingual corpora in neural machine transla- Confronting a paradox: a new perspective of the im- tion. CoRR, abs/1503.03535. pact of uncertainty in suspense. Frontiers in Psy- chology, 9:1392. David Ha and Jürgen Schmidhuber. 2018. Recur- rent world models facilitate policy evolution. In Vera Demberg and Frank Keller. 2008. Data from eye- Advances in Neural Information Processing Sys- tracking corpora as evidence for theories of syntactic tems 31, pages 2451–2463. Curran Associates, Inc. processing complexity. Cognition, 101(2):193–210. https://worldmodels.github.io. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and John Hale. 2001. A probabilistic Earley parser as Kristina Toutanova. 2019. BERT: Pre-training of a psycholinguistic model. In Proceedings of the deep bidirectional transformers for language under- 2nd Conference of the North American Chapter of standing. In Proceedings of the 2019 Conference the Association for Computational Linguistics, vol- of the North American Chapter of the Association ume 2, pages 159–166, Pittsburgh, PA. Association for Computational Linguistics: Human Language for Computational Linguistics. Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- John Hale. 2006. Uncertainty about the rest of the sen- ation for Computational Linguistics. tence. Cognitive science, 30(4):643–672. 1772
Brent Harrison, Christopher Purdy, and Mark O Riedl. Zhiwei Li, Neil Bramley, and Todd M. Gureckis. 2018. 2017. Toward automated story generation with Modeling dynamics of suspense and surprise. In markov chain monte carlo methods and deep neural Proceedings of the 40th Annual Meeting of the Cog- networks. In Thirteenth Artificial Intelligence and nitive Science Society, CogSci 2018, Madison, WI, Interactive Digital Entertainment Conference. USA, July 25-28, 2018. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Chunhua Liu, Haiou Zhang, Shan Jiang, and Dong Long short-term memory. Neural Computation, Yu. 2018a. DEMN: Distilled-exposition enhanced 9(8):1735–1780. matching network for story comprehension. In Pro- ceedings of the 32nd Pacific Asia Conference on Lan- Hans Hoeken and Mario van Vliet. 2000. Suspense, guage, Information and Computation, Hong Kong. curiosity, and surprise: How discourse structure in- Association for Computational Linguistics. fluences the affective and cognitive processing of a story. Poetics, 27(4):277–286. Fei Liu, Trevor Cohn, and Timothy Baldwin. 2018b. Narrative modeling with memory chains and seman- Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine tic supervision. In Proceedings of the 56th An- Bosselut, David Golub, and Yejin Choi. 2018. nual Meeting of the Association for Computational Learning to write with cooperative discriminators. Linguistics (Volume 2: Short Papers), pages 278– In Proceedings of the 56th Annual Meeting of the As- 284, Melbourne, Australia. Association for Compu- sociation for Computational Linguistics (Volume 1: tational Linguistics. Long Papers), pages 1638–1649, Melbourne, Aus- tralia. Association for Computational Linguistics. Lajanugen Logeswaran and Honglak Lee. 2018. An ef- ficient framework for learning sentence representa- Matthew Honnibal and Ines Montani. 2017. spaCy 2: tions. In International Conference on Learning Rep- Natural language understanding with Bloom embed- resentations. dings, convolutional neural networks and incremen- tal parsing. To appear. Lara J Martin, Prithviraj Ammanabrolu, Xinyu Wang, William Hancock, Shruti Singh, Brent Harrison, and C. T. Hsu, M. Conrad, and A. M. Jacobs. 2014. Fiction Mark O Riedl. 2018. Event representations for au- feelings in harry potter: haemodynamic response in tomated story generation with deep neural nets. In the mid-cingulate cortex correlates with immersive Thirty-Second AAAI Conference on Artificial Intelli- reading experience. NeuroReport, 25:1356–1361. gence. Clayton J Hutto and Eric Gilbert. 2014. Vader: A par- Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong simonious rule-based model for sentiment analysis He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, of social media text. In Eighth international AAAI Pushmeet Kohli, and James Allen. 2016. A cor- conference on weblogs and social media. pus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jor- Conference of the North American Chapter of the dan Boyd-Graber, and Hal Daumé III. 2016. Feud- Association for Computational Linguistics: Human ing families and former friends: Unsupervised learn- Language Technologies, pages 839–849, San Diego, ing for dynamic fictional relationships. In Proceed- California. Association for Computational Linguis- ings of the 2016 Conference of the North Ameri- tics. can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages M. B. Oliver. 1993. Exploring the paradox of the en- 1534–1544. joyment of sad films. Human Communication Re- search, 19:315–342. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text Brian O’Neill. 2013. A computational model of sus- classification. arXiv preprint arXiv:1607.01759. pense for the augmentation of intelligent story gen- eration. Ph.D. thesis, Georgia Institute of Technol- Daniel Kahneman and Amos Tversky. 2013. Prospect ogy. theory: An analysis of decision under risk. In Hand- book of the fundamentals of financial decision mak- Brian O’Neill and Mark Riedl. 2014. Dramatis: A ing: Part I, pages 99–127. World Scientific. computational model of suspense. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial In- MG Kendall. 1975. Rank correlation measures. telligence, July 27 -31, 2014, Québec City, Québec, Charles Griffin, London, 202:15. Canada., pages 944–950. Y. Khrypko and P. Andreae. 2011. Towards the prob- Pinelopi Papalampidi, Frank Keller, and Mirella Lap- lem of maintaining suspense in interactive narrative. ata. 2019. Movie plot analysis via turning point In Proceedings of the 7th Australasian Conference identification. In Proceedings of the 2019 Confer- on Interactive Entertainment, pages 5:1–5:3. ence on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- Klaus Krippendorff. 2011. Computing krippendorff’s ence on Natural Language Processing (EMNLP- alpha-reliability. 2011. Annenberg School for Com- IJCNLP), pages 1707–1717, Hong Kong, China. As- munication Departmental Papers: Philadelphia. sociation for Computational Linguistics. 1773
Jeffrey Pennington, Richard Socher, and Christopher Abigail See, Aneesh Pappu, Rohun Saxena, Akhila Manning. 2014. Glove: Global vectors for word rep- Yerukola, and Christopher D Manning. 2019. Do resentation. In Proceedings of the 2014 conference massively pretrained language models make better on empirical methods in natural language process- storytellers? arXiv preprint arXiv:1909.10705. ing (EMNLP), pages 1532–1543. Pranab Kumar Sen. 1968. Estimates of the regres- Joseph Perreault. 2018. The Universal Structure of Plot sion coefficient based on kendall’s tau. Journal of Content: Suspense, Magnetic Plot Elements, and the the American statistical association, 63(324):1379– Evolution of an Interesting Story. Ph.D. thesis, Uni- 1389. versity of Idaho. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Rishi Sharma, James Allen, Omid Bakhshandeh, and Gardner, Christopher Clark, Kenton Lee, and Luke Nasrin Mostafazadeh. 2018. Tackling the story end- Zettlemoyer. 2018. Deep contextualized word rep- ing biases in the story cloze test. In Proceedings of resentations. In Proceedings of the 2018 Confer- the 56th Annual Meeting of the Association for Com- ence of the North American Chapter of the Associ- putational Linguistics (Volume 2: Short Papers), ation for Computational Linguistics: Human Lan- pages 752–757. guage Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association Aaron Smuts. 2008. The desire-frustration theory of for Computational Linguistics. suspense. The Journal of Aesthetics and Art Criti- cism, 66(3):281–290. Juan A Prieto-Pablos. 1998. The paradox of suspense. Poetics, 26(2):99–113. Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates. 2018. Cold fusion: Training seq2seq Eric S Rabkin. 1973. Narrative suspense.” When models together with language models. In Inter- Slim turned sideways...”. Ann Arbor: University of speech 2018, 19th Annual Conference of the Interna- Michigan Press. tional Speech Communication Association, Hyder- Alec Radford, Karthik Narasimhan, Tim Salimans, abad, India, 2-6 September 2018., pages 387–391. and Ilya Sutskever. 2018. Improving lan- guage understanding by generative pre-training. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Https://openai.com/blog/language-unsupervised/. Axiomatic attribution for deep networks. In Pro- ceedings of the 34th International Conference on Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Machine Learning - Volume 70, ICML’17, page Dario Amodei, and Ilya Sutskever. 2019. Language 3319–3328. JMLR.org. models are unsupervised multitask learners. OpenAI Blog, 1(8). Ilya Sutskever, James Martens, George Dahl, and Ge- offrey Hinton. 2013. On the importance of initial- Hannah Rashkin, Maarten Sap, Emily Allaway, ization and momentum in deep learning. In Interna- Noah A. Smith, and Yejin Choi. 2018. Event2Mind: tional conference on machine learning, pages 1139– Commonsense inference on events, intents, and reac- 1147. tions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 463–473, Melbourne, Marten Van Schijndel and Tal Linzen. 2018a. Can en- Australia. Association for Computational Linguis- tropy explain successor surprisal effects in reading? tics. CoRR, abs/1810.11481. Brian Roark, Asaf Bachrach, Carlos Cardenas, and Marten Van Schijndel and Tal Linzen. 2018b. Model- Christophe Pallier. 2009. Deriving lexical and syn- ing garden path effects without explicit hierarchical tactic expectation-based measures for psycholinguis- syntax. In Proceedings of the 40th Annual Meeting tic modeling via incremental top-down parsing. In of the Cognitive Science Society, CogSci 2018, Madi- Proceedings of the 2009 Conference on Empirical son, WI, USA, July 25-28, 2018. Methods in Natural Language Processing, pages 324–333, Singapore. Association for Computational Robert J Yanal. 1996. The paradox of suspense. The Linguistics. British Journal of Aesthetics, 36(2):146–159. Melissa Roemmele and Andrew Gordon. 2018. An encoder-decoder approach to predicting causal rela- Rowan Zellers, Ari Holtzman, Hannah Rashkin, tions in stories. In Proceedings of the First Work- Yonatan Bisk, Ali Farhadi, Franziska Roesner, and shop on Storytelling, pages 50–59, New Orleans, Yejin Choi. 2019. Defending against neural fake Louisiana. Association for Computational Linguis- news. CoRR, abs/1905.12616. tics. Dolf Zillmann. 1996. The psychology of suspense G. Schraw, Flowerday, T., and S. Lehman. 2001. In- in dramatic exposition. Suspense: Conceptual- creasing situational interest in the classroom. Edu- izations, theoretical analyses, and empirical explo- cational Psychology Review, 13:211–224. rations, pages 199–231. 1774
A Pre-processing training for subsequent HITS. Other stories are in separate HITS, please search for ”Story dramatic WritingPrompts comes from a public forum of tension, reading sentence by sentence” to find them. short stories and so is naturally noisy. Story au- The training completion code will work for all re- thors often use punctuation in unusual ways to lated HITS. mark out sentences or paragraph boundaries and You will read a short story and for each sentence there are lots of spelling mistakes. Some of these be asked to assess how the dramatic tension in- cause problems with the GPT model and in some creases, decreases or stays the same. Each story circumstances can cause it to crash. To improve will take an estimated 8-10 minutes. Judge each the quality, sentence demarcations are left as they sentence on how the dramatic tension has changed are from the original WritingPrompts dataset but as felt by the main characters in the story, not what some sentences are cleaned up and others skipped you as a reader feel. Dramatic tension is the excite- over. Skipping over is also why there sometimes ment or anxiousness over what will happen to the are gaps in the graph plots as the sentence was characters next, it is anticipation. ignored during training and inference. The pre- Increasing levels of each of the following in- processing steps are as follows. Where substitu- crease the level of dramatic tension: tions are made rather than ignoring the sentence, the token is replaced by the Spacy (Honnibal and • Uncertainty: How uncertain are the charac- Montani, 2017) POS tag. ters involved about what will happen next? Put yourself in the characters shoes; judge 1. English Language: Some phrases in sen- the change in the tension based on how the tences can be non-English, Whatthelang characters perceive the situation. (Joulin et al., 2016) is used to filter out these sentences. • Significance: How significant are the conse- quences of what will happen to the central 2. Nondictionary words: PyDictionary and characters of the story? PyEnchant and used to check if each word is a dictionary word. If not they are replaced. An Example: Take a dramatic moment in a story such as a character that needs to walk along a dan- 3. Repeating Symbols: Some author mark out gerous cliff path. When the character first realises sections by using a string of characters such they will encounter danger the tension will rise, as *************** or !!!!!!!!!!!!. This can then tension will increase further. Other details cause the Pytorch GPT implementation to such as falling rocks or slips will increase the ten- break so repeating characters are replaced sion further to a peak. When the cliff edge has been with a single one. navigated safely the tension will drop. The pattern will be the same with a dramatic event such as a 4. Ignoring sentences: If after all of these re- fight, argument, accident, romantic moment, where placements there are not three or more GPT the tension will rise to a peak and then fall away as word pieces ignoring the POS replacements the tension is resolved. then the sentence is skipped. The same pro- You will be presented with one sentence at a cessing applies to generating sentences in the time. Once you have read the sentence, you will inference. Occasionally the generated sen- press one of five keys to judge the increase or de- tences can be nonsense, so the same criteria crease in dramatic tension that this sentence caused. are used to exclude them. You will use five levels (with keyboard shortcuts in B Mechanical Turk Written Instructions brackets): These are the actual instructions given to the Me- • Big Decrease (A): A sudden decrease in dra- chanical Turk Annotators, plus the example in Ta- matic tension of the situation. In the cliff ble 5: example the person reaching the other side safely. INSTRUCTIONS For the first HIT there will be an additional training step to pass. This will take • Decrease (S): A slow decrease in the level of about 5 minutes. After this you will receive a code tension, a more gradual drop. For example the which you can enter in the code box to bypass the cliff walker sees an easier route out. 1775
Annotation Sentence NA Clancy Marguerian, 154, private first class of the 150 + army , sits in his foxhole. Increase Tired cold, wet and hungry, the only thing preventing him from laying down his rifle and walking towards the enemy lines in surrender is the knowledge that however bad he has it here, life as a 50 - 100 POW is surely much worse . Increase He’s fighting to keep his eyes open and his rifle ready when the mortar shells start landing near him. Same He hunkers lower. Increase After a few minutes under the barrage, Marguerian hears hurried footsteps, a grunt, and a thud as a soldier leaps into the foxhole. Same The man’s uniform is tan , he must be a 50 - 100 . Big Increase The two men snarl and grab at each other , grappling in the small foxhole . Same Abruptly, their faces come together. Decrease “Clancy?” Decrease “Rob?” Big Decrease Rob Hall, 97, Corporal in the 50 - 100 army grins, as the situation turns from life or death struggle, to a meeting of two college friends. Decrease He lets go of Marguerian’s collar. Same “ Holy shit Clancy , you’re the last person I expected to see here ” Same “ Yeah ” “ Shit man , I didn’t think I’d ever see Mr. volunteers every saturday morning at the food shelf’ , not after The Reorganization at least ” Same “Yeah Rob , it is something isn’t it ” Decrease “ Man , I’m sorry, I tried to kill you there”. Table 5: One of the training annotation examples given to Mechanical Turk workers. The annotation labels are the recommended labels. This is an extract from a validation set WritingPrompts story. • Same (Space): Stays at a similar level. In the through can be done to tie these into the suspense cliff example an ongoing description of the measures and also the WritingPrompts prompts. event. C Writing Prompts Examples • Increase (K): A gradual increase in the ten- sion. Loose rocks fall nearby the cliff walker. The numbers are from the full WritingPrompts test set. Since random sampling was done from these • Big Increase (L): A more sudden dramatic from for evaluation the numbers are not in a con- increase such as an argument. The cliff walker tiguous block. There are a couple of nonsense suddenly slips and falls. sentences or entirely punctuation sentences. In the model these are excluded in pre-processing but in- POST ACTUAL INSTRUCTIONS In addition cluded here to match the sentence segmentation. to the suspense annotation. The following review Also there are some unusual break such as “should questions were asked: n’t”, this is because the word segmentation pro- • Please write a summary of the story in one or duced by the Spacy tokenizer. two sentences. C.1 Story 27 • Do you think the story is interesting or not? This is Story 27 from the test set in Figure 4, it is And why? One or two sentences. the same as the example in the main text: • How interesting is the story? 1–5 0. As I finished up my research on Alligator The main purpose of this was to test if the MTurk breeding habits for a story I was tasked with Annotators were comprehending the stories and not writing , a bell began to ring loudly throughout trying to cheat by skipping over. Some further work the office . 1776
You can also read