Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning

Page created by Earl Castillo

Lifestyle

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning

Inferring the Reader: Guiding Automated Story Generation with
                                                                    Commonsense Reasoning

                                                               Xiangyu Peng∗, Siyan Li∗, Sarah Wiegreffe, and Mark Riedl
                                                                            Georgia Institute of Technology
                                                                                   Atlanta, GA 30332
                                                                   {xpeng62, sli613, swiegreffe6}@gatech.edu
                                                                                riedl@cc.gatech.edu

                                                                 Abstract

                                            Transformer-based language model ap-
arXiv:2105.01311v1 [cs.CL] 4 May 2021

                                            proaches to automated story generation
                                            currently provide state-of-the-art results.
                                            However, they still suffer from plot incoher-
                                            ence when generating narratives over time,
                                            and critically lack basic commonsense reason-
                                            ing. Furthermore, existing methods generally
                                            focus only on single-character stories, or fail
                                            to track characters at all. To improve the co-
                                            herence of generated narratives and to expand
                                            the scope of character-centric narrative gener-
                                            ation, we introduce Commonsense-inference
                                            Augmented neural StoryTelling (CAST), a
                                            framework for introducing commonsense
                                            reasoning into the generation process while           Figure 1: Overview of the CAST system. 1. A text
                                            modeling the interaction between multiple             prompt starts the story generation process. 2. The sys-
                                            characters. We find that our CAST method              tem infers facts about the characters’ intents. 3. A lan-
                                            produces significantly more coherent and              guage model generates candidate continuations. 4. In-
                                            on-topic two-character stories, outperforming         ferences about the candidates are matched against the
                                            baselines in dimensions including plot plausi-        previous inferences and the best candidate is added to
                                            bility and staying on topic. We also show how         the story.
                                            the CAST method can be used to further train
                                            language models that generate more coherent
                                            stories and reduce computation cost.                  These techniques have improved with the adoption
                                                                                                  of Transformer-based models, such as GPT-2 (Rad-
                                        1   Introduction                                          ford et al., 2019). While GPT-2 and similar neural
                                                                                                  language models are considered highly fluent from
                                        AI storytelling is a crucial component of computa-        a grammatical standpoint, they are prone to gener-
                                        tional creativity. Humans use storytelling to enter-      ating repetitive or generic continuations (Holtzman
                                        tain, share experiences, educate, and to facilitate       et al., 2019). Furthermore, as the length of the story
                                        social bonding. For an intelligent system to be un-       grows, these models can lose coherence. One rea-
                                        able to generate a story limits its ability to interact   son for these phenomena is that language models
                                        with humans in naturalistic ways. Automated Story         generate continuations by sampling from a learned
                                        Generation, in particular, has been a grand chal-         distribution Pθ (tokn |tokn−k , ..., tokn−2 , tokn−1 ).
                                        lenge in artificial intelligence, requiring a system      Human readers, however, do not perceive the coher-
                                        to construct a sequence of sentences that can be          ence of a narrative as a function of the likelihood
                                        read and understood as a story.                           of seeing particular words based on the occurrence
                                           A common approach to story generation is to use        of previous words.
                                        neural language models (Roemmele, 2016; Khalifa
                                                                                                     Previous attempts to enhance the coherence
                                        et al., 2017; Clark et al., 2018; Martin et al., 2018).
                                                                                                  of generated stories use conditioning on content-
                                            ∗
                                                Equal contributions                               relevant features such as plot outlines (Fan et al.,

2018; Peng et al., 2018; Rashkin et al., 2020), or     erative model trained on triples from ATOMIC and
character emotional arcs (Brahman and Chaturvedi,      infers relations about sentences broadly divided
2020). These achieve coherence through adher-          into four categories: (1) Causes of a person’s ac-
ence to a manually given high-level plan. The high     tions (preconditions of the event), (2) Attributes of
level plan can be generated (Yao et al., 2019; Am-     a person, (3) Effects of actions on a person (post-
manabrolu et al., 2020), which only elevates the       conditions of the event), and (4) Effects of actions
challenges of maintaining coherence to a higher        on others (postconditions of the event). We pro-
level of abstraction. Neural language models           pose to infer character intentions and effects of
can also be fine-tuned on extraneous signals such      actions using COMET to inform the generation of
as commonsense knowledge or progression re-            subsequent sentences by a neural language model.
wards (Guan et al., 2020; Tambwekar et al., 2019),        To address the challenge of maintaining coher-
which improves the distribution but still relies       ence in language-model-based story generation,
solely on sampling.                                    we propose a novel two-character story genera-
   The latent state of neural language models used     tion method, Commonsense inference Augmented
to generate subsequent story continuations are un-     neural StoryTelling (CAST), that infers the causal
likely to relate to a human reader’s mental model of   relations between events as well as the intents and
the state of a story world. Studies of human reader    motivations of characters in the story context so
comprehension (Trabasso and Van Den Broek,             far in order to generate story continuations that are
1985; Graesser et al., 1991, 1994) show that read-     more coherent to readers. CAST uses these inferred
ers comprehend stories by tracking the relations       causal relations and character intentions to make
between events. Reader comprehension relies on         more informed choices about potential story con-
the tracking of at least four types of relations be-   tinuations generated by a neural language model
tween events: (1) causal consequence, (2) goal         (GPT-2). We hypothesize that stricter, more explicit
hierarchies, (3) goal initiation, and (4) character    constraints during generation should result in more
intentions. The perceived coherence of a story is      coherent narratives than generating via sampling
thus a function of the reader being able to compre-    from a distribution alone, even if the distribution is
hend how events correlate to each other causally       fine-tuned.
or how they follow characters’ pursuits of implicit       To evaluate the efficacy of our proposed method,
goals. Cognitive studies of human reading com-         we conduct a series of human-participant experi-
prehension (Trabasso and Van Den Broek, 1985;          ments specifically measuring perceptions of logical
Graesser et al., 1991, 1994) suggest that readers      coherence of CAST against a number of fine-tuned
track the goals and motivations of story world char-   variants of GPT-2. Results indicate that the CAST
acters as well as causal consequences of actions.      method significantly increases the perception of
We hypothesize that a story generation system that     generated stories over baselines.
makes decisions on how to continue a story based
on tracking and reasoning about character inten-       2   Related Work
tions and action consequences will generate more
                                                       Neural networks—recurrent and transformer-
coherent stories.
                                                       based—have been used to produce stories (Roem-
   However, stories don’t always explicitly declare    mele, 2016; Khalifa et al., 2017; Martin et al., 2018;
the goals and motivations of characters; sentences     Clark et al., 2018). In these systems a neural lan-
describing character actions are not explicitly an-    guage model learns to approximate the distribu-
notated with the characters’ motivations and goals.    tion Pθ (tokn |tokn−k , ..., tokn−2 , tokn−1 ). Stories
Readers must infer the characters’ goals and the       are generated by providing a context sequence and
relationship between their actions and those goals.    sampling from the distribution. When the language
The ability to use basic knowledge about goals         model is trained on a corpus of stories, the gen-
about what is happening in the world falls within      erated text tends to also be a story. Sometimes
the study of commonsense inference. ATOMIC             generation is done hierarchically (Yao et al., 2019;
(Sap et al., 2019) is a commonsense knowledge          Ammanabrolu et al., 2020). However, coherence is
base that contains logical relationships concern-      not guaranteed; statistical sampling from a distribu-
ing mental states, attributes, and events. COMET       tion is not constrained to making logical transitions
(Bosselut et al., 2019) is a transformer-based gen-    because the rich relationships that readers make to

perceive coherence are not modeled. Other artifacts language model and obtain a set of common-
of the sampling process include new characters be- sense inferences.
ing arbitrarily introduced at any time, characters
being forgotten, and repetitions. 3. We match commonsense inference sets be-
To control the generation, sometimes a high- tween si−1 and c using a matching criteria
level plot outline is given and a language model is grounded in theory, and produce a score for c.
conditioned or otherwise guided by the high-level
plot outline (Fan et al., 2018; Peng et al., 2018; 4. If the score is above a user-specified thresh-
Rashkin et al., 2020; Brahman and Chaturvedi, old, c is selected to be si and is appended
2020). These high-level guidance specifications to the generation history. Otherwise, steps 2
turn the story generation problem into a supervised through 4 are repeated until a viable candidate
learning problem. We do not consider these ap- is found.
proaches further in this paper because we do not
assume the existence of any guidance specification. 5. Repeat steps 1 through 4 until k − 1 sentences
The story generation system by Guan et have been generated.
al. (2020) first fine-tunes GPT-2 on the ATOMIC
An overview of the pipeline is given in Figure 2.
dataset and then fine-tunes a second time on the
The system is simple but effective—CAST reasons
ROCStories corpus (Mostafazadeh et al., 2016) to
about whether there are sufficient relations between
induce GPT-2 to generate short stories. The au-
the candidate sentence and the previous sentence
thors use multi-task learning during the second
to support reader comprehension. Unlike models
fine-tuning stage with an auxiliary objective to dis-
trained or fine-tuned on commonsense knowledge,
tinguish true and engineered false stories. Even
the filtering technique provides a hard constraint
with such specialized learning, the resulting model
that is likely to persist throughout a generated story
still fails to avoid logical errors, repeats pieces of
of arbitrary length without degeneration, nor is it
narratives, and introduces unrelated entities. This
limited in application to the maximum story length
demonstrates the need for a stronger inductive bias
seen during training.
on how commonsense knowledge is used in story
generation. Brahman and Chaturvedi (2020) gen-
3.1 Language Model
erate stories that follow a given emotional arc for
a protagonist, using COMET to infer the protago- We fine-tune GPT-2 on a pre-processed version of
nist’s emotions. The C2PO system (Ammanabrolu the ROCStories corpus (Mostafazadeh et al., 2016)
et al., 2021) uses COMET to generate successor to encourage story generation. Following Guan
and predecessor events instead of a language model, et al. (2020), we pre-process the corpus to remove
performing a bi-directional search from a given character names, replacing them with [MALE] or
start event and a given end event. However, C2PO [FEMALE]. In our pipeline, we enforce the telling
generates plots made up of short statements of char- of a two-character narrative in an interleaving fash-
acter intentions. ion wherein characters take turns being the subject
of each sentence. This is a simplifying assump-
3 The CAST Method tion, but fits the pattern that occurs in many stories
in which one character acts and another character
The conventional set up for neural k-sentence story
reacts. We achieve the turn-taking by condition-
generation is: given the first sentence s1 of a story—
ing the language model generation with both the
a prompt—generate the subsequent k − 1 sen-
previous story context and the tag denoting the
tences, s2 , s3 , ..., sk . The Commonsense inference
character who is to take a turn. For example if
Augmented neural StoryTelling (CAST) method is
the last sentence in the story is “[MALE] was upset
as follows. To generate sentence si with 2 ≤ i ≤ k:
with [FEMALE].”, we append the tag of the oppo-
1. We condition a fine-tuned language model site character to the context, i.e., [FEMALE]. Since
on the story up to the current sentence the next sentence now starts with the tag associ-
[s1 , . . . , si−1 ] followed by a token signifying ated with the opposite character of the one who
the subject of sentence i. acted in the previous sentence, the language model
will be biased toward describing the action for that
2. We sample a sentence candidate c from the character.

previous
                                                                               COMET
                                                                                           inferences
                           "[MALE] had a big
         s1 ... si-1 si
                          crush on [FEMALE]."
                                                                                                        no   Match?     yes       si+1

                                                Generate                                    candidate
  "[FEMALE]"              context                           candidate          COMET
                                                candidate                                  inferences
                                                                                                                oWant → xIntent
                                 "[MALE] had a big crush          "[FEMALE] wanted to                           oReact → xAttr
           PREFIX
                                on [FEMALE]. [FEMALE]"            go to prom with him."

           Figure 2: The overall procedure of generating two-character narratives with the CAST pipeline.

 Type          Description                                                relations:
 xIntent The reason why PersonX would cause the event
 xNeed   What PersonX might need to do before the event                        • The event in si−1 affects the wants of a char-
 xAttr   How PersonX might be described given the event                          acter, which manifests as an intention of the
 oReact The reaction of Others to the event                                      primary character in c (oWant→xIntent).
 oEffect The effect the event has on Others
 oWant   What Others may want to do after the event                            • An effect of the event in si−1 manifests itself
                                                                                 as something the primary character needs in c
Table 1: Definitions of the six types of ATOMIC rela-
                                                                                 (oEffect→xNeed).
tions that CAST uses. See Sap et al. (2019) for the full
set. The top set describes preconditions of the current
                                                                               • A reaction to the event in si−1 is expected
sentence and the bottom set describes postconditions
of the previous sentence. Dotted lines show how we                               and matches some property of the primary
match postconditions to preconditions during the filter-                         character in c (oReact→xAttr).
ing phase.
                                                                         In this way, we create hard constraints via a
                                                                         form of chaining that allows us to filter a set
3.2     Commonsense Inferences                                           of potential sentence generations to find one
                                                                         that adequately matches the expected inferences.
To produce commonsense inferences for each sen-                          The oWant→xIntent pattern corresponds to how
tence, we use the COMET model (Bosselut et al.,                          readers track how characters form and carry
2019) to infer a set of ATOMIC (Sap et al., 2019)                        out intentions in reader comprehension models
relations for each generated sentence. COMET                             (Graesser et al., 1991). The oEffect→xNeed and
generates commonsense inferences for a single sen-                       oReact→xAttr correspond to how readers track
tence, referring to PersonX as the sentence’s sub-                       causal consequences in reader comprehension mod-
ject, and Others as other characters in the story.                       els (Trabasso and Van Den Broek, 1985).
Due to our two-character closed-world assumption,                           To filter out “unqualified” sentence candidates
we assume Others to refer to the second character.                       generated by the language model, we match the
   We identify six of ATOMIC’s nine relation                             inference types described in Table 1 and their ar-
types that are useful for creating coherent relations                    guments. For example, suppose sentence si−1 is
between story events: xIntent, xNeed, xAttr,                             “[MALE] gives [FEMALE] a burger”. Its oEffect
oReact, oEffect, and oWant. Table 1 provides                             is “have a burger”. If sentence si is “[FEMALE]
the definitions. We treat the first three relation                       eats the burger”, it will have an xNeed that is also
types as preconditions of an event because they                          “have a burger”, which represents the prerequisite
infer facts that might need to be true for an event to                   of “having a burger” in order to “eat a burger”.
be enacted, such as a character having an intention,                        In practice, we find that simple string match-
a goal, or a property. We categorize the final three                     ing does not adequately capture when two inferred
as postconditions of an event because they infer                         relations’ arguments have slightly different phras-
things that might change once an event is enacted,                       ing (e.g., “to sleep” versus “sleeping”). We define
such as a character reaction, an effect of the action,                   a match as the semantic similarity between two
or a character forming an intention.                                     inferences exceeding a certain threshold. We en-
   Once we have inferred relations for the previous                      code each relation argument into a fixed-length
sentence si−1 and current candidate sentence c,                          vector representation, and then compute the cosine
we look for specific patterns between the sets of                        similarity. We use Sentence-BERT (Reimers and

Gurevych, 2019) for encoding, as it is designed Sentence Pair Inferences
for semantic similarity tasks and performs better [MALE] had a big crush on [FEMALE].
None
than the traditional BERT on sentence similarity [FEMALE] was constantly yelling at him.
benchmarks. [MALE] had a big crush on [FEMALE]. oReact→xAttr
We use 80% semantic similarity as our lower- [FEMALE] wanted to go to prom with him. oWant→xIntent
bound. We find this value correctly considers the in- [MALE] took [FEMALE] fishing in summer.
None
[FEMALE] got very sick.
ferences listed above as matches, but excludes less-
related inferences. This value can also be tuned to [MALE] took [FEMALE] fishing in summer.oReact→xAttr
[FEMALE] was loving it. oWant→xIntent
increase or decrease the strength of the matching
constraint at the cost of runtime computation—a Table 2: Examples of generated labeled sentence pairs.
higher threshold requires more candidate sentences Sentence pairs with no matching inferences are labeled
generated by the language model to be evaluated. with 0, while those with over 2 matching inference
In order to balance computation time and quality pairs are labeled with 1.
of the match, we only require two of three inference
type pairs to match between a seed and a candidate
without matches incur more loss. The first sentence
sentence. In practice, we find this results in higher
of the pair is masked so that loss is only calculated
quality sentences than if we only require one out
on the logits that produce the successor sentence.
of three pairs to match, but is significantly more
efficient than three out of three. For maximal fil- 4 Experiments
tering strength and best coherence, all three pairs
would be required to match. However, in our pilot We conducted three experiments to evaluate the
experiments, the model is rarely able to generate CAST technique. The first experiment assesses
any “qualified” sentence within 100 sentence can- whether fine-tuning neural language models on
didates, when required to match all three pairs. commonsense knowledge alone can significantly
address the challenge of generating coherent nar-
3.3 Fine-tuning on Commonense Inferences ratives. The second experiment compares CAST
CAST generates candidate sentences until one to two unconstrained neural language models story
passes the inference matching process. Each candi- generators: GPT-2 and the work by Guan et al.
date considered requires a set of commonsense in- (2020). The third experiment assesses whether
ferences and semantic matching—computationally CAST can be used to fine-tune a neural language
expensive processes. To increase the probability model. Story examples can be found in Table 3 and
that the neural language model generates candidate Appendix B.
sentences that will pass the filter, we fine-tune GPT-
Metrics. Purdy et al. (2018) proposed a set of
2 using a policy gradient reinforcement learning
questions for evaluating story generation systems
technique inspired by Peng et al. (2020).
that includes dimensions such a logical coherence,
As CAST generates sentences, it stores sentence
loyalty to plot, and enjoyability. A variation of
pairs along with a label, 0 for no relation matches,
these questions have been used in evaluations of
or 1 for two or more relation matches. Some ex-
other story generation systems (Tambwekar et al.,
amples are given in Table 2. After generating a
2019; Ammanabrolu et al., 2020, 2021). We mod-
full story of k sentences, GPT-2 is fine-tuned on
ify a subset of these questions to simplify the lan-
the labeled sentence pairs using the reinforcement
guage and focus more on participants’ overall im-
learning technique. To punish the generation of
pressions of the narrative coherence:
“unqualified” sentences, we use the following loss
function: • The story FOLLOWS A SINGLE TOPIC
• The story AVOIDS REPETITION

losss (s) if label = 1 • The story uses INTERESTING LANGUAGE
lossRL (s) =
losss (s) × ρ if label = 0 • Which story makes better LOGICAL SENSE?
(1) We conduct our studies on a crowdworking plat-
form. Only those who pass a screening question
where losss (s) is the cross-entropy loss given by are qualified for the study. Participants must also
Radford et al. (2018) and ρ is a constant (ρ ≥ 1) explain their preferences with more than 50 char-
controlling the strength of punishment—sentences acters of writing. This helps filter out low-quality

Seed Prompt: erated by GPT-ROC and GPT-ATOMIC-ROC and
Bob invited Alice to hang out. answered the set of questions introduced earlier for
CAST: each story, providing ratings for each question on a
Alice planned a nice dinner for Bob.
Bob and Alice spent all evening cooking dinner together.
scale from 1-5. At least five crowd workers rated
Alice was happy to see her dinner cooked. each story.
Bob was impressed with how delicious her dinner was. The results, summarized in Table 4, show that
GPT-ROC: GPT-ROC-ATOMIC performs slightly better than
Alice thought Bob was funny.
Bob got mad and threatened Alice with punches. GPT-ROC on some questions, but slightly worse
Alice ended up running away from Bob. on plausible order. However, none of these dif-
Bob was awarded the fun they had together.
:::::::::::::::::::::::::::::::: ferences are statistically significant as deterimined
CAST-RL: by a Mann-Whitney U-test. This result indicates
Alice invited Bob to hang out. that fine-tuning on a static commonsense knowl-
Bob agreed, and was happy to meet her.
Alice was very happy and liked Bob. edge base alone is not sufficient to improve story
Bob and Alice still hang out after that. coherence. For this reason, we use GPT-ROC as
Guan et al. (2020) the base language model for CAST in subsequent
Alice thought she would like it. experiments.
Bob only wanted to show her the interest of a nerdy rlist.
Alice seemed to:::
like::::
Bob ::::
asked::::
Bob ::::::
because :::
she :::
likes::::
him.
Bob ended up meeting Alice after
::::::::::::::::::::::::::::::
school. 4.2 Experiment 2: CAST

Table 3: Story examples generated by CAST, CAST- In this set of experiments we assess whether the
RL, GOT-ROC and Guan et al. (2020). Story gener- CAST constraint method improves story coherence
ated by CAST follows a single topic (bolded) – cook- over the GPT-ROC and Guan et al. (2020) baselines.
ing dinner, and shows a good plot coherence. However, We recruited 116 participants. Each participant
story generated by GPT-ROC fails to maintain plot co- read a randomly selected subset of 6 stories—one
herence (:::::::::
underlined) during generation. CAST-RL gen- from CAST and one from one of the baselines,
erates relatively more repetitive, boring but logical co-
GPT-ROC or Guan et al.. For each of the four ques-
herent narrative (in italic). Guan et al. (2020) also suf-
fers in plot coherence. tions described in Section 4, participants answered
which story best met the criteria. Participants read
either 5-sentence or 10-sentence stories. To gen-
responses and ensures the validity of the study. erate the stories, we again randomly selected 30
2-character prompts from the ROCStories corpus
Baselines. We evaluate CAST against three base- to seed the systems. The 10-sentence stories allow
lines. The first is GPT-2-small fine-tuned on ROC- us to look at the effect of story length on coherence
Stories (GPT-ROC), following prior work (Guan for CAST and the baseline GPT-ROC. They were
et al., 2020). The second baseline is GPT-2- generated on the same 30 prompts described above.
small fine-tuned on the ATOMIC dataset and then The results are shown in Table 5 (top). The re-
fine-tuned a second time on ROCStories (GPT- sults indicate that our model performs significantly
ATOMIC-ROC). The third baseline is the system better than baseline, GPT-ROC, on the “Logical
by Guan et al. (2020), which also fine-tunes GPT- Sense” and “Single Topic” dimensions. We con-
2-small on ATOMIC and ROCStories, but uses a clude that CAST improves the coherence of gener-
multi-objective training procedure. ated narratives and stays more on topic, while re-
taining comparably interesting language and avoid-
4.1 Experiment 1: Commonsense ance of repetition (neither of which are statisti-
Fine-Tuning cally significantly different). When we increase
The first human-participant study assesses whether the lengths of stories to 10 sentences from 5, a
fine-tuning on the ATOMIC commonsense greater proportion of participants prefer CAST to
database alone is sufficient to improve 2-character the baseline in the dimensions of “Logical Sense”
story generation coherence. 20 two-character first and “Single Topic” (∼ 10% increase). This shows
sentences were randomly sampled from ROCSto- that stories generated by CAST retain coherence
ries as seed sentences to produce 20 stories per longer relative to the baseline language model op-
baseline system. We recruited 34 participants. erating without constraints.
Each participant read 10 5-sentence stories gen- CAST also significantly outperforms Guan et al.

Model               Single Plot     Plausible Order       Avoids Repetition      Enjoyable
                  GPT-ROC             3.025 ± 1.180     2.825 ± 1.188           3.525 ± 1.031       2.575 ± 1.041
              GPT-ATOMIC-ROC          3.213 ± 1.122     2.750 ± 1.153           3.713 ± 0.814       2.600 ± 1.165

Table 4: Human participant evaluation results for experiment 1. Each system was conditioned on the same 20
test-set prompts. 34 participants rated each story on a Likert scale (1-5). A confidence level of 95% over the scores
is presented. None of the models’ score differences are statistically significant at p < 0.10 by the Mann-Whitney
U test.

                             Story Logical Sense     Single Topic  Avoid Repetition Interesting LanguageNum
           Models
                            lengthWin %Lose %Tie %Win %Lose %Tie %Win %Lose %Tie %Win %Lose % Tie % resp.
                                 5 53.7**17.0      29.3 52.4**14.3        33.3 18.5   25.3   56.2   30.6   23.8     45.6   147
      CAST vs GPT-ROC
                                 10 63.3**19.3     17.4 55.0**14.7        30.3 33.9   48.6   17.5   34.9   34.9     30.2   109
                                 5 64.8**19.4      15.8 62.5**18.5        19.0 26.9   31.9 41.2     27.4 34.4       38.2   216
      CAST vs Guan et al.
                                 10 73.9**11.3     14.8 86.7**7.6         5.7 27.6    50.5** 21.9   29.5 48.6*      21.9   115

GPT-ROC-RL vs GPT-ROC            5   68.1**21.3    10.6 63.8**20.2        16.0 33.0   39.4   27.6   41.5   29.8     28.7   94
CAST-RL vs GPT-ROC-RL            5   42.7   42.7   14.6 31.1   25.6       43.3 26.7   35.6   37.7   24.4   25.6     50.0   89
      CAST-RL vs CAST            5   45.2   38.1   16.7 36.5   31.0       32.5 27.2   39.2   33.6   23.2 40.8*      36.0   126

Table 5: Human-participant evaluation results for experiments 2 and 3, showing the percentage of participants that
preferred the first system, second system, or thought the systems were equal. Each system was conditioned on the
same 30 test-set prompts. * indicates results are significant at p < 0.05 confidence level; ** at p < 0.01 using a
Wilcoxan sign test on win-lose pairs.

          Model      Train Set          Test Set                the neural language model and (b) reducing overall
         CAST        20.43 ± 3.40*      28.31 ± 3.39*           computation time by foregoing the inference and se-
        CAST-RL      8.72 ± 1.95*       11.06 ± 1.81*           mantic matching steps. We hypothesize fine-tuning
                                                                on CAST labels can increase robustness by generat-
Table 6: The average number of sentence candidates              ing continuations that will pass the filtering process
generated before finding a match. A confidence level
                                                                with higher likelihood. This in turn decreases in
of 95% over 5 random-seed runs for each model is pre-
sented. Each system was conditioned on the same 30              the average number of sentence candidates needed
(training) and 10 (test) 2-character prompts from ROC-          to be generated to find a match.
Stories. * indicates the difference between CAST and               After generating 30 5-sentence stories with
CAST-RL is statistically significant at p < 0.01 using          CAST using 40 different random seeds, we train
the Mann-Whitney U test.                                        GPT-ROC with the 2 × 4 × 30 × 40 generated la-
                                                                beled sentence pairs as described in Section 3.3.1
in the “Logical Sense” and “Single Topic” dimen-                This results in a new fine-tuned language model,
sions. When increasing story length to 10 sen-                  which we refer to as GPT-ROC-RL. 40 different
tences, the coherence of stories generated by the               seeds for the language model are used to gener-
Guan et al. system drops substantially compared to              ate sentence pairs in order to simulate fine-tuning
our model. Our model is worse at avoiding repeti-               on very small datasets. Consistent with previous
tion and using interesting language, though these               experiments, we use GPT-ROC as our baseline.
differences are only statisically significant on the               To evaluate whether fine-tuning on CAST labels
10-sentence stories. We consider this a valuable                improves story coherence, we recruited 59 partic-
tradeoff for logical coherence.                                 ipants and replicated the protocol in Section 4.2,
                                                                except participants read pairs of stories from GPT-
4.3     Experiment 3: RL-Based Fine-Tuning                      ROC, GPT-ROC-RL, CAST, or CAST-RL (GPT-
Having established that CAST improves coherence,                ROC-RL with CAST filtering on top).
we ask whether reinforcement-based fine-tuning                     Table 5 (bottom half) shows the percentage of
(Section 3.3) can be used to teach a language model                   1
                                                                     We generate 2 “matched” sentence pairs during each sen-
to model the CAST filtering method directly. In                  tence generation, one pair of each label. When k = 5, 4
doing so we hope to (a) increase the robustness of               sentences are generated to form a 5-sentence story.

times stories from each system are preferred for CAST method can be applied to any neural lan-
each metric. GPT-ROC-RL improves coherence guage model or be used to create a labeled fine-
over GPT-ROC to approximately the same degree tuning dataset. The approach presents a first at-
as CAST does over GPT-ROC, indicating that RL tempt at building text generation systems that rea-
fine-tuning can be an alternative to constraint-based son about their reader along multiple dimensions.
filtering on short (5-sentence) stories. CAST-RL
does not significantly improve the logical sense of
stories over GPT-ROC-RL, indicating that the re- References
inforcement learning process has generalized the Prithviraj Ammanabrolu, Wesley Cheung, William
CAST inferential constraints into the language Broniec, and Mark O Riedl. 2021. Automated sto-
model. GPT-ROC-RL performs better on avoid- rytelling via causal, commonsense plot ordering. In
Proceedings of AAAI, volume 35.
ing repetition and interesting language than CAST-
RL, implying more robust language generation. Fi- Prithviraj Ammanabrolu, Ethan Tien, Wesley Cheung,
nally, CAST-RL does improve the logical sense Zhaochen Luo, William Ma, Lara J Martin, and
and single topic a small amount over CAST, but Mark O Riedl. 2020. Story realization: Expanding
plot events into sentences. In Proceedings of AAAI,
not at a statisically significant level. Despite being
volume 34.
trained on CAST labels, the filtering process can
still improve coherence. However, it does so at the Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-
expense of repetition and interesting language. tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.
2019. Comet: Commonsense transformers for au-
Importantly, Table 6 shows that fine-tuning on tomatic knowledge graph construction. In Proceed-
CAST labels reduces the average number of can- ings of the 57th ACL, pages 4762–4779.
didates that need to be generated for a “match”
to 11 from 28. The RL fine-tuning procedure re- Faeze Brahman and Snigdha Chaturvedi. 2020. Mod-
eling protagonist emotions for emotion-aware sto-
duces CAST’s computation by 60.93%. Further
rytelling. In Proceedings of EMNLP, pages 5277–
fine-tuning results in an overfit language model 5294.
but this model is sufficient to tie models explicitly
using constraints (e.g., CAST-RL) on the issue of Elizabeth Clark, Yangfeng Ji, and Noah A. Smith. 2018.
coherence. Tables 5 and 6 taken together show the Neural text generation in stories using entity repre-
sentations as context. In Proceedings of NAACL-
benefit of CAST-RL for efficiency at no cost to HTL, pages 2250–2260.
desired model characteristics.
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
5 Conclusions erarchical neural story generation. In Proceedings of
the 56th ACL, pages 889–898.
Neural language models generate content based on Art Graesser, Kathy L. Lang, and Richard M. Roberts.
the likelihood of tokens given a historical context. 1991. Question answering in the context of sto-
Human readers, on the other hand, use complex ries. Journal of Experimental Psychology: General,
inferential processes to connect the relationships 120(3):254–277.
between events. This mismatch between genera- Art Graesser, Murray Singer, and Tom Trabasso. 1994.
tive models and reader comprehension is one of the Constructing inferences during narrative text com-
reasons why stories generated by neural language prehension. Psychological Review, 101(3):371–
models lose coherence over time. We present a 395.
new approach to neural story generation that at- Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and
tempts to model the inferred event relations and Minlie Huang. 2020. A knowledge-enhanced pre-
character intentions of human readers in order to training model for commonsense story generation.
generate more coherent stories. The CAST method Transactions of the ACL, 8:93–108.
provides hard constraints to neural language model
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
generation that results in greater story coherence, Yejin Choi. 2019. The curious case of neural text de-
which crucially does not degenerate as the story generation. In International Conference on Learn-
gets longer. We further show that we can train ing Representations.
neural language models to emulate the logic of the
Ahmed Khalifa, Gabriella AB Barros, and Julian
CAST constraints using reinforcement learning. Togelius. 2017. Deeptingle. arXiv preprint
Our approach is relatively straightforward; the arXiv:1705.03557.

Lara Martin, Prithviraj Ammanabrolu, Xinyu Wang, Tom Trabasso and Paul Van Den Broek. 1985. Causal
William Hancock, Shruti Singh, Brent Harrison, and thinking and the representation of narrative events.
Mark Riedl. 2018. Event representations for auto- Journal of memory and language, 24(5):612–630.
mated story generation with deep neural nets. In
Proceedings of AAAI, volume 32. Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin
Knight, Dongyan Zhao, and Rui Yan. 2019. Plan-
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong and-write: Towards better automatic storytelling. In
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Proceedings of AAAI, volume 33, pages 7378–7385.
Pushmeet Kohli, and James Allen. 2016. A corpus
and evaluation framework for deeper understanding
of commonsense stories. In Proceedings of NAACL-
HLT, pages 839–849.

Nanyun Peng, Marjan Ghazvininejad, Jonathan May,
and Kevin Knight. 2018. Towards controllable story
generation. In Proceedings of the First Workshop on
Storytelling, pages 43–49.

Xiangyu Peng, Siyan Li, Spencer Frazier, and Mark
Riedl. 2020. Reducing non-normative text genera-
tion from language models. In Proceedings of the
13th INLG, pages 374–383.

Christopher Purdy, Xinyu Wang, Larry He, and Mark
Riedl. 2018. Predicting generated story quality with
quantitative measures. In Proceedings of the AAAI
Conference on Artificial Intelligence and Interactive
Digital Entertainment, volume 14.

Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners.

Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, and
Jianfeng Gao. 2020. Plotmachines: Outline-
conditioned generation with dynamic plot state
tracking. In Proceedings of EMNLP, pages 4274–
4295.

Nils Reimers and Iryna Gurevych. 2019. Sentence-
bert: Sentence embeddings using siamese bert-
networks. In Proceedings of EMNLP-IJCNLP,
pages 3982–3992.

Melissa Roemmele. 2016. Writing stories with help
from recurrent neural networks. In Proceedings of
AAAI, volume 30.

Maarten Sap, Ronan Le Bras, Emily Allaway, Chan-
dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,
Brendan Roof, Noah A Smith, and Yejin Choi. 2019.
Atomic: An atlas of machine commonsense for if-
then reasoning. In Proceedings of AAAI, volume 33,
pages 3027–3035.

Pradyumna Tambwekar, Murtaza Dhuliawala, Lara J
Martin, Animesh Mehta, Brent Harrison, and
Mark O Riedl. 2019. Controllable neural story plot
generation via reinforcement learning. In Proceed-
ings of the 28th IJCAI.

A Implementation Details inferences for each inference type, which results
in a higher percent of matched inferences in our
A.1 Data
preliminary experiments. We also qualitatively ob-
We use the preprocessed ROCStories corpus of serve that matching on a larger set of inferences
5-sentence stories (588,966 stories) and joint (as shown in the demo5 ) more often results in at
ATOMIC and ConceptNet dataset converted to tem- least one or a few high-quality inferences, due to
plate sentences (1,174,267 train/66,856 dev/73,083 COMET having some error.
test), both provided by Guan et al.2 We shuffle As mentioned in the body of the text, we use a
and split the ROCStories dataset into 80% (78,528) semantic similarity threshold of 80% and require 2
train and 20% (19,633) test sets. of 3 inferences to match. Runtime is feasible due
Following Guan et al. (2020), character names in to matching on two out of three inference filters
the ROCStories corpus are replaced with [MALE] and using the 5-beam COMET output. However, in
or [FEMALE] tags. This prevents skewed predic- some rare cases, no matching next-sentence candi-
tions due to the presence of certain names in a date can be found. If no qualified sentence is found
small dataset such as ROCStories and also allows after 50 generated candidates, in order to avoid
us to focus on 2-character stories without having potentially infinite search we loosen the filtering
to perform NER on generated sentences to remove strength to match only one pair of inferences.
extraneously generated names outside of the two
main characters. It also allows a direct comparison A.4 Preliminary Study Questions
to prior work. After a story is generated, we replace • The story FOLLOWS A SINGLE PLOT.
the [MALE] and [FEMALE] tags with user-inputted • The story’s events OCCUR IN A PLAUSI-
names, assuming the subject and object of the first BLE ORDER.
sentence are the subsequently-generated tags. • The story AVOIDS REPETITION.
• The story IS ENJOYABLE.
A.2 Models
Following Guan et al., we use the small version A.5 RL Fine-Tuning
of GPT-2 with 124M parameters as the base for We generate 30 5-sentence stories with CAST, us-
all fine-tuned models. When fine-tuning GPT-2 on ing 40 different seeds. 2 sentence pairs are gen-
either ROCStories and the commonsense knowl- erated when generating each following sentence
edge resources (done separately), we train with a and each 5-sentence story is composed of 4 follow-
batch-size of 16, a learning rate of 0.00005, and ing sentence. Hence, 2 × 4 × 30 × 40 sentence
using the Adam optimizer with gradient clipping at pairs are generated to fine-tune GPT-ROC with re-
a max norm of 1. All models were trained on sin- inforcement learning technique for 1 iteration. We
gle GeForce RTX 2080 GPUs in Pytorch using the use ρ = 0.001 in Section 3.3 to punish the ”un-
Huggingface Transformers library.3 We replicate match” sentence pairs.
the multi-task baseline of Guan et al. in Tensorflow
using their provided code.4 We train with early B Additional Story Outputs
stopping on the dev set (80% train,10% dev,10%
Additional 5-sentence story examples are as fol-
test split) loss with a patience of 10 epochs. Both
lows,
models converge within 1-2 epochs. All other train-
ing details are kept the same.
We use top-p sampling (Holtzman et al., 2019) Seed Prompt:
with a value of 0.9, a temperature of 1, and a max Bob and Alice went hiking together.
length of 20 tokens per sentence to sample from all
models. CAST:
Alice was excited because she planned a picnic.
A.3 CAST Bob was excited because he was staying up late at
When producing commonsense inferences from night.
COMET, we use “beam-5” setting to generate 5 Alice felt much better about her camping trip.
Bob spent the night playing in the woods.
2
https://cloud.tsinghua.edu.cn/d/670f7787b6554f308226/
3
https://huggingface.co/transformers/
4 5
https://github.com/thu-coai/CommonsenseStoryGen https://mosaickg.apps.allenai.org/comet atomic

GPT-ROC:
Alice was being very loud.
Bob told Alice they did not agree. Seed Prompt:
Alice ordered pizza to be delivered. Bob took Alice fishing in the summer.
Bob showed up later and Alice was pleased.
CAST:
CAST-RL: Alice waited at the lake for hours.
Alice decided to take a group photo. Bob got another boat and brought it back home.
Bob shared the picture with Alice. Alice and Bob fished the next day.
Alice came over and smiled at Bob. Bob enjoyed the fishing and had a great time.
Bob was so happy that he took the picture.
GPT-ROC:
Guan et al. (2020): Alice took his friend’s boat.
Alice thought that the hike was cool. Bob brought with him to the water.
Bob was pretty cool. Alice stuck her hand out a river but the fishing got
Alice reminded him of the great fall, especially in stuck.
the forest. Bob snapped his fishing line and the boat stopped
in the middle.
Bob was cute, but Alice was lost.

CAST-RL:
Alice caught ten fish.
Seed Prompt:
Bob was pleased that he caught ten fish.
Bob asked Alice to come over for dinner.
Alice was very pleased to have a good fish.
Bob is glad that he caught ten fish.
CAST:
Alice wanted to go to a party.
Guan et al. (2020):
Bob agreed to go with her.
Alice got seasick.
Alice and Bob went to the concert.
Bob asked Alice to help him.
Bob bought a bottle of wine, and they enjoyed their
Alice went to Bob ’s house to tow Bob ’s boat.
time.
Bob got a surfboard for a big fishing trip.

GPT-ROC:
Alice said yes. Seed Prompt:
Bob put on his clothing. Bob was engaged to Alice.
Alice walked out of the restaurant.
Bob didn’t give Alice her favorite shirt. CAST:
Alice wanted a diamond ring.
CAST-RL: Bob thought that ring was the perfect diamond.
Alice said yes, so he brought her food. Alice loved the diamond ring and decided to buy it.
Bob had dinner with Alice that night. Bob went to the store and bought the ring.
Alice said yes to the dinner.
Bob was happy that he invited her to dinner with GPT-ROC:
him. Alice invited him to her wedding.
Bob refused and immediately broke down.
Guan et al. (2020): Alice stormed out of the door.
Alice drove and came over Bob ’s house in time. Bob was very upset that he would never see her
Bob decided to let Alice choose the foods. again.
Alice ate some dog food and then she met her hus-
band. CAST-RL:
Bob offered Alice a side of anything he had for Alice was going on a date with Bob.
dinner. Bob agreed to the date and married her.

Alice now has two kids and needs the baby. Alice informed Bob she was very excited for him.
Bob is very happy he finally realized she was un- Bob was very appreciative.
happy. Alice said it was a perfect time.
Bob and Alice had a wonderful first date.
Guan et al. (2020): Alice was surprised she was invited to his birthday.
Alice just got a new husband. Bob was thrilled.
Bob wondered what to expect. Alice was happy to have a new boyfriend.
Alice bought the nice wedding ring from a daycare.
Bob wore it to the wedding. GPT-ROC:
Alice asked him to meet her at the restaurant.
Bob refused to get her out of the car.
Seed Prompt: Alice began to panic and began to argue with Bob.
Bob took Alice to school today. Bob ended up never talking to Alice.
Alice began to return the phone and she called the
CAST: restaurant.
Alice was a freshman. Bob got a new date with Alice.
Bob took her to the library. Alice left Bob with a very happy reunion.
Alice was excited to take the long bus ride to the Bob is so happy he is now dating Alice.
library. Alice is so happy she loves her new boyfriend.
Bob continued the long walk through the library.
Guan et al. (2020):
GPT-ROC: Alice and Bob didn’t realize that Bob loved movies.
Alice really needed Bob to come with her. Bob’s friend, Alice, saw the movie with Alice.
Bob knew he couldn’t get her to come with him. Alice and Bob were good friends.
Alice saw Bob grow up and change his attitude. Bob’s friend, Alice, screamed for Alice.
Bob is grounded because Bob is getting away with Alice was their eyes met her eyes.
everything. Bob and Alice’s grandparents were never match.
Alice because they had ever though, they thought
CAST-RL: the end the words said, she passed with Alice.
Alice had been waiting for hours at school. Bob and they agreed that they went out the eyes.
Bob was relieved to see her take the bus to school. Alice felt they got puppies and they do the thought
Alice was excited to see the bus arrive. she was their eyes Alice had.
Bob was proud of her for taking the bus to school.

Guan et al. (2020): Seed Prompt:
Alice was nervous because Bob didn’t know how Bob broke up with Alice.
to tap dance.
Bob was excited for his first attempt at tap dancing. CAST:
Alice was impressed and wanted to learn. Alice was very hurt and very angry.
Bob started his first dance at the dance school. Bob knew he would have to live with his pain.
Alice said she could give him a relief.
Additional 10-sentence story examples are as fol- Bob felt much better after that.
lows: Alice felt even better in peace.
Bob is now happy with himself.
Seed Prompt: Alice felt much better.
Bob went on her very first date with Alice. Bob will make himself a happy man.
Alice feels much better after having peace.
CAST:
Alice was very excited about having a date with GPT-ROC:
Bob. Alice had a fake boyfriend.
Bob was excited and eager. Bob found out Alice was missing.

Alice promised to find her. Bob had a great time.
Bob found her in an alley. Alice had some great food and wine.
Alice never broke up with Bob again. Bob couldn’t wait to see Alice again.
Bob never broke up with Alice. Alice let him take a bath.
Alice was so happy to see him again. Bob felt smarter and had a good time again.
Bob is very happy being close to her. Alice also got his great seeing Alice tattoo.
Alice is happy that he’s okay now. Bob had a good time.
Alice tattooy from his favorite j tomorrow too.
Guan et al. (2020):
Alice didn’t want to get it.
Bob thought that he was too thin. Seed Prompt:
Alice said he could give back inside a horse. Bob went over to play at Alice’s house.
Bob could bring along for a ride and to return to
the river. CAST:
Alice had to drive for three days and she realized Alice liked to watch movies.
she was cheating. Bob was always there watching movies.
Bob and Alice got into a fight. Alice saw a movie she wanted to see.
Alice ran away with Alice and grounded. Bob was very glad to buy that movie.
Bob and sent Alice. Alice was so happy with the movie he bought.
Alice and a girl kicked each other. Bob was excited to be watching the movie with
Alice.
Alice could not wait to see the movie with Bob.
Seed Prompt: Bob and Alice had so many fun.
Bob went to see Alice today. Alice laughed and thanked Bob for the great movie.

CAST: GPT-ROC:
Alice told him about the movie she loved to watch. Alice and Bob played soccer.
Bob grabbed a movie poster from the library. Bob’s mom told him to throw a kick.
Alice shared her own reaction to the movie. Alice kicked him in the leg.
Bob watched the movie that night. Bob was glad that he never played there again. Al-
Alice felt very satisfied after the movie was over. ice’s mom apologized for the incident.
Bob went home happy that he watched the movie. Bob’s mom never learned to kick in the leg again.
Alice has been a huge fan of the movie ever since. Alice was very glad he never stopped playing soc-
Bob looks forward to a nice vacation at the movies. cer with her.
Alice is glad that she shared her own comments to Bob was happy that he was never kicked again.
the movie. Alice still still plays soccer every now and loves
hertime.
GPT-ROC:
Alice wanted to buy her a necklace. Guan et al. (2020):
Bob felt bad for her. Alice was a baby dolls.
Alice got the necklace. Bob told Alice he could save a doll.
Bob made it. Alice went to the video game section to change the
Alice was very happy with the necklace. game.
Bob was glad that she got to see Alice. Bob was amazed to see who did the laundry to the
Alice will never forget her great necklace. last minute.
Bob will never forget the necklace that Alice Alice waved and proposed the doll girls.
bought. Bob wished her that she was getting a doll.
Alice is glad that she bought her necklace. Alice helped him with the $40 and she gave her
$100, but by buttering buttering a doll.
Guan et al. (2020): Bob’s doll broke.
Alice was nice and gregarious. Alice had an ruined doll and they would buy her.

You can also read