Harry Potter and the Action Prediction Challenge from Natural Language

Harry Potter and the Action Prediction Challenge from Natural Language

                                                             David Vilares                             Carlos Gómez-Rodrı́guez
                                                     Universidade da Coruña, CITIC                  Universidade da Coruña, CITIC
                                                     Departamento de Computación                    Departamento de Computación
                                                      Campus de Elviña s/n, 15071                    Campus de Elviña s/n, 15071
                                                            A Coruña, Spain                                A Coruña, Spain
                                                      david.vilares@udc.es                            carlos.gomez@udc.es

                                                              Abstract                               In an alternative line of work, script induction
                                             We explore the challenge of action prediction        (Schank and Abelson, 1977) has been also a use-
                                             from textual descriptions of scenes, a testbed       ful approach to evaluate inference and semantic
                                             to approximate whether text inference can be         capabilities of NLP systems. Here, a model pro-
                                             used to predict upcoming actions. As a case          cesses a document to infer new sequences that re-
                                             of study, we consider the world of the Harry         flect events that are statistically probable (e.g. go
                                             Potter fantasy novels and inferring what spell       to a restaurant, be seated, check the menu, . . . ).
                                             will be cast next given a fragment of a story.
                                                                                                  For example, Chambers and Jurafsky (2008) in-
                                             Spells act as keywords that abstract actions
                                             (e.g. ‘Alohomora’ to open a door) and de-            troduce narrative event chains, a representation
                                             note a response to the environment. This idea        of structured knowledge of a set of events occur-
                                             is used to automatically build HPAC, a corpus        ring around a protagonist. They then propose a
                                             containing 82 836 samples and 85 actions. We         method to learn statistical scripts, and also intro-
                                             then evaluate different baselines. Among the         duce two different evaluation strategies. With a
                                             tested models, an LSTM-based approach ob-            related aim, Pichotta and Mooney (2014) propose
                                             tains the best performance for frequent actions
                                                                                                  a multi-event representation of statistical scripts to
                                             and large scene descriptions, but approaches
                                             such as logistic regression behave well on in-       be able to consider multiple entities. These same
                                             frequent actions.                                    authors (Pichotta and Mooney, 2016) have also
                                                                                                  studied the abilities of recurrent neural networks
                                         1   Introduction                                         for learning scripts, generating upcoming events
                                         Natural language processing (NLP) has achieved           given a raw sequence of tokens, using BLEU (Pap-
                                         significant advances in reading comprehension            ineni et al., 2002) for evaluation.
                                         tasks (Chen et al., 2016; Salant and Berant, 2017).         This paper explores instead a new task: action
                                         These are partially due to embedding methods             prediction from natural language descriptions of
                                         (Mikolov et al., 2013; Devlin et al., 2018) and          scenes. The challenge is addressed as follows:
                                         neural networks (Rosenblatt, 1958; Hochreiter and        given a natural language input sequence describ-
                                         Schmidhuber, 1997; Vaswani et al., 2017), but also       ing the scene, such as a piece of a story coming
                                         to the availability of new resources and challenges.     from a transcript, the goal is to infer which action
                                         For instance, in cloze-form tasks (Hermann et al.,       is most likely to happen next.
                                         2015; Bajgar et al., 2016), the goal is to predict the
                                         missing word given a short context. Weston et al.        Contribution We introduce a fictional-domain
                                         (2015) presented baBI, a set of proxy tasks for          English corpus set in the world of Harry Potter
                                         reading comprenhension. In the SQuAD corpus              novels. The domain is motivated by the existence
                                         (Rajpurkar et al., 2016), the aim is to answer ques-     of a variety of spells in these literary books, associ-
                                         tions given a Wikipedia passage. Kocisky et al.          ated with keywords that can be seen as unambigu-
                                         (2018) introduce NarrativeQA, where answering            ous markers for actions that potentially relate to
                                         the questions requires to process entire stories. In     the previous context. This is used to automatically
                                         a related line, Frermann et al. (2017) use fictional     create a natural language corpus coming from hun-
                                         crime scene investigation data, from the CSI se-         dreds of users, with different styles, interests and
                                         ries, to define a task where the models try to an-       writing skills. We then train a number of standard
                                         swer the question: ‘who committed the crime?’.           baselines to predict upcoming actions, a task that
requires to be aware of the context. In particular,                considered as the scene description is however not
we test a number of generic models, from a simple                  trivial. This paper considers experiments (§4) us-
logistic regression to neural models. Experiments                  ing snippets with the 32, 64, 96 and 128 previous
shed some light about their strengths and weak-                    tokens to an action. We provide the needed scripts
nesses and how these are related to the frequency                  to rebuild the corpus using arbitrary lengths.2
of each action, the existence of other semantically
related actions and the length of the input story.                 2.2      Data crawling
                                                                   The number of occurrences of spells in the origi-
2       HPAC: The Harry Potter’s Action                            nal Harry Potter books is small (432 occurrences),
        prediction Corpus                                          which makes it difficult to train and test a machine
To build an action prediction corpus, we need to:                  learning model. However, the amount of available
(1) consider the set of actions, and (2) collect data              fan fiction for this saga allows to create a large
where these occur. Data should come from differ-                   corpus. For HPAC, we used fan fiction (and
ent users, to approximate a real natural language                  only fan fiction texts) from https://www.
task. Also, it needs to be annotated, determining                  fanfiction.net/book/Harry-Potter/
that a piece of text ends up triggering an action.                 and a version of the crawler by Milli and Bamman
These tasks are however time consuming, as they                    (2016).3 We collected Harry Potter stories written
require annotators to read vast amounts of large                   in English and marked with the status ‘com-
texts. In this context, machine comprehension re-                  pleted’. From these we extracted a total of 82 836
sources usually establish a compromise between                     spell occurrences, that we used to obtain the scene
their complexity and the costs of building them                    descriptions. Table 2 details the statistics of the
(Hermann et al., 2015; Kocisky et al., 2018).                      corpus (see also Appendix A). Note that similar to
                                                                   Twitter corpora, fan fiction stories can be deleted
2.1      Domain motivation                                         over time by users or admins, causing losses in
We rely on an intuitive idea that uses transcripts                 the dataset.4
from the Harry Potter world to build up a corpus                   Preprocessing We tokenized the samples with
for textual action prediction. The domain has a set                (Manning et al., 2014) and merged the occurrences
of desirable properties to evaluate reading compre-                of multi-word spells into a single token.
hension systems, which we now review.
   Harry Potter novels define a variety of spells.                 3       Models
These are keywords cast by witches and wizards to
achieve purposes, such as turning on a light (‘Lu-                 This work addresses the task as a classification
mos’), unlocking a door (‘Alohomora’) or killing                   problem, and in particular as a sequence to label
(‘Avada Kedavra’). They abstract complex and                       classification problem. For this reason, we rely on
non-ambiguous actions. Their use also makes it                     standard models used for this type of task: multi-
possible to build an automatic and self-annotated                  nomial logistic regression, a multi-layered per-
corpus for action prediction. The moment a spell                   ceptron, convolutional neural networks and long
occurs in a text represents a response to the en-                  short-term memory networks. We outline the es-
vironment, and hence, it can be used to label the                  sentials of each of these models, but will treat them
preceding text fragment as a scene description that                as black boxes. In a related line, Kaushik and Lip-
ends up triggering that action. Table 1 illustrates it             ton (2018) discuss the need of providing rigorous
with some examples from the original books.                        baselines that help better understand the improve-
   This makes it possible to consider texts from the               ment coming from future and complex models,
magic world of Harry Potter as the domain for the                  and also the need of not demanding architectural
action prediction corpus, and the spells as the set                novelty when introducing new datasets.
of eligible actions.1 Determining the length of the                   Although not done in this work, an alternative
preceding context, namely snippet, that has to be                  (but also natural) way to address the task is as a
    1                                                                  2
     Note that the corpus is built in an automatic way and               https://github.com/aghie/hpac
some occurrences might not correspond to actions, but for ex-            Due to the website’s Terms of Service, the corpus cannot
ample, to a description of the spell or even some false positive   be directly released.
samples. Related to this, we have not censored the content of            They also can be modified, making it unfeasible to re-
the stories, so some of them might contain adult content.          trieve some of the samples.
Text fragment                                                                                                        Action
 Ducking under Peeves, they ran for their lives, right to the end of the corridor where they slammed into a door      Unlock the
 - and it was locked. ‘This is it!’ Ron moaned, as they pushed helplessly at the door, ‘We’re done for! This is       door
 the end!’ They could hear footsteps, Filch running as fast as he could toward Peeves’s shouts. ‘Oh, move over’,
 Hermione snarled. She grabbed Harry’s wand, tapped the lock, and whispered, ‘Alohomora’.
 And then, without warning, Harry’s scar exploded with pain. It was agony such as he had never felt in all his        Kill a target
 life; his wand slipped from his fingers as he put his hands over his face; his knees buckled; he was on the ground
 and he could see nothing at all; his head was about to split open. From far away, above his head, he heard a
 high, cold voice say, ‘Kill the spare.’ A swishing noise and a second voice, which screeched the words to the
 night: ‘Avada Kedavra’
 Harry felt himself being pushed hither and thither by people whose faces he could not see. Then he heard Ron         Turn on a
 yell with pain. ‘What happened?’ said Hermione anxiously, stopping so abruptly that Harry walked into her.           light
 ‘Ron, where are you? Oh, this is stupid’ - ‘Lumos’

      Table 1: Examples from the Harry Potter books showing how spells map to reactions to the environment.

  Statistics             Training       Dev       Test             3.2      Sequential models
  #Actions                      85        83        84
  #Samples                  66 274     8 279     8 283             The input sequence is represented as a sequence
  #Tokens (s=32)         2 111 180 263 573 263 937
  #Unique tokens (s=32)     33 067    13 075    13 207
                                                                   of word embeddings, w1:n , where wi is a con-
  #Tokens (s=128)        8 329 531 1 040 705 1 041 027             catenation of an internal embedding learned dur-
  #Unique tokens (s=128)    60 379    25 146    25 285             ing the training process for the word wi , and a pre-
Table 2: Corpus statistics: s is the length of the snippet.
                                                                   trained embedding extracted from GloVe (Pen-
                                                                   nington et al., 2014)5 , that is further fine-tuned.
                                                                   Long short-term memory network (Hochre-
special case of language modelling, where the out-
                                                                   iter and Schmidhuber, 1997): The output for an
put vocabulary is restricted to the size of the ‘ac-
                                                                   element wi also depends on the output of wi−1 .
tion’ vocabulary. Also, note that the performance
                                                                   The LSTMθ (w1:n )6 takes as input a sequence of
for this task is not expected to achieve a perfect ac-
                                                                   word embeddings and produces a sequence of hid-
curacy, as there may be situations where more than
                                                                   den outputs, h1:n (hi size set to 128). The last
one action is reasonable, and also because writers
                                                                   output of the LSTMθ , hn , is fed to a MLPθ .
tell a story playing with elements such as surprise
or uncertainty.                                                    Convolutional Neural Network (LeCun et al.,
   The source code for the models can be found in                  1995; Kim, 2014). It captures local properties over
the GitHub repository mentioned above.                             continuous slices of text by applying a convolution
                                                                   layer made of different filters. We use a wide con-
Notation w1:n denotes a sequence of words                          volution, with a window slice size of length 3 and
w1 , ..., wn that represents the scene, with wi ∈ V .              250 different filters. The convolutional layer uses
Fθ (·) is a function parametrized by θ. The task is                a relu as the activation function. The output is
cast as F : V n → A, where A is the set of actions.                fed to a max pooling layer, whose output vector is
                                                                   passed again as input to a MLPθ .
3.1    Machine learning models
The input sentence w1:n is encoded as a one-hot                    4       Experiments
vector, v (total occurrence weighting scheme).
                                                                   Setup All MLPθ ’s have 128 input neurons and
Multinomial Logistic Regression Let MLRθ (v)                       1 hidden layer. We trained up to 15 epochs
be an abstraction of a multinomial logistic regres-                using mini-batches (size=16), Adam (lr=0.001)
sion parametrized by θ, the output for an input                    (Kingma and Ba, 2015) and early stopping.
v is computed as the arg maxa∈A P (y = a|v),
where P (y = a|v) is a sof tmax function, i.e,                        Table 3 shows the macro and weighted F-scores
                   Wa ·v
P (y = a|v) = PAe Wa0 ·v .                                         for the models considering different snippet sizes.7
                     a0   e
MultiLayer Perceptron We use one hid-                              6B.zip
den layer with a rectifier activation function                           n is set to be equal to the length of the snippet.
                                                                         As we have addressed the task as a classification prob-
(relu(x)=max(0, x)). The output is computed as                     lem, we will use precision, recall and F-score as the evalua-
MLPθ (v)= sof tmax(W2 · relu(W · v + b) + b2 ).                    tion metrics.
To diminish the impact of random seeds and local               the performance on these two groups of actions,
minima in neural networks, results are averaged                with a ∼50 points difference in recall at 5. Also, a
across 5 runs.8 ‘Base’ is a majority-class model               simple logistic regression performs similar to the
that maps everything to ‘Avada Kedavra’, the most              LSTM on the infrequent actions.
common action in the training set. This helps test
                                                                 Snippet    Model    R@1    R@2     R@5     R@10
whether the models predict above chance perfor-                     -       Base     11.5    -       -        -
mance. When using short snippets (size=32), dis-                            MLR      31.4   43.7    60.3     73.5
parate models such as our MLR, MLP and LSTMs                                MLP      32.1   44.3    61.5     74.9
                                                                            LSTM     32.2   44.3    61.5     74.7
achieve a similar performance. As the snippet size                          CNN      29.2   41.1    58.1     71.6
is increased, the LSTM-based approach shows a                               MLR      32.1   44.9    61.9     74.3
clear improvement on the weighted scores9 , some-                   64
                                                                            MLP      32.7   46.0    63.5     76.6
                                                                            LSTM     33.9   46.1    63.1     75.7
thing that happens only marginally for the rest.                            CNN      29.9   41.8    59.0     72.2
However, from Table 3 it is hard to find out what                           MLR      32.0   44.5    60.7     74.6
the approaches are actually learning to predict.                            MLP      32.6   45.6    63.4     76.6
                                                                            LSTM     34.5   46.9    63.7     76.1
                            Macro              Weighted                     CNN      29.3   41.9    59.5     72.8
  Snippet Model                                                             MLR      31.7   44.5    61.0     74.3
                        P      R       F      P    R    F
       -    Base      0.1     1.2    0.2    1.3 11.5 2.4                    MLP      32.9   45.8    63.2     76.9
            MLR      18.7    11.6   13.1   28.9 31.4 28.3                   LSTM     35.1   47.4    64.4     76.9
            MLP      19.1     9.8   10.3   31.7 32.1 28.0                   CNN      30.2   42.3    59.6     72.8
            LSTM     13.7     9.7    9.5   29.1 32.2 28.6
            CNN       9.9     7.8    7.3   24.6 29.2 24.7            Table 4: Averaged recall at k over 5 runs.
            MLR      20.6    12.3   13.9   29.9 32.1 29.0
            MLP      17.9     9.5    9.8   31.2 32.7 27.9
            LSTM     13.3    10.3   10.2   30.3 33.9 30.4                             Frequent        Infrequent
            CNN       9.8     7.8    7.4   25.0 29.9 25.4        Snippet Model
                                                                                  Fwe R@1 R@5      Fwe R@1 R@5
            MLR      20.4    13.3   14.6   30.3 32.0 29.3                  Base    3.7 14.5    -    0.0   0.0    -
            MLP      16.9     9.5    9.8   30.2 32.6 27.8                  MLR    35.8 37.1 70.5   14.8   9.5 23.0
            LSTM     14.0    10.5   10.3   30.6 34.5 30.7                  MLP    35.9 38.1 71.9   13.2   9.4 21.8
            CNN      10.2     7.1    6.9   25.2 29.4 24.4          32
                                                                           LSTM   37.1 38.4 71.6   11.7   8.6 23.0
            MLR      19.6    12.1   12.9   30.0 31.7 28.2                  CNN    33.1 35.5 69.3    7.1   5.2 15.2
            MLP      18.9     9.9   10.3   31.4 32.9 28.0                  MLR    36.7 37.9 71.8   14.9   9.9 24.0
            LSTM     14.4    10.5   10.5   31.3 35.1 31.1                  MLP    36.4 39.2 74.5   11.0   7.9 21.6
            CNN       8.8     7.8    7.1   24.8 30.2 25.0          64
                                                                           LSTM   39.2 40.3 73.0   12.4   9.4 25.4
                                                                           CNN    33.9 36.4 70.6    6.9   5.2 15.1
 Table 3: Macro and weighted F-scores over 5 runs.                         MLR    36.4 37.4 70.1   17.1 11.7 25.1
                                                                           MLP    36.2 39.1 74.0   11.0   7.9 23.1
                                                                           LSTM   39.6 41.1 73.7   12.4   9.6 25.8
   To shed some light, Table 4 shows their perfor-                         CNN    32.7 35.8 71.6    6.3   4.8 13.7
mance according to a ranking metric, recall at k.                          MLR    35.4 37.2 70.5   15.4 10.7 25.0
The results show that the LSTM-based approach is                           MLP    36.5 39.5 74.0   11.1   8.2 22.3
                                                                           LSTM   40.3 41.9 74.4   12.3   9.5 26.2
the top performing model, but the MLP obtains just                         CNN    33.7 36.9 71.4    6.5   5.0 14.6
slightly worse results. Recall at 1 is in both cases
low, which suggests that the task is indeed com-               Table 5: Performance on frequent (those that occur
plex and that using just LSTMs is not enough. It               above the average) and infrequent actions.
is also possible to observe that even if the mod-
els have difficulties to correctly predict the action          Error analysis10 Some of the misclassifications
as a first option, they develop certain sense of the           made by the LSTM approach were semantically
scene and consider the right one among their top               related actions and counter-actions. For exam-
choices. Table 5 delves into this by splitting the             ple, ‘Colloportus’ (to close a door) was never
performance of the model into infrequent and fre-              predicted. The most common mis-classification
quent actions (above the average, i.e. those that              (14 out of 41) was ‘Alohomora’ (to unlock a
occur more than 98 times in the training set, a to-            door), which was 5 times more frequent in the
tal of 20 actions). There is a clear gap between               training corpus. Similarly, ‘Nox’ (to extinguish
     Some macro F-scores do not lie within the Precision and   the light from a wand) was correctly predicted
Recall due to this issue.                                      6 times, meanwhile 36 mis-classifications corre-
     For each label, we compute their average, weighted by
the number of true instances for each label. The F-score            Made over one of the runs from the LSTM-based ap-
might be not between precision and recall.                     proach and setting the snippet size to 128 tokens.
spond to ‘Lumos’ (to light a place using a wand),      with this dataset could be transferred to real-word
which was 6 times more frequent in the train-          actions (i.e. real-domain setups), or if such trans-
ing set. Other less frequent spells that denote        fer is not possible and a model needs to be trained
vision and guidance actions, such as ‘Point me’        from scratch.
(the wand acts a a compass pointing North) and
‘Homenum revelio’ (to revel a human presence)          Acknowlegments
were also mainly misclassified as ‘Lumos’. This        This work has received support from the
is an indicator that the LSTM approach has dif-        TELEPARES-UDC project (FFI2014-51978-C2-
ficulties to disambiguate among semantically re-       2-R) and the ANSWER-ASAP project (TIN2017-
lated actions, especially if their occurrence was      85160-C2-1-R) from MINECO, and from Xunta
unbalanced in the training set. This issue is in       de Galicia (ED431B 2017/01), and from the Eu-
line with the tendency observed for recall at k.       ropean Research Council (ERC), under the Euro-
Spells intended for much more specific purposes,       pean Union’s Horizon 2020 research and innova-
according to the books, obtained a performance         tion programme (FASTPARSE, grant agreement
significantly higher than the average, e.g. F-         No 714150).
score(‘Riddikulus’)=63.54, F-score(‘Expecto Pa-
tronum’)=55.49 and F-score(‘Obliviate’)=47.45.
Action                  #Training   #Dev   #Test   Action                   #Training   #Dev   #Test
AVADA KEDAVRA                7937    986    954    CRUCIO                        7852    931    980
ACCIO                        4556    595    562    LUMOS                         4159    505    531
STUPEFY                      3636    471    457    OBLIVIATE                     3200    388    397
EXPELLIARMUS                 2998    377    376    LEGILIMENS                    1938    237    247
EXPECTO PATRONUM             1796    212    242    PROTEGO                       1640    196    229
SECTUMSEMPRA                 1596    200    189    ALOHOMORA                     1365    172    174
INCENDIO                     1346    163    186    SCOURGIFY                     1317    152    166
REDUCTO                      1313    171    163    IMPERIO                       1278    159    144
WINGARDIUM LEVIOSA           1265    158    154    PETRIFICUS TOTALUS            1253    175    134
SILENCIO                     1145    153    136    REPARO                        1124    159    137
MUFFLIATO                    1005    108     92    AGUAMENTI                      796     84     86
FINITE INCANTATEM             693     90     75    INCARCEROUS                    686     99     87
NOX                           673     82     80    RIDDIKULUS                     655     81     88
DIFFINDO                      565     90     82    IMPEDIMENTA                    552     88     79
LEVICORPUS                    535     63     68    EVANESCO                       484     53     59
SONORUS                       454     66     73    POINT ME                       422     57     69
EPISKEY                       410     55     59    CONFRINGO                      359     52     48
ENGORGIO                      342     52     41    COLLOPORTUS                    269     26     41
RENNERVATE                    253     24     33    PORTUS                         238     22     31
TERGEO                        235     23     26    MORSMORDRE                     219     29     38
EXPULSO                       196     23     20    HOMENUM REVELIO                188     30     24
MOBILICORPUS                  176     20     14    RELASHIO                       174     20     27
LOCOMOTOR                     172     24     19    AVIS                           166     17     29
RICTUSEMPRA                   159     16     26    IMPERVIUS                      149     26     13
OPPUGNO                       144     18       7   FURNUNCULUS                    137     20     20
SERPENSORTIA                  133     14     15    CONFUNDO                       130     17     21
LOCOMOTOR MORTIS              127     14     15    TARANTALLEGRA                  126     11     17
REDUCIO                       117     13     22    QUIETUS                        108     15     17
LANGLOCK                       99     12     19    GEMINIO                         78      5     10
FERULA                         78      6     10    ORCHIDEOUS                      76      7       5
DENSAUGEO                      67     13       8   LIBERACORPUS                    63      7       5
APARECIUM                      63     14     10    ANAPNEO                         62      6       5
FLAGRATE                       59      4     11    DELETRIUS                       59     12       6
OBSCURO                        57     11       7   PRIOR INCANTATO                 56      4       3
DEPRIMO                        51      2       2   SPECIALIS REVELIO               50     11       6
WADDIWASI                      45      5       8   PROTEGO TOTALUM                 44      9       5
DURO                           36      4       4   SALVIO HEXIA                    36      8       5
DEFODIO                        34      2       6   PIERTOTUM LOCOMOTOR             30      4       3
GLISSEO                        26      4       3   MOBILIARBUS                     25      3       4
REPELLO MUGGLETUM              23      2       5   ERECTO                          23      7       5
CAVE INIMICUM                  19      5       2   DESCENDO                        19      0       1
PROTEGO HORRIBILIS             18      7       5   METEOLOJINX RECANTO             10      3       1
PESKIPIKSI PESTERNOMI           7      0       0
                          Table 6: Label distribution for the HPAC corpus
