Narrative Modeling with Memory Chains and Semantic Supervision
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Narrative Modeling with Memory Chains and Semantic Supervision Fei Liu Trevor Cohn Timothy Baldwin School of Computing and Information Systems The University of Melbourne Victoria, Australia fliu3@student.unimelb.edu.au t.cohn@unimelb.edu.au tb@ldwin.net Abstract Context: Sam loved his old belt. He matched it with arXiv:1805.06122v1 [cs.CL] 16 May 2018 everything. Unfortunately he gained too much weight. Story comprehension requires a deep se- It became too small. mantic understanding of the narrative, Coherent Ending: Sam went on a diet. Incoherent Ending: Sam was happy. making it a challenging task. Inspired by previous studies on ROC Story Cloze Test, we propose a novel method, tracking var- Figure 1: Story Cloze Test example. ious semantic aspects with external neural memory chains while encouraging each to focus on a particular semantic aspect. The current state-of-the-art approach of Evaluated on the task of story ending pre- Chaturvedi et al. (2017) is based on understanding diction, our model demonstrates superior the context from three perspectives: (1) event performance to a collection of competitive sequence, (2) sentiment trajectory, and (3) topic baselines, setting a new state of the art. 1 consistency. Chaturvedi et al. (2017) adopt exter- nal tools to recognise relevant aspect-triggering 1 Introduction words, and manually design features to incorpo- Story narrative comprehension has been rate them into the classifier. While identifying a long-standing challenge in artificial in- triggers has been made easy by the use of various telligence (Winograd, 1972; Turner, 1994; linguistic resources, crafting such features is Schubert and Hwang, 2000). The difficulties of time consuming and requires domain-specific this task arise from the necessity for understand- knowledge along with repeated experimentation. ing not only narratives, but also commonsense Inspired by the argument for tracking the dy- and normative social behaviour (Charniak, 1972). namics of events, sentiment and topic, we propose Of particular interest in this paper is the work to address the issues identified above with mul- by Mostafazadeh et al. (2016) on understanding tiple external memory chains, each responsible commonsense stories in the form of a Story Cloze for a single aspect. Building on Recurrent Entity Test: given a short story, we must predict the most Networks (EntNets), a superior framework to coherent sentential ending from two options (e.g. LSTMs demonstrated by Henaff et al. (2017) for see Figure 1). reasoning-focused question answering and cloze- Many attempts have been made to solve style reading comprehension, we introduce a novel this problem, based either on linear classifiers multi-task learning objective, encouraging each with handcrafted features (Schwartz et al., 2017; chain to focus on a particular aspect. While still Chaturvedi et al., 2017), or representation learningmaking use of external linguistic resources, we do via deep learning models (Mihaylov and Frank, not extract or design features from them but rather 2017; Bugert et al., 2017; Mostafazadeh et al., utilise such tools to generate labels. The generated 2017). Another widely used component of com- labels are then used to guide training so that each petitive systems is language model-based features, chain focuses on tracking a particular aspect. At which require training on large corpora in the storytest time, our model is free of feature engineering domain. such that, once trained, it can be easily deployed to 1 unseen data without preprocessing. Moreover, our Code available at http://github.com/liufly/narrative-modeling.
approach also differs in the lack of a ROC Stories key kj language model component, eliminating the need update for large, domain-specific training corpora. m̃ji φ L σ gate ⊙ Evaluated on the task of Story Cloze Test, our gij model outperforms a collection of competitive C baselines, achieving state-of-the-art performance + under more modest data requirements. memory mji−1 mji hi 2 Methodology In the story cloze test, given a story of length Figure 2: Illustration of EntNet with a single L, consisting of a sequence of context sentences memory chain at time i. φ and σ represent Equa- c = {c1 , c2 , . . . , cL }, we are interested in predict- tions 1 and 2, while circled nodes L, C, ⊙ and ing the coherent ending to the story out of two end- + depict the location, content terms, Hadamard ing options o1 and o2 . Following previous studies product, and addition, resp. (Schwartz et al., 2017), we frame this problem as a binary classification task. Assuming o1 and o2 are Memory update candidate. At every time step the logical and inconsistent endings respectively − → i, the internal memory update candidate m̃ji for with their associated labels being y1 and y2 , we the j-th memory chain is computed as: assign y1 = 1 and y2 = 0. At test time, given − → → →j − → − − → a pair of possible endings, the system returns the m̃ji = φ( U − mi−1 + V kj + W hi ) (1) one with the higher score as the prediction. In this section, we first describe the model architecture where kj is the embedding for the j-th entity → − − → − → and then detail how we identify aspect-triggering (key), U , V and W are trainable parameters, and words in text and incorporate them in training. φ is the parametric ReLU (He et al., 2015). 2.1 Proposed Model Memory update. The update of each memory chain is controlled by a gating mechanism: First, to take neighbouring contexts into account, we process context sentences and ending options − →g ji = σ(hi · kj + hi · − →j ) m i−1 (2) at the word level with a bi-directional gated recur- − →j − → j → − j − →j → − m̊i = mi−1 + g i ⊙ m̃i (3) rent unit (“Bi-GRU”: Chung et al. (2014)): h i = −−→ − → GRU(wi , h i−1 ) where wi is the embedding of where ⊙ denotes Hadamard product (resulting −→ the i-th word in the story or ending option. The in the unnormalised memory representation m̊ji ), ←− backward hidden representation h i is computed and σ is the sigmoid function. The gate − →g ji con- analogously but with a different set of GRU pa- trols how much the memory chain should be up- − → ← − rameters. We simply take the sum hi = h i + h i dated, a decision factoring in two elements: (1) the as the representation at time i. An ending option, “location” term, quantifying how related the cur- denoted o, is represented by taking the sum of the rent input hi (the output of the Bi-GRU at time i) final states of the same Bi-GRU over the ending is to the key kj of the entity being tracked; and (2) option word sequence. the “content” term, measuring the similarity be- This representation is then taken as input tween the input and the current state − →j of the m i−1 to a Recurrent Entity Network (“EntNet”: entity tracked by the j-th memory chain. Henaff et al. (2017)), capable of tracking the state Normalisation. We normalise each memory of the world with external memory. Developed in →j = − → − → − → the context of machine reading comprehension, representation: − m i m̊ji /km̊ji k where km̊ji k de- − → EntNet maintains a number of “memory chains” notes the Euclidean norm of m̊ji . In doing so, we — one for each entity — and dynamically up- allow the model to forget in the sense that, as − →j m i dates the representations of them as it progresses is constrained to be lying on the surface of the unit − → through a story. Here, we borrow the concept of sphere, adding new information carried by m̃ji and “entity” and use each chain to track a unique as- then performing normalisation inevitably causes pect. An illustration of EntNet is provided in forgetting in that the cosine distance between the Figure 2. original and updated memory decreases.
Bi-directionality. In contrast to EntNet, we Ricky fell while hiking in the woods apply the above steps in both directions, scanning lEvent 0 1 0 1 0 0 1 a story both forward and backward, with arrows lSentiment 0 1 0 0 0 0 0 denoting the processing direction. ← −j is computed m i lT opic 1 1 0 1 0 0 1 analogously to − →j but with a different set of pa- m i Table 1: An example of the linguistically moti- rameters, and mji = − →j + ← m i m−j . We further define i vated memory chain supervision binary labels. gij to be the average of the gate values of both di- rections at time i for the j-th chain: to open its gate only when it sees a trigger bearing information semantically sensitive to that particu- gij = (− → g ji + ← − g ji )/2 (4) lar aspect. Note that at test time, we do not apply which will be used for semantic supervision. such supervision. This effectively becomes a bi- nary tagging task for each memory chain indepen- Final classifier. The final prediction ŷ to an end- dently and results in individual memory chains be- ing option given its context story is performed by ing sensitive to only a specific set of triggers bear- incorporating the last states of all memory chains ing one of the three types of signals. in the form of a weighted sum u: While largely inspired by Chaturvedi et al. (2017), our approach differs in how we extract pj = softmax((kj )⊤ Watt o) (5) and use such features. Also note that, while still X u= pj mjT (6) making use of external linguistic resources to de- j tect trigger words, our approach lets the mem- ory chains decide how such words should in- where T denotes the total number of words in a fluence the final prediction, as opposed to the story and Watt is a trainable weight matrix. We handcrafted conditional probability features of then transform u to get the final prediction: Chaturvedi et al. (2017). ŷ = σ(Rφ(Hu + o)) (7) To identify the trigger words, we use external linguistic tools, and mark trigger words for each where R and H are trainable weight matrices. aspect with a label of 1. An example is presented in Table 1, noting that the same word can act as 2.2 Semantically Motivated Memory Chains trigger for multiple aspects. In order to encourage each chain to focus on a Event sequence. We parse each sentence into particular aspect, we supervise the activation of its FrameNet representation with SEMAFOR each update gate (Equation 2) with binary labels. (Das et al., 2010), and identify each frame target Intuitively, for the sentiment chain, a sentiment- (word or phrase tokens evoking a frame). bearing word (e.g. amazing) receives a label of 1, allowing the model to open the gate and therefore Sentiment trajectory. Following change the internal state relevant to this aspect, Chaturvedi et al. (2017), we utilise a pre-compiled while a neutral one (e.g. school) should not trig- list of sentiment words (Liu et al., 2005). To take ger the activation with an assigned label of 0. To negation into account, we parse each sentence achieve this, we use three memory chains to in- with the Stanford Core NLP dependency dependently track each of: (1) event sequence, (2) parser (Manning et al., 2014; Chen and Manning, sentiment trajectory, and (3) topical consistency. 2014) and include negation words as trigger To supervise the memory update gates of these words. chains, we design three sequences of binary labels: lj = {l1j , l2j , . . . , lTj } for j ∈ [1, 3] representing Topical consistency. We process each sen- event, sentiment, and topic, and lij ∈ {0, 1}. The tence with the Stanford Core NLP POS tag- label at time i for the j-th aspect is only assigned ger and identify nouns and verbs, following a value of 1 if the word is a trigger for that particu- Chaturvedi et al. (2017). lar aspect: lij = 1; otherwise lij = 0. During train- ing, we utilise such sequences of binary labels to 2.3 Training Loss supervise the memory update gate activations of In addition to the cross entropy loss of the final each chain. Specifically, each chain is encouraged prediction of right/wrong endings, we also take
into account the memory update gate supervision Model Acc. SD of each chain by adding the second term. More for- DSSM 58.5 — mally, the model is trained to minimise the loss: TBMIHAYLOV 72.8 — X L = XEntropy(y, ŷ) + α XEntropy(lij , gij ) MSAP 75.2 — i,j HCM 77.6 — EntNet † 77.6 ±0.5 where ŷ and gij are defined in Equations 7 and 4 re- Our model† 78.5 ±0.5 spectively, y is the gold label for the current ending Table 2: Performance on the Story Cloze test set. option o, and lij is the semantic supervision binary Bold = best performance, † = ave. accuracy over 5 label at time i for the j-th memory chain. In our runs, SD = standard deviation, “—” = not reported. experiments, we empirically set α to 0.5 based on development data. Evaluation. Following previous studies, we 3 Experiments evaluate the performance in terms of coherent ending prediction accuracy: #correct #total . Here, we re- 3.1 Experimental Setup port the average accuracy over 5 runs with dif- Dataset. To test the effectiveness of our model, ferent random seeds. Echoing Melis et al. (2018), we employ the Story Cloze Test dataset of we also observe some variance in model perfor- Mostafazadeh et al. (2016). The development and mance even if training is carried out with the same test set each consist of 1,871 4-sentence stories, random seed, which is largely due to the non- each with a pair of ending options. Consistent with deterministic ordering of floating-point operations previous studies, we split the development set into in our environment (Tensorflow (Abadi et al., a training and validation set (for early stopping), 2015) with a single GPU). Therefore, to account resulting in 1,683 and 188 in each set, resp. Note for the randomness, we further train our model 5 that while most current approaches make use of times for each random seed and select the model the much larger training set, comprised of 100K 5- with the best validation performance. sentence ROC stories (with coherent endings only, We benchmark against a collection of also released as part of the dataset) to train a lan- strong baselines, including the top-3 sys- guage model, we make no use of this data. tems of the 2017 LSDSem workshop shared Model configuration. We initialise our model task (Mostafazadeh et al., 2017): MSAP with word2vec embeddings (300-D, pre-trained (Schwartz et al., 2017), HCM (Chaturvedi et al., on 100B Google News articles, not updated dur- 2017)2 , and TBMIHAYLOV (Mihaylov and Frank, ing training: Mikolov et al. (2013a,b)). In addition 2017). The first two primarily rely on a sim- to the three supervised chains, we also add a “free” ple logistic regression classifier, both taking chain, unconstrained to any semantic aspect. linguistic and probability features from a ROC Training is carried out over 200 epochs with Stories domain-specific neural language model. the FTRL optimiser (McMahan et al., 2013) and a TBMIHAYLOV is LSTM-based; we also include batch size of 128 and learning rate of 0.1. We use DSSM (Mostafazadeh et al., 2016). We addi- the following hyper-parameters for weight matri- tionally implement a bi-directional EntNet ces in both directions: R ∈ R300×1 , H, U, V, W (Henaff et al., 2017) with the same hyper- are all matrices of size R300×300 , and hidden size parameter settings as our model and no semantic of the Bi-GRU is 300. Dropout is applied to the supervision.3 output of φ in the final classifier (Equation 7) with 3.2 Results and Discussion a rate of 0.2. Moreover, we employ the technique introduced by Gal and Ghahramani (2016) where The experimental results are shown in Table 2. the same dropout mask is applied at every step to State-of-the-art results. Our model outper- the input wi to the Bi-GRU and the input hi to the forms a collection of strong baselines, setting a memory chains with rates of 0.5 and 0.2 respec- 2 tively. Lastly, to curb overfitting, we regularise the We take the performance reported in a subsequent paper, 3.2% better than the original submission to the shared task. last layer (Equation 7) with an L2 penalty on its 3 EntNet is also subject to the same model selection cri- weights: λkRk where λ = 0.001. terion described above.
Model Acc. SD Our model 78.5 ±0.5 —bi-directionality 77.8 ±0.7 —all semantic supervision 77.6 ±0.5 —event sequence 78.7 ±0.2 —sentiment trajectory 75.9 ±0.4 Event Key Sentiment Key Topic Key Free Key —topical consistency 77.3 ±0.4 Event Word Sentiment Word Topic Word —free chain 77.0 ±0.4 Figure 3: t-SNE scatter plot of aspect-triggering Table 3: Ablation study. Average accuracy over words (output representations hi by Bi-GRU, 5 runs on the Story Cloze test set (all subject to 500 randomly sampled from each aspect) and the the same model selection criterion as our model). learned keys. EntNet (left) vs. our model (right). Bold: best performance. SD: standard deviation. Best viewed in colour. performance down to 77.8, is inferior to its bi- new state of the art. Note that this is achieved with- directional cousin. Removing all semantic super- out any external linguistic resources at test time or vision (resulting in 4 free memory chains) is domain-specific language model-based probabil- also damaging to the accuracy (dropping to 77.6). ity features, highlighting the effectiveness of the Among the different types of semantic super- proposed model. vision, we see various degrees of performance EntNet vs. our model. We see clear improve- degradation, with removing sentiment trajectory ments of the proposed method over EntNet, being the most detrimental, reflecting its value an absolute gain of 0.9% in accuracy. This vali- to the task. Interestingly, we observe improve- dates the hypothesis that encouraging each mem- ment when removing event sequence supervision ory chain to focus on a unique, well-defined aspect from consideration. We suspect that this is mainly is beneficial. due to the noise introduced by the rather inaccu- rate FrameNet representation output of SEMAFOR Discussion. To better understand the benefits of (F1 = 61.4% in frame identification as reported in the proposed method, we visualise the learned Das et al. (2010)). While it is possible that replac- word representations (output representation of the ing SEMAFOR with the SemLM approach (with Bi-GRU, hi ) and keys between EntNet and our the extended frame definition to include explicit model in Figure 3. With EntNet, while senti- discourse markers, e.g. but) in Peng and Roth ment words form a loose cluster, words bearing (2016) and Chaturvedi et al. (2017) may poten- event and topic signal are placed in close prox- tially reduce the amount of noise, we leave this imity and are largely indistinguishable. With our exercise for future work. model, on the other hand, the degree of separation is much clearer. The intersection of a small por- 4 Conclusion tion of the event and topic groups is largely due to the fact that both aspects include verbs. Another In this paper, we have proposed a novel model interesting perspective is the location of the au- for tracking various semantic aspects with exter- tomatically learned keys (displayed as diamonds): nal memory chains. While requiring less domain- while all the keys with EntNet end up overlap- specific training data, our model achieves state- ping, indicating little difference among them, the of-the-art performance on the task of ROC Story keys learned by our method demonstrate seman- Cloze ending prediction, beating a collection of tic diversity, with each placed within its respective strong baselines. cluster. Note that the free key is learned to com- plement the event key, a difficult challenge given Acknowledgments the two disjoint clusters. We thank the anonymous reviewers for their Ablation study. We further perform a ablation valuable feedback, and gratefully acknowledge study to analyse the utility of each component the support of Australian Government Research in Table 3. The uni-directional variant, with its Training Program Scholarship. This work was
also supported in part by the Australian Research Yarin Gal and Zoubin Ghahramani. 2016. A theo- Council. retically grounded application of dropout in recur- rent neural networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS 2016), pages 1027–1035, Barcelona, Spain. References Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Sun. 2015. Delving deep into rectifiers: Surpass- Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor- ing human-level performance on ImageNet classifi- rado, Andy Davis, Jeffrey Dean, Matthieu Devin, cation. In Proceedings of the 2015 IEEE Interna- Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, tional Conference on Computer Vision (ICCV 2015), Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal pages 1026–1034, Washington, DC, USA. Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Sherry Moore, Derek Murray, Chris Olah, Mike Bordes, and Yann LeCun. 2017. Tracking the world Schuster, Jonathon Shlens, Benoit Steiner, Ilya state with recurrent entity networks. In Proceed- Sutskever, Kunal Talwar, Paul Tucker, Vincent Van- ings of the 5th International Conference on Learn- houcke, Vijay Vasudevan, Fernanda Viégas, Oriol ing Representations (ICLR 2017), Toulon, France. Vinyals, Pete Warden, Martin Wattenberg, Mar- tin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. Bing Liu, Minqing Hu, and Junsheng Cheng. 2005. TensorFlow: Large-scale machine learning on heterogeneous systems. Opinion observer: analyzing and comparing opin- Software available from tensorflow.org. ions on the web. In Proceedings of the 14th Interna- tional Conference on World Wide Web (WWW 2005), Michael Bugert, Yevgeniy Puzikov, Andreas Rücklé, pages 342–351, Chiba, Japan. Judith Eckle-Kohler, Teresa Martin, Eugenio Martı́nez-Cámara, Daniil Sorokin, Maxime Peyrard, Christopher D. Manning, Mihai Surdeanu, John Bauer, and Iryna Gurevych. 2017. LSDSem 2017: Explor- Jenny Finkel, Steven J. Bethard, and David Mc- ing data generation methods for the story cloze test. Closky. 2014. The Stanford CoreNLP natural lan- In Proceedings of the 2nd Workshop on Linking guage processing toolkit. In Association for Compu- Models of Lexical, Sentential and Discourse-level tational Linguistics (ACL) System Demonstrations, Semantics (LSDSem 2017), pages 56–61, Valencia, pages 55–60. Spain. H. Brendan McMahan, Gary Holt, David Sculley, Eugene Charniak. 1972. Toward a model of chil- Michael Young, Dietmar Ebner, Julian Grady, dren”s story comprehension. Technical report, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Massachusetts Institute of Technology, Cambridge, Golovin, et al. 2013. Ad click prediction: A view USA. from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Snigdha Chaturvedi, Haoruo Peng, and Dan Roth. Discovery and Data Mining (KDD 2013), pages 2017. Story comprehension for predicting what hap- 1222–1230, Chicago, USA. pens next. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Pro- Gábor Melis, Chris Dyer, and Phil Blunsom. 2018. cessing (EMNLP 2017), pages 1603–1614, Copen- On the state of the art of evaluation in neural lan- hagen, Denmark. guage models. In Proceedings of the 6th Inter- national Conference on Learning Representations (ICLR 2018), Vancouver, Canada. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural net- Todor Mihaylov and Anette Frank. 2017. Story cloze works. In Proceedings of the 2014 Conference on ending selection baselines and data examination. In Empirical Methods in Natural Language Processing Proceedings of the 2nd Workshop on Linking Models (EMNLP 2014), pages 740–750, Doha, Qatar. of Lexical, Sentential and Discourse-level Semantics (LSDSem 2017), pages 87–92, Valencia, Spain. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey gated recurrent neural networks on sequence model- Dean. 2013a. Efficient estimation of word represen- ing. In Proceedings of the NIPS 2014 Deep Learn- tations in vector space. In Proceedings of the 1st ing and Representation Learning Workshop. International Conference on Learning Representa- tions (ICLR 2013), Scottsdale, USA. Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A. Smith. 2010. Probabilistic frame-semantic Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- parsing. In Human Language Technologies: The rado, and Jeff Dean. 2013b. Distributed representa- 2010 Annual Conference of the North American tions of words and phrases and their compositional- Chapter of the Association for Computational Lin- ity. In Proceedings of the 27th Annual Conference guistics (NAACL-HLT 2010), pages 948–956, Los on Neural Information Processing Systems (NIPS Angeles, USA. 2013), pages 3111–3119, Stateline, USA.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A cor- pus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2016), pages 839–849, San Diego, USA. Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. 2017. LS- DSem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics (LSDSem 2017), pages 46–51, Valencia, Spain. Haoruo Peng and Dan Roth. 2016. Two discourse driven language models for semantics. In Proceed- ings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 290–300, Berlin, Germany. Lenhart K. Schubert and Chung Hee Hwang. 2000. Episodic logic meets little red riding hood: A com- prehensive, natural representation for language un- derstanding. In Lucja Iwanska and Stuart C. Shapiro, editors, Natural Language Processing and Knowledge Representation: Language for Knowl- edge and Knowledge for Language, pages 111–174. MIT/AAAI Press, Menlo Park, USA. Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and Noah A. Smith. 2017. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In Pro- ceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 15–25, Vancouver, Canada. Scott R. Turner. 1994. The Creative Process: A Computer Model of Storytelling and Creativity. Lawrence Erlbaum, Hillsdale, USA. Terry Winograd. 1972. Understanding natural lan- guage. Cognitive Psychology, 3(1):1–191.
You can also read