Event Representation with Sequential, Semi-Supervised Discrete Variables
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Event Representation with Sequential, Semi-Supervised Discrete Variables Mehdi Rezaee Francis Ferraro Department of Computer Science Department of Computer Science University of Maryland Baltimore County University of Maryland Baltimore County Baltimore, MD 21250 USA Baltimore, MD 21250 USA rezaee1@umbc.edu ferraro@umbc.edu AAAB+3icdVDLSgMxFM3UV62vWpdugkVwNWTGFttd0Y3gpoJ9QDuUTJq2oZkHyR1pGforblwo4tYfceffmD4EFT1w4XDOvcm9x4+l0EDIh5VZW9/Y3Mpu53Z29/YP8oeFpo4SxXiDRTJSbZ9qLkXIGyBA8nasOA18yVv++Grut+650iIK72Aacy+gw1AMBKNgpF6+0OkCn4Bm6Y2Q5pHhzOvli8SuVkmpVMbELhPXdSuGkHO3UnWwY5MFimiFei//3u1HLAl4CExSrTsOicFLqQLBJJ/luonmMWVjOuQdQ0MacO2li91n+NQofTyIlKkQ8EL9PpHSQOtp4JvOgMJI//bm4l9eJ4FBxUtFGCfAQ7b8aJBIDBGeB4H7QnEGcmoIZUqYXTEbUUUZmLhyJoSvS/H/pOnaTtkmt6Vi7XIVRxYdoxN0hhx0gWroGtVRAzE0QQ/oCT1bM+vRerFel60ZazVzhH7AevsEDtaVHQ== [Killing] Abstract f2 f3 AAAB6nicdVDJSgNBEK2JW4xb1KOXxiB4GnomCSa3oBePEc0CyRB6Oj1Jk56F7h4hDPkELx4U8eoXefNv7CyCij4oeLxXRVU9PxFcaYw/rNza+sbmVn67sLO7t39QPDxqqziVlLVoLGLZ9YligkespbkWrJtIRkJfsI4/uZr7nXsmFY+jOz1NmBeSUcQDTok20m0wKA+KJWzX67hSqSJsV7HrujVDcNmt1R3k2HiBEqzQHBTf+8OYpiGLNBVEqZ6DE+1lRGpOBZsV+qliCaETMmI9QyMSMuVli1Nn6MwoQxTE0lSk0UL9PpGRUKlp6JvOkOix+u3Nxb+8XqqDmpfxKEk1i+hyUZAKpGM0/xsNuWRUi6khhEpubkV0TCSh2qRTMCF8fYr+J23Xdqo2vqmUGperOPJwAqdwDg5cQAOuoQktoDCCB3iCZ0tYj9aL9bpszVmrmWP4AevtE16Djd0= AAAB6nicdVDLSgNBEOyNrxhfUY9eBoPgaZlNzK65Bb14jGhiIFnC7GQ2GTL7YGZWCCGf4MWDIl79Im/+jZOHoKIFDUVVN91dQSq40hh/WLmV1bX1jfxmYWt7Z3evuH/QUkkmKWvSRCSyHRDFBI9ZU3MtWDuVjESBYHfB6HLm390zqXgS3+pxyvyIDGIeckq0kW7CXrlXLGHb8bDr1BC2K27FrWFDPK9cdTFybDxHCZZo9Irv3X5Cs4jFmgqiVMfBqfYnRGpOBZsWupliKaEjMmAdQ2MSMeVP5qdO0YlR+ihMpKlYo7n6fWJCIqXGUWA6I6KH6rc3E//yOpkOz/0Jj9NMs5guFoWZQDpBs79Rn0tGtRgbQqjk5lZEh0QSqk06BRPC16fof9Iq207VxtdnpfrFMo48HMExnIIDHtThChrQBAoDeIAneLaE9Wi9WK+L1py1nDmEH7DePgFj543h Within the context of event modeling and understanding, we propose a new method arXiv:2010.04361v3 [cs.LG] 13 Apr 2021 Crush occurred at station Train collided with train Killed passengers and injured for neural sequence modeling that takes partially-observed sequences of discrete, ex- ternal knowledge into account. We construct Figure 1: An overview of event modeling, where the ob- a sequential neural variational autoencoder, served events (black text) are generated via a sequence which uses Gumbel-Softmax reparametriza- of semi-observed random variables. In this case, the tion within a carefully defined encoder, to random variables are directed to take on the meaning allow for successful backpropagation during of semantic frames that can be helpful to explain the training. The core idea is to allow semi- events. Unobserved frames are in orange and observed supervised external discrete knowledge to frames are in blue. Some connections are more impor- guide, but not restrict, the variational latent pa- tant than others, indicated by the weighted arrows. rameters during training. Our experiments in- dicate that our approach not only outperforms such as for script learning—complicates the devel- multiple baselines and the state-of-the-art in opment of neural generative models. narrative script induction, but also converges more quickly. In this paper, we provide a successful approach for incorporating partially-observed, discrete, se- 1 Introduction quential external knowledge in a neural genera- tive model. We specifically examine event se- Event scripts are a classic way of summarizing quence modeling augmented by semantic frames. events, participants, and other relevant information Frames (Minsky, 1974, i.a.) are a semantic repre- as a way of analyzing complex situations (Schank sentation designed to capture the common and gen- and Abelson, 1977). To learn these scripts we eral knowledge we have about events, situations, must be able to group similar-events together, learn and things. They have been effective in providing common patterns/sequences of events, and learn crucial information for modeling and understand- to represent an event’s arguments (Minsky, 1974). ing the meaning of events (Peng and Roth, 2016; While continuous embeddings can be learned for Ferraro et al., 2017; Padia et al., 2018; Zhang et al., events and their arguments (Ferraro et al., 2017; 2020). Though we focus on semantic frames as our Weber et al., 2018a), the direct inclusion of more source of external knowledge, we argue this work structured, discrete knowledge is helpful in learn- is applicable to other similar types of knowledge. ing event representations (Ferraro and Van Durme, We examine the problem of modeling observed 2016). Obtaining fully accurate structured knowl- event tuples as a partially observed sequence of edge can be difficult, so when the external knowl- semantic frames. Consider the following three edge is neither sufficiently reliable nor present, a events, preceded by their corresponding bracketed natural question arises: how can our models use frames, from Fig. 1: the knowledge that is present? [E VENT ] crash occurred at station. Generative probabilistic models provide a frame- [I MPACT ] train collided with train. work for doing so: external knowledge is a random [K ILLING ] killed passengers and injured. variable, which can be observed or latent, and the data/observations are generated (explained) from We can see that even without knowing the it. Knowledge that is discrete, sequential, or both— [K ILLING ] frame, the [E VENT ] and [I MPACT ]
frames can help predict the word killed in the third explain the data. Mathematically, the joint proba- event; the frames summarize the events and can be bility p(x, f ; θ) factorizes as follows used as guidance for the latent variables to repre- sent the data. On the other hand, words like crash, p(x, f ; θ) = p(f )p(x|f ; θ), (1) station and killed from the first and third events where θ represents the model parameters. Since play a role in predicting [I MPACT] in the second in practice maximizing the log-likelihood is in- event. Overall, to successfully represent events, tractable, we approximate the posterior by defining beyond capturing the event to event connections, q(f |x; φ) and maximize the ELBO (Kingma and we propose to consider all the information from the Welling, 2013) as a surrogate objective: frames to events and frames to frames. In this work, we study the effect of tying discrete, p(x, f ; θ) sequential latent variables to partially-observable, Lθ,φ = Eq(f |x;φ) log . (2) q(f |x; φ) noisy (imperfect) semantic frames. Like Weber et al. (2018b), our semi-supervised model is a bidi- In this paper, we are interested in studying a spe- rectional auto-encoder, with a structured collection cific case; the input x is a sequence of T tokens, of latent variables separating the encoder and de- we have M sequential discrete latent variables coder, and attention mechanisms on both the en- z = {zm }M m=1 , where each zm takes F discrete coder and decoder. Rather than applying vector values. While there have been effective proposals quantization, we adopt a Gumbel-Softmax (Jang for unsupervised optimization of θ and φ, we focus et al., 2017) ancestral sampling method to easily on learning partially observed sequences of these switch between the observed frames and latent variables. That is, we assume that in the training ones, where we inject the observed frame infor- phase some values are observed while others are mation on the Gumbel-Softmax parameters before latent. We incorporate this partially observed, ex- sampling. Overall, our contributions are: ternal knowledge to the φ parameters to guide the inference. The inferred latent variables later will • We demonstrate how to learn a VAE that be used to reconstruct to the tokens. contains sequential, discrete, and partially- Kingma et al. (2014) generalized VAEs to the observed latent variables. semi-supervised setup, but they assume that the • We show that adding partially-observed, ex- dataset can be split into observed and unobserved ternal, semantic frame knowledge to our struc- samples and they have defined separate loss func- tured, neural generative model leads to im- tions for each case; in our work, we allow portions provements over the current state-of-the-art of a sequence to be latent. Teng et al. (2020) charac- on recent core event modeling tasks. Our ap- terized the semi-supervised VAEs via sparse latent proach leads to faster training convergence. variables; see Mousavi et al. (2019) for an in-depth • We show that our semi-supervised model, study of additional sparse models. though developed as a generative model, can Of the approaches that have been developed effectively predict the labels that it may not for handling discrete latent variables in a neu- observe. Additionally, we find that our model ral model (Vahdat et al., 2018a,b; Lorberbom outperforms a discriminatively trained model et al., 2019, i.a.), we use the Gumbel-Softmax with full supervision. reparametrization (Jang et al., 2017; Maddison et al., 2016). This approximates a discrete draw with logits π as softmax( π+g τ ), where g is a vec- 2 Related Work tor of Gumbel(0, 1) draws and τ is an annealing Our work builds on both event modeling and la- temperature that allows targeted behavior; its ease, tent generative models. In this section, we outline customizability, and efficacy are big advantages. relevant background and related work. 2.2 Event Modeling 2.1 Latent Generative Modeling Sequential event modeling, as in this paper, can Generative latent variable models learn a mapping be viewed as a type of script or schema induc- from the low-dimensional hidden variables f to tion (Schank and Abelson, 1977) via language the observed data points x, where the hidden rep- modeling techniques. Mooney and DeJong (1985) resentation captures the high-level information to provided an early analysis of explanatory schema
ENCODER DECODER Frame and Events Alignment RNN Hidden States Extracted Frames Frames Lookup Emb Train collided with train e fm AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBiyURRY9FLx4rWFtIQ9hsJ+3S3WzY3Qgl5Gd48aCIV3+NN/+N2zYHbX0w8Hhvhpl5UcqZNq777VRWVtfWN6qbta3tnd29+v7Bo5aZotChkkvVi4gGzhLoGGY49FIFREQcutH4dup3n0BpJpMHM0khEGSYsJhRYqzkQ5jHYS7OvKII6w236c6Al4lXkgYq0Q7rX/2BpJmAxFBOtPY9NzVBTpRhlENR62caUkLHZAi+pQkRoIN8dnKBT6wywLFUthKDZ+rviZwIrScisp2CmJFe9Kbif56fmfg6yFmSZgYSOl8UZxwbiaf/4wFTQA2fWEKoYvZWTEdEEWpsSjUbgrf48jJ5PG96l033/qLRuinjqKIjdIxOkYeuUAvdoTbqIIokekav6M0xzovz7nzMWytOOXOI/sD5/AEwG5Ew 1 Context External Information Event and Frames Alignment f1 AAAB8XicdVBNSwMxEJ2tX7V+VT16CRbB05JdW2xvRS89KlgrtkvJptk2NJtdkqxQlv4LLx4U8eq/8ea/Ma0VVPTBwOO9GWbmhang2mD87hSWlldW14rrpY3Nre2d8u7etU4yRVmbJiJRNyHRTHDJ2oYbwW5SxUgcCtYJx+czv3PHlOaJvDKTlAUxGUoecUqMlW57rWzI8qjvTfvlCnYbDVyt1hB2a9j3/bol+MSvNzzkuXiOCixw0S+/9QYJzWImDRVE666HUxPkRBlOBZuWeplmKaFjMmRdSyWJmQ7y+cVTdGSVAYoSZUsaNFe/T+Qk1noSh7YzJmakf3sz8S+vm5moHuRcpplhkn4uijKBTIJm76MBV4waMbGEUMXtrYiOiCLU2JBKNoSvT9H/5Np3vZqLL6uV5tkijiIcwCEcgwen0IQWXEAbKEi4h0d4crTz4Dw7L5+tBWcxsw8/4Lx+AMUwkP4= f2 AAAB8XicdVBNS8NAEN3Ur1q/qh69LBbBU0lLS9Nb0UuPFewHtqFstpN26WYTdjdCCf0XXjwo4tV/481/46aNoKIPBh7vzTAzz4s4U9q2P6zcxubW9k5+t7C3f3B4VDw+6akwlhS6NOShHHhEAWcCupppDoNIAgk8Dn1vfp36/XuQioXiVi8icAMyFcxnlGgj3Y3a8RQSf1xdjoslu2zbjlNv4JSkMKTZrDWdGq5kSgll6IyL76NJSOMAhKacKDWs2JF2EyI1oxyWhVGsICJ0TqYwNFSQAJSbrC5e4gujTLAfSlNC45X6fSIhgVKLwDOdAdEz9dtLxb+8Yax9x02YiGINgq4X+THHOsTp+3jCJFDNF4YQKpm5FdMZkYRqE1LBhPD1Kf6f9KrlSr1s39RKrassjjw6Q+foElVQA7VQG3VQF1Ek0AN6Qs+Wsh6tF+t13ZqzsplT9APW2yfC/ZD9 fM AAAB6nicdVDJSgNBEK2JW4xb1KOXxiB4GnrGBJNb0IsXIaJZIBlCT6cnadKz0N0jhCGf4MWDIl79Im/+jZ1FUNEHBY/3qqiq5yeCK43xh5VbWV1b38hvFra2d3b3ivsHLRWnkrImjUUsOz5RTPCINTXXgnUSyUjoC9b2x5czv33PpOJxdKcnCfNCMox4wCnRRroN+tf9YgnbtRoulysI2xXsum7VEHzmVmsOcmw8RwmWaPSL771BTNOQRZoKolTXwYn2MiI1p4JNC71UsYTQMRmyrqERCZnysvmpU3RilAEKYmkq0miufp/ISKjUJPRNZ0j0SP32ZuJfXjfVQdXLeJSkmkV0sShIBdIxmv2NBlwyqsXEEEIlN7ciOiKSUG3SKZgQvj5F/5OWazsVG9+US/WLZRx5OIJjOAUHzqEOV9CAJlAYwgM8wbMlrEfrxXpdtOas5cwh/ID19gmF6433 Impact Context Sample Token Gumbel-Softmax Params Sample Frame Figure 2: Our encoder (left) and decoder (right). The orange nodes mean that the frame is latent (Im = 0), while the blue nodes indicate observed frames (Im is one-hot). In the encoder, the RNN hidden vectors are aligned with the frames to predict the next frame. The decoder utilizes the inferred frame information in the reconstruction. generating system to process narratives, and Pi- tional VAE-based model. chotta and Mooney (2016) applied an LSTM-based In another thread of research, Chen et al. (2018) model to predict the event arguments. Modi (2016) utilized labels in conjunction with latent variables, proposed a neural network model to predict ran- but unlike Weber et al. (2018b), their model’s latent domly missing events, while Rudinger et al. (2015) variables are conditionally independent and do not showed how neural language modeling can be form a hierarchy. Sønderby et al. (2016) proposed used for sequential event prediction. Weber et al. a sequential latent structure with Gaussian latent (2018a) and Ding et al. (2019) used tensor-based variables. Liévin et al. (2019), similar to our model decomposition methods for event representation. structure, provided an analysis of hierarchical re- Weber et al. (2020) studied causality in event mod- laxed categorical sampling but for the unsupervised eling via a latent neural language model. settings. Previous work has also examined how to incor- porate or learn various forms of semantic repre- 3 Method sentations while modeling events. Cheung et al. Our focus in this paper is modeling sequential event (2013) introduced an HMM-based model to ex- structure. In this section, we describe our varia- plain event sequences via latent frames. Materna tional autoencoder model, and demonstrate how (2012), Chambers (2013) and Bamman et al. (2013) partially-observed external knowledge can be in- provided structured graphical models to learn event jected into the learning process. We provide an models over syntactic dependencies; Ferraro and overview of our joint model in §3.1 and Fig. 2. Our Van Durme (2016) unified and generalized these model operates on sequences of events: it consists approaches to capture varying levels of semantic of an encoder (§3.2) that encodes the sequence of forms and representation. Kallmeyer et al. (2018) events as a new sequence of frames (higher-level, proposed a Bayesian network based on a hierar- more abstract representations), and a decoder (§3.3) chical dependency between syntactic dependencies that learns how to reconstruct the original sequence and frames. Ribeiro et al. (2019) provided an anal- of events from the representation provided by the ysis of clustering predicates and their arguments to encoder. During training (§3.4), the model can infer semantic frames. make use of partially-observed sequential knowl- Variational autoencoders and attention networks edge to enrich the representations produced by the (Kingma and Welling, 2013; Bahdanau et al., 2014), encoder. In §3.5 we summarize the novel aspects allowed Bisk et al. (2019) to use RNNs with at- of our model. tention to capture the abstract and concrete script representations. Weber et al. (2018b) came up with 3.1 Model Setup a recurrent autoencoder model (HAQAE), which We define each document as a sequence of M used vector-quantization to learn hierarchical de- events. In keeping with previous work on event pendencies among discrete latent variables and an representation, each event is represented as a lexi- observed event sequence. Kiyomaru et al. (2019) calized 4-tuple: the core event predicate (verb), two suggested generating next events using a condi- main arguments (subject and object), and event
modifier (if applicable) (Pichotta and Mooney, tion term learns to generate the observed events w 2016; Weber et al., 2018b). For simplicity, we across all valid encodings f , while the KL term can write each document as a sequence of T words uses the prior p(f ) to regularize q. w = {wt }Tt=1 , where T = 4M and each wt is Optimizing Eq. (4) is in general intractable, so from a vocabulary of size V .1 Fig. 1 gives an ex- we sample S chains of variables f (1) , . . . , f (S) ample of 3 events: during learning (but not testing) from q(f |w) and approximate Eq. (4) as our model would have access to some, but not all, " frames to lightly guide training (in this case, the 1X p(f (s) ) L≈ log p(w|f (s) ) + αq log + first two frames but not the third). S s q(f (s) |w) While lexically rich, this 4-tuple representation i is limited in the knowledge that can be directly αc Lc (q(f (s) |w)) . encoded. Therefore, our model assumes that each (5) document w can be explained jointly with a collec- As our model is designed to allow the injection tion of M random variables fm : f = {fm }M of external knowledge I, we define the variational m=1 . The joint probability for our model factorizes as distribution as q(f |w; I). In our experiments, Im is a binary vector encoding which (if any) frame is T M Y Y observed for event m.3 For example in Fig. 1, we p(w, f ) = p(wt |f , w
Algorithm 1 Encoder: The following algorithm not observed in training, such as for the third event shows how we compute the next frame fm given ([K ILLING ]), γ m is not adjusted. the previous frame fm−1 . We compute and return Since each draw fm from a Gumbel-Softmax is a a hidden frame representation γ m , and fm via a simplex vector, given learnable frame embeddings continuous Gumbel-Softmax reparametrization. E F , we can obtain an aggregate frame represen- Input: previous frame fm−1 , . fm−1 ∈ RF tation em by calculating em = fm > E F . This can current frame observation (Im ), be thought of as roughly extracting row m from encoder GRU hidden states H ∈ RT ×dh . E F for low entropy draws fm , and using many Parameters: Win ∈ Rdh ×de , Wout ∈ RF ×dh , frames in the representation for high entropy draws. frames embeddings E F ∈ RF ×de Via the temperature hyperparameter, the Gumbel- Output: fm , γ m Softmax allows us to control the entropy. 1: efm−1 = fm−1 > E F 3.3 Decoder 2: α ← Softmax(HWin e> fm−1 ) . Attn. Scores 3: cm ← H α > . Context Vector Our decoder (Alg. 2) must be able to reconstruct 0 4: γ m ← Wout tanh(Win efm−1 ) + tanh(cm ) the input event token sequence from the frame rep- 5: γ m ← γ m + kγ m k Im . Observation resentations f = (f1 , . . . , fM ) computed by the 6: q(fm |fm−1 ) ← GumbelSoftmax(γ m ) encoder. In contrast to the encoder, the decoder is 7: fm ∼ q(fm |fm−1 ) . fm ∈ R F relatively simple: we use an auto-regressive (left- to-right) GRU to produce hidden representations zt for each token we need to reconstruct, but we enrich that representation via an attention mecha- Next, it calculates the similarity between efm−1 and nism over f . Specifically, we use both f and the RNN hidden representations and all recurrent hid- same learned frame embeddings EF from the en- den states H (line 2). After deriving the attention coder to compute inferred, contextualized frame scores, the weighted average of hidden states (cm ) embeddings as E M = f EF ∈ RM ×de . For each summarizes the role of tokens in influencing the output token (time step t), we align the decoder frame fm for the mth event (line 3). We then com- GRU hidden state ht with the rows of E M (line 1). bine the previous frame embedding efm−1 and the After calculating the scores for each frame embed- current context vector cm to obtain a representation ding, we obtain the output context vector ct (line 2), γ 0m for the mth event (line 4). which is used in conjunction with the hidden state While γ 0m may be an appropriate representation of the decoder zt to generate the wt token (line 5). if no external knowledge is provided, our encoder In Fig. 1, the collection of all the three frames and needs to be able to inject any provided external the tokens from the first event will be used to pre- knowledge Im . Our model defines a chain of dict the Train token from the second event. variables—some of which may be observed and some of which may be latent—so care must be 3.4 Training Process taken to preserve the gradient flow within the net- We now analyze the different terms in Eq. (5). In work. We note that an initial strategy of solely our experiments, we have set the number of sam- using Im instead of fm (whenever Im is provided) ples S to be 1. From Eq. (6), and using the sam- is not sufficient to ensure gradient flow. Instead, pled sequence of frames f1 , f2 , . . . fM from our we incorporate the observed information given by encoder, we P approximate the reconstruction term Im by adding this information to the output of the as Lw = t log p(wt |f1 , f2 , . . . fM ; zt ), where zt encoder logits before drawing fm samples (line 5). is the decoder GRU’s hidden representation after This remedy motivates the encoder to softly in- having reconstructed the previous t − 1 words in crease the importance of the observed frames dur- the event sequence. ing the training. Finally, we draw fm from the Looking at the KL-term, we define the prior Gumbel-Softmax distribution (line 7). frame-to-frame distribution as p(fm |fm−1 ) = For example, in Fig. 1, when the model knows 1/F , and let the variational distribution capture that [I MPACT ] is triggered, it increases the value the dependency between the frames. A simi- of [I MPACT ] in γ m to encourage [I MPACT ] to be lar type of strategy has been exploited success- sampled, but it does not prevent other frames from fully by Chen et al. (2018) to make computa- being sampled. On the other hand, when a frame is tion simpler. We see the computational benefits
Algorithm 2 Decoder: To (re)generate each token Our model is related to HAQAE (Weber et al., in the event sequence, we compute an attention 2018b), though with important differences. While α̂ over the sequence of frame random variables f HAQAE relies on unsupervised deterministic infer- (from Alg. 1). This attention weights each frame’s ence, we aim to incorporate the frame information contribution to generating the current word. in a softer, guided fashion. The core differences are: Input: E M ∈ RM ×de (computed as f EF ) our model supports partially-observed frame se- decoder’s current hidden state zt ∈ Rdh quences (i.e., semi-supervised learning); the linear- Parameters: Ŵin ∈ Rde ×dh , Ŵout ∈ RV ×de , EF chain connection among the event variables in our Output: wt model is simpler than the tree-based structure in 1: α̂ ← Softmax(E M Ŵin zt ) . Attn. Scores HAQAE; while both works use attention mech- M> anisms in the encoder and decoder, our attention 2: ct ← E α̂ . Context Vector mechanism is based on addition rather than concate- 3: g ← Ŵout tanh(Ŵin zt ) + tanh(ct ) 4: p(wt |f ; zt ) ∝ exp(g) nation; and our handling of latent discrete variables 5: wt ∼ p(wt |f ; zt ) is based on the Gumbel-Softmax reparametrization, rather than vector quantization. We discuss these differences further in Appendix C.1. of a uniform prior: splitting the KL term into Eq(f |w) log p(f ) − Eq(f |w) log q(f |w) allows us 4 Experimental Results to neglect the first term. For the second term, we normalize the Gumbel-Softmax logits γ m , i.e., We test the performance of our model on a γ m = Softmax(γ m ), and compute portion of the Concretely Annotated Wikipedia X dataset (Ferraro et al., 2014), which is a dump of Lq = −Eq(f |w) log q(f |w) ≈ − γ> m log γ m . English Wikipedia that has been annotated with the m outputs of more than a dozen NLP analytics; we use this as it has readily-available FrameNet an- Lq encourages the entropy of the variational dis- notations provided via SemaFor (Das et al., 2014). tribution to be high which makes it hard for the Our training data has 457k documents, our valida- encoder to distinguish the true frames from the tion set has 16k documents, and our test set has wrong ones. We add a fixed and constant regular- 21k documents. More than 99% of the frames are ization coefficient αq to decrease the effect of this concentrated in the first 500 most common frames, term (Bowman et al., 2015). P We define the classifi- so we set F = 500. Nearly 15% of the events cation loss as Lc = − Im >0 Im > log γ m , to en- did not have any frame, many of which were due courage q to be good at predicting any frames that to auxiliary/modal verb structures; as a result, we were actually observed. We weight Lc by a fixed did not include them. For all the experiments, the coefficient αc . Summing these losses together, we vocabulary size (V ) is set as 40k and the number arrive at our objective function: of events (M ) is 5; this is to maintain compara- L = Lw + αq Lq + αc Lc . (7) bility with HAQAE. For the documents that had more than 5 events, we extracted the first 5 events 3.5 Relation to prior event modeling that had frames. For both the validation and test A number of efforts have leveraged frame induction datasets, we have set Im = 0 for all the events; for event modeling (Cheung et al., 2013; Chambers, frames are only observed during training. 2013; Ferraro and Van Durme, 2016; Kallmeyer Documents are fed to the model as a sequence et al., 2018; Ribeiro et al., 2019). These meth- of events with verb, subj, object and modifier ele- ods are restricted to explicit connections between ments. The events are separated with a special sep- events and their corresponding frames; they do not arating < TUP > token and the missing elements are capture all the possible connections between the ob- represented with a special N ONE token. In order to served events and frames. Weber et al. (2018b) pro- facilitate semi-supervised training and examine the posed a hierarchical unsupervised attention struc- impact of frame knowledge, we introduce a user- ture (HAQAE) that corrects for this. HAQAE uses set value : in each document, for event m, the true vector quantization (Van Den Oord et al., 2017) to value of the frame is preserved in Im with probabil- capture sequential event structure via tree-based ity , while with probability 1 − we set Im = 0. latent variables. This is set and fixed prior to each experiment. For
PPL Inv Narr Cloze Model Valid Test Model Wiki NYT RNNLM - 61.34 ±2.05 61.80 ±4.81 Valid Test Valid Test RNNLM+ROLE - 66.07 ±0.40 60.99 ±2.11 RNNLM - 20.33 ±0.56 21.37 ±0.98 18.11 ±0.41 17.86 ±0.80 HAQAE - 24.39 ±0.46 21.38 ±0.25 RNNLM+ROLE - 19.57 ±0.68 19.69 ±0.97 17.56 ±0.10 17.95 ±0.25 HAQAE - 29.18 ±1.40 24.88 ±1.35 20.5 ±1.31 22.11 ±0.49 Ours 0.0 41.18 ±0.69 36.28 ±0.74 Ours 0.0 43.80 ±2.93 45.75 ±3.47 29.40 ±1.17 28.63 ±0.37 Ours 0.2 38.52 ±0.83 33.31 ±0.63 Ours 0.2 45.78 ±1.53 44.38 ±2.10 29.50 ±1.13 29.30 ±1.45 Ours 0.4 37.79 ±0.52 33.12 ±0.54 Ours 0.4 47.65 ±3.40 47.88 ±3.59 30.01 ±1.27 30.61 ±0.37 Ours 0.5 35.84 ±0.66 31.11 ±0.85 Ours 0.5 42.38 ±2.41 40.18 ±0.90 29.36 ±1.58 29.95 ±0.97 Ours 0.7 24.20 ±1.07 21.19 ±0.76 Ours 0.7 38.40 ±1.20 39.08 ±1.55 29.15 ±0.95 30.13 ±0.66 Ours 0.8 23.68 ±0.75 20.77 ±0.73 Ours 0.8 39.48 ±3.02 38.96 ±2.75 29.50 ±0.30 30.33 ±0.81 Ours 0.9 22.52 ±0.62 19.84 ±0.52 Ours 0.9 35.61 ±0.62 35.56 ±1.70 28.41 ±0.29 29.01 ±0.84 (a) Validation and test per-word perplexities (lower (b) Inverse Narrative Cloze scores (higher is better), averaged across 3 is better). We always outperform RNNLM and runs, with standard deviation reported. Some frame observation ( = 0.4) RNNLM+ROLE, and outperform HAQAE when au- is most effective across Wikipedia and NYT, though we outperform our tomatically extracted frames are sufficiently available baselines, including the SOTA, at any level of frame observation. For the during training ( ∈ {0.7, 0.8, 0.9}). NYT dataset, we first trained the model on the Wikipedia dataset and then did the tests on the NYT valid and test inverse narrative cloze datasets. Table 1: Validation and test results for per-word perplexity (Table 1a: lower is better) and inverse narrative cloze accuracy (Table 1b: higher is better). Recall that is the (average) percent of frames observed during training though during evaluation no frames are observed. all the experiments we set αq and αc as 0.1, found 4.1 Evaluations empirically on the validation data. To measure the effectiveness of our proposed model for event representation, we first report the perplex- Setup We represent words by their pretrained ity and Inverse Narrative Cloze metrics. Glove 300 embeddings and used gradient clipping at 5.0 to prevent exploding gradients. We use a two Perplexity We summarize our per-word perplex- layer of bi-directional GRU for the encoder, and ity results in Table 1a, which compares our event a two layer uni-directional GRU for the decoder chain model, with varying values, to the three (with a hidden dimension of 512 for both). See baselines.5 Recall that refers to the (average) per- Appendix B for additional computational details.4 cent of frames observed during training. During evaluation no frames are observed; this ensures a Baselines In our experiments, we compare our fair comparison to our baselines. proposed methods against the following methods: As clearly seen, our model outperforms other baselines across both the validation and test • RNNLM: We report the performance of a se- datasets. We find that increasing the observation quence to sequence language model with the probability consistently yields performance im- same structure used in our own model. A Bi- provement. For any value of we outperform directional GRU cell with two layers, hidden RNNLM and RNNLM+ROLE. HAQAE outper- dimension of 512, gradient clipping at 5 and forms our model for ≤ 0.5, while we outperform Glove 300 embeddings to represent words. HAQAE for ∈ {0.7, 0.8, 0.9}. This suggests that • RNNLM+ROLE (Pichotta and Mooney, while the tree-based latent structure can be helpful 2016): This model has the same structure as when external, semantic knowledge is not suffi- RNNLM, but the role for each token (verb, ciently available during training, a simpler linear subject, object, modifier) as a learnable em- structure can be successfully guided by that knowl- bedding vector is concatenated to the token edge when it is available. Finally, recall that the embeddings and then it is fed to the model. external frame annotations are automatically pro- The embedding dimension for roles is 300. vided, without human curation: this suggests that our model does not require perfectly, curated anno- • HAQAE (Weber et al., 2018b) This work is tations. These observations support the hypothesis the most similar to ours. For fairness, we seed that frame observations, in conjunction with latent HAQAE with the same dimension GRUs and variables, provide a benefit to event modeling. pretrained embeddings. 5 In our case, perplexity provides an indication of the 4 https://github.com/mmrezaee/SSDVAE model’s ability to predict the next event and arguments.
Tokens βenc (Frames Given Tokens) kills K ILLING D EATH H UNTING _ SUCCESS _ OR _ FAILURE H IT _ TARGET ATTACK paper S UBMITTING _ DOCUMENTS S UMMARIZING S IGN D ECIDING E XPLAINING _ THE _ FACTS business C OMMERCE _ PAY C OMMERCE _ BUY E XPENSIVENESS R ENTING R EPORTING Frames βdec (Tokens Given Frames) C AUSATION raise rendered caused induced brought P ERSONAL _ RELATIONSHIP dated dating married divorced widowed F INISH _ COMPETITION lost compete won competed competes Table 2: Results for the outputs of the attention layer, the upper table shows the βenc and the bottom table shows the βdec , when = 0.7. Each row shows the top 5 words for each clustering. Inverse Narrative Cloze This task has been pro- 4.2 Qualitative Analysis of Attention posed by Weber et al. (2018b) to evaluate the abil- ity of models to classify the legitimate sequence To illustrate the effectiveness of our proposed at- of events over detractor sequences. For this task, tention mechanism, in Table 2 we show the most we have created two Wiki-based datasets from our likely frames given tokens (βenc ) , and tokens given validation and test datasets, each with 2k samples. frames (βdec ). We define βenc = Wout tanh(H > ) Each sample has 6 options in which the first events > and βdec = Ŵout tanh(E M ) where βenc ∈ RF ×T are the same and only one of the options represents provides a contextual token-to-frame soft cluster- the actual sequence of events. All the options have ing matrix for each document and analogously a fixed size of 6 events and the one that has the low- βdec ∈ RV ×M provides a frame-to-word soft- est perplexity is selected as the correct one. We also clustering contextualized in part based on the in- consider two NYT inverse narrative cloze datasets ferred frames. We argue that these clusters are that are publicly available.6 All the models are useful for analyzing and interpreting the model and trained on the Wiki dataset and then classifications its predictions. Our experiments demonstrate that are done on the NYT dataset (no NYT training data the frames in the encoder (Table 2, top) mostly at- was publicly available). tend to the verbs and similarly the decoder utilizes Table 1b presents the results for this task. Our expected and reasonable frames to predict the next method tends to achieve a superior classification verb. Note that we have not restricted the frames score over all the baselines, even for small . Our and tokens connection: the attention mechanism model also yields performance improvements on makes the ultimate decision for these connections. the NYT validation and test datasets. We observe that the inverse narrative cloze scores for the NYT We note that these clusters are a result of our datasets is almost independent from the . We sus- attention mechanisms. Recall that in both the en- pect this due to the different domains between train- coder and decoder algorithms, after computing the ing (Wikipedia) and testing (newswire). context vectors, we use the addition of two tanh(·) functions with the goal of separating the GRU hid- Note that while our model’s perplexity improved den states and frame embeddings (line 3). This is a monotonically as increased, we do not see mono- different computation from the bi-linear attention tonic changes, with respect to , for this task. By mechanism (Luong et al., 2015) that applies the examining computed quantities from our model, tanh(·) function over concatenation. Our additive we observed both that a high resulted in very approach was inspired by the neural topic modeling low entropy attention and that frames very often method from Dieng et al. (2016), which similarly attended to the verb of the event—it learned this uses additive factors to learn an expressive and pre- association despite never being explicitly directly dictive neural component and the classic “topics” to. While this is a benefit to localized next word (distributions/clusters over words) that traditional prediction (i.e., perplexity), it is detrimental to in- topic models excel at finding. While theoretical verse narrative cloze. On the other hand, lower guarantees are beyond our scope, qualitative anal- resulted in slightly higher attention entropy, sug- yses suggests that our additive attention lets the gesting that less peaky attention allows the model model learn reasonable soft clusters of tokens into to capture more of the entire event sequence and frame-based “topics.” See Table 6 in the Appendix improve global coherence. for an empirical comparison and validation of our use of addition rather than concatenation in the 6 https://git.io/Jkm46 attention mechanisms.
Valid Test Model Acc Prec f1 Acc Prec f1 Model RNNLM - 0.89 0.73 0.66 0.88 0.71 0.65 RNNLM + ROLE - 0.89 0.75 0.69 0.88 0.74 0.68 6 HAQAE Ours 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Ours (ε=0.2) Heldout NLL Ours 0.20 0.59 0.27 0.28 0.58 0.27 0.28 Ours 0.40 0.77 0.49 0.50 0.77 0.49 0.50 Ours 0.50 0.79 0.51 0.48 0.79 0.50 0.48 5 Ours (ε=0.7) Ours 0.70 0.85 0.69 0.65 0.84 0.69 0.65 Ours (ε=0.9) Ours 0.80 0.86 0.77 0.74 0.85 0.76 0.74 Ours 0.90 0.87 0.81 0.78 0.86 0.81 0.79 4 Table 3: Accuracy and macro precision and F1-score, averaged across three different runs. We present stan- 3 dard deviations in Table 5. 10000 20000 Epoch 4.3 How Discriminative Is A Latent Node? Figure 3: Validation NLL during training of our model with ∈ {0.2, 0.7, 0.9} and HAQAE (the red curve). Though we develop a generative model, we want Epochs are displayed with a square root transform. to make sure the latent nodes are capable of lever- aging the frame information in the decoder. We 4.4 Training Speed examine this assessing the ability of one single la- tent node to classify the frame for an event. We re- Our experiments show that our proposed approach purpose the Wikipedia language modeling dataset converges faster than the existing HAQAE model. into a new training data set with 1,643,656 samples, For fairness, we have used the same data-loader, validation with 57,261 samples and test with 75903 batch size as 100, learning rate as 10−3 and Adam samples. We used 500 frame labels. Each sample is optimizer. In Fig. 3, on average each iteration takes a single event. We fixed the number of latent nodes 0.2951 seconds for HAQAE and 0.2958 seconds to be one. We use RNNLM and RNNLM+ROLE as for our model. From Fig. 3 we can see that for baselines, adding a linear classifier layer followed sufficiently high values of our model is converg- by the softplus function on top of the bidirectional ing both better, in terms of negative log-likelihood GRU last hidden vectors and a dropout of 0.15 on (NLL), and faster—though for small , our model the logits. We trained all the models with the afore- still converges much faster. The reasons for this can mentioned training dataset, and tuned the hyper be boiled down to utilizing Gumbel-Softmax rather parameters on the validation dataset. than VQ-VAE, and also injection information in We trained the RNNLM and RNNLM+ROLE the form of frames jointly. baselines in a purely supervised way, whereas our model mixed supervised (discriminative) and un- 5 Conclusion supervised (generative) training. The baselines ob- We showed how to learn a semi-supervised VAE served all of the frame labels in the training set; our with partially observed, sequential, discrete latent model only observed frame values in training with variables. We used Gumbel-Softmax and a modi- probability , which it predicted from γ m . The fied attention to learn a highly effective event lan- parameters leading to the highest accuracy were guage model (low perplexity), predictor of how chosen to evaluate the classification on the test an initial event may progress (improved inverse dataset. The results for this task are summarized in narrative cloze), and a task-based classifier (outper- Table 3. Our method is an attention based model forming fully supervised systems). We believe that which captures all the dependencies in each event future work could extend our method by incorpo- to construct the latent representation, but the base- rating other sources or types of knowledge (such lines are autoregressive models. Our encoder acts as entity or “commonsense” knowledge), and by like a discriminative classifier to predict the frames, using other forms of a prior distribution, such as where they will later be used in the decoder to con- “plug-and-play” priors (Guo et al., 2019; Moham- struct the events. We expect the model performance madi et al., 2021; Laumont et al., 2021). to be comparable to RNNLM and RNNLM+ROLE in terms of classification when is high. Our model Acknowledgements and Funding Disclosure with larger tends to achieve better performance in We would also like to thank the anonymous re- terms of macro precision and macro F1-score. viewers for their comments, questions, and sug-
gestions. Some experiments were conducted on Dipanjan Das, Desai Chen, André FT Martins, Nathan the UMBC HPCF, supported by the National Sci- Schneider, and Noah A Smith. 2014. Frame- semantic parsing. Computational linguistics, ence Foundation under Grant No. CNS-1920079. 40(1):9–56. We’d also like to thank the reviewers for their com- ments and suggestions. This material is based in Adji B Dieng, Chong Wang, Jianfeng Gao, and John part upon work supported by the National Science Paisley. 2016. Topicrnn: A recurrent neural net- Foundation under Grant Nos. IIS-1940931 and IIS- work with long-range semantic dependency. arXiv preprint arXiv:1611.01702. 2024878. This material is also based on research that is in part supported by the Air Force Research Xiao Ding, Kuo Liao, Ting Liu, Zhongyang Li, and Laboratory (AFRL), DARPA, for the KAIROS pro- Junwen Duan. 2019. Event representation learning gram under agreement number FA8750-19-2-1003. enhanced with external commonsense knowledge. In Proceedings of the 2019 Conference on Empirical The U.S.Government is authorized to reproduce Methods in Natural Language Processing and the and distribute reprints for Governmental purposes 9th International Joint Conference on Natural Lan- notwithstanding any copyright notation thereon. guage Processing (EMNLP-IJCNLP), pages 4896– The views and conclusions contained herein are 4905. those of the authors and should not be interpreted Francis Ferraro, Adam Poliak, Ryan Cotterell, and Ben- as necessarily representing the official policies or jamin Van Durme. 2017. Frame-based continuous endorsements, either express or implied, of the Air lexical semantics through exponential family tensor Force Research Laboratory (AFRL), DARPA, or factorization and semantic proto-roles. In Proceed- ings of the 6th Joint Conference on Lexical and Com- the U.S. Government. putational Semantics (*SEM 2017), pages 97–103, Vancouver, Canada. Association for Computational Linguistics. References Francis Ferraro, Max Thomas, Matthew R. Gormley, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Travis Wolfe, Craig Harman, and Benjamin Van gio. 2014. Neural machine translation by jointly Durme. 2014. Concretely Annotated Corpora. In learning to align and translate. arXiv preprint 4th Workshop on Automated Knowledge Base Con- arXiv:1409.0473. struction (AKBC), Montreal, Canada. David Bamman, Brendan O’Connor, and Noah A. Francis Ferraro and Benjamin Van Durme. 2016. A Smith. 2013. Learning latent personas of film char- unified bayesian model of scripts, frames and lan- acters. In ACL. guage. In Thirtieth AAAI Conference on Artificial Yonatan Bisk, Jan Buys, Karl Pichotta, and Yejin Choi. Intelligence. 2019. Benchmarking hierarchical script knowledge. In Proceedings of the 2019 Conference of the North Bichuan Guo, Yuxing Han, and Jiangtao Wen. 2019. American Chapter of the Association for Computa- Agem: Solving linear inverse problems via deep pri- tional Linguistics: Human Language Technologies, ors and sampling. Advances in Neural Information Volume 1 (Long and Short Papers), pages 4077– Processing Systems, 32:547–558. 4085. Eric Jang, Shixiang Gu, and Ben Poole. 2017. Cate- Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An- gorical reparameterization with gumbel-softmax. In drew M Dai, Rafal Jozefowicz, and Samy Ben- ICLR. gio. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Laura Kallmeyer, Behrang QasemiZadeh, and Jackie Chi Kit Cheung. 2018. Coarse lexical frame acquisi- Nathanael Chambers. 2013. Event schema induction tion at the syntax–semantics interface using a latent- with a probabilistic entity-driven model. In Proceed- variable pcfg model. In Proceedings of the Seventh ings of the 2013 Conference on Empirical Methods Joint Conference on Lexical and Computational Se- in Natural Language Processing, pages 1797–1807. mantics, pages 130–141. Mingda Chen, Qingming Tang, Karen Livescu, and Diederik P Kingma and Max Welling. 2013. Auto- Kevin Gimpel. 2018. Variational sequential labelers encoding variational bayes. arXiv preprint for semi-supervised learning. In Proceedings of the arXiv:1312.6114. 2018 Conference on Empirical Methods in Natural Language Processing, pages 215–226. Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised Jackie Chi Kit Cheung, Hoifung Poon, and Lucy learning with deep generative models. In Advances Vanderwende. 2013. Probabilistic frame induction. in neural information processing systems, pages NAACL HLT 2013, pages 837–846. 3581–3589.
Hirokazu Kiyomaru, Kazumasa Omura, Yugo Mu- First Workshop on Fact Extraction and VERification rawaki, Daisuke Kawahara, and Sadao Kurohashi. (FEVER), pages 161–165, Brussels, Belgium. Work- 2019. Diversity-aware event prediction based on a shop at EMNLP. conditional variational autoencoder with reconstruc- tion. In Proceedings of the First Workshop on Com- Haoruo Peng and Dan Roth. 2016. Two discourse monsense Inference in Natural Language Process- driven language models for semantics. In Proceed- ing, pages 113–122. ings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- Rémi Laumont, Valentin De Bortoli, Andrés Almansa, pers), pages 290–300, Berlin, Germany. Association Julie Delon, Alain Durmus, and Marcelo Pereyra. for Computational Linguistics. 2021. Bayesian imaging using plug & play pri- ors: when langevin meets tweedie. arXiv preprint Karl Pichotta and Raymond J Mooney. 2016. Learn- arXiv:2103.04715. ing statistical scripts with lstm recurrent neural net- works. In Thirtieth AAAI Conference on Artificial Valentin Liévin, Andrea Dittadi, Lars Maaløe, and Ole Intelligence. Winther. 2019. Towards hierarchical discrete vari- ational autoencoders. In 2nd Symposium on Ad- Eugénio Ribeiro, Vânia Mendonça, Ricardo Ribeiro, vances in Approximate Bayesian Inference (AABI). David Martins de Matos, Alberto Sardinha, Ana Lú- cia Santos, and Luísa Coheur. 2019. L2f/inesc-id at Guy Lorberbom, Andreea Gane, Tommi Jaakkola, and semeval-2019 task 2: Unsupervised lexical semantic Tamir Hazan. 2019. Direct optimization through frame induction using contextualized word represen- argmax for discrete variational auto-encoder. In Ad- tations. In SemEval@NAACL-HLT. vances in Neural Information Processing Systems, pages 6203–6214. Rachel Rudinger, Pushpendre Rastogi, Francis Ferraro, and Benjamin Van Durme. 2015. Script induction as Minh-Thang Luong, Hieu Pham, and Christopher D language modeling. In EMNLP. Manning. 2015. Effective approaches to attention- based neural machine translation. In Proceedings of Roger C Schank and Robert P Abelson. 1977. Scripts, the 2015 Conference on Empirical Methods in Natu- plans, goals, and understanding: an inquiry into hu- ral Language Processing, pages 1412–1421. man knowledge structures. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, 2016. The concrete distribution: A continuous relax- Søren Kaae Sønderby, and Ole Winther. 2016. Lad- ation of discrete random variables. arXiv preprint der variational autoencoders. In Advances in neural arXiv:1611.00712. information processing systems, pages 3738–3746. Jiří Materna. 2012. LDA-Frames: an unsupervised ap- Michael Teng, Tuan Anh Le, Adam Scibior, and Frank proach to generating semantic frames. In Compu- Wood. 2020. Semi-supervised sequential generative tational Linguistics and Intelligent Text Processing, models. In Conference on Uncertainty in Artificial pages 376–387. Springer. Intelligence, pages 649–658. PMLR. Marvin Minsky. 1974. A framework for representing Arash Vahdat, Evgeny Andriyash, and William knowledge. MIT-AI Laboratory Memo 306. Macready. 2018a. Dvae#: Discrete variational au- toencoders with relaxed boltzmann priors. In Ad- Ashutosh Modi. 2016. Event embeddings for seman- vances in Neural Information Processing Systems, tic script modeling. In Proceedings of The 20th pages 1864–1874. SIGNLL Conference on Computational Natural Lan- guage Learning, pages 75–83. Arash Vahdat, William G Macready, Zhengbing Bian, Amir Khoshaman, and Evgeny Andriyash. Narges Mohammadi, Marvin M Doyley, and Mujdat 2018b. Dvae++: Discrete variational autoencoders Cetin. 2021. Ultrasound elasticity imaging using with overlapping transformations. arXiv preprint physics-based models and learning-based plug-and- arXiv:1802.04920. play priors. arXiv preprint arXiv:2103.14096. Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neu- Raymond J Mooney and Gerald DeJong. 1985. Learn- ral discrete representation learning. In Advances ing schemata for natural language processing. In IJ- in Neural Information Processing Systems, pages CAI, pages 681–687. 6306–6315. Seyedahmad Mousavi, Mohammad Mehdi Rezaee Noah Weber, Niranjan Balasubramanian, and Taghiabadi, and Ramin Ayanzadeh. 2019. A survey Nathanael Chambers. 2018a. Event representations on compressive sensing: Classical results and recent with tensor-based compositions. In Thirty-Second advancements. arXiv preprint arXiv:1908.01014. AAAI Conference on Artificial Intelligence. Ankur Padia, Francis Ferraro, and Tim Finin. 2018. Noah Weber, Rachel Rudinger, and Benjamin Team UMBC-FEVER : Claim verification using se- Van Durme. 2020. Causal inference of script knowl- mantic lexical resources. In Proceedings of the edge. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP). Noah Weber, Leena Shekhar, Niranjan Balasubrama- nian, and Nathanael Chambers. 2018b. Hierarchi- cal quantized representations for script generation. In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 3783–3792. Rong Ye, Wenxian Shi, Hao Zhou, Zhongyu Wei, and Lei Li. 2020. Variational template ma- chine for data-to-text generation. arXiv preprint arXiv:2002.01127. Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-aware BERT for language understanding. In AAAI.
A Table of Notation VQ-VAE Query Symbol Description Dimension F Frames vocabulary size N Min Distance V Tokens vocabulary size N M Number of events, per sequence N T Number of words, per sequence (4M ) N (a) VQVAE de Frame emb. dim. N dh RNN hidden state dim. N EF Learned frame emb. R ×de F Gumbel-Softmax EM Embeddings of sampled frames (E M = f E F ) R ×de M Im Observed frame (external one-hot) RF Query fm Sampled frame (simplex) RF (Simplex) H RNN hidden states from the encoder RT ×dh Vector to Matrix zt RNN hidden state from the decoder Rdh Multiplication Win Learned frames-to-hidden-states weights (Encoder) Rdh ×de Ŵin Learned hidden-states-to-frames weights (Decoder) Rde ×dh (b) Gumbel-Softmax and Emb (ours) Wout Contextualized frame emb. (Encoder) RF ×dh Ŵout Learned frames-to-words weights (Decoder) RV ×de Figure 4: (a) VQ-VAE in HAQAE works based on the α Attention scores over hidden states (Encoder) RT α̂ Attention scores over frame emb. (Decoder) RM minimum distance between the query vector and the cm Context vector (Encoder) Rdh embeddings. (b) our approach first draws a Gumbel- ct Context vector (Decoder) Rd e Softmax sample and then extracts the relevant row by γm Gumbel-Softmax params. (Encoder) RF wt Tokens (onehot) RV doing vector to matrix multiplication. Table 4: Notations used in this paper latent variable zi is defined in the embedding space denoted as e. The varaitional distribution to ap- proximate z = {z1 , z2 , . . . , zM }M i=1 given tokens B Computing Infrastructure x is defined as follows: We used the Adam optimizer with initial learning M Y −1 rate 10−3 and early stopping (lack of validation q(z|x) = q0 (z0 |x) qi (zi |parent_of(zi ), x). performance improvement for 10 iterations). We i=1 represent events by their pretrained Glove 300 em- The encoder calculates the attention over the input beddings and utilized gradient clipping at 5.0 to RNN hidden states hx and the parent of zi to define prevent exploding gradients. The Gumbel-Softmax qi (zi = k|zi−1 , x): temperature is fixed to τ = 0.5. We have not used ( dropout or batch norm on any layer. We have used 1 k = argminj kgi (x, parent_of(zi )) − eij k2 two layers of Bi-directional GRU cells with a hid- 0 elsewise, den dimension of 512 for the encoder module and where gi (x, parent_of(zi )) computes bilinear atten- Unidirectional GRU with the same configuration tion between hx and parent_of(zi )). for the decoder. Each model was trained using In this setting, the variational distribution is de- a single GPU (a TITAN RTX RTX 2080 TI, or terministic; right after deriving the latent query vec- a Quadro 8000), though we note that neither our tor it will be compared with a lookup embedding ta- models nor the baselines required the full memory ble and the row with minimum distance is selected, of any GPU (e.g., our models used roughly 6GB of see Fig. 4a. The decoder reconstructs the tokens GPU memory for a batch of 100 documents). as p(xi |z) that calculates the attention over latent C Additional Insights into Novelty variables and the RNN hidden states. A reconstruc- tion loss LR j and a “commit” loss (Weber et al., We have previously mentioned how our work is 2018b) loss LC j force gj (x, parent_of(zi )) to be most similar to HAQAE (Weber et al., 2018b). close to the embedding referred to by qi (zi ). Both In this section, we provide a brief overview of LR C j and Lj terms rely on a deterministic mapping HAQAE (Appendix C.1) and then highlight dif- that is based on a nearest neighbor computation ferences (Fig. 4), with examples (Appendix C.2). that makes it difficult to inject guiding information to the latent variable. C.1 Overview of HAQAE HAQAE provides an unsupervised tree structure C.2 Frame Vector Norm based on the vector quantization variational autoen- Like HAQAE, we use embeddings E F instead coder (VQVAE) over M latent variables. Each of directly using the Gumbel-Softmax frame sam-
Valid Test Model Acc Prec f1 Acc Prec f1 RNNLM - 0.89 ±0.001 0.73 ±0.004 0.66 ±0.008 0.88 ±0.005 0.71 ±0.014 0.65 ±0.006 RNNLM + ROLE - 0.89 ±0.004 0.75 ±0.022 0.69 ±0.026 0.88 ±0.005 0.74 ±0.017 0.68 ±0.020 Ours 0.00 0.00 ±0.001 0.00 ±0.001 0.00 ±0.000 0.00 ±0.001 0.00 ±0.000 0.00 ±0.000 Ours 0.20 0.59 ±0.020 0.27 ±0.005 0.28 ±0.090 0.58 ±0.010 0.27 ±0.050 0.28 ±0.010 Ours 0.40 0.77 ±0.012 0.49 ±0.130 0.50 ±0.010 0.77 ±0.010 0.49 ±0.060 0.50 ±0.010 Ours 0.50 0.79 ±0.010 0.51 ±0.080 0.48 ±0.080 0.79 ±0.009 0.50 ±0.080 0.48 ±0.050 Ours 0.70 0.85 ±0.007 0.69 ±0.050 0.65 ±0.040 0.84 ±0.013 0.69 ±0.050 0.65 ±0.020 Ours 0.80 0.86 ±0.003 0.77 ±0.013 0.74 ±0.006 0.85 ±0.005 0.76 ±0.008 0.74 ±0.010 Ours 0.90 0.87 ±0.002 0.81 ±0.007 0.78 ±0.006 0.86 ±0.011 0.81 ±0.020 0.79 ±0.010 Table 5: Accuracy and macro precision and F1-score, averaged across 3 runs, with standard deviation reported. Addition Concatenation against the concatenation (regular bilinear atten- Valid Test Valid Test 0.0 41.18 ±0.69 36.28 ±0.74 48.86 ±0.13 42.76 ±0.08 tion) method. We report the results on the 0.2 38.52 ±0.83 33.31 ±0.63 48.73 ±0.18 42.76 ±0.06 0.4 37.79 ±0.52 33.12 ±0.54 49.34 ±0.32 43.18 ±0.21 Wikipedia dataset in Table 6. Experiments indicate 0.5 35.84 ±0.66 31.11 ±0.85 49.83 ±0.24 43.90 ±0.32 that the regular bilinear attention structure with 0.7 24.20 ±1.07 21.19 ±0.76 52.41 ±0.38 45.97 ±0.48 0.8 23.68 ±0.75 20.77 ±0.73 55.13 ±0.31 48.27 ±0.21 larger obtains worse performance. These results 0.9 22.52±0.62 19.84 ±0.52 58.63 ±0.25 51.39 ±0.34 confirm the claim that the proposed approach bene- Table 6: Validation and test per-word perplexities with fits from the addition structure. bilinear-attention (lower is better). D.3 Generated Sentences ples. In our model definition, each frame fm = Recall that the reconstruction loss is [fm,1 , fm,2 , . . . , fm,F ] sampled from the Gumbel- 1X (s) (s) (s) Softmax distribution is a simplex vector: Lw ≈ log p(w|f1 , f2 , . . . fM ; z), S s F X 0 ≤ fm,i ≤ 1, fm,i = 1. (8) (s) (s) where fm ∼ q(fm |fm−1 , Im , w). Based on these i=1 formulations, we provide some examples of gener- PF p 1/p ated scripts. Given the seed, the model first predicts So kfm kp = i=1 fm,i ≤ F 1/p . After sam- pling a frame simplex vector, we multiply it to the the first frame f1 , then it predicts the next verb v1 frame embeddings matrix E F . With an appropri- and similarly samples the tokens one-by-one. Dur- ate temperature for Gumbel-Softmax, the simplex ing the event generations, if the sampled token is would be approximately a one-hot vector and the unknown, the decoder samples again. As we see in multiplication maps the simplex vector to the em- Table 7, the generated events and frames samples beddings space without any limitation on the norm. are consistent which shows the ability of model in event representation. D More Results D.4 Inferred Frames In this section, we provide additional quantitative and qualitative results, to supplement what was Using Alg. 1, we can see the frames sampled dur- presented in §4. ing the training and validation. In Table 8, we provide some examples of frame inferring for both D.1 Standard Deviation for Frame training and validation examples. We observe that Prediction (§4.3) for training examples when > 0.5, almost all the In Table 5, we see the average results of classifi- observed frames and inferred frames are the same. cation metrics with their corresponding standard In other words, the model prediction is almost the deviations. same as the ground truth. Interestingly, the model is more flexible in sam- D.2 Importance of Using Addition rather pling the latent frames (orange ones). In Table 8a than Concatenation in Attention the model is estimating HAVE _ ASSOCIATED in- To demonstrate the effectiveness of our proposed stead of the POSSESSION frame. In Table 8b, attention mechanism, we compare the addition instead of PROCESS _ START we have ACTIV-
You can also read