Capturing Event Argument Interaction via A Bi-Directional Entity-Level Recurrent Decoder
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Capturing Event Argument Interaction via A Bi-Directional Entity-Level Recurrent Decoder Xi Xiangyu1,2 , Wei Ye1,†, Shikun Zhang1,†, Quanxiu Wang3 , Huixing Jiang2 , Wei Wu2 1 National Engineering Research Center for Software Engineering, Peking University, Beijing, China 2 Meituan Group, Beijing, China 3 RICH AI, Beijing, China {xixy,wye,zhangsk}@pku.edu.cn Abstract achieved significant progress(Chen et al., 2015; Nguyen et al., 2016; Sha et al., 2018; Yang et al., Capturing interactions among event arguments 2019; Wang et al., 2019b; Zhang et al., 2020; Du is an essential step towards robust event argu- and Cardie, 2020). Many efforts have been devoted ment extraction (EAE). However, existing ef- arXiv:2107.00189v1 [cs.CL] 1 Jul 2021 forts in this direction suffer from two limita- to improving EAE by better characterizing argu- tions: 1) The argument role type information ment interaction, categorized into two paradigms. of contextual entities is mainly utilized as train- The first one, named inter-event argument inter- ing signals, ignoring the potential merits of action in this paper, concentrates on mining infor- directly adopting it as semantically rich input mation of the target entity (candidate argument) features; 2) The argument-level sequential se- in the context of other event instances (Yu et al., mantics, which implies the overall distribution 2011; Nguyen et al., 2016), e.g., the evidence that pattern of argument roles over an event men- a Victim argument for the Die event is often the tion, is not well characterized. To tackle the above two bottlenecks, we formalize EAE as Target argument for the Attack event in the same a Seq2Seq-like learning problem for the first sentence. The second one is intra-event argu- time, where a sentence with a specific event ment interaction, which exploits the relationship trigger is mapped to a sequence of event ar- of the target entity with others in the same event gument roles. A neural architecture with a instance (Yu et al., 2011; Sha et al., 2016, 2018). novel Bi-directional Entity-level Recurrent De- We focus on the second paradigm in this paper. coder (BERD) is proposed to generate argu- Despite their promising results, existing methods ment roles by incorporating contextual enti- ties’ argument role predictions, like a word- on capturing intra-event argument interaction suffer by-word text generation process, thereby dis- from two bottlenecks. tinguishing implicit argument distribution pat- (1) The argument role type information of terns within an event more accurately. contextual entities is underutilized. As two representative explorations, dBRNN (Sha et al., 1 Introduction 2018) uses an intermediate tensor layer to capture latent interaction between candidate arguments; Event argument extraction (EAE), which aims to RBPB (Sha et al., 2016) estimates whether two identify the entities serving as event arguments and candidate argument belongs to one event or not, classify the roles they play in an event, is a key step serving as constraints on a Beam-Search-based pre- towards event extraction (EE). For example, given diction algorithm. Generally, these works use the that the word “fired” triggers an Attack event in argument role type information of contextual enti- the sentence “In Baghdad, a cameraman died when ties as auxiliary supervision signals for training an American tank fired on the Palestine Hotel” , to refine input representation. However, one intu- EAE need to identify that “Baghdad”, “camera- itive observation is that the argument role types man”, “American tank”, and “Palestine hotel” are can be utilized straightforwardly as semantically arguments with Place, Target, Instrument, and Tar- rich input features, like how we use entity type get as roles respectively. information. To verify this intuition, we conduct Recently, deep learning models have been an experiment on ACE 2005 English corpus, in widely applied to event argument extraction and which CNN (Nguyen and Grishman, 2015) is uti- † Corresponding authors. lized as a baseline. For an entity, we incorporate
the ground-truth roles of its contextual arguments or previous entity recurrently like a text generation into the baseline model’s input representation, ob- process. In this way, BERD can identify candidate taining model CNN(w. role type). As expected, arguments in a way that is more consistent with the CNN(w. role type) outperforms CNN significantly implicit distribution pattern of multiple argument as shown in Table 11 . roles within a sentence, similar to text generation models that learn to generate word sequences fol- Model P R F1 lowing certain grammatical rules or text styles. CNN 57.8 55.0 56.3 The contributions of this paper are: CNN(w. role type) 59.8 60.3 60.0 1. We formalize the task of event argument ex- Table 1: Experimental results of CNN and its variant traction as a Seq2Seq-like learning problem on ACE 2005. for the first time, where a sentence with a spe- cific event trigger is mapped to a sequence of The challenge of the method lies in knowing the event argument roles. ground-truth roles of contextual entities in the infer- 2. We propose a novel architecture with a Bi- ence (or testing) phase. That is one possible reason directional Entity-level Recurrent Decoder why existing works do not investigate in this direc- (BERD) that is capable of leveraging the ar- tion. Here we can simply use predicted argument gument role predictions of left- and right-side roles to approximate corresponding ground truth contextual entities and distinguishing argu- for inference. We believe that the noise brought by ment roles’ overall distribution pattern. prediction is tolerable, considering the stimulating effect of using argument roles directly as input. 3. Extensive experimental results show that our (2) The distribution pattern of multiple argu- proposed method outperforms several compet- ment roles within an event is not well character- itive baselines on the widely-used ACE 2005 ized. For events with many entities, distinguishing dataset. BERD’s superiority is more signifi- the overall appearance patterns of argument roles cant given more entities in a sentence. is essential to make accurate role predictions. In dBRNN (Sha et al., 2018), however, there is no 2 Problem Formulation specific design involving constraints or interaction Most previous works formalize EAE as either a among multiple prediction results, though the argu- word-level sequence labeling problem (Nguyen ment representation fed into the final classifier is et al., 2016; Zeng et al., 2016; Yang et al., 2019) enriched with synthesized information (the tensor or an entity-oriented classic classification problem layer) from other arguments. RBPB (Sha et al., (Chen et al., 2015; Wang et al., 2019b). We for- 2016) explicitly leverages simple correlations in- malize EAE as a Seq2Seq-like learning problem as side each argument pair, ignoring more complex in- follows. Let S = {w1 , ..., wn } be a sentence where teractions in the whole argument sequence. There- n is the sentence length and wi is the i-th token. fore, we need a more reliable way to learn the Also, let E = {e1 , ..., ek } be the entity mentions in sequential semantics of argument roles in an event. the sentence where k is number of entities. Given To address the above two challenges, we for- that an event triggered by t ∈ S is detected in ED malize EAE as a Seq2Seq-like learning problem stage , EAE need to map the sentence with the event (Bahdanau et al., 2014) of mapping a sentence with to a sequence of argument roles R = {y1 , ..., yk }, a specific event trigger to a sequence of event ar- where yi denotes the argument role that entity ei gument roles. To fully utilize both left- and right- plays in the event. side argument role information, inspired by the bi- directional decoder for machine translation (Zhang 3 The Proposed Approach et al., 2018), we propose a neural architecture with We employ an encoder-decoder architecture for the a novel Bi-directional Entity-level Recurrent De- problem defined above, which is similar to most coder (BERD) to generate event argument roles Seq2Seq models in machine translation (Vaswani entity by entity. The predicted argument role of an et al., 2017; Zhang et al., 2018), automatic text entity is fed into the decoding module for the next summarization (Song et al., 2019; Shi et al., 2021), 1 In the experiment we skip the event detection phase and and speech recognition (Tüske et al., 2019; Hannun directly assume all the triggers are correctly recognized. et al., 2019) from a high-level perspective.
E c d A Cla i e Cla i e Cla i e Cla i e ca e a a d ed e a Bac a d D c d a e ca a ed Cla i e Cla i e Cla i e Cla i e U da e U da e U da e e Pa e e I a ce Fea e Ag e Fea e I a ce Fea e Ag e Fea e I a ce Fea e Ag e Fea e I a ce Fea e Ag e Fea e E ac E ac E ac E ac E ac E ac E ac E ac H e 1 2 3 4 Start # ATTACK F ad # D c d [SEP] Cla i e Cla i e Cla i e Cla i e U da e U da e U da e Be Ag e Fea e I a ce Fea e Ag e Fea e I a ce Fea e Ag e Fea e I a ce Fea e Ag e Fea e I a ce Fea e E ac E ac E ac E ac E ac E ac E ac E ac W dP ece Se e P + + Start 1 2 3 4 Figure 1: The detailed architecture of our proposed approach. The figure depicts a concrete case where a sentence contains an Attack event (triggered by “fired”) and 4 candidate arguments {e1 , e2 , e3 , e4 }. The encoder on the left converts the sentence into intermediate continuous representations. Then the forward decoder and backward decoder generates the argument roles sequences in a left-to-right and right-to-left manner (denoted by → − y i and ← y−i ) respectively. A classifier is finally adopted to make the final prediction y i . The forward and backward decoder shares the instance feature extractor and generate the same instance representation xi for i-th entity. The histogram in green and brown denotes the probability distribution generated by forward decoder and backward decoder re- spectively. The orange histogram denotes the final predictions. Note that the histograms are for illustration only and do not represent the true probability distribution. In particular, as Figure 1 shows, our architecture resentations as follows2 , consists of an encoder that converts the sentence S H = (h1 , ..., hn ) = F (w1 , ..., wn ) (1) with a specific event trigger into intermediate vec- torized representation and a decoder that generates where F (·) is the neural network to encode the sen- a sequence of argument roles entity by entity. The tence. In this paper, we select BERT (Devlin et al., decoder is an entity-level recurrent network whose 2019) as the encoder. Considering representation number of decoding steps is fixed, the same as the H does not contain event type information, which entity number in the corresponding sentence. On is essential for predicting argument roles. We ap- each decoding step, we feed the prediction results pend a special phrase denoting event type of t into of the previously-processed entity into the recurrent each input sequence, such as “# ATTACK #”. unit to make prediction for the current entity. Since the predicted results of both left- and right-side en- 3.2 Decoder tities can be potentially valuable information,we Different from traditional token-level Seq2Seq further incorporate a bidirectional decoding mech- models, we use a bi-directional entity-level recur- anism that integrates a forward decoding process rent decoder (BERD) with a classifier to generate a and a backward decoding process effectively. sequence of argument roles entity by entity. BERD consists of a forward and backward recurrent de- coder, which exploit the same recurrent unit archi- tecture as follows. 3.1 Encoder 3.2.1 Recurrent Unit Given the sentence S = (w1 , ..., wn ) containing The recurrent unit is designed to explicitly utilize a trigger t ∈ S and k candidate arguments E = two kinds of information: (1) the instance infor- {e1 , ..., ek }, an encoder is adopted to encode the 2 The trigger word can be detected by any event detection word sequence into a sequence of continuous rep- model, which is not the scope of this paper.
mation which contains the sentence, event, and as the input feature representation for the argument candidate argument (denoted by S, t, e); and (2) role classifier, and estimate the role that e plays in contextual argument information which consists the event as follows: of argument roles of other entities (denoted by A). The recurrent unit exploits two corresponding fea- p = f (W [x; xa ] + b) (5) ture extractors as follows: o = Softmax(p) Instance Feature Extractor. Given the represen- tation H generated by encoder, dynamic multi- where W and b are weight parameters. o is the pooling (Chen et al., 2015) is then applied to extract probability distribution over the role label space. max values of three split parts, which are decided For the sake of simplicity, in rest of the paper by the event trigger and the candidate argument. we use Unit(S, t, e, A) to represent the calculation The three hidden embeddings are aggregated into of probability distribution o by recurrent unit with an instance feature representation x as follows: S, t, e, A as inputs. [x1,pt ]i = max{[h1 ]i , ..., [hpt ]i } 3.2.2 Forward Decoder [xpt +1,pe ]i = max{[hpt +1 ]i , ..., [hpe ]i } Given the sentence S with k candidate arguments (2) E = {e1 , ..., ek }, the forward decoder exploits [xpe +1,n ]i = max{[hpe +1 ]i , ..., [hn ]i } above recurrent unit and generates the argument x = [x1,pt ; xpt +1,pe ; xpe +1,n ] roles sequence in a left-to-right manner. The condi- where [·]i is the i-th value of a vector, pt , pe are the tional probability of the argument roles sequence positions of trigger t and candidate argument e 3 . is calculated as follows: Argument Feature Extractor. To incorporate k Y previously-generated arguments, we exploit CNN P (R|E, S, t) = p(yi |ei ; R
label space for i-th entity ei is calculated as follows: During training, we apply the teacher forcing mechanism where gold arguments information is ← ← − y−i = Unit(S, t, ei , Ai ) (9) fed into BERD’s recurrent units, enabling paral- ←− leled computation and greatly accelerates the train- where Ai denotes the contextual argument in- ing process. Once the model is trained, we first formation of i-th decoding step, which contains use the forward decoder with a greedy search to se- previously-predicted argument roles R>i . We up- quentially generate a sequence of argument roles in ←−− date Ai−1 by labeling ei as g(← y−i ) for next step i-1. a left-to-right manner. Then, the backward decoder The argument feature extracted by recurrent units performs decoding in the same way but a right-to- of backward decoder is denoted as ← −a . x i left manner. Finally, the classifier combines both left- and right-side argument features and make 3.2.4 Classifier prediction for each entity as Equation 10 shows. To utilize both left- and right-side argument in- formation, a classifier is then adopted to combine 4 Experiments argument features of both decoders and make final 4.1 Experimental Setup prediction for each entity ei as follows: 4.1.1 Dataset pi = f (Wc [xi ; → − x ai ; ← −a ] + b ) x i c Following most works on EAE (Nguyen et al., (10) y i = Softmax(pi ) 2016; Sha et al., 2018; Yang et al., 2019; Du and Cardie, 2020), we evaluate our models on the most where yi denotes the final probability distribution widely-used ACE 2005 dataset, which contains 599 for ei . Wc and bc are weight parameters. documents annotated with 33 event subtypes and 35 argument roles. We use the same test set con- 3.3 Training and Optimization taining 40 newswire documents, a development set As seen, the forward decoder and backward de- containing 30 randomly selected documents and coder in BERD mainly play two important roles. training set with the remaining 529 documents. The first one is to yield intermediate argument fea- We notice Wang et al. (2019b) used TAC KBP tures for the final classifier, and the second one is to dataset, which we can not access online or acquire make the initial predictions fed into the argument from them due to copyright. We believe experi- feature extractor. Since the initial predictions of menting with settings consistent with most related the two decoders are crucial to generate accurate works (e.g., 27 out of 37 top papers used only the argument features, we need to optimize their own ACE 2005 dataset in the last four years) should classifier in addition to the final classifier. yield convincing empirical results. −−−−→ ←−−−− We use p(yi |ei ) and p(yi |ei ) to represent the probability of ei playing role yi estimated by for- 4.1.2 Hyperparameters ward and backward decoder respectively. p(yi |ei ) We adopt BERT (Devlin et al., 2019) as encoder denotes the final estimated probability of ei playing and the proposed bi-directional entity-level recur- role yi by Equation 10. The optimization objective rent decoder as decoder for the experiment. The function is defined as follows: hyperparameters used in the experiment are listed. BERT. The hyperparameters of BERT are the same XX X as the BERTBASE model4 . We use a dropout prob- J(θ) = − α log p(yi |ei ; R6=i , S, t) ability of 0.1 on all layers. S∈D t∈S ei ∈ES Argument Feature Extractor. Dimensions of word −−−−→ ←−−−− embedding, position embedding, event type em- + β log p(yi |ei ) + γ log p(yi |ei ) bedding and argument role embedding for each (11) token are 100, 5, 5, 10 respectively. We utilize where D denotes the training set and t ∈ S denotes 300 convolution kernels with size 3. The glove the trigger word detected by previous event detec- embedding(Pennington et al., 2014) are utilized for tion model in sentence S. ES represents the entity initialization of word embedding5 . mentions in S. α, β and γ are weights for loss Training. Adam with learning rate of 6e-05, β1 = of final classifier, forward decoder and backward 4 https://github.com/google-research/bert 5 decoder respectively. https://nlp.stanford.edu/projects/glove/
0.9, β2 = 0.999, L2 weight decay of 0.01 and Model P R F1 learning rate warmup of 0.1 is used for optimiza- DMBERT 56.9 57.4 57.2 tion. We set the training epochs and batch size HMEAE 62.2 56.6 59.3 to 40 and 30 respectively. Besides, we exploit a BERT(Inter) 58.4 57.1 57.8 dropout with rate 0.5 on the concatenated feature BERT(Intra) 56.4 61.2 58.7 representations. The loss weights α, β and γ are BERD 59.1 61.5 60.3 set to 1.0, 0.5 and 0.5 respectively. Table 2: Overall performance on ACE 2005 (%). 4.2 Baselines We compare our method against the following four 4.3 Main Results baselines. The first two are state-of-the-art models that separately predicts argument without consid- The performance of BERD and baselines are shown ering argument interaction. We also implement in Table 2 (statistically significant with p < 0.05), two variants of DMBERT utilizing the latest inter- from which we have several main observations. (1) event and intra-event argument interaction method, Compared with the latest best-performed baseline named BERT(Inter) and BERT(Intra) respectively. HMEAE, our method BERD achieves an absolute improvement of 1.0 F1, clearly achieving compet- 1. DMBERT which adopts BERT as encoder itive performance. (2) Incorporation of argument and generate representation for each entity interactions brings significant improvements over mention based on dynamic multi-pooling op- vanilla DMBERT. For example, BERT(Intra) gains eration(Wang et al., 2019a). The candidate a 1.5 F1 improvement compared with DMBERT, arguments are predicted separately. which has the same architecture except for argu- ment interaction. (3) Intra-event argument interac- 2. HMEAE which utilizes the concept hierar- tion brings more benefit than inter-event interaction chy of argument roles and utilizes hierarchical (57.8 of BERT(Inter) v.s. 58.7 of BERT(Intra) v.s. modular attention for event argument extrac- 60.3 of BERD). (4) Compared with BERT(Inter) tion (Wang et al., 2019b). and BERT(Intra), our proposed BERD achieves the 3. BERT(Inter) which enhances DMBERT most significant improvements. We attribute the with inter-event argument interaction adopted solid enhancement to BERD’s novel seq2seq-like by Nguyen et al. (2016). The memory ma- architecture that effectively exploits the argument trices are introduced to store dependencies roles of contextual entities. among event triggers and argument roles. 4.4 Effect of Entity Numbers 4. BERT(Intra) which incorporates intra-event To further investigate how our method improves argument interaction adopted by Sha et al. performance, we conduct comparison and analysis (2018) into DMBERT. The tensor layer and on effect of entity numbers. Specifically, we first self-matching attention matrix with the same divide the event instances of test set into some settings are applied in the experiment. subsets based on the number of entities in an event. Since events with a specific number of entities may Following previous work (Wang et al., 2019b), be too few, results on a subset of a range of entity we use a pipelined approach for event extraction numbers will yield more robust and convincing and implement DMBERT as event detection model. conclusion. To make the number of events in all The same event detection model is used for all the subsets as balanced as possible, we finally get a baselines to ensure a fair comparison. division of four subsets, whose entity numbers are Note that Nguyen et al. (2016) uses the last word in the range of [1,3], [4,6], [7,9], and [10,] and to represent the entity mention6 , which may lead event quantities account for 28.4%, 28.2%, 25.9%, to insufficient semantic information and inaccurate and 17.5%, respectively. evaluation considering entity mentions may consist The performance of all models on the four sub- of multiple words and overlap with each other. We sets is shown in Figure 2, from which we can ob- sum hidden embedding of all words when collect- serve a general trend that BERD outperforms other ing lexical features for each entity mention. baselines more significantly if more entities appear 6 Sha et al. (2018) doesn’t introduce the details. in an event. More entities usually mean more com-
DMBERT Model Subset-O Subset-N 64 HMEAE BERT(Inter) DMBERT 56.4 59.4 62 BERT(Intra) HMEAE 58.8 59.6 BERD BERT(Inter) 57.3 58.8 60 BERT(Intra) 58.5 59.2 F1-Score(%) 58 BERD 60.5 60.1 56 Table 3: Comparison on sentences with and with- out overlapping entities (Subset-O v.s. Subset-N). F1- 54 Score (%) is listed. 52 Subset-1 Subset-2 Subset-3 Subset-4 significant. We attribute it to BERD’s capability of [1,3] [4,6] [7,9] [10,] distinguishing argument distribution patterns. Figure 2: Comparison on four subsets with different 4.6 Effect of the Bidirectional Decoding range of entity numbers. F1-Score (%) is listed. To further investigate the effectiveness of the bidi- rectional decoding process, we exclude the back- plex contextual information for a candidate argu- ward decoder or forward decoder from BERD and ment, which will lead to a performance degradation. obtain two models with only unidirectional decoder, BERD alleviates degradation better because of its whose performance is shown in the lines of “-w/ capability of capturing argument role information Forward Decoder” and “-w/ Backward Decoder” in of contextual entities. We notice that BERT(Intra) Table 4. From the results, we can observe that: (1) also outperforms DMBERT significantly on Subset- When decoding with only forward or backward de- 4, which demonstrates the effectiveness of intra- coder, the performance decreases by 1.6 and 1.3 in event argument interaction. terms of F1 respectively. The results clearly demon- Note that the performance on Subset-1 is worse strate the superiority of the bidirectional decod- than that on Subset-2, looking like an outlier. The ing mechanism (2) Though the two model variants reason lies in that the performance of the first-stage have performance degradation, they still outper- event detection model on Subset-1 is much poorer form DMBERT significantly, once again verifying (e.g., 32.8 of F1 score for events with one entity). that exploiting contextual argument information, 4.5 Effect of Overlapping Entity Mentions even in only one direction, is beneficial to EAE. Though performance improvement can be easily Model P R F1 observed, it is nontrivial to quantitatively verify DMBERT 56.9 57.4 57.2 how BERD captures the distribution pattern of mul- BERD 59.1 61.5 60.3 tiple argument roles within an event. In this section, -w/ Forward Decoder 58.0 59.4 58.7 we partly investigate this problem by exploring the -w/ Backward Decoder 58.3 59.8 59.0 effect of overlapping entities. Since there is usually -w/ Forward Decoder x2 56.8 61.1 58.9 only one entity serving as argument roles in multi- -w/ Backward Decoder x2 57.2 61.0 59.1 ple overlapping entities, we believe sophisticated -w/o Recurrent Mechanism 55.3 60.0 57.4 EAE models should identify this pattern. Therefore, we divide the test set into two subsets (Subset-O Table 4: Ablation study on ACE 2005 dataset (%). and Subset-N ) based on whether an event contains overlapping entity mentions and check all models’ Considering number of model parameters will performance on these two subsets. Table 3 shows be decreased by excluding the forward/backward the results, from which we can find that all base- decoder, we build another two model variants with lines perform worse on Subset-O. It is a natural two decoders of the same direction (denoted by result since multiple overlapping entities usually “-w/ Forward Decoder x2” and “-w/ Backward De- have similar representations, making the pattern coder x2”), whose parameter numbers are exactly mentioned above challenging to capture. BERD equal to BERD. Table 4 shows that the two enlarged performs well in both Subset-O and Subset-N, and single-direction models have similar performance the superiority on Subset-O over baseline is more with their original versions. We can conclude that
the improvement comes from complementation of S1:Tens of thousands of destitute Africans try to enter Spain illegally each year by crossing the perilous Strait of Gibraltar to reach the southern the two decoders with different directions, rather mainland or by sailing northwest to the Canary Islands out in the Atlantic than increment of model parameters. (the perilous Strait of Gibraltar, N/A, Destination( ), Destination( ), N/A) (the southern mainland, N/A, Destination( ), Destination( ), N/A) Besides, we exclude the recurrent mechanism by (the Canary Islands out in the Atlantic, Destination, Destination, Destination, Destination) preventing argument role predictions of contextual S2: Ankara police chief Ercument Yilmaz visited the site of the morning blast entities from being fed into the decoding module, but refused to say if a bomb had caused the explosion obtaining another model variant named “-w/o Re- (Ankara, N/A, Artifact( ), N/A, N/A) (Ankara police, N/A, Artifact( ), N/A, N/A) current Mechanism”. The performance degradation (Ankara police chief, N/A, Artifact( ), Artifact( ), N/A) (Ankara police chief Ercument Yilmaz, Artifact, Artifact, Artifact, Artifact) clearly shows the value of the recurrent decoding S3:Prison authorities have given the node for Anwar to be token home later process incorporating argument role information. in the afternoon to marry his eldest daughter, Nurul Izzah, to engineer Raja Ahmad Sharir Iskandar in a traditional Malay ceremony, he said (home, Place, Place, Place, Time-Within( ) ) 4.7 Case Study and Error Analysis (later in the afternoon, Time-Within, Time-Within, Time-Within, N/A( )) To promote understanding of our method, we Figure 3: Case study. Entities and triggers are high- demonstrate three concrete examples in Figure 3. lighted by green and purple respectively. Each tuple Sentence S1 contains a Transport event triggered (E,G,P1 ,P2 ,P3 ) denotes the predictions for an entity E by “sailing”. DMBERT and BERT(Intra) assigns with gold label G, where P1 , P2 and P3 denotes predic- Destination role to candidate argument “the per- tion of DMBERT, BERT(Intra) and BERD respectively. ilous Strait of Gibraltar ”, “the southern mainland” Incorrect predictions are denoted by a red mark. and “the Canary Islands out in the Atlantic”, the first two of which are mislabeled. It’s an unusual predicted as N/A. pattern that a Transport event contains multiple destinations. DMBERT and BERT(Intra) fail to 5 Related Work recognize the information of such patterns, show- ing that they can not well capture this type of corre- We have covered research on EAE in Section 1, lation among prediction results. Our BERD, how- related work that inspires our technical design is ever, leverages previous predictions to generate mainly introduced in the following. argument roles entity by entity in a sentence, suc- Though our recurrent decoder is entity-level, our cessfully avoiding the unusual pattern happening. bidirectional decoding mechanism is inspired by S2 contains a Transport event triggered by “vis- some bidirectional decoders in token-level Seq2Seq ited”, and 4 nested entities exists in the phrase models, e.g., of machine translation (Zhou et al., “Ankara police chief Ercument Yilmaz”. Since 2019), speech recognition (Chen et al., 2020) and these nested entities share the same sentence con- scene text recognition (Gao et al., 2019). text, it is not strange that DMBERT wrongly pre- We formalize the task of EAE as a Seq2Seq-like dicts such entities as the same argument role Arti- learning problem instead of a classic classification fact. Thanks to the bidirectional entity-level recur- problem or sequence labeling problem. We have rent decoder, our method can recognize the distribu- found that there are also some works performing tion pattern of arguments better and hence correctly classification or sequence labeling in a Seq2Seq identifies these nested entities as false instances. manner in other fields. For example, Yang et al. In this case, BERD reduces 3 false-positive pre- (2018) formulates the multi-label classification task dictions compared with DMBERT, confirming the as a sequence generation problem to capture the results and analysis of Table 3. correlations between labels. Daza and Frank (2018) explores an encoder-decoder model for semantic As a qualitative error analysis, the last example role labeling. We are the first to employ a Seq2Seq- S3 demonstrates that incorporating previous predic- like architecture to solve the EAE task. tions may also lead to error propagation problem. S3 contains a Marry event triggered by “marry”. 6 Conclusion Entity “home” is mislabeled as Time-Within role by BERD and this wrong prediction will be used We have presented BERD, a neural architecture as argument features to identify entity “later in this with a Bidirectional Entity-level Recurrent Decoder after”, whose role is Time-Within. As analyzed in that achieves competitive performance on the task the first case, BERD tends to avoid repetitive roles of event argument extraction (EAE). One main in a sentence, leading this entity incorrectly being characteristic that distinguishes our techniques
from previous works is that we formalize EAE Xinya Du and Claire Cardie. 2020. Event extraction by as a Seq2Seq-like learning problem instead of a answering (almost) natural questions. In Proceed- ings of the 2020 Conference on Empirical Methods classic classification or sequence labeling problem. in Natural Language Processing (EMNLP), pages The novel bidirectional decoding mechanism en- 671–683. ables our BERD to utilize both the left- and right- side argument predictions effectively to generate Yunze Gao, Yingying Chen, Jinqiao Wang, and Han- qing Lu. 2019. Gate-based bidirectional interactive a sequence of argument roles that follows overall decoding network for scene text recognition. In Pro- distribution patterns over a sentence better. ceedings of the 28th ACM International Conference As pioneer research that introduces the Seq2Seq- on Information and Knowledge Management, pages like architecture into the EAE task, BERD also 2273–2276. faces some open questions. For example, since we Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Col- use gold argument roles as prediction results during lobert. 2019. Sequence-to-sequence speech recogni- training, how to alleviate the exposure bias problem tion with time-depth separable convolutions. Proc. Interspeech 2019, pages 3785–3789. is worth investigating. We are also interested in in- corporating our techniques into more sophisticated Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr- models that jointly extract triggers and arguments. ishman. 2016. Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Confer- Acknowledgements ence of the NAACL, pages 300–309. Thien Huu Nguyen and Ralph Grishman. 2015. Event We thank anonymous reviewers for valuable com- detection and domain adaptation with convolutional ments. This research was supported by the Na- neural networks. In Proceedings of the 53rd Annual tional Key Research And Development Program Meeting of the Association for Computational Lin- of China (No.2019YFB1405802) and the central guistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short government guided local science and technology Papers), pages 365–371. development fund projects (science and technology innovation base projects) No. 206Z0302G. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of the 2014 conference on empirical methods in natural language process- References ing (EMNLP), pages 1532–1543. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Lei Sha, Jing Liu, Chin-Yew Lin, Sujian Li, gio. 2014. Neural machine translation by jointly Baobao Chang, and Zhifang Sui. 2016. Rbpb: learning to align and translate. arXiv preprint Regularization-based pattern balancing method for arXiv:1409.0473. event extraction. In Proceedings of the 54th An- Xi Chen, Songyang Zhang, Dandan Song, Peng nual Meeting of the Association for Computational Ouyang, and Shouyi Yin. 2020. Transformer with Linguistics (Volume 1: Long Papers), pages 1224– bidirectional decoder for speech recognition. pages 1234. 1773–1777. Lei Sha, Feng Qian, Baobao Chang, and Zhifang Sui. 2018. Jointly extracting event triggers and argu- Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and ments by dependency-bridge RNN and tensor-based Jun Zhao. 2015. Event extraction via dynamic multi- argument interaction. In Proceedings of the 32 pooling convolutional neural networks. In Proceed- AAAI, New Orleans, Louisiana, USA, pages 5916– ings of the 53rd Annual Meeting of the ACL and the 5923. 7th IJCNLP, pages 167–176. Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Angel Daza and Anette Frank. 2018. A sequence-to- Chandan K Reddy. 2021. Neural abstractive text sequence model for semantic role labeling. In Pro- summarization with sequence-to-sequence models. ceedings of The Third Workshop on Representation ACM Transactions on Data Science, 2(1):1–37. Learning for NLP, pages 207–216. Shengli Song, Haitao Huang, and Tongxiao Ruan. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 2019. Abstractive text summarization using lstm- Kristina Toutanova. 2019. Bert: Pre-training of cnn based deep learning. Multimedia Tools and Ap- deep bidirectional transformers for language under- plications, 78(1):857–875. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Zoltán Tüske, Kartik Audhkhasi, and George Saon. Computational Linguistics: Human Language Tech- 2019. Advancing sequence-to-sequence based nologies, Volume 1 (Long and Short Papers), pages speech recognition. In INTERSPEECH, pages 4171–4186. 3780–3784.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob References Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- you need. In Advances in neural information pro- gio. 2014. Neural machine translation by jointly cessing systems, pages 5998–6008. learning to align and translate. arXiv preprint arXiv:1409.0473. Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun, Xi Chen, Songyang Zhang, Dandan Song, Peng and Peng Li. 2019a. Adversarial training for weakly Ouyang, and Shouyi Yin. 2020. Transformer with supervised event detection. In Proceedings of the bidirectional decoder for speech recognition. pages 2019 Conference of the NAACL, pages 998–1008. 1773–1777. Xiaozhi Wang, Ziqi Wang, Xu Han, Zhiyuan Liu, Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Juanzi Li, Peng Li, Maosong Sun, Jie Zhou, and Jun Zhao. 2015. Event extraction via dynamic multi- Xiang Ren. 2019b. Hmeae: Hierarchical modu- pooling convolutional neural networks. In Proceed- lar event argument extraction. In Proceedings of ings of the 53rd Annual Meeting of the ACL and the the 2019 Conference on Empirical Methods in Nat- 7th IJCNLP, pages 167–176. ural Language Processing and the 9th International Joint Conference on Natural Language Processing Angel Daza and Anette Frank. 2018. A sequence-to- (EMNLP-IJCNLP), pages 5781–5787. sequence model for semantic role labeling. In Pro- ceedings of The Third Workshop on Representation Learning for NLP, pages 207–216. Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. 2018. Sgm: Sequence Jacob Devlin, Ming-Wei Chang, Kenton Lee, and generation model for multi-label classification. In Kristina Toutanova. 2019. Bert: Pre-training of Proceedings of the 27th International Conference on deep bidirectional transformers for language under- Computational Linguistics, pages 3915–3926. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for S Yang, DW Feng, LB aQiao, ZG Kan, and DS Li. Computational Linguistics: Human Language Tech- 2019. Exploring pre-trained language models for nologies, Volume 1 (Long and Short Papers), pages event extraction and generation. In Proceedings of 4171–4186. the 57th Annual Meeting of the ACL. Xinya Du and Claire Cardie. 2020. Event extraction by answering (almost) natural questions. In Proceed- Hong Yu, Zhang Jianfeng, Ma Bin, Yao Jianmin, Zhou ings of the 2020 Conference on Empirical Methods Guodong, and Zhu Qiaoming. 2011. Using cross- in Natural Language Processing (EMNLP), pages entity inference to improve event extraction. In Pro- 671–683. ceedings of the 49th Annual Meeting of the ACL, pages 1127–1136. Association for Computational Yunze Gao, Yingying Chen, Jinqiao Wang, and Han- Linguistics. qing Lu. 2019. Gate-based bidirectional interactive decoding network for scene text recognition. In Pro- Ying Zeng, Honghui Yang, Yansong Feng, Zheng ceedings of the 28th ACM International Conference Wang, and Dongyan Zhao. 2016. A convolution bil- on Information and Knowledge Management, pages stm neural network model for chinese event extrac- 2273–2276. tion. In NLUIA, pages 275–287. Springer. Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Col- Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Ron- lobert. 2019. Sequence-to-sequence speech recogni- grong Ji, and Hongji Wang. 2018. Asynchronous tion with time-depth separable convolutions. Proc. bidirectional decoding for neural machine transla- Interspeech 2019, pages 3785–3789. tion. In Proceedings of the AAAI Conference on Ar- Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr- tificial Intelligence, volume 32. ishman. 2016. Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Confer- Zhisong Zhang, Xiang Kong, Zhengzhong Liu, Xuezhe ence of the NAACL, pages 300–309. Ma, and Eduard Hovy. 2020. A two-step approach for implicit event argument detection. In Proceed- Thien Huu Nguyen and Ralph Grishman. 2015. Event ings of the 58th Annual Meeting of the Association detection and domain adaptation with convolutional for Computational Linguistics, pages 7479–7485. neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Lin- Long Zhou, Jiajun Zhang, and Chengqing Zong. 2019. guistics and the 7th International Joint Conference Synchronous bidirectional neural machine transla- on Natural Language Processing (Volume 2: Short tion. Transactions of the Association for Computa- Papers), pages 365–371. tional Linguistics, 7:91–105. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of the 2014 conference
on empirical methods in natural language process- Hong Yu, Zhang Jianfeng, Ma Bin, Yao Jianmin, Zhou ing (EMNLP), pages 1532–1543. Guodong, and Zhu Qiaoming. 2011. Using cross- entity inference to improve event extraction. In Pro- Lei Sha, Jing Liu, Chin-Yew Lin, Sujian Li, ceedings of the 49th Annual Meeting of the ACL, Baobao Chang, and Zhifang Sui. 2016. Rbpb: pages 1127–1136. Association for Computational Regularization-based pattern balancing method for Linguistics. event extraction. In Proceedings of the 54th An- nual Meeting of the Association for Computational Ying Zeng, Honghui Yang, Yansong Feng, Zheng Linguistics (Volume 1: Long Papers), pages 1224– Wang, and Dongyan Zhao. 2016. A convolution bil- 1234. stm neural network model for chinese event extrac- tion. In NLUIA, pages 275–287. Springer. Lei Sha, Feng Qian, Baobao Chang, and Zhifang Sui. Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Ron- 2018. Jointly extracting event triggers and argu- grong Ji, and Hongji Wang. 2018. Asynchronous ments by dependency-bridge RNN and tensor-based bidirectional decoding for neural machine transla- argument interaction. In Proceedings of the 32 tion. In Proceedings of the AAAI Conference on Ar- AAAI, New Orleans, Louisiana, USA, pages 5916– tificial Intelligence, volume 32. 5923. Zhisong Zhang, Xiang Kong, Zhengzhong Liu, Xuezhe Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Ma, and Eduard Hovy. 2020. A two-step approach Chandan K Reddy. 2021. Neural abstractive text for implicit event argument detection. In Proceed- summarization with sequence-to-sequence models. ings of the 58th Annual Meeting of the Association ACM Transactions on Data Science, 2(1):1–37. for Computational Linguistics, pages 7479–7485. Shengli Song, Haitao Huang, and Tongxiao Ruan. Long Zhou, Jiajun Zhang, and Chengqing Zong. 2019. 2019. Abstractive text summarization using lstm- Synchronous bidirectional neural machine transla- cnn based deep learning. Multimedia Tools and Ap- tion. Transactions of the Association for Computa- plications, 78(1):857–875. tional Linguistics, 7:91–105. Zoltán Tüske, Kartik Audhkhasi, and George Saon. 2019. Advancing sequence-to-sequence based speech recognition. In INTERSPEECH, pages 3780–3784. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun, and Peng Li. 2019a. Adversarial training for weakly supervised event detection. In Proceedings of the 2019 Conference of the NAACL, pages 998–1008. Xiaozhi Wang, Ziqi Wang, Xu Han, Zhiyuan Liu, Juanzi Li, Peng Li, Maosong Sun, Jie Zhou, and Xiang Ren. 2019b. Hmeae: Hierarchical modu- lar event argument extraction. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5781–5787. Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. 2018. Sgm: Sequence generation model for multi-label classification. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3915–3926. S Yang, DW Feng, LB aQiao, ZG Kan, and DS Li. 2019. Exploring pre-trained language models for event extraction and generation. In Proceedings of the 57th Annual Meeting of the ACL.
You can also read