Improving Relation Extraction with Knowledge-attention
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Improving Relation Extraction with Knowledge-attention ∗ Pengfei Li1 , Kezhi Mao1 , Xuefeng Yang3 , and Qi Li2 1 School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 2 Interdisciplinary Graduate School, Nanyang Technological University, Singapore 3 ZhuiYi Technology, Shenzhen, China {pli006,ekzmao,liqi0024}@ntu.edu.sg, ryan@wezhuiyi.com Abstract and Harabagiu, 2010). Deep neural networks such as Convolutional Neural Networks (CNNs) and While attention mechanisms have been proven Recurrent Neural Networks (RNNs) have the abil- to be effective in many NLP tasks, major- arXiv:1910.02724v2 [cs.CL] 4 Mar 2020 ity of them are data-driven. We propose ity of exploring more complex semantics and ex- a novel knowledge-attention encoder which tracting features automatically from raw texts for incorporates prior knowledge from external relation extraction tasks (Xu et al., 2016; Vu et al., lexical resources into deep neural networks 2016; Lee et al., 2017). Recently, attention mech- for relation extraction task. Furthermore, anisms have been introduced to deep neural net- we present three effective ways of integrat- works to improve their performance (Zhou et al., ing knowledge-attention with self-attention to 2016; Wang et al., 2016; Zhang et al., 2017). Es- maximize the utilization of both knowledge pecially, the Transformer proposed by Vaswani et and data. The proposed relation extraction sys- tem is end-to-end and fully attention-based. al. (2017) is based solely on self-attention and Experiment results show that the proposed has demonstrated better performance than tradi- knowledge-attention mechanism has comple- tional RNNs (Bilan and Roth, 2018; Verga et al., mentary strengths with self-attention, and our 2018). However, deep neural networks normally integrated models outperform existing CNN, require sufficient labeled data to train their numer- RNN, and self-attention based models. State- ous model parameters. The scarcity or low qual- of-the-art performance is achieved on TA- ity of training data will limit the model’s ability to CRED, a complex and large-scale relation ex- traction dataset. recognize complex relations and also cause over- fitting issue. 1 Introduction A recent study (Li and Mao, 2019) shows that incorporating prior knowledge from external lex- Relation extraction aims to detect the semantic ical resources into deep neural network can re- relationship between two entities in a sentence. duce the reliance on training data and improve re- For example, given the sentence: “James Dobson lation extraction performance. Motivated by this, has resigned as chairman of Focus On The Family, which he we propose a novel knowledge-attention mecha- founded thirty years ago.”, the goal is to recognize the nism, which transforms texts from word semantic organization-founder relation held between “Focus space into relational semantic space by attending On The Family” and “James Dobson”. The various rela- to relation indicators that are useful in recogniz- tions between entities extracted from large-scale ing different relations. The relation indicators are unstructured texts can be used for ontology and automatically generated from lexical knowledge knowledge base population (Chen et al., 2018a; bases which represent keywords and cue phrases Fossati et al., 2018), as well as facilitating down- of different relation expressions. While the exist- stream tasks that requires relational understand- ing self-attention encoder learns internal seman- ing of texts such as question answering (Yu et al., tic features by attending to the input texts them- 2017) and dialogue systems (Young et al., 2018). selves, the proposed knowledge-attention encoder Traditional feature-based and kernel-based ap- captures the linguistic clues of different relations proaches require extensive feature engineer- based on external knowledge. Since the two atten- ing (Suchanek et al., 2006; Qian et al., 2008; Rink tion mechanisms complement each other, we inte- ∗ Corresponding author. grate them into a single model to maximize the uti- Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong, China.
lization of both knowledge and data, and achieve larly, Vaswani et al. (2017) proposed a solely self- optimal performance for relation extraction. attention-based model called Transformer, which In summary, the main contributions of the paper is more effective than RNNs in capturing long- are: (1) We propose knowledge-attention encoder, distance features since it is able to draw global a novel attention mechanism which incorporates dependencies without regard to their distances in prior knowledge from external lexical resources to the sequences. Bilan and Roth (2018) first ap- effectively capture the informative linguistic clues plied self-attention encoder to relation extraction for relation extraction. (2) To take the advantages task and achieved competitive results on TACRED of both knowledge-attention and self-attention, we dataset. Verga et al. (2018) used self-attention propose three integration strategies: multi-channel to encode long contexts spanning multiple sen- attention, softmax interpolation, and knowledge- tences for biological relation extraction. How- informed self-attention. Our final models are fully ever, more attention heads and layers are required attention-based and can be easily set up for end-to- for self-attention encoder to capture complex se- end training. (3) We present detailed analysis on mantic and syntactic information since learning is knowledge-attention encoder. Results show that solely based on training data. Hence, more high it has complementary strengths with self-attention quality training data and computational power are encoder, and the integrated models achieve start- needed. Our work utilizes the knowledge from of-the-art results for relation extraction. external lexical resources to improve deep neural network’s ability of capturing informative linguis- 2 Related Works tic clues. External knowledge has shown to be effective We focus here on deep neural networks for rela- in neural networks for many NLP tasks. Ex- tion extraction since they have demonstrated bet- isting works focus on utilizing external knowl- ter performance than traditional feature-based and edge to improve embedding representations (Chen kernel-based approaches. et al., 2015; Liu et al., 2015; Sinoara et al., 2019), Convolutional Neural Networks (CNNs) and CNNs (Toutanova et al., 2015; Wang et al., 2017; Recurrent Neural Networks (RNNs) are the ear- Li and Mao, 2019), and RNNs (Ahn et al., 2016; liest and commonly used approaches for relation Chen et al., 2016, 2018b; Shen et al., 2018). Our extraction. Zeng et al. (2014) showed that CNN work is the first to incorporate knowledge into with position embeddings is effective for rela- Transformer through a novel knowledge-attention tion extraction. Similarly, CNN with multiple fil- mechanism to improve its performance on relation ter sizes (Nguyen and Grishman, 2015), pairwise extraction task. ranking loss function (dos Santos et al., 2015) and auxiliary embeddings (Lee et al., 2017) were 3 Knowledge-attention Encoder proposed to improve performance. Zhang and Wang (2015) proposed bi-directional RNN with We present the proposed knowledge-attention en- max pooling to model the sequential relations. In- coder in this section. Relation indicators are first stead of modeling the whole sentence, perform- generated from external lexical resources (Sec- ing RNN on sub-dependency trees (e.g. short- tion 3.1); Then the input texts are transformed est dependency path between two entities) has from word semantic space into relational seman- demonstrated to be effective in capturing long- tic space by attending to the relation indicators us- distance relation patterns (Xu et al., 2016; Miwa ing knowledge-attention mechanism (Section 3.2); and Bansal, 2016). Zhang et al. (2018) pro- Finally, position-aware attention is used to sum- posed graph convolution over dependency trees marize the input sequence by taking both relation and achieved state-of-the-art results on TACRED semantics and relative positions into consideration dataset. (Section 3.3). Recently, attention mechanisms have been widely applied to CNNs (Wang et al., 2016; Han 3.1 Relation Indicators Generation et al., 2018) and RNNs (Zhou et al., 2016; Zhang Relation indicators represent the keywords or cue et al., 2017; Du et al., 2018). The improved per- phrases of various relation types, which are es- formance demonstrated the effectiveness of atten- sential for knowledge-attention encoder to capture tion mechanisms in deep neural networks. Particu- the linguistic clues of certain relation from texts. 2
Texts x1 x2 x3 x4 xn Q K V (K) Input Relation Embeddings Indicators (K & V) Linear Linear (Q) k1 k2 Knowledge k3 Attention Knowledge-attention k4 h km Hidden Representation avg Concat h1 h2 h3 h4 hn Relation Linear Indicators Mean Subtraction Norm Position-wise Feed Forward Attention Output Add and Norm o1 o2 o3 o4 on Figure 1: Knowledge-attention process (left) and multi-head structure (right) of knowledge-attention encoder. We utilize two publicly available lexical resources ti ∈ Rdt by looking up the word embedding ma- wrd including FrameNet1 and Thesaurus.com2 to find trix Wwrd ∈ Rdw ×V and POS embedding ma- pos d ×V pos such lexical units. trix W ∈ R t respectively, where dw and FrameNet is a large lexical knowledge base dt are the dimensions of word and POS embed- which categorizes English words and sentences dings, V wrd is vocabulary size3 and V pos is total into higher level semantic frames (Ruppenhofer number of POS tags. The corresponding relation et al., 2006). Each frame is a conceptual struc- indicator is formed by concatenating word embed- ture describing a type of event, object or relation. ding and POS embedding, ki = [wi , ti ]. If a lex- FrameNet contains over 1200 semantic frames, ical unit contains multiple words (i.e. phrase), the many of which represent various semantic rela- corresponding relation indicator is formed by av- tions. For each relation type in our relation ex- eraging the embeddings of all words. Eventually, traction task, we first find all the relevant seman- around 3000 relation indicators (including 2000 tic frames by searching from FrameNet (refer Ap- synonyms) are generated: K = {k1 , k2 , ..., km }. pendix for detailed semantic frames used). Then we extract all the lexical units involved in these 3.2 Knowledge Attention frames, which are exactly the keywords or phrases 3.2.1 Knowledge-attention process that often used to express such relation. The- In a typical attention mechanism, a query (q) is saurus.com is the largest online thesaurus which compared with the keys (K) in a set of key-value has over 3 million synonyms and antonyms. It pairs and the corresponding attention weights are also has the flexibility to filter search results by calculated. The attention output is weighted sum relevance, POS tag, word length, and complexity. of values (V ) using the attention weights. In To broaden the coverage of relation indicators, we our proposed knowledge-attention encoder, the utilize the synonyms in Thesaurus.com to extend queries are input texts and the key-value pairs are the lexical units extracted from FrameNet. To re- both relation indicators. The detailed process of duce noise, only the most relevant synonyms with knowledge-attention is shown in Figure 1 (left). the same POS tag are selected. Formally, given text input x = {x1 , x2 , ..., xn }, Relation indicators are generated based on the the input embeddings Q = {q1 , q2 , ..., qn } are word embeddings and POS tags of lexical units. generated by concatenating each word’s word em- Formally, given a word in a lexical unit, we find its bedding and POS embedding in the same way as word embedding wi ∈ Rdw and POS embedding relation indicator generation in Section 3.1. The 1 3 https://framenet.icsi.berkeley.edu/ Same word embedding matrix is used for relation indi- fndrupal cators and input texts, hence the vocabulary also includes all 2 https://www.thesaurus.com the words in the training corpus. 3
hidden representations H = {h1 , h2 , ..., hn } are 3.3 Position-aware Attention obtained by attending to the knowledge indicators It has been proven that the relative position infor- K, as shown in Equation 1. The final knowledge- mation of each token with respective to the two attention outputs are obtained by subtracting the target entities is beneficial for relation extraction hidden representations with the relation indicators task (Zeng et al., 2014). We modify the position- mean, as shown in Equation 2. aware attention originally proposed by Zhang et al. (2017) to incorporate such relative position in- QKT H = sof tmax( √ )V (1) formation and find the importance of each token to dk the final sentence representation. X Assume the relative position of token xi to tar- knwl(Q, K, V) = H − K/m (2) get entity is pˆi . We apply position binning func- tion (Equation 4) to make it easier for the model where knwl indicates knowledge-attention pro- to distinguish long and short relative distances. cess, m is the number of relation indicators, and dk is dimension of key/query vectors which is a scaling factor same as in Vaswani et al. (2017). ( pˆi |pˆi | ≤ 2 The subtraction of relation indicators mean will pi = pˆi (4) |pˆi | dlog2 |pˆi | + 1e |pˆi | > 2 result in small outputs for irrelevant words. More importantly, the resulted output will be close to the related relation indicators and further apart After getting the relative positions psi and poi to the from other relation indicators in relational seman- two entities of interest (subject and object respec- tic space. Therefore, the proposed knowledge- tively), we map them to position embeddings base attention mechanism is effective in capturing the on a shared position embedding matrix Wp . The linguistic clues of relations represented by relation two embeddings are concatenated to form the final indicators in the relational semantic space. position embedding for token xi : pi = [psi , poi ]. Position-aware attention is performed on the 3.2.2 Multi-head knowledge-attention outputs of knowledge-attention O ∈ Rn×dk , tak- Inspired by the multi-head attention in Trans- ing the corresponding relative position embed- former (Vaswani et al., 2017), we also have dings P ∈ Rn×dp into consideration: multi-head knowledge-attention which first lin- early transforms Q, K and V h times, and then f = OT sof tmax(tanh(OWo + PWp )c) (5) perform h knowledge-attentions simultaneously, as shown in Figure 1 (right). where Wo ∈ Rdk ×da , Wp ∈ Rdp ×da , da is atten- Different from the Transformer encoder, we use tion dimension, and c ∈ Rda is a context vector the same linear transformation for Q and K in learned by the neural network. each head to keep the correspondence between queries and keys. 4 Integrate Knowledge-attention with Self-attention headi = knwl(QWiQ , KWiQ , VWiV ) (3) The self-attention encoder proposed by Vaswani where WiQ , WiV ∈ Rdk ×(dk /h) and i ∈ [1, 2, ...h]. et al. (2017) learns internal semantic features by Besides, only one residual connection from input modeling pair-wise interactions within the texts embeddings to outputs of position-wise feed for- themselves, which is effective in capturing long- ward networks is used. We also mask the outputs distance dependencies. Our proposed knowledge- of padding tokens using zero vectors. attention encoder has complementary strengths of The multi-head structure in knowledge- capturing the linguistic clues of relations precisely attention allows the model to jointly attend inputs based on external knowledge. Therefore, it is ben- to different relational semantic subspaces with eficial to integrate the two models to maximize the different contributions of relation indicators. This utilization of both external knowledge and train- is beneficial in recognizing complex relations ing data. In this section, we propose three inte- where various compositions of relation indicators gration approaches as shown in Figure 2, and each are needed. approach has its own advantages. 4
f1 Multi-Channel Softmax Input Sentence Self-attention Attention Classifier Position- aware Attention Lexical f2 Resources Relation Indicators Knowledge-attention Softmax Classifier Softmax Classifier Output Input Sentence Position- Fully Connected NN Knowledge-informed aware & Self-attention Lexical Relation Indicators Attention Softmax Classifier Resources Feature Vector Figure 2: Three ways of integrating knowledge-attention with self-attention: multi-channel attention and softmax interpolation (top), as well as knowledge-informed self-attention (bottom). 4.1 Multi-channel Attention knowledge-attention in softmax interpolation. In- In this approach, self-attention and knowledge- stead of integrating the feature vectors, we make attention are treated as two separate channels to two independent predictions using two softmax model sentence from different perspectives. Af- classifiers based on the feature vectors from the ter applying position-aware attention, two feature two channels. The loss function is defined as to- vectors f1 and f2 are obtained from self-attention tal cross-entropy loss of the two classifiers. The and knowledge-attention respectively. We apply final prediction is obtained using an interpolation another attention mechanism called multi-channel function of the two softmax distributions: attention to integrate the feature vectors. p = β · p1 + (1 − β) · p2 (7) In multi-channel attention, feature vectors are where p1 , p2 are the softmax distributions first fed into a fully connected neural network to obtained form self-attention and knowledge- get their hidden representations hi . Then atten- attention respectively, and β is the priority weight tion weights are calculated using a learnable con- assigned to self-attention. text vector c, which reflects the importance of each Since knowledge-attention focuses on capturing feature vector to final relation classification. Fi- the keywords and cue phrases of relations, the pre- nally, the feature vectors are integrated based on cision will be higher than self-attention while the attention weights, as shown in Equation 6. recall is lower. The proposed softmax interpo- X lation approach is able to take the advantages of r= sof tmax(hTi c)hi (6) both attention mechanisms and balance the preci- i sion and recall by adjusting the priority weight β. After obtaining the integrated feature vector r, we 4.3 Knowledge-informed Self-attention pass it to a softmax classifier to determine the re- lation class. The model is trained using stochastic Since knowledge-attention and self-attention gradient descent with momentum and learning rate share similar structures, it is also possible to decay to minimize the cross-entropy loss. integrate them into a single channel. We propose The main advantage of this approach is flexibil- knowledge-informed self-attention encoder which ity. Since the two channels process information incorporates knowledge-attention into every independently, the input components are not nec- self-attention head to jointly model the semantic essary to be the same. Besides, we can add more relations based on both knowledge and data. features from other sources (e.g. subject and ob- The structure of knowledge-informed self- ject categories) to multi-channel attention to make attention is shown in Figure 3. Formally, given final decision based on all the information sources. texts input matrix Q ∈ Rn×dk and knowledge in- dicators K ∈ Rm×dk . The output of each attention 4.2 Softmax Interpolation head is calculated as follows: Similar as multi-channel attention, we also use headi = knwl(QWiQ , KWiQ , KWiV )+ (8) two independent channels for self-attention and self (QWiQs , QWiKs , QWiVs ) 5
Q K K Q Q Q on the original Transformer encoder, including the use of relative positional encodings instead of ab- Linear Linear Linear Linear Linear solute sinusoidal encodings, as well as other con- figurations such as residual connection, activation Knowledge-attention + Self-attention h function and normalization. For our model, we evaluate both the proposed Concat knowledge-attention encoder (Knwl-attn) as well Linear as the integrated models with self-attention in- cluding multi-channel attention (MCA), softmax interpolation (SI) and knowledge-informed self- Figure 3: Knowledge-informed self-attention structure. attention (KISA). Q, K represent input matrix and knowledge indicators respectively, h is the number of attention heads. 5.2 Experiment Settings We conduct our main experiments on TACRED, where knwl and self indicate knowledge- a large-scale relation extraction dataset introduced attention and self-attention respectively, and all by Zhang et al. (2017). TACRED contains over the linear transformation weight matrices have the 106k sentences with hand-annotated subject and dimensionality of W ∈ Rdk ×(dk /h) . object entities as well as the relations between Since each self-attention head is aided with them. It is a very complex relation extraction prior knowledge in knowledge-attention, the dataset with 41 relation types and a no relation knowledge-informed self-attention encoder is able class when no relation is hold between entities. to capture more lexical and semantic information The dataset is suited for real-word relation extrac- than single attention encoder. tion since it is unbalanced with 79.5% no relation 5 Experiment and Analysis samples, and multiple relations between different entity pairs can be exist in one sentence. Besides, 5.1 Baseline Models the samples are normally long sentences with an To study the performance of our proposed models, average of 36.2 words. the following baseline models are used for com- Since the dataset is already partitioned into parison: train (68124 samples), dev (22631 samples) and CNN-based models including: (1) CNN: the clas- test (15509 samples) sets, we tune model hyper- sical convolutional neural network for sentence parameters using dev set and evaluate model us- classification (Kim, 2014). (2) CNN-PE: CNN ing test set. The evaluation metrics are micro- with position embeddings dedicated for relation averaged precision, recall and F1 score. For fair classification (Nguyen and Grishman, 2015). (3) comparison, we select the model with median F1 GCN: a graph convolutional network over the score on dev set from 5 independent runs, same pruned dependency trees of the sentence (Zhang as Zhang et al. (2017). The same “entity mask” et al., 2018). strategy is used which replaces subject (or object) RNN-based models including: (1) LSTM: long entity with special NER -SUBJ (or NER -OBJ) short-term memory network to sequentially model tokens to avoid overfittting on specific entities and the texts. Classification is based on the last hid- provide entity type information. den output. (2) PA-LSTM: Similar position-aware Besides TACRED, another dataset called attention mechanism as our work is used to sum- SemEval2010-Task8 (Hendrickx et al., 2009) marize the LSTM outputs (Zhang et al., 2017). is used to evaluate the generalization ability of CNN-RNN hybrid model including contextu- our proposed model. The dataset is significantly alized GCN (C-GCN) where the input vectors smaller and simpler than TACRED, which has are obtained using bi-directional LSTM net- 8000 training samples and 2717 testing samples. work (Zhang et al., 2018). It contains 9 directed relations and 1 other relation Self-attention-based model (Self-attn) which (19 relation classes in total). We use the official uses self-attention encoder to model the input sen- macro-averaged F1 score as evaluation metric. tence. Our implementation is based on Bilan and We use one layer encoder with 6 attention heads Roth (2018) where several modifications are made for both knowledge-attention and self-attention 6
since further increasing the number of layers and Model P R F1 attention heads will degrade the performance. For CNN† 72.1 50.3 59.2 softmax interpolation, we choose β = 0.8 to CNN-PE† 68.2 55.4 61.1 balance precision and recall. Word embeddings GCN‡ 69.8 59.0 64.0 are fine-tuned based on pre-trained GloVe (Pen- LSTM† 61.4 61.7 61.5 nington et al., 2014) with dimensionality of 300. PA-LSTM† 65.7 64.5 65.1 Dropout (Srivastava et al., 2014) is used during tri- C-GCN‡ 69.9 63.3 66.4 aning to alleviate overfitting. Other model hyper- Self-attn†† 64.6 68.6 66.5 parameters and training details are described in Knwl-attn 70.0 63.1 66.4 Appendix due to space limitations. Knwl+Self (MCA) 68.4 66.1 67.3* Knwl+Self (SI) 67.1 68.4 67.8* 5.3 Results and Analysis Know+Self (KISA) 69.4 66.0 67.7* 5.3.1 Results on TACRED dataset Table 1: Micro-averaged precision (P), recall (R) and Table 1 shows the results of baseline as well as F1 score on TACRED dataset. †, ‡ and †† mark the our proposed models on TACRED dataset. It is results reported in (Zhang et al., 2017), (Zhang et al., 2018) and (Bilan and Roth, 2018) respectively. ∗ observed that our proposed knowledge-attention marks statistically significant improvements over Self- encoder outperforms all CNN-based and RNN- attn with p < 0.01 under one-tailed t-test. based models by at least 1.3 F1 . Meanwhile, it achieves comparable results with C-GCN and self- attention encoder, which are the current start-of- the-art single-model systems. Comparing with self-attention encoder, it is ob- served that knowledge-attention encoder results in higher precision but lower recall. This is reason- able since knowledge-attention encoder focuses on capturing the significant linguistic clues of re- lations based on external knowledge, it will re- sult in high precision for the predicted relations similar to rule-based systems. Self-attention en- coder is able to capture more long-distance de- pendency features by learning from data, result- ing in better recall. By integrating self-attention Figure 4: Change of precision, recall and F1 score on and knowledge-attention using the proposed ap- dev set as the priority weight β in softmax interpolation proaches, a more balanced precision and recall can changes. be obtained, suggesting the complementary effects of self-attention and knowledge-attention mech- anisms. The integrated models improve perfor- call are balanced (β = 0.8). mance by at least 0.9 F1 score and achieve new Knowledge-informed self-attention (KISA) has state-of-the-art results among all the single end- comparable performance with softmax interpola- to-end models. tion, and without the need of hyper-parameter tun- Comparing the three integrated models, soft- ing since knowledge-attention and self-attention max interpolation (SI) achieves the best perfor- are integrated into a single channel. The per- mance. More interestingly, we found that the pre- formance gain over self-attention encoder is 1.2 cision and recall can be controlled by adjusting the F1 with much improved precision, demonstrat- priority weight β. Figure 4 shows impact of β on ing the effectiveness of incorporating knowledge- precision, recall and F1 score. As β increases, pre- attention into self-attention to jointly model the cision decreases and recall increases. Therefore, sentence based on both knowledge and data. we can choose a small β for relation extraction Performance gain is the lowest for multi- system which requires high precision, and a large channel attention (MCA). However, the model is β for the system requiring better recall. F1 score more flexible in the way that features from other reaches the highest value when precision and re- information sources can be easily added to the 7
Model P R F1 Model mask entity keep entity MCA 68.4 66.1 67.3 Self-attn 76.8 83.1 + NER 68.7 66.2 67.5 Knwl-attn 76.1 82.3 + Entity category 70.0 65.8 67.8 Knwl+Self (MCA) 77.4* 84.0* + Both 70.1 66.0 67.9 Knwl+Self (SI) 78.0* 84.3* Know+Self (KISA) 77.5* 84.0* Table 2: Results of adding NER embeddings and entity categorical embeddings to the multi-channel attention Table 3: Macro-averaged F1 score on SemEval2010- (MCA) integrated model. Task8 dataset. ∗ marks statistically significant im- provements over Self-attn with p < 0.01 under one- tailed t-test. model to further improve its performance. Table 2 shows the results of adding NER embeddings Model Dev F1 of each token to self-attention channel, and en- Knwl-attn Encoder 66.5 tity (subject and object) categorical embeddings to 1. − Multi-head structure 64.6 multi-channel attention as additional feature vec- 2. − Synonym relation indicators 64.7 tors. We use dimensionality of 30 and 60 for 3. − Relation indicators mean 65.0 NER and entity categorical embeddings respec- 4. − Output masking 65.8 tively, and the two embedding matrixes are learned 5. − Entity masking 65.4 by the neural network. Results show that adding 6. − Relative positions 63.0 NER and entity categorical information to MCA integrated model improves F1 score by 0.2 and 0.5 Table 4: Ablation study on knowledge-attention en- respectively, and adding both improves precision coder. Results are the median F1 scores of 5 indepen- dent runs on dev set of TACRED. significantly, resulting a new best F1 score. 5.3.2 Results on SemEval2010-Task8 dataset without certain components are shown in Table 4. We use SemEval2010-Task8 dataset to evaluate It is observed that: (1) The proposed multi- the generalization ability of our proposed model. head knowledge-attention structure outperforms Experiments are conducted in two manners: mask single-head significantly. This demonstrates the or keep the entities of interest. Results in Table effectiveness of jointly attending texts to different 3 show that the “entity mask” strategy degrades relational semantic subspaces in the multi-head the performance, indicating that there exist strong structure. (2) The synonyms improve the per- correlations between entities of interest and rela- formance of knowledge-attention since they are tion classes in SemEval2010-Task8 dataset. Al- able to broaden the coverage of relation indicators though the results of keeping the entities are better, and form a robust relational semantic space. (3) the model tends to remember these entities instead The subtraction of relation indicators mean vec- of focusing on learning the linguistic clues of re- tor from attention hidden representations helps to lations. This will result in bad generalization for suppress the activation of irrelevant words and re- sentences with unseen entities. sults in a better representation for each word to Regardless of whether the entity mask is used, capture the linguistic clues of relations. (4-5) The by incorporating knowledge-attention mechanism, two masking strategies are helpful for our model: our model improves the performance of self- the output masking eliminates the effects of the attention by a statistically significant margin, espe- padding tokens and the entity masking avoids en- cially the softmax interpolation integrated model. tity overfitting while providing entity type infor- The results on SemEval2010-Task8 are consistent mation. (6) The relative position embedding term with that of TACRED, demonstrating the effec- in position-aware attention contributes a signifi- tiveness and robustness of our proposed method. cant amount of F1 score. This shows that posi- tional information is particularly important for re- 5.3.3 Ablation study lation extraction task. To study the contributions of specific components of knowledge-attention encoder, we perform abla- 5.3.4 Attention visualization tion experiments on the dev set of TACRED. The To verify the complementary effects of results of knowledge-attention encoder with and knowledge-attention encoder and self-attention 8
Sample Sentences True Relation Predict SUBJ-PERSON graduated in 1992 from the OBJ-ORGANIZATION OBJ-ORGANIZATION OBJ-ORGANIZATION per:schools correct with a degree in computer science and had worked as a systems analyst at a Pittsburgh law firm since 1999 . attended SUBJ-PERSON graduated in 1992 from the OBJ-ORGANIZATION OBJ-ORGANIZATION OBJ-ORGANIZATION correct with a degree in computer science and had worked as a systems analyst at a Pittsburgh law firm since 1999 . OBJ-PERSON OBJ-PERSON , a public affairs and government relations strategist , was executive director of the org:top members correct SUBJ-ORGANIZATION SUBJ-ORGANIZATION Policy Institute from 2005 to 2010 . /employees OBJ-PERSON OBJ-PERSON , a public affairs and government relations strategist , was executive director of the wrong SUBJ-ORGANIZATION SUBJ-ORGANIZATION Policy Institute from 2005 to 2010 . Founded in 1992 in Schaumburg , Illinois , the SUBJ-ORGANIZATION is one of the largest Chinese - American org:country of wrong associations of professionals in the OBJ-COUNTRY OBJ-COUNTRY . headquarters Founded in 1992 in Schaumburg , Illinois , the SUBJ-ORGANIZATION is one of the largest Chinese - American correct associations of professionals in the OBJ-COUNTRY OBJ-COUNTRY . Table 5: Attention visualization for knowledge-attention encoder (first) and self-attention encoder (second). Words are highlighted based on the attention weights assigned to them. Best viewed in color. encoder, we compare the attention weights as- caused by multiple entities with different rela- signed to words from the two encoders. Table 5 tions co-occurred in one sentence. Our model presents the attention visualization results on sam- may mistake irrelevant entities as a relation pair. ple sentences. For each sample sentence, attention We also observed that many FP errors are due weights from knowledge-attention encoder are to the confusions between related relations such visualized first, followed by self-attention en- as “city of death”and “city of residence”. More coder. It is observed that knowledge-attention data or knowledge is needed to distinguish “death” encoder focuses more on the specific keywords or and “residence”. Besides, some errors are caused cue phrases of certain relations, such as “gradu- by imperfect annotations. ated”, “executive director” and “founded”; while self-attention encoder attends to a wide range of 6 Conclusion and Future Work words in the sentence and pays more attention to We introduce knowledge-attention encoder which the surrounding words of target entities especially effectively incorporates prior knowledge from ex- the words indicating the syntactic structure, such ternal lexical resources for relation extraction. The as “is”, “in” and “of”. Therefore, knowledge- proposed knowledge-attention mechanism trans- attention encoder and self-attention encoder have forms texts from word space into relational se- complementary strengths that focus on different mantic space and captures the informative linguis- perspectives for relation extraction. tic clues of relations effectively. Furthermore, we show the complementary strengths of knowledge- 5.3.5 Error analysis attention and self-attention, and propose three dif- To investigate the limitations of our proposed ferent ways of integrating them to maximize the model and provide insights for future research, we utilization of both knowledge and data. The pro- analyze the errors produced by the system on the posed models are fully attention-based end-to- test set of TACRED. For knowledge-attention en- end systems and achieve state-of-the-art results on coder, 58% errors are false negative (FN) due to TACRED dataset, outperforming existing CNN, the limited ability in capturing long-distance de- RNN, and self-attention based models. pendencies and some unseen linguistic clues dur- In future work, besides lexical knowledge, ing training. For our integrated model4 that takes we will incorporate conceptual knowledge from the benefits of both self-attention and knowledge- encyclopedic knowledge bases into knowledge- attention, FN is reduced by 10%. However, false attention encoder to capture the high-level seman- positive (FP) is not improved due to overfitting tics of texts. We will also apply knowledge- that leads to wrong predictions. Many errors are attention in other tasks such as text classification, 4 We observed similar error behaviors of the three pro- sentiment analysis and question answering. posed integrated models. 9
References Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and Conference on Empirical Methods in Natural Lan- Yoshua Bengio. 2016. A neural knowledge lan- guage Processing (EMNLP), pages 1746–1751. guage model. arXiv preprint arXiv:1608.00318. Ji Young Lee, Franck Dernoncourt, and Peter Ivan Bilan and Benjamin Roth. 2018. Position-aware Szolovits. 2017. Mit at semeval-2017 task 10: Rela- self-attention with relative positional encodings for tion extraction with convolutional neural networks. slot filling. arXiv preprint arXiv:1807.03052. arXiv preprint arXiv:1704.01523. Muhao Chen, Yingtao Tian, Xuelu Chen, Zijun Xue, Pengfei Li and Kezhi Mao. 2019. Knowledge-oriented and Carlo Zaniolo. 2018a. On2vec: Embedding- convolutional neural network for causal relation ex- based relation prediction for ontology population. In traction from natural language texts. Expert Systems Proceedings of the 2018 SIAM International Confer- with Applications, 115:512–523. ence on Data Mining, pages 315–323. SIAM. Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Diana Yu Hu. 2015. Learning semantic word embeddings Inkpen, and Si Wei. 2018b. Neural natural language based on ordinal knowledge constraints. In Pro- inference models enhanced with external knowl- ceedings of the 53rd Annual Meeting of the Associ- edge. In Proceedings of the 56th Annual Meeting of ation for Computational Linguistics and the 7th In- the Association for Computational Linguistics (Vol- ternational Joint Conference on Natural Language ume 1: Long Papers), pages 2406–2417. Processing (Volume 1: Long Papers), pages 1501– 1511. Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur, Makoto Miwa and Mohit Bansal. 2016. End-to-end re- Asli Celikyilmaz, Jianfeng Gao, and Li Deng. lation extraction using lstms on sequences and tree 2016. Knowledge as a teacher: Knowledge- structures. In Proceedings of the 54th Annual Meet- guided structural attention networks. arXiv preprint ing of the Association for Computational Linguistics arXiv:1609.03286. (Volume 1: Long Papers), volume 1, pages 1105– 1116. Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Re- Thien Huu Nguyen and Ralph Grishman. 2015. Rela- visiting word embedding for contrasting meaning. tion extraction: Perspective from convolutional neu- In Proceedings of the 53rd Annual Meeting of the ral networks. In Proceedings of the 1st Workshop on Association for Computational Linguistics and the Vector Space Modeling for Natural Language Pro- 7th International Joint Conference on Natural Lan- cessing, pages 39–48. guage Processing (Volume 1: Long Papers), pages 106–115. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word Jinhua Du, Jingguang Han, Andy Way, and Dadong representation. In Proceedings of the 2014 confer- Wan. 2018. Multi-level structured self-attentions for ence on empirical methods in natural language pro- distantly supervised relation extraction. In Proceed- cessing (EMNLP), pages 1532–1543. ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2216–2225. Longhua Qian, Guodong Zhou, Fang Kong, Qiaoming Zhu, and Peide Qian. 2008. Exploiting constituent Marco Fossati, Emilio Dorigatti, and Claudio Giuliano. dependencies for tree kernel-based semantic relation 2018. N-ary relation extraction for simultaneous t- extraction. In Proceedings of the 22nd International box and a-box knowledge base augmentation. Se- Conference on Computational Linguistics-Volume 1, mantic Web, 9(4):413–439. pages 697–704. Association for Computational Lin- guistics. Xu Han, Pengfei Yu, Zhiyuan Liu, Maosong Sun, and Bryan Rink and Sanda Harabagiu. 2010. Utd: Clas- Peng Li. 2018. Hierarchical relation extraction with sifying semantic relations by combining lexical and coarse-to-fine grained attention. In Proceedings of semantic resources. In Proceedings of the 5th Inter- the 2018 Conference on Empirical Methods in Nat- national Workshop on Semantic Evaluation, pages ural Language Processing, pages 2236–2245. 256–259. Association for Computational Linguis- tics. Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Josef Ruppenhofer, Michael Ellsworth, Myriam Padó, Marco Pennacchiotti, Lorenza Romano, and Schwarzer-Petruck, Christopher R Johnson, and Jan Stan Szpakowicz. 2009. Semeval-2010 task 8: Scheffczyk. 2006. Framenet ii: Extended theory and Multi-way classification of semantic relations be- practice. tween pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Cicero dos Santos, Bing Xiang, and Bowen Zhou. Achievements and Future Directions, pages 94–99. 2015. Classifying relations by ranking with con- Association for Computational Linguistics. volutional neural networks. In Proceedings of the 10
53rd Annual Meeting of the Association for Compu- Yan Xu, Ran Jia, Lili Mou, Ge Li, Yunchuan Chen, tational Linguistics and the 7th International Joint Yangyang Lu, and Zhi Jin. 2016. Improved rela- Conference on Natural Language Processing (Vol- tion classification by deep recurrent neural networks ume 1: Long Papers), pages 626–634. with data augmentation. In Proceedings of COLING 2016, the 26th International Conference on Compu- Ying Shen, Yang Deng, Min Yang, Yaliang Li, Nan Du, tational Linguistics: Technical Papers, pages 1461– Wei Fan, and Kai Lei. 2018. Knowledge-aware at- 1470. tentive neural network for ranking question answer pairs. In The 41st International ACM SIGIR Con- Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, ference on Research & Development in Information Subham Biswas, and Minlie Huang. 2018. Aug- Retrieval, pages 901–904. ACM. menting end-to-end dialogue systems with common- sense knowledge. In Thirty-Second AAAI Confer- Roberta A Sinoara, Jose Camacho-Collados, Rafael G ence on Artificial Intelligence. Rossi, Roberto Navigli, and Solange O Rezende. 2019. Knowledge-enhanced document embeddings Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Ci- for text classification. Knowledge-Based Systems, cero dos Santos, Bing Xiang, and Bowen Zhou. 163:955–971. 2017. Improved neural relation detection for knowl- edge base question answering. arXiv preprint Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, arXiv:1704.06194. Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, from overfitting. The Journal of Machine Learning Jun Zhao, et al. 2014. Relation classification via Research, 15(1):1929–1958. convolutional deep neural network. Fabian M Suchanek, Georgiana Ifrim, and Gerhard Dongxu Zhang and Dong Wang. 2015. Relation classi- Weikum. 2006. Combining linguistic and statistical fication via recurrent neural network. arXiv preprint analysis to extract relations from web documents. arXiv:1508.01006. In Proceedings of the 12th ACM SIGKDD interna- Yuhao Zhang, Peng Qi, and Christopher D Manning. tional conference on Knowledge discovery and data 2018. Graph convolution over pruned dependency mining, pages 712–717. ACM. trees improves relation extraction. In Proceedings of Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi- the 2018 Conference on Empirical Methods in Nat- fung Poon, Pallavi Choudhury, and Michael Gamon. ural Language Processing, pages 2205–2215. 2015. Representing text for joint embedding of text Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An- and knowledge bases. In Proceedings of the 2015 geli, and Christopher D Manning. 2017. Position- Conference on Empirical Methods in Natural Lan- aware attention and supervised data improve slot fill- guage Processing, pages 1499–1509. ing. In Proceedings of the 2017 Conference on Em- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob pirical Methods in Natural Language Processing, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz pages 35–45. Kaiser, and Illia Polosukhin. 2017. Attention is all Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen you need. In Advances in neural information pro- Li, Hongwei Hao, and Bo Xu. 2016. Attention- cessing systems, pages 5998–6008. based bidirectional long short-term memory net- works for relation classification. In Proceedings of Patrick Verga, Emma Strubell, and Andrew McCallum. the 54th Annual Meeting of the Association for Com- 2018. Simultaneously self-attending to all mentions putational Linguistics (Volume 2: Short Papers), for full-abstract biological relation extraction. In volume 2, pages 207–212. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 872–884. Ngoc Thang Vu, Heike Adel, Pankaj Gupta, et al. 2016. Combining recurrent and convolutional neural net- works for relation classification. In Proceedings of NAACL-HLT, pages 534–539. Jin Wang, Zhongyuan Wang, Dawei Zhang, and Jun Yan. 2017. Combining knowledge with deep convo- lutional neural networks for short text classification. In IJCAI, pages 2915–2921. Linlin Wang, Zhu Cao, Gerard De Melo, and Zhiyuan Liu. 2016. Relation classification via multi-level at- tention cnns. 11
A FramNet Frames Used for Relation Extraction on TACRED Dataset Relation Types FrameNet Frames org:alternate names, Being named, Name conferral, Namesake, per:alternate names Referring by name, Simple naming Being located, Locale, Locale by characteristic entity, org:city of headquarters, Locale by collocation, Locale by event, org:country of headquarters, Locale by ownership, Locale by use, Locale closure, org:stateorprovince of headquarters Locating, Locative relation, Spatial co-location org:founded, org:founded by Intentionally create Location in time, Relative time, Time vector, org:dissolved Timespan, Temporal collocation, Temporal subregion org:member of, org:members Becoming a member, Membership org:political/religious affiliation, People by religion, Religious belief, Political locales per:religion org:subsidiaries, org:parents Part whole, Partitive, Inclusion org:shareholders Capital stock org:top members/employees, Leadership, Working a post, Employing, org:number of employees/members, People by vocation, Cardinal numbers per:employee of per:age Age per:cause of death, per:date of death, per:city of death, Death, Cause harm per:country of death, per:stateorprovince of death Notification of charges, Committing crime, per:charges Criminal investigation per:children, per:parents, Kinship per:other family, per:siblings per:cities of residence, per:countries of residence, Expected location of person, Residence per:stateorprovinces of residence per:date of birth, per:city of birth, per:country of birth, Being born, Giving birth per:stateorprovince of birth per:origin Origin, People by origin per:schools attended Education teaching per:spouse Personal relationship, Forming relationships per:title Performers and roles B Hyper-parameter Settings and Training Details The dimension of POS and position embeddings are both 30. The inner layer dimension in position-wise feed-forward network is 130. The dimension of the relative positional encoding within knowledge- attention and self-attention is 50. The attention dimension is 200 in position-aware attention and 100 in multi-channel attention. The fully connected network before softmax has a dimensionality of 100. We use ReLU for all the nonlinear activation functions. Dropout rate is 0.4 for input embeddings, attention outputs and position-wise feed-forward outputs, and 0.1 for attention weights dropout. The models are trained using Stochastic Gradient Descent with learning rate of 0.1 and momentum of 0.9. The learning rate is decayed with a rate of 0.9 after 15 epochs if F1 score on dev set does not improve. The batch size is set to 100 and we train the model for 70 epochs. 12
You can also read