Improving Relation Extraction with Knowledge-attention

Page created by Steve Hines

Buildings

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Improving Relation Extraction with Knowledge-attention
                                                                                                    ∗
                                                                          Pengfei Li1 , Kezhi Mao1 , Xuefeng Yang3 , and Qi Li2
                                        1
                                            School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
                                                 2
                                                   Interdisciplinary Graduate School, Nanyang Technological University, Singapore
                                                                        3
                                                                          ZhuiYi Technology, Shenzhen, China
                                                   {pli006,ekzmao,liqi0024}@ntu.edu.sg, ryan@wezhuiyi.com

                                                                   Abstract                             and Harabagiu, 2010). Deep neural networks such
                                                                                                        as Convolutional Neural Networks (CNNs) and
                                               While attention mechanisms have been proven              Recurrent Neural Networks (RNNs) have the abil-
                                               to be effective in many NLP tasks, major-
arXiv:1910.02724v2 [cs.CL] 4 Mar 2020

                                               ity of them are data-driven. We propose
                                                                                                        ity of exploring more complex semantics and ex-
                                               a novel knowledge-attention encoder which                tracting features automatically from raw texts for
                                               incorporates prior knowledge from external               relation extraction tasks (Xu et al., 2016; Vu et al.,
                                               lexical resources into deep neural networks              2016; Lee et al., 2017). Recently, attention mech-
                                               for relation extraction task. Furthermore,               anisms have been introduced to deep neural net-
                                               we present three effective ways of integrat-             works to improve their performance (Zhou et al.,
                                               ing knowledge-attention with self-attention to           2016; Wang et al., 2016; Zhang et al., 2017). Es-
                                               maximize the utilization of both knowledge
                                                                                                        pecially, the Transformer proposed by Vaswani et
                                               and data. The proposed relation extraction sys-
                                               tem is end-to-end and fully attention-based.             al. (2017) is based solely on self-attention and
                                               Experiment results show that the proposed                has demonstrated better performance than tradi-
                                               knowledge-attention mechanism has comple-                tional RNNs (Bilan and Roth, 2018; Verga et al.,
                                               mentary strengths with self-attention, and our           2018). However, deep neural networks normally
                                               integrated models outperform existing CNN,               require sufficient labeled data to train their numer-
                                               RNN, and self-attention based models. State-             ous model parameters. The scarcity or low qual-
                                               of-the-art performance is achieved on TA-                ity of training data will limit the model’s ability to
                                               CRED, a complex and large-scale relation ex-
                                               traction dataset.
                                                                                                        recognize complex relations and also cause over-
                                                                                                        fitting issue.
                                        1      Introduction                                                 A recent study (Li and Mao, 2019) shows that
                                                                                                        incorporating prior knowledge from external lex-
                                        Relation extraction aims to detect the semantic                 ical resources into deep neural network can re-
                                        relationship between two entities in a sentence.                duce the reliance on training data and improve re-
                                        For example, given the sentence: “James Dobson                  lation extraction performance. Motivated by this,
                                        has resigned as chairman of Focus On The Family, which he
                                                                                                        we propose a novel knowledge-attention mecha-
                                        founded thirty years ago.”,
                                                                the goal is to recognize the            nism, which transforms texts from word semantic
                                        organization-founder relation held between “Focus               space into relational semantic space by attending
                                        On The Family” and “James Dobson”. The various rela-
                                                                                                        to relation indicators that are useful in recogniz-
                                        tions between entities extracted from large-scale               ing different relations. The relation indicators are
                                        unstructured texts can be used for ontology and                 automatically generated from lexical knowledge
                                        knowledge base population (Chen et al., 2018a;                  bases which represent keywords and cue phrases
                                        Fossati et al., 2018), as well as facilitating down-            of different relation expressions. While the exist-
                                        stream tasks that requires relational understand-               ing self-attention encoder learns internal seman-
                                        ing of texts such as question answering (Yu et al.,             tic features by attending to the input texts them-
                                        2017) and dialogue systems (Young et al., 2018).                selves, the proposed knowledge-attention encoder
                                           Traditional feature-based and kernel-based ap-               captures the linguistic clues of different relations
                                        proaches require extensive feature engineer-                    based on external knowledge. Since the two atten-
                                        ing (Suchanek et al., 2006; Qian et al., 2008; Rink             tion mechanisms complement each other, we inte-
                                              ∗
                                                  Corresponding author.                                 grate them into a single model to maximize the uti-

                                                Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
                                             International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong, China.

lization of both knowledge and data, and achieve larly, Vaswani et al. (2017) proposed a solely self-
optimal performance for relation extraction. attention-based model called Transformer, which
In summary, the main contributions of the paper is more effective than RNNs in capturing long-
are: (1) We propose knowledge-attention encoder, distance features since it is able to draw global
a novel attention mechanism which incorporates dependencies without regard to their distances in
prior knowledge from external lexical resources to the sequences. Bilan and Roth (2018) first ap-
effectively capture the informative linguistic clues plied self-attention encoder to relation extraction
for relation extraction. (2) To take the advantages task and achieved competitive results on TACRED
of both knowledge-attention and self-attention, we dataset. Verga et al. (2018) used self-attention
propose three integration strategies: multi-channel to encode long contexts spanning multiple sen-
attention, softmax interpolation, and knowledge- tences for biological relation extraction. How-
informed self-attention. Our final models are fully ever, more attention heads and layers are required
attention-based and can be easily set up for end-to- for self-attention encoder to capture complex se-
end training. (3) We present detailed analysis on mantic and syntactic information since learning is
knowledge-attention encoder. Results show that solely based on training data. Hence, more high
it has complementary strengths with self-attention quality training data and computational power are
encoder, and the integrated models achieve start- needed. Our work utilizes the knowledge from
of-the-art results for relation extraction. external lexical resources to improve deep neural
network’s ability of capturing informative linguis-
2 Related Works tic clues.
External knowledge has shown to be effective
We focus here on deep neural networks for rela- in neural networks for many NLP tasks. Ex-
tion extraction since they have demonstrated bet- isting works focus on utilizing external knowl-
ter performance than traditional feature-based and edge to improve embedding representations (Chen
kernel-based approaches. et al., 2015; Liu et al., 2015; Sinoara et al., 2019),
Convolutional Neural Networks (CNNs) and CNNs (Toutanova et al., 2015; Wang et al., 2017;
Recurrent Neural Networks (RNNs) are the ear- Li and Mao, 2019), and RNNs (Ahn et al., 2016;
liest and commonly used approaches for relation Chen et al., 2016, 2018b; Shen et al., 2018). Our
extraction. Zeng et al. (2014) showed that CNN work is the first to incorporate knowledge into
with position embeddings is effective for rela- Transformer through a novel knowledge-attention
tion extraction. Similarly, CNN with multiple fil- mechanism to improve its performance on relation
ter sizes (Nguyen and Grishman, 2015), pairwise extraction task.
ranking loss function (dos Santos et al., 2015)
and auxiliary embeddings (Lee et al., 2017) were 3 Knowledge-attention Encoder
proposed to improve performance. Zhang and
Wang (2015) proposed bi-directional RNN with We present the proposed knowledge-attention en-
max pooling to model the sequential relations. In- coder in this section. Relation indicators are first
stead of modeling the whole sentence, perform- generated from external lexical resources (Sec-
ing RNN on sub-dependency trees (e.g. short- tion 3.1); Then the input texts are transformed
est dependency path between two entities) has from word semantic space into relational seman-
demonstrated to be effective in capturing long- tic space by attending to the relation indicators us-
distance relation patterns (Xu et al., 2016; Miwa ing knowledge-attention mechanism (Section 3.2);
and Bansal, 2016). Zhang et al. (2018) pro- Finally, position-aware attention is used to sum-
posed graph convolution over dependency trees marize the input sequence by taking both relation
and achieved state-of-the-art results on TACRED semantics and relative positions into consideration
dataset. (Section 3.3).
Recently, attention mechanisms have been
widely applied to CNNs (Wang et al., 2016; Han 3.1 Relation Indicators Generation
et al., 2018) and RNNs (Zhou et al., 2016; Zhang Relation indicators represent the keywords or cue
et al., 2017; Du et al., 2018). The improved per- phrases of various relation types, which are es-
formance demonstrated the effectiveness of atten- sential for knowledge-attention encoder to capture
tion mechanisms in deep neural networks. Particu- the linguistic clues of certain relation from texts.

Texts        x1   x2   x3     x4     xn                                              Q             K            V (K)

             Input                                       Relation
           Embeddings                                   Indicators
                                                          (K & V)                                      Linear                 Linear
              (Q)
                                                                     k1
                                                                     k2
            Knowledge                                                k3
             Attention                                                                             Knowledge-attention
                                                                     k4                                                                h

                                                                     km
             Hidden
          Representation                                   avg                                                       Concat

                           h1   h2   h3     h4     hn          Relation                                              Linear
                                                              Indicators
                                                                Mean
                                     Subtraction                                                                     Norm

                                                                                                                Position-wise
                                                                                                                Feed Forward
            Attention
             Output
                                                                                                                Add and Norm

                           o1   o2   o3     o4     on

  Figure 1: Knowledge-attention process (left) and multi-head structure (right) of knowledge-attention encoder.

We utilize two publicly available lexical resources                           ti ∈ Rdt by looking up the word embedding ma-
                                                                                                     wrd
including FrameNet1 and Thesaurus.com2 to find                                trix Wwrd ∈ Rdw ×V          and POS embedding ma-
                                                                                     pos      d  ×V pos
such lexical units.                                                           trix W ∈ R       t        respectively, where dw and
   FrameNet is a large lexical knowledge base                                 dt are the dimensions of word and POS embed-
which categorizes English words and sentences                                 dings, V wrd is vocabulary size3 and V pos is total
into higher level semantic frames (Ruppenhofer                                number of POS tags. The corresponding relation
et al., 2006). Each frame is a conceptual struc-                              indicator is formed by concatenating word embed-
ture describing a type of event, object or relation.                          ding and POS embedding, ki = [wi , ti ]. If a lex-
FrameNet contains over 1200 semantic frames,                                  ical unit contains multiple words (i.e. phrase), the
many of which represent various semantic rela-                                corresponding relation indicator is formed by av-
tions. For each relation type in our relation ex-                             eraging the embeddings of all words. Eventually,
traction task, we first find all the relevant seman-                          around 3000 relation indicators (including 2000
tic frames by searching from FrameNet (refer Ap-                              synonyms) are generated: K = {k1 , k2 , ..., km }.
pendix for detailed semantic frames used). Then
we extract all the lexical units involved in these                            3.2    Knowledge Attention
frames, which are exactly the keywords or phrases                             3.2.1 Knowledge-attention process
that often used to express such relation. The-                                In a typical attention mechanism, a query (q) is
saurus.com is the largest online thesaurus which                              compared with the keys (K) in a set of key-value
has over 3 million synonyms and antonyms. It                                  pairs and the corresponding attention weights are
also has the flexibility to filter search results by                          calculated. The attention output is weighted sum
relevance, POS tag, word length, and complexity.                              of values (V ) using the attention weights. In
To broaden the coverage of relation indicators, we                            our proposed knowledge-attention encoder, the
utilize the synonyms in Thesaurus.com to extend                               queries are input texts and the key-value pairs are
the lexical units extracted from FrameNet. To re-                             both relation indicators. The detailed process of
duce noise, only the most relevant synonyms with                              knowledge-attention is shown in Figure 1 (left).
the same POS tag are selected.                                                   Formally, given text input x = {x1 , x2 , ..., xn },
   Relation indicators are generated based on the                             the input embeddings Q = {q1 , q2 , ..., qn } are
word embeddings and POS tags of lexical units.                                generated by concatenating each word’s word em-
Formally, given a word in a lexical unit, we find its                         bedding and POS embedding in the same way as
word embedding wi ∈ Rdw and POS embedding                                     relation indicator generation in Section 3.1. The
   1                                                                             3
    https://framenet.icsi.berkeley.edu/                                           Same word embedding matrix is used for relation indi-
fndrupal                                                                      cators and input texts, hence the vocabulary also includes all
  2
    https://www.thesaurus.com                                                 the words in the training corpus.

                                                                          3

hidden representations H = {h1 , h2 , ..., hn } are         3.3    Position-aware Attention
obtained by attending to the knowledge indicators
                                                            It has been proven that the relative position infor-
K, as shown in Equation 1. The final knowledge-
                                                            mation of each token with respective to the two
attention outputs are obtained by subtracting the
                                                            target entities is beneficial for relation extraction
hidden representations with the relation indicators
                                                            task (Zeng et al., 2014). We modify the position-
mean, as shown in Equation 2.
                                                            aware attention originally proposed by Zhang et
                                                            al. (2017) to incorporate such relative position in-
                         QKT
            H = sof tmax( √ )V                   (1)        formation and find the importance of each token to
                           dk
                                                            the final sentence representation.
                                  X                            Assume the relative position of token xi to tar-
        knwl(Q, K, V) = H −           K/m        (2)        get entity is pˆi . We apply position binning func-
                                                            tion (Equation 4) to make it easier for the model
where knwl indicates knowledge-attention pro-
                                                            to distinguish long and short relative distances.
cess, m is the number of relation indicators, and
dk is dimension of key/query vectors which is a
scaling factor same as in Vaswani et al. (2017).                           (
                                                                                            pˆi           |pˆi | ≤ 2
   The subtraction of relation indicators mean will                 pi =        pˆi                                    (4)
                                                                               |pˆi |   dlog2 |pˆi | + 1e |pˆi | > 2
result in small outputs for irrelevant words. More
importantly, the resulted output will be close to
the related relation indicators and further apart           After getting the relative positions psi and poi to the
from other relation indicators in relational seman-         two entities of interest (subject and object respec-
tic space. Therefore, the proposed knowledge-               tively), we map them to position embeddings base
attention mechanism is effective in capturing the           on a shared position embedding matrix Wp . The
linguistic clues of relations represented by relation       two embeddings are concatenated to form the final
indicators in the relational semantic space.                position embedding for token xi : pi = [psi , poi ].
                                                               Position-aware attention is performed on the
3.2.2   Multi-head knowledge-attention                      outputs of knowledge-attention O ∈ Rn×dk , tak-
Inspired by the multi-head attention in Trans-              ing the corresponding relative position embed-
former (Vaswani et al., 2017), we also have                 dings P ∈ Rn×dp into consideration:
multi-head knowledge-attention which first lin-
early transforms Q, K and V h times, and then                   f = OT sof tmax(tanh(OWo + PWp )c) (5)
perform h knowledge-attentions simultaneously,
as shown in Figure 1 (right).                               where Wo ∈ Rdk ×da , Wp ∈ Rdp ×da , da is atten-
   Different from the Transformer encoder, we use           tion dimension, and c ∈ Rda is a context vector
the same linear transformation for Q and K in               learned by the neural network.
each head to keep the correspondence between
queries and keys.                                           4     Integrate Knowledge-attention with
                                                                  Self-attention
   headi = knwl(QWiQ , KWiQ , VWiV )             (3)
                                                            The self-attention encoder proposed by Vaswani
where WiQ , WiV ∈ Rdk ×(dk /h) and i ∈ [1, 2, ...h].        et al. (2017) learns internal semantic features by
Besides, only one residual connection from input            modeling pair-wise interactions within the texts
embeddings to outputs of position-wise feed for-            themselves, which is effective in capturing long-
ward networks is used. We also mask the outputs             distance dependencies. Our proposed knowledge-
of padding tokens using zero vectors.                       attention encoder has complementary strengths of
   The multi-head structure in knowledge-                   capturing the linguistic clues of relations precisely
attention allows the model to jointly attend inputs         based on external knowledge. Therefore, it is ben-
to different relational semantic subspaces with             eficial to integrate the two models to maximize the
different contributions of relation indicators. This        utilization of both external knowledge and train-
is beneficial in recognizing complex relations              ing data. In this section, we propose three inte-
where various compositions of relation indicators           gration approaches as shown in Figure 2, and each
are needed.                                                 approach has its own advantages.

                                                        4

f1                     Multi-Channel         Softmax
                               Input Sentence           Self-attention                                                   Attention            Classifier

                                                                               Position-
                                                                                aware
                                                                               Attention
               Lexical                                                                          f2
              Resources        Relation Indicators   Knowledge-attention
                                                                                                                         Softmax Classifier

                                                                                                                         Softmax Classifier
                                                                                                                                                  Output

                               Input Sentence
                                                                                   Position-                           Fully Connected NN
                                                     Knowledge-informed
                                                                                    aware                                       &
                                                        Self-attention
               Lexical
                               Relation Indicators                                 Attention                            Softmax Classifier
              Resources

                                                                                                      Feature Vector

Figure 2: Three ways of integrating knowledge-attention with self-attention: multi-channel attention and softmax
interpolation (top), as well as knowledge-informed self-attention (bottom).

4.1   Multi-channel Attention                                                       knowledge-attention in softmax interpolation. In-
In this approach, self-attention and knowledge-                                     stead of integrating the feature vectors, we make
attention are treated as two separate channels to                                   two independent predictions using two softmax
model sentence from different perspectives. Af-                                     classifiers based on the feature vectors from the
ter applying position-aware attention, two feature                                  two channels. The loss function is defined as to-
vectors f1 and f2 are obtained from self-attention                                  tal cross-entropy loss of the two classifiers. The
and knowledge-attention respectively. We apply                                      final prediction is obtained using an interpolation
another attention mechanism called multi-channel                                    function of the two softmax distributions:
attention to integrate the feature vectors.                                                                 p = β · p1 + (1 − β) · p2                      (7)
   In multi-channel attention, feature vectors are
                                                                                    where p1 , p2 are the softmax distributions
first fed into a fully connected neural network to
                                                                                    obtained form self-attention and knowledge-
get their hidden representations hi . Then atten-
                                                                                    attention respectively, and β is the priority weight
tion weights are calculated using a learnable con-
                                                                                    assigned to self-attention.
text vector c, which reflects the importance of each
                                                                                       Since knowledge-attention focuses on capturing
feature vector to final relation classification. Fi-
                                                                                    the keywords and cue phrases of relations, the pre-
nally, the feature vectors are integrated based on
                                                                                    cision will be higher than self-attention while the
attention weights, as shown in Equation 6.
                                                                                    recall is lower. The proposed softmax interpo-
                    X                                                               lation approach is able to take the advantages of
            r=                sof tmax(hTi c)hi                          (6)
                                                                                    both attention mechanisms and balance the preci-
                          i
                                                                                    sion and recall by adjusting the priority weight β.
After obtaining the integrated feature vector r, we
                                                                                    4.3        Knowledge-informed Self-attention
pass it to a softmax classifier to determine the re-
lation class. The model is trained using stochastic                                 Since knowledge-attention and self-attention
gradient descent with momentum and learning rate                                    share similar structures, it is also possible to
decay to minimize the cross-entropy loss.                                           integrate them into a single channel. We propose
   The main advantage of this approach is flexibil-                                 knowledge-informed self-attention encoder which
ity. Since the two channels process information                                     incorporates knowledge-attention into every
independently, the input components are not nec-                                    self-attention head to jointly model the semantic
essary to be the same. Besides, we can add more                                     relations based on both knowledge and data.
features from other sources (e.g. subject and ob-                                      The structure of knowledge-informed self-
ject categories) to multi-channel attention to make                                 attention is shown in Figure 3. Formally, given
final decision based on all the information sources.                                texts input matrix Q ∈ Rn×dk and knowledge in-
                                                                                    dicators K ∈ Rm×dk . The output of each attention
4.2   Softmax Interpolation                                                         head is calculated as follows:
Similar as multi-channel attention, we also use                                     headi = knwl(QWiQ , KWiQ , KWiV )+
                                                                                                                                                           (8)
two independent channels for self-attention and                                             self (QWiQs , QWiKs , QWiVs )

                                                                               5

Q            K    K                  Q         Q            Q              on the original Transformer encoder, including the
                                                                                     use of relative positional encodings instead of ab-
              Linear       Linear            Linear     Linear      Linear
                                                                                     solute sinusoidal encodings, as well as other con-
                                                                                     figurations such as residual connection, activation
          Knowledge-attention        +             Self-attention            h
                                                                                     function and normalization.
                                                                                        For our model, we evaluate both the proposed
                                    Concat
                                                                                     knowledge-attention encoder (Knwl-attn) as well
                                    Linear                                           as the integrated models with self-attention in-
                                                                                     cluding multi-channel attention (MCA), softmax
                                                                                     interpolation (SI) and knowledge-informed self-
Figure 3: Knowledge-informed self-attention structure.                               attention (KISA).
Q, K represent input matrix and knowledge indicators
respectively, h is the number of attention heads.
                                                                                     5.2   Experiment Settings
                                                                                     We conduct our main experiments on TACRED,
where knwl and self indicate knowledge-
                                                                                     a large-scale relation extraction dataset introduced
attention and self-attention respectively, and all
                                                                                     by Zhang et al. (2017). TACRED contains over
the linear transformation weight matrices have the
                                                                                     106k sentences with hand-annotated subject and
dimensionality of W ∈ Rdk ×(dk /h) .
                                                                                     object entities as well as the relations between
   Since each self-attention head is aided with
                                                                                     them. It is a very complex relation extraction
prior knowledge in knowledge-attention, the
                                                                                     dataset with 41 relation types and a no relation
knowledge-informed self-attention encoder is able
                                                                                     class when no relation is hold between entities.
to capture more lexical and semantic information
                                                                                     The dataset is suited for real-word relation extrac-
than single attention encoder.
                                                                                     tion since it is unbalanced with 79.5% no relation
5     Experiment and Analysis                                                        samples, and multiple relations between different
                                                                                     entity pairs can be exist in one sentence. Besides,
5.1    Baseline Models                                                               the samples are normally long sentences with an
To study the performance of our proposed models,                                     average of 36.2 words.
the following baseline models are used for com-                                         Since the dataset is already partitioned into
parison:                                                                             train (68124 samples), dev (22631 samples) and
CNN-based models including: (1) CNN: the clas-                                       test (15509 samples) sets, we tune model hyper-
sical convolutional neural network for sentence                                      parameters using dev set and evaluate model us-
classification (Kim, 2014). (2) CNN-PE: CNN                                          ing test set. The evaluation metrics are micro-
with position embeddings dedicated for relation                                      averaged precision, recall and F1 score. For fair
classification (Nguyen and Grishman, 2015). (3)                                      comparison, we select the model with median F1
GCN: a graph convolutional network over the                                          score on dev set from 5 independent runs, same
pruned dependency trees of the sentence (Zhang                                       as Zhang et al. (2017). The same “entity mask”
et al., 2018).                                                                       strategy is used which replaces subject (or object)
RNN-based models including: (1) LSTM: long                                           entity with special NER -SUBJ (or NER -OBJ)
short-term memory network to sequentially model                                      tokens to avoid overfittting on specific entities and
the texts. Classification is based on the last hid-                                  provide entity type information.
den output. (2) PA-LSTM: Similar position-aware                                         Besides TACRED, another dataset called
attention mechanism as our work is used to sum-                                      SemEval2010-Task8 (Hendrickx et al., 2009)
marize the LSTM outputs (Zhang et al., 2017).                                        is used to evaluate the generalization ability of
CNN-RNN hybrid model including contextu-                                             our proposed model. The dataset is significantly
alized GCN (C-GCN) where the input vectors                                           smaller and simpler than TACRED, which has
are obtained using bi-directional LSTM net-                                          8000 training samples and 2717 testing samples.
work (Zhang et al., 2018).                                                           It contains 9 directed relations and 1 other relation
Self-attention-based model (Self-attn) which                                         (19 relation classes in total). We use the official
uses self-attention encoder to model the input sen-                                  macro-averaged F1 score as evaluation metric.
tence. Our implementation is based on Bilan and                                         We use one layer encoder with 6 attention heads
Roth (2018) where several modifications are made                                     for both knowledge-attention and self-attention

                                                                                 6

since further increasing the number of layers and Model P R F1
attention heads will degrade the performance. For CNN† 72.1 50.3 59.2
softmax interpolation, we choose β = 0.8 to CNN-PE† 68.2 55.4 61.1
balance precision and recall. Word embeddings GCN‡ 69.8 59.0 64.0
are fine-tuned based on pre-trained GloVe (Pen- LSTM† 61.4 61.7 61.5
nington et al., 2014) with dimensionality of 300. PA-LSTM† 65.7 64.5 65.1
Dropout (Srivastava et al., 2014) is used during tri- C-GCN‡ 69.9 63.3 66.4
aning to alleviate overfitting. Other model hyper- Self-attn†† 64.6 68.6 66.5
parameters and training details are described in Knwl-attn 70.0 63.1 66.4
Appendix due to space limitations. Knwl+Self (MCA) 68.4 66.1 67.3*
Knwl+Self (SI) 67.1 68.4 67.8*
5.3 Results and Analysis Know+Self (KISA) 69.4 66.0 67.7*
5.3.1 Results on TACRED dataset Table 1: Micro-averaged precision (P), recall (R) and
Table 1 shows the results of baseline as well as F1 score on TACRED dataset. †, ‡ and †† mark the
our proposed models on TACRED dataset. It is results reported in (Zhang et al., 2017), (Zhang et al.,
2018) and (Bilan and Roth, 2018) respectively. ∗
observed that our proposed knowledge-attention
marks statistically significant improvements over Self-
encoder outperforms all CNN-based and RNN- attn with p < 0.01 under one-tailed t-test.
based models by at least 1.3 F1 . Meanwhile, it
achieves comparable results with C-GCN and self-
attention encoder, which are the current start-of-
the-art single-model systems.
Comparing with self-attention encoder, it is ob-
served that knowledge-attention encoder results in
higher precision but lower recall. This is reason-
able since knowledge-attention encoder focuses
on capturing the significant linguistic clues of re-
lations based on external knowledge, it will re-
sult in high precision for the predicted relations
similar to rule-based systems. Self-attention en-
coder is able to capture more long-distance de-
pendency features by learning from data, result-
ing in better recall. By integrating self-attention Figure 4: Change of precision, recall and F1 score on
and knowledge-attention using the proposed ap- dev set as the priority weight β in softmax interpolation
proaches, a more balanced precision and recall can changes.
be obtained, suggesting the complementary effects
of self-attention and knowledge-attention mech-
anisms. The integrated models improve perfor- call are balanced (β = 0.8).
mance by at least 0.9 F1 score and achieve new Knowledge-informed self-attention (KISA) has
state-of-the-art results among all the single end- comparable performance with softmax interpola-
to-end models. tion, and without the need of hyper-parameter tun-
Comparing the three integrated models, soft- ing since knowledge-attention and self-attention
max interpolation (SI) achieves the best perfor- are integrated into a single channel. The per-
mance. More interestingly, we found that the pre- formance gain over self-attention encoder is 1.2
cision and recall can be controlled by adjusting the F1 with much improved precision, demonstrat-
priority weight β. Figure 4 shows impact of β on ing the effectiveness of incorporating knowledge-
precision, recall and F1 score. As β increases, pre- attention into self-attention to jointly model the
cision decreases and recall increases. Therefore, sentence based on both knowledge and data.
we can choose a small β for relation extraction Performance gain is the lowest for multi-
system which requires high precision, and a large channel attention (MCA). However, the model is
β for the system requiring better recall. F1 score more flexible in the way that features from other
reaches the highest value when precision and re- information sources can be easily added to the

Model P R F1 Model mask entity keep entity
MCA 68.4 66.1 67.3 Self-attn 76.8 83.1
+ NER 68.7 66.2 67.5 Knwl-attn 76.1 82.3
+ Entity category 70.0 65.8 67.8 Knwl+Self (MCA) 77.4* 84.0*
+ Both 70.1 66.0 67.9 Knwl+Self (SI) 78.0* 84.3*
Know+Self (KISA) 77.5* 84.0*
Table 2: Results of adding NER embeddings and entity
categorical embeddings to the multi-channel attention Table 3: Macro-averaged F1 score on SemEval2010-
(MCA) integrated model. Task8 dataset. ∗ marks statistically significant im-
provements over Self-attn with p < 0.01 under one-
tailed t-test.
model to further improve its performance. Table
2 shows the results of adding NER embeddings Model Dev F1
of each token to self-attention channel, and en- Knwl-attn Encoder 66.5
tity (subject and object) categorical embeddings to 1. − Multi-head structure 64.6
multi-channel attention as additional feature vec- 2. − Synonym relation indicators 64.7
tors. We use dimensionality of 30 and 60 for 3. − Relation indicators mean 65.0
NER and entity categorical embeddings respec- 4. − Output masking 65.8
tively, and the two embedding matrixes are learned 5. − Entity masking 65.4
by the neural network. Results show that adding 6. − Relative positions 63.0
NER and entity categorical information to MCA
integrated model improves F1 score by 0.2 and 0.5 Table 4: Ablation study on knowledge-attention en-
respectively, and adding both improves precision coder. Results are the median F1 scores of 5 indepen-
dent runs on dev set of TACRED.
significantly, resulting a new best F1 score.

5.3.2 Results on SemEval2010-Task8 dataset
without certain components are shown in Table 4.
We use SemEval2010-Task8 dataset to evaluate It is observed that: (1) The proposed multi-
the generalization ability of our proposed model. head knowledge-attention structure outperforms
Experiments are conducted in two manners: mask single-head significantly. This demonstrates the
or keep the entities of interest. Results in Table effectiveness of jointly attending texts to different
3 show that the “entity mask” strategy degrades relational semantic subspaces in the multi-head
the performance, indicating that there exist strong structure. (2) The synonyms improve the per-
correlations between entities of interest and rela- formance of knowledge-attention since they are
tion classes in SemEval2010-Task8 dataset. Al- able to broaden the coverage of relation indicators
though the results of keeping the entities are better, and form a robust relational semantic space. (3)
the model tends to remember these entities instead The subtraction of relation indicators mean vec-
of focusing on learning the linguistic clues of re- tor from attention hidden representations helps to
lations. This will result in bad generalization for suppress the activation of irrelevant words and re-
sentences with unseen entities. sults in a better representation for each word to
Regardless of whether the entity mask is used, capture the linguistic clues of relations. (4-5) The
by incorporating knowledge-attention mechanism, two masking strategies are helpful for our model:
our model improves the performance of self- the output masking eliminates the effects of the
attention by a statistically significant margin, espe- padding tokens and the entity masking avoids en-
cially the softmax interpolation integrated model. tity overfitting while providing entity type infor-
The results on SemEval2010-Task8 are consistent mation. (6) The relative position embedding term
with that of TACRED, demonstrating the effec- in position-aware attention contributes a signifi-
tiveness and robustness of our proposed method. cant amount of F1 score. This shows that posi-
tional information is particularly important for re-
5.3.3 Ablation study
lation extraction task.
To study the contributions of specific components
of knowledge-attention encoder, we perform abla- 5.3.4 Attention visualization
tion experiments on the dev set of TACRED. The To verify the complementary effects of
results of knowledge-attention encoder with and knowledge-attention encoder and self-attention

Sample Sentences                                                                                                True Relation    Predict
  SUBJ-PERSON graduated in 1992 from the OBJ-ORGANIZATION OBJ-ORGANIZATION OBJ-ORGANIZATION
                                                                                                                 per:schools      correct
  with a degree in computer science and had worked as a systems analyst at a Pittsburgh law firm since 1999 .     attended
  SUBJ-PERSON graduated in 1992 from the OBJ-ORGANIZATION OBJ-ORGANIZATION OBJ-ORGANIZATION
                                                                                                                                  correct
  with a degree in computer science and had worked as a systems analyst at a Pittsburgh law firm since 1999 .

  OBJ-PERSON OBJ-PERSON , a public affairs and government relations strategist , was executive director of the
                                                                                                                 org:top members correct
  SUBJ-ORGANIZATION SUBJ-ORGANIZATION Policy Institute from 2005 to 2010 .                                       /employees
  OBJ-PERSON OBJ-PERSON , a public affairs and government relations strategist , was executive director of the
                                                                                                                                  wrong
  SUBJ-ORGANIZATION SUBJ-ORGANIZATION Policy Institute from 2005 to 2010 .

  Founded in 1992 in Schaumburg , Illinois , the SUBJ-ORGANIZATION is one of the largest Chinese - American
                                                                                                                 org:country of   wrong
  associations of professionals in the OBJ-COUNTRY OBJ-COUNTRY .                                                  headquarters
  Founded in 1992 in Schaumburg , Illinois , the SUBJ-ORGANIZATION is one of the largest Chinese - American
                                                                                                                                  correct
  associations of professionals in the OBJ-COUNTRY OBJ-COUNTRY .

Table 5: Attention visualization for knowledge-attention encoder (first) and self-attention encoder (second). Words
are highlighted based on the attention weights assigned to them. Best viewed in color.

encoder, we compare the attention weights as-                                caused by multiple entities with different rela-
signed to words from the two encoders. Table 5                               tions co-occurred in one sentence. Our model
presents the attention visualization results on sam-                         may mistake irrelevant entities as a relation pair.
ple sentences. For each sample sentence, attention                           We also observed that many FP errors are due
weights from knowledge-attention encoder are                                 to the confusions between related relations such
visualized first, followed by self-attention en-                             as “city of death”and “city of residence”. More
coder. It is observed that knowledge-attention                               data or knowledge is needed to distinguish “death”
encoder focuses more on the specific keywords or                             and “residence”. Besides, some errors are caused
cue phrases of certain relations, such as “gradu-                            by imperfect annotations.
ated”, “executive director” and “founded”; while
self-attention encoder attends to a wide range of                            6     Conclusion and Future Work
words in the sentence and pays more attention to                             We introduce knowledge-attention encoder which
the surrounding words of target entities especially                          effectively incorporates prior knowledge from ex-
the words indicating the syntactic structure, such                           ternal lexical resources for relation extraction. The
as “is”, “in” and “of”. Therefore, knowledge-                                proposed knowledge-attention mechanism trans-
attention encoder and self-attention encoder have                            forms texts from word space into relational se-
complementary strengths that focus on different                              mantic space and captures the informative linguis-
perspectives for relation extraction.                                        tic clues of relations effectively. Furthermore, we
                                                                             show the complementary strengths of knowledge-
5.3.5     Error analysis
                                                                             attention and self-attention, and propose three dif-
To investigate the limitations of our proposed                               ferent ways of integrating them to maximize the
model and provide insights for future research, we                           utilization of both knowledge and data. The pro-
analyze the errors produced by the system on the                             posed models are fully attention-based end-to-
test set of TACRED. For knowledge-attention en-                              end systems and achieve state-of-the-art results on
coder, 58% errors are false negative (FN) due to                             TACRED dataset, outperforming existing CNN,
the limited ability in capturing long-distance de-                           RNN, and self-attention based models.
pendencies and some unseen linguistic clues dur-                                In future work, besides lexical knowledge,
ing training. For our integrated model4 that takes                           we will incorporate conceptual knowledge from
the benefits of both self-attention and knowledge-                           encyclopedic knowledge bases into knowledge-
attention, FN is reduced by 10%. However, false                              attention encoder to capture the high-level seman-
positive (FP) is not improved due to overfitting                             tics of texts. We will also apply knowledge-
that leads to wrong predictions. Many errors are                             attention in other tasks such as text classification,
   4
     We observed similar error behaviors of the three pro-                   sentiment analysis and question answering.
posed integrated models.

                                                                         9

References Yoon Kim. 2014. Convolutional neural networks for
sentence classification. In Proceedings of the 2014
Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and Conference on Empirical Methods in Natural Lan-
Yoshua Bengio. 2016. A neural knowledge language Processing (EMNLP), pages 1746–1751.
guage model. arXiv preprint arXiv:1608.00318.
Ji Young Lee, Franck Dernoncourt, and Peter
Ivan Bilan and Benjamin Roth. 2018. Position-aware Szolovits. 2017. Mit at semeval-2017 task 10: Rela-
self-attention with relative positional encodings for tion extraction with convolutional neural networks.
slot filling. arXiv preprint arXiv:1807.03052. arXiv preprint arXiv:1704.01523.

Muhao Chen, Yingtao Tian, Xuelu Chen, Zijun Xue, Pengfei Li and Kezhi Mao. 2019. Knowledge-oriented
and Carlo Zaniolo. 2018a. On2vec: Embedding- convolutional neural network for causal relation ex-
based relation prediction for ontology population. In traction from natural language texts. Expert Systems
Proceedings of the 2018 SIAM International Confer- with Applications, 115:512–523.
ence on Data Mining, pages 315–323. SIAM.
Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Diana Yu Hu. 2015. Learning semantic word embeddings
Inkpen, and Si Wei. 2018b. Neural natural language based on ordinal knowledge constraints. In Pro-
inference models enhanced with external knowl- ceedings of the 53rd Annual Meeting of the Associ-
edge. In Proceedings of the 56th Annual Meeting of ation for Computational Linguistics and the 7th In-
the Association for Computational Linguistics (Vol- ternational Joint Conference on Natural Language
ume 1: Long Papers), pages 2406–2417. Processing (Volume 1: Long Papers), pages 1501–
1511.
Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur, Makoto Miwa and Mohit Bansal. 2016. End-to-end re-
Asli Celikyilmaz, Jianfeng Gao, and Li Deng. lation extraction using lstms on sequences and tree
2016. Knowledge as a teacher: Knowledge- structures. In Proceedings of the 54th Annual Meet-
guided structural attention networks. arXiv preprint ing of the Association for Computational Linguistics
arXiv:1609.03286. (Volume 1: Long Papers), volume 1, pages 1105–
1116.
Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen,
Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Re- Thien Huu Nguyen and Ralph Grishman. 2015. Rela-
visiting word embedding for contrasting meaning. tion extraction: Perspective from convolutional neu-
In Proceedings of the 53rd Annual Meeting of the ral networks. In Proceedings of the 1st Workshop on
Association for Computational Linguistics and the Vector Space Modeling for Natural Language Pro-
7th International Joint Conference on Natural Lan- cessing, pages 39–48.
guage Processing (Volume 1: Long Papers), pages
106–115. Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. Glove: Global vectors for word
Jinhua Du, Jingguang Han, Andy Way, and Dadong representation. In Proceedings of the 2014 confer-
Wan. 2018. Multi-level structured self-attentions for ence on empirical methods in natural language pro-
distantly supervised relation extraction. In Proceed- cessing (EMNLP), pages 1532–1543.
ings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 2216–2225. Longhua Qian, Guodong Zhou, Fang Kong, Qiaoming
Zhu, and Peide Qian. 2008. Exploiting constituent
Marco Fossati, Emilio Dorigatti, and Claudio Giuliano. dependencies for tree kernel-based semantic relation
2018. N-ary relation extraction for simultaneous t- extraction. In Proceedings of the 22nd International
box and a-box knowledge base augmentation. Se- Conference on Computational Linguistics-Volume 1,
mantic Web, 9(4):413–439. pages 697–704. Association for Computational Lin-
guistics.
Xu Han, Pengfei Yu, Zhiyuan Liu, Maosong Sun, and Bryan Rink and Sanda Harabagiu. 2010. Utd: Clas-
Peng Li. 2018. Hierarchical relation extraction with sifying semantic relations by combining lexical and
coarse-to-fine grained attention. In Proceedings of semantic resources. In Proceedings of the 5th Inter-
the 2018 Conference on Empirical Methods in Nat- national Workshop on Semantic Evaluation, pages
ural Language Processing, pages 2236–2245. 256–259. Association for Computational Linguis-
tics.
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,
Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Josef Ruppenhofer, Michael Ellsworth, Myriam
Padó, Marco Pennacchiotti, Lorenza Romano, and Schwarzer-Petruck, Christopher R Johnson, and Jan
Stan Szpakowicz. 2009. Semeval-2010 task 8: Scheffczyk. 2006. Framenet ii: Extended theory and
Multi-way classification of semantic relations be- practice.
tween pairs of nominals. In Proceedings of
the Workshop on Semantic Evaluations: Recent Cicero dos Santos, Bing Xiang, and Bowen Zhou.
Achievements and Future Directions, pages 94–99. 2015. Classifying relations by ranking with con-
Association for Computational Linguistics. volutional neural networks. In Proceedings of the

53rd Annual Meeting of the Association for Compu- Yan Xu, Ran Jia, Lili Mou, Ge Li, Yunchuan Chen,
tational Linguistics and the 7th International Joint Yangyang Lu, and Zhi Jin. 2016. Improved rela-
Conference on Natural Language Processing (Vol- tion classification by deep recurrent neural networks
ume 1: Long Papers), pages 626–634. with data augmentation. In Proceedings of COLING
2016, the 26th International Conference on Compu-
Ying Shen, Yang Deng, Min Yang, Yaliang Li, Nan Du, tational Linguistics: Technical Papers, pages 1461–
Wei Fan, and Kai Lei. 2018. Knowledge-aware at- 1470.
tentive neural network for ranking question answer
pairs. In The 41st International ACM SIGIR Con- Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou,
ference on Research & Development in Information Subham Biswas, and Minlie Huang. 2018. Aug-
Retrieval, pages 901–904. ACM. menting end-to-end dialogue systems with common-
sense knowledge. In Thirty-Second AAAI Confer-
Roberta A Sinoara, Jose Camacho-Collados, Rafael G ence on Artificial Intelligence.
Rossi, Roberto Navigli, and Solange O Rezende.
2019. Knowledge-enhanced document embeddings Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Ci-
for text classification. Knowledge-Based Systems, cero dos Santos, Bing Xiang, and Bowen Zhou.
163:955–971. 2017. Improved neural relation detection for knowl-
edge base question answering. arXiv preprint
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, arXiv:1704.06194.
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: a simple way to prevent neural networks Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,
from overfitting. The Journal of Machine Learning Jun Zhao, et al. 2014. Relation classification via
Research, 15(1):1929–1958. convolutional deep neural network.

Fabian M Suchanek, Georgiana Ifrim, and Gerhard Dongxu Zhang and Dong Wang. 2015. Relation classi-
Weikum. 2006. Combining linguistic and statistical fication via recurrent neural network. arXiv preprint
analysis to extract relations from web documents. arXiv:1508.01006.
In Proceedings of the 12th ACM SIGKDD interna-
Yuhao Zhang, Peng Qi, and Christopher D Manning.
tional conference on Knowledge discovery and data
2018. Graph convolution over pruned dependency
mining, pages 712–717. ACM.
trees improves relation extraction. In Proceedings of
Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi- the 2018 Conference on Empirical Methods in Nat-
fung Poon, Pallavi Choudhury, and Michael Gamon. ural Language Processing, pages 2205–2215.
2015. Representing text for joint embedding of text
Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-
and knowledge bases. In Proceedings of the 2015
geli, and Christopher D Manning. 2017. Position-
Conference on Empirical Methods in Natural Lan-
aware attention and supervised data improve slot fill-
guage Processing, pages 1499–1509.
ing. In Proceedings of the 2017 Conference on Em-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob pirical Methods in Natural Language Processing,
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz pages 35–45.
Kaiser, and Illia Polosukhin. 2017. Attention is all Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen
you need. In Advances in neural information pro- Li, Hongwei Hao, and Bo Xu. 2016. Attention-
cessing systems, pages 5998–6008. based bidirectional long short-term memory net-
works for relation classification. In Proceedings of
Patrick Verga, Emma Strubell, and Andrew McCallum.
the 54th Annual Meeting of the Association for Com-
2018. Simultaneously self-attending to all mentions
putational Linguistics (Volume 2: Short Papers),
for full-abstract biological relation extraction. In
volume 2, pages 207–212.
Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 872–884.

Ngoc Thang Vu, Heike Adel, Pankaj Gupta, et al. 2016.
Combining recurrent and convolutional neural net-
works for relation classification. In Proceedings of
NAACL-HLT, pages 534–539.

Jin Wang, Zhongyuan Wang, Dawei Zhang, and Jun
Yan. 2017. Combining knowledge with deep convo-
lutional neural networks for short text classification.
In IJCAI, pages 2915–2921.

Linlin Wang, Zhu Cao, Gerard De Melo, and Zhiyuan
Liu. 2016. Relation classification via multi-level at-
tention cnns.

A    FramNet Frames Used for Relation Extraction on TACRED Dataset

    Relation Types                           FrameNet Frames
    org:alternate names,                     Being named, Name conferral, Namesake,
    per:alternate names                      Referring by name, Simple naming
                                             Being located, Locale, Locale by characteristic entity,
    org:city of headquarters,
                                             Locale by collocation, Locale by event,
    org:country of headquarters,
                                             Locale by ownership, Locale by use, Locale closure,
    org:stateorprovince of headquarters
                                             Locating, Locative relation, Spatial co-location
    org:founded, org:founded by              Intentionally create
                                             Location in time, Relative time, Time vector,
    org:dissolved
                                             Timespan, Temporal collocation, Temporal subregion
    org:member of, org:members               Becoming a member, Membership
    org:political/religious affiliation,
                                             People by religion, Religious belief, Political locales
    per:religion
    org:subsidiaries, org:parents            Part whole, Partitive, Inclusion
    org:shareholders                         Capital stock
    org:top members/employees,
                                             Leadership, Working a post, Employing,
    org:number of employees/members,
                                             People by vocation, Cardinal numbers
    per:employee of
    per:age                                  Age
    per:cause of death, per:date of death,
    per:city of death,
                                             Death, Cause harm
    per:country of death,
    per:stateorprovince of death
                                             Notification of charges, Committing crime,
    per:charges
                                             Criminal investigation
    per:children, per:parents,
                                             Kinship
    per:other family, per:siblings
    per:cities of residence,
    per:countries of residence,              Expected location of person, Residence
    per:stateorprovinces of residence
    per:date of birth, per:city of birth,
    per:country of birth,                    Being born, Giving birth
    per:stateorprovince of birth
    per:origin                               Origin, People by origin
    per:schools attended                     Education teaching
    per:spouse                               Personal relationship, Forming relationships
    per:title                                Performers and roles

B    Hyper-parameter Settings and Training Details
The dimension of POS and position embeddings are both 30. The inner layer dimension in position-wise
feed-forward network is 130. The dimension of the relative positional encoding within knowledge-
attention and self-attention is 50. The attention dimension is 200 in position-aware attention and 100 in
multi-channel attention. The fully connected network before softmax has a dimensionality of 100. We
use ReLU for all the nonlinear activation functions. Dropout rate is 0.4 for input embeddings, attention
outputs and position-wise feed-forward outputs, and 0.1 for attention weights dropout.
   The models are trained using Stochastic Gradient Descent with learning rate of 0.1 and momentum
of 0.9. The learning rate is decayed with a rate of 0.9 after 15 epochs if F1 score on dev set does not
improve. The batch size is set to 100 and we train the model for 70 epochs.

                                                   12

You can also read