Improving Aspect Term Extraction with Bidirectional Dependency Tree Representation

Page created by Curtis Terry
 
CONTINUE READING
Improving Aspect Term Extraction with Bidirectional
                                                                Dependency Tree Representation
                                              Huaishao Luo1 , Tianrui Li∗1 , Bing Liu2 , Bin Wang1 , and Herwig Unger3
                                             1 School of Information Science and Technology, Southwest Jiaotong University

                                         huaishaoluo@gmail.com, trli@swjtu.edu.cn, binwang007@gmail.com
                                                    2 Department of Computer Science, University of Illinois at Chicago

                                                                           liub@uic.edu
                                                3 Faculty of Mathematics and Computer Science, Fern University in Hagen

                                                                     herwig.unger@gmail.com

                                                             Abstract
                                                                                                 Table 1: Example of user’ review with aspect
                                            Aspect term extraction is one of the im-             term marked in bold.
arXiv:1805.07889v2 [cs.CL] 5 May 2019

                                            portant subtasks in aspect-based sentiment              No. Reviews
                                            analysis. Previous studies have shown that              1 The design and atmosphere are just as
                                            using dependency tree structure represen-                    good.
                                            tation is promising for this task. How-                 2 The staff is very kind and well trained,
                                            ever, most dependency tree structures in-                    they’re fast, they are always prompt to
                                            volve only one directional propagation                       jump behind the bar and fix drinks,
                                            on the dependency tree. In this paper,                       they know details of every item in the
                                            we first propose a novel bidirectional de-                   menu and make excellent recommen-
                                            pendency tree network to extract depen-                      dation.
                                            dency structure features from the given                 3 I love the operating system and the
                                            sentences. The key idea is to explicitly in-                 preloaded software.
                                            corporate both representations gained sep-              4 There also seemed to be a problem with
                                            arately from the bottom-up and top-down                      the hard disc, as certain times win-
                                            propagation on the given dependency syn-                     dows loads but claims to not be able to
                                            tactic tree. An end-to-end framework is                      find any drivers or files.
                                            then developed to integrate the embedded
                                            representations and BiLSTM plus CRF
                                            to learn both tree-structured and sequen-            unsupervised and supervised approaches. The
                                            tial features to solve the aspect term ex-           unsupervised approach is mainly based on topic
                                            traction problem. Experimental results               modeling (Lin and He, 2009; Brody and Elhadad,
                                            demonstrate that the proposed model out-             2010; Moghaddam and Ester, 2011; Chen et al.,
                                            performs state-of-the-art baseline models            2013; Chen and Liu, 2014; Chen et al., 2014), syn-
                                            on four benchmark SemEval datasets.                  tactic rules (Wang and Wang, 2008; Zhang et al.,
                                                                                                 2010; Wu et al., 2009; Qiu et al., 2011; Liu et al.,
                                        1   Introduction
                                                                                                 2013), and lifelong learning (Chen et al., 2014;
                                        Aspect term extraction (ATE) is the task of ex-          Wang et al., 2016a; Liu et al., 2016; Shu et al.,
                                        tracting the attributes (or aspects) of an entity upon   2017). The supervised approach is mainly based
                                        which people have expressed opinions. It is one of       on Conditional Random Fields (CRF) (Lafferty
                                        the most important subtasks in aspect-based sen-         et al., 2001; Jakob and Gurevych, 2010; Choi and
                                        timent analysis (Liu, 2012). As examples shown           Cardie, 2010; Li et al., 2010; Mitchell et al., 2013;
                                        in Table 1, “design”, “atmosphere”, “staff”, “bar”,      Giannakopoulos et al., 2017).
                                        “drinks”, and “menu” in the first two sentences are         This paper focuses on CRF-based models,
                                        aspect terms of the restaurant reviews, and “oper-       which regard ATE as a sequence labeling task.
                                        ating system”, “preloaded software”, “hard disc”,        There are three main types of features that have
                                        “windows”, and “drivers” in the last two sentences       been used in previous CRF-based models for ATE.
                                        are aspects terms of the laptop reviews.                 The first type is the traditional natural language
                                           Existing methods for ATE can be divided into          features, e.g., syntactic structures and lexical fea-
has                         love

                           advcl     nsubj   dobj                     nsubj       dobj
                                      advmod                                                     punct
                               punct                  punct

        Speaking                ,   it   too     problems     .        I                  system                           .
                 nmod                                                             det compound   cc      conj

                       browser                                                the    operating     and            software
                     case det                                                                                   det amod

                of      the                                                                           the       preloaded

           a) Speaking of the browser, it too has problems.       b) I love the operating system and the preloaded software.

Figure 1: Examples of dependency relations (generated by the basic dependencies of Stanford CoreNLP
3.8.0). Each node is a word, and each edge is the dependency relation between two words.

tures (Toh and Su, 2016; Hamdan et al., 2015;
Toh and Su, 2015; Balage Filho and Pardo, 2014;
Jakob and Gurevych, 2010; Shu et al., 2017). The
second type is the cross domain knowledge based
features, which are useful because there are plenty
of shared aspects across domains although each
entity/product is different (Jakob and Gurevych,
2010; Mitchell et al., 2013; Shu et al., 2017). The
final type is the deep learning features learned by
deep learning models, which have been proven
very useful for the ATE in recent years (Gian-
nakopoulos et al., 2017; Liu et al., 2015a; Wang                    Figure 2: An example of a constituency tree
et al., 2016b; Yin et al., 2016; Ye et al., 2017; Li                (generated by the constituency parse of Stanford
and Lam, 2017; Wang et al., 2017b,a).                               CoreNLP 3.8.0). Each node with the blue back-
   The deep learning features generally include                     ground is a real word in the sentence: Speaking of
sequential representation and tree-structured rep-                  the browser, it too has problems.
resentation features. Sequential representation
means the word order of a sentence. Tree-
structured representation features come from the
syntax structure of a sentence, which represent the                 considering top-down propagation, which means
internal logical relations between words. Figure 1                  that given software as an aspect term, system can
shows two examples of the dependency structure,                     be extracted as an aspect term through the rela-
                                                                                            con j−1
in which each node is a word of the sentence, and                   tion: so f tware −−−−→ system, where con j−1 is
each edge is a dependency relation between words.                   the inverse relation of the con j for the purpose of
                                      nmod
For example, the relation Speaking −−−→ browser                     distinguishing different directions of propagation.
means Speaking is a nominal modifier of browser.                    Compared with the sequential representation, the
Such a relation is useful in ATE. For instance,                     tree-structured representation is capable of obtain-
given system as an aspect term, software can be                     ing the long-range dependency relation between
extracted as an aspect term through the relation:                   words, especially for long sentences like the sec-
        con j
system −−→ so f tware in Figure 1 b) because con j                  ond and fourth reviews in Table 1.
means system and so f tware are connected by a co-                     In this paper, we first enhance the tree-
ordinating conjunction (e.g., and). However, the                    structured representation using a bidirectional gate
tree-structured representation in the previous work                 control mechanism which originates from bidirec-
only considered a single direction of propagation                   tional LSTM (BiLSTM) (Hochreiter and Schmid-
(bottom-up propagation) trained on the parse trees                  huber, 1997; Gers et al., 1999) and then fuse the
with shared weights. We further exploit the ca-                     tree-structured and the sequential information to
pability of the tree-structured representation by                   perform the aspect term extraction. By combin-
                                                                    ing the two steps into one, we propose a novel
framework named bidirectional dependency tree            described in Section 1, BiDTreeCRF consists of
conditional random fields (BiDTreeCRF). Specif-          three modules (or components): BiDTree, BiL-
ically, BiDTreeCRF is an incremental framework,          STM, and CRF. These modules will be described
which consists of three main components. The             in details in Sections 2.2 and 2.3.
first component is a bidirectional dependency tree
network (BiDTree), which is an extension of the          2.1     Problem Statement
recursive neural network in (Socher et al., 2011).       We are given a review sentence from a particular
Its goal is to extract the tree-structured representa-   domain, denoted by S = {w1 , w2 , . . . , wi , . . . , wN },
tion from the dependency tree of a given sentence.       where N is the sentence length. For any word
The second component is the BiLSTM, whose in-            wi ∈ S, the task of ATE is to find a label ti ∈ T cor-
put is the output of BiDTree. The tree-structured        responding to it, where T = {B-AP, I-AP, O}. “B-
and sequential information is fused in this layer.       AP”, “I-AP”, and “O” stand for the beginning of
The last component is the CRF, which is used to          an aspect term, inside of an aspect term, and other
generate labels. To the best of our knowledge, this      words, respectively. For example, “The/O pic-
is the first work to fuse tree-structured and sequen-    ture/B-AP quality/I-AP is/O very/O good/O ./O”
tial information to solve the ATE. This new model        is a sentence with labels (or tags), where the aspect
results in major improvements for ATE over the           term is picture quality. This BIO encoding scheme
existing baseline models.                                is widely used in NLP tasks and such tasks are of-
   The proposed BiDTree is constructed based on          ten solved using CRF based methods (Liu et al.,
the dependency tree. Compared with many other            2015a; Wang et al., 2016b; Irsoy and Cardie, 2013,
methods based on the constituency tree (Figure 2)        2014).
(Irsoy and Cardie, 2013; Tai et al., 2015; Teng and
Zhang, 2016; Chen et al., 2017), BiDTree focuses         2.2     Bidirectional Dependency Tree Network
more directly on the dependency relation between         Since BiDTree is built on the dependency tree,
words because all nodes in the dependency tree are       a sentence should be converted to a dependency-
input words themselves, but the constituency tree        based parse tree first. As the left part of Figure
focuses on identified phrases and their recursive        1 shows, each node in the dependency tree rep-
structure.                                               resents a word and connects to at least one other
   The two main contributions of this paper are as       node/word. Each node has one and only one head
follows.                                                 word, e.g., Speaking is the head of browser, has
                                                         is the head of Speaking, and the head word of has
    • It proposes a novel bidirectional recursive
                                                         is ROOT1 . The edge between each node and its
      neural network BiDTree, which enhances the
                                                         head word is a syntactic dependency relation, e.g.,
      tree-structured representation by construct-
                                                         nmod between browser and Speaking is used for
      ing a bidirectional propagation mechanism
                                                         nominal modifiers of nouns or clausal predicates.
      on the dependency tree. Thus, BiDTree can
                                                         Syntactic relations in Figure 3 are shown as dotted
      capture more effective tree-structured rep-
                                                         black lines.
      resentation features and gain better perfor-
                                                            After generating a dependency tree, each word
      mance.
                                                         wi will be initialized with a feature vector xwi ∈
    • It proposes the incremental framework              Rd , which corresponds to a column of a pre-
      BiDTreeCRF, which can incorporate both the         trained word embedding E ∈ Rd×|V | , where d is
      syntactic information and the sequential in-       the dimension of the word vector and |V | is the
      formation. These pieces of information are         size of the vocabulary. As described above, each
      fed into the CRF layer for aspect term extrac-     relation of a dependency tree starts from a head
      tion. The integrated model can be effectively      word and points to its dependent words. This can
      trained in an end-to-end fashion.                  be formulated as follows: The governor node p
                                                         and its dependent nodes c1 , c2 , . . . , cni . . . , cn p are
2    Model Description                                   connected by r pc1 , r pc2 , . . . , r pci , . . . , r pcn p , where n p
                                                         is the number of dependent nodes belonging to p,
The architecture of the proposed framework is            and r pci ∈ L, where L is a set of syntactic rela-
shown in Figure 3. Its sample input is the de-
pendency relations presented in Figure 1. As                 1 We   hide it for simplicity.
Output    O    O           O B-AP O                        O           O             O         O          O

                                                                         CRF

                                                                      Bi-LSTM

                                       Speaking
                                                                                 +     +          +           +       +           +           +             +         +          +

                                                                      BiDTree
    bottom-up propagation                                                                                                                                 has

                                                                                                                                                            od
                                                                                                                                                                      pu

                                                                                                                                                             do
                                                           browser                                                                                 bj

                                                                                                                                                          vm
                                                                                                                                                                        nc
                                                                                                      advcl                        nct         u

                                                                                                                                                                 bj
                                                                                                                                             ns                              t

                                                                                                                                                        ad
                                                                                                                                 pu
                                                                                Speaking                              ,              it      too            problems             .
                                                                                           nm
                                                                                             od
                                                               the                                         browser
                                  of
                                                                                                      se

                                                                                                           det
                                                                                                   ca                         top-down cell states               top-down output
                                                                                                                              bottom-up cell states              bottom-up output
     top-down propagation                                                              of         the                     +   concatenate

                              a) BiDTree                                                                      b) BiDTreeCRF

Figure 3: An illustration of the BiDTree and BiDTreeCRF architecture. Left: BiDTree architecture, in-
cluding bottom-up propagation and top-down propagation; r means the syntactic relation (e.g., nmod,
case, and det); x is the word; s and h denote cell memory and hidden state, respectively. Right:
BiDTreeCRF has three modules: BiDTree, BiLSTM, and CRF.

                                                                                                                                                           ↑(o)
tions such as nmod, case, det, nsubj, and so on.                                       To = W ↑(o) xw p +                                 ∑         Wr↑ (k) rk↑ ,                     (3)
The syntactic relation information not only serves                                                                                k∈C(p)
as features encoded in the network but also as a                                                                                           ↑( f )
                                                                                      T f k = W ↑( f ) xw p +Wr↑ (k) rk↑ ,                                                            (4)
guide for the selection of training weights.
                                                                                                                                                           ↑(u)
   BiDTree works in two directions using LSTM:                                         Tu = W ↑(u) xw p +                                 ∑         Wr↑ (k) rk↑ .                     (5)
bottom-up LSTM and top-down LSTM. Bottom-                                                                                         k∈C(p)
up LSTM is shown with solid black arrows and                             Then, the bottom-up LSTM transition equations of
top-down LSTM is shown with dotted black ar-                             BiDTree are as follows:
rows at the lower portion of Figure 3. It should                                                             !
                                                                                                                               ↑(i)
be noted that they are different in not only the                                 ip = σ           Ti +            ∑       Ur↑ (k) h↑k + b↑(i) ,                                       (6)
direction but also the governor node and depen-                                                               k∈C(p)
dent nodes. Specifically, each node of the top-                                                                                                                       !
                                                                                                                                ↑(o)
down LSTM only owns one dependent node, but                                     op = σ            To +            ∑       Ur↑ (k) h↑k + b↑(o) ,                                       (7)
the bottom-up LSTM generally owns more than                                                                   k∈C(p)
                                                                                                                    
one dependent node. As shown in Formula (1), we                                                  ↑( f )
                                                                                f pk = σ T f k +Ur↑ (k) h↑k + b↑( f ) ,                                                               (8)
concatenate the output h↑wi of the bottom-up LSTM                                                                                                                         !
and the output h↓wi of the top-down LSTM into hwi                                                                                         ↑(u)
                                                                                u p = tanh Tu +                       ∑          Ur↑ (k) h↑k + b↑(u) ,                                (9)
as the output of BiDTree for word wi ,                                                                             k∈C(p)

                                                                                s↑p = i p         up +            ∑           f pl         s↑l ,                                     (10)
                      hwi = [h↑wi ; h↓wi ].                     (1)
                                                                                                                 l∈C(p)

This allows BiDTree to capture the global syntac-                               h↑p = o p         tanh(s↑p ),                                                                        (11)
tic context.                                                             where i p is the input gate, o p is the output gate,
   Let C(p) = {c1 , c2 , . . . , cni . . . , cn p }, which is the         f pk and f pl are the forget gates, which are ex-
set of dependent nodes of node p described above.                        tended from the standard LSTM (Hochreiter and
Under these symbolic instructions, the bottom-up                         Schmidhuber, 1997; Gers et al., 1999). s↑p and
LSTM of BiDTree firstly encodes the governor                             s↑l are the memory cell states, h↑p and h↑k are the
word and the related syntactic relations:                                hidden states, σ denotes the logistic function,
                                                                                                                         ↑(∗)
                                              ↑(i)
                                                                         means element-wise multiplication, W ↑(∗) , Wr↑ (k) ,
             Ti = W ↑(i) xw p +     ∑      Wr↑ (k) rk↑ ,        (2)         ↑(∗)
                                  k∈C(p)                                 Ur↑ (k) are weight matrices, b↑(∗) are bias vectors,
and r↑ (k) is a mapping function that maps a syn-                                           x              +
tactic relation type to its corresponding parame-
                                                                                                           x            tanh
ter matrix. ∗ ∈ {i, o, f , u}. Specially, the syntactic
relation rk↑ is encoded into the network like word                                                     tanh              x

vector xw p but initialized randomly. The size of rk↑
is the same as xw p in our experiments.
   The top-down LSTM has the same transition
equations as the bottom-up LSTM, except the di-                                            Figure 4: LSTM Unit
rection and the number of dependent nodes. Par-
ticularly, the syntactic relation type of the top-
down LSTM is opposite to that of the bottom-up
LSTM, and we distinguish them by adding a prefix
“I-”, e.g., setting I-nmod to nmod. It leads to the
difference of r↓ (k) and parameter matrices. In this
paper, all weights and bias vectors of BiDTree are
set to size d × d and d-dimensions, respectively.                                     Figure 5: Bidirectional LSTM
The output hwi is thus a 2d-dimensional vector.
   As an instance, we give the concrete formulas                         ically, the word “Speaking” is related with the
of the bottom-up propagation in Figure 3 a), which                       target word “browser” by the relation “I-nmod”.
are used to calculate the output of word “browser”.                      Thus, x4 is xbrowser and r4↓ refers to rI-nmod .
On the bottom-up direction, the word “of” and                               The formula for BiDTree is similar to the de-
“the” are related with the target word “browser” by                      pendency layer in (Miwa and Bansal, 2016), and
the relation “case” and “det”, respectively. Thus,                       the main difference is the design of parameters of
x4 is xbrowser . r2↑ and r3↑ mean rcase and rdet , re-                   the forget gate. Their work defines a parameter-
spectively. Likewise, the subscripts 2, 3, and 4                         ization of the k-th forget gate f pk of the depen-
of s↑ and h↑ are replaced with their corresponding                                                                ↑( f )
                                                                         dent node with parameter matrices Ur↑ (k)r↑ (l) 2 . The
word “of”, “the”, and “browser” to facilitate un-                        whole equation corresponding to Eq. (8) is as fol-
derstanding. So, the output of “browser” on the                          lows:
bottom-up direction is calculated as follows:                                                                             !
                                                                                                               ↑( f )
                                       ↑(i)
      Ti = W ↑(i) xbrowser +Wcase rcase +Wdet rdet ,
                                                        ↑(i)
                                                                               f pk = σ   Tfk +    ∑       Ur↑ (k)r↑ (l) h↑k + b↑( f ) .
                                        ↑(o)             ↑(o)                                     l∈C(p)
     To = W ↑(o) xbrowser +Wcase rcase +Wdet rdet ,                                                                      (13)
                  ↑( f )               ↑( f )                               As Tai et al. mentioned in (Tai et al., 2015),
T f (case) = W             xbrowser +Wcase rcase ,
                                        ↑( f )
                                                                         for a large number of dependent nodes n p , using
 T f (det) = W ↑( f ) xbrowser +Wdet rdet ,                              additional parameters for flexible control of infor-
                                        ↑(u)             ↑(u)
       Tu = W ↑(u) xbrowser +Wcase rcase +Wdet rdet ,                    mation propagation from dependent to governor
               
                         ↑(i)            ↑(i) ↑
                                                                        is impractical. Considering the proposed frame-
       i p = σ Ti +Ucase h↑o f +Udet hthe        + b↑(i) ,               work has a variable number of typed dependent
                                                                         nodes, we use Eq. (8) instead of Eq. (13) to reduce
                                                         
                         ↑(o)             ↑(o) ↑
       o p = σ To +Ucase h↑o f +Udet hthe          + b↑(o) ,
                                                                       the computation cost. Another difference between
                                 ↑( f )
f p(case) = σ T f (case) +Ucase h↑o f + b↑( f ) ,                        their formulas and ours is that we encode the syn-
                                                                       tactic relation into our network, namely, the sec-
                                ↑( f ) ↑
  f p(det) = σ T f (det) +Udet hthe       + b↑( f ) ,                    ond term of Eqs. (2-5), which is proven effective
                 
                              ↑(u)           ↑(u) ↑
                                                                        in this paper.
       u p = tanh Tu +Ucase h↑o f +Udet hthe          + b↑(u) ,
                                                                         2.3     Integration with Bidirectional LSTM
s↑browser = i p      u p + f p(case)      s↑o f + f p(det)       ↑
                                                                sthe ,
                                                                         As the second module, BiLSTM (Graves and
h↑browser = o p       tanh(s↑browser ).                                  Schmidhuber, 2005) keeps the sequential context
                                                                  (12)   of the dependency information between words. As
                                                                         Figure 4 demonstrates, the LSTM unit at j-th word
The top-down propagation of “browser” has the
same formulas but with different direction. Specif-                         2 Same   symbols are used for easy comparison
receives the output of BiDTree hw j , the previous                  where Ψj (y0 , y, g) = exp(WyT0 ,y g + by0 ,y ) is the po-
hidden state h j−1 , and the previous memory cell                   tential of pair (y0 , y). W and b are weight and bias,
c j−1 to calculate new hidden state h j and the new                 respectively.
memory cell c j using the following equations:                         Conventionally, the training process is using
                                                                  maximum conditional likelihood estimation. The
    i j = σ W (i) hw j +U (i) h j−1 + b(i) ,        (14)            log-likelihood is computed as follows:
                                               
    o j = σ W (o) hw j +U (o) h j−1 + b(o) ,        (15)                       L (W, b) = ∑ log p (y|g;W, b).            (22)
                                                                                             j
    f j = σ W ( f ) hw j +U ( f ) h j−1 + b( f ) ,  (16)
                                                                  The last labeling results are generated with the
    u j = tanh W (u) hw j +U (u) h j−1 + b(u) , (17)                highest conditional probability:
      cj = ij   uj + fj     c j−1 ,                        (18)
                                                                                 y∗ = arg max p(y|g;W, b).               (23)
    hj = oj      tanh(c j ),                               (19)                         y∈Y(g)

where i j , o j , f j are gates having the same mean-               This process is usually solved efficiently by the
ings as their counterparts in BiDTree, W (∗) with                   Viterbi algorithm.
size d × 2d, U (∗) with size d × d are weight ma-
trices, and b(∗) are d-dimensional bias vectors.                    2.5   Decoding from Labeling Results
∗ ∈ {i, o, f , u}. We also concatenate the hidden                   Once the labeling results are generated, the last
states generated by LSTM cells in both directions                   step to obtain the aspect terms of the given sen-
belonging to the same word as the output vector,                    tence is decoding the labeled sequence. Accord-
which is expressed as follows:                                      ing to the mean of elements in T , it is convenient
                             h→
                              − ← −i                                to get the aspect terms. For example, to a sen-
                        gj = hj; hj              (20)               tence “w1 w2 w3 w4 ”, if the labeling sequence is
                                                                    “B-AP B-AP I-AP O” then (“w1 ”, 1, 2) and (“w2
The architecture of BiLSTM is shown in Figure 5.
                                                                    w3 ”, 2, 4) are target aspect terms. For the above
Also, each g j is reduced to |T| dimensions by a full
                                                                    triple, the first element is the real aspect term, and
connection layer so as to pass to the subsequent
                                                                    the second element and the last element are the be-
layers in our implementation.
                                                                    ginning (inclusive) and ending (exclusive) index in
2.4     Integration with CRF                                        the sentence, respectively. Algorithm 1 gives this
                                                                    process in detail.
The learned features actually are hybrid features
containing both tree-structured and sequential in-                  2.6   Loss and Model Training
formation. All these features are fed into the
                                                                    We equivalently use the negative of L (W, b) in Eq.
last CRF layer to predict the label of each word.
                                                                    (22) as the error to do minimization optimization.
Linear-chain CRF is adopted here. Formally, let
                                                                    Thus, the loss is as follows:
g = {g1 , g2 , . . . , g j , . . . , gN } represent the output
features extracted by BiDTree and BiLSTM layer.                                  L = − ∑ log p (y|g;W, b).               (24)
The goal of CRF is to decode the best chain of                                             j
labels y = {t1 ,t2 , . . . ,t j , . . . ,tN }, where t j has been
described in Section 2.1. As a discriminant graph-                  Then, the loss of the entire model is:
ical model, CRF benefits from considering the cor-                                                  λ
relations between labels/tags in the neighborhood,                                  J (Θ) = L +       kΘk2 ,             (25)
                                                                                                    2
which is widely used in sequence labeling or tag-
ging tasks (Huang et al., 2015; Ma and Hovy,                        where Θ represents the model parameters contain-
2016). Let Y(g) denote all possible labels and                      ing all weight matrices W , U and bias vectors b,
y0 ∈ Y(g). The probability of CRF p(y|g;W, b) is                    and λ is the regularization parameter.
computed as follows:                                                  We update all parameters for BiDTreeCRF from
                                                                    top to bottom by propagating the errors through
                         ∏Nj=1 Ψj (y j−1 , y j , g)                 the CRF to the hidden layers of BiLSTM and
      p(y|g;W, b) =                                      ,
                    ∑y0 ∈Y(g) ∏Nj=1 Ψj (y0j−1 , y0j , g)            then to BiDTree via backpropagation through time
                                                        (21)        (BPTT) (Goller and Kuchler, 1996). Finally, we
Algorithm 1 Decoding from the Labeling Se-                     Algorithm 2 BiDTreeCRF Training Algorithm
quence                                                         Input: A set of review sentences S from a par-
Input: A               labeling         sequence       τ =         ticular domain, S = {w1 , w2 , . . . , wi , . . . , wN } is
    {t1 ,t2 , . . . ,ti , . . . tN }, and its correspond-          one of the element in S.
    ing sentence S = {w1 , w2 , . . . , wi , . . . , wN }.     Output: Learned BiDTreeCRF model
Output: A list of aspect term triples                           1: Construct dependency trees for each sentence
 1: result ← ()                                                    S using Stanford Parser Package.
 2: temp ← “”                                                   2: Initialize all learnable parameters Θ
 3: start ← 0                                                   3: repeat
 4: for i = 1; i ≤ N; i + + do                                  4:    Select a batch of instances Sb from S
 5:    if ti = “O” and temp 6= “” then                          5:    for each sentence S ∈ Sb do
 6:        result ← result + (wstart:i , start, i)              6:       Use BiDTree (1-11) to generate h
 7:        temp ← “”                                            7:       Use BiLSTM (14-20) to generate g
 8:        start ← 0                                            8:       Compute L (W, b) through (21-22)
 9:    else                                                     9:    end for
10:        if ti = “B-AP” then                                 10:    Use the backpropagation algorithm to up-
11:             if temp 6= “” then                                    date parameters Θ by minimizing the ob-
12:                  result ← result + (wstart:i , start, i)          jective (25) with the batch update mode
13:             end if                                         11: until stopping criteria is met
14:             temp ← ti
15:             start ← i
16:        end if                                              3     Experiments
17:    end if
18: end for                                                    In this section, we conduct experiments to evaluate
19: if temp 6= “” then                                         the effectiveness of the proposed framework.
20:    result ← result + (wstart:i , start, i)
21: end if                                                     3.1    Datasets and Experiment Setup
22: return result
                                                               We conduct experiments using four benchmark
                                                               SemEval datasets. The detailed statistics of the
                                                               datasets are summarized in Table 2. L-14 and R-
                                                               14 are from SemEval 20143 (Pontiki et al., 2014),
                                                               R-15 is from SemEval 20154 (Pontiki et al., 2015),
                                                               and R-16 is from SemEval 20165 (Pontiki et al.,
use Adam (Kingma et al., 2014) for optimization
                                                               2016). L-14 contains laptop reviews, and R-14, R-
with gradient clipping. The L2-regularization fac-
                                                               15, and R-16 all contain restaurant reviews. These
tor λ is set as 0.001 empirically. The mini-batch
                                                               datasets have been officially divided into three
size is 20 and the initial learning rate is 0.001. We
                                                               parts: A training set, a validation set, and a test set.
also employ dropout (Srivastava et al., 2014) on
                                                               These divisions will be kept for a fair comparison.
the outputs of BiDTree and BiLSTM layers with
                                                               All these datasets contain annotated aspect terms,
the dropout rate of 0.5. All weights W , U and
                                                               which will be used to generate sequence labels in
bias terms b are trainable parameters. Early stop-
                                                               the experiments. We use the Stanford Parser Pack-
ping (Caruana et al., 2000) is used based on per-
                                                               age6 to generate dependency trees. The evaluation
formance on validation sets. Its value is 5 epochs
                                                               metric is the F1 score, the same as the baseline
in our experiments. At the same time, initial em-
                                                               methods.
beddings are fine-tuned during the training pro-
                                                                 In order to initialize word vectors, we train word
cess. That means word embedding will be mod-
                                                               embeddings with a bag-of-words based model
ified by back-propagating gradients. We imple-
                                                               (CBOW) (Mikolov et al., 2013) on Amazon re-
ment BiDTreeCRF using the TensorFlow library
(Abadi et al., 2016), and all computations are done                3 http://alt.qcri.org/semeval2014/task4/
on an NVIDIA Tesla K80 GPU. The overall proce-                     4 http://alt.qcri.org/semeval2015/task12/
dure of BiDTreeCRF is summarized in Algorithm                      5 http://alt.qcri.org/semeval2016/task5/

2.                                                                 6 https://nlp.stanford.edu/software/lex-parser.html
compared with the end-to-end fashion of
Table 2: Datasets from SemEval; #S means the
                                                              neural network. U means using additional
number of sentences, #T means the number of as-
                                                              resources without any constraint, such as
pect terms; L-14, R-14, R-15, and R-16 are short
                                                              lexicons or additional training data.
for Laptops 2014, Restaurants 2014, Restaurants
2015 and Restaurants 2016, respectively.                   • WDEmb: It uses word embedding, linear
      Datasets    Train     Val      Test     Total          context embedding and dependency path em-
      L-14 #S     2,945     100      800      3,845          bedding to enhance CRF (Yin et al., 2016).
      R-14 #S     2,941     100      800      3,841        • RNCRF-O, RNCRF-F: They both extract
      R-15 #S     1,315     48       685      2,048          tree-structured features using a recursive neu-
      R-16 #S     2,000     48       676      2,724          ral network as the CRF input. RNCRF-O
      L-14 #T     2,304     54       654      3,012          is a model trained without opinion labels.
      R-14 #T     3,595     98       1,134    4,827          RNCRF-F is trained not only using opinion
      R-15 #T     1,654     57       845      2,556          labels but also some hand-crafted features
      R-16 #T     2,507     66       859      3,432          (Wang et al., 2016b).

                                                           • DTBCSNN+F: A convolution stacked neural
views7  and Yelp    reviews8 ,
                            which are in-domain              network built on dependency trees to capture
corpora for laptop and restaurant, respectively.             syntactic features. Its results are produced by
The Amazon review dataset contains 142.8M re-                the inference layer (Ye et al., 2017).
views, and the Yelp review dataset contains 2.2M
                                                           • MIN: MIN is a LSTM-based deep multi-task
restaurant reviews. All these datasets are trained
                                                             learning framework, which jointly handles
by gensim9 which contains the implementation of
                                                             the extraction tasks of aspects and opinions
CBOW. The parameter min count is 10 and iter
                                                             via memory interactions (Li and Lam, 2017).
is 200 in our experiments. We set the dimension
of word vectors to 300 based on the conclusion             • CMLA, MTCA: CMLA is a multilayer at-
drawn in (Wang et al., 2016b). The experimental              tention network, which exploits relations be-
results about dimension settings for the proposed            tween aspect terms and opinion terms with-
model also showed that 300 is a suitable choice,             out any parsers or linguistic resources for pre-
which provides a good trade-off between effec-               processing (Wang et al., 2017b). MTCA is
tiveness and efficiency.                                     a multi-task attention model, which learns
                                                             shared information among different tasks
3.2    Baseline Methods and Results
                                                             (Wang et al., 2017a).
To validate the performance of our proposed
model on aspect term extraction, we compare it             • LSTM+CRF, BiLSTM+CRF: They are
against a number of baselines:                               proposed by (Huang et al., 2015) and pro-
                                                             duce state-of-the-art (or close to) accuracy on
  • IHS RD, DLIREC(U), EliXa(U), and                         POS, chunking and NER data sets. We bor-
    NLANGP(U): The top system for L-14 in                    row them for the ATE as baselines.
    SemEval Challenge 2014 (Chernyshevich,
    2014), the top system for R-14 in SemEval              • BiLSTM+CNN: BiLSTM+CNN10 is the Bi-
    Challenge 2014 (Toh and Wang, 2014), the                 directional LSTM-CNNs-CRF model from
    top system for R-15 in SemEval Challenge                 (Ma and Hovy, 2016). Compared with BiL-
    2015 (Vicente et al., 2015), and the top                 STM+CRF above, BiLSTM+CNN encoded
    system for R-16 in SemEval Challenge                     char embedding by CNN and obtained state-
    2016 (Toh and Su, 2016), respectively. All               of-the-art performance on the task of POS
    of these systems have the same property:                 tagging and named entity recognition (NER).
    They are trained on a variety of lexicon and             We borrow this method for the ATE as a base-
    syntactic features, which is labor-intensive             line. The window size of CNN is 3, the num-
  7 http://jmcauley.ucsd.edu/data/amazon/
                                                             ber of filters is 30, and the dimension of char
  8 https://www.yelp.com/academic                            is 100.
                                    dataset
  9 https://radimrehurek.com/gensim/models/word2vec.html   10 We   use this abbreviation for the sake of typesetting.
Table 3: Comparison on F1 scores. ‘-’ indicates the results were not available in their papers 12 .
           Models                                                        L-14       R-14         R-15        R-16
           IHS RD (Chernyshevich, 2014)                                  74.55      79.62          -           -
           DLIREC(U) (Toh and Wang, 2014)                                73.78      84.01          -           -
           EliXa(U) (Vicente et al., 2015)                                 -          -          70.05         -
           NLANGP(U) (Toh and Su, 2016)                                    -          -          67.12       72.34
           WDEmb (Yin et al., 2016)                                      75.16      84.97        69.73         -
           RNCRF-O (Wang et al., 2016b)                                  74.52      82.73          -           -
           RNCRF+F (Wang et al., 2016b)                                  78.42      84.93          -           -
           DTBCSNN+F (Ye et al., 2017)                                   75.66      83.97          -           -
           MIN (Li and Lam, 2017)                                        77.58        -            -         73.44
           CMLA (Wang et al., 2017b)                                     77.80      85.29        70.73         -
           MTCA (Wang et al., 2017a)                                     69.14        -          71.31       73.26
           LSTM+CRF                                                      73.43      81.80        66.03       70.31
           BiLSTM+CRF                                                    76.10      82.38        65.96       70.11
           BiLSTM+CNN                                                    78.97      83.87        69.64       73.36
           BiDTreeCRF#1                                                  80.36      85.08        69.44       73.74
           BiDTreeCRF#2                                                  80.22      85.31        68.61       74.01
           BiDTreeCRF#3                                                  80.57      84.83        70.83       74.49

   For our proposed model, there are three variants                        is the average of 20 runs with the same hyper-
depending on whether the weight matrices of Eqs.                           parameters that have been described in Section 2.6
(2-9) are shared or not 11 . BiDTreeCRF#1 shares                           and are used throughout our experiments. We re-
                                         ↑(i,o, f ,u)
all weight matrices, namely W∗                        = W ↑(i,o, f ,u)     port the results of L-14 initialized with the Ama-
       ↑(i,o, f ,u)                                                        zon Embedding. For the other datasets, we ini-
and U∗              = U ↑(i,o, f ,u) , which means the map-
ping function r↑ (k) is useless. BiDTreeCRF#2                              tialize with the Yelp Embedding since they are all
shares the weight matrices of Eqs. (2-3, 5) and                            restaurant reviews. We will also show the embed-
Eqs. (6-7, 9) while excluding Eqs. (4, 8).                                 ding comparison below.
BiDTreeCRF#3 keeps Eqs. (2-9) and does not                                    Compared to the best systems in 2014, 2015 and
share any weight matrices. The different types                             2016 SemEval ABSA challenges, BiDTreeCRF#3
of weight sharing mean different ways of informa-                          achieves 6.02%, 0.82%, 0.78%, and 2.15% F1
tion transmission. BiDTreeCRF#1 shares weight                              score gains over IHS RD, DLIREC(U), EliXa(U)
matrices, which indicates the dependent words of                           and NLANGP(U) on L-14, R-14, R-15, and R-
a head word are undifferentiated and the syntactic                         16, respectively. Specifically, BiDTreeCRF#3 out-
relations, e.g., nmod and case, are out of consider-                       performs WDEmb by 5.41% on L-14 and 1.10%
ation. BiDTreeCRF#2 treats the forget gates dif-                           on R-15, and outperforms RNCRF-O by 6.05%,
ferently, which indicates that each dependent word                         2.10% for L-14 and R-14, respectively. Even
is controlled by syntactic relation to transmitting                        compared with RNCRF+F and DTBCSNN+F
hidden state to its next node. BiDTreeCRF#3 fur-                           which exploit additional hand-crafted features,
ther treats all gates differently. The elaborate in-                       BiDTreeCRF#3 on L-14 and BiDTreeCRF#2 on
formation flow under the control of syntactic rela-                        R-14 without other linguistic features (e.g., POS)
tions is proved to be efficient.                                           still achieve 2.15%, 4.91% and 0.38%, 1.34%
   The comparison results are given in Table 3.                            improvements, respectively. MIN is trained via
In this table, the F1 score of the proposed model                          memory interactions, CMLA and MTCA are de-
                                                                           signed as a multi-task model, and all of these
  11 The code is publicly available at https://github                      three methods use more labels and share infor-
.com/ArrowLuo/BiDTree                                                      mation among different tasks. Comparing with
  12 We report the best results from the original papers, and
                                                                           them, BiDTreeCRF#3 still gives the best score for
keep the officially divided datasets and the evaluation pro-
gram the same to make the comparison fair.                                 L-14 and R-16 and a competitive score for R-15
88                                                                                          88
                                                        E-Amazon                                                                                    E-Amazon

                  83                                    E-Yelp                                                83                                    E-Yelp

                                                                                               F1 Score (%)
   F1 Score (%)
                  78                                                                                          78

                  73                                                                                          73

                  68                                                                                          68

                  63                                                                                          63
                       L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16                                 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16

                         BiDepsCRF#1         BiDepsCRF#2           BiDepsCRF#3                                       BiDepsCRF#1         BiDepsCRF#2           BiDepsCRF#3

  Figure 6: Amazon Embedding vs. Yelp Em-                                                      Figure 7: Amazon Embedding vs. Yelp Em-
  bedding (E-Amazon vs. E-Yelp) with syn-                                                      bedding (E-Amazon vs. E-Yelp) without
  tactic relation.                                                                             syntactic relation.
                  88                                                                                          88
                                                        With-Rel                                                                                    With-Rel
                  83                                    No-Rel                                                83                                    No-Rel
   F1 Score (%)

                                                                                               F1 Score (%)
                  78                                                                                          78

                  73                                                                                          73

                  68                                                                                          68

                  63                                                                                          63
                       L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16                                 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16
                         BiDepsCRF#1         BiDepsCRF#2           BiDepsCRF#3                                       BiDepsCRF#1         BiDepsCRF#2           BiDepsCRF#3

  Figure 8: With syntactic relation vs. With-                                                  Figure 9: With syntactic relation vs. With-
  out syntactic relation (With-Rel vs. No-Rel)                                                 out syntactic relation (With-Rel vs. No-Rel)
  with Amazon Embedding.                                                                       with Yelp Embedding.

                                                                                         R-15, and R-16, and BiDTreeCRF#2 is more ef-
Table 4: F1-scores of ablation                                     experiments on
                                                                                         fective on R-15. We believe the fact that R-15 is
BiDTreeCRF.
                                                                                         a small dataset with some “NULL” aspect terms
 Models         L-14 R-14                                           R-15         R-16    is the reason that the performance of these base-
 BiLSTM+CRF 76.10 82.38                                             65.96        70.11   lines have a small gap between them. It proves
 BiDTree+CRF 71.29 81.09                                            64.09        67.87   that it is a hard dataset to improve the score. Thus,
 DTree-up       78.96 84.47                                         68.69        72.42   it is an inspiring result though BiDTreeCRF#3 is
 DTree-down     78.46 84.41                                         68.75        72.91   a little worse than MTCA without other auxil-
 BiDTreeCRF#3 80.57 84.83                                           70.83        74.49   iary information (e.g., opinion terms). Besides,
                                                                                         BiDTreeCRF#3 outperforms BiLSTM+CNN even
                                                                                         without char embedding. Note that we did not tune
and BiDTreeCRF#2 achieves the state-of-the-art                                           the hyperparameters of BiDTreeCRF for practi-
score for R-14, although our model is designed as                                        cal purposes because this tuning process is time-
a single-task model. Moreover, BiDTreeCRF#3                                              consuming.
outperforms LSTM+CRF and BiLSTM+CRF on
                                                                                         3.3          Ablation Experiments
all datasets by 7.14%, 3.03%, 4.80%, and 4.18%,
and 4.47%, 2.45%, 4.87%, and 4.38%, respec-                                              To test the effect of each component of
tively, and these improvements are significant (p <                                      BiDTreeCRF, the following ablation experiments
0.05). Considering the fact that BiLSTM+CRF                                              on different layers of BiDTreeCRF#3 are per-
can be seen as BiDTreeCRF#3 without BiDTree                                              formed: (1) DTree-up: The bottom-up propa-
layer, all the results support that BiDTree can ex-                                      gation of BiDTree is connected to BiLSTM and
tract syntactic information effectively.                                                 the CRF layer. (2) DTree-down: The top-down
   As we can see, different variants of the pro-                                         propagation of BiDTree is connected to BiLSTM
posed model have different performances on the                                           and the CRF layer. (3) BiDTree+CRF: BiL-
four datasets. In particular, BiDTreeCRF#3 is                                            STM layer is not used compared to BiDTreeCRF.
more powerful than the other variants on L-14,                                           The initial word embeddings are the same as be-
90
fore. The comparison results are shown in Ta-                                                                 L-14     R-15
                                                                                                              R-14     R-16
                                                                               85
ble 4. Comparing BiDTreeCRF with DTree-up
and DTree-down, it is obvious that BiDTree is                                  80

                                                                F1 Score (%)
more competitive than any single directional de-                               75
pendency network, which is the original moti-
                                                                               70
vation of the proposed BiDTreeCRF. The fact
                                                                               65
that BiDTreeCRF outperforms BiDTree+CRF in-
dicates the BiLSTM layer is effective in extracting                            60
                                                                                    0   50 100 150 200 250 300 350 400 450 500
sequential information on top of BiDTree. On the                                                        d

other hand, the fact that BiDTreeCRF outperforms
                                                                                                              L-14     R-15
BiLSTM+CRF shows that the dependency syntac-                                   90
                                                                                                              R-14     R-16

tic information extracted by BiDTree is extremely                              85

useful in the aspect term extraction task. All above

                                                                F1 Score (%)
                                                                               80
improvements are significant (p < 0.05) with the
                                                                               75
statistical t-test.
                                                                               70

3.4   Word Embeddings & Syntactic Relation                                     65

Since word embeddings are an important con-                                         0   50 100 150 200 250 300 350 400 450 500
                                                                                                        d
tributing factor for learning with less data, we also
conduct comparative experiments about word em-
beddings. Additionally, the syntactic relation (the     Figure 10: Sensitivity studies on word embed-
second terms of Eqs. (2-5)) is also adopted as          dings. Top: F1 Score of BiDTreeCRF#3 with
a comparison criterion. The experimental setup,         different word vector dimensions d on Electron-
e.g., mini-batch size and learning rate, is the         ics Amazon Embedding. Bottom: F1 Score of
same as the previous setup and no other changes         BiDTreeCRF#3 with different word vector dimen-
but word embeddings and with/without integrating        sions d on Yelp Embedding.
syntactic relation knowledge.
   Figure 6 and Figure 7 illustrate a compari-
                                                        ferent dimensions (ranging from 50 to 450, with
son between Amazon Embedding and Yelp Em-
                                                        the increment of 50) are involved. The sensitivity
bedding. Each figure involves three variants of
                                                        plots on the four datasets are given in Figure 10
BiDTreeCRF on four datasets. All of them show
                                                        using Amazon Embedding and Yelp Embedding,
that Amazon Embedding is always superior to
                                                        respectively. It is worth mentioning that Amazon
Yelp Embedding for L-14, and Yelp Embedding
                                                        Embedding here is only trained from reviews of
has an absolute advantage over Amazon Embed-
                                                        electronics products considering the time cost. Al-
ding for R-14, R-15, and R-16. The fact that Yelp
                                                        though the score is a little lower than the embed-
Embedding is in-domain for restaurant and Ama-
                                                        ding trained from the whole Amazon review cor-
zon Embedding is in-domain for laptop indicates
                                                        pus, the conclusion still holds. The figure shows
that in-domain embedding is more effective than
                                                        that 300 is a suitable dimension size for the pro-
out-domain embedding.
                                                        posed model. It also proves the stability and ro-
   Figure 8 and Figure 9 show a comparison of dif-
                                                        bustness of our model.
ferent syntactic relation conditions. Figure 8 is a
comparison using Amazon Embedding, and Fig-
                                                        3.6   Case Study
ure 9 is a comparison using Yelp Embedding. The
fact that the model with syntactic relation wins 7      Table 5 shows some examples from the L-
out of 12 in Figure 8 and 9 out of 12 in Figure 9       14 dataset to demonstrate the effectiveness of
comparing with the model without syntactic rela-        BiDTreeCRF. The first column contains the re-
tion indicates the syntactic relation information is    views, and the corresponding aspect terms are
useful for performance improvement.                     marked with bold font. The second column de-
                                                        scribes some dependency relations related to the
3.5   Sensitivity Test                                  aspect terms. The third column and the last col-
We conduct the sensitivity test on the dimension        umn are the extraction results of BiDTreeCRF and
d of word embeddings of BiDTreeCRF#3. Dif-              BiLSTM, respectively. On the whole, the pro-
Table 5: Extraction comparison between BiDTreeCRF and BiLSTM.

 Text (The ground-truth of aspect terms
                                              Dependency Relationships             BiDTreeCRF         BiLSTM
 is marked with bold font)
 Other than not being a fan of click
 pads (industry standard these days)                compound
                                              click ←−−−−− pads,                   click pads,
 and the lousy internal speakers, it’s                     amod                                       internal speakers,
                                              internal ←−−− speakers,              internal speakers,
 hard for me to find things about this                                                                price tag
                                                     compound                      price tag
 notebook I don’t like, especially con-       price ←−−−−− tag
 sidering the $350 price tag.
                                                            nsub j                                    Keyboard,
 Keyboard responds well to presses.           Keyboard ←−−− responds               Keyboard
                                                                                                      responds
                                                   nmod             case
 I am please with the products ease of        ease −−−→ use −−→ o f ,       use,
                                                          cc                                          use,
 use; out of the box ready; appearance        appearance −→ and,            appearance,
                                                         con j
                                                                                                      functionality
 and functionality.                           appearance −−→ f unctionality functionality
                                                  nmod             case
                                              use −−−→ OS −−→ o f ,                                   softwares,
 With the softwares supporting the use            det                              softwares,
                                              the ←− so f twares,                                     use,
 of other OS makes it much better.                                                 OS
                                                             nsub j
                                              so f twares ←−−− supporting                             OS
                                                         cc
 I tried several monitors and several         monitors − → and,
                                                           con j                   monitors,          HDMI
 HDMI cables and this was the case            monitors −−→ cables,
                                                         compound
                                                                                   HDMI cables        cables
 each time.                                   cables −−−−−→ HDMI

posed BiDTreeCRF can extract aspect terms bet-              4       Related Work
ter than BiLSTM with fewer omissions and errors.
In the first example, BiLSTM misses the aspect              As an important and practically very useful topic,
term “click pads” but its inner relation is similar to      Sentiment analysis has been extensively studied in
           compound                                         the literature (Hu and Liu, 2004; Cambria, 2016),
the price ←−−−−− tag, which in the BiDTreeCRF
can be considered as a significant feature. Thus            especially the ATE. There are several main ap-
BiDTreeCRF can extract it accurately. Likewise,             proaches to solving the ATE problem. Hu and
                                     nsub j                 Liu (2004) extracted aspect terms that are fre-
through the relation Keyboard ←−−− responds,
                                                            quently occurring nouns and noun phrases using
BiDTreeCRF can avoid making “responds” as an
                                                            frequent pattern mining. Qiu et al. (2011) and
aspect term. For the same word “use” in the third
                                                            Liu et al. (2015b) proposed to use a rule-based
example and the fourth example, one is real aspect
                                                            approach exploiting either hand-crafted or auto-
term, and the other is not. The reason is reflected
                              nmod     case                 matically generated rules about some syntactic re-
in these two relations: ease −−−→ use −−→ o f and           lations between aspect terms (also called targets)
     nmod     case
use −−−→ OS −−→ o f . To the final example, “mon-           and sentiment words based on the idea that opin-
itors” and “cables” are equivalence relation be-            ion or sentiment must have a target (Liu, 2012).
                        con j
cause of the monitors −−→ cables, and thus, they            Chen et al. (2014) adopted the topic modeling to
are extracted simultaneously by BiDTreeCRF in-              address the ATE, which employs some probabilis-
stead of being extracted only one part of them by           tic graphical models based on Latent Dirichlet Al-
BiLSTM. All of the above analysis gives support-            location (LDA) (Blei et al., 2003) and its vari-
ing evidence that our proposed BiDTreeCRF con-              ants. All of the above methods are based on un-
structed on the dependency tree is useful and can           supervised learning. For supervised learning, ATE
take advantage of the relation between words to             is mainly regarded as a sequential labeling prob-
improve the ATE performance.                                lem, and solved by hidden Markov models (Jin
                                                            et al., 2009) or CRF. However, traditional super-
                                                            vised methods need to design some lexical and
syntactic features artificially to improve perfor-     end-to-end system trained directly from the depen-
mance. Neural network is an effective approach         dency path information to the final ATE tags. On
to solve this problem.                                 the contrary, our proposed BiDTreeCRF is an end-
   Recent work showed that neural networks             to-end deep learning model and it does not need
can indeed achieve competitive performance on          any hand-crafted features. Wang et al. (2016b) in-
the ATE. Irsoy and Cardie (2013) applied deep          tegrated dependency tree and CRF into a unified
Elman-type Recurrent Neural Network (RNN) to           framework for explicit aspect and opinion terms
extract opinion expressions and showed that deep       co-extraction. However, a single directional prop-
RNN outperforms CRF, semi-CRF and shallow              agation on the dependency tree is not enough to
RNN. Liu et al. (2015a) further experimented with      represent complete tree-structured syntactic infor-
more advanced RNN variants with fine-tune em-          mation. Instead of the full connection on each
beddings. Moreover, they pointed out that em-          layer of the dependency tree, we use a bidirec-
ploying other linguistic features (e.g., POS) can      tional propagation mechanism to extract informa-
get better results. Different from these works,        tion, which is proved to be effective in our experi-
Poria et al. (2016) used a 7-layer deep convolu-       ments. Ye et al. (2017) proposed a tree-based con-
tional neural network (CNN) to tag each word with      volution to capture the syntactic features of sen-
an aspect or non-aspect label in opinionated sen-      tences, which makes it hard to keep sequential in-
tences. Some linguistic patterns were also used        formation. We fused the tree-structured and se-
to improve labeling accuracy. Attention mech-          quential information rather than only using a sin-
anism and memory interaction are also effective        gle representation to address the ATE efficiently.
methods for ATE. Li and Lam (2017) adopted                This paper is also related to several other mod-
two LSTMs for jointly handling the extraction          els which are constructed on constituency trees
tasks of aspects and opinions via memory interac-      and used to accomplish some other NLP tasks,
tions. These LSTMs are equipped with extended          e.g., translation (Chen et al., 2017), relation ex-
memories and neural memory operations. Wang            traction (Miwa and Bansal, 2016), relation classi-
et al. (2017b) proposed a multi-layer attention net-   fication (Liu et al., 2015c) and syntactic language
work to deal with aspect and opinion terms co-         modeling (Tai et al., 2015; Teng and Zhang, 2016;
extraction task, which exploits the indirect rela-     Zhang et al., 2016). However, we have different
tions between terms for more precise information       models and also different applications.
extraction. He et al. (2017) presented an unsuper-
vised neural attention model to discover coherent      5   Conclusion
aspects. Its key idea is to exploit the distribution
of word co-occurrences through the use of neural       In this paper, an end-to-end framework
word embeddings and use an attention mechanism         BiDTreeCRF was introduced.             The frame-
to de-emphasize irrelevant words during training.      work can efficiently extract dependency syntactic
However, RNN and CNN based on the sequence             information through bottom-up and top-down
structure of a sentence cannot effectively and di-     propagation in dependency trees. By combining
rectly capture the tree-based syntactic information    the dependency syntactic information with the
which better reflects the syntactic properties of      advantages of BiLSTM and CRF, we achieve
natural language and hence is very important to        state-of-the-art performance on four benchmark
the ATE.                                               datasets without using any other linguistic fea-
   Some tree-based neural networks have been           tures. Three variants of the proposed model have
proposed by researchers. For example, Yin et al.       been evaluated and shown to be more effective
(2016) designed a word embedding method that           than the existing state-of-the-art baseline meth-
considers not only the linear context but also the     ods. The distinction of these variants depends
dependency context information. The resulting          on whether they share weights during training.
embeddings are used in CRF for extracting as-          Our results suggest that the dependency syntactic
pect terms. This model proves that syntactic in-       information may also be used in aspect term and
formation among words yields better performance        aspect opinion co-extraction, and other sequence
than other representative ones for ATE. However,       labeling tasks. Additional linguistic features (e.g.,
it involves a two-stage process, which is not an       POS) and char embeddings can further boost the
                                                       performance of the proposed model.
References                                               Athanasios Giannakopoulos, Claudiu Musat, Andreea
                                                           Hossmann, and Michael Baeriswyl. 2017. Unsuper-
Martı́n Abadi, Ashish Agarwal, Paul Barham, et al.         vised aspect term extraction with b-lstm & crf using
 2016. Tensorflow: Large-scale machine learning on         automatically labelled datasets. In WASSA, pages
 heterogeneous distributed systems. arXiv preprint         180–188.
 arXiv:1603.04467.
                                                         Christoph Goller and Andreas Kuchler. 1996. Learning
Pedro Paulo Balage Filho and Thiago Alexan-                task-dependent distributed representations by back-
  dre Salgueiro Pardo. 2014.     NIL CUSP: As-             propagation through structure. In Proceedings of
  pect extraction using semantic labels. In Se-            IEEE International Conference on Neural Networks,
  mEval@COLING, pages 433–436.                             pages 347–352.

David M Blei, Andrew Y Ng, and Michael I Jordan.         Alex Graves and Jürgen Schmidhuber. 2005. Frame-
  2003. Latent dirichlet allocation. JMLR, 3(1):993–       wise phoneme classification with bidirectional lstm
  1022.                                                    and other neural network architectures. Neural Net-
                                                           works, 18(5-6):602–610.
Samuel Brody and Noemie Elhadad. 2010. An un-            Hussam Hamdan, Patrice Bellot, and Frédéric Béchet.
  supervised aspect-sentiment model for online re-         2015. Lsislif: CRF and logistic regression for opin-
  views. In NAACL-HLT, pages 804–812. Association          ion target extraction and sentiment polarity analysis.
  for Computational Linguistics.                           In SemEval@NAACL-HLT, pages 753–758. Associ-
                                                           ation for Computer Linguistics.
Erik Cambria. 2016. Affective computing and senti-
   ment analysis. IEEE Intelligent Systems, 31(2):102–   Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel
   107.                                                    Dahlmeier. 2017. An unsupervised neural attention
                                                           model for aspect extraction. In ACL, pages 388–
Rich Caruana, Steve Lawrence, and C. Lee Giles. 2000.      397.
  Overfitting in neural nets: Backpropagation, conju-
  gate gradient, and early stopping. In Advances in      Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  Neural Information Processing Systems, pages 402–        Long short-term memory. Neural Computation,
  408.                                                     9(8):1735–1780.
                                                         Minqing Hu and Bing Liu. 2004. Mining and summa-
Huadong Chen, Shujian Huang, David Chiang, and Ji-         rizing customer reviews. In KDD, pages 168–177.
  ajun Chen. 2017. Improved neural machine trans-          ACM.
  lation with a syntax-aware encoder and decoder. In
  ACL, pages 1936–1945.                                  Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-
                                                           tional lstm-crf models for sequence tagging. arXiv
Zhiyuan Chen and Bing Liu. 2014. Topic modeling            preprint arXiv:1508.01991.
  using topics from many domains, lifelong learning
  and big data. In ICML, pages 703–711.                  Ozan Irsoy and Claire Cardie. 2013. Bidirectional re-
                                                           cursive neural networks for token-level labeling with
Zhiyuan Chen, Arjun Mukherjee, and Bing Liu. 2014.         structure. arXiv preprint arXiv:1312.0493.
  Aspect extraction with automated prior knowledge       Ozan Irsoy and Claire Cardie. 2014. Opinion mining
  learning. In ACL, pages 347–358.                         with deep recurrent neural networks. In EMNLP,
                                                           pages 720–728. Association for Computer Linguis-
Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun           tics.
  Hsu, Malu Castellanos, and Riddhiman Ghosh.
  2013. Exploiting domain knowledge in aspect ex-        Niklas Jakob and Iryna Gurevych. 2010. Extracting
  traction. In EMNLP, pages 1655–1667. Association         opinion targets in a single-and cross-domain setting
  for Computational Linguistics.                           with conditional random fields. In EMNLP, pages
                                                           1035–1045. Association for Computational Linguis-
Maryna Chernyshevich. 2014. IHS R&D Belarus:               tics.
 Cross-domain extraction of product features using
 CRF. In SemEval@COLING, pages 309–313. As-              Wei Jin, Hung Hay Ho, and Rohini K Srihari. 2009.
 sociation for Computer Linguistics.                       A novel lexicalized hmm-based learning framework
                                                           for web opinion mining. In ICML, pages 465–472.
Yejin Choi and Claire Cardie. 2010. Hierarchical se-     Diederik Kingma, Jimmy Ba, Diederik Kingma, and
  quential learning for extracting opinions and their      Jimmy Ba. 2014. Adam: A method for stochastic
  attributes. In ACL, pages 269–274. Association for       optimization. arXiv preprint arXiv:1412.6980.
  Computational Linguistics.
                                                         John D. Lafferty, Andrew McCallum, and Fernando
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins.       C. N. Pereira. 2001. Conditional random fields:
  1999. Learning to forget: Continual prediction with      Probabilistic models for segmenting and labeling se-
  lstm. Neural Computation, 12(10):2451–2471.              quence data. In ICML, pages 282–289.
Fangtao Li, Chao Han, Minlie Huang, Xiaoyan Zhu,          Samaneh Moghaddam and Martin Ester. 2011. ILDA:
  Ying-Ju Xia, Shu Zhang, and Hao Yu. 2010.                 interdependent lda model for learning latent aspects
  Structure-aware review mining and summarization.          and their ratings from online product reviews. In
  In COLING, pages 653–661. Association for Com-            SIGIR, pages 665–674. ACM.
  putational Linguistics.
                                                          Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,
Xin Li and Wai Lam. 2017. Deep multi-task learning         Suresh Manandhar, and Ion Androutsopoulos. 2015.
  for aspect term extraction with memory interaction.      Semeval-2015 task 12: Aspect based sentiment anal-
  In EMNLP, pages 2876–2882. Association for Com-          ysis. In SemEval@NAACL-HLT, pages 486–495.
  putational Linguistics.                                  Association for Computer Linguistics.
Chenghua Lin and Yulan He. 2009. Joint senti-
                                                          Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,
  ment/topic model for sentiment analysis. In CIKM,
                                                           et al. 2016. Semeval-2016 task 5: Aspect based sen-
  pages 375–384. ACM.
                                                           timent analysis. In SemEval@NAACL-HLT, pages
Bing Liu. 2012. Sentiment analysis and opinion min-        19–30. Association for Computer Linguistics.
  ing. Synthesis Lectures on Human Language Tech-
  nologies, 5(1):1–167.                                   Maria Pontiki, Dimitris Galanis, John Pavlopoulos,
                                                           Harris Papageorgiou, Ion Androutsopoulos, and
Pengfei Liu, Shafiq Joty, and Helen Meng. 2015a.           Suresh Manandhar. 2014.        Semeval-2014 task
  Fine-grained opinion mining with recurrent neural        4: Aspect based sentiment analysis.       In Se-
  networks and word embeddings. In EMNLP, pages            mEval@COLING, pages 27–35. Association for
  1433–1443. Association for Computational Linguis-        Computer Linguistics.
  tics.
                                                          Soujanya Poria, Erik Cambria, and Alexander Gel-
Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang.        bukh. 2016. Aspect extraction for opinion min-
  2013. A logic programming approach to aspect ex-          ing with a deep convolutional neural network.
  traction in opinion mining. In Proceedings of 2013        Knowledge-Based Systems, 108:42–49.
  IEEE/WIC/ACM International Joint Conferences on
  Web Intelligence and Intelligent Agent Technologies,    Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen.
  pages 276–283. IEEE.                                      2011. Opinion word expansion and target extraction
                                                            through double propagation. Computational Lin-
Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang.
                                                            guistics, 37(1):9–27.
  2015b. Automated rule selection for aspect extrac-
  tion in opinion mining. In Proceedings of the 24th
                                                          Lei Shu, Hu Xu, and Bing Liu. 2017. Lifelong learning
  International Conference on Artificial Intelligence,
                                                            crf for supervised aspect extraction. In ACL, pages
  pages 1291–1297. AAAI Press.
                                                            148–154.
Qian Liu, Bing Liu, Yuanlin Zhang, Doo Soon Kim,
  and Zhiqiang Gao. 2016. Improving opinion aspect        Richard Socher, Cliff C Lin, Chris Manning, and An-
  extraction using semantic similarity and aspect asso-     drew Y Ng. 2011. Parsing natural scenes and natural
  ciations. In AAAI, pages 2986–2992.                       language with recursive neural networks. In ICML,
                                                            pages 129–136.
Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou,
  and Houfeng Wang. 2015c. A dependency-based             Nitish Srivastava, Geoffrey E. Hinton, Alex
  neural network for relation classification. arXiv         Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-
  preprint arXiv:1507.04646.                                nov. 2014. Dropout: a simple way to prevent neural
                                                            networks from overfitting. JMLR, 15(1):1929–
Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end se-          1958.
  quence labeling via bi-directional lstm-cnns-crf. In
  ACL, pages 1064–1074. Association for Computer          Kai Sheng Tai, Richard Socher, and Christopher D.
  Linguistics.                                              Manning. 2015. Improved semantic representations
                                                            from tree-structured long short-term memory net-
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
                                                            works. In ACL-AFNLP, pages 1556–1566. Associa-
  frey Dean. 2013. Efficient estimation of word
                                                            tion for Computer Linguistics.
  representations in vector space. arXiv preprint
  arXiv:1301.3781.
                                                          Zhiyang Teng and Yue Zhang. 2016. Bidirectional
Margaret Mitchell, Jacqui Aguilar, Theresa Wilson,          tree-structured lstm with head lexicalization. arXiv
 and Benjamin Van Durme. 2013. Open domain tar-             preprint arXiv:1611.06788.
 geted sentiment. In EMNLP, pages 1643–1654.
                                                          Zhiqiang Toh and Jian Su. 2015. NLANGP: su-
Makoto Miwa and Mohit Bansal. 2016. End-to-end re-          pervised machine learning system for aspect cate-
 lation extraction using lstms on sequences and tree        gory classification and opinion target extraction. In
 structures. In ACL, pages 1105–1116. Association           SemEval@NAACL-HLT, pages 496–501. Associa-
 for Computer Linguistics.                                  tion for Computer Linguistics.
You can also read