Improving Aspect Term Extraction with Bidirectional Dependency Tree Representation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Improving Aspect Term Extraction with Bidirectional Dependency Tree Representation Huaishao Luo1 , Tianrui Li∗1 , Bing Liu2 , Bin Wang1 , and Herwig Unger3 1 School of Information Science and Technology, Southwest Jiaotong University huaishaoluo@gmail.com, trli@swjtu.edu.cn, binwang007@gmail.com 2 Department of Computer Science, University of Illinois at Chicago liub@uic.edu 3 Faculty of Mathematics and Computer Science, Fern University in Hagen herwig.unger@gmail.com Abstract Table 1: Example of user’ review with aspect Aspect term extraction is one of the im- term marked in bold. arXiv:1805.07889v2 [cs.CL] 5 May 2019 portant subtasks in aspect-based sentiment No. Reviews analysis. Previous studies have shown that 1 The design and atmosphere are just as using dependency tree structure represen- good. tation is promising for this task. How- 2 The staff is very kind and well trained, ever, most dependency tree structures in- they’re fast, they are always prompt to volve only one directional propagation jump behind the bar and fix drinks, on the dependency tree. In this paper, they know details of every item in the we first propose a novel bidirectional de- menu and make excellent recommen- pendency tree network to extract depen- dation. dency structure features from the given 3 I love the operating system and the sentences. The key idea is to explicitly in- preloaded software. corporate both representations gained sep- 4 There also seemed to be a problem with arately from the bottom-up and top-down the hard disc, as certain times win- propagation on the given dependency syn- dows loads but claims to not be able to tactic tree. An end-to-end framework is find any drivers or files. then developed to integrate the embedded representations and BiLSTM plus CRF to learn both tree-structured and sequen- unsupervised and supervised approaches. The tial features to solve the aspect term ex- unsupervised approach is mainly based on topic traction problem. Experimental results modeling (Lin and He, 2009; Brody and Elhadad, demonstrate that the proposed model out- 2010; Moghaddam and Ester, 2011; Chen et al., performs state-of-the-art baseline models 2013; Chen and Liu, 2014; Chen et al., 2014), syn- on four benchmark SemEval datasets. tactic rules (Wang and Wang, 2008; Zhang et al., 2010; Wu et al., 2009; Qiu et al., 2011; Liu et al., 1 Introduction 2013), and lifelong learning (Chen et al., 2014; Aspect term extraction (ATE) is the task of ex- Wang et al., 2016a; Liu et al., 2016; Shu et al., tracting the attributes (or aspects) of an entity upon 2017). The supervised approach is mainly based which people have expressed opinions. It is one of on Conditional Random Fields (CRF) (Lafferty the most important subtasks in aspect-based sen- et al., 2001; Jakob and Gurevych, 2010; Choi and timent analysis (Liu, 2012). As examples shown Cardie, 2010; Li et al., 2010; Mitchell et al., 2013; in Table 1, “design”, “atmosphere”, “staff”, “bar”, Giannakopoulos et al., 2017). “drinks”, and “menu” in the first two sentences are This paper focuses on CRF-based models, aspect terms of the restaurant reviews, and “oper- which regard ATE as a sequence labeling task. ating system”, “preloaded software”, “hard disc”, There are three main types of features that have “windows”, and “drivers” in the last two sentences been used in previous CRF-based models for ATE. are aspects terms of the laptop reviews. The first type is the traditional natural language Existing methods for ATE can be divided into features, e.g., syntactic structures and lexical fea-
has love advcl nsubj dobj nsubj dobj advmod punct punct punct Speaking , it too problems . I system . nmod det compound cc conj browser the operating and software case det det amod of the the preloaded a) Speaking of the browser, it too has problems. b) I love the operating system and the preloaded software. Figure 1: Examples of dependency relations (generated by the basic dependencies of Stanford CoreNLP 3.8.0). Each node is a word, and each edge is the dependency relation between two words. tures (Toh and Su, 2016; Hamdan et al., 2015; Toh and Su, 2015; Balage Filho and Pardo, 2014; Jakob and Gurevych, 2010; Shu et al., 2017). The second type is the cross domain knowledge based features, which are useful because there are plenty of shared aspects across domains although each entity/product is different (Jakob and Gurevych, 2010; Mitchell et al., 2013; Shu et al., 2017). The final type is the deep learning features learned by deep learning models, which have been proven very useful for the ATE in recent years (Gian- nakopoulos et al., 2017; Liu et al., 2015a; Wang Figure 2: An example of a constituency tree et al., 2016b; Yin et al., 2016; Ye et al., 2017; Li (generated by the constituency parse of Stanford and Lam, 2017; Wang et al., 2017b,a). CoreNLP 3.8.0). Each node with the blue back- The deep learning features generally include ground is a real word in the sentence: Speaking of sequential representation and tree-structured rep- the browser, it too has problems. resentation features. Sequential representation means the word order of a sentence. Tree- structured representation features come from the syntax structure of a sentence, which represent the considering top-down propagation, which means internal logical relations between words. Figure 1 that given software as an aspect term, system can shows two examples of the dependency structure, be extracted as an aspect term through the rela- con j−1 in which each node is a word of the sentence, and tion: so f tware −−−−→ system, where con j−1 is each edge is a dependency relation between words. the inverse relation of the con j for the purpose of nmod For example, the relation Speaking −−−→ browser distinguishing different directions of propagation. means Speaking is a nominal modifier of browser. Compared with the sequential representation, the Such a relation is useful in ATE. For instance, tree-structured representation is capable of obtain- given system as an aspect term, software can be ing the long-range dependency relation between extracted as an aspect term through the relation: words, especially for long sentences like the sec- con j system −−→ so f tware in Figure 1 b) because con j ond and fourth reviews in Table 1. means system and so f tware are connected by a co- In this paper, we first enhance the tree- ordinating conjunction (e.g., and). However, the structured representation using a bidirectional gate tree-structured representation in the previous work control mechanism which originates from bidirec- only considered a single direction of propagation tional LSTM (BiLSTM) (Hochreiter and Schmid- (bottom-up propagation) trained on the parse trees huber, 1997; Gers et al., 1999) and then fuse the with shared weights. We further exploit the ca- tree-structured and the sequential information to pability of the tree-structured representation by perform the aspect term extraction. By combin- ing the two steps into one, we propose a novel
framework named bidirectional dependency tree described in Section 1, BiDTreeCRF consists of conditional random fields (BiDTreeCRF). Specif- three modules (or components): BiDTree, BiL- ically, BiDTreeCRF is an incremental framework, STM, and CRF. These modules will be described which consists of three main components. The in details in Sections 2.2 and 2.3. first component is a bidirectional dependency tree network (BiDTree), which is an extension of the 2.1 Problem Statement recursive neural network in (Socher et al., 2011). We are given a review sentence from a particular Its goal is to extract the tree-structured representa- domain, denoted by S = {w1 , w2 , . . . , wi , . . . , wN }, tion from the dependency tree of a given sentence. where N is the sentence length. For any word The second component is the BiLSTM, whose in- wi ∈ S, the task of ATE is to find a label ti ∈ T cor- put is the output of BiDTree. The tree-structured responding to it, where T = {B-AP, I-AP, O}. “B- and sequential information is fused in this layer. AP”, “I-AP”, and “O” stand for the beginning of The last component is the CRF, which is used to an aspect term, inside of an aspect term, and other generate labels. To the best of our knowledge, this words, respectively. For example, “The/O pic- is the first work to fuse tree-structured and sequen- ture/B-AP quality/I-AP is/O very/O good/O ./O” tial information to solve the ATE. This new model is a sentence with labels (or tags), where the aspect results in major improvements for ATE over the term is picture quality. This BIO encoding scheme existing baseline models. is widely used in NLP tasks and such tasks are of- The proposed BiDTree is constructed based on ten solved using CRF based methods (Liu et al., the dependency tree. Compared with many other 2015a; Wang et al., 2016b; Irsoy and Cardie, 2013, methods based on the constituency tree (Figure 2) 2014). (Irsoy and Cardie, 2013; Tai et al., 2015; Teng and Zhang, 2016; Chen et al., 2017), BiDTree focuses 2.2 Bidirectional Dependency Tree Network more directly on the dependency relation between Since BiDTree is built on the dependency tree, words because all nodes in the dependency tree are a sentence should be converted to a dependency- input words themselves, but the constituency tree based parse tree first. As the left part of Figure focuses on identified phrases and their recursive 1 shows, each node in the dependency tree rep- structure. resents a word and connects to at least one other The two main contributions of this paper are as node/word. Each node has one and only one head follows. word, e.g., Speaking is the head of browser, has is the head of Speaking, and the head word of has • It proposes a novel bidirectional recursive is ROOT1 . The edge between each node and its neural network BiDTree, which enhances the head word is a syntactic dependency relation, e.g., tree-structured representation by construct- nmod between browser and Speaking is used for ing a bidirectional propagation mechanism nominal modifiers of nouns or clausal predicates. on the dependency tree. Thus, BiDTree can Syntactic relations in Figure 3 are shown as dotted capture more effective tree-structured rep- black lines. resentation features and gain better perfor- After generating a dependency tree, each word mance. wi will be initialized with a feature vector xwi ∈ • It proposes the incremental framework Rd , which corresponds to a column of a pre- BiDTreeCRF, which can incorporate both the trained word embedding E ∈ Rd×|V | , where d is syntactic information and the sequential in- the dimension of the word vector and |V | is the formation. These pieces of information are size of the vocabulary. As described above, each fed into the CRF layer for aspect term extrac- relation of a dependency tree starts from a head tion. The integrated model can be effectively word and points to its dependent words. This can trained in an end-to-end fashion. be formulated as follows: The governor node p and its dependent nodes c1 , c2 , . . . , cni . . . , cn p are 2 Model Description connected by r pc1 , r pc2 , . . . , r pci , . . . , r pcn p , where n p is the number of dependent nodes belonging to p, The architecture of the proposed framework is and r pci ∈ L, where L is a set of syntactic rela- shown in Figure 3. Its sample input is the de- pendency relations presented in Figure 1. As 1 We hide it for simplicity.
Output O O O B-AP O O O O O O CRF Bi-LSTM Speaking + + + + + + + + + + BiDTree bottom-up propagation has od pu do browser bj vm nc advcl nct u bj ns t ad pu Speaking , it too problems . nm od the browser of se det ca top-down cell states top-down output bottom-up cell states bottom-up output top-down propagation of the + concatenate a) BiDTree b) BiDTreeCRF Figure 3: An illustration of the BiDTree and BiDTreeCRF architecture. Left: BiDTree architecture, in- cluding bottom-up propagation and top-down propagation; r means the syntactic relation (e.g., nmod, case, and det); x is the word; s and h denote cell memory and hidden state, respectively. Right: BiDTreeCRF has three modules: BiDTree, BiLSTM, and CRF. ↑(o) tions such as nmod, case, det, nsubj, and so on. To = W ↑(o) xw p + ∑ Wr↑ (k) rk↑ , (3) The syntactic relation information not only serves k∈C(p) as features encoded in the network but also as a ↑( f ) T f k = W ↑( f ) xw p +Wr↑ (k) rk↑ , (4) guide for the selection of training weights. ↑(u) BiDTree works in two directions using LSTM: Tu = W ↑(u) xw p + ∑ Wr↑ (k) rk↑ . (5) bottom-up LSTM and top-down LSTM. Bottom- k∈C(p) up LSTM is shown with solid black arrows and Then, the bottom-up LSTM transition equations of top-down LSTM is shown with dotted black ar- BiDTree are as follows: rows at the lower portion of Figure 3. It should ! ↑(i) be noted that they are different in not only the ip = σ Ti + ∑ Ur↑ (k) h↑k + b↑(i) , (6) direction but also the governor node and depen- k∈C(p) dent nodes. Specifically, each node of the top- ! ↑(o) down LSTM only owns one dependent node, but op = σ To + ∑ Ur↑ (k) h↑k + b↑(o) , (7) the bottom-up LSTM generally owns more than k∈C(p) one dependent node. As shown in Formula (1), we ↑( f ) f pk = σ T f k +Ur↑ (k) h↑k + b↑( f ) , (8) concatenate the output h↑wi of the bottom-up LSTM ! and the output h↓wi of the top-down LSTM into hwi ↑(u) u p = tanh Tu + ∑ Ur↑ (k) h↑k + b↑(u) , (9) as the output of BiDTree for word wi , k∈C(p) s↑p = i p up + ∑ f pl s↑l , (10) hwi = [h↑wi ; h↓wi ]. (1) l∈C(p) This allows BiDTree to capture the global syntac- h↑p = o p tanh(s↑p ), (11) tic context. where i p is the input gate, o p is the output gate, Let C(p) = {c1 , c2 , . . . , cni . . . , cn p }, which is the f pk and f pl are the forget gates, which are ex- set of dependent nodes of node p described above. tended from the standard LSTM (Hochreiter and Under these symbolic instructions, the bottom-up Schmidhuber, 1997; Gers et al., 1999). s↑p and LSTM of BiDTree firstly encodes the governor s↑l are the memory cell states, h↑p and h↑k are the word and the related syntactic relations: hidden states, σ denotes the logistic function, ↑(∗) ↑(i) means element-wise multiplication, W ↑(∗) , Wr↑ (k) , Ti = W ↑(i) xw p + ∑ Wr↑ (k) rk↑ , (2) ↑(∗) k∈C(p) Ur↑ (k) are weight matrices, b↑(∗) are bias vectors,
and r↑ (k) is a mapping function that maps a syn- x + tactic relation type to its corresponding parame- x tanh ter matrix. ∗ ∈ {i, o, f , u}. Specially, the syntactic relation rk↑ is encoded into the network like word tanh x vector xw p but initialized randomly. The size of rk↑ is the same as xw p in our experiments. The top-down LSTM has the same transition equations as the bottom-up LSTM, except the di- Figure 4: LSTM Unit rection and the number of dependent nodes. Par- ticularly, the syntactic relation type of the top- down LSTM is opposite to that of the bottom-up LSTM, and we distinguish them by adding a prefix “I-”, e.g., setting I-nmod to nmod. It leads to the difference of r↓ (k) and parameter matrices. In this paper, all weights and bias vectors of BiDTree are set to size d × d and d-dimensions, respectively. Figure 5: Bidirectional LSTM The output hwi is thus a 2d-dimensional vector. As an instance, we give the concrete formulas ically, the word “Speaking” is related with the of the bottom-up propagation in Figure 3 a), which target word “browser” by the relation “I-nmod”. are used to calculate the output of word “browser”. Thus, x4 is xbrowser and r4↓ refers to rI-nmod . On the bottom-up direction, the word “of” and The formula for BiDTree is similar to the de- “the” are related with the target word “browser” by pendency layer in (Miwa and Bansal, 2016), and the relation “case” and “det”, respectively. Thus, the main difference is the design of parameters of x4 is xbrowser . r2↑ and r3↑ mean rcase and rdet , re- the forget gate. Their work defines a parameter- spectively. Likewise, the subscripts 2, 3, and 4 ization of the k-th forget gate f pk of the depen- of s↑ and h↑ are replaced with their corresponding ↑( f ) dent node with parameter matrices Ur↑ (k)r↑ (l) 2 . The word “of”, “the”, and “browser” to facilitate un- whole equation corresponding to Eq. (8) is as fol- derstanding. So, the output of “browser” on the lows: bottom-up direction is calculated as follows: ! ↑( f ) ↑(i) Ti = W ↑(i) xbrowser +Wcase rcase +Wdet rdet , ↑(i) f pk = σ Tfk + ∑ Ur↑ (k)r↑ (l) h↑k + b↑( f ) . ↑(o) ↑(o) l∈C(p) To = W ↑(o) xbrowser +Wcase rcase +Wdet rdet , (13) ↑( f ) ↑( f ) As Tai et al. mentioned in (Tai et al., 2015), T f (case) = W xbrowser +Wcase rcase , ↑( f ) for a large number of dependent nodes n p , using T f (det) = W ↑( f ) xbrowser +Wdet rdet , additional parameters for flexible control of infor- ↑(u) ↑(u) Tu = W ↑(u) xbrowser +Wcase rcase +Wdet rdet , mation propagation from dependent to governor ↑(i) ↑(i) ↑ is impractical. Considering the proposed frame- i p = σ Ti +Ucase h↑o f +Udet hthe + b↑(i) , work has a variable number of typed dependent nodes, we use Eq. (8) instead of Eq. (13) to reduce ↑(o) ↑(o) ↑ o p = σ To +Ucase h↑o f +Udet hthe + b↑(o) , the computation cost. Another difference between ↑( f ) f p(case) = σ T f (case) +Ucase h↑o f + b↑( f ) , their formulas and ours is that we encode the syn- tactic relation into our network, namely, the sec- ↑( f ) ↑ f p(det) = σ T f (det) +Udet hthe + b↑( f ) , ond term of Eqs. (2-5), which is proven effective ↑(u) ↑(u) ↑ in this paper. u p = tanh Tu +Ucase h↑o f +Udet hthe + b↑(u) , 2.3 Integration with Bidirectional LSTM s↑browser = i p u p + f p(case) s↑o f + f p(det) ↑ sthe , As the second module, BiLSTM (Graves and h↑browser = o p tanh(s↑browser ). Schmidhuber, 2005) keeps the sequential context (12) of the dependency information between words. As Figure 4 demonstrates, the LSTM unit at j-th word The top-down propagation of “browser” has the same formulas but with different direction. Specif- 2 Same symbols are used for easy comparison
receives the output of BiDTree hw j , the previous where Ψj (y0 , y, g) = exp(WyT0 ,y g + by0 ,y ) is the po- hidden state h j−1 , and the previous memory cell tential of pair (y0 , y). W and b are weight and bias, c j−1 to calculate new hidden state h j and the new respectively. memory cell c j using the following equations: Conventionally, the training process is using maximum conditional likelihood estimation. The i j = σ W (i) hw j +U (i) h j−1 + b(i) , (14) log-likelihood is computed as follows: o j = σ W (o) hw j +U (o) h j−1 + b(o) , (15) L (W, b) = ∑ log p (y|g;W, b). (22) j f j = σ W ( f ) hw j +U ( f ) h j−1 + b( f ) , (16) The last labeling results are generated with the u j = tanh W (u) hw j +U (u) h j−1 + b(u) , (17) highest conditional probability: cj = ij uj + fj c j−1 , (18) y∗ = arg max p(y|g;W, b). (23) hj = oj tanh(c j ), (19) y∈Y(g) where i j , o j , f j are gates having the same mean- This process is usually solved efficiently by the ings as their counterparts in BiDTree, W (∗) with Viterbi algorithm. size d × 2d, U (∗) with size d × d are weight ma- trices, and b(∗) are d-dimensional bias vectors. 2.5 Decoding from Labeling Results ∗ ∈ {i, o, f , u}. We also concatenate the hidden Once the labeling results are generated, the last states generated by LSTM cells in both directions step to obtain the aspect terms of the given sen- belonging to the same word as the output vector, tence is decoding the labeled sequence. Accord- which is expressed as follows: ing to the mean of elements in T , it is convenient h→ − ← −i to get the aspect terms. For example, to a sen- gj = hj; hj (20) tence “w1 w2 w3 w4 ”, if the labeling sequence is “B-AP B-AP I-AP O” then (“w1 ”, 1, 2) and (“w2 The architecture of BiLSTM is shown in Figure 5. w3 ”, 2, 4) are target aspect terms. For the above Also, each g j is reduced to |T| dimensions by a full triple, the first element is the real aspect term, and connection layer so as to pass to the subsequent the second element and the last element are the be- layers in our implementation. ginning (inclusive) and ending (exclusive) index in 2.4 Integration with CRF the sentence, respectively. Algorithm 1 gives this process in detail. The learned features actually are hybrid features containing both tree-structured and sequential in- 2.6 Loss and Model Training formation. All these features are fed into the We equivalently use the negative of L (W, b) in Eq. last CRF layer to predict the label of each word. (22) as the error to do minimization optimization. Linear-chain CRF is adopted here. Formally, let Thus, the loss is as follows: g = {g1 , g2 , . . . , g j , . . . , gN } represent the output features extracted by BiDTree and BiLSTM layer. L = − ∑ log p (y|g;W, b). (24) The goal of CRF is to decode the best chain of j labels y = {t1 ,t2 , . . . ,t j , . . . ,tN }, where t j has been described in Section 2.1. As a discriminant graph- Then, the loss of the entire model is: ical model, CRF benefits from considering the cor- λ relations between labels/tags in the neighborhood, J (Θ) = L + kΘk2 , (25) 2 which is widely used in sequence labeling or tag- ging tasks (Huang et al., 2015; Ma and Hovy, where Θ represents the model parameters contain- 2016). Let Y(g) denote all possible labels and ing all weight matrices W , U and bias vectors b, y0 ∈ Y(g). The probability of CRF p(y|g;W, b) is and λ is the regularization parameter. computed as follows: We update all parameters for BiDTreeCRF from top to bottom by propagating the errors through ∏Nj=1 Ψj (y j−1 , y j , g) the CRF to the hidden layers of BiLSTM and p(y|g;W, b) = , ∑y0 ∈Y(g) ∏Nj=1 Ψj (y0j−1 , y0j , g) then to BiDTree via backpropagation through time (21) (BPTT) (Goller and Kuchler, 1996). Finally, we
Algorithm 1 Decoding from the Labeling Se- Algorithm 2 BiDTreeCRF Training Algorithm quence Input: A set of review sentences S from a par- Input: A labeling sequence τ = ticular domain, S = {w1 , w2 , . . . , wi , . . . , wN } is {t1 ,t2 , . . . ,ti , . . . tN }, and its correspond- one of the element in S. ing sentence S = {w1 , w2 , . . . , wi , . . . , wN }. Output: Learned BiDTreeCRF model Output: A list of aspect term triples 1: Construct dependency trees for each sentence 1: result ← () S using Stanford Parser Package. 2: temp ← “” 2: Initialize all learnable parameters Θ 3: start ← 0 3: repeat 4: for i = 1; i ≤ N; i + + do 4: Select a batch of instances Sb from S 5: if ti = “O” and temp 6= “” then 5: for each sentence S ∈ Sb do 6: result ← result + (wstart:i , start, i) 6: Use BiDTree (1-11) to generate h 7: temp ← “” 7: Use BiLSTM (14-20) to generate g 8: start ← 0 8: Compute L (W, b) through (21-22) 9: else 9: end for 10: if ti = “B-AP” then 10: Use the backpropagation algorithm to up- 11: if temp 6= “” then date parameters Θ by minimizing the ob- 12: result ← result + (wstart:i , start, i) jective (25) with the batch update mode 13: end if 11: until stopping criteria is met 14: temp ← ti 15: start ← i 16: end if 3 Experiments 17: end if 18: end for In this section, we conduct experiments to evaluate 19: if temp 6= “” then the effectiveness of the proposed framework. 20: result ← result + (wstart:i , start, i) 21: end if 3.1 Datasets and Experiment Setup 22: return result We conduct experiments using four benchmark SemEval datasets. The detailed statistics of the datasets are summarized in Table 2. L-14 and R- 14 are from SemEval 20143 (Pontiki et al., 2014), R-15 is from SemEval 20154 (Pontiki et al., 2015), and R-16 is from SemEval 20165 (Pontiki et al., use Adam (Kingma et al., 2014) for optimization 2016). L-14 contains laptop reviews, and R-14, R- with gradient clipping. The L2-regularization fac- 15, and R-16 all contain restaurant reviews. These tor λ is set as 0.001 empirically. The mini-batch datasets have been officially divided into three size is 20 and the initial learning rate is 0.001. We parts: A training set, a validation set, and a test set. also employ dropout (Srivastava et al., 2014) on These divisions will be kept for a fair comparison. the outputs of BiDTree and BiLSTM layers with All these datasets contain annotated aspect terms, the dropout rate of 0.5. All weights W , U and which will be used to generate sequence labels in bias terms b are trainable parameters. Early stop- the experiments. We use the Stanford Parser Pack- ping (Caruana et al., 2000) is used based on per- age6 to generate dependency trees. The evaluation formance on validation sets. Its value is 5 epochs metric is the F1 score, the same as the baseline in our experiments. At the same time, initial em- methods. beddings are fine-tuned during the training pro- In order to initialize word vectors, we train word cess. That means word embedding will be mod- embeddings with a bag-of-words based model ified by back-propagating gradients. We imple- (CBOW) (Mikolov et al., 2013) on Amazon re- ment BiDTreeCRF using the TensorFlow library (Abadi et al., 2016), and all computations are done 3 http://alt.qcri.org/semeval2014/task4/ on an NVIDIA Tesla K80 GPU. The overall proce- 4 http://alt.qcri.org/semeval2015/task12/ dure of BiDTreeCRF is summarized in Algorithm 5 http://alt.qcri.org/semeval2016/task5/ 2. 6 https://nlp.stanford.edu/software/lex-parser.html
compared with the end-to-end fashion of Table 2: Datasets from SemEval; #S means the neural network. U means using additional number of sentences, #T means the number of as- resources without any constraint, such as pect terms; L-14, R-14, R-15, and R-16 are short lexicons or additional training data. for Laptops 2014, Restaurants 2014, Restaurants 2015 and Restaurants 2016, respectively. • WDEmb: It uses word embedding, linear Datasets Train Val Test Total context embedding and dependency path em- L-14 #S 2,945 100 800 3,845 bedding to enhance CRF (Yin et al., 2016). R-14 #S 2,941 100 800 3,841 • RNCRF-O, RNCRF-F: They both extract R-15 #S 1,315 48 685 2,048 tree-structured features using a recursive neu- R-16 #S 2,000 48 676 2,724 ral network as the CRF input. RNCRF-O L-14 #T 2,304 54 654 3,012 is a model trained without opinion labels. R-14 #T 3,595 98 1,134 4,827 RNCRF-F is trained not only using opinion R-15 #T 1,654 57 845 2,556 labels but also some hand-crafted features R-16 #T 2,507 66 859 3,432 (Wang et al., 2016b). • DTBCSNN+F: A convolution stacked neural views7 and Yelp reviews8 , which are in-domain network built on dependency trees to capture corpora for laptop and restaurant, respectively. syntactic features. Its results are produced by The Amazon review dataset contains 142.8M re- the inference layer (Ye et al., 2017). views, and the Yelp review dataset contains 2.2M • MIN: MIN is a LSTM-based deep multi-task restaurant reviews. All these datasets are trained learning framework, which jointly handles by gensim9 which contains the implementation of the extraction tasks of aspects and opinions CBOW. The parameter min count is 10 and iter via memory interactions (Li and Lam, 2017). is 200 in our experiments. We set the dimension of word vectors to 300 based on the conclusion • CMLA, MTCA: CMLA is a multilayer at- drawn in (Wang et al., 2016b). The experimental tention network, which exploits relations be- results about dimension settings for the proposed tween aspect terms and opinion terms with- model also showed that 300 is a suitable choice, out any parsers or linguistic resources for pre- which provides a good trade-off between effec- processing (Wang et al., 2017b). MTCA is tiveness and efficiency. a multi-task attention model, which learns shared information among different tasks 3.2 Baseline Methods and Results (Wang et al., 2017a). To validate the performance of our proposed model on aspect term extraction, we compare it • LSTM+CRF, BiLSTM+CRF: They are against a number of baselines: proposed by (Huang et al., 2015) and pro- duce state-of-the-art (or close to) accuracy on • IHS RD, DLIREC(U), EliXa(U), and POS, chunking and NER data sets. We bor- NLANGP(U): The top system for L-14 in row them for the ATE as baselines. SemEval Challenge 2014 (Chernyshevich, 2014), the top system for R-14 in SemEval • BiLSTM+CNN: BiLSTM+CNN10 is the Bi- Challenge 2014 (Toh and Wang, 2014), the directional LSTM-CNNs-CRF model from top system for R-15 in SemEval Challenge (Ma and Hovy, 2016). Compared with BiL- 2015 (Vicente et al., 2015), and the top STM+CRF above, BiLSTM+CNN encoded system for R-16 in SemEval Challenge char embedding by CNN and obtained state- 2016 (Toh and Su, 2016), respectively. All of-the-art performance on the task of POS of these systems have the same property: tagging and named entity recognition (NER). They are trained on a variety of lexicon and We borrow this method for the ATE as a base- syntactic features, which is labor-intensive line. The window size of CNN is 3, the num- 7 http://jmcauley.ucsd.edu/data/amazon/ ber of filters is 30, and the dimension of char 8 https://www.yelp.com/academic is 100. dataset 9 https://radimrehurek.com/gensim/models/word2vec.html 10 We use this abbreviation for the sake of typesetting.
Table 3: Comparison on F1 scores. ‘-’ indicates the results were not available in their papers 12 . Models L-14 R-14 R-15 R-16 IHS RD (Chernyshevich, 2014) 74.55 79.62 - - DLIREC(U) (Toh and Wang, 2014) 73.78 84.01 - - EliXa(U) (Vicente et al., 2015) - - 70.05 - NLANGP(U) (Toh and Su, 2016) - - 67.12 72.34 WDEmb (Yin et al., 2016) 75.16 84.97 69.73 - RNCRF-O (Wang et al., 2016b) 74.52 82.73 - - RNCRF+F (Wang et al., 2016b) 78.42 84.93 - - DTBCSNN+F (Ye et al., 2017) 75.66 83.97 - - MIN (Li and Lam, 2017) 77.58 - - 73.44 CMLA (Wang et al., 2017b) 77.80 85.29 70.73 - MTCA (Wang et al., 2017a) 69.14 - 71.31 73.26 LSTM+CRF 73.43 81.80 66.03 70.31 BiLSTM+CRF 76.10 82.38 65.96 70.11 BiLSTM+CNN 78.97 83.87 69.64 73.36 BiDTreeCRF#1 80.36 85.08 69.44 73.74 BiDTreeCRF#2 80.22 85.31 68.61 74.01 BiDTreeCRF#3 80.57 84.83 70.83 74.49 For our proposed model, there are three variants is the average of 20 runs with the same hyper- depending on whether the weight matrices of Eqs. parameters that have been described in Section 2.6 (2-9) are shared or not 11 . BiDTreeCRF#1 shares and are used throughout our experiments. We re- ↑(i,o, f ,u) all weight matrices, namely W∗ = W ↑(i,o, f ,u) port the results of L-14 initialized with the Ama- ↑(i,o, f ,u) zon Embedding. For the other datasets, we ini- and U∗ = U ↑(i,o, f ,u) , which means the map- ping function r↑ (k) is useless. BiDTreeCRF#2 tialize with the Yelp Embedding since they are all shares the weight matrices of Eqs. (2-3, 5) and restaurant reviews. We will also show the embed- Eqs. (6-7, 9) while excluding Eqs. (4, 8). ding comparison below. BiDTreeCRF#3 keeps Eqs. (2-9) and does not Compared to the best systems in 2014, 2015 and share any weight matrices. The different types 2016 SemEval ABSA challenges, BiDTreeCRF#3 of weight sharing mean different ways of informa- achieves 6.02%, 0.82%, 0.78%, and 2.15% F1 tion transmission. BiDTreeCRF#1 shares weight score gains over IHS RD, DLIREC(U), EliXa(U) matrices, which indicates the dependent words of and NLANGP(U) on L-14, R-14, R-15, and R- a head word are undifferentiated and the syntactic 16, respectively. Specifically, BiDTreeCRF#3 out- relations, e.g., nmod and case, are out of consider- performs WDEmb by 5.41% on L-14 and 1.10% ation. BiDTreeCRF#2 treats the forget gates dif- on R-15, and outperforms RNCRF-O by 6.05%, ferently, which indicates that each dependent word 2.10% for L-14 and R-14, respectively. Even is controlled by syntactic relation to transmitting compared with RNCRF+F and DTBCSNN+F hidden state to its next node. BiDTreeCRF#3 fur- which exploit additional hand-crafted features, ther treats all gates differently. The elaborate in- BiDTreeCRF#3 on L-14 and BiDTreeCRF#2 on formation flow under the control of syntactic rela- R-14 without other linguistic features (e.g., POS) tions is proved to be efficient. still achieve 2.15%, 4.91% and 0.38%, 1.34% The comparison results are given in Table 3. improvements, respectively. MIN is trained via In this table, the F1 score of the proposed model memory interactions, CMLA and MTCA are de- signed as a multi-task model, and all of these 11 The code is publicly available at https://github three methods use more labels and share infor- .com/ArrowLuo/BiDTree mation among different tasks. Comparing with 12 We report the best results from the original papers, and them, BiDTreeCRF#3 still gives the best score for keep the officially divided datasets and the evaluation pro- gram the same to make the comparison fair. L-14 and R-16 and a competitive score for R-15
88 88 E-Amazon E-Amazon 83 E-Yelp 83 E-Yelp F1 Score (%) F1 Score (%) 78 78 73 73 68 68 63 63 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 BiDepsCRF#1 BiDepsCRF#2 BiDepsCRF#3 BiDepsCRF#1 BiDepsCRF#2 BiDepsCRF#3 Figure 6: Amazon Embedding vs. Yelp Em- Figure 7: Amazon Embedding vs. Yelp Em- bedding (E-Amazon vs. E-Yelp) with syn- bedding (E-Amazon vs. E-Yelp) without tactic relation. syntactic relation. 88 88 With-Rel With-Rel 83 No-Rel 83 No-Rel F1 Score (%) F1 Score (%) 78 78 73 73 68 68 63 63 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 L-14 R-14 R-15 R-16 BiDepsCRF#1 BiDepsCRF#2 BiDepsCRF#3 BiDepsCRF#1 BiDepsCRF#2 BiDepsCRF#3 Figure 8: With syntactic relation vs. With- Figure 9: With syntactic relation vs. With- out syntactic relation (With-Rel vs. No-Rel) out syntactic relation (With-Rel vs. No-Rel) with Amazon Embedding. with Yelp Embedding. R-15, and R-16, and BiDTreeCRF#2 is more ef- Table 4: F1-scores of ablation experiments on fective on R-15. We believe the fact that R-15 is BiDTreeCRF. a small dataset with some “NULL” aspect terms Models L-14 R-14 R-15 R-16 is the reason that the performance of these base- BiLSTM+CRF 76.10 82.38 65.96 70.11 lines have a small gap between them. It proves BiDTree+CRF 71.29 81.09 64.09 67.87 that it is a hard dataset to improve the score. Thus, DTree-up 78.96 84.47 68.69 72.42 it is an inspiring result though BiDTreeCRF#3 is DTree-down 78.46 84.41 68.75 72.91 a little worse than MTCA without other auxil- BiDTreeCRF#3 80.57 84.83 70.83 74.49 iary information (e.g., opinion terms). Besides, BiDTreeCRF#3 outperforms BiLSTM+CNN even without char embedding. Note that we did not tune and BiDTreeCRF#2 achieves the state-of-the-art the hyperparameters of BiDTreeCRF for practi- score for R-14, although our model is designed as cal purposes because this tuning process is time- a single-task model. Moreover, BiDTreeCRF#3 consuming. outperforms LSTM+CRF and BiLSTM+CRF on 3.3 Ablation Experiments all datasets by 7.14%, 3.03%, 4.80%, and 4.18%, and 4.47%, 2.45%, 4.87%, and 4.38%, respec- To test the effect of each component of tively, and these improvements are significant (p < BiDTreeCRF, the following ablation experiments 0.05). Considering the fact that BiLSTM+CRF on different layers of BiDTreeCRF#3 are per- can be seen as BiDTreeCRF#3 without BiDTree formed: (1) DTree-up: The bottom-up propa- layer, all the results support that BiDTree can ex- gation of BiDTree is connected to BiLSTM and tract syntactic information effectively. the CRF layer. (2) DTree-down: The top-down As we can see, different variants of the pro- propagation of BiDTree is connected to BiLSTM posed model have different performances on the and the CRF layer. (3) BiDTree+CRF: BiL- four datasets. In particular, BiDTreeCRF#3 is STM layer is not used compared to BiDTreeCRF. more powerful than the other variants on L-14, The initial word embeddings are the same as be-
90 fore. The comparison results are shown in Ta- L-14 R-15 R-14 R-16 85 ble 4. Comparing BiDTreeCRF with DTree-up and DTree-down, it is obvious that BiDTree is 80 F1 Score (%) more competitive than any single directional de- 75 pendency network, which is the original moti- 70 vation of the proposed BiDTreeCRF. The fact 65 that BiDTreeCRF outperforms BiDTree+CRF in- dicates the BiLSTM layer is effective in extracting 60 0 50 100 150 200 250 300 350 400 450 500 sequential information on top of BiDTree. On the d other hand, the fact that BiDTreeCRF outperforms L-14 R-15 BiLSTM+CRF shows that the dependency syntac- 90 R-14 R-16 tic information extracted by BiDTree is extremely 85 useful in the aspect term extraction task. All above F1 Score (%) 80 improvements are significant (p < 0.05) with the 75 statistical t-test. 70 3.4 Word Embeddings & Syntactic Relation 65 Since word embeddings are an important con- 0 50 100 150 200 250 300 350 400 450 500 d tributing factor for learning with less data, we also conduct comparative experiments about word em- beddings. Additionally, the syntactic relation (the Figure 10: Sensitivity studies on word embed- second terms of Eqs. (2-5)) is also adopted as dings. Top: F1 Score of BiDTreeCRF#3 with a comparison criterion. The experimental setup, different word vector dimensions d on Electron- e.g., mini-batch size and learning rate, is the ics Amazon Embedding. Bottom: F1 Score of same as the previous setup and no other changes BiDTreeCRF#3 with different word vector dimen- but word embeddings and with/without integrating sions d on Yelp Embedding. syntactic relation knowledge. Figure 6 and Figure 7 illustrate a compari- ferent dimensions (ranging from 50 to 450, with son between Amazon Embedding and Yelp Em- the increment of 50) are involved. The sensitivity bedding. Each figure involves three variants of plots on the four datasets are given in Figure 10 BiDTreeCRF on four datasets. All of them show using Amazon Embedding and Yelp Embedding, that Amazon Embedding is always superior to respectively. It is worth mentioning that Amazon Yelp Embedding for L-14, and Yelp Embedding Embedding here is only trained from reviews of has an absolute advantage over Amazon Embed- electronics products considering the time cost. Al- ding for R-14, R-15, and R-16. The fact that Yelp though the score is a little lower than the embed- Embedding is in-domain for restaurant and Ama- ding trained from the whole Amazon review cor- zon Embedding is in-domain for laptop indicates pus, the conclusion still holds. The figure shows that in-domain embedding is more effective than that 300 is a suitable dimension size for the pro- out-domain embedding. posed model. It also proves the stability and ro- Figure 8 and Figure 9 show a comparison of dif- bustness of our model. ferent syntactic relation conditions. Figure 8 is a comparison using Amazon Embedding, and Fig- 3.6 Case Study ure 9 is a comparison using Yelp Embedding. The fact that the model with syntactic relation wins 7 Table 5 shows some examples from the L- out of 12 in Figure 8 and 9 out of 12 in Figure 9 14 dataset to demonstrate the effectiveness of comparing with the model without syntactic rela- BiDTreeCRF. The first column contains the re- tion indicates the syntactic relation information is views, and the corresponding aspect terms are useful for performance improvement. marked with bold font. The second column de- scribes some dependency relations related to the 3.5 Sensitivity Test aspect terms. The third column and the last col- We conduct the sensitivity test on the dimension umn are the extraction results of BiDTreeCRF and d of word embeddings of BiDTreeCRF#3. Dif- BiLSTM, respectively. On the whole, the pro-
Table 5: Extraction comparison between BiDTreeCRF and BiLSTM. Text (The ground-truth of aspect terms Dependency Relationships BiDTreeCRF BiLSTM is marked with bold font) Other than not being a fan of click pads (industry standard these days) compound click ←−−−−− pads, click pads, and the lousy internal speakers, it’s amod internal speakers, internal ←−−− speakers, internal speakers, hard for me to find things about this price tag compound price tag notebook I don’t like, especially con- price ←−−−−− tag sidering the $350 price tag. nsub j Keyboard, Keyboard responds well to presses. Keyboard ←−−− responds Keyboard responds nmod case I am please with the products ease of ease −−−→ use −−→ o f , use, cc use, use; out of the box ready; appearance appearance −→ and, appearance, con j functionality and functionality. appearance −−→ f unctionality functionality nmod case use −−−→ OS −−→ o f , softwares, With the softwares supporting the use det softwares, the ←− so f twares, use, of other OS makes it much better. OS nsub j so f twares ←−−− supporting OS cc I tried several monitors and several monitors − → and, con j monitors, HDMI HDMI cables and this was the case monitors −−→ cables, compound HDMI cables cables each time. cables −−−−−→ HDMI posed BiDTreeCRF can extract aspect terms bet- 4 Related Work ter than BiLSTM with fewer omissions and errors. In the first example, BiLSTM misses the aspect As an important and practically very useful topic, term “click pads” but its inner relation is similar to Sentiment analysis has been extensively studied in compound the literature (Hu and Liu, 2004; Cambria, 2016), the price ←−−−−− tag, which in the BiDTreeCRF can be considered as a significant feature. Thus especially the ATE. There are several main ap- BiDTreeCRF can extract it accurately. Likewise, proaches to solving the ATE problem. Hu and nsub j Liu (2004) extracted aspect terms that are fre- through the relation Keyboard ←−−− responds, quently occurring nouns and noun phrases using BiDTreeCRF can avoid making “responds” as an frequent pattern mining. Qiu et al. (2011) and aspect term. For the same word “use” in the third Liu et al. (2015b) proposed to use a rule-based example and the fourth example, one is real aspect approach exploiting either hand-crafted or auto- term, and the other is not. The reason is reflected nmod case matically generated rules about some syntactic re- in these two relations: ease −−−→ use −−→ o f and lations between aspect terms (also called targets) nmod case use −−−→ OS −−→ o f . To the final example, “mon- and sentiment words based on the idea that opin- itors” and “cables” are equivalence relation be- ion or sentiment must have a target (Liu, 2012). con j cause of the monitors −−→ cables, and thus, they Chen et al. (2014) adopted the topic modeling to are extracted simultaneously by BiDTreeCRF in- address the ATE, which employs some probabilis- stead of being extracted only one part of them by tic graphical models based on Latent Dirichlet Al- BiLSTM. All of the above analysis gives support- location (LDA) (Blei et al., 2003) and its vari- ing evidence that our proposed BiDTreeCRF con- ants. All of the above methods are based on un- structed on the dependency tree is useful and can supervised learning. For supervised learning, ATE take advantage of the relation between words to is mainly regarded as a sequential labeling prob- improve the ATE performance. lem, and solved by hidden Markov models (Jin et al., 2009) or CRF. However, traditional super- vised methods need to design some lexical and
syntactic features artificially to improve perfor- end-to-end system trained directly from the depen- mance. Neural network is an effective approach dency path information to the final ATE tags. On to solve this problem. the contrary, our proposed BiDTreeCRF is an end- Recent work showed that neural networks to-end deep learning model and it does not need can indeed achieve competitive performance on any hand-crafted features. Wang et al. (2016b) in- the ATE. Irsoy and Cardie (2013) applied deep tegrated dependency tree and CRF into a unified Elman-type Recurrent Neural Network (RNN) to framework for explicit aspect and opinion terms extract opinion expressions and showed that deep co-extraction. However, a single directional prop- RNN outperforms CRF, semi-CRF and shallow agation on the dependency tree is not enough to RNN. Liu et al. (2015a) further experimented with represent complete tree-structured syntactic infor- more advanced RNN variants with fine-tune em- mation. Instead of the full connection on each beddings. Moreover, they pointed out that em- layer of the dependency tree, we use a bidirec- ploying other linguistic features (e.g., POS) can tional propagation mechanism to extract informa- get better results. Different from these works, tion, which is proved to be effective in our experi- Poria et al. (2016) used a 7-layer deep convolu- ments. Ye et al. (2017) proposed a tree-based con- tional neural network (CNN) to tag each word with volution to capture the syntactic features of sen- an aspect or non-aspect label in opinionated sen- tences, which makes it hard to keep sequential in- tences. Some linguistic patterns were also used formation. We fused the tree-structured and se- to improve labeling accuracy. Attention mech- quential information rather than only using a sin- anism and memory interaction are also effective gle representation to address the ATE efficiently. methods for ATE. Li and Lam (2017) adopted This paper is also related to several other mod- two LSTMs for jointly handling the extraction els which are constructed on constituency trees tasks of aspects and opinions via memory interac- and used to accomplish some other NLP tasks, tions. These LSTMs are equipped with extended e.g., translation (Chen et al., 2017), relation ex- memories and neural memory operations. Wang traction (Miwa and Bansal, 2016), relation classi- et al. (2017b) proposed a multi-layer attention net- fication (Liu et al., 2015c) and syntactic language work to deal with aspect and opinion terms co- modeling (Tai et al., 2015; Teng and Zhang, 2016; extraction task, which exploits the indirect rela- Zhang et al., 2016). However, we have different tions between terms for more precise information models and also different applications. extraction. He et al. (2017) presented an unsuper- vised neural attention model to discover coherent 5 Conclusion aspects. Its key idea is to exploit the distribution of word co-occurrences through the use of neural In this paper, an end-to-end framework word embeddings and use an attention mechanism BiDTreeCRF was introduced. The frame- to de-emphasize irrelevant words during training. work can efficiently extract dependency syntactic However, RNN and CNN based on the sequence information through bottom-up and top-down structure of a sentence cannot effectively and di- propagation in dependency trees. By combining rectly capture the tree-based syntactic information the dependency syntactic information with the which better reflects the syntactic properties of advantages of BiLSTM and CRF, we achieve natural language and hence is very important to state-of-the-art performance on four benchmark the ATE. datasets without using any other linguistic fea- Some tree-based neural networks have been tures. Three variants of the proposed model have proposed by researchers. For example, Yin et al. been evaluated and shown to be more effective (2016) designed a word embedding method that than the existing state-of-the-art baseline meth- considers not only the linear context but also the ods. The distinction of these variants depends dependency context information. The resulting on whether they share weights during training. embeddings are used in CRF for extracting as- Our results suggest that the dependency syntactic pect terms. This model proves that syntactic in- information may also be used in aspect term and formation among words yields better performance aspect opinion co-extraction, and other sequence than other representative ones for ATE. However, labeling tasks. Additional linguistic features (e.g., it involves a two-stage process, which is not an POS) and char embeddings can further boost the performance of the proposed model.
References Athanasios Giannakopoulos, Claudiu Musat, Andreea Hossmann, and Michael Baeriswyl. 2017. Unsuper- Martı́n Abadi, Ashish Agarwal, Paul Barham, et al. vised aspect term extraction with b-lstm & crf using 2016. Tensorflow: Large-scale machine learning on automatically labelled datasets. In WASSA, pages heterogeneous distributed systems. arXiv preprint 180–188. arXiv:1603.04467. Christoph Goller and Andreas Kuchler. 1996. Learning Pedro Paulo Balage Filho and Thiago Alexan- task-dependent distributed representations by back- dre Salgueiro Pardo. 2014. NIL CUSP: As- propagation through structure. In Proceedings of pect extraction using semantic labels. In Se- IEEE International Conference on Neural Networks, mEval@COLING, pages 433–436. pages 347–352. David M Blei, Andrew Y Ng, and Michael I Jordan. Alex Graves and Jürgen Schmidhuber. 2005. Frame- 2003. Latent dirichlet allocation. JMLR, 3(1):993– wise phoneme classification with bidirectional lstm 1022. and other neural network architectures. Neural Net- works, 18(5-6):602–610. Samuel Brody and Noemie Elhadad. 2010. An un- Hussam Hamdan, Patrice Bellot, and Frédéric Béchet. supervised aspect-sentiment model for online re- 2015. Lsislif: CRF and logistic regression for opin- views. In NAACL-HLT, pages 804–812. Association ion target extraction and sentiment polarity analysis. for Computational Linguistics. In SemEval@NAACL-HLT, pages 753–758. Associ- ation for Computer Linguistics. Erik Cambria. 2016. Affective computing and senti- ment analysis. IEEE Intelligent Systems, 31(2):102– Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel 107. Dahlmeier. 2017. An unsupervised neural attention model for aspect extraction. In ACL, pages 388– Rich Caruana, Steve Lawrence, and C. Lee Giles. 2000. 397. Overfitting in neural nets: Backpropagation, conju- gate gradient, and early stopping. In Advances in Sepp Hochreiter and Jürgen Schmidhuber. 1997. Neural Information Processing Systems, pages 402– Long short-term memory. Neural Computation, 408. 9(8):1735–1780. Minqing Hu and Bing Liu. 2004. Mining and summa- Huadong Chen, Shujian Huang, David Chiang, and Ji- rizing customer reviews. In KDD, pages 168–177. ajun Chen. 2017. Improved neural machine trans- ACM. lation with a syntax-aware encoder and decoder. In ACL, pages 1936–1945. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec- tional lstm-crf models for sequence tagging. arXiv Zhiyuan Chen and Bing Liu. 2014. Topic modeling preprint arXiv:1508.01991. using topics from many domains, lifelong learning and big data. In ICML, pages 703–711. Ozan Irsoy and Claire Cardie. 2013. Bidirectional re- cursive neural networks for token-level labeling with Zhiyuan Chen, Arjun Mukherjee, and Bing Liu. 2014. structure. arXiv preprint arXiv:1312.0493. Aspect extraction with automated prior knowledge Ozan Irsoy and Claire Cardie. 2014. Opinion mining learning. In ACL, pages 347–358. with deep recurrent neural networks. In EMNLP, pages 720–728. Association for Computer Linguis- Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun tics. Hsu, Malu Castellanos, and Riddhiman Ghosh. 2013. Exploiting domain knowledge in aspect ex- Niklas Jakob and Iryna Gurevych. 2010. Extracting traction. In EMNLP, pages 1655–1667. Association opinion targets in a single-and cross-domain setting for Computational Linguistics. with conditional random fields. In EMNLP, pages 1035–1045. Association for Computational Linguis- Maryna Chernyshevich. 2014. IHS R&D Belarus: tics. Cross-domain extraction of product features using CRF. In SemEval@COLING, pages 309–313. As- Wei Jin, Hung Hay Ho, and Rohini K Srihari. 2009. sociation for Computer Linguistics. A novel lexicalized hmm-based learning framework for web opinion mining. In ICML, pages 465–472. Yejin Choi and Claire Cardie. 2010. Hierarchical se- Diederik Kingma, Jimmy Ba, Diederik Kingma, and quential learning for extracting opinions and their Jimmy Ba. 2014. Adam: A method for stochastic attributes. In ACL, pages 269–274. Association for optimization. arXiv preprint arXiv:1412.6980. Computational Linguistics. John D. Lafferty, Andrew McCallum, and Fernando Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. C. N. Pereira. 2001. Conditional random fields: 1999. Learning to forget: Continual prediction with Probabilistic models for segmenting and labeling se- lstm. Neural Computation, 12(10):2451–2471. quence data. In ICML, pages 282–289.
Fangtao Li, Chao Han, Minlie Huang, Xiaoyan Zhu, Samaneh Moghaddam and Martin Ester. 2011. ILDA: Ying-Ju Xia, Shu Zhang, and Hao Yu. 2010. interdependent lda model for learning latent aspects Structure-aware review mining and summarization. and their ratings from online product reviews. In In COLING, pages 653–661. Association for Com- SIGIR, pages 665–674. ACM. putational Linguistics. Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Xin Li and Wai Lam. 2017. Deep multi-task learning Suresh Manandhar, and Ion Androutsopoulos. 2015. for aspect term extraction with memory interaction. Semeval-2015 task 12: Aspect based sentiment anal- In EMNLP, pages 2876–2882. Association for Com- ysis. In SemEval@NAACL-HLT, pages 486–495. putational Linguistics. Association for Computer Linguistics. Chenghua Lin and Yulan He. 2009. Joint senti- Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, ment/topic model for sentiment analysis. In CIKM, et al. 2016. Semeval-2016 task 5: Aspect based sen- pages 375–384. ACM. timent analysis. In SemEval@NAACL-HLT, pages Bing Liu. 2012. Sentiment analysis and opinion min- 19–30. Association for Computer Linguistics. ing. Synthesis Lectures on Human Language Tech- nologies, 5(1):1–167. Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Pengfei Liu, Shafiq Joty, and Helen Meng. 2015a. Suresh Manandhar. 2014. Semeval-2014 task Fine-grained opinion mining with recurrent neural 4: Aspect based sentiment analysis. In Se- networks and word embeddings. In EMNLP, pages mEval@COLING, pages 27–35. Association for 1433–1443. Association for Computational Linguis- Computer Linguistics. tics. Soujanya Poria, Erik Cambria, and Alexander Gel- Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. bukh. 2016. Aspect extraction for opinion min- 2013. A logic programming approach to aspect ex- ing with a deep convolutional neural network. traction in opinion mining. In Proceedings of 2013 Knowledge-Based Systems, 108:42–49. IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. pages 276–283. IEEE. 2011. Opinion word expansion and target extraction through double propagation. Computational Lin- Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. guistics, 37(1):9–27. 2015b. Automated rule selection for aspect extrac- tion in opinion mining. In Proceedings of the 24th Lei Shu, Hu Xu, and Bing Liu. 2017. Lifelong learning International Conference on Artificial Intelligence, crf for supervised aspect extraction. In ACL, pages pages 1291–1297. AAAI Press. 148–154. Qian Liu, Bing Liu, Yuanlin Zhang, Doo Soon Kim, and Zhiqiang Gao. 2016. Improving opinion aspect Richard Socher, Cliff C Lin, Chris Manning, and An- extraction using semantic similarity and aspect asso- drew Y Ng. 2011. Parsing natural scenes and natural ciations. In AAAI, pages 2986–2992. language with recursive neural networks. In ICML, pages 129–136. Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng Wang. 2015c. A dependency-based Nitish Srivastava, Geoffrey E. Hinton, Alex neural network for relation classification. arXiv Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- preprint arXiv:1507.04646. nov. 2014. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929– Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end se- 1958. quence labeling via bi-directional lstm-cnns-crf. In ACL, pages 1064–1074. Association for Computer Kai Sheng Tai, Richard Socher, and Christopher D. Linguistics. Manning. 2015. Improved semantic representations from tree-structured long short-term memory net- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- works. In ACL-AFNLP, pages 1556–1566. Associa- frey Dean. 2013. Efficient estimation of word tion for Computer Linguistics. representations in vector space. arXiv preprint arXiv:1301.3781. Zhiyang Teng and Yue Zhang. 2016. Bidirectional Margaret Mitchell, Jacqui Aguilar, Theresa Wilson, tree-structured lstm with head lexicalization. arXiv and Benjamin Van Durme. 2013. Open domain tar- preprint arXiv:1611.06788. geted sentiment. In EMNLP, pages 1643–1654. Zhiqiang Toh and Jian Su. 2015. NLANGP: su- Makoto Miwa and Mohit Bansal. 2016. End-to-end re- pervised machine learning system for aspect cate- lation extraction using lstms on sequences and tree gory classification and opinion target extraction. In structures. In ACL, pages 1105–1116. Association SemEval@NAACL-HLT, pages 496–501. Associa- for Computer Linguistics. tion for Computer Linguistics.
You can also read