ASAP: A Chinese Review Dataset Towards Aspect Category Sentiment Analysis and Rating Prediction
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ASAP: A Chinese Review Dataset Towards Aspect Category Sentiment Analysis and Rating Prediction Jiahao Bu, Lei Ren, Shuang Zheng, Yang Yang, Jingang Wang, Fuzheng Zhang, Wei Wu NLP Center, Meituan {bujiahao, renlei04, zhengshuang04, yangyang113, wangjingang02, zhangfuzheng, wuwei30}@meituan.com Abstract widely used in industries. Specifically, given a re- view ”Although the fish is delicious, the waiter is Sentiment analysis has attracted increasing at- horrible!”, the ACSA task aims to infer the senti- tention in e-commerce. The sentiment po- ment polarity over aspect category food is positive arXiv:2103.06605v1 [cs.CL] 11 Mar 2021 larities underlying user reviews are of great while the opinion over the aspect category service value for business intelligence. Aspect cate- gory sentiment analysis (ACSA) and review is negative. rating prediction (RP) are two essential tasks The user interfaces of e-commerce platforms are to detect the fine-to-coarse sentiment polar- more intelligent than ever before with the help of ities. ACSA and RP are highly correlated ACSA techniques. For example, Figure 1 presents and usually employed jointly in real-world the detail page of a coffee shop on a popular e- e-commerce scenarios. While most public commerce platform in China. The upper aspect- datasets are constructed for ACSA and RP sep- based sentiment text-boxes display the aspect cate- arately, which may limit the further exploita- tion of both tasks. To address the problem gories (e.g, food, sanitation) mentioned frequently and advance related researches, we present a in user reviews and the aggregated sentiment po- large-scale Chinese restaurant review dataset larities on these aspect categories (the orange ones ASAP including 46, 730 genuine reviews from represent positive and the blue ones represent neg- a leading online-to-offline (O2O) e-commerce ative). Customers can filter corresponding reviews platform in China. Besides a 5-star scale rat- effectively by clicking the aspect-based sentiment ing, each review is manually annotated accord- text-boxes they care about (i.e., the orange filled ing to its sentiment polarities towards 18 pre- text-box “卫生条件好” (good sanitation)). Our defined aspect categories. We hope the release of the dataset could shed some light on the user survey based on 7, 824 valid questionnaires fields of sentiment analysis. Moreover, we pro- demonstrates that 80.08% customers agree that the pose an intuitive yet effective joint model for aspect-based sentiment text-boxes are helpful to ACSA and RP. Experimental results demon- their decision-making on restaurant choices. Be- strate that the joint model outperforms state- sides, the merchants can keep track of their cuisines of-the-art baselines on both tasks. and service qualities with the help of the aspect- based sentiment text-boxes. Most e-commerce plat- 1 Introduction forms such as Taobao1 , Dianping2 , and Koubei3 With the rapid development of e-commerce, mas- deploy the similar user interfaces to improve user sive user reviews available on e-commerce plat- experience. forms are becoming valuable resources for both Users also publish their overall 5-star scale rat- customers and merchants. Aspect-based sentiment ings together with reviews. Figure 1 displays a analysis(ABSA) on user reviews is a fundamental sample of 5-star rating to the coffee shop. In com- and challenging task which attracts interests from parison to fine-grained aspect sentiment, the overall both academia and industries (Hu and Liu, 2004; review rating is usually a coarse-grained synthesis Ganu et al., 2009; Jo and Oh, 2011; Kiritchenko of the opinions on multiple aspects. Rating predic- et al., 2014). According to whether the aspect terms tion(RP) which aims to predict the “seeing stars” of are explicitly mentioned in texts, ABSA can be reviews (Jin et al., 2016; Li et al., 2018; Wu et al., further classified into aspect term sentiment anal- 1 https://www.taobao.com/ ysis (ATSA) and aspect category sentiment analy- 2 https://www.dianping.com/ 3 sis (ACSA), we focus on the latter which is more https://www.koubei.com/
2019a) also has wide applications. For example, baselines on both tasks. to promise the aspect-based sentiment text-boxes accurate, unreliable reviews should be removed . 5 5 before ACSA algorithms are performed. Given a piece of user review, we can predict a rating for it based on the overall sentiment polarity under- - , --,, , , 5 5 lying the text. We assume the predicted rating of 5 , 5, 5 , ,5 ,, . 5 - the review should be consistent with its ground- 5 , , 5. , 5 , 5 - , truth rating as long as the review is reliable. If the 5 , , - ', ,, - --,, 5 . , , -- , predicted rating and the user rating of a review dis- agree with each other explicitly, the reliability of Figure 1: The user interface of a coffee shop on a pop- the review is doubtful. Figure 2 demonstrates an ular e-commerce App. The top aspect-based sentiment example review of low-reliability. In summary, RP text-boxes display aspect categories and sentiment po- can help merchants to detect unreliable reviews. larities. The orange text-boxes are positive, while the Therefore, both ACSA and RP are of great im- blue ones are negative. The reviews mentioning the portance for business intelligence in e-commerce, clicked aspect category (e.g, good sanitation) with rat- ings are shown below. The text spans mentioning the and they are highly correlated and complementary. aspect categories are also highlighted. ACSA focuses on predicting its underlying senti- ment polarities on different aspect categories, while RP focuses on predicting the user’s overall feelings from the review content. We reckon these two tasks are highly correlated and better performance could be achieved by considering them jointly. As far as we know, current public datasets are . ' ,5 5 . ' ' constructed for ACSA and RP separately, which . , ' ' ' ,' . , . , ' ' . ' limits further joint explorations of ACSA and RP. . To address the problem and advance the related research, this paper presents a large-scale Chi- Figure 2: A content-rating disagreement case. The re- nese restaurant review dataset for Aspect category view holds a 2-star rating while all the mentioned as- pects are super positive. Sentiment Analysis and rating Prediction, denotes as ASAP for short. All the reviews in ASAP are collected from the aforementioned e-commerce Our main contributions can be summarized as platform. There are 46, 730 restaurant reviews at- follows. (1) We present a large-scale Chinese re- tached with 5-star scale ratings. Each review is view dataset towards aspect category sentiment manually annotated according to its sentiment po- analysis and rating prediction, named as ASAP, larities towards 18 fine-grained aspect categories. including as many as 46, 730 real-world restau- To the best of our knowledge, ASAP is the largest rant reviews annotated from 18 pre-defined as- Chinese large-scale review dataset towards both pect categories. Our dataset has been uploaded ACSA and RP tasks. in softconf system and will be released at https: //github.com/XXXX4 . (2) We explore the perfor- We implement several state-of-the-art (SOTA) mance of widely used models for ACSA and RP on baselines for ACSA and RP and evaluate their per- ASAP. (3) We propose a joint learning model for formance on ASAP. To make a fair comparison, ACSA and RP tasks. Our model achieves the best we also perform ACSA experiments on a widely results both on ASAP and SemEval R ESTAURANT used SemEval-2014 restaurant review dataset (Pon- datasets. tiki et al., 2014). Since BERT (Devlin et al., 2018) has achieved great success in several nat- 2 Related Work and Datasets ural language understanding tasks including sen- timent analysis (Xu et al., 2019; Sun et al., 2019; Aspect Category Sentiment Analysis. Jiang et al., 2019), we propose a joint model that ACSA (Zhou et al., 2015; Movahedi et al., employs the fine-to-coarse semantic capability of 2019; Ruder et al., 2016; Hu et al., 2018) aims 4 BERT. Our joint model outperforms the competing Anonymization for double-blind review
to predict sentiment polarities on all aspect opinions on multiple aspects. (Ganu et al., 2009; categories mentioned in the text. The series of Li et al., 2011; Chen et al., 2018) form this task SemEval datasets consisting of user reviews from as a text classification or regression problem. Con- e-commerce websites have been widely used and sidering the importance of opinions on multiple pushed forward related research (Wang et al., aspects in reviews, recent years have seen numer- 2016; Ma et al., 2017; Xu et al., 2019; Sun et al., ous work (Jin et al., 2016; Cheng et al., 2018; Li 2019; Jiang et al., 2019). The SemEval-2014 et al., 2018; Wu et al., 2019a) utilizing the informa- task-4 dataset (SE-ABSA14) (Pontiki et al., tion of the aspects to improve the rating prediction 2014) is composed of laptop and restaurant performance. This trending also inspires the moti- reviews. The restaurant subset includes 5 aspect vation of ASAP. categories (i.e., Food, Service, Price, Ambience Most RP datasets are crawled from real-world re- and Anecdotes/Miscellaneous) and 4 polarity view websites and created for RP specifically. Ama- labels (i.e., Positive, Negative, Conflict and zon Product Review English dataset (McAuley and Neutral). The laptop subset is not suitable Leskovec, 2013) containing product reviews and for ACSA. The SemEval-2015 task-12 dataset metadata from Amazon has been widely used for (SE-ABSA15) (Pontiki et al., 2015) builds upon RP (Cheng et al., 2018; McAuley and Leskovec, SE-ABSA14 and defines its aspect category as a 2013). Another popular English dataset comes combination of an entity type E and an attribute from Yelp Dataset Challenge 20176 , which in- type A (e.g., Food#Style Options). The SemEval- cludes reviews of local businesses in 12 metropoli- 2016 task-5 dataset (SE-ABSA16) (Pontiki et al., tan areas across 4 countries. Openrice7 is a Chinese 2016) extends SE-ABSA15 to new domains and RP dataset composed of 168, 142 reviews. Both new languages other than English. MAMS (Jiang the English and Chinese datasets don’t annotate et al., 2019) tailors SE-ABSA14 to make it more fine-grained aspect category sentiment polarities. challenging, in which each sentence contains We hope the release of ASAP can bridge the gap at least two aspects with different sentiment between ACSA and RP. polarities. Compared with the prosperity of English re- 3 Dataset Collection and Analysis sources, high-quality Chinese datasets are not 3.1 Data Construction & Curation rich enough. “ChnSentiCorp” (Tan and Zhang, We collect reviews from one of the most popular 2008), “IT168TEST” (Zagibalov and Carroll, O2O e-commerce platforms in China, which al- 2008), “Weibo”5 , “CTB” (Li et al., 2014) are 4 pop- lows users to publish coarse-grained star ratings ular Chinese datasets for general sentiment analy- and writing fine-grained reviews to restaurants (or sis. However, aspect category information is not places of interest) they have visited. In the reviews, annotated in these datasets. (Zhao et al., 2014) users comment on multiple aspects either explic- presents two Chinese ABSA datasets for consumer itly or implicitly, including ambience,price, food, electronics (mobile phones and cameras). Never- service, and so on. theless, the two datasets only contain 400 docu- First, we retrieve a large volume of user reviews ments (∼ 4000 sentences), in which each sentence from popular restaurants holding more than 50 user only mentions one aspect category at most. (Peng reviews randomly. Then, 4 pre-processing steps et al., 2017) summarizes available Chinese ABSA are performed to promise the ethics, quality, and datasets. While most of them are constructed reliability of the reviews. (1) User information through rule-based or machine learning-based ap- (e.g., user-ids, usernames, avatars, and post-times) proaches, which inevitably introduce additional are removed due to privacy considerations. (2) noise into the datasets. Our ASAP excels above Short reviews with less than 50 Chinese characters, Chinese datasets both on quantity and quality. as well as lengthy reviews with more than 1000 Rating Prediction. Rating prediction (RP) aims Chinese characters are filtered out. (3) If the ratio to predict the “seeing stars” of reviews, which rep- of non-Chinese characters within a review is over resent the overall ratings of reviews. In comparison 70%, the review is discarded. (4) To detect the low- to fine-grained aspect sentiment, the overall review quality reviews (e.g., advertising texts), we build rating is usually a coarse-grained synthesis of the 6 http://www.yelp.com/dataset challenge/ 5 7 http://tcci.ccf.org.cn/conference/2014/pages/page04 dg.html https://www.openrice.com
a BERT-based classifier with an accuracy of 97% assessors to annotate independently. Second, each in a leave-out test-set. The reviews detected as group is split into 2 subsets according to the an- low-quality by the classifier are discarded too. notation results, denoted as Sub-Agree and Sub- Disagree. Sub-Agree comprises the data exam- 3.2 Aspect Categories ples with agreement annotation, and Sub-Disagree Since the reviews already hold users’ star ratings, comprises the data examples with disagreement an- this section mainly introduces our annotation de- notation. Sub-Agree will be reviewed by assessors tails for ACSA. In SE-ABSA14 restaurant dataset from other groups. The controversial examples (denoted as R ESTAURANT for simplicity), there during the review are considered as difficult cases. are 5 coarse-grained aspect categories, including Sub-Disagree will be reviewed by the 2 project food, service, price, ambience and miscellaneous. managers independently and then discuss to reach After an in-depth analysis of the collected reviews, an agreement annotation. The examples that could we find the aspect categories mentioned by users not be addressed after discussions are also con- are rather diverse and fine-grained. Take the text sidered as difficult cases. Third, for each group, “...The restaurant holds a high-end decoration but the difficult examples from two subsets are deliv- is quite noisy since a wedding ceremony was be- ered to the expert reviewer to make a final deci- ing held in the main hall... (...环境看起来很高 sion. More details of difficult cases and annotation 大上的样子,但是因为主厅在举办婚礼非常 guidelines during annotation are demonstrated in 混乱,感觉特别吵...)” in Table 2 for example, the Appendix A.2. the reviewer actually expresses opposite sentiment Finally, ASAP corpus consists of 46, 730 pieces polarities on two fine-grained aspect categories re- of real-world user reviews, and we split it into a lated to ambience. The restaurant’s decoration is training set (36, 850), a validation set (4, 940) and very high-end (Positive), while it’s very noisy due a test set (4, 940) randomly. Table 2 presents an to an ongoing ceremony (Negative). Therefore, example review of ASAP and corresponding anno- we summarize the frequently mentioned aspects tations on the 18 aspect categories. and refine the 5 coarse-grained categories into 18 fine-grained categories. We replace miscellaneous 3.4 Dataset Analysis with location since we find users usually review the restaurants’ location (e.g., whether the restau- Figure 3 presents the distribution of 18 aspect cat- rant is easy to reach by public transportation.). We egories in ASAP. Because ASAP concentrates denote the aspect category as the form of “Coarse- on the domain of restaurant, 94.7% reviews men- grained Category#Fine-grained Categoty”, such tion Food#Taste as expected. Users also pay as “Food#Taste” and “Ambience#Decoration”. The great attention to fine-grained aspect categories full list of aspect categories and definitions are such as Service#Hospitality, Price#Level and Am- listed in Table 1. bience#Decoration. The distribution proves the advantages of ASAP, as users’ fine-grained prefer- 3.3 Annotation Guidelines & Process ences could reflect the pros and cons of restaurants Bearing in mind the pre-defined 18 aspects, as- more precisely. sessors are asked to annotate sentiment polarities The statistics of ASAP are presented in Table 3. towards the mentioned aspect categories of each We also include a tailored SE-ABSA14 R ESTAU - review. Given a review, when an aspect category is RANT dataset for reference. Please note that we mentioned within the review either explicitly and remove the reviews holding aspect categories with implicitly, the sentiment polarity over the aspect sentiment polarity of “conflict” from the original category is labeled as 1 (Positive), 0 (Neutral) or R ESTAURANT dataset. −1 (Negative) as shown in Table 2. Compared with R ESTAURANT, ASAP excels We hire 20 vendor assessors, 2 project managers, in the quantities of training instances, which sup- and 1 expert reviewer to perform annotations. Each ports the exploration of recent data-intensive deep assessor needs to attend a training to ensure their neural models. ASAP is a review-level dataset, intact understanding of the annotation guidelines. while R ESTAURANT is a sentence-level dataset. Three rounds of annotation are conducted sequen- The average length of reviews in ASAP is much tially. First, we randomly split the whole dataset longer, thus the reviews tend to contain richer as- into 10 groups, and every group is assigned to 2 pect information. In ASAP, the reviews contain
Table 1: The full list of 18 aspect categories and definitions. Aspect category Definition Aspect category Definition Food#Taste Location#Easy to find Food taste Whether the restaurant is (口味) (是否容易寻找) easy to find Food#Appearance Service#Queue Food appearance Whether the queue time (外观) (排队时间) is acceptable Food#Portion Service#Hospitality Food portion Waiters/waitresses’ atti- (分量) (服务人员态度) tude/hospitality Food#Recommend Service#Parking Whether the food is worth Parking convenience (推荐程度) (停车方便) being recommended Price#Level Service#Timely Price level Order/Serving time (价格水平) (点菜/上菜速度) Price#Cost effective Ambience#Decoration Whether the restaurant is Decoration level (性价比) (装修) cost-effective Price#Discount Ambience#Noise Discount strength Whether the restaurant is (折扣力度) (嘈杂情况) noisy Location#Downtown Ambience#Space Whether the restaurant is Dining Space and Seat (位于商圈附近) (就餐空间) located near downtown Size Location#Transportation Ambience#Sanitary Convenient public trans- Sanitary condition (交通方便) (卫生情况) portation to the restaurant Figure 3: The distribution of 18 fine-grained aspect categories in ASAP. 5.8 aspect categories in average, which is 4.7 times view R. N is the number of pre-defined aspect of R ESTAURANT. Both review-level ACSA and categories (i.e., 18 in this paper). Suppose there RP are more challenging than their sentence-level are K mentioned aspect categories in R. We de- counterparts. Take the review in Table 2 for exam- fine a mask vector [p1 , p2 , ..., pN ] to indicate the ple, the review contains several sentiment polarities occurrence of aspect categories. When the aspect towards multiple aspect categories. In addition to category ai is mentioned P in R, pi = 1, otherwise aspect category sentiment annotations, ASAP also pi = 0. So we have N i=1 pi = K. In terms of RP, includes overall user ratings for reviews. With the it aims to predict the 5-star rating score of g, which help of ASAP, ACSA and RP can be further opti- represents the overall rating of the given review R. mized either separately or jointly. 4.2 Joint Model 4 Methodology Given a user review, ACSA focuses on predicting 4.1 Problem Formulation its underlying sentiment polarities on different as- We use D to denote the collection of user review pect categories, while RP focuses on predicting the corpus in the training data. Given a review R which user’s overall feelings from the review content. We consists of a series of words: {w1 , w2 , ..., wZ }, reckon these two tasks are highly correlated and ACSA aims to predict the sentiment polarity better performance could be achieved by consider- yi ∈ {P ositive, N eutral, N egative} of review ing them jointly. R with respect to the mentioned aspect category The advent of BERT has established the success ai , i ∈ {1, 2, ..., N }. Z denotes the length of re- of the “pre-training and then fine-tuning” paradigm
Table 2: A review example in ASAP, with overall star rating and aspect category sentiment polarity annotations. Review Rating Aspect Category Label Aspect Category Label Location#Transportation Price#Discount 1 - (交通方便) (折扣力度) With convenient traffic, the restaurant holds a high-end decoration, but quite noisy because a Location#Downtown Ambience#Decoration wedding ceremony was being held in the main - 1 (位于商圈附近) (装修) hall. Impressed by its delicate decoration and grand appearance though, we had to wait for a while at the weekend time. However, considering its high Location#Easy to find Ambience#Noise - −1 price level, the taste is unexpected. We ordered the (是否容易寻找 ) (嘈杂情况) Kung Pao Prawn, the taste was acceptable and the serving size is enough, but the shrimp is not fresh. In terms of service, you could not expect too much Service#Queue Ambience#Space - 1 due to the massive customers there. By the way, the (排队时间) (就餐空间) free-served fruit cup was nice. Generally speaking, it was a typical wedding banquet restaurant rather than 3-Star Service#Hospitality Ambience#Sanitary a comfortable place to date with friends. - - (服务人员态度) (卫生情况) 交通还挺方便的,环境看起来很高大上的 样子,但是因为主厅在举办婚礼非常混乱,特别 Service#Parking Food#Portion 吵感觉,但是装修的还不错,感觉很精致的装 - 1 (停车方便) (分量) 修,门面很气派,周末去的时候还需要等位。味 道的话我觉得还可以但是跟价格比起来就很一般 了,性价比挺低的,为了去吃宫保虾球的,但是 Service#Timely Food#Taste 我觉得也就那样吧虾不是特别新鲜,不过虾球很 −1 1 (点菜/上菜速度) (口味) 大,味道还行。服务的话由于人很多所以也顾不 过来上菜的速度不快,但是有送水果杯还挺好吃 的。总之就是典型的婚宴餐厅不是适合普通朋友 Price#Level 0 Food#Appearance - 吃饭的地方了。 (价格水平) (外观) Price#Cost effective Food#Recommend −1 - (性价比) (推荐程度) for NLP tasks. BERT-based models have achieved gregate the related token embeddings dynamically impressive results in ACSA (Xu et al., 2019; Sun for every aspect category. The attention-pooling et al., 2019; Jiang et al., 2019). Review rating layer helps the model focus on the tokens most prediction can be deemed as a single-sentence clas- related to the target aspect categories. sification (regression) task, which could also be ad- dressed with BERT. Therefore, we propose a joint Mia = tanh(Wia ∗ H) (1) learning model to address ACSA and RP in a multi- task learning manner. Our joint model employs the fine-to-coarse semantic representation capabil- αi & = sof tmax(ωiT ∗ Mia ) (2) ity of the BERT encoder. Figure 4 illustrates the framework of our joint model. ri & = tanh(Wip ∗ H ∗ αiT ) (3) ACSA As shown in Figure 4, the token embed- Where Wia ∈ Rd∗d , Mia ∈ Rd∗Z , ωi ∈ Rd , dings of the input review are generated through a αi ∈ RZ , Wip ∈ Rd∗d , and ri ∈ Rd . αi is a vec- shared BERT encoder. Briefly, let H ∈ Rd∗Z be tor consisting of attention weights of all tokens the matrix consisting of token embedding vectors which can selectively attend the regions of the as- {h1 , ..., hZ } that BERT produces, where d is the pect category related tokens, and ri is the attentive size of hidden layers and Z is the length of the representation of review with respect to the ith as- given review. Since different aspect category infor- pect category ai , i ∈ {1, 2, ..., N }. Then we have mation is dispersed across the content of R, we add an attention-pooling layer (Wang et al., 2016) to ag- ŷi = sof tmax(Wiq ∗ ri + bqi ) (4)
Table 3: The statistics and label/rating distribution of ASAP and R ESTAURANT. The review length are counted by Chinese characters and English words respectively. The sentences are segmented with periods in ASAP, while R ESTAURANT is a sentence-level dataset. Average Average Average Positive Negative Dataset Split Reviews sentences aspects Neutral 1-star 2-star 3-star 4-star 5-star length per review per review Train 36, 850 8.6 5.8 319.7 133, 721 27, 425 52, 225 1, 219 1, 258 5, 241 13, 362 15, 770 ASAP Dev 4, 940 8.7 5.9 319.9 18, 176 3, 733 7, 192 151 166 784 1, 734 2, 105 Test 4, 940 8.3 5.7 317.1 17, 523 3, 813 7, 026 165 173 717 1, 867 2, 018 Train 2, 855 1 1.2 15.2 2150 822 498 - - - - - R ESTAURANT Test 749 1 1.3 15.6 645 215 94 - - - - - g y 1 y N Where W r ∈ Rd∗d ,br ∈ Rd , β ∈ Rd are trainable Output dense Softmax1 … SoftmaxN parameters. The final loss of our joint model becomes as r … r 1 N follows. loss = lossACSA + lossRP (8) α 1 α N Attention … 5 Experiments Contextual h h … h We perform an extensive set of experiments to eval- Representation [cls] 1 Z uate the performance of our joint model on ASAP and R ESTAURANT. Ablation studies are also con- BERT Encoder BERT ducted to probe the interactive influence between ACSA and RP. Input [CLS] w 1 … w Z 5.1 ACSA Figure 4: The framework of the proposed joint learning model. The right part of the dotted vertical line is used Baseline Models We implement several ACSA to predict multiple aspect category sentiment polarities, baselines for comparison. According to the differ- while the left part is used to predict the review rating. ent structures of their encoders, these models are classified into Non-BERT based models or BERT- Where Wiq ∈ RC∗d and bqi ∈ RC are trainable based models. Non-BERT based models include parameters of the softmax layer. C is the number TextCNN (Kim, 2014), BiLSTM+Attn (Zhou et al., of labels (i.e, 3 in our task). Hence, the ACSA loss 2016), ATAE-LSTM (Wang et al., 2016) and Cap- for a given review R is defined as follows, sNet (Sabour et al., 2017). BERT-based models include vanilla BERT (Devlin et al., 2018), QA- N 1 X X BERT (Sun et al., 2019) and CapsNet-BERT (Jiang lossACSA = pi yi ∗ log ŷi (5) K et al., 2019). Due to the page limit, the imple- i=1 C mentation details of all experimental models are If the aspect category ai is not mentioned in S, presented in the Appendix A.3. yi is set as a random value. The pi serves as a Evaluation Metrics Following the settings of gate function, which filters out the random yi and R ESTAURANT (Pontiki et al., 2014), we adopt ensures only the mentioned aspect categories can Macro-F1 and Accuracy (Acc) as evaluation met- participate in the calculation of the loss function. rics. Rating Prediction Since the objective of RP is Experimental Results & Analysis We report the to predict the review rating based on the review performance of aforementioned models on ASAP content, we adopt the [CLS] embedding h[cls] ∈ Rd and R ESTAURANT in Table 4. Generally, BERT- BERT produces as the representation of the input based models outperform Non-BERT based mod- review, where d is the size of hidden layers in the els on both datasets. The two variants of our BERT encoder. joint model perform better than vanilla-BERT, QA- ĝ = β T ∗ tanh(W r ∗ h[cls] + br ) (6) BERT, and CapsNet-BERT, which proves the ad- vantages of our joint learning model. Given a user Hence the RP loss for a given review R is defined review, vanilla-BERT, QA-BERT, and CapsNet- as follows, BERT treat the pre-defined aspect categories inde- lossRP = |g − ĝ| (7) pendently, while our joint model combines them
Table 4: The experimental results of ACSA models on ASAP and R ESTAURANT. Best scores are boldfaced. ASAP R ESTAURANT Category Model Macro-F1 Acc. Macro-F1 Acc. TextCNN (Kim, 2014) 60.41% 71.10% 70.56% 82.29% BiLSTM+Attn (Zhou et al., 2016) 70.53% 77.78% 70.85% 81.97% Non-BERT-based models ATAE LSTM (Wang et al., 2016) 76.60% 81.94% 70.15% 82.12% CapsNet (Sabour et al., 2017) 75.54% 81.66% 71.84% 82.63% Vanilla-BERT (Devlin et al., 2018) 79.18% 84.09% 79.22% 87.63% BERT-based models QA-BERT (Sun et al., 2019) 79.44% 83.92% 80.89% 88.89% CapsNet-BERT (Jiang et al., 2019) 78.92% 83.74% 80.94% 89.00% Joint Model (w/o RP) 80.75% 85.15% 82.01% 89.62% Joint Model 80.78% 85.19% - - together with a multi-task learning framework. outperforms other models considerably. On one On one hand, the encoder-sharing setting enables hand, the performance improvement is expected knowledge transferring among different aspect cat- since our joint model is built upon BERT. On egories. On the other hand, our joint model is the other hand, the ablation of ACSA (i.e., joint more efficient than other competitors, especially model(w/o ACSA)) brings performance degrada- when the number of aspect categories is large. The tion of RP on both metrics. We can conclude that ablation of RP (i.e., joint model(w/o RP)) still out- the fine-grained aspect category sentiment predic- performs all other baselines. The introduction of tion of the review indeed helps the model predict RP to ACSA brings marginal improvement. This is its overall rating more accurately. reasonable considering that the essential objective This section conducts preliminary experiments of RP is to estimate the overall sentiment polarity to evaluate classical ACSA and RP models on our instead of fine-grained sentiment polarities. We proposed ASAP dataset. We believe there still also visualize the attention weights produced by exists much room for improvements to both tasks, our joint model on the example of Table 2 in Ap- and we will leave them for future work. pendix A.4 for reference. 6 Conclusion 5.2 Rating Prediction This paper presents ASAP, a large-scale Chinese We compare several RP models on ASAP, includ- restaurant review dataset towards aspect category ing TextCNN (Kim, 2014), BiLSTM+Attn (Zhou sentiment analysis (ACSA) and rating prediction et al., 2016) and ARP (Wu et al., 2019b). ARP is (RP) tasks. ASAP consists of 46, 730 restaurant the SOTA aspect-aware rating prediction model in user reviews with star ratings from a leading online- the literature. The data pre-processing and imple- to-offline (O2O) e-commerce platform in China. mentation details are identical with ACSA experi- Each review is manually annotated according to ments. its sentiment polarities on 18 fine-grained aspect Evaluation Metrics. We adopt Mean Absolute categories, such as Food#Taste, Ambience#Noise Error (MAE) and Accuracy (by mapping the pre- and Service#Hospitality. Both ACSA and RP are of dicted rating score to the nearest category) as eval- great value for business intelligence in e-commerce, uation metrics. while high-quality Chinese datasets for ACSA and Experimental Results & Analysis The experi- RP are in short supply unexpectedly. we hope the mental results of comparative RP models are il- release of ASAP could push forward related re- lustrated in Table 5. searches and applications. Although ASAP could Table 5: Experimental results of RP models on ASAP. be utilized as evaluation benchmarks for ACSA Best scores are boldfaced. and RP respectively, a unique advantage of ASAP is that each review holds fine-grained aspect-based Model MAE Acc. TextCNN (Kim, 2014) .5814 52.99% sentiment polarity annotation and an overall user BiLSTM+Attn (Zhou et al., 2016) .5737 54.38% rating. Besides evaluations of ACSA and RP ARP (Wu et al., 2019b) .5620 54.76% models on ASAP separately, we also propose a Joint Model (w/o ACSA) .4421 60.08% Joint Model .4266 61.26% joint model to address ACSA and RP synthetically, which outperforms other state-of-the-art baselines Our joint model which combines ACSA and RP considerably.
References Changliang Li, Bo Xu, Gaowei Wu, Saike He, Guan- hua Tian, and Hongwei Hao. 2014. Recursive Chong Chen, Min Zhang, Yiqun Liu, and Shaoping deep learning for sentiment analysis over social data. Ma. 2018. Neural attentional rating regression with In 2014 IEEE/WIC/ACM International Joint Con- review-level explanations. In Proceedings of the ferences on Web Intelligence (WI) and Intelligent 2018 World Wide Web Conference, pages 1583– Agent Technologies (IAT), volume 2, pages 180–185. 1592. IEEE. Zhiyong Cheng, Ying Ding, Lei Zhu, and Mohan Fangtao Li, Nathan Nan Liu, Hongwei Jin, Kai Zhao, Kankanhalli. 2018. Aspect-aware latent factor Qiang Yang, and Xiaoyan Zhu. 2011. Incorporat- model: Rating prediction with ratings and reviews. ing reviewer and product information for review rat- In Proceedings of the 2018 world wide web confer- ing prediction. In Twenty-second international joint ence, pages 639–648. conference on artificial intelligence. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Junjie Li, Haitong Yang, and Chengqing Zong. 2018. Kristina Toutanova. 2018. Bert: Pre-training of deep Document-level multi-aspect sentiment classifica- bidirectional transformers for language understand- tion by jointly modeling users, aspects, and overall ing. CoRR, abs/1810.04805. ratings. In Proceedings of the 27th International Gayatree Ganu, Noemie Elhadad, and Amélie Marian. Conference on Computational Linguistics, pages 2009. Beyond the stars: improving rating predic- 925–936, Santa Fe, New Mexico, USA. Association tions using review text content. In WebDB, vol- for Computational Linguistics. ume 9, pages 1–6. Citeseer. Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Mengting Hu, Shiwan Zhao, Li Zhang, Keke Wang. 2017. Interactive attention networks for Cai, Zhong Su, Renhong Cheng, and Xiaowei aspect-level sentiment classification. In Interna- Shen. 2018. CAN: constrained attention net- tional Joint Conference on Artificial Intelligence, works for multi-aspect sentiment analysis. CoRR, pages 4068–4074. abs/1812.10735. Julian McAuley and Jure Leskovec. 2013. Hidden fac- Minqing Hu and Bing Liu. 2004. Mining and summa- tors and hidden topics: understanding rating dimen- rizing customer reviews. In Proceedings of the tenth sions with review text. In Proceedings of the 7th ACM SIGKDD international conference on Knowl- ACM conference on Recommender systems, pages edge discovery and data mining, pages 168–177. 165–172. Qingnan Jiang, Lei Chen, Ruifeng Xu, Xiang Ao, and Sajad Movahedi, Erfan Ghadery, Heshaam Faili, and Min Yang. 2019. A challenge dataset and effective Azadeh Shakery. 2019. Aspect category detection models for aspect-based sentiment analysis. In Pro- via topic-attention network. ceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th In- Haiyun Peng, Erik Cambria, and Amir Hussain. 2017. ternational Joint Conference on Natural Language A review of sentiment analysis research in chinese Processing (EMNLP-IJCNLP), pages 6281–6286. language. Cognitive Computation, 9(4):423–435. Zhipeng Jin, Qiudan Li, Daniel D Zeng, YongCheng Maria Pontiki, Dimitrios Galanis, Haris Papageorgiou, Zhan, Ruoran Liu, Lei Wang, and Hongyuan Ma. Suresh Manandhar, and Ion Androutsopoulos. 2015. 2016. Jointly modeling review content and aspect Semeval-2015 task 12: Aspect based sentiment ratings for review rating prediction. In Proceedings analys. In International Workshop on Semantic of the 39th International ACM SIGIR conference on Evaluation (SemEval 2015). Research and Development in Information Retrieval, pages 893–896. Maria Pontiki, Dimitrios Galanis, John Pavlopou- los, Haris Papageorgiou, Ion Androutsopoulos, and Yohan Jo and Alice H Oh. 2011. Aspect and senti- Suresh Manandhar. 2014. Semeval-2014 task 4: ment unification model for online review analysis. Aspect based sentiment analysis. In International In Proceedings of the fourth ACM international con- Workshop on Semantic Evaluation (SemEval 2015). ference on Web search and data mining, pages 815– 824. ACM. Maria Pontiki, Dimitris Galanis, Haris Papageor- giou, Ion Androutsopoulos, Suresh Manandhar, AL- Yoon Kim. 2014. Convolutional neural net- Smadi Mohammad, Mahmoud Al-Ayyoub, Yanyan works for sentence classification. arXiv preprint Zhao, Bing Qin, Orphée De Clercq, et al. 2016. arXiv:1408.5882. Semeval-2016 task 5: Aspect based sentiment anal- ysis. In Proceedings of the 10th international work- Svetlana Kiritchenko, Xiaodan Zhu, Colin Cherry, and shop on semantic evaluation (SemEval-2016), pages Saif Mohammad. 2014. Nrc-canada-2014: Detect- 19–30. ing aspects and sentiment in customer reviews. In Proceedings of the 8th International Workshop on Sebastian Ruder, Parsa Ghaffari, and John G. Breslin. Semantic Evaluation (SemEval 2014), pages 437– 2016. A hierarchical model of reviews for aspect- 442. based sentiment analysis.
Sara Sabour, Nicholas Frosst, and Geoffrey E Hin- ton. 2017. Dynamic routing between capsules. In Advances in neural information processing systems, pages 3856–3866. Chi Sun, Luyao Huang, and Xipeng Qiu. 2019. Uti- lizing bert for aspect-based sentiment analysis via constructing auxiliary sentence. arXiv preprint arXiv:1903.09588. Songbo Tan and Jin Zhang. 2008. An empirical study of sentiment analysis for chinese documents. Expert Systems with applications, 34(4):2622–2629. Yequan Wang, Minlie Huang, and Li Zhao. 2016. Attention-based lstm for aspect-level sentiment clas- sification. In Empirical Methods in Natural Lan- guage Processing, pages 606–615. Chuhan Wu, Fangzhao Wu, Junxin Liu, Yongfeng Huang, and Xing Xie. 2019a. Arp: Aspect-aware neural review rating prediction. In Proceedings of the 28th ACM International Conference on Infor- mation and Knowledge Management, pages 2169– 2172. Chuhan Wu, Fangzhao Wu, Junxin Liu, Yongfeng Huang, and Xing Xie. 2019b. Arp: Aspect-aware neural review rating prediction. In Proceedings of the 28th ACM International Conference on Infor- mation and Knowledge Management, pages 2169– 2172. Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. 2019. Bert post-training for review reading comprehension and aspect-based sentiment analysis. arXiv preprint arXiv:1904.02232. Taras Zagibalov and John A Carroll. 2008. Unsuper- vised classification of sentiment and objectivity in chinese text. In Proceedings of the Third Interna- tional Joint Conference on Natural Language Pro- cessing: Volume-I. Yanyan Zhao, Bing Qin, and Ting Liu. 2014. Creating a fine-grained corpus for chinese sentiment analysis. IEEE Intelligent Systems, 30(1):36–43. Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics (Volume 2: Short Papers), pages 207–212. Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2015. Representation learning for aspect category detec- tion in online reviews. In Twenty-ninth Aaai Con- ference on Artificial Intelligence.
You can also read