Recommend for a Reason: Unlocking the Power of Unsupervised Aspect-Sentiment Co-Extraction
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Recommend for a Reason: Unlocking the Power of Unsupervised Aspect-Sentiment Co-Extraction Zeyu Li1 , Wei Cheng2 , Reema Kshetramade1 , John Houser1 , Haifeng Chen2 , and Wei Wang1 1 Department of Computer Science, University of California, Los Angeles 2 NEC Labs America {zyli,weiwang}@cs.ucla.edu {reemakshe,johnhouser}@ucla.edu {weicheng,haifeng}@nec-labs.com Abstract ciated with it. Users may show strong attentions to certain properties but indifference to others. The at- Compliments and concerns in reviews are valu- tended advantageous or disadvantageous properties able for understanding users’ shopping inter- ests and their opinions with respect to spe- can dominate the attitude of users and consequently, decide their generosity in rating. arXiv:2109.03821v1 [cs.IR] 7 Sep 2021 cific aspects of certain items. Existing review- based recommenders favor large and complex Table 1 exemplifies the impact of the user atti- language encoders that can only learn latent tude using three real reviews for a headset. Three and uninterpretable text representations. They aspects are covered: microphone quality, comfort- lack explicit user-attention and item-property ableness, and sound quality. The microphone qual- modeling, which however could provide valu- able information beyond the ability to rec- ity is controversial. R2 and R3 criticize it but R1 ommend items. Therefore, we propose a praises it. The sole disagreement between R1 and tightly coupled two-stage approach, including R2 is on microphone, which is the major concern of an Aspect-Sentiment Pair Extractor (ASPE) R2, results in the divergence of ratings (5 stars vs. and an Attention-Property-aware Rating Es- 3 stars). However, R3 neglects that disadvantage timator (APRE). Unsupervised ASPE mines and grades highly (5 stars) for its superior comfort- Aspect-Sentiment pairs (AS-pairs) and APRE ableness indicated by the metaphor of “pillow”. predicts ratings using AS-pairs as concrete aspect-level evidences. Extensive experiments Secondly, understanding user motivations in on seven real-world Amazon Review Datasets granular item properties provides valuable infor- demonstrate that ASPE can effectively extract mation beyond the ability to recommend items. It AS-pairs which enable APRE to deliver supe- requires aspect-based NLP techniques to extract rior accuracy over the leading baselines. explicit and definitive aspects. However, existing aspect-based models mainly use latent or implicit 1 Introduction aspects (Chin et al., 2018) whose real semantics Reviews and ratings are valuable assets for the rec- are unjustifiable. Similar to Latent Dirichlet Al- ommender systems of e-commerce websites since location (LDA, Blei et al., 2003), the semantics they immediately describe the users’ subjective of the derived aspects (topics) are mutually over- feelings about the purchases. Learning user pref- lapped (Huang et al., 2020b). These models under- erences from such feedback is straightforward and mine the resultant aspect distinctiveness and lead efficacious. Previous research on review-based rec- to uninterpretable and sometimes counterintuitive ommendation has been fruitful (Chin et al., 2018; results. The root of the problem is the lack of large Chen et al., 2018; Bauman et al., 2017; Liu et al., review corpora with aspect and sentiment annota- 2019). Cutting-edge natural language processing tions. The existing ones are either too small or (NLP) techniques are applied to extract the latent too domain-specific (Wang and Pan, 2018) to be user sentiments, item properties, and the compli- applied to general use cases. Progress on senti- cated interactions between the two components. ment term extraction (Dai and Song, 2019; Tian However, existing approaches have disadvan- et al., 2020; Chen et al., 2020a) takes advantage tages bearing room for improvement. Firstly, they of neural networks and linguistic knowledge and dismiss the phenomenon that users may hold dif- partially makes it possible to use unsupervised term ferent attentions toward various properties of the annotation to tackle the lack-of-huge-corpus issue. merchandise. An item property is the combination In this paper, we seek to understand how re- of an aspect of the item and the characteristic asso- views and ratings are affected by users’ perception
Reviews Microphone Comfort Sound R1 [5 stars]: Comfortable. Very high quality sound. . . . Mic is good too. There is good comfortable high quality an switch to mute your mic. . . I wear glasses and these are comfortable with my (satisfied) (praising) glasses on. . . . R2 [3 stars]: I love the comfort, sound, and style but the mic is complete junk! complete love love junk (angry) R3 [5 stars]: . . . But this one feels like a pillow, there’s nothing wrong with the pretty bad like a pillow nothing audio and it does the job. . . . con is that the included microphone is pretty bad. (unsatisfied) (enjoyable) wrong Table 1: Example reviews of a headset with three aspects, namely microphone quality, comfort level, and sound quality, highlighted specifically. The extracted sentiments are on the right. R1 vs. R2: Different users react differ- ently (microphone quality) to the same item due to distinct personal attentions and, consequently, give divergent ratings. R1 vs. R3: A user can still rate highly of an item due to special attention on particular aspects (comfort level) regardless of certain unsatisfactory or indifferent properties (microphone and sound qualities). of item properties in a fine-grained way and dis- 2 Related Work cuss how to utilize these findings transparently and effectively in rating prediction. We propose a two- Our work is related to four lines of literature which stage recommender with an unsupervised Aspect- are located in the overlap of ABSA and Recom- Sentiment Pair Extractor (ASPE) and an Attention- mender Systems. Property-aware Rating Estimator (APRE). ASPE 2.1 Aspect-based Sentiment Analysis extracts (aspect, sentiment) pairs (AS- pairs) from reviews. The pairs are fed into APRE as Aspect-based sentiment analysis (ABSA) (Xu et al., explicit user attention and item property carriers in- 2020; Wang et al., 2018) predicts sentiments toward dicating both frequencies and sentiments of aspect aspects mentioned in the text. Natural language is mentions. APRE encodes the text by a contextu- modeled by graphs in (Zhang et al., 2019; Wang alized encoder and processes implicit text features et al., 2020) such as Pointwise Mutual Information and the annotated AS-pairs by a dual-channel rating (PMI) graphs and dependency graphs. Phan and regressor. ASPE and APRE jointly extract explicit Ogunbona (2020) and Tang et al. (2020) utilize aspect-based attentions and properties and solve contextualized language encoding to capture the the rating prediction with a great performance. context of aspect terms. Chen et al. (2020b) focuses on the consistency of the emotion surrounding the Aspect-level user attitude differs from user pref- aspects, and Du et al. (2020) equips pre-trained erence. The user attitudes produced by the inter- BERT with domain-awareness of sentiments. Our actions of user attentions and item properties are work is informed by these progress which utilize sophisticated and granular sentiments and ratio- PMI, dependency tree, and BERT for syntax feature nales for interpretation (see Section 4.4 and A.3.5). extraction and language encoding. Preferences, on the contrary, are coarse sentiments 2.2 Aspect or Sentiment Terms Extraction such as like, dislike, or neutral. Preference-based models may infer that R1 and R3 are written by Aspect and sentiment terms extraction is a presup- headset lovers because of the high ratings. Instead, position of ABSA. However, manually annotating attitude-based methods further understand that it data for training, which requires the hard labor of is the comfortableness that matters to R3 rather experts, is only feasible on small datasets in particu- than the item being a headset. Aspect-level atti- lar domains such as Laptop and Restaurant (Pontiki tude modeling is more accurate, informative, and et al., 2014, 2015) which are overused in ABSA. personalized than preference modeling. Recently, RINANTE (Dai and Song, 2019) and SDRN (Chen et al., 2020a) automatically extract Note. Due to the page limits, some support- both terms using rule-guided data augmentation ive materials, marked by “†”, are presented in and double-channel opinion-relation co-extraction, the Supplementary Materials. We strongly rec- respectively. However, the supervised approaches ommend readers check out these materials. The are too domain-specific to generalize to out-of- source code of our work is available on GitHub domain or open-domain corpora. Conducting do- at https://github.com/zyli93/ASPE-APRE. main adaptation from small labeled corpora to un-
labeled open corpora only produces suboptimal guage knowledge- or lexicon-based methods. The results (Wang and Pan, 2018). SKEP (Tian et al., framework is visualized in Figure 1. 2020) exploits an unsupervised PMI+seed strategy to coarsely label sentimentally polarized tokens Lexicon Review Text as sentiment terms, showing that the unsupervised Neural Net PMI Dependency parsing method is advantageous when annotated corpora STNN AAAB83icbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxVOp2C9oQtlsN+3SzSbsTsQS+je8eFDEq3/Gm//GbZuDVh8MPN6bYWZekAiu0XG+rMLK6tr6RnGztLW9s7tX3j9o6zhVlLVoLGLVDYhmgkvWQo6CdRPFSBQI1gnGNzO/88CU5rFs4iRhfkSGkoecEjSSd9/se8geMavXp/1yxak6c9h/iZuTCuRo9Muf3iCmacQkUkG07rlOgn5GFHIq2LTkpZolhI7JkPUMlSRi2s/mN0/tE6MM7DBWpiTac/XnREYirSdRYDojgiO97M3E/7xeiuGVn3GZpMgkXSwKU2FjbM8CsAdcMYpiYgihiptbbToiilA0MZVMCO7yy39J+6zqXlSdu/NK7TqPowhHcAyn4MIl1OAWGtACCgk8wQu8Wqn1bL1Z74vWgpXPHMIvWB/fDPWRsQ== Dep.-based Rules: are insufficient in the domain-of-interest. STLex AAAB9HicbVBNS8NAEN3Ur1q/qh69BIvgqSSi6LHoxYOHiv2CNpTNdtIu3Wzi7qS0hP4OLx4U8eqP8ea/cdvmoNUHA4/3ZpiZ58eCa3ScLyu3srq2vpHfLGxt7+zuFfcPGjpKFIM6i0SkWj7VILiEOnIU0IoV0NAX0PSHNzO/OQKleSRrOInBC2lf8oAzikbyHmrdDsIY0zsYT7vFklN25rD/EjcjJZKh2i1+dnoRS0KQyATVuu06MXopVciZgGmhk2iIKRvSPrQNlTQE7aXzo6f2iVF6dhApUxLtufpzIqWh1pPQN50hxYFe9mbif147weDKS7mMEwTJFouCRNgY2bME7B5XwFBMDKFMcXOrzQZUUYYmp4IJwV1++S9pnJXdi7Jzf16qXGdx5MkROSanxCWXpEJuSZXUCSOP5Im8kFdrZD1bb9b7ojVnZTOH5Besj28MkJJI [ AAAB63icbVBNSwMxEJ2tX7V+VT16CRbBU9kVRY9FLx4r2A9ol5JNs21okg1JVihL/4IXD4p49Q9589+YbfegrQ8GHu/NMDMvUpwZ6/vfXmltfWNzq7xd2dnd2z+oHh61TZJqQlsk4YnuRthQziRtWWY57SpNsYg47USTu9zvPFFtWCIf7VTRUOCRZDEj2OZSn6RqUK35dX8OtEqCgtSgQHNQ/eoPE5IKKi3h2Jhe4CsbZlhbRjidVfqpoQqTCR7RnqMSC2rCbH7rDJ05ZYjiRLuSFs3V3xMZFsZMReQ6BbZjs+zl4n9eL7XxTZgxqVJLJVksilOObILyx9GQaUosnzqCiWbuVkTGWGNiXTwVF0Kw/PIqaV/Ug6u6/3BZa9wWcZThBE7hHAK4hgbcQxNaQGAMz/AKb57wXrx372PRWvKKmWP4A+/zBx9pjko= STPMI AAAB9HicbVDLSgNBEJyNrxhfUY9eBoPgKeyKosegFz0IEfOCZAmzk95kyOzDmd5gWPIdXjwo4tWP8ebfOEn2oNGChqKqm+4uL5ZCo21/Wbml5ZXVtfx6YWNza3unuLvX0FGiONR5JCPV8pgGKUKoo0AJrVgBCzwJTW94NfWbI1BaRGENxzG4AeuHwhecoZHc+1q3g/CIafX2ZtItluyyPQP9S5yMlEiGarf42elFPAkgRC6Z1m3HjtFNmULBJUwKnURDzPiQ9aFtaMgC0G46O3pCj4zSo36kTIVIZ+rPiZQFWo8Dz3QGDAd60ZuK/3ntBP0LNxVhnCCEfL7ITyTFiE4ToD2hgKMcG8K4EuZWygdMMY4mp4IJwVl8+S9pnJSds7J9d1qqXGZx5MkBOSTHxCHnpEKuSZXUCScP5Im8kFdrZD1bb9b7vDVnZTP75Besj2+mopIF amod, nsubj+acomp AS-pair Candidates Compared to the above models, our ASPE has (Aspect 1, Sentiment 1), Sentiment Terms (ST) (Aspect 2, Sentiment 2),… two merits of being (1) unsupervised and hence free from expensive data labeling; (2) generalizable filtering and merging AS-pair Extractions (in green) to different domains by combining three different (Aspect 1, Sentiment 1), (Aspect 2, Sentiment 2) … labeling methods. Figure 1: Pipeline of ASPE. 2.3 Aspect-based Recommendation Aspect-based recommendation is a relevant task 3.2.1 Sentiment Terms Extraction with a major difference that specific terms indicat- PMI-based method Pointwise Mutual Informa- ing sentiments are not extracted. Only the aspects tion (PMI) originates from Information Theory and are needed (Hou et al., 2019; Guan et al., 2019; is adapted into NLP (Zhang et al., 2019; Tian et al., Huang et al., 2020a; Chin et al., 2018). Some dis- 2020) to measure statistical word associations in advantages are summarized as follows. Firstly, corpora. It determines the sentiment polarities of the aspect extraction tools are usually outdated words using a small number of carefully selected and inaccurate such as LDA (Hou et al., 2019), positive and negative seeds (s+ and s− ) (Tian et al., TF-IDF (Guan et al., 2019), and word embedding- 2020). It first extracts candidate sentiment terms based similarity (Huang et al., 2020a). Second, the satisfying the part-of-speech patterns by Turney representation of sentiment is scalar-based which (2002) and then measures the polarity of each can- is coarser than embedding-based used in our work. didate term w by 2.4 Rating Prediction X X + − Pol(w) = PMI(w, s ) − PMI(w, s ). (1) Rating prediction is an important task in recom- s+ s− mendation. Related approaches utilize text mining Given a sliding window-based context sampler ctx, algorithms to build user and item representations the PMI(·, ·) between words is defined by and predict ratings (Kim et al., 2016; Zheng et al., 2017; Chen et al., 2018; Chin et al., 2018; Liu p(w1 , w2 ) et al., 2019; Bauman et al., 2017). However, the PMI(w1 , w2 ) = log , (2) p(w1 )p(w2 ) text features learned are latent and unable to pro- vide explicit hints for explaining user interests. where p(·), the probability estimated by token counts, is defined by p(w1 , w2 ) = |{ctx|w 1 ,w2 ∈ctx}| total #ctx 3 ASPE and APRE and p(w1 ) = |{ctx|w 1 ∈ctx}| total #ctx . Afterward, we collect 3.1 Problem Formulation the top-q sentiment tokens with strong polarities, both positive and negative, as STPMI . Review-based rating prediction involves two major entities: users and items. A user u writes a review NN-based method As discussed in Section 2, co- ru,t for an item t and rates a score su,t . Let Ru extraction models (Dai and Song, 2019) can accu- denote all reviews given by u and Rt denote all rately label AS-pairs only in the training domain. reviews received by t. A rating regressor takes in a For sentiment terms with consistent semantics in tuple of a review-and-rate event (u, t) and review different domains such as good and great, NN sets Ru and Rt to estimate the rating score su,t . methods can still provide a robust extraction recall. In this work, we take a pretrained SDRN (Chen 3.2 Unsupervised ASPE et al., 2020a) as the NN-based method to gener- We combine three separate methods to label AS- ate STNN . The pretrained SDRN is considered an pairs without the need for supervision, namely PMI- off-the-shelf tool similar to the pretrained BERT based, neural network-based (NN-based), and lan- which is irrelevant to our rating prediction data.
Therefore, we argue ASPE is unsupervised for open are ignored to reduce sparsity. We use WordNet domain rating prediction. synsets (Miller, 1995) to merge the synonym as- pects. The aspect with the most synonyms is se- Knowledge-based method PMI- and NN-based lected as the representative of that aspect set. methods have shortcomings. The PMI-based method depends on the seed selection. The ac- amod dependency relation: curacy of the NN-based method deteriorates when prep the applied domain is distant from the training data. conj pobj As compensation, we integrate a sentiment lexi- amod cc advmod nummod con STLex summarized by linguists since expert Amazing sound and quality, all in one headset. Extracted AS-pair candidates: knowledge is widely used in unsupervised learn- (sound, amazing), (quality, amazing) ing. Examples of linguistic lexicons include Sen- nsubj+acomp dependency relation: conj tiWordNet (Baccianella et al., 2010) and Opinion Lexicon (Hu and Liu, 2004). The latter one is used cc compound nsubj acomp nsubj acomp in this work. Sound quality is superior and comfort is excellent. Building sentiment term set The three senti- Extracted AS-pair candidates: (Sound quality, superior), (comfort, excellent) ment term subsets are joined to build an overall sentiment set used in AS-pair generation: ST = Figure 2: Two dependency-based rules for AS-pair can- STPMI ∪ STNN ∪ STLex . The three sets compensate didates extraction. Effective dependency relations and for the discrepancies of other methods and expand aspects and sentiments candidates are highlighted. the coverage of terms shown in Table 10†. Discussion ASPE is different from Aspect Ex- 3.2.2 Syntactic AS-pairs Extraction traction (AE) (Tulkens and van Cranenburgh, 2020; To extract AS-pairs, we first label AS-pair can- Luo et al., 2019; Wei et al., 2020; Ma et al., 2019; didates using dependency parsing and then filter Angelidis and Lapata, 2018; Xu et al., 2018; Shu out non-sentiment-carrying candidates using (ST )1 . et al., 2017; He et al., 2017a) which extracts as- Dependency parsing extracts the syntactic relations pects only and infers sentiment polarities in {pos, between the words. Some nouns are considered neg, (neu)}. AS-pair co-extraction, however, of- potential aspects and are modified by adjectives fers more diversified emotional signals than the with two types of dependency relations shown in bipolar sentiment measurement of AE. Figure 2: amod and nsubj+acomp. The pairs of nouns and the modifying adjectives compose 3.3 APRE the AS-pair candidates. Similar techniques are APRE, depicted in Figure 3, predicts ratings given widely used in unsupervised aspect extraction mod- reviews and the corresponding AS-pairs. It first els (Tulkens and van Cranenburgh, 2020; Dai and encodes language into embeddings, then learns ex- Song, 2019). AS-pair candidates are noisy since plicit and implicit features, and finally computes not all adjectives in it bear sentiment inclination. the score regression. One distinctive feature of ST comes into use to filter out non-sentiment- APRE is that it explicitly models the aspect in- carrying AS-pair candidates whose adjective is not formation by incorporating a da -dimensional as- in ST . The left candidates form the AS-pair set. pect representation ai ∈ Rda in each side of the Admittedly, the dependency-based extraction for substructures for review encoding. Let A(u) = (noun, adj.) pairs is suboptimal and causes miss- (u) (u) {a1 , . . . , ak } denotes the k aspect embeddings ing aspect or sentiment terms. An implicit module for users and A(t) for items. k is decided by the is designed to remedy this issue. Open domain number of unique aspects in the AS-pair set. AS-pair co-extraction is blocked by the lacking of public labeled data and is left for future work. Language encoding The reviews are encoded We introduce ItemTok as a special aspect to- into low-dimensional token embedding sequences ken of the nsubj+acomp rule where nsubj is by a fixed pre-trained BERT (Devlin et al., 2019), a pronoun of the item such as it and they. Infre- a powerful transformer-based contextualized lan- quent aspect terms with less than c occurrences guage encoder. For each review r in Ru or Rt , 1 Section A.2.1† explains this procedure in detail by pseu- the resulting encoding H0 ∈ R(|r|+2)×de consists docode of Algorithm 1†. of (|r| + 2) de -dimensional contextualized vectors:
APRE - Item IMPLICIT Reviews Encoder EXPLICIT can nicely handle both sentiments and frequencies vt Gt AAAC63icbVJNb9MwGHbC1whfHRy5RFRFrQRVgpjGcWMHdgCpCLpNakr0xnUaa44d2U61yvJf4MIBhLjyh7jxb3Daoq3tXsn24+d5/PrjdVYxqnQU/fX8Gzdv3b6zcze4d//Bw0et3ccnStQSkyEWTMizDBRhlJOhppqRs0oSKDNGTrPzo0Y/nRGpqOCf9bwi4xKmnOYUg3ZUuut5nUSTC20WfZabY2u/xEEnyQSbqHnpBlM4Jl0azOjo/aextVsG04WeTU39QjbaWsZD28h1rxGAVQVsmDOi4XJyNev/TTHn9jKtKoEx8wEukueHs+nLgRDM2u41t+it55vZtA7Wbe+sTXXaakf9aBHhNohXoI1WMUhbf5KJwHVJuMYMlBrFUaXHBqSmmBEbJLUiFeBzmJKRgxxKosZmUSsbdhwzCXMhXeM6XLBXVxgoVXNg5yxBF2pTa8jrtFGt8zdjQ3lVa8LxcqO8ZqEWYVP4cEIlwZrNHQAsqTtriAuQgLX7HoF7hHjzytvg5FU/3utHH1+3D96unmMHPUXPUBfFaB8doGM0QEOEvcL76n33fvil/83/6f9aWn1vteYJWgv/9z+otO+D AAAC63icbVJNb9MwGHbC1whfHRy5RFRFrQRVgpjGcWMHdgCpCLpNakr0xnUaa44d2U61yvJf4MIBhLjyh7jxb3DaonXtXsn24+d5/Pr1R1YxqnQU/fX8Gzdv3b6zcze4d//Bw0et3ccnStQSkyEWTMizDBRhlJOhppqRs0oSKDNGTrPzo0Y/nRGpqOCf9bwi4xKmnOYUg3ZUuut5nUSTC20WfZabY2u/xEEnyQSbqHnpBlM4Jl0azOjo/aextVsG04WeTU39QjbalYyHtpHrXiMAqwrYMGdEw+VkPev/TTHn9jKtKoEx8wEukueHs+nLgRDM2u41p+gF6+lmNtWbpb2zjkxb7agfLSLcBvEKtNEqBmnrTzIRuC4J15iBUqM4qvTYgNQUM2KDpFakAnwOUzJykENJ1Ngs3sqGHcdMwlxI17gOF+z6CgOlaip2zhJ0oTa1hrxOG9U6fzM2lFe1JhwvN8prFmoRNg8fTqgkWLO5A4AldbWGuAAJWLvvEbhLiDePvA1OXvXjvX708XX74O3qOnbQU/QMdVGM9tEBOkYDNETYK7yv3nfvh1/63/yf/q+l1fdWa56gK+H//gerhe+C of term mentions. Aspects that are not mentioned rating pred. Fim (·) ŝu,t rating pred. Fex (·) AAAB9XicbVBNS8NAEJ34WetX1aOXxSJ4kJKIoseiF48V7Ae0tWy2m3bpZhN2J0oJ+R9ePCji1f/izX/jts1BWx8MPN6bYWaeH0th0HW/naXlldW19cJGcXNre2e3tLffMFGiGa+zSEa65VPDpVC8jgIlb8Wa09CXvOmPbiZ+85FrIyJ1j+OYd0M6UCIQjKKVHjpDiqnJemlySjDrlcpuxZ2CLBIvJ2XIUeuVvjr9iCUhV8gkNabtuTF2U6pRMMmzYicxPKZsRAe8bamiITfddHp1Ro6t0idBpG0pJFP190RKQ2PGoW87Q4pDM+9NxP+8doLBVTcVKk6QKzZbFCSSYEQmEZC+0JyhHFtCmRb2VsKGVFOGNqiiDcGbf3mRNM4q3kXFvTsvV6/zOApwCEdwAh5cQhVuoQZ1YKDhGV7hzXlyXpx352PWuuTkMwfwB87nD6Xzkpo= AAADUnicbVJbb9MwFHZaLiMM1sEjLxFVUStBlSAQPG5Mgj1sUhF0m9SE6MR1GmtOHMVO1cryb0RCvPBDeOEBcHrR1qZHinP8fd+5WSfKGRXSdX9Zjeadu/fu7z2wH+4/enzQOnxyIXhZYDLEnPHiKgJBGM3IUFLJyFVeEEgjRi6j65OKv5ySQlCefZXznAQpTDIaUwzSQOGhlXR8SWZSLc4oVqdaf/Psjh9xNhbz1PxUYpBwKVCjk7MvgdY1gepCT4eqfFlU3EbGY13RZa8igOUJbIkjIuHmcjvruijOMn2TVqTAmDqHmf/ieDp5NeCcad3dMUVvM99Uh3K7t096B1gFu7XYdYcpyAQDUx91uAyjqamOx1z27DpJZmsybLXdvrswp+54K6eNVjYIWz/8McdlSjKJGQgx8txcBgoKSTEj2vZLQXLA1zAhI+NmkBIRqMVKaKdjkLET88J8mXQW6O0IBamoZjPKqmexzVXgLm5Uyvh9oGiWl5JkeFkoLpkjuVPtlzOmBcGSzY0DuKCmVwcnUACWZgtt8wje9sh15+J133vbdz+/aR99WD3HHnqGnqMu8tA7dIRO0QANEba+W7+tv9a/xs/Gn6bVbC6lDWsV8xRtWHP/P6YwF8o= AAADNHicbVJbi9NAFJ7E21pvXX30JVgqLWhJRNHHXRd0EYWKdnehieFkMmmHncyEzKRsGeZH+eIP8UUEHxTx1d/gpKns9nIgycn3feecby5JwahUvv/dcS9dvnL12s711o2bt27fae/ePZKiKjEZYcFEeZKAJIxyMlJUMXJSlATyhJHj5PSg5o9npJRU8I9qXpAohwmnGcWgLBTvOm+6oSJnSi/eSaYPjfkUtLphIlgq57n96KlF4kagxwdvP0TGbAh0D/om1tWjsuZWOu6bmq76NQGsmMKaOCEKzn8udv0/FHNuztvKHBjT7+AsfLg/mzweCsGM6W1ZRX+138zEat3ba7MFrIv9jdrGYZiDmmJg+pWJmyqa2+E4Faoftzv+wF+Et5kEy6SDljGM21/DVOAqJ1xhBlKOA79QkYZSUcyIHVZJUgA+hQkZ25RDTmSkF4duvK5FUi8TpX248hboxQoNuazdW2XtWa5zNbiNG1cqexFpyotKEY6bQVnFPCW8+gZ5KS0JVmxuE8AltV49PIUSsLL3rGU3IVhf8mZy9GQQPBv475929l4ut2MH3UcPUA8F6DnaQ4doiEYIO5+db85P55f7xf3h/nb/NFLXWdbcQyvh/v0HzqwOTA== (a) vu AAAC1XicbVLLjtMwFHXCawivAks2EVVRK0GVIBAsZ5jNLEAqj05HakJ04zqtNY4d2U7VyvIChNjyb+z4Bz4Cpy2aTjtXsn18zvH14zqvGFU6iv54/rXrN27eOrgd3Ll77/6D1sNHp0rUEpMhFkzIsxwUYZSToaaakbNKEihzRkb5+XGjj+ZEKir4F72sSFrClNOCYtCOylp/O4kmC21WfV6YE2u/xkEnyQWbqGXpBjNzTLY2mPHx+8+ptXsG04WezUz9XDbapYxHtpHrXiMAq2awY86JhovJdtb/m2LO7UVaVQJj5gMskmdH8+mLgRDM2u4Vt+gF2+nmNquzVjvqR6sI90G8AW20iUHW+p1MBK5LwjVmoNQ4jiqdGpCaYkZskNSKVIDPYUrGDnIoiUrNqio27DhmEhZCusZ1uGK3VxgoVXM45yxBz9Su1pBXaeNaF29TQ3lVa8LxeqOiZqEWYVPicEIlwZotHQAsqTtriGcgAWv3EQL3CPHulffB6ct+/LoffXzVPny3eY4D9AQ9RV0UozfoEJ2gARoi7H3yFt4377s/8q3/w/+5tvreZs1jdCn8X/8AmdHnYw== IMPLICIT EXPLICIT # of reviews by r will have hu,r = 0. To completely picture Gu AAAC63icbVJNb9MwGHbC1whfHRy5RFRFrQRVgpjGcWMHdgCpCLpNakr0xnUaa44d2U61yvJf4MIBhLjyh7jxb3Daoq3tXsn24+d5/PrjdVYxqnQU/fX8Gzdv3b6zcze4d//Bw0et3ccnStQSkyEWTMizDBRhlJOhppqRs0oSKDNGTrPzo0Y/nRGpqOCf9bwi4xKmnOYUg3ZUuut5nUSTC20WfZabY2u/xEEnyQSbqHnpBlM4Jl0azOjo/aextVsG04WeTU39QjbaWsZD28h1rxGAVQVsmDOi4XJyNev/TTHn9jKtKoEx8wEukueHs+nLgRDM2u41t+it55vZtA7Wbe+s49JWO+pHiwi3QbwCbbSKQdr6k0wErkvCNWag1CiOKj02IDXFjNggqRWpAJ/DlIwc5FASNTaLWtmw45hJmAvpGtfhgr26wkCpmgM7Zwm6UJtaQ16njWqdvxkbyqtaE46XG+U1C7UIm8KHEyoJ1mzuAGBJ3VlDXIAErN33CNwjxJtX3gYnr/rxXj/6+Lp98Hb1HDvoKXqGuihG++gAHaMBGiLsFd5X77v3wy/9b/5P/9fS6nurNU/QWvi//wGqOO+E # of aspects user u’s attentions to all aspects, we aggregate all AAACYXicbVFNSwMxEM2uX7V+rfXoZbEUFKTsiqLHqpceFawKbV1m02wbmv0gmZWWsH/Smxcv/hGzbQ/VOpDk8d6bTGYSZoIr9LxPy15b39jcqmxXd3b39g+cw9qzSnNJWYemIpWvISgmeMI6yFGw10wyiEPBXsLxfam/vDOpeJo84TRj/RiGCY84BTRU4EwaPWQT1LM9jHS7KN78aqMXpmKgprE59Mgwwdygi2JF06dwVgQ6P5el9uuy26KU87Oi2gORjWDZGzh1r+nNwl0F/gLUySIeAuejN0hpHrMEqQClur6XYV+DRE4FMyVyxTKgYxiyroEJxEz19WxChdswzMCNUmlWgu6MXc7QEKuyJeOMAUfqr1aS/2ndHKObvuZJliNL6LxQlAsXU7cctzvgklEUUwOASm7e6tIRSKBoPqVqhuD/bXkVPF80/aum93hZb90txlEhx+SEnBKfXJMWaZMH0iGUfFnr1p61b33b27Zj1+ZW21rkHJFfYR//AMkkuHU= (a) weighted REVIEW-WISE weighted ↵u,r avg. pool avg. pool AAACc3icbVFNTxsxEPVuaUtDaUMrceFQizQSSBDtIlB75OPCoQdQCSAl29WsM0ssvB+yZ6tG1v6B/jxu/Asuvdeb5NCQjmT76b03Y884KZU0FASPnv9i5eWr16tvWmtv19+9b298uDZFpQX2RaEKfZuAQSVz7JMkhbelRsgShTfJ/Vmj3/xEbWSRX9GkxCiDu1ymUgA5Km7/7g4Jf5Gd7klqz+v6R9jqDpNCjcwkc4cdOyaeGezg7Nv3qK6XDHYHduvYVnu60RYqntSNXO02AqhyDAvmYYIEMxy3O0EvmAZfBuEcdNg8LuL2w3BUiCrDnIQCYwZhUFJkQZMUCl3tymAJ4h7ucOBgDhmayE5nVvOuY0Y8LbRbOfEp+2+Ghcw0/TlnBjQ2z7WG/J82qCj9GlmZlxVhLmYXpZXiVPDmA/hIahSkJg6A0NK9lYsxaBDkvqnlhhA+b3kZXB/0wqNecHnYOT6dj2OVbbFttsNC9oUds3N2wfpMsCdv0/vkce+Pv+Vv+59nVt+b53xkC+Hv/wUisr9U u,r AGGREGATION reviews from u, i.e. Ru , using review-wise ag- review-wise agg review-wise agg (a) gregation weighted by αu,r given in the equation v u,r (a) below. αu,r indicates the significance of each re- AAADF3icbVLLjtMwFHXCayiP6cCSTURV1EpMlSAQLGeYBbMAqQg6M1JTohvXaa1x7Mh2qqks/wUbfoUNCxBiCzv+BqctmmnaK8U5Pufc6+tHWjCqdBj+9fxr12/cvLVzu3Hn7r37u829BydKlBKTARZMyLMUFGGUk4GmmpGzQhLIU0ZO0/OjSj+dEamo4B/1vCCjHCacZhSDdlSy5+23Y00utFmMaWaOrf0UNdpxKthYzXP3M1PHJEuDGR69/TCydsNgOtC1iSmfykpbq3hoK7nsVgKwYgo1c0o0XE6uVv2/KObcXpZVOTBm3sFF/ORwNtnvC8Gs7WzZRXe93swmut7bG7uFrJLDRi112WDSbIW9cBHBJohWoIVW0U+af+KxwGVOuMYMlBpGYaFHBqSmmBHbiEtFCsDnMCFDBznkRI3M4l5t0HbMOMiEdB/XwYK9mmEgV1WHzpmDnqq6VpHbtGGps1cjQ3lRasLxcqGsZIEWQfVIgjGVBGs2dwCwpK7XAE9BAtbuKVWHENW3vAlOnvWiF73w/fPWwevVceygR+gx6qAIvUQH6Bj10QBh77P31fvu/fC/+N/8n/6vpdX3VjkP0Vr4v/8BviEB7A== ASPECT-WISE AAACNHicbVBLSwMxEM7WV62vqkcvi6VQQcquKHoseil4qWAf0NYlm2bb0OyDZFYsYX+UF3+IFxE8KOLV32C23YO2DiTz8X0zzMznRpxJsKxXI7e0vLK6ll8vbGxube8Ud/daMowFoU0S8lB0XCwpZwFtAgNOO5Gg2Hc5bbvjq1Rv31MhWRjcwiSifR8PA+YxgkFTTvG63AP6AGr6u56qJ8mdXSj33JAP5MTXSY0048wKVJIU5iRVwUeJo+JjkTjFklW1pmEuAjsDJZRFwyk+9wYhiX0aAOFYyq5tRdBXWAAjnOpRsaQRJmM8pF0NA+xT2VfToxOzrJmB6YVCvwDMKfu7Q2FfpmvqSh/DSM5rKfmf1o3Bu+grFkQx0IDMBnkxNyE0UwfNAROUAJ9ogIlgeleTjLDABLTPBW2CPX/yImidVO2zqnVzWqpdZnbk0QE6RBVko3NUQ3XUQE1E0CN6Qe/ow3gy3oxP42tWmjOynn30J4zvH2/brSM= h(a) u,r concat AGGREGATION … hcnn AAACkHicbVFdb9MwFHXC1ygMynjkJaKqtEmoSgaIPYDY2MuEeBiCbpPaEN24N6s1x4nsG0Rl+ffwf3jj3+C0QaItV7J9fM6xr31vXkthKI5/B+Gt23fu3tu533vwcPfR4/6TvQtTNZrjmFey0lc5GJRC4ZgESbyqNUKZS7zMb05b/fI7aiMq9ZUWNaYlXCtRCA7kqaz/czgl/EF2OeeFPXPuW9IbTvNKzsyi9IudeyZbGezk9NOX1Lktg92HA5fZ5oVutbUbT1wrNwetALKew4Y5R4Jus37p35xcKeey/iAexcuItkHSgQHr4jzr/5rOKt6UqIhLMGaSxDWlFjQJLtHnagzWwG/gGiceKijRpHZZUBcNPTOLikr7oShasv+esFCa9p3eWQLNzabWkv/TJg0VR6kVqm4IFV8lKhoZURW13YlmQiMnufAAuBb+rRGfgwZOvoc9X4Rk88vb4OJwlLwexZ9fDY4/dOXYYc/Yc7bPEvaGHbMzds7GjAe7wcvgbfAu3AuPwvfhycoaBt2Zp2wtwo9/AIvgykk= “comfort” “mic quality” User-specific aspect repr. view’s contribution to the overall understanding of pool AAACTXicbVHLSgMxFM3UR2t9jbp0M1gKFaTMiKLLqpsuK9gHtLVkMpk2mHmQ3BFLmB90I7jzL9y4UETMtF3Y1gtJDufcm3tz4sacSbDtNyO3srq2ni9sFDe3tnd2zb39lowSQWiTRDwSHRdLyllIm8CA004sKA5cTtvuw02mtx+pkCwK72Ac036AhyHzGcGgqYHplXtAn0BNdtdX9TS9d4rlnhtxT44DfaiRZgbTBJWmS5qq4ON0oJITkRbn77pKMzXRqlmyq/YkrGXgzEAJzaIxMF97XkSSgIZAOJay69gx9BUWwAinuk8iaYzJAx7SroYhDqjsq4kbqVXWjGf5kdArBGvC/q1QOJDZ+DozwDCSi1pG/qd1E/Av+4qFcQI0JNNGfsItiKzMWstjghLgYw0wEUzPapERFpiA/oCiNsFZfPIyaJ1WnfOqfXtWql3P7CigQ3SEKshBF6iG6qiBmoigZ/SOPtGX8WJ8GN/GzzQ1Z8xqDtBc5PK/13C2pA== matrix A(u) cnn+relu avg&max “sound quality” u’s attention to aspect a pool pool pool h1[CLS] AAACHXicbVC7TsMwFHXKq5RXgZEloqrEVCWoCMaKLh0YiqAPqQmR4zitVech+wZRRfkRFn6FhQGEGFgQf4PbZoCWI9k+Oude+d7jxpxJMIxvrbCyura+UdwsbW3v7O6V9w+6MkoEoR0S8Uj0XSwpZyHtAANO+7GgOHA57bnj5tTv3VMhWRTewiSmdoCHIfMZwaAkp1yvWkAfIJ3drp+2suzOLFluxD05CdSTjpTgzP100Ly6sbPMKVeMmjGDvkzMnFRQjrZT/rS8iCQBDYFwLOXANGKwUyyAEU6zkpVIGmMyxkM6UDTEAZV2Otsu06tK8XQ/EuqEoM/U3x0pDuR0VlUZYBjJRW8q/ucNEvAv7JSFcQI0JPOP/ITrEOnTqHSPCUqATxTBRDA1q05GWGACKtCSCsFcXHmZdE9r5lnNuK5XGpd5HEV0hI7RCTLROWqgFmqjDiLoET2jV/SmPWkv2rv2MS8taHnPIfoD7esHQTujQw== sound aspect comfort aspect (a) [CLS] token repr. seq. H1 exp(tanh(wTex [hu,r ; a(u) ])) AAAB/HicbVDLSsNAFJ34rPUV7dLNYBFclUQUXRbddFnBPqCNZTKdtEMnkzBzI4ZQf8WNC0Xc+iHu/BunaRbaeuBeDufcy9w5fiy4Bsf5tlZW19Y3Nktb5e2d3b19++CwraNEUdaikYhU1yeaCS5ZCzgI1o0VI6EvWMef3Mz8zgNTmkfyDtKYeSEZSR5wSsBIA7vSB/YIWd79IGtMp/fuwK46NScHXiZuQaqoQHNgf/WHEU1CJoEKonXPdWLwMqKAU8Gm5X6iWUzohIxYz1BJQqa9LD9+ik+MMsRBpExJwLn6eyMjodZp6JvJkMBYL3oz8T+vl0Bw5WVcxgkwSecPBYnAEOFZEnjIFaMgUkMIVdzciumYKELB5FU2IbiLX14m7bOae1Fzbs+r9esijhI6QsfoFLnoEtVRAzVRC1GUomf0it6sJ+vFerc+5qMrVrFTQX9gff4AZVSVPg== H1 AAAB/HicbVDLSsNAFJ34rPUV7dLNYBFclUQUXRbddFnBPqCNZTKdtEMnkzBzI4ZQf8WNC0Xc+iHu/BunaRbaeuBeDufcy9w5fiy4Bsf5tlZW19Y3Nktb5e2d3b19++CwraNEUdaikYhU1yeaCS5ZCzgI1o0VI6EvWMef3Mz8zgNTmkfyDtKYeSEZSR5wSsBIA7vSB/YIWd79IGtMp/fuwK46NScHXiZuQaqoQHNgf/WHEU1CJoEKonXPdWLwMqKAU8Gm5X6iWUzohIxYz1BJQqa9LD9+ik+MMsRBpExJwLn6eyMjodZp6JvJkMBYL3oz8T+vl0Bw5WVcxgkwSecPBYnAEOFZEnjIFaMgUkMIVdzciumYKELB5FU2IbiLX14m7bOae1Fzbs+r9esijhI6QsfoFLnoEtVRAzVRC1GUomf0it6sJ+vFerc+5qMrVrFTQX9gff4AZVSVPg== token (a) … αu,r =P (a) , H0 AAADAXicbVJNb9MwGHbC1ygf6+DAgUtEVdRKUCUItB03dmAHkIqg26QmRG9cp7Xm2JHtVKssc+GvcOEAQlz5F9z4Nzht0dZ2rxTn8fM872u/trOSUaXD8K/nX7t+4+atrduNO3fv3d9u7jw4VqKSmAywYEKeZqAIo5wMNNWMnJaSQJExcpKdHdb6yZRIRQX/qGclSQoYc5pTDNpR6Y73qB1rcq7NfMxyc2Ttp6jRjjPBRmpWuJ+ZOCZdGMzw8O2HxNoNg+lA16ameiZrbaXiga3lqlsLwMoJrJkzouFicrnq/0Ux5/airCqAMfMOzuOnB9Px874QzNrOFV10V+tNbarX9/bG1uRmbthIm62wF84j2ATRErTQMvpp8088ErgqCNeYgVLDKCx1YkBqihmxjbhSpAR8BmMydJBDQVRi5jdog7ZjRkEupPu4Dubs5QwDharbcM4C9EStazV5lTasdL6XGMrLShOOFwvlFQu0COrnEIyoJFizmQOAJXV7DfAEJGDtHk19CNF6y5vg+EUvetUL379s7b9eHscWeoyeoA6K0C7aR0eojwYIe5+9r95374f/xf/m//R/Lay+t8x5iFbC//0P5tf4Zw== r0 ∈Ru exp(tanh(wTex [hu,r0 ; a(u) ])) Contextualized Language Encoder ct r nd lity ] is is S] fort and … erio elle where [·; ·] denotes the concatenation of tensors. [CL Sou qua com sup exc [ AS-pair 1 AS-pair 2 wex ∈ R(df +da ) is a trainable weight. With the APRE - User Reviews Encoder (a) usefulness distribution of αu,r , we aggregate the Aspect annotations Shared aspect representation Review representation (a) Sentiment annotations/embeddings Computational modules/operations hu,r of r ∈ Ru by weighted average pooling: Figure 3: Pipeline of APRE including a user review X encoder in the orange dashed box and an item review g (a) u = (a) (a) αu,r hu,r . r∈Ru encoder in the top blue box, each containing an im- plicit channel (left) and an aspect-based explicit chan- nel (right). Internal details of item encoder are identical Now we obtain the user attention representation (a) to the counterpart of user encoder and hence omitted. for aspect a, g u ∈ Rdf . We use Gu ∈ Rdf ×k to (a) denote the matrix of g u . The item-tower architec- ture is omitted in Figure 3 since the item property H0 = {h0[CLS] , h01 , . . . , h0|r| , h0[SEP] }. [CLS] and modeling shares the identical computing proce- [SEP] are two special tokens indicating starts and dure. It generates the item property representations separators of sentences. We use a trainable linear (a) transformation, h1i = WTad h0i + bad , to adapt the g t of Gt . Mutual attention (Liu et al., 2019; Tay BERT output representation H0 to our task as H1 et al., 2018; Dong et al., 2020) is not utilized since where Wad ∈ Rde ×df , bad ∈ Rdf , and df is the the generation of user attention encodings Gu is transformed dimension of internal features. BERT independent to the item properties and vice versa. encodes the token semantics based upon the context Implicit review representation It is acknowl- which resolves the polysemy of certain sentiment edged by existing works shown in Section 2 that terms, e.g., “cheap” is positive for price but nega- implicit semantic modeling is critical because some tive for quality. This step transforms the sentiment emotions are conveyed without explicit sentiment encoding to attention-property modeling. word mentions. For example, “But this one feels Explicit aspect-level attitude modeling For as- like a pillow . . . ” in R3 of Table 1 does not con- pect a in the k total aspects, we pull out all the con- tain any sentiment tokens but expresses a strong textualized representations of the sentiment words2 satisfaction of the comfortableness, which will be that modify a, and aggregate their representations missed by the extractive annotation-based ASPE. to a single embedding of aspect a in r as In APRE, we combine a global feature h1[CLS] , a X local context feature hcnn ∈ Rnc learned by a con- h(a) u,r = h1j , wj ∈ ST ∩ r and wj modifies a. volutional neural network (CNN) of output channel size nc and kernel size nk with max pooling, and An observation by Chen et al. (2020b) suggests that two token-level features, average and max pooling users tend to use semantically consistent words for of H1 to build a comprehensive multi-granularity the same aspect in reviews. Therefore, sum-pooling review representation v u,r : 2 BERT uses WordPiece tokenizer that can break an out-of- v u,r = h1[CLS] ; hcnn ; MaxPool(H1 ); AvgPool(H1 ) , vocabulary word into shorter word pieces. If a sentiment word is broken into word pieces, we use the representation of the first word piece produced. hcnn = MaxPool(ReLU(ConvNN_1D(H1 ))).
We apply review-wise aggregation without aspects Baseline models Thirteen baselines in tradi- for latent review embedding v u tional and deep learning categories are compared with the proposed framework. The pre-deep learn- exp(tanh(wTim v u,r )) ing traditional approaches predict ratings solely βu,r = P T , r0 ∈Ru exp(tanh(w im v u,r0 )) based upon the entity IDs. Table 3 introduces their X basic profiles which are extended in Section A.3.3†. vu = βu,r v u,r , Specially, AHN-B refers to AHN using pretrained r∈Ru BERT as the input embedding encoder. It is in- (a) cluded to test the impact of the input encoders. where βu,r is the counterpart of αu,r in the implicit channel, wim ∈ Rdim is a trainable parameter, and Evaluation metric We use Mean Square Error dim = 3df + nc . Using similar steps, we can also (MSE) for performance evaluation. Given a test set obtain v t for the item implicit embeddings. Rtest , the MSE is defined by 1 X Rating regression and optimization Implicit MSE = (ŝu,r − su,r )2 . |Rtest | features v u and v t and explicit features Gu and (u,r)∈Rtest Gt compose the input to the rating predictor to Reproducibility We provide instructions to re- estimate the score su,t by produce AS-pair extraction of ASPE and rating pre- diction of baselines and APRE in Section A.3.1†. ŝu,t = bu + bt + Fim ([v u ; v t ]) + hγ, Fex ([Gu ; Gt ])i . | {z } | {z } | {z } The source code of our models is publicly available biases implicit feature explicit feature on GitHub4 . Fim : R2dim → R and Fex : R2df ×k → Rk are 4.2 AS-pair Extraction of ASPE multi-layer fully-connected neural networks with We present the extraction performance of unsuper- ReLU activation and dropout to avoid overfitting. vised ASPE. The distributions of the frequencies They model user attention and item property inter- of extracted AS-pairs in Figure 5 follow the trend actions in explicit and implicit channels, respec- of Zipf’s Law with a deviation common to natural tively. h·, ·i denotes inner-product. γ ∈ Rk and languages (Li, 1992), meaning that ASPE performs {bu , bt } ∈ R are trainable parameters. The opti- consistently across domains. We show the qualita- mization function of the trainable parameter set Θ tive results of term extraction separately. with an L2 regularization weighted by λ is Sentiment terms Generally, the AS-pair statis- J(Θ) = X 2 (su,t − ŝu,t ) + L2 -reg(λ). tics given in Table 9† on different datasets are quan- ru,t ∈R titatively consistent with the data statistics in Ta- ble 2† regardless of domain. Figure 4 is a Venn J(Θ) is optimized by back-propagation learning diagram showing the sources of the sentiment terms methods such as Adam (Kingma and Ba, 2014). extracted by ASPE from AM. All three methods are efficacious and contribute uniquely, which can 4 Experiments also be verified by Table 10† in Section A.3.2†. Aspect terms Table 4 presents the most frequent 4.1 Experimental Setup aspect terms of all datasets. ItemTok is ranked Datasets We use seven datasets from Amazon top as users tend to describe overall feelings about Review Datasets (He and McAuley, 2016)3 includ- items. Domain-specific terms (e.g., car in AM) and ing AutoMotive (AM), Digital Music (DM), Musi- general terms (e.g., price, quality, and size) are in- cal Instruments (MI), Pet Supplies (PS), Sport and termingled illustrating the comprehensive coverage Outdoors (SO), Toys and Games (TG), and Tools and the high accuracy of the result of ASPE. and Home improvement (TH). Their statistics are shown in Table 2. 4.3 Rating Prediction of APRE We use 8:1:1 as the train, validation, and test Comparisons with baselines For the task of ratio for all experiments. Users and items with less review-based rating prediction, a percentage in- than 5 reviews and reviews with less than 5 words 3 https://jmcauley.ucsd.edu/data/amazon 4 are removed to reduce data sparsity. https://github.com/zyli93/ASPE-APRE
Dataset Abbr. #Reviews #Users #Items Density Ttl. #W #R/U #R/T #W/R AutoMotive AM 20,413 2,928 1,835 3.419×10−3 1.77M 6.274 10.011 96.583 Digital Music DM 111,323 14,138 11,707 6.053×10−4 5.69M 7.087 8.558 56.828 Musical Instruments MI 10,226 1,429 900 7.156×10−3 0.96M 6.440 10.226 103.958 Pet Supplies PS 157,376 19,854 8,510 8.383×10−4 14.23M 7.134 16.644 100.469 Sports and Outdoors SO 295,434 35,590 18,357 4.070×10−4 26.38M 7.471 14.484 99.199 Toys and Games TG 167,155 19,409 11,924 6.500×10−4 17.16M 7.751 12.616 114.047 Tools and Home improv. TH 134,129 16,633 10,217 7.103×10−4 15.02M 7.258 11.815 124.429 Table 2: The statistics of the seven real-world datasets. (W: Words; U: Users; T: iTems; R: Reviews.) Model Reference Cat. U/T ID Review AM DM MI PS SO TG TH MF - Trad. X ItemTok song ItemTok ItemTok ItemTok ItemTok ItemTok WRMF Hu et al. (2008) Trad. X product ItemTok sound dog knife toy light time album guitar food quality game tool FM Rendle (2010) Trad. X car music string cat product piece quality ConvMF Kim et al. (2016) Deep X X look time quality toy size quality price NeuMF He et al. (2017b) Deep X price sound tone time price child product D-CNN Zheng et al. (2017) Deep X quality voice price product look color bulb D-Attn Seo et al. (2017) Deep X light track pedal price bag part battery NARRE Chen et al. (2018) Deep X X oil lyric tuner treat fit fun size ANR Chin et al. (2018) Deep X battery version cable water light size flashlight MPCN Tay et al. (2018) Deep X X DAML Liu et al. (2019) Deep X Table 4: High frequency aspects of the corpora. AHN Dong et al. (2020) Deep X X AHN-B Same as AHN Deep X X the original word2vec-based AHN5 because the Table 3: Basics of compared baselines. Models’ in- weights of word2vec vectors are trainable while put is marked by “X”. “U” and “T” denote Users the BERT embeddings are fixed, which reduces and iTems. D-CNN represents DeepCoNN. AHN-B de- notes the variant of AHN with BERT embeddings. the parameter capacity. Within baseline models, deep-learning-based models are generally stronger than entity ID-based traditional methods and recent NN terms 105 ones tend to perform better. Frequency of AS-pairs PMI terms 169 2734 104 94 26 228 103 AM Ablation study Ablation studies answer the 972 DM 102 MI question of which channel, explicit or implicit, con- PS SO tributes to the superior performance and to what 101 5558 TG extent? We measure their contributions by rows of 100 TH 100 101 102 103 104 w/o EX and w/o IM in Table 5. w/o EX presents Lexicon Frequency Rank of AS-pairs the best MSEs of an APRE variant without explicit features under the default settings. The impact of Figure 4: Sources of sen- Figure 5: Freq. rank vs. timent terms from AM. frequency of AS-pairs AS-pairs is nullified. w/o IM, in contrast, shows the best MSEs of an APRE variant only leveraging the explicit channel while removing the implicit crease above 1% in performance is considered sig- one (without implicit). We observe that the opti- nificant (Chin et al., 2018; Tay et al., 2018). Ac- mal performances of the single-channel variants all cording to Table 5, our model outperforms all base- fall behind those of the dual-channel model, which line models including the AHN-B on all datasets reflects positive contributions from both channels. by a minimum of 1.337% on MI and a maximum w/o IM has lower MSEs than w/o EX on several of 4.061% on TG, which are significant improve- datasets showing that the explicit channel can sup- ments. It demonstrates (1) the superior capability ply comparatively more performance improvement of our model to make accurate rating predictions in than the implicit channel. It also suggests that the different domains (Ours vs. the rest); (2) the perfor- costly latent review encoding can be less effective mance improvement is NOT because of the use of 5 BERT (Ours vs. AHN-B). AHN-B underperforms The authors of AHN also confirmed this observation.
0.83 0.83 Models AM DM MI PS SO TG TH Perf. in MSE Perf. in MSE Epochs num.14 Epochs num. 14 Performance in MSE Performance in MSE Number of epochs Number of epochs T RADITIONAL M ODELS 0.82 12 0.82 12 MF 1.986 1.715 2.085 2.048 2.084 1.471 1.631 10 10 0.81 0.81 WRMF 1.327 0.537 1.358 1.629 1.371 1.068 1.216 8 8 0.80 6 0.80 6 FM 1.082 0.436 1.146 1.458 1.212 0.922 1.050 4 4 D EEP L EARNING - BASED M ODELS 0.79 0.0 0.1 0.2 0.3 0.4 0.5 2 0.7910 4 10 3 10 2 2 ConvMF 1.046 0.407 1.075 1.458 1.026 0.986 1.104 Dropout Rate Initial Learning Rate NeuMF 0.901 0.396 0.903 1.294 0.893 0.841 1.072 (a) Dropout vs. MSE and Ep. (b) Init. LR vs. MSE and Ep. D-Attn 0.816 0.403 0.835 1.264 0.897 0.887 0.980 D-CNN 0.809 0.390 0.861 1.250 0.894 0.835 0.975 Figure 6: Hyper-parameter searching and sensitivity NARRE 0.826 0.374 0.837 1.425 0.990 0.908 0.958 MPCN 0.815 0.447 0.842 1.300 0.929 0.898 0.969 ANR 0.806 0.381 0.845 1.327 0.906 0.844 0.981 AM DM MI PS SO TG TH DAML 0.829 0.372 0.837 1.247 0.893 0.820 0.962 127s∗ 31min 90s∗ 36min 90min 51min 35min AHN-B 0.810 0.385 0.840 1.270 0.896 0.829 0.976 AHN 0.802 0.376 0.834 1.252 0.887 0.822 0.967 Table 6: Per epoch run time of APRE on the seven O UR M ODELS AND P ERCENTAGE I MPROVEMENTS datasets. The run time of AM and MI, denoted by “∗ ”, Ours 0.791 0.359 0.823 1.218 0.863 0.788 0.936 is disproportional to their sizes since they can fit into ∆(%) 1.390 3.621 1.337 2.381 2.784 4.061 2.350 the GPU memory for acceleration. Val. 0.790 0.362 0.821 1.216 0.860 0.790 0.933 A BLATION S TUDIES w/o EX 0.814 0.379 0.833 1.244. 0.882 0.796 0.965 hyper-parameter sensitivities to changes on internal w/o IM 0.798 0.374 0.863 1.226 0.873 0.798 0.956 feature dimensions (da , df , and nc ), CNN kernel size nk , and λ of L2 -reg weight. Table 5: MSE of baselines, our model (Ours for test and Val. for validation), and variants. The row of ∆ cal- Efficiency A brief run time analysis of APRE is culates the percentage improvements over the best base- given in Table 6. The model can run fast with all lines. All reported improvements over the best base- data in GPU memory such as AM and MI, which lines are statistically significant with p-value < 0.01. demonstrates the efficiency of our model and the room for improvement on the run time of datasets than the aspect-sentiment level user and item pro- that cannot fit in the GPU memory. The efficiency filing, which is a useful finding. of ASPE is less critical since it only runs once for each dataset. Hyper-parameter sensitivity A number of 4.4 Case Study for Interpretation hyper-parameter settings are of interest, e.g., Finally, we showcase an interpretation procedure dropout, learning rate (LR), internal feature di- of the rating estimation for an instance in AM: how mensions (da , df , nc , and nk ), and regularization does APRE predict u∗ ’s rating for a smart driving weight λ of the L2 -reg in J(Θ). We run each assistant t∗ using the output AS-pairs of ASPE? set of experiments on sensitivity search 10 times We select seven example aspect categories with all and report the average performances. We tune review snippets mentioning those categories. Each dropout rate in [0, 0.1, 0.2, 0.3, 0.4, 0.5] and LR6 category is a set of similar aspect terms, e.g., {look, in [0.0001, 0.0005, 0.001, 0.005, 0.01] with other design} and {beep, sound}. Without loss of gener- hyper-parameters set to default, and report in Fig- ality, we refer to the categories as aspects. Table 7 ure 6 the minimum MSEs and the epoch numbers presents the aspects and review snippets given by (Ep.) on AM. For dropout, we find the balance of u∗ and received by t∗ with AS-pairs annotations. its effects on avoiding overfitting and reducing ac- Three aspects, {battery, install, look}, are shared tive parameters at 0.2. Larger dropouts need more (yellow rows). Each side has two unique aspects training epochs. For LR, we also target a balance never mentioned by the reviews of the other side: between training instability of large LRs and over- {materials, smell} of u∗ (green rows) and {price, fitting concern of small LRs, thus 0.001 is selected. sound} of t∗ (blue rows). Larger LRs plateau earlier with fewer epochs while APRE measures the aspect-level contribu- smaller LRs later with more. Figure 7† analyzes tions of user-attention and item-property inter- 6 The reported LRs are initial since Adam and a LR sched- actions by the last term of su,t prediction, i.e., uler adjust it dynamically along the training. hγ, Fex ([Gu ; Gt ])i. The contribution on the ith
aspect is calculated by the ith dimension of γ times From reviews given by user u∗ . All aspects attended (3). the ith value of Fex ([Gu ; Gt ]) which is shown in battery [To t1 ] After leaving this attached to my car for two Table 8. The top two rows summarize the atten- days of non-use I have a dead battery. Never had a dead battery . . . , so I am blaming this device. tions of u∗ and the properties of t∗ . Inferred Impact install [To t2 ] This was unbelievably easy to install. I have states the interactional effects of user attentions done . . . . The real key . . . the installation is so easy. [To t3 ] There were many installation options, but once . . . , and item properties based on our assumption that they clicked on easily. attended aspects bear stronger impacts to the final look [To t3 ] It was not perfect and not shiny, but it did look prediction. On the overlapping aspects, the infe- better. [To t4 ] It takes some elbow grease, but the results are remarkable. rior property of battery produces the only negative score (-0.008) whereas the advantages on install material [To t5 ] The plastic however is very thin and the cap is pretty cheap. [To t6 ] Great value. . . . . They are very and look create positive scores (0.019 and 0.015), hard plastic, so they don’t mark up panels. which is consistent with the inferred impact. Other smell [To t7 ] This has a terrible smell that really lingers aspects, either unknown to user attentions or to item awhile. It goes on green. . . . properties, contribute relatively less: t∗ ’s unappeal- From reviews received by item t∗ . ing price accounts for the small score 0.009 and the battery [From u1 ] The reason this won’t work on an iPhone 4 mixture property of sound accounts for the 0.006. or . . . because it uses low power Bluetooth, . . . . (7) This case study demonstrates the usefulness of install [From u2 ] Your mileage and gas mileage and cost of fuel is tabulated for each trip- Installation is pretty the numbers that add up to ŝu,t . Although small in simple - but it . . . . (3) scale, they carry significant information of valued look [From u3 ] Driving habits, fuel efficiency, and engine or disliked aspects in u∗ ’s perception of t∗ . This health are nice features. The overall design is nice and easy to navigate. (3) process of decomposition is a great way to interpret model prediction on an aspect-level granularity, price [From u4 ] In fact, there are similar products to this available at a much lower price that do work with . . . (7) which is a capacity that other baseline models do sound [From u5 ] The Link device makes an audible sound not enjoy. when you go over 70 mpg, brake hard, or accelerate too fast. (3) [From u6 ] Also, the beep the link device In Section A.3.5†, another case study indicates makes . . . sounds really cheapy. (7) that a certain imperfect item property without user attentions only inconsiderably affects the rating Table 7: Examples of reviews given by u∗ and received although the aspect is mentioned by the user’s re- by t∗ with Aspect-Sentiment pair mentions as well as views. other sentiment evidences on seven example aspects. 5 Conclusion Aspects material smell battery install look price sound Attn. of u∗ 3 3 3 3 3 n/a n/a In this work, we propose a tightly coupled two- Prop. of t∗ n/a n/a 7 3 3 7 3/7 Inferred Impact Unk. Unk. Neg. Pos. Pos. Unk. Unk. stage review-based rating predictor, consisting γ i Fex (·)i (×10−2 ) 1.0 0.8 -0.8 1.9 1.5 0.9 0.6 of an Aspect-Sentiment Pair Extractor (ASPE) and an Attention-Property-aware Rating Estimator Table 8: Attentions and properties summaries, inferred (APRE). ASPE extracts aspect-sentiment pairs (AS- impacts, and the learned aspect-level contributions. pairs) from reviews and APRE learns explicit user attentions and item properties as well as implicit sentence semantics to predict the rating. Extensive Acknowledgement quantitative and qualitative experimental results We would like to thank the reviewers for their help- demonstrate that ASPE accurately and compre- ful comments. The work was partially supported hensively extracts AS-pairs without using domain- by NSF DGE-1829071 and NSF IIS-2106859. specific training data and APRE outperforms the state-of-the-art recommender frameworks and ex- Broader Impact Statement plains the prediction results taking advantage of the extracted AS-pairs. This paper proposes a rating prediction model that Several challenges are left open such as fully or has a great potential to be widely applied to rec- weakly supervised open domain AS-pair extraction ommender systems with reviews due to its high and end-to-end design for AS-pair extraction and accuracy. In the meantime, it tries to relieve the un- rating prediction. We leave these problems for justifiability issue for black-box neural networks by future work. suggesting what aspects of an item a user may feel
satisfied or dissatisfied with. The recommender SIGKDD International Conference on Knowledge system can better understand the rationale behind Discovery and Data Mining. users’ reviews so that the merits of items can be car- David M Blei, Andrew Y Ng, and Michael I Jordan. ried forward while the defects can be fixed. As far 2003. Latent dirichlet allocation. Journal of ma- as we are concerned, this work is the first work that chine Learning research. takes care of both rating prediction and rationale understanding utilizing NLP techniques. Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural attentional rating regression with We then address the generalizability and deploy- review-level explanations. In Proceedings of the ment issues. Reported experiments are conducted 2018 World Wide Web Conference. on different domains in English with distinct review Shaowei Chen, Jie Liu, Yu Wang, Wenzheng Zhang, styles and diverse user populations. We can ob- and Ziming Chi. 2020a. Synchronous double- serve that our model performs consistently which channel recurrent network for aspect-opinion pair supports its generalizability. Ranging from smaller extraction. In Proceedings of the 58th Annual Meet- datasets to larger datasets, we have not noticed any ing of the Association for Computational Linguis- potential deployment issues. Instead, we notice tics. that stronger computational resources can greatly Xiao Chen, Changlong Sun, Jingjing Wang, Shoushan speed up the training and inference and scale up Li, Luo Si, Min Zhang, and Guodong Zhou. 2020b. the problem size while keeping the major execution Aspect sentiment classification with document-level pipeline unchanged. sentiment preference modeling. In Proceedings of the 58th Annual Meeting of the Association for Com- In terms of the potential harms and misuses, we putational Linguistics. believe they and their consequences involve two perspectives: (1) the harm of generating inaccurate Jin Yao Chin, Kaiqi Zhao, Shafiq Joty, and Gao Cong. or suboptimal results from this recommender; (2) 2018. Anr: Aspect-based neural recommender. In Proceedings of the 27th ACM International Confer- the risk of misuse (attack) of this model to reveal ence on Information and Knowledge Management. user identity. For point (1), the potential risk of suboptimal results has little impact on the major Hongliang Dai and Yangqiu Song. 2019. Neural as- function of online shopping websites since recom- pect and opinion term extraction with mined rules as weak supervision. In Proceedings of the 57th An- menders are only in charge of suggestive content. nual Meeting of the Association for Computational For point (2), our model does not involve user and Linguistics. item ID modeling. Also, we aggregate the user re- views in the representation space so that user iden- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of tity is hard to infer through reverse-engineering at- deep bidirectional transformers for language under- tacks. In all, we believe our model has little risk of standing. In Proceedings of the 2019 Conference of causing dysfunction of online shopping platforms the North American Chapter of the Association for and leakages of user identities. Computational Linguistics: Human Language Tech- nologies, NAACL-HLT. Xin Dong, Jingchao Ni, Wei Cheng, Zhengzhang Chen, References Bo Zong, Dongjin Song, Yanchi Liu, Haifeng Chen, and Gerard de Melo. 2020. Asymmetrical hierar- Stefanos Angelidis and Mirella Lapata. 2018. Summa- chical networks with attentive interactions for in- rizing opinions: Aspect extraction meets sentiment terpretable review-based recommendation. In Pro- prediction and they are both weakly supervised. In ceedings of the AAAI Conference on Artificial Intel- Proceedings of the 2018 Conference on Empirical ligence. Methods in Natural Language Processing (EMNLP). Chunning Du, Haifeng Sun, Jingyu Wang, Qi Qi, and Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- Jianxin Liao. 2020. Adversarial and domain-aware tiani. 2010. Sentiwordnet 3.0: an enhanced lexical bert for cross-domain sentiment analysis. In Pro- resource for sentiment analysis and opinion mining. ceedings of the 58th Annual Meeting of the Associa- In Proceedings of the Seventh International Confer- tion for Computational Linguistics. ence on Language Resources and Evaluation. Xinyu Guan, Zhiyong Cheng, Xiangnan He, Yongfeng Konstantin Bauman, Bing Liu, and Alexander Tuzhilin. Zhang, Zhibo Zhu, Qinke Peng, and Tat-Seng Chua. 2017. Aspect based recommendations: Recom- 2019. Attentive aspect modeling for review-aware mending items with the most valuable aspects based recommendation. ACM Transactions on Informa- on user reviews. In Proceedings of the 23rd ACM tion Systems (TOIS), 37(3):1–27.
You can also read