Dual Preference Distribution Learning for Item Recommendation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Dual Preference Distribution Learning for Item Recommendation XUE DONG, School of Software, Shandong University, China XUEMENG SONG∗ , School of Computer Science and Technology, Shandong University, China arXiv:2201.09490v2 [cs.IR] 26 May 2022 NA ZHENG, School of Computer Science and Technology, Shandong University, China YINWEI WEI, School of Computing, National University of Singapore, Singapore ZHONGZHOU ZHAO, DAMO Academy, Alibaba Group, China HONGJUN DAI, School of Software, Shandong University, China Recommender systems automatically recommend users items that they probably like, for which the goal is to represent the user and item and model their interaction. Existing methods have primarily learned the user’s preferences and item’s features with vectorized representations, and modeled the user-item interaction by the similarity of their representations. In fact, the user’s different preferences are related and such relations could help better understand the user’s preferences. Toward this end, we propose to represent the user’s preference with multi-variant Gaussian distribution, and model the user-item interaction by calculating the probability density at the item in the user’s preference distribution. In this manner, the mean vector of the Gaussian distribution is able to capture the center of the user’s preferences, while its covariance matrix captures the relations of these preferences. In particular, in this work, we propose a dual preference distribution learning framework, which captures the user’s preferences to both the items and attributes by a Gaussian distribution, respectively. As a byproduct, identifying the user’s preference to specific attributes enables us to provide the explanation of recommending an item to the user. Extensive quantitative and qualitative experiments on six public datasets demonstrate the effectiveness of the proposed DUPLE method. CCS Concepts: • Information systems → Retrieval models and ranking; Recommender systems; • Mathematics of computing → Distribution functions; Computing most probable explanation. Additional Key Words and Phrases: Recommender System, Preference Distribution Learning, Explainable Recommendation. ACM Reference Format: Xue Dong, Xuemeng Song, Na Zheng, Yinwei Wei, Zhongzhou Zhao, and Hongjun Dai. 2018. Dual Preference Distribution Learning for Item Recommendation. J. ACM 37, 4, Article 111 (August 2018), 23 pages. https: //doi.org/XXXXXXX.XXXXXXX ∗ Corresponding author. Authors’ addresses: Xue Dong, dongxue.sdu@gmail.com, School of Software, Shandong University, Jinan, China, 250101; Xuemeng Song, sxmustc@gmail.com, School of Computer Science and Technology, Shandong University, Qingdao, China, 266237; Na Zheng, zhengnagrape@gmail.com, School of Computer Science and Technology, Shandong University, Qingdao, China, 266237; Yinwei Wei, weiyinwei@hotmail.com, School of Computing, National University of Singapore, Singapore; Zhongzhou Zhao, zhongzhou.zhaozz@alibaba-inc.com, DAMO Academy, Alibaba Group, Zhejiang, China, 311121; Hongjun Dai, dahogn@sdu.edu.cn, School of Software, Shandong University, Jinan, China, 250101. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and 111 the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 Association for Computing Machinery. 0004-5411/2018/8-ART111 $15.00 https://doi.org/XXXXXXX.XXXXXXX J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:2 Xue Dong, Xuemeng Song, Na Zheng, Yinwei Wei, Zhongzhou Zhao, and Hongjun Dai 1 INTRODUCTION To save users from selecting items by themselves on the web, recommender systems have attracted much research attention [21, 25, 30, 32, 44, 47]. Specially, these systems aim to recommend users items that they probably prefer, according to the their historical interactions with items. The key to fulfill this is to effectively represent the user and item, and based on that predict their potential interactions [46]. Existing studies have primarily represented the item and user with vectorized representations, where the different dimensions of the item representation capture the item’s different features, while those of the user representation indicate the user’s corresponding preferences to the features. Accordingly, the user-item interaction can be modeled by calculating the similarity of their representations via techniques, like the inner product [21, 46], Euclidean distance [42] and non-linear neural networks [16]. In a sense, with either the inner product or Euclidean distance, each dimension of the user and item representations is conducted separately and equally. Nevertheless, in fact, the user’s preferences to different item features are related. For example, if a user has watched many action movies, most of which are played by the actor Jackie Chan, then we can learn that the user’s preferences to the features action movie and Jackie Chan are positively related. Although using the non-linear neural networks [16], the relations among different dimensions can be learned implicitly via multiple fully-connected layers. But it is hard to exhibit what the networks actually learn. Therefore, in this work, we aim to explicitly model the relations among the user’s preferences to different features for a better recommendation. In particular, we propose to represent the user’s preferences with a multi-variant Gaussian distribution rather than a vectorized representation. In this manner, the mean vector of the Gaussian distribution is able to capture the center of the user’s preferences, and its covariance matrix captures the relations of these preferences. However, it is non- trivial due to the following two challenges. 1) How to accurately model the user’s preference based on the Gaussian distribution is one major challenge. And 2) the user’s preferences to items usually come from his/her judgements towards the item’s attributes (e.g., genres of the movie) [31, 58]. Therefore, how to seamlessly integrate the user’s preferences to the general item and specific attribute to strengthen the recommendation in terms of both quality and interpretability consists another crucial challenge for us. To address these challenges, we propose a DUal Preference distribution LEarning framework (DUPLE), shown in Fig. 1, which jointly learns the user’s preferences to the general item and spe- cific attribute with a probabilistic distribution, respectively. Notably, in our context, we name the user’s preferences to the general item and specific attribute as the general and specific preferences, respectively. As illustrated in Fig. 1, DUPLE consists of two key components: the parameter con- struction and general-specific transformation modules. In particular, to capture the user’s general preference, we learn a general preference distribution by the parameter construction module that works on constructing the essential parameters, i.e., the mean vector and covariance matrix, based on the user’s ID embedding. We regard the probability density of the item’s ID embedding in this distribution as the user’s general preference. To capture the user’s specific preference, we first design a general-specific transformation module that works on predicting the item’s attribute representation (depicting what attributes the item has) from its ID embedding supervised by the items’ attribute labels. And then, the transformation, forming the gap between the general item and its specific attributes, enables us to derive a specific preference distribution from the general one1 . Similarly, the probability density of the item’s attribute embedding in the specific preference distribution can be regarded as the user’s specific preferences to the attributes. Ultimately, we fuse the user’s general and specific preferences with a linear combination to predict the user-item 1 Detailed mathematical proof will be given in Section 3.3. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Dual Preference Distribution Learning for Item Recommendation 111:3 Item ID A Preferred Item , Σ ) embedding High-Top Printed Shoes Parameter Item Construction General Preference User ID embedding Σ Distribution General-Specific Transformation Similarity General preference learning ො , Σ ) Item attribute Attribute Label Printed Shoes representation ො (BOWs Embedding) Specific preference learning Explanation production High-Top Maximize Specific Preference Σ Distribution Fig. 1. Illustration of the proposed dual preference distribution learning framework (DUPLE), which jointly learns the user’s preferences to general item (blue flow) and specific attribute (green flow) for the better recommendation. interaction, based on which we can perform the item recommendation. Extensive experiments on six version datasets derived from the Amazon Product [30] and MovieLens [13] demonstrate the superiority of our DUPLE over state-of-the-art methods. Besides, DUPLE can further infer what attributes the user prefers, and explain why the model recommends the item to the user. To be more specific, the user’s the specific preference distribution learns the user’s preferences to item attributes, we thus can summarize a preferred attribute profile for the user. Accordingly, the overlap between the attributes that the user prefers and an item has can be as the explanation of recommending the item to the user. We summarize our main contributions as follows: • We propose a dual preference distribution learning framework (DUPLE) that captures the user’s preferences to both the general item and specific attributes with the probabilistic distributions. To the best of our knowledge, this is the first attempt to capture the relations of the user’s preferences with the covariance matrix of the distribution. Learning such information, we can better model the user’s preferences and gain a better recommendation performance. • Benefited from learning the user’s specific preference distribution, we can summarize the preferred attributes of the user, and based on that provide the explanation for the recommen- dation, namely, give the reasons why we recommend the item for the user • Extensive qualitative and quantitative experiments conducted on six real-world datasets have proven the both the effectiveness and explainability of the proposed DUPLE method. We will release our codes to facilitate other researchers. 2 RELATED WORK In this section, we briefly introduce traditional recommender systems in Subsection 2.1 and ex- plainable recommender systems in Subsection 2.2. 2.1 Recommender Systems Initial researches utilize the Collaborative Filtering techniques [37] to capture the user’s preferences from the interacted relationships between users and items. Matrix Factorization (MF) is the most J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:4 Xue Dong, Xuemeng Song, Na Zheng, Yinwei Wei, Zhongzhou Zhao, and Hongjun Dai popular method [21, 22, 33]. It focuses on factorizing the user rating matrix into the user matrix and item matrix and predicting the user-item interaction by the similarity of their representations. Considering that different users may have different rating habits, Koren et al. [21] introduced the user and item biases into the matrix factorization, achieving a better performance. Several other researchers argued different users may have similar preferences to items. Consequently, Yang et al. [54] and Chen et al. [5] clustered the users into several groups according to their historical item interactions and separately captured the common preferences of users in each group. Recently, due to the great performance of graph convolutional networks, many approaches have resorted to constructing a graph of users and items according to their historical interactions, and exploring the high-order connectivity from user-item interaction [41, 44, 46–48]. For example, Wang et al. [46] served users and items as nodes and their interaction histories as edges between nodes. And then, they proposed a three-layer embedding propagation to propagate messages from items to user and then back to items. Wang et al. [47] designed the intent-aware interaction graph that disentangles the item representation into several factors to capture user’s different intents. Beyond directly learning the user’s preferences from the historically user-item interactions, other researchers began to leverage the item’s rich context information to capture the user’s detailed preferences to items. Several of them incorporated the visual information of items to improve the recommendation performance [15, 30, 56]. For example, He et al. [15] enriched the item’s representation by extracting the item’s visual feature by a CNN-based network from its image and added it into the matrix factorization. Yang et al. [56] further highlighted the region of the item image that the user is probably interested in. Some other researches explore the item attributes [2, 31, 58] or user reviews [6, 39, 52] to learn the user’s preferences to the item’s specific aspects. For example, Pan et al. [31] utilized the attribute representations as the regularization of learning the item’s representation during modeling the user-item interaction. In this manner, the user’s preferences to attributes can be tracked by the path of the user to item and then to attributes. And Chen et al. [6] proposed a co-attentive multi-task learning model for recommending user items and generating the user’s reviews to items. Besides, the multi-modal data has been proven to be important for the recommendation [10, 23, 49]. For example, Wei et al. [49] constructed a user-item graph on each modality to learn the user’s modal-specific preferences and incorporated all the preferences to predict the user-item interaction. Although these above approaches are able to capture the user’s preferences to items or specific contents of the item (e.g., attributes), they fail to explicitly model the relations of the user’s different preferences, which is beneficial for the better recommendation. In this work, we proposed to capture the user’s preferences with the multi-variant Gaussian distribution, and model the user- item interaction with the probability density of the item in the user’s preference distribution. In this manner, the relations of the user’s preferences can be captured by the covariance matrix of the distribution. 2.2 Explainable Recommender Systems The recommender systems trained by deep neural networks are perceived as a black box only able to predict a recommendation. Thus, to make the recommendation more transparent and trustworthy, interpretable recommender systems [4, 58] are therefore gaining popularity, which focus on what and why to recommend an item. Initial approaches can provide rough reasons of recommending an item based on the similar items or users [34, 36]. Thereafter, researchers attempt to seek more explicit reasons provided to users, e.g., explain recommendations with item attributes, reviews, and reasoning paths. Interpretation with Item Attributes. This group of interpretable recommender systems con- sider that the user likes an item may be caused by its certain attributes, e.g., “you may like Harry J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Dual Preference Distribution Learning for Item Recommendation 111:5 Potter because it is an adventure movie”. Thus, mainstream approaches in this research line have been dedicated to bridging the gap between users and attributes [4, 17, 31, 55]. In particular, Wang et at. [45] employed a tree-based model to learn explicit decision rules from item attributes, and designed an embedding model to generalize to unseen decision rules on users and items. Benefited from the attention mechanism, several researchers [4, 31] learned the item embedding with the fusion of its attribute embeddings. By checking the attention weights, these methods can infer how does each attribute cause the high/low rating score. Interpretation with Reviews. These methods leverage the review as the extra information to the user-item interaction, which can infer the user’s altitude towards one item [8, 9, 28, 35, 50]. In particular, several approaches aggregated review texts of users/items and adopted the attention mechanism to learn the users/items embeddings [12, 28, 35, 50]. Based on the attention weights, the model can highlight the high words in reviews as explanations. Different from highlighting the review words as explanations, other researches attempt to automatically generate reviews for a user-item pair [6, 9, 24, 29]. Specifically, Costa et al.[9] designed a character-level recurrent neural network, which generates review explanations for user-item pair using long-short term memories. Li et al.[24] proposed a more comprehensive model to generate tips in review systems. Inspired by human information processing model in cognitive psychology, Chen et al. [6] developed an encoder-selector-decoder architecture, which exploits the correlations between recommendation and explanation through co-attentive multi-task learning. Interpretation with Reasoning Paths. This kind of approaches construct a user-item inter- action graph and aim to find a explicitly path on the graph that traces the decision-masking process [1, 14, 18, 44, 51, 52]. In particular, Ai et al. [1] constructed a user-item knowledge graph of users, items, and multi-type relations (e.g., purchase and belong), and generated explanations by finding the shortest path from the user to the item. Wang et al. [44] proposed a Knowledge Graph Attention Network (KGAT) that explicitly models the high-order relations in the knowledge graph in an end-to-end manner. Xian et al. [51] proposed a reinforcement reasoning approach over knowledge graphs for interpretable recommendation, where agent starts from a user and is trained to reach the correct items with high rewards. Further, considering that users and items have different intrinsic characteristics, He et al. [14] designed a two-stage representation learning algorithm for learning better representations of heterogeneous nodes. Yang et al. [57] proposed a Hierarchical Attention Graph Convolutional Network (HAGERec) that involves the hierarchical attention mechanism to exploit and adjust the contributions of each neighbor to one node in the knowledge graph. Despite of their achievements in the explainable recommendation, existing approaches mainly a discriminative method to provide an explanation, i.e., they provide an explanation for a given user- item pair. However, in fact, users select items according to their inherent preferences. Therefore, we argue that the explainable recommender systems should be a generative method that first summarizes the user’s preferences and then the explanation can be easily derived from the overlap between the user’s preferences and item properties. 3 THE PROPOSED DUPLE MODEL We now present our proposed dual preference distribution learning framework (DUPLE), which is illustrated in Fig. 1. It is composed of two key modules. 1) The parameter construction module, which learns a general preference distribution to capture the user’s preferences to general items by constructing its essential parameters (i.e., the mean vector and covariance matrix). And 2) the general-specific transformation module, which bridges the gap between the item and its attributes. It facilitates us to learn the user’s specific preference distribution by transforming the parameters in the general preference distribution to their specific counterparts. In the rest of this section, we J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:6 Xue Dong, Xuemeng Song, Na Zheng, Yinwei Wei, Zhongzhou Zhao, and Hongjun Dai Table 1. Summary of the Main Notations. Notation Explanation U, I, A The sets of users, items, and attributes respectively. I The set of historical interacted items of the user ∈ U. A The set of attributes of the item ∈ I. A The summarized preferred attribute profile of the user . D The training set of triplet ( , , ), ∈ U, ∈ I , ∉ I . , ID embeddings of the user and item , respectively. Bag-of-words embedding of the item . , Mean vector and covariance matrix of the user ’s general preference distribution. , Mean vector and covariance matrix of the user ’s specific preference distribution. , The general and specific preferences of the user to the item , respectively. The final preference of the user to the item . To-be-learned set of parameters. first briefly define our recommendation problem in Subsection 3.1. Then, we detail the two key modules in Subsections 3.2 and 3.3, respectively. Finally, we illustrate the user-item interaction modeling in Subsection 3.4, followed by the explainable recommendation in Subsection 3.5. 3.1 Problem Definition Without losing generality, suppose that we have a set of users U, a set of items I, and a set of item attributes A that can be applied to describe all the items in I. Each user ∈ U is associated with a set of items I that the user historically likes. Each item ∈ I is annotated by a set of attributes A . Following mainstream recommender models [16, 22, 47], we describe a user (an item ) with an ID embedding vector ∈ R ( ∈ R ), where denotes the embedding dimension. Besides, to describe the item with its attributes, we represent the item by a bag-of-words embedding of its attributes ∈ R | A | , where the -th elements = 1 refers to the item has the -th attribute in A. Inputs: The inputs of the dual preference distribution learning framework (DUPLE) consist of 3 parts: the set of users U, set of items I, and the set of item attributes A. Outputs: Given a user , item with its attributes A , DUPLE predicts the user-item interaction from both the general item and specific attribute perspectives. To improve the readability, we declare the notations used in this paper. We use the squiggled letters (e.g., X) to represent sets. The bold capital letters (e.g., ) and bold lowercase letters (e.g., ) to represent matrices and vectors, respectively. Let the non-bold letters (e.g., ) denote scalars. The notations used in this paper are summarized in Table 1. 3.2 Parameter Construction Module The user ’s general preference distribution G( , ) could capture his/her preferences to the general items. More specifically, the mean vector refers to the center of the user’s general preference, while the covariance matrix stands for the relationship among the latent variables affecting the user’s preferences to items. The parameter construction module will separately construct the mean vector and covariance, as they are different in the form and mathematical properties. In particular, the mean vector is a -dimensional vector indicating the center of the general preference. Therefore, we simply adopt one fully connected layer to map the user’s ID embedding to the mean vector of the general J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Dual Preference Distribution Learning for Item Recommendation 111:7 preference distribution ∈ R as follows, (1) = + , where ∈ R × and ∈ R are the non-zero parameters to map the user’s ID embedding. As the covariance matrix is a symmetric and positive semi-definite matrix according to its mathematical properties, it is difficult to construct it directly based on the user’s ID embedding. ′ Instead, we propose to first derive a low-rank matrix ∈ R × , ′ < , from the user’s ID embedding as a bridge, and then construct the covariance matrix through the following equation, = T, (2) ′ where = arranged by [ 1 ; 2 ; · · · ; ] ′ column vectors, which can be derived by the column- specific transformation as follows, = Σ + Σ , = 1, 2, · · · , ′, (3) where Σ ∈ and R × ∈ are the non-zero parameters to derive the -th column of Σ R . These parameters and the user’s ID embedding (i.e., non-zero vector) guarantee that the is a non-zero vector. With the simple algebra derivation, the covariance matrix derived T according to Eqn. (2) is symmetric as = ( T ) T = T = , and positive semi-definite as = = || || ≥ 0, ∀ ≠ 0, ∈ R . T T T 2 3.3 General-Specific Transformation The general-specific transformation module works on predicting the item’s attribute representation from its ID embedding. Benefiting from it forms a bridge between the item and its attributes, we can learn the user’s specific preference from the general preference by transforming its mean vector and covariance matrix to their specific counterparts. Instead of introducing another branch to construct the specific preference distribution anew, i.e., learning its parameters and through the user’s ID embedding, as the similar with the parameter construction module, this has the following benefits. (1) The user’s general preference toward an item often comes from his/her specific judgments toward the item’s attributes. In light of this, it is promising to derive the specific preference distribution by referring the general preference distribution. (2) Learning the general and specific preferences of the user with one branch allows the knowledge learned by each other to be mutually referred. And 3) this would facilitate the specific preference modeling directly from the pure general preference, enables the testing cases that lack item attribute details. In particular, following the approach [31], we adopt the linear mapping as the general-specific transformation to predict the item ’s attribute representation ^ from its ID embedding as follows, ^ = , (4) where ∈ is the parameter of the general-specific transformation. R | A |× We adopt the ground-true item’s attribute label vector to supervise the general-specific transformation learning. Specifically, we enforce the predicted item ’s attribute representation ^ to be as close as to its ground truth attribute label vector by the following objective function, ∑︁ L = − log( ( ( ^ , ) − ( ^ , ))), (5) , ∈I, ≠ where is the ground-truth attribute label vector of another item . We estimate the similarity between item ’s attribute representation and the ground-truth attribute label vector with their T ^ cosine similarity as: ( , ^ ) = | | | ^ | . By minimizing this objective function, the item ’s attribute representation ^ is able to indicate what attributes the item has. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:8 Xue Dong, Xuemeng Song, Na Zheng, Yinwei Wei, Zhongzhou Zhao, and Hongjun Dai We then introduce how to construct the user ’s specific preference distribution G( , ) from the general one based on the general-specific transformation. Differently from the general preference distribution, each dimension in the specific preference distribution refers to a specific item’s attribute. Therefore, the mean vector ∈ R | A | , i.e., the center of the user’s specific preference, indicates what attributes that the user prefers. The covariance matrix ∈ R | A |×| A | stands for the relations of these preferences. Technically, we project the parameters of the general preference distribution, i.e., and , into their specific counterparts as follows, ^ = , ^ = T . (6) ^ , are the mean vector and covariance In this manner, the transformed parameters, i.e., ^ and ^ = . The rationality matrix of the specific preference distribution, respectively, i.e., ^ = , proof is given as follows, • Argument 1. ^ is the mean vector of the user ’s specific preference distribution, i.e., ^ = . • Proof 1. Referring to the mathematical definition of the mean vector, and the additivity and homogeneity of the linear mapping, we have, 1 ∑︁ ^ = = ( ) |I | ∈I 1 ∑︁ 1 ∑︁ = = |I | |I | ∈I ∈I 1 ∑︁ = ^ := . |I | ∈I □ • Argument 2. ^ is the covariance matrix of the user ’s specific preference distribution, i.e., ^ = . • Proof 2. Referring to the mathematical definition of the covariance matrix, and the additivity and homogeneity of the linear mapping, we have, ^ = T 1 ∑︁ ( − )( − ) T ) T = ( |I | ∈I 1 ∑︁ ( − )( − ) T T = |I | ∈I 1 ∑︁ ( ( − ))( ( − )) T = |I | ∈I 1 ∑︁ ( − )( − ) T = |I | ∈I 1 ∑︁ = ( ^ − ^ )( ^ − ^ ) T |I | ∈I 1 ∑︁ = ( ^ − )( ^ − ) T := . |I | ∈I □ J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Dual Preference Distribution Learning for Item Recommendation 111:9 Algorithm 1 Dual Preference Distribution Learning Framework (DUPLE). Input: The sets of users U and items I. The training set D. The set of attributes A of each ∈ I. The number of the optimization step . Output: The parameters of DUPLE. 1: Initialize the user embedding , ∈ U, and item embedding , ∈ I. Initialize the model parameters . 2: for in [1, · · · , ] do 3: Randomly draw a mini-batch of training triplets ( , , ) from D. Calculate the mean vector and covariance matrix of the general preference distribution 4: through Eqn. (1) and (2), respectively. 5: Calculate the mean vector and covariance matrix of the specific preference distribution through Eqn. (6). 6: Predict the user-item interaction through Eqn. (9). 7: Update the parameters through Eqn. (11): L ← − 8: end for 3.4 User-Item Interaction Modeling We predict the user-item interaction by the probability density of the item in the both the general and specific preference distributions. In particular, in the user ’s general preference distribution, the probability density of the item ’s ID embedding can be as the proxy to the general preference of the user toward the item . Formally, we define as follows, = ( | , ) 1 1 (7) exp − ( − ) T ( ) −1 ( − ) , = √︁ 2 | | 2 where | | and ( ) −1 are the determinant and inverse matrix of the covariance matrix of the user’s general preference distribution defined in Eqn. (2), respectively. We can define the specific preference of the user toward the item as the probability density of the item ’s attribute representation ^ in the similar form as follows, = ( ^ | , ) 1 1 (8) exp − ( ^ − ) T ( ) −1 ( ^ − ) . = √︁ 2 | | 2 Ultimately, the user-item interaction between the user and item can be defined as the linear combination of the two preferences as follows, (9) = + (1 − ) , where ∈ [0, 1] is a hyper-parameter for adjusting the trade-off between the two terms. Regarding the optimization of the general preference modeling, we enforce the interaction between a user and his/her historically interacted item larger than other items with no historical interaction through the following objective function, ∑︁ L = − log( ), (10) + ( , , ) ∈ D where D : {( , , )| ∈ U, ∈ I , ∉ I } is the set of training triplets, where item is the user’s preferred item while item is the item that the user has not interacted. The triplet ( , , ) indicates J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:10 Xue Dong, Xuemeng Song, Na Zheng, Yinwei Wei, Zhongzhou Zhao, and Hongjun Dai User ID embedding Parameter A Given Item Construction General-Specific Transformation Mean Vector of Specific Distribution User’s Preferred Drame Fantasy Overlap Item Attributes Attribute Profile Adventure Adventure We recommend this item, because you like adventure movies. Explanation Fig. 2. Illustration of the explanation generation of the proposed DUPLE. We first derive the user’s preferred attribute profile with the mean vector of the learned user specific distribution. We then use the overlapped attributes between the learned attribute profile and the recommended item’s attributes as the explanation. that the user prefers item than item . It is worth noting that we can increase the number of the items that have not been interacted. Ultimately, the final objective function of the proposed method is defined as follows, L = min L + L , (11) where L and L are defined in Eqn. (5) and (10), respectively. refers to the set of to-be-learned parameters of the proposed framework. The detailed training process of DUPLE is summarized in Algorithm 1. 3.5 Explanation Generation In addition to the item recommendation, DUPLE can also explain why to recommend an item to a user as a byproduct, as shown in Fig. 2. In particular, in the specific preference distribution, each dimension refers to a specific attribute and the mean vector indicates the center of the user’s specific preference. Accordingly, we can infer what attributes the user likes and summarize a preferred attribute profile for the user. Technically, suppose that the user prefers attributes totally, then we can define the preferred attribute profile A as follows, 1, 2, ..., = arg max ( ), ∈ R | A | , (12) A = { 1 , 2 , ..., }, where arg max (·) is the function returning the indices of the largest elements in the vector. is the -th attribute in A. After building the preferred attribute profile of the user , for a given recommended item associated with its attributes A , we can derive the reason why the user likes the item by checking the overlap between A and A . Suppose that A ∩ A = { , }. Then we can provide the explanation for user as “you may like the item for its attributes and ”. Besides, it is worth mentioning that the diagonal elements of the covariance matrix in the specific preference distribution can capture the significances of user’s different preferences to attributes. Specifically, if the value of the diagonal element in one dimension is small, its tiny change will sharply affect the prediction of the user-item interaction, and we can infer that the user’s preference to the attribute in the corresponding dimension is “strict”, i.e., significant. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Dual Preference Distribution Learning for Item Recommendation 111:11 Table 2. Statistics of the six public datasets (after preprocessing), including the numbers of users (#user), items (#item), their interactions (#rating), and attributes (#attri.), as well as the density of the dataset Amazon Product Dataset #user #item #rating #attri. density Women’s Clothing 19,972 285,508 326,968 1,095 0.01% Men’s Clothing 4,807 43,832 70,723 985 0.03% Cell Phone & Accessories 9,103 51,497 132,422 1,103 0.03% MovieLens Dataset #user #item #rating #attri. density MovieLens-small 579 6,296 48,395 761 1.33% MovieLens-1M 5,950 3,532 574,619 587 2.73% MovieLens-10M 66,028 10,254 4,980,475 489 0.74% 4 EXPERIMENTS In this section, we first introduce the dataset details and experimental settings in Subsections 4.1 and 4.2, respectively. And then, we conduct extensive experiments by answering the following research questions: (1) Does DUPLE outperform the state-of-the-art methods? (2) How do the different variants of the Gaussian distribution perform? (3) What are the learned relations of the user’s preferences? (4) How is the explainable ability of DUPLE for the item recommendation? 4.1 Dataset and Pre-processing To verify the effectiveness of DUPLE, we adopted six public datasets with various sizes and densities: the Women’s Clothing, Men’s Clothing, Cell Phones & Accessories, MovieLens-small, MovieLens- 1M, and MovieLens-10M. The former three datasets are derived from Amazon Product dataset [30], where each item is associated with a product image and a textual description. The latter three datasets are released by MovieLens dataset [13], where each movie has the title, published year, and genre information. User ratings of all the six datasets range from 1 to 5. To gain the reliable preferred items of each user, following the studies [26, 27], we only kept the user’s ratings that are larger than 3. Meanwhile, similar to the studies [11, 19], for each dataset, we filtered out users and items that have less than 10 interactions to ensure the dataset quality. Since there is no attribute annotation in all the datasets above, following the studies [43, 50], we adopted the high-frequency words in the item’s textual information as the ground truth attributes of items. Specifically, for each dataset, we regarded the words that appear in the textual description of more than 0.1% of items in the dataset as the high-frequency words. Notably, the stopwords (like “the”) and noisy characters (like “/”) are not considered. The final statistics of the datasets are listed in Table 2, including the numbers of users (#user), items (#item), their interactions (#rating), and attributes (#attri.), as well as the density of the dataset. Similar to the study [47], we calculated the dataset density by the formula ## ×# . To gain a better understanding of the dataset, we further provide the data distribution of users with different number of interacted items in Fig. 3. The horizontal axis refers to the number range of interacted items. We omitted the users that have more than 120 interacted items in Fig. 3, because their proportion is too small. 4.2 Experimental Settings Baselines. We compared the proposed DUPLE method with the following baselines. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:12 Xue Dong, Xuemeng Song, Na Zheng, Yinwei Wei, Zhongzhou Zhao, and Hongjun Dai 0.6 Women's Clothing 0.5 Men's Clothing (User Proportion) 0.4 Cell Phone & Accessories MovieLens-small 0.3 MovieLens-1M 0.2 MovieLens-10M 0.1 0 10 20 30 40 50 60 70 80 90 100 110 120 (Number of Interacted Items) Fig. 3. Data distribution of users with different number of interacted items of the six datasets. The horizontal and longitudinal axes refer to the number range of the interacted items, and the user proportion, respectively. • VBPR [15]. This method represents the item with both the ID embedding and visual content feature2 . The user’s preferences to an item are measured by the inner product between their corresponding representations. Ultimately, VBPR adopts the BPR framework to enforce that the user’s preferences toward a preferred item are larger than another item. • AMR [40]. To enhance the recommending robustness, this approach involves the adversarial learning onto the VBPR model. Specifically, it trains the network to defend an adversary by adding perturbations to the item image. • DeepICF+a [53]. This method assesses the user’s preferences toward an item by evaluating the visual similarity between the given item and the user’s historical preferred items. It sums the visual features of historical preferred items based on the similarities between the given item and historical items. The result is then fed into a multi-layer perceptron to predict the user-item interaction. • NARRE [3]. This method adds the item’s attributes into the matrix factorization. It enriches the item representation with the attribute representation, which is the weighted-summation of all the word-embedding of the item’s attributes. Ultimately, based on the learned weights for the attribute representations, NARRE can capture the user’s specific preferences to attributes. • AMCF [38]. This method also adds the item’s attributes into the matrix factorization by enriching the item representation with the attribute representation. Different from NARRE, AMCF regards attributes as the orthogonal vectors, which is only as a regularization for the item representation learning. Similarly, according to the attention weights, AMCF can capture the user’s specific preferences to different attributes. It is worth noting that the first three baselines, i.e., VBPR, AMR, and DeepICF+a, only focus on the user’s general preference, termed as single-preference-based (SP-based) method. The rest methods, i.e., NARRE, AMCF, and our proposed DUPLE consider both the general and specific preferences during recommending items, termed as dual-preference-based (DP-based) method. Data Split. We adopted the widely used leave-one-out evaluation [7, 11, 16] to split the training, validation, and testing sets. In particular, for each user, we randomly selected an item from his/her preferred items for validation and testing, respectively, and leaved the rest for training. In the validation and testing, in order to avoid the heavy computation on all user-item pairs, following the studies [11, 16], we composed the candidate item set by one ground-truth item and 100 randomly selected negative items that have not been interacted by the user. 2 It is released in http://jmcauley.ucsd.edu/data/amazon/links.html. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Dual Preference Distribution Learning for Item Recommendation 111:13 MRR HR@10 Loss Women’s Clothing Men’s Clothing Cell Phone & Accessories 0.5 0.70 0.5 0.70 0.5 0.70 0.4 0.60 0.4 0.60 0.4 0.60 Trainin Loss Trainin Loss Trainin Loss 0.3 0.50 0.3 0.50 0.3 0.50 Metrics Metrics Metrics 0.2 0.40 0.2 0.40 0.2 0.40 0.1 0.30 0.1 0.30 0.1 0.30 0 0.20 0 0.20 0 0.20 0 120 240 360 480 600 720 (K) 0 80 160 240 320 400 480 560 (K) 0 120 240 360 480 600 720 (K) Training Step Training Step Training Step MovieLens-small MovieLens-1M MovieLens-10M 0.5 0.65 0.5 0.65 0.5 0.65 0.4 0.58 0.4 0.60 0.4 0.60 Trainin Loss Trainin Loss Trainin Loss 0.3 0.3 0.55 0.3 0.55 Metrics Metrics 0.51 Metrics 0.2 0.44 0.2 0.50 0.2 0.50 0.1 0.37 0.1 0.45 0.1 0.45 0 0.30 0 0.40 0 0.40 0 50 100 150 200 250 300 (K) 0 100 200 300 400 500 (K) 0 140 280 420 560 700 840 (K) Training Step Training Step Training Step Fig. 4. The curves during the training on the six datasets of the proposed DUPLE model. The training loss is shown in grey line corresponding to the right longitudinal axis. MRR and HR@10 on the validation set are shown in orange and blue lines, respectively, corresponding to the left longitudinal axis. Evaluation Metrics. We adopted the Area Under Curve (AUC), Mean Reciprocal Rank (MRR), Hit Rate (H@10) and Normalized Discounted Cumulative Gain (N@10) truncated the ranking list at 10 to comprehensively evaluate the item recommendation performance. In particular, AUC indicates the classification ability of the model in terms of distinguishing the user’s likes and dislikes. MRR, H@10, and N@10 reflect the ranking ability of the model in terms of the top-N recommendation. Implementation Details. For all methods, following the study [11], we unified the dimension of the item’s and user’s ID embedding as 64 for the fair comparison. Besides, to gain the powerful representation of the user and item, we added two fully connected layers to transform the raw ID embeddings of the user and item to the condensed ones, respectively, before feeding them into the network. We used the random normal initialization for the parameters and trained them by Adam optimizer [20] with a learning rate of 10−4 and batch size of 128. For different dataset, the number of the optimization steps is different. This is because a large dataset needs more steps to converge. We showed the curves of the training loss in Eqn. (11) and metrics on the validation set on the six datasets in Fig. 4. Without losing the generality, we reported the MRR and NDCG@10. For our method, we tuned the trade-off parameter in Eqn.(9) from 0 to 1 with the stride of 0.1 for each dataset, and the dimension ′ in Eqn.(2) from [2, 4, 8, 16, 32]. 4.3 Comparison of Baselines (RQ1) For each method, we reported its average results of three runs with different random initialization of model parameters. Furthermore, we conducted one-sample t-tests to justify that all of enhancements. Table 3 and Table 4 shows the performance of baselines and our proposed DUPLE method on the Amazon and MovieLens datasets, respectively. The best and second-best results are in bold and underlined, respectively. The row “%improv” indicates the relative improvement of DUPLE over the best baseline. The statistical significance is marked with the superscript ∗ . From Table 3 and 4, we have the following observations: J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:14 Xue Dong, Xuemeng Song, Na Zheng, Yinwei Wei, Zhongzhou Zhao, and Hongjun Dai Table 3. Experiment results of Amazon datasets. The best and second-best results are in bold and underlined, respectively. %improv is the relative improvement of our method over the strongest baselines. The superscript ∗ denotes the statistical significance for -value < 0.05. Amazon Metrics Method Dataset AUC MRR HR@10 HDCG@10 VBPR 71.42 13.10 29.45 14.85 AMR 72.68 15.14 34.22 16.39 DeepICF+a 74.46 20.20 40.21 23.43 Women’s Clothing NARRE 72.04 15.21 34.52 18.15 AMCF 73.61 14.73 35.68 17.69 DUPLE 77.48∗ 20.35∗ 42.43∗ 23.60∗ %improv +4.07 +0.76 +5.54 +0.73 VBPR 71.08 13.63 31.07 15.72 AMR 73.84 15.82 32.67 15.49 DeepICF+a 74.82∗ 17.31∗ 38.23∗ 22.41 Men’s Clothing NARRE 69.84 14.02 31.07 16.28 AMCF 73.66 15.70 35.95 18.47 DUPLE 76.00∗ 19.35∗ 40.95 23.62 %improv +1.58 +11.84 +7.13 +5.40 VBPR 73.42 15.83 34.07 18.17 AMR 74.42 16.38 36.43 20.97 DeepICF+a 76.12 16.71 38.33 20.01 Cell Phone & Accessories NARRE 74.37 15.66 35.33 19.74 AMCF 75.01 14.90 34.98 19.46 DUPLE 80.65∗ 20.38∗ 45.47∗ 25.14∗ %improv +5.95 +21.97 +18.63 +25.64 (1) AMR outperforms VBPR on all the datasets. This indicates that adding perturbations enables the network to be robust and has a better generalization ability. (2) DeepICF+a generally achieves better performance than VBPR and AMR in the datasets derived from Amazon Product. This indicates that considering the contents of items that a user historically prefers is beneficial to comprehensively capture the user’s preference. On the contrary, in the datasets derived from MovieLens, DeepICF+a underperforms VBPR and AMR. This suggests that the dense MovieLens datasets, in which each user may have too many interacted items, could lead to the information redundancy, whereby decrease the model performance. (3) In datasets from MovieLens, on average, DP-based methods (i.e., NARRE, AMCF, and DUPLE) gain the better performance than the SP-based methods (i.e., VBPR, AMR, and DeepICF+a). It proves that jointly learning the user’s general and specific preferences helps to better understand the user overall preferences and gains a better recommendation performance. (4) However, in datasets derived from Amazon Product, it is unexpected that DP-based methods underpreform DeepICF+a. This may be because that, in these datasets, DeepICF+a incorporates the item visual feature to enhance the item representation, which is essential to tackle the visual-oriented recommendation task (e.g., fashion recommendation). (5) DUPLE outperforms all the baselines in terms of all metrics across different datasets, which demonstrates the superiority of our proposed framework over existing methods. This may be due to the fact that by modeling the user preferences with the probabilistic distributions, DUPLE is capable of exploring the correlations of items and attributes that a user likes, J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Dual Preference Distribution Learning for Item Recommendation 111:15 Table 4. Experiment results of MovieLens datasets. The best and second-best results are in bold and underlined, respectively. %improv is the relative improvement of our method over the strongest baselines. The superscript ∗ denotes the statistical significance for -value < 0.05. MovieLens Metrics Method Dataset AUC MRR HR@10 HDCG@10 VBPR 72.04 12.12 29.36 13.49 AMR 74.66 12.66 32.70 13.85 DeepICF+a 75.03 12.56 31.34 16.69 MovieLens-small NARRE 76.30 13.23 31.84 17.41 AMCF 76.06 11.91 31.15 17.04 DUPLE 77.48∗ 20.35∗ 42.43∗ 23.60∗ %improv +3.60 +12.89 +16.56 +19.47 VBPR 73.95 15.26 33.83 17.23 AMR 74.73 15.37 34.21 18.52 DeepICF+a 72.65 13.87 31.88 14.71 MovieLens-1M NARRE 77.54 15.87 37.11 17.97 AMCF 77.90 15.42 38.18 18.79 DUPLE 78.21 17.39∗ 39.85∗ 20.49∗ %improv +0.41 +9.64 +4.38 +9.05 VBPR 76.19 15.17 34.97 17.47 AMR 78.12 16.15 38.47 18.39 DeepICF+a 74.06 13.73 32.19 14.86 MovieLens-10M NARRE 82.12 19.39 43.60 21.36 AMCF 80.50 16.68 40.91 20.68 DUPLE 82.71 21.09∗ 47.23∗ 24.62∗ %improv +0.72 +8.77 +8.34 15.26 whereby gains the better performance. This also suggests that there does exist the latent correlation among the items and attributes that a user prefers. 4.4 Comparison of Variants (RQ2) In order to verify that the user’s different preferences are related, i.e., each dimension of the user’s two preference distributions are dependent, we introduced the variant of our model, termed as DUPLE-iden, whose covariance matrices of the two distributions (i.e., and ) are set to be diagonal matrices, i.e., the off-diagonal elements of the covariance matrix are all zeros. Besides, to further demonstrate that the user’s different preferences contribute differently to predict the user-item interaction, we designed DUPLE-diag method that = = E, where E is an identity matrix. Formally, according to Eqn.(7) and Eqn.(8), by omitting constant terms, the user 1 ∝ − 12 | | ^ − | | 2 , ’s general and specific preferences can be simplified as ∝ − 2 | | − | |2 and respectively. Intuitively, the variant DUPLE-diag leverages Euclidean distance to measure the user preference, whose philosophy is as similar as the baselines we used. For each method, we reported the average result of three runs with different random initialization of the model parameters. The results are shown in Table 5 and the detailed analysis is given as follows. (1) DUPLE-diag performs worse than DUPLE. The reason behind may be that, equipped with the non-diagonal covariance matrix, the specific preference distribution in DUPLE is able to capture the relations of the user’s preferences, which helps to better understand the user preferences and gain a better performance. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
111:16 Xue Dong, Xuemeng Song, Na Zheng, Yinwei Wei, Zhongzhou Zhao, and Hongjun Dai Table 5. The results of the comparison of different variants of DUPLE. DUPLE-iden and DUPLE-diag set the covariance matrix of the preference distributions are identity and diagonal matrices, respectively. The best results are highlighted in bold. Metrics Method AUC MRR HR@10 HDCG@10 DUPLE-iden 72.23 18.48 37.14 21.09 Women’s Clothing DUPLE-diag 75.19 18.43 40.14 21.22 DUPLE 77.48 20.35 42.43 23.60 DUPLE-iden 72.80 19.27 38.86 20.38 Men’s Clothing DUPLE-diag 75.76 18.16 41.26 22.16 DUPLE 76.00 19.35 40.95 23.62 DUPLE-iden 77.84 19.49 41.10 24.00 Cell Phone & Accessories DUPLE-diag 80.64 20.99 46.64 25.65 DUPLE 80.65 20.38 45.47 25.14 DUPLE-iden 77.26 14.89 36.26 19.32 MovieLens-Small DUPLE-diag 78.33 15.13 36.09 18.11 DUPLE 79.07 16.43 38.86 20.80 DUPLE-iden 76.38 16.56 37.65 19.57 MovieLens-1M DUPLE-diag 76.53 15.66 36.87 17.57 DUPLE 78.21 17.39 39.85 20.49 DUPLE-iden 79.16 17.57 40.32 20.93 MovieLens-10M DUPLE-diag 80.07 18.48 41.84 21.74 DUPLE 82.71 21.09 47.23 24.62 (2) DUPLE-iden performs comparably with the existing baselines on average. The reason behind may be that with the same identity covariance matrix setting, as the same as baselines, DUPLE- iden measures the user preferences based on the Euclidean distance between the user and item representations, and achieves the similar performance. (3) In Cell Phone & Accessories dataset, it is unexpected that DUPLE performs worse than DUPLE-diag on average. This may be attributed to that the user’s preferences to items in this category rarely interact with each other. For example, whether a user prefers black phone will not be influenced by whether he/she prefers its LCD-screen. Therefore, leveraging more parameters to capture these relations of user’s preferences only decreases the performance of DUPLE. 4.5 Visualization of User’s Preferences Relations (RQ3) In order to further gain a deeper insight of the relations of the user’s preferences, we visualized the learned covariance matrix of the specific preference distribution of a randomly selected user in the Women’s Clothing and MovieLens-1M datasets, respectively. For clarity, instead of visualizing the relations of the user’s preferences to all the attributes, we randomly picked up 20 attributes and visualized their corresponding correlation coefficients in the covariance matrix of the specific preference distribution in each dataset by heat maps in Fig. 5. The darker blue color indicates a higher relation of two user’s preferences. We circled the four most prominent preference pairs, where two pairs with the highest relation are surrounded by the yellow boxes, and two pairs with the lowest relation by the red boxes. From Fig. 5, we can see that in the dataset Women’s Clothing, this user’s preferences to attributes black and sexy, as well as attributes silk and sexy, are highly relevant, while those to attributes casual and sexy, as well as attributes sweatshirt and sexy are less relevant. This is reasonable as J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
Dual Preference Distribution Learning for Item Recommendation 111:17 Wonderful Adventure Sweatshirt Animation Superman Childrens Romance Midnight Leggings America Comedy Princess Running Women Metallic Fantasy Fantasy Athletic Friends Golden Winter Drama Father Bootie Happy Action Casual Sports Lovely Fabric Crime Tights Short Black Mini Sexy Lace Scifi Soft Black Silk Friends 1.00 Tights Fantasy Casual Happy 0.75 Bootie Animation Sexy America 0.50 Silk Princess Sports Romance Mini Wonderful 0.25 Golden Childrens Athletic Father 0.00 Short Fantasy Lovely Superman Winter Adventure -0.25 Lace Midnight Metallic Scifi -0.50 Fabric Comedy Leggings Crime -0.75 Sweatshirt Drama Soft Action Running Women -1.00 (a) Women’s Clothing (b) MovieLens-1M Fig. 5. Relation visualization for the user’s preferences of a random user in Women’s Clothing (a) and MovieLens-1M (b) datasets, respectively. The darker color indicates the higher relation, where two highest and lowest preference pairs are surrounded by yellow and red boxes, respectively. one user that prefers the sexy garments are more probably to like garments in black color, but hardly like casual garments. As for the MovieLens-1M dataset, the user’s preference to attribute Princess has the high relevance with the preferences to attributes Friends and Women, while the user’s preferences to the attribute Scifi are mutually exclusive with those to attributes Animation and Princess. These relations uncovered by DUPLE also make sense. Overall, these observations demonstrate that the covariance matrix of one’s specific preference distribution is able to capture the relations of the his/her preferences. 4.6 Explainable Recommendation (RQ4) To evaluate the explainable ability of our proposed DUPLE, we first provided examples of our explainable recommendation in Subsection 4.6.1. And then, we conducted a subjective psycho-visual test to judge the explainability of the proposed DUPLE in Subsection 4.6.2. 4.6.1 Explainable Recommendation Examples. Fig. 6 shows four examples of our explainable recom- mendation, for each the user’s historical preferred items including images and textual descriptions are listed in the left column, the learned user’s preferred attribute profile is shown in the center column, and an item associated with the recommended explanations is given in the right column. The users in Fig. 6 from the top to bottom are from the datasets Men’s Clothing, Women’s Clothing, Cell Phone & Accessories, and MovieLens-1M, respectively. The reason why we only provided the example in the dataset MovieLens-1M rather than all the three datasets MovieLens-small, MovieLens-1M, and MovieLens-10M, lies in that items in these three datasets are all movies. From Fig. 6, we have the following observations. (1) The summarized users’ preferred attribute profiles in the center column are in line with the users’ historical preferences. For example, as for the first user, he has bought many sporty shoes and upper clothes, based on which we can infer that he likes sports and prefers the sporty style. These inferences are consistent with the preferred attributes DUPLE summarizes, e.g., balance and running. In addition, as for the 4-th user, he/she has watched several animations like J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.
You can also read