TransRec: Learning Transferable Recommendation from Mixture-of-Modality Feedback - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TransRec: Learning Transferable Recommendation from Mixture-of-Modality Feedback Jie Wang1,3∗ Fajie Yuan2† Mingyue Cheng4 Joemon M. Jose1 Chenyun Yu6 Beibei Kong3 Zhijin Wang5 Bo Hu3 Zang Li3 1 University of Glasgow 2 Westlake University 3 Platform and Content Group, Tencent 4 University of Science and Technology of China 5 Jimei University 6 Sun Yat-sen University arXiv:2206.06190v1 [cs.IR] 13 Jun 2022 j.wang.9@research.gla.ac.uk yuanfajie@westlake.edu.cn mycheng@mail.ustc.edu.cn joemon.jose@glasgow.ac.uk {echokong, harryyfhu, gavinzli}@tencent.com yuchy35@mail.sysu.edu.cn zhijinecnu@gmail.com Abstract Learning big models and then transfer has become the de facto practice in com- puter vision (CV) and natural language processing (NLP). However, such unified paradigm is uncommon for recommender systems (RS). A critical issue that ham- pers this is that standard recommendation models are built on unshareable identity data, where both users and their interacted items are represented by unique IDs. In this paper, we study a novel scenario where user’s interaction feedback involves mixture-of-modality (MoM) items. We present TransRec, a straightforward mod- ification done on the popular ID-based RS framework. TransRec directly learns from MoM feedback in an end-to-end manner, and thus enables effective transfer learning under various scenarios without relying on overlapped users or items. We empirically study the transferring ability of TransRec across four different real-world recommendation settings. Besides, we study its effects by scaling the size of source and target data. Our results suggest that learning recommenders from MoM feedback provides a promising way to realize universal recommender systems. Our code and datasets will be made available. 1 Introduction The mainstream recommender systems (RS) model domain-specific user behaviors and generate item recommendations only from the same platform. Such specialized RS have been well-established in literature [6, 8, 18, 19, 22], yet they routinely suffer from some intrinsic limitations, such as low-accuracy problem for cold and new items [44], heavy manual work and high cost for training from scratch separate models [46]. Hence, developing general-purpose recommendation models to be useful to many systems has significant practical value. These kinds of models are also popular in computer vision (CV) [15, 10] and natural language processing (NLP) [9, 3] literature, recently called the foundation models (FM) [2]. Despite their remarkable progress, there has yet to be a recognized learning paradigm for building general-purpose models for recommender systems (gpRS). One paramount reason is that existing RS models are mainly dominated by the ID-based collaborative filtering (CF) approaches, where users/items are represented by their unique IDs assigned by the recommendation platform. Thereby, well-trained RS models can only be used to serve the current system because neither users nor items are easy to be shared across different private systems, e.g. ∗ Work was done when Jie Wang was a visiting scholar at Westlake University and intern at Platform and Content Group, Tencent. † Corresponding author. Fajie Yuan designed the research and Jie Wang led the experiments. Preprint. Under review.
TikTok3 and YouTube4 . Even in some special cases, where userIDs or itemIDs from two platforms can be shared, it is still less prone to realizing the desired transferability since users/items on the two platforms also have the overlapping problem. That is, less user and item overlapping will lead to very limited transferring effects. Recent attempts such as PeterRec [44], lifelong Conure model [46] and STAR [33] fall exactly into this category. To address the above problems, we attempt to explore modality content-based recommendation where items are represented by a modality encoder (e.g. BERT [9] and ResNet [15]) rather than the ID embedding. By modeling modality features, intuitively recommendation models have the potential to achieve domain transferring for this modality in a broader sense — i.e. no longer relying on overlapping and shared-ID information. Moreover, the revolution of big encoder networks in NLP and computer vision is also potentially beneficial to multi-modal item recommendation, and might even bring about a paradigm shift for RS from ID-based CF back to content-based recommendation. In this paper, we study a common yet unexplored recommendation scenario where user behaviors are composed of items with mixed-modal features — e.g. interacted items by a user can be texts or images or both. Such scenario is popular in many practical recommender systems such as feeds recommendation, where recommended feeds can be a piece of news, an image or a micro-video. We claim that developing recommendation models based on user feedback with mixture-of- modality items is a vital way towards transferable and general-purpose recommendation. To verify it, we design TransRec, a simple yet representative framework for modeling the mixture-of- modality feedback. TransRec is a direct modification on the most classical two-tower ID-based DSSM [21] model, where one tower represents users and one represents items. To eliminate the non-transferable ID features, items are encoded by a modality encoder instead of ID embeddings, whereas users are represented by a sequence of items instead of user embeddings, as shown in Figure 1. We train TransRec including both user and item encoders by an end-to-end manner rather than using frozen features pre-extracted from modality encoders. More importantly, we perform broad empirical studies on TransRec — the first RS regime enabling effective transfer across modalities & domains. Specifically, we first train TransRec on a large-scale source dataset collected from a commercial website, where a user’s feedback contains either textual or visual modality, or both. Then we evaluate pre-trained TransRec on the first target dataset collected from a different platform but has similar user feedback formats. Second, we evaluate TransRec on the second target dataset where items have only one modality. Third, we evaluate TransRec with still one modality, but along with additional user/item features to verify the its flexibility. At last, we evaluate TransRec on another target dataset that are very different from source domain so as to verify its generality. Beyond this, we empirically examine the performance of TransRec with different scaling strategies on the source and target datasets. Our results confirm that TransRec learns from MoM feedback is effective for various transfer learning tasks. To date, TransRec is probably the closest model towards gpRS. We believe its success will point out a new way towards the foundation models in the RS domain. To summarize, our contributions are as follows: (1) we identify an important fact that learning from MoM feedback has the potential to realize gpRS; (2) we design TransRec, the first recommendation model that realizes cross-modality and cross-domain recommendation; (3) we evaluate the trans- ferability of TransRec across four types of recommendation scenarios; (4) we study the effects of TransRec by scaling the data, and provide useful insights; (5) we make our code, datasets5 and pre-trained parameters of TransRec available for future research. 2 Related Work In this section, we briefly review the progress of gpRS and its core techniques: self-supervised pre-training (SSP) and transfer learning (TF). General-purpose Recommendation. Big foundation models have achieved astounding feats in the NLP and CV communities [2]. BERT [9], GPT-3 [3], ResNet [15] and various Vision Trans- formers [10, 1, 27] have almost dominated the two fields because of their superb performance and 3 https://www.tiktok.com/ 4 https://www.youtube.com/ 5 For privacy issues, the datasets are provided only by email with a permission for research purposes. 2
Figure 1: Illustration of the training process in TransRec. Here, inner product is employed to compute the preference between users and candidate items. transferability. By contrast, very few efforts have been devoted to foundation recommendation models which are key towards gpRS. Some work adopt multi-task learning (MTL) to combine multiple objectives so as to obtain a more general representation model [28, 47, 37, 31]. However, typical MTL is only useful to these trained tasks and cannot be directly transferred to new coming recommendation tasks or scenarios. PeterRec [44] proposed the first pre-training-then-fine-tuning paradigm for learning and transferring general-purpose user representation, following closely the BERT model. Following it, Conure [46] introduced the ‘one person, one model, one world’ idea and claimed that recommendation models benefit from lifelong learning. Similar cross-domain recommendation work also include [26, 29, 40, 49, 4, 30, 33, 7]. However, these models are all based on the shared-ID assumption in source and target domains, which tends to be difficult to hold in practice. Distinct from these work, some preprint paper [34, 35, 41] devised general-purpose gpRS by leveraging the textual information. [39] learned user representations based on text and image modalities with images processed into frozen features beforehand. The most recent preprint paper P5 [12] formulates various recommendation-related task (e.g. rating prediction, item recommendation and explanation generation) as a unified text-to-text paradigm with the same language modeling objective for pre-training. However, to the best of our knowledge, there exists no general-purpose recommendation models that are trained from various modality feedback, along with an end-to-end training fashion. SSP & TF. Recent years have seen a growing interest in the paradigm of upstream SSP and down- stream TF. In terms of pre-training, existing literature can be broadly categorized into two classes: BERT/GPT-like generative pre-training [9, 3] and discriminative pre-training [14, 5, 38]. Compared with supervised learning, SSP methods are more likely to learn general-purpose representation since they are directly trained by self-generated labels rather than explicit task-specific supervision. Recommender systems are a natural fit for self-supervised learning with a large amount implicit user feedback. Both generative pre-training [44, 46, 36, 12] and contrastive pre-training [13, 48, 42] are popular, although many [48, 42, 36] of them only investigated pre-training for the current task rather than pursuing general-purpose recommendation. Regarding transfer learning, current work mainly adopted parameter freezing [17], full fine-tuning [9, 16], adapter fine-tuning [44, 20] and prompt [3, 11] to adapt an upstream SSP model to downstream tasks. In this paper, we consider both generative and contrastive pre-training for learning the gpRS model and apply full fine-tuning for domain adaption. 3 The TransRec Framework In this section, we first formulate the recommendation tasks, involving mixture-modality feedback. Then, we introduce the TransRec framework in detail. 3
3.1 Problem Definition Assume that we are given two cat- ... egories of domains: source do- main S and target domains T = {T1 , T2 , ..., TN }. In source (target) do- ... ... main, suppose that there exist user set Us (Ut ) and item set Vs (Vt ), involving ... |Us | (|Ut |) users and |Vs | (|Vt |) items in the systems. In both domains, the con- Source domain Target domain tent feature of items are recorded with domain modality set M = {a, b}, containing Figure 2: Illustration of the transferring process. TransRec textual and visual modalities, denoted first pre-train a unified recommender in source domain, and as a and b. Following the setting of item- then serve the target domain with the pre-trained network. based collaborative filtering [44], users can be represented with the sequence of their historical interaction records Cu = {c1,m , ..., cn,m }. Here, n indicates the sequence length while m ∈ M. The goal of this work is to learn a generic recommender from source domain S and can be transferred to N target domains T , involving non-overlapping of userIDs or itemIDs. As illustrated in Figure 2, by learning from M, the trained model can be applied to the following domains, including single-modality domain C = {c1,a , ..., cn,a } (ci,a ∈ Vta ) in domain T a , single- modality domain C = {c1,b , ..., cn,b } (ci,b ∈ Vtb ) in domain T b , and mixed-modality domain C = {c1,m , ..., cn,m } ( ci,m ∈ Vtm , m ∈ M) in domain T m . Suppose that we extend the source domain with four modalities M = {a, b, c, d} = {text, vision, audio, video}. The target domain can be served with 15 types of modalities, i.e. {a}, {b}, {c}, {a, b}, {a, b, c}...{a, b, c, d} that covers a majority of existing multimedia modalities. 3.2 TransRec Architecture To verify our claim, we develop TransRec on the most popular two-tower based recommendation architecture, a.k.a. DSSM [21]. To eliminate ID features, we represent both users and items with item modality contents. That is, the item tower of DSSM is replaced with an item modality encoder (e.g. BERT for text and ResNet for images) network, and user tower is replaced with a user encoder network which directly models an ordered collection of item interactions rather than the explicit userID data. In this sense, TransRec is also a sequential recommendation (SR) model [19, 38].6 Formally, given a user interaction sequence C = {c1,m , ..., cn+l,m } from a mixed-modal scenario, TransRec takes two sub-sequence C u = {c1,m , ..., cn,m } and C e = {cn+1,m , ..., cn+l,m } as inputs. Z u = { z1 , ..., zn } and Z e = { zn+1 , ..., zn+l } are item representations for C u and C e by item encoder Ei . The item representations in Z u are then fed into the user encoder Eu to achieve user representation U u . U u and Z e are used to compute their relevance scores. The process can be formulated as: e e u u Z = Ei (C ) , Z = Ei (C ) , (1) u u u e U = Eu (Z ) , Ru,e = U · Z , (2) where Ru,e = ru,1 , ru,2 , ..., ru,l , ru,t denotes the relevance score between U u and t-th item of the sub-sequence C e , and Ru,e demonstrates the relation between the user and his next interaction sequence. Next, we describe components of our model in detail. Item Encoder. Given an MoM scenario, the item encoders of TransRec take as input individual modality content. We consider two types of modalities in this paper, i.e. textual tokens and image pixels. Attempts for more modality scenarios are interesting for future work. For an item with textual 6 It is worth mentioning that there exist various types of SR frameworks, e.g. the popular CPC [19, 38] framework, NextItNet- and SASRec-style autoregressive framework [45, 22], and BERT4Rec-style denoising framework [36]. The reason we extend DSSM or CPC framework is mainly because of its flexibility in incorporating various user and item features. In fact, TransRec fits to most sequential recommender models. 4
modality (i.e. ci,t ), we adopt BERT-base [9] to model its token content t = [t1 , t2 , ..., tk ]. Then we apply an attention network as pooling layer to obtain the final textual representation Zi,t : Zi,t = SelfAtt(BERT(t)). (3) Similarly, we apply the ResNet-18 [15] to encoder visual pixels of an image, denoted as v. Then we perform average pooling, followed by a MLP layer. The visual representation is given: Zi,v = MLP(ResNet(v)). (4) User Encoder. For user encoder, we still use the BERT architecture (denoted as BERTu ), where each token embedding is the representation from an item encoder rather than the original word ID embedding. Position embedding P = {p1 , ..., pn } is added to model the sequential patterns of user behaviors. The specific process is formulated as follows: u u u S =Z +P , (5) u u u U = Eu (S ) = BERTu (S ). (6) TransRec can incorporate user features by simply concatenating them with user embedding, and incorporate item features by concatenating them with item embedding. By contrast, typical sequential recommendation models, such as NextItNet & SASRec are not straightforward to merge user features. 3.3 Optimization Inspired by the pre-training and fine-tuning paradigm, we apply the two-stage training for TransRec: first pre-training the user encoder network, and then training whole framework of TransRec. Stage 1: User Encoder Pre-training. We perform pre-training for user encoder network in the self-supervised manner. Specifically, we apply the left-to-right style generative pre-training to predict the next item in the interaction sequence, similarly as SASRec and NextItNet [22, 45]. The way we choose unidirectional pre-training rather than BERT-style (bidirectional) (i.e. Eq.(6)) is simply because unidirectional pre-training converges much faster but without loss of precision. We use the softmax cross-entropy loss as objective function: ỹ t =softmax(RELU(S 0 t W U + bU )), (7) X X LUEP = − yt log (ỹ t ) , (8) u∈U t∈[1,...,n] where W U , bU are the projection matrix & bias terms, S 0 t is the representation of last hidden layer. Stage 2: End-to-End Training. TransRec is trained in an end-to-end manner by fine-tuning pa- rameters of both user and item encoders. This is significantly different from many multi-modal recommendation task that pre-extracts modal features before training models [17]. End-to-end train- ing enables a better adaption for textual and visual features to the current recommendation domain. Specifically, we propose to use the Contrastive Predictive Coding (CPC) [38] learning method. Given a sequence of user interactions C, we divide the sequence into sub-sequence C u and C e to encode the relationship between them. The binary cross entropy loss function is as follows: j " n+l # X X X LCPC = − log (σ (ru,t )) + log (1 − σ (ru,g )) (9) u∈U t=n+1 g=1 where g is a randomly sampled negative item [32, 43] during model training. 4 Experiments To verify the effectiveness of our proposed TransRec, we conduct empirical experiments and evaluate pre-trained recommenders on four types of downstream tasks. 5
4.1 Experiments for The Source Domain Datasets. The source data is the news recommendation data collected from QQBrowser7 from 14th to 17th, December 2020. We collect around 25 million user-item interaction behaviors, involving about 1 million randomly sampled users and 133, 000 interacted items. Each interaction denotes observed feedback at a certain time, including full-play and clicks. For each user, we construct the Table 1: Characteristics of the source dataset. ‘Form’ indicates sequence behaviors using her re- the item category. ‘All’ means all users in this datasets, including cent 25 ordered interactions. In the above three types. For example, the first line denotes that there addition to the ID information, are 765,895 users whose interacted items always have two-modal the datasets have rich content (i.e. textual and visual) features. features to represent each item. More accurately, items in a user Form Modality User Item Interaction session can be videos-only, or Mixed Text + Vision 765,895 - 19,233,882 news-only, or both. But each item Article Text 133,107 - 3,327,463 can be either a video or a piece Video Vision 123,897 - 2,996,048 of news, i.e. containing only one All Text + Vision 1,022,899 133,107 25,557,393 modality. Note that for video items, we represent them by their cover images. The statistics are presented in Table 1. Evaluation Metrics. We use the typical leave-one-out strategy [18] for evaluation, where the last item of user interaction sequence is denoted as test data, and the item before the last one is used as validation data. Unlike many previous methods that employ a small scope of randomly sampled items for evaluation, which may lead to inconsistencies with the non-sampled version [23], we rank the entire item set without using the inaccurate sampling measures. We apply Hit Ratio (HR) [44] and Normalized Discounted Cumulative Gain (NDCG) [45] to measure the performance of each method. Our evaluation methods are consistent for both the source and downstream tasks. Implement Details. To ensure fair evaluation, each hyper-parameter is fine-tuned on the validation set. We set the embedding size to 256. All networks are optimized by employing Adam optimizer with the learning rate of 1e−4 . Batch size is set to 512 in pre-training. All baseline models are either tuned on the validation set or use the suggested settings from the original paper. We train all models until it converges and save parameters when they reach the highest accuracy on the validation set. We set the layer number of user encoder (i.e. BERT) to 4 and head number of multi-head attention to 4. Similar setup is adopted for all downstream tasks. Results. Before evaluating TransRec on the downstream datasets, we first examine its performance in the source domain. The purpose here is not to demonstrate that TransRec can achieve state-of-the-art recommendation accuracy. Instead, we hope to investigate whether learning from modality contents has advantages over traditional ID-based methods, i.e. IDRec. Throughout this paper, we use IDRec to denote the network that has the similar architecture with TransRec, but with the item encoder replaced with the ID embedding layer. Table 2 shows the results of IDRec Table 2: Results on the source dataset. The terms below and TransRec in terms of HR@10 and have the same meaning with Table 1. TransRec- denotes NDCG@10. Two important observations TransRec without first stage user encoder pre-training. can be made: (1) Both TransRec and TransRec- largely outperform IDRec (e.g. Method Modality. HR@10 NDCG@10 0.0699 vs. 0.0230, and 0.0532 vs. 0.0230 IDRec ID 0.0230 0.0118 on HR@10), which demonstrates strong potential of learning from modality content Vision 0.0540 0.0281 data. The higher results of TransRec are Text 0.0536 0.0280 TransRec- presumably attributed to three key factors: Mixed 0.0530 0.0275 large-scale training data, powerful item en- All 0.0532 0.0276 coder networks, and an end-to-end training Vision 0.1128 0.0553 fashion. (2) TransRec exceeds TransRec- Text 0.0582 0.0272 with all types of modality settings (e.g. TransRec Mixed 0.0679 0.0326 0.0699 vs. 0.0532, and 0.0679 vs. 0.0530), All 0.0699 0.0334 which evidences the effectiveness of user 7 https://browser.qq.com/ 6
encoder pre-training (see Section 3.3). The results of (1) further motivate us to develop modality content-based recommendation models for downstream tasks. 4.2 Experiments for The Target Domains All the target datasets below are from other recommender systems different from the source domain. TN-mixed: It was collected from Tencent News (TN)8 , where an interaction can be either a piece of news or a video’s cover image. Similar to the source domain, the interacted item set of a user contains mixture-of-modality features, i.e. both visual and textual features. TN-video/text: The two datasets only con- Table 3: Datasets for downstream recommendation tain items with single modality. For example, tasks. Except Douyin, interactions in other datasets user’s interactions in TN-video include only are mainly about user’s clicking or watching behav- videos, while TN-text includes only textual iors. items, i.e. news. They are used to evaluate TransRec’s generality for singe-modal item Domain Modality User Item recommendation. Given that users and items TN-mixed Text+Vision 49,639 48,383 in real-world recommender systems have var- TN-video Vision 47,004 50,053 ious additional features, we introduce two TN-text Text 49,033 49,142 types of user features (gender and age) and Douyin Vision 100,000 66,228 one item feature (category) for TN-text. Douyin: It was collected from Douyin9 (the Chinese version of TikTok), a well-known short video recommendation application. Unlike all previous datasets, the user positive feedback in Douyin only contains comment behaviors. In addition, video genres and cover image size in Douyin are vastly different from the source domain. Table 3 summarizes the statistics of downstream datasets. Baselines. In addition to IDRec, we have also presented IDGru [19] and IDNext [45], two popular ID- based sequential recommendation baselines as a reference. Despite that, we emphasize again that the purpose of this study is neither to propose a more advanced neural recommendation architecture, nor to pursue some state-of-the-art results. The key purpose of this study is to indicate that: (1) learning from modality content features instead of ID features achieves the goal of transferable recommendations across different domains; (2) learning from MoM feedback rather than single-modal or multi-modal feedback reaches the goal of generic recommendations across different modalities. Results. The overall results are shown in Table 4, which includes four recommendation scenarios. Two important observations can be made: (1) TransRec performs largely better than training-from- scratch; (2) TransRec performs better than IDRec as well. The results suggest that training the source domain brings much better results for TransRec on all target datasets. By analyzing these scenarios, we can draw the conclusion that TransRec — learning from MoM feedback — can be broadly transferred to various recommendation scenarios, including the source-like mixed-modal scenario (i.e. TN-mixed), the single-modal scenario (TN-video), the scenario with more additional features (TN-text), and scenario with very different modality content (Douyin). 4.3 Scaling Effects TN-mixed TN-text The larger the source dataset, the 0.05 0.06 stronger representation TransRec 0.04 0.05 achieves. To investigate the scaling effect 0.03 0.04 HR@10 HR@10 of pre-trained dataset in the source domain, 0.03 0.02 we compare the results of TransRec on Train-from-scratch 0.02 Train-from-scratch 0.01 20% 20% downstream tasks by changing the size of 50% 0.01 50% TransRec TransRec the QQBrowser dataset. To be specific, we 0.00 0.00 0 10 20 30 40 50 0 10 20 30 40 50 extract 20% and 50% user sequences from Epoch Epoch the original dataset, and conduct the same training process on them, and then evaluate Figure 3: Convergence trend by scaling the source their transferability. data. 8 https://news.qq.com/ 9 https://www.douyin.com/ 7
Table 4: Comparison of recommendation results on four downstream domains. Train-from-scratch denotes training target datasets with random parameters as initialization. It shares exactly the same network architecture and hyper-parameters with TransRec. The best results are bolded. Domain Metric IDRec IDGru IDNext Train-from-scratch TransRec HR@10 0.0210 0.0281 0.0334 0.0428 0.0478 TN-mixed NDCG@10 0.0100 0.0143 0.0167 0.0213 0.0239 HR@10 0.0267 0.0357 0.0406 0.0336 0.0424 TN-video NDCG@10 0.0125 0.0206 0.0208 0.0173 0.0221 HR@10 0.0192 - - 0.0500 0.0597 TN-text NDCG@10 0.0090 - - 0.0255 0.0303 HR@10 0.0019 0.0140 0.0200 0.0205 0.0259 Douyin NDCG@10 0.0011 0.0068 0.0095 0.0101 0.0126 Table 5: Results of the target data by scaling source corpus. Domain Metric Train-from-scratch 20% 50% TransRec HR@10 0.0428 0.0448 0.0474 0.0485 TN-mixed NDCG@10 0.0213 0.0227 0.0237 0.0245 HR@10 0.0336 0.0400 0.0417 0.0424 TN-video NDCG@10 0.0173 0.0209 0.0214 0.0221 HR@10 0.0500 0.0543 0.0581 0.0597 TN-text NDCG@10 0.0255 0.0281 0.0292 0.0303 HR@10 0.0205 0.0233 0.0254 0.0260 Douyin NDCG@10 0.0101 0.0113 0.0120 0.0126 Table 5 shows the results for the four downstream recommendation tasks. First, it can be seen that the recommendation accuracy improves obviously by using more source data for training. For example, in the TN-mixed dataset, the result of HR@10 grows from 0.0428 to 0.0448 by using 20% source data, and then grows from 0.0448 to 0.0474 by using 50% source data. Such property of TransRec is desired since it implies that scaling up the source dataset is an effective way to improve downstream tasks. We also depict the convergence behavior in Figure 3, which shows more clear improvements. The smaller the downstream dataset, the larger improvement TransRec achieves. We study the effects of TransRec by scaling down the target data, aiming to verify whether TransRec can lessen the insufficient data problem, and thus improve the recommendation performance. To be specific, we decrease the size of these target datasets by using 20% and 60% of the original data. Results are shown in Table 6 and Figure 4. It can be seen that (1) recommendation accuracy increases with more training data; (2) More performance gains are achieved with less training data; 0.05 10,000 0.05 30,000 0.05 49,639 0.04 0.04 0.04 0.03 0.03 0.03 HR@10 HR@10 HR@10 0.02 0.02 0.02 0.01 0.01 0.01 Fine-tuning Fine-tuning Fine-tuning Train-from-scratch Train-from-scratch Train-from-scratch 0.00 0.00 0.00 0 10 20 30 40 50 60 0 10 20 30 40 50 0 10 20 30 40 50 Epoch Epoch Epoch Figure 4: Comparison of convergence by scaling TN-mixed dataset. 8
Table 6: Comparison of relative performance improvement on downstream tasks with varied target data size. ‘Num. Sample’ denotes the number of user behavior sequences for training. ‘Improv.’ indicates the relative performance improvement of TransRec compared with Train-from-scratch. Num. HR@10 NDCG@10 Domain Improv. Improv. Sample Train-from Train-from TransRec TransRec -scratch -scratch 10,000 0.0261 0.0385 41.51% 0.0126 0.0193 53.17% TN-mixed 30,000 0.0354 0.0448 26.55% 0.0176 0.0223 26.70% 49,639 0.0428 0.0485 13.32% 0.0213 0.0245 15.02% 10,000 0.0393 0.0597 51.91% 0.0201 0.0262 30.35% TN-text 30,000 0.0453 0.0549 21.19% 0.0230 0.0283 23.04% 49,033 0.0500 0.0597 19.40% 0.0255 0.0303 18.82% Table 7: End-to-end training vs. frozen features. TN-mixed TN-text TN-video Method Manner HR@10 NDCG@10 HR@10 NDCG@10 HR@10 NDCG@10 Train-from Frozen 0.0334 0.0167 0.0350 0.0176 0.0037 0.0017 -scratch End2end 0.0428 0.0213 0.0500 0.0255 0.0336 0.0173 Frozen 0.0359 0.0181 0.0411 0.0206 0.0040 0.0023 TransRec End2end 0.0485 0.0245 0.0597 0.0303 0.0424 0.0221 This suggests that when a recommender system is in the phase of lacking training data, transferring the (user-item) matching relationship from a large source dataset is of great help. 4.4 End-to-End vs. Frozen Features. Traditional multimodal and multimedia recommendations have been well studied [17]. While due to high computing resource and less powerful text/image encoder network, prior art tends to extract frozen modality features and then feed them into a CTR or recommendation model. Such practice is especially common for industrial applications given billions of training examples [8, 6]. However, we want to explore whether end-to-end learning is superior to learning from frozen features. The results are in Table 7. Clearly, we can achieve consistent improvements by end-to-end learning, although fine- tuning BERT and ResNet is computationally expensive than using pre-extracted features. Particularly, we notice that using frozen textual features yields less worse results than using visual features. This may imply that the textual features generated by BERT is more general than visual features generated by ResNet. This is also aligned with findings in NLP and CV fields — fine-tuning all parameters are in general better than fine-tuning only the classification layer (with the backbone network frozen). 5 Conclusion, Limitations, and Future Works In this paper, we study a new recommendation scenario where user feedback contains items with mixture-of-modality features. We develop TransRec, the first recommendation model learning from MoM feedback by an end-to-end manner. To show its transferring ability, we conduct empirical study in four types of downstream recommendation tasks. Our results verify that TransRec is a generic model that can be broadly transferred to improve many recommendation tasks as long as the modality has been trained in the source domain. Our work has significant practical implications towards universal recommender systems with the goal of realizing ‘One Model to Serve All’ [46, 33]. One limitation is that we only examine TransRec with two types of modality features (vision and text). As a result, it can only serve three scenarios: vision-only, text-only, and vision-text. Since both video and audio data can be represented by images [24, 25], intuitively, TransRec can be extended to scenarios where items involve more modalities. For example, suppose four distinct modalities (vision, text, audio and video) are available in user feedback, then TransRec has the potential to serve at most 15 types of scenarios which might cover most modalities for multimedia data. This is an interesting future direction and may realize a more general recommendation system. 9
References [1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021. [2] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. [4] Lei Chen, Fajie Yuan, Jiaxi Yang, Xiangnan He, Chengming Li, and Min Yang. User-specific adaptive fine-tuning for cross-domain recommendations. IEEE Transactions on Knowledge and Data Engineering, 2021. [5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020. [6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10, 2016. [7] Mingyue Cheng, Fajie Yuan, Qi Liu, Xin Xin, and Enhong Chen. Learning transferable user representations with sequential behaviors via contrastive pre-training. In 2021 IEEE International Conference on Data Mining (ICDM), pages 51–60. IEEE, 2021. [8] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pages 191–198, 2016. [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [11] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020. [12] Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). arXiv preprint arXiv:2203.13366, 2022. [13] Jie Gu, Feng Wang, Qinghui Sun, Zhiquan Ye, Xiaoxiao Xu, Jingmin Chen, and Jun Zhang. Exploiting behavioral consistence for universal user representation. arXiv preprint arXiv:2012.06146, 2020. [14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020. [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [16] Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pages 507–517, 2016. [17] Ruining He and Julian McAuley. Vbpr: visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. [18] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pages 173–182, 2017. [19] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommen- dations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015. [20] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. [21] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 2333–2338, 2013. 10
[22] Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pages 197–206. IEEE, 2018. [23] Walid Krichene and Steffen Rendle. On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1748–1757, 2020. [24] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6706–6713, 2019. [25] Valerii Likhosherstov, Anurag Arnab, Krzysztof Choromanski, Mario Lucic, Yi Tay, Adrian Weller, and Mostafa Dehghani. Polyvit: Co-training vision transformers on images, videos and audio. arXiv preprint arXiv:2111.12993, 2021. [26] Jian Liu, Pengpeng Zhao, Fuzhen Zhuang, Yanchi Liu, Victor S Sheng, Jiajie Xu, Xiaofang Zhou, and Hui Xiong. Exploiting aesthetic preference in deep cross networks for cross-domain recommendation. In Proceedings of The Web Conference 2020, pages 2768–2774, 2020. [27] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. [28] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1930–1939, 2018. [29] Muyang Ma, Pengjie Ren, Yujie Lin, Zhumin Chen, Jun Ma, and Maarten de Rijke. π-net: A parallel information-sharing network for shared-account cross-domain sequential recommendations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 685–694, 2019. [30] Tong Man, Huawei Shen, Xiaolong Jin, and Xueqi Cheng. Cross-domain recommendation: An embedding and mapping approach. In IJCAI, volume 17, pages 2464–2470, 2017. [31] Yabo Ni, Dan Ou, Shichen Liu, Xiang Li, Wenwu Ou, Anxiang Zeng, and Luo Si. Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 596–605, 2018. [32] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618, 2012. [33] Xiang-Rong Sheng, Liqin Zhao, Guorui Zhou, Xinyao Ding, Binding Dai, Qiang Luo, Siran Yang, Jingshan Lv, Chi Zhang, Hongbo Deng, et al. One model to serve all: Star topology adaptive recommender for multi-domain ctr prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 4104–4113, 2021. [34] Kyuyong Shin, Hanock Kwak, Kyung-Min Kim, Minkyu Kim, Young-Jin Park, Jisu Jeong, and Se- ungjae Jung. One4all user representation for recommender systems in e-commerce. arXiv preprint arXiv:2106.00573, 2021. [35] Kyuyong Shin, Hanock Kwak, Kyung-Min Kim, Su Young Kim, and Max Nihlen Ramstrom. Scal- ing law for recommendation models: Towards general-purpose user representations. arXiv preprint arXiv:2111.11294, 2021. [36] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1441–1450, 2019. [37] Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Fourteenth ACM Conference on Recommender Systems, pages 269–278, 2020. [38] Aaron Van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv e-prints, pages arXiv–1807, 2018. [39] Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. Mm-rec: Multimodal news recommendation. arXiv preprint arXiv:2104.07407, 2021. [40] Chuhan Wu, Fangzhao Wu, Tao Qi, Jianxun Lian, Yongfeng Huang, and Xing Xie. Ptum: Pre-training user model from unlabeled user behaviors via self-supervision. arXiv preprint arXiv:2010.01494, 2020. [41] Chuhan Wu, Fangzhao Wu, Yang Yu, Tao Qi, Yongfeng Huang, and Xing Xie. Userbert: Contrastive user model pre-training. arXiv preprint arXiv:2109.01274, 2021. [42] Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. Self- supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 726–735, 2021. 11
[43] Fajie Yuan, Guibing Guo, Joemon M Jose, Long Chen, Haitao Yu, and Weinan Zhang. Lambdafm: learning optimal ranking with factorization machines using lambda surrogates. In Proceedings of the 25th ACM international on conference on information and knowledge management, pages 227–236, 2016. [44] Fajie Yuan, Xiangnan He, Alexandros Karatzoglou, and Liguang Zhang. Parameter-efficient transfer from sequential behaviors for user modeling and recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1469–1478, 2020. [45] Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M Jose, and Xiangnan He. A simple convolutional generative network for next item recommendation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 582–590, 2019. [46] Fajie Yuan, Guoxiao Zhang, Alexandros Karatzoglou, Joemon Jose, Beibei Kong, and Yudong Li. One person, one model, one world: Learning continual user representation without forgetting. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 696–705, 2021. [47] Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Mah- eswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 43–51, 2019. [48] Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 1893–1902, 2020. [49] Feng Zhu, Yan Wang, Chaochao Chen, Guanfeng Liu, Mehmet Orgun, and Jia Wu. A deep framework for cross-domain and cross-system recommendations. arXiv preprint arXiv:2009.06215, 2020. 12
You can also read