Context Uncertainty in Contextual Bandits with Applications to Recommender Systems
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Context Uncertainty in Contextual Bandits with Applications to Recommender Systems Hao Wang1 , Yifei Ma2 , Hao Ding2 , Yuyang Wang2 1 Department of Computer Science, Rutgers University 2 AWS AI Lab hoguewang@gmail.com, {yifeim,haodin,yuyawang}@amazon.com arXiv:2202.00805v1 [cs.LG] 1 Feb 2022 Abstract exploration during recommendations using the learned repre- sentations. Recurrent neural networks have proven effective in modeling One roadblock is that effective exploration relies heavily sequential user feedbacks for recommender systems. How- on well learned representations, which in turn require suf- ever, they usually focus solely on item relevance and fail to effectively explore diverse items for users, therefore harm- ficient exploration; this is a chicken-and-egg problem. In a ing the system performance in the long run. To address this case where RNN learns unreasonable representations (e.g., problem, we propose a new type of recurrent neural networks, all items have the same representations), exploration in the dubbed recurrent exploration networks (REN), to jointly per- latent space is meaningless. To address this problem, we en- form representation learning and effective exploration in the able REN to take into account the uncertainty of the learned latent space. REN tries to balance relevance and exploration representations as well during recommendations. Essentially while taking into account the uncertainty in the representations. items whose representations have higher uncertainty can be Our theoretical analysis shows that REN can preserve the rate- explored more often. Such a model can be seen as a contex- optimal sublinear regret even when there exists uncertainty tual bandit algorithm that is aware of the uncertainty for each in the learned representations. Our empirical study demon- context. Our contributions are as follows: strates that REN can achieve satisfactory long-term rewards on both synthetic and real-world recommendation datasets, 1. We propose REN as a new type of RNN to balance rele- outperforming state-of-the-art models. vance and exploration during recommendation, yielding satisfactory long-term rewards. 2. Our theoretical analysis shows that there is an upper con- Introduction fidence bound related to uncertainty in learned representa- Modeling and predicting sequential user feedbacks is a core tions. With such a bound implemented in the algorithm, problem in modern e-commerce recommender systems. In REN can achieve the same rate-optimal sublinear regret. this regard, recurrent neural networks (RNN) have shown To the best of our knowledge, we are the first to study the great promise since they can naturally handle sequential regret bounds under “context uncertainty”. data (Hidasi et al. 2016; Quadrana et al. 2017; Belletti, Chen, 3. Experiments of joint learning and exploration on both and Chi 2019; Ma et al. 2020). While these RNN-based mod- synthetic and real-world temporal datasets show that REN els can effectively learn representations in the latent space significantly improve long-term rewards over state-of-the- to achieve satisfactory immediate recommendation accuracy, art RNN-based recommenders. they typically focus solely on relevance and fall short of ef- fective exploration in the latent space, leading to poor perfor- mance in the long run. For example, a recommender system Related Work may keep recommending action movies to a user once it Deep Learning for Recommender Systems. Deep learning learns that she likes such movies. This may increase immedi- (DL) has been playing a key role in modern recommender sys- ate rewards, but the lack of exploration in other movie genres tems (Salakhutdinov, Mnih, and Hinton 2007; van den Oord, can certainly be detrimental to long-term rewards. Dieleman, and Schrauwen 2013; Wang, Wang, and Yeung So, how does one effectively explore diverse items for 2015; Wang, Shi, and Yeung 2015, 2016; Li and She 2017; users while retaining the representation power offered by Chen et al. 2019; Fang et al. 2019; Tang et al. 2019; Ding RNN-based recommenders. We note that the learned repre- et al. 2021; Gupta et al. 2021). (Salakhutdinov, Mnih, and sentations in the latent space are crucial for these models’ suc- Hinton 2007) uses restricted Boltzmann machine to perform cess. Therefore we propose recurrent exploration networks collaborative filtering in recommender systems. Collabora- (REN) to explore diverse items in the latent space learned tive deep learning (CDL) (Wang, Wang, and Yeung 2015; by RNN-based models. REN tries to balance relevance and Wang, Shi, and Yeung 2016; Li and She 2017) is devised as Bayesian deep learning models (Wang and Yeung 2016, Copyright © 2022, Association for the Advancement of Artificial 2020; Wang 2017) to significantly improve recommendation Intelligence (www.aaai.org). All rights reserved. performance. In terms of sequential (or session-based) rec-
ommender systems (Hidasi et al. 2016; Quadrana et al. 2017; Notation and RNN-Based Recommender Systems Bai, Kolter, and Koltun 2018; Li et al. 2017; Liu et al. 2018; Notation. We consider the problem of sequential recommen- Wu et al. 2019; Ma et al. 2020), GRU4Rec (Hidasi et al. 2016) dations where the goal is to predict the item a user interacts was first proposed to use gated recurrent units (GRU) (Cho with (e.g., click or purchase) at time t, denoted as ekt , given et al. 2014), an RNN variant with gating mechanism, for recommendation. Since then, follow-up works such as hierar- her previous interaction history Et = [ekτ ]t−1 τ =1 . Here kt is the chical GRU (Quadrana et al. 2017), temporal convolutional index for the item at time t, ekt ∈ {0, 1}K is a one-hot vector networks (TCN) (Bai, Kolter, and Koltun 2018), and hierar- indicating an item, and K is the number of total items. We de- chical RNN (HRNN) (Ma et al. 2020) have tried to achieve note the item embedding (encoding) for ekt as xkt = fe (ekt ), improvement in accuracy with the help of cross-session in- where fe (·) is the encoder as a part of the RNN. Correspond- formation (Quadrana et al. 2017), causal convolutions (Bai, ingly we have Xt = [xkτ ]t−1 τ =1 . Strictly speaking, in an online Kolter, and Koltun 2018), as well as control signals (Ma et al. setting where the model updates at every time step t, xk also 2020). We note that our REN does not assume specific RNN changes over time; in Sec. we use xk as a shorthand for xt,k architectures (e.g., GRU or TCN) and is therefore compati- for simplicity. We use kzk∞ = maxi |z(i) | to denote the L∞ ble with different RNN-based (or more generally DL-based) norm, where the superscript (i) means the i-th entry of the models, as shown in later sections. vector z. Contextual Bandits. Contextual bandit algorithms such RNN-Based Recommender Systems. Given the interac- as LinUCB (Li et al. 2010) and its variants (Yue and Guestrin tion history Et , the RNN generates the user embedding at 2011; Agarwal et al. 2014; Li, Karatzoglou, and Gentile time t as θ t = R([xkτ ]t−1τ =1 ), where xkτ = fe (ekτ ) ∈ R , d 2016; Kveton et al. 2017; Foster et al. 2018; Korda, Szorenyi, and R(·) is the recurrent part of the RNN. Assuming tied and Li 2016; Mahadik et al. 2020; Zhou, Li, and Gu 2019) weights, the score for each candidate item is then computed have been proposed to tackle the exploitation-exploration as pk,t = x> k θ t . As the last step, the recommender system trade-off in recommender systems and successfully improve will recommend the items with the highest scores to the user. upon context-free bandit algorithms (Auer 2002). Similar Note that the subscript k indexes the items, and is equivalent to (Auer 2002), theoretical analysis shows that LinUCB vari- to an ‘action’, usually denoted as a, in the context of bandit ants could achieve a rate-optimal regret bound (Chu et al. algorithms. 2011). However, these methods either assume observed con- text (Zhou, Li, and Gu 2019) or are incompatible with neu- Determinantal Point Processes for Diversity and ral networks (Li, Karatzoglou, and Gentile 2016; Yue and Exploration Guestrin 2011). In contrast, REN as a contextual bandit al- gorithm runs in the latent space and assumes user models Determinantal point processes (DPP) consider an item se- based on RNN; therefore it is compatible with state-of-the-art lection problem where each item is represented by a fea- RNN-based recommender systems. ture vector xt . Diversity is achieved by picking a subset of Diversity-Inducing Models. Various works have focused items to cover the maximum volume spanned by the items, on inducing diversity in recommender systems (Nguyen et al. measured by the log-determinant of the corresponding ker- 2014; Antikacioglu and Ravi 2017; Wilhelm et al. 2018; nel matrix, ker(Xt ) = log det(IK + Xt X> t ), where IK is Bello et al. 2018). Usually such a system consists of a sub- included to prevent singularity. Intuitively, DPP penalizes modular function, which measures the diversity among items, colinearity, which is an indicator that the topics of one item and a relevance prediction model, which predicts relevance are already covered by the other topics in the full set. The log- between users and items. Examples of submodular functions determinant of a kernel matrix is also a submodular function include the probabilistic coverage function (Hiranandani et al. (Friedland and Gaubert 2013), which implies a (1 − 1/e)- 2019) and facility location diversity (FILD) (Tschiatschek, optimal guarantees from greedy solutions. The greedy algo- Djolonga, and Krause 2016), while relevance prediction mod- rithm for DPP via the matrix determinant lemma is els can be Gaussian processes (Vanchinathan et al. 2014), argmaxk log det(Id + X> > t Xt + x k x k ) (1) linear regression (Yue and Guestrin 2011), etc. These models typically focus on improving diversity among recommended − log det(Id + X>t Xt ) items in a slate at the cost of accuracy. In contrast, REN’s = argmaxk log(1 + xk (Id + X> > −1 t Xt ) xk ) (2) goal is to optimize for long-term rewards through improving q diversity between previous and recommended items. We in- = argmaxk x> > −1 x . k (Id + Xt Xt ) k (3) clude some slate generation in our real-data experiments for completeness. q Interestingly, note that x> > k (Id + Xt Xt ) −1 x has the k Recurrent Exploration Networks same form as the confidence interval in LinUCB (Li et al. 2010), a commonly used contextual bandit algorithm to boost In this section we first describe the general notations and exploration and achieve long-term rewards, suggesting a con- how RNN can be used for recommendation, briefly review nection between diversity and long-term rewards (Yue and determinantal point processes (DPP) as a diversity-inducing Guestrin 2011). Intuitively, this makes sense in recommender model as well as their connection to exploration in contextual systems since encouraging diversity relative to user history bandits, and then introduce our proposed REN framework. (as well as diversity in a slate of recommendations in our
Algorithm 1: Recurrent Exploration Networks where θ t = R(Dt ) = R([µkτ ]t−1 t−1 τ =1 ) and Dt = [µkτ ]τ =1 . (REN) The term σ k quantifies the uncertainty for each dimension of xk , meaning that items whose embeddings REN is uncertain 1 Input: λd , λu , initialized REN model with the about are more likely to be recommended. Therefore with encoder, i.e., R(·) and fe (·). the third term, REN can naturally balance among relevance, 2 for t = 1, 2, . . . , T do diversity (relative to user history), and uncertainty during 3 Obtain item embeddings from REN: exploration. 4 µkτ ← fe (ekτ ) for all τ ∈ {1, 2, . . . , t − 1}. Putting It All Together. Algorithm 1 shows the overview 5 Obtain the current user embedding from REN: of REN. Note that the difference between REN and tradi- 6 θ t ← R(Dt ). tional RNN-based recommenders is only in the inference Compute At ← Id + τ ∈Ψt µ> P 7 kτ µkτ . stage. During training (Line 16 of Algorithm 1), one can 8 Obtain candidate items’ embeddings from REN: train REN only with the relevance term using models such 9 µk ← fe (ek ), where k ∈ [K]. as GRU4Rec and HRNN. In the experiments, we use un- √ 10 Obtain candidate items’ uncertainty estimates σ k , certainty estimates diag(σ k ) = 1/ nk Id , where nk is where k ∈ [K]. item k’s total number of impressions (i.e., the number of 11 for k ∈ [K] do times item k has been recommended) for all users. The intu- 12 Obtain the score for item k at time t: ition is that: the more frequently item k is recommended, the 13 pk,t ← more frequently its embedding xk gets updated, the faster√σ k decreases.1 Our preliminary experiments q > −1 show that 1/ nk µk θ t + λd µ> k At µk + λu kσ k k∞ . √ does decrease at the rate of O(1/ t), meaning that the as- 14 end sumption in√Lemma 4 is satisfied. From the Bayesian per- 15 Recommend item kt ← argmaxk pt,k and collect spective, 1/ nk may not accurately reflect the uncertainty of user feedbacks. the learned xk , which is a limitation of our model. In princi- 16 Update the REN model R(·) and fe (·) using ple, one can learn σ k from data using the reparameterization collected user feedbacks. trick (Kingma and Welling 2014) with a Gaussian prior on 17 end xk and examine whether σ k the assumption in Lemma 4; this would be interesting future work. Linearity in REN. REN only needs a linear bandit model; experiments) naturally explores user interest previously un- REN’s output x> k θ t is linear w.r.t. θ and xk . Note that Neu- known to the model, leading to much higher long-term re- ralUCB (Zhou, Li, and Gu 2019) is a powerful nonlinear wards, as shown in Sec. 12. extension of LinUCB, i.e., its output is nonlinear w.r.t. θ and xk . Extending REN’s output from x> k θ t to a nonlinear Recurrent Exploration Networks function f (xk , θ t ) as in NeuralUCB is also interesting future Exploration Term. Based on the intuition above, we can work.2 modify the user-item score pk,t = x>k θ t to include a diversity Beyond RNN. Note that our methods and theory go be- (exploration) term, leading to the new score yond RNN-based models and can be naturally extended q to any latent factor models including transformers, MLPs, pk,t = x> θ k t + λ > > d xk (Id + Xt Xt ) −1 x , k (4) and matrix factorization. The key is the user embedding θ t = R(Xt ), which can be instantiated with an RNN, a where the first term is the relevance score and the second term transformer, or a matrix-factorization model. is the exploration score (measuring diversity between previ- ous and recommended items). θ t = R(Xt ) = R([xkτ ]t−1 τ =1 ) Theoretical Analysis is RNN’s hidden states at time t representing the user embed- ding. The hyperparameter λd aims to balance two terms. With REN’s connection to contextual bandits, we can prove Uncertainty Term for Context Uncertainty. At first that with proper λd and λu , Eqn. 5 is actually the upper blush, given the user history the system using Eqn. 4 will rec- confidence bound that leads to long-term rewards with a ommend items that are (1) relevant to the user’s interest and rate-optimal regret bound. (2) diverse from the user’s previous items. However, this only Reward Uncertainty versus Context Uncertainty. Note works when item embeddings xk are correctly learned. Un- 1 fortunately, the quality of learned item embeddings, in turn, There are some caveats in general. diag(σ k ) ∝ Id assumes that all coordinates of x shrink at the same rate. However, REN relies heavily on the effectiveness of exploration, leading to exploration mechanism associates nk with the total variance of the a chicken-and-egg problem. To address this problem, one features of an item. This may not ensure all feature dimensions to also needs to consider the uncertainty of the learned item em- be equally explored. See (Jun et al. 2019) for a different algorithm beddings. Assuming the item embedding xk ∼ N (µk , Σk ), that analyzes the exploration of the low-rank feature space. where Σk = diag(σ 2k ), we have the final score for REN: 2 In other words, we did not fully explain why x could be shared q between non-linear RNN and the uncertainty bounds based on linear pk,t = µ> > > k θ t + λd µk (Id + Dt Dt ) −1 µ + λ kσ k , k u k ∞ models. On the other hand, we did observe promising empirical results, which may encourage interested readers to dive deep into (5) different theoretical analyses.
that unlike existing works which primarily consider the ran- Algorithm 2: BaseREN: Basic REN Inference at domness from the reward, we take into consideration the un- Step t certainty resulted from the context (content) (Mi et al. 2019; 1 Input: α, Ψt ⊆ {1, 2, . . . , t − 1}. Wang, Xingjian, and Yeung 2016), i.e., context uncertainty. 2 Obtain item embeddings from REN: In CDL (Wang, Wang, and Yeung 2015; Wang, Shi, and Ye- ung 2016), it is shown that such content information is crucial µτ,kτ ← fe (eτ,kτ ) for all τ ∈ Ψt . in DL-based RecSys (Wang, Wang, and Yeung 2015; Wang, 3 Obtain user embedding: θ t ← R(Dt ). > P Shi, and Yeung 2016), and so is the associated uncertainty. 4 At ← Id + τ ∈Ψt µτ,kτ µτ,kτ . More specifically, existing works assume deterministic x and 5 Obtain candidate items’ embeddings: only assume randomness in the reward, i.e., they assume that µt,k ← fe (et,k ), where k ∈ [K]. r = x> θ + , and therefore r’s randomness is independent 6 Obtain candidate items’ uncertainty estimates σ t,k , of x. The problem with this formulation is that they assume where k ∈ [K]. x is deterministic and therefore the model only has a point 7 for a ∈ [K] do estimate of the item embedding x, but does not have uncer- q −1 tainty estimation for such x. We find that such uncertainty 8 st,k = µ> t,k At µt,k estimation is crucial for exploration; if the model is uncertain √ q 9 wt,k ← (α+1)st,k +(4 d+2 ln TδK )kσ t,k k∞ . about x, it can then explore more on the corresponding item. To facilitate analysis, we follow common practice (Auer 10 rbt,k ← θ > t µt,k . 2002; Chu et al. 2011) to divide the procedure of REN into 11 end “BaseREN" (Algorithm 2) and “SupREN" stages correspond- 12 Recommend item k ← argmaxk rbt,k + wt,k . ingly. Essentially SupREN introduces S = ln T levels of elimination (with s as an index) to filter out low-quality items and ensures that the assumption holds (see the Supplement for details of SupREN). high probability, meaning that Eqn. 5 upper bounds the true In this section, we first provide a high probability bound for reward with high probability, which makes it a reasonable BaseREN with uncertain embeddings (context), and derive score for recommendations. an upper bound for the regret. As mentioned in Sec. , for the Lemma 1 (Confidence Bound). With probability at least online setting where the model updates at every time step t, 1 − 2δ/T , we have for all k ∈ [K] that xk also changes over time. Therefore in this section we use xt,k , µt,k , Σt,k , and σ t,k in place of xk , µk , Σk , and σ k rt,k − x∗t,k > θ ∗ | ≤(α + 1)st,k |b from Sec. to be rigorous. √ r TK Assumption 1. Assume there exists an optimal θ , with ∗ + (4 d + 2 ln )kσ t,k k∞ , δ kθ ∗ k ≤ 1, and x∗t,k such that E[rt,k ] = x∗t,k > θ ∗ . Further (i) assume that there is an effective distribution N (µt,k , Σt,k ) where kσ t,k k∞ = maxi |σ t,k | is the L∞ norm. such that x∗t,k ∼ N (µt,k , Σt,k ) where Σt,k = diag(σ 2t,k ). The proof is in the Supplement. This upper confidence Thus, the true underlying context is unavailable, but we are bound above provides important insight on why Eqn. 5 is aided with the knowledge that it is generated by a multivari- reasonable as a final score to select items in Algorithm 1 as ate normal with known parameters3 . well as the choice of hyperparameters λd and λu . RNN to Estimate θ t . REN uses RNN to approximate Upper Confidence Bound for Uncertain A−1 t bt (useful in the proof of Lemma 1) in Eqn. 6. Note Embeddings that a linear RNN with tied weights and a single time step is For simplicity denote the item embedding (context) as xt,k , equivalent to linear regression (LR); therefore RNN is a more where t indexes the rounds (time steps) and k indexes the general model to estimate θ t . Compared to LR, RNN-based items. We define: recommenders can naturally incorporate new user history q by incrementally updating the hidden states (θ t in REN), −1 |Ψt |×d without the need to solve a linear equation. Interestingly, one st,k = µ> t,k At µt,k ∈ R+ , Dt = [µτ,kτ ]τ ∈Ψt ∈ R , can also see RNN’s recurrent computation as a simulation yt = [rτ,kτ ]τ ∈Ψt ∈ R|Ψt |×1 , At = Id + D> t Dt , (approximation) for solving equations via iterative updating. bt = D > t yt , rbt,k = µ> ˆ > −1 t,k θ t = µt,k At bt , (6) Regret Bound Lemma 1 above provides an estimate of the reward’s upper where yt is the collected user feedback. q Lemma 1 below bound at time t. Based on this estimate, one natural next step shows that with λd = 1 + α = 1 + 12 ln 2TδK and λu = is to analyze the regret after all T rounds. Formally, we define √ q the regret of the algorithm after T rounds as 4 d + 2 ln TδK , Eqn. 5 is the upper confidence bound with T X T X 3 Here we omit the identifiability issue of x∗t,k and assume that B(T ) = rt,kt∗ − rt,kt , (7) there is a unique x∗t,k for clarity. t=1 t=1
where kt∗ is the optimal item (action) k at round t that maxi- and its baselines) is then trained and evaluated in a rolling mizes E[rt,k ] = x∗t,k > θ ∗ , and kt is the action chose by the manner: (1) RNN is trained using data in [T0 , T1 ); (2) RNN algorithm at round t. Similar to (Auer 2002), SupREN calls is evaluated using data in [T1 , T2 ) and collects feedbacks BaseREN as a sub-routine. In this subsection, we derive the (rewards) for its recommendations; (3) RNN uses newly col- regret bound for SupREN with uncertain item embeddings. lected feedbacks from [T1 , T2 ) to finetune the model; (4) Repeat the previous two steps using data from the next time Lemma 2. With probability 1 − 2δS, for any t ∈ [T ] and interval. Note that different from traditional offline and one- any s ∈ [S], we have: (1) |b rt,k − E[rt,k ]| ≤ wt,k for any ∗ step evaluation, corresponding to only Step (1) and (2), our k ∈ [K], (2) kt ∈ Âs , and (3) E[rt,kt∗ ] − E[rt,k ] ≤ 2(3−s) setting performs joint learning and exploration in temporal for any k ∈ Âs . data, and therefore is more realistic and closer to production P Lemma 3. In BaseREN, we have: (1 + α) t∈ΨT +1 st,kt ≤ systems. p Long-Term Rewards. Since the goal is to evaluate long- 5 · (1 + α2 ) d|ΨT +1 |. term rewards, we are mostly interested in the rewards during Lemma 4. Assuming kσ 1,k k∞ = 1 and kσ t,k k∞ ≤ √1t the last (few) time intervals. Conventional RNN-based rec- ommenders do not perform exploration and are therefore P any k and t, then for for p any k, we have the upper bound: much easier to saturate at a relatively low reward. In contrast, t∈ΨT +1 kσ t,k k∞ ≤ |ΨT +1 |. REN with its effective exploration can achieve nearly optimal Essentially Lemma 2 links the regret B(T ) to the width of rewards in the end. the confidence bound wt,k (Line 9 of Algorithm 2 or the last Compared Methods. We compare REN variants two termsp of Eqn. 5). Lemma √ 3 and Lemma 4 then connect with state-of-the-art RNN-based recommenders including wt,k to |ΨT +1 | ≤ T , which is sublinear in T ; this is GRU4Rec (Hidasi et al. 2016), TCN (Bai, Kolter, and Koltun the key to achieve a sublinear regret bound. Note that Âs is 2018), HRNN (Ma et al. 2020). Since REN can use any RNN- defined inside Algorithm 2 (SupREN) of the Supplement. based recommenders as a base model, we evaluate three REN Interestingly, Lemma 4 states that the uncertainty only variants in the experiments: REN-G, REN-T, and REN-H, needs to decrease at the rate √1t , which is consistent with our which use GRU4Rec, TCN, and HRNN as base models, re- √ spectively. Additionally we also evaluate REN-1,2, an REN choice of diag(σ k ) = 1/ nk Id in Sec. , where nk is item variant without the third term of Eqn. 5, and REN-1,3, one k’s total number of impressions for all users. As the last step, without the second term of Eqn. 5, as an ablation study. Both Lemma 5 and Theorem 1 below build on all lemmas above REN-1,2 and REN-1,3 use GRU4Rec as the base model. As to derive the final sublinear regret bound. references we also include Oracle, which always achieves op- Lemma 5. For all s ∈ [S], timal rewards, and Random, which randomly recommends q √ r TK one item from the full set. For REN variants we choose √ λd (s) (s) |ΨT +1 | ≤ 2s · (5(1 + α2 ) d|ΨT +1 | + 4 dT + 2 T ln ). from {0.001, 0.005, 0.01, 0.05, 0.1} and set λu = 10λd . δ Other hyperparameters in the RNN base models are kept the q same for fair comparison (see the Supplement for more de- Theorem 1. If SupREN is run with α = 12 ln 2TδK , with tails on neural network architectures, hyperparameters, and probability at least 1 − δ, the regret of the algorithm is their sensitivity analysis). √ 2T K(2 ln T + 2) 3 √ Connection to Reinforcement Learning (RL) and Ban- B(T ) ≤ 2 T + 92 · (1 + ln )2 Td dits. REN-1,2 (in Fig. 2) can be seen as a simplified ver- r δ sion of ‘randomized least-squares value iteration’ (an RL KT ln(T ) approach proposed in (Osband, Van Roy, and Wen 2016)) = O( T d ln3 ( )), δ or an adapted version of contextual bandits, while REN-1,3 (in Fig. 2) is an advanced version of -greedy exploration in The full proofs of all lemmas and the theorem are in the RL. Note that REN is orthogonal to RL (Shi et al. 2019) and Supplement. Theorem 1 shows that even with the uncertainty bandit methods. in the item embeddings (i.e., context uncertainty), our pro- posed REN can achieve the same rate-optimal sublinear re- gret bound. Simulated Experiments Datasets. Following the setting described in Sec. 12, we start Experiments with three synthetic datasets, namely SYN-S, SYN-M, and In this section, we evaluate our proposed REN on both syn- SYN-L, which allow complete control on the simulated en- thetic and real-world datasets. vironments. We assume 8-dimensional latent vectors, which are unknown to the models, for each user and item, and use Experiment Setup and Compared Methods the inner product between user and item latent vectors as ∗ Joint Learning and Exploration Procedure in Tempo- √ user vector θ , we the reward. Specifically, for each latent ral Data. To effectively verify REN’s capability to boost randomly choose 3 entries to set to 1/ 3 and set the rest to long-term rewards, we adopt an online experiment set- 0, keeping kθ ∗ k2 = 1. We generate C28 = 28 unique item ∗ ting where data is divided into different time intervals √ vectors. Each item latent vector xk has 2 ∗entries set to latent [T0 , T1 ), [T1 , T2 ), . . . , [TM −1 , TM ]. RNN (including REN 1/ 2 and the other 6 entries set to 0 so that kxk k2 = 1.
0.80 0.80 0.80 0.75 0.75 0.75 0.70 0.70 0.70 Reward Reward Reward GRU4Rec 0.65 GRU4Rec 0.65 GRU4Rec 0.65 HRNN HRNN HRNN REN-G 0.60 REN-G 0.60 REN-G 0.60 REN-H REN-H REN-H REN-T 0.55 REN-T 0.55 REN-T 0.55 TCN TCN TCN Oracle 0.50 Oracle 0.50 Oracle 0.50 Random Random Random 0.45 0.45 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Time Time Time Figure 1: Results for different methods in SYN-S (left with 28 items), SYN-M (middle with 280 items), and SYN-L (right with 1400 items). One time step represents one interaction step, where in each interaction step the model recommends 3 items to the user and the user interacts with one of them. In all cases, REN models with diversity-based exploration lead to final convergence, whereas models without exploration get stuck at local optima. We assume 15 users in our datasets. SYN-S contains exactly 0.80 28 items, while SYN-M repeats each unique item latent vector for 10 times, yielding 280 items in total. Similarly, SYN-L 0.75 repeats for 50 times, therefore yielding 1400 items in total. 0.70 Reward The purpose of allowing different items to have identical latent vectors is to investigate REN’s capability to explore in 0.65 SYN-L, REN-1,2 the compact latent space rather than the large item space. All 0.60 SYN-L, REN-1,2,3 SYN-L, REN-1,3 users have a history length of 60. 0.55 SYN-S, REN-1,2 Simulated Environments. With the generated latent vec- SYN-S, REN-1,2,3 tors, the simulated environment runs as follows: At each time 0.50 SYN-S, REN-1,3 Oracle step t, the environment randomly chooses one user and feed 0.45 the user’s interaction history Xt (or Dt ) into the RNN recom- 100 200 300 400 500 600 700 800 900 mender. The recommender then recommends the top 4 items Time to the user. The user will select the item with the highest Figure 2: Ablation study on different terms of REN. ‘REN- > ground-truth reward θ ∗ x∗k , after which the recommender 1,2,3’ refers to the full ‘REN-G’ model. will collect the selected item with the reward and finetune the model. Results. Fig. 1 shows the rewards over time for different directly in the item space only works when the item number methods. Results are averaged over 3 runs and we plot the is small, e.g., in SYN-S. rolling average with a window size of 100 to prevent clutter. Hyperparameters. For the base models GRU4Rec, TCN, As expected, conventional RNN-based recommenders satu- and HRNN, we use identical network architectures and hyper- rate at around the 500-th time step, while all REN variants paremeters whenever possible following (Hidasi et al. 2016; successfully achieve nearly optimal rewards in the end. One Bai, Kolter, and Koltun 2018; Ma et al. 2020). Each RNN interesting observation is that REN variants obtain rewards consists of an encoding layer, a core RNN layer, and a de- lower than the “Random" baseline at the beginning, mean- coding layer. We set the number of hidden neurons to 32 for ing that they are sacrificing immediate rewards to perform all models including REN variants. Fig. 1 in the Supplement exploration in exchange for long-term rewards. shows the REN-G’s performance for different λd (note that Ablation Study. Fig. 2 shows the rewards over time for √ we fix λu = 10λd ) in SYN-S, SYN-M, and SYN-L. We can REN-G (i.e., REN-1,2,3), REN-1,2, and REN-1,3 in SYN-S observe stable REN performance across a wide range of λd . and SYN-L. We observe that REN-1,2, with only the rele- As expected, REN-G’s performance is closer to GRU4Rec vance (first) and diversity (second) terms of Eqn. 5, saturates when λd is small. prematurely in SYN-S. On the other hand, the reward of REN- 1,3, with only the relevance (first) and uncertainty (third) term, barely increases over time in SYN-L. In contrast, the Real-World Experiments full REN-G works in both SYN-S and SYN-L. This is because MovieLens-1M. We use MovieLens-1M (Harper and Kon- without the uncertainty term, REN-1,2 fails to effectively stan 2016) containing 3,900 movies and 6,040 users with choose items with uncertain embeddings to explore. REN-1,3 an experiment setting similar to Sec. 12. Each user has 120 ignores the diversity in the latent space and tends to explore interactions, and we follow the joint learning and exploration items that have rarely been recommended; such exploration procedure described in Sec. 12 to evaluate all methods (more
0.25 GRU4Rec HRNN 0.50 0.6 0.20 REN-G REN-H 0.45 0.5 REN-T 0.15 TCN 0.4 Reward Reward Reward 0.40 0.35 GRU4REC 0.3 0.10 HRNN HRNN REN-G 0.2 Oracle 0.30 REN-H REN-H 0.05 REN-T 0.1 Random 0.25 TCN 0.00 0.0 0 20 40 60 80 100 120 0 5 10 15 20 25 30 35 0 2 4 6 8 Time Time Time Figure 3: Rewards (precision@10, MRR, and recall@100, respectively) over time on MovieLens-1M (left), Trivago (middle), and Netflix (right). One time step represents 10 recommendations to a user, one hour of data, and 100 recommendations to a user for MovieLens-1M, Trivago, and Netflix, respectively. 0.35 0.6 0.6 0.30 0.5 0.5 0.25 0.4 0.4 Reward Reward Reward 0.20 0.3 0.3 0.15 HRNN HRNN HRNN Oracle 0.2 Oracle 0.2 Oracle 0.10 REN-H REN-H REN-H 0.05 Random 0.1 Random 0.1 Random 0.00 0.0 0.0 0 2 4 6 8 0 2 4 6 8 0 10 20 30 40 50 Time Time Time Figure 4: Rewards over time on Netflix. One time step represents 100 recommendations to a user. details in the Supplement). All models recommend 10 items REN-G) can effectively balance relevance and exploration at each round for a chosen user, and the precision@10 is used for these 10 recommended items at each time step, achieving as the reward. Fig. 3(left) shows the rewards over time aver- higher long-term rewards. Interestingly, we also observe that aged over all 6,040 users. As expected, REN variants with REN variants have better stability in performance compared different base models are able to achieve higher long-term to RNN baselines. rewards compared to their non-REN counterparts. Netflix. Finally, we also use Netflix5 to evaluate how REN Trivago. We also evaluate the proposed methods on performs in the slate recommendation setting and without Trivago4 , a hotel recommendation dataset with 730,803 users, finetuning in each time step, i.e., skipping Step (3) in Sec. 12. 926,457 items, and 910,683 interactions. We use a subset We pretrain REN on data from half the users and evaluate on with 57,778 users, 387,348 items, and 108,713 interactions the other half. At each time step, REN generates 100 mutually and slice the data into M = 48 one-hour time intervals for diversified items for one slate following Eqn. 5, with pk,t up- the online experiment (see the Supplement for details on data dated after every item generation. Fig. 3(right) shows the re- pre-processing). Different from MovieLens-1M, Triavago has call@100 as the reward for different methods, demonstrating impression data available: at each time step, besides which REN’s promising exploration ability when no finetuning is item is clicked by the user, we also know which 25 items are allowed (more results in the Supplement). Fig. 4(left) shows being shown to the user. Such information makes the online similar trends with recall@100 as the reward on the same evaluation more realistic, as we now know the ground-truth holdout item set. This shows that the collected set contributes feedback if an arbitrary subset of the 25 items are presented to to building better user embedding models. Fig. 4(middle) the user. At each time step of the online experiments, all meth- shows that the additional exploration power comes with- ods will choose 10 items from the 25 items to recommend out significant harms to the user’s immediate rewards on the current user and collect the feedback for these 10 items as the exploration set, where the recommendations are served. data for finetuning. We pretrain the model using all 25 items In fact, we used a relatively large exploration coefficient, from the first 13 hours before starting the online evaluation. λd = λu = 0.005, which starts to affect recommendation Fig. 3(middle) shows the mean reciprocal rank (MRR), the results on the sixth position. By additional hyperparameter official metric used in the RecSys Challenge, for different tuning, we realized that to achieve better rewards on the methods. As expected, the baseline RNN (e.g., GRU4Rec) exploration set, we may choose smaller λd = 0.0007 and suffers from a drastic drop in rewards because agents are λu = 0.0008. Fig. 4(right) shows significantly higher recalls allowed to recommend only 10 items, and they choose to close to the oracle performance, where all of the users’ his- focus only on relevance. This will inevitably ignores valuable tories are known and used as inputs to predict the top-100 items and harms the accuracy. In contrast, REN variants (e.g., personalized recommendations.6 Note that, for fair presen- 4 5 More details are available at https://recsys.trivago.cloud/ https://www.kaggle.com/netflix-inc/netflix-prize-data 6 challenge/dataset/. The gap between oracle and 100% recall lies in the model
tation of the tuned results, we switched the exploration set Chen, M.; Beutel, A.; Covington, P.; Jain, S.; Belletti, F.; and and the holdout set and used a different test user group, con- Chi, E. H. 2019. Top-k off-policy correction for a REIN- sisting of 1543 users. We believe that the tuned results are FORCE recommender system. In WSDM, 456–464. generalizable with new users and items, but we also real- Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; ize that the Netflix dataset still has a significant popularity Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning bias and therefore we recommend using larger exploration Phrase Representations using RNN Encoder-Decoder for coefficients with real online systems. The inference cost is Statistical Machine Translation. In EMNLP, 1724–1734. 175 milliseconds to pick top-100 items from 8000 evaluation Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Con- items. It includes 100 sequential linear function solutions textual bandits with linear payoff functions. In AISTATS, with 50 embedding dimensions, which is further improvable 208–214. by selecting multiple items at a time in slate generation. Ding, H.; Ma, Y.; Deoras, A.; Wang, Y.; and Wang, H. 2021. Zero-Shot Recommender Systems. arXiv preprint Conclusion arXiv:2105.08318. We propose the REN framework to balance relevance and Fang, H.; Zhang, D.; Shu, Y.; and Guo, G. 2019. Deep Learn- exploration during recommendation. Our theoretical analysis ing for Sequential Recommendation: Algorithms, Influential and empirical results demonstrate the importance of consid- Factors, and Evaluations. arXiv preprint arXiv:1905.01997. ering uncertainty in the learned representations for effective Foster, D. J.; Agarwal, A.; Dudík, M.; Luo, H.; and Schapire, exploration and improvement on long-term rewards. We pro- R. E. 2018. Practical Contextual Bandits with Regression vide an upper confidence bound on the estimated rewards Oracles. In ICML, 1534–1543. along with its corresponding regret bound and show that REN can achieve the same rate-optimal sublinear regret even in Friedland, S.; and Gaubert, S. 2013. Submodular spectral the presence of uncertain representations. Future work could functions of principal submatrices of a hermitian matrix, investigate the possibility of learned uncertainty in represen- extensions and applications. Linear Algebra and its Applica- tations, extension to Thompson sampling, nonlinearity of tions, 438(10): 3872–3884. the reward w.r.t. θ t , and applications beyond recommender Gupta, S.; Wang, H.; Lipton, Z.; and Wang, Y. 2021. Correct- systems, e.g., robotics and conversational agents. ing exposure bias for link recommendation. In ICML. Harper, F. M.; and Konstan, J. A. 2016. The movielens Acknowledgement datasets: History and context. ACM Transactions on Interac- The authors thank Tim Januschowski, Alex Smola, the AWS tive Intelligent Systems (TiiS), 5(4): 19. AI’s Personalize Team and ML Forecast Team, as well as the Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; and Tikk, D. 2016. reviewers/SPC/AC for the constructive comments to improve Session-based Recommendations with Recurrent Neural Net- the paper. We are also grateful for the RecoBandits package works. In ICLR. provided by Bharathan Blaji, Saurabh Gupta, and Jing Wang Hiranandani, G.; Singh, H.; Gupta, P.; Burhanuddin, I. A.; to facilitate the simulated environments. HW is partially sup- Wen, Z.; and Kveton, B. 2019. Cascading Linear Submodular ported by NSF Grant IIS-2127918 and an Amazon Faculty Bandits: Accounting for Position Bias and Diversity in Online Research Award. Learning to Rank. In UAI, 248. Jun, K.-S.; Willett, R.; Wright, S.; and Nowak, R. 2019. Bilin- References ear Bandits with Low-rank Structure. In Chaudhuri, K.; and Agarwal, A.; Hsu, D. J.; Kale, S.; Langford, J.; Li, L.; and Salakhutdinov, R., eds., Proceedings of the 36th International Schapire, R. E. 2014. Taming the Monster: A Fast and Simple Conference on Machine Learning, volume 97 of Proceedings Algorithm for Contextual Bandits. In ICML, 1638–1646. of Machine Learning Research, 3163–3172. PMLR. Antikacioglu, A.; and Ravi, R. 2017. Post processing recom- Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Varia- mender systems for diversity. In KDD, 707–716. tional Bayes. In ICLR. Korda, N.; Szorenyi, B.; and Li, S. 2016. Distributed clus- Auer, P. 2002. Using Confidence Bounds for Exploitation- tering of linear bandits in peer to peer networks. In ICML, Exploration Trade-offs. JMLR, 3: 397–422. 1301–1309. Bai, S.; Kolter, J. Z.; and Koltun, V. 2018. An Empirical Kveton, B.; Szepesvári, C.; Rao, A.; Wen, Z.; Abbasi- Evaluation of Generic Convolutional and Recurrent Networks Yadkori, Y.; and Muthukrishnan, S. 2017. Stochastic Low- for Sequence Modeling. CoRR, abs/1803.01271. Rank Bandits. CoRR, abs/1712.04644. Belletti, F.; Chen, M.; and Chi, E. H. 2019. Quantifying Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; and Ma, J. 2017. Long Range Dependence in Language and User Behavior to Neural attentive session-based recommendation. In CIKM, improve RNNs. In KDD, 1317–1327. 1419–1428. Bello, I.; Kulkarni, S.; Jain, S.; Boutilier, C.; Chi, E.; Eban, Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A E.; Luo, X.; Mackey, A.; and Meshi, O. 2018. Seq2slate: contextual-bandit approach to personalized news article rec- Re-ranking and slate optimization with rnns. arXiv preprint ommendation. In WWW, 661–670. arXiv:1810.02019. Li, S.; Karatzoglou, A.; and Gentile, C. 2016. Collaborative approximation errors. Filtering Bandits. In SIGIR, 539–548.
Li, X.; and She, J. 2017. Collaborative Variational Autoen- Wang, H.; Wang, N.; and Yeung, D. 2015. Collaborative deep coder for Recommender Systems. In KDD, 305–314. learning for recommender systems. In KDD, 1235–1244. Liu, Q.; Zeng, Y.; Mokhosi, R.; and Zhang, H. 2018. STAMP: Wang, H.; Xingjian, S.; and Yeung, D.-Y. 2016. Natural- short-term attention/memory priority model for session-based parameter networks: A class of probabilistic neural networks. recommendation. In KDD, 1831–1839. In NIPS, 118–126. Ma, Y.; Narayanaswamy, M. B.; Lin, H.; and Ding, H. 2020. Wang, H.; and Yeung, D.-Y. 2016. Towards Bayesian deep Temporal-Contextual Recommendation in Real-Time. In learning: A framework and some existing methods. TDKE, KDD. 28(12): 3395–3408. Mahadik, K.; Wu, Q.; Li, S.; and Sabne, A. 2020. Fast Wang, H.; and Yeung, D.-Y. 2020. A Survey on Bayesian distributed bandits for online recommendation systems. In Deep Learning. ACM Computing Surveys (CSUR), 53(5): SC, 1–13. 1–37. Mi, L.; Wang, H.; Tian, Y.; and Shavit, N. 2019. Training- Wilhelm, M.; Ramanathan, A.; Bonomo, A.; Jain, S.; Chi, Free Uncertainty Estimation for Neural Networks. arXiv E. H.; and Gillenwater, J. 2018. Practical diversified recom- e-prints, arXiv–1910. mendations on youtube with determinantal point processes. Nguyen, T. T.; Hui, P.-M.; Harper, F. M.; Terveen, L.; and In CIKM, 2165–2173. Konstan, J. A. 2014. Exploring the filter bubble: the effect of Wu, S.; Tang, Y.; Zhu, Y.; Wang, L.; Xie, X.; and Tan, T. 2019. using recommender systems on content diversity. In WWW, Session-based recommendation with graph neural networks. 677–686. In AAAI, volume 33, 346–353. Osband, I.; Van Roy, B.; and Wen, Z. 2016. Generalization Yue, Y.; and Guestrin, C. 2011. Linear Submodular Bandits and exploration via randomized value functions. In ICML, and their Application to Diversified Retrieval. In NIPS, 2483– 2377–2386. 2491. Quadrana, M.; Karatzoglou, A.; Hidasi, B.; and Cremonesi, P. Zhou, D.; Li, L.; and Gu, Q. 2019. Neural Contextual Bandits 2017. Personalizing Session-based Recommendations with with Upper Confidence Bound-Based Exploration. arXiv Hierarchical Recurrent Neural Networks. In RecSys, 130– preprint arXiv:1911.04462. 137. Salakhutdinov, R.; Mnih, A.; and Hinton, G. E. 2007. Re- stricted Boltzmann machines for collaborative filtering. In ICML, volume 227, 791–798. Shi, J.-C.; Yu, Y.; Da, Q.; Chen, S.-Y.; and Zeng, A.-X. 2019. Virtual-taobao: Virtualizing real-world online retail environ- ment for reinforcement learning. In AAAI, volume 33, 4902– 4909. Tang, J.; Belletti, F.; Jain, S.; Chen, M.; Beutel, A.; Xu, C.; and H. Chi, E. 2019. Towards neural mixture recommender for long range dependent user sequences. In WWW, 1782– 1793. Tschiatschek, S.; Djolonga, J.; and Krause, A. 2016. Learn- ing Probabilistic Submodular Diversity Models Via Noise Contrastive Estimation. In AISTATS, 770–779. van den Oord, A.; Dieleman, S.; and Schrauwen, B. 2013. Deep content-based music recommendation. In NIPS, 2643– 2651. Vanchinathan, H. P.; Nikolic, I.; Bona, F. D.; and Krause, A. 2014. Explore-exploit in top-N recommender systems via Gaussian processes. In RecSys, 225–232. Wang, H. 2017. Bayesian Deep Learning for Integrated Intelligence: Bridging the Gap between Perception and In- ference. Ph.D. thesis, Hong Kong University of Science and Technology. Wang, H.; Shi, X.; and Yeung, D. 2015. Relational stacked denoising autoencoder for tag recommendation. In AAAI, 3052–3058. Wang, H.; Shi, X.; and Yeung, D.-Y. 2016. Collaborative recurrent autoencoder: Recommend while learning to fill in the blanks. In NIPS, 415–423.
You can also read