Context Uncertainty in Contextual Bandits with Applications to Recommender Systems

Page created by Lee Mann
 
CONTINUE READING
Context Uncertainty in Contextual Bandits
                                                                            with Applications to Recommender Systems
                                                                             Hao Wang1 , Yifei Ma2 , Hao Ding2 , Yuyang Wang2
                                                                        1
                                                                         Department of Computer Science, Rutgers University 2 AWS AI Lab
                                                                         hoguewang@gmail.com, {yifeim,haodin,yuyawang}@amazon.com
arXiv:2202.00805v1 [cs.LG] 1 Feb 2022

                                                                    Abstract                                  exploration during recommendations using the learned repre-
                                                                                                              sentations.
                                          Recurrent neural networks have proven effective in modeling            One roadblock is that effective exploration relies heavily
                                          sequential user feedbacks for recommender systems. How-             on well learned representations, which in turn require suf-
                                          ever, they usually focus solely on item relevance and fail to
                                          effectively explore diverse items for users, therefore harm-
                                                                                                              ficient exploration; this is a chicken-and-egg problem. In a
                                          ing the system performance in the long run. To address this         case where RNN learns unreasonable representations (e.g.,
                                          problem, we propose a new type of recurrent neural networks,        all items have the same representations), exploration in the
                                          dubbed recurrent exploration networks (REN), to jointly per-        latent space is meaningless. To address this problem, we en-
                                          form representation learning and effective exploration in the       able REN to take into account the uncertainty of the learned
                                          latent space. REN tries to balance relevance and exploration        representations as well during recommendations. Essentially
                                          while taking into account the uncertainty in the representations.   items whose representations have higher uncertainty can be
                                          Our theoretical analysis shows that REN can preserve the rate-      explored more often. Such a model can be seen as a contex-
                                          optimal sublinear regret even when there exists uncertainty         tual bandit algorithm that is aware of the uncertainty for each
                                          in the learned representations. Our empirical study demon-          context. Our contributions are as follows:
                                          strates that REN can achieve satisfactory long-term rewards
                                          on both synthetic and real-world recommendation datasets,           1. We propose REN as a new type of RNN to balance rele-
                                          outperforming state-of-the-art models.                                 vance and exploration during recommendation, yielding
                                                                                                                 satisfactory long-term rewards.
                                                                                                              2. Our theoretical analysis shows that there is an upper con-
                                                                Introduction                                     fidence bound related to uncertainty in learned representa-
                                        Modeling and predicting sequential user feedbacks is a core              tions. With such a bound implemented in the algorithm,
                                        problem in modern e-commerce recommender systems. In                     REN can achieve the same rate-optimal sublinear regret.
                                        this regard, recurrent neural networks (RNN) have shown                  To the best of our knowledge, we are the first to study the
                                        great promise since they can naturally handle sequential                 regret bounds under “context uncertainty”.
                                        data (Hidasi et al. 2016; Quadrana et al. 2017; Belletti, Chen,       3. Experiments of joint learning and exploration on both
                                        and Chi 2019; Ma et al. 2020). While these RNN-based mod-                synthetic and real-world temporal datasets show that REN
                                        els can effectively learn representations in the latent space            significantly improve long-term rewards over state-of-the-
                                        to achieve satisfactory immediate recommendation accuracy,               art RNN-based recommenders.
                                        they typically focus solely on relevance and fall short of ef-
                                        fective exploration in the latent space, leading to poor perfor-
                                        mance in the long run. For example, a recommender system
                                                                                                                                   Related Work
                                        may keep recommending action movies to a user once it                 Deep Learning for Recommender Systems. Deep learning
                                        learns that she likes such movies. This may increase immedi-          (DL) has been playing a key role in modern recommender sys-
                                        ate rewards, but the lack of exploration in other movie genres        tems (Salakhutdinov, Mnih, and Hinton 2007; van den Oord,
                                        can certainly be detrimental to long-term rewards.                    Dieleman, and Schrauwen 2013; Wang, Wang, and Yeung
                                           So, how does one effectively explore diverse items for             2015; Wang, Shi, and Yeung 2015, 2016; Li and She 2017;
                                        users while retaining the representation power offered by             Chen et al. 2019; Fang et al. 2019; Tang et al. 2019; Ding
                                        RNN-based recommenders. We note that the learned repre-               et al. 2021; Gupta et al. 2021). (Salakhutdinov, Mnih, and
                                        sentations in the latent space are crucial for these models’ suc-     Hinton 2007) uses restricted Boltzmann machine to perform
                                        cess. Therefore we propose recurrent exploration networks             collaborative filtering in recommender systems. Collabora-
                                        (REN) to explore diverse items in the latent space learned            tive deep learning (CDL) (Wang, Wang, and Yeung 2015;
                                        by RNN-based models. REN tries to balance relevance and               Wang, Shi, and Yeung 2016; Li and She 2017) is devised
                                                                                                              as Bayesian deep learning models (Wang and Yeung 2016,
                                        Copyright © 2022, Association for the Advancement of Artificial       2020; Wang 2017) to significantly improve recommendation
                                        Intelligence (www.aaai.org). All rights reserved.                     performance. In terms of sequential (or session-based) rec-
ommender systems (Hidasi et al. 2016; Quadrana et al. 2017;       Notation and RNN-Based Recommender Systems
Bai, Kolter, and Koltun 2018; Li et al. 2017; Liu et al. 2018;
                                                                  Notation. We consider the problem of sequential recommen-
Wu et al. 2019; Ma et al. 2020), GRU4Rec (Hidasi et al. 2016)
                                                                  dations where the goal is to predict the item a user interacts
was first proposed to use gated recurrent units (GRU) (Cho
                                                                  with (e.g., click or purchase) at time t, denoted as ekt , given
et al. 2014), an RNN variant with gating mechanism, for
recommendation. Since then, follow-up works such as hierar-       her previous interaction history Et = [ekτ ]t−1 τ =1 . Here kt is the
chical GRU (Quadrana et al. 2017), temporal convolutional         index for the item at time t, ekt ∈ {0, 1}K is a one-hot vector
networks (TCN) (Bai, Kolter, and Koltun 2018), and hierar-        indicating an item, and K is the number of total items. We de-
chical RNN (HRNN) (Ma et al. 2020) have tried to achieve          note the item embedding (encoding) for ekt as xkt = fe (ekt ),
improvement in accuracy with the help of cross-session in-        where fe (·) is the encoder as a part of the RNN. Correspond-
formation (Quadrana et al. 2017), causal convolutions (Bai,       ingly we have Xt = [xkτ ]t−1 τ =1 . Strictly speaking, in an online
Kolter, and Koltun 2018), as well as control signals (Ma et al.   setting where the model updates at every time step t, xk also
2020). We note that our REN does not assume specific RNN          changes over time; in Sec. we use xk as a shorthand for xt,k
architectures (e.g., GRU or TCN) and is therefore compati-        for simplicity. We use kzk∞ = maxi |z(i) | to denote the L∞
ble with different RNN-based (or more generally DL-based)         norm, where the superscript (i) means the i-th entry of the
models, as shown in later sections.                               vector z.
   Contextual Bandits. Contextual bandit algorithms such             RNN-Based Recommender Systems. Given the interac-
as LinUCB (Li et al. 2010) and its variants (Yue and Guestrin     tion history Et , the RNN generates the user embedding at
2011; Agarwal et al. 2014; Li, Karatzoglou, and Gentile           time t as θ t = R([xkτ ]t−1τ =1 ), where xkτ = fe (ekτ ) ∈ R ,
                                                                                                                                      d

2016; Kveton et al. 2017; Foster et al. 2018; Korda, Szorenyi,    and R(·) is the recurrent part of the RNN. Assuming tied
and Li 2016; Mahadik et al. 2020; Zhou, Li, and Gu 2019)          weights, the score for each candidate item is then computed
have been proposed to tackle the exploitation-exploration         as pk,t = x> k θ t . As the last step, the recommender system
trade-off in recommender systems and successfully improve         will recommend the items with the highest scores to the user.
upon context-free bandit algorithms (Auer 2002). Similar          Note that the subscript k indexes the items, and is equivalent
to (Auer 2002), theoretical analysis shows that LinUCB vari-      to an ‘action’, usually denoted as a, in the context of bandit
ants could achieve a rate-optimal regret bound (Chu et al.        algorithms.
2011). However, these methods either assume observed con-
text (Zhou, Li, and Gu 2019) or are incompatible with neu-        Determinantal Point Processes for Diversity and
ral networks (Li, Karatzoglou, and Gentile 2016; Yue and          Exploration
Guestrin 2011). In contrast, REN as a contextual bandit al-
gorithm runs in the latent space and assumes user models          Determinantal point processes (DPP) consider an item se-
based on RNN; therefore it is compatible with state-of-the-art    lection problem where each item is represented by a fea-
RNN-based recommender systems.                                    ture vector xt . Diversity is achieved by picking a subset of
   Diversity-Inducing Models. Various works have focused          items to cover the maximum volume spanned by the items,
on inducing diversity in recommender systems (Nguyen et al.       measured by the log-determinant of the corresponding ker-
2014; Antikacioglu and Ravi 2017; Wilhelm et al. 2018;            nel matrix, ker(Xt ) = log det(IK + Xt X>     t ), where IK is
Bello et al. 2018). Usually such a system consists of a sub-      included to prevent singularity. Intuitively, DPP penalizes
modular function, which measures the diversity among items,       colinearity, which is an indicator that the topics of one item
and a relevance prediction model, which predicts relevance        are already covered by the other topics in the full set. The log-
between users and items. Examples of submodular functions         determinant of a kernel matrix is also a submodular function
include the probabilistic coverage function (Hiranandani et al.   (Friedland and Gaubert 2013), which implies a (1 − 1/e)-
2019) and facility location diversity (FILD) (Tschiatschek,       optimal guarantees from greedy solutions. The greedy algo-
Djolonga, and Krause 2016), while relevance prediction mod-       rithm for DPP via the matrix determinant lemma is
els can be Gaussian processes (Vanchinathan et al. 2014),                     argmaxk log det(Id + X>            >
                                                                                                    t Xt + x k x k )               (1)
linear regression (Yue and Guestrin 2011), etc. These models
typically focus on improving diversity among recommended                              − log det(Id + X>t Xt )
items in a slate at the cost of accuracy. In contrast, REN’s              =   argmaxk log(1 + xk (Id + X>
                                                                                               >                −1
                                                                                                         t Xt )    xk )            (2)
goal is to optimize for long-term rewards through improving                             q
diversity between previous and recommended items. We in-                  = argmaxk         x>        >     −1 x .
                                                                                             k (Id + Xt Xt )    k                  (3)
clude some slate generation in our real-data experiments for
completeness.                                                                                 q
                                                                    Interestingly, note that x>             >
                                                                                                   k (Id + Xt Xt )
                                                                                                                   −1 x has the
                                                                                                                       k

         Recurrent Exploration Networks                           same form as the confidence interval in LinUCB (Li et al.
                                                                  2010), a commonly used contextual bandit algorithm to boost
In this section we first describe the general notations and       exploration and achieve long-term rewards, suggesting a con-
how RNN can be used for recommendation, briefly review            nection between diversity and long-term rewards (Yue and
determinantal point processes (DPP) as a diversity-inducing       Guestrin 2011). Intuitively, this makes sense in recommender
model as well as their connection to exploration in contextual    systems since encouraging diversity relative to user history
bandits, and then introduce our proposed REN framework.           (as well as diversity in a slate of recommendations in our
Algorithm 1: Recurrent Exploration Networks                       where θ t = R(Dt ) = R([µkτ ]t−1                                t−1
                                                                                                           τ =1 ) and Dt = [µkτ ]τ =1 .
  (REN)                                                             The term σ k quantifies the uncertainty for each dimension of
                                                                    xk , meaning that items whose embeddings REN is uncertain
 1 Input: λd , λu , initialized REN model with the
                                                                    about are more likely to be recommended. Therefore with
    encoder, i.e., R(·) and fe (·).
                                                                    the third term, REN can naturally balance among relevance,
 2 for t = 1, 2, . . . , T do
                                                                    diversity (relative to user history), and uncertainty during
 3     Obtain item embeddings from REN:
                                                                    exploration.
 4         µkτ ← fe (ekτ ) for all τ ∈ {1, 2, . . . , t − 1}.
                                                                       Putting It All Together. Algorithm 1 shows the overview
 5     Obtain the current user embedding from REN:                  of REN. Note that the difference between REN and tradi-
 6         θ t ← R(Dt ).                                            tional RNN-based recommenders is only in the inference
       Compute At ← Id + τ ∈Ψt µ>
                                P
 7                                     kτ µkτ .                     stage. During training (Line 16 of Algorithm 1), one can
 8     Obtain candidate items’ embeddings from REN:                 train REN only with the relevance term using models such
 9         µk ← fe (ek ), where k ∈ [K].                            as GRU4Rec and HRNN. In the experiments,                 we use un-
                                                                                                                 √
10     Obtain candidate items’ uncertainty estimates σ k ,          certainty estimates diag(σ k ) = 1/ nk Id , where nk is
         where k ∈ [K].                                             item k’s total number of impressions (i.e., the number of
11     for k ∈ [K] do                                               times item k has been recommended) for all users. The intu-
12         Obtain the score for item k at time t:                   ition is that: the more frequently item k is recommended, the
13              pk,t ←                                              more frequently its embedding xk gets updated, the faster√σ k
                                                                    decreases.1 Our preliminary experiments
                            q
               >                  −1                                                                                 show that 1/ nk
             µk θ t + λd µ>    k At µk + λu kσ k k∞ .                                                        √
                                                                    does decrease at the rate of O(1/ t), meaning that the as-
14     end                                                          sumption in√Lemma 4 is satisfied. From the Bayesian per-
15     Recommend item kt ← argmaxk pt,k and collect                 spective, 1/ nk may not accurately reflect the uncertainty of
         user feedbacks.                                            the learned xk , which is a limitation of our model. In princi-
16     Update the REN model R(·) and fe (·) using                   ple, one can learn σ k from data using the reparameterization
         collected user feedbacks.                                  trick (Kingma and Welling 2014) with a Gaussian prior on
17 end
                                                                    xk and examine whether σ k the assumption in Lemma 4; this
                                                                    would be interesting future work.
                                                                       Linearity in REN. REN only needs a linear bandit model;
experiments) naturally explores user interest previously un-        REN’s output x>    k θ t is linear w.r.t. θ and xk . Note that Neu-
known to the model, leading to much higher long-term re-            ralUCB (Zhou, Li, and Gu 2019) is a powerful nonlinear
wards, as shown in Sec. 12.                                         extension of LinUCB, i.e., its output is nonlinear w.r.t. θ
                                                                    and xk . Extending REN’s output from x>         k θ t to a nonlinear
Recurrent Exploration Networks                                      function f (xk , θ t ) as in NeuralUCB is also interesting future
Exploration Term. Based on the intuition above, we can              work.2
modify the user-item score pk,t = x>k θ t to include a diversity
                                                                       Beyond RNN. Note that our methods and theory go be-
(exploration) term, leading to the new score                        yond RNN-based models and can be naturally extended
                           q                                        to any latent factor models including transformers, MLPs,
        pk,t = x>  θ
                 k t  + λ      >           >
                          d xk (Id + Xt Xt )
                                                 −1 x ,
                                                     k       (4)    and matrix factorization. The key is the user embedding
                                                                    θ t = R(Xt ), which can be instantiated with an RNN, a
where the first term is the relevance score and the second term     transformer, or a matrix-factorization model.
is the exploration score (measuring diversity between previ-
ous and recommended items). θ t = R(Xt ) = R([xkτ ]t−1     τ =1 )                      Theoretical Analysis
is RNN’s hidden states at time t representing the user embed-
ding. The hyperparameter λd aims to balance two terms.              With REN’s connection to contextual bandits, we can prove
   Uncertainty Term for Context Uncertainty. At first               that with proper λd and λu , Eqn. 5 is actually the upper
blush, given the user history the system using Eqn. 4 will rec-     confidence bound that leads to long-term rewards with a
ommend items that are (1) relevant to the user’s interest and       rate-optimal regret bound.
(2) diverse from the user’s previous items. However, this only         Reward Uncertainty versus Context Uncertainty. Note
works when item embeddings xk are correctly learned. Un-                1
fortunately, the quality of learned item embeddings, in turn,             There are some caveats in general. diag(σ k ) ∝ Id assumes
                                                                    that all coordinates of x shrink at the same rate. However, REN
relies heavily on the effectiveness of exploration, leading to      exploration mechanism associates nk with the total variance of the
a chicken-and-egg problem. To address this problem, one             features of an item. This may not ensure all feature dimensions to
also needs to consider the uncertainty of the learned item em-      be equally explored. See (Jun et al. 2019) for a different algorithm
beddings. Assuming the item embedding xk ∼ N (µk , Σk ),            that analyzes the exploration of the low-rank feature space.
where Σk = diag(σ 2k ), we have the final score for REN:                2
                                                                          In other words, we did not fully explain why x could be shared
                     q                                              between non-linear RNN and the uncertainty bounds based on linear
 pk,t = µ>                >         >
          k θ t + λd µk (Id + Dt Dt )
                                           −1 µ + λ kσ k ,
                                               k      u  k ∞
                                                                    models. On the other hand, we did observe promising empirical
                                                                    results, which may encourage interested readers to dive deep into
                                                             (5)    different theoretical analyses.
that unlike existing works which primarily consider the ran-                 Algorithm 2: BaseREN: Basic REN Inference at
domness from the reward, we take into consideration the un-                  Step t
certainty resulted from the context (content) (Mi et al. 2019;
                                                                            1 Input: α, Ψt ⊆ {1, 2, . . . , t − 1}.
Wang, Xingjian, and Yeung 2016), i.e., context uncertainty.
                                                                            2 Obtain item embeddings from REN:
In CDL (Wang, Wang, and Yeung 2015; Wang, Shi, and Ye-
ung 2016), it is shown that such content information is crucial                µτ,kτ ← fe (eτ,kτ ) for all τ ∈ Ψt .
in DL-based RecSys (Wang, Wang, and Yeung 2015; Wang,                       3 Obtain user embedding: θ t ← R(Dt ).
                                                                                                    >
                                                                                          P
Shi, and Yeung 2016), and so is the associated uncertainty.                 4 At ← Id +     τ ∈Ψt µτ,kτ µτ,kτ .
More specifically, existing works assume deterministic x and                5 Obtain candidate items’ embeddings:
only assume randomness in the reward, i.e., they assume that                   µt,k ← fe (et,k ), where k ∈ [K].
r = x> θ + , and therefore r’s randomness is independent                   6 Obtain candidate items’ uncertainty estimates σ t,k ,
of x. The problem with this formulation is that they assume                    where k ∈ [K].
x is deterministic and therefore the model only has a point                 7 for a ∈ [K] do
estimate of the item embedding x, but does not have uncer-                               q
                                                                                                    −1
tainty estimation for such x. We find that such uncertainty                 8      st,k =     µ>
                                                                                               t,k At µt,k
estimation is crucial for exploration; if the model is uncertain                                       √   q
                                                                            9      wt,k ← (α+1)st,k +(4 d+2 ln TδK )kσ t,k k∞ .
about x, it can then explore more on the corresponding item.
   To facilitate analysis, we follow common practice (Auer                 10      rbt,k ← θ >
                                                                                             t µt,k .
2002; Chu et al. 2011) to divide the procedure of REN into                 11   end
“BaseREN" (Algorithm 2) and “SupREN" stages correspond-                    12   Recommend item k ← argmaxk rbt,k + wt,k .
ingly. Essentially SupREN introduces S = ln T levels of
elimination (with s as an index) to filter out low-quality items
and ensures that the assumption holds (see the Supplement
for details of SupREN).                                                    high probability, meaning that Eqn. 5 upper bounds the true
   In this section, we first provide a high probability bound for          reward with high probability, which makes it a reasonable
BaseREN with uncertain embeddings (context), and derive                    score for recommendations.
an upper bound for the regret. As mentioned in Sec. , for the              Lemma 1 (Confidence Bound). With probability at least
online setting where the model updates at every time step t,               1 − 2δ/T , we have for all k ∈ [K] that
xk also changes over time. Therefore in this section we use
xt,k , µt,k , Σt,k , and σ t,k in place of xk , µk , Σk , and σ k                 rt,k − x∗t,k > θ ∗ | ≤(α + 1)st,k
                                                                                 |b
from Sec. to be rigorous.                                                                                    √
                                                                                                                        r
                                                                                                                               TK
Assumption 1. Assume there exists an optimal θ , with           ∗                                      + (4 d + 2         ln      )kσ t,k k∞ ,
                                                                                                                                δ
kθ ∗ k ≤ 1, and x∗t,k such that E[rt,k ] = x∗t,k > θ ∗ . Further
                                                                                                           (i)
assume that there is an effective distribution N (µt,k , Σt,k )            where kσ t,k k∞ = maxi |σ t,k | is the L∞ norm.
such that x∗t,k ∼ N (µt,k , Σt,k ) where Σt,k = diag(σ 2t,k ).                The proof is in the Supplement. This upper confidence
Thus, the true underlying context is unavailable, but we are               bound above provides important insight on why Eqn. 5 is
aided with the knowledge that it is generated by a multivari-              reasonable as a final score to select items in Algorithm 1 as
ate normal with known parameters3 .                                        well as the choice of hyperparameters λd and λu .
                                                                              RNN to Estimate θ t . REN uses RNN to approximate
Upper Confidence Bound for Uncertain                                       A−1
                                                                             t bt (useful in the proof of Lemma 1) in Eqn. 6. Note
Embeddings                                                                 that a linear RNN with tied weights and a single time step is
For simplicity denote the item embedding (context) as xt,k ,               equivalent to linear regression (LR); therefore RNN is a more
where t indexes the rounds (time steps) and k indexes the                  general model to estimate θ t . Compared to LR, RNN-based
items. We define:                                                          recommenders can naturally incorporate new user history
          q                                                                by incrementally updating the hidden states (θ t in REN),
                    −1                                  |Ψt |×d            without the need to solve a linear equation. Interestingly, one
 st,k =       µ>
               t,k At µt,k ∈ R+ , Dt = [µτ,kτ ]τ ∈Ψt ∈ R        ,
                                                                           can also see RNN’s recurrent computation as a simulation
  yt = [rτ,kτ ]τ ∈Ψt ∈ R|Ψt |×1 , At = Id + D>
                                             t Dt ,                        (approximation) for solving equations via iterative updating.
  bt = D >
         t yt ,                   rbt,k = µ>   ˆ      >    −1
                                           t,k θ t = µt,k At bt ,
                                                                     (6)   Regret Bound
                                                                           Lemma 1 above provides an estimate of the reward’s upper
where yt is the collected user feedback.
                                     q Lemma 1 below                       bound at time t. Based on this estimate, one natural next step
shows that with λd = 1 + α = 1 + 12 ln 2TδK and λu =                       is to analyze the regret after all T rounds. Formally, we define
 √      q                                                                  the regret of the algorithm after T rounds as
4 d + 2 ln TδK , Eqn. 5 is the upper confidence bound with
                                                                                                       T
                                                                                                       X                T
                                                                                                                        X
   3
     Here we omit the identifiability issue of   x∗t,k   and assume that                     B(T ) =         rt,kt∗ −         rt,kt ,            (7)
there is a unique x∗t,k for clarity.                                                                   t=1              t=1
where kt∗ is the optimal item (action) k at round t that maxi-      and its baselines) is then trained and evaluated in a rolling
mizes E[rt,k ] = x∗t,k > θ ∗ , and kt is the action chose by the    manner: (1) RNN is trained using data in [T0 , T1 ); (2) RNN
algorithm at round t. Similar to (Auer 2002), SupREN calls          is evaluated using data in [T1 , T2 ) and collects feedbacks
BaseREN as a sub-routine. In this subsection, we derive the         (rewards) for its recommendations; (3) RNN uses newly col-
regret bound for SupREN with uncertain item embeddings.             lected feedbacks from [T1 , T2 ) to finetune the model; (4)
                                                                    Repeat the previous two steps using data from the next time
Lemma 2. With probability 1 − 2δS, for any t ∈ [T ] and
                                                                    interval. Note that different from traditional offline and one-
any s ∈ [S], we have: (1) |b    rt,k − E[rt,k ]| ≤ wt,k for any
                ∗
                                                                    step evaluation, corresponding to only Step (1) and (2), our
k ∈ [K], (2) kt ∈ Âs , and (3) E[rt,kt∗ ] − E[rt,k ] ≤ 2(3−s)      setting performs joint learning and exploration in temporal
for any k ∈ Âs .                                                   data, and therefore is more realistic and closer to production
                                               P
Lemma 3. In BaseREN, we have: (1 + α) t∈ΨT +1 st,kt ≤               systems.
             p                                                         Long-Term Rewards. Since the goal is to evaluate long-
5 · (1 + α2 ) d|ΨT +1 |.                                            term rewards, we are mostly interested in the rewards during
Lemma 4. Assuming kσ 1,k k∞ = 1 and kσ t,k k∞ ≤ √1t                 the last (few) time intervals. Conventional RNN-based rec-
                                                                    ommenders do not perform exploration and are therefore
P any k and t, then for
for                    p any k, we have the upper bound:            much easier to saturate at a relatively low reward. In contrast,
   t∈ΨT +1 kσ t,k k∞ ≤  |ΨT +1 |.
                                                                    REN with its effective exploration can achieve nearly optimal
   Essentially Lemma 2 links the regret B(T ) to the width of       rewards in the end.
the confidence bound wt,k (Line 9 of Algorithm 2 or the last           Compared Methods. We compare REN variants
two termsp of Eqn. 5). Lemma
                        √        3 and Lemma 4 then connect         with state-of-the-art RNN-based recommenders including
wt,k to |ΨT +1 | ≤ T , which is sublinear in T ; this is            GRU4Rec (Hidasi et al. 2016), TCN (Bai, Kolter, and Koltun
the key to achieve a sublinear regret bound. Note that Âs is       2018), HRNN (Ma et al. 2020). Since REN can use any RNN-
defined inside Algorithm 2 (SupREN) of the Supplement.              based recommenders as a base model, we evaluate three REN
   Interestingly, Lemma 4 states that the uncertainty only          variants in the experiments: REN-G, REN-T, and REN-H,
needs to decrease at the rate √1t , which is consistent with our    which use GRU4Rec, TCN, and HRNN as base models, re-
                            √                                       spectively. Additionally we also evaluate REN-1,2, an REN
choice of diag(σ k ) = 1/ nk Id in Sec. , where nk is item          variant without the third term of Eqn. 5, and REN-1,3, one
k’s total number of impressions for all users. As the last step,    without the second term of Eqn. 5, as an ablation study. Both
Lemma 5 and Theorem 1 below build on all lemmas above               REN-1,2 and REN-1,3 use GRU4Rec as the base model. As
to derive the final sublinear regret bound.                         references we also include Oracle, which always achieves op-
Lemma 5. For all s ∈ [S],                                           timal rewards, and Random, which randomly recommends
                           q             √
                                                r
                                                      TK
                                                                    one item from the full set. For REN variants we choose  √ λd
  (s)                          (s)
|ΨT +1 | ≤ 2s · (5(1 + α2 ) d|ΨT +1 | + 4 dT + 2 T ln    ).         from {0.001, 0.005, 0.01, 0.05, 0.1} and set λu = 10λd .
                                                       δ            Other hyperparameters in the RNN base models are kept the
                                             q                      same for fair comparison (see the Supplement for more de-
Theorem 1. If SupREN is run with α = 12 ln 2TδK , with              tails on neural network architectures, hyperparameters, and
probability at least 1 − δ, the regret of the algorithm is          their sensitivity analysis).
             √                    2T K(2 ln T + 2) 3 √                 Connection to Reinforcement Learning (RL) and Ban-
  B(T ) ≤ 2 T + 92 · (1 + ln                         )2 Td          dits. REN-1,2 (in Fig. 2) can be seen as a simplified ver-
               r                           δ                        sion of ‘randomized least-squares value iteration’ (an RL
                          KT ln(T )                                 approach proposed in (Osband, Van Roy, and Wen 2016))
          = O( T d ln3 (              )),
                               δ                                    or an adapted version of contextual bandits, while REN-1,3
                                                                    (in Fig. 2) is an advanced version of -greedy exploration in
   The full proofs of all lemmas and the theorem are in the
                                                                    RL. Note that REN is orthogonal to RL (Shi et al. 2019) and
Supplement. Theorem 1 shows that even with the uncertainty
                                                                    bandit methods.
in the item embeddings (i.e., context uncertainty), our pro-
posed REN can achieve the same rate-optimal sublinear re-
gret bound.                                                         Simulated Experiments
                                                                    Datasets. Following the setting described in Sec. 12, we start
                       Experiments                                  with three synthetic datasets, namely SYN-S, SYN-M, and
In this section, we evaluate our proposed REN on both syn-          SYN-L, which allow complete control on the simulated en-
thetic and real-world datasets.                                     vironments. We assume 8-dimensional latent vectors, which
                                                                    are unknown to the models, for each user and item, and use
Experiment Setup and Compared Methods                               the inner product between user and item latent vectors as
                                                                                                                            ∗
Joint Learning and Exploration Procedure in Tempo-                                                         √ user vector θ , we
                                                                    the reward. Specifically, for each latent
ral Data. To effectively verify REN’s capability to boost           randomly choose 3 entries to set to 1/ 3 and set the rest to
long-term rewards, we adopt an online experiment set-               0, keeping kθ ∗ k2 = 1. We generate C28 = 28 unique item
                                                                                                             ∗
ting where data is divided into different time intervals               √ vectors. Each item latent vector xk has 2 ∗entries set to
                                                                    latent
[T0 , T1 ), [T1 , T2 ), . . . , [TM −1 , TM ]. RNN (including REN   1/ 2 and the other 6 entries set to 0 so that kxk k2 = 1.
0.80                                               0.80                                                     0.80
         0.75                                               0.75                                                     0.75
         0.70                                               0.70                                                     0.70
Reward

                                                   Reward

                                                                                                            Reward
                                         GRU4Rec            0.65                              GRU4Rec                0.65                                   GRU4Rec
         0.65                            HRNN                                                 HRNN                                                          HRNN
                                         REN-G              0.60                              REN-G                  0.60                                   REN-G
         0.60                            REN-H                                                REN-H                                                         REN-H
                                         REN-T              0.55                              REN-T                  0.55                                   REN-T
         0.55                            TCN                                                  TCN                                                           TCN
                                         Oracle             0.50                              Oracle                 0.50                                   Oracle
         0.50                            Random                                               Random                                                        Random
                                                            0.45                                                     0.45
                100 200 300 400 500 600 700 800 900                100 200 300 400 500 600 700 800 900                      100 200 300 400 500 600 700 800 900
                               Time                                               Time                                                       Time

 Figure 1: Results for different methods in SYN-S (left with 28 items), SYN-M (middle with 280 items), and SYN-L (right with
 1400 items). One time step represents one interaction step, where in each interaction step the model recommends 3 items to the
 user and the user interacts with one of them. In all cases, REN models with diversity-based exploration lead to final convergence,
 whereas models without exploration get stuck at local optima.

    We assume 15 users in our datasets. SYN-S contains exactly                                      0.80
 28 items, while SYN-M repeats each unique item latent vector
 for 10 times, yielding 280 items in total. Similarly, SYN-L                                        0.75
 repeats for 50 times, therefore yielding 1400 items in total.                                      0.70

                                                                                           Reward
 The purpose of allowing different items to have identical
 latent vectors is to investigate REN’s capability to explore in
                                                                                                    0.65                                 SYN-L, REN-1,2
 the compact latent space rather than the large item space. All                                     0.60                                 SYN-L, REN-1,2,3
                                                                                                                                         SYN-L, REN-1,3
 users have a history length of 60.                                                                 0.55                                 SYN-S, REN-1,2
    Simulated Environments. With the generated latent vec-                                                                               SYN-S, REN-1,2,3
 tors, the simulated environment runs as follows: At each time                                      0.50                                 SYN-S, REN-1,3
                                                                                                                                         Oracle
 step t, the environment randomly chooses one user and feed                                         0.45
 the user’s interaction history Xt (or Dt ) into the RNN recom-                                            100 200 300 400 500 600 700 800 900
 mender. The recommender then recommends the top 4 items                                                                          Time
 to the user. The user will select the item with the highest                        Figure 2: Ablation study on different terms of REN. ‘REN-
                            >
 ground-truth reward θ ∗ x∗k , after which the recommender                          1,2,3’ refers to the full ‘REN-G’ model.
 will collect the selected item with the reward and finetune the
 model.
    Results. Fig. 1 shows the rewards over time for different                       directly in the item space only works when the item number
 methods. Results are averaged over 3 runs and we plot the                          is small, e.g., in SYN-S.
 rolling average with a window size of 100 to prevent clutter.                         Hyperparameters. For the base models GRU4Rec, TCN,
 As expected, conventional RNN-based recommenders satu-                             and HRNN, we use identical network architectures and hyper-
 rate at around the 500-th time step, while all REN variants                        paremeters whenever possible following (Hidasi et al. 2016;
 successfully achieve nearly optimal rewards in the end. One                        Bai, Kolter, and Koltun 2018; Ma et al. 2020). Each RNN
 interesting observation is that REN variants obtain rewards                        consists of an encoding layer, a core RNN layer, and a de-
 lower than the “Random" baseline at the beginning, mean-                           coding layer. We set the number of hidden neurons to 32 for
 ing that they are sacrificing immediate rewards to perform                         all models including REN variants. Fig. 1 in the Supplement
 exploration in exchange for long-term rewards.                                     shows the REN-G’s     performance for different λd (note that
    Ablation Study. Fig. 2 shows the rewards over time for                                        √
                                                                                    we fix λu = 10λd ) in SYN-S, SYN-M, and SYN-L. We can
 REN-G (i.e., REN-1,2,3), REN-1,2, and REN-1,3 in SYN-S                             observe stable REN performance across a wide range of λd .
 and SYN-L. We observe that REN-1,2, with only the rele-                            As expected, REN-G’s performance is closer to GRU4Rec
 vance (first) and diversity (second) terms of Eqn. 5, saturates                    when λd is small.
 prematurely in SYN-S. On the other hand, the reward of REN-
 1,3, with only the relevance (first) and uncertainty (third)
 term, barely increases over time in SYN-L. In contrast, the
                                                                                    Real-World Experiments
 full REN-G works in both SYN-S and SYN-L. This is because                          MovieLens-1M. We use MovieLens-1M (Harper and Kon-
 without the uncertainty term, REN-1,2 fails to effectively                         stan 2016) containing 3,900 movies and 6,040 users with
 choose items with uncertain embeddings to explore. REN-1,3                         an experiment setting similar to Sec. 12. Each user has 120
 ignores the diversity in the latent space and tends to explore                     interactions, and we follow the joint learning and exploration
 items that have rarely been recommended; such exploration                          procedure described in Sec. 12 to evaluate all methods (more
0.25           GRU4Rec
                                  HRNN                                       0.50                                                   0.6
                   0.20           REN-G
                                  REN-H                                      0.45                                                   0.5
                                  REN-T
                   0.15           TCN                                                                                               0.4
          Reward

                                                                    Reward

                                                                                                                           Reward
                                                                             0.40
                                                                             0.35                               GRU4REC             0.3
                   0.10                                                                                         HRNN                                                  HRNN
                                                                                                                REN-G               0.2                               Oracle
                                                                             0.30                               REN-H                                                 REN-H
                   0.05                                                                                         REN-T               0.1                               Random
                                                                             0.25                               TCN
                   0.00                                                                                                             0.0
                          0       20       40   60   80   100 120                   0   5       10 15 20 25 30 35                         0    2        4        6        8
                              Time                              Time                                 Time
Figure 3: Rewards (precision@10, MRR, and recall@100, respectively) over time on MovieLens-1M (left), Trivago (middle),
and Netflix (right). One time step represents 10 recommendations to a user, one hour of data, and 100 recommendations to a user
for MovieLens-1M, Trivago, and Netflix, respectively.

                    0.35                                                     0.6                                                   0.6
                    0.30                                                     0.5                                                   0.5
                    0.25                                                                                                           0.4
                                                                             0.4
           Reward

                                                                    Reward

                                                                                                                          Reward
                    0.20                                                                                                           0.3
                                                                             0.3
                    0.15                                   HRNN                                                  HRNN                                                 HRNN
                                                           Oracle            0.2                                 Oracle            0.2                                Oracle
                    0.10                                   REN-H                                                 REN-H                                                REN-H
                    0.05                                   Random            0.1                                 Random            0.1                                Random
                    0.00                                                     0.0                                                   0.0
                              0        2        4     6     8                       0       2      4      6       8                       0   10   20       30       40       50
                                  Time                               Time                             Time
                     Figure 4: Rewards over time on Netflix. One time step represents 100 recommendations to a user.
details in the Supplement). All models recommend 10 items                                              REN-G) can effectively balance relevance and exploration
at each round for a chosen user, and the precision@10 is used                                          for these 10 recommended items at each time step, achieving
as the reward. Fig. 3(left) shows the rewards over time aver-                                          higher long-term rewards. Interestingly, we also observe that
aged over all 6,040 users. As expected, REN variants with                                              REN variants have better stability in performance compared
different base models are able to achieve higher long-term                                             to RNN baselines.
rewards compared to their non-REN counterparts.                                                           Netflix. Finally, we also use Netflix5 to evaluate how REN
   Trivago. We also evaluate the proposed methods on                                                   performs in the slate recommendation setting and without
Trivago4 , a hotel recommendation dataset with 730,803 users,                                          finetuning in each time step, i.e., skipping Step (3) in Sec. 12.
926,457 items, and 910,683 interactions. We use a subset                                               We pretrain REN on data from half the users and evaluate on
with 57,778 users, 387,348 items, and 108,713 interactions                                             the other half. At each time step, REN generates 100 mutually
and slice the data into M = 48 one-hour time intervals for                                             diversified items for one slate following Eqn. 5, with pk,t up-
the online experiment (see the Supplement for details on data                                          dated after every item generation. Fig. 3(right) shows the re-
pre-processing). Different from MovieLens-1M, Triavago has                                             call@100 as the reward for different methods, demonstrating
impression data available: at each time step, besides which                                            REN’s promising exploration ability when no finetuning is
item is clicked by the user, we also know which 25 items are                                           allowed (more results in the Supplement). Fig. 4(left) shows
being shown to the user. Such information makes the online                                             similar trends with recall@100 as the reward on the same
evaluation more realistic, as we now know the ground-truth                                             holdout item set. This shows that the collected set contributes
feedback if an arbitrary subset of the 25 items are presented to                                       to building better user embedding models. Fig. 4(middle)
the user. At each time step of the online experiments, all meth-                                       shows that the additional exploration power comes with-
ods will choose 10 items from the 25 items to recommend                                                out significant harms to the user’s immediate rewards on
the current user and collect the feedback for these 10 items as                                        the exploration set, where the recommendations are served.
data for finetuning. We pretrain the model using all 25 items                                          In fact, we used a relatively large exploration coefficient,
from the first 13 hours before starting the online evaluation.                                         λd = λu = 0.005, which starts to affect recommendation
Fig. 3(middle) shows the mean reciprocal rank (MRR), the                                               results on the sixth position. By additional hyperparameter
official metric used in the RecSys Challenge, for different                                            tuning, we realized that to achieve better rewards on the
methods. As expected, the baseline RNN (e.g., GRU4Rec)                                                 exploration set, we may choose smaller λd = 0.0007 and
suffers from a drastic drop in rewards because agents are                                              λu = 0.0008. Fig. 4(right) shows significantly higher recalls
allowed to recommend only 10 items, and they choose to                                                 close to the oracle performance, where all of the users’ his-
focus only on relevance. This will inevitably ignores valuable                                         tories are known and used as inputs to predict the top-100
items and harms the accuracy. In contrast, REN variants (e.g.,                                         personalized recommendations.6 Note that, for fair presen-

   4                                                                                                      5
    More details are available at https://recsys.trivago.cloud/                                               https://www.kaggle.com/netflix-inc/netflix-prize-data
                                                                                                          6
challenge/dataset/.                                                                                           The gap between oracle and 100% recall lies in the model
tation of the tuned results, we switched the exploration set       Chen, M.; Beutel, A.; Covington, P.; Jain, S.; Belletti, F.; and
and the holdout set and used a different test user group, con-     Chi, E. H. 2019. Top-k off-policy correction for a REIN-
sisting of 1543 users. We believe that the tuned results are       FORCE recommender system. In WSDM, 456–464.
generalizable with new users and items, but we also real-          Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.;
ize that the Netflix dataset still has a significant popularity    Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning
bias and therefore we recommend using larger exploration           Phrase Representations using RNN Encoder-Decoder for
coefficients with real online systems. The inference cost is       Statistical Machine Translation. In EMNLP, 1724–1734.
175 milliseconds to pick top-100 items from 8000 evaluation        Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Con-
items. It includes 100 sequential linear function solutions        textual bandits with linear payoff functions. In AISTATS,
with 50 embedding dimensions, which is further improvable          208–214.
by selecting multiple items at a time in slate generation.
                                                                   Ding, H.; Ma, Y.; Deoras, A.; Wang, Y.; and Wang, H.
                                                                   2021. Zero-Shot Recommender Systems. arXiv preprint
                        Conclusion                                 arXiv:2105.08318.
We propose the REN framework to balance relevance and              Fang, H.; Zhang, D.; Shu, Y.; and Guo, G. 2019. Deep Learn-
exploration during recommendation. Our theoretical analysis        ing for Sequential Recommendation: Algorithms, Influential
and empirical results demonstrate the importance of consid-        Factors, and Evaluations. arXiv preprint arXiv:1905.01997.
ering uncertainty in the learned representations for effective
                                                                   Foster, D. J.; Agarwal, A.; Dudík, M.; Luo, H.; and Schapire,
exploration and improvement on long-term rewards. We pro-
                                                                   R. E. 2018. Practical Contextual Bandits with Regression
vide an upper confidence bound on the estimated rewards
                                                                   Oracles. In ICML, 1534–1543.
along with its corresponding regret bound and show that REN
can achieve the same rate-optimal sublinear regret even in         Friedland, S.; and Gaubert, S. 2013. Submodular spectral
the presence of uncertain representations. Future work could       functions of principal submatrices of a hermitian matrix,
investigate the possibility of learned uncertainty in represen-    extensions and applications. Linear Algebra and its Applica-
tations, extension to Thompson sampling, nonlinearity of           tions, 438(10): 3872–3884.
the reward w.r.t. θ t , and applications beyond recommender        Gupta, S.; Wang, H.; Lipton, Z.; and Wang, Y. 2021. Correct-
systems, e.g., robotics and conversational agents.                 ing exposure bias for link recommendation. In ICML.
                                                                   Harper, F. M.; and Konstan, J. A. 2016. The movielens
                   Acknowledgement                                 datasets: History and context. ACM Transactions on Interac-
The authors thank Tim Januschowski, Alex Smola, the AWS            tive Intelligent Systems (TiiS), 5(4): 19.
AI’s Personalize Team and ML Forecast Team, as well as the         Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; and Tikk, D. 2016.
reviewers/SPC/AC for the constructive comments to improve          Session-based Recommendations with Recurrent Neural Net-
the paper. We are also grateful for the RecoBandits package        works. In ICLR.
provided by Bharathan Blaji, Saurabh Gupta, and Jing Wang          Hiranandani, G.; Singh, H.; Gupta, P.; Burhanuddin, I. A.;
to facilitate the simulated environments. HW is partially sup-     Wen, Z.; and Kveton, B. 2019. Cascading Linear Submodular
ported by NSF Grant IIS-2127918 and an Amazon Faculty              Bandits: Accounting for Position Bias and Diversity in Online
Research Award.                                                    Learning to Rank. In UAI, 248.
                                                                   Jun, K.-S.; Willett, R.; Wright, S.; and Nowak, R. 2019. Bilin-
                        References                                 ear Bandits with Low-rank Structure. In Chaudhuri, K.; and
Agarwal, A.; Hsu, D. J.; Kale, S.; Langford, J.; Li, L.; and       Salakhutdinov, R., eds., Proceedings of the 36th International
Schapire, R. E. 2014. Taming the Monster: A Fast and Simple        Conference on Machine Learning, volume 97 of Proceedings
Algorithm for Contextual Bandits. In ICML, 1638–1646.              of Machine Learning Research, 3163–3172. PMLR.
Antikacioglu, A.; and Ravi, R. 2017. Post processing recom-        Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Varia-
mender systems for diversity. In KDD, 707–716.                     tional Bayes. In ICLR.
                                                                   Korda, N.; Szorenyi, B.; and Li, S. 2016. Distributed clus-
Auer, P. 2002. Using Confidence Bounds for Exploitation-
                                                                   tering of linear bandits in peer to peer networks. In ICML,
Exploration Trade-offs. JMLR, 3: 397–422.
                                                                   1301–1309.
Bai, S.; Kolter, J. Z.; and Koltun, V. 2018. An Empirical
                                                                   Kveton, B.; Szepesvári, C.; Rao, A.; Wen, Z.; Abbasi-
Evaluation of Generic Convolutional and Recurrent Networks
                                                                   Yadkori, Y.; and Muthukrishnan, S. 2017. Stochastic Low-
for Sequence Modeling. CoRR, abs/1803.01271.
                                                                   Rank Bandits. CoRR, abs/1712.04644.
Belletti, F.; Chen, M.; and Chi, E. H. 2019. Quantifying           Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; and Ma, J. 2017.
Long Range Dependence in Language and User Behavior to             Neural attentive session-based recommendation. In CIKM,
improve RNNs. In KDD, 1317–1327.                                   1419–1428.
Bello, I.; Kulkarni, S.; Jain, S.; Boutilier, C.; Chi, E.; Eban,   Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A
E.; Luo, X.; Mackey, A.; and Meshi, O. 2018. Seq2slate:            contextual-bandit approach to personalized news article rec-
Re-ranking and slate optimization with rnns. arXiv preprint        ommendation. In WWW, 661–670.
arXiv:1810.02019.
                                                                   Li, S.; Karatzoglou, A.; and Gentile, C. 2016. Collaborative
approximation errors.                                              Filtering Bandits. In SIGIR, 539–548.
Li, X.; and She, J. 2017. Collaborative Variational Autoen-       Wang, H.; Wang, N.; and Yeung, D. 2015. Collaborative deep
coder for Recommender Systems. In KDD, 305–314.                   learning for recommender systems. In KDD, 1235–1244.
Liu, Q.; Zeng, Y.; Mokhosi, R.; and Zhang, H. 2018. STAMP:        Wang, H.; Xingjian, S.; and Yeung, D.-Y. 2016. Natural-
short-term attention/memory priority model for session-based      parameter networks: A class of probabilistic neural networks.
recommendation. In KDD, 1831–1839.                                In NIPS, 118–126.
Ma, Y.; Narayanaswamy, M. B.; Lin, H.; and Ding, H. 2020.         Wang, H.; and Yeung, D.-Y. 2016. Towards Bayesian deep
Temporal-Contextual Recommendation in Real-Time. In               learning: A framework and some existing methods. TDKE,
KDD.                                                              28(12): 3395–3408.
Mahadik, K.; Wu, Q.; Li, S.; and Sabne, A. 2020. Fast             Wang, H.; and Yeung, D.-Y. 2020. A Survey on Bayesian
distributed bandits for online recommendation systems. In         Deep Learning. ACM Computing Surveys (CSUR), 53(5):
SC, 1–13.                                                         1–37.
Mi, L.; Wang, H.; Tian, Y.; and Shavit, N. 2019. Training-        Wilhelm, M.; Ramanathan, A.; Bonomo, A.; Jain, S.; Chi,
Free Uncertainty Estimation for Neural Networks. arXiv            E. H.; and Gillenwater, J. 2018. Practical diversified recom-
e-prints, arXiv–1910.                                             mendations on youtube with determinantal point processes.
Nguyen, T. T.; Hui, P.-M.; Harper, F. M.; Terveen, L.; and        In CIKM, 2165–2173.
Konstan, J. A. 2014. Exploring the filter bubble: the effect of   Wu, S.; Tang, Y.; Zhu, Y.; Wang, L.; Xie, X.; and Tan, T. 2019.
using recommender systems on content diversity. In WWW,           Session-based recommendation with graph neural networks.
677–686.                                                          In AAAI, volume 33, 346–353.
Osband, I.; Van Roy, B.; and Wen, Z. 2016. Generalization         Yue, Y.; and Guestrin, C. 2011. Linear Submodular Bandits
and exploration via randomized value functions. In ICML,          and their Application to Diversified Retrieval. In NIPS, 2483–
2377–2386.                                                        2491.
Quadrana, M.; Karatzoglou, A.; Hidasi, B.; and Cremonesi, P.      Zhou, D.; Li, L.; and Gu, Q. 2019. Neural Contextual Bandits
2017. Personalizing Session-based Recommendations with            with Upper Confidence Bound-Based Exploration. arXiv
Hierarchical Recurrent Neural Networks. In RecSys, 130–           preprint arXiv:1911.04462.
137.
Salakhutdinov, R.; Mnih, A.; and Hinton, G. E. 2007. Re-
stricted Boltzmann machines for collaborative filtering. In
ICML, volume 227, 791–798.
Shi, J.-C.; Yu, Y.; Da, Q.; Chen, S.-Y.; and Zeng, A.-X. 2019.
Virtual-taobao: Virtualizing real-world online retail environ-
ment for reinforcement learning. In AAAI, volume 33, 4902–
4909.
Tang, J.; Belletti, F.; Jain, S.; Chen, M.; Beutel, A.; Xu, C.;
and H. Chi, E. 2019. Towards neural mixture recommender
for long range dependent user sequences. In WWW, 1782–
1793.
Tschiatschek, S.; Djolonga, J.; and Krause, A. 2016. Learn-
ing Probabilistic Submodular Diversity Models Via Noise
Contrastive Estimation. In AISTATS, 770–779.
van den Oord, A.; Dieleman, S.; and Schrauwen, B. 2013.
Deep content-based music recommendation. In NIPS, 2643–
2651.
Vanchinathan, H. P.; Nikolic, I.; Bona, F. D.; and Krause, A.
2014. Explore-exploit in top-N recommender systems via
Gaussian processes. In RecSys, 225–232.
Wang, H. 2017. Bayesian Deep Learning for Integrated
Intelligence: Bridging the Gap between Perception and In-
ference. Ph.D. thesis, Hong Kong University of Science and
Technology.
Wang, H.; Shi, X.; and Yeung, D. 2015. Relational stacked
denoising autoencoder for tag recommendation. In AAAI,
3052–3058.
Wang, H.; Shi, X.; and Yeung, D.-Y. 2016. Collaborative
recurrent autoencoder: Recommend while learning to fill in
the blanks. In NIPS, 415–423.
You can also read