                         Xing Yi Liangjie Hong Erheng Zhong Nathan Nan Liu Suju Rajan
                                 {xingyi, liangjie, erheng, nanliu, suju}
                                       Personalization Sciences, Yahoo Labs, Sunnyvale, CA 94089, USA

Many internet companies, such as Yahoo, Facebook, Google and
Twitter, rely on content recommendation systems to deliver the
most relevant content items to individual users through personaliza-
tion. Delivering such personalized user experiences is believed to
increase the long term engagement of users. While there has been
a lot of progress in designing effective personalized recommender
systems, by exploiting user interests and historical interaction data
through implicit (item click) or explicit (item rating) feedback, di-
rectly optimizing for users’ satisfaction with the system remains
challenging. In this paper, we explore the idea of using item-level
dwell time as a proxy to quantify how likely a content item is rel-
evant to a particular user. We describe a novel method to compute
accurate dwell time based on client-side and server-side logging
and demonstrate how to normalize dwell time across different de-
vices and contexts. In addition, we describe our experiments in
incorporating dwell time into state-of-the-art learning to rank tech-
niques and collaborative filtering models that obtain competitive                             Figure 1: A snapshot of Yahoo’s homepage in U.S. where the
performances in both offline and online settings.                                             content stream is highlighted in red.
Categories and Subject Descriptors: H.3.5 [Information Storage
and Retrieval]: Online Information Services                                                   lot of work in designing and improving personalized recommender
General Terms: Theory, Experimentation                                                        systems.
                                                                                                 Traditionally, simplistic user feedback signals, such as click
Keywords: Content Recommendation, Personalization, Dwell
                                                                                              through rate (CTR) on items or user-item ratings, have been used
Time, Learning to Rank, Collaborative Filtering
                                                                                              to quantify users’ interest and satisfaction. Based on these read-
                                                                                              ily available signals, most content recommendation systems essen-
1. INTRODUCTION                                                                               tially optimize for CTR or attempt to fill in a sparse user-item rating
   Content recommendation systems play a central role in today’s                              matrix with missing ratings. Specifically for the latter case, with the
Web ecosystems. Companies like Yahoo, Google, Facebook and                                    success of the Netflix Prize competition, matrix-completion based
Twitter are striving to deliver the most relevant content items to                            methods have dominated the field of recommender systems. How-
individual users. For example, visitors to these sites are presented                          ever, in many content recommendation tasks users rarely provide
with a stream of articles, slideshows and videos that they may be in-                         explicit ratings or direct feedback (such as ‘like’ or ‘dislike’) when
terested in viewing. With a personalized recommendation system,                               consuming frequently updated online content. Thus, explicit user
these companies aim to better predict and rank content of interest                            ratings are too sparse to be usable as input for matrix factorization
to users by using historical user interactions on the respective sites.                       approaches. On the other hand, item CTR as implicit user interest
The underlying belief is that personalization increases long term                             signal does not capture any post-click user engagement. For ex-
user engagement and as a side-benefit, also drives up other aspects                           ample, users may have clicked on an item by mistake or because
of the services, for instance, revenue. Therefore, there has been a                           of link bait, but are truly not engaged with the content being pre-
                                                                                              sented. Thus, it is arguable that leveraging the noisy click-based
                                                                                              user engagement signal for recommendation can achieve the best
                                                                                                 We argue that the amount of time that users spend on content                                                    items, the dwell time, is an important metric to measure user en-
gagement on content and should be used as a proxy to user satis-           In this section, we review three related research directions. First,
faction for recommended content, complementing and/or replacing         we examine how dwell time is studied and used in web search or IR
click based signals. However, utilizing dwell time in a personalized    domains. We will carefully analyze which of the existing practices
recommender system introduces a number of new research and en-          and experiences, on dwell time computation, can be utilized in the
gineering challenges. For instance, a fundamental question would        context of personalization. We then list several pointers for MLR
be how to measure dwell time effectively. Furthermore, different        as it has been extensively studied in the past decade, followed by
users exhibit different content consumption behaviors even for the      a brief discussion on CF, paying special attention to how implicit
same piece of content on the same device. In addition, for the same     user feedback is used in CF.
user, depending on the nature of the content item and the context,         Dwell Time in Other Domains: A significant amount of previ-
the user’s content consumption behavior can be significantly dif-       ous research on web search has investigated using post-click dwell
ferent. Therefore, it would be beneficial to normalize dwell time       time of each search result as an indicator of its relevance for web
across different devices and contexts. Also, recommender sys-           queries and how it can be applied for different web search tasks. All
tems usually employ machine learning-to-rank (MLR) techniques           such previous research focused on examining the dwell time’s util-
and collaborative filtering (CF) models, with a wide range of fea-      ity for improving search results. For instance, White and Kelly [15]
tures, to obtain state-of-the-art performance. Using dwell time in      demonstrated that using dwell time can potentially help improve
these frameworks is not straight-forward.                               the performance of implicit relevance feedback. Kim et al. [8]
   In this paper, we use the problem of recommending items for the      and Xu et al. [17, 18] showed that using webpage-level dwell time
content feed or stream on the Yahoo’s homepage, shown in Figure         can help personalize the ranking of the web search results. Liu
1, as a running example to demonstrate how dwell time can be em-        et al. [11] investigated the necessity of using different thresholds
bedded into a personalized recommendation system. We have de-           of dwell time, in order to derive meaningful document relevance
signed several approaches for accurately computing item-level user      information from the dwell time for helping different web search
content consumption time from large-scale web browsing log data.        tasks. To the best of our knowledge, we are the first to use dwell
We first use the log data to determine when each item page gains or     time for personalized content recommendation. Furthermore, we
loses the user attention. Capturing the user’s attention on an item     consider different types of content (news articles, slideshows and
page enables us to compute per-user item-level dwell time. In ad-       videos), present several approaches to accurately measure content
dition, we leverage content consumption dwell time distributions        consumption time, and use the dwell time for understanding users’
of different content types for normalizing users’ engagement sig-       daily habit and interests. Most recently, Youtube has started to use
nals, so that we can use this engagement signal for recommending        users’ video time spent instead of the click event 1 to better measure
multiple content type items to the user in the content stream. We       the users’ engagement with video content. In contrast, we focus on
then incorporate dwell time into machine learning-to-rank (MLR)
techniques and collaborative filtering (CF) models. For MLR, we            Learning To Rank in Web Search: The field of MLR has sig-
propose to use per-user item-level dwell time as the learning tar-      nificantly matured in the past decade, mainly due to the popularity
get, which can be easily considered in all existing MLR models,         of search engines. Liu [12] and Li [10] provide an in-depth sur-
and demonstrate that it can result in better performances. For CF,      vey on this topic. Here, we point out that a fundamental issue with
we use dwell time as a form of implicit feedback from users and         all existing MLR models is that they all optimize for relevance, an
demonstrate that a state-of-the-art matrix factorization model to in-   abstract yet important concept in IR. In the standard setting, the
corporate this information can yield competitive and even better        “relevance” between a particular query and a list of documents is
performances than the click-optimized counterpart. To be more           objective and the same for all users. For IR, “relevance” is judged
specific, we have made the following contributions in this paper:       by human experts through a manual process and is difficult to scale
    • A novel method to compute fine-grained item-level user con-       to millions of real queries. In order to personalize IR, a natural
       tent consumption time for better understanding users’ inter-     alternative to “relevance” is to optimize CTR. In this paper, we ex-
       ests is proposed.                                                plore the possibility of optimizing for dwell time under the existing
    • A novel solution to normalize dwell time for multiple             framework of gradient boosted decision trees [6]. However, other
       content-type items across different devices is proposed and      MLR models can also be used such as pair-wise models (e.g., Rank-
       presented.                                                       Boost [5] and AdaRank [16]) and list-wise models (e.g., RankNet
    • An empirical study of dwell time in the context of content        [3] and ListNet [4]). Note that, we do not seek to propose new MLR
       recommendation system is presented in this paper.                models, but instead show the advantage of utilizing dwell time in
    • A MLR framework to utilize dwell time is proposed and its         existing models.
       effectiveness in real-world settings is demonstrated.
    • A CF framework to utilize dwell time is proposed and its             Collaborative Filtering: In CF systems, users’ satisfaction with
       effectiveness against non-trivial baselines is presented.        the items is usually not considered. Almost all previous work in
The paper is organized as follows. In §2, we review three related       CF (e.g., [9, 1]) take only explicit feedback such as ratings, or “im-
research directions. In §3, we demonstrate how dwell time can           plicit” click-based feedback into account. Hu et al. [7] considered
be measured and present some of its interesting characteristics. In     implicit feedback signal, such as whether a user clicks or reviews an
the following sections, we show two important use cases for dwell       item, and incorporated it into the matrix factorization framework.
time. In §4, we show how dwell time can be used in MLR to obtain        Rendle et al. [13] proposed a learning algorithm for binary implicit
superior performance than the models that optimize CTR. In §5,          feedback datasets, which is essentially similar to AUC optimization.
we plug dwell time into the state-of-the-art CF models and demon-       None of them went beyond binary implicit feedback to investigate
strate that we can obtain competitive performance. We conclude          the interactions between users and items. The approach of Yin et
the paper in §6.                                                        al. [19] is the closest to our work. In that paper, the authors used
                                                                        a graphical model on the explicit feedback signals and dwell time
2. RELATED WORK                                                         we-focus-on-watch-time.html
Table 1: Client-side Logging Example                      Table 2: Comparison of dwell time measurement. The first two
                                                                         columns are for LE, the middle two columns are for FB and
     User Behaviors                        Client-side Events
                                                                         the last two columns are for client-side logs. Each row contains
     A user opens a news article page.     {DOM-ready, t1 }              data from a day.
     He reads the article for several      {Focus, t2 }                    #         DT. (LE) #           DT. (FB) #           DT. (C)
     seconds.                                                              3, 322 86.5          3, 197 134.4           3, 410 130.3
     He switches to another browser        {Blur, t3 }                     5, 711 85.4          5, 392 132.6           5, 829 124.0
     tab or a window to read other arti-
     cles.                                                                                                                  ●●
     He goes back to the article page      {Focus, t4 }                                                                 ●                ●

                                                                                                                    ●                        ●
     and comments on it.                                                                                                                         ●
     He closes the article page, or        {BeforeUnload, t5 }                     ●              ●●
                                                                                       ●●●●●                                                                 ●●●●
     clicks the back button to go to an-                                                                                                                            ●●●●●●●
     other page.                                                                                                            ●●
                                                                                                                        ●        ●
                                                                                                                    ●                ●                                                  Devices

                                                                                                               ●●                                                                       ● Desktop
                                                                                                          ●●                                 ●                                          ● Tablet
                                                                                                      ●                                          ●                                      ● Mobile
data to predict the user’s score. Our work is different in that our                          ●●                                                      ●
                                                                             ●     ●   ●●●                                                                    ●●●
model does not require the presence of explicit user feedback.                                                                                                      ●●●●

3. MEASURING ITEM DWELL TIME                                                                              ●●
                                                                                                  ●●                                             ●
   In this section, we describe how dwell time can be measured               ●     ●   ●●●                                                           ●●
from web logs and show its basic characteristics.                            0.0                  2.5                  5.0                                      7.5
3.1 Dwell Time Computation
   Accurately computing item-level dwell time from web-scale user        Figure 2: The (un)normalized distribution of log of dwell time
browsing activity data is a challenging problem. As an exam-             for articles across different devices. The X-axis is the log of
ple, most modern browsers have a multi-tabbed interface in which         dwell time and the Y-axis is the counts (removed for proprietary
users can open multiple stories simultaneously and switch between        reasons).
them. In the multi-tabbed setting, figuring out the tab that captured
the user’s attention is non-trivial. In this paper we describe two
                                                                         loss in this client-server interaction can be very high, for example,
complementary methods to derive dwell time, one via client-side
                                                                         because of loss in internet connection. In addition, users may also
logging and the other via server-side logging. We have also con-
                                                                         disable Javascript in their browsers.
ducted a simple study comparing these two approaches. Although
client-side logging can capture fine-grained user behavior and has          Server-Side Dwell Time: When client-side logs are not avail-
the potential of being highly useful, there is a lot of dependency       able, we resort to server-side logging to infer users’ attention on
on browser implementation and potential for large amounts of data        item pages. The computation of dwell time on server-side are built
loss. Therefore, when the client-side data is not available, we resort   on a number of heuristics. One approach is to simulate client-
to reasonable approximation methods through server-side logging.         side user attention events by identifying pseudo Focus and Blur
Thus, we can reliably compute dwell time in a real world setting.        events from server logs. Consider the following sequence of log-
                                                                         ging events:
   Client-Side Dwell Time:           Client-side logging utilizes
Javascript/DOM events 2 to record how users interact with                        {i, Click, t1 } → {j, Click, t2 } → {k, Click, t3 }
the content story pages. Let us imagine the scenario demonstrated        where each event is a tuple of an item id, a event type and a times-
in Table 1 where the left column is a sequence of user interactions      tamp. The dwell time for i and j can be computed as t2 − t1 and
with a news article and the right column contains the correspond-        t3 − t2 respectively. A more complicated example is:
ing client-side events, in the form of {event name, time stamp}
tuples. In these events, DOM-ready indicates the ready-time                  {i, Click, t1 } → {j, Click, t2 } → {k, Click, t3 } →
of the body of the page, which can be considered as the start of             {i, Comment, t4 } → {n, Click, t5 }
the dwell time. Focus indicates that the user’s focus was back
on the body of the news article. Blur means that the article             where the dwell time for page i can be computed as (t2 − t1 ) +
body lost the user attention. BeforeUnload is the time point             (t5 − t4 ). We denote this as FB (Focus/Blur) method. Another
immediately prior to the page being unloaded. Based on these             simpler heuristic is called LE (Last Event) method, which is to take
events, we can compute dwell time on the client-side by simply           the last event of the page as the end-page event and compute the
accumulating time differences between Focus event and Blur               interval of the first-event timestamp and the last one. From the ex-
events. From the above example, we have the dwell time as:               ample above, the dwell time of page i by LE would be t4 − t1 .
(t3 − t2 ) + (t5 − t4 ). We can clearly see that client-side approach    Both approximation methods have their own weaknesses: 1) The
can accurately capture users’ actual attention even in multi-tabbed      FB approach can over-estimate the dwell time because servers do
modern browsers. The major drawback of client-side logging is            not know the exact time the target story page loses its user atten-
that it relies on the correctness of Javascript execution and on         tion. For the above example, if the last click happens on some other
servers successfully receiving and logging client-sent data. Data        page, (t5 − t4 ) interval could includes some user time spent out-
                                                                         side the target page. The FB approach might also under-estimate
2                              the dwell time because servers also do not accurately know the time
●                                                                                                                 ●
                 150                                                                   ●
                                                                               ●      ●                                                                                                                   ●
                                                                          ●●    ●●                                                                                                    ●
                                                                  ●●                                                         150
                                                           ●           ●
                 125                                                                                                                                                          ●       ●
Average Dwell Time

                                                                                                            Average Dwell Time
                                                          ●●●              ●                                                                                                      ●
                                                    ●●●●     ●    ●   ●
                                                        ●    ●                       ●                                       125                                                          ●●          ●
                                                ● ●                                                                                                                 ●
                                                 ●●●                                                                                                                     ●
                                              ●●                                                Devices                                                                       ●                               Devices
                                            ●●   ● ● ●●●                                                                                                        ●  ●              ●
                                        ●●●                                                                                                                 ●
                 100                                                                            ● Desktop                                                         ● ●    ●●
                                                                                                                                                                                                              ● Desktop
                                                     ●● ● ●●                                    ● Tablet                                                        ●● ●                                      ●   ● Tablet
                                    ● ●    ● ●         ● ● ●                                                                                                                                  ●
                                  ●        ● ● ● ● ●●●                                          ● Mobile                     100                        ● ●                                                   ● Mobile
                                 ● ● ● ●        ●                                                                                                 ●● ●● ●●            ●●              ●               ●
                                ●              ● ●                                                                                            ●●●                       ●
                             ● ●       ● ● ● ●     ●         ●
                                                                                                                                                      ●  ●● ● ●                                   ●
                              ●                                                                                                                 ●                         ●
                                         ●● ● ●                                                                                          ●    ●       ●                           ●
                          ●         ●       ● ●                                                                                                     ●              ●
                     75           ● ●●●●         ●                                                                                         ●●     ●          ● ●
                           ●●   ●● ●●●● ● ●                                                                                      75        ●                         ●●                   ●
                              ● ●                                                                                                                          ●
                                  ●                                                                                                     ●● ●●● ●●                ●                                    ●
                             ●● ● ●                                                                                                    ●● ●             ●●    ●
                               ●                                                                                                       ●●● ●                            ●
                            ●●                                                                                                        ●
                           ●                                                                                                            ●
                     50                                                                                                          50   ●
                          ●                                                                                                           ●

                               250         500             750         1000              1250                                                 10             20                       30
                                                 Article Length                                                                                       The Number of Photos

Figure 3: The relationship between the average dwell time and                                               Figure 4: The relationship between the average dwell time and
the article length where X-axis is the binned article length and                                            the number of photos on a slideshow where X-axis is the binned
the Y-axis is binned average dwell time.                                                                    number of photos and the Y-axis is binned average dwell time.

the target page gains user attention. For the above example, if users                                       Shapiro–Wilk test [14] reject such an assertion. A further study of
have returned to the target page and read it for some additional time                                       its formal distribution is in future work. Regardless of its normal-
before commenting at t4 , the dwell time computation will not in-                                           ity, we observe that the bell-curve pattern holds for different time
clude the additional time. 2) The LE approach does not consider                                             periods and different types of devices (see Figure 5 and 6, which
the scenario in which the users’ reading focus could switch among                                           we will discuss later).
multiple browser page tabs, thus over-estimating the dwell time.                                                Since dwell time approximates the time users spend on an item,
On the other hand, because the LE approach conservatively uses                                              it is natural to assume that given the same content quality, a longer
the last event on the target page to compute dwell time and servers                                         news article would attract longer average dwell time across all
do not know when the user abandons or closes the page (without                                              users. In order to demonstrate this behavior, we investigate this
the client-side unload event), it can also under-estimate the dwell                                         issue on text heavy news article 3 , and plot a scatter-plot of aver-
time.                                                                                                       age dwell time per article versus article length in Figure 3 where
                                                                                                            X-axis is the length of article and the Y-axis is the average dwell
   Because both approaches could over-estimate or under-estimate                                            time of that particular article from all users. In order to show things
the item-level dwell time, we conducted a simple comparison study                                           clearly, the dwell times and article lengths are binned into smaller
among FB, LE and the client-side logging. The results are shown                                             buckets where each point represents a bucket. We show the scatter-
in Table 2. The purpose of this study is to explore which server-                                           plot of the dwell time against the length of the article on different
side approach can be used to better approximate the client-side log-                                        devices, namely desktop, tablet and mobile devices. The black line
ging. We use two days’ server-side logging events and client-side                                           is a fitted linear line for a particular device type with the 0.95 con-
logging events for article pages, and compute the average dwell                                             fidence interval in the grey area. From the figure, it is very clear
time by each method. Note that even for the same time period,                                               that the length of the article has good linear correlation with the av-
different approaches use different sets of events to compute dwell                                          erage dwell time across devices. Also, matching our intuition, the
time (see above example) and client-side events can be lost. Thus,                                          average dwell time on desktop is longer for long articles and the
the total number of articles considered varies (the first, the third                                        reading behavior on tablet and mobile devices are similar. Further-
and the fifth column). From the table, we can see that the average                                          more, the correlation becomes weaker when articles are very long:
dwell time computed by the FB approach is very close to the client-                                         for desktop when the article is longer than 1,000 words, the plot
side logging.Meantime, the LE approach greatly under-estimates                                              has big variance; this indicates that users may have run out of their
the dwell time, compared with the client-side events. This result                                           time-budget to consume the complete long story. Although the high
shows: through simulating users’ reading attention switch events                                            correlation between the length of articles and average dwell time
from server-side, the FB approach better handles item-level dwell                                           naturally leads to using the length of articles as a feature to pre-
time computation in multi-tabbed modern browser setting. There-                                             dict average dwell time, we point out based on the observed data:
fore, we now use FB as a relative reliable fall-back proxy to mea-                                          (1) per-user dwell time (rather than binned average dwell time over
sure the item-level dwell time from server-side logging events when                                         all users) has little correlation with the article length; and (2) long
client-side logging is not available.                                                                       dwell time may not necessarily reflect that users are really inter-
                                                                                                            ested in the article. In other words, content length alone can hardly
3.2 Dwell Time Analysis                                                                                     explain the per-user per-item dwell time, and we need to be care-
   In order to understand the nature of it, we analyze per-item per-                                        ful of the bias of dwell time based user engagement measurements
user dwell time from a large real-world data collection from Yahoo.                                         towards long length content stories. (We will revisit this issue in
We plot the unnormalized distribution of log of dwell time in Fig-                                          §3.4.)
ure 2. The data used for this figure is from one month’s Yahoo
homepage sample traffic. It is obvious that the log of dwell time                                           3
                                                                                                              For other content types such as slideshow and video, the content
follows a bell-curve. Many would guess the distribution of log of                                           length could be the number of slides in the slideshow and the video
dwell time is a Gaussian distribution. However, Q-Q plot and also                                           clip’s raw duration, respectively.
●                                                                                                                                ● ● ●
                          ●       ●                                                                                                                        ●


                      ●               ●                                                                                                                ●
                  ●                                                                                      ●   ●                                     ●
                                          ●                                                                      ●                                                           ●
    ●         ●                                                                                                                                ●
        ● ●                                   ●                                                                      ●                                                           ●
                                                  ● ● ●                                                                  ● ●         ● ●
                                                          ● ● ● ● ● ● ● ●                                                      ● ● ●                                                 ● ● ●
                                                                                                                                                                                           ● ● ●
                  ●                                                                                                                                        ●
                ●  ●●                                                                                                                                          ● ●
                 ●   ●
                                                                                      Devices                                                          ●             ●                                       Devices


              ●●                                                                      ● Desktop                                                    ●                                                         ● Desktop
                       ●                                                                                                                                                 ●
                                                                                      ● Tablet                                                                                                               ● Tablet
                        ●                                                             ● Mobile                                             ● ●                                                               ● Mobile
                         ●                                                                                                         ● ●
           ●              ●                                                                              ●   ●   ●                     ●                                     ●
         ●●                ●●●●●●
                                 ●●                                                                                  ● ● ● ●                                                     ●
                  ●                                                                                                                                        ●
                ●●                                                                                                                                             ●
                   ●                                                                                                                                   ●           ● ●


                     ●                                                                                   ●                                         ●
            ●●        ●                                                                                                                    ●
                       ●                                                                                     ●   ● ●       ● ● ●                                             ●
         ●●●            ●●
                          ●●●●●●●                                                                                    ● ● ●                                                       ● ●
    0                     3                   6                 9                                        0                     2               4                                       6
                                      LogDwellTime                                                                                    LogDwellTime

Figure 5: The (un)normalized distribution of log of dwell time                                    Figure 6: The (un)normalized distribution of log of dwell time
for slideshows across different devices. The X-axis is the log of                                 for videos across different devices. The X-axis is the log of dwell
dwell time and the Y-axis is the counts (removed for proprietary                                  time and the Y-axis is the counts.
                                                                                                  strumentations, etc.). We do this by normalizing out the variance of
   For slideshows, a natural assumption would be that the larger the                              the dwell time due to differences in context. In particular, we adopt
number of photos/slides, the longer the average dwell time these                                  the following procedure to normalize dwell time into a comparable
items would receive over all users. We demonstrate the relation-                                  space:
ships between the number of photos and the average dwell time on                                      1. For each content consumption context C, collect the histor-
slideshows in Figure 4. Again, we binned the number of photos                                            ical per-item time spent data and compute the mean µC and
and the average dwell time. It is clear that the correlation is not                                      standard deviation σC , both in log space.
as strong as the length of articles. For videos, we also observe the                                  2. Given a new content item i’s time spent tI in its context Ci ,
                                                                                                                                                  log(ti )−µCi
similar weak correlations between the duration of a video clips’ and                                     compute the z-value in log space: zi =                .
it’s average dwell time.                                                                              3. Compute the normalized dwell time of item i in the article
3.3 Normalized Dwell Time                                                                                space: ti,article = exp(µarticle + σarticle × zi ).
                                                                                                  In other words, all other types of items are now “comparable” after
   As may be obvious, users’ consumption of content items varies                                  this transformation, and the normalized user engagement signals
by context. For example, in historical data, we found that users                                  are then used for training recommendation models to handle differ-
have on average less dwell time per article on mobile or tablet de-                               ent content types and can be deployed in different contexts.
vices than on desktops. Also, users on average spend less time per
slideshow than per article. Indeed, different content types, by their                             3.4        Predicting Dwell Time
nature, would result in different browsing behaviors; thus we would                                  The average dwell time of a content item can be viewed as one
expect different dwell times among these content types. In order                                  of the item’s inherent characteristic, which provides important aver-
to extract comparable user engagement signals, we introduce the                                   age user engagement information on how much time the user will
normalized user dwell time to handle users’ different content con-                                spend on this item. Predicting average dwell time for each con-
sumption behaviors on different devices for personalization. The                                  tent item can help labeling items when their dwell time are not
technique discussed here can also be used to blend multiple con-                                  available/missing. For example, content items that have never been
tent sources (e.g., slide-shows and articles) into a unified stream.                              shown to users (such as new items) will not have available dwell
   Although the distributions of users’ per-item dwell time (from                                 time. As another example, a user’s dwell time on her clicked story
all users) for each content type is different, we found that each con-                            may not be always be computed because there may be no subse-
tent type’s distribution remains similar over a long time period. To                              quent server-side events from the same user. Therefore, leveraging
demonstrate this observation, we further plot the log of dwell time                               predicted average dwell time can greatly improve the “coverage”
of two important types of content: slideshows and videos in Fig-                                  (or alleviate the missing data issue). Not handling these situations,
ure 5 and Figure 6, respectively. Similar to the article case, we do                              could degrade the effectiveness of applying dwell time in person-
not report the absolute values for both types. However, the pat-                                  alization applications. In this sub-section, we present a machine
terns are again obvious. In all these cases, the log of dwell time                                learning method to predict dwell time of article stories using sim-
has Gaussian-like distributions. Indeed, most of the dwell time dis-                              ple features.
tributions for each different content-type on different device plat-                                 The features we consider are topical category of the article and
forms all surprisingly share the similar pattern. The same conclu-                                the context in which the article would be shown (e.g., desktop,
sion holds for different lengths of the time period. Also, we can                                 tablet or mobile). We use Support Vector Regression (SVR)4 mod-
easily see that the peak of log of dwell time is highest for videos,                              els to predict dwell time. The model is trained from a sample of
followed by articles and slideshows, which matches our intuitive                                  user-article interaction data. We show the features and their corre-
understanding of these three types of content items.                                              sponding weights in Table 3. Most features are categorical and we
   Thus, the basic idea is, for each consumed item, we would like                                 use log(Dwell Time) as the model response. We can loosely in-
to extract its dwell time based user engagement level such that it is
comparable across different context (e.g. content types, devices, in-                       ˜cjlin/liblinear/
In this paper, we use the Gradient Boosted Decision Tree (GBDT)
Table 3: Features and corresponding weights for predicted                 algorithm [6] to learn the ranking functions. GBDT is an additive
dwell time. The features are shown in the order of magnitude of           regression algorithm consisting of an ensemble of trees, fitted to
weights. The left column shows positive weights and the right             current residuals, gradients of the loss function, in a forward step-
negative weights.                                                         wise manner. It iteratively fits an additive model as:
   Name               Weight Name                   Weight
   Desktop            1.280     Apparel             -0.001                                                       ∑

   Mobile             1.033     Hobbies             -0.010                              ft (x) = Tt (x; Θ) + λ         βt Tt (x; Θt )         (1)
   Tablet             0.946     Travel & Tourism -0.039
   Content Length 0.218         Technology          -0.040                such that a certain loss function L(yi , fT (xi )) (e.g., square loss,
   Transportation     0.136     Environment         -0.065                logistic loss) is minimized, where Tt (x; Θt ) is a tree at iteration t,
   Politics           0.130     Beauty              -0.094                weighted by a parameter βt , with a finite number of parameters Θt ,
   Science            0.111     Finance             -0.151                and λ is the learning rate. At iteration t, tree Tt (x; β) is induced to
   Culture            0.100     Food                -0.173                fit the negative gradient by least squares. That is:
   Real Estate        0.088     Entertainment       -0.191
                                                                                     Θ̂ = arg min         wi (−Git − βt Tt (xi ); Θ)2         (2)
terpret the weights of these features as how much that feature con-                                   i
tributes to the article’s average dwell time prediction. The feature      where wi is the weight for data instance i, which is usually set
weights match our current expectation for average users’ article          to 1, and
reading behavior: longer articles can lead to higher predicted av-               [ Git is the ]gradient over the current prediction function:
                                                                          Git = ∂L(y   i ,f (xi ))
                                                                                                           . The optimal weights of tree βt are
erage dwell time; people spend a longer time reading articles on                      ∂f (xi )
                                                                                                   f =ft−1
desktop devices than mobile devices; more serious topics can lead                                            ∑
                                                                          determined by βt = arg minβ N        i L(yi , ft−1 (xi ) + βT (xi , θ)).
users to dwell longer. Potentially, the predicted average dwell time      More details about GBDT, please refer to [20]. As mentioned
could be leveraged to normalize the dwell time-based user engage-         above, if we use click/non-click as responses, we simply treat
ment signal (as discussed in §3.3); however, this is non-trivial as the   xi = xq,d and yi = yd,u . In fact, all previous research on MLR-
interplay between the dwell time features and users’ experience is        based content recommendation system has been focusing on using
not obvious. For example, will recommending more serious topics           click-based information for training and evaluation. For example,
that have long average dwell lead to better or worse user experi-         Bian et al. [2] and Agarwal et al. [1] have used users’ click/view
ence? We will leave answering this question for future work.              data in Today module in Yahoo for optimizing CTR for content rec-
                                                                             Dwell Time for MLR: There are two intuitive ways to incorpo-
   In this section, we investigate how to leverage item-level dwell
                                                                          rate dwell time into MLR frameworks. Let γd be the average dwell
time to train machine-learned ranking (MLR) models for content
                                                                          time for article d. Taking the GBDT algorithm mentioned above, we
                                                                          could have: 1) Use the per-article dwell time as the response, treat-
   The Basic MLR Setting: In traditional MLR, a query q is repre-         ing yi = h(γd ) and 2) Use the per-article dwell time as the weight
sented as a feature vector q while a document d is represented as a       for sample instances, treating wi = h(γd ) where the function h
feature vector d. A function g takes these two feature vectors and        is a transformation of the dwell time. In both cases, we promote
outputs a feature vector xq,d = g(q, d) for this query-document           articles that have high average dwell time and try to learn models
pair (q, d). Note that g could be as simple as a concatenation. Each      that can optimize for user engagements. In all our experiments, we
query-document pair has a response yq,d , in traditional IR, which        found that h = log(x) yields the best performance.
is usually the relevance judgment. Typically this judgment is com-           We show the effectiveness of MLR model firstly from an offline
mon to all users, that is, there is no user-specific personalization.     experiment. We use data from a bucket of traffic of a Yahoo prop-
Depending on the particular paradigm (e.g., point-wise, pair-wise         erty and split it uniformly at random into training and test sets,
or list-wise), a machine learned model imposes a loss function l          using a 70-30 split. We repeat this sampling multiple times and
which takes one or all documents belonging to a query q as the            the average results across all train-test splits are shown in Table 4.
input, approximating the individual relevance judgment, pair-wise         The first observation is that either method of using dwell time as
relevance preferences or the whole list ordering. In the context of       learning target or instance weight can improve three major ranking
content recommendation, we can simply borrow the idea of MLR              metrics. The second observation is that, dwell time as an instance
by treating user interests as queries and articles (or other types of     weight leads to the best performance. We further validate these
items) as documents. Although this formulation looks promising,           findings in online buckets, shown in Figure 7. Without disclosing
there are two challenges. One is how to construct a feature vector        the absolute numbers, we show the same three buckets with respect
for queries (users) and the second is how to utilize user activities      to two types of performance metrics: 1) CTR (shown on the top)
to infer relevancy between users and documents. The discussion of         and 2) a user engagement metric(shown on the bottom). The user
the first question is out of this paper’s scope. Here, we focus on        engagement metric is a proprietary one, which can be explained as
the second question. While the definition of relevance judgments          the quality of users’ engagement with Yahoo homepage’s content
might be unambiguous in IR, it is not straightforward in the con-         stream. Each data point represents the metric on a particular day.
text of content personalization. One cheap and easy approach is to        We report the bucket metrics for a three month period between June
use users’ click-through data as relevance judgments. Essentially,        2013 and August 2013. Initially, the three buckets were running the
in such case, we use yd,u = {0, 1}, a binary variable, to indicate        same linear model and we can see from the first three data points
whether an article d (the “document” in IR setting) is clicked by         (three-day data), both CTR and the user engagement metric are sim-
the user u (the “query” in IR setting). Under this formalism, a MLR       ilar. Then, we update the models as follows: 1) A: a linear model
model indeed optimizes (CTR).                                             optimizes click/non-click, 2) B: a GBDT model optimizes click and

                                  ●●●                                                                                                                                                                                                   instance, users come to Yahoo’s homepage to consume news items
                                   ●                          ●           ●                                   ●
                                  ● ●
                                                            ●    ●
                                                                                                                                                                                          ●                                             such as articles, videos and slideshows by only browsing and click-
            ●                 ●                             ●●●                       ●●                                                          ●                 ● ●
                ●                                         ● ● ●●                      ●●●                     ●●
                                                                                                                          ●           ●
                                                                                                                                                                  ● ●●
                                                                                                                                                                                          ●                                             ing on particular items that they are interested in without providing

        ●           ●                                     ●                           ● ●            ●                                    ●       ●                  ●
                                                  ●●                                   ●●       ●● ●              ●       ●                   ●                        ●                                ●
                                                                                                ● ●●
                                                                                              ● ●
                                                                                              ●● ●                    ●
                                                                                                                                ●             ●
                                                                                                                                                                  ●                                   ●●
                                                                                                                                                                                                                                        any explicit rating feedback, even though the user interface allows
                                                                                                 ●                    ●                       ●           ●●                  ●                       ●
                      ●                           ●●                          ●
                                                                                                                                  ●                        ●                  ●
                                                                                                                                                                                 ●            ●
                                                                                                                                                                                                                                        users to “like” or “dislike” the item. Therefore, users’ feedback
                          ●                                                                                                                                   ●                               ●
                                                                                                                                                                                              ●                               Buckets
                                                                                  ●                                                                                            ●●
                                                                                                                                                                               ●                                              ● A
                                                                                                                                                                                                                                        under this context is implicit.
                                                                                                                                                                                                  ●                           ● B
                                                                                                                                                                                                                              ● C
                                                                                                                                                                                                                                           In this paper, we propose to use dwell time of users rather than
                                                                                                                                                                                                                              ● A       asking them to give ratings or using click information as implicit
                                                                                                                                                                                                                              ● B
                                                                                                                                                                                                                                        rating feedback. User feedback is represented as a M × N sparse

                                                                                                                                                                                                            User Engagement
                                                                                                                  ●                                   ●
                              ●                                                                                                                                                                                               ● C
                                  ●                                                                                                                                       ●
                    ●                         ●                   ●       ●

         ● ●●
                                                                          ●           ●
                                                                                          ●                       ●
                                                                                                                                                                  ●  ●
                                                                                                                                                                       ●                                                                matrix I, where M is the number of users, and N is the number
                                          ●               ●               ●                                   ●● ●                    ●●
        ● ●                               ●                                                                                                       ●●●               ●●
          ● ●
              ●                                   ●       ●
                                                              ●       ●
                                                                                                                          ●●     ●●●                      ●
                                                                                                                                                                  ●● ●●
                                                                                                                                                                                                                                        of items, and each entry in the matrix is one user feedback, which
                                                                                                                                                                                                                                        is denoted as ri,j . For dwell time, ri,j ∈ R or [0, 6] for normal-
                                                                                      ●     ●
                                                                                            ●●       ●●                              ●                            ●                       ●
    ●                                                                 ●                                               ●          ●● ●                                                                   ●
    ●                                             ●                                                  ● ●              ●       ●●                                         ●
                                                                                                 ●                                 ●                                                      ●           ●
                                                                              ●               ●          ●●                        ●                                                                  ●●
                                                                                                          ●                                                                   ●

                                                                                                                                                                                                                                        ized dwell time; for click/view, ri,j ∈ [0, 1]. Formally, we aim to
                                                  ●●                                          ●●                              ●●                                                              ●
                          ●                                                                                                                                   ●                   ●
                          ●                                                                     ●        ●                        ●
                          ●                                                   ●                                                                               ●                       ●       ●
                                                      ●                           ●                                                                                                           ●
                                                      ●                           ●                                                                           ●                   ●

                                                                                                                                                                                                  ●                                     predict the unobserved entries in the matrix based on the observed
            Jun 15                                                            Jul 01                                          Jul 15                                                      Aug 01                                        data. Rank-based matrix factorization is used. We decompose the
                                                                                                                                                                                                                                        sparse matrix I as U and V , to minimize the following objective
Figure 7: The relative performance comparison between three
buckets. The top figure shows the relative CTR difference and                                                                                                                                                                                         ∑
                                                                                                                                                                                                                                                      M      ∑
the bottom figure shows the relative user engagement differ-                                                                                                                                                                                arg min                    Ui (Vj − Vk )T + λ(|U |2 + |V |2 )   (3)
                                                                                                                                                                                                                                                      i=1 ri,j
