Beyond Clicks: Dwell Time for Personalization
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Beyond Clicks: Dwell Time for Personalization Xing Yi Liangjie Hong Erheng Zhong Nathan Nan Liu Suju Rajan {xingyi, liangjie, erheng, nanliu, suju}@yahoo-inc.com Personalization Sciences, Yahoo Labs, Sunnyvale, CA 94089, USA ABSTRACT Many internet companies, such as Yahoo, Facebook, Google and Twitter, rely on content recommendation systems to deliver the most relevant content items to individual users through personaliza- tion. Delivering such personalized user experiences is believed to increase the long term engagement of users. While there has been a lot of progress in designing effective personalized recommender systems, by exploiting user interests and historical interaction data through implicit (item click) or explicit (item rating) feedback, di- rectly optimizing for users’ satisfaction with the system remains challenging. In this paper, we explore the idea of using item-level dwell time as a proxy to quantify how likely a content item is rel- evant to a particular user. We describe a novel method to compute accurate dwell time based on client-side and server-side logging and demonstrate how to normalize dwell time across different de- vices and contexts. In addition, we describe our experiments in incorporating dwell time into state-of-the-art learning to rank tech- niques and collaborative filtering models that obtain competitive Figure 1: A snapshot of Yahoo’s homepage in U.S. where the performances in both offline and online settings. content stream is highlighted in red. Categories and Subject Descriptors: H.3.5 [Information Storage and Retrieval]: Online Information Services lot of work in designing and improving personalized recommender General Terms: Theory, Experimentation systems. Traditionally, simplistic user feedback signals, such as click Keywords: Content Recommendation, Personalization, Dwell through rate (CTR) on items or user-item ratings, have been used Time, Learning to Rank, Collaborative Filtering to quantify users’ interest and satisfaction. Based on these read- ily available signals, most content recommendation systems essen- 1. INTRODUCTION tially optimize for CTR or attempt to fill in a sparse user-item rating Content recommendation systems play a central role in today’s matrix with missing ratings. Specifically for the latter case, with the Web ecosystems. Companies like Yahoo, Google, Facebook and success of the Netflix Prize competition, matrix-completion based Twitter are striving to deliver the most relevant content items to methods have dominated the field of recommender systems. How- individual users. For example, visitors to these sites are presented ever, in many content recommendation tasks users rarely provide with a stream of articles, slideshows and videos that they may be in- explicit ratings or direct feedback (such as ‘like’ or ‘dislike’) when terested in viewing. With a personalized recommendation system, consuming frequently updated online content. Thus, explicit user these companies aim to better predict and rank content of interest ratings are too sparse to be usable as input for matrix factorization to users by using historical user interactions on the respective sites. approaches. On the other hand, item CTR as implicit user interest The underlying belief is that personalization increases long term signal does not capture any post-click user engagement. For ex- user engagement and as a side-benefit, also drives up other aspects ample, users may have clicked on an item by mistake or because of the services, for instance, revenue. Therefore, there has been a of link bait, but are truly not engaged with the content being pre- sented. Thus, it is arguable that leveraging the noisy click-based user engagement signal for recommendation can achieve the best Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed long term user experience. In fact, a recommender system needs for profit or corecsysercial advantage and that copies bear this notice and the full cita- to have different strategies to optimize short term metrics like CTR tion on the first page. Copyrights for components of this work owned by others than and long term metrics like how many visits a user would pay in ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- several months. Thus, it becomes critical to identify signals and publish, to post on servers or to redistribute to lists, requires prior specific permission metrics that truly capture user satisfaction and optimize these ac- and/or a fee. Request permissions from permissions@acm.org. cordingly. RecSys’14, October 6–10, 2014, Foster City, Silicon Valley, CA, USA. Copyright 2014 ACM 978-1-4503-2668-1/14/10 ...$15.00. We argue that the amount of time that users spend on content http://dx.doi.org/10.1145/2645710.2645724. items, the dwell time, is an important metric to measure user en-
gagement on content and should be used as a proxy to user satis- In this section, we review three related research directions. First, faction for recommended content, complementing and/or replacing we examine how dwell time is studied and used in web search or IR click based signals. However, utilizing dwell time in a personalized domains. We will carefully analyze which of the existing practices recommender system introduces a number of new research and en- and experiences, on dwell time computation, can be utilized in the gineering challenges. For instance, a fundamental question would context of personalization. We then list several pointers for MLR be how to measure dwell time effectively. Furthermore, different as it has been extensively studied in the past decade, followed by users exhibit different content consumption behaviors even for the a brief discussion on CF, paying special attention to how implicit same piece of content on the same device. In addition, for the same user feedback is used in CF. user, depending on the nature of the content item and the context, Dwell Time in Other Domains: A significant amount of previ- the user’s content consumption behavior can be significantly dif- ous research on web search has investigated using post-click dwell ferent. Therefore, it would be beneficial to normalize dwell time time of each search result as an indicator of its relevance for web across different devices and contexts. Also, recommender sys- queries and how it can be applied for different web search tasks. All tems usually employ machine learning-to-rank (MLR) techniques such previous research focused on examining the dwell time’s util- and collaborative filtering (CF) models, with a wide range of fea- ity for improving search results. For instance, White and Kelly [15] tures, to obtain state-of-the-art performance. Using dwell time in demonstrated that using dwell time can potentially help improve these frameworks is not straight-forward. the performance of implicit relevance feedback. Kim et al. [8] In this paper, we use the problem of recommending items for the and Xu et al. [17, 18] showed that using webpage-level dwell time content feed or stream on the Yahoo’s homepage, shown in Figure can help personalize the ranking of the web search results. Liu 1, as a running example to demonstrate how dwell time can be em- et al. [11] investigated the necessity of using different thresholds bedded into a personalized recommendation system. We have de- of dwell time, in order to derive meaningful document relevance signed several approaches for accurately computing item-level user information from the dwell time for helping different web search content consumption time from large-scale web browsing log data. tasks. To the best of our knowledge, we are the first to use dwell We first use the log data to determine when each item page gains or time for personalized content recommendation. Furthermore, we loses the user attention. Capturing the user’s attention on an item consider different types of content (news articles, slideshows and page enables us to compute per-user item-level dwell time. In ad- videos), present several approaches to accurately measure content dition, we leverage content consumption dwell time distributions consumption time, and use the dwell time for understanding users’ of different content types for normalizing users’ engagement sig- daily habit and interests. Most recently, Youtube has started to use nals, so that we can use this engagement signal for recommending users’ video time spent instead of the click event 1 to better measure multiple content type items to the user in the content stream. We the users’ engagement with video content. In contrast, we focus on then incorporate dwell time into machine learning-to-rank (MLR) techniques and collaborative filtering (CF) models. For MLR, we Learning To Rank in Web Search: The field of MLR has sig- propose to use per-user item-level dwell time as the learning tar- nificantly matured in the past decade, mainly due to the popularity get, which can be easily considered in all existing MLR models, of search engines. Liu [12] and Li [10] provide an in-depth sur- and demonstrate that it can result in better performances. For CF, vey on this topic. Here, we point out that a fundamental issue with we use dwell time as a form of implicit feedback from users and all existing MLR models is that they all optimize for relevance, an demonstrate that a state-of-the-art matrix factorization model to in- abstract yet important concept in IR. In the standard setting, the corporate this information can yield competitive and even better “relevance” between a particular query and a list of documents is performances than the click-optimized counterpart. To be more objective and the same for all users. For IR, “relevance” is judged specific, we have made the following contributions in this paper: by human experts through a manual process and is difficult to scale • A novel method to compute fine-grained item-level user con- to millions of real queries. In order to personalize IR, a natural tent consumption time for better understanding users’ inter- alternative to “relevance” is to optimize CTR. In this paper, we ex- ests is proposed. plore the possibility of optimizing for dwell time under the existing • A novel solution to normalize dwell time for multiple framework of gradient boosted decision trees [6]. However, other content-type items across different devices is proposed and MLR models can also be used such as pair-wise models (e.g., Rank- presented. Boost [5] and AdaRank [16]) and list-wise models (e.g., RankNet • An empirical study of dwell time in the context of content [3] and ListNet [4]). Note that, we do not seek to propose new MLR recommendation system is presented in this paper. models, but instead show the advantage of utilizing dwell time in • A MLR framework to utilize dwell time is proposed and its existing models. effectiveness in real-world settings is demonstrated. • A CF framework to utilize dwell time is proposed and its Collaborative Filtering: In CF systems, users’ satisfaction with effectiveness against non-trivial baselines is presented. the items is usually not considered. Almost all previous work in The paper is organized as follows. In §2, we review three related CF (e.g., [9, 1]) take only explicit feedback such as ratings, or “im- research directions. In §3, we demonstrate how dwell time can plicit” click-based feedback into account. Hu et al. [7] considered be measured and present some of its interesting characteristics. In implicit feedback signal, such as whether a user clicks or reviews an the following sections, we show two important use cases for dwell item, and incorporated it into the matrix factorization framework. time. In §4, we show how dwell time can be used in MLR to obtain Rendle et al. [13] proposed a learning algorithm for binary implicit superior performance than the models that optimize CTR. In §5, feedback datasets, which is essentially similar to AUC optimization. we plug dwell time into the state-of-the-art CF models and demon- None of them went beyond binary implicit feedback to investigate strate that we can obtain competitive performance. We conclude the interactions between users and items. The approach of Yin et the paper in §6. al. [19] is the closest to our work. In that paper, the authors used a graphical model on the explicit feedback signals and dwell time 1 http://youtubecreator.blogspot.com/2012/08/youtube-now-why- 2. RELATED WORK we-focus-on-watch-time.html
Table 1: Client-side Logging Example Table 2: Comparison of dwell time measurement. The first two columns are for LE, the middle two columns are for FB and User Behaviors Client-side Events the last two columns are for client-side logs. Each row contains A user opens a news article page. {DOM-ready, t1 } data from a day. He reads the article for several {Focus, t2 } # DT. (LE) # DT. (FB) # DT. (C) seconds. 3, 322 86.5 3, 197 134.4 3, 410 130.3 He switches to another browser {Blur, t3 } 5, 711 85.4 5, 392 132.6 5, 829 124.0 tab or a window to read other arti- cles. ●● ● ● He goes back to the article page {Focus, t4 } ● ● Desktop ● ● and comments on it. ● ●● He closes the article page, or {BeforeUnload, t5 } ● ●● ●● ● ● ●●●●● ●●●● clicks the back button to go to an- ●●●●●●● other page. ●● ● ● ● ● Devices Tablet ● ●● ● Desktop ●● ● ● Tablet ● ● ● ● Mobile data to predict the user’s score. Our work is different in that our ●● ● ●● ● ● ●●● ●●● model does not require the presence of explicit user feedback. ●●●● ●●● ● ● Mobile ● ●● 3. MEASURING ITEM DWELL TIME ●● ● ● ●● ● ●● In this section, we describe how dwell time can be measured ● ● ●●● ●● ●●●● ●●●● from web logs and show its basic characteristics. 0.0 2.5 5.0 7.5 LogDwellTime 3.1 Dwell Time Computation Accurately computing item-level dwell time from web-scale user Figure 2: The (un)normalized distribution of log of dwell time browsing activity data is a challenging problem. As an exam- for articles across different devices. The X-axis is the log of ple, most modern browsers have a multi-tabbed interface in which dwell time and the Y-axis is the counts (removed for proprietary users can open multiple stories simultaneously and switch between reasons). them. In the multi-tabbed setting, figuring out the tab that captured the user’s attention is non-trivial. In this paper we describe two loss in this client-server interaction can be very high, for example, complementary methods to derive dwell time, one via client-side because of loss in internet connection. In addition, users may also logging and the other via server-side logging. We have also con- disable Javascript in their browsers. ducted a simple study comparing these two approaches. Although client-side logging can capture fine-grained user behavior and has Server-Side Dwell Time: When client-side logs are not avail- the potential of being highly useful, there is a lot of dependency able, we resort to server-side logging to infer users’ attention on on browser implementation and potential for large amounts of data item pages. The computation of dwell time on server-side are built loss. Therefore, when the client-side data is not available, we resort on a number of heuristics. One approach is to simulate client- to reasonable approximation methods through server-side logging. side user attention events by identifying pseudo Focus and Blur Thus, we can reliably compute dwell time in a real world setting. events from server logs. Consider the following sequence of log- ging events: Client-Side Dwell Time: Client-side logging utilizes Javascript/DOM events 2 to record how users interact with {i, Click, t1 } → {j, Click, t2 } → {k, Click, t3 } the content story pages. Let us imagine the scenario demonstrated where each event is a tuple of an item id, a event type and a times- in Table 1 where the left column is a sequence of user interactions tamp. The dwell time for i and j can be computed as t2 − t1 and with a news article and the right column contains the correspond- t3 − t2 respectively. A more complicated example is: ing client-side events, in the form of {event name, time stamp} tuples. In these events, DOM-ready indicates the ready-time {i, Click, t1 } → {j, Click, t2 } → {k, Click, t3 } → of the body of the page, which can be considered as the start of {i, Comment, t4 } → {n, Click, t5 } the dwell time. Focus indicates that the user’s focus was back on the body of the news article. Blur means that the article where the dwell time for page i can be computed as (t2 − t1 ) + body lost the user attention. BeforeUnload is the time point (t5 − t4 ). We denote this as FB (Focus/Blur) method. Another immediately prior to the page being unloaded. Based on these simpler heuristic is called LE (Last Event) method, which is to take events, we can compute dwell time on the client-side by simply the last event of the page as the end-page event and compute the accumulating time differences between Focus event and Blur interval of the first-event timestamp and the last one. From the ex- events. From the above example, we have the dwell time as: ample above, the dwell time of page i by LE would be t4 − t1 . (t3 − t2 ) + (t5 − t4 ). We can clearly see that client-side approach Both approximation methods have their own weaknesses: 1) The can accurately capture users’ actual attention even in multi-tabbed FB approach can over-estimate the dwell time because servers do modern browsers. The major drawback of client-side logging is not know the exact time the target story page loses its user atten- that it relies on the correctness of Javascript execution and on tion. For the above example, if the last click happens on some other servers successfully receiving and logging client-sent data. Data page, (t5 − t4 ) interval could includes some user time spent out- side the target page. The FB approach might also under-estimate 2 http://en.wikipedia.org/wiki/DOM_events the dwell time because servers also do not accurately know the time
● ● 150 ● ● ● ● ● ● ●● ●● ● ●● 150 ● ● 125 ● ● ● Average Dwell Time Average Dwell Time ●●● ● ● ●●●● ● ● ● ●● ● ● ● 125 ●● ● ● ● ● ●●● ● ●● Devices ● Devices ●● ● ● ●●● ● ● ● ●●● ● 100 ● Desktop ● ● ●● ● Desktop ●● ● ●● ● Tablet ●● ● ● ● Tablet ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● Mobile 100 ● ● ● Mobile ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 75 ● ●●●● ● ●● ● ● ● ●● ●● ●●●● ● ● 75 ● ●● ● ● ● ● ● ● ● ●● ●●● ●● ● ● ●● ● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● 50 50 ● ● ● 250 500 750 1000 1250 10 20 30 Article Length The Number of Photos Figure 3: The relationship between the average dwell time and Figure 4: The relationship between the average dwell time and the article length where X-axis is the binned article length and the number of photos on a slideshow where X-axis is the binned the Y-axis is binned average dwell time. number of photos and the Y-axis is binned average dwell time. the target page gains user attention. For the above example, if users Shapiro–Wilk test [14] reject such an assertion. A further study of have returned to the target page and read it for some additional time its formal distribution is in future work. Regardless of its normal- before commenting at t4 , the dwell time computation will not in- ity, we observe that the bell-curve pattern holds for different time clude the additional time. 2) The LE approach does not consider periods and different types of devices (see Figure 5 and 6, which the scenario in which the users’ reading focus could switch among we will discuss later). multiple browser page tabs, thus over-estimating the dwell time. Since dwell time approximates the time users spend on an item, On the other hand, because the LE approach conservatively uses it is natural to assume that given the same content quality, a longer the last event on the target page to compute dwell time and servers news article would attract longer average dwell time across all do not know when the user abandons or closes the page (without users. In order to demonstrate this behavior, we investigate this the client-side unload event), it can also under-estimate the dwell issue on text heavy news article 3 , and plot a scatter-plot of aver- time. age dwell time per article versus article length in Figure 3 where X-axis is the length of article and the Y-axis is the average dwell Because both approaches could over-estimate or under-estimate time of that particular article from all users. In order to show things the item-level dwell time, we conducted a simple comparison study clearly, the dwell times and article lengths are binned into smaller among FB, LE and the client-side logging. The results are shown buckets where each point represents a bucket. We show the scatter- in Table 2. The purpose of this study is to explore which server- plot of the dwell time against the length of the article on different side approach can be used to better approximate the client-side log- devices, namely desktop, tablet and mobile devices. The black line ging. We use two days’ server-side logging events and client-side is a fitted linear line for a particular device type with the 0.95 con- logging events for article pages, and compute the average dwell fidence interval in the grey area. From the figure, it is very clear time by each method. Note that even for the same time period, that the length of the article has good linear correlation with the av- different approaches use different sets of events to compute dwell erage dwell time across devices. Also, matching our intuition, the time (see above example) and client-side events can be lost. Thus, average dwell time on desktop is longer for long articles and the the total number of articles considered varies (the first, the third reading behavior on tablet and mobile devices are similar. Further- and the fifth column). From the table, we can see that the average more, the correlation becomes weaker when articles are very long: dwell time computed by the FB approach is very close to the client- for desktop when the article is longer than 1,000 words, the plot side logging.Meantime, the LE approach greatly under-estimates has big variance; this indicates that users may have run out of their the dwell time, compared with the client-side events. This result time-budget to consume the complete long story. Although the high shows: through simulating users’ reading attention switch events correlation between the length of articles and average dwell time from server-side, the FB approach better handles item-level dwell naturally leads to using the length of articles as a feature to pre- time computation in multi-tabbed modern browser setting. There- dict average dwell time, we point out based on the observed data: fore, we now use FB as a relative reliable fall-back proxy to mea- (1) per-user dwell time (rather than binned average dwell time over sure the item-level dwell time from server-side logging events when all users) has little correlation with the article length; and (2) long client-side logging is not available. dwell time may not necessarily reflect that users are really inter- ested in the article. In other words, content length alone can hardly 3.2 Dwell Time Analysis explain the per-user per-item dwell time, and we need to be care- In order to understand the nature of it, we analyze per-item per- ful of the bias of dwell time based user engagement measurements user dwell time from a large real-world data collection from Yahoo. towards long length content stories. (We will revisit this issue in We plot the unnormalized distribution of log of dwell time in Fig- §3.4.) ure 2. The data used for this figure is from one month’s Yahoo homepage sample traffic. It is obvious that the log of dwell time 3 For other content types such as slideshow and video, the content follows a bell-curve. Many would guess the distribution of log of length could be the number of slides in the slideshow and the video dwell time is a Gaussian distribution. However, Q-Q plot and also clip’s raw duration, respectively.
● ● ● ● ● ● ● ● Desktop Desktop ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● Devices ● ● Devices ● Tablet Tablet ●● ● Desktop ● ● Desktop ● ● ● Tablet ● Tablet ●● ● ● Mobile ● ● ● Mobile ● ● ● ● ● ● ● ● ● ● ●● ●●●●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● Mobile Mobile ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ● 0 3 6 9 0 2 4 6 LogDwellTime LogDwellTime Figure 5: The (un)normalized distribution of log of dwell time Figure 6: The (un)normalized distribution of log of dwell time for slideshows across different devices. The X-axis is the log of for videos across different devices. The X-axis is the log of dwell dwell time and the Y-axis is the counts (removed for proprietary time and the Y-axis is the counts. reasons). strumentations, etc.). We do this by normalizing out the variance of For slideshows, a natural assumption would be that the larger the the dwell time due to differences in context. In particular, we adopt number of photos/slides, the longer the average dwell time these the following procedure to normalize dwell time into a comparable items would receive over all users. We demonstrate the relation- space: ships between the number of photos and the average dwell time on 1. For each content consumption context C, collect the histor- slideshows in Figure 4. Again, we binned the number of photos ical per-item time spent data and compute the mean µC and and the average dwell time. It is clear that the correlation is not standard deviation σC , both in log space. as strong as the length of articles. For videos, we also observe the 2. Given a new content item i’s time spent tI in its context Ci , log(ti )−µCi similar weak correlations between the duration of a video clips’ and compute the z-value in log space: zi = . σCi it’s average dwell time. 3. Compute the normalized dwell time of item i in the article 3.3 Normalized Dwell Time space: ti,article = exp(µarticle + σarticle × zi ). In other words, all other types of items are now “comparable” after As may be obvious, users’ consumption of content items varies this transformation, and the normalized user engagement signals by context. For example, in historical data, we found that users are then used for training recommendation models to handle differ- have on average less dwell time per article on mobile or tablet de- ent content types and can be deployed in different contexts. vices than on desktops. Also, users on average spend less time per slideshow than per article. Indeed, different content types, by their 3.4 Predicting Dwell Time nature, would result in different browsing behaviors; thus we would The average dwell time of a content item can be viewed as one expect different dwell times among these content types. In order of the item’s inherent characteristic, which provides important aver- to extract comparable user engagement signals, we introduce the age user engagement information on how much time the user will normalized user dwell time to handle users’ different content con- spend on this item. Predicting average dwell time for each con- sumption behaviors on different devices for personalization. The tent item can help labeling items when their dwell time are not technique discussed here can also be used to blend multiple con- available/missing. For example, content items that have never been tent sources (e.g., slide-shows and articles) into a unified stream. shown to users (such as new items) will not have available dwell Although the distributions of users’ per-item dwell time (from time. As another example, a user’s dwell time on her clicked story all users) for each content type is different, we found that each con- may not be always be computed because there may be no subse- tent type’s distribution remains similar over a long time period. To quent server-side events from the same user. Therefore, leveraging demonstrate this observation, we further plot the log of dwell time predicted average dwell time can greatly improve the “coverage” of two important types of content: slideshows and videos in Fig- (or alleviate the missing data issue). Not handling these situations, ure 5 and Figure 6, respectively. Similar to the article case, we do could degrade the effectiveness of applying dwell time in person- not report the absolute values for both types. However, the pat- alization applications. In this sub-section, we present a machine terns are again obvious. In all these cases, the log of dwell time learning method to predict dwell time of article stories using sim- has Gaussian-like distributions. Indeed, most of the dwell time dis- ple features. tributions for each different content-type on different device plat- The features we consider are topical category of the article and forms all surprisingly share the similar pattern. The same conclu- the context in which the article would be shown (e.g., desktop, sion holds for different lengths of the time period. Also, we can tablet or mobile). We use Support Vector Regression (SVR)4 mod- easily see that the peak of log of dwell time is highest for videos, els to predict dwell time. The model is trained from a sample of followed by articles and slideshows, which matches our intuitive user-article interaction data. We show the features and their corre- understanding of these three types of content items. sponding weights in Table 3. Most features are categorical and we Thus, the basic idea is, for each consumed item, we would like use log(Dwell Time) as the model response. We can loosely in- to extract its dwell time based user engagement level such that it is 4 comparable across different context (e.g. content types, devices, in- http://www.csie.ntu.edu.tw/˜cjlin/liblinear/
In this paper, we use the Gradient Boosted Decision Tree (GBDT) Table 3: Features and corresponding weights for predicted algorithm [6] to learn the ranking functions. GBDT is an additive dwell time. The features are shown in the order of magnitude of regression algorithm consisting of an ensemble of trees, fitted to weights. The left column shows positive weights and the right current residuals, gradients of the loss function, in a forward step- negative weights. wise manner. It iteratively fits an additive model as: Name Weight Name Weight Desktop 1.280 Apparel -0.001 ∑ T Mobile 1.033 Hobbies -0.010 ft (x) = Tt (x; Θ) + λ βt Tt (x; Θt ) (1) t=1 Tablet 0.946 Travel & Tourism -0.039 Content Length 0.218 Technology -0.040 such that a certain loss function L(yi , fT (xi )) (e.g., square loss, Transportation 0.136 Environment -0.065 logistic loss) is minimized, where Tt (x; Θt ) is a tree at iteration t, Politics 0.130 Beauty -0.094 weighted by a parameter βt , with a finite number of parameters Θt , Science 0.111 Finance -0.151 and λ is the learning rate. At iteration t, tree Tt (x; β) is induced to Culture 0.100 Food -0.173 fit the negative gradient by least squares. That is: Real Estate 0.088 Entertainment -0.191 ∑ N Θ̂ = arg min wi (−Git − βt Tt (xi ); Θ)2 (2) β terpret the weights of these features as how much that feature con- i tributes to the article’s average dwell time prediction. The feature where wi is the weight for data instance i, which is usually set weights match our current expectation for average users’ article to 1, and reading behavior: longer articles can lead to higher predicted av- [ Git is the ]gradient over the current prediction function: Git = ∂L(y i ,f (xi )) . The optimal weights of tree βt are erage dwell time; people spend a longer time reading articles on ∂f (xi ) f =ft−1 desktop devices than mobile devices; more serious topics can lead ∑ determined by βt = arg minβ N i L(yi , ft−1 (xi ) + βT (xi , θ)). users to dwell longer. Potentially, the predicted average dwell time More details about GBDT, please refer to [20]. As mentioned could be leveraged to normalize the dwell time-based user engage- above, if we use click/non-click as responses, we simply treat ment signal (as discussed in §3.3); however, this is non-trivial as the xi = xq,d and yi = yd,u . In fact, all previous research on MLR- interplay between the dwell time features and users’ experience is based content recommendation system has been focusing on using not obvious. For example, will recommending more serious topics click-based information for training and evaluation. For example, that have long average dwell lead to better or worse user experi- Bian et al. [2] and Agarwal et al. [1] have used users’ click/view ence? We will leave answering this question for future work. data in Today module in Yahoo for optimizing CTR for content rec- ommendation. 4. USE CASE I: LEARNING TO RANK Dwell Time for MLR: There are two intuitive ways to incorpo- In this section, we investigate how to leverage item-level dwell rate dwell time into MLR frameworks. Let γd be the average dwell time to train machine-learned ranking (MLR) models for content time for article d. Taking the GBDT algorithm mentioned above, we recommendation. could have: 1) Use the per-article dwell time as the response, treat- The Basic MLR Setting: In traditional MLR, a query q is repre- ing yi = h(γd ) and 2) Use the per-article dwell time as the weight sented as a feature vector q while a document d is represented as a for sample instances, treating wi = h(γd ) where the function h feature vector d. A function g takes these two feature vectors and is a transformation of the dwell time. In both cases, we promote outputs a feature vector xq,d = g(q, d) for this query-document articles that have high average dwell time and try to learn models pair (q, d). Note that g could be as simple as a concatenation. Each that can optimize for user engagements. In all our experiments, we query-document pair has a response yq,d , in traditional IR, which found that h = log(x) yields the best performance. is usually the relevance judgment. Typically this judgment is com- We show the effectiveness of MLR model firstly from an offline mon to all users, that is, there is no user-specific personalization. experiment. We use data from a bucket of traffic of a Yahoo prop- Depending on the particular paradigm (e.g., point-wise, pair-wise erty and split it uniformly at random into training and test sets, or list-wise), a machine learned model imposes a loss function l using a 70-30 split. We repeat this sampling multiple times and which takes one or all documents belonging to a query q as the the average results across all train-test splits are shown in Table 4. input, approximating the individual relevance judgment, pair-wise The first observation is that either method of using dwell time as relevance preferences or the whole list ordering. In the context of learning target or instance weight can improve three major ranking content recommendation, we can simply borrow the idea of MLR metrics. The second observation is that, dwell time as an instance by treating user interests as queries and articles (or other types of weight leads to the best performance. We further validate these items) as documents. Although this formulation looks promising, findings in online buckets, shown in Figure 7. Without disclosing there are two challenges. One is how to construct a feature vector the absolute numbers, we show the same three buckets with respect for queries (users) and the second is how to utilize user activities to two types of performance metrics: 1) CTR (shown on the top) to infer relevancy between users and documents. The discussion of and 2) a user engagement metric(shown on the bottom). The user the first question is out of this paper’s scope. Here, we focus on engagement metric is a proprietary one, which can be explained as the second question. While the definition of relevance judgments the quality of users’ engagement with Yahoo homepage’s content might be unambiguous in IR, it is not straightforward in the con- stream. Each data point represents the metric on a particular day. text of content personalization. One cheap and easy approach is to We report the bucket metrics for a three month period between June use users’ click-through data as relevance judgments. Essentially, 2013 and August 2013. Initially, the three buckets were running the in such case, we use yd,u = {0, 1}, a binary variable, to indicate same linear model and we can see from the first three data points whether an article d (the “document” in IR setting) is clicked by (three-day data), both CTR and the user engagement metric are sim- the user u (the “query” in IR setting). Under this formalism, a MLR ilar. Then, we update the models as follows: 1) A: a linear model model indeed optimizes (CTR). optimizes click/non-click, 2) B: a GBDT model optimizes click and
●● ● ● ●●● ●●● instance, users come to Yahoo’s homepage to consume news items ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● such as articles, videos and slideshows by only browsing and click- ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ●● ● ● ●● ● ●● ●● ● ing on particular items that they are interested in without providing CTR ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ●● any explicit rating feedback, even though the user interface allows ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● users to “like” or “dislike” the item. Therefore, users’ feedback ● ● ● ● Buckets ● ● ● ●● ● ● A under this context is implicit. ● ● B ● ● ● C In this paper, we propose to use dwell time of users rather than ● ● ● A asking them to give ratings or using click information as implicit ● B rating feedback. User feedback is represented as a M × N sparse ● User Engagement ● ● ● ● ● C ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● matrix I, where M is the number of users, and N is the number ● ● ● ●● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ● of items, and each entry in the matrix is one user feedback, which is denoted as ri,j . For dwell time, ri,j ∈ R or [0, 6] for normal- ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ized dwell time; for click/view, ri,j ∈ [0, 1]. Formally, we aim to ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● predict the unobserved entries in the matrix based on the observed Jun 15 Jul 01 Jul 15 Aug 01 data. Rank-based matrix factorization is used. We decompose the Date sparse matrix I as U and V , to minimize the following objective function: Figure 7: The relative performance comparison between three buckets. The top figure shows the relative CTR difference and ∑ M ∑ the bottom figure shows the relative user engagement differ- arg min Ui (Vj − Vk )T + λ(|U |2 + |V |2 ) (3) U,V i=1 ri,j
[4] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to Table 5: Performance for Collaborative Filtering rank: From pairwise approach to listwise approach. In Performance for Monthly Prediction Proceedings of ICML, pages 129–136, 2007. Signal MAP NDCG NDCG@10 [5] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient Click as Target 0.3773 0.7439 0.7434 boosting algorithm for combining preferences. The Journal Dwell Time as Target 0.3779 0.7457 0.7451 of Machine Learning Research, 4:933–969, Dec. 2003. Performance for Weekly Prediction [6] J. H. Friedman. Stochastic gradient boosting. Computational Signal MAP NDCG NDCG@10 Statistics & Data Analysis - Nonlinear methods and data Click as Target 0.6275 0.5820 0.5813 mining, 38(4):367–378, Feb. 2002. Dwell Time as Target 0.6287 0.5832 0.5826 [7] Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for Performance for Daily Prediction implicit feedback datasets. In Proceedings of ICDM, pages Signal MAP NDCG NDCG@10 263–272, 2008. Click as Target 0.6275 0.5578 0.5570 [8] Y. Kim, A. Hassan, R. W. White, and I. Zitouni. Modeling Dwell Time as Target 0.6648 0.5596 0.5589 dwell time to predict click-level satisfaction. In Proceedings of WSDM, pages 193–202, 2014. [9] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization the metrics are averaged numbers across multiple weeks or days. techniques for recommender systems. Computer, We also vary the latent dimension K for both click version and 42(8):30–37, Aug. 2009. the dwell time version and only report the best performance across [10] H. Li. Learning to rank for information retrieval and natural different K values. We can observe that in all evaluation methods language processing. Synthesis Lectures on Human and all metrics, the model to optimize dwell time has a comparable Language Technologies, 4(1):1–113, 2011. performance as the one to optimize clicks. In addition, the per- [11] C. Liu, J. Liu, N. Belkin, M. Cole, and J. Gwizdka. Using formance of dwell time based models are consistently better than dwell time as an implicit measure of usefulness in different the click based ones. One plausible reason, for the small overall task types. Proceedings of the American Society for improvement, is that content features or user-side information may Information Science and Technology, 48(1):1–4, 2011. be needed for better predicting dwell time based rating. Deeper [12] T.-Y. Liu. Learning to rank for information retrieval. analysis and experimentation on the benefit of dwell time based CF Foundations and Trends in Information Retrieval, models is future work. 3(3):225–331, 2009. [13] S. Rendle, C. Freudenthaler, Z. Gantner, and 6. DISCUSSION AND CONCLUSIONS L. Schmidt-Thieme. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of UAI, pages In this paper, we demonstrated how dwell time is computed from 452–461, 2009. a large scale web log and how it can be incorporated into a person- alized recommendation system. Several approaches are proposed [14] S. S. Shapiro and M. B. Wilk. An analysis of variance test for accurately computing item-level user content consumption time for normality (complete samples). Biometrika 52 (3-4), from both client side and server side logging data. In addition, pages 591–611, 1965. we exploited the dwell time distributions of different content types [15] R. W. White and D. Kelly. A study on the effects of for normalizing users’ engagement signals into the same space. personalization and task information on implicit feedback For MLR, we proposed using per-user per-item dwell time as the performance. In Proceedings of CIKM, pages 297–306, 2006. learning target and demonstrated that it can result in better perfor- [16] J. Xu and H. Li. Adarank: A boosting algorithm for mances. For CF, we used dwell time as a form of implicit feed- information retrieval. In Proceedings of SIGIR, pages back from users and demonstrated how it can be incorporated into 391–398, 2007. a state-of-the-art matrix factorization model, yielding competitive [17] S. Xu, H. Jiang, and F. C. M. Lau. Mining user dwell time and even better performances than the click-optimized counterpart. for personalized web search re-ranking. In Proceedings of For future work, we would like to design dwell time based user IJCAI, pages 2367–2372, 2011. engagement metrics and explore how to optimize these metrics di- [18] S. Xu, Y. Zhu, H. Jiang, and F. C. M. Lau. A user-oriented rectly. We would also like to investigate better ways to normalize webpage ranking algorithm based on user attention time. In dwell time. This will enable us to extract better user engagement Proceedings of the 23rd National Conference on Artificial signals for training recommendation systems thereby optimizing Intelligence - Volume 2, pages 1255–1260, 2008. for long term user satisfaction. [19] P. Yin, P. Luo, W.-C. Lee, and M. Wang. Silence is also evidence: Interpreting dwell time for recommendation from 7. REFERENCES psychological perspective. In Proceedings of SIGKDD, pages 989–997, New York, NY, USA, 2013. ACM. [1] D. Agarwal, B.-C. Chen, P. Elango, and R. Ramakrishnan. [20] Z. Zheng, K. Chen, G. Sun, and H. Zha. A regression Content recommendation on web portals. Communications framework for learning ranking functions using relative of the ACM, 56(6):92–101, June 2013. relevance judgments. In Proceedings of SIGIR, pages [2] J. Bian, A. Dong, X. He, S. Reddy, and Y. Chang. User 287–294, 2007. action interpretation for online content optimization. IEEE TKDE, 25(9):2161–2174, Sept 2013. [3] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of ICML, pages 89–96, 2005.
You can also read