GetJar Mobile Application Recommendations with Very Sparse Datasets
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
GetJar Mobile Application Recommendations with Very Sparse Datasets Kent Shi Kamal Ali GetJar Inc. GetJar Inc. San Mateo, CA, USA San Mateo, CA, USA kent@getjar.com kamal@getjar.com ABSTRACT Keywords The Netflix competition of 2006 [2] has spurred significant Recommender system, mobile application, evaluation, sparse activity in the recommendations field, particularly in ap- data, PCA proaches using latent factor models [3, 5, 8, 12]. However, the near ubiquity of the Netflix and the similar MovieLens datasets1 may be narrowing the generality of lessons learned 1. INTRODUCTION in this field. At GetJar, our goal is to make appealing rec- In the last few years, there has been a tremendous amount ommendations of mobile applications (apps). For app usage, of growth in the mobile app space, particularly in the An- we observe a distribution that has higher kurtosis (heavier droid platform. As of January 2012, there are more than head and longer tail) than that for the aforementioned movie 400,000 apps hosted on Google’s app store:2 Google Play datasets. This happens primarily because of the large dis- (formerly known as Android Market). However, Google Play parity in resources available to app developers and the low provides little personalization beyond location-based tailor- cost of app publication relative to movies. ing of catalogs. That means all users from a given country In this paper we compare a latent factor (PureSVD) and will see the same list of apps regardless of their tastes and a memory-based model with our novel PCA-based model, preferences. which we call Eigenapp. We use both accuracy and variety Since most users typically navigate no more than a few as evaluation metrics. PureSVD did not perform well due pages when browsing the store, lack of personalization lim- to its reliance on explicit feedback such as ratings, which we its exposure for the majority of the apps. By analyzing do not have. Memory-based approaches that perform vec- the usage of apps on a sample of devices, we find that this tor operations in the original high dimensional space over- space is dominated by a few apps, which unsurprisingly are predict popular apps because they fail to capture the neigh- ones that have been “featured” recently on the front page of borhood of less popular apps. They have high accuracy due Google Play. to the concentration of mass in the head, but did poorly GetJar, founded in 2004, is the largest free app store in in terms of variety of apps exposed. Eigenapp, which ex- the world. It provides mobile apps to users of all mobile ploits neighborhood information in low dimensional spaces, platforms. We have recently begun to focus on the Android did well both on precision and variety, underscoring the im- platform due to its openness and surging market share. Our portance of dimensionality reduction to form quality neigh- goal is to become an attractive destination for Android apps borhoods in high kurtosis distributions. by providing high quality personalization as a means to app discovery. Categories and Subject Descriptors 1.1 Challenges H.2.8 [Database Management]: Database Applications— While recommendations techniques, especially those using Data mining; H.3.3 [Information Storage and Retrieval]: collaborative filtering, have been common since the early Information Search and Retrieval—Information filtering 1990s [6] and have been deployed on a number of e-commerce websites such as Amazon.com [9], recommendation in the General Terms emerging app domain is a task beset by unique challenges Algorithms, Experimentation, Performance mainly due to the greater kurtosis in the distribution of app usage data. 1 http://www.grouplens.org/node/73 From anonymous usage data collected at GetJar, we find that there are a few well-known apps popular among a large number of users, but the vast majority of apps are rarely Permission to make digital or hard copies of all or part of this work for used by most users. Figure 1(a) shows a comparison of the personal or classroom use is granted without fee provided that copies are data distribution between the movie (Netflix) and app (Get- not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy Jar) domains. Note the plot only includes apps that have otherwise, or republish, to post on servers or to redistribute to lists, been recently used by GetJar users. This constitutes approx- requires prior specific permission and/or a fee. imately 55,000 apps, or about 14% of all apps. The movie KDD’12, August 12–16, 2012, Beijing, China. 2 Copyright 2012 ACM 978-1-4503-1462-6/12/08... $15.00. http://www.distimo.com/blog/2012_01_ google-android-market-tops-400000-applications 204
100% 100% GetJar GetJar Netflix Netflix 10% 10% Percent of Users Percent of Users 1% 1% 0.1% 0.1% 0.01% 0.01% 0 20% 40% 60% 80% 100% 0.01% 0.1% 1% 10% 100% Percent of Items Percent of Items (a) (b) Figure 1: (a) Distribution of items (GetJar apps or Netflix movies) in terms of percentage of total users, with items sorted by popularity. (b) Distributions of items plotted in log-log scale. at the first percentile (rank 177) is rated by 20% of Net- app stores rely on developers to categorize their own apps flix users. In contrast, the app at the first percentile (rank using a fixed inventory of labels. This leads to a small num- 550) is used only by 0.6% of GetJar users. Furthermore, ber of categories and a large number of apps within each, the movie at the first percentile has 42% as many users as causing only the top few apps in each category to ever have the most popular movie, but app at the first percentile has significant visibility. Search is also ineffective because we only 1.3% as many users as the most popular app. There- find that most users don’t know what to search for. About fore, even though there are over 400,000 available apps, in 90% of search queries at GetJar are titles (or close variants) reality only a few thousand of them are being used in any of popular apps, which means search currently is not being significant sense. used as an effective vehicle to discover new apps. The same data is plotted in Figure 1(b), this time using a log scale for both axes. We can see that the GetJar curve is almost a straight line in log-log space, indicating that the 1.2 Goal and evaluation criteria frequencies can be approximated by a Zipf distribution[17]. Users visit GetJar hoping to find interesting and useful This figure definitively shows the qualitative difference in dis- apps. But as we have seen, common strategies such as tribution: App distribution is linear in log-log space whereas browsing and searching, which have worked well for other movie distribution isn’t. Traditional collaborative filtering e-commerce sites don’t work as well in domains where many techniques [9, 14] or even the newer latent factor models [3, items remain under-publicized. Our goal is to use personal- 5, 8, 12, 13] were not designed to handle this level of sparsity. ization to help users find a greater variety of appealing apps. There are at least three reasons for this difference. First, Our prototype recommendation system recommends a top- the disparity in available resources among the app develop- N list of apps to each user based on her recent app usage. ers is larger than that of movie producers. This is mainly We judge the quality of the recommendations primarily by due to the cost (time and money) of publishing apps being accuracy, which represents the ability of the recommender much lower than that for releasing movies. Second, due to to predict the presence of an app on the user’s device. To the less mature nature of the smart phone space, most ca- increase the exposure of under-publicized apps, the recom- sual users are unaware of the full capabilities of their device mender is also evaluated on its ability to recommend tail or what apps are available for it. This is in contrast to other apps as well as the variety of the apps it recommend. domains such as movies, where there are numerous outlets A number of app stores currently offer personalized app dedicated to reviewing or promoting those products. Third, recommendations, most notably the Apple App Store and discovery mechanisms in the app space are less effective and the Amazon Appstore. However, little is known about how mature compared to those of other domains. they generate their recommendations. Furthermore, we are Today, most app stores offer three ways for users to dis- unaware of any publications on mobile app recommenda- cover apps: (1) Listings of apps sorted by the number of tions. downloads or a similar trending metric, (2) Category-based The rest of the paper is organized as follows: Section 2 browsing and (3) Keyword-based searching. We know that will review how the data was collected and some of its prop- the number of apps that can be exposed using listings is erties; Section 3 will provide details of the algorithms that limited, and that methods 2 and 3 are not as effective as we we considered; Section 4 will provide the experimental setup would like. Browsing by category is only useful if the ontol- and results; and finally Sections 5 and 6 provide discussion ogy of categories is rich, as in the case of Amazon. But most and conclusions. 205
2. THE GETJAR DATA 100% GetJar The data we report upon in this paper comes from server Netflix log files at GetJar where all personally identifying infor- 80% mation had been stripped, but information pertaining to Percent of Total Rating/Usage a single source can be uniquely identified up to a common anonymous identifier. The apps we report here include those 60% hosted on GetJar as well as those on Google Play. For the purposes of this study, we rely upon app usage data rather than installation data. The reason we choose 40% not to use installation data is that it is a poor indicator of interest since many app installations are experimental from a user’s perspective. A significant fraction of our users are 20% found to uninstall an app on the same day as they installed it. Also, there is another significant fraction of users that have a vast number of installed apps that never get used. Many users are new to the mobile app space and thus are 0 likely experimenting with a variety of apps. We restrict our 0.01% 0.1% 1% 10% 100% data to recent app usage to account for this fact, that users’ tastes for apps can change more rapidly than for traditional Percent of Items domains such as movies and music. We are only interested in recommending apps that reflect their current tastes or Figure 2: Cumulative distribution of items in terms interests. of percentage of total usage, the curves can be The observation period for data used for this study is from viewed as the integral of the curves in Figure 1. November 7 to November 21, 2011. We find that varying length of the observation period by a few days makes almost Dataset Users Items Usages/Ratings Density no difference in the number of apps used by the users.3 In an GetJar 101,106 55,020 1.99M 0.04% effort to reduce noise in the data from apps that were being GetJar* 101,031 7,304 1.82M 0.25% trialed by users, we filtered out apps that were not used other Netflix 480,189 17,770 100M 1.18% than on the day of installation. We further cleaned the data by removing users that joined or left midway during the Table 1: Size of user-item matrices for Netflix and observation period and those that were not associated with GetJar dataset. GetJar* denotes the GetJar dataset a pre-determined list of legitimate devices. The resultant including only apps that have been used by more dataset contains 101,106 users. For each user we used the list than 20 users. of apps and the number of days each app was used during the observation period. The total number of unique apps used by all users during the interval satisfying our constraints was 55,020. Jar dataset, as previously alluded to, is primarily due to the low cost of publishing apps compared to the cost of releas- 2.1 Data sparsity and long tail ing a movie. This encourages developers to release as many As we have already illustrated in Figure 1, our data is apps as possible to increase the chances of their apps being extremely sparse and that the vast majority of apps have discovered by search. This strategy often leads to apps be- low usage. While it is well known that sparsity and a long ing published multiple times with different titles but similar tail [1] are two characteristics of all e-commerce data, these functionalities. This also encourages the proliferation of a are especially pronounced in our dataset. large number of apps tailored for very specific needs (e.g. Figure 2 plots the cumulative distribution of the items ringtone apps dedicated to music by specific artists) as op- in terms of the total amount of usage. We can see that posed to general apps (e.g. a single ringtone app containing the GetJar dataset is far more head-heavy compared to the music by all artists). Netflix dataset, with the top 1% of apps accounting for 58% Given that we have little or no usage information on the of usage in contrast to Netflix where the top 1% of movies bulk of the tail apps, it makes recommending them a very contribute to 22% of all ratings. An even more selective difficult task. In order to ensure that the recommended apps subset - the 100 most popular apps - account for 30% of will have certain amount of support, for this study, we lim- total app usage. For the GetJar dataset, we define the head ited our app population by only including apps with more to be the top 100 apps and the remaining apps to be the than 20 users. This reduces the number of apps from 55,030 tail. to 7,304. Even though this pruning process removed 87% of One major reason for this difference is that many apps are apps (or 98% if we include apps with no usage), it is note- used every day, but movies are seldom watched more than worthy that only 9% of the total usage was thus eliminated once or twice. Thus Netflix users may be more likely to ex- from our modeling. Table 1 shows the size and density of the plore new items relative to GetJar users. Another reason user-item matrices before and after our pruning. It shows is that the Netflix data was collected over a much longer that even after rejecting the bottom 87% of the apps, the period of time. The reason for the longer tail in the Get- GetJar* dataset is still much sparser relative to Netflix. 3 We use the more convenient word users to denote their 2.2 Usage versus ratings anonymized identifiers. Another difference between the GetJar dataset and the 206
Netflix dataset is that movie ratings is an explicit feedback Number of Common Users Dataset for interest whereas days of usage is implicit [11]. The ben- 0 1 2-10 11-20 > 20 efit of an explicit rating system is that it is well-defined and GetJar* 83.2% 9.1% 6.6% 0.6% 0.6% standardized, thus generating a more accurate measurement Netflix 0.2% 0.4% 33.8% 22.2% 43.3% of interest compared to implicit feedbacks such as days of usage. The latter can be influenced by a number of factors Table 2: Breakdown of number of common users for such as mood, unforeseen events, or logging errors. Fur- the GetJar and Netflix datasets. For n items, the 2 thermore, there is also correlation between usage and cate- total number of item pairs is n 2−n . gory - we find that “social” apps are consistently the most heavily used apps among nearly all users. This is because “social” apps need to be used often in order to serve their purpose, but apps in categories such as “productivity” are items in user space or that of users in item space. A user- seldom needed on a continuous basis. So while it is safe to user or item-item similarity matrix is computed for pairs assume that a user enjoyed a movie that she rated highly and recommendations are generated based on these similari- relatively to one rated lowly, the same cannot be said for ties. Latent factor models are more sophisticated approaches a user that used a “social” app more than a “productivity” where the user-item matrix is decomposed via matrix fac- app. torization techniques such as Singular Value Decomposition We choose not to use ratings because it has a number of (SVD). Latent factors are then extracted and used to gen- drawbacks in the mobile app domain. Most importantly, it erate predictions. is very difficult to collect for a large number of users with- We evaluated both the above approaches using our data. out forceful intervention. Furthermore, since users’ taste in In addition, we developed a hybrid system using Princi- apps may change and many app developers frequently up- pal Components Analysis (PCA) which we call Eigenapp. date their apps with new features or functionalities, ratings These three algorithms were also compared against a non- may become obsolete in as little as one month. Finally, ob- personalized baseline recommendation system that serves serving ratings on Google Play, we find they are polarized, the most popular items. with the vast majority of ratings being either 1 or 5. This is likely due to fragmentation of the Android platform,4 re- 3.1 Non-personalized models sulting in most ratings being given based on whether the Non-personalized models are those that serve the same app worked (5) or not (1) for the user. list of items to all users. They commonly sort items by Due to the influence of the Netflix competition, most re- the number of purchases, profit margin, click-through rate search in the recommendations community has been geared (CTR), or other similar metrics. In this paper, our non- toward rating prediction by means of minimizing root mean personalized baseline algorithm sorts items by popularity, square error (RMSE). However, Cremonesi et. al [3] re- where popularity is defined as the number of distinct users ported that improving RMSE does not translate into im- that have used the item during the observation period. provement in accuracy for the top-N task. On the Netflix and MovieLens datasets, the predictive accuracy of a naive 3.2 Memory-based models most popular list is comparable to those by sophisticated There are two types of memory-based models: Item-based algorithms optimized for RMSE. We tried the same using and user-based. Item-based models find similarities between the GetJar dataset but substituting days of usage for rat- items, and for a given user they recommend items that are ings, and found that algorithms optimized for RMSE actu- similar to items she already owns. User-based models find ally performed far worse than a simple most popular list. similarities between users, and for a given user they recom- With that said, days of usage can still be used for neigh- mend items owned by her most similar users. borhood approaches, provided that there still exists some Computationally, item-based models are more scalable be- correlation between it and interest. A part of this study is cause there are usually far fewer items than users, as is the to evaluate the usefulness of this metric. Thus, for our ex- case in the mobile app space. In addition, there is research periments, we used two versions of the user-item matrix. In showing that item-based algorithms generally perform bet- the first version, each cell represents the number of days the ter than user-based algorithms [9, 14]. Hence, our memory- app was used, and in the second, each cell is a binary indi- based model uses the item-based approach. cator of usage during the observation period. We’d like to Two of the most common neighborhood similarity met- see if the additional granularity provided by the days of us- rics in current use are the Pearson correlation coefficient age will generate better recommendations than when using and cosine similarity. The Pearson correlation coefficient is a binary indicator. computed for a pair of items based on the set of users that have used both. Since the vast majority of our items reside 3. MODELS in the long tail, many of those items are unlikely to share common users with most other items. Two common recommendation approaches in use today Table 2 presents the distribution of number of common are those using memory-based models and latent factor mod- users in the GetJar and Netflix datasets. The table shows els. Memory-based models leverage the neighborhood of that 83.2% of item pairs in the GetJar dataset have zero 4 users in common, whereas that same percentage for Net- There are many manufacturers that produce Android de- flix is 0.2%. For GetJar, more than 90% of item pairs have vices with various hardware specifications and tweaks of the operating system. This makes it difficult for developers to to one or no common users. Thus it is impossible to compute test their apps on all devices, resulting in apps not working correlations for these item pairs. In addition, the vast ma- as intended on many devices. jority of the remaining item pairs share 10 or fewer users, 207
meaning that the sample correlation estimate is likely to be Examples of this approach include [5, 8, 12, 13]. We inaccurate due to poor support. In contrast, the published tried [5] and [13] by substituting days of usage for ratings, Netflix dataset has less than 1% of movie pairs sharing 1 and then sorting the predictions to generate a top-N rec- or fewer common users and about 65% of movie pairs share ommended list. The results were by far the worst of all more than 10 common users. Since the Pearson correlation algorithms, for reasons explained in Section 2.2. We expect coefficient is undefined for 90% of our item pairs, we will use similar results for other rating prediction based algorithms. cosine similarity. The only latent factor top-N algorithm we are aware of Let R denote the m × n user-item matrix where m is the is PureSVD [3]. The algorithm works by replacing all miss- number of users and n is the number of items. From R, we ing values (those with no ratings) in R with 0, and then compute an item-item similarity matrix S, whose (i, j) entry factorizing R via SVD: is: R = U · Σ · VT (4) r∗,i · r∗,j si,j = (1) r∗,i 2 · r∗,j 2 Then affinity between user u and item i can be computed by: where r∗,i and r∗,j are the ith and jth columns respectively of R. Cosine similarity does not require items to share com- tu,i = ru,∗ · Q · qTi (5) mon users. In such case it will simply produce a similarity where Q stands for the top k singular vectors extracted from of 0. However, it still suffers from low overlap support. The V and qi is the row in Q corresponding to item i. Note that closest neighbors for a less popular item will often occur tu,i is simply an association measure and not a predicted by coincidence simply because they are the only ones that rating. A top-N list can then be made for user u by selecting produced non-zero similarity scores. the N items with the highest affinity score to u. Using S, the affinity tu,i between user u and item i is the PureSVD is the only latent factor algorithm we evaluated sum of similarities between i and items used by u: that was able to generate reasonable recommendations. The main reason for this is that, unlike the other algorithms, tu,i = si,j (2) PureSVD is not optimized for RMSE based rating prediction j∈Iu but rather the relative ordering of items produced by the where Iu is the set of items used by u. For a given user, all association scores. items are sorted by their affinity score in order to produce a top-N list.5 3.4 Eigenapp model We made two slight modifications to the above method Of the two previously mentioned approaches, memory- that produced better results. First, the item-item similarity based models yielded far better results despite only hav- scores si,j were normalized before being used in equation (2). ing neighborhoods for popular items. We want to improve Deshpande et al. [4] suggested using a normalization such the result of memory-based models by borrowing ideas from that the sum of the similarities add up to 1. However, we the latent factor models. Along these lines, we used dimen- found that normalizing using z-score worked much better for sionality reduction techniques to extract meaningful features the GetJar dataset, producing the asymmetric similarity: from the items and then applied memory-based techniques to generate recommendations in this reduced space. si,j − s∗,j si,j = (3) Our neighborhood is still item-based, but items are now σs∗,j represented using features instead of users. Similar to [3], we also replace all missing values in R with 0. Given the where s∗,j is the average similarity to item j and σs∗,j is the large disparity in app frequencies, we normalized the item standard deviation of similarities to item j. Second, for each vectors to prevent the features from being based on only item candidate i, instead of summing over all items in Iu , popular items. This is done by normalizingeach column in we only considered the l nearest items, which are those with the greatest normalized similarly scores to i. This has the R to have zero mean and length of one: 2 u ru,i = 0 and effect of noise reduction by discarding weakly related items u ru,i = 1. We denote this new normalized user-item to the given i. For the GetJar dataset, we find that setting matrix as R and apply PCA to R for feature extraction. l = 5 seemed to work the best. PCA is performed via eigen decomposition of the covari- ance matrix C. C is computed by first calculating the mean 3.3 Latent factor models item vector b with bu = n1 i ru,i . Then remove the mean Latent factor models work by factorizing the user-item by forming matrix A where each cell au,i = ru,i − bu and T matrix R into two lower rank matrices: user factors and item finally compute C = AA . Note that C is an m × m matrix factors. These models are often used for rating predictions, with the number of users m likely to be a very large num- where a rating ru,i for user u on item i can be predicted ber. This makes eigen decomposition practically impossible by taking the inner product of their respective vectors in in time and space. Observing that the number of items n the user factors and item factors. User bias and item bias is likely to be much lower, we used the same procedure as are commonly removed by subtracting the row and column in Eigenface [16] to optimize the process. The procedure means from R prior to the factorization step. The biases works by first conducting eigen decomposition on the n × n are added back on to the inner product to generate the final matrix AT A obtaining eigenvectors v∗ and eigenvalues λ∗ prediction. such that for each j: 5 AT Avj∗ = λ∗j vj∗ (6) In equation (2), users that use a greater number of items will have more summands, but since we’re only interested Multiplying both sides by A, we get: in the relative order of items for a given user, the varying number of summands does not pose a problem. AAT (Avj∗ ) = λ∗j (Avj∗ ) (7) 208
POP POP 0.020 0.25 MEM BIN MEM BIN MEM DAY MEM DAY PureSVD BIN PureSVD BIN PureSVD DAY PureSVD DAY 0.20 Eigenapp BIN Eigenapp BIN 0.015 Eigenapp DAY Eigenapp DAY Precision 0.15 Recall 0.010 0.10 0.005 0.05 0.000 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 Recall N (a) Precision-Recall (b) Recall at N Figure 3: (a) Precision-recall curves and (b) Recall at N curves using all users in the test set. We see that vectors vj = Avj∗ are the eigenvectors for C. the number of apps with some minimum amount of usage is From there, we normalize each vj to length one and keep unlikely to increase significantly with more users, we do not only the k eigenvectors with the highest corresponding eigen- believe this will pose a problem. values. The eigenvectors represent the dimensions with the Eigenapp is similar to another PCA based algorithm Eigen- largest variances, or the dimensions that can best differenti- taste [7]. The main difference is that Eigentaste, which was ate the items. Alternatively, these eigenvectors can also be evaluated on the Jester joke dataset,7 requires a gauge item viewed as item features, items with similar projected values set where every user must have rated every item in the gauge on a particular eigenvector are likely to be similar in cer- set. Coming up with such a gauge set is impossible for our tain attributes. We will denote these eigenvectors as eige- application, much less one that is representative. In addi- napps. Finally, we can project all the items onto the reduced tion, Eigentaste uses a user-based neighborhood approach to eigenspace by D = vA. D is a k × n matrix, where each col- generate recommendations, whereas Eigenapp utilizes item- umn contains the projected values of the item onto each of based neighborhoods. the eigenapps. The values can be viewed as the coefficients or weights of the eigenapp for the items. By observing sev- eral rows in D, apps with high projected values in these 4. EVALUATION eigenapps are often similar types of apps. This was useful in We evaluated the four types of models from Section 3: preliminary validation showing that the Eigenapp approach Non-personalized (POP), Memory-based (MEM), PureSVD indeed captured latent item features. and Eigenapp, using the GetJar dataset. The experiment is Item-item similarities can be computed using equation (1) set up by randomly dividing the users into five equal sized except that we use D instead of R. Since D is dense, groups. Four of the groups are used for training, and the similarity scores will likely be non-zero for all item pairs. remaining one for evaluation. Using the training set, we Once the item-item similarity matrix S has been computed, compute the item-item similarity matrix S for MEM and the remainder of the algorithm is identical to the memory- Eigenapp, item factor matrix Q for PureSVD, and the list based algorithm described in Section 3.2. We find that of most popular items for POP. The number of eigenvectors the computed neighborhood in the reduced eigenspace is of used for Eigenapp and number of singular vectors used for much better quality compared to the one computed using PureSVD are both 300. For each user in the test set, we sort the memory-based methods in the non-reduced space. How- the apps by install time. We feed the first M − 1 apps to the ever, neighborhood quality is still better for popular items model to generate its recommendation list of N apps. Then than for less popular items, likely due to better support. We we check if the left out app is in the recommended list (all also find that the quality of neighborhood improves when we algorithms make sure to exclude from their recommendation increase the number of eigenapps used, and that the neigh- list the M − 1 apps known to already be installed for the borhood becomes relatively stable after k = 200. given user). This procedure is repeated on all 5 possible ways The computation complexity of this algorithm, up to gen- of dividing the user groups, allowing every group to be used erating S, is O(mn2 ). Using the current GetJar dataset, that as the evaluation group once, and thus a recommendation process took about 11 minutes on an Intel Core i7 machine list for every user exists. using the Eigen library.6 However, since the computation Two forms of user-item matrix R were considered for the of S is the offline phase of the recommender system, and experiments, as described in Section 2.2. The first version 6 7 http://eigen.tuxfamily.org http://eigentaste.berkeley.edu/dataset 209
POP POP MEM BIN MEM BIN 0.20 0.015 MEM DAY MEM DAY PureSVD BIN PureSVD BIN PureSVD DAY PureSVD DAY Eigenapp BIN Eigenapp BIN 0.15 Eigenapp DAY Eigenapp DAY Precision 0.010 Recall 0.10 0.005 0.05 0.000 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 Recall N (a) Precision-Recall (b) Recall at N Figure 4: (a) Precision-recall curves and (b) Recall at N curves after removing the 100 most popular items. using days of usage will be denoted as DAY, and the bina- 4.2 Accuracy of less popular items rized version will be denoted as BIN. Given the overwhelming exposure popular apps receive Accuracy is the first evaluation criterion we used because today in the Android ecosystem, many users will use them we want our recommendations to be relevant to user’s inter- simply because those are the only apps they know. Thus est and preferences. However, user satisfaction is not solely using a popular app may not be a strong indicator of in- dependent on accuracy [10]. In particular, given the domi- terest relative to less popular apps. In order to measure nance of the popular apps in this domain, it is important to precision and recall on the “tail”, we redrew the precision- expose apps in the tail. With that in mind, we also evalu- recall curves by excluding the 100 most popular apps from ated the accuracy of the models in recommending tail apps, the recommended list of each user. Note therefore, that hu and the variety of the apps recommended. will always be 0 for users whose relevant items are among the 100 most popular apps. Thus those users were removed 4.1 Accuracy for this experiment. The accuracies of the models were evaluated by the stan- Figure 4(a) shows the precision-recall curves after remov- dard precision-recall methodology. Since we have only one ing the 100 most popular items. The figure shows that Eige- relevant item to be predicted for each user (the left out app), napp has the highest accuracy for this tail subset. MEM is we set hu equal to 1 if the relevant item is in the top-N list now second, followed by PureSVD and POP. Recall at N for user u and 0 otherwise. Precision and recall at each N shown in Figure 4(b) shows a similar picture, but it is worth is computed by: noting that relative to Figure 3(b), recall dropped for every m algorithm with the exception of PureSVD. This shows it is u=1 hu precision(N ) = (8) more difficult to recommend relevant tail apps than head m m ·N apps. u=1 hu recall(N ) = (9) Using the two types of user-item matrix (BIN and DAY) m still achieved similar performance for all three algorithms, where m is the number of users. but it appears Eigenapp and PureSVD yielded slightly bet- Figure 3(a) illustrates the precision-recall curves for the ter results using BIN compared to DAY. algorithms . As we can see, the best performer was MEM de- spite using an item-item similarity matrix consisting mostly of zeros. A close second was Eigenapp, followed by POP 4.3 Presentation and PureSVD. Figure 3(b) illustrates the recall at each N , The impression that the recommended list makes to the up to N = 50. This figure shows the percentage of users user is also important to their satisfaction [10]. An artifact whose missing app was identified in the top-N. When N is of our methodology for predicting the left-out item means 10, MEM identified the missing app for about 11% of users, that we penalize algorithms for predicting items that the Eigenapp identified the missing app for about 10% of users, user may have liked had she known about them. Since it and POP and PureSVD identified the missing app for about is impossible for us to know which of the “irrelevant” items 7% and 4% of users respectively. (those that do not correspond to the left out item) in the The two types of user-item matrix (BIN and DAY) made top-N are potentially interesting ones, we can only judge the little difference in the global accuracy of any of the three al- diversity of items that are presented. In this study, we are gorithms. Indicating that the additional signals contributed interested in recommending a diverse list of apps from all by number of days of usage do not outweigh its inaccuracies. popularity spectrums. 210
Popularity Rank 10 Algorithm 1-50 51-100 101-500 501-1000 >1000 POP 100% 0 0 0 0 MM BIN 85% 5% 6% 2% 2% 8 MM DAY 80% 6% 8% 3% 4% PS BIN
list of apps. This is because all item vectors are normalized 8. REFERENCES prior to applying PCA, thus usage of less popular apps can [1] C. Anderson. The Long Tail: Why the Future of be captured by the top eigenvectors. That makes it possible Business Is Selling Less of More. Hyperion, 2006. for the less popular apps to be among the closest neigh- [2] J. Bennett and S. Lanning. The netflix prize. In bors of the popular apps. This is particularly important for Proceedings of KDD Cup and Workshop, pages 3–6, exposure of the less popular apps, because given the domi- 2007. nance of the popular apps, only apps that are close to one of [3] P. Cremonesi, Y. Koren, and R. Turrin. Performance the popular apps can make frequent appearances at the top of recommender algorithms on top-n recommendation of the recommended lists. Using traditional memory-based tasks. In Proceedings of the fourth ACM conference on models, the popular apps form a tight cluster (relative to Recommender systems, RecSys ’10, pages 39–46, New the less popular apps) in its neighborhood, thus making it York, NY, USA, 2010. ACM. difficult for less popular apps to surface to the top of the recommended lists for many users. [4] M. Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143–177, Jan. 2004. 6. CONCLUSION [5] S. Funk. Netflix update: Try this at home. With increasing numbers of people switching to smart http://sifter.org/˜simon/journal/20061211.html, 2006. phones, the mobile application space is an emerging domain [6] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. for recommendation systems. Due to the wide disparity in Using collaborative filtering to weave an information resources among app publishers, the apps that large compa- tapestry. Commun. ACM, 35(12):61–70, Dec. 1992. nies develop receive far more exposure than those developed [7] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. by individual developers. This results in app usage being Eigentaste: A constant time collaborative filtering dominated by a few popular apps. The problem is further algorithm. Inf. Retr., 4(2):133–151, July 2001. exacerbated by existing apps stores using non-personalized [8] Y. Koren. Factorization meets the neighborhood: a ranking mechanisms. While that approach may help most multifaceted collaborative filtering model. In users find high quality and essential apps quickly, it is less Proceedings of the 14th ACM SIGKDD international effective in recommending apps to users who are in an ex- conference on Knowledge discovery and data mining, ploratory mode KDD ’08, pages 426–434, New York, NY, USA, 2008. In this study, we used app-usage as our metric. Given ACM. the characteristics of this data, we found that traditional [9] G. Linden, B. Smith, and J. York. Amazon.com memory-based approaches heavily favor popular apps con- recommendations: Item-to-item collaborative filtering. trary to our mission. On the other hand, latent factor IEEE Internet Computing, 7:76–80, 2003. models that were developed based on the Netflix data per- [10] S. M. McNee, J. Riedl, and J. A. Konstan. Being formed quite poorly accuracy-wise. We find that the Eige- accurate is not enough: how accuracy metrics have napp model performed the best in accuracy and in promo- hurt recommender systems. In CHI ’06 extended tion of less well known apps in the tail of our dataset. abstracts on Human factors in computing systems, A system using the Eigenapp model is currently in internal CHI EA ’06, pages 1097–1101, New York, NY, USA, trials at GetJar. It presents a personalized app list to users 2006. ACM. along with a non-personalized most popular list. The first [11] D. W. Oard and J. Kim. Implicit feedback for list is elicited when users are in an exploratory mode and recommender systems. In Proceedings of the AAAI the second when they are looking for the most sought-after Workshop on Recommender Systems, pages 81–83, apps. We plan to open this system for general use in the 1998. second half of 2012. Simultaneously, we are also working continuously to improve our system. [12] A. Paterek. Improving regularized singular value A limitation of the current model is that it includes only decomposition for collaborative filtering. In apps with certain minimum of usage, a condition that most Proceedings of KDD Cup and Workshop, pages 39–42, apps do not satisfy. While the set of apps included probably 2007. contains most of the potentially interesting ones, it is pos- [13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. sible that we removed some interesting niche apps, or high Application of dimensionality reduction in quality apps by individual developers that were not exposed recommender system – a case study. In Proceedings of due to lack of marketing. The latter case is particularly the ACM WebKDD Workshop, 2000. important to us. We are currently exploring content-based [14] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. models that extract useful features from app metadata and Item-based collaborative filtering recommendation plan to combine the results of the collaborative and content- algorithms. In Proceedings of the 10th international based approaches in future work. conference on World Wide Web, WWW ’01, pages 285–295, New York, NY, USA, 2001. ACM. [15] G. Shani and A. Gunawardana. Evaluating 7. ACKNOWLEDGEMENTS recommendation systems. Recommender Systems The authors would like to thank Anand Venkataraman for Handbook, pages 257–297, 2011. guidance, edits and help with revisions. Chris Dury provided [16] M. Turk and A. Pentland. Eigenfaces for recognition. valuable feedback and Sunil Yarram helped during various J. Cognitive Neuroscience, 3(1):71–86, Jan. 1991. stages of data preparation. [17] G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949. 212
You can also read