GetJar Mobile Application Recommendations with Very Sparse Datasets

Page created by Nancy Hawkins

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

GetJar Mobile Application Recommendations with Very
                        Sparse Datasets

                                     Kent Shi                                                        Kamal Ali
                                 GetJar Inc.                                                       GetJar Inc.
                             San Mateo, CA, USA                                                San Mateo, CA, USA
                             kent@getjar.com                                                   kamal@getjar.com

ABSTRACT                                                                          Keywords
The Netﬂix competition of 2006 [2] has spurred signiﬁcant                         Recommender system, mobile application, evaluation, sparse
activity in the recommendations ﬁeld, particularly in ap-                         data, PCA
proaches using latent factor models [3, 5, 8, 12]. However,
the near ubiquity of the Netﬂix and the similar MovieLens
datasets1 may be narrowing the generality of lessons learned
                                                                                  1. INTRODUCTION
in this ﬁeld. At GetJar, our goal is to make appealing rec-                          In the last few years, there has been a tremendous amount
ommendations of mobile applications (apps). For app usage,                        of growth in the mobile app space, particularly in the An-
we observe a distribution that has higher kurtosis (heavier                       droid platform. As of January 2012, there are more than
head and longer tail) than that for the aforementioned movie                      400,000 apps hosted on Google’s app store:2 Google Play
datasets. This happens primarily because of the large dis-                        (formerly known as Android Market). However, Google Play
parity in resources available to app developers and the low                       provides little personalization beyond location-based tailor-
cost of app publication relative to movies.                                       ing of catalogs. That means all users from a given country
  In this paper we compare a latent factor (PureSVD) and                          will see the same list of apps regardless of their tastes and
a memory-based model with our novel PCA-based model,                              preferences.
which we call Eigenapp. We use both accuracy and variety                             Since most users typically navigate no more than a few
as evaluation metrics. PureSVD did not perform well due                           pages when browsing the store, lack of personalization lim-
to its reliance on explicit feedback such as ratings, which we                    its exposure for the majority of the apps. By analyzing
do not have. Memory-based approaches that perform vec-                            the usage of apps on a sample of devices, we ﬁnd that this
tor operations in the original high dimensional space over-                       space is dominated by a few apps, which unsurprisingly are
predict popular apps because they fail to capture the neigh-                      ones that have been “featured” recently on the front page of
borhood of less popular apps. They have high accuracy due                         Google Play.
to the concentration of mass in the head, but did poorly                             GetJar, founded in 2004, is the largest free app store in
in terms of variety of apps exposed. Eigenapp, which ex-                          the world. It provides mobile apps to users of all mobile
ploits neighborhood information in low dimensional spaces,                        platforms. We have recently begun to focus on the Android
did well both on precision and variety, underscoring the im-                      platform due to its openness and surging market share. Our
portance of dimensionality reduction to form quality neigh-                       goal is to become an attractive destination for Android apps
borhoods in high kurtosis distributions.                                          by providing high quality personalization as a means to app
                                                                                  discovery.
Categories and Subject Descriptors                                                1.1   Challenges
H.2.8 [Database Management]: Database Applications—                                 While recommendations techniques, especially those using
Data mining; H.3.3 [Information Storage and Retrieval]:                           collaborative ﬁltering, have been common since the early
Information Search and Retrieval—Information ﬁltering                             1990s [6] and have been deployed on a number of e-commerce
                                                                                  websites such as Amazon.com [9], recommendation in the
General Terms                                                                     emerging app domain is a task beset by unique challenges
Algorithms, Experimentation, Performance                                          mainly due to the greater kurtosis in the distribution of app
                                                                                  usage data.
1
    http://www.grouplens.org/node/73                                                From anonymous usage data collected at GetJar, we ﬁnd
                                                                                  that there are a few well-known apps popular among a large
                                                                                  number of users, but the vast majority of apps are rarely
Permission to make digital or hard copies of all or part of this work for         used by most users. Figure 1(a) shows a comparison of the
personal or classroom use is granted without fee provided that copies are         data distribution between the movie (Netﬂix) and app (Get-
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
                                                                                  Jar) domains. Note the plot only includes apps that have
otherwise, or republish, to post on servers or to redistribute to lists,          been recently used by GetJar users. This constitutes approx-
requires prior specific permission and/or a fee.                                  imately 55,000 apps, or about 14% of all apps. The movie
KDD’12, August 12–16, 2012, Beijing, China.                                       2
Copyright 2012 ACM 978-1-4503-1462-6/12/08... $15.00.                               http://www.distimo.com/blog/2012_01_
                                                                                  google-android-market-tops-400000-applications

                                                                            204

100%

                                                                                                    100%
                                                                GetJar                                                                          GetJar
                                                                Netflix                                                                         Netflix
                     10%

                                                                                                    10%
  Percent of Users

                                                                                 Percent of Users
                     1%

                                                                                                    1%
                     0.1%

                                                                                                    0.1%
                     0.01%

                                                                                                    0.01%
                             0   20%   40%        60%     80%      100%                                       0.01%   0.1%        1%      10%      100%

                                       Percent of Items                                                                Percent of Items
                                       (a)                                                                             (b)

Figure 1: (a) Distribution of items (GetJar apps or Netﬂix movies) in terms of percentage of total users,
with items sorted by popularity. (b) Distributions of items plotted in log-log scale.

at the ﬁrst percentile (rank 177) is rated by 20% of Net-                       app stores rely on developers to categorize their own apps
ﬂix users. In contrast, the app at the ﬁrst percentile (rank                    using a ﬁxed inventory of labels. This leads to a small num-
550) is used only by 0.6% of GetJar users. Furthermore,                         ber of categories and a large number of apps within each,
the movie at the ﬁrst percentile has 42% as many users as                       causing only the top few apps in each category to ever have
the most popular movie, but app at the ﬁrst percentile has                      signiﬁcant visibility. Search is also ineﬀective because we
only 1.3% as many users as the most popular app. There-                         ﬁnd that most users don’t know what to search for. About
fore, even though there are over 400,000 available apps, in                     90% of search queries at GetJar are titles (or close variants)
reality only a few thousand of them are being used in any                       of popular apps, which means search currently is not being
signiﬁcant sense.                                                               used as an eﬀective vehicle to discover new apps.
   The same data is plotted in Figure 1(b), this time using
a log scale for both axes. We can see that the GetJar curve
is almost a straight line in log-log space, indicating that the                 1.2                    Goal and evaluation criteria
frequencies can be approximated by a Zipf distribution[17].                        Users visit GetJar hoping to ﬁnd interesting and useful
This ﬁgure deﬁnitively shows the qualitative diﬀerence in dis-                  apps. But as we have seen, common strategies such as
tribution: App distribution is linear in log-log space whereas                  browsing and searching, which have worked well for other
movie distribution isn’t. Traditional collaborative ﬁltering                    e-commerce sites don’t work as well in domains where many
techniques [9, 14] or even the newer latent factor models [3,                   items remain under-publicized. Our goal is to use personal-
5, 8, 12, 13] were not designed to handle this level of sparsity.               ization to help users ﬁnd a greater variety of appealing apps.
   There are at least three reasons for this diﬀerence. First,                  Our prototype recommendation system recommends a top-
the disparity in available resources among the app develop-                     N list of apps to each user based on her recent app usage.
ers is larger than that of movie producers. This is mainly                      We judge the quality of the recommendations primarily by
due to the cost (time and money) of publishing apps being                       accuracy, which represents the ability of the recommender
much lower than that for releasing movies. Second, due to                       to predict the presence of an app on the user’s device. To
the less mature nature of the smart phone space, most ca-                       increase the exposure of under-publicized apps, the recom-
sual users are unaware of the full capabilities of their device                 mender is also evaluated on its ability to recommend tail
or what apps are available for it. This is in contrast to other                 apps as well as the variety of the apps it recommend.
domains such as movies, where there are numerous outlets                           A number of app stores currently oﬀer personalized app
dedicated to reviewing or promoting those products. Third,                      recommendations, most notably the Apple App Store and
discovery mechanisms in the app space are less eﬀective and                     the Amazon Appstore. However, little is known about how
mature compared to those of other domains.                                      they generate their recommendations. Furthermore, we are
   Today, most app stores oﬀer three ways for users to dis-                     unaware of any publications on mobile app recommenda-
cover apps: (1) Listings of apps sorted by the number of                        tions.
downloads or a similar trending metric, (2) Category-based                         The rest of the paper is organized as follows: Section 2
browsing and (3) Keyword-based searching. We know that                          will review how the data was collected and some of its prop-
the number of apps that can be exposed using listings is                        erties; Section 3 will provide details of the algorithms that
limited, and that methods 2 and 3 are not as eﬀective as we                     we considered; Section 4 will provide the experimental setup
would like. Browsing by category is only useful if the ontol-                   and results; and ﬁnally Sections 5 and 6 provide discussion
ogy of categories is rich, as in the case of Amazon. But most                   and conclusions.

                                                                          205

2. THE GETJAR DATA

100%
GetJar
The data we report upon in this paper comes from server Netflix
log ﬁles at GetJar where all personally identifying infor-

80%
mation had been stripped, but information pertaining to

Percent of Total Rating/Usage
a single source can be uniquely identiﬁed up to a common
anonymous identiﬁer. The apps we report here include those

60%
hosted on GetJar as well as those on Google Play.
For the purposes of this study, we rely upon app usage
data rather than installation data. The reason we choose

40%
not to use installation data is that it is a poor indicator of
interest since many app installations are experimental from
a user’s perspective. A signiﬁcant fraction of our users are

20%
found to uninstall an app on the same day as they installed
it. Also, there is another signiﬁcant fraction of users that
have a vast number of installed apps that never get used.
Many users are new to the mobile app space and thus are

0
likely experimenting with a variety of apps. We restrict our 0.01% 0.1% 1% 10% 100%
data to recent app usage to account for this fact, that users’
tastes for apps can change more rapidly than for traditional Percent of Items
domains such as movies and music. We are only interested
in recommending apps that reflect their current tastes or Figure 2: Cumulative distribution of items in terms
interests. of percentage of total usage, the curves can be
The observation period for data used for this study is from viewed as the integral of the curves in Figure 1.
November 7 to November 21, 2011. We find that varying
length of the observation period by a few days makes almost Dataset Users Items Usages/Ratings Density
no difference in the number of apps used by the users.3 In an GetJar 101,106 55,020 1.99M 0.04%
effort to reduce noise in the data from apps that were being GetJar* 101,031 7,304 1.82M 0.25%
trialed by users, we filtered out apps that were not used other Netflix 480,189 17,770 100M 1.18%
than on the day of installation. We further cleaned the data
by removing users that joined or left midway during the Table 1: Size of user-item matrices for Netflix and
observation period and those that were not associated with GetJar dataset. GetJar* denotes the GetJar dataset
a pre-determined list of legitimate devices. The resultant including only apps that have been used by more
dataset contains 101,106 users. For each user we used the list than 20 users.
of apps and the number of days each app was used during the
observation period. The total number of unique apps used
by all users during the interval satisfying our constraints was
55,020. Jar dataset, as previously alluded to, is primarily due to the
low cost of publishing apps compared to the cost of releas-
2.1 Data sparsity and long tail ing a movie. This encourages developers to release as many
As we have already illustrated in Figure 1, our data is apps as possible to increase the chances of their apps being
extremely sparse and that the vast majority of apps have discovered by search. This strategy often leads to apps be-
low usage. While it is well known that sparsity and a long ing published multiple times with different titles but similar
tail [1] are two characteristics of all e-commerce data, these functionalities. This also encourages the proliferation of a
are especially pronounced in our dataset. large number of apps tailored for very specific needs (e.g.
Figure 2 plots the cumulative distribution of the items ringtone apps dedicated to music by specific artists) as op-
in terms of the total amount of usage. We can see that posed to general apps (e.g. a single ringtone app containing
the GetJar dataset is far more head-heavy compared to the music by all artists).
Netflix dataset, with the top 1% of apps accounting for 58% Given that we have little or no usage information on the
of usage in contrast to Netflix where the top 1% of movies bulk of the tail apps, it makes recommending them a very
contribute to 22% of all ratings. An even more selective difficult task. In order to ensure that the recommended apps
subset - the 100 most popular apps - account for 30% of will have certain amount of support, for this study, we lim-
total app usage. For the GetJar dataset, we define the head ited our app population by only including apps with more
to be the top 100 apps and the remaining apps to be the than 20 users. This reduces the number of apps from 55,030
tail. to 7,304. Even though this pruning process removed 87% of
One major reason for this difference is that many apps are apps (or 98% if we include apps with no usage), it is note-
used every day, but movies are seldom watched more than worthy that only 9% of the total usage was thus eliminated
once or twice. Thus Netflix users may be more likely to ex- from our modeling. Table 1 shows the size and density of the
plore new items relative to GetJar users. Another reason user-item matrices before and after our pruning. It shows
is that the Netflix data was collected over a much longer that even after rejecting the bottom 87% of the apps, the
period of time. The reason for the longer tail in the Get- GetJar* dataset is still much sparser relative to Netflix.

3
We use the more convenient word users to denote their 2.2 Usage versus ratings
anonymized identiﬁers. Another diﬀerence between the GetJar dataset and the

206

Netflix dataset is that movie ratings is an explicit feedback Number of Common Users
Dataset
for interest whereas days of usage is implicit [11]. The ben- 0 1 2-10 11-20 > 20
efit of an explicit rating system is that it is well-defined and GetJar* 83.2% 9.1% 6.6% 0.6% 0.6%
standardized, thus generating a more accurate measurement Netflix 0.2% 0.4% 33.8% 22.2% 43.3%
of interest compared to implicit feedbacks such as days of
usage. The latter can be influenced by a number of factors Table 2: Breakdown of number of common users for
such as mood, unforeseen events, or logging errors. Fur- the GetJar and Netflix datasets. For n items, the
2
thermore, there is also correlation between usage and cate- total number of item pairs is n 2−n .
gory - we find that “social” apps are consistently the most
heavily used apps among nearly all users. This is because
“social” apps need to be used often in order to serve their
purpose, but apps in categories such as “productivity” are items in user space or that of users in item space. A user-
seldom needed on a continuous basis. So while it is safe to user or item-item similarity matrix is computed for pairs
assume that a user enjoyed a movie that she rated highly and recommendations are generated based on these similari-
relatively to one rated lowly, the same cannot be said for ties. Latent factor models are more sophisticated approaches
a user that used a “social” app more than a “productivity” where the user-item matrix is decomposed via matrix fac-
app. torization techniques such as Singular Value Decomposition
We choose not to use ratings because it has a number of (SVD). Latent factors are then extracted and used to gen-
drawbacks in the mobile app domain. Most importantly, it erate predictions.
is very difficult to collect for a large number of users with- We evaluated both the above approaches using our data.
out forceful intervention. Furthermore, since users’ taste in In addition, we developed a hybrid system using Princi-
apps may change and many app developers frequently up- pal Components Analysis (PCA) which we call Eigenapp.
date their apps with new features or functionalities, ratings These three algorithms were also compared against a non-
may become obsolete in as little as one month. Finally, ob- personalized baseline recommendation system that serves
serving ratings on Google Play, we find they are polarized, the most popular items.
with the vast majority of ratings being either 1 or 5. This
is likely due to fragmentation of the Android platform,4 re- 3.1 Non-personalized models
sulting in most ratings being given based on whether the Non-personalized models are those that serve the same
app worked (5) or not (1) for the user. list of items to all users. They commonly sort items by
Due to the influence of the Netflix competition, most re- the number of purchases, profit margin, click-through rate
search in the recommendations community has been geared (CTR), or other similar metrics. In this paper, our non-
toward rating prediction by means of minimizing root mean personalized baseline algorithm sorts items by popularity,
square error (RMSE). However, Cremonesi et. al [3] re- where popularity is defined as the number of distinct users
ported that improving RMSE does not translate into im- that have used the item during the observation period.
provement in accuracy for the top-N task. On the Netflix
and MovieLens datasets, the predictive accuracy of a naive 3.2 Memory-based models
most popular list is comparable to those by sophisticated
There are two types of memory-based models: Item-based
algorithms optimized for RMSE. We tried the same using
and user-based. Item-based models find similarities between
the GetJar dataset but substituting days of usage for rat-
items, and for a given user they recommend items that are
ings, and found that algorithms optimized for RMSE actu-
similar to items she already owns. User-based models find
ally performed far worse than a simple most popular list.
similarities between users, and for a given user they recom-
With that said, days of usage can still be used for neigh-
mend items owned by her most similar users.
borhood approaches, provided that there still exists some
Computationally, item-based models are more scalable be-
correlation between it and interest. A part of this study is
cause there are usually far fewer items than users, as is the
to evaluate the usefulness of this metric. Thus, for our ex-
case in the mobile app space. In addition, there is research
periments, we used two versions of the user-item matrix. In
showing that item-based algorithms generally perform bet-
the first version, each cell represents the number of days the
ter than user-based algorithms [9, 14]. Hence, our memory-
app was used, and in the second, each cell is a binary indi-
based model uses the item-based approach.
cator of usage during the observation period. We’d like to
Two of the most common neighborhood similarity met-
see if the additional granularity provided by the days of us-
rics in current use are the Pearson correlation coefficient
age will generate better recommendations than when using
and cosine similarity. The Pearson correlation coefficient is
a binary indicator.
computed for a pair of items based on the set of users that
have used both. Since the vast majority of our items reside
3. MODELS in the long tail, many of those items are unlikely to share
common users with most other items.
Two common recommendation approaches in use today
Table 2 presents the distribution of number of common
are those using memory-based models and latent factor mod-
users in the GetJar and Netflix datasets. The table shows
els. Memory-based models leverage the neighborhood of
that 83.2% of item pairs in the GetJar dataset have zero
4 users in common, whereas that same percentage for Net-
There are many manufacturers that produce Android de- flix is 0.2%. For GetJar, more than 90% of item pairs have
vices with various hardware specifications and tweaks of the
operating system. This makes it difficult for developers to to one or no common users. Thus it is impossible to compute
test their apps on all devices, resulting in apps not working correlations for these item pairs. In addition, the vast ma-
as intended on many devices. jority of the remaining item pairs share 10 or fewer users,

207

meaning that the sample correlation estimate is likely to be Examples of this approach include [5, 8, 12, 13]. We
inaccurate due to poor support. In contrast, the published tried [5] and [13] by substituting days of usage for ratings,
Netflix dataset has less than 1% of movie pairs sharing 1 and then sorting the predictions to generate a top-N rec-
or fewer common users and about 65% of movie pairs share ommended list. The results were by far the worst of all
more than 10 common users. Since the Pearson correlation algorithms, for reasons explained in Section 2.2. We expect
coefficient is undefined for 90% of our item pairs, we will use similar results for other rating prediction based algorithms.
cosine similarity. The only latent factor top-N algorithm we are aware of
Let R denote the m × n user-item matrix where m is the is PureSVD [3]. The algorithm works by replacing all miss-
number of users and n is the number of items. From R, we ing values (those with no ratings) in R with 0, and then
compute an item-item similarity matrix S, whose (i, j) entry factorizing R via SVD:
is:
R = U · Σ · VT (4)
r∗,i · r∗,j
si,j = (1)
r∗,i 2 · r∗,j 2 Then affinity between user u and item i can be computed
by:
where r∗,i and r∗,j are the ith and jth columns respectively
of R. Cosine similarity does not require items to share com- tu,i = ru,∗ · Q · qTi (5)
mon users. In such case it will simply produce a similarity where Q stands for the top k singular vectors extracted from
of 0. However, it still suffers from low overlap support. The V and qi is the row in Q corresponding to item i. Note that
closest neighbors for a less popular item will often occur tu,i is simply an association measure and not a predicted
by coincidence simply because they are the only ones that rating. A top-N list can then be made for user u by selecting
produced non-zero similarity scores. the N items with the highest affinity score to u.
Using S, the affinity tu,i between user u and item i is the PureSVD is the only latent factor algorithm we evaluated
sum of similarities between i and items used by u: that was able to generate reasonable recommendations. The
main reason for this is that, unlike the other algorithms,
tu,i = si,j (2) PureSVD is not optimized for RMSE based rating prediction
j∈Iu
but rather the relative ordering of items produced by the
where Iu is the set of items used by u. For a given user, all association scores.
items are sorted by their affinity score in order to produce
a top-N list.5
3.4 Eigenapp model
We made two slight modifications to the above method Of the two previously mentioned approaches, memory-
that produced better results. First, the item-item similarity based models yielded far better results despite only hav-
scores si,j were normalized before being used in equation (2). ing neighborhoods for popular items. We want to improve
Deshpande et al. [4] suggested using a normalization such the result of memory-based models by borrowing ideas from
that the sum of the similarities add up to 1. However, we the latent factor models. Along these lines, we used dimen-
found that normalizing using z-score worked much better for sionality reduction techniques to extract meaningful features
the GetJar dataset, producing the asymmetric similarity: from the items and then applied memory-based techniques
to generate recommendations in this reduced space.
si,j − s∗,j
si,j = (3) Our neighborhood is still item-based, but items are now
σs∗,j represented using features instead of users. Similar to [3],
we also replace all missing values in R with 0. Given the
where s∗,j is the average similarity to item j and σs∗,j is the
large disparity in app frequencies, we normalized the item
standard deviation of similarities to item j. Second, for each
vectors to prevent the features from being based on only
item candidate i, instead of summing over all items in Iu ,
popular items. This is done by normalizingeach column in
we only considered the l nearest items, which are those with
the greatest normalized similarly scores to i. This has the
R to have zero mean and length of one:
2 u ru,i = 0 and

effect of noise reduction by discarding weakly related items u ru,i = 1. We denote this new normalized user-item
to the given i. For the GetJar dataset, we find that setting matrix as R and apply PCA to R for feature extraction.
l = 5 seemed to work the best. PCA is performed via eigen decomposition of the covari-
ance matrix C. C is computed by first calculating the mean

3.3 Latent factor models item vector b with bu = n1 i ru,i . Then remove the mean

Latent factor models work by factorizing the user-item by forming matrix A where each cell au,i = ru,i − bu and
T
matrix R into two lower rank matrices: user factors and item finally compute C = AA . Note that C is an m × m matrix
factors. These models are often used for rating predictions, with the number of users m likely to be a very large num-
where a rating ru,i for user u on item i can be predicted ber. This makes eigen decomposition practically impossible
by taking the inner product of their respective vectors in in time and space. Observing that the number of items n
the user factors and item factors. User bias and item bias is likely to be much lower, we used the same procedure as
are commonly removed by subtracting the row and column in Eigenface [16] to optimize the process. The procedure
means from R prior to the factorization step. The biases works by first conducting eigen decomposition on the n × n
are added back on to the inner product to generate the final matrix AT A obtaining eigenvectors v∗ and eigenvalues λ∗
prediction. such that for each j:
5 AT Avj∗ = λ∗j vj∗ (6)
In equation (2), users that use a greater number of items
will have more summands, but since we’re only interested Multiplying both sides by A, we get:
in the relative order of items for a given user, the varying
number of summands does not pose a problem. AAT (Avj∗ ) = λ∗j (Avj∗ ) (7)

208

POP                                           POP

                0.020

                                                                                               0.25
                                                            MEM BIN                                       MEM BIN
                                                            MEM DAY                                       MEM DAY
                                                            PureSVD BIN                                   PureSVD BIN
                                                            PureSVD DAY                                   PureSVD DAY

                                                                                               0.20
                                                            Eigenapp BIN                                  Eigenapp BIN
                0.015

                                                            Eigenapp DAY                                  Eigenapp DAY
    Precision

                                                                                               0.15
                                                                                      Recall
                0.010

                                                                                               0.10
                0.005

                                                                                               0.05
                0.000

                                                                                               0.00
                        0.0   0.2      0.4            0.6   0.8       1.0                             0     10           20       30   40   50

                                             Recall                                                                           N
                               (a) Precision-Recall                                                              (b) Recall at N

                    Figure 3: (a) Precision-recall curves and (b) Recall at N curves using all users in the test set.

We see that vectors vj = Avj∗ are the eigenvectors for C.                         the number of apps with some minimum amount of usage is
From there, we normalize each vj to length one and keep                           unlikely to increase signiﬁcantly with more users, we do not
only the k eigenvectors with the highest corresponding eigen-                     believe this will pose a problem.
values. The eigenvectors represent the dimensions with the                           Eigenapp is similar to another PCA based algorithm Eigen-
largest variances, or the dimensions that can best diﬀerenti-                     taste [7]. The main diﬀerence is that Eigentaste, which was
ate the items. Alternatively, these eigenvectors can also be                      evaluated on the Jester joke dataset,7 requires a gauge item
viewed as item features, items with similar projected values                      set where every user must have rated every item in the gauge
on a particular eigenvector are likely to be similar in cer-                      set. Coming up with such a gauge set is impossible for our
tain attributes. We will denote these eigenvectors as eige-                       application, much less one that is representative. In addi-
napps. Finally, we can project all the items onto the reduced                     tion, Eigentaste uses a user-based neighborhood approach to
eigenspace by D = vA. D is a k × n matrix, where each col-                        generate recommendations, whereas Eigenapp utilizes item-
umn contains the projected values of the item onto each of                        based neighborhoods.
the eigenapps. The values can be viewed as the coeﬃcients
or weights of the eigenapp for the items. By observing sev-
eral rows in D, apps with high projected values in these                          4. EVALUATION
eigenapps are often similar types of apps. This was useful in                        We evaluated the four types of models from Section 3:
preliminary validation showing that the Eigenapp approach                         Non-personalized (POP), Memory-based (MEM), PureSVD
indeed captured latent item features.                                             and Eigenapp, using the GetJar dataset. The experiment is
   Item-item similarities can be computed using equation (1)                      set up by randomly dividing the users into ﬁve equal sized
except that we use D instead of R. Since D is dense,                              groups. Four of the groups are used for training, and the
similarity scores will likely be non-zero for all item pairs.                     remaining one for evaluation. Using the training set, we
Once the item-item similarity matrix S has been computed,                         compute the item-item similarity matrix S for MEM and
the remainder of the algorithm is identical to the memory-                        Eigenapp, item factor matrix Q for PureSVD, and the list
based algorithm described in Section 3.2. We ﬁnd that                             of most popular items for POP. The number of eigenvectors
the computed neighborhood in the reduced eigenspace is of                         used for Eigenapp and number of singular vectors used for
much better quality compared to the one computed using                            PureSVD are both 300. For each user in the test set, we sort
the memory-based methods in the non-reduced space. How-                           the apps by install time. We feed the ﬁrst M − 1 apps to the
ever, neighborhood quality is still better for popular items                      model to generate its recommendation list of N apps. Then
than for less popular items, likely due to better support. We                     we check if the left out app is in the recommended list (all
also ﬁnd that the quality of neighborhood improves when we                        algorithms make sure to exclude from their recommendation
increase the number of eigenapps used, and that the neigh-                        list the M − 1 apps known to already be installed for the
borhood becomes relatively stable after k = 200.                                  given user). This procedure is repeated on all 5 possible ways
  The computation complexity of this algorithm, up to gen-                        of dividing the user groups, allowing every group to be used
erating S, is O(mn2 ). Using the current GetJar dataset, that                     as the evaluation group once, and thus a recommendation
process took about 11 minutes on an Intel Core i7 machine                         list for every user exists.
using the Eigen library.6 However, since the computation                             Two forms of user-item matrix R were considered for the
of S is the oﬄine phase of the recommender system, and                            experiments, as described in Section 2.2. The ﬁrst version
6                                                                                 7
    http://eigen.tuxfamily.org                                                        http://eigentaste.berkeley.edu/dataset

                                                                            209

POP                                        POP
                                                          MEM BIN                                    MEM BIN

                                                                                          0.20
              0.015                                       MEM DAY                                    MEM DAY
                                                          PureSVD BIN                                PureSVD BIN
                                                          PureSVD DAY                                PureSVD DAY
                                                          Eigenapp BIN                               Eigenapp BIN

                                                                                          0.15
                                                          Eigenapp DAY                               Eigenapp DAY
  Precision

              0.010

                                                                                 Recall

                                                                                          0.10
              0.005

                                                                                          0.05
              0.000

                                                                                          0.00
                      0.0   0.2      0.4            0.6   0.8       1.0                          0     10           20       30   40    50

                                           Recall                                                                        N
                             (a) Precision-Recall                                                           (b) Recall at N

Figure 4: (a) Precision-recall curves and (b) Recall at N curves after removing the 100 most popular items.

using days of usage will be denoted as DAY, and the bina-                       4.2         Accuracy of less popular items
rized version will be denoted as BIN.                                              Given the overwhelming exposure popular apps receive
   Accuracy is the ﬁrst evaluation criterion we used because                    today in the Android ecosystem, many users will use them
we want our recommendations to be relevant to user’s inter-                     simply because those are the only apps they know. Thus
est and preferences. However, user satisfaction is not solely                   using a popular app may not be a strong indicator of in-
dependent on accuracy [10]. In particular, given the domi-                      terest relative to less popular apps. In order to measure
nance of the popular apps in this domain, it is important to                    precision and recall on the “tail”, we redrew the precision-
expose apps in the tail. With that in mind, we also evalu-                      recall curves by excluding the 100 most popular apps from
ated the accuracy of the models in recommending tail apps,                      the recommended list of each user. Note therefore, that hu
and the variety of the apps recommended.                                        will always be 0 for users whose relevant items are among
                                                                                the 100 most popular apps. Thus those users were removed
4.1 Accuracy                                                                    for this experiment.
   The accuracies of the models were evaluated by the stan-                        Figure 4(a) shows the precision-recall curves after remov-
dard precision-recall methodology. Since we have only one                       ing the 100 most popular items. The ﬁgure shows that Eige-
relevant item to be predicted for each user (the left out app),                 napp has the highest accuracy for this tail subset. MEM is
we set hu equal to 1 if the relevant item is in the top-N list                  now second, followed by PureSVD and POP. Recall at N
for user u and 0 otherwise. Precision and recall at each N                      shown in Figure 4(b) shows a similar picture, but it is worth
is computed by:                                                                 noting that relative to Figure 3(b), recall dropped for every
                                   m                                           algorithm with the exception of PureSVD. This shows it is
                                      u=1 hu
                   precision(N ) =                          (8)                 more diﬃcult to recommend relevant tail apps than head
                                   m m
                                        ·N
                                                                                apps.
                                      u=1 hu
                      recall(N ) =                          (9)                    Using the two types of user-item matrix (BIN and DAY)
                                       m                                        still achieved similar performance for all three algorithms,
where m is the number of users.                                                 but it appears Eigenapp and PureSVD yielded slightly bet-
   Figure 3(a) illustrates the precision-recall curves for the                  ter results using BIN compared to DAY.
algorithms . As we can see, the best performer was MEM de-
spite using an item-item similarity matrix consisting mostly
of zeros. A close second was Eigenapp, followed by POP                          4.3         Presentation
and PureSVD. Figure 3(b) illustrates the recall at each N ,                        The impression that the recommended list makes to the
up to N = 50. This ﬁgure shows the percentage of users                          user is also important to their satisfaction [10]. An artifact
whose missing app was identiﬁed in the top-N. When N is                         of our methodology for predicting the left-out item means
10, MEM identiﬁed the missing app for about 11% of users,                       that we penalize algorithms for predicting items that the
Eigenapp identiﬁed the missing app for about 10% of users,                      user may have liked had she known about them. Since it
and POP and PureSVD identiﬁed the missing app for about                         is impossible for us to know which of the “irrelevant” items
7% and 4% of users respectively.                                                (those that do not correspond to the left out item) in the
   The two types of user-item matrix (BIN and DAY) made                         top-N are potentially interesting ones, we can only judge the
little diﬀerence in the global accuracy of any of the three al-                 diversity of items that are presented. In this study, we are
gorithms. Indicating that the additional signals contributed                    interested in recommending a diverse list of apps from all
by number of days of usage do not outweigh its inaccuracies.                    popularity spectrums.

                                                                          210

Popularity Rank

                                                                                  10
 Algorithm
               1-50    51-100 101-500 501-1000         >1000
 POP          100%       0        0          0           0
 MM BIN       85%       5%       6%         2%          2%

                                                                                  8
 MM DAY       80%       6%       8%         3%          4%
 PS BIN

list of apps. This is because all item vectors are normalized 8. REFERENCES
prior to applying PCA, thus usage of less popular apps can [1] C. Anderson. The Long Tail: Why the Future of
be captured by the top eigenvectors. That makes it possible Business Is Selling Less of More. Hyperion, 2006.
for the less popular apps to be among the closest neigh-
[2] J. Bennett and S. Lanning. The netflix prize. In
bors of the popular apps. This is particularly important for
Proceedings of KDD Cup and Workshop, pages 3–6,
exposure of the less popular apps, because given the domi-
2007.
nance of the popular apps, only apps that are close to one of
[3] P. Cremonesi, Y. Koren, and R. Turrin. Performance
the popular apps can make frequent appearances at the top
of recommender algorithms on top-n recommendation
of the recommended lists. Using traditional memory-based
tasks. In Proceedings of the fourth ACM conference on
models, the popular apps form a tight cluster (relative to
Recommender systems, RecSys ’10, pages 39–46, New
the less popular apps) in its neighborhood, thus making it
York, NY, USA, 2010. ACM.
difficult for less popular apps to surface to the top of the
recommended lists for many users. [4] M. Deshpande and G. Karypis. Item-based top-n
recommendation algorithms. ACM Trans. Inf. Syst.,
22(1):143–177, Jan. 2004.
6. CONCLUSION [5] S. Funk. Netflix update: Try this at home.
With increasing numbers of people switching to smart http://sifter.org/˜simon/journal/20061211.html, 2006.
phones, the mobile application space is an emerging domain [6] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry.
for recommendation systems. Due to the wide disparity in Using collaborative filtering to weave an information
resources among app publishers, the apps that large compa- tapestry. Commun. ACM, 35(12):61–70, Dec. 1992.
nies develop receive far more exposure than those developed [7] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins.
by individual developers. This results in app usage being Eigentaste: A constant time collaborative filtering
dominated by a few popular apps. The problem is further algorithm. Inf. Retr., 4(2):133–151, July 2001.
exacerbated by existing apps stores using non-personalized [8] Y. Koren. Factorization meets the neighborhood: a
ranking mechanisms. While that approach may help most multifaceted collaborative filtering model. In
users find high quality and essential apps quickly, it is less Proceedings of the 14th ACM SIGKDD international
effective in recommending apps to users who are in an ex- conference on Knowledge discovery and data mining,
ploratory mode KDD ’08, pages 426–434, New York, NY, USA, 2008.
In this study, we used app-usage as our metric. Given ACM.
the characteristics of this data, we found that traditional [9] G. Linden, B. Smith, and J. York. Amazon.com
memory-based approaches heavily favor popular apps con- recommendations: Item-to-item collaborative filtering.
trary to our mission. On the other hand, latent factor IEEE Internet Computing, 7:76–80, 2003.
models that were developed based on the Netflix data per-
[10] S. M. McNee, J. Riedl, and J. A. Konstan. Being
formed quite poorly accuracy-wise. We find that the Eige-
accurate is not enough: how accuracy metrics have
napp model performed the best in accuracy and in promo-
hurt recommender systems. In CHI ’06 extended
tion of less well known apps in the tail of our dataset.
abstracts on Human factors in computing systems,
A system using the Eigenapp model is currently in internal
CHI EA ’06, pages 1097–1101, New York, NY, USA,
trials at GetJar. It presents a personalized app list to users
2006. ACM.
along with a non-personalized most popular list. The first
[11] D. W. Oard and J. Kim. Implicit feedback for
list is elicited when users are in an exploratory mode and
recommender systems. In Proceedings of the AAAI
the second when they are looking for the most sought-after
Workshop on Recommender Systems, pages 81–83,
apps. We plan to open this system for general use in the
1998.
second half of 2012. Simultaneously, we are also working
continuously to improve our system. [12] A. Paterek. Improving regularized singular value
A limitation of the current model is that it includes only decomposition for collaborative filtering. In
apps with certain minimum of usage, a condition that most Proceedings of KDD Cup and Workshop, pages 39–42,
apps do not satisfy. While the set of apps included probably 2007.
contains most of the potentially interesting ones, it is pos- [13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
sible that we removed some interesting niche apps, or high Application of dimensionality reduction in
quality apps by individual developers that were not exposed recommender system – a case study. In Proceedings of
due to lack of marketing. The latter case is particularly the ACM WebKDD Workshop, 2000.
important to us. We are currently exploring content-based [14] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
models that extract useful features from app metadata and Item-based collaborative filtering recommendation
plan to combine the results of the collaborative and content- algorithms. In Proceedings of the 10th international
based approaches in future work. conference on World Wide Web, WWW ’01, pages
285–295, New York, NY, USA, 2001. ACM.
[15] G. Shani and A. Gunawardana. Evaluating
7. ACKNOWLEDGEMENTS recommendation systems. Recommender Systems
The authors would like to thank Anand Venkataraman for Handbook, pages 257–297, 2011.
guidance, edits and help with revisions. Chris Dury provided [16] M. Turk and A. Pentland. Eigenfaces for recognition.
valuable feedback and Sunil Yarram helped during various J. Cognitive Neuroscience, 3(1):71–86, Jan. 1991.
stages of data preparation. [17] G. K. Zipf. Human Behavior and the Principle of
Least Effort. Addison-Wesley, 1949.

212

You can also read