Multidimensional Characterization of Expert Users in the Yelp Review Network
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multidimensional Characterization of Expert Users in the Yelp Review Network ∗ Cheng Han Lee Sean Massung Department of Computer Science Department of Computer Science University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign clee17@illinois.edu massung1@illinois.edu ABSTRACT reviewer to give a fair or useful review of a business that In this paper, we propose a multidimensional model that in- may be a future destination. From a business’s perspective, tegrates text analysis, temporal information, network struc- expert reviewers should be great summarizers and able to ture, and user metadata to effectively predict experts from explain exactly how to improve their store or restaurant. In a large collection of user profiles. We make use of the Yelp both cases, it’s much more efficient to find the opinion of an Academic Dataset which provides us with a rich social net- expert reviewer than sift through hundreds of thousands of work of bidirectional friendships, full review text including potentially useless or spam reviews. formatting, timestamped activities, and user metadata (such as votes and other information) in order to analyze and Yelp is a crowd-sourced business review site as well as a so- train our classification models. Through our experiments, cial network, consisting of several objects: users, reviews, we hope to develop a feature set that can be used to accu- and businesses. Users write text reviews accompanied by a rately predict whether a user is a Yelp expert (also known star rating for businesses they visit. Users also have bidi- as an ‘elite’ user) or a normal user. We show that each of rectional friendships as well as one-directional fans. We the four feature types is able to capture a signal that a user consider the social network to consist of the bidirectional is an expert user. In the end, we combine all feature sets friendships since each user consents to the friendship of the together in an attempt to raise the classification accuracy other user. Additionally, popular users are much less likely even higher. to know their individual fans making this connection much weaker. Each review object is annotated with a time stamp, Keywords so we are able to investigate trends temporally. network mining, text mining, expert finding, social network analysis, time series analysis The purpose of this work is to investigate and analyze the Yelp dataset and find potentially interesting patterns that we can exploit in our future expert-finding system. The key 1. INTRODUCTION question we hope to answer is: Expert finding seeks to locate users in a particular domain that have more qualifications or knowledge (expertise) than the average user. Usually, the number of experts is very Given a network of Yelp users, who is an elite user? low compared to the overall population, making this a chal- lenging problem. Expert finding is especially important in medical, legal, and even governmental situations. In our To answer the above question, we have to first address the work, we focus on the Yelp academic dataset [8] since it has following: many rich features that are unavailable in other domains. In particular, we have full review content, timestamps, the friend graph, and user metadata—this allows us to use tech- 1. How does the text in expert reviews differ from text in niques from text mining, time series analysis, social network normal reviews? analysis, and classical machine learning. 2. How does the average number of votes per review for a user change over time? From a user’s perspective, it’s important to find an expert ∗Submitted as the semester project for CS 598hs Fall 2014. 3. Are elite users the first to review a new business? 4. Does the social network structure suggest whether a user is an elite user? 5. Does user metadata available from Yelp have any in- dication about a user’s status? The structure of this paper is as follows: in section 2, we discuss related work. In sections 3, 4, 5, and 6, we discuss the four different dimensions of the Yelp dataset. For the first
three feature types, we use text analysis, temporal analysis, Paper Text? Time? Graph? and social network analysis respectively. The user metadata Sun et al. 2009 [13] X is already in a quantized format, so we simply overview the Bozzon et al. 2013 [3] X fields available. Section 7 details running experiments on the Zhang et al. 2007 [17] X proposed features on balanced (number of experts is equal Choudhury et al. 2009 [5] X X to the number of normal users) and unbalanced (number of Balog et al. 2009 [1] X experts is much less) data. Finally, we end with conclusions Ehrlich et al. 2007 [6] X X and future work in section 8. 2. RELATED WORK Figure 1: Comparison of features used in previous RankClus [13] integrates clustering and ranking on heteroge- work. neous information networks. Within each cluster, a ranking of nodes is created, so the top-ranked nodes could be consid- A paper on Expert Language Models [1] builds two different ered experts for a given cluster. For example, consider the language models by invoking Bayes’ Theorem. The con- DBLP bibliographic network. Clusters are formed based on ditional probability of a candidate given a specific query is authors who share coauthors, and within each cluster there estimated by representing it using a multinomial probability is a ranking of authoritative authors (experts in their field). distribution over the vocabulary terms. A candidate model θca is inferred for each candidate ca, such that the probabil- Clustering and ranking are defined to mutually enhance each ity of a term used in the query given the candidate model other since conditional rank is used as a clustering feature is p(t|θca ). For one of the models, they assumed that the and cluster membership is used as an object feature. In document and the candidate are conditionally independent order to determine the final configuration, an expectation- and for the other model, they use the probability p(t|d, ca), maximization algorithm is used to iteratively update cluster which is based on the strength of the co-occurrence between and ranking assignments. a term and a candidate in a particular document. This work is relevant to our Yelp dataset if we consider clus- In terms of the modeling techniques used above, we can ters to be the business categories, and experts to be domain adopt a similar method whereby the candidate in the Expert experts. However, the Yelp categories are not well-defined Language Models [1] is a yelp user and we will determine the since some category labels overlap, so some extra processing extent to which a review characterizes an elite or normal may be necessary to deal with this issue. user. Expert Finding in Social Networks [3] considers Facebook, For the paper on Interesting YouTube commenters [5], the LinkedIn, and Twitter as domains where experts reside. In- goal is to determine a real scalar value corresponding to stead of labeling nodes from the entire graph as experts, a each conversation to measure its interestingness. The model subset of candidate nodes is considered and they are ranked comprised of detecting conversational themes using a mix- according to an expertise score. These expertise scores are ture model approach, determining ‘interestingness’ of par- obtained through link relation types defined on each social ticipants and conversations based on a random walk model, network, such as creates, contains, annotates, owns, etc. and lastly, establishing the consequential impact of ‘inter- estingness’ via different metrics. The paper could be useful To rank experts, they used a vector-space retrieval model to us for characterizing reviews and Yelp users in terms of common in information retrieval [10] and evaluated with ‘interestingness’. An intuitive conjecture is that ‘elite’ users popular IR metrics such as MAP, MRR, and NDCG [10]. should be ones with high ‘interestingness’ level and likewise, Their vector space consisted of resources, related entities, they should post reviews that are interesting. and expertise measures. They concluded that profile infor- mation is a less useful determiner for expertise than their In summary, Fig 1 shows a comparison of related work sur- extracted relations, and that resources created by others in- veyed and which aspects of the dataset they examined. cluding the target are also quite useful. A paper on “Expertise Networks” [17] begins with a large 3. TEXT ANALYSIS study on analyzing a question and answer forum; typical The text analysis examines the reviews written by each user centrality measures such as PageRank [11] and HITS [9] are in order to extract features from the unstructured text con- used to initially find expert users. Then, other features de- tent. Common text processing techniques such as indexing, scribing these expert users are defined or extracted in order categorization, and language modeling are explored in the to create an “ExpertiseRank” algorithm, which (as far as we next sections. can tell) is essentially PageRank. This algorithm was then evaluated by human raters and it was found ExpertiseRank had slightly smaller errors than the other measures (includ- 3.1 Datasets ing HITS, but was not evaluated against PageRank). First, we preprocessed the text by lowercasing, removing stop words, and performing stemming with the Porter2 En- While the result of ExpertiseRank is unsurprising, we would glish stemmer. Text was tokenized as unigram bag-of-words be unable to directly use it or PageRank since the Yelp social by the MeTA toolkit1 . network is undirected; running PageRank on an undirected 1 network approximates degree centrality. http://meta-toolkit.github.io/meta/
Dataset Docs Lenavg |V | Raw Index F1 scores were similar. Recall the F1 score is a harmonic All 659,248 81.8 164,311 480 81 mean of precision P and recall R: Elite 329,624 98.8 125,137 290 46 Normal 329,624 64.9 95,428 190 37 2P R F1 = P +R Figure 2: Comparison of the three text datasets of Yelp reviews in terms of corpus size, average doc- Since this is just a baseline classifier, we expect it is possi- ument length (in words), vocabulary size, raw data ble to achieve higher accuracy using more advanced features size, and indexed data size (both in MB). such as n-grams of words or grammatical features like part- of-speech tags or parse tree productions. However, this ini- Confusion Matrix tial experiment is to determine whether elite and non-elite classified as elite classfied as normal reviews can be separated based on text alone, with no re- elite 0.665 0.335 gard to the author or context. Since the accuracy on this normal 0.252 0.748 default model is 70% is seems that text will make a useful subset of overall features to predict expertise. Class F1 Score Precision Recall Elite 0.694 0.665 0.725 Furthermore, remember that this classification experiment Normal 0.718 0.748 0.691 is not whether a user is elite or not, but rather whether a Total 0.706 0.706 0.708 review has been written by an elite user; it would be very straightforward to extend this problem to classify users in- stead, where each user is a combination of all reviews that he or she writes. In fact, this is what we do in section 7, Figure 3: Confusion matrix and classification accu- where we are concerned with identifying elite users. racy on normal vs elite reviews. 3.3 Language Model We created three datasets that are used in this section: We now turn to the next text analysis method: unigram language models. A language model is simply a distribution of words given some context. In our example, we will define • All: contains all the elite reviews (reviews written by three language models—each based on a corpus described in an elite user) and an equal number of normal reviews section 3.1. • Elite: contains only reviews written by elite users The background language model (or “collection” language model) simply represents the All corpus. We define a • Normal: contains only reviews written by normal smoothed collection language model pC (w) as users count(w, C) + 1 Elite and Normal together make up All; this is to ensure the pC (w) = analyses run on these corpora have a balanced class distri- |C| + |V | bution. Overall, there were 1,125,458 reviews, consisting of 329,624 elite reviews and 795,834 normal reviews. Thus, the This creates a distribution pC (·) over each word w ∈ V . number of normal reviews was randomized and truncated to Here, C is the corpus, and V is the vocabulary (each unique 329,624. word in C), so |C| is the total number of words in the corpus and |V | is the number of unique terms. The number of elite users is far fewer than the ratio of writ- ten reviews may suggest; this is because elite users write The collection language model essentially shows the prob- many more reviews on average than normal users. A sum- ability of a word occurring in the entire corpus. Thus, we mary of the three datasets is found in Fig 2. can sort the outcomes in the distribution by their assigned probabilities to get the most frequent words. Unsurprisingly, 3.2 Classification these words are the common stop words with no real content information. However, we will use this background model to We tested how easy it is to distinguish between an elite re- filter out words specific to elite or normal users. view and a non-elite (normal) review by a simple supervised classification task. We used the dataset All described in the We now define another unigram language model to repre- previous section along with each review’s true label to train sent the probability of seeing a word w in a corpus θ ∈ an SVM classifier. Evaluation was performed with five-fold {elite, normal}. We create a normalized language model cross validation and had a baseline of 50% accuracy. Results score per word using the smoothed background model de- of this categorization experiment are displayed in Fig 3. fined previously: The confusion matrix tells us that it was slightly easier to classify normal reviews, though the overall accuracy was ac- count(w,θ) ceptable at just over 70%. Precision and recall highs had |θ| count(w, θ) opposite maximums for normal and elite, though overall the score(w, θ) = = pC (w) pC (w) · |θ|
Background Normal Elite 3.4 Typographical Features the gorsek uuu Based partly on the experiments performed in section 3.3, we and forks) aloha!!! now define typographical features of the review text. We call a yu-go **recommendations** a feature a ‘typographical’ feature if it is a trait that can’t i sabroso meter: be detected by a unigram words tokenizer and is indicative to (*** **summary** of the style of review writing. was eloff carin of -/+ no1dp We use the following six style or typographical features: is jeph (lyrics for deirdra friends!!!!! it ruffin’ **ordered** • Average review length. We calculate review length in josefa 8/20/2011 as the number of whitespace-delimited tokens in a re- that ubox rickie view. Average review length is simply the average of my waite kuge this count across all of a user’s reviews. with again!! ;]]] • Average review sentiment. We used sentiment va- but optionz #365 lence scores [12] to calculate the sentiment of an entire this ecig g review. The sentiment valence score is < 0 if the over- you nulook *price all sentiment is negative and > 0 if the overall senti- we gtr visits): ment is positive. they shiba r on kenta ik • Paragraph rate. Based on the language model analy- sis, we included a feature to detect whether paragraph segmentation was used in a review. We simply count Figure 4: Top 20 tokens from each of the three lan- the rate of multiple newline characters per review per guage models. user. • List rate. Again, based on the language model anal- ysis, we add this feature to detect whether a bulleted list is included in the review. We defined a list as the The goal of the language model score is to find unigram beginning of a line followed by ‘*’ or ‘-’ before alpha tokens that are very indicative of their respective categories; characters. using a language model this way can be seen as a form of feature selection. Fig 4 shows a comparison of the top twenty • All caps. The rate of words in all capital letters. We words from each of the three methods. suspect very high rates of capital letters will indicate spam or useless reviews. These default language models did not reveal very clear dif- ferences in word usage between the two categories, despite • Bad punctuation. Again, this feature is to detect the elite users using a larger vocabulary as shown in Fig 2. less serious reviews in an attempt to find spam. A The singular finding was that the elite language model shows basic example of bad punctuation is not starting a new that its users are more likely to segment their reviews into sentence with a capital letter. different sections, discussing different aspects of the busi- ness. For example, recommendations, summary, ordered, or Although the number of features here is low, we hope that price. the added meaning behind each one is more informative than a single unigram words feature. Also, it may appear that there are a good deal of nonsense words in the top words from each language model. However, upon closer inspection, these words are actually valid given 4. TEMPORAL ANALYSIS some domain knowledge of the Yelp dataset. For example, In this section, we look at how features change temporally the top word “gorsek” in the normal language model is the by making use of the time stamp in reviews as well as tips. last name of a normal user that always signs his posts. Sim- This allows to us to analyze the activity of a user over time ilarly, “sabroso” is a Spanish word meaning delicious that a as well as how the average number of votes the user has particular user likes to say in his posts. Similar arguments received changes with each review posted. can be made for other words in the normal language model. In the elite model, “uuu” was originally “\uuu/”, an emoti- 4.1 Average Votes-per-review Over Time con that an elite user is fond of. “No1DP” is a Yelp username For the average number of votes-per-review varies with each that is often referred to by a few other elite users in their review posted by an user. To gather this data, we grouped review text. the reviews in the Yelp dataset by users and ordered the reviews by the date each was posted. Work on supervised and unsupervised review aspect segmen- tation has been done before [15,16], and it may be applicable The goal was to try to predict whether an user is an “Elite” in our case since there are clear boundaries in aspect men- or “Normal” user using the votes-per-review vs review num- tions. Another approach would be to add a boolean feature ber plot. The motivation for this was that after processing has_aspects that detects whether a review is segmented in the data, we found out that the number of votes on average the style popular among elite users. was significantly greater for elite users compared to normal
Elite vs Normal users Statistics Confusion Matrix useful votes funny votes cool votes classified as elite classified as normal elite users 616 361 415 elite 0.64 0.36 normal users 20 7 7 normal 0.26 0.74 Figure 5: Average number of votes per category for Figure 7: Summary of results for logistic regression. elite and normal users. Logistic Regression Summary elite users normal users training 2005 2005 testing 18040 18040 Figure 6: Summary of training and testing data for logistic regression. users as show in Fig 5. Thus, we decided to find out whether any trend exists on how the average number of votes grow with each review posted by users from both categories. We hypothesized that elite users should have an increasing av- erage number of votes over time. On the y-axis, we have υi which is the votes-per-review after a user posts his ith review. This is defined as the sum of the number of “useful” votes, “cool” votes and “funny” votes divided by the number of reviews by the user up to that point in time. On the x-axis, we will have the review count. Figure 8: Plot of the probability of being an elite user for reviews at rank r. Using the Yelp dataset, we plotted a scatter plot for each user. Visual inspection of graphs did not show any obvious trends in how the average number of likes per review varied users compared to normal users. This means that each re- with each review being posted by the user. view that a elite user posts tends to be a “quality” review that receives enough votes to increase the running average We then proceeded to perform a logistic regression using the of votes-per-review for this user. The second hypothesis is following variables: that the mean of the running average votes-per-review for elite users is higher than that of normal users. This is sup- ported by data shown in Fig 5 where the average votes for count(increases) elite users are higher than normal users. Pincrease = count(reviews) 4.2 User Review Rank Pcount(reviews) For the second part of our temporal analysis, we look at i=0 υi µ= the rank of each review a user has posted. Using 0-index, if count(reviews) a review has rank r for business b, the review was the rth review written for business b. where count(increases) is the number of times the average votes-per-review increased (i.e. υi+1 > υi ) after a user posts Our hypothesis was that an elite user should be one of the a review and count(reviews) is the number of reviews the first few users who write a review for a restaurant because user has made. elite users are more likely to find new restaurants to review. Also, based on the dataset, elite users write approximately Both the training and testing sets consists of only users with 230 more reviews on average than normal users, thus it is at least one review. For each user, we calculated the vari- more likely that elite users will be one of the first users to ables Pincrease and µ. The training and testing data are review a business. Over time, since there are more normal shown in Fig 6. 10% of users with at least one review be- users, the ratio of elite to normal users will decrease as more came part of the training data and the remaining 90% were normal users write reviews. used to test. To verify this, we calculated the percentage of elite reviews There was an accuracy of 0.69 on the testing set. The results for each rank across the top 10,000 businesses, whereby the are shown in Fig 7. top business is defined as the business with the most reviews. The number of ranks we look at will be the minimum num- Given the overall accuracy of our model is relatively high ber of reviews of a single business among the top 10,000 at 0.69, we can hypothesize that Pincrease is higher for elite businesses. The plot is shown in Fig 8.
users, the plot shows us that it is more likely for an elite user to be among the first few tippers of a business. Fur- thermore, for this specific dataset, elite users only make up approximately 25% of the total number of tips, yet for the top 10,000 businesses, they make up more than 25% of the tips for almost all the ranks shown in Fig 9. We then calculated a score for each user based on the rank of each tip of the user and we included this as a feature in the SVM. The score is defined as follows: X score = (tip count(business of (review)) − rank(tip)) tip (The equation for this score follows the same reasoning as the user review rank section) Figure 9: Plot of the probability of being an elite 4.4 Review Activity Window user for tips at rank r. In this section, we look at the distribution of a user’s activity over time. The window we look at is between the user’s join date and end date, defined as the last date of any review Given that the dataset consists of approximately 10% elite posted in the entire dataset. For each user, we will find users, the plot shows us that it is more likely for an elite the interval in days between each review, including the join user to be among the first few reviewers of a business. date and end date. For example if the user has two reviews on date1 and date2, where date2 is after date1, the interval We calculated a score for each user which is a function of durations will be: date1-joinDate, date2-date1 and endDate- the rank of each review of the user and we included this as date2. So for n number of reviews, we will get n+1 intervals. a feature in the SVM. For each review of a user, we find the Based on the list of intervals, we will calculate a score. For total number of reviews the business that this review belongs this feature, we hypothesize that the lower the score, the to has. We take the total review count of this business and more likely the user is an elite user. subtract the rank of the review from it. We then sum this value for each review to assign a score to the user. Based The score is defined as: on our hypothesis, since elite users will more likely have a lower rank for each review than normal users, the score for elite users should therefore be higher than normal users. var(intervals) + avg(intervals) score = days on yelp The score for a review is defined as follows: Where var(intervals) is the variance of all the interval values, X avg(intervals) is the average and days on yelp is the number score = (review count(business of (rev)) − rank(rev)) of days a user has been on Yelp. rev For the variance, the hypothesis is that for elite users, the variance will tend to be low as we hypothesize that elite users We subtract the rank from the total review count so that should post regularly. For normal users, the variance will be based on our hypothesis, elite users will end up having a high possibly due to irregular posting and long periods of higher score. inactivity between posts. 4.3 User Tip Rank We also look at the average value of the intervals. This is A tip is a short chunk of text that a user can submit to a because if we were to only look at variance, a user who writes restaurant via any Yelp mobile application. Using 0-index, a review every two days will get the same variance (zero) as if a tip has rank r for business b, the tip was the rth tip a user who writes a review every day. As such the average written for business b. Similar to the review rank, we we of the intervals will account for this by increasing the score hypothesized that an elite user should be one of the first few tippers (person who gives a tip) of a restaurant. We plotted Finally, we divide the score by the number of days the user the same graph which shows the percentage of elite tips for has been on Yelp. This is to account for situations where each rank across the top 10,000 businesses, whereby the top a user makes a post every week but has only been on Yelp business is defined as the business with the most tips. The for three weeks, versus a user who makes a post every week plot is shown in Fig 9. as well but has been on Yelp for a year. The user who has been on Yelp for a year will then get a lower value for this Given that the dataset consists of approximately 10% elite score (elite user).
5. SOCIAL NETWORK ANALYSIS Degree Centrality The Yelp social network is the user friendship graph. This Name Reviews Useful Friends Fans Elite data is available in the latest version of the Yelp academic Walker 240 6,166 2,917 142 Y dataset. We used the graph library from the same toolkit Kimquyen 628 7,489 2,875 128 Y that was used to do the text analysis in section 3. Katie 985 23,030 2,561 1,068 Y Philip 706 4,147 2,551 86 Y We make an assumption that users on the Yelp network Gabi 1,440 12,807 2,550 420 Y don’t become friends at random; that is, we hypothesize Betweenness Centrality that users become friends if they think their friendship is Name Reviews Useful Friends Fans Elite mutually beneficial. In this model, we think one friend will Gabi 1,440 12,807 2,550 420 Y become friends with another user if he or she thinks the Philip 706 4,147 2,551 86 Y other user is worth knowing (i.e., is a “good reviewer”). We Lindsey 906 7,641 1,617 348 Y believe this is a fair assumption to make, since the purpose Jon 230 2,709 1,432 60 Y of the Yelp website is to provide quality reviews for both Walker 240 6,166 2,917 142 Y businesses and users. One potential downside we can see is Eigenvector Centrality users becoming friends just because they are friends in real Name Reviews Useful Friends Fans Elite life, or in a different social network. Kimquyen 628 7,489 2,875 128 Y Carol 505 2,740 2,159 163 Y Sam 683 9,142 1,960 100 Y 5.1 Network Centrality Alina 329 2,096 1,737 141 N Since our goal is to find “interesting” or “elite” users, we use Katie 985 23,030 2,561 1,068 Y three network centrality measures to identify central (im- portant) nodes. We would like to find out if elite users are more likely to be central nodes in their friendship network. Figure 10: Comparison of the top-ranked users as We’d also like to investigate whether the results of the three defined by the three centrality measures on the so- centrality measures we investigate are correlated. Next, we cial network. briefly summarize each measure. For a more in-depth dis- cussion of centrality (including the measures we use), we suggest the reader consult [7]. For our centrality calcula- The eigenvector centralities for the Yelp social network were tions we considered the graph of 123,369 users that wrote at calculated in less than 30 seconds. least one review. Fig 10 shows the comparison of the top five ranked users Degree centrality for a user u is simply the degree of node based on each centrality score. The top five users of each u. In our network, this is the same value as the number centrality shared some names: Walker, Gabi, and Philip in of friends. Therefore, it makes sense that users with more degree and betweenness; Kimquyen and Katie in degree and friends are more important (or active) than those that have eigenvector; betweenness and eigenvector shared no users in fewer or no friends. Degree centrality can be calculated al- the top five (though not shown, there are some that are the most instantly. same in the range six to ten). Betweenness centrality for a node u essentially captures the The top users defined by centrality measures are almost all number of shortest paths between all pairs of nodes that elite users even though elite users only make up about 8% of pass through u. In this case, a user being an intermediary the dataset. The only exception here is Alina from eigenvec- between many user pairs signifies importance. Betweenness tor centrality. Her other statistics look like they fit in with centrality is very expensive to calculate, even using a O(mn) the other elite users, so perhaps this could be a prediction algorithm [4]. This algorithm is part of the toolkit we used that Alina will be elite in the year 2015. and it took two hours to run on 3.0 GHz processors with 24 threads. The next step is to use these social network features to pre- dict elite users. Eigenvector centrality operates under the assumption that important nodes are connected to other important nodes. PageRank [11] is a simple extension to eigenvector centrality. 5.2 Weighted Networks If a graph is represented as an adjacency matrix A, then Adding weighted links between users could definitely en- the (i, j)th cell is 1 if there is an edge between i and j, hance the graph representation. The types which could po- and 0 otherwise. This notation is convenient when defining tentially be weighted are fans and votes. Additionally, if we eigenvector centrality for a node u denoted as xu : had some tie strength of friendship based on communication or profile views, we could use weighted centrality measures for this aspect as well. n 1X xu = Aiu xi Unfortunately, we have no way to define the strength of λ i=1 the friendship between two users, since we only have the information present in the Yelp academic dataset. As for Since this can be rewritten as Ax = λx, we can solve for the votes and fans, in the Yelp academic dataset we are the eigenvector centrality values with power iteration, which only given a raw number for these values, as opposed to the converges in a small number of iterations and is quite fast. actual links for the social network. If we had this additional
information, we could add those centrality measures to the Confusion Matrix: Balanced Text Features classified as elite classified as normal friendship graph centrality measures for an enriched social elite 0.651 0.349 network feature set. normal 0.124 0.876 Overall Accuracy: 76.7%, baseline 50% 6. USER METADATA User metadata is information that is already part of the Confusion Matrix: Unbalanced Text Features JSON Yelp user object. It is possible to see all the metadata classified as elite classified as normal by visiting the Yelp website and viewing specific numerical elite 0.582 0.418 fields. normal 0.039 0.961 Overall accuracy: 91.8%, baseline 92% • Votes. Votes are ways to show a specific type of ap- preciation towards a user. There are three types of Figure 11: Confusion matrices for normal vs elite votes: funny, useful, and cool. There is no specific users on balanced and unbalanced datasets. definition for what each means. Confusion Matrix: Balanced Temporal Features • Review count. This is simply the total number of classified as elite classified as normal reviews that a user has written. elite 0.790 0.210 normal 0.320 0.680 • Number of friends. The total number of friends in Overall Accuracy: 73.5%, baseline 50% a user’s friendship graph. This feature is duplicated in the degree centrality measure of the social network Confusion Matrix: Unbalanced Temporal Features analysis. classified as elite classified as normal elite 0.267 0.733 • Number of fans. The total number of fans a user normal 0.067 0.933 has. Overall accuracy: 88%, baseline 92% • Average rating. The average star rating in [1, 5] the user gives a business. Figure 12: Confusion matrices for normal vs elite users on balanced and unbalanced datasets. • Number of compliments. According to Yelp, the compliment button is “an easy way to send some good vibes.” This is separate from a review. In fact, users 7.1 Text Features get compliments from other users based on particular We represent users as a collection of all their review text. reviews that they write. Based on the previous experiments, we saw that it was pos- sible to classify a single review as being written by an elite or normal user. Now, we want to classify users based on all We hope to use these metadata features in order to classify their reviews as either an elite or normal user. Figure 11 users as elite. We already saw in section 5 that some meta- shows the results of the text classification task. Using the data fields seemed to be correlated with network centrality balanced dataset we achieve about 77% accuracy, compared measures as well as a user’s status, so it seems like they will to barely achieving the baseline accuracy in the full dataset. be informative features. Since the text features are so high dimensional, we per- formed some basic feature selection by selecting the most 7. EXPERIMENTS frequent features from the dataset. Before feature selection, We now run experiments to test whether each feature gen- we had an accuracy on the balanced dataset of about 70%. eration method is a viable candidate to distinguish between Using the top 100, 250, and 500 features all resulted in a elite and normal users. As mentioned before, the number of similar accuracy of around 76%. We use the reduced fea- elite users is much smaller than the number of total users; ture set of 250 in our experimental results in the rest of this about 8% of all 252,898 users are elite. This presents us with paper. a very imbalanced class distribution. Since using the entire user base to classify elite users has such a high baseline (92% 7.2 Temporal Features accuracy), we also truncate the dataset to a balanced class The temporal features consist of features derived using distribution with a total of 40,090 users, giving a alternate changes in the average number of votes per review posted, baseline of 50% accuracy. Both datasets are used for all the sum of the ranks of reviews of an user as well as the tips, future experiments. and the distribution of reviews posted over the lifetime of a user. Using these features, we obtained the results shown in As described in section 3.1, we use the MeTA toolkit2 to do Figure 12. the text tokenization, class balancing, and five-fold cross- validation with SVM. SVM is implemented here as stochas- tic gradient descent with hinge loss. 7.3 Graph Features Figure 13 shows the results using the centrality measures 2 http://meta-toolkit.github.io/meta/ from the social network. Although there are only three fea-
Confusion Matrix: Balanced Graph Features Confusion Matrix: Balanced All Features classified as elite classified as normal classified as elite classified as normal elite 0.842 0.158 elite 0.754 0.256 normal 0.251 0.749 normal 0.111 0.889 Overall Accuracy: 79.6%, baseline 50% Overall Accuracy: 82.2%, baseline 50% Confusion Matrix: Unbalanced Graph Features Confusion Matrix: Unbalanced All Features classified as elite classified as normal classified as elite classified as normal elite 0.311 0.689 elite 0.976 0.024 normal 0.075 0.925 normal 0.731 0.269 Overall accuracy: 87.6%, baseline 92% Overall accuracy: 92%, baseline 92% Figure 13: Confusion matrices for normal vs elite Figure 15: Confusion matrices for normal vs elite users on balanced and unbalanced datasets. users on balanced and unbalanced datasets with all features present. Confusion Matrix: Balanced Metadata Features classified as elite classified as normal Text Temp. Graph Meta All elite 0.959 0.041 Balanced .767 .735 .796 .938 .822∗ normal 0.083 0.917 Unbalanced .918 .880 .876 .901 .920 Overall Accuracy: 93.8%, baseline 50% Confusion Matrix: Unbalanced Metadata Features Figure 16: Final results summary for all features and classified as elite classified as normal feature combinations on balanced and unbalanced elite 0.880 0.120 data. ∗ Excluding just the text features resulted in normal 0.097 0.903 90.4% accuracy. Overall accuracy: 90.1%, baseline 92% the difficult baseline. Figure 14: Confusion matrices for normal vs elite users on balanced and unbalanced datasets. Using all combined features except the text features resulted in 90.4% accuracy, suggesting there is some sort of disagree- ment between “predictive” text features and all other pre- tures, Figure 10 showed that there is potentially a corre- dictive features. Thus, removing the text features yielded lation between the elite status and high-valued centrality a much higher result, approaching the accuracy of just the measures. The three graph features alone were able to pre- Yelp metadata features. dict whether a user was elite using the balanced dataset with almost 80% accuracy. Again, results were lower compared Since we dealt with some overfitting issues, we made sure to the baseline when using the full user set. that the classifier used regularization. Regularization en- sures that weights for specific features do not become too 7.4 Metadata Features high if it seems that they are incredibly predictive of the Using only the six metadata features from the original Yelp class label. Fortunately (or unfortunately), the classifier we JSON file gave surprisingly high accuracy at almost 94% for used does employ regularization, so there is nothing we could the balanced classes. In fact, the metadata features had the further do to attempt to increase the performance. highest precision for both the elite and normal classes. The unbalanced accuracy was near the baseline. 8. CONCLUSION AND FUTURE WORK We investigated several different feature types to attempt to 7.5 Feature Combination and Discussion classify elite users in the Yelp network. We found that all of To combine features, we simply concatenated the feature our features were able to distinguish between the two user vectors for all the previous features and used the same splits types. However, when combined, we weren’t able to make and classifier as before. Figure 15 shows the breakdown of an improvement in accuracy on the class-balanced dataset this classification experiment. Additionally, we summarize over the best-performing single feature type. all results by final accuracy in Figure 16. In the text analysis, we can investigate different classifiers to Unfortunately, it looks like the combined feature vectors did improve the classification accuracy. For example, k-nearest not significantly improve the classification accuracy on the neighbor could be a good approach since it is nonlinear and balanced dataset as expected. Initially, we though that this we have a relatively small number of dimensions after re- might be due to overfitting, which is why we reduced the ducing the text features. The text analysis could also be number of text features from over 70,000 to 250. Using the extended with the aid of topic modeling [2]. One output 70,000 text features combined with the other feature types from this algorithm acts to cluster documents into separate resulted in about 70% accuracy; with the top 250 features, topics; a document is then represented as a mixture of these we achieved 82.2% as shown in the tables. For the unbal- topics, and each document’s mixture can be used as a feature anced dataset, it seems that the results did improve to reach for the classifier.
In the temporal analysis, we made some assumptions about [9] Jon M. Kleinberg. Authoritative sources in a the ‘elite’ data provided by the Yelp dataset. The data tells hyperlinked environment. J. ACM, 46(5):604–632, us for which years the user was ‘elite’ and we made a sim- September 1999. plifying assumption that as long a user has at least one year [10] Christopher D. Manning, Prabhakar Raghavan, and of elite status, the user is currently and has always been an Hinrich Schütze. Introduction to Information elite user. For instance, if a user was only elite in the year Retrieval. Cambridge University Press, New York, NY, 2010, we treated the user’s review back in 2008 as an elite re- USA, 2008. view. Also, we could have made use of more advanced mod- [11] Larry Page, Sergey Brin, R. Motwani, and els like the vector autoregression model (VAR) [14] which T. Winograd. The pagerank citation ranking: might allow us to improve the analysis of votes per review Bringing order to the web, 1998. over time. One possible way will be to look at all the votes- [12] Bo Pang and Lillian Lee. Opinion mining and per-review plots of users in the dataset and run the model sentiment analysis. Found. Trends Inf. Retr., using this data. Finally, in the network analysis, we can con- 2(1-2):1–135, January 2008. sider different network features such as clustering coefficient [13] Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, or similarity via random walks. Hong Cheng, and Tianyi Wu. Rankclus: Integrating clustering with ranking for heterogeneous information The graph features would certainly benefit from added network analysis. In Proceedings of the 12th weights, but as mentioned in section 5, we unfortunately International Conference on Extending Database do not have this data. Social graph structure can also be Technology: Advances in Database Technology, EDBT created with more information about fans and votes. ’09, pages 565–576, New York, NY, USA, 2009. ACM. [14] Hiro Y. Toda and Peter C.B. Phillips. Vector Finally, since the metadata features were by far the best- Autoregression and Causality. Cowles Foundation performing, it would be an interesting auxiliary problem to Discussion Papers 977, Cowles Foundation for predict their values via a regression using the other feature Research in Economics, Yale University, May 1991. types we created. [15] Hongning Wang, Yue Lu, and ChengXiang Zhai. Latent aspect rating analysis on review text data: A APPENDIX rating regression approach. In Proceedings of the 16th A. REFERENCES ACM SIGKDD International Conference on [1] Krisztian Balog, Leif Azzopardi, and Maarten Knowledge Discovery and Data Mining, KDD ’10, de Rijke. A language modeling framework for expert pages 783–792, New York, NY, USA, 2010. ACM. finding. Inf. Process. Manage., 45(1):1–19, January [16] Hongning Wang, Yue Lu, and ChengXiang Zhai. 2009. Latent aspect rating analysis without aspect keyword [2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. supervision. In Proceedings of the 17th ACM SIGKDD Latent dirichlet allocation. J. Mach. Learn. Res., International Conference on Knowledge Discovery and 3:993–1022, March 2003. Data Mining, KDD ’11, pages 618–626, New York, [3] Alessandro Bozzon, Marco Brambilla, Stefano Ceri, NY, USA, 2011. ACM. Matteo Silvestri, and Giuliano Vesci. Choosing the [17] Jun Zhang, Mark S. Ackerman, and Lada Adamic. right crowd: Expert finding in social networks. In Expertise networks in online communities: Structure Proceedings of the 16th International Conference on and algorithms. In Proceedings of the 16th Extending Database Technology, EDBT ’13, pages International Conference on World Wide Web, WWW 637–648, New York, NY, USA, 2013. ACM. ’07, pages 221–230, New York, NY, USA, 2007. ACM. [4] Ulrik Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25:163–177, 2001. [5] Munmun De Choudhury, Hari Sundaram, Ajita John, and Dorée Duncan Seligmann. What makes conversations interesting?: Themes, participants and consequences of conversations in online social media. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pages 331–340, New York, NY, USA, 2009. ACM. [6] Kate Ehrlich, Ching-Yung Lin, and Vicky Griffiths-Fisher. Searching for experts in the enterprise: Combining text and social network analysis. In Proceedings of the 2007 International ACM Conference on Supporting Group Work, GROUP ’07, pages 117–126, New York, NY, USA, 2007. ACM. [7] Jiawei Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. [8] Yelp Inc. Yelp Dataset Challenge, 2014. http://www.yelp.com/dataset_challenge.
You can also read