Estimating the Prevalence of Deception in Online Review Communities
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Estimating the Prevalence of Deception in Online Review Communities Myle Ott Claire Cardie Jeff Hancock Dept. of Computer Science Depts. of Computer Science Depts. of Communication and Cornell University and Information Science Information Science Ithaca, NY 14850 Cornell University Cornell University myleott@cs.cornell.edu Ithaca, NY 14850 Ithaca, NY 14850 cardie@cs.cornell.edu jeff.hancock@cornell.edu ABSTRACT 1. INTRODUCTION Consumers’ purchase decisions are increasingly influenced Consumers rely increasingly on user-generated online re- by user-generated online reviews [3]. Accordingly, there has views to make, or reverse, purchase decisions [3]. Accord- been growing concern about the potential for posting decep- ingly, there appears to be widespread and growing concern tive opinion spam—fictitious reviews that have been deliber- among both businesses and the public [12, 14, 16, 19, 20, ately written to sound authentic, to deceive the reader [15]. 21] regarding the potential for posting deceptive opinion But while this practice has received considerable public at- spam—fictitious reviews that have been deliberately writ- tention and concern, relatively little is known about the ac- ten to sound authentic, to deceive the reader [15]. Perhaps tual prevalence, or rate, of deception in online review com- surprisingly, however, relatively little is known about the ac- munities, and less still about the factors that influence it. tual prevalence, or rate, of deception in online review com- We propose a generative model of deception which, in con- munities, and less still is known about the factors that can junction with a deception classifier [15], we use to explore the influence it. On the one hand, the relative ease of producing prevalence of deception in six popular online review commu- reviews, combined with the pressure for businesses, prod- nities: Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor, ucts, and services to be perceived in a positive light, might and Yelp. We additionally propose a theoretical model of lead one to expect that a preponderance of online reviews online reviews based on economic signaling theory [18], in are fake. One can argue, on the other hand, that a low rate which consumer reviews diminish the inherent information of deception is required for review sites to serve any value.1 asymmetry between consumers and producers, by acting as a The focus of spam research in the context of online reviews signal to a product’s true, unknown quality. We find that de- has been primarily on detection. Jindal and Liu [8], for ceptive opinion spam is a growing problem overall, but with example, train models using features based on the review different growth rates across communities. These rates, we text, reviewer, and product to identify duplicate opinions.2 argue, are driven by the different signaling costs associated Yoo and Gretzel [23] gather 40 truthful and 42 deceptive with deception for each review community, e.g., posting re- hotel reviews and, using a standard statistical test, manually quirements. When measures are taken to increase signaling compare the psychologically relevant linguistic differences cost, e.g., filtering reviews written by first-time reviewers, between them. While useful, these approaches do not focus deception prevalence is effectively reduced. on the prevalence of deception in online reviews. Indeed, empirical, scholarly studies of the prevalence of deceptive opinion spam have remained elusive. One reason Categories and Subject Descriptors is the difficulty in obtaining reliable gold-standard annota- I.2.7 [Artificial Intelligence]: Natural Language Process- tions for reviews, i.e., trusted labels that tag each review ing; J.4 [Computer Applications]: Social and Behavioral as either truthful (real) or deceptive (fake). One option for Sciences—economics, psychology; K.4.1 [Computers and producing gold-standard labels, for example, would be to Society]: Public Policy Issues—abuse and crime involving rely on the judgements of human annotators. Recent stud- computers; K.4.4 [Computers and Society]: Electronic ies, however, show that deceptive opinion spam is not eas- Commerce ily identified by human readers [15]; this is especially the case when considering the overtrusting nature of most hu- man judges, a phenomenon referred to in the psychological General Terms Algorithms, Experimentation, Measurement, Theory 1 It is worth pointing out that a review site containing de- ceptive reviews might still serve value, for example, if there remains enough truthful content to produce reasonable ag- Keywords gregate comparisons between offerings. 2 Deceptive opinion spam, Deception prevalence, Gibbs sam- Duplicate (or near-duplicate) opinions are opinions that pling, Online reviews, Signaling theory appear more than once in the corpus with the same (or sim- ilar) text. However, simply because a review is duplicated Copyright is held by the International World Wide Web Conference Com- does not make it deceptive. Furthermore, it seems unlikely mittee (IW3C2). Distribution of these papers is limited to classroom use, that either duplication or plagiarism characterizes the ma- and personal use by others. jority of fake reviews. Moreover, such reviews are potentially WWW 2012, April 16–20, 2012, Lyon, France. detectable via off-the-shelf plagiarism detection software. ACM 978-1-4503-1229-5/12/04.
deception literature as a truth bias [22]. To help illustrate which we perform via Gibbs sampling, allows us to estimate the non-trivial nature of identifying deceptive content, given the prevalence of deception in the underlying review commu- below are two positive reviews of the Hilton Chicago Hotel, nity, without relying on either self-reports or gold-standard one of which is truthful, and the other of which is deceptive annotations. opinion spam: We further propose a theoretical component to the frame- work based on signaling theory from economics [18] (see Sec- 1. “My husband and I stayed in the Hilton Chicago and tion 6) and use it to reason about the factors that influence had a very nice stay! The rooms were large and com- deception prevalence in online review communities. In our fortable. The view of Lake Michigan from our room context, signaling theory interprets each review as a signal was gorgeous. Room service was really good and quick, to the product’s true, unknown quality; thus, the goal of eating in the room looking at that view, awesome! The consumer reviews is to diminish the inherent information pool was really nice but we didnt get a chance to use it. asymmetry between consumers and producer. Very briefly, Great location for all of the downtown Chicago attrac- according to a signaling theory approach, deception preva- tions such as theaters and museums. Very friendly staff lence should be a function of the costs and benefits that and knowledgable, you cant go wrong staying here.” accrue from producing a fake review. We hypothesize that review communities with low signaling cost, such as commu- 2. “We loved the hotel. When I see other posts about nities that make it easy to post a review, and large benefits, it being shabby I can’t for the life of me figure out such as highly trafficked sites, will exhibit more deceptive what they are talking about. Rooms were large with opinion spam than those with higher signaling costs, such TWO bathrooms, lobby was fabulous, pool was large as communities that establish additional requirements for with two hot tubs and huge gym, staff was courteous. posting reviews, and lower benefits, such as low site traffic. For us, the location was great–across the street from We apply our approach to the domain of hotel reviews. Grant Park with a great view of Buckingham Fountain In particular, we examine hotels from the Chicago area, re- and close to all the museums and theatres. I’m sure stricting attention to positive reviews only, and instantiate others would rather be north of the river closer to the the framework for six online review communities (see Sec- Magnificent Mile but we enjoyed the quieter and more tion 5): Expedia (http://www.expedia.com), Hotels.com scenic location. Got it for $105 on Hotwire. What a (http://www.hotels.com), Orbitz (http://www.orbitz.com), bargain for such a nice hotel.” Priceline (http://www.priceline.com), TripAdvisor (http: //www.tripadvisor.com), and Yelp (http://www.yelp.com). Answer: See footnote.3 We find first that the prevalence of deception indeed varies by community. However, because it is not possible to vali- The difficulty of detecting which of these reviews is fake date these estimates empirically (i.e., the gold-standard rate is consistent with recent large meta-analyses demonstrating of deception in each community is unknown), we focus our the inaccuracy of human judgments of deception, with accu- discussion instead on the relative differences in the rate of racy rates typically near chance [1]. In particular, humans deception between communities. Here, the results confirm have a difficult time identifying deceptive messages from our hypotheses and suggest that deception is most prevalent cues alone, and as such, it is not surprising that research on in communities with a low signal cost. Importantly, when estimating the prevalence of deception (see Section 8.2) has measures are taken to increase a community’s signal cost, generally relied on self-report methods, even though such we find dramatic reductions in our estimates of the rate of reports are difficult and expensive to obtain, especially in deception in that community. large-scale settings, e.g., the web [5]. More importantly, self- report methods, such as diaries and large-scale surveys, have several methodological concerns, including social desirability 2. FRAMEWORK bias and self-deception [4]. Furthermore, there are consid- In this section, we propose a framework to estimate the erable disincentives to revealing one’s own deception in the prevalence, or rate, of deception among reviews in six on- case of online reviews, such as being permanently banned line review communities. Since reviews in these communi- from a review portal, or harming a business’s reputation. ties do not have gold-standard annotations of deceptiveness, Recently, automated approaches (see Section 4.1) have and neither human judgements nor self-reports of deception emerged to reliably label reviews as truthful vs. deceptive: are reliable in this setting (see discussion in Section 1), our Ott et al. [15] train an n-gram–based text classifier using a framework instead estimates the rates of deception in these corpus of truthful and deceptive reviews—the former culled communities using the output of an imperfect, automated from online review communities and the latter generated deception classifier. In particular, we utilize a supervised using Amazon Mechanical Turk (http://www.mturk.com). machine learning classifier, which has been shown recently Their resulting classifier is nearly 90% accurate. by Ott et al. [15] to be nearly 90% accurate at detecting In this work, we present a general framework (see Sec- deceptive opinion spam in a class-balanced dataset. tion 2) for estimating the prevalence of deception in online A similar framework has been used previously in stud- review communities. Given a classifier that distinguishes ies of disease prevalence, in which gold-standard diagnostic truthful from deceptive reviews (like that described above), testing is either too expensive, or impossible to perform [9, and inspired by studies of disease prevalence [9, 10], we pro- 10]. In such cases, it is therefore necessary to estimate the pose a generative model of deception (see Section 3) that prevalence of disease in the population using a combination jointly models the classifier’s uncertainty as well as the ground- of an imperfect diagnostic test, and estimates of the test’s truth deceptiveness of each review. Inference for this model, positive and negative recall rates.4 3 4 The first review is deceptive opinion spam. Recall rates of an imperfect diagnostic test are unlikely to
Our proposed framework is summarized here, with each step discussed in greater detail in the corresponding section: α 1. Data (Section 5): Assume given a set of labeled training reviews, Dtrain = {(xi , yi )}N train i=1 , where, for each review i, yi ∈ {0, 1} π* gives the review’s label (0 for truthful, 1 for deceptive), and xi ∈ R|V | gives the review’s feature vector repre- sentation, for some feature space of size |V |. Similarly, assume given a set of labeled truthful development re- dev β yi γ views, Ddev = {(xi , 0)}N i=1 , and a set of unlabeled test test N test reviews, D = {xi }i=1 . 2. Deception Classifier (Section 4.1): η* f (xi ) θ* Using the labeled training reviews, Dtrain , learn a su- pervised deception classifier, f : R|V | → {0, 1}. Ntest 3. Classifier Sensitivity and Specificity (Section 4.2): Figure 1: The Bayesian Prevalence Model in plate By cross-validation on Dtrain , estimate the sensitivity notation. Shaded nodes represent observed vari- (deceptive recall) of the deception classifier, f , as: ables, and arrows denote dependence. For example, η = Pr(f (xi ) = 1 | yi = 1). (1) f (xi ) is observed, and depends on η ∗ , θ∗ , and yi . Then, use Ddev to estimate the specificity (truthful recall) of the deception classifier, f , as: specificity (truthful recall) of f be given by η and θ, respec- θ = Pr(f (xi ) = 0 | yi = 0). (2) tively. Then, we can write the expectation of πf as: 4. Prevalence Models (Section 3): 1 X E[πf ] = E test δ[f (x) = 1] N Finally, use f , η, θ, and either the Naı̈ve Prevalence test x∈D Model (Section 3.1), or the generative Bayesian Preva- 1 X lence Model (Section 3.2), to estimate the prevalence = E [δ [f (x) = 1]] N test of deception, denoted π, among reviews in Dtest . Note x∈D test = ηπ ∗ + (1 − θ)(1 − π ∗ ), test that if we had gold-standard labels, {yi }N i=1 , the gold- (4) standard prevalence of deception would be: ∗ where π is the true (latent) rate of deception, and δ[a = b] N test is the Kronecker delta function, which is equal to 1 when 1 X π∗ = yi . (3) a = b, and 0 otherwise. N test i=1 If we rearrange Equation 4 in terms of π ∗ , and replace the expectation of πf with the observed value, we get the Naı̈ve 3. PREVALENCE MODELS Prevalence Model estimator: In Section 2, we propose a framework to estimate the πf − (1 − θ) πnaı̈ve = . (5) prevalence of deception in a group of reviews using only η − (1 − θ) the output of a noisy deception classifier. Central to this framework is the Prevalence Model, which models the un- Intuitively, Equation 5 corrects the raw classifier output, certainty of the deception classifier, and ultimately produces given by πf , by subtracting from it the false positive rate, the desired prevalence estimate. In this section, we propose given by 1 − θ, and dividing the result by the difference two competing Prevalence Models, which can be used inter- between the true and false positive rates, given by η−(1−θ). changeably in our framework. Notice that when f is an oracle,5 i.e., when η = θ = 1, the Naı̈ve Prevalence Model estimate correctly reduces to the 3.1 Naïve Prevalence Model oracle rate given by f , i.e., πnaı̈ve = πf = π ∗ . The Naı̈ve Prevalence Model (naı̈ve) estimates the preva- lence of deception in a corpus of reviews by correcting the 3.2 Bayesian Prevalence Model output of a noisy deception classifier according to the clas- Unfortunately, the Naı̈ve Prevalence Model estimate, πnaı̈ve , sifier’s known performance characteristics. is not restricted to the range [0, 1]. Specifically, it is negative Formally, for a given deception classifier, f , let πf be the when πf < 1 − θ, and greater than 1 when πf > η. Further- number of reviews in Dtest for which f makes a positive more, the Naı̈ve Prevalence Model makes the unrealistic as- prediction, i.e., the number of reviews for which f predicts sumption that the estimates of the classifier’s sensitivity (η) deceptive. Also, let the sensitivity (deceptive recall) and and specificity (θ), obtained using the procedure discussed in Section 4.2 and Appendix B, are exact. be known precisely. However, imprecise estimates can often 5 be obtained, especially in cases where it is feasible to perform An oracle is a classifier that does not make mistakes, and gold-standard testing on a small subpopulation. always predicts the true, gold-standard label.
The Bayesian Prevalence Model (bayes) addresses these limitations by modeling the generative process through which Table 1: Reference 5-fold cross-validated perfor- deception occurs, or, equivalently, the joint probability dis- mance of an SVM deception detection classifier in tribution of the observed and latent data. In particular, a balanced dataset of TripAdvisor reviews, given by bayes models the observed classifier output, the true (la- Ott et al. [15]. F-score corresponds to the harmonic tent) rate of deception (π ∗ ), as well as the classifier’s true mean of precision and recall. (latent) sensitivity (η ∗ ) and specificity (θ∗ ). Formally, bayes metric performance assumes that our data was generated according to the fol- Accuracy 89.6% lowing generative story: Deceptive Precision 89.1% • Sample the true rate of deception: π ∗ ∼ Beta(α) Deceptive Recall 90.3% Deceptive F-score 89.7% • Sample the classifier’s true sensitivity: η ∗ ∼ Beta(β) Truthful Precision 90.1% Truthful Recall 89.0% • Sample the classifier’s true specificity: θ∗ ∼ Beta(γ) Truthful F-score 89.6% • For each review i: Baseline Accuracy 50% – Sample the ground-truth deception label: yi ∼ Bernoulli(π ∗ ) where each term is given according to the sampling distri- butions specified in the generative story in Section 3.2. – Sample the classifier’s output: A common technique to simplify the joint distribution, Bernoulli(η ∗ ) if yi = 1 and the sampling process, is to integrate out (collapse) vari- f (xi ) ∼ ables that do not need to be sampled. If we integrate out π ∗ , Bernoulli(1 − θ∗ ) if yi = 0 η ∗ , and θ∗ from Equation 6, we can derive a Gibbs sampler The corresponding graphical model is given in plate nota- that only needs to sample the yi ’s at each iteration. The re- tion in Figure 1. Notice that by placing Beta prior distribu- sulting sampling equations, and the corresponding Bayesian tions on π ∗ , η ∗ , and θ∗ , bayes enables us to encode our prior Prevalence Model estimate of the prevalence of deception, knowledge about the true rate of deception, as well as our πbayes , are given in greater detail in Appendix A. uncertainty about the estimates of the classifier’s sensitivity and specificity. This is discussed further in Section 4.2. 4. DECEPTION DETECTION A similar model has been proposed by Joseph et al. [10] for studies of disease prevalence, in which it is necessary to 4.1 Deception Classifier estimate the prevalence of disease in a population given only The next component of the framework given in Section 2 an imperfect diagnostic test. However, that model samples is the deception classifier, which predicts whether each unla- the total number of true positives and false negatives, while beled review is truthful (real) or deceptive (fake). Following our model samples the yi individually. Accordingly, while previous work [15], we assume given some amount of labeled pilot experiments confirm that the two models produce iden- training reviews, so that we can train deception classifiers tical results, the generative story of our model, given above, using a supervised learning algorithm. is comparatively much more intuitive. Previous work has shown that Support Vector Machines 3.2.1 Inference (SVM) trained on n-gram features perform well in decep- tion detection tasks [8, 13, 15]. Following Ott et al. [15], While exact inference is intractable for the Bayesian Preva- we train linear SVM classifiers using the LIBSVM [2] soft- lence Model, a popular alternative way of approximating the ware package, and represent reviews using unigram and bi- desired posterior distribution is with Markov Chain Monte gram bag-of-words features. While more sophisticated and Carlo (MCMC) sampling, and more specifically Gibbs sam- purpose-built classifiers might achieve better performance, pling. Gibbs sampling works by sampling each variable, in pilot experiments suggest that the Prevalence Models (see turn, from the conditional distribution of that variable given Section 3) are not heavily affected by minor differences in all other variables in the model. After repeating this proce- classifier performance. Furthermore, the simple approach dure for a fixed number of iterations, the desired posterior just outlined has been previously evaluated to be nearly 90% distribution can be approximated from samples in the chain accurate at detecting deception in a balanced dataset [15]. by: (1) discarding a number of initial burn-in iterations, and Reference cross-validated classifier performance appears in (2) since adjacent samples in the chain are often highly cor- Table 1. related, thinning the number of remaining samples according to a sampling lag. 4.2 Classifier Sensitivity and Specificity The conditional distributions of each variable given the Both Prevalence Models introduced in Section 3 can uti- others can be derived from the joint distribution, which can lize knowledge of the underlying deception classifier’s sensi- be read directly from the graph. Based on the graphical tivity (η ∗ ), i.e., deceptive recall rate, and specificity (θ∗ ), i.e., representation of bayes, given in Figure 1, the joint distri- truthful recall rate. While it is not possible to obtain gold- bution of the observed and latent variables is just: standard values for these parameters, we can obtain rough Pr(f (x), y, π ∗ , η ∗ , θ∗ ; α, β, γ) = Pr(f (x) | y, η ∗ , θ∗ )· estimates of their values (denoted η and θ, respectively) through a combination of cross-validation, and evaluation Pr(y | π ∗ ) · Pr(π ∗ | α) · Pr(η ∗ | β) · Pr(θ∗ | γ), (6) on a labeled development set. For the Naı̈ve Prevalence
Table 2: Corpus statistics for unlabeled test reviews Table 3: Signal costs associated with six online re- from six online review communities. view communities, sorted approximately from high- est signal cost to lowest. Posting cost is High if users community # hotels # reviews are required to purchase a product before review- Expedia 100 4,341 ing it, and Low otherwise. Exposure benefit is Low, Hotels.com 103 6,792 Medium, or High based on the number of reviews in Orbitz 97 1,777 the community (see Table 2). Priceline 98 4,027 TripAdvisor 104 9,602 community posting cost exposure benefit Yelp 103 1,537 Orbitz High Low Mechanical Turk 20 400 Priceline High Medium Expedia High Medium Hotels.com High Medium Yelp Low Low Model, the estimates are used directly, and are assumed to TripAdvisor Low High be exact. For the Bayesian Prevalence Model, we adopt an empiri- cal Bayesian approach and use the estimates to inform the corresponding Beta priors via their hyperparameters, β and is smaller for heavily-reviewed products, and that therefore γ, respectively. The full procedure is given in Appendix B. spam should be less common among them. For consistency with our labeled deceptive review data, we simply label as 5. DATA truthful all positive (5-star) reviews of the 20 previously In this section, we briefly discuss each of the three kinds of chosen Chicago hotels. We then draw a random sample of data used by our framework introduced in Section 2. Corpus size 400, and take that to be our labeled truthful training statistics are given in Table 2. Following Ott et al. [15], we data. excluded all reviews with fewer than 150 characters, as well as all non-English reviews.6 5.2 Development Reviews (Ddev ) By training on deceptive and truthful reviews from the 5.1 Training Reviews (Dtrain ) same 20 hotels, we are effectively controlling our classifier Training a supervised deception classifier requires labeled for topic. However, because this training data is not repre- training data. Following Ott et al. [15], we build a balanced sentative of Chicago hotel reviews in general, it is important set of 800 training reviews, containing 400 truthful reviews that we do not use it to estimate the resulting classifier’s from six online review communities, and 400 gold-standard specificity (truthful recall). Accordingly, as specified in our deceptive reviews from Amazon Mechanical Turk. framework (Section 2), classifier specificity is instead esti- Deceptive Reviews: In Section 1, we discuss some of the mated on a separate, labeled truthful development set, which difficulties associated with obtaining gold-standard labels of we draw uniformly at random from the unlabeled reviews in deception, including the inaccuracy of human judgements, each review community. For consistency with the sensitivity and the problems with self-reports of deception. To avoid estimate, the size of the draw is always 400 reviews. these difficulties, Ott et al. [15] have recently created 400 gold-standard deceptive reviews using Amazon’s Mechan- 5.3 Test Reviews (Dtest ) ical Turk service. In particular, they paid one US dol- The last data component of our framework is the set of lar ($1) to each of 400 unique Mechanical Turk workers test reviews, among which to estimate the prevalence of de- to write a fake positive (5-star) review for one of the 20 ception. To avoid evaluating reviews that are too different most heavily-reviewed Chicago hotels on TripAdvisor. Each from our training data in either sentiment (due to negative worker was given a link to the hotel’s website, and instructed reviews), or topic (due to reviews of hotels outside Chicago), to write a convincing review from the perspective of a satis- we constrain each community’s test set to contain only pos- fied customer. Any submission found to be plagiarized was itive (5-star) Chicago hotel reviews. This unfortunately dis- rejected. Any submission with fewer than 150 characters qualifies our estimates of each community’s prevalence of was discarded. To date, this is the only publicly-available7 deception from being representative of all hotel reviews. No- gold-standard deceptive opinion spam dataset. As such, we tably, estimates of the prevalence of deception among nega- choose it to be our sole source of labeled deceptive reviews tive reviews might be very different from our estimates, due for training our supervised deception classifiers. Note that to the distinct motives of posting deceptive positive vs. neg- these same reviews are used to estimate the resulting clas- ative reviews. We discuss this further in Section 9. sifier sensitivity (deceptive recall), via the cross-validation procedure given in Appendix B. 6. SIGNAL THEORY Truthful Reviews: Many of the same challenges that make In terms of economic theory, the role of review commu- it difficult to obtain gold-standard deceptive reviews, also nities is to reduce the inherent information asymmetry [18] apply to obtaining truthful reviews. Related work [8, 11] between buyers and sellers in online marketplaces, by provid- has hypothesized that the relative impact of spam reviews ing buyers with a priori knowledge of the underlying quality 6 of the products being sold [7]. It follows that if reviews regu- Language was identified by the Language Detection Li- brary: http://code.google.com/p/language-detection/. larly failed to reduce this information asymmetry, or, worse, 7 http://www.cs.cornell.edu/~myleott/op_spam convey false information, then they would cease to be of
6% 6% 6% 4% 4% 4% 2% 2% 2% 0% 0% 0% -‐2% -‐2% -‐2% -‐4% -‐4% -‐4% -‐6% -‐6% -‐6% Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 (a) Orbitz (b) Priceline (c) Expedia 6% 6% 6% 4% 4% 4% 2% 2% 2% 0% 0% 0% -‐2% -‐2% -‐2% -‐4% -‐4% -‐4% -‐6% -‐6% -‐6% Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 (d) Hotels.com (e) Yelp (f) TripAdvisor Figure 2: Graph of Naı̈ve estimates of deception prevalence versus time, for six online review communities. Blue (a–d) and red (e–f ) graphs correspond to high and low posting cost communities, respectively. value to the user. Given that review communities are, in these factors for each of the six review communities is given fact, valued by users [3], it seems unlikely that the preva- in Table 3. lence of deception among them is large. Based on the signal cost function just defined, we propose Nonetheless, there is widespread concern about the preva- two hypotheses: lence of deception in online reviews, rightly or wrongly, and further, deceptive reviews can be cause for concern even in • Hypothesis 1 : Review communities that have low sig- small quantities, e.g., if they are concentrated in a single nal costs (low posting requirements, high exposure), review community. We propose that by framing reviews as e.g., TripAdvisor and Yelp, will have more deception signals—voluntary communications that serve to convey in- than communities with high signal costs, e.g., Orbitz. formation about the signaler [18], we can reason about the factors underlying deception by manipulating the distinct • Hypothesis 2 : Increasing the signal cost will decrease signal costs associated with truthful vs. deceptive reviews. the prevalence of deception. Specifically, we claim that for a positive review to be posted in a given review community, there must be an in- 7. EXPERIMENTAL SETUP curred signal cost, that is increased by: The framework described in Section 2 is instantiated for 1. The posting cost for posting the review in a given the six review communities introduced in Section 5. In par- review community, i.e., whether users are required to ticular, we first train our SVM deception classifier following purchase a product prior to reviewing it (high cost) or the procedure outlined in Section 4.1. An important step not (low cost). Some sites, for example, allow anyone when training SVM classifiers is setting the cost parameter, to post reviews about any hotel, making the review C. We set C using a nested 5-fold cross-validation pro- cost effectively zero. Other sites, however, require the cedure, and choose the value that gives the best average purchase of the hotel room before a review can be writ- balanced accuracy, defined as 21 (sensitivity + specificity). ten, raising the cost from zero to the price of the room. We then estimate the classifier’s sensitivity, specificity, and hyperparameters, using the procedure outlined in Sec- and decreased by: tion 4.2 and Appendix B. Based on those estimates, we then 2. The exposure benefit of posting the review in that estimate the prevalence of deception among reviews in our review community, i.e., the benefit derived from other test set using the Naı̈ve and the Bayesian Prevalence Models. users reading the review, which is proportional to the Gibbs sampling for the Bayesian Prevalence Model is per- size of the review community’s audience. Review sites formed using Equations 7 and 8 (given in Appendix A) for with more traffic have greater exposure benefit. 70,000 iterations, with a burn-in of 20,000 iterations, and a sampling lag of 50. We use an uninformative (uniform) Observe that both the posting cost and the exposure benefit prior for π ∗ , i.e., α = h1, 1i. Multiple runs are performed to depend entirely on the review community. An overview of verify the stability of the results.
10% 12% 12% 8% 10% 10% 8% 8% 6% 6% 6% 4% 4% 4% 2% 2% 2% 0% 0% 0% Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 (a) Orbitz (b) Priceline (c) Expedia 12% 12% 12% 10% 10% 10% 8% 8% 8% 6% 6% 6% 4% 4% 4% 2% 2% 2% 0% 0% 0% Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 (d) Hotels.com (e) Yelp (f) TripAdvisor Figure 3: Graph of Bayesian estimates of deception prevalence versus time, for six online review communities. Blue (a–d) and red (e–f ) graphs correspond to high and low posting cost communities, respectively. Error bars show Bayesian 95% credible intervals. 8. RESULTS AND DISCUSSION respond to communities with High and Low posting costs, Estimates of the prevalence of deception for six review respectively. communities over time, given by the Naı̈ve Prevalence Model, In agreement with Hypothesis 1 (Section 6), we again find appear in Figure 2. Blue graphs (a–d) correspond to com- that Low signal cost communities, e.g., TripAdvisor, seem to munities with High posting cost (see Table 3), i.e., commu- contain larger quantities and accelerated growth of deceptive nities for which you are required to book a hotel room before opinion spam when compared to High signal cost communi- posting a review, while red graphs (e–f) correspond to com- ties, e.g., Orbitz. Interestingly, communities with a blend of munities with Low posting cost, i.e., communities that allow signal costs appear to have medium rates of deception that any user to post reviews for any hotel. are neither growing nor declining, e.g., Hotels.com, which In agreement with Hypothesis 1 (given in Section 6), it has a rate of deception of ≈ 2%. is clear from Figure 2 that deceptive opinion spam is de- To test Hypothesis 2, i.e., that increasing the signal cost creasing or stationary over time for High posting cost re- will decrease the prevalence of deception, we need to increase view communities (blue graphs, a–d). In contrast, review the signal cost, as we have defined it in Section 6. Thus, it communities that allow any user to post reviews for any ho- is necessary to either increase the posting cost, or decrease tel, i.e., Low posting cost communities (red graphs, e–f), are the exposure benefit. And while we have no control over a seeing growth in their rate of deceptive opinion spam. community’s exposure benefit, we can increase the posting Unfortunately, as discussed in Section 3.1, we observe that cost by, for example, hiding all reviews written by users the prevalence estimates produced by the Naı̈ve Prevalence who have not posted at least two reviews. Essentially, by Model are often negative. This occurs when the rate at requiring users to post more than one review in order for which the classifier makes positive predictions is below the their review to be displayed, we are increasing the posting classifier’s estimated false positive rate, suggesting both that cost and, accordingly, the signal cost as well. the estimated false positive rate of the classifier is perhaps Bayesian Prevalence Model estimates for TripAdvisor for overestimated, and that the classifier’s estimated specificity varying signal costs appear in Figure 4. In particular, we (truthful recall rate, given by θ) is perhaps underestimated. give the estimated prevalence of deception over time af- We address this further in Section 8.1. ter removing reviews written by first-time review writers, The Bayesian Prevalence Model, on the other hand, en- and after removing reviews written by first- or second-time codes the uncertainty in the estimated values of the classi- review writers. In agreement with Hypothesis 2, we see a fier’s sensitivity and specificity through two Beta priors, and clear reduction in the prevalence of deception over time on in particular their hyperparameters, β and γ. Estimates of TripAdvisor after removing these reviews, with rates drop- the prevalence of deception for the six review communities ping from ≈ 6%, to ≈ 5%, and finally to ≈ 4%, suggesting over time, given by the Bayesian Prevalence Model, appear that an increased signal cost may indeed help to reduce the in Figure 3. Blue (a–d) and red (e–f) graphs, as before, cor- prevalence of deception in online review communities.
12% 12% 12% 10% 10% 10% 8% 8% 8% 6% 6% 6% 4% 4% 4% 2% 2% 2% 0% 0% 0% Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 Jan-‐09 Jul-‐09 Jan-‐10 Jul-‐10 Jan-‐11 Jul-‐11 (a) TripAdvisor. All reviews. (b) TripAdvisor. First-time reviewers ex- (c) TripAdvisor. First-time and second- cluded. time reviewers excluded. Figure 4: Graph of Bayesian estimates of deception prevalence versus time, for TripAdvisor, with reviews written by new users excluded. Excluding reviews written by first- or second-time reviewers increases the signal cost, and decreases the prevalence of deception. 8.1 Assumptions and Limitations scale studies looking at how often people lie in everyday In this work we have made a number of assumptions, a communication, DePaulo et al. [4] used a diary method to few of which we will now highlight and discuss. calculate the average number of lies told per day. At the First, we note that our unlabeled test set, Dtest , overlaps end of seven days participants told approximately one to with our labeled truthful training set, Dtrain . Consequently, two lies per day, with more recent studies replicating this we will underestimate the prevalence of deception, because general finding [6], suggesting that deception is frequent in the overlapping reviews will be more likely to be classified human communication. More recently, Serota et al. [17] at test time as truthful, having been seen in training as be- conducted a large scale representative survey of Americans ing truthful. Excluding these overlapping reviews from the asking participants how often they lied in the last 24 hours. test set results in overestimating the prevalence of decep- While they found the same average deception rate as pre- tion, based on the hypothesis that the overlapping reviews, vious research (approximately 1.65 lies per day), they dis- chosen from the 20 most highly-reviewed Chicago hotels, are covered that the data was heavily skewed, with 60 percent more likely to be truthful to begin with. of the participants reporting no lies at all. They concluded Second, we observe that our development set, Ddev , con- that rather than deception prevalence being spread evenly taining labeled truthful reviews, is not gold-standard. Un- across the population, there are instead a few prolific liars. fortunately, while it is necessary to obtain a uniform sample Unfortunately, both sides of this debate have relied solely of reviews in order to fairly estimate the classifier’s truthful on self-report data. recall rate (specificity), such review samples are inherently The current approach offers a novel method for assessing unlabeled. This can be problematic if the underlying rate of deception prevalence that does not require self-report, but deception is high among the reviews from which the devel- can provide insight into the prevalence of deception in hu- opment set is sampled, because the specificity will then be man communication more generally. At the same time, the underestimated. Indeed, our Naı̈ve Prevalence Model regu- question raised by the psychological research also mirrors larly produces negative estimates, suggesting that the esti- an important point regarding the prevalence of deception in mated classifier specificity may indeed be underestimated, online reviews: are a few deceptive reviews posted by many possibly due to deceptive reviews in the development set. people, or are there many deceptive reviews told by only a Third, our proposal for increasing the signal cost, by hid- few? That is, do some hotels have many fake reviews while ing reviews written by first- or second-time reviewers, is not others are primarily honest? Or, is there a little bit of cheat- ideal. While our results confirm that hiding these reviews ing by most hotels? This kind of individualized modeling will cause an immediate reduction in deception prevalence, represents an important next step in this line of research. the increase in signal cost might be insufficient to discourage new deception, once deceivers become aware of the increased posting requirements. 9. CONCLUSION Fourth, in this work we have only considered a limited In this work, we have presented a general framework for version of the deception prevalence problem. In particular, estimating the prevalence of deception in online review com- we have only considered positive Chicago hotel reviews, and munities, based on the output of a noisy deception classifier. our classifier is trained on deceptive reviews coming only Using this framework, we have explored the prevalence of from Amazon Mechanical Turk. Both negative reviews as deception among positive reviews in six popular online re- well as deceptive reviews obtained by other means are likely view communities, and provided the first empirical study of to be different in character than the data used in this study. the magnitude, and influencing factors of deceptive opinion spam. 8.2 Implications for Psychological Research We have additionally proposed a theoretical model of on- The current research also represents a novel approach to a line reviews as a signal to a product’s true (unknown) qual- long-standing and ongoing debate around deception preva- ity, based on economic signaling theory. Specifically, we have lence in the psychological literature. In one of the first large- defined the signal cost of positive online reviews as a func-
tion of the posting costs and exposure benefits of the review revisited. American Journal of Epidemiology, community in which it is posted. Based on this theory, we 153(9):921, 2001. have further suggested two hypotheses, both of which are [10] L. Joseph, T. Gyorkos, and L. Coupal. Bayesian supported by our findings. In particular, we find first that estimation of disease prevalence and the parameters of review communities with low signal costs (low posting re- diagnostic tests in the absence of a gold standard. quirements, high exposure) have more deception than com- American Journal of Epidemiology, 141(3):263, 1995. munities with comparatively higher signal costs. Second, we [11] E. Lim, V. Nguyen, N. Jindal, B. Liu, and H. Lauw. find that by increasing the signal cost of a review community, Detecting product review spammers using rating e.g., by excluding reviews written by first- or second-time re- behaviors. In Proceedings of the 19th ACM viewers, we can effectively reduce both the prevalence and international conference on Information and the growth rate of deception in that community. knowledge management, pages 939–948. ACM, 2010. Future work might explore other methods for manipu- [12] D. Meyer. Fake reviews prompt belkin apology. http: lating the signal costs associated with posting online re- //news.cnet.com/8301-1001_3-10145399-92.html, views, and the corresponding effects on deception preva- Jan. 2009. lence. For example, some sites, such as Angie’s List (http: [13] R. Mihalcea and C. Strapparava. The lie detector: //www.angieslist.com/), charge a monthly access fee in or- Explorations in the automatic recognition of deceptive der to browse or post reviews, and future work might study language. In Proceedings of the ACL-IJCNLP 2009 the effectiveness of such techniques at deterring deception. Conference Short Papers, pages 309–312. Association for Computational Linguistics, 2009. 10. ACKNOWLEDGMENTS [14] C. Miller. Company settles case of reviews it faked. This work was supported in part by National Science http://www.nytimes.com/2009/07/15/technology/ Foundation Grant NSCC-0904913, and the Jack Kent Cooke internet/15lift.html, July 2009. Foundation. We also thank, alphabetically, Cristian Danescu- [15] M. Ott, Y. Choi, C. Cardie, and J. Hancock. Finding Niculescu-Mizil, Lillian Lee, Bin Lu, Karthik Raman, Lu deceptive opinion spam by any stretch of the Wang, and Ainur Yessenalina, as well as members of the imagination. In Proceedings of the 49th Annual Cornell NLP seminar group and the WWW reviewers for Meeting of the Association for Computational their insightful comments, suggestions and advice on vari- Linguistics: Human Language Technologies-Volume 1, ous aspects of this work. pages 309–319. Association for Computational Linguistics, 2011. 11. REFERENCES [16] B. Page. Amazon withdraws ebook explaining how to [1] C. Bond and B. DePaulo. Accuracy of deception manipulate its sales rankings. judgments. Personality and Social Psychology Review, http://www.guardian.co.uk/books/2011/jan/05/ 10(3):214, 2006. amazon-ebook-manipulate-kindle-rankings, Jan. [2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for 2011. support vector machines. ACM Transactions on [17] K. Serota, T. Levine, and F. Boster. The prevalence of Intelligent Systems and Technology, 2:27:1–27:27, lying in America: Three studies of self-reported lies. 2011. Software available at Human Communication Research, 36(1):2–25, 2010. http://www.csie.ntu.edu.tw/~cjlin/libsvm. [18] M. Spence. Job market signaling. The quarterly [3] Cone. 2011 Online Influence Trend Tracker. Online: journal of Economics, 87(3):355, 1973. http://www.coneinc.com/negative-reviews- [19] D. Streitfeld. In a race to out-rave, 5-star web reviews online-reverse-purchase-decisions, August 2011. go for $5. http://www.nytimes.com/2011/08/20/ [4] B. DePaulo, D. Kashy, S. Kirkendol, M. Wyer, and technology/finding-fake-reviews-online.html, J. Epstein. Lying in everyday life. Journal of Aug. 2011. personality and social psychology, 70(5):979, 1996. [20] D. Streitfeld. For $2 a star, an online retailer gets [5] J. Hancock. Digital Deception: The Practice of Lying 5-star product reviews. http: in the Digital Age. Deception: Methods, Contexts and //www.nytimes.com/2012/01/27/technology/for-2- Consequences, pages 109–120, 2009. a-star-a-retailer-gets-5-star-reviews.html, [6] J. Hancock, J. Thom-Santelli, and T. Ritchie. Jan. 2012. Deception and design: The impact of communication [21] A. Topping. Historian Orlando Figes agrees to pay technology on lying behavior. In Proceedings of the damages for fake reviews. SIGCHI conference on Human factors in computing http://www.guardian.co.uk/books/2010/jul/16/ systems, pages 129–134. ACM, 2004. orlando-figes-fake-amazon-reviews, July 2010. [7] N. Hu, L. Liu, and J. Zhang. Do online reviews affect [22] A. Vrij. Detecting lies and deceit: Pitfalls and product sales? The role of reviewer characteristics and opportunities. Wiley-Interscience, 2008. temporal effects. Information Technology and [23] K. Yoo and U. Gretzel. Comparison of Deceptive and Management, 9(3):201–214, 2008. Truthful Travel Reviews. Information and [8] N. Jindal and B. Liu. Opinion spam and analysis. In Communication Technologies in Tourism 2009, pages Proceedings of the international conference on Web 37–47, 2009. search and web data mining, pages 219–230. ACM, 2008. [9] W. Johnson, J. Gastwirth, and L. Pearson. Screening without a “gold standard”: The Hui-Walter paradigm
APPENDIX B. ESTIMATING CLASSIFIER SENSITIV- A. GIBBS SAMPLER FOR BAYESIAN PREVA- ITY AND SPECIFICITY We estimate the sensitivity and specificity of our deception LENCE MODEL classifier via the following procedure: Gibbs sampling of the Bayesian Prevalence Model, intro- duced in Section 3.2, is performed according to the following 1. Assume given a labeled training set, Dtrain , containing conditional distributions: N train reviews of n hotels. Also assume given a devel- opment set, Ddev , containing labeled truthful reviews. Pr(yi = 1 | f (x), y(−i) ; α, β, γ) 2. Split Dtrain into n folds, D1train , . . . , Dntrain , of sizes given (−i) β f (xi ) + Xf (xi ) by, N1train , . . . , Nntrain , respectively, such that Djtrain con- (−i) train ∝ (α1 + N1 )· P (−i) , (7) tains all (and only) reviews of hotel j. Let D(−j) con- β + N1 tain all reviews except those of hotel j. and, 3. Then, for each hotel j: train Pr(yi = 0 | f (x), y(−i) ; α, β, γ) (a) Train a classifier, fj , from reviews in D(−j) , and train (−i) use it to classify reviews in Dj . (−i) γ 1−f (xi ) + Yf (xi ) (b) Let |T P |j correspond to the observed number of ∝ (α0 + N0 )· (−i) , (8) true positives, i.e.: P γ + N0 X where, |T P |j = σ[y = 1] · σ[fj (x) = 1]. (12) (x,y)∈Djtrain (−i) X Xk = σ[yj = 1] · σ[f (xj ) = k], j6=i (c) Similarly, let |F N |j correspond to the observed X Yk (−i) = σ[yj = 0] · σ[f (xj ) = k], number of false negatives. j6=i 4. Calculate the aggregate number of true positives (|T P |) (−i) N1 = X0 (−i) + X1 (−i) , and false negatives (|F N |), and compute the sensitivity (deceptive recall) as: (−i) (−i) (−i) N0 = Y0 + Y1 . |T P | After sampling, we reconstruct the collapsed variables to η= . (13) |T P | + |F N | yield the Bayesian Prevalence Model estimate of the preva- lence of deception: 5. Train a classifier using all reviews in Dtrain , and use it to classify reviews in Ddev . α1 + N1 πbayes = P . (9) 6. Let the resulting number of true negative and false α + N test positive predictions in Ddev be given by |T N |dev and Estimates of the classifier’s sensitivity and specificity are |F P |dev , respectively, and compute the specificity (truth- similarly given by: ful recall) as: β + X1 |T N |dev ηbayes = P1 , (10) θ= . (14) β + N1 |T N |dev + |F P |dev γ + Y0 θbayes = P 1 . (11) For the Bayesian Prevalence Model, we observe that the γ + N0 posterior distribution of a variable with an uninformative (uniform) Beta prior, after observing a successes and b fail- ures, is just Beta(a+1, b+1), i.e., a and b are pseudo counts. Based on this observation, we set the hyperparameters β and γ, corresponding to the classifier’s sensitivity (deceptive re- call) and specificity (truthful recall), respectively, to: β = h|F N | + 1, |T P | + 1i , γ = h|F P |dev + 1, |T N |dev + 1i .
You can also read