Consumer recommendation dynamics in online retail business under logistic regression and naïve Bayes analyses
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
120 Consumer recommendation dynamics in online retail business under logistic regression and naïve Bayes analyses Irina GEORGESCU Bucharest University of Economics, Bucharest, Romania irina.georgescu@csie.ase.ro Jani KINNUNEN Åbo Akademi University, Turku, Finland jani.kinnunen@abo.fi Abstract. Competitive businesses need to study the behavior of their current and potential customer base. Relevant data on the behavior can be obtained from online, where the purchase decisions are increasingly made and often based on product reviews, ratings and recommendations available in social media networks. The original data consists of 23486 customer reviews with ten variables/features of the reviewing customers, the products under review and the feedback to their reviews from online retail clothing business, and about half of the dataset is analyzed after cleaning the data. To find out, which features are the most important factors leading to a recommendation, the naïve Bayes and logistic regression methods are applied. Earlier research has shown that the sentiment of textual reviews and the given numerical ratings are key factors for the decision to recommend or not recommend products. The focus of this paper is to identify and rank-order the most relevant (numerical) factors affecting the review process leading to a recommendation. After applying the logistic regression classifier, we have found that rating, positive feedback count and age are statistically significant factors, in that order. The results support online retailers and manufacturers, as well, in adjusting their product portfolios and marketing efforts optimally to obtain recommendations for their products, reach potential customers and expose them to the given recommendations leading to positive purchase decisions. Further, the results indicate some future research opportunities. Keywords: consumers, social media, logistic regression, naïve Bayes, ROC curve. Introduction In this paper we discuss the role of social media for developing successful businesses based on consumers’ opinions. The paper is a continuation of some previous research by Androniceanu et al. (2020), where we studied the same issue with sentiment analysis and lexicon-based approaches. Here we use the same dataset, a public dataset called Women’s Clothing E-Commerce Review collected by Nick Brooks (Brooks, 2018a, 2018b) in order to analyze customers’ reviews on fashion items. The original dataset contained 23486 customer reviews and 10 variables, both text and numerical. Previous research on this dataset has also been done by Agarap, 2020 who made a sentiment analysis and implemented a bidirectional neural network for sentiment classification. Cleaning the dataset, we worked with almost half of the data. In this paper we will apply supervised methods, such as binomial logistic regression and naïve Bayes classifier in order to find the best prediction model. We consider RecommendedIND as 10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020
121 the class variable, having two values: 1 if the review is positive and 0, if the review is negative. The predictors considered in this approach are Rating (a qualitative variable having values from 1 to 5), Age of consumers and PositiveFeedbackCount (a numerical variable counting the positive reviews of a fashion item). For each classifier mentioned above we divide the dataset in two subsets: the training set representing 75% of the data and the test set representing 25% of the data. The classifier is trained on the training set and then it is run on the test set to predict the class membership of the customers’ reviews (1-recommended or 0-not recommended). The real class and the predicted class are compared by means of the confusion matrix. The ROC curve is drawn and the area under curve (AUC) is determined in order to see the classifier’s performance. Literature review With the existence of big datasets, consumers have found that social networks can bring flows of useful information to make an opinion on the latest products and services. In their turn, companies can make use of social media platforms and technology worldwide available, such as Facebook, Google, Instagram, Spotify, Twitter, etc., but also provided by marketing companies (Andzulis et al., 2012). Such online platforms influence the behavior of sellers and buyers, the sales rules and sales practices. Vaitkevicius et al. (2019) proved how the easiness of shopping and availability of product information and reviews, along with pricing, drive the demand growth in online shopping. Fodiatis and Stylos (2017) investigated online experience factors and the influence of social platforms on using information about parks and online sales. They built a questionnaire about the visitors’ satisfaction on E-da World theme park in Taiwan and applied a theoretical framework of TAM (Technological Acceptance Model). Chen et al. (2010) studied mobile phone customer satisfaction and experience as a basis on recommendations and they view a customer, not only as a service user, but even as a partner of a service provider due to the importance of their recommendations to potential customers. Product and service recommendations are relevant performance indicators of the companies offering products, which exposed to reviews, while important also for potential other customers reading the reviews and recommendations (cf. Siering et al., 2018). However, consumers differ by how much weight they give for recommendations, e.g., Androniceanu et al. (2020b) found that customers in richer countries gave smaller weight to reviews, for their purchase decisions, unlike shopping online customers. Further, in spite that major companies use e-commerce, there are regions where there is a limited volume of online transactions, such as Arab world. A study by Chivandi et al. (2018) asserts that 47% of small business do not frequently use social media platforms, while 25% do not use it at all. Chivandi et al. (2019) discussed how by brand awareness strategies, social media platforms determine the consumer behavior in online purchases. Siering et al. (2018) studied online recommendations focusing on the airline service quality and reviews and which factors in customer responses, i.e. the textual reviews can predict, whether they will recommend the airline or not. They used sentiment analysis. Similarly, Androniceanu et al. (2020a) studied online clothing business using sentiment analysis of the textual data of given reviews. The key factor was the positive sentiment score of each review together with given ratings in predicting the decision to give a recommendation or not. Next, we will extend the study of Androniceanu et al. (2020a) using the same dataset. 10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020
122 Methodology In this section we shortly present the two classifiers used for this dataset: binominal logistic regression and naïve Bayes. Binomial logistic regression Following Sperandei, 2014 and McHugh, 2009, we will briefly present the binomial logistic regression model. For simplification, one considers a logistic model with two predictors x1 , x 2 and a binary dependent variable Y, whose probability of success is denoted by p=P(Y=1) and the probability of failure is denoted by 1-p=P(Y=0). The parameters b0 , b1 , b2 of the model are known. We call the log odds of the event that p the response variable takes value 1 by ln and we assume a linear relationship between 1− p the log odds and the predictors x1 , x 2 . The logistic model is represented by this linear relationship: p p ln = b0 + b1 x1 + b2 x 2 , equivalent to = e b0 +b1x1 +b2 x 2 , from where the probability of 1− p 1− p 1 success p is derived: p = − ( b0 + b1x1 + b2 x2 ) . Log odds are difficult to interpret, therefore one 1+ e computes the exponential of the coefficients b0 , b1 , b2 . These exponentials tell the amount by which the odds increase if the associated predictor increases by one. Naïve Bayes classifier Following Mitchell, 2020, a naïve Bayes classifier refers to a joint distribution over a response variable Y and a set of known random variables X 1 ,..., X n . The random variables are assumed conditionally independent given the label Y: P( X 1 ,..., X n , Y ) = P(Y ) P( X i | Y ) i According to Bayes’ rule, the probability that Y takes the value y k is: P (Y = y k ) P ( X 1 ,..., X n | Y = y k ) P (Y = y k | X 1 ,.., X n ) = (1) P(Y = y j ) P( X 1 ,..., X n | Y = y j ) j P(Y = y k ) P( X i | Y = y k ) = i P(Y = y j j ) P( X i | Y = y j ) i Given a new observation X = ( X 1 ,..., X n ) and estimating the distributions P(Y) and P( X i | Y ) from the training set, one can compute the probability that Y can take any value y k . The naïve Bayes classification rule (Mitchell, 2020) is: P(Y = y k ) P( X i | Y = y k ) Y arg max i . (2) yk P j (Y = y j P( X i | Y = y j ) ) i 10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020
123 The formula above can be simplified (Mitchell, 2020): Y arg max P(Y = y k ) P( X i | Y = y k ) , since the denominator does not depend on the value yk i yk ROC (receiver operating characteristics) curve and AUC (area under curve) measure the performance of the classifier. ROC curve summarizes confusion matrix at all threshold values. AUC has values in the unit interval, indicating the efficiency of the classifier to separate positive and negative class. AUC=1 indicates a perfect classifier; AUC=0 indicates that all predictions are wrong. The axes of the ROC curve are measured between 0 and 1. Specificity measures the proportion of negative class observations correctly predicted as negative. Sensitivity measures the proportion of positive class observations correctly predicted as positive. The Oy axis measures the sensitivity, while the Ox axis measures 1-specificity. Kappa statistics signifies how much of the accuracy is due to chance, for example selecting the most common class. Kappa takes values in the unit interval. Results and discussions In the data analysis we will remove the identifiers Nr and ClothingID, since they are useless and would confuse machine learning algorithms. The training set contains 5695 rows (75%), while the test set contains 2847 rows (25%). The formula on each classifier builds is RecommendedIND ~ Rating + Age+PositiveFeedbackCount. Binomial logistic regression The logistic regression model will be deduced from the following output: Figure no. 1. Logistic regression output Source: own calculations. 10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020
124 We remark that the response variable RecommendedIND is a factor variable with two levels 0 and 1. Thus in our case we estimate the probability of having a positive review. The output from Figure 1 includes measures of fit such as summary of deviance residuals and AIC (Akaike Information Criterion). The coefficients of each predictor and the intercept are also included. The coefficients of Rating and Age are positive and statistically significant, at the significance levels of 0.001 and 0.1, respectively. The coefficient of PositiveFeedbackCount is negative and significant at the level of 0.05. A one-point increase in Rating and Age increases the log of odds ratio by 3.44, respectively 0.009. A one-point increase in PositiveFeedbackCount decreases the log of odds ratio by 0.02. The log of odds is difficult to interpret, therefore the exponential of the coefficients are computed. Figure no. 2. Example of the logistic regression coefficients Source: own computation. If Rating increases by 1, the odds ratio of the review being positive increases by 31,2. If Age increases by 1 year, the odds ratio of the review begin good increases by 1.0091, while if the PositiveFeedbackCount increases by 1, the odds ratio increases by 0.979. Confusion matrix for the test set is depicted in figure 3. Figure no. 3. Confusion matrix for logistic regression classifier Source: own computation. Out of 503 negative reviews, 477 have been correctly predicted as being negative. Out of 2344 positive reviews, 2188 have been correctly predicted as being positive. The classifier accuracy is 93.6%. A kappa value of 80.1% is pretty high. 10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020
125 Figure no. 4. ROC and AUC for logistic regression classifier Source: own computation. In figure 4, AUC=0.941, meaning that there is 94.1% chance that the model will be able to distinguish between positive and negative reviews. Naïve Bayes classifier The model computes the a-priori probabilities that indicate the data distribution. Since the three predictors are numeric, we obtain the means (the first column in figure 5) and the standard deviations (the second column in figure 5) of the conditional Gaussian distributions. Figure no. 5. Summary of the naïve Bayes model Source: own computation 10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020
126 The next step is to make the class prediction of the review based on the test set. In order to decide the threshold for classifying a review as positive or negative, we declare a review as positive when the probability of its being positive is greater than 0.75. Figure no. 6. Confusion matrix for naïve Bayes classifier Source: own computation. Out of 503 negative reviews, 477 have been correctly predicted as being negative. Out of 2344 positive reviews, 2176 have been correctly predicted as being positive. The classifier accuracy is 93.2%. A kappa value of 78.9% is pretty high. Figure no. 7. ROC and AUC for naïve Bayes classifier Source: own computation. 10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020
127 In figure 7, AUC=0.938, meaning that there is 93.8% chance that the model will be able to distinguish between positive and negative reviews. Conclusion In this paper we analyzed customer reviews on fashion by employing supervised learning methods, logistic regression and naïve Bayes for classifying if the review text recommends or not the fashion item. Both classifiers showed a very high accuracy, about 93%-94%, with a kappa statistic of about 79-80%. These results can give companies suggestions on the customers’ opinions about their products and services and how to improve their quality and types. By means of machine learning, intelligent systems can create customers’ profiles and purchasing habits, contributing to sales increases and business strategies to fulfill various criteria. Some limitations of machine learning technology may refer to ethics, namely customer privacy and lack of transparency. References Agarap, A.F.M. (2020). Statistical analysis on e-commerce reviews with sentiment classification using bidirectional recurrent neural networks. Preprint. Available at https://arxiv.org/pdf/1805.03687.pdf Androniceanu, A., Georgescu, I., & Kinnunen, J. (2020a). The key role of social media in identifying consumer opinions for building sustainable competitive advantages. In Meiselwitz G. (Eds.) Social Computing and Social Media. Participation, User Experience, Consumer Experience and Applications of Social Computing. HCII 2020. Lecture Notes in Computer Science, vol. 12195, Springer, Cham, 261-277. Androniceanu, A., Kinnunen, J., Georgescu, I., & Androniceanu A.-M. (2020b). Multidimensional analysis of consumer behaviour on the European digital market. In: Sroka W. (eds.) Perspectives on Consumer Behaviour. Contributions to Management Science. Springer, Cham. https://doi.org/10.1007/978-3-030-47380-8_4 Andzulis, J. M., Panagopoulos, N. G. & Rapp, A. (2012). A review of social media and implications for the sales process. Journal of Personal Selling& Sales Management, 2(3), 305-316. Brooks, N. (2018a). Guided numeric and text exploration E-commerce, available at: https://www.kaggle.com/nicapotato/guided-numeric-and-text-exploration- commerce Brooks, N.:(2018b) Women’s e-commerce clothing review, available at: https://www.kaggle.com/nicapoto/womens-ecommerce-clothing-views Chen, W-K, Huang, H-C, & Chou, S-C. (2010). Understanding consumer recommendation behavior. In Pousttchi, K. & Wiedemann, D.G. (eds.) Handbook of Research on Mobile Marketing Management, IGI Global, pp. 401-416 Chivandi, A., Vafana, S., Samuel, O.G., & Muchie, M. (2018). Social media innovation consumption of hair products in South Africa; African female perception. Journal of Retail and Consumer Services, JJRC2018759 Chivandi, A., Samuel, M. O., & Muchie, M. (2019). Social media, consumer behaviour and service marketing. Consumer Behaviour and Service Marketing. Matthew Reyes. IntechOpen, Available at: https://www.intechopen.com/books/consumer-behavior- andmarketing/socialmediaconsumer-behavior-and-service-marketing Fotiadis, A. K. & Stylos, N. (2017). The effects of online social networking on retailer consumer dynamics in the attraction industry: The case of ‘E-da’ theme park, Taiwan.Technological Forecasting and Social Change. 124, 283-294. McHugh, M.L. (2009). The odds ratio: calculation, usage, and interpretation. Biochem Med., 19(2), 120-126. Mitchell, T. (2020). Machine Learning (2nd ed.) McGraw Hill. 10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020
128 Sperandei, S. (2014). Understanding logistic regression analysis. Biochem Med., 24(1), 12-18. Siering, M, Deokar, A.V., & Janze, C. (2018). Disentangling consumer recommendations: Explaining and predicting airline recommendations based on online reviews. Decision Support Systems, 107, March 2018, 52-63. Vaitkevicius, S., Mazeikiene, E., Bilan, S., Navickas, V., & Savaneviciene, A. (2019). Economic demand formation motives in online-shopping. Inzinerine Ekonomika Engineering Economics, 30(5), 631-640. 10.2478/icas-2021-0012, pp 120-128, ISSN 2668-6309|Proceedings of the 14th International Conference on Applied Statistics 2020|No 1, 2020
You can also read