Predicting Influencer Actual Reach Using Linear Regression - SAM KHOGASTEH EDVIN WIOREK
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2021 Predicting Influencer Actual Reach Using Linear Regression SAM KHOGASTEH EDVIN WIOREK KTH SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP
DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND MANAGEMENT, 15 CREDITS, STOCKHOLM, SWEDEN 2021 1 Predicting Influencer Actual Reach Using Linear Regression (June 2021) Sam Khogasteh, samkz@kth.se, Edvin Wiorek edvinwi@kth.se, Abstract—The influencer marketing industry has seen a and realms of exploration. This thesis aims to study spon- tremendous growth in recent years, yet the effectiveness of this sored content on Instagram, and is performed partly at an marketing form is still largely unexplored. This report aims to organization. The organization conducts digital retail business explore how various performance measures are linked to the reach of social media pages, utilizing the linear regression model. through their own website as well as physically through Three different data sets were collected manually, or using web separate shops. For disclosure purposes they will be referred scraping. By splitting these data sets to training- and test data to as the organization in this thesis. They focus largely on a we examined the degree to which the linear regression model can singular product and target consumers who are active on social predict the actual reach, the page views and the weekly growth media platforms and who are drawn to branded products. of an influencer. We concluded that there is a statistically significant correlation The organization is interested in improving their advertisement between multiple performance metrics of a social media page targeting and gaining insights in the possible actual reach of and the actual reach or the page views of that account. This their collaborations. They work with influencers on platforms study is however limited by its narrow data set and time frame, such as Instagram, and would benefit from insights that could warranting future research in order to further establish the help predict the extent of actual reach of a partner’s sponsored degree of this correlation. The results of this study can benefit companies in their process content. This study could be utilized for assessing which of selecting influencers to collaborate with, as well as determining influencers the organization should collaborate with, as well the expected return on investment for that particular collabora- as what monetary compensation said partners should receive. tion. This can in turn lead to a more efficient, authentic and This domain is relatively unexplored, and marketing on social transparent marketplace, and to consumers being less exposed media through the use of influencers has only existed for a few to advertisement from misleading and malicious influencers. years. This thesis intends to address one of the most significant Sammanfattning—Under de senaste åren har questions organizations are faced with as they advertise their marknadsföringsindustrin med influencers växt drastiskt, goods and services, ultimately simplifying this process enough ändå är effektiviteten hos denna marknadsföringsform relativt for an organization to approximate the potential actual reach outforskad. Denna rapport avser använda linjär regression för att utforska hur olika prestationsmått är kopplade till räckvidden of a collaboration. hos profiler på sociala medier. De olika datamängderna samlades manuellt, eller med hjälp av web scraping. Genom att dela upp II. S OCIAL AND E THICAL A SPECTS datamängderna i träningsdata och testdata undersökte vi i hur From a social perspective, this study can support influencers hög grad den linjära regressionsmodellen kan förutsäga faktisk räckvidd, sidvisningar och profilens tillväxt under en vecka. who have expanded their follower base in an organic and Vi drog slutsatsen att det finns en statistisk signifikant korre- permissive manner, while influencers who have purchased lation mellan flera prestationsmått för en profilsida, och antalet parts of their follower base in order to mislead consumers and sidvisningar för det kontot. Studien är emellertid begränsad av organizations, will be deprived of paid marketing deals. As a sin datamängd och tidsspann, något som motiverar framtida consequence, the issues addressed in this study are relevant studier för att ytterligare etablera korrelationsgraden. Studiens resultat kan gynna företag i deras process att välja from an ethical perspective and the study seeks to favour vilka influencers de vill samarbeta med, såväl som i deras process transparency and authenticity on social media platforms. The att bestämma den förväntade avkastningen för ett specifikt study therefore connects to the United Nation’s Sustainable samarbete. Detta kan i sin tur bidra till en mer effektiv, autentisk Development goal number 12; to contribute to responsible och transparent marknad, något som också gör att konsumenten consumption and production. 1 As organizations more easily blir mindre exponerad för marknadsföring från vilseledande och illvilliga influencers. can evaluate which influencers are more deceptive, consumers of platforms such as Instagram will be exposed to adver- tisement from influencers of further credibility. This can in I. I NTRODUCTION turn encourage sustainable consumer patterns. Transparency in social media is a contemporary problem and it is a common D IGITAL marketing is an increasingly popular way for organizations to advertise their products and services. The novelty of this marketing platform opens new possibilities occurrence that consumers and marketers gain a defective un- derstanding of an authority on a platform and act accordingly. The amount of followers can be misleading to the actual reach 1 United Nations, “Transforming our World: The 2030 Agenda for Sustainable Development,”, 2015. [Online] Available: https://sdgs.un.org/publications/transforming-our-world-2030-agenda- sustainable-development-17981
2 DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND MANAGEMENT, 15 CREDITS, STOCKHOLM, SWEDEN 2021 or purchasing power that a profile yields, an aspect which can variable is a scalar. If the model consists of several indepen- be problematic as consumers tend to be more influenced by dent variables it is called a multiple linear regression. In the accounts with seemingly large following bases. By increasing case of the dependent variable being a vector the model is the knowledge of the authenticity of profiles, corporations instead called a multivariate regression. A regression model can construct more favourable collaborations and consumer will only show a correlation between variables, since it is misinformation can be reduced. impossible to determine a causality without further analysis Collection of personal data is often regarded as an ethical [1]. grey area, though this study includes only data that is either public or willfully sent directly from the influencer who is B. Evaluation of regression model fully aware of its intended use. The study could be expanded to also explore the follower base of the profile, for example The method of least squares assesses the fitted function recording the demographics of users exposed to the sponsored by calculating the aggregate squared distance between the content and of customers purchasing products. Considering data points and their true values (MSE) [2]. There are many this can be done without the permission of the consumers, forms of error measurement during a regression such as it was deemed unethical. Web scraping is also considered a MSE, Root Mean Square Error, Mean Absolute Error and controversial method, it was however utilized in the study as it many other measurements that have subtle differences between was evaluated to not infringe on personal data or any websites them. The R-squared value metric measures the proportion terms of service. of dependent variable variation which is explained by the model. It ranges between 0% where there is no correlation and 100% where there is a perfect correlation. This value gives A. Research Questions & Challenges a different indication of performance depending on the area The research questions this thesis aims to answer are that is researched. For instance, the R2-value is expected to the following: How can an influencer’s reach be predicted be lower for data sets where human error is involved. Within using machine learning with the account’s engagement based nonlinear models the error is usually calculated by iterating metrics as input data? How strong is the prediction based on through multiple different linear error computations. The data methods of evaluation of machine learning algorithms? should be split between representative training- and test data The task involves a collection of metadata about the in- in order to determine the strength of the model in a general fluencers through the mentioned organization, as well as context. In such case, mean evaluation metrics from different other social media accounts from various websites. Potential combinations of training- and test data are assessed. challenges lie, for instance, in using connections between the input data in order to find the most fundamental assumptions C. Digital Marketing for the regression analysis. The assignment requires rigorous 1) Influencer marketing: Influencer marketing is a market- interpretation and assessment of the results, for example the ing form where organizations advertise through social media degree to which connections between the input data represents users that have amassed credibility and trust within their fol- real connections and how well they can be applied in a general lowing. These users are referred to as social media influencers context. (SMIs) or influencers. In their social media marketing cam- paigns organizations create collaborations where the influencer III. T HEORY posts sponsored content on a platform and personally endorses In this section, the theoretical framework for this thesis is a particular service or product [3]. presented. Certain theories that have historically been used for 2) Performance indicators in digital marketing: Engage- evaluating social media statistics are explained as well. ment rate, a measurement of engagement relative to amount of followers, is a performance indicator within digital marketing that describes how many social interactions an influencer gets A. Multiple Linear Regression on their platform, indicating quality of the actor rather than the Linear regression is a model that expresses the relationship quantity of followers. Amount of social interactions relates to between variables using an approximated function created the content, the social role and the outreach of an influencer’s through the minimization of an error parameter. For instance, network. Engagement rate includes likes as well as higher the sum total of the squared distance between each data point valued “talk abouts” such as comments and shares [4]. is minimized in a least-squares estimation. The regression is Actual reach refers to the amount of users exposed to an a form of supervised machine learning process that identifies advertisement during a set of time. For Instagram, it is the possible correlation and can be used to predict outcomes. The amount of unique users who saw a post a certain day. This response variable y which depends on k independent variables metric is therefore closely tied to at what extent the brand x1 ,x2 , ...xk , with an error term of , gives the model awareness is actually increased by a sponsored post. y = β0 + β1 x1 + β2 x2 + ... + βk xk + IV. P REVIOUS S TUDIES A model which involves only one explanatory variable is In the article Using Random Forest model to predict image referred to as a simple linear regression, where the dependent engagement rate the authors M. Lazic and F. Eder [6] classified
SAM KHOGASTEH, EDVIN WIOREK : PREDICTING INFLUENCER ACTUAL REACH USING LINEAR REGRESSION (JUNE 2021) 3 images on Instagram with the intention to understand what sponsored content receives. Due to the actual reach not being types of posts receive the highest engagement. The data for publicly available, this data was provided manually by the each post consisted of image data and engagement data. The organization. After each collaboration the influencer sends image data was extracted using Google Cloud Vision API, a screenshot containing their analytics. The relevant data, which assigned features to each image. Through the use of a containing 75 influencer profiles, was collected and analyzed. random forest algorithm, the data set was trained and classified 2) Web data collection: For the second part of the analysis a to predict engagement based on image features. The low larger collection of data was obtained. The goal was to acquire regression accuracy of the result was explained by the authors data which described the audience, engagement, and reach as a consequence of method limitations, too few features, and of Instagram accounts. The formerly mentioned attributes are unpredictability. Lazic and Eder propose to solve this problem readily available, however, the reach attribute is not publicly by examining a combination of metadata such as engagement, accessible. The challenge was therefore to represent the poten- amount of followers and account reach in order to predict the tial views and success of an Instagram marketing collaboration reach of a specific sponsored post. with an accessible feature. The features chosen were growth A similar conclusion was reached by authors T. Sweet, A. rate and page views. The growth rate was measured by the Rothwell and X. Luo [7] in the article Machine Learning change of follower count on a weekly span. The page views Techniques for Brand-Influencer Matchmaking on the Insta- describes the amount of times an Instagram account’s profile gram Social Network. The thesis aimed to match companies page has been visited. It is theorized that a high follower with brand influencers with a machine learning algorithm. growth rate and a large amount of page views is a result To predict the profitability of influencer collaborations, the of a high actual reach. For example, the amount of page similarity between the influencer’s content and the organiza- views indicates the actual activity surrounding an account. tion’s brand was measured. The study utilized a k-Nearest Growth rate was accessible and could indicate actual activity Neighbours method to create a model generating a list of best as long the growth was reached organically. This is why those suited users for a certain brand. Again, the authors indicate attributes were deemed to be suitable alternative features for that an improvement would be to implement metadata about the web data regressions. the user. For example, they suggest amount of likes per For the accounts included, a web scraping program was post, frequency of posts, and other variables explaining user constructed to extract the information of each profile from the audience interaction. underlying HTML code. The web collection resulted in two The document System and method for evaluating the true separate data sets, the first containing the engagement rate reach of social media influencers [8] describes a patented and weekly followers growth of 30 thousand profile pages. method for evaluating the amount of accounts an influencer on The second data set assessed the dependent variable of page social media reaches. The inventors use random forest method- views and independent variables followers and engagement ology which classifies influencers as a good or bad match for rate. Around 5 thousand data points were collected for this a brand, depending on data about the account. The in-data data set. variables of the algorithm are of interest, seen in the document example tree. Amount of posts of the account is the root, B. Data selection branching out into amount of followers or engagement rate, while the remaining nodes consist of average amount of likes For the web data collection, the Instagram accounts from per post or amount of unique views per post. The document which the data was extracted were selected roughly based on acknowledges that in-variables that are considered important those with the highest amount of followers on the platform, include: the amount of fans and followers the influencer has, to the extent of which they were available. This caused the the amount of likes and comments the influencer’s posts analysis to center around accounts with large follower groups. receive, the total reach of the influencer, and the similarity The organization data was selected by which collaborations between the account’s content and the brand. These variables the organization chose to pursue during the time of research. have similarities to the suggestions of Lazic and Eder as well as Sweet, Rothwell and Luo, in regard to which data should C. Treatment of data be used in future studies on predicting success of sponsored For data handling of the organization’s data, some missing content on social media platforms. data points were replaced with a mean value based on the remaining rows. For the web data however, missing values V. I MPLEMENTATION were neglected. Clear input errors such as accounts listed to In this section the practical process of the study is described. have zero followers or no engagement rate were also removed. A. Data Collection D. Regressions 1) Organization Data Collection: For the organization’s Three main regressions were performed as a part of the data, each entry was from actual collaborations the organiza- study, considering the different dependent variables actual tion had pursued in recent months. Each influencer’s amount reach, growth, and page views. The first and third dependent of followers and rate of follower-engagement, was modeled to variables were modeled by different explanatory variables, approximate the amount of actual reach that the organization’s resulting in five regression models described below.
4 DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND MANAGEMENT, 15 CREDITS, STOCKHOLM, SWEDEN 2021 Actual reach = β0 + β1 Engagement + Actual reach = β0 + β1 F ollowers + W eekly growth = β0 + β1 Engagement rate + P age views = β0 + β1 Engagement + P age views = β0 + β1 F ollowers + β2 Engagement rate + The engagement variable was calculated as followers multi- plied by the engagement rate. The regressions were processed through the linear model module of the sklearn library. The results of each regression are presented in section VI. Fig. 1. Actual reach as a function of engagement. E. Regression evaluation A fifth of the data was set aside for testing while the models were trained. The process of a single regression with a certain train/test split was iterated over 10 times in order to find the averages of the metrics, thus reducing randomness. Sklearn metrics module was used for mean absolute error and R- squared, while the library Statsmodels supplied the statistical metrics. F. Industrial Engineering Perspective From an industrial engineering perspective this regression is intended to, with some degree of precision, predict the actual reach of a particular influencer based on their publicly available data points. This prediction can be utilized by an organization to simplify and improve the process of choosing Fig. 2. Actual reach as a function of followers. collaboration partners. This is analyzed from the perspective of Porter’s five forces. actual values are now scattered further from the prediction VI. R ESULTS line, visualizing the effect of engagement rate for predicting A. Regression results actual reach. Figure 3 plots the web regression of engagement rate Table 1 displays the average R-squared and average mean and weekly growth, and illustrates the lack of linear trend absolute error over ten iterations of each regression. The between engagement rate and weekly growth rate. Most data dependent and independent variables are also stated in the points are cluttered around low engagement rate values yet table. Table 2 shows the independent variable coefficients of generate highly various growth rates, whereas the regression each regression, the standard error of the regression, the t- line indicates only a slight predictive property at the larger value, p-value and the confidence interval for the coefficients. engagement rates, albeit with a tendency to overestimate the growth rates. This reveals a lack of linearity in the relationship B. Scatter plots between the variables, or the absence of any correlation at all. For the regressions with a singular independent variable, the Figure 4 is a plot of the web regression of engagement regression line compared to the actual values of the test data rate and page views. The figure indicates only a slight linear are plotted in figures 1 through 4. relationship in the data, which reflects the regression metrics Figure 1 concerns the organization regression set with mentioned earlier. engagement as the dependent variable. It reveals a linear trend for the data points of smaller engagement values, indicating a linear correlation between the engagement of an account to C. Residual versus Fitted plots its actual reach. Figures 5 through 7 display residual plots against fitted Figure 2 concerns the organization regression set with fol- values for three of the regressions. lowers as the dependent variable, that is, the same data points Figure 5 contains the residual plot of the organization as figure 1 but with followers as the explanatory variable. The regression. It shows asymmetry, which indicates a possibility
SAM KHOGASTEH, EDVIN WIOREK : PREDICTING INFLUENCER ACTUAL REACH USING LINEAR REGRESSION (JUNE 2021) 5 TABLE I OVERALL REGRESSION METRICS Dependent variable Independent variables Average R-squared Average MAE Org. regression 1 Actual reach Engagement 0.7011 14453.12 Org. regression 2 Actual reach Followers 0.3809 10156.02 WG regression Weekly growth % Engagement rate % 0.0077 1.03 PV regression 1 Page views Engagement 0.3679 1226.41 PV regression 2 Page views ER & Followers 0.3788 1100.12 TABLE II R EGRESSION VARIABLE METRICS Coefficient Std error t-value P>|t| Conf-interval (2.5 %) Org. regression 1 Constant (β0 ) 3055.35 19262.78 1.56 0.124 [-856.47, 6967.17] Engagement 4.26 0.31 13.72 0.00 [3.64, 4.88] Org. regression 2 Constant (β0 ) -3144.71 3915.51 -0.803 0.425 [-10900, 4658] Followers 0.4074 0.058 6.98 0.00 [0.291, 0.524] WG regression Constant (β0 ) 0.402 0.014 29.614 0.000 [0.375, 0.429] ER % 0.045 0.003 15.217 0.000 [0.039, 0.050] PV regression 1 Constant (β0 ) 633.183 64.763 9.777 0.000 [506.212, 760.154] Engagement 0.0063 0.000 49.061 0.000 [0.006, 0.007] PV regression 2 Constant (β0 ) 179.41 63.62 2.82 0.005 [54.67, 304.14] Followers 0.0002 3.71e-06 48.928 0.000 [0.000, 0.000] ER % 13670 995.04 13.736 0.000 [11700, 15600] Fig. 3. Weekly relative growth as a function of relative engagement. Fig. 4. Page views as a function of engagement. of a nonlinear data pattern. Assuming it is not due to the small views regression. It also illustrates slight non-linear patterns amount of observations of the regression, the residual distri- in the data, as the residuals do not appear to be randomly bution appears to become increasingly negative and therefore scattered around the x-axis. The observed trend is clearly implying that a nonlinear model assumption would perform negative, indicating a heteroscedasticity problem in the pre- better. dictor variable. Perhaps an assumed logarithmic relationship Figure 6, which concerns the residual plot of the weekly between the explanatory variable and the dependent variable growth regression, shows a slight decrease, indicating possible would improve the model. non-linearity. The higher the engagement rate becomes, the less of an effect it has on the weekly growth rate. Although VII. D ISCUSSION this decrease is small, it might be plausible that the correlation A. Regressions is logarithmic rather than linear. Considering the organization’s data set, a high correlation Figure 7 plots the residual against fitted values of the page was found between the actual reach of an influencer and their
6 DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND MANAGEMENT, 15 CREDITS, STOCKHOLM, SWEDEN 2021 Fig. 5. Residual vs. fitted values for actual reach Fig. 7. Residual vs. fitted values for page views shows a statistically significant correlation with high prediction accuracy between the amount of engagement the influencer gets and the number of people that view their sponsored content. Although the data set needs to be larger for the results to be more reliable, the high R-squared value of this regression shows promise for future models using these variables to be useful tools for predicting actual reach with narrow confidence intervals. These initial results open the door for a wide array of new methods to quantify the quality of an influencer and the value they can bring to an organization. The information can help advertisers estimate how much reach a particular influencer has as well as the expected return on capital. Additionally, this predictive capability could simplify the mar- keting process for companies, allowing them to reliably and Fig. 6. Residual vs. fitted values for weekly growth accurately choose the most valuable influencers to collaborate with, while reducing personnel costs and avoiding unprofitable investments. On the same note, by quantifying influencer engagement. The regressions on this data set also showed audience quality, the higher quality influencers can increase that followers had a correlation to actual reach, however their bargaining power and easily differentiate themselves significantly lower than the correlation to engagement,. This using simple measurement numbers. This results in a more indicates the predictive properties the engagement rate has on transparent and efficient marketplace for both parties, even actual reach. The results of the page views regressions indicate benefiting the potential customers by only exposing them to that influencers who have higher engagement rate in relation sponsored content from trusted and transparent influencers. to their followers also, to some extent, see more profile views. Considering the page views regression, engagement factors Lastly, the linear regression of the largest data set showed had a slight predictive capability for amount of profile visits. virtually no correlation between the weekly follower growth Assuming that influencers who get a high amount of profile rate of an influencer and their engagement rate. views will also have a higher amount of post views, using Continuing the analysis of the regression results, the p- page views as a public predictive variable opens the possibility values of coefficients were often very low. A p-value of for use of larger data sets and thus more reliable regressions. 0.000 indicates that the value is very low and that the null However, there is speculated to be a dissonance between hypothesis can therefore be rejected even with an alpha-value the page view and actual reach variables, supported by the of 0.01. Suggesting that there is a correlation between page regression scores of this study. The correlation between page views and the amount of followers and engagement rate, views and the actual reach of an account is a task for future respectively. Since the confidence interval for the engagement studies. rate coefficient does not include 0, or the null hypothesis, it On the other hand there was no correlation found between means that we can reject the null hypothesis and with an alpha the weekly follower growth rate of an influencer and the value of 2.5% conclude that there is a correlation between engagement rate of their account. We speculate that this is engagement rate and the page views of an influencer. The due to the high prevalence of inorganic growth through buying results are therefore statistically significant. fake followers. This can drastically increase the number of The linear regression using the organization’s data set, followers an influencer has and lead to artificially high growth
SAM KHOGASTEH, EDVIN WIOREK : PREDICTING INFLUENCER ACTUAL REACH USING LINEAR REGRESSION (JUNE 2021) 7 rates that are not a result of having a high user activity on their web scraping from third-party web services there is no way account. These issues can be mitigated through measuring the to fully verify the validity of the information, although a growth rate over a longer time frame, for example months, sample data set had been cross-checked with other third-party rather than weeks. By doing this, even accounts that have platforms. For future research it is recommended to create a inorganically acquired their followers will see a more realistic web scraping program that extracts all necessary information growth rate as it is distributed over several weeks, ultimately directly from the Instagram page to avoid these validity issues. resulting in the regression not having to fit to extreme values Also, considering these regressions, the residual plots indicate and possibly resulting in a higher correlation between growth the potential for non-linear models to perform better, which is and account activity. This further emphasises the need for worth examining in future studies. evaluating metrics beyond followers or growth of followers Another consideration when examining the growth variable in order to assess the actual reach of an account. Since it is in future research, is to increase the time interval where common for organizations to use growth metrics and even ab- the growth has taken place. Measuring on a weekly scale solute followers as key variables for determining collaboration might lead to misleading results, since there are many possible partners, these results strongly indicate that other metrics must variables that can distort this measurement in the shorter term. be used to actually predict the increased brand awareness of Also, as concluded in the residual versus fitted plots of a particular collaboration. the results section, there is a possibility that nonlinear model assumptions could improve the performance of similar regres- B. Industrial engineering perspective sions. We therefore suggest future research to investigate expo- From the perspective of Porter’s five forces, the main nential and logarithmic relationships between the performance insights revolved around the bargaining power of suppliers and metrics. buyers. This study connects to increasing the bargaining power As some of the previous work mentioned, a potential of the advertising organization and puts pressure on influencers improvement of the study would be to also consider the to compete with each other and acquire the highest quality of similarity between influencer content and brand, as a factor for audience possible. Since social media marketing is a relatively a campaign’s success of increasing brand awareness. Previous new phenomenon and since it is difficult to track the quality studies also make the case for ensemble learning methodology and quantity of traffic that a particular influencer gets, there such as random decision forests for predicting actual reach, is no clear established method of determining the right pay which could improve the performance of the model. For such for their advertisement services. This leads to organizations studies it is however necessary that a larger amount of in- working to maximize their bargaining power by developing data variables are assessed. Furthermore, it is important to methods to approximate and predict how much reach their note that this study does not address the connection between sponsored content will get. Assuming a certain conversion a large, engaging audience, and purchasing customers. By rate for purchases, one can then calculate the rate of return tracking website traffic as a response to an influencer campaign required for a profitable advertising campaign and set their pay the organization can better understand the correlation between accordingly, ultimately maximizing revenue while reducing actual post reach of the sponsored content and the resulting cost through avoiding unprofitable collaborations. 2 customer interactions with the organization. C. Future research VIII. C ONCLUSION Suggestions for future research include further examining In this study three sets of Instagram user data were analyzed the connection between page views and actual reach. This to model the actual reach of their posts or indicators of actual however requires the influencer to personally send their ana- reach. With the intention of increasing the predictive abilities lytics to the researchers, increasing the difficulty of collecting for organizations in influencer marketing collaborations, linear sufficient data for a machine learning algorithm. regressions were executed and evaluated for different sets of Among the linear regressions performed, the size of the dif- metadata variables. Predicting the actual reach of a post using ferent data sets were significantly varied. The data set used by the engagement measure of an account showed promising the organization had less than 100 data points and despite the results from analyzing the collaborations of a retail business, results showing a correlation between the dependent and the however the scale of it was insufficient for drawing any independent variables, further studies using a larger number of definite conclusions. These results however indicate a possible data points is necessary. As more data points are added one can increased ability for companies to calculate returns on social examine the amount where the R-squared values of different media marketing collaborations, reducing the occurrence of regression iterations converge, and at that point conclude that unprofitable campaigns. A larger regression of growth rate to the data set is large enough for the result to be reliable. engagement rate showed that growth, at that time frame, is not For the linear regressions using larger data sets the quantity a good indicator of the interaction the content of an account of data was considered sufficient but it is important to note receives, which was speculated to be caused by accounts that concerns regarding quality. Since the data was obtained using grow their follower base inorganically. The publicly accessible 2 CFI, page views metric indicated a significant correlation with ”Bargaining Power of Buyers”. [Online] Available: https://corporatefinanceinstitute.com/resources/knowledge/strategy/bargaining- engagement, while analyzing the correlation of page views power-of-buyers to actual reach is a task for future studies. This study is a
8 DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND MANAGEMENT, 15 CREDITS, STOCKHOLM, SWEDEN 2021 step towards creating a more transparent and fair marketplace between advertisers and influencers on social media platforms. R EFERENCES [1] D.C. Montgomery, E. Peck & G. Vining, ”REGRESSION AND MODEL BUILDING” in Introduction to linear regression analysis, 5th ed., New York City, NY, USA, Wiley, 2012, pp.21-28. [2] G.S. Handelman, H. Kok, R. Chandra, A. Razavi, S. Huang, M. Brooks, M. Lee, & H. Asadi, ”Peering Into the Black Box of Artificial Intel- ligence: Evaluation Metrics of Machine Learning Methods.”, American Journal of Roentgenology, vol. 212, no. 1, pp. 38-43, Jan. 2019, DOI: 10.2214/AJR.18.20224. [3] C. Stubb, A. Nyström, & J. Colliander, ”Influencer marketing: The im- pact of disclosing sponsorship compensation justification on sponsored content effectiveness”, Journal of Communication Management, vol. 23, no. 2, pp. 109-122, May 2019, DOI: 10.1108/JCOM-11-2018-0119. [4] K. Peters., Y. Chen, A. Kaplan, B. Ognibeni, & K. Pauwels, ”Social Media Metrics — A Framework and Guidelines for Managing Social Media”,. Journal of Interactive Marketing, vol. 27, no. 4, pp. 281-298, 2013, DOI: 10.1016/j.intmar.2013.09.007 [5] M. Lazic & F. Eder, ”Using Random Forest model to predict image engagement rate”, EESC, KTH, Stockholm, 2018, [Online]. Available: https://www.diva- portal.org/smash/record.jsf?pid=diva2%3A1215409dswid=3348 [6] T. Sweet, A. Rothwell, & X. Luo, ”Machine Learning Tech- niques for Brand-Influencer Matchmaking on the Instagram So- cial Network”, Cornell University, Ithica, 2019. [Online]. Available: https://arxiv.org/abs/1901.05949 [7] System and method for evaluating the true reach of social me- dia influencers, by D. Sullivan, J. Heenan, & R. Akulshin, (2019, Oct 1) Patent Number WO2019010379 [Online]. Available: https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2019010379
TRITA-EECS-EX-2021:354 www.kth.se
You can also read