Predicting Influencer Actual Reach Using Linear Regression - SAM KHOGASTEH EDVIN WIOREK

Page created by Dolores Davidson
 
CONTINUE READING
Predicting Influencer Actual Reach Using Linear Regression - SAM KHOGASTEH EDVIN WIOREK
EXAMENSARBETE INOM TEKNIK,
GRUNDNIVÅ, 15 HP
STOCKHOLM, SVERIGE 2021

Predicting Influencer Actual
Reach Using Linear
Regression
SAM KHOGASTEH

EDVIN WIOREK

KTH
SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP
DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND MANAGEMENT, 15 CREDITS, STOCKHOLM, SWEDEN 2021                                                     1

    Predicting Influencer Actual Reach Using Linear
                 Regression (June 2021)
                              Sam Khogasteh, samkz@kth.se, Edvin Wiorek edvinwi@kth.se,

   Abstract—The influencer marketing industry has seen a                and realms of exploration. This thesis aims to study spon-
tremendous growth in recent years, yet the effectiveness of this        sored content on Instagram, and is performed partly at an
marketing form is still largely unexplored. This report aims to         organization. The organization conducts digital retail business
explore how various performance measures are linked to the
reach of social media pages, utilizing the linear regression model.     through their own website as well as physically through
Three different data sets were collected manually, or using web         separate shops. For disclosure purposes they will be referred
scraping. By splitting these data sets to training- and test data       to as the organization in this thesis. They focus largely on a
we examined the degree to which the linear regression model can         singular product and target consumers who are active on social
predict the actual reach, the page views and the weekly growth          media platforms and who are drawn to branded products.
of an influencer.
   We concluded that there is a statistically significant correlation   The organization is interested in improving their advertisement
between multiple performance metrics of a social media page             targeting and gaining insights in the possible actual reach of
and the actual reach or the page views of that account. This            their collaborations. They work with influencers on platforms
study is however limited by its narrow data set and time frame,         such as Instagram, and would benefit from insights that could
warranting future research in order to further establish the            help predict the extent of actual reach of a partner’s sponsored
degree of this correlation.
   The results of this study can benefit companies in their process
                                                                        content. This study could be utilized for assessing which
of selecting influencers to collaborate with, as well as determining    influencers the organization should collaborate with, as well
the expected return on investment for that particular collabora-        as what monetary compensation said partners should receive.
tion. This can in turn lead to a more efficient, authentic and          This domain is relatively unexplored, and marketing on social
transparent marketplace, and to consumers being less exposed            media through the use of influencers has only existed for a few
to advertisement from misleading and malicious influencers.
                                                                        years. This thesis intends to address one of the most significant
    Sammanfattning—Under           de      senaste      åren     har   questions organizations are faced with as they advertise their
marknadsföringsindustrin med influencers växt drastiskt,              goods and services, ultimately simplifying this process enough
ändå är effektiviteten hos denna marknadsföringsform relativt
                                                                        for an organization to approximate the potential actual reach
outforskad. Denna rapport avser använda linjär regression för
att utforska hur olika prestationsmått är kopplade till räckvidden   of a collaboration.
hos profiler på sociala medier. De olika datamängderna samlades
manuellt, eller med hjälp av web scraping. Genom att dela upp                       II. S OCIAL AND E THICAL A SPECTS
datamängderna i träningsdata och testdata undersökte vi i hur
                                                                           From a social perspective, this study can support influencers
hög grad den linjära regressionsmodellen kan förutsäga faktisk
räckvidd, sidvisningar och profilens tillväxt under en vecka.         who have expanded their follower base in an organic and
    Vi drog slutsatsen att det finns en statistisk signifikant korre-   permissive manner, while influencers who have purchased
lation mellan flera prestationsmått för en profilsida, och antalet    parts of their follower base in order to mislead consumers and
sidvisningar för det kontot. Studien är emellertid begränsad av      organizations, will be deprived of paid marketing deals. As a
sin datamängd och tidsspann, något som motiverar framtida             consequence, the issues addressed in this study are relevant
studier för att ytterligare etablera korrelationsgraden.
    Studiens resultat kan gynna företag i deras process att välja     from an ethical perspective and the study seeks to favour
vilka influencers de vill samarbeta med, såväl som i deras process    transparency and authenticity on social media platforms. The
att bestämma den förväntade avkastningen för ett specifikt          study therefore connects to the United Nation’s Sustainable
samarbete. Detta kan i sin tur bidra till en mer effektiv, autentisk    Development goal number 12; to contribute to responsible
och transparent marknad, något som också gör att konsumenten         consumption and production. 1 As organizations more easily
blir mindre exponerad för marknadsföring från vilseledande och
illvilliga influencers.                                                 can evaluate which influencers are more deceptive, consumers
                                                                        of platforms such as Instagram will be exposed to adver-
                                                                        tisement from influencers of further credibility. This can in
                       I. I NTRODUCTION                                 turn encourage sustainable consumer patterns. Transparency in
                                                                        social media is a contemporary problem and it is a common

D    IGITAL marketing is an increasingly popular way for
     organizations to advertise their products and services.
The novelty of this marketing platform opens new possibilities
                                                                        occurrence that consumers and marketers gain a defective un-
                                                                        derstanding of an authority on a platform and act accordingly.
                                                                        The amount of followers can be misleading to the actual reach
                                                                           1 United    Nations,     “Transforming      our    World:    The  2030
                                                                        Agenda for Sustainable Development,”, 2015. [Online] Available:
                                                                        https://sdgs.un.org/publications/transforming-our-world-2030-agenda-
                                                                        sustainable-development-17981
2                                   DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND MANAGEMENT, 15 CREDITS, STOCKHOLM, SWEDEN 2021

or purchasing power that a profile yields, an aspect which can     variable is a scalar. If the model consists of several indepen-
be problematic as consumers tend to be more influenced by          dent variables it is called a multiple linear regression. In the
accounts with seemingly large following bases. By increasing       case of the dependent variable being a vector the model is
the knowledge of the authenticity of profiles, corporations        instead called a multivariate regression. A regression model
can construct more favourable collaborations and consumer          will only show a correlation between variables, since it is
misinformation can be reduced.                                     impossible to determine a causality without further analysis
   Collection of personal data is often regarded as an ethical     [1].
grey area, though this study includes only data that is either
public or willfully sent directly from the influencer who is       B. Evaluation of regression model
fully aware of its intended use. The study could be expanded
to also explore the follower base of the profile, for example         The method of least squares assesses the fitted function
recording the demographics of users exposed to the sponsored       by calculating the aggregate squared distance between the
content and of customers purchasing products. Considering          data points and their true values (MSE) [2]. There are many
this can be done without the permission of the consumers,          forms of error measurement during a regression such as
it was deemed unethical. Web scraping is also considered a         MSE, Root Mean Square Error, Mean Absolute Error and
controversial method, it was however utilized in the study as it   many other measurements that have subtle differences between
was evaluated to not infringe on personal data or any websites     them. The R-squared value metric measures the proportion
terms of service.                                                  of dependent variable variation which is explained by the
                                                                   model. It ranges between 0% where there is no correlation
                                                                   and 100% where there is a perfect correlation. This value gives
A. Research Questions & Challenges                                 a different indication of performance depending on the area
   The research questions this thesis aims to answer are           that is researched. For instance, the R2-value is expected to
the following: How can an influencer’s reach be predicted          be lower for data sets where human error is involved. Within
using machine learning with the account’s engagement based         nonlinear models the error is usually calculated by iterating
metrics as input data? How strong is the prediction based on       through multiple different linear error computations. The data
methods of evaluation of machine learning algorithms?              should be split between representative training- and test data
   The task involves a collection of metadata about the in-        in order to determine the strength of the model in a general
fluencers through the mentioned organization, as well as           context. In such case, mean evaluation metrics from different
other social media accounts from various websites. Potential       combinations of training- and test data are assessed.
challenges lie, for instance, in using connections between the
input data in order to find the most fundamental assumptions       C. Digital Marketing
for the regression analysis. The assignment requires rigorous
                                                                      1) Influencer marketing: Influencer marketing is a market-
interpretation and assessment of the results, for example the
                                                                   ing form where organizations advertise through social media
degree to which connections between the input data represents
                                                                   users that have amassed credibility and trust within their fol-
real connections and how well they can be applied in a general
                                                                   lowing. These users are referred to as social media influencers
context.
                                                                   (SMIs) or influencers. In their social media marketing cam-
                                                                   paigns organizations create collaborations where the influencer
                         III. T HEORY                              posts sponsored content on a platform and personally endorses
  In this section, the theoretical framework for this thesis is    a particular service or product [3].
presented. Certain theories that have historically been used for      2) Performance indicators in digital marketing: Engage-
evaluating social media statistics are explained as well.          ment rate, a measurement of engagement relative to amount of
                                                                   followers, is a performance indicator within digital marketing
                                                                   that describes how many social interactions an influencer gets
A. Multiple Linear Regression                                      on their platform, indicating quality of the actor rather than the
   Linear regression is a model that expresses the relationship    quantity of followers. Amount of social interactions relates to
between variables using an approximated function created           the content, the social role and the outreach of an influencer’s
through the minimization of an error parameter. For instance,      network. Engagement rate includes likes as well as higher
the sum total of the squared distance between each data point      valued “talk abouts” such as comments and shares [4].
is minimized in a least-squares estimation. The regression is         Actual reach refers to the amount of users exposed to an
a form of supervised machine learning process that identifies      advertisement during a set of time. For Instagram, it is the
possible correlation and can be used to predict outcomes. The      amount of unique users who saw a post a certain day. This
response variable y which depends on k independent variables       metric is therefore closely tied to at what extent the brand
x1 ,x2 , ...xk , with an error term of , gives the model          awareness is actually increased by a sponsored post.

          y = β0 + β1 x1 + β2 x2 + ... + βk xk +                                     IV. P REVIOUS S TUDIES
   A model which involves only one explanatory variable is           In the article Using Random Forest model to predict image
referred to as a simple linear regression, where the dependent     engagement rate the authors M. Lazic and F. Eder [6] classified
SAM KHOGASTEH, EDVIN WIOREK : PREDICTING INFLUENCER ACTUAL REACH USING LINEAR REGRESSION (JUNE 2021)                                  3

images on Instagram with the intention to understand what            sponsored content receives. Due to the actual reach not being
types of posts receive the highest engagement. The data for          publicly available, this data was provided manually by the
each post consisted of image data and engagement data. The           organization. After each collaboration the influencer sends
image data was extracted using Google Cloud Vision API,              a screenshot containing their analytics. The relevant data,
which assigned features to each image. Through the use of a          containing 75 influencer profiles, was collected and analyzed.
random forest algorithm, the data set was trained and classified        2) Web data collection: For the second part of the analysis a
to predict engagement based on image features. The low               larger collection of data was obtained. The goal was to acquire
regression accuracy of the result was explained by the authors       data which described the audience, engagement, and reach
as a consequence of method limitations, too few features, and        of Instagram accounts. The formerly mentioned attributes are
unpredictability. Lazic and Eder propose to solve this problem       readily available, however, the reach attribute is not publicly
by examining a combination of metadata such as engagement,           accessible. The challenge was therefore to represent the poten-
amount of followers and account reach in order to predict the        tial views and success of an Instagram marketing collaboration
reach of a specific sponsored post.                                  with an accessible feature. The features chosen were growth
   A similar conclusion was reached by authors T. Sweet, A.          rate and page views. The growth rate was measured by the
Rothwell and X. Luo [7] in the article Machine Learning              change of follower count on a weekly span. The page views
Techniques for Brand-Influencer Matchmaking on the Insta-            describes the amount of times an Instagram account’s profile
gram Social Network. The thesis aimed to match companies             page has been visited. It is theorized that a high follower
with brand influencers with a machine learning algorithm.            growth rate and a large amount of page views is a result
To predict the profitability of influencer collaborations, the       of a high actual reach. For example, the amount of page
similarity between the influencer’s content and the organiza-        views indicates the actual activity surrounding an account.
tion’s brand was measured. The study utilized a k-Nearest            Growth rate was accessible and could indicate actual activity
Neighbours method to create a model generating a list of best        as long the growth was reached organically. This is why those
suited users for a certain brand. Again, the authors indicate        attributes were deemed to be suitable alternative features for
that an improvement would be to implement metadata about             the web data regressions.
the user. For example, they suggest amount of likes per                 For the accounts included, a web scraping program was
post, frequency of posts, and other variables explaining user        constructed to extract the information of each profile from the
audience interaction.                                                underlying HTML code. The web collection resulted in two
   The document System and method for evaluating the true            separate data sets, the first containing the engagement rate
reach of social media influencers [8] describes a patented           and weekly followers growth of 30 thousand profile pages.
method for evaluating the amount of accounts an influencer on        The second data set assessed the dependent variable of page
social media reaches. The inventors use random forest method-        views and independent variables followers and engagement
ology which classifies influencers as a good or bad match for        rate. Around 5 thousand data points were collected for this
a brand, depending on data about the account. The in-data            data set.
variables of the algorithm are of interest, seen in the document
example tree. Amount of posts of the account is the root,            B. Data selection
branching out into amount of followers or engagement rate,
while the remaining nodes consist of average amount of likes            For the web data collection, the Instagram accounts from
per post or amount of unique views per post. The document            which the data was extracted were selected roughly based on
acknowledges that in-variables that are considered important         those with the highest amount of followers on the platform,
include: the amount of fans and followers the influencer has,        to the extent of which they were available. This caused the
the amount of likes and comments the influencer’s posts              analysis to center around accounts with large follower groups.
receive, the total reach of the influencer, and the similarity       The organization data was selected by which collaborations
between the account’s content and the brand. These variables         the organization chose to pursue during the time of research.
have similarities to the suggestions of Lazic and Eder as well
as Sweet, Rothwell and Luo, in regard to which data should           C. Treatment of data
be used in future studies on predicting success of sponsored           For data handling of the organization’s data, some missing
content on social media platforms.                                   data points were replaced with a mean value based on the
                                                                     remaining rows. For the web data however, missing values
                     V. I MPLEMENTATION                              were neglected. Clear input errors such as accounts listed to
  In this section the practical process of the study is described.   have zero followers or no engagement rate were also removed.

A. Data Collection                                                   D. Regressions
   1) Organization Data Collection: For the organization’s              Three main regressions were performed as a part of the
data, each entry was from actual collaborations the organiza-        study, considering the different dependent variables actual
tion had pursued in recent months. Each influencer’s amount          reach, growth, and page views. The first and third dependent
of followers and rate of follower-engagement, was modeled to         variables were modeled by different explanatory variables,
approximate the amount of actual reach that the organization’s       resulting in five regression models described below.
4                                   DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND MANAGEMENT, 15 CREDITS, STOCKHOLM, SWEDEN 2021

         Actual reach = β0 + β1 Engagement + 

          Actual reach = β0 + β1 F ollowers + 

     W eekly growth = β0 + β1 Engagement rate + 

          P age views = β0 + β1 Engagement + 

P age views = β0 + β1 F ollowers + β2 Engagement rate + 
   The engagement variable was calculated as followers multi-
plied by the engagement rate. The regressions were processed
through the linear model module of the sklearn library. The
results of each regression are presented in section VI.             Fig. 1.   Actual reach as a function of engagement.

E. Regression evaluation
   A fifth of the data was set aside for testing while the models
were trained. The process of a single regression with a certain
train/test split was iterated over 10 times in order to find the
averages of the metrics, thus reducing randomness. Sklearn
metrics module was used for mean absolute error and R-
squared, while the library Statsmodels supplied the statistical
metrics.

F. Industrial Engineering Perspective
   From an industrial engineering perspective this regression
is intended to, with some degree of precision, predict the
actual reach of a particular influencer based on their publicly
available data points. This prediction can be utilized by an
organization to simplify and improve the process of choosing        Fig. 2.   Actual reach as a function of followers.
collaboration partners. This is analyzed from the perspective
of Porter’s five forces.
                                                                    actual values are now scattered further from the prediction
                         VI. R ESULTS                               line, visualizing the effect of engagement rate for predicting
A. Regression results                                               actual reach.
                                                                       Figure 3 plots the web regression of engagement rate
   Table 1 displays the average R-squared and average mean
                                                                    and weekly growth, and illustrates the lack of linear trend
absolute error over ten iterations of each regression. The
                                                                    between engagement rate and weekly growth rate. Most data
dependent and independent variables are also stated in the
                                                                    points are cluttered around low engagement rate values yet
table. Table 2 shows the independent variable coefficients of
                                                                    generate highly various growth rates, whereas the regression
each regression, the standard error of the regression, the t-
                                                                    line indicates only a slight predictive property at the larger
value, p-value and the confidence interval for the coefficients.
                                                                    engagement rates, albeit with a tendency to overestimate the
                                                                    growth rates. This reveals a lack of linearity in the relationship
B. Scatter plots                                                    between the variables, or the absence of any correlation at all.
   For the regressions with a singular independent variable, the       Figure 4 is a plot of the web regression of engagement
regression line compared to the actual values of the test data      rate and page views. The figure indicates only a slight linear
are plotted in figures 1 through 4.                                 relationship in the data, which reflects the regression metrics
   Figure 1 concerns the organization regression set with           mentioned earlier.
engagement as the dependent variable. It reveals a linear trend
for the data points of smaller engagement values, indicating
a linear correlation between the engagement of an account to        C. Residual versus Fitted plots
its actual reach.                                                     Figures 5 through 7 display residual plots against fitted
   Figure 2 concerns the organization regression set with fol-      values for three of the regressions.
lowers as the dependent variable, that is, the same data points       Figure 5 contains the residual plot of the organization
as figure 1 but with followers as the explanatory variable. The     regression. It shows asymmetry, which indicates a possibility
SAM KHOGASTEH, EDVIN WIOREK : PREDICTING INFLUENCER ACTUAL REACH USING LINEAR REGRESSION (JUNE 2021)                                                          5

                                                                      TABLE I
                                                             OVERALL REGRESSION METRICS

                              Dependent variable               Independent variables               Average R-squared                Average MAE
 Org. regression 1               Actual reach                      Engagement                           0.7011                       14453.12
 Org. regression 2               Actual reach                       Followers                           0.3809                       10156.02
  WG regression                Weekly growth %                  Engagement rate %                       0.0077                         1.03
 PV regression 1                 Page views                        Engagement                           0.3679                       1226.41
 PV regression 2                 Page views                      ER & Followers                         0.3788                       1100.12

                                                                       TABLE II
                                                             R EGRESSION VARIABLE METRICS

                                  Coefficient                Std error                   t-value                 P>|t|                Conf-interval (2.5 %)
 Org. regression 1
  Constant (β0 )                   3055.35                   19262.78                    1.56                    0.124                 [-856.47, 6967.17]
   Engagement                       4.26                       0.31                      13.72                    0.00                     [3.64, 4.88]
 Org. regression 2
  Constant (β0 )                   -3144.71                  3915.51                     -0.803                  0.425                   [-10900, 4658]
    Followers                       0.4074                    0.058                       6.98                    0.00                    [0.291, 0.524]
   WG regression
   Constant (β0 )                   0.402                      0.014                     29.614                  0.000                   [0.375, 0.429]
      ER %                          0.045                      0.003                     15.217                  0.000                   [0.039, 0.050]
  PV regression 1
   Constant (β0 )                  633.183                    64.763                     9.777                   0.000                 [506.212, 760.154]
    Engagement                     0.0063                     0.000                      49.061                  0.000                   [0.006, 0.007]
  PV regression 2
   Constant (β0 )                   179.41                     63.62                      2.82                   0.005                   [54.67, 304.14]
    Followers                       0.0002                   3.71e-06                    48.928                  0.000                    [0.000, 0.000]
      ER %                          13670                     995.04                     13.736                  0.000                   [11700, 15600]

Fig. 3.   Weekly relative growth as a function of relative engagement.         Fig. 4.    Page views as a function of engagement.

of a nonlinear data pattern. Assuming it is not due to the small               views regression. It also illustrates slight non-linear patterns
amount of observations of the regression, the residual distri-                 in the data, as the residuals do not appear to be randomly
bution appears to become increasingly negative and therefore                   scattered around the x-axis. The observed trend is clearly
implying that a nonlinear model assumption would perform                       negative, indicating a heteroscedasticity problem in the pre-
better.                                                                        dictor variable. Perhaps an assumed logarithmic relationship
   Figure 6, which concerns the residual plot of the weekly                    between the explanatory variable and the dependent variable
growth regression, shows a slight decrease, indicating possible                would improve the model.
non-linearity. The higher the engagement rate becomes, the
less of an effect it has on the weekly growth rate. Although                                              VII. D ISCUSSION
this decrease is small, it might be plausible that the correlation             A. Regressions
is logarithmic rather than linear.                                               Considering the organization’s data set, a high correlation
   Figure 7 plots the residual against fitted values of the page               was found between the actual reach of an influencer and their
6                                            DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND MANAGEMENT, 15 CREDITS, STOCKHOLM, SWEDEN 2021

Fig. 5.   Residual vs. fitted values for actual reach                     Fig. 7.   Residual vs. fitted values for page views

                                                                          shows a statistically significant correlation with high prediction
                                                                          accuracy between the amount of engagement the influencer
                                                                          gets and the number of people that view their sponsored
                                                                          content. Although the data set needs to be larger for the results
                                                                          to be more reliable, the high R-squared value of this regression
                                                                          shows promise for future models using these variables to be
                                                                          useful tools for predicting actual reach with narrow confidence
                                                                          intervals. These initial results open the door for a wide array
                                                                          of new methods to quantify the quality of an influencer and
                                                                          the value they can bring to an organization. The information
                                                                          can help advertisers estimate how much reach a particular
                                                                          influencer has as well as the expected return on capital.
                                                                          Additionally, this predictive capability could simplify the mar-
                                                                          keting process for companies, allowing them to reliably and
Fig. 6.   Residual vs. fitted values for weekly growth                    accurately choose the most valuable influencers to collaborate
                                                                          with, while reducing personnel costs and avoiding unprofitable
                                                                          investments. On the same note, by quantifying influencer
engagement. The regressions on this data set also showed                  audience quality, the higher quality influencers can increase
that followers had a correlation to actual reach, however                 their bargaining power and easily differentiate themselves
significantly lower than the correlation to engagement,. This             using simple measurement numbers. This results in a more
indicates the predictive properties the engagement rate has on            transparent and efficient marketplace for both parties, even
actual reach. The results of the page views regressions indicate          benefiting the potential customers by only exposing them to
that influencers who have higher engagement rate in relation              sponsored content from trusted and transparent influencers.
to their followers also, to some extent, see more profile views.             Considering the page views regression, engagement factors
Lastly, the linear regression of the largest data set showed              had a slight predictive capability for amount of profile visits.
virtually no correlation between the weekly follower growth               Assuming that influencers who get a high amount of profile
rate of an influencer and their engagement rate.                          views will also have a higher amount of post views, using
   Continuing the analysis of the regression results, the p-              page views as a public predictive variable opens the possibility
values of coefficients were often very low. A p-value of                  for use of larger data sets and thus more reliable regressions.
0.000 indicates that the value is very low and that the null              However, there is speculated to be a dissonance between
hypothesis can therefore be rejected even with an alpha-value             the page view and actual reach variables, supported by the
of 0.01. Suggesting that there is a correlation between page              regression scores of this study. The correlation between page
views and the amount of followers and engagement rate,                    views and the actual reach of an account is a task for future
respectively. Since the confidence interval for the engagement            studies.
rate coefficient does not include 0, or the null hypothesis, it              On the other hand there was no correlation found between
means that we can reject the null hypothesis and with an alpha            the weekly follower growth rate of an influencer and the
value of 2.5% conclude that there is a correlation between                engagement rate of their account. We speculate that this is
engagement rate and the page views of an influencer. The                  due to the high prevalence of inorganic growth through buying
results are therefore statistically significant.                          fake followers. This can drastically increase the number of
   The linear regression using the organization’s data set,               followers an influencer has and lead to artificially high growth
SAM KHOGASTEH, EDVIN WIOREK : PREDICTING INFLUENCER ACTUAL REACH USING LINEAR REGRESSION (JUNE 2021)                                                7

rates that are not a result of having a high user activity on their              web scraping from third-party web services there is no way
account. These issues can be mitigated through measuring the                     to fully verify the validity of the information, although a
growth rate over a longer time frame, for example months,                        sample data set had been cross-checked with other third-party
rather than weeks. By doing this, even accounts that have                        platforms. For future research it is recommended to create a
inorganically acquired their followers will see a more realistic                 web scraping program that extracts all necessary information
growth rate as it is distributed over several weeks, ultimately                  directly from the Instagram page to avoid these validity issues.
resulting in the regression not having to fit to extreme values                  Also, considering these regressions, the residual plots indicate
and possibly resulting in a higher correlation between growth                    the potential for non-linear models to perform better, which is
and account activity. This further emphasises the need for                       worth examining in future studies.
evaluating metrics beyond followers or growth of followers                          Another consideration when examining the growth variable
in order to assess the actual reach of an account. Since it is                   in future research, is to increase the time interval where
common for organizations to use growth metrics and even ab-                      the growth has taken place. Measuring on a weekly scale
solute followers as key variables for determining collaboration                  might lead to misleading results, since there are many possible
partners, these results strongly indicate that other metrics must                variables that can distort this measurement in the shorter term.
be used to actually predict the increased brand awareness of                        Also, as concluded in the residual versus fitted plots of
a particular collaboration.                                                      the results section, there is a possibility that nonlinear model
                                                                                 assumptions could improve the performance of similar regres-
B. Industrial engineering perspective                                            sions. We therefore suggest future research to investigate expo-
   From the perspective of Porter’s five forces, the main                        nential and logarithmic relationships between the performance
insights revolved around the bargaining power of suppliers and                   metrics.
buyers. This study connects to increasing the bargaining power                      As some of the previous work mentioned, a potential
of the advertising organization and puts pressure on influencers                 improvement of the study would be to also consider the
to compete with each other and acquire the highest quality of                    similarity between influencer content and brand, as a factor for
audience possible. Since social media marketing is a relatively                  a campaign’s success of increasing brand awareness. Previous
new phenomenon and since it is difficult to track the quality                    studies also make the case for ensemble learning methodology
and quantity of traffic that a particular influencer gets, there                 such as random decision forests for predicting actual reach,
is no clear established method of determining the right pay                      which could improve the performance of the model. For such
for their advertisement services. This leads to organizations                    studies it is however necessary that a larger amount of in-
working to maximize their bargaining power by developing                         data variables are assessed. Furthermore, it is important to
methods to approximate and predict how much reach their                          note that this study does not address the connection between
sponsored content will get. Assuming a certain conversion                        a large, engaging audience, and purchasing customers. By
rate for purchases, one can then calculate the rate of return                    tracking website traffic as a response to an influencer campaign
required for a profitable advertising campaign and set their pay                 the organization can better understand the correlation between
accordingly, ultimately maximizing revenue while reducing                        actual post reach of the sponsored content and the resulting
cost through avoiding unprofitable collaborations. 2                             customer interactions with the organization.

C. Future research                                                                                    VIII. C ONCLUSION
   Suggestions for future research include further examining                        In this study three sets of Instagram user data were analyzed
the connection between page views and actual reach. This                         to model the actual reach of their posts or indicators of actual
however requires the influencer to personally send their ana-                    reach. With the intention of increasing the predictive abilities
lytics to the researchers, increasing the difficulty of collecting               for organizations in influencer marketing collaborations, linear
sufficient data for a machine learning algorithm.                                regressions were executed and evaluated for different sets of
   Among the linear regressions performed, the size of the dif-                  metadata variables. Predicting the actual reach of a post using
ferent data sets were significantly varied. The data set used by                 the engagement measure of an account showed promising
the organization had less than 100 data points and despite the                   results from analyzing the collaborations of a retail business,
results showing a correlation between the dependent and the                      however the scale of it was insufficient for drawing any
independent variables, further studies using a larger number of                  definite conclusions. These results however indicate a possible
data points is necessary. As more data points are added one can                  increased ability for companies to calculate returns on social
examine the amount where the R-squared values of different                       media marketing collaborations, reducing the occurrence of
regression iterations converge, and at that point conclude that                  unprofitable campaigns. A larger regression of growth rate to
the data set is large enough for the result to be reliable.                      engagement rate showed that growth, at that time frame, is not
   For the linear regressions using larger data sets the quantity                a good indicator of the interaction the content of an account
of data was considered sufficient but it is important to note                    receives, which was speculated to be caused by accounts that
concerns regarding quality. Since the data was obtained using                    grow their follower base inorganically. The publicly accessible
   2 CFI,
                                                                                 page views metric indicated a significant correlation with
            ”Bargaining      Power     of   Buyers”.   [Online]     Available:
https://corporatefinanceinstitute.com/resources/knowledge/strategy/bargaining-   engagement, while analyzing the correlation of page views
power-of-buyers                                                                  to actual reach is a task for future studies. This study is a
8                                              DEGREE PROJECT IN INDUSTRIAL ENGINEERING AND MANAGEMENT, 15 CREDITS, STOCKHOLM, SWEDEN 2021

step towards creating a more transparent and fair marketplace
between advertisers and influencers on social media platforms.

                                 R EFERENCES
    [1] D.C. Montgomery, E. Peck & G. Vining, ”REGRESSION AND
        MODEL BUILDING” in Introduction to linear regression analysis, 5th
        ed., New York City, NY, USA, Wiley, 2012, pp.21-28.
    [2] G.S. Handelman, H. Kok, R. Chandra, A. Razavi, S. Huang, M. Brooks,
        M. Lee, & H. Asadi, ”Peering Into the Black Box of Artificial Intel-
        ligence: Evaluation Metrics of Machine Learning Methods.”, American
        Journal of Roentgenology, vol. 212, no. 1, pp. 38-43, Jan. 2019, DOI:
        10.2214/AJR.18.20224.
    [3] C. Stubb, A. Nyström, & J. Colliander, ”Influencer marketing: The im-
        pact of disclosing sponsorship compensation justification on sponsored
        content effectiveness”, Journal of Communication Management, vol. 23,
        no. 2, pp. 109-122, May 2019, DOI: 10.1108/JCOM-11-2018-0119.
    [4] K. Peters., Y. Chen, A. Kaplan, B. Ognibeni, & K. Pauwels, ”Social
        Media Metrics — A Framework and Guidelines for Managing Social
        Media”,. Journal of Interactive Marketing, vol. 27, no. 4, pp. 281-298,
        2013, DOI: 10.1016/j.intmar.2013.09.007
    [5] M. Lazic & F. Eder, ”Using Random Forest model
        to      predict     image     engagement       rate”,    EESC,     KTH,
        Stockholm,        2018,    [Online].     Available:     https://www.diva-
        portal.org/smash/record.jsf?pid=diva2%3A1215409dswid=3348
    [6] T. Sweet, A. Rothwell, & X. Luo, ”Machine Learning Tech-
        niques for Brand-Influencer Matchmaking on the Instagram So-
        cial Network”, Cornell University, Ithica, 2019. [Online]. Available:
        https://arxiv.org/abs/1901.05949
    [7] System and method for evaluating the true reach of social me-
        dia influencers, by D. Sullivan, J. Heenan, & R. Akulshin,
        (2019, Oct 1) Patent Number WO2019010379 [Online]. Available:
        https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2019010379
TRITA-EECS-EX-2021:354

www.kth.se
You can also read