Analysis of Performance Measures affecting the economic success on the PGA Tour using multiple linear regression - JOHANNES HÖGBOM AUGUST REGNELL ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2020 Analysis of Performance Measures affecting the economic success on the PGA Tour using multiple linear regression JOHANNES HÖGBOM AUGUST REGNELL KTH SKOLAN FÖR TEKNIKVETENSKAP
Analysis of Performance Measures affecting the economic success on the PGA Tour using multiple linear regression Johannes Högbom August Regnell ROYAL Degree Projects in Applied Mathematics and Industrial Economics (15 hp) Degree Programme in Industrial Engineering and Management (300 hp) KTH Royal Institute of Technology year 2020 Supervisor at KTH: Alessandro Mastrototaro, Julia Liljegren Examiner at KTH: Sigrid Källblad Nordin
TRITA-SCI-GRU 2020:112 MAT-K 2020:013 Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci
Abstract This bachelor thesis examined the relationship between performance mea- sures and prize money earnings on the PGA Tour. Using regression analysis and data from seasons 2004 through 2019 retrieved from the PGA Tour web- site this thesis examined if prize money could be predicted. Starting with 102 covariates, comprehensibly covering all aspects of the game, the model was reduced to 13 with Driving Distance being most prominent, favoring 2 simplicity resulting in an RAdj of 0.6918. The final model was discussed in regards to relevance, reliability and usability. This thesis further analyzed how the entry of ShotLink, the technology re- sponsible for the vast statistical database surrounding the PGA Tour, have affected golf in general and the PGA Tour in particular. Analysis regarding how ShotLink affected golf on different levels, both for players as well as other stakeholders, where conducted. These show developments on multi- ple levels; on how statistics are used, golf related technologies, broadcasts, betting market, and both amateur and PGA Tour playing golf players. The analysis of the latter, using statistics from the PGA Tour website, showed a significant improvement in scoring average since ShotLinks inception. 4
Analys av prestationsmått som påverkar den ekonomiska framgången på PGA-Touren med hjälp av multipel linjär regression Sammanfattning Detta kandidatexamensarbete undersökte relationen mellan prestationsmått och prispengar på PGA Touren. Genom regressionsanalys och data från säsongerna 2004 till och med 2019 hämtat från PGA Tourens hemsida undersökte detta arbete om prispengar kunde predikteras. Startandes med 102 kovariat, täckandes alla aspekter av spelet, reducerades sedan modellen till 13 med Utslags Distans mest framträdande, i förmån för simplicitet och 2 resulterande i ett RAdj på 0.6918. Den slutliga modellen diskuterades sedan gällande relevans, reliabilitet och användbarhet. Vidare analyserar detta arbete hur ShotLinks entré, tekniken ansvarig för den omfattande statistikdatabasen som omger PGA Touren, har påverkat golf generellt och PGA Touren specifikt. Analyser gällande hur ShotLink har påverkat golf på olika nivåer, både för spelare och andra intressen- ter, genomfördes. Dessa visar utvecklingar på flera fronter; hur statistik används, golfrelaterade teknologier, mediasändningar, bettingmarknad samt både för amatörspelare och spelare på PGA Touren. Den senare analysen, genom användande av statistik från PGA Tourens hemsida, visade på en signifikant förbättring i genomsnittsscore sedan ShotLink infördes. 5
Contents Contents 6 1 Introduction 8 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Theoretical Framework 11 2.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . 11 2.1.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Ordinary Least Squares . . . . . . . . . . . . . . . . . 12 2.2 Model Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Muticollinearity . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Endogeneity . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . 13 2.3 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 All possible regressions . . . . . . . . . . . . . . . . . . 14 2.3.2 Stepwise regression methods . . . . . . . . . . . . . . . 14 2.4 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Normality . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Cook’s Distance . . . . . . . . . . . . . . . . . . . . . 16 2.4.3 Hypothesis testing . . . . . . . . . . . . . . . . . . . . 17 2.4.3.1 Shapiro-Wilk test . . . . . . . . . . . . . . . 17 2.4.3.2 F-statistic . . . . . . . . . . . . . . . . . . . . 17 2.4.3.3 t-statistic . . . . . . . . . . . . . . . . . . . . 18 2.4.3.4 Confidence Interval . . . . . . . . . . . . . . 19 2.4.4 Transformation . . . . . . . . . . . . . . . . . . . . . . 19 2.4.4.1 Heuristic approach . . . . . . . . . . . . . . . 19 2.4.4.2 Box-Cox Transformation . . . . . . . . . . . 20 2.4.5 R2 and RAdj 2 . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.6 AIC and BIC . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.7 Mallow’s Cp . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.8 VIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.9 Cross Validation . . . . . . . . . . . . . . . . . . . . . 23 6
3 Methodology 24 3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . 24 3.1.2 A first look at the data . . . . . . . . . . . . . . . . . 24 3.2 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Largest possible model and initial transformations . . 27 3.2.2 Stepwise regression . . . . . . . . . . . . . . . . . . . . 29 3.2.3 All possible regressions . . . . . . . . . . . . . . . . . . 30 3.3 Promising Models . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.1 Model Validation . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Variable Selection . . . . . . . . . . . . . . . . . . . . 33 3.3.3 Multicollinearity . . . . . . . . . . . . . . . . . . . . . 33 3.3.4 Possible transformations . . . . . . . . . . . . . . . . . 34 3.3.5 Cross Validation . . . . . . . . . . . . . . . . . . . . . 35 4 Results 36 4.1 Final Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Impact of Performance Measures . . . . . . . . . . . . . . . . 38 4.2.1 Example: Tiger Woods . . . . . . . . . . . . . . . . . 39 4.2.2 Example: Joe Meanson . . . . . . . . . . . . . . . . . 41 5 Analysis & Discussion 43 5.1 Analysis of Final Model . . . . . . . . . . . . . . . . . . . . . 43 5.2 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . 45 5.3 Handling of Outliers . . . . . . . . . . . . . . . . . . . . . . . 45 5.4 Model Development . . . . . . . . . . . . . . . . . . . . . . . 46 6 Industrial Engineering and Management Part 49 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 49 6.1.2 Aim and research questions . . . . . . . . . . . . . . . 50 6.2 Analysis & Discussion . . . . . . . . . . . . . . . . . . . . . . 51 6.2.1 Effect on product Golf . . . . . . . . . . . . . . . . . . 51 6.2.2 Effect on players . . . . . . . . . . . . . . . . . . . . . 56 6.2.3 Effect on betting market . . . . . . . . . . . . . . . . . 60 7 Conclusion 66 8 References 67 Appendices 71 7
1 Introduction 1.1 Background The PGA Tour is the organizer of the main professional golf tours played by men in North America. These tours, including PGA Tour, PGA Tour Cham- pions, Korn Ferry Tour, PGA Tour Canada, PGA Tour Latinoamérica, and PGA Tour China, are annual series of golf tournaments. We have chosen only to include the PGA Tour as it is the largest tour in financial terms, has the largest viewership, and arguably also the best golf players. The most common tournament format on the tour is stroke play, where the players total number of strokes are counted over four rounds played during Thursday to Sunday. When the rounds are finished, the player with the fewest strokes taken, wins. If a tie occurs, it is usually settled by the leaders playing a playoff, consisting of a selected number of holes until one of them gets a better score. Typically, 72-hole tournaments also enforce a cut after 36 holes, where the cut line is set to the 65th lowest scoring professional. If more than 70 players make the cut, due to several players having the same score as the 65th best player, there will usually be another cut after 54 holes with the same rules.[1] Prize money is paid out to all players making the cut, and further depends on the player’s final placement. Finishing first usually gives a payout of $630,000 to $2,250,000 - 18% of the total prize pool, while finishing 65th gives a payout of $7,525 to $26,875 - 0.215% of the total prize pool.[2] As one can see in figure 1, the percentage of the prize pool grows exponentially with placement, which will be taken into account later in the project. 8
Figure 1: Prize money distribution for the PGA Tour For many players prize money is a major part of their income, making it essential for them to stay on the PGA Tour. As the costs of coaching, training, equipment, travel and housing can be large, an understanding of how much money you are predicted to earn can be vital when planning for a season ahead. Getting a lower standard of any of these may directly result in worse performance. However, for famous and very successful play- ers, such as Tiger Woods, sponsors are also a large part of their total income. With the help of ShotLink, the PGA Tour records more than a hundred performance measures per player. Some of the most known measures are: • Driving distance - how far the ball travels from the tee when driven, usually measured on two holes per round. • Driving accuracy - the percentage of drives ending up on the fairway. • Sand saves - the percentage with which a player takes two or fewer strokes to hole the ball from greenside bunkers. • Green in regulation - the percentage of golf holes for which the player reaches the surface of the green in at least two strokes below par. • Scrambling - the percent of time a player misses the green in regu- lation, but still makes par or better. 9
1.2 Aim This project aims to, through the use of multiple linear regression analysis - ordinary least squares (OLS), determine which performance measures drive economic success on the PGA Tour. Further, answering to what extent the different performance measures drive success could be greatly beneficial for individual players. Given where a player stands, regarding the performance measures, training can be focused on where it yields the largest impact as to increase their income. Moreover, from an industrial engineering and management standpoint, this project aims to investigate the effects of the introduction of ShotLink tech- nology through different perspectives. Further information regarding this part of the report is found in section 6. 1.3 Research Questions The research questions are thus: 1. To what extent does different performance measures drive economic success on to PGA Tour? And three further questions belonging to the Industrial engineering and management part: 2. How has the introduction of ShotLink technology affected the product Golf ? 3. How have the players, as products, been affected by the introduction of ShotLink technology? 4. What has the information increase and availability supplied by ShotLink meant for the betting market? 1.4 Limitations The more successful a player has been during a period, the more data is available for the given period. Players first need to play well enough to be included in the lineup of a PGA Tour event, and thereafter succeed in reaching the latter rounds of the event. Further, as we use prize money as our response variable, all players not having won any prize money are excluded. Finally, the difficulty of extracting data from PGA Tour’s website has limited the amount of data used for this report, which may have an impact on the results. 10
2 Theoretical Framework 2.1 Multiple Linear Regression Multiple Linear Regression (MLR) is a method used to investigate and model the relationship between a dependent response variable and multiple explanatory variables. The statistical technique models a linear relation- ship between the dependent variable y and the independent variables xj , j=1,2,...k. For observation i the relationship can be described as: k X yi = xij βj + εi (1) j=0 where βj are referred to as regression coefficients, β0 being the y-intercept (constant term), and βi,i6=0 are slope coefficients, and are estimated through the regression model procedure. The independent variables xij , also known as covariates, and the dependent response variable yi are given while the residuals εi are stochastic variables.[3] With n observations and k covariates the model, in matrix notations, is described as follows Y = Xβ + ε (2) where y1 1 x11 x12 · · · x1k β1 ε1 y2 1 x21 x22 · · · x2k β2 ε2 Y = . , X = . .. , β = .. , ε = .. (3) .. .. .. .. .. . . . . . . yn 1 xn1 xn2 · · · xnk βk εn 2.1.1 Assumptions The classical linear model (CLR) is based on five assumptions.[4] 1. The relationship between the response y and the regressors xi ’s is lin- ear, plus an error term (ε) and the coefficients form the vector β and are constants. (2) 2. The error term ε has mean zero; i.e. E(εi ) = 0 ∀ i. 3. The error term ε has constant variance σ 2 and two error terms are uncorrelated; i.e. Var(εi ) = σ 2 and Corr(εi , εj ) = 0, ∀ i, j , i 6= j. 11
4. The observation of y is fixed in repeated samples; i.e. resampling with the same independent variable values is possible. 5. The number of observations, n exceeds the number of independent variables k and there are no exact linear relationships between the xi ’s. Further, to construct confidence intervals and perform hypothesis testing the CLR is extended by assumption 6 and are referred to as classical normal linear regression (CNLR).[3][4] 6. The errors ε and thus the dependent variables y are normally dis- tributed. A model should always be validified in regards to these assumptions as model inadequacies can have serious consequences. 2.1.2 Ordinary Least Squares Ordinary Least Squares (OLS) is a method of estimating the regression co- efficients βi , these estimates are denoted β̂i . The coefficients are determined by the principle of least squares and minimizes the sum of squares residuals (SSRes , i.e. min et e = |e|2 ). Thus β̂ is obtained as a solution to the normal equations X te = 0 (4) where e = Y − X β̂ (5) and (5) in (4) gives X t (Y − X β̂) = 0 X t Y − X t X β̂ = 0 (6) X t X β̂ = X t Y β̂ = (X t X)−1 X t Y The OLS estimator β̂ is the Best Linear Unbiased Estimator (BLUE) of β, i.e. E[β̂] = β and it minimizes the variance for any linear combination of the estimated coefficients.[3] 12
2.2 Model Errors 2.2.1 Muticollinearity If there is no linear relationship between the independent variables, that the xi ’s are orthogonal, inference are made with ease. However most frequently they are not orthogonal and in some situations they are nearly perfectly linearly related; i.e. one covariate can be explained through linear com- binations of others. When such relationships exists among the covariates Multicollinearity is said to exist. There are two main problems with multicollinearity.[3] 1. It increases the variances and covariances of the estimates βˆj 2. It tends to enlarge the magnitudes of the estimates βˆj For 1 consider a bivariate normal distribution where X t X and X t Y are in correlation form. Since Var(βˆj ) = σ 2 ((X t X)−1 )jj where ((X t X)−1 )jj = 1−r 1 2 the variance diverges as |r12 | → 1. For 2 we see that 12 the expectedsquare distance from β̂ to β increases with collinearity as 2 P h ˆ i P E β̂ − β = j E (βj − βj )2 = j V ar(βˆj ) = σ 2 tr(X t X)−1 2 and therefore the β̂s are overestimated as E β̂ = kβk2 + σ 2 tr(X t X)−1 2.2.2 Endogeneity Endogeneity arises when there exists correlation between the error term and one or more covariates effectively making assumption 2 unfulfilled and yields inconsistent results from the OLS regression.This mainly occurs under two conditions; when important variables are omitted (referred to as ”omitted variable bias”) and when the dependent variable is a predictor of x and not simply a response (referred to as ”simultaneity bias”).[3] 2.2.3 Heteroscedasticity Occurance of heteroscedasticity implies non constant variance around mean zero among the error terms ε effectively breaking assumptions 2 and 3. This affects the standard deviation and significance levels of the estimated β̂s and reducing the validity of hypothesis testing. 13
By plotting the Residual versus Fitted Value and observing patterns in- dicating non constant variance over the Fitted Values (e.g. cone shape) heteroscedasticity is present and can be dealt with by for example adding variables or transforming existing variables improving homoscedasticity.[3] 2.3 Variable Selection Having a large number of possible regressors, with most likely varying lev- els of importance, methods for reduction of regressors are used. Further, even though it does not guarantee elimination of multicollinearity it is the most frequently used technique to handle problems where multicollinearity is present to a high degree. Fronted with the conflicting ideas; both wanting a model containing many regressors to maximize the information content when estimating our response, as well as wanting a simple model for vari- ance reduction in the prediction of ŷ and lowering cost for data collection and model building, methods for variable selection are used.[3] 2.3.1 All possible regressions The procedure referred to as all possible regressions is to fit all regression equations using 1 to k (all candidates) regressors. One can then evaluate the different equations depending on the characteristics such as R2 , RAdj 2 and Mallows Cp (covered later in the report). This is possible as k is small but the number of possible regression equations increase rapidly as there are 2k possible equations. Therefore, for a k larger than 20-30 it is unfeasible with today’s computing power and methods are used to decrease the number until all possible regressions is made feasible.[3] 2.3.2 Stepwise regression methods There are three stepwise regression method categories;[3] 1. Forward selection - A procedure, starting with zero regressors, adding one at a time based on which regressor that has the largest correlation with the response y and the previously added regressors. Computa- tionally this selection is made using F-statistic (covered later in report) where the regressor yielding the largest partial F-statistic (F-statistic comparing the model where the regressor is added and when it is not) until it does not exceed a preselected cut off value. 14
2. Backward elimination - A procedure, starting with all regressors, removing one at a time based on the smallest partial F-statistic until it exceeds a preselected value. 3. Stepwise regression - A modification of the Forward selection pro- cedure, where at each step the previously added regressors are reeval- uated based on their impact on the partial F-statistic (e.g. the second added may be redundant based on the five added thereafter). For stepwise regression there are two preselected cutoff values, one for en- tering (as for Forward selection) and one for removal (as for Backward elimination). In this report we will use Stepwise regression. 2.4 Model Validation 2.4.1 Normality As mentioned in assumption 6 there is necessity for the residuals to be nor- mal distributed for confidence intervals and hypothesis testing. One way to visualize potential normality problems is by using a Normal Quantile Quantile Plot (Normal Q-Q plot). By presenting the standardized residuals versus the theoretical quantiles normality problems are visualized by devi- ations from the straight line. In figure 2 we see a case where normality is present (left) and where it is not (right). Further, under section Hypothesis testing a hypothesis test testing for normality in a sample is described.[3] Figure 2: Normal Q-Q plot 15
2.4.2 Cook’s Distance When validating a model, diagnostics of influential points in regards to both leverage and influence are important. Points with high leverage, points with an unusual independent variable illustrated by point A in figure 3, will not affect the value of β̂ but affect statistics such as R2 . Points with high influence, points with unusual relation between the independent and dependent variable on the other hand, illustrated by point B in figure 3, will impact the β̂ as it pulls the model in its direction. [3] Figure 3: Illustration of points of Leverage and Influence A concept introduced by Ralph Dennis Cook presented in 1977 is the Cook’s distance or Cook’s D.[5] Often denoted Di it’s used to detect influential outliers among predictor variables. By removing the ith data point and recalculating the regression it illustrates how much the removal affects the regression model. The formula for Cook’s D is (ŷ(i) − ŷ)t (ŷ(i) − ŷ) Di = (7) pM SRes and can be rewritten as ri2 hii Di = , i = 1, 2, ..., n (8) p 1 − hii Hence, disregarding the constant term p, Di depends on the square of the hii ith studentized residual and the ratio 1−h ii , the distance from the predictor vector xi to the centroid of the remaining data and thus by combining residual size as well as location in x space Cook’s distance is a suitable tool for detecting outliers, with values Di > 1 usually considered being influential. [3] 16
2.4.3 Hypothesis testing Hypothesis testing is a procedure testing the plausibility of a hypothesis and involves the following steps:[6] 1. State the null hypothesis H0 and the alternative hypothesis H1 . 2. Determine the test size, i.e. deciding on one or two tailed and defining the level of significance α, under which H0 is rejected. 3. Compute the test statistic, appropriately, and derive the distribution. This step involves the construction of confidence intervals. 4. From 2 and 3 a decision is then made, either rejecting or accepting H0 . 2.4.3.1 Shapiro-Wilk test A Shapiro-Wilk test tests H0 that a sample x1 , ... ,xn is normally dis- tributed. The test statistic is ( ni=1 ai x(i) )2 P W = Pn 2 (9) i=1 (xi − x̄) where x(i) is the ith order statistic, i.e. the the ith smallest number in the sample of x0i s and x̄ is the sample mean. Further ai is given by t V −1 (ai , · · · , an ) = m C where C is the Euclidean norm C = V −1 m 2 and the vector m is made of the the expected values of the order statistics of i.i.d. random variables sampled from the standard normal distribution and V is the covariance matrix of mentioned normal order statistics. For a Shapiro-Wilks test H0 is that the sample is normally distributed and thus if the p-value is less than the chosen significance level α H0 is rejected.[7] 2.4.3.2 F-statistic By definition SSR (sum of squares regression) = ni=1 (ŷi − ȳ)2 and under the P assumption that Var(ε) = σ 2 I, SS σ2 R follows χ2 distribution with degrees of freedom equals to non-centrality parameter λ = σ12 βR t X t X β , where C C R C indicates that the matrix X is reduced by removing the columns containing ones and thereafter centered and R indicates that β is reduced by removing the intercept β0 . 17
Further by definition SSRes (sum of squared residuals) = ni=1 (yi − ŷi )2 and P that SSσRes 2 follows χ2 distribution with (n-p) degrees of freedom. The F statistic is the ratio of two independent χ2 random variables each divided by their degrees of freedom. As SS σ2 R and SSσRes 2 are independent 2 under the assumption Var(ε) = σ I the F statistic is formed by SSR M SR kσ 2 σ2 M SR F0 = SSRes = M SRes = ∼ Fk,n−p (10) M SRes (n−p)σ 2 σ2 and H0 : β1 = β2 = · · · = βk = 0 is rejected if F0 > Fα,k,n-p and the F-test can be seen as a test of significance of the regression.[3] This result is often summarized in the ANOVA (analysis-of-variance table) seen in table 1. Table 1: ANOVA table Source of Variation Sum of Squares Degrees of Freedom Mean Square F0 Regression SS R k MSR MSR /MSRes Residual SSRes n-p MSRes Total SST n-1 For the partial F-statistic referred to under the section Variable selection the statistic is instead formed by SSRreduced model −SSRnon−reduced model ∆p F = ∼ F∆p,n−p (11) M SResnon−reduced model where ∆p is the difference in number of regressors and p is the number of regressors in the non-reduced model and H0 is in this case that the ∆p regressors under consideration are zero.[8] 2.4.3.3 t-statistic After F-test completion we have determined that at least one of the re- gressors are important and the logical next step are questioning which ones. The hypothesis testing for the significance of individual regressors is the t- test. The hypothesis are H0 : βj = 0, and H1 : βj 6= 0 and if H0 is rejected the regressor j stays in the model. The test statistic is defined as β̂j t0 = q ∼ tn−p (12) σ̂ 2 (X t X)−1 jj q since SE(βj ) = V ar(βj )= σ 2 (X t X)−1 p 2 jj and can be estimated using σ̂ = M SRes .[3] 18
2.4.3.4 Confidence Interval With the assumptions that the errors εi are i.i.d. normally distributed with mean zero and variance σ 2 confidence intervals (CI) can be constructed for the regression coefficients βj,j=0,1,··· ,k . As β ∼ N (β, σ 2 (X t X)−1 ) the t-test statistic for βj equals β̂j − βj t0 = q ∼ tn−p (13) σ̂ 2 (X t X)−1 jj where σ̂ 2 = M SRes . Based on (13) a 100(1-α) percent CI for βj is defined as q β̂j ± tα/2,n−p σ̂ 2 (X t X)−1 jj (14) [3] 2.4.4 Transformation The basic CLR assumptions 2 and 3 regarding the model errors having mean zero, constant variance and being uncorrelated can be shown to be violated during model adequacy checks and thus showing signs of heteroscedasticity. In this case data transformations may be used as a tool to improve the homoscedasticity. Both the regressors and the response can be subject to transformations and data transformations may be chosen heuristically, by trial-and-error or by analytical procedures.[3] 2.4.4.1 Heuristic approach By observing the relationship between the response y and one of the re- gressors xi a pattern may suggest a certain transformation of the regressor. In figure 4 the pattern suggest the transformation of x’=1/x which linearized the relationship as seen in figure 5. 19
Figure 4: Before transformation Figure 5: After transformation 2.4.4.2 Box-Cox Transformation An analytical approach is the Box-Cox Transformation which aims to trans- form the response y using the power transformation y λ where λ is the param- eter determined by the Box-Cox method. George Box and Sir David Cox showed how one by the use of maximum likelihood simultaneously could estimate the regression coefficients and λ.[9] First, for obivous reasons, problems arise as λ equals zero. Thus the appro- priate procedure is using ( λ y −1 , λ 6= 0 y (λ) = λẏλ−1 (15) ẏ ln(y) , λ = 0 1 where ẏ = ( ni=1 yi ) n is the geometric mean of the observations to fit the Q model Y(λ) = Xβ + ε. The λ found through the maximum-likelihood procedure is the λ for which the residual sum of squares is minimum, i.e. SSRes (λ) is minimum. Further an estimate of the confidence interval can be found which could be beneficial in two cases; primarily if the λ found is for example 1.496 but 1.5 is a part of the CI one might opt for choosing 1.5 simplifying calculations, and secondly if λ = 1 is a part of the CI a transformation may not be necessary. Such an interval consists of the λ that satisfy 1 L(λ̂) − L(λ) ≤ χ2α,1 (16) 2 where λ̂ is the ML estimate.[3] 20
2.4.5 R2 and RAdj 2 Another way to evaluate model adequacy is by the Coefficient of determi- nation R2 and is defined as SSR SSRes R2 = =1− (17) SST SST and is thus often referred to as the proportion of variation explained by the regression and it follows that 0 ≤ R2 ≤ 1 where 1 indicates that the regres- sion model explains all variation in the response and vise versa. However, when comparing and evaluating different models with varying number of regressors the R2 is biased. Moreover, R2 never decreases as more regressors are added and thus generally favors models with more regressors and by solely using R2 one risks adding more variables than necessary and 2 overfitting the model. Therefore the use of RAdj is favored, defined as 2 SSRes /(n − p) RAdj =1− (18) SST /(n − 1) 2 Hence, RAdj will only increase if adding variables reduces the residual mean square (MSRes ) and penalizes adding non-helpful regressors effectively mak- 2 ing RAdj a better tool in comparing competing models of different complexity.[3] 2.4.6 AIC and BIC The Akaike Information Criterion (AIC) estimates the out-of-sample pre- diction error and thereby gives the relative quality of the statistic model given the data. It is defined as follows, AIC = 2p − 2 ln(L̂) (19) where p is the number of estimated parameters in the model and L̂ is the maximum value of the likelihood function for the model. The lower the score, the better the model is effectively maximizing the expected information. However, when the sample size is small, the AIC is more likely to overfit, which the AICc was developed to address. It is defined as follows, 2p2 + 2p AICc = AIC + (20) n−p where n is the sample size. Note that AICc converges to AIC as n → ∞. 21
There are also different Bayesian Information Criterion BIC that gives greater penalties for adding regressors as the sample size increases and one is the Schwartz criterion and is defined as BICSch = p ln(n) − 2 ln(L̂) (21) [3] 2.4.7 Mallow’s Cp Mallow’s Cp presents a criterion based variance and bias and is defined as SSRes (p) Cp = − n + 2p (22) σ̂ 2 and it can be shown that if the p-term model have zero bias the estimated value of the Cp equals p. Thus, visualizing different models based on their Cp as in figure 6 can aid in model selection. For models above the line bias is present. However, some bias may be accepted in return for a simpler model and in this case C may be preferred over A although in includes some bias.[3] Figure 6: Mallows Cp 22
2.4.8 VIF To deal with the problem of multiocollinearity, or rather diagnose a model based on it’s multicollinearity something called variance inflation factors (VIF) can be used. The VIF for a regressor indicates the combined effect of the dependencies among the regressors on the variance of that term. Hence one or more large VIFs, experience tells us that values exceeding 5-10 are large, indicated presence of multicollinearity and thus a poorly estimated regression model. Thus comparing VIFs between models as well as for evaluating a reduced to a non-reduced model is desirable for evaluation. The VIF is mathematically defined as the diagonal elements of the (X t X)−1 1 matrix when in correlation form, i.e. V IFj = 1−r 2 where 0 ≤ |rj | ≤ 1 and j will be close to one, increasing the V IFj , when the regressor j is nearly linearly dependent on other regressors.[3] 2.4.9 Cross Validation One method of estimating the prediction error of the model is using cross validation, a so called resampling method. When cross validation is used the dataset is divided into two sets; the training set and the testing set. The model is fitted with the training set and the testing set is used as ”new” data to estimate the prediction error. Using repeated k-fold cross validation signifies using k equally large datasets (divisions of the original observations) which are all used as testing sets in order for the average prediction error to be calculated. This is repeated m times with the different divisions of the original dataset and leads to k × m cross validations per model.[3] 23
3 Methodology 3.1 Data Collection The data was collected from the PGA Tour’s website for the years 2004 to 2019.[10] The performance measures of 40 players per year were extracted resulting in a total of 640 data points. The data extracted from the website was saved as PDFs, one per player and season, and Visual Basic was used to via Excel transform them into a R compatible CSV-file. 3.1.1 Pre-Processing Firstly our response, season prize money, was corrected according to the inflation in the US, making it possible to compare data points from different years. Secondly, after the data extraction was done, there were 102 perfor- mance measures per data point. By doing a qualitative variable selection, the number of performance measures shrunk from 102 to 56 and the number of observations from 640 to 581, reasons for which are further discussed in discussion section 5.2. 3.1.2 A first look at the data To get an initial grasp of the predictors that may be included in the final model all predictors were plotted individually against Official Money and their correlation were calculated. This heuristic approach pointed to several promising predictors: long approaches (several different predictors), Greens In Regulation Percentage and Driving Distance. These predictors all had an absolute correlation with Official Money in the range 0.4 - 0.6. 24
Figure 7: Plot of Greens In Regulation Percentage against Official Money However, it should be noted that this process in no way measured any form of multicollinearity. It is thus likely that these initially promising predictors may not be included in the final model. 3.2 Variable Selection D. C. Montgomery, E. A. Peck, and G. G. Vining presents a strategy for variable section and model building which was used in this project. It is presented as follows:[3] 1. Fit the largest model possible to the data. 2. Perform a thorough analysis of this model. 3. Determine if a transformation of the response or of some of the regres- sors is necessary. 25
4. Determine if all possible regressions are feasible. • If all possible regressions are feasible, perform all possible regres- sions using suitable criteria to rank the best subset models. • If all possible regressions are not feasible, use stepwise selection techniques to generate the largest model such that all possible re- gressions are feasible. Perform all possible regressions as outlined above. 5. Compare and contrast the best models recommended by each criterion. 6. Perform a thorough analysis of the ”best” models. Usually 3-5 models. 7. Explore the need of further transformations. 8. Discuss with the subject-matter experts the relative advantages and disadvantages of the final set of models. Furthermore, for step 6 in the list above the author suggested the following sub-steps:[3] 1. Are the usual diagnostic checks for model adequacy satisfactory? For example, do the residual plots indicate unexplained structure or out- liers or are there one or more high leverage points that may be con- trolling the fit? Do these plots suggest other possible transformation of the response or of some of the regressors? 2. Which equations appear most reasonable? Do the regressors in the best model make sense in light of the problem environment? Which models make the most sense from the subject - matter theory? 3. Which models are most usable for the intended purpose? For exam- ple, a model intended for prediction that contains a regressor that is unobservable at the time the prediction needs to be made is unusable. Another example is a model that includes a regressor whose cost of collecting is prohibitive. 4. Are the regression coefficients reasonable? In particular, are the signs and magnitudes of the coefficients realistic and the standard errors relatively small? 5. Is there still a problem with multicollinearity? 26
3.2.1 Largest possible model and initial transformations As a first step after collecting and processing the data, a model consist- ing of all 56 predictors was fitted to the response variable with the built in R-function lm(). This initial model had a RAdj 2 value of 0.589, worrying residuals and high VIFs (several over 10 and the highest was Rough Proxim- ity’s of 122), which hinted that variable selection and some transformations were in order. Firstly, as discussed earlier in the report, the prize money on the PGA Tour grows exponentially by ranking, which is why we took the logarithm of the response Official Money. Secondly, since the number of events played by a player most likely should be a multiplicative factor in their total earnings, we also took the logarithm of the predictor Events Played. Lastly, potential outliers were also identified with measurements such as Cook’s distance, leverage and residuals. The outliers were removed, which resulted in a model with RAdj2 = 0.786 and the residuals presented in figure 8. Figure 8: Residuals of initial model vs their respective fitted values The boxcox()-function in the library EnvStats was used to check if any fur- ther transformation of the response was necessary. As λ = 1 was outside the confidence interval given by the Box-Cox optimization, a transformation was performed. 27
Figure 9: Log-likelihood graph for the BoxCox parameter λ with the 95 % confidence interval shown After the BoxCox transformation the residual for the fitted model looked as follows: Figure 10: Residuals of initial model vs their respective fitted values after BoxCox transformation with λ = 2.57 28
3.2.2 Stepwise regression Stepwise regression was performed with the help of the olsrr -function ols step both p(). The significance level for an entering variable was set to α = 0.05, and the significance level for en exiting variable to α = 0.01, which resulted in 20 predictors. Something noteworthy is that the reduced model did not lose a significant amount of accuracy, the model consisting of 20 2 variables had a RAdj value of 0.748. As 20 is in the interval of the feasible number of predictors for an all possible regressions analysis (Section 2.3.1) no further heuristic variable selection was performed. 2 Figure 11: The graphs shows how the RAdj varies with every step of the stepwise regression, that is either an addition or a removal of a predictor. 29
3.2.3 All possible regressions With the predictors selected by the stepwise regression all possible regres- sions was performed with ols step all possible() from the package olsrr. Since there was 20 possible predictors to include in the models, 220 = 1, 048, 576 models were tested and ranked by different measures∗ , including AIC, BIC, 2 RAdj and Mallow’s Cp . Since these measures all depend on the mean square error and penalizes the number of predictors, they all recommend the same model within the subsets of models with equal numbers of predictors. How- ever, when comparing models with different numbers of predictors, they do differ. In the following graph all the tested models where plotted with the numbers 2 of predictors on the x-axis and their respective RAdj on the y-axis. The red triangles and the corresponding number represents the best model for every number of predictors. Figure 12: The graphs shows the result of the all possible regression where every blue point represents one model. 2 One clearly sees that the RAdj improves significantly until about 12 pre- dictors. This also held for the other measurements. This indicated that choosing a model with less than 12 predictors may jeopardize its accuracy. ∗ This took a little more than 80 hours 30
The 10 best models with the number of predictors varying between 11 and 20 were selected for further viewing. These are presented in table 2. Table 2: Summary of initial models Model number Number of predictors 2 RAdj AIC Cp 1048575 (I) 20 0.7527425 6944.485 21.00000 1048555 (II) 19 0.7505097 6948.753 25.07500 1048365 (III) 18 0.7480356 6953.531 29.71749 1047225 (IV) 17 0.7452406 6958.984 35.11195 1042380 (V) 16 0.7423691 6964.539 40.70396 1026876 (VI) 15 0.7365307 6976.610 53.11051 988116 (VII) 14 0.7301641 6989.534 66.77582 910596 (VIII) 13 0.7244185 7000.821 79.06578 784626 (IX) 12 0.7164660 7016.402 96.48104 616666 (X) 11 0.6999963 7048.286 133.5952 2 As said above, the RAdj seems to be quite stable as the number of predic- tors decrease from 20 to 12 variables. However, when the model is reduced 2 further the decrease in RAdj seems to accelerate, indicating that the model shouldn’t be reduced to fewer than 12 predictors. In contrast, all models seem to be biased according to Mallow’s Cp and the bias increases quickly for every predictor removed. This indicates that the model shouldn’t be reduced too much. Looking at all the measurements and simultaneously trying to keep the size of the models down for simplicity’s sake, we chose to continue with model III-VIII for a further analysis. 31
3.3 Promising Models Table 3: Predictors included in the chosen models Predictor III IV V VI VII VIII Sand Save Percentage X X X X X X Approaches From > 200 Yards X X X X X X Average Distance Of Putts Made X X X X X X Events Played X X X X X X Driving Distance X X X X X X Scrambling From 10 - 20 Yards X X X X X Approaches From 175 - 200 Yards X X X X X X Scrambling From > 30 Yards X X X X X X Driving Accuracy Percentage X X X X X X Approaches From 50 - 125 Yards X X X X X X Fairway Proximity X X X X X X Approaches From 150 - 175 Yards X X X X X X Approaches From 125 - 150 Yards X X X X X X Putting - Inside 10’ X X X X X X Approaches From > 275 Yards X X X X Putting From 7’ X X X Scrambling From < 10 Yards X X Longest Putts X Approaches From > 200 Yards (Rough) Approaches From 50 - 75 Yards The six models chosen in the previous section are summarized in table 3, which shows which of the 20 variables selected by the stepwise regression are included in the respective models. 32
3.3.1 Model Validation All six models did not seem to have any major problem regarding residuals or influential points. There was no obvious heteroscedasticity (see figure 24) and the point with the highest potential of being an influential point had a Cook’s distance of below 0.08 (see figure 26). In addition, the Q-Q plots did not show any glaring evidence of the residuals differing from a normal distribution (see figure 25). 3.3.2 Variable Selection Looking at reasonability of the included predictors from a golfing perspective and their respective estimated coefficients magnitude and signs we concluded that Longest Putts, Putting From 7’, Approaches From > 275 Yards, and Fairway Proximity should be excluded from the final model. Further, look- ing at the remaining predictors respective significance, we concluded that we should choose a model where the variables Scrambling From < 10 Yards, Longest Putts and Approaches From > 200 Yards (Rough) are not included. The reason for this will be discussed further in section 5.4. 3.3.3 Multicollinearity When taking a look at the variance inflation factors of the predictors one could see that all but one of them falls in to the range 1 - 3. Fairway Prox- imity, however, has a VIF of around 7, depending on which model the VIFs are calculated for, indicating that it correlates with at least one of the other predictors. The bar plots of model III:s and VIII:s VIFs are shown in Figure 13 and 14 respectively. Firstly, one sees that Fairway Proximity:s bar sticks out from the rest. Secondly, one also sees that the VIF:s of model VIII are slightly smaller than those of model III:s since there should be less overall correlation when there are fewer predictors included in the model. This, together with the points discussed in the previous section made us chose model VII with Fairway Proximity removed as the final model. This will be further discussed in section 5.4. 33
Figure 13: VIFs of model III Figure 14: VIFs of model VIII 3.3.4 Possible transformations Once again we looked for necessary transformations of the response (now on the final model) by using the Box-Cox transformation. The suggested exponent was 1.16 and one clearly sees in Figure 15 that λ = 1 lies within the 95% CI. Figure 15: Log-likelihood graph for the BoxCox parameter λ with the 95 % confidence interval shown. This indicates that no further transformation of the response variable is nec- essary and the lambda for the final model remains being 2.57. 34
3.3.5 Cross Validation Repeated five-fold cross validation was performed on model V with 10 rep- etitions. The average RAdj2 , R̄2 Adj = 0.6695, of the models based on the re-sampled dataset did not differ significantly from the value without any 2 re-sampling, RAdj = 0.6918, indicating a good fit. In addition, the root- mean-square-error and the mean-absolute-error were quite small in compar- ison to the average response-value, namely 11.37% and 8.95% respectively. 35
4 Results 4.1 Final Model The final model consists of 13 predictors and an intercept. Its regression equation is described in equation (23) below where y is the response Official Money. (ln(y))2.57 = − 3905.0545 + 441.1038 · xSandSaveP ercentage − 26.7400 · xApproachesF rom>200Y ards + 21.5155 · xAverageDistanceOf P uttsM ade + 151.2015 · ln(xEventsP layed ) + 8.9043 · xDrivingDistance + 429.1561 · xScramblingF rom10−20Y ards − 36.8731 · xApproachesF rom175−200Y ards (23) + 302.4759 · xScramblingF rom>30Y ards + 759.1370 · xDrivingAccuracyP ercentage − 40.2363 · xApproachesF rom50−125Y ards − 26.9964 · xApproachesF rom150−175Y ards − 22.7777 · xApproachesF rom125−150Y ards + 1858.3662 · xP utting−Inside100 This can be rewritten as, 13 X 1 y = exp((βˆ0 + β̂i xi ) λ ) (24) i=1 where the indices represent that names of the predictors in equation (23), β̂i their coefficients and λ = 2.57. In the following table, Table 4, the 13 predictors in the final model and its intercept are summarized. 36
Table 4: Summary of coefficients in the final model Predictor [j] β̂j Std. Error p-value (Intercept) -3,905.0545 394.8379 200 Yards -26.7400 4.8073 4.10e-08 Average Distance Of Putts Made 21.5155 3.5099 1.65e-09 Events Played 151.2015 12.8785
4.2 Impact of Performance Measures With these results we will now try to answer the original question: Which performance measures drive PGA Tour prize money? This was done by improving the performance measures one by one with a percentage of their respective standard deviation (of all data points). How- ever, for the performance measure Events Played we will add one additional event. An explanation of why this method was chosen can be found in sec- tion 5.4. We will look at two different players, Tiger Woods 2005 (his best year in prize money earned adjusted to inflation) and Joe Meanson (a fictional player that is average in every performance measure). Table 6: A summary of the standard deviations of the predictors in the final model and the performance measures of the two players that will be looked in to. Official Money is calculated with equation (24). Predictor σ Tiger Woods Joe Meanson Sand Save Percentage [%] 0.0626 0.54 0.50 Approaches From > 200 Yards [m] 1.156 14.45 15.34 Average Distance Of Putts Made [m] 1.321 24.64 22.50 Events Played 5.975 20 22.61 Driving Distance [yds] 9.578 316.1 291.74 Scrambling From 10 - 20 Yards [%] 0.0503 0.70 0.65 Approaches From 175 - 200 Yards [m] 0.724 8.33 10.10 Scrambling From > 30 Yards [%] 0.0654 0.29 0.29 Driving Accuracy Percentage [%] 0.0530 0.55 0.62 Approaches From 50 - 125 Yards [m] 0.499 5.03 5.67 Approaches From 150 - 175 Yards [m] 0.601 8.15 8.41 Approaches From 125 - 150 Yards [m] 0.559 6.05 6.98 Putting - Inside 10’ [%] 0.0134 0.89 0.87 Official Money [$] 14,038,107 1,588,413 38
4.2.1 Example: Tiger Woods Following is a table (Table 7) showing Tiger Woods expected increase in prize money earned over the 2005 season if one of the performance measures in the final model would be improved by a multiple of its standard deviation. Table 7: How much Tiger Woods prize money would increase if one of his performance measures would increase by a factor times the standard deviation of that performance measures. 3 significant digits. Predictor σj 0.167σj 0.01σj Sand Save Percentage 1,970,000 (+14.0%) 314,000 (+2.2%) 18,600 (+0.13%) Approaches From > 200 Yards 2,220,000 (+15.8%) 352,000 (+2.5%) 20,800 (+0.15%) Average Distance Of Putts Made 2,030,000 (+14.5%) 323,000 (+2.3%) 19,100 (+0.14%) Events Played 504,000 (+3.6%) Driving Distance 6,920,000 (+49.3%) 990,000 (+7.1%) 57,500 (+0.41%) Scrambling From 10 - 20 Yards 1,520,000 (+10.8%) 245,000 (+1.7%) 14,500 (+0.10%) Approaches From 175 - 200 Yards 1,900,000 (+13.6%) 304,000 (+2.2%) 18,000 (+0.13%) Scrambling From > 30 Yards 1,390,000 (+9.9%) 224,000 (+1.6%) 13,300 (+0.09%) Driving Accuracy Percentage 2,960,000 (+21.1%) 460,000 (+3.3%) 27,100 (+0.19%) Approaches From 50 - 125 Yards 1,410,000 (+10.1%) 228,000 (+1.6%) 13,500 (+0.10%) Approaches From 150 - 175 Yards 1,130,000 (+8.1%) 184,000 (+1.3%) 10,900 (+0.08%) Approaches From 125 - 150 Yards 880,000 (+6.3%) 144,000 (+1.0%) 8,560 (+0.06%) Putting - Inside 10’ 1,760,000 (+12.6%) 282,000 (+2.0%) 16,700 (+0.12%) 39
The case where the improvement of the performance measures is 0.167σj is shown in the bar plot in Figure 16. Figure 16: How much Tiger Woods prize money would increase if one particular perfor- mance measure would improve by 0.167σj (which is equal to 1 for Events Played to 3 significant digits). 40
4.2.2 Example: Joe Meanson Following is a table (Table 8) showing Joe Meansons expected increase in prize money earned over a season if one of the performance measures in the final model would be improved by a multiple of its standard deviation. Table 8: How much Joe Meanson prize money would increase if one of his performance measures would increase by a factor times the standard deviation of that performance measures. 3 significant digits. Predictor σj 0.167σj 0.01σj Sand Save Percentage 292,000 (+17.7%) 46,000 (+2.8%) 2,720 (+0.16%) Approaches From > 200 Yards 330,000 (+20.0%) 51,600 (+3.1%) 3,040 (+0.18%) Average Distance Of Putts Made 302,000 (+18.3%) 47,400 (+2.9%) 2,800 (+0.17%) Events Played 65,500 (+4.0%) Driving Distance 1,060,000 (+64.0%) 146,000 (+8.8%) 8,410 (+0.51%) Scrambling From 10 - 20 Yards 225,000 (+13.6%) 35,900 (+2.2%) 2,130 (+0.13%) Approaches From 175 - 200 Yards 282,000 (+17.1%) 44,500 (+2.7%) 2,630 (+0.16%) Scrambling From > 30 Yards 205,000 (+12.4%) 32,900 (+2.0%) 1,950 (+0.12%) Driving Accuracy Percentage 441,000 (+26.7%) 67,500 (+4.1%) 3,970 (+0.24%) Approaches From 50 - 125 Yards 208,000 (+12.6%) 33,400 (+2.0%) 1,980 (+0.12%) Approaches From 150 - 175 Yards 167,000 (+10.1%) 26,900 (+1.6%) 1,600 (+0.10%) Approaches From 125 - 150 Yards 130,000 (+7.9%) 21,100 (+1.3%) 1,250 (+0.08%) Putting - Inside 10’ 261,000 (+15.8%) 41,400 (+2.5%) 2,450 (+0.15%) 41
The case where the improvement of the performance measures is 0.167σj is shown in the bar plot in Figure 17. Figure 17: How much Joe Meanson prize money would increase if one particular perfor- mance measure would improve by 0.167σj (which is equal to 1 for Events Played ). 42
5 Analysis & Discussion 5.1 Analysis of Final Model The final model consists of 13 variables, which are summarized in table 9. Table 9: Predictors included in the final model. Driving... Approaches From... Short Game Miscellaneous Distance 50 - 125 Yards Sand Save Percentage Events Played Accuracy Percentage 125 - 150 Yards Scrambling From 10 - 20 Yards 150 - 175 Yards Scrambling From > 30 Yards 175 - 200 Yards Putting - Inside 10’ > 200 Yards Average Distance Of Putts Made When one is only looking at this table one could conclude that approaching and short game are the most important categories in golf for the PGA Tour players since they together make up 10 of the 13 predictors in the model. However, according to to Figure 16 and 17 it seems to be the opposite since Driving Distance, Events Played and Driving Accuracy Percentage are ex- pected to have the largest impact on the players income. This may be since approaching and short game is divided into more performance measures than other parts of golf, thus making e.g. driving predictors more comprehensive. This may indicate that an improvement of one standard deviation in one of the driving predictors compared to one of the approaching predictors is a much greater actual improvement. Furthermore, generally approach-shots that are hit from a shorter distance should end up closer to the hole. If then all approaches (regardless of dis- tance) had an equal impact on prize money their respective estimated coeffi- cients should be decreasing in absolute value as the distance increases. This wasn’t the case; sorted by absolute value of their respective coefficient from largest to smallest the approach-predictors come in the order 50 - 125, 150 - 175, > 200 and 125 - 150. This indicates that the respective approach- predictors don’t have an equivalent impact on the response variable Official Money, which is backed by Table 7 and 8 where Approaches From > 200 Yards was shown to have the largest impact. One interesting analysis would be if according to the final model, if an in- crease (+1) in Events Played would lead to a linear increase in expected prize money equivalent to the current average prize money per event. Look- ing at the players discussed in section 4.2 we see that Tiger Woods expected 43
earnings per event was $702,000 and his expected earnings for playing one more event was $504,000. On the contrary, Joe Meanson’s expected earnings per event was $70,300 and his expected earnings for playing one more event would was $65,500. This may indicate the additional events one adds to ones schedule may not be as profitable as the previous, maybe because players tend to choose events that they are expected to earn a lot at first. These events may have a large prize pool or they may fit that specific player’s playstyle. This analysis was made possible by us not using Official Money Per Event as the response, since it enabled us to use Events Played as a predictor. Another interesting observation is that no approaching-from-the-rough or left/right-tendency predictors survived to the final model. This probably in- dicates that they aren’t good at predicting prize money since the predictors of the final model were chosen by looking at their relevance, reasonability, significance level and effect on multicollinearity. However, it is not impossi- ble that they do have an impact on prize money, but it is not big enough to detect in the shadow of the other predictors. While on the topic of predictors that were excluded from the final model the careful reader will notice that Greens In Regulation Percentage is missing from the final model even though it was deemed promising in Section 3.1.2. This particular predictor was removed during the stepwise regression at step 12, when the model consisted of 11 variables, as one can see in Figure 11. It was probably removed due to it having multicollinearity with short game and approaching predictors. Since it was removed before the model reached the size of the final model, we are fairly confident that too much precision wasn’t lost from it not being part of the all possible regression. Lastly the intercept is seemingly quite large in absolute terms in relation to the other coefficients, calculated to −3, 905 compared to second largest being β̂P utting−Inside100 = 1, 858. One might from equation (24) expect a large negative intercept and that this indicates that Official Money would be close to 0 as the other predictors are set to 0. However, one have then failed to account for the exponent λ1 which due to being less than 1 (λ = 2.57) implies that βˆ0 + 13 P P13 i=1 β̂i xi must be larger than 0 and the sum i=1 β̂i xi to be larger than 3, 905 to produce a real answer. This enables a player’s performance measure to not yield a positive return, expectedly as players will not win prize money regardless of their performance measures. 44
5.2 Data Pre-Processing We started of with 102 predictors and 640 data points, however several data points were incomplete and some of the performance measures were only measured for a small group of players. Thus, performance measures for which a lot of data points had missing were removed and incomplete data points too. Performance measures that directly affected score (such as Av- erage Score and Birdie Average) were also removed as they can’t indicate a specific part of the players game they can improve. Furthermore, we also chose to exclude all Strokes Gained performance measures since we wanted more precise predictors, e.g. Strokes Gained Approach-The-Green and - Putting is a combination of all approach-the-green and putting predictors respectively. All these measures shrunk the number of predictors to 56 and data points to 581. While the PGA Tour website provided an abundance of data points and pre- dictors it was very labour-intensive to extract. It had to be done manually and took about one minute per data point, which stopped us from collecting more data than may have been necessary. We do however think that the around 600 data points were enough to perform a solid analysis. 5.3 Handling of Outliers Five data points in the data set were deemed to be outliers since they had large residuals, Cook’s distances and deviances in the Q-Q plot. These were the data points 11, 128, 326, 415 and 585 which are evenly distributed between 1 and 640 indicating that there is no trend regarding outliers and year (the data set is sorted by year). They did however have a noticeable effect on the precision of the model; the first model consisting of all 56 variables increased its adjusted R2 value by 0.0113, from 0.7749 to 0.7862. 45
You can also read