Skill, Luck and Hot Hands on the PGA Tour
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Skill, Luck and Hot Hands on the PGA Tour June 21, 2005 Robert A. Connolly Richard J. Rendleman, Jr. Kenan-Flagler Business School Kenan-Flagler Business School CB3490, McColl Building CB3490, McColl Building UNC - Chapel Hill UNC - Chapel Hill Chapel Hill, NC 27599-3490 Chapel Hill, NC 27599-3490 (919) 962-0053 (phone) (919) 962-3188 (phone) (919) 962-5539 (fax) (919) 962-2068 (fax) connollr@bschool.unc.edu richard_rendleman@unc.edu We thank Tom Gresik and seminar participants at the University of Notre Dame for helpful comments. We provide special thanks to Carl Ackermann and David Ravenscraft who provided significant input and assistance during the early stages of this study. We also thank Mustafa Gultekin for computational assistance and Yuedong Wang and Douglas Bates for assistance in formulating, programming and testing portions of our estimation procedures. Please direct all correspondence to Richard Rendleman.
Skill, Luck and Hot Hands on the PGA Tour 1. INTRODUCTION Like all sports, outcomes in golf involve elements of both skill and luck. Perhaps the highest level of skill in golf is displayed on the PGA Tour. Even among these highly skilled players, however, a small portion of each 18-hole score can be attributed to luck, or what players and commentators often refer to as “good and bad breaks.” The purpose of our work is to determine the extent to which skill and luck combine to determine 18-hole scores in PGA Tour events. We are also interested in the question of whether PGA players experience “hot or cold hands,” or runs of exceptionally good or bad scores, in relation to those predicted by their statistically-estimated skill levels. From a psychological standpoint, understanding the extent to which luck plays a role in determining 18-hole golf scores is important. Clearly, a player would not want to make swing changes to correct abnormally high scores that were due, primarily, to bad luck. Similarly, a player who shoots a low round should not get discouraged if he cannot sustain that level of play, especially if good luck was the primary reason for his score. At the same time, it is important for a player to know whether his general (mean) skill level is improving or deteriorating over time and to understand that deviations from past scoring patterns may not be due to luck alone. From a policy standpoint, it seems reasonable that the PGA Tour and other professional golf tournament organizations should attempt to minimize the role of luck in determining tournament outcomes and qualification for play in Tour events. Ideally, the PGA Tour should be comprised of the most highly-skilled players. Also, the Tour should strive to conduct tournaments that reward the most highly skilled players rather than those who experience the greatest luck. In some cases, luck in a round of golf can easily be identified. In the final round of the 2002 Bay Hill Invitational, David Duval’s approach shot to the 16th hole hit the pin and bounced back into a water hazard fronting the green. Duval took a nine on the hole. Few would argue that Duval’s score of nine was due to bad judgment or a sudden change in his skill level, and it is highly unlikely that Duval made swing changes or changes to his general approach to the game to correct the type of problem he incurred on the 16th hole. In contrast, consider the good fortune experienced by Craig Perks in the final round of the 2002 Players Championship, when he chipped in on holes 16 and 18 and sunk a 28-foot putt for
2 birdie on the 17th hole en route to victory. Was Perks’ victory, his first on the PGA Tour, a reflection of exceptional ability or four consecutive days of good luck? The fact that his best season since 2002 placed him 146th on the PGA Tour’s Official Money List suggests that luck may have been the overriding factor. Similarly, as of June, 2005, Ben Curtis has had only has had only one top-10 finish since winning the 2003 British Open. At this point, few would argue that Curtis’s victory was anything other than a fluke. In many situations, specific occurrences of good and bad luck may not be as obvious as in the examples above. Luck simply occurs in small quantities as part of the game. Even players as highly skilled as Tiger Woods and Vijay Singh cannot produce perfect swings on every shot. As a result, a certain human element of luck is introduced to a shot even before contact with the ball is made. (An excellent article on the role of luck in golf, with a focus on how the ultimate outcome of the dual between Tiger Woods and Chris DiMarco in the 2005 Masters was determined as much by luck as by skill, is provided in Jenkins (2005).) Our work draws on the rich literature in sports statistics. Klaassen and Magnus (2001) model the probability of winning a tennis point as a function of player quality, context (situational) variables, and a random component. They note that failing to account for quality differences will create pseudo-dependence in scoring outcomes, because winning reflects, in part, player quality, and player quality is generally persistent. Parameter estimates from their dynamic, panel random effects model using four years of data from Wimbledon matches suggest that the iid hypothesis for tennis points is a good approximation, provided there are controls for player quality. The most important advantage of their approach from our perspective is that it supports a decomposition of actual performance into two parts: an expected score based on skill and an unexpected residual portion. In this paper, we decompose individual golfer’s scores into skill-based and unexpected components by using Wang’s (1998) smoothing spline model to estimate each player’s mean skill level as a function of time while simultaneously estimating the correlation in the random error structure of fitted player scores and the relative difficulty of each round. The fitted spline values of this model provide estimates of expected player scores. We define luck as deviations from these estimated scores, whether positive or negative, and explore its statistical properties. Our tests show that after adjusting a player’s score for his general skill level and the relative difficulty of the course on the day a round is played, the average (and median) standard deviation of residual golf scores on the PGA Tour is approximately 2.7 strokes per 18-hole round, ranging
3 between 2.1 and 3.5 strokes per round per player. We find clear evidence of positive first-order autocorrelation in the error structure for over 13 percent of the golfers and, therefore, conclude that a significant number of PGA Tour players experience hot and cold hands. We also apply some traditional hot hands tests to the autocorrelated residuals and come to similar conclusions. However, after removing the effects of first-order autocorrelation from the residual scores, we find little additional evidence of hot hands. This suggests that any statistical study of sporting performance should estimate skill dynamics and deviations from normal performance while simultaneously accounting for the relative difficulty of the task and the potential autocorrelation in unexpected performance outcomes. The remainder of the paper is organized as follows. In Section 2 we describe our data and criteria for including players in our statistical samples. In Section 3 we present the results of a number of fixed-effects model specifications to identify the variables that are important in predicting player performance. In Section 4 we formulate and test the cubic spline-based model for estimating player skill. This model becomes the basis for our analysis of player skill, luck and hot hands summarized in Section 5. A final section provides a summary and concluding comments. 2. DATA We have collected individual 18-hole scores for every player in every stroke-play PGA Tour event for years 1998-2001 for a total of 76,456 scores distributed among 1,405 players. Our data include all stroke play events for which participants receive credit for earning official PGA Tour money, even though some of the events, including all four “majors,” are not actually run by the PGA Tour. The data were collected, primarily, from www.pgatour.com, www.golfweek.com, www.golfonline.com, www.golfnews.augustachronicle.com, www.insidetheropes.com, and www.golftoday.com. When we were unable to obtain all necessary data from these sources, we checked national and local newspapers, and in some instances, contacted tournament headquarters directly. Our data cover scores of players who made and missed cuts. (Although there are a few exceptions, after the second round of a typical PGA Tour event, the field is reduced to the 70 players, including ties, with the lowest total scores after the first two rounds.) The data also include scores for players who withdrew from tournaments and who were disqualified; as long as we have a score, we use it. We also gathered data on where each round was played. This is especially important for
4 tournaments such as the Bob Hope Chrysler Classic and ATT Pebble Beach National Pro-Am played on more than one course. The great majority of the players represented in the sample are not regular members of the PGA Tour. Nine players in the sample recorded only one 18-hole score over the 1998-2001 period, and 565 players recorded only two scores. Most of these 565 players qualified to play in a single US Open, British Open or PGA Championship, missed the cut after the first two rounds and subsequently played in no other PGA Tour events. As illustrated in the top panel of Figure 1, 1,069 players, 76.1 percent of the players in the sample, recorded 50 or fewer 18-hole scores. As illustrated in the bottom panel, this same group recorded 5,895 scores, which amounts to only 7.7 percent of the total number of scores in the sample. 1,162 players, representing 82.7 percent of the players in the sample, recorded 100 or fewer scores. The total number of scores recorded by this group was 12,849, or 16.8 percent of the sample. The greatest number of scores recorded by a single player was 457 recorded by Fred Funk. The players who recorded 50 or fewer scores represent a mix of established senior players such as Hale Irwin (19 scores), Jim Thorpe (18) and Larry Nelson (17), relatively inactive middle- aged players including Bobby Clampett (16), Jerry Pate (16) and Seve Ballesteros (22), “old-timers” such as Arnold Palmer (32), Raymond Floyd (24) and Gary Player (18), up-and-coming stars including Adam Scott (32) and Chad Campbell (20), established European Tour players such as Andrew Coltart (49), Niclas Fasth (48) and Robert Karlsson (46), and a large number of players that most readers of this paper, including those who follow the PGA Tour, have probably never heard of. Clearly, this group, which accounts for 76.1 percent of the players in the sample, but only 7.7 percent of the scores, is not representative of typical PGA Tour participants. Therefore, including them in the sample could cause distortions in estimating the statistical properties of golf scores on the Tour. In estimating the skill levels of individual players, we employ the smoothing spline model of Wang (1998) which adjusts for correlation in random errors. Simulations by Wang indicate that 50 sample observations is probably too small, and that approximately 100 or more observations are required to obtain dependable statistical estimates of cubic-spline-based mean estimates. After examining player names and the number of rounds recorded by each player, we have concluded that a sample of players who recorded more than 90 scores is reasonably homogeneous and likely to meet the minimum sample size requirements of the cubic spline methodology.
5 3. MODEL IDENTIFICATION The model used in our tests, based on Wang’s smoothing spline, is computationally intensive and requires approximately a full day to run on a 3.19 GHz PC operating under Windows XP. Because of its computational requirements, we employ two simpler models to identify the relevant variables and appropriate model form for our tests. In model 1, we employ a fixed-effects multiple regression to predict a player’s 18-hole score for a given round as a function of the level of skill displayed by the player throughout the entire 1998-2001 period and the relative difficulty of each round. In model 2 each player’s skill level is allowed to change (approximately) by calendar year, but otherwise, the model is identical to model 1. We show that including approximate calendar year- based time dependency significantly improves the predictive power of the model. We also estimate various versions of model 2 to determine whether alternative model specifications might be warranted. Using model 2 we show that the best way to estimate the relative difficulty of a given golf round is through a round-course interaction dummy variable. Although most tournaments on the PGA Tour are played on a single course, several are played on more than one course. Using round- course interactions to predict the relative difficulty of each round is significantly more powerful than employing a dummy for the round alone (without interacting with the course), the course alone (without interacting with the round), or the tournament alone (without interacting with either the course or the round). We also show that individual player performance is not affected by the courses on which he plays. 3.1. Model 1 Let si , j denote the 18-hole score of player i (with i = 1,....n , alphabetically) in PGA Tour round-course interaction j (with j = 1,....m ). A round-course interaction is defined as the interaction between a regular 18-hole round of play in a specific tournament and the course on which the round is played. For most tournaments, only one course is used and, therefore, there is only one such interaction per round. However, over the sample period, 25 of 182 tournaments were played on more than one course. For example, in the Bob Hope Chrysler Classic, the first four rounds are played on four different courses using a rotation that assigns each tournament participant to each of the four courses over the first four days. A cut is made after the fourth round, and a final round is played the
6 fifth day on a single course. Thus, the Bob Hope tournament consists of 17 round-course interactions – four for each of the first four days of play and one additional interaction for the fifth and final day. According to model 1, the predicted score for player i played in connection with round-course interaction j is determined by the following fixed-effects multiple regression model: n m si , j = α + ∑ β k pk + ∑ γ c rc + ε i , j (1) k =2 c=2 In (1), pk is a dummy variable that takes on a value of 1 if player i = k and zero otherwise, rc is a dummy that takes on a value of 1 if round-course interaction j = c and zero otherwise, and ε i , j is an error term with E ( ε i , j ) = 0 . In the model, the first round of the 1998 Mercedes Championships, played on a single course, is round-course interaction j = 1 . Therefore, the regression intercept, α , represents the expected score of the first player ( k = 1 ) in the first round of the 1998 Mercedes Championships, β k represents the differential amount that player k > 1 would be expected to shoot in the first round of the 1998 Mercedes, and γ c denotes the additional amount that all players, including the first player, would be expected to shoot in connection with any other round- course combination. The only restriction placed on the data is that a player must have recorded at least four 18-hole scores, the minimum number required for the regression to be of full rank, to be included in the regression estimate. With this restriction, the data include 75,054 scores for 810 players recorded over a possible 848 round-course interactions, although no single player participated in more than 457 rounds. Berry (2001) employs a similarly-constructed random effects model that takes into account the skill level of each player and the intrinsic difficulty of each round to measure the performance of Tiger Woods relative to others on the PGA Tour. However, Berry’s model does not take account of the potential for a given round to be played on more than one course. It should be noted that when estimating round-course interaction coefficients, no specific information about course conditions, adverse weather, etc. is taken into account. Nevertheless, if such conditions combine to produce abnormal scores in a given 18-hole round, the effects of these conditions should be reflected in the estimated coefficients. Under model 1, the highest round-course interaction coefficient is associated with the portion of the third round of the 1999 ATT Pebble Beach National Pro-Am played on the Pebble Beach course. For this particular round-course interaction, wind and rain combined to make the famed course extremely difficult and forced tournament officials
7 to cancel the event during the fourth round due to unplayable conditions. (Our final analysis shows that the Pebble Beach course played 5.8 strokes more difficult on the third day than on the first two days.) 3.2. Model 2 In the previous model, a player’s skill coefficient is assumed to be constant throughout the entire 1998-2001 period. In model 2, a player’s skill coefficient is allowed to vary through time on an approximate calendar year basis. In principle, we wish to estimate a single regression equation with separate skill coefficients estimated for each player in each calendar year. However, experiments with a model of this form reveal that a number of players whose scores are included in the estimation of model 1 would have to be removed from the original sample for the regression to be of full rank. To overcome the potential problem of singularity, while maintaining the same sample employed in the estimation of model 1, we employ the following method for allowing an individual player’s skill coefficient to vary through time. If a full calendar year has passed, at least 25 scores were used in the estimation of the previous skill coefficient for the same player, and at least 25 additional scores were recorded for the same player, then a new incremental skill coefficient for that player is estimated. For players who participated actively in all four years of the sample, this procedure results in the estimation of different mean skill levels in each of the four years. Although the 25-score criterion is somewhat arbitrary, it is not critical, since the model we will actually use for estimating player scores is based on Wang’s smoothing spline methodology and does not use the 25- score criterion. As before, let si , j denote the 18-hole score of player i in connection with PGA Tour round- course interaction j. Define pk ,t ( k ) as a dummy variable that takes on a value of 1 if player i = k and the player-specific (approximate) calendar year-based time period equals or exceeds t ( k ) = 1,...T ( k ) and zero otherwise, where T ( k ) denotes the total number of player-specific (approximate) calendar year-based time periods attributable to player k. Thus, if i = k , pk ,1 = 1 in all time periods, pk ,2 = 1 in periods 2 through T ( k ) but zero otherwise, pk ,3 = 1 in periods 3 through T ( k ) but zero otherwise,
8 and pk ,4 = 1 in period 4 only. Finally, let β k ,t ( k ) denote the value of incremental skill coefficient t ( k ) for player k. Then, the basic version of model 2, denoted as 2.1, is specified as follows: T (k ) n T (k ) m si , j = α + ∑ β k =1,t ( k ) pk =1,t ( k ) + ∑ ∑ β k ,t ( k ) pk ,t ( k ) + ∑ γ c rc + ε i , j (2.1) t ( k )=2 k = 2 t ( k ) =1 c=2 As in model 1, the first player in the sample is player k = 1 , and the first round of the 1998 Mercedes Championships is round-course combination j = 1 . Therefore, the regression intercept, α , represents the first player’s expected score in the first round of the 1998 Mercedes Championships. T (k ) The term ∑β t ( k )=2 k =1, t ( k ) pk =1,t ( k ) picks up potential incremental changes in skill for the first player starting in his second estimation period, typically 1999. In all forms of model 2, the skill coefficient β k ,1 is estimated in each of the T ( k ) estimation periods for all players k > 1 . Starting in the second period, the coefficient β k ,2 is estimated for all players, k = 1,....n for whom T ( k ) > 1 , and so on. To determine the best way to estimate the relative difficulty of each round, we estimate three additional forms of model 2, denoted as models 2.2 through 2.4, respectively. In model 2.2 we substitute 728 round dummies for the 848 round-course interaction dummies in model 2.1. We define a “round” as a round of play in a specific tournament, regardless of whether the round is played on more than one course. Therefore, this specification ignores differences in scores for the same round that are played on different courses. In model 2.3 we substitute 182 tournament dummies for the round-course interaction dummies model 2.1. Each tournament dummy denotes a particular tournament played in a given year. For example, there are four tournament dummies associated with the Masters, played in each of the four years 1998-2001. According to this specification, the relative difficulty of each round is determined by the tournament and does not vary from day to day within the tournament or over multiple courses that might be used for the tournament. Finally, in model 2.4 we substitute 77 course dummies, which indicate the courses on which the rounds are played, for the round-course interaction dummies in model 2.1. As such, this specification assumes that the relative difficulty of a round is determined strictly by the course on which the round is played and that the difficulty of the course does not change from day to day within a tournament or even from year to year. Table 1 summarizes the performance of these models using both a full sample of players (75,054 observations) and a sample restricted to the players who recorded more than 90 scores (64,364 observations).
9 Using the full sample, the adjusted R 2 is 0.2996 for model 1 and 0.3077 for model 2.1. The F-statistic for testing the difference between model 2.1 and model 1 is 2.44, which, given the large sample size, is significant at any normal testing level. This indicates that there is added value to including (approximate) calendar year-based time variation in mean player skill levels. F-statistics for testing model 2.1 against alternative specifications using dummies for rounds, tournaments and courses, rather than for round-course interactions, are 5.66, 9.66 and 10.67, respectively, indicating that model 2.1, which uses round-course interactions, is superior. Very similar results are obtained for the same tests using the restricted sample. It should be noted that Berry’s (2001) model for predicting player scores is equivalent to our model 1 using random effects for actual rounds rather than fixed effects for round-course interactions. The results in Table 1 indicate that including calendar-year-based time variation in player skill levels and adjusting scores for the relative difficulty of round-course interactions is a superior model specification. We also attempted to estimate a fifth version of model 2 to test whether there is added value to including interactions among players and courses. With the full sample of players, the number of coefficients to be estimated increases from 2,224 using model 2.1 to 17,201 when player-course interactions are included. With the sample restricted to only those players that competed in more than 90 rounds, the number of coefficients increases from 848 to 13,383. Unfortunately, the regressions are of such large scale that we were unable to obtain sufficient resources to estimate them on the university’s statistical mainframe computer. Therefore, we address the issue of player-course interactions by including course dummies in random effects analyses of individual player residual scores from our final spline-based estimation model. These tests, described in Section 4.1, show no evidence of significant player-course interaction effects. Also, tests of player performance in adjacent rounds of the same tournament, summarized in Table 5 and discussed in Section 5, provides additional evidence of no significant effects. 4. THE SPLINE-BASED ESTIMATION MODEL Based on the previous analysis, we conclude that the best way to adjust player scores for the relative difficulty of each round is to use round-course interactions rather than rounds (ignoring courses), courses (ignoring rounds) or tournaments (ignoring both rounds and courses). We also conclude that our estimation model should reflect time-dependent estimates of mean player skill. However, we recognize that allowing player skill to change only by calendar year is somewhat
10 arbitrary. Therefore, in model 3 we employ a more general specification of time dependency in mean player skill that does not require skill changes to occur only at the turn of the year or at any other pre- specified points in time. In model 3 we estimate each player’s mean skill level as a function of “golf time” using the restricted maximum likelihood (REML) version Wang’s (1998) smoothing spline model which adjusts for correlated random errors. Player k’s “golf time” counts successive competitive PGA Tour rounds of golf played by player k regardless of how these rounds are sequenced in actual calendar time. Inasmuch as there are likely to be gaps in actual time between some adjacent points in golf time, it is unlikely that random errors around individual player spline fits follow higher-order autoregressive processes. At the same time, however, we do not want to rule out the possibility that the random errors may be correlated. Model 3, which uses Wang’s REML model for estimating mean individual player skill as a function of golf time, is formulated as follows. Define player k’s golf time as the sequence g ( k ) = 1,....G ( k ) , and let g k ( j ) denote the mapping of the sequence j = 1,....m to the sequence g ( k ) = 1,....G ( k ) such that the sequence g k ( j ) represents the “golf time” of all round-course interactions, j, for which player k recorded an 18-hole score. For example, assume that player k’s first three scores were recorded in connection with round-course interactions 6, 7, and 16. Then, g k ( 6 ) = 1, g k ( 7 ) = 2, and g k (16 ) = 3 . With this mapping, model 3 becomes: ( ) n m si , j = ∑ f k ⎣⎡ g k ( j ) ⎦⎤ + θ k ⎡⎣ g k ( j ) ⎤⎦ pk + ∑ ( γ c + ξc ) rc + κ i , j k =1 c=2 ( ) n m = ∑ f k ⎣⎡ g k ( j ) ⎦⎤ + θ k ⎡⎣ g k ( j ) ⎤⎦ pk + ∑ ωc rc + κ i , j , k =1 c=2 with ωc = γ c + ξ c and E (κ i , j ) = 0 . In model 3, f k ( g k [ j ]) is the smoothing spline function applied to player k’s golf scores over golf time g ( k ) = 1,....G ( k ) . θ k ( g k [ j ]) is the random error associated with the spline fit for player k evaluated with respect to round-course interaction j as mapped into ′ ( ) golf time g k ( j ) , with θ κ = θ k [1] ,....θ k ⎡⎣G ( k ) ⎤⎦ ∼ N ( 0, σ k2 Wk‐1 ) and σ k2 unknown. Note that the intercept is absorbed by the f k ( g k [ j ]) terms.
11 It is convenient to re-express θ k ( g k [ j ]) as θ k ( g k [ j ]) = ϕk ( g k [ j ]) + ηk ( g k [ j ]) , where ϕk ( g k [ j ]) represents the predicted error associated with player k’s spline fit as of golf time g k ( j ) as a function of past errors, θ k (1) ,....θ k ( g k [ j ] − 1) , and ηk ( g k [ j ]) denotes the remaining residual error. Substituting θ k ( g k [ j ]) = ϕk ( g k [ j ]) + ηk ( g k [ j ]) , model 3 becomes: ( ) n m si , j = ∑ f k ⎡⎣ g k ( j ) ⎤⎦ + ϕk ⎡⎣ g k ( j ) ⎤⎦ + ηk ⎡⎣ g k ( j ) ⎤⎦ pk + ∑ ωc rc + κ i , j (3) k =1 c=2 It is important to recognize that in estimating model 3, the autocorrelation in random errors about each spline fit is not removed. Instead, the spline fits and the autocorrelation parameters associated with the random error around the spline fits are estimated simultaneously. Model 3 represents a generalized additive model with random effects η = (η1 ,....η n )′ and ξ = (ξ1 ,....ξ m )′ associated with players i = 1,....n and round-course interactions j = 1,....m , respectively. Although the smoothing spline functions allow for general autocorrelation structures in θi , we assume that the θi follow player-specific AR(1) processes. The methodology for estimating model 3 is described in the appendix. 4.1. General Properties of Model 3 Throughout the remainder of the paper, frequent reference is made to two different types of residual errors from the spline fits of model 3. To avoid confusion, we hereby define the two residual error types. In model 3, θi ( gi [ j ]) is the total residual error associated with spline fit i as of golf time gi ( j ) . ϕi ( gi [ j ]) represents the predicted error associated with player i’s spline fit as of golf time gi ( j ) as a function of past errors, θi (1) ,....θi ( gi [ j ] − 1) , and ηi ( gi [ j ]) represents the error component of the spline fit that is not predictable from past residual errors, θi ( gi [ j ]) . We assume that the error θi ( gi [ j ]) follows an AR(1) process with first-order autocorrelation coefficient φi . The spline fit fi and correlation φi are estimated simultaneously using Wang’s (1998) smoothing spline methodology. For some of our tests we focus on residual errors ηi ( gi [ j ]) , which we refer to as η errors. If the assumed AR(1) process properly captures the correlation structures of residual player
12 scores, η errors should represent white noise. For other tests, we focus on autocorrelated residual errors θi ( gi [ j ]) , which we refer to as θ errors. In formulating model 3, we assume that an AR(1) process describes the residual error structure about the cubic spline fit for each player. If this is a valid assumption, the η errors for each player should be serially uncorrelated. However, if this is not a valid assumption, there may be additional autocorrelation in the η errors. To check for possible model misspecification in connection with the AR(1) assumption, we estimated autocorrelation coefficients of order 1-5 on the η errors of each player. In no instance were any estimated coefficients statistically significant at the 5 percent level. In addition, we computed the Ljung-Box (1978) Q statistic associated with the η errors for each player for lags equal to the minimum of 10 or 5 percent of the number of rounds played. Ljung (1986) suggests that no more than 10 lags should be used for the test and Burns (2002) suggests that the number of lags should not exceed 5 percent of the length of the series. Only 6 of 253 Q statistics were significant at the 5 percent level. As a final diagnostic, we ran random effects models relating θ and η errors to course dummy variables for each of the 253 players, assigning a dummy to a course if a player played the course at least twice, the minimum required for a model to be of full rank. None of the F-tests associated with the 253 random effects tests were significant at the 5 percent level for either type of error. Therefore, we conclude that the autocorrelation properties of the spline function have been properly specified and that it is unnecessary to include player-course interactions in the estimation model. Figure 2 shows plots of spline fits for four selected players. All figures in the upper panel show 18-hole scores reduced by the effects of round-course interactions (connected by jagged lines) along with corresponding spline fits (smooth lines). The same spline fits (smooth lines) are shown in the lower panel along with predicted scores adjusted for round-course interactions (connected by jagged lines), computed as a function of prior residual errors, θi ( gi [ j ]) , and the first-order autocorrelation coefficient, φi , estimated in connection with the spline fit. The scale of each plot is different. Therefore, the spline fits in the lower panel appear as stretched versions of the corresponding fits in the upper panel and visually, plots for the four players are not directly comparable.
13 The two plots for Chris Smith reflect 10.7067 degrees of freedom used in connection with the estimate of his spline function, the largest degrees of freedom value among the 253 players. In contrast, the spline fit for Tiger Woods, estimated with 2.0011 degrees of freedom, is more typical; almost 71 percent of the 253 spline fits use 2.25 or fewer degrees of freedom and have the same general appearance as that estimated for Woods. The fit for Ian Leggatt reflects φ = −0.2791 , the most negative first-order autocorrelation coefficient estimated in connection with the 253 splines. Also, the standard deviation of the θ residual errors around the spline fit for Leggatt is 2.14 strokes per round, the lowest among the 253 players. Interestingly, after adjusting for prior θ errors and first-order autocorrelation, φ , the predicted score for Bob Estes, shown as the jagged line in the lower panel, is the lowest among all 253 players at the end of the 1998-2001 sample period. Figure 3 shows six histograms that help to summarize the 253 cubic spline-based mean player skill functions. The first histogram shows the degrees of freedom for the various spline fits. Three of the fits have exactly two degrees of freedom, while the degrees of freedom for 179, or 71 percent of the splines, is less than 2.25. This implies that a large majority of the spline fits are essentially linear functions of time, such as that illustrated in Figure 2 for Tiger Woods. (It should be noted that for each spline, an additional degree of freedom, not accounted for in the histograms, is used up in estimating the AR(1) correlation structure of the residuals.) Fifty-three of the spline fits have three or more degrees of freedom, implying a departure from linearity. Twelve of the splines have five or more degrees of freedom, and as noted earlier, the largest number of degrees of freedom is 10.71 for Chris Smith. The histogram in the lower left-hand panel of Figure 3 shows the distribution of first-order autocorrelation coefficients estimated in connection with the 253 splines. As illustrated in the lower left-hand panel, the residual θ errors of 158, or 62 percent of the spline fits, are positively correlated. We employ the bootstrap method to test the significance of individual player spline fits against alternative specifications of player skill. To maintain consistency in our testing methods, we apply the same set of bootstrap samples to test the significance levels of the first-order autocorrelation coefficients estimated in connection with each fit. All bootstrap tests are based on balanced sampling of 40 samples per player. Wang and Wahba (1995) describe how the bootstrap method can be used in connection with smoothing splines that are estimated without taking into account the autocorrelation in residual errors. We modify the method outlined in Wang and Wahba so that the bootstrap samples are based on η residuals which adjust predicted scores for
14 autocorrelation in prior θ residuals. Forty bootstrap samples is the minimum necessary to estimate two-sided 95 percent confidence intervals. Although 40 samples is well below the number required to estimate precise confidence intervals for each individual player, a total of 40 × 253 = 10,120 bootstrap samples are taken over all 253 players, requiring over three days of computation time using a 3.19 GHz computer running under Windows XP. The 10,120 total bootstrap samples should be more than sufficient to draw general inferences about statistical significance within the overall population of PGA Tour players. After sorting the correlation coefficients computed via the 40 bootstrap samples from lowest to highest, the number significantly negative is the number of players for which the 39th correlation coefficient is negative, and the number significantly positive is the number of players for which the second correlation coefficient is positive. We find that six of 95 negative autocorrelation coefficients are significant at the 5 percent level while 32 of 158 positive coefficients are significant. Although the implications of positive autocorrelation are developed in more detail later, based on the number of players displaying significant positive autocorrelation, it is clear that a substantial number of players exhibit hot and cold hands in their golfing performance over the sample period. The four remaining histograms in Figure 3 summarize bootstrap tests of the 253 cublic spline fits against the following alternative methods of estimating a player’s 18-hole score (after subtracting the effects of round-course interactions): 1. The player’s mean score. 2. The player’s mean score in each approximate calendar year period as defined by model 2. 3. The player’s score estimated as a linear function of time. 4. The player’s score estimated as a quadratic function of time. RSE ( alt ) RSE ( spline ) For each of the four tests we form a test statistic, ζˆ = − , where G is G − df alt G − df spline − 1 the number of 18-hole scores for a given player, df alt and df spline are the number of degrees of freedom associated with the estimation of the alternative and spline models, respectively, and RSE ( alt ) and RSE ( spline ) are total residual squared errors from the alternative and spline models. We subtract 1 in the denominator of the second term to account for the additional degree of freedom associated with the estimation of the first-order autocorrelation coefficient. For the purposes of this test, RSE ( spline ) is based on η errors, since these errors reflect the complete information set
15 derived from each spline fit. The test statistic ζˆ is suggested by Efron and Tibshirani (1998, pp. 190-192 and p. 200 [problem 14.12]) for testing the predictive power of an estimation model formulated with two different sets of independent variables. The test statistic is computed for each of 40 bootstrap samples per player. In a one-sided test of whether the spline fit is a superior specification to the alternative model, ζˆ should be positive in 38 or more of the 40 bootstrap samples. The first line of Table 2 shows that the spline model is significantly superior (at the 5 percent level) to the player’s mean score for approximately 30 percent of the 253 players in the sample. It is significantly superior to a linear time trend for 8 percent of the players, to the mean score computed for each (approximate) calendar year for 5 percent, and to a quadratic time trend for 5 percent of the players. The second line of the table shows that the percentage of players for which the alternative model is generally superior (but not statistically superior) is 30 percent and 17 percent for the linear and quadratic time trends, respectively, and only 8 percent and 4 percent for the mean score and mean score per (approximate) calendar year. (For a given player, the alternative model is generally superior if ζˆ is positive for less than 20 of 40 bootstrap samples.) Although the spline model is superior to the alternative models at the 5 percent level for a relatively small percentage of players, it “beats” the alternative models a much larger proportion of the time than could be predicted by chance. Therefore, we conclude that the spline model is superior to any of the four alternative specifications. 5. PLAYER PERFORMANCE 5.1 Skill Based on bootstrap sampling, the preceding analysis shows that approximately 30 percent of cubic spline-based player score estimates are significantly superior at the 5 percent level to estimates based solely on mean scores. Moreover, the mean squared residual error for each of the original RSE ( spline ) cubic spline fits, measured as , is less than the mean squared residual about the mean, G − df spline − 1 RSE ( mean ) measured as , for 70 percent of the players in the sample. These results provide strong G −1 evidence that the skill levels of PGA Tour players change through time. For many players such as
16 Tiger Woods, the relationship between the player’s average skill level and time is well approximated by a linear time trend, after adjusting for autocorrelation in residual errors. For others, such as Chris Smith, Ian Leggatt and Bob Estes, the relationship between mean player skill and time is more complex and cannot easily be modeled by a simple parametric-based time relationship. The player-specific cubic spline functions can provide point estimates of expected scores, adjusted for round-course interactions, at the end of the 1998-2001 sample period, and, therefore, can be used to rank the players at the end of 2001. Table 3 provides a summary of the best players among the sample of 253 as of the end of the 2001 based on the cubic spline point estimates shown in column 1 of the table. Column 2 shows estimates of player scores after adjusting for autocorrelation in θ residual errors around each spline fit. The values in column 1 can be thought of as estimates of mean player skill at the end of the sample period. In contrast, the values in column 2 can be thought of as estimates of each player’s last score as a function of his ending mean skill level and the correlation in random errors about prior mean skill estimates. The Official World Golf Ranking is also shown for each player as of 11/04/01, the ranking date that corresponds to the end of the official 2001 PGA Tour season. The World Golf Ranking is based on a player’s most recent rolling two years of performance, with points awarded based on position of finish in qualifying worldwide golf events on nine different tours. The ranking does not reflect actual player scores. Details of the ranking methodology can be found in the “about” section of the Official World Golf Ranking web site, www.officialworldgolfranking.com. Despite being based on two entirely different criteria, the two ranking methods produce similar lists of players. Mike Weir is the only player in the top 10 of the Official World Golf Ranking (number 10) as of 11/04/01 whose name does not appear in Table 3 (he is actually 21st according to our ranking), and only five players listed in Table 3 were not ranked among the Official World Golf Ranking’s top 20. Among the 20 players listed in Table 3, the predicted score for Bob Estes, adjusted for autocorrelation in residual errors (column 2), is the lowest. Although it may come as a surprise that a player with as little name recognition as Estes would be predicted to shoot the lowest score at the end of 2001, the plots of Estes’ spline function in Figure 2 show why. Estes exhibited marked improvement over the last quarter of the sample period, and the improvement was sufficiently pronounced that his estimated spline function picked it up.
17 Table 4 provides lists of the 10 players showing the improvement and deterioration in skill over the 1998-2001 sample period based on differences between beginning and ending spline-based estimates of player scores adjusted for round-course interactions. Cameron Beckman was the most improved player, improving by 3.36 strokes from the beginning of 1998 to the end of 2001. Chris DiMarco and Mike Weir, relatively unknown in 1998 but now recognized among the world’s elite players, were the fifth and sixth most improved players over the sample period. Among the 10 golfers whose predicted scores went up the most were Lanny Wadkins (8.18 strokes, born 1949), Fuzzy Zoeller (2.99 strokes, 1951), Feith Fergus (2.85 strokes, 1954), Bobby Wadkins (2.74 strokes, 1951), Craig Stadler (2.65 strokes, 1953), Tom Watson (2.64 strokes, 1949), and David Edwards (2.32 strokes, 1956). Clearly, deterioration in player skill appears to be a function of age, with a substantial amount of deterioration occurring during a golfer’s mid to late 40’s. But despite the natural deterioration that occurs with age, 133 of the 253 players in the sample actually improved from the beginning of the sample period to the end. 5.2 Luck To what extent does luck play a role in determining success or failure on the PGA Tour? Based on the premise that our model for predicting 18-hole scores is correct, luck represents deviations between actual and predicted 18-hole scores, either positive or negative, that are not sustainable from one round to the next. Even if a golfer plays at an unusually high skill level in a given round and cannot point to any specific instances of good luck to explain his performance, we would consider him to have been lucky if his high skill level cannot be sustained. In this view, luck is a random variable with temporal independence. 5.2.1 Average Residual Scores in Adjacent Rounds. In the analysis that follows, we test whether deviations between actual and predicted 18-hole scores are sustainable from one round to the next by examining whether players who record exceptionally good or bad scores in one round of a tournament can be expected to continue their exceptional performance in subsequent rounds of the same tournament. For this test we examine the performance of players in all adjacent tournament rounds that are not divided by a cut. A typical PGA Tour tournament involves four rounds of play with a cut being made after the second round. Therefore, for a typical tournament, we compare performance between rounds 1 and 2 and also between rounds 3 and 4. We do not compare performance between rounds 2 and 3, however, because approximately half of the players who
18 participate in round 2 are cut and do not continue for a third round. For the Bob Hope Chrysler Classic, which involves five rounds of play with a cut being made after the fourth round, we compare performance between rounds 1 and 2, rounds 2 and 3 and rounds 3 and 4 but do not compare performance between rounds 4 and 5. A similar procedure is employed for other tournaments that do not employ a cut after the second round. For the purposes of this test, we sort all θ and η residual scores in the first of each pair of qualifying rounds and place each player into one of 20 categories of approximately equal size based on the ranking of residuals in the first of each qualifying two-round pair. Within each sort category we compute the average residual score in both the first and second of the two adjacent rounds. Table 5 summarizes the results of this test for both θ and η residuals. The section of the table that summarizes the analysis for θ residuals shows a very slight tendency for scores in the 20 sort categories to carry over from one round to the next. In sort categories 1 – 10, the average first-round θ residual score is negative, and the average second-round θ residual is also negative in each of these categories. For example, on average players in sort category 1 have a first-round θ residual score of -5.184. This means that on average, players in the first group scored 5.184 strokes better (lower) than predicted by model 3. If a portion of the 5.184 strokes represents a change in skill for these same players, or an advantage due to some players having a competitive advantage on the course they played in the first round, rather than random good luck, scores in the next round should continue to be lower than predicted. Otherwise, scores of players in the first sort group should revert back to normal. In the next adjacent round (not divided by a cut), the average θ residual for these same players is –0.097. This same pattern, negative average residual first-round scores followed by negative average residual second-round scores, is present throughout the first ten sort categories. However, all of the average second-round residuals are very close to zero, with the largest, in absolute value, being 0.135 strokes for sort category 2. To shed further light on the relationship between residual scores in adjacent rounds, we run a simple least squares regression within each sort group in which the second residual in each pair is regressed against the first. If players within a group tend to continue with similar performance, this regression coefficient should be positive and significant. However, as shown in Table 5, there is no discernable pattern among the regression coefficients within the first ten sort groups, and none of the regressions is significant.
19 Among sort groups 11 – 20, all average first-round θ residuals are positive, and seven of ten average second-round residuals are positive. But except in sort categories 19 and 20, with average second-round residuals of 0.219 and 0.328, respectively, the average second-round residuals would be of little if any significance in golf. One possible explanation for the different pattern of average residuals in categories 19 and 20, and the significant positive regression coefficient in sort group 20, is that players who have performed poorly prior to a cut may take more chances in an attempt to make the cut. If riskier play tends to be accompanied by higher scores, this pattern should emerge. An alternative explanation is that many players in the last two sort groups may consider it a forgone conclusion that they will miss the cut and, perhaps, do not give their best efforts in the second of the two adjacent rounds. Although both explanations are quite different, they both involve players changing their normal method of play when the second of the two rounds is followed by a cut. We can test for this tendency by separating the average residual scores into those for which the second of the two adjacent rounds is immediately followed by a cut and those for which a cut does not occur after the second of the two rounds. Although not shown in the table, for all sort groups, except groups 19 and 20, there is little difference in performance when a cut occurs after the second of two qualifying adjacent rounds (panel 2) and when it does not (panel 3). However, for sort categories 19 and 20, the average of the θ residuals in the second of the two adjacent rounds are 0.381 and 0.406, respectively when a cut occurs immediately afterward the second of the two rounds, and –0.021 and 0.212, respectively, when a cut does not follow the second round. Moreover, the regression of second-round residuals against first-round residuals is only significant for sort category 20 when the second of the two rounds is followed by a cut. Taken together, this evidence suggests that tournament participants who perform exceptionally poorly in the earliest round(s) may change their method of play in the round before a cut to reflect the high probability of being cut. Otherwise, the threat of being cut does not appear to affect player performance. It should be noted that if a player’s exceptionally good (or bad) score as measured by his θ residual in the first of two adjacent rounds is due to an advantage (or disadvantage) in playing a particular course, the advantage should carry over into the next round, provided the next round is played on the same course. Since the large majority of adjacent rounds summarized in Table 5 involve play on the same course, the fact that the average second round residual in each sort category
20 is essentially zero provides additional evidence that player-course interactions do not play a significant role in determining 18-hole scores on the PGA Tour. Table 5 also summarizes the results of identical tests of η residuals. Unless the AR(1) model does not adequately capture the correlation structure of player scores around their respective spline fits and/or there are significant player-course interaction effects, η residuals should represent white noise uncorrelated from one round to the next. Table 5 shows that regardless of the first-round sort category, average residual scores in the second of two adjacent rounds are very close to zero, and that there is no discernable relationship between the signs of average first-round scores and those of adjacent second-round scores. As with θ residuals, we run a least squares regression within each sort group in which the second residual in each pair is regressed against the first. Only one coefficient is significant at the 5 percent level – that for sort group 4. Moreover, the fact that the average second-round residual in group 4 is so close to zero (-0.029) indicates that any tendency for the direction of abnormal performance to persist in this sort group is not accompanied by a level of performance that would make much difference in determining a player’s 18-hole score. Therefore, we conclude that significant differences between actual and predicted scores are due, primarily, to luck, and that these differences cannot be expected to persist. 5.2.2 How Much Luck Does it Take to Win a PGA Tour Event? It is interesting to consider how much luck it takes to win a PGA Tour event or to at least finish among the top players. Table 6 summarizes the actual scores and θ residual scores on a round-by-round basis for the top 40 finishers in the 2001 Players Championship. We focus on θ residual scores rather than η residuals so that we can determine the extent to which each player’s scores deviate from their time-dependent mean values. The pattern of residual scores exhibited in this table is typical of those for all 182 tournaments. Among the top 40 finishers, the total residual score for almost all players is negative; only Phil Mickelson, Colin Montgomerie and Steve Flesch have positive total residual scores. Thus, to have won this tournament, and almost all others in the sample, not only must one have played better than normal, but one must have also played sufficiently well (or with sufficient luck) to overcome the collective good luck of many other participants in the same event. Over all 182 tournaments, the average total θ residual score for winners and first-place ties was –9.92 strokes, with the total residual ranging from –2.14 strokes for Tiger Woods in the 2000 ATT Pebble Beach Pro-Am to –23.48 strokes for Mark Calcavecchia in the 2001 Pheonix Open. (Similarly, the average η residual was –9.80 strokes.) Table 7 summarizes the highest 20 total θ
21 residual scores per tournament for winners and first place ties. It is noteworthy that no player won a tournament after recording a positive total θ residual score. Although not shown in the table, Tiger Woods’ total η residual score for the 2000 ATT Pebble Beach event was +0.32 strokes, but this is the only positive total residual using either the θ or η measure. It is also noteworthy that Woods’ name appears 11 times in Table 7 and that all the other players on the list (Phil Mickelson, Sergio Garcia, David Duval, and Davis Love III) are among the world’s most recognizable players. Thus, over the 1998-2001 period, only Tiger, and perhaps a handful of other top players, were able to win tournaments without experiencing exceptionally good luck. 5.2.3 Standard Deviation of Residual Scores. Figure 4 summarizes the standard deviation of θ residual errors among all 253 players in the sample. The range of standard deviations is 2.14 to 3.45 strokes per round, with a median of 2.69 strokes. John Daly and Phil Mickelson, both well- known for their aggressive play and propensities to take risks, have the third and 15th highest standard deviations, respectively. Ian Leggatt has the lowest standard deviation. Chris Riley and Jeff Sluman, both known as very conservative players, have the second and fifth lowest deviations, respectively. It is interesting to consider whether average scores and standard deviations of θ residual errors are correlated. A least squares regression of standard deviations against the mean of each player’s spline-based estimate of skill over the entire sample period yields: expected score = 68.22 + 1.14 × standard deviation , with adjusted R2 = 0.067, F = 19.02 and p-value = 0.00002. Thus, there is a tendency for greater variation in player scores to lead to slightly higher average scores. 5.2.4 Effect of Round-Course Interactions. Figure 5 summarizes the distribution of 848 random round-course interaction coefficients estimated in connection with model 3. The coefficients range in value from –3.924 to 6.946, implying almost an 11-stroke difference between the relative difficulty of the most difficult and easiest rounds played on the Tour during the 1998-2001 period. Over this period, 25 tournaments were played on more than one course. In multiple course tournaments, players are (more or less) randomly assigned to a group that rotates among all courses used for the tournament. By the time each rotation is completed, all players will have played the same courses. At that time a cut is made, and participants who survive the cut finish the tournament on the same course. Although every attempt is made set up the courses so that they play with approximately the same level of difficulty in each round, tournament officials cannot control the
22 weather and, therefore, there is no guarantee that the rotation assignments will all play with the same approximate levels of difficulty. Figure 6 shows the distribution of the difference in the sum of round-course interaction coefficients for the easiest and most difficult rotations for the 25 tournaments that were played on more than one course. For seven of 25 tournaments, the difference was less than 0.50 strokes. Thus, on average the difference for these tournaments was sufficiently small that a player’s total score should have been the same regardless of the course rotation to which he was assigned. At the extreme, there was a 5.45 stroke differential between the relative difficulties of the easiest and most difficult rotation assignments in the 1999 ATT Pebble Beach Pro-Am. Within this tournament, two of six possible rotations played with round-course interaction coefficients that totaled 10.41 and 10.23 strokes, while the total for the remaining four rotations fell between 4.96 and 5.72 strokes. The two difficult rotations involved playing the famed Pebble Beach course on the third day of the tournament. Described as one of the nastiest days since the tournament was started in 1947 (www.golftoday.co.uk/tours/tours99/pebblebeach/round3report.html), the adverse weather conditions had a much greater effect on scores recorded on the Pebble Beach course than on the other two courses, Spyglass and Poppy Hills. According to David Duval (www.golftoday.co.uk), "This is the stuff we stopped playing in last year. … It's the type of day you don't want, for the sole reason that luck becomes a big factor.'' It is interesting that the top nine finishers in this tournament all played one of the four easiest rotations, as did four of the five players who tied for 10th place. Among the top 20 finishers, only two played one of the two difficult rotations. Clearly, for this particular tournament Duval was right. The luck of the draw had more to do with determining the top money winners than the actual skill exhibited by the players. 5.3 Hot and Cold Hands Hot and cold hands represent the tendency for abnormally good and poor performance to persist over time and has been the focus of a number of statistical studies of sports. While some had argued for the presence of a hot hand in basketball, Gilovich, Vallnoe and Tversky (1985) argued that this belief might exist even though actual shooting was consistent with a random process of hits/misses (wins/losses). In other words, people may find systematic patterns in what is actually random data. Larkey, Smith, and Kadane (1989) disputed this finding. Wardrop (1995) notes that
You can also read