A Predictive Model for NFL Rookie Quarterback Fantasy Football Points - Steve Bronder and Alex Polinsky
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Predictive Model for NFL Rookie Quarterback Fantasy Football Points Steve Bronder and Alex Polinsky Duquesne University Economics Department
Abstract This analysis designs a model that predicts NFL rookie Quarterbacks’s fantasy football points. Theoretical models for judging the quality of football players are limited. We theorize that a player’s skill can be isolated by assessing their fantasy football points. This model will predict within a margin of error how well a Quarter- back will perform his rookie year based off of their college career statistics. Data was gathered for Quarterbacks in the NFL drafts from 2009 to 2012. Our linear model will use college career statistics to predict fantasy football scores. It is found that as the percentage of pass completions, average yards per game, and rushing touchdowns are significant predictors of a rookie quarterbacks NFL fantasy points. The model used does not suffer from heteroskedasticity, but does have a large constant and standard error for the constant. 1
1 Background Statistical analysis in baseball has become so popular that there is now an Oscar nominated movie and best-selling book about it. MoneyBall is the story of how the Oakland Athletics General Manager, facing a tight budget and the release of his star players, uses statistical analysis to hire undervalued players. The Oakland Athletics, despite having such a small budget, managed to make it to the playoffs that year. Since then, almost every baseball team in the United States has adopted sabremetrics, the statistical analysis of baseball. While sabremetrics has dominated baseball, other sports such as football have been wary of implementing statistical analysis into the team’s decision-making. Most decisions football teams make, whether which players to draft or what play to run on a fourth down, are chosen heuristically by the teams manager or head coach. In the beginning of each season, each team goes through a series of seven rounds where they choose NCAA athletes to join their team. Attempting to estimate how well a rookie will perform in the NFL is very difficult because, unlike baseball, where each at bat takes place between only the pitcher and batter, there are twenty two different players on the field affecting the every detail of the play. This makes the statistics for each individual player difficult to assess. Each player can change the outcome of a play wildly. The challenge to measure a single player’s value is a challenge that sparked the interest of us. 2 Model In this model, we will be regressing NFL quarterback rookie’s fantasy points over the player’s performance in college. Our data consists of metrics such as rushing touch- downs, passing touchdowns, passing completion percentage, and rushing yards. Using fantasy points as our dependent variable allowed us to remove some of the omitted vari- able bias that is inherent in our data. When fantasy points are calculated it does not matter whether yards gained, for example, were made on a first down or fourth down. We will use a standard linear regression. Using a standard linear regression is optimal because we are only interested in how an increase or decrease in fantasy point changes based on our explanatory variables. Unlike baseball, we do not have much theory in football to base our models. As such, we will perform a stepwise regression function and use the Bayesian information Criterion (BIC) to determine the proper model. The BIC measures the relative quality of the model while assessing the goodness of fit and complexity. Some problems we run into are heteroskedasticity and omitted variable bias. Omitted variable bias will likely be prominent because of the nature of the sport. Unlike baseball, which is an individual based sport hidden in a team sport, football is solely a team sport. If a quarterback is on a very good team he will inherently have a better fantasy score than a quarterback on a worse team because, for example, he may have better linemen to block for his passes. It may not be possible to correct for omitted variable bias, as that type of data is 2
unavailable. It may be a good place to pursue research in the future. To test for het- eroskedasticity we will use the Breusch-Pagan test. If we have heteroskedasticity, we will use robust standard errors to correct it or look to respecify the parameters in the model. 3 Regression Specification First, I ran the R package, leaps, and performed forward and backward stepwise regressions. In the chart below you can see the stepwise regression’s BICs. When fitting models it is possible to increase your R2 , however this can lead to over fitting. The BIC solves this problem by inducing a penalty term for the number of parameters in the model. As the BIC of a model decreases, it increases in efficiency. The chart below contains the graphic for our stepwise regression’s BICs. As each independent variable goes from white to black it means the variable was increasingly significant as other variables were dropped. Figure 1: BIC Plot for Stepwise Regression on College Metrics The graph above suggests that the most efficient model is the following equation: F T SP T S = β + AY.A(X1 ) + T D.2(X2 ) (1) 3
FTSPTS AY.A TD.2 β are rookie Quarterback fantasy points, average yards per game, rushing touchdowns, and a constant, respectively. While this model was the most efficient, running a Breusch-Pagan test for heteroskedasticity forced us to reject the null hypothesis at the .01 level and conclude it suffered from heteroskedasticity. With this knowledge I returned to the chart above and decided to try the following model represented by the following equation: F T SP T S = β + P ct(X1 ) + Y.A(X2 ) + T D.2(X3 ) (2) PCT Y.A are the pass completion percentage a Quarterback earned and the average amount of yards per completion in their college career, respectively. Running the Breusch-Pagan test for heteroskedasticity returned a p-value of .1063. Because there is a 10.63 percent chance of erroneously rejecting the null hypothesis we keep it and conclude this model does not suffer from heteroskedasticity. Figure two is the plot of the residuals versus leverage for figure two Figure 2: Residuals versus Leverage for Equation 2 Investigating figure two shows excessive leverage for Nick Foles and Cam Newton. This means that these two points of data lead to predicted values that are not close to our average predictor values. These two points of data are why we are close to rejecting the null hypothesis of the Breusch-Pagan test. The Cook’s distance in the graph shows 4
us what type of data points we should attempt to find to best decrease the chance of our model suffering from heteroskedasticity. Finding more star players like Cam Newton and Nick Foles would decrease the chance of heteroskedasticity. 4 Results Analysis Below are the outputs for equation 2. Estimate Std. Error t value Pr(>|t|) (Intercept) -783.9780 222.5539 -3.52 0.0015** Pct 7.4641 4.1502 1.80 0.0829. Y.A 52.6316 22.2688 2.36 0.0253* TD.2 13.0181 6.2696 2.08 0.0472* Signif Codes: 0.001 ’***’ 0.01 ’**’ .05 ’*’ .1 ’.” 1”” For every one percentage increase in the pass completion percentage there is a 7.641 increase in fantasy points. for every one yard increase the the average amount of yards per completion there is a 52.63 increase in fantasy points. When the amount of rushing touchdowns increases by one point there is a 13.01 increase in fantasy points. All vari- ables are found to be significant. Percentage of completed passes is significant at the at .1, yards on average and rushing touchdowns are significant at .05, and the intercept is significant at .01. One thing that is interesting to note is our large negative intercept and standard error of intercept. This may be due in part to the low number of data points we were able to obtain. One way to correct for this and increase the predictive power of our model would be to collect more data. Attempting to remove the constant led to the model hav- ing heteroskedasticity while p-values for the independent variables became insignificant. To test against multicollinearity we performed the test of variance inflation factor. All variables had a VIF of less than 5 and so we concluded that our model does not suffer from multicollinearity. Because we do not suffer from heteroskedasticity or multicollinearity we can establish confidence intervals on our parameters. Table 2 contains the confidence intervals at .95. 2.5 % 97.5 % (Intercept) -1239.86 -328.10 Pct -1.04 15.97 TD.2 0.18 25.86 Y.A 7.02 98.25 5
5 conclusion In conclusion, our model observed a small sample size and lacked multicollinearity as well as heteroskedasticity. We narrowed our independent variable choice to avoid any possible variable bias. The results have shown that our model is relatively accurate, observed by our R-Squared of 0.436. However, the large standard deviation of our constant implies that collecting more data would drastically help increase the predictive power of our model. We have tested our results to see how predictive our model is on the player Matthew Ryan. The results from that test showed that our model was a mere fifteen fantasy points away from his actual earned point value, showing that there is significant validity to our regression model. While this is just one point, it does fit within our model with confidence intervals. Further collection of data and back testing against this model would help us identify the strength and predictive interval for the linear regression. 6
You can also read