A Predictive Model for NFL Rookie Quarterback Fantasy Football Points - Steve Bronder and Alex Polinsky

Page created by Gilbert Cruz
CONTINUE READING
A Predictive Model for NFL
   Rookie Quarterback
 Fantasy Football Points

 Steve Bronder and Alex Polinsky

Duquesne University Economics Department
Abstract
    This analysis designs a model that predicts NFL rookie Quarterbacks’s fantasy
football points. Theoretical models for judging the quality of football players are
limited. We theorize that a player’s skill can be isolated by assessing their fantasy
football points. This model will predict within a margin of error how well a Quarter-
back will perform his rookie year based off of their college career statistics. Data was
gathered for Quarterbacks in the NFL drafts from 2009 to 2012. Our linear model
will use college career statistics to predict fantasy football scores.
It is found that as the percentage of pass completions, average yards per game, and
rushing touchdowns are significant predictors of a rookie quarterbacks NFL fantasy
points. The model used does not suffer from heteroskedasticity, but does have a large
constant and standard error for the constant.

                                           1
1 Background
    Statistical analysis in baseball has become so popular that there is now an Oscar
nominated movie and best-selling book about it. MoneyBall is the story of how the
Oakland Athletics General Manager, facing a tight budget and the release of his star
players, uses statistical analysis to hire undervalued players. The Oakland Athletics,
despite having such a small budget, managed to make it to the playoffs that year. Since
then, almost every baseball team in the United States has adopted sabremetrics, the
statistical analysis of baseball.
While sabremetrics has dominated baseball, other sports such as football have been wary
of implementing statistical analysis into the team’s decision-making. Most decisions
football teams make, whether which players to draft or what play to run on a fourth
down, are chosen heuristically by the teams manager or head coach. In the beginning of
each season, each team goes through a series of seven rounds where they choose NCAA
athletes to join their team. Attempting to estimate how well a rookie will perform in the
NFL is very difficult because, unlike baseball, where each at bat takes place between only
the pitcher and batter, there are twenty two different players on the field affecting the
every detail of the play. This makes the statistics for each individual player difficult to
assess. Each player can change the outcome of a play wildly. The challenge to measure
a single player’s value is a challenge that sparked the interest of us.

2 Model
    In this model, we will be regressing NFL quarterback rookie’s fantasy points over
the player’s performance in college. Our data consists of metrics such as rushing touch-
downs, passing touchdowns, passing completion percentage, and rushing yards. Using
fantasy points as our dependent variable allowed us to remove some of the omitted vari-
able bias that is inherent in our data. When fantasy points are calculated it does not
matter whether yards gained, for example, were made on a first down or fourth down.
We will use a standard linear regression. Using a standard linear regression is optimal
because we are only interested in how an increase or decrease in fantasy point changes
based on our explanatory variables. Unlike baseball, we do not have much theory in
football to base our models. As such, we will perform a stepwise regression function
and use the Bayesian information Criterion (BIC) to determine the proper model. The
BIC measures the relative quality of the model while assessing the goodness of fit and
complexity.
Some problems we run into are heteroskedasticity and omitted variable bias. Omitted
variable bias will likely be prominent because of the nature of the sport. Unlike baseball,
which is an individual based sport hidden in a team sport, football is solely a team sport.
If a quarterback is on a very good team he will inherently have a better fantasy score
than a quarterback on a worse team because, for example, he may have better linemen
to block for his passes.
It may not be possible to correct for omitted variable bias, as that type of data is

                                            2
unavailable. It may be a good place to pursue research in the future. To test for het-
eroskedasticity we will use the Breusch-Pagan test. If we have heteroskedasticity, we will
use robust standard errors to correct it or look to respecify the parameters in the model.

3 Regression Specification
    First, I ran the R package, leaps, and performed forward and backward stepwise
regressions. In the chart below you can see the stepwise regression’s BICs. When fitting
models it is possible to increase your R2 , however this can lead to over fitting. The
BIC solves this problem by inducing a penalty term for the number of parameters in
the model. As the BIC of a model decreases, it increases in efficiency. The chart below
contains the graphic for our stepwise regression’s BICs. As each independent variable
goes from white to black it means the variable was increasingly significant as other
variables were dropped.

            Figure 1: BIC Plot for Stepwise Regression on College Metrics

   The graph above suggests that the most efficient model is the following equation:

                        F T SP T S = β + AY.A(X1 ) + T D.2(X2 )                       (1)

                                            3
FTSPTS AY.A TD.2 β are rookie Quarterback fantasy points, average yards per
game, rushing touchdowns, and a constant, respectively. While this model was the most
efficient, running a Breusch-Pagan test for heteroskedasticity forced us to reject the
null hypothesis at the .01 level and conclude it suffered from heteroskedasticity. With
this knowledge I returned to the chart above and decided to try the following model
represented by the following equation:

                   F T SP T S = β + P ct(X1 ) + Y.A(X2 ) + T D.2(X3 )                (2)

PCT Y.A are the pass completion percentage a Quarterback earned and the average
amount of yards per completion in their college career, respectively. Running the
Breusch-Pagan test for heteroskedasticity returned a p-value of .1063. Because there
is a 10.63 percent chance of erroneously rejecting the null hypothesis we keep it and
conclude this model does not suffer from heteroskedasticity. Figure two is the plot of
the residuals versus leverage for figure two

                  Figure 2: Residuals versus Leverage for Equation 2

   Investigating figure two shows excessive leverage for Nick Foles and Cam Newton.
This means that these two points of data lead to predicted values that are not close to
our average predictor values. These two points of data are why we are close to rejecting
the null hypothesis of the Breusch-Pagan test. The Cook’s distance in the graph shows

                                           4
us what type of data points we should attempt to find to best decrease the chance of
our model suffering from heteroskedasticity. Finding more star players like Cam Newton
and Nick Foles would decrease the chance of heteroskedasticity.

4 Results Analysis
   Below are the outputs for equation 2.

                                 Estimate     Std. Error    t value    Pr(>|t|)
                (Intercept)     -783.9780       222.5539       -3.52   0.0015**
                        Pct        7.4641         4.1502        1.80     0.0829.
                       Y.A        52.6316        22.2688        2.36    0.0253*
                      TD.2        13.0181         6.2696        2.08    0.0472*
              Signif Codes:    0.001 ’***’      0.01 ’**’    .05 ’*’       .1 ’.”
                      1””

    For every one percentage increase in the pass completion percentage there is a 7.641
increase in fantasy points. for every one yard increase the the average amount of yards
per completion there is a 52.63 increase in fantasy points. When the amount of rushing
touchdowns increases by one point there is a 13.01 increase in fantasy points. All vari-
ables are found to be significant. Percentage of completed passes is significant at the at
.1, yards on average and rushing touchdowns are significant at .05, and the intercept is
significant at .01.
One thing that is interesting to note is our large negative intercept and standard error
of intercept. This may be due in part to the low number of data points we were able
to obtain. One way to correct for this and increase the predictive power of our model
would be to collect more data. Attempting to remove the constant led to the model hav-
ing heteroskedasticity while p-values for the independent variables became insignificant.
To test against multicollinearity we performed the test of variance inflation factor. All
variables had a VIF of less than 5 and so we concluded that our model does not suffer
from multicollinearity.
Because we do not suffer from heteroskedasticity or multicollinearity we can establish
confidence intervals on our parameters. Table 2 contains the confidence intervals at .95.

                                                2.5 %     97.5 %
                              (Intercept)    -1239.86    -328.10
                                      Pct        -1.04     15.97
                                    TD.2          0.18     25.86
                                     Y.A          7.02     98.25

                                              5
5 conclusion
    In conclusion, our model observed a small sample size and lacked multicollinearity
as well as heteroskedasticity. We narrowed our independent variable choice to avoid any
possible variable bias. The results have shown that our model is relatively accurate,
observed by our R-Squared of 0.436. However, the large standard deviation of our
constant implies that collecting more data would drastically help increase the predictive
power of our model.
We have tested our results to see how predictive our model is on the player Matthew
Ryan. The results from that test showed that our model was a mere fifteen fantasy
points away from his actual earned point value, showing that there is significant validity
to our regression model. While this is just one point, it does fit within our model with
confidence intervals. Further collection of data and back testing against this model would
help us identify the strength and predictive interval for the linear regression.

                                            6
You can also read