Lewis University Dr. James Girard Summer Undergraduate Research Program 2022 Faculty Mentor - Project Application: Predictive Modeling and ...

Page created by Gilbert Riley
 
CONTINUE READING
Lewis University
Dr. James Girard Summer Undergraduate Research
                  Program 2022
       Faculty Mentor - Project Application:
 Predictive Modeling and Analysis of Sports Using
           Linear Algebra-based Models
                                     Dr. Amanda Harsy
              Department of Engineering, Computing, and Mathematics Sciences
 Predictive Modeling and Analysis of Sports Using Linear Algebra-based Models

                                       January 7, 2022

                                           Abstract
     Ranking sports teams and predicting post-season results from seasonal games can be chal-
 lenging. Among the many mathematically inspired sports ranking systems, linear algebra-based
 methods like Colley, Massey, and Markov Chain models are relatively simple and can easily be
 introduced to undergraduate students who have taken a linear algebra course. At their most
 basic level, these methods are useful for sports rankings, but unfortunately, they are not par-
 ticularly strong at predicting future outcomes of games. One way to possibly improve these
 methods for ranking and predicting future outcomes is by introducing weights or features to
 these systems and by using cross validation to help determine the quality of our models. In
 this research, we will create and test the predictive power of linear algebra-based models using
 sports data of the student’s choosing.
1     Project Description
1.1   Introduction and Background
Ranking sports teams can be a challenging task and using straight win percentage can be misleading
at times. For example, a team that wins 9/10 games will have the same win percentage as a team
that wins 90/100. Sports analytics have turned to mathematics to help give more accurate sports
rankings. There are many methods for ranking sports [1, 5, 4, 12]. Sites like www.mratings.com,
www.thepredictiontracker, www.bcsfootball.org, www.masseyratings.com, and other algo-
rithmic ranking systems are published online. Among the many mathematically inspired sports
ranking systems, linear algebra-based methods are among the most elegantly simple. A variety
of methods are used and some are dependent on complex mathematical or statistical tools, but
weighted Colley and Massey Methods and Markov Chain Models are very accessible for undergrad-
uates who have taken linear algebra. Other mathematicians have explored the Colley and Massey
methods in basketball and football. For example, Dr. Tim Chartier of Davidson College has used
both the Colley and Massey Methods to predict the NCAA March Madness Men’s Basketball tour-
nament [2]. In [10], Pasteur extended the Colley Method by including weights for home and away
games and more recent games to predict college football outcomes. Another linear algebra and
probability-based method involves the use of Markov chains to predict final team rankings [6, 7].
For example, Paul Kvam and Joel Sokol, developed a model which used a combination of logistic
regression and Markov chains to rank NCAA Basketball teams [7]. These particular models have
popular applications in sports such as NCAA Basketball and NFL football [8]. This project in-
volves creating and testing the accuracy of different probability and linear-algebra-based models
from sports data, but techniques from this research can translate to broader scenarios which involve
ranking or predictive modeling.

1.2   Research Aims and Goals
Each linear algebra-based method has different strengths and weaknesses and one model may
be better suited for a particular sport or data set. This is not surprising. In [1], Barrow, etc.
evaluated eight different predictive sports ranking methods and found that the predictive power
of these methods were different depending on the data set and the type of scoring and situation
for the different sports leagues. Thus, part of this research is determining which method is better
suited for the particular data set and research questions. Furthermore, as we can improve our
linear algebra-based models by introducing weights or adding features to these systems. As such,
this research involves creating different weighting systems which will extend both the Colley and
Massey Methods and test their ability to predict future outcomes using sports data. Because we
will be dealing with large data sets, we will use MATLAB or Python to help with these large
matrix systems. The student may also need to use diagonalization or other matrix decomposition
methods if necessary to solve these systems or may want to incorporate some artificial intelligence
or machine learning techniques into their code (depending on their background). Finally, since we
are creating predictive models, it is important to incorporate some form of cross-validation or way
to test our model. Different projects have used different cross validation methods. For example,
one project split their data into a training set, cross-validation set, and testing set. Another used
a nested cross validation method since his project had temporal data.

                                                 1
1.3     Proposed Research Plan
1.4     Preliminary Results and Past Projects by Lewis Students:
I have worked with 12 Lewis students on this research. During the spring of 2017, I worked
with two Lewis golfers, Hannah Schultz and Rachel Sweeney, to extend the Colley and Massey
Methods or sports rating to women’s golf teams in the Great Lakes Valley Conference. Their
initial results have found that some of the stats weighted in basketball and football don’t have the
same impact in predicting golf results (supporting Barrow, etc.’s findings [1]). For example, whereas
winning on the road may be an important stat to weight in basketball, it really isn’t as much of
a predictor for the GLVC golf matches since the teams play on the same courses. We suspect
that other sports will have its own set of stats that are more predictive for determining outcomes
of games than golf, basketball, and other sports. For example, many softball games are played
as double headers. Perhaps winning the second game of the double header is more indicative
of success in the end of the season tournaments? We asked this question and more during my
2017 SURE project (Lewis’ Summer Undergraduate Research Experience). During SURE 2017,
Carley Maupin explored whether past and present seasonal results would transfer to the post-season
conference tournament. During the 2017 fall semester, three student researchers, Austin Buente,
Hannah Schultz, and Marissa Koronkiewicz, tested the predictive power of using a weighted Massey
Method to predict golf tournament results from the Great Lakes Valley Conference. Specifically,
they tested whether or not incorporating the “dropped score” for each team is a better predictor
for future tournaments. They also explored different weights on the individual golf scores. These
students presented their research at the 2018 Joint Mathematics Meetings. Last spring, I also had
two student researchers, Adrian Siwy and Brandon Joutras, incorporate some artificial intelligence
methods to the systems made by Buente, Schultz, and Koronkiewicz. During the summer of
2018, my SURE student, Kevin Gannon, continued the work done with the 2017 golf project.
Gannon added more data, incorporated a nested cross-validation scheme, and tested other features
beyond individual scores like round variance, course length, and whether it was a conference game.
In particular, he found that incorporating course length improved the basic predictability of the
unweighted Massey Method. Marco Pettinato continued working on Maupin’s softball project by
testing additional factors, adding more data (including individual statistics), and incorporating
principal component analysis to the model. He found that our softball model improved with using
PCA when compared with using basic feature extraction. This past summer, Megan Vesta, Sheila
Leisak, and Maria Del Corral created a Markov-Chain Model to explore whether win streaks helped
predict end of the season rankings for NCAA Division I baseball.

1.4.1    Experimental Design and Methodology
Below we outline some of the linear algebra-based methods often used in this research. Success in
this research is determined by comparing our model’s outcomes with actual seasonal data. What
is nice about this research is that a negative answer is still informative (for example, showing that
something does not seem to be predictive). Another nice aspect is that students have access to the
typical software used in this research.

                                                 2
1.4.1.1   The Colley Method
The Colley Method was developed by Dr. Wesley Colley. This method applies Laplace’s Rule of
Succession so that an untried team should have a win percentage of 50% instead of the 0/0 which
untried teams have with regular win percentage. That is instead of using the regular win percentage
wi
    , where wi is the total number of wins for Team i and ti is the total number of games placed
 ti
                                         wi + 1
by Team i, the Colley win percentage is         . Dr. Colley also incorporates strength of schedule
                                         ti + 2
into his rating system by approximating the number of games played by the sum of the rating of
the teams played [3].

To set up his system, Colley sets the rating for each team, ri , equal to its Colley win percentage.
               wi + 1
That is, ri =          . In order to incorporate strength of schedule, Colley does the following
                ti + 2
algebraic manipulation (note li denotes the total number of losses for Team i) to get the following
relationship:
                                                   wi − li wi + li
                                 (2 + ti )ri = 1 +        +                                      (1)
                                                      2       2
Note that wi2+li is just the total number of games played by team i divided by 2. This is where
Colley made an adjustment to his system in order to incorporate strength of schedule. He does
this by replacing wi2+li with the sum of the ratings of the teams played by team i. Thus our new
Colley system has the following set up. For each Team i, we get the following equation:
                                                       wi − li
                                   (2 + ti )ri = 1 +           + S,                              (2)
                                                          2
                  where S is the sum of the ratings of teams played by Team i.

These equations form a nice symmetric matrix system, Cr = b. After solving this system, we order
the team rankings by the value of the ratings with the largest rating corresponding with the top
ranked team.

1.4.1.2   The Massey Method
Note that Colley’s method doesn’t incorporate scores into its system. On the other hand, Kenneth
Massey’s method of sports rating creates a system of equations which is based on the score of the
games played. Massey originally created this method for ranking college football teams [11, 8].
Using the Massey Method, for each game, the difference in the ratings correlates to the point
differential [8]. This is represented as a system: ri − rj = bk where ri is the rating for team i and
rj is the rating for team j and bk is the score differential for game k. Note also that it is easy to
incorporate a cap on the score differential to maybe curb the impact of blow outs. Note that each
row of our system of equations representing a particular game from the data set. Our columns
represent the different teams being ranked. This usually means in practice that our system has
more rows than columns. Furthermore, this system, Xr = b will often be inconsistent (rarely does
a team beat a team twice by the same score and note in this example that is true). Therefore the
Massey Method uses the method of least squares to solve its system [9]. To find a least squares
solution, we solve the normal equation which is created by multiplying both sides of our original

                                                 3
equation Xr = b, by the transpose of X [2]. The transpose matrix, denoted X T , is a matrix whose
rows are the columns of the original matrix X. Thus we now solve the system X T Xr = X T b.
Sometimes there are multiple least squares solutions [2], which is the case for the example above.
To fix this, Massey, needed to force his system to have a unique solution. He did this by replacing
one of the rows with an equation which was not in the span of the other rows of the system (that is,
not a linear combination of the other rows). He could replace any row, but for consistency, Massey
                                         Xm
replaces the last row of the system with     ri = 0 [2, 4]. This ultimately means that the sum of the
                                             i=1
ratings of the teams should add up to 0. Like with Colley, after we solve our system of equations,
we put the ratings in order from largest to smallest in order to create our ranking of the teams.

1.4.1.3   Weighted Colley and Massey Models
While at their most basic level, Colley and Massey Methods are useful for sports rankings, they are
not particularly strong at predicting future outcomes of games. One way to improve these methods
for ranking and predicting future outcomes is by introducing weights to these systems [2, 4]. For
example, we can weight a win achieved on the road higher than winning a home game or weight
wins at the end of the season higher than wins at the start of the season. In this research we usually
incorporate multiple weights and multiple factors and then use cross validation to help us improve
our models.

1.4.1.4   Markov Chains
In addition to Colley and Massey Methods, we have also used Markov chain-based models. This
past summer, my students and I developed a Markov chain modeled based off of Paul Kvam and
Joel Sokol’s model in [7]. The goal of Kvam and Sokol’s model was to predict the probability that
Team A is better than Team B, given Team A beat Team B by x points at A, rxH . In order to
estimate this value, they calculated, sH
                                       x , the probability that Team A beat B at B given that Team
A beat Team B at A by x points [7]. This was then used to set up the Markov chain transition
matrix using the following equations:
                            1 X        R
                                               X
                                                       H
                    tij =      [ (1 − rx(g) )+   (1 − r−x(g) )], for all j ̸= i,
                            Ni
                               g=(i,j)                 g=(j,i)

                                      1 X X R       X X
                                                         H
                              tii =      [  rx(g) +     r−x(g) ].
                                      Ni
                                         j   g=(i,j)             j   g=(j,i)

In the above equations, we let rxR = 1 − r−xH be the probability a team that wins on the road by

x points is better than its opponent. A negative value of x signifies that the team lost the game.
Furthermore, an ordered pair (i, j) is used to denote a game where the visiting team is listed first.
Finally, x(g) is the difference between the home team’s score and the visiting team’s score in any
game g. We then use the steady state vector of this transition matrix to predict the final ranking
of teams from our data set. Similarly to the Colley and Massey Models, the team with the highest
steady state probability is ranked first, the team with the second highest probability is ranked
second, etc.

                                                         4
This model can be modified and adjusted to answer a variety of questions. One question, we
recently explored was whether or not win streaks would help predict end of the season standings in
men’s college baseball. We adjusted Kvam and Sokol’s model to incorporate the probability that
Team A will win its next game, given Team A is on a win streak of x games. After finding the
steady state vector of this new transition matrix, we found that four teams were correctly predicted
to be in the top five teams by this linear model. We are planning to add more data and features
to this model in the future.

1.5     Mentorship Plan
Since this research in an ongoing project, there are several options students can pursue to further
this research. My past student researchers have been applying these methods to NCAA Division
I or II data from softball, baseball, or golf teams and NHL Hockey. Students working on this
project can continue the work from previous student projects or they can pick their own sports
team data and create and test their own weights and research questions about the data. In both
cases, students can build off the previous work done by students from past years and adjust it to
their project. Having the researchers review past student projects and research other sports ranking
models helps them decide on their own research questions and goals. It also provides nice sample
toy-problems which students can use to help them learn techniques which will be used in their own
projects. Sometimes, I will also create toy examples to help students understand methods they
have read about or seen in past research. To help keep the students engaged in the research, the
students will meet at least three times a week with Dr. Harsy in addition to attending the SURE
seminars and the weekly Math Research Seminars. During the Math Research Seminars, students
will present their current progress on their research to the other students and faculty working on
math research. This allows them to practice presenting and gives them a chance to get feedback
from other mathematicians. Most students present bi-weekly. Students will also be expected to
keep a research journal and complete a final write-up and research report/paper by the end of the
summer. The schedules below provide details and timelines for the students and will help keep them
accountable and organized with completing their work. Many of the students who have worked on
this research have presented their results at local, regional, or national mathematics conferences
and I would expect the students in this project to present at similar conferences over the following
year.

1.6     Proposed Timeline with Project Goals
1.6.1    General Timeline
Weeks 1-2: Review past student projects and read their papers. Complete a Lit Review. Pick
Specific Research Questions and Data Set. Begin to collect and clean data. If appropriate, create
some toy data sets which can be used to practice using models from past projects/other papers.

Weeks 3-5: Adjust previously created code or create their own code for their specific project and
research questions.

Weeks 6-8: Introduce and test weights/features for rating methods

                                                 5
Week 9: Finalize and Summarize Results, attend and present at the MAA MathFest Conference.

Week 10: Finish main draft of paper and prepare for SURE presentation

1.6.2   Timeline for Student Write-Up/Paper Task List:
Note this is just a task list and the student should not obsess about any of these tasks. None need
to be perfect, all will evolve.

Week 1: Do a literary search for the current research in Sports Analytics -specifically using linear
algebra methods like Colley and Massey or Markov Chains. Complete a Lit Review of past student
projects. Complete a typed 1-3 paragraph synopsis of what might be useful in at least 5 of the
papers/books/articles found.

Week 2: Work with Dr. Harsy to decide on the focus of their research and the point of their
paper. At this point they should have a title page and the TeX template set up. This is just a
working title and can be changed.

Week 3: Make a concept map of the ideas for the paper and how they interconnect. Decide how
to thread your storyline through this map.

Week 4: First draft of definitions and background terminology and notation, three examples, or
other foundation materials should be added to the report.

Week 5: Complete further work on definitions and outline of content. Start including examples
or data or initial results.

Week 6: Write a draft of the introduction. This will incorporate material from the paragraphs
you wrote during the literary search. Continue adding supporting materials. Start working on any
figures you want to add to the paper.

Week 7: Draft a conclusion and typeset. Insert figures, review expository structure. If possible,
write a 15 min talk with slides, and use what you learn from the talk to improve the organization,
figures, and exposition of the paper.

Week 8: Do an overall editing pass to complete the first draft. Continue to add further results if
needed.

Week 9: Have someone give you some feedback like Dr. Harsy, Dr. Stephenson, Dr. Meyer, or
Dr. Szczurek. Having someone outside of your project is often helpful at this stage. Finalize SURE
presentation.

Week 10: Edit the second draft. Put it away for at least a week, edit again (have someone give
feedback again). Do final editing and then possibly submit to a journal.

                                                 6
References
 [1] Daniel Barrow, Ian Drayer, Peter Elliott, Garren Gaut, and Braxton Osting. Ranking rank-
     ings: an empirical comparison of the predictive power of sports ranking methods. Journal of
     Quantitative Analysis in Sports, 9(2):187–202, 2013.

 [2] Tim Chartier, Erich Kreutzer, Amy Langville, and Kathryn Pedings. Bracketology: How can
     math help? Mathematics and Sports, 43(67):55–70, 2010.

 [3] W Colley. Colley’s bias free college football ranking method, 2002.

 [4] Joseph A Gallian. Mathematics and sports. Number 43. MAA, 2010.

 [5] Anjela Y Govan, Amy N Langville, and Carl D Meyer. Offense-defense approach to ranking
     team sports. Journal of Quantitative Analysis in Sports, 5(1), 2009.

 [6] Jason Kolbush and Joel Sokol. A logistic regression/markov chain model for american college
     football. International Journal of Computer Science in Sport, 16(3):185–196, 2017.

 [7] Paul Kvam and Joel S Sokol. A logistic regression/markov chain model for ncaa basketball.
     Naval Research Logistics (NrL), 53(8):788–803, 2006.

 [8] Amy N Langville and Carl D Meyer. Who’s# 1?: the science of rating and ranking. Princeton
     University Press, 2012.

 [9] Kenneth Massey. Statistical models applied to the rating of sports teams. Bluefield College,
     1997.

[10] R Drew Pasteur. Extending the colley method to generate predictive football rankings. Math-
     ematics and Sports, pages pp–117, 2010.

[11] James E Salzman and JB Ruhl. Who’s number one? 2009.

[12] Baback Vaziri, Shaunak Dabadghao, Yuehwern Yih, and Thomas L Morin. Properties of sports
     ranking methods. Journal of the Operational Research Society, 69(5):776–787, 2018.

2    Budget
Students may need to have MATLAB added to their personal computers which hasn’t been a
problem in the past few years of doing this research since Lewis has university licenses. Thus, I am
requesting no funds for supplies.

3    Description of any additional Funding you will be using for your
     proposed research and how they will be used in this project.
I have no additional funding for this research.

                                                  7
4    Criteria for Student Applicants
Since this research involves linear algebra, programming, and some data analytics, it is appropriate
for students who have taken Linear Algebra and have some programming background. This project
may especially be of interest to students majoring in mathematics, data science, and/or computer
science, but is open to all majors.

5    SURE Seminars
I am interested in Presentation Skills, Preparing for Graduate School, Literature Search and Library
Resources, Resume Writing and Marketing YOU, and Interview Skills. I would also be willing to
do a LATEX seminar as well.

6    SURE Agreement
The James Girard Summer Undergraduate Research Program is designed to support the execution
of this proposed project by the faculty mentor and a single undergraduate student. After review
of faculty proposals, selected projects will be advertised to Lewis University students, and all in-
terested undergraduates will then be required to apply into the program, denoting the project for
which they would like to be considered. Student applications will be reviewed for completeness by
the program director, and then forwarded to the appropriate faculty mentor for final selection of a
candidate. Faculty may submit up to 2 projects for funding through the program. Although faculty
mentors may also mentor additional students in the summer not funded through the program, the
weekly program events and presentations will be exclusive for students in the program.

By submitting this application, you are agreeing to the following responsibilities of a S.U.R.E.
Faculty Mentor:

    • Working closely with your student to ensure a worthwhile educational experience. Regular
      interactions with your student (a minimum of once a week, but more frequently is encouraged)
      are an expectation. Interaction with other mentors and students is strongly encouraged.

    • Participating in the welcome and orientation day.

    • Leading at least one of the weekly workshops for the entire group of S.U.R.E. participants.

    • Writing at least one blog related to your area of expertise for the program website.

    • Participating in the Summer Research Symposium.

This application will be reviewed by a faculty panel for acceptance into the program – determi-
nation of selected projects will be communicated after review. Project descriptions will then be
made available to Lewis University undergraduate students, who can apply to the program and
specific projects online via our website. Student applicants will be matched with mentors using a
selection process where mentors rank interested students based on their applications and students
rank projects based on their interests.

                                                 8
Any questions and all completed applications should be sent to Brittany Stephenson (S.U.R.E.
Director) at bstephenson@lewisu.edu

                                             9
You can also read