Lewis University Dr. James Girard Summer Undergraduate Research Program 2022 Faculty Mentor - Project Application: Predictive Modeling and ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Lewis University Dr. James Girard Summer Undergraduate Research Program 2022 Faculty Mentor - Project Application: Predictive Modeling and Analysis of Sports Using Linear Algebra-based Models Dr. Amanda Harsy Department of Engineering, Computing, and Mathematics Sciences Predictive Modeling and Analysis of Sports Using Linear Algebra-based Models January 7, 2022 Abstract Ranking sports teams and predicting post-season results from seasonal games can be chal- lenging. Among the many mathematically inspired sports ranking systems, linear algebra-based methods like Colley, Massey, and Markov Chain models are relatively simple and can easily be introduced to undergraduate students who have taken a linear algebra course. At their most basic level, these methods are useful for sports rankings, but unfortunately, they are not par- ticularly strong at predicting future outcomes of games. One way to possibly improve these methods for ranking and predicting future outcomes is by introducing weights or features to these systems and by using cross validation to help determine the quality of our models. In this research, we will create and test the predictive power of linear algebra-based models using sports data of the student’s choosing.
1 Project Description 1.1 Introduction and Background Ranking sports teams can be a challenging task and using straight win percentage can be misleading at times. For example, a team that wins 9/10 games will have the same win percentage as a team that wins 90/100. Sports analytics have turned to mathematics to help give more accurate sports rankings. There are many methods for ranking sports [1, 5, 4, 12]. Sites like www.mratings.com, www.thepredictiontracker, www.bcsfootball.org, www.masseyratings.com, and other algo- rithmic ranking systems are published online. Among the many mathematically inspired sports ranking systems, linear algebra-based methods are among the most elegantly simple. A variety of methods are used and some are dependent on complex mathematical or statistical tools, but weighted Colley and Massey Methods and Markov Chain Models are very accessible for undergrad- uates who have taken linear algebra. Other mathematicians have explored the Colley and Massey methods in basketball and football. For example, Dr. Tim Chartier of Davidson College has used both the Colley and Massey Methods to predict the NCAA March Madness Men’s Basketball tour- nament [2]. In [10], Pasteur extended the Colley Method by including weights for home and away games and more recent games to predict college football outcomes. Another linear algebra and probability-based method involves the use of Markov chains to predict final team rankings [6, 7]. For example, Paul Kvam and Joel Sokol, developed a model which used a combination of logistic regression and Markov chains to rank NCAA Basketball teams [7]. These particular models have popular applications in sports such as NCAA Basketball and NFL football [8]. This project in- volves creating and testing the accuracy of different probability and linear-algebra-based models from sports data, but techniques from this research can translate to broader scenarios which involve ranking or predictive modeling. 1.2 Research Aims and Goals Each linear algebra-based method has different strengths and weaknesses and one model may be better suited for a particular sport or data set. This is not surprising. In [1], Barrow, etc. evaluated eight different predictive sports ranking methods and found that the predictive power of these methods were different depending on the data set and the type of scoring and situation for the different sports leagues. Thus, part of this research is determining which method is better suited for the particular data set and research questions. Furthermore, as we can improve our linear algebra-based models by introducing weights or adding features to these systems. As such, this research involves creating different weighting systems which will extend both the Colley and Massey Methods and test their ability to predict future outcomes using sports data. Because we will be dealing with large data sets, we will use MATLAB or Python to help with these large matrix systems. The student may also need to use diagonalization or other matrix decomposition methods if necessary to solve these systems or may want to incorporate some artificial intelligence or machine learning techniques into their code (depending on their background). Finally, since we are creating predictive models, it is important to incorporate some form of cross-validation or way to test our model. Different projects have used different cross validation methods. For example, one project split their data into a training set, cross-validation set, and testing set. Another used a nested cross validation method since his project had temporal data. 1
1.3 Proposed Research Plan 1.4 Preliminary Results and Past Projects by Lewis Students: I have worked with 12 Lewis students on this research. During the spring of 2017, I worked with two Lewis golfers, Hannah Schultz and Rachel Sweeney, to extend the Colley and Massey Methods or sports rating to women’s golf teams in the Great Lakes Valley Conference. Their initial results have found that some of the stats weighted in basketball and football don’t have the same impact in predicting golf results (supporting Barrow, etc.’s findings [1]). For example, whereas winning on the road may be an important stat to weight in basketball, it really isn’t as much of a predictor for the GLVC golf matches since the teams play on the same courses. We suspect that other sports will have its own set of stats that are more predictive for determining outcomes of games than golf, basketball, and other sports. For example, many softball games are played as double headers. Perhaps winning the second game of the double header is more indicative of success in the end of the season tournaments? We asked this question and more during my 2017 SURE project (Lewis’ Summer Undergraduate Research Experience). During SURE 2017, Carley Maupin explored whether past and present seasonal results would transfer to the post-season conference tournament. During the 2017 fall semester, three student researchers, Austin Buente, Hannah Schultz, and Marissa Koronkiewicz, tested the predictive power of using a weighted Massey Method to predict golf tournament results from the Great Lakes Valley Conference. Specifically, they tested whether or not incorporating the “dropped score” for each team is a better predictor for future tournaments. They also explored different weights on the individual golf scores. These students presented their research at the 2018 Joint Mathematics Meetings. Last spring, I also had two student researchers, Adrian Siwy and Brandon Joutras, incorporate some artificial intelligence methods to the systems made by Buente, Schultz, and Koronkiewicz. During the summer of 2018, my SURE student, Kevin Gannon, continued the work done with the 2017 golf project. Gannon added more data, incorporated a nested cross-validation scheme, and tested other features beyond individual scores like round variance, course length, and whether it was a conference game. In particular, he found that incorporating course length improved the basic predictability of the unweighted Massey Method. Marco Pettinato continued working on Maupin’s softball project by testing additional factors, adding more data (including individual statistics), and incorporating principal component analysis to the model. He found that our softball model improved with using PCA when compared with using basic feature extraction. This past summer, Megan Vesta, Sheila Leisak, and Maria Del Corral created a Markov-Chain Model to explore whether win streaks helped predict end of the season rankings for NCAA Division I baseball. 1.4.1 Experimental Design and Methodology Below we outline some of the linear algebra-based methods often used in this research. Success in this research is determined by comparing our model’s outcomes with actual seasonal data. What is nice about this research is that a negative answer is still informative (for example, showing that something does not seem to be predictive). Another nice aspect is that students have access to the typical software used in this research. 2
1.4.1.1 The Colley Method The Colley Method was developed by Dr. Wesley Colley. This method applies Laplace’s Rule of Succession so that an untried team should have a win percentage of 50% instead of the 0/0 which untried teams have with regular win percentage. That is instead of using the regular win percentage wi , where wi is the total number of wins for Team i and ti is the total number of games placed ti wi + 1 by Team i, the Colley win percentage is . Dr. Colley also incorporates strength of schedule ti + 2 into his rating system by approximating the number of games played by the sum of the rating of the teams played [3]. To set up his system, Colley sets the rating for each team, ri , equal to its Colley win percentage. wi + 1 That is, ri = . In order to incorporate strength of schedule, Colley does the following ti + 2 algebraic manipulation (note li denotes the total number of losses for Team i) to get the following relationship: wi − li wi + li (2 + ti )ri = 1 + + (1) 2 2 Note that wi2+li is just the total number of games played by team i divided by 2. This is where Colley made an adjustment to his system in order to incorporate strength of schedule. He does this by replacing wi2+li with the sum of the ratings of the teams played by team i. Thus our new Colley system has the following set up. For each Team i, we get the following equation: wi − li (2 + ti )ri = 1 + + S, (2) 2 where S is the sum of the ratings of teams played by Team i. These equations form a nice symmetric matrix system, Cr = b. After solving this system, we order the team rankings by the value of the ratings with the largest rating corresponding with the top ranked team. 1.4.1.2 The Massey Method Note that Colley’s method doesn’t incorporate scores into its system. On the other hand, Kenneth Massey’s method of sports rating creates a system of equations which is based on the score of the games played. Massey originally created this method for ranking college football teams [11, 8]. Using the Massey Method, for each game, the difference in the ratings correlates to the point differential [8]. This is represented as a system: ri − rj = bk where ri is the rating for team i and rj is the rating for team j and bk is the score differential for game k. Note also that it is easy to incorporate a cap on the score differential to maybe curb the impact of blow outs. Note that each row of our system of equations representing a particular game from the data set. Our columns represent the different teams being ranked. This usually means in practice that our system has more rows than columns. Furthermore, this system, Xr = b will often be inconsistent (rarely does a team beat a team twice by the same score and note in this example that is true). Therefore the Massey Method uses the method of least squares to solve its system [9]. To find a least squares solution, we solve the normal equation which is created by multiplying both sides of our original 3
equation Xr = b, by the transpose of X [2]. The transpose matrix, denoted X T , is a matrix whose rows are the columns of the original matrix X. Thus we now solve the system X T Xr = X T b. Sometimes there are multiple least squares solutions [2], which is the case for the example above. To fix this, Massey, needed to force his system to have a unique solution. He did this by replacing one of the rows with an equation which was not in the span of the other rows of the system (that is, not a linear combination of the other rows). He could replace any row, but for consistency, Massey Xm replaces the last row of the system with ri = 0 [2, 4]. This ultimately means that the sum of the i=1 ratings of the teams should add up to 0. Like with Colley, after we solve our system of equations, we put the ratings in order from largest to smallest in order to create our ranking of the teams. 1.4.1.3 Weighted Colley and Massey Models While at their most basic level, Colley and Massey Methods are useful for sports rankings, they are not particularly strong at predicting future outcomes of games. One way to improve these methods for ranking and predicting future outcomes is by introducing weights to these systems [2, 4]. For example, we can weight a win achieved on the road higher than winning a home game or weight wins at the end of the season higher than wins at the start of the season. In this research we usually incorporate multiple weights and multiple factors and then use cross validation to help us improve our models. 1.4.1.4 Markov Chains In addition to Colley and Massey Methods, we have also used Markov chain-based models. This past summer, my students and I developed a Markov chain modeled based off of Paul Kvam and Joel Sokol’s model in [7]. The goal of Kvam and Sokol’s model was to predict the probability that Team A is better than Team B, given Team A beat Team B by x points at A, rxH . In order to estimate this value, they calculated, sH x , the probability that Team A beat B at B given that Team A beat Team B at A by x points [7]. This was then used to set up the Markov chain transition matrix using the following equations: 1 X R X H tij = [ (1 − rx(g) )+ (1 − r−x(g) )], for all j ̸= i, Ni g=(i,j) g=(j,i) 1 X X R X X H tii = [ rx(g) + r−x(g) ]. Ni j g=(i,j) j g=(j,i) In the above equations, we let rxR = 1 − r−xH be the probability a team that wins on the road by x points is better than its opponent. A negative value of x signifies that the team lost the game. Furthermore, an ordered pair (i, j) is used to denote a game where the visiting team is listed first. Finally, x(g) is the difference between the home team’s score and the visiting team’s score in any game g. We then use the steady state vector of this transition matrix to predict the final ranking of teams from our data set. Similarly to the Colley and Massey Models, the team with the highest steady state probability is ranked first, the team with the second highest probability is ranked second, etc. 4
This model can be modified and adjusted to answer a variety of questions. One question, we recently explored was whether or not win streaks would help predict end of the season standings in men’s college baseball. We adjusted Kvam and Sokol’s model to incorporate the probability that Team A will win its next game, given Team A is on a win streak of x games. After finding the steady state vector of this new transition matrix, we found that four teams were correctly predicted to be in the top five teams by this linear model. We are planning to add more data and features to this model in the future. 1.5 Mentorship Plan Since this research in an ongoing project, there are several options students can pursue to further this research. My past student researchers have been applying these methods to NCAA Division I or II data from softball, baseball, or golf teams and NHL Hockey. Students working on this project can continue the work from previous student projects or they can pick their own sports team data and create and test their own weights and research questions about the data. In both cases, students can build off the previous work done by students from past years and adjust it to their project. Having the researchers review past student projects and research other sports ranking models helps them decide on their own research questions and goals. It also provides nice sample toy-problems which students can use to help them learn techniques which will be used in their own projects. Sometimes, I will also create toy examples to help students understand methods they have read about or seen in past research. To help keep the students engaged in the research, the students will meet at least three times a week with Dr. Harsy in addition to attending the SURE seminars and the weekly Math Research Seminars. During the Math Research Seminars, students will present their current progress on their research to the other students and faculty working on math research. This allows them to practice presenting and gives them a chance to get feedback from other mathematicians. Most students present bi-weekly. Students will also be expected to keep a research journal and complete a final write-up and research report/paper by the end of the summer. The schedules below provide details and timelines for the students and will help keep them accountable and organized with completing their work. Many of the students who have worked on this research have presented their results at local, regional, or national mathematics conferences and I would expect the students in this project to present at similar conferences over the following year. 1.6 Proposed Timeline with Project Goals 1.6.1 General Timeline Weeks 1-2: Review past student projects and read their papers. Complete a Lit Review. Pick Specific Research Questions and Data Set. Begin to collect and clean data. If appropriate, create some toy data sets which can be used to practice using models from past projects/other papers. Weeks 3-5: Adjust previously created code or create their own code for their specific project and research questions. Weeks 6-8: Introduce and test weights/features for rating methods 5
Week 9: Finalize and Summarize Results, attend and present at the MAA MathFest Conference. Week 10: Finish main draft of paper and prepare for SURE presentation 1.6.2 Timeline for Student Write-Up/Paper Task List: Note this is just a task list and the student should not obsess about any of these tasks. None need to be perfect, all will evolve. Week 1: Do a literary search for the current research in Sports Analytics -specifically using linear algebra methods like Colley and Massey or Markov Chains. Complete a Lit Review of past student projects. Complete a typed 1-3 paragraph synopsis of what might be useful in at least 5 of the papers/books/articles found. Week 2: Work with Dr. Harsy to decide on the focus of their research and the point of their paper. At this point they should have a title page and the TeX template set up. This is just a working title and can be changed. Week 3: Make a concept map of the ideas for the paper and how they interconnect. Decide how to thread your storyline through this map. Week 4: First draft of definitions and background terminology and notation, three examples, or other foundation materials should be added to the report. Week 5: Complete further work on definitions and outline of content. Start including examples or data or initial results. Week 6: Write a draft of the introduction. This will incorporate material from the paragraphs you wrote during the literary search. Continue adding supporting materials. Start working on any figures you want to add to the paper. Week 7: Draft a conclusion and typeset. Insert figures, review expository structure. If possible, write a 15 min talk with slides, and use what you learn from the talk to improve the organization, figures, and exposition of the paper. Week 8: Do an overall editing pass to complete the first draft. Continue to add further results if needed. Week 9: Have someone give you some feedback like Dr. Harsy, Dr. Stephenson, Dr. Meyer, or Dr. Szczurek. Having someone outside of your project is often helpful at this stage. Finalize SURE presentation. Week 10: Edit the second draft. Put it away for at least a week, edit again (have someone give feedback again). Do final editing and then possibly submit to a journal. 6
References [1] Daniel Barrow, Ian Drayer, Peter Elliott, Garren Gaut, and Braxton Osting. Ranking rank- ings: an empirical comparison of the predictive power of sports ranking methods. Journal of Quantitative Analysis in Sports, 9(2):187–202, 2013. [2] Tim Chartier, Erich Kreutzer, Amy Langville, and Kathryn Pedings. Bracketology: How can math help? Mathematics and Sports, 43(67):55–70, 2010. [3] W Colley. Colley’s bias free college football ranking method, 2002. [4] Joseph A Gallian. Mathematics and sports. Number 43. MAA, 2010. [5] Anjela Y Govan, Amy N Langville, and Carl D Meyer. Offense-defense approach to ranking team sports. Journal of Quantitative Analysis in Sports, 5(1), 2009. [6] Jason Kolbush and Joel Sokol. A logistic regression/markov chain model for american college football. International Journal of Computer Science in Sport, 16(3):185–196, 2017. [7] Paul Kvam and Joel S Sokol. A logistic regression/markov chain model for ncaa basketball. Naval Research Logistics (NrL), 53(8):788–803, 2006. [8] Amy N Langville and Carl D Meyer. Who’s# 1?: the science of rating and ranking. Princeton University Press, 2012. [9] Kenneth Massey. Statistical models applied to the rating of sports teams. Bluefield College, 1997. [10] R Drew Pasteur. Extending the colley method to generate predictive football rankings. Math- ematics and Sports, pages pp–117, 2010. [11] James E Salzman and JB Ruhl. Who’s number one? 2009. [12] Baback Vaziri, Shaunak Dabadghao, Yuehwern Yih, and Thomas L Morin. Properties of sports ranking methods. Journal of the Operational Research Society, 69(5):776–787, 2018. 2 Budget Students may need to have MATLAB added to their personal computers which hasn’t been a problem in the past few years of doing this research since Lewis has university licenses. Thus, I am requesting no funds for supplies. 3 Description of any additional Funding you will be using for your proposed research and how they will be used in this project. I have no additional funding for this research. 7
4 Criteria for Student Applicants Since this research involves linear algebra, programming, and some data analytics, it is appropriate for students who have taken Linear Algebra and have some programming background. This project may especially be of interest to students majoring in mathematics, data science, and/or computer science, but is open to all majors. 5 SURE Seminars I am interested in Presentation Skills, Preparing for Graduate School, Literature Search and Library Resources, Resume Writing and Marketing YOU, and Interview Skills. I would also be willing to do a LATEX seminar as well. 6 SURE Agreement The James Girard Summer Undergraduate Research Program is designed to support the execution of this proposed project by the faculty mentor and a single undergraduate student. After review of faculty proposals, selected projects will be advertised to Lewis University students, and all in- terested undergraduates will then be required to apply into the program, denoting the project for which they would like to be considered. Student applications will be reviewed for completeness by the program director, and then forwarded to the appropriate faculty mentor for final selection of a candidate. Faculty may submit up to 2 projects for funding through the program. Although faculty mentors may also mentor additional students in the summer not funded through the program, the weekly program events and presentations will be exclusive for students in the program. By submitting this application, you are agreeing to the following responsibilities of a S.U.R.E. Faculty Mentor: • Working closely with your student to ensure a worthwhile educational experience. Regular interactions with your student (a minimum of once a week, but more frequently is encouraged) are an expectation. Interaction with other mentors and students is strongly encouraged. • Participating in the welcome and orientation day. • Leading at least one of the weekly workshops for the entire group of S.U.R.E. participants. • Writing at least one blog related to your area of expertise for the program website. • Participating in the Summer Research Symposium. This application will be reviewed by a faculty panel for acceptance into the program – determi- nation of selected projects will be communicated after review. Project descriptions will then be made available to Lewis University undergraduate students, who can apply to the program and specific projects online via our website. Student applicants will be matched with mentors using a selection process where mentors rank interested students based on their applications and students rank projects based on their interests. 8
Any questions and all completed applications should be sent to Brittany Stephenson (S.U.R.E. Director) at bstephenson@lewisu.edu 9
You can also read