A Machine Learning Approach to March Madness - Neural ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Machine Learning Approach to March Madness∗ Jared Forsyth, Andrew Wilde CS 478, Winter 2014 Department of Computer Science Brigham Young University Abstract Since there are 64 games played in the NCAA tournament, it is nearly impossible to predict a perfect bracket. High The aim of this experiment was to learn which Point Enterprise, a morning paper from North Carolina, stated learning model, feature selection technique, order- that ”you have a much greater chance of winning the lot- ing process, and distance criteria provided the best tery, shooting a hole-in-one in golf or being struck by light- classification accuracy in predicting the outcome of ning”. They estimated that the chances of predicting a perfect any NCAA Men’s Basketball Tournament match. bracket are 1 in 9.2 quintillion. Ninety-four features were selected from ESPN.com It became clear that developing a model that provided per- for each team accepted into the tournament for the fect win/loss classification was unrealistic, so instead we fo- last four years. Ordering processes were tested cused on improving the prediction accuracy of individual against the baseline and the random ordering in- games. Data was collected from a number of sources to help creased classification accuracy from 0.61 to 0.63. with the learning. Certain organizations such as ESPN.com Random ordering was used to test a variety of and NCAA.com keep mounds of statistical information on feature reduction techniques. Random forest fea- every team throughout the regular season. By collecting these ture reduction performed best and increased accu- statistical measurements and running them through numerous racy from 0.63 to 0.7322 when used in conjunc- learning models, prediction accuracy could drastically im- tion with kNN and limiting the number of features prove. to five. Using Manahattan distances as opposed to Note that the problem at hand is not classification of indi- Euclidean distance further increased accuracy from vidual teams, but rather predicting the outcome of a match 0.7322 to 0.7362. between any two teams. We discuss the impact of this dis- tinction and our steps to deal with it in the methods section. 1 Introduction Since 1939, the best colleges and universities across the 2 Methods United States have participated in a yearly tournament called 2.1 Data Source the NCAA Men’s Basketball Championship. This basketball We selected our data from the ESPN.com website, one of tournament has become one of the most popular and famous the largest and most popular sources for sports news in gen- sporting tournaments in the United States. Millions of peo- eral, and particularly for information about March Madness. ple around the country participate in contests in which the ESPN.com is listed as the third search result in Google (af- participants make their best guesses on who will win each ter two ncaa.com results) for the query ”march madness”. game throughout the tournament. These types of contests ESPN.com not only publishes the results of the March Mad- have become so popular that more than 11 million brackets ness tournament, but also keeps track of a vast number of were filled out on ESPN.com in 2014. One billion dollars statistics about each team and player that participates in the was even being rewarded to anyone who achieved a ”perfect NCAA tournament. Each team is associated with a unique ID bracket” (guessed every game correctly). on ESPN.com, making associating match results with team Every game is unpredictable, and the teams that are sup- statistics trivial. posedly the ”better team” sometimes end up losing. This Though the NCAA March Madness competition has ex- is called an upset in basketball lingo, and happens regularly isted for many decades, ESPN changed the way that they cal- throughout the tournament. Because of these upsets, it can culate the statistics in 2010, leaving us with 4 years of data. be difficult to correctly guess the winner of each game. The Each championship contains 64 matches, yielding nearly 300 tournament can be so unpredictable that the time period over matches for evaluation, each match representing a single in- which the tournament runs has been termed March Madness. stance in our machine learning problem. Also included in the ∗ instance are the statistics provided by ESPN.com for the two These match the formatting instructions of IJCAI-07. The sup- port of IJCAI, Inc. is acknowledged. competing teams, of which there are 47 per team. This re-
sults in 94 features per instance, of which several features are Feature Name Feature Description derived from other features. DEFQ Defensive Quotient We considered growing the data set by including matches DRPG Defensive rebounds per game that occurred during the season preceding the tournament, but given the specificity of the problem, we decided that the ad- FG Field goal percentage dition of such instances would not help the algorithm to gen- FT Total free throws made for season eralize on championship games. Additionally, many of the features taken from ESPN are calculated based on the team’s FTA Total free throw attempts for season performance during the preceding season, which would not FTM Free throws made for season apply to matches that occurred during that season. GP Games played for season LRPI Road and neutral game RPI 2.2 Selected Models NCRP Non-conference RPI NCSS Non conference strength of schedule OFF Offensive rating Given that the comparison classification problem is suffi- ciently different from typical problems, we wanted to eval- OFFQ Offensive quotient uate several different learning algorithms to see which would ORPG Average offensive rebounds per game fare the best. All of the input features are continuous, with a boolean output class, which fits many common algorithms PF Personal fouls for season well. PPG Points per game Accordingly, we selected Naive Bayes, kNN, Decision PPS Points per shot Tree, SVM, CN2 Induction, Random Forest, Logistic Regre- RK Rank at the end of the season sison, and Backpropagated Neural Network to form our base set of models. Additionally, we ran all tests with a simple RPG Rebounds per game learner that output the majority class (stochastically) to form RPI Rating percentage index a baseline. SOS Strength of schedule ST/TO Steals to turnover ratio Table 1: Feature Descriptions STPG Steals per game TOPG Turnovers per game Feature Name Feature Description 2P Two point field goals made for season These statistics refer to teams’ performances as a whole and 2PA Two point field goal attempts for season do not pertain to the performance of individual players. 2PM Two point field goals made for season 3P Total three point fields for season 3 Initial Results 3PA Three point attempts For our initial results, we ran all of the learners with their default parameters (as supplied by the orange computation 3PM Three pointers made for season platform) using all 94 features and a feature-based order- APG Assists per game ing scheme where the ”team1” was the team with the lower ”strength of schedule” score. This yielded abysmal results; ASM Average scoring margin the best learner was kNN, which achieved 61% accuracy, AST Assists for season barely above the baseline of 60%. All of the other learners tied with the baseline or did significantly worse. AST/TO Assists to turnover ratio BLK Blocks for season 4 Improvements BLK/PF Blocks to personal foul ratio 4.1 The Problem of Comparison BLKPG Blocks per game One of the features of this machine learning task is that its CFRP Conference RPI goal is comparison between two items (in this case teams), where for each team there are a number of features associated CFSS Conference strength of schedule with it. The problem of classification, and specifically binary DEF Defensive rating classification, is to determine whether an instance A is one of two classes, for example X or Y . What’s unique about the
situation of comparison is that each instance A is comprised ordering resulted 90% of the instances being the same class, of an ordered pair of ”teams” B and C. For every instance many learners would doubtless do very well, perhaps in the BC => X, the reversed pair CB will always Y , and vice 80-90% range. This is far less impressive, however, than a versa. learner that achieves 60% accuracy on a data set that is split This adds a layer of complexity to what the learner must 50-50. learn; for every logical input feature f , fA and fB are im- plicitly linked and their reverse is linked to the reverse of the outcome. The setup of the learners forces B and C to be ordered, though the data set is not inherently ordered; for each game there are just two teams and an indication of which team won. An obvious option would be to put the winner first and the loser second, but then all classifiers would only classify the outcome as ”WIN”; the ordering must be something that can be trivially determined for novel data. We investigated several methods of approaching this issue, including feature-based ordering, arbitrary ordering, reverse duplication, and random ordering. Feature-based Ordering Figure 1: Performance of Various Learners on Different Or- Initially, we just sorted the teams based on a single feature derings that we thought was most useful, the ”strength of schedule” feature that is decided by ESPN.com based on a secret algo- rithm known only to them. This feature generally has a lot of weight when people build their brackets by hand, and so we tried ordering the teams such that the team with a lower ”strength of schedule” came first. Arbitrary Ordering Our second approach was to order the teams alphabetically, such that the team to come ”first” was the one whose name came first alphabetically. This was arbitrary, but not random; it introduced a bias that was completely unrelated to the prob- lem at hand. Random Ordering A third approach was to order the teams randomly – essen- tially remove any bias from the ordering. Removing that bias Figure 2: Performance of Various Learners against the Base- would mean one less thing that a learner has to discover. This line doesn’t help them to understand that there is reversability in the data set, however, which could potentially damage results. The large amount of variance for a given learner on the Data Doubling different orderings indicates that ordering does have a large impact on a learner’s ability to learn the problem. The various The final method was to insert both orderings; both BC => learners reacted differently to each of the orderings, as some X and CB => Y . This would give learners the most pos- were better able to cope with or take advantage of the bias sible information and hopefully also remove any bias. While introduced by a given scheme. testing learners, pairs from the test set were excluded from the test set. If BC were left in the training set when testing For the arbitrary (alphabetical by team name) ordering, al- on CB, the learners achieved nearly 100% accuracy. most all of the learners were more accurate than the baseline. Because the ordering was fairly uncorrelated with the learn- Comparison of Orderings ing goal, there wasn’t much of an extra bias that had to be When tested with the default settings of eight different ma- overcome. However, it still fared worse than the random or- chine learning algorithms, the different ordering schemes re- dering, which was (by definition) uncorrelated and unbiased. sulted in significant differences in accuracy. Figure 1 shows Feature-based ordering was least helpful; almost all of the the absolute accuracy of each learner for each of the four dif- learners did worse than the baseline. The ”strength of sched- ferent ordering schemes, and Figure 2 shows the relative ac- ule” feature was more correlated with the learning goal than curacy compared with a baseline learner (majority) for that any other ordering; it resulted in 60-40% win-loss split in our ordering. Accuracy was determined using 10-fold cross vali- data set. The majority learner therefor achieved 60% accu- dation. racy, making all others comparatively much worse. In this comparison, difference from the baseline is a more Random ordering was most effective across the board, both useful metric than absolute performance. In a case where the in terms of absolute accuracy and in terms of improvement
over the baseline. Given that none of the algorithms used are duction achieved. The kNN algorithm was boosted by 10% designed to take advantage of ordering bias in a comparison when run on only the top 5 features from the random forest instance, the best method was to just remove the bias. weights—a 95% reduction in feature space size. All except Data doubling was surprisingly ineffective. Even though logistic regression and SVM achieved similar improvements there were twice as many instances for a learner to learn from in accuracy. As was the case with SVM reduction, the Ran- and the ordering bias was effectively neutralized, the prob- dom Forest reduction resulted in a significant improvement lem was apparently just as hard to learn as when there was for the Random Forest algorithm. ordering bias. 4.2 Feature Reduction From an initial feature space of 94 dimensions, we tested several different methods of reduction, in which we took the top n attributes as ranked by four different algorithms: Gain Ratio, ReliefF, Linear SVM Weights, and Random Forests. For each algorithm we tested taking different numbers of at- tributes (always the top n) for each of the learners. Figure 3 shows the best accuracy achieved by each learner for each reduction technique, for n < 45. Figure 4: Benefit of Reduction to Individual Learners 4.3 Model Tweaks The kNN learner achieved the greatest accuracy after apply- ing feature reduction, so we chose to focus on it for more granular parameter adjustment. We experimented with two different distance metrics, Euclidean and Manhattan, at vary- ing values of k to isolate the ideal parameters. Results are displayed in Figure 5. The greatest accuracy (73.62%) was reached by using the Manhattan distance metric and a k value of 130. Figure 3: Comparison of Reduction Methods ReliefF scoring proved to be the least helpful, and was the only method that actually decreased accuracy when the num- ber of features was reduced; the other three methods managed to achieve significant gains over the full feature set. Using Gain Ratio to reduce feature size resulted in modest accuracy improvements for over half of the learners. How- ever, it was only able to reduce the feature set size down to about 15 before all the learners began to lose accuracy dra- matically. The two reduction methods that involved more complex computation resulted in the most impressive accuracy gains. They also succeeded in reducing the number of features by Figure 5: Nearest Neighbor Accuracies by # Neighbors over 80%. Reduction using the Support Vector Machine resulted in dramatic improvements for several learners. The SVM 5 Results learner improved by 8% and the neural network and logistic Numerous tests were performed altering the ordering se- regression algorithms both improved by 6%. In each case, the quence, training model, features, and type of distance (Eu- feature space was reduced from 94 features to between 5 and clidean vs. Manhattan). The data was trained using SVM, 10, dramatically speeding up learning time as well. It is inter- CN2 rules, neural network, classification tree, logistic regres- esting that the SVM learner improved when using SVM pre- sion, naive bayes, random forest, k-nearest neighbor, or sim- processing. The second learner was apparently able to benefit ple majority. The features were selected according to those from the work done by the previous one, while at the same that provided the highest SVM weights. In other words, if time exploring new territory to achieve better accuracy. the data set was training using 10 features, then the 10 fea- Random forest feature reduction proved to be the most suc- tures with the highest SVM weights were used. Increasing cessful, both in terms of accuracy and in the amount of re- classification accuracy (CA) was our primary goal.
5.1 Ordering Model Top N Features Each ordering technique was run with each learning model. All 15 5 4 The results can be seen on table 2. SVM 0.6195 0.6452 0.6228 0.6226 CN2 rules 0.6417 0.6687 0.6835 0.6905 Arbitrary Random Feature Double Random Forest 0.6189 0.6903 0.6719 0.6571 Neural Network 0.6379 0.6493 0.6942 0.6946 SVM 0.6047 0.6195 0.5813 0.6071 Naive Bayes 0.6313 0.6755 0.6909 0.6835 CN2 rules 0.5734 0.6417 0.555 0.5854 Logistic regression 0.6533 0.6379 0.6187 0.6187 Random Forest 0.5249 0.6189 0.5171 0.5 kNN 0.6312 0.6872 0.7322 0.7024 Neural Network 0.6043 0.6379 0.5736 0.5783 Classification Tree 0.6229 0.5774 0.6756 0.5936 Naive Bayes 0.6201 0.6313 0.6077 0.5613 Logistic regression 0.5852 0.6533 0.6004 0.6034 Majority 0.5246 0.517 0.6001 0.5528 Table 4: Features were selected by selecting those features kNN 0.5816 0.6312 0.6117 0.5411 most important in a random forest Classification Tree 0.5506 0.6229 0.55 0.609 Model Top N Features Table 2: Ordering influence on training models All 65 15 5 SVM 0.6195 0.6085 0.5991 0.5651 Random ordering gave us our best results for every training CN2 rules 0.6417 0.5813 0.5969 0.4828 model. The classification accuracy increased from the base- Random Forest 0.6189 0.616 0.5665 0.5132 line of 0.61 to 0.63. Other ordering techniques introduced a Neural Network 0.6379 0.6303 0.6305 0.5694 type of bias that random ordering did not introduce. These Naive Bayes 0.6313 0.6496 0.6074 0.566 results were attained without implementing any feature re- Logistic regression 0.6533 0.6528 0.6377 0.5991 duction techniques. kNN 0.6312 0.6199 0.5701 0.5963 Classification Tree 0.6229 0.5595 0.5661 0.5131 5.2 Feature Reduction SVM Table 5: Features selected using ReliefF scoring Features were chosen according to their SVM weights. If only ten features were selected, the top ten features with the Gain Ratio highest SVM weights were used. Random ordering was used Calculates features that provide the most gain, but does not in creating each instance. take combination of features into account. Table 8 shows fea- ture selection using gain ratio. Model Top N Features All 35 10 9 5 Model Top N Features SVM 0.619 0.679 0.702 0.698 0.588 All 35 15 5 CN2 rules 0.641 0.663 0.562 0.592 0.569 SVM 0.6195 0.6566 0.6298 0.6148 Random Forest 0.618 0.660 0.611 0.641 0.551 CN2 rules 0.6417 0.6342 0.6567 0.5698 Neural Network 0.637 0.667 0.687 0.706 0.641 Random Forest 0.6189 0.6386 0.6835 0.5922 Naive Bayes 0.631 0.645 0.634 0.623 0.6 Neural Network 0.6379 0.6224 0.6684 0.615 Logistic regression 0.653 0.668 0.691 0.720 0.607 Naive Bayes 0.6313 0.6642 0.6528 0.6225 kNN 0.631 0.672 0.656 0.657 0.603 Logistic regression 0.6533 0.6349 0.6115 0.6074 Classification Tree 0.622 0.555 0.547 0.570 0.572 kNN 0.6312 0.6952 0.6989 0.6679 Classification Tree 0.6229 0.6342 0.5671 0.5775 Table 3: Features were reduced using the best features ac- cording to SVM weights. Table 6: Features selected using gain ratio. Random Forest Feature Reduction Summary Features were chosen by constructing a large number of de- Random forest provided the best results when used in con- cision trees and choosing the features that averaged the most junction with kNN and limiting the number of features used importance among the decision trees. These results can be in the training to five. The classification accuracy increased viewed in Table 4. to from 0.63 to 0.7322, adding more than 10% of additional accuracy. ReliefF Instances are chosen at random and changes the weights of 5.3 Euclidean Distance vs. Manhattan Distance feature relevance according to its nearest neighbor. Table 7 The results on Table 9 show the differences between us- shows the results using ReliefF scoring. ing Euclidean Distance and Manhattan Distance on k-nearest
neighbor training on a randomly ordered instances and limit- 6.4 Random Forest and Combination of Features ing the number of features used in training to five. Theory Using random forest takes into account how features inter- # Neighbors Euclidean Manhattan act one with another. This allowed us to use those features that provided the best results when used in collaboration with 60 0.7174 0.6984 other features. Thus, we were better able to compensate for 80 0.7209 0.7098 the importance of a combination of features and use that as 100 0.7322 0.7211 part of the training process. 120 0.7171 0.7286 130 0.7095 0.7362 6.5 Perfect Bracket Probabilities 140 0.7020 0.7246 Even with large amounts of statistical information, there are some parts of a basketball game you simply cannot predict. Table 7: Nearest Neighbor Distance Algorithms and K values Injuries cannot be forecast, and the mental state of players playing the game cannot be observed. However, these things Using a Manhattan distance and the nearest 130 neighbors can drastically affect the outcome of a basketball match. Per- increased our previous best result from 0.7322 to 0.7362. fect brackets will therefore continue to be impossible to cal- culate, no matter how much data is provided to the learning models. 6 Conclusion The results of our experiments show that random ordering 7 Future Work used in conjunction with k-nearest neighbor, Manhattan dis- We believe that our classification accuracy can improve to tance, and random forest as the feature reduction algorithm 85% if the proper features are used in training. Vital statistical provide the best classification accuracy. Our best classifica- information we were unable to gather include the following: tion accuracy, using each of these algorithms, was 0.7362. The following sections describe possible causes about why • Individual statistics for players on each team these features provided the best results among all experiments • Venue of the match including the distance that were tested. • Importance of a match-up 6.1 Style Theory • Win/loss record for the last five, ten, and fifteen games In basketball, it is believed that certain teams match-up better • Average number of players who play consistently against certain teams. For example, one team may be ex- All of this data is expected to play a role in how teams per- tremely quick while another team is extremely tall. If the first form. However, this type of data was out of our scope to ac- team can alter the game so that it is played at a fast pace, then cumulate. Organizations with more access to this type of data the game style is in their favor. Essentially, specialties on a will be able to study whether these features have a profound basketball team translate to success against opponents with effect on the success of a basketball team. certain characteristics. This research would also benefit greatly with more data. Since only the statistics for the last four tournaments were ac- 6.2 K-nearest Neighbor & Style Theory cessible in a consistent format, our training models risked the danger of overfit. We would recommend that much more data K-nearest neighbor provided the best results in comparison be gathered to help avoid overfit. However, since basketball to the other training models most likely due to this theory. continually changes throughout the years, we recommend not The data compares itself to the instances that provide similar use data from more than ten years back. data. In other words, quicker teams that play taller teams will We would also recommend experimenting with using sev- compare themselves to other match-ups that involved quicker eral feature reduction techniques together as opposed to using teams playing taller teams. The consistencies of the match- the reduction techniques one by one. ups and their associated outcomes will mean more to an in- stance that shares those characteristics. 6.3 Combination of Features Theory Some features, when combined, prove to be more beneficial to the success of a team than others. For example, a team that has a low number of steals per game, shoots a large number of field goals, and shoots a high field goal percentage will have more success than a team with a combination of large number of steals, high field goal percentage, and low number of shots taken each game. Learning models that take into ac- count the combination of features when classifying instances will perform better than those that do not.
You can also read