Regression Analysis on NBA Players Background and Performance using Gaussian Processes
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Regression Analysis on NBA Players Background and Performance using Gaussian Processes Can NBA-drafts be improved by taking socioeconomic background into consideration? LUDVIG PERSSON LEJON ludper@kth.se FREDRIK BERNTSSON fbernts@kth.se Degree Project in Engineering Physics at KTH School of Computer Sciences and Communications Supervisors: Petter Ögren (General) Carl-Henrik Ek (Machine Learning) Examiner: Mårten Olsson TRITA xxx yyyy-nn
Abstract In the modern society it is well known that an individual’s background matters in her career, but should it be taken into consideration in a recruiting process in general and a recruiting process of NBA-players in particular? Pre- vious research shows that white basketball players from high-income families have a 75% higher chance of becom- ing an NBA player compared to a white basketball player from a low-income family. In this paper, we have examined whether there is a connection between NBA-player back- ground and the chances of succeeding in the NBA given that the player has been picked in the NBA-draft. The results have been carried out using machine learning al- gorithms based on Gaussian Processes. The results show that draft decisions will not be improved by taking socioe- conomic background into consideration.
Referat Regressionsanalys med Gaussiska processer av NBA spelares framgång och socioekonomiska bakgrund I dagens samhälle finns en medvetenhet om att bakgrund spelar roll för individens karriär, men bör den tas i be- aktning i rekryteringsprocesser i allmänhet och rekrytering av NBA-spelare i synnerhet? Forskning har tidigare visat att amerikanska vita basketspelare från en höginkomstbak- grund har 75% större chans att nå NBA jämfört med vita basketspelare från en amerikanska låginkomstbakgrund. Vi har i denna rapport undersökt huruvida det finns en kopp- ling mellan uppväxtmiljön och möjligheten att lyckas som NBA-spelare givet att spelaren blivit vald i NBA-draften. Resultaten har tagits fram med hjälp av maskininlärnings- algoritmer som härstammar från Gaussiska processer. Des- sa resultat visar att valet av spelare i draften inte förbättras genom att ta hänsyn till socioekonomisk bakgrund.
Contents 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 NBA-Draft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Previous Research on Player Background . . . . . . . . . . . . . . . 4 1.5 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Parameters 5 2.1 Socioeconomic Background Parameters . . . . . . . . . . . . . . . . . 5 2.2 Successful Draft Pick Parameter . . . . . . . . . . . . . . . . . . . . 7 2.2.1 How to Measure General Success . . . . . . . . . . . . . . . . 7 2.2.2 Introducing BPL-index and BPL-level . . . . . . . . . . . . . 7 3 Method 9 3.1 Data Scraping Method . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Regression Method Requirements . . . . . . . . . . . . . . . . . . . . 9 3.3 Possible alternative Approach: Neural Networks . . . . . . . . . . . 10 3.4 Non Technical Description of Gaussian Processes . . . . . . . . . . . 11 3.5 Technical Description of Gaussian Processes . . . . . . . . . . . . . . 13 3.5.1 Important Definitions . . . . . . . . . . . . . . . . . . . . . . 13 3.5.2 What This Means . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5.3 Worth Mentioning About the Multivariable Case . . . . . . . 14 3.6 How to use Gaussian Processes as Regression . . . . . . . . . . . . . 15 4 Results 17 4.1 Games Playes versus Single Hometown Parameter . . . . . . . . . . 17 5 Analysis 19 5.1 Single Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2 Multivariable Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6 Conclusions 23
6.1 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6.2 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6.2.1 Sociology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.2.4 Success measure . . . . . . . . . . . . . . . . . . . . . . . . . 25 Bibliography 27 A Graphs 31 B Data variables 39
1. Introduction In this introductory chapter, a detailed background on how the research question is chosen to ”Can the NBA-draft picks, by looking at players’ performance and socioe- conomic background data, be improved using Gaussian Processes as regression?” is provided. 1.1 Background In businesses all around the world, recruiting competent people is a key factor for success [27]. When modern companies recruit people, they do not only take hard facts (e.g. number of sold products previous year, number of employees or grades) into consideration while hiring. They also take personality traits of their prospective employees into consideration before hiring[11] and the standard way of doing this is by interviews and references. From the field of sociology, the personality traits taken into consideration in job interviews could partly be backtracked to socioeconomic background. In "Le sens pratique" [6] French sociologist and antropologist Pierre Bourdieu thouroughly de- scribes the habitus term. Sociologist Donald Broady provides a short version where habitus is explained as a result of social experiences, collective memories, a way of move and think that imprints peoples minds and bodies [8]. A tangible obser- vation directly related to the habitus is that students’ different career paths and levels of success could be backtracked to their background [7]. Furthermore, studies do connect specific city traits with achievements in different areas; one example is that wages are systematically higher in cities with rich linguistic diversity [21]. With this in mind we ask if it is possible to determine any specific examples that connects chances of succeeding with hometown traits using machine learning algo- rithms. This is however too broad for this paper, which is why we will narrow it down to something more feasible size-wise. One way of limiting the scope of our research is to focus on a niche. In search of that niche, we look for one niche with: 1
CHAPTER 1. INTRODUCTION i) The possibility to compare individuals’ level of success. This is important since success is a term that could not be measured easily in general. ii) Easy access to the information that let us compare success. This is important for mainly two reasons; firstly, with an easy accessed dataset the confirmability increases and with that the trustworthiness of the research follows. Secondly, it is time saving to use easy accessed data. On i), to measure success in sports has been done previously [23]. Measuring success in other fields than sports is often done by comparison of salaries, for example the research made by Thomas W. H. Ng, Lillian T. Eby, Kelly L. Sorensen and Daniel C. Feldman on carreer success [28]. Fulfilling wish list item ii) with the given timeframe of this research, finding individuals’ salaries and connect them with their hometowns would lead to a tremendous amount of work which would not be especially confirmable since we do not want to put these individuals’ personal data up for disposal. However, sports statistics are easily accessed and fit right on the second criteria. We will examine how hometown traits affects the possibility to succeed in a certain sport since there is a connection between city traits and success, via the common denominator known as personality. There are over 3,000 sports in the World Sports Encyclopedia [17] so which one to choose? There are certain criteria we look for; we would like a sport that has: i) Valid statistics. Not every sport is a good choice statistic wise. An extreme example is fishing where statistics doesn’t tell us as much as in other sports since local conditions heavily influence the outcome. ii) Teams and not individuals. This is since activities in many organizations are entirely designed around groups [19] which could make it possible to connect our results from the world of sports to the normal work life. iii) Players which are able to track to their home towns, where hometown data is accessible. iv) A considerable number of practitioners. Since a larger sport minimizes the bias of selection of practitioners. v) One highest league. This is important for comparison reasons, some of the European football leagues are all considered to be really good, but it is hard to tell which is the best scorer between a player who scored 12 goals in the Spanish Primera Division in 1996 and a player who scored 16 goales the same year in the German Bundesliga. 2
1.2. DATA ANALYSIS This wish list gives us a couple of alternatives but we chose basketball since it has i) statistics from all players back to 1947 [4], ii) an overwhelming majority of players are from the US [4] which makes it iii) easier to backtrack them than if they would be of different nationalities. iv) The NBA-league had over 17.000.000 visitors the season 2011-2012 and is v) considered substantially better than other leagues, an indication of this is that the average NBA franchise is now worth $634 million [3] and arena capacities taking over 20,000 visitors [30] compared to the 5,000 in the Italian league [29]. 1.2 Data Analysis When NBA talent scouts seek to find new talents, they look for players which fulfill a combination of subjective and objective criterions on what a great basketball player should have in order to become a top player. The gut feeling on which the subjective feelings are based upon is described and discussed by Nobel price laureate Daniel Kahneman [15] where he claims that we are not able to take all facts into consideration while making decisions which often leads to biased decisions. In recent years pattern recognition have developed and is used in everything from the US Postal Service where they automatically read hand written letters [25] to determining which Bordeaux wine will become the most expensive twenty years from the production date [1]. These often biased decisions described by Kahneman could be diminished by analysing patterns, and since computer analysis tools are strong at identifying patterns we will use modern computer analysis tool based on Gaussian Processes to investigate the connections between home town traits and the success level of the player. This should lead to decisions less influenced by the, some times, misleading gut feeling. 1.3 NBA-Draft The NBA-draft is an event that occurs every year. The purpose of this event is to enroll non-NBA-players into the NBA. Basketball players who have not previously played in the NBA are eligible for the NBA-daft. Typically this means that foreign players and players in the college leagues apply for the NBA-draft where a total of about 50 players are recruited to an NBA-team each year. The NBA-teams pick players in an order made up in a lottery. The probability distribution of the pick order is heavily dependent on the score the previous season, a NBA-team with low score is likely to pick among the first teams. 3
CHAPTER 1. INTRODUCTION The players picked early in the draft are expected to perform better than players picked late in the draft. This will be discussed further and showed in section 2.2.2. 1.4 Previous Research on Player Background A number of studies have been conducted in the specific area of sports and back- ground. One example is the studies of the likeliness of reaching the NBA which tells us that players from low income African American families have 37% lower chance of reaching the NBA than players from high income African American families. The difference in chance between white low- and high-income families is 75% in favor of the high-income families.[10] However after interviewing sports journalist Mattias Lühr (Expressen), we found out that North American teams are looking almost only at on-field performance and physical performance (e.g. points made, +/- statistics or 40 yard time) rather than personality traits before they draft players. Despite this previous research we do not know if a player from a low income family is more or less likely to succeed in the NBA than a player from a high income family given that they both have been drafted. 1.5 Research Question Given all this background, the question we ask is ”Can the NBA-draft picks, by looking at players’ performance and socioeconomic background data, be improved using Gaussian Processes as regression?". 1.6 Delimitations The scope of this research will only contain players born in the US and players who have played minimum one game in the NBA. 4
2. Parameters In order to measure draft pick success we will in this chapter introduce the Berntsson Persson Lejon (BPL) level and BPL index. An in-depth explanation of why the chosen variables for socioeconomic are income, education, housing and criminal activity is also provided. The players examined are all drafted players in the years 1990-1999 who have played one or more games in the NBA and are born in the US. 2.1 Socioeconomic Background Parameters Determining a player’s chance of succeeding is not only a question of physiological profiling, it can contribute to determine likeliness of success, but determinants of success are multi-factorial [14] so we will take socioeconomic factors into consider- ation. To be able to focus on the socioeconomic background we have to know what socioe- conomic background actually means. The definition of the word “socioeconomic” from the Oxford Dictionaries is “Relating to or concerned with the interaction of social and economic factors” [26]. But what is a viable way of measuring socioe- conomic background from a statistical point of view? Previous research that have measured socioeconomic status have often viewed income, education and occupation as factors to define socioeconomic status [32] but there are also well cited articles that uses variables such as social support [18] in their definition of socioeconomic background. The data we have collected concerning the NBA players in our study’s background contains the variables in appendix B. However as we increase the number of data points, we have to increase the number of players. Increasing the number of Amer- ican NBA players drafted in the years between 1990 and 1999 is, unfortunately for our research, impossible to do. So we have to choose certain points from our data 5
CHAPTER 2. PARAMETERS set to test against the player’s likelihood of success. G.A. Kaplan and J.E. Keil claims that the most frequent way of measuring socioeco- nomic status is to look only at educational level, this is because of the easy accessed education data. On the other hand they also stress that a combination of ways of measuring socioeconomic data have merit [16]. Another simple and powerful way to distinguish different socioeconomic statuses is the housing tenure [13]. Research has also shown that the level of criminal activity is correlated to class status [2]. If we now summarize what other socioeconomic research base their arguments on, we say that the way of measure socioeconomic background is: i) Income ii) Education iii) Occupation iv) Housing tenure v) Criminal activity vi) Population And the quantitative measurements of these background traits will be based on the following parameters from the player’s hometown i) Population ii) Income per Capita iii) Household Income iv) Home Appreciation v) Homes Owned vi) Housing Vacant vii) Homes Rented viii) Violent Crime ix) 2 Year College x) 4 Year College 6
2.2. SUCCESSFUL DRAFT PICK PARAMETER xi) Graduate Degree xii) Highschool Graduate 2.2 Successful Draft Pick Parameter 2.2.1 How to Measure General Success Games played has previously been used to measure success in team sports. Other measurement method favors a certain player type. While number of games played favors the player which the coach believes is good enough to play. We follow the example of Tingling, Masri and Martell’s research on analyzing the NHL-draft [23] where they speak in favor of measuring number of games. Three of these arguments are also valid for basketball i) It is verifiable and easy to measure ii) It is easy to compare players across positions and teams iii) Players who do not contribute to the team are unlikely to play Which is why games played is a part of the way we measure the success level of the draft pick. The vast majority of players that constitutes our data have retired hence draft-year will not affect the success measure. An in-depth discussion on success is made in chapter 6.2.4. 2.2.2 Introducing BPL-index and BPL-level The BPL-index is the expected number of games played for a player drafted at a specific draft position. The BPL-index is calculated from a set of drafted players where two numbers represent every player, these numbers are i) Draft pick number. ii) Number of games played in career. From this set of data, an exponential regression is performed with x as draft pick number and y as number of games played. The exponential function ybpl (x) = eax is now obtained. This function is the expected number of games played for a specific draft number and denoted the BPL-index. 7
CHAPTER 2. PARAMETERS The BPL-level is a measurement of the draft pick success. The BPL-level is obtained by using the formula 2.1 where nd is the number of games played by the drafted player, ybpl (p) is the exponential function defining the BPL-index and p is the draft pick number of the measured player . nd − ybpl (p) BPL-level = (2.1) ybpl (p) The BPL-level is a constant, and a positive BPL-level represents a player who have played more games than the expected number of games and a negative BPL-level represents a draft pick who have played less games than the expected number of games. The BPL-level is the chosen method of measuring draft pick success. Figure 2.1 shows a regression where the BPL-index is shown in blue. BPL−index for NBA−players drafted 1990−1999 1400 BPL−index Player 1200 1000 Games Played 800 600 400 200 0 0 10 20 30 40 50 60 Pick Number Figure 2.1: BPL index in blue calculated with the the drafted NBA players the years 1990-1999 8
3. Method The choice of method that will address the research question (see 1.5) will be dis- cussed in this chapter, including a brief motivation and some limitations of the method. The method used to collect data is data scraping and the method used to analyse the collected data is regression based on Gaussian Processes (GP). 3.1 Data Scraping Method Given all the information available online and the computing power that can be bought at affordable prices, a datum is to try to find the relevant information through data scraping. Data scraping is the practice of examining large pre-existing databases in order to generate new information [26]. Collecting selected variabels from online content is made through data scraping. There are many different types of programs that goes under the wide definition on data scrapers which is why we will not describe data scraper in general but describing our particular data scraper on a conceptual level. In the popular general- scripting language PHP, there are some built in functions that makes it very easy to program data scrapers [24]. The data scraper is first set to visit a certain page and scan through all the HTML on that page, the HTML is parsed as an XML- document. from this XML-document we can identify the interesting variables and save them into our dataset and we can also follow links and redo the same procedure on that following page. Having all the data collected we present it in a MATLAB- compatible way so we can analyse the collected data in MATLAB. 3.2 Regression Method Requirements In order to answer the research question 1.5 we will predict the outcome of a player’s rate of success given some traits of his hometown. In order to accomplish this we want to use a method that: 9
CHAPTER 3. METHOD i) Gives a quantitative result ii) Is robust iii) Has a straight forward work flow The criteria i) and ii) are close to self-explaining, and we will not discuss them further. The third criteria however will rule out some algorithms as we will see when we discuss Neural Networks in section 3.3. The method Gaussian Processes meets these criteria and is therefore the method of choice, this field will be discussed further in section 3.4 to 3.6. 3.3 Possible alternative Approach: Neural Networks In this section will we give a brief introduction to the field of Neural Networks, discuss its properties and explain why we chose GP insted of Neural Networks. In order to describe Neural Networks (NN), we begin with describing the smallest part of the NN, namely the neuron. The neuron can be described by the function n X yj = wi x i + w0 (3.1) i=1 wi is the weight for the input xi and w0 is a biased term. If we put together several Neurons we get what is called a Neural Network. This may look like in figure 3.1. By convention is the biased term w0 ommited in the illustration. The left most column of circles in 3.1 is the input data, denoted by xi . In the middle column do we have a hidden layer of neurons,the values generated by a neuron in the hidden layer is denoted by yj , observe that the biased term in equation 3.1 is not explicity expressed. The last column is the output data, denoted by zk , the biased term is not explicity expressed for those columns either. If we now apply the equation of a single neuron onto a NN, we get the equation 0 zk = wkj yj + c0k → zk = wkj 0 (wij xi + ci ) + c0k (3.2) 0 as well as the For a set of training data [xi , ok ], we can train the weights wij and wkj 0 biased terms ci and ck so that the output zk of the NN correpsonds to the targets ok . If this training is done correctly we will be able to predict the outcome yk∗ from some in-data x∗i . Some very interesting applications have been made using Neural Networks, such as, identifying handwritten characters [25]. But since no rigorous theory has been developed concerning how to design a Neural Network, the construction of a Neural Network is closely correlated with a trial and error methodology as well as black box 10
3.4. NON TECHNICAL DESCRIPTION OF GAUSSIAN PROCESSES Figure 3.1: Schematic illustration of a neural network programming [31]. Hence is the iii) criteria, to have a straight forward workflow, not fulfilled for the Neural Network. 3.4 Non Technical Description of Gaussian Processes Machine learning is a type of artificial intelligence where the computer program is able to predict a certain outcome based on examples given to the program. Gaussian Processes is a machine learning method where the prediction is based on normal distributions as illustrated in figure 3.3. In Gaussian Processes we say that the likeliness of a certain outcome is normal distributed, and based on the training set (i.e. the example given to the program) we are able to approximate the standard deviation and from there describe a function that best fits the training set. A simple explanation is that we start with an arbitrary chosen value in x0 and from there say that the next value x1 is randomly set from a normal distribution with based on the previous point x0 , and so it continues for x2 , ..., xn . The more values generated, the less impact does x0 have on the outcome. The functions of different colors in fig 3.2 are examples of random functions generated by this method. 11
CHAPTER 3. METHOD Figure 3.2: Some random functions Figure 3.3: An intuitive illustration of the concepts of gaussian processes 12
3.5. TECHNICAL DESCRIPTION OF GAUSSIAN PROCESSES 3.5 Technical Description of Gaussian Processes 3.5.1 Important Definitions In this section we introduce some important definitions and some basic expla- nations that are necessary for further discussions of the subject Gaussian Pro- cesses. Assume that we are about to perform a measurement of some data points [x1 , y1 ], [x2 , y2 ], ...[xn , yn ]. We believe that yi can be described by yi = [wj φ(xj )]i (3.3) Where φ(x) is fixed, but typically non-linear, and is named basis-function. w has a prior normal distribution wi ∈ N (µi , α−1 I) (3.4) There is no restriction to assume that the mean of wi is zero since we can add a biased contribution to φi (xj ). Now we are able to calculate the expected value and the covariance of yi E[yi ] = E[wi φ(xi )] = φ(xj )E[wj ] = µi (3.5) Cov[yi ] = E[yi yj ] = φ(xi )E[wi wj ]φ(xj ) = α−1 φ(xj )φ(xi ) (3.6) We define k(xi , xj ) = α−1 φ(xj )φ(xi ) (3.7) which we name the kernel function. Now we have everything we need in order to define a specific Gaussian process. yi ∈ GP (µi , k(xi , xj )) (3.8) The kernel we will use is a variant of the squared exponential, which is a common used kernel − 1 |x −x |2 k(xi , xj ) = θ2 e θ1 i j (3.9) Where θ1 and θ2 are hyperparameters. Some functions generated using this kernel can bee seen in figure 3.2. In any practical use though, will the kernel k(xi , xj ) only be evaluated at a fi- nite number of points, and the kernel may be represented as a matrix, which we will denote as Cij = k(xi , xj ) (3.10) 13
CHAPTER 3. METHOD 3.5.2 What This Means In this section we will discuss what Gaussian processes do in practice, with focus on an intuitive understanding. Assume that we are about to measure some data points ([x1 , y1 ], . . . , [xn , yn ]). We want to find some function yi = f (xj ), but we do not have any idea how this relation may look like. A naïve approach to find the best function f (xj ) is to try every available function, it turns out that this approach is not as crazy at it may seem, because here we can make use of the tools provided by the Gaussian processes. We can make a prior guess of the first value y1 , since we prior do not know anything about y1 we might as well guess that it is zero, see point x0 in figure 3.3. This itself is no limitation since we can choose our zero level arbitrary. What was proposed in the previous paragraph is not an implementation without conditions, but we say that similar points (i.e. k ≈ l) xk and xl will generate similar values yk and yl . In our case this means that the point x1 in figure 3.3 is more like x0 than x2 and x3 . We can observe this decrease in dependency by observing the increasing variance of the shaded areas in figure 3.3. 3.5.3 Worth Mentioning About the Multivariable Case The theory of Gaussian Processes is not limited to be used with one dimensional features, but allows in theory any number of dimensions of features. In section 3.5.1, the theory described is based on a single variable case but very small changes are required in order to describe the multivariable case. The complexity is O(x3 ) since we calculate the inverse of a matrix and hence will the computional load set the limit for how many data points and features we can add. Another issue that arises when the dimensionality increases is that the Eucledian norms may become unsat- isfactory for high dimension vector spaces. One kernel that is commonly used when fitting multivariate data, and incorporates the use of Automatic Relevance Determination (ARD) is the kernel D ! 1X k(xi , xj ) = θ0 exp − θn (xin − xjn )2 (3.11) 2 n=1 Where D denotes the number of dimensions of the input data and the hyperparam- eters θn are the weights determining the relevance of a certain dimension. In the case of a high dependency on the outcome from a certain feature, that feature’s θ- value will be large. The hyperparameters are determined by a maximum likelihood algorithm. 14
3.6. HOW TO USE GAUSSIAN PROCESSES AS REGRESSION 3.6 How to use Gaussian Processes as Regression In our case and in most other cases where Gaussian Processes are appliable, are we not interested in generating a random function from scratch, but instead to do a regression of some measured data. We want to find the most likely distribution of weights wj given some data ti . We assume that ti is ti = yi + i (3.12) Where i is a noise term which we assume to be normal distributed i ∈ N (0, β −1 ) We may express the conditional probability as p(wj |ti ). This can be rewritten, using Baye’s rule to p(ti |wj )p(wj |θk ) p(wj |ti , θk ) = (3.13) p(ti |θk ) Where θk are some hyperparameters such as characterstic length or internal weights between the different dimensions of input data. We see that the denominator is independent of wj , it is hence possible to ommit the denominator and instead express the probability as p(wj |ti , θk ) ∝ p(ti |wj )p(wj |θk ) (3.14) and reintroduce the normalization constant when convenient, typically when all calculations are made. The first probability factor is normal distributed since i is normal distributed. The second probabilty factor is normal distributed as well, as stated in equation 3.4. Hence is the probability distribution in equation 3.14 is normal distributed too. Now, by using an algorithm to find the set of wj that is most likely, we can calculate (posterior) (posterior) its mean function µi and covariance function Cij and get a posterior prediction at some points xj For further reading about GP we recommend the books by Rasmussen, Williams [9] and Bishop [5] and visiting the homepage gaussianprocesses.org. 15
4. Results 4.1 Games Playes versus Single Hometown Parameter Results from all twelve hometown parameters is presented in appendix A. With figure 4.1 we explain what the graph shows. Starting with the information on the Highschool Graduate vs Success 4 90 % Confidence interval 80 % Confidence interval 3.5 70 % Confidence interval 60 % Confidence interval 3 51 % Confidence interval Player Mean games played per draft position 2.5 US median (vertical) NBA−players median(vertical) 2 BPL−level 1.5 1 0.5 0 −0.5 −3 −2 −1 0 1 2 Quantitative representation of Highschool Graduate Figure 4.1: Success vs High School Graduates two axes, the x-axis shows the distribution on high school graduates in respective players home town in a normalized manner, the normalization is made by High school graduates in hometown (%)-Mean high school graduates in the US (%) Mean high school gratuates in the US (%) The y axis shows the BPL level with 0 as BPL index (see section 2.1). The red dot represents a player, this means that if you look at a dot with positive BPL level, it means that the player has played more games than mean games played for that player’s draft position. 17
CHAPTER 4. RESULTS The vertical red line represents the mean number of high school graduates among NBA players hometowns. So a red dot positioned with x-value greater than the vertical red line-value represents a player from a town with higher percentage of high school graduates than the arithmetic mean amongst NBA-players. The vertical black line represents the mean number of high school graduates in the whole US. So a red dot positioned with x-value greater than the vertical black- line value represents a player from a town with a higher percentage of high school graduates than the mean in the US. A black vertical line positioned left of the red vertical line means that the mean percentage of high school graduates in the US is lower than the mean percentage of high school graduates in the NBA-players’ hometowns. The different confidence intervals are marked with different colors. The confidence intervals represent the probability that a function from the distribution will lie within the coloured area. 18
5. Analysis Only by inspection and with very little background knowledge it is possible to anal- yse the results with the single socioeconomic variables. The multivariable analysis demands a deeper understanding of GP. 5.1 Single Parameter Analysis If we study each and every different level of confidence interval we see that the BPL-index is enclosed by almost every level of confidence intervall plotted in the graph. This means that there is no significant correlation between the socioeconomic parameters and the BPL-level. 5.2 Multivariable Analysis Since we were not able to see any correlation between the rate of success and the fea- tures independently, we seek a joint distrubution such that the success is a function of the background variables x1 , ..., x12 described in section 2.1. success = success(x1 , ..., x12 ) (5.1) In this multivariate case we will use the kernel 3.11 in order to take advantage of the ARD benefits. Since it is difficult for us to prior determine the values of the hyperparameter, we will vary the hyperparameters and find the combination of hyperparameters that minimizes the error when calculating the GP. The error of the GP can be seen in figure 5.1. We can clearly see that the er- ror is converging and hence that the hyperparameters are optimal. The weights of the different features can be seen in Table 5.2 The magnitude does not differ by at 19
CHAPTER 5. ANALYSIS Figure 5.1: The error plotted versus the number of evaluations, when minimizing hyperparameters 0.8872 0.8206 1.1392 1.3566 1.1973 1.2648 1.0076 1.4203 1.3271 1.3759 1.0337 1.2628 Figure 5.2: Weights of different features least one order of magnitude and hence can we not neglect any of them at this stage. The mean and standard deviation can not easily be displayed in this high dimen- sional case in the same way as in the single variable case. In order to view the result will we instead sample from this GP. We can see the result of the sampling in figure 5.3. The data is the same for the two left plots and the two right plots, the difference between them is the scale of the axises. On the x-axis we see the expected outcome of the mean on the subfigures 1 and 3 and standard deviation (std) on the subfigures 2 and 4 . The corresponding sampled values are plotted on the y-axis. The black dots correpsond to a serie of samples of identic data, its x-position shows the expected value of the 20
5.2. MULTIVARIABLE ANALYSIS Figure 5.3: The mean and std, both expected and sampled, of the multivariable GP. We will enumerate the subfigures with 1,2,3,4 where 1 is the upper left, 2 is the upper right, 3 is the lower left and 4 the lower right. certain in-data and its y-position shows the mean of the samples with this certain in-data. The red lines are the function f (x) = 1 · x. In figure 1 and 2 can we see two points located on those lines. Those points are in fact several points located very close to each other, as we can see in the figures 3 and 4 where the scale is smaller. We see that the expected mean value does not depend on the in-data, since the function-values are indifferent of the in-data. The sampled means and std lie very close to the expected values of the same reason. In conclusion we can with great confident say that there is no correlation between the different hometown traits and the chances of succeeding of beating the BPL- index. 21
6. Conclusions As described in the analysis, we can with great confidence state that there is no correlation between hometown traits and beating the BPL-index. The answer to our research question in section 1.5 is that we cannot improve draft picks with this method and these parameters. This result is not surprising since we know from previous research that the difference in resources has taken its toll before the NBA draft. Despite the lack of spectacular results, these types of studies are important in order to understand how individuals are affected by the society in which they grow up. 6.1 Recommendations An interesting next step in this interdisciplinary study with one leg in machine learning and the other in sociology would be to use a similar machine learning methods on background parameters in other industries than sports. Since data scraping is a very powerful tool for fast data collection and Gaussian Processes are well suited for pattern recognition, only an analysis on which variables to collect is needed in order to apply this method on other industry analysis. It would also be interesting to use the same regression on the players’ physical traits. 6.2 General Discussion The General discussion and criticism could be divided into four parts: sociology, the data, the chosen method and relying so heavily on played games in the measurement of success. 23
CHAPTER 6. CONCLUSIONS 6.2.1 Sociology Beginning with the sociology criticism, we want be transparent with our lack of background and previous knowledge in the field of sociology. We have conducted a brief interview with Prof. Emer. at the Department of Sociology in Uppsala University just to make sure that we are not way off our interpretations on the sociology part. The main criticism consists in the fact that we haven’t been taken the individual player’s family background but the background in the area in which they grew up. 6.2.2 Data We have used the data from which the towns in which the players where born. Some of the players have most certainly moved to a different town while they where young, which would have been a more accurate town to define as their hometown. We have also used current hometown data. Since the players where drafted between 1990 and 1999 it means that they grew up in the years roughly between 1975 and 1995 which in turn means that our analysis is based on data that is up to almost 40 years off. 6.2.3 Method The kernel we have used is the squared exponential (SE), or one closely related to SE k(xi , xj ) = θ0 · exp θn (xi − xj )2 (6.1) This kernel is commonly used and generates a smooth function, as we could see in figure 3.2, even though these characteristics makes this kernel a good choice, we can not guarantee that our result hold for every possible kernel. We also do not know if we preserved characteristics of the data when we did the pre-processing. The pre-processing is often necessary if the data is too irregular. For example, if we examine figure A.2,almost all points are located to the very left, this is because the well populated cities like New York or Los Angeles are few but large in comparison to most other US cities. Translation and scaling preserves many characteristics of the data and are hence very convenient ways to pre-process the data. However this is also the only way we have pre-processed our data. 24
6.2. GENERAL DISCUSSION 6.2.4 Success measure Having games played as the one measurement of success is something that we in our research have had quite some inquiries about. In our opinion it is the “least bad” choice. Questions we had to answer to ourselves where most importantly i) Why we do not use salaries as a measurement of success? ii) Why we do not use more stats, such as points made, +/- statistics etc.? iii) Some players are really great but have injuries for a long time on their careers; this will not show in our data. These are all relevant questions but are more problematic than they first appear to be. Our response to i) is that the players salaries are first limited by the “Collective Bargaining Agreement Between the National Basketball Assiciation (NBA) and the National Basketball Players Association (NBAPA)” [20]. This agreement includes salary caps, revenue distribution, player contracts among other things, which leads to a situation where salaries are not a good measurement of success level. To answer to ii) we do not use other statistics because it tends to favour a certain type of players and could be misleading. One example is that Steve Nash was elected to most valuable player (MVP) the season 05-06 when he scored an average of 18.8 points per game (ppg)[4] and despite the fact that Kobe Bryant scored an average of 35.0 ppg[4]. Both Nash and Bryant are held to be great players and where both top choices for the MVP award 2006 [22] even though Bryant averaged 86% higher ppg than Nash. What they also had in common was that they played lots of games, Nash played 78 games[4] and Bryant played 80 games [4] that season. To continue on ii) we do not use a combination of statistics since it would be to much of a project to determine how the most valid combination would look like, but we do encourage that research since it could be of great help in the future. As for iii) we try to measure how a certain background can reflect a player’s chances of succeeding, and we have in this research presumed that there is no correlation between chances of becoming injured and hometown background as a NBA-player. 25
Bibliography [1] Orley Ashenfelter. Predicting the quality and prices of bordeaux wine*. The Economic Journal, 118(529):F174–F184, 2008. [2] W F Gabrielli S A Mednick B McGarvey, P M Bentler. Rearing social class, education, and criminality - a multiple indicator model. Journal of Abnormal Psychology, 90:354–365, 1981. [3] Kurt Badenhausen. As stern says goodbye, knicks, lak- ers set records as nba’s most valuable teams. http: //www.forbes.com/sites/kurtbadenhausen/2014/01/22/ as-stern-says-goodbye-knicks-lakers-set-records-as-nbas-most-valuable-teams/. [Online; January 2014]. [4] Basketball-Reference.com. Nba and aba basketball statistics and history. http: //www.basketball-references.com. [Online; february 2014]. [5] Chriopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [6] Pierre Bourdieu. Le Sens commun. Les Editions de Minuit, 1980. [7] Pierre Bourdieu. Les Héritiers : Les étudiants et la culture. Les Editions de Minuit, 1984. [8] Donald Broady. Sociologi och epistemologi. Pierre Bourdieus forfattarskap och den historiska epistemologin. HLS Forlag, 1991. [9] Christoffer K. I Williams Carl Edward Rasmussen. Gaussian Processes for Machine Learning. The MIT Press, 2006. [10] Joshua Kjerulf Dubrow and jimi adams. Hoop inequalities: Race, class and family structure background and the odds of playing in the national basketball association. International Review for the Sociology of Sport, 2010. [11] Jac Fitz-enz. ROI of Human Capital. AMACOM, American Management Association, 2000. 27
BIBLIOGRAPHY [12] gaussianprocesses.org. The gaussian processes web site. http://www. gaussianprocess.org/. [Online; may 2014]. [13] Robert M. Hauser. Measuring socioeconomic status in studies of child devel- opment. Child Development, 65:1541–1545, 1994. [14] Deborah G Hoare. Predicting success in junior elite basketball players — the contribution of anthropometic and physiological attributes. Journal of science and medicine in sport, 3:391–405, 2000. [15] Daniel Kahneman. Thinking Fast and Slow. Farrar, Straus and Giroux, 2011. [16] G A Kaplan and J E Keil. Socioeconomic factors and cardiovascular disease: a review of the literature. Circulation, 88(4):1973–98, 1993. [17] Wojciech Liponski. World Sports Encyclopedia. Quarto Publishing Group USA, 2003. [18] Jane D. McLeod and Ronald C. Kessler. Socioeconomic status differences in vulnerability to undesirable life events. Journal of Health and Social Behavior, 31:162–172, 1990. [19] RichardL. Moreland, Linda Argote, and Ranjani Krishnan. Training people to work in groups. In R.Scott Tindale, Linda Heath, John Edwards, EmilJ. Posavac, FredB. Bryant, Yolanda Suarez-Balcazar, Eaaron Henderson-King, and Judith Myers, editors, Theory and Research on Small Groups, volume 4 of Social Psychological Applications to Social Issues, pages 37–60. Springer US, 2002. [20] NBA.com. Highlights of the 2011 collective bargaining agreement between the national basketball association (nba) and the national basketball players association (nbpa). http://www.nba.com/media/CBA101_9.12.pdf. [Online; march 2014]. [21] Gianmarco I.P. Ottaviano and Giovanni Peri. Cities and cultures. Journal of Urban Economics, 58:304–337, 2005. [22] Kevin Pelton. Numbers don’t lie. http://sportsillustrated.cnn.com/ 2006/writers/82games/04/13/mvp/2.html. [Online; April 2014]. [23] Matthew Martell Peter Tingling, Kamal Masri. Does order matter? an em- pirical analysis of nhl draft decisions. Sport, Business and Management: An International Journal, 1:155–171, 2011. [24] php.net. Php: Documentation. http://php.net/docs.php. [Online; March 2014]. 28
BIBLIOGRAPHY [25] Sargur N. Srihari and Edward J. Kuebert. Integration of hand-written address interpretation technology into the united states postal service remote computer reader system. In Proceedings of the 4th International Conference on Document Analysis and Recognition, ICDAR ’97, pages 892–896, Washington, DC, USA, 1997. IEEE Computer Society. [26] Angus Stevenson. Oxford Dictionary of English. Oxford University Press, 2010. [27] Robert I Sutton. The No Asshole Rule: Building a Civilized Workplace and Surviving One That Isn’t. Business Plus, 2006. [28] K. L. Sorensen K Feldman T. W. H. Ng, L. T. Eby. Predictors of objective and subjective career success: A meta-analysis. Journal of Personnel Psychology, 58:367–408, 2005. [29] Wikipedia. Lega basket serie a. http://en.wikipedia.org/wiki/Lega_ Basket_Serie_A. [Online; may 2014]. [30] Wikipedia. List of national basketball association arenas. http: //en.wikipedia.org/wiki/List_of_National_Basketball_Association_ arenas. [Online; may 2014]. [31] wikipedia.com. Artificial neural network. http://en.wikipedia.org/wiki/ Artificial_neural_network#Criticism. [Online; April 2014. [32] M A Winkleby, D E Jatulis, E Frank, and S P Fortmann. Socioeconomic status and health: how education, income, and occupation contribute to risk factors for cardiovascular disease. American Journal of Public Health, 82:816–820, 1992. 29
A. Graphs This appendix contains graphs representing single hometown parameter versus suc- cess, where both axis are dimensionless and normalized. Figure A.1: Homes Owned Highschool Graduate vs Success 4 90 % Confidence interval 80 % Confidence interval 3.5 70 % Confidence interval 60 % Confidence interval 3 51 % Confidence interval Player Mean games played per draft position 2.5 US median (vertical) NBA−players median(vertical) 2 BPL−level 1.5 1 0.5 0 −0.5 −3 −2 −1 0 1 2 Quantitative representation of Highschool Graduate The legends on the following figures are referred to the legend on figure A.1. A more meticulous figure description is made in Chapter 4. 31
APPENDIX A. GRAPHS Figure A.2: Population Population vs Success 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 Figure A.3: Income per Capita Income per Capita vs Success 4 3.5 3 2.5 2 BPL−level 1.5 1 0.5 0 −0.5 −1 0 1 2 3 4 5 Quantitative representation of Income per Capita 32
Figure A.4: Household Income Household Income vs Success 4 3.5 3 2.5 2 BPL−level 1.5 1 0.5 0 −0.5 −1 0 1 2 3 4 5 6 Quantitative representation of Household Income Figure A.5: Home Appreciation Home Appreciation vs Success 4 3.5 3 2.5 2 BPL−level 1.5 1 0.5 0 −0.5 −3 −2 −1 0 1 2 3 Quantitative representation of Home Appreciation 33
APPENDIX A. GRAPHS Figure A.6: Homes Owned Homes Owned vs Success 4 3.5 3 2.5 2 BPL−level 1.5 1 0.5 0 −0.5 −2 −1 0 1 2 3 4 Quantitative representation of Homes Owned Figure A.7: Housing Vacant Housing Vacant vs Success 4 3.5 3 2.5 2 BPL−level 1.5 1 0.5 0 −0.5 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 Quantitative representation of Housing Vacant 34
Figure A.8: Homes Rented Homes Rented vs Success 4 3.5 3 2.5 2 BPL−level 1.5 1 0.5 0 −0.5 −3 −2 −1 0 1 2 Quantitative representation of Homes Rented Figure A.9: Violent Crime Violent Crime vs Success 4 3.5 3 2.5 2 BPL−level 1.5 1 0.5 0 −0.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 Quantitative representation of Violent Crime 35
APPENDIX A. GRAPHS Figure A.10: 2 Year College 2 Year College vs Success 4 3.5 3 2.5 2 BPL−level 1.5 1 0.5 0 −0.5 −3 −2 −1 0 1 2 3 4 Quantitative representation of 2 Year College Figure A.11: 4 Year College 4 Year College vs Success 4 3.5 3 2.5 2 BPL−level 1.5 1 0.5 0 −0.5 −2 −1 0 1 2 3 Quantitative representation of 4 Year College 36
Figure A.12: Graduate Degree Graduate Degree vs Success 4 3.5 3 2.5 2 BPL−level 1.5 1 0.5 0 −0.5 −1 0 1 2 3 4 5 6 Quantitative representation of Graduate Degree 37
B. Data variables Population Pop. Density Pop.Change Median Age Households Household Size Male Population Female Population Married Population Single population Air Quality Water Quality Superfund Sites Physicians per 100k Unemployment Rate Recent Job Growth Future Job Growth Sales Taxes Income Taxes Income per Cap. Household Income Income Less Than 15K Income between 15K and 25K Income between 25K and 35K Income between 35K and 50K Income between 50K and 75K Income between 75K and 100K Income between 100K and 150K Income between 150K and 250K Income between 250K and 500K Income greater than 500K Management, Business, and Financial Operations Professional and Related Occupations Service Sales and Office Farming, Fishing, and Forestry Construction, Extraction, and Maintenance Production, Transportation, and Material Moving Median Home Age Median Home Cost Home Appreciation Homes Owned Housing Vacant Homes Rented Property Tax Rate Property Tax Rate Less Than $20,000 Property Tax Rate $20,000 to $39,999 Property Tax Rate $40,000 to $59,999 Property Tax Rate $60,000 to $79,999 Property Tax Rate $80,000 to $99,999 Property Tax Rate $100,000 to $149,999 Property Tax Rate $150,000 to $199,999 Property Tax Rate $200,000 to $299,999 Property Tax Rate $300,000 to $399,999 Property Tax Rate $400,000 to $499,999 Property Tax Rate $500,000 to $749,999 Property Tax Rate$1,000,000 or more 39
APPENDIX B. DATA VARIABLES 1999 to October 2005 1995 to 1998 1990 to 1994 1980 to 1989 1970 to 1979 1960 to 1969 1950 to 1959 1940 to 1949 1939 or Earlier Violent Crime Property Crime Rainfall (in.) Snowfall (in.) Precipitation Days Sunny Days Avg. July High Avg. Jan. Low Comfort Index (higher=better) UV Index Elevation ft. School Expend. Pupil/Teacher Ratio Students per Librarian Students per Counselor 2 yr College Grad. 4 yr College Grad. Graduate Degrees High School Grads. Commute Time Auto (alone) Carpool Mass Transit Work at Home Commute Less Than 15 min. Commute 15 to 29 min. Commute 30 to 44 min. Commute 45 to 59 min. Commute greater than 60 min. Overall Food Utilities Miscellaneous Percent Religious Catholic Protestant LDS Baptist Episcopalian Pentecostal Lutheran Methodist Presbyterian Other Christian Jewish Eastern Islam Democrat Republican Independent Other 40
You can also read