Soccer Analytics - SFU Summit
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Soccer Analytics by Lucas Yifan Wu M.Sc., Simon Fraser University, 2018 B.Sc., Simon Fraser University, 2017 Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the Department of Statistics and Actuarial Science Faculty of Science © Lucas Yifan Wu 2022 SIMON FRASER UNIVERSITY Fall 2022 Copyright in this work is held by the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation.
Declaration of Committee Name: Lucas Yifan Wu Degree: Doctor of Philosophy Thesis title: Soccer Analytics Committee: Chair: Liangliang Wang Associate Professor, Statistics and Actuarial Science Timothy Swartz Supervisor Professor, Statistics and Actuarial Science Boxin Tang Committee Member Professor, Statistics and Actuarial Science Oliver Schulte Examiner Professor, Computing Science Ian McHale External Examiner Professor, Management School University of Liverpool ii
Abstract This thesis consists of a compilation of four projects all related to soccer. The first short chapter investigates how to obtain reliable speed measurements from player tracking data. The second chapter considers the problem of crossing the ball in soccer. In recent years, some research suggests that there exists a negative correlation between crossing and scoring. However, correlation does not imply causation. There are various factors that affect the decision of crossing. In the crossing problem, an experimenter can not assign whether a player crosses or does not cross the ball during a particular crossing opportunity due to the fact that matches are observational studies. For this reason, we use a causal inference framework to investigate the causal relationship of crossing on shots. Our findings suggest that crossing remains an effective tactic for increasing shot probabilities. The third chapter considers the evaluation of off-the-ball actions in soccer. There are numer- ous statistics and metrics that have been proposed to evaluate the performance of players in team sports based on actions involving the ball. In soccer, players typically don’t have the possession of the ball for even three minutes during a game. In this paper, we develop methods that analyze the activities of players that are “off-the-ball”. Then a defensive antic- ipation metric is developed based on the tenet that moving faster to the expected location is better than moving slower. The last chapter considers the problem of pitch control in soccer. With the availability of tracking data, one of the most intriguing ideas in soccer is to model how much space the player or the team owns at any given time, which is known as pitch control or field ownership in soccer analytics community. This project first conducts a literature review on various approaches for the determination of pitch control and introduces a new field ownership metric that takes into account associated movement dynamics, such as speed, acceleration and change of direction etc. Keywords: Sports Analytics; Player Tracking Data; Causal Inference; Machine Learning; Pitch Control. iii
Acknowledgements First and foremost, I would like to express my sincere gratitude to my senior supervisor Dr. Tim Swartz as I am deeply indebted to his continual support and guidance. This thesis would not have been possible without him. He saw the potential in me, drafted me as his PhD student and encouraged me to pursue a career in Sports Analytics. I am extremely grateful to my examining committees for their thorough reading and valuable comments on my thesis, Dr. Boxin Tang, Dr. Oliver Schulte and Dr. Ian McHale. Special thanks to Dr. Liangliang Wang for chairing my defence. I would also like to thank my All-Star teammates Dani Chu, Matthew Reyers, James Thomson and Meyappan Subbaiah. Without these amazing teammates, it would be impos- sible to win the Big Data Bowl. Many thanks to the former and current SFU Sports Ana- lytics members who help to make SFU a Sports Analytics hub, Dr. Dave Clarke, Dr. Peter Chow-White, Dr. Tim Swartz, Dr. Thomas Loughin, Dr. Luke Bornn, Dr. Oliver Schulte, Dr. Peter Tingling, Dr. Aaron Danielson, Dr. Harsha Perera, Dr. Jacob Mortensen, Dr. Nate Sandholtz, Sarah Bailey, Matthew Van Bommel, Steven Wu, Peter Tea, Kevin Floyd, Robert Nguyen, Denis Beausoleil, Daniel Daly Grafstein, Chris Li, Ken Peng, Nirodha Es- pasinghege Dona, Aaron Pearson, Robyn Ritchie, Ryker Moreau, Elijah Cavan, Brendan Kumagi, James Thomson, Dani Chu and Matthew Reyers. I am grateful to all the faculty members in the department of Statistics and Actuarial Science who oversaw a kid hanging around for years, especially Dr. Dave Campbell for sparking my interests in machine learning. In addition, I would like to thank all my lovely friends and fellow MSc and PhD students for all the tears, laughters, fears and hopes we shared. I would like to extend my sincere thanks to Dr. Doug Fearing, Dr. Luke Bornn and all of my co-workers at Zelus Analytics for their support and help throughout the pandemic. Special shout-out to COVID-19 which makes everyone’s life much more difficult but we have grown stronger together. Last but not least I would like to thank my girlfriend and family, especially my parents for their unconditional love and support. iv
Table of Contents Declaration of Committee ii Abstract iii Acknowledgements iv Table of Contents v List of Tables vii List of Figures viii 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 The Calculation of Player Speed from Tracking Data 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Speed Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Exploratory Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Soccer Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 NFL Football Example . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 A Contextual Analysis of Crossing the Ball in Soccer 13 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Defining Crossing Opportunities . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Crafting Situational Variables . . . . . . . . . . . . . . . . . . . . . . 18 3.2.3 Outcome Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 A Model for the Crossing Decision . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 The Intended Target Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 v
3.5 Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5.1 Propensity Score Matching . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 Evaluation of Off-the-Ball Actions in Soccer 29 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 Rationale of the Approach . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.2 Prediction of Velocities . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.3 Computational Overview . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3.4 Derivation of a Metric for Defensive Anticipation . . . . . . . . . . 36 4.4 Results and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.2 Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.1 Connections to Existing Literature . . . . . . . . . . . . . . . . . . . 44 4.5.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5 Pitch Control 45 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 A New Metric for Pitch Control . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.1 Criteria for Pitch Control . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.2 Timing of the Ball . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.3 Timing of Players . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.5 Accuracy of the Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Bibliography 62 Appendix A Code used for Establishing Pitch Control 69 vi
List of Tables Table 2.1 A sample of soccer tracking data from the CSL. . . . . . . . . . . . . 6 Table 2.2 A sample of football tracking data from the NFL. . . . . . . . . . . . 6 Table 3.1 A subset of situational variables relevant to crossing which form the columns of the design matrix Z. All distances are measured in metres. 18 Table 3.2 Estimates and standard errors for the parameters corresponding to model (3.1). The third column provides the estimate multiplied by the mean value of its corresponding covariate.The fourth column marginal effect is the product of the estimate and the standard deviation of the corresponding z terms. . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Table 3.3 The key situational variables that are relevant to crossing success as modeled in Section 3.4. All distances are measured in metres, speed is measured in metres/second, angles are measured in degrees, and areas are measured in squared metres. . . . . . . . . . . . . . . . . . . . . . 21 Table 3.4 Estimates of the parameters from the intended target model and other related statistics. The estimates describe associations between spatio- temporal features and the successful completion of an attempted cross. 22 Table 4.1 The defensive anticipation metric P calculated during even and odd weeks for players on Shandong Luneng during the 2019 season. . . . . 40 Table 4.2 The defensive anticipation metric P given by (4.2) for 10 players on Shandong Luneng who received the most playing time during the 2019 CSL season. We also provide comparison metrics involving aggression during the 2019 season, namely the total number of fouls committed, tackles made and the number of interceptions. . . . . . . . . . . . . . 42 Table 5.1 The determination of pitch control at a given location given time in- equalities involving tb , th and tr . . . . . . . . . . . . . . . . . . . . . . 52 Table 5.2 The classification of 7901 intended passes according to whether pitch control (PC) was designated to the intended team, the opponent or neither team. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 vii
List of Figures Figure 2.1 Path of a player over a 29-second interval based on location data recorded at 10 hertz. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 2.2 Estimated speed (∆ = 1) of the player corresponding to the path in Figure 2.1 over a 29-second interval. . . . . . . . . . . . . . . . . . 9 Figure 2.3 Estimated speed (∆ = 4) of the player corresponding to the path in Figure 2.1 over a 29-second interval. . . . . . . . . . . . . . . . . . 10 Figure 2.4 The red-lined plots correspond to speed and acceleration estimates (∆ = 1) for Brandin Cooks of the NFL during a 7-second time interval. The analogous blue-lined plots correspond to ∆ = 2. . . . 11 Figure 3.1 Examples of possession sequences with (a) a crossing attempt and (b) without a crossing attempt. . . . . . . . . . . . . . . . . . . . . 17 Figure 3.2 Panels (a) and (b) present output from the intended target model. These diagrams provide a way for teams to study the spatial config- urations of players and the ball during crossing opportunities. . . . 23 Figure 3.3 The directed acyclic graph describes the crossing problem. The vari- ables ZT are causes of T, but not Y . The variables ZTY are common causes for T and Y . And, the variables ZY are causes for Y , but not T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 3.4 After matching, histograms of the two groups (treatment and con- trol) are depicted where the horizontal variable is the propensity score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 3.5 After matching, smoothed plots of the shot variable Y for both groups with respect to the propensity score. . . . . . . . . . . . . . 27 Figure 4.1 Correlation of predicted speed at time t and actual speed at time t−∆ where time is measured in seconds. The blue dashed line corresponds to the selected value ∆ = 0.5 seconds. . . . . . . . . . . . . . . . . . 36 viii
Figure 4.2 Geometric diagram which illustrates the components of the statis- tic p in equation (4.1). Imagine a player who is located at the origin (0, 0). The observed velocity of the player is shown by the blue vector pointing towards (2, 4). The predicted velocity of an average player is shown by the yellow vector pointing towards (8, 4). The perpen- dicular line indicates the projection of the observed velocity vector on the predicted velocity vector. Using equation (4.1), the defensive anticipation value, p, is equal to −0.6, which can be interpreted as a 60% reduction compared to the average player. . . . . . . . . . . . 37 Figure 4.3 Plot of predicted velocities (purple arrows) and observed velocities (black arrows) at a given instant in time. The blue team is in pos- session, the yellow team is defending and the red dot corresponds to the ball. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 4.4 Density plots of (4.2) based on playing position. For each player, the defensive anticipation metric (4.2) was calculated for all matches in the 2019 CSL season. We observe that central midfielders have slightly larger defensive anticipation values than other players on average, and there is more variability amongst the forwards than the other playing positions. . . . . . . . . . . . . . . . . . . . . . . . . 41 Figure 4.5 Scatterplots of the defensive anticipation metric (4.2) plotted against player interceptions and tackles made during the 2019 CSL season. 42 Figure 4.6 Plot of the defensive anticipation metric (4.2) averaged over all CSL players during 10-minute intervals. . . . . . . . . . . . . . . . . . . 43 Figure 5.1 Voronoi diagram based on n = 5 points generated on the unit square. 47 Figure 5.2 Voronoi diagram applied to a given snapshot of a soccer game based on the location of the 22 players on the pitch. The shaded orange and purple areas correspond the dominant regions for the home and away teams, respectively. . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 5.3 The distribution of maximum speed and maximum acceleration of all players in the Chinese Super League in 2019. . . . . . . . . . . . 54 Figure 5.4 Current velocity vectors for the example depicted in Figure 5.2. . . 57 Figure 5.5 The left plot uses colors to depict the time that it takes a stationary player to reach field locations given the current location marked with a dot. The right plot does likewise but introduces an initial velocity (arrow) for the player. . . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 5.6 Pitch control diagram using the proposed methods for the example depicted in Figure 5.2. . . . . . . . . . . . . . . . . . . . . . . . . . 59 ix
Chapter 1 Introduction 1.1 Introduction Sports analytics is an emerging field where it combines sports with multidisciplinary knowl- edge and expertise, such as statistics, computing science, sports science and business, to support decisions in player evaluation, injury prevention, business operations, etc. The book and movie Moneyball was one of the first influences that put sports analytics in front of the eyes of the public. The Moneyball movement started in baseball and has swept across multiple sports in a few years. Humans are often clouded by personal judgement when making decisions. This was also featured in the movie Moneyball, "People are overlooked for a variety of biased reasons and perceived flaws - age, appearance, personality." One of the common biases is recency bias, where we tend to weight the most recent event more significantly than it should be. For example, when a player has just made a poster dunk, we are more likely to remember that highlight and downweight the fact he gave away five easy layups to his opponents earlier. In the end, we might only remember one moment of brilliance and come away with the perception that the player had an amazing game. It is fairly easy to find many similar examples in sports and how this type of bias can hinder player evaluation in sports. Baseball is one of the earliest sports that embraced the idea of using numbers to inform decisions. Moving beyond replying on pure instinct to evaluate players is a huge leap for sports analytics. Back in the early days, the only available data were box score statistics involving summary statistics of a few categories. As teams recognized the importance of getting more granular data, they began to collect event data, which provide finer details on the sequence of events and players being involved for the recorded event. In my opinion, I would like to argue that this was the first wave of evolution in sports analytics, where there was a shift of mindset to adopt numbers to analyze player performance objectively. The second wave of evolution came with the accessibility of tracking data. Although event data provide a rich amount of contextual information, event data do not describe what other players are doing when they do not possess the ball or are not involved 1
in the recorded event. Tracking data fill the gap by collecting detailed information, such as the x,y coordinates, of the ball and all players on the field multiple times per second. With the availability of spatio-temporal tracking data, it unlocks a new world for researchers to explore and to tackle questions that they were not able to answer. Plenty of interesting research has been done using tracking data in baseball, basketball, soccer and football since then. 1.2 Organization of the thesis In this thesis, there are four chapters that follow. The common theme connecting these chapters is soccer analytics, where we identify interesting research problems in soccer and attempt to solve them using statistics and computing. One of the challenges among these chapters is handling the enormously rich data sets in soccer that track every player’s detailed movement on the field at a rate of multiple times per second. Chapter 2 is a short chapter which investigates how to obtain reliable speed measure- ments from player tracking data. This chapter has been published as the following research article: • Wu, L. and Swartz, T.B. (2022). The calculation of player speed from tracking data. International Journal of Sports Science & Coaching, 0(0). Chapter 3 considers the problem of crossing the ball in soccer. In recent years, some research suggests that there exists a negative correlation between crossing and scoring. However, correlation does not imply causation. There are various factors that affect the de- cision of crossing, including the position of the cross, the defensive pressure on the crosser, the distance between the crosser and his teammates, the score differential, the number of defenders in the box, etc. In general, randomized controlled trials are the gold standard ap- proach to estimate the causal effects of a treatment on an outcome. In the crossing problem, an experimenter can not assign whether a player crosses or does not cross the ball during a particular crossing opportunity due to the fact that matches are observational studies. For this reason, we use a well-established method under the causal inference framework - propensity score matching to investigate the causal relationship of crossing on shots. This is one of the few papers that considers a causal inference approach in team sport, which utilizes player tracking data to identify and measure confounding variables. Our findings suggest that crossing remains an effective tactic for increasing shot probabilities. This chapter has been published as the following research article: • Wu, L., Danielson, A., Hu, J.X. and Swartz, T.B. (2021). A contextual analysis of crossing the ball in soccer. Journal of Quantitative Analysis in Sports, 17(1), 57-66. Chapter 4 considers the evaluation of off-the-ball actions in soccer. There are numerous statistics and metrics that have been proposed to evaluate the performance of players in 2
team sports based on actions involving the ball. In soccer, players typically don’t have the possession of the ball for even three minutes during a game. In this paper, we develop methods that analyze the activities of players that are “off-the-ball”. Specifically, we propose a metric to measure defensive anticipation in soccer. The analogy in chess would be when you are planning your next move, you will always try to anticipate the moves of your opponents. Similarly in soccer, we try to conceptualize the idea of anticipation for defensive players using expected movements at the next moment given a snapshot of the game. The expected movement at the next moment is a function of the spatio-temporal snapshot of the match prior to the moment in time. This provides a new way to evaluate the performance of players off-the-ball. We used machine learning models to learn the non-linear relationship between the contextual variables and velocity from a massive set of game instances. The output from the model which we termed the predicted (expected) velocity represents where the player is expected to move and how fast he is expected to move on average. Then a metric is developed by comparing the player’s actual velocity with the predicted velocity of a typical player in this situation. The interpretation of the defensive anticipation metric is based on the tenet that moving faster to the expected location is better than moving slower. This chapter is under revision at Statistica Applicata - Italian Journal of Applied Statistics: • Wu, L. and Swartz, T.B. (2022). Evaluation of off-the-ball actions in soccer. Manuscript under review. Chapter 5 considers the problem of pitch control in soccer. With the availability of track- ing data, one of the most intriguing ideas in soccer is to model how much space the player owned at any given time, which is known as pitch control or field ownership in the soccer analytics community. This chapter first reviews various approaches for the determination of pitch control and introduces a new metric that takes into account associated movement dynamics of the ball and players. With the pitch control model, we could determine if the home team or road team or neither team has the control at any given location on the field. This approach is generally applicable to invasion sports and is illustrated in the context of soccer. This chapter has been submitted to Scientific Reports: • Wu, L. and Swartz, T.B. (2022). A New Metric for Pitch Control based on an Intuitive Motion Model. Manuscript under review. 3
Chapter 2 The Calculation of Player Speed from Tracking Data 2.1 Introduction In the past decade, the advent of player tracking data has sparked a revolution in sports analytics (Morgulev, Azar and Lidor 2018). With player tracking data, analysts have access to the Cartesian coordinates of each player on the pitch where the observations are recorded frequently (e.g. 10 times per second). The availability of such detailed data provides oppor- tunities to investigate sporting questions that were previously unimaginable. Gudmundsson and Horton (2017) provide a review paper on spatio-temporal analyses used in invasion sports where player tracking data are available. Currently, player tracking systems are expensive, and consequently, tracking data are only collected in “big” sports such as basketball (the National Basketball Association), soccer (various leagues and competitions), football (the National Football League) and hockey (the National Hockey League). Tracking data are not only collected during matches but also during workout sessions where fitness, training and health considerations are main concerns. Tracking data are typically proprietary and are supplied by service providers using various technologies (Torres-Ronda et al. 2022). There are four prominent technologies: (1) global positioning systems (GPS), (2) local positioning systems (LPS), (3) inertial measure- ment units (IMU) and (4) optical tracking (OT) systems. OT systems are fundamentally different as they do not require wearable devices and do not directly determine player coor- dinates. Instead, OT technology requires advanced camera systems and player recognition software to evaluate player coordinates. No matter which technology is utilized, tracking sys- tems begin with the collection of the (x, y) coordinates of participants measured at frequent time intervals. With the coordinates, various statistics can be calculated or approximated (e.g. speed, acceleration, distance travelled, etc.). 4
In this paper, we are concerned with derivative calculations associated with tracking data coordinates. Specifically, we are interested in the approximation of player speed which is an important statistic in sports analytics and sports science. For example, Wu and Swartz (2022) require player speeds in soccer to assess off-the-ball activity. They introduce a mea- sure which addresses defensive anticipation. Buchheit et al. (2014) use regression method- ology to determine factors that are associated with player speed in soccer. For example, horizontal force and horizontal power were seen to be associated with speed. Oliva-Lozano et al. (2020) characterize positional differences in soccer based on acceleration and sprint profiles. Related to speed, Shen, Santo and Akande (2022) analyze pace of play in soccer, and conclude that pace increases with decreasing team quality, which indicates the importance of playing with pace. From a training and performance perspective, Ferrari Bravo et al. (2008) demonstrate that sprint-training significantly increases both aerobic and anaerobic performances in soccer. Naturally, different applications require different levels of accuracy. For example, in sports science, critical velocity is an active research field which relies on highly accurate measurements of speed (Peng, Clarke and Swartz 2022). Much has been written on the accuracy of various tracking data technologies. For ex- ample, Mara et al. (2017) considered the displacement accuracy of an OT system, Tan, Polglaze and Peeling (2021) investigated the validity and accuracy of a GPS system, and Pino-Ortega et al. (2022) provided a review of the validity and reliability of LPS systems against other devices. Massard, Eggars and Lovell (2017) questioned the need for sprint testing based on the comparison of GPS match and field-testing data. However, all of these investigations rely on some measure of the truth against which tracking measurements are compared. What should experimenters do if they do not have access to the truth and they are unsure of the accuracy of speed calculations obtained from tracking data? This paper introduces some simple principles from exploratory data analysis that assists experimenters to obtain more reliable estimates of speed. In Section 2.2, we describe the datasets upon which our methods are illustrated, and we describe how player speed is calculated from tracking data coordinates. In Section 2.3, some simple exploratory plots are introduced that help the analyst obtain more reliable speed calculations. We conclude with a short discussion in Section 2.4. 2.2 Data We have access to tracking data from matches during the 2019 season of the Chinese Su- per League (CSL). The CSL uses OT technology (previously discussed) provided by Stats Perform where observations were recorded 10 times per second. The tracking data consist of roughly one million rows per match measured on 7 variables. Each row corresponds to a particular player at a given instant in time. The soccer tracking data were initially provided as xml files, and were processed in R for further analysis. In Table 2.1, we present three 5
rows of the soccer tracking data. Here we observe x-y coordinates and player identifiers at every 1/10th of a second. The entries are mostly intuitive except perhaps for the x-y coordinates which refer to the player location on a 105m by 68m soccer field. For example, (x, y) = (−52.5, 0) corresponds to the middle of the goal line on the left hand side of the soccer field. gameID Time x y IdActor IsBall IdHalf JerseyNumber WUHAN-BEIJI-01032019 30 -4 -9.6 345354 FALSE 1 25 WUHAN-BEIJI-01032019 30.1 -4 -9.5 345354 FALSE 1 25 WUHAN-BEIJI-01032019 30.2 -4 -9.4 345354 FALSE 1 25 Table 2.1: A sample of soccer tracking data from the CSL. Our second dataset corresponds to tracking data from the National Football League (NFL). Unlike the OT soccer data, the NFL data were based on GPS technology, but were also collected using 10 hertz sampling frames. The data were used in the 2019 Big Data Bowl competition and are publicly available at https://github.com/nfl-football-ops/Big- Data-Bowl. Here we use data corresponding to a single deep pass play by the wide receiver Brandin Cooks of the New England Patriots taken from a 7-second interval during the September 7/2017 match against the Kansas City Chiefs. In Table 2.2, we present three rows of the football tracking data. Here we observe a similar structure to the tracking data in soccer. The football tracking data include the x-y coordinates for players measured in yards where x refers to the player position along the long axis of the field ranging from 0 to 120 yards, and y refers to the player position along the short axis of the field ranging from 0 to 53.3 yards. For instance, (x, y) = (0, 0) corresponds to the bottom left of the football field. The remaining variables in Table 2.2 are mostly intuitive where dis corresponds to distance travelled from the previous frame (i.e. previous 1/10th second) and dir corresponds to the angle of player motion in degrees. The frame.id is the frame identifier for each frame which resets to 1 for each play. gameId playId frame.id x y dis dir event playerId displayName jerseyNumber 2017090700 160 40 53.78 10.82 0.77 239.36 pass_forward 2543498 Brandin Cooks 14 2017090700 160 41 53.11 10.45 0.76 238.66 NA 2543498 Brandin Cooks 14 2017090700 160 42 52.44 10.08 0.76 237.76 NA 2543498 Brandin Cooks 14 Table 2.2: A sample of football tracking data from the NFL. 2.2.1 Speed Calculations We emphasize that the approach that we introduce is general and straightforward. It can be utilized using any tracking technology in any sport. However, knowledge of the sport dictates our interpretation of the exploratory plots. 6
Consider then a particular player where our interest concerns the calculation of their speed. If (x(t), y(t)) denotes the location of the player at time t, then the player’s speed at time t is defined by (x(t + ∆) − x(t − ∆))2 + (y(t + ∆) − y(t − ∆))2 p s(t) = lim . (2.1) ∆→0 2∆ In words, formula (2.1) is the limiting change in distance travelled with respect to time. Of course, (2.1) is a mathematical expression based on taking a limit, and is not a quantity that can be calculated from data. Instead, with tracking data, the player’s locations are obtained at regular times which are denoted by (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). Here, the subscripts i = 1, . . . , n of the Cartesian coordinates refer to the time increments. Therefore, assuming that t corresponds to an observed time increment from the tracking data, it is reasonable to approximate s(t) in (2.1) by (xt+∆ − xt−∆ )2 + (yt+∆ − yt−∆ )2 p ŝ(t) = (2.2) 2∆ where ∆ = 1, 2, . . . is an increment that needs to be specified. In our illustration with 10 hertz data, the value ∆ = 1 corresponds to 1/10th of a second. We have simplified the discussion by referring to speed. The approximation of velocity is also of interest where velocity has a directional component in addition to the scalar quantity speed. Note that acceleration calculations are also important, and are obtained as derivatives of speed. 2.3 Exploratory Analyses Whereas the estimand s(t) in (2.1) is an instantaneous speed, it’s estimate ŝ(t) in (2.2) is an average speed taken over the time period 2∆. It may therefore appear that smaller values of ∆ will yield better estimates. However, this needs to be balanced against the fact that player coordinates (xt , yt ) are subject to measurement error as is the time interval 2∆. Therefore, inaccuracies in the speed estimates are propagated from inaccuracies in the raw data. To theoretically investigate the magnitude of error in speed via measurement error in the numerator of (2.2), we consider the true speed ∆l /(2∆) which denotes the change in location ∆l by the change in time where ∆ denotes the previously defined incremental step size in time. With measurement error present, we denote the observed speed (∆l ± E)/(2∆) where E denotes a fixed error in the location measurement corresponding to the device. 7
Then relative error RE is given by | ∆l /(2∆) − (∆l ± E)/(2∆) | RE = ∆l /(2∆) = | E | /∆l . (2.3) We note that the relative error (2.3) is smaller for larger speeds (i.e. greater changes in location ∆l ). For example, when ∆ = 1, consider a true location displacement ∆l = 8 metres which is incorrectly measured as 9 metres. Then the actual speed is 8.0 metres/sec (fast), the observed speed is 9.0 metres/sec, and the measurement error is E = 1 metre. This results in relative error RE = 0.125. For contrast, when ∆ = 1, consider a true location displacement ∆l = 2 metres which is incorrectly measured as 3 metres. Then the actual speed is 2.0 metres/sec (slow), the observed speed is 3.0 metres/sec, and the measurement error is E = 1 metre. This results in relative error RE = 0.50. 2.3.1 Soccer Example To begin our investigation, Figure 1 provides a plot of the locations of a player from the CSL dataset taken during a 29-second interval where he is known to be running fast during portions of the interval. When a player is running fast, it is physically impossible to make sharp turns, and therefore, the smoothness of the path suggests apparent accuracy in the location measurements. Starting point Figure 2.1: Path of a player over a 29-second interval based on location data recorded at 10 hertz. However, when we take the path locations in Figure 2.1, and estimate speeds (2.2) using ∆ = 1, there seems to be a significant accuracy problem. Figure 2.2 provides a plot of estimated speed versus time for the selected path. In Figure 2.2, we observe that there are many instances where a player has a recorded speed which increases (or decreases) by 8
roughly 1.0 metre per second in the subsequent 1/10th second, and then returns to the baseline speed 1/10th of a second later. When speeds are recorded in the (0,8) metres per second range, frequent fluctuations of this magnitude do not seem plausible. The problem here is that the location measurements were recorded to one decimal point on the metres scale, and therefore, there is inaccuracy in (2.2) when dividing by 2∆ which corresponds to 0.2 seconds. 8 6 Estimated speed (m/s) 4 2 0 0 4 8 12 16 20 24 28 Time elapsed in seconds Figure 2.2: Estimated speed (∆ = 1) of the player corresponding to the path in Figure 2.1 over a 29-second interval. A remedy to the estimation of the instantaneous speed s(t) is to increase the time increment ∆ surrounding t. Increasing the length of the time interval 2∆ results in less fluctuation in the estimated speeds which is desirable. However, this is done at the expense of moving in the direction from instantaneous speeds to average speeds. We have found that the approximation ∆ = 4 works well in this application. Figure 2.3 provides the analogous plot to Figure 2.2 where the time intervals have been widened to intervals of length 0.8 seconds. In Figure 2.3, we observe that the fluctuations are less pronounced, and that the plot of estimated speed versus time is smoother. For example, the fluctuations during the interval 16-18 seconds in Figure 2.2 are less believable than what is observed in Figure 2.3. We refer back to the theoretical analysis of relative error at the beginning of Section 3. In this example, we have seen that we prefer the time increment ∆ = 4 over ∆ = 1. With ∆ = 4, speed ∆l /2∆ = 8 metres/sec and location measurement error E = 1 metre, this implies ∆l = 64 metres and relative error RE = E/64 = 0.015625. With ∆ = 1, speed ∆l /2∆ = 8 metres/sec and location measurement error E = 1 metre, this implies ∆l = 16 metres and relative error RE = E/16 = 0.0625. Therefore, ∆ = 4 is preferred over ∆ = 1 in reducing relative error. This exercise can be repeated for any speed. Issues which arise in speed measurements are a consequence of the fact that speed is the derivative of position, and that position is not measured with sufficient accuracy. In applications where acceleration measurements are important, one can imagine even greater 9
6 Estimated speed (m/s) 4 2 0 0 4 8 12 16 20 24 28 Time elapsed in seconds Figure 2.3: Estimated speed (∆ = 4) of the player corresponding to the path in Figure 2.1 over a 29-second interval. challenges since acceleration is the derivative of speed. This is illustrated in the following example. 2.3.2 NFL Football Example In the second example, we first note that the running patterns of a NFL wide receiver differ from those of a soccer player. Typically, the wide receiver sprints over a short time interval and does not make many changes of direction. This has implications for the estimation of speed. In Figure 2.4, we provide the estimated speed and acceleration estimates for Brandin Cooks based on a 7-second pass route. The red-lined plots correspond to estimates based on ∆ = 1 (i.e. intervals of 0.2 seconds), and the blue-lined plots correspond to estimates based on ∆ = 2 (i.e. intervals of 0.4 seconds). Using ∆ = 1, the speed estimates appear satisfactory as there are no unrealistic fluctuations between successive estimates. When we compare the speed estimates using ∆ = 1 to ∆ = 2, there is no apparent improvement in the speed estimates. This suggests that ∆ = 1 may be adequate for this application which is a different conclusion than with the soccer data. This may point to either a difference between the OT technology versus the GPS technology, or the intrinsic differences between the motions of soccer players compared to wide receivers in football. When we look at the acceleration plots in Figure 2.4, it appears that ∆ = 1 may exhibit untenable fluctuations in acceleration, especially around the 5.5 second mark. For example, from the 5.2-second to 5.9-second mark, there is a change in acceleration in each successive time step, and the acceleration follows an unlikely fluctuating pattern of up, down, up, down, down, up and down (i.e. five changes in direction). From the 5.2-second to 5.9-second mark with ∆ = 2, we observe the more believable pattern of up, up, down, same, same, down 10
and same (i.e. only one change in direction). With respect to the estimation of acceleration, ∆ = 2 is preferred over ∆ = 1. 20 6 Estimated acceleration (m/s^2) Estimated speed (m/s) 10 4 delta = 1 delta = 2 2 0 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Time elapsed in seconds Time elapsed in seconds Figure 2.4: The red-lined plots correspond to speed and acceleration estimates (∆ = 1) for Brandin Cooks of the NFL during a 7-second time interval. The analogous blue-lined plots correspond to ∆ = 2. 2.4 Discussion Tracking data have provided opportunities to study problems in sports analytics which were once unimaginable. However, sound tracking data analyses require data that are reli- able, and the reliability of tracking data statistics often degrade with increasingly complex statistics. We have provided some simple principles from exploratory data analysis to help experimenters derive more reliable estimates of player speed. The same principles can be utilized in the calculation of velocities and accelerations. The principles developed here are general and can be used with any type of player track- ing system in any sport. The experimenter needs to consider the estimands of interest. The experimenter also requires domain knowledge of the sport to assess whether the resultant variations in the estimates are reasonable. An avenue of future research may involve the implementation of statistical methods to smooth estimates of speed and acceleration. For example, one might consider the Hodrick- Prescott filter to smooth estimates of speed (Hodrick and Prescott 1997). Instead of having experimenters manually estimate speed from (x, y) coordinates, some tracking data providers automatically provide speed statistics. Coleman (2018) describes the procedure that the data provider Opta uses in calculating top speeds for players in soccer: “The speed in kilometers per hour for a given frame is based on the previous 15 frame-to-frame speeds. Out of the 15 frame-to-frame speeds, the four highest and the four lowest values are discarded and the result is an average of the remaining seven values.” Given that speed is of great importance in sports analytics, we suggest that it would be 11
good practice for the providers to be explicit about the the derivation and justification of their speed calculations. 12
Chapter 3 A Contextual Analysis of Crossing the Ball in Soccer 3.1 Introduction The sport of soccer (association football) has a long history dating back to 1863 when the Laws of the Game were codified by the Football Association in England. Throughout the history of the sport, tactics have evolved with the intention of providing a competitive advantage (Wilson 2013). As a strategy, the action of crossing the ball in soccer has always been a staple of the game that has been thought to produce goals. A crossed ball occurs when a player (normally situated in a wide area of the attacking third of the pitch) kicks the ball towards the box with the intention that an attacking teammate will score. However, in recent years, research has been carried out that casts doubt on the benefits of crossing the ball. Vecer (2014) provides a persuasive argument that the overall effect of crossing the ball has a strong negative impact on scoring. Vecer (2014) uses both aggregate crossing statistics and multilevel Poisson regression to study the impact of crossing. In the analyses, there is a suggestion that crossing (when executed properly) is valuable; however, the rate of bad crosses greatly exceeds the rate of good crosses, and this is a primary argument against crossing. Vecer (2014) also demonstrates that missed scoring opportunities due to open crossing is associated with the quality of the attacking team. In recent years, teams have become more reluctant to cross the ball. For example, Vecer (2014) states that the number of open crosses in the German Bundesliga dropped from 12.0 per match in the 2009/2010 season to 8.9 per match in the 2015/2016 season, a decrease exceeding 25%. Vecer (2014) analyzes the efficiency of crossing and found that 14.5% of the goals scored were the results of open crosses in English Premier League. We found a similar story in Chinese Super League, where 16.9% of the goals were scored from open crosses in 2019 season. Sarkar (2018) investigates crosses from a game theoretic perspective. They assume the attacking team can cross the ball or not, and the defending team can utilize an offside 13
trap or not. The vector of equilibrium strategies determines the probabilities of the possi- ble outcomes. Somewhat surprisingly, Sarkar (2018) suggests that teams that are good at aspects of executing a cross should cross the ball less often. Sarkar (2018) and Sarkar and Chakraborty (2018) also confirm the inverse relationship between the number of crosses and the number of goals scored in a match. Other papers that have provided nuanced views on the negative effects of crossing include Liu et al. (2015) and Oberstone (2009). Given the longstanding history of crossing the ball in soccer, the conclusions reached by Vecer (2014) and Sarkar (2018) have been surprising to many, including the authors of this paper. We hypothesize that there are contexts in which crossing the ball in soccer is a beneficial strategy. Knowing when to the cross the ball is a step in the direction of effective playing strategy. Our contextual investigation is made possible by the availability of player tracking data. Player tracking data in soccer consists of the (x, y) coordinates of the ball and the 22 players on the pitch recorded at regular and frequent time intervals. Player tracking data in sport are the catalysts for big data analyses and do not form part of the analyses by Vecer (2014) and Sarkar (2018). Gudmundsson and Horton (2017) provide a review paper on spatio-temporal analyses used in invasion sports (including soccer) where player tracking data are available. The analysis of player tracking data has been particularly prominent in the sport of basketball; see for example, Miller et al. (2014). Although tactical decisions are a fundamental aspect of sport, sporting decisions are not typically based on the results of randomized designs, the bread and butter of causal inference. Clearly, in professional sport, match outcomes are important and coaches would be unwilling to implement a tactic in a random selection of games and then implement an alternative tactic in a remaining subset of games. There are many approaches that estimate causal effects with observational data (see Pearl 2009), but these methods have not received much attention in the sports analytics literature. One exception is the work of Yam and Lopez (2019) who investigate the impact of “going for it” on fourth down in the National Football League as opposed to punting or kicking a field goal. Their approach is based on matching propensity scores and covariates associated with game situations. As another example, Toumi and Lopez (2019) use propensity score matching and Bayesian additive regression trees to estimate the causal effects of zone-entry decisions in the National Hockey League. Our work uses spatio-temporal data to investigate three aspects of the crossing problem in soccer. First, we investigate the spatio-temporal conditions that lead to crossing. Then we introduce an intended target model that investigates crossing success. Finally, a contextual analysis is provided that assesses the benefits of crossing in various situations. The analysis is based on causal inference techniques and suggests that crossing remains an effective tactic in particular contexts. Section 3.2 introduces the dataset. We outline the steps involved in converting the player tracking data into features that are used in the ensuing analyses. The resultant design matrix 14
consists of rows that correspond to crossing opportunities and columns (covariates) that are believed to related to aspects of crossing. Our analysis is based on various assumptions used in the definition of a crossing opportunity and on the definition of outcomes arising from crossing opportunities. In cases where the rationale for the assumptions is less clear, we introduce tuning parameters so that analyses can be carried out using a range of values of the tuning parameters. Section 3.3 is concerned with the spatio-temporal conditions that lead a player to cross the ball. We develop a logistic regression model which relates the attempt (or non-attempt) to cross the ball to covariates (situational variables) which are believed to be related to the crossing decision. We observe that the model makes physical sense according to our under- standing of soccer. The fitted model provides evidence of the rich information embedded in the player tracking data. The logistic model is subsequently used in the causal analysis of Section 3.5. Section 3.4 develops an intended target model. The model introduces additional covari- ates that are relevant to the probability of success of a cross. The analysis concerns a sender (the player contemplating the cross) and potential receivers (players to whom the cross may be intended). The intended target model provides insight to whom a cross ought to be made. Again, the fitted model aligns with our understanding of soccer. The information gleaned from the model may benefit players and coaches in terms of tactical decisions. In Section 3.5, we first review concepts needed to apply basic causal inference tech- niques to the crossing problem. Then we use propensity score matching to assess whether crossing is beneficial. Our results are nuanced as crossing is seen to be beneficial in par- ticular circumstances, and these circumstances are those when a player is more likely to cross. We therefore see that the intuition of soccer players involving the decision to cross corresponds to good decision making. And importantly, we dispel the notion that crossing is not a valuable tactic in soccer. Some concluding remarks are then provided in Section 3.6. 3.2 Data Preprocessing Statistical analyses begin with the existence of a dataset. However, with big data, the pre- processing of data has become an integral part of statistical practice that defines the types of models and analyses that can be entertained. In this paper, we have a big data problem where both event data and player tracking data are analyzed based on the 30 regular season matches of the 2017 season for Shandong Taishan Luneng FC of the Chinese Super League. Event data and tracking data are collected independently where event data consists of occurrences such as tackles and passes, and these are manually recorded along with auxiliary information whenever an “event” takes place. Both event data and tracking data have timestamps so that the two files can be compared 15
for internal consistency. In the Shandong Luneng dataset, tracking data are obtained from the use of optical recognition software. The Shandong Luneng tracking data consists of roughly 1,000,000 rows per match measured on 7 variables where the data are recorded every 1/10th of a second. Each row corresponds to a particular player at a given time. Although the inferences gained via our analyses are specific to Shandong Luneng, it is plausible that some of the broad insights may hold generally to high level soccer competitions. 3.2.1 Defining Crossing Opportunities Vecer (2014) suggests that there are alternative strategies to crossing that are more beneficial in terms of goal scoring. These strategies include attacking through the center of the pitch (via dribbling and passing) and shooting. Vecer (2014) also states that when the attacking team enters the final third of the pitch, various options are more or less open. We focus on this assumption in our analysis. In particular, we utilize event and player tracking data to define a crossing opportunity. We define a crossing opportunity to be an occasion where a player has possession of the ball in a potential crossing zone and has the opportunity to either cross or not cross the ball. Also, we record covariates that describe the relevant circumstances at the time of each crossing opportunity. Soccer is a fluid game where events frequently occur. Following Bransen, Van Haaren and van de Velden (2019), we define a possession sequence as a sequence of events involving possession of the ball by the same team. A possession sequence concludes with a change of possession or a stoppage. In our dataset, the length of a possession sequence ranges up to 19 events. We begin by restricting our crossing analysis to occasions when the offensive team retains possession in a wide position of an attacking third of the pitch (i.e. within 13.85 metres of the sideline). We refer to these two regions (on the opposite sides of the field) as the potential crossing zones, which are highlighted in blue in Figure 3.1. We are interested in the segment of possession sequences in the blue region. Only in these segments is it possible to cross the ball. After restricting our analysis to possessions in the potential crossing zones, we identify the final event that occurred in the zone, and we record the spatio-temporal information of all players at that moment. The last event in the potential crossing zone will be either a cross or non-cross (i.e. pass or dribble). In particular, we remove possession sequences that correspond to corner kicks and free kicks. Note that corner kicks and free kicks are not open crosses, but could possibly occur in a wide position of an attacking third of the pitch. We have N = 2225 final events in potential crossing zones throughout the 30 matches. 16
(a) (b) Figure 3.1: Examples of possession sequences with (a) a crossing attempt and (b) without a crossing attempt. 17
3.2.2 Crafting Situational Variables Building on previous research that evaluates passing ability (Szczepanski and McHale 2016, Power et al. 2017), we propose variables specific to the context of crossing. It is a tenet of soccer that time and space are paramount factors that lead to improved attacking outcomes. From the tracking data, it is possible to determine the location and velocity of both the ball and the player of interest. The location and velocity measurements form the basis for the situational variables presented in Table 3.1. Recall that the situational variables τ, z1 , ..., z9 form the columns of a design matrix Z where the rows of Z are crossing opportunities corresponding to the final event in a possession sequence occurring in potential crossing zones. Although the situational variables in Table 3.1 are self-explanatory, the variable z2 (nearest defender distance) is a measure of defensive pressure on the sender. However, it does not account for the situation where multiple defenders are covering the sender and the location of defender relative to sender matters. A defender standing one meter in front of you versus one meter behind you is very different. The variable z3 indicating the space controlled within 2 meters by the sender has been introduced using ideas from Fernandez and Bornn (2018) and Fernandez et al. (2019). Although we experimented with many other crossing variables, the variables presented in Table 3.1 are those that provided excellent fit for the logistic model of Section 3.3. Variable Definition of Variable τ = 1 (0) - the ball is crossed (not crossed) z1 - score differential wrt the team in possession z2 - distance between the sender and nearest defender z3 - space controlled by the sender z4 - distance between the sender and nearest teammate z5 - distance between the sender and the endline z6 - ratio of the number of offensive players to defensive players in the box z7 - indicator variable corresponding to whether the sender is a defender z8 - indicator variable corresponding to whether the sender is a midfielder z9 - indicator variable for last 10 minutes of a half Table 3.1: A subset of situational variables relevant to crossing which form the columns of the design matrix Z. All distances are measured in metres. 3.2.3 Outcome Variable We require a response variable that allows us to assess whether crossing is beneficial. The obvious candidate is the variable Y1 = 1(0) according to whether a crossing opportunity led (did not lead) to a goal. Although scoring and preventing goals is the primary objective of soccer teams, goal scoring is a rare event with only 2.5-3.0 goals scored per game on average in top European soccer leagues. Therefore, it is difficult to tease out subtle inferences when goal scoring is used as the dependent variable. 18
Alternative indicator variables that we have considered for a response variable are whether a crossing opportunity led to a shot on goal Y2 and whether a crossing oppor- tunity led to a shot Y3 . The variable Y2 is more common than Y1 and Y3 is more common than Y2 . For this reason, we prefer the response variable Y = Y3 . We note that shot statis- tics (as opposed to goal statistics) are prevalent in the hockey analytics literature and are referred to as Fenwick and Corsi (Vollman, Awad and Fyffe 2016). Clearly, shots do not necessarily occur immediately after a cross. Therefore, we introduce a tuning parameter k where a success (shot attempt) is defined as having occurred within the next k events. If the team maintains possession after the ball exits the potential crossing zone and a shot attempt occurs within the next k events, then Y = 1, otherwise Y = 0. In this application, we set k = 5. The idea to let the play “unfold” was used by Schuckers and Curro (2013) in the context of player evaluation in hockey. Using the above definition for Y , we observed 274 shots arising from the N = 2225 crossing opportunities. With the choice k = 5, it took 2.61 seconds on average for a shot to occur after a cross. Also, the offensive team retained possession (and did not cross the ball) 14.92% of the time (332 out of the 2225 cases). We recognize that k is a tuning parameter and we have experimented with different values for k, such as k = 4, 6, 7 and found little difference in the results. Another possible way of defining the response variable involves the consideration of time until a shot occurs. For example, Espasinghege Dona and Swartz (2022) define Y according to whether a shot occurs by the end of possession. 3.3 A Model for the Crossing Decision We first consider how T (i.e. the variable denoting the decision to cross) depends on situa- tional variables as expressed by Z (see Table 3.1). For this, we consider a logistic regression model based on the N = 2225 crossing opportunities where T ∼ Bernoulli(pT ) and logit(pT ) = λ0 + λZ . (3.1) Parameter estimates and standard errors for the significant terms corresponding to model (3.1) are given in Table 3.2. To get a sense of the relative importance of the terms, the third column in Table provides the parameter estimate multiplied by the mean value of its corresponding covariate. A notable observation is that given a crossing opportunity, crossing the ball is less frequent than not crossing the ball. For example, when the mean values of the covariates are substituted into the fitted equation corresponding to (3.1), the probability of a cross is Prob(T = 1) = 0.130. We also note that all of the parameters in Table 3.2 are highly significant except for z1 (p-value = 0.040) and z9 (p-value = 0.051). The coefficients in Table 3.2 also correspond to our soccer intuition. For example, we see that an increase in the ratio of offensive players in the box to defensive players in the box leads to an increased probability of crossing (i.e. positive coefficient of z6 ). The most 19
You can also read