Soccer Analytics - SFU Summit

Page created by Thomas Webb
 
CONTINUE READING
Soccer Analytics - SFU Summit
Soccer Analytics
                                       by

                           Lucas Yifan Wu
                    M.Sc., Simon Fraser University, 2018
                     B.Sc., Simon Fraser University, 2017

               Thesis Submitted in Partial Fulfillment of the
                       Requirements for the Degree of
                             Doctor of Philosophy

                                     in the
               Department of Statistics and Actuarial Science
                               Faculty of Science

                          © Lucas Yifan Wu 2022
                   SIMON FRASER UNIVERSITY
                                   Fall 2022

Copyright in this work is held by the author. Please ensure that any reproduction
 or re-use is done in accordance with the relevant national copyright legislation.
Soccer Analytics - SFU Summit
Declaration of Committee

Name:           Lucas Yifan Wu

Degree:         Doctor of Philosophy

Thesis title:   Soccer Analytics

Committee:      Chair:   Liangliang Wang
                         Associate Professor, Statistics and Actuarial
                         Science

                Timothy Swartz
                Supervisor
                Professor, Statistics and Actuarial Science

                Boxin Tang
                Committee Member
                Professor, Statistics and Actuarial Science

                Oliver Schulte
                Examiner
                Professor, Computing Science

                Ian McHale
                External Examiner
                Professor, Management School
                University of Liverpool

                             ii
Abstract

This thesis consists of a compilation of four projects all related to soccer. The first short
chapter investigates how to obtain reliable speed measurements from player tracking data.
The second chapter considers the problem of crossing the ball in soccer. In recent years,
some research suggests that there exists a negative correlation between crossing and scoring.
However, correlation does not imply causation. There are various factors that affect the
decision of crossing. In the crossing problem, an experimenter can not assign whether a
player crosses or does not cross the ball during a particular crossing opportunity due to
the fact that matches are observational studies. For this reason, we use a causal inference
framework to investigate the causal relationship of crossing on shots. Our findings suggest
that crossing remains an effective tactic for increasing shot probabilities.
The third chapter considers the evaluation of off-the-ball actions in soccer. There are numer-
ous statistics and metrics that have been proposed to evaluate the performance of players
in team sports based on actions involving the ball. In soccer, players typically don’t have
the possession of the ball for even three minutes during a game. In this paper, we develop
methods that analyze the activities of players that are “off-the-ball”. Then a defensive antic-
ipation metric is developed based on the tenet that moving faster to the expected location
is better than moving slower.
The last chapter considers the problem of pitch control in soccer. With the availability
of tracking data, one of the most intriguing ideas in soccer is to model how much space
the player or the team owns at any given time, which is known as pitch control or field
ownership in soccer analytics community. This project first conducts a literature review
on various approaches for the determination of pitch control and introduces a new field
ownership metric that takes into account associated movement dynamics, such as speed,
acceleration and change of direction etc.

Keywords: Sports Analytics; Player Tracking Data; Causal Inference; Machine Learning;
Pitch Control.

                                              iii
Acknowledgements

First and foremost, I would like to express my sincere gratitude to my senior supervisor
Dr. Tim Swartz as I am deeply indebted to his continual support and guidance. This thesis
would not have been possible without him. He saw the potential in me, drafted me as his
PhD student and encouraged me to pursue a career in Sports Analytics.
   I am extremely grateful to my examining committees for their thorough reading and
valuable comments on my thesis, Dr. Boxin Tang, Dr. Oliver Schulte and Dr. Ian McHale.
Special thanks to Dr. Liangliang Wang for chairing my defence.
   I would also like to thank my All-Star teammates Dani Chu, Matthew Reyers, James
Thomson and Meyappan Subbaiah. Without these amazing teammates, it would be impos-
sible to win the Big Data Bowl. Many thanks to the former and current SFU Sports Ana-
lytics members who help to make SFU a Sports Analytics hub, Dr. Dave Clarke, Dr. Peter
Chow-White, Dr. Tim Swartz, Dr. Thomas Loughin, Dr. Luke Bornn, Dr. Oliver Schulte,
Dr. Peter Tingling, Dr. Aaron Danielson, Dr. Harsha Perera, Dr. Jacob Mortensen, Dr.
Nate Sandholtz, Sarah Bailey, Matthew Van Bommel, Steven Wu, Peter Tea, Kevin Floyd,
Robert Nguyen, Denis Beausoleil, Daniel Daly Grafstein, Chris Li, Ken Peng, Nirodha Es-
pasinghege Dona, Aaron Pearson, Robyn Ritchie, Ryker Moreau, Elijah Cavan, Brendan
Kumagi, James Thomson, Dani Chu and Matthew Reyers.
   I am grateful to all the faculty members in the department of Statistics and Actuarial
Science who oversaw a kid hanging around for years, especially Dr. Dave Campbell for
sparking my interests in machine learning. In addition, I would like to thank all my lovely
friends and fellow MSc and PhD students for all the tears, laughters, fears and hopes we
shared.
   I would like to extend my sincere thanks to Dr. Doug Fearing, Dr. Luke Bornn and all
of my co-workers at Zelus Analytics for their support and help throughout the pandemic.
   Special shout-out to COVID-19 which makes everyone’s life much more difficult but we
have grown stronger together.
   Last but not least I would like to thank my girlfriend and family, especially my parents
for their unconditional love and support.

                                            iv
Table of Contents

Declaration of Committee                                                                           ii

Abstract                                                                                          iii

Acknowledgements                                                                                  iv

Table of Contents                                                                                  v

List of Tables                                                                                   vii

List of Figures                                                                                  viii

1 Introduction                                                                                     1
   1.1   Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      1
   1.2   Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .        2

2 The Calculation of Player Speed from Tracking Data                                               4
   2.1   Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      4
   2.2   Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      5
         2.2.1   Speed Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . .        6
   2.3   Exploratory Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        7
         2.3.1   Soccer Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        8
         2.3.2   NFL Football Example        . . . . . . . . . . . . . . . . . . . . . . . . .    10
   2.4   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     11

3 A Contextual Analysis of Crossing the Ball in Soccer                                            13
   3.1   Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     13
   3.2   Data Preprocessing      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    15
         3.2.1   Defining Crossing Opportunities . . . . . . . . . . . . . . . . . . . .          16
         3.2.2   Crafting Situational Variables . . . . . . . . . . . . . . . . . . . . . .       18
         3.2.3   Outcome Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . .         18
   3.3   A Model for the Crossing Decision . . . . . . . . . . . . . . . . . . . . . . .          19
   3.4   The Intended Target Model . . . . . . . . . . . . . . . . . . . . . . . . . . .          20

                                                v
3.5   Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       23
        3.5.1   Propensity Score Matching . . . . . . . . . . . . . . . . . . . . . . .          24
        3.5.2   Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      25
  3.6   Discussion     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   27

4 Evaluation of Off-the-Ball Actions in Soccer                                                   29
  4.1   Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     29
  4.2   Data    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    31
  4.3   Methods      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   32
        4.3.1   Rationale of the Approach . . . . . . . . . . . . . . . . . . . . . . . .        32
        4.3.2   Prediction of Velocities     . . . . . . . . . . . . . . . . . . . . . . . . .   32
        4.3.3   Computational Overview . . . . . . . . . . . . . . . . . . . . . . . .           35
        4.3.4   Derivation of a Metric for Defensive Anticipation          . . . . . . . . . .   36
  4.4   Results and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         38
        4.4.1   Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    39
        4.4.2   Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     39
  4.5   Discussion     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   43
        4.5.1   Connections to Existing Literature . . . . . . . . . . . . . . . . . . .         44
        4.5.2   Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        44

5 Pitch Control                                                                                  45
  5.1   Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     45
  5.2   Literature Review      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   46
  5.3   A New Metric for Pitch Control . . . . . . . . . . . . . . . . . . . . . . . . .         51
        5.3.1   Criteria for Pitch Control . . . . . . . . . . . . . . . . . . . . . . . .       51
        5.3.2   Timing of the Ball . . . . . . . . . . . . . . . . . . . . . . . . . . . .       52
        5.3.3   Timing of Players      . . . . . . . . . . . . . . . . . . . . . . . . . . . .   53
  5.4   An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       56
        5.4.1   Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        59
  5.5   Accuracy of the Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       60
  5.6   Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     61

Bibliography                                                                                     62

Appendix A Code used for Establishing Pitch Control                                              69

                                                vi
List of Tables

 Table 2.1 A sample of soccer tracking data from the CSL. . . . . . . . . . . . .              6
 Table 2.2 A sample of football tracking data from the NFL. . . . . . . . . . . .              6

 Table 3.1 A subset of situational variables relevant to crossing which form the
            columns of the design matrix Z. All distances are measured in metres.             18
 Table 3.2 Estimates and standard errors for the parameters corresponding to
            model (3.1). The third column provides the estimate multiplied by the
            mean value of its corresponding covariate.The fourth column marginal
            effect is the product of the estimate and the standard deviation of the
            corresponding z terms. . . . . . . . . . . . . . . . . . . . . . . . . . .        20
 Table 3.3 The key situational variables that are relevant to crossing success as
            modeled in Section 3.4. All distances are measured in metres, speed is
            measured in metres/second, angles are measured in degrees, and areas
            are measured in squared metres. . . . . . . . . . . . . . . . . . . . . .         21
 Table 3.4 Estimates of the parameters from the intended target model and other
            related statistics. The estimates describe associations between spatio-
            temporal features and the successful completion of an attempted cross.            22

 Table 4.1 The defensive anticipation metric P calculated during even and odd
            weeks for players on Shandong Luneng during the 2019 season. . . . .              40
 Table 4.2 The defensive anticipation metric P given by (4.2) for 10 players on
            Shandong Luneng who received the most playing time during the 2019
            CSL season. We also provide comparison metrics involving aggression
            during the 2019 season, namely the total number of fouls committed,
            tackles made and the number of interceptions. . . . . . . . . . . . . .           42

 Table 5.1 The determination of pitch control at a given location given time in-
            equalities involving tb , th and tr . . . . . . . . . . . . . . . . . . . . . .   52
 Table 5.2 The classification of 7901 intended passes according to whether pitch
            control (PC) was designated to the intended team, the opponent or
            neither team. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     60

                                             vii
List of Figures

 Figure 2.1   Path of a player over a 29-second interval based on location data
              recorded at 10 hertz. . . . . . . . . . . . . . . . . . . . . . . . . . .       8
 Figure 2.2   Estimated speed (∆ = 1) of the player corresponding to the path in
              Figure 2.1 over a 29-second interval.      . . . . . . . . . . . . . . . . .    9
 Figure 2.3   Estimated speed (∆ = 4) of the player corresponding to the path in
              Figure 2.1 over a 29-second interval.      . . . . . . . . . . . . . . . . .   10
 Figure 2.4   The red-lined plots correspond to speed and acceleration estimates
              (∆ = 1) for Brandin Cooks of the NFL during a 7-second time
              interval. The analogous blue-lined plots correspond to ∆ = 2. . . .            11

 Figure 3.1   Examples of possession sequences with (a) a crossing attempt and
              (b) without a crossing attempt. . . . . . . . . . . . . . . . . . . . .        17
 Figure 3.2   Panels (a) and (b) present output from the intended target model.
              These diagrams provide a way for teams to study the spatial config-
              urations of players and the ball during crossing opportunities. . . .          23
 Figure 3.3   The directed acyclic graph describes the crossing problem. The vari-
              ables ZT are causes of T, but not Y . The variables ZTY are common
              causes for T and Y . And, the variables ZY are causes for Y , but not
              T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   25
 Figure 3.4   After matching, histograms of the two groups (treatment and con-
              trol) are depicted where the horizontal variable is the propensity
              score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   26
 Figure 3.5   After matching, smoothed plots of the shot variable Y for both
              groups with respect to the propensity score. . . . . . . . . . . . . .         27

 Figure 4.1   Correlation of predicted speed at time t and actual speed at time t−∆
              where time is measured in seconds. The blue dashed line corresponds
              to the selected value ∆ = 0.5 seconds. . . . . . . . . . . . . . . . . .       36

                                           viii
Figure 4.2   Geometric diagram which illustrates the components of the statis-
             tic p in equation (4.1). Imagine a player who is located at the origin
             (0, 0). The observed velocity of the player is shown by the blue vector
             pointing towards (2, 4). The predicted velocity of an average player
             is shown by the yellow vector pointing towards (8, 4). The perpen-
             dicular line indicates the projection of the observed velocity vector
             on the predicted velocity vector. Using equation (4.1), the defensive
             anticipation value, p, is equal to −0.6, which can be interpreted as a
             60% reduction compared to the average player. . . . . . . . . . . .            37
Figure 4.3   Plot of predicted velocities (purple arrows) and observed velocities
             (black arrows) at a given instant in time. The blue team is in pos-
             session, the yellow team is defending and the red dot corresponds to
             the ball. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    39
Figure 4.4   Density plots of (4.2) based on playing position. For each player,
             the defensive anticipation metric (4.2) was calculated for all matches
             in the 2019 CSL season. We observe that central midfielders have
             slightly larger defensive anticipation values than other players on
             average, and there is more variability amongst the forwards than the
             other playing positions.     . . . . . . . . . . . . . . . . . . . . . . . .   41
Figure 4.5   Scatterplots of the defensive anticipation metric (4.2) plotted against
             player interceptions and tackles made during the 2019 CSL season.              42
Figure 4.6   Plot of the defensive anticipation metric (4.2) averaged over all CSL
             players during 10-minute intervals. . . . . . . . . . . . . . . . . . .        43

Figure 5.1   Voronoi diagram based on n = 5 points generated on the unit square. 47
Figure 5.2   Voronoi diagram applied to a given snapshot of a soccer game based
             on the location of the 22 players on the pitch. The shaded orange
             and purple areas correspond the dominant regions for the home and
             away teams, respectively. . . . . . . . . . . . . . . . . . . . . . . . .      48
Figure 5.3   The distribution of maximum speed and maximum acceleration of
             all players in the Chinese Super League in 2019. . . . . . . . . . . .         54
Figure 5.4   Current velocity vectors for the example depicted in Figure 5.2. . .           57
Figure 5.5   The left plot uses colors to depict the time that it takes a stationary
             player to reach field locations given the current location marked with
             a dot. The right plot does likewise but introduces an initial velocity
             (arrow) for the player. . . . . . . . . . . . . . . . . . . . . . . . . .      58
Figure 5.6   Pitch control diagram using the proposed methods for the example
             depicted in Figure 5.2. . . . . . . . . . . . . . . . . . . . . . . . . .      59

                                           ix
Chapter 1

Introduction

1.1    Introduction
Sports analytics is an emerging field where it combines sports with multidisciplinary knowl-
edge and expertise, such as statistics, computing science, sports science and business, to
support decisions in player evaluation, injury prevention, business operations, etc. The book
and movie Moneyball was one of the first influences that put sports analytics in front of
the eyes of the public. The Moneyball movement started in baseball and has swept across
multiple sports in a few years.
   Humans are often clouded by personal judgement when making decisions. This was also
featured in the movie Moneyball, "People are overlooked for a variety of biased reasons and
perceived flaws - age, appearance, personality." One of the common biases is recency bias,
where we tend to weight the most recent event more significantly than it should be. For
example, when a player has just made a poster dunk, we are more likely to remember that
highlight and downweight the fact he gave away five easy layups to his opponents earlier.
In the end, we might only remember one moment of brilliance and come away with the
perception that the player had an amazing game. It is fairly easy to find many similar
examples in sports and how this type of bias can hinder player evaluation in sports.
   Baseball is one of the earliest sports that embraced the idea of using numbers to inform
decisions. Moving beyond replying on pure instinct to evaluate players is a huge leap for
sports analytics. Back in the early days, the only available data were box score statistics
involving summary statistics of a few categories. As teams recognized the importance of
getting more granular data, they began to collect event data, which provide finer details on
the sequence of events and players being involved for the recorded event.
   In my opinion, I would like to argue that this was the first wave of evolution in sports
analytics, where there was a shift of mindset to adopt numbers to analyze player performance
objectively. The second wave of evolution came with the accessibility of tracking data.
Although event data provide a rich amount of contextual information, event data do not
describe what other players are doing when they do not possess the ball or are not involved

                                             1
in the recorded event. Tracking data fill the gap by collecting detailed information, such as
the x,y coordinates, of the ball and all players on the field multiple times per second. With
the availability of spatio-temporal tracking data, it unlocks a new world for researchers to
explore and to tackle questions that they were not able to answer. Plenty of interesting
research has been done using tracking data in baseball, basketball, soccer and football since
then.

1.2        Organization of the thesis
In this thesis, there are four chapters that follow. The common theme connecting these
chapters is soccer analytics, where we identify interesting research problems in soccer and
attempt to solve them using statistics and computing. One of the challenges among these
chapters is handling the enormously rich data sets in soccer that track every player’s detailed
movement on the field at a rate of multiple times per second.
    Chapter 2 is a short chapter which investigates how to obtain reliable speed measure-
ments from player tracking data. This chapter has been published as the following research
article:

   • Wu, L. and Swartz, T.B. (2022). The calculation of player speed from tracking data.
        International Journal of Sports Science & Coaching, 0(0).

    Chapter 3 considers the problem of crossing the ball in soccer. In recent years, some
research suggests that there exists a negative correlation between crossing and scoring.
However, correlation does not imply causation. There are various factors that affect the de-
cision of crossing, including the position of the cross, the defensive pressure on the crosser,
the distance between the crosser and his teammates, the score differential, the number of
defenders in the box, etc. In general, randomized controlled trials are the gold standard ap-
proach to estimate the causal effects of a treatment on an outcome. In the crossing problem,
an experimenter can not assign whether a player crosses or does not cross the ball during
a particular crossing opportunity due to the fact that matches are observational studies.
For this reason, we use a well-established method under the causal inference framework -
propensity score matching to investigate the causal relationship of crossing on shots. This is
one of the few papers that considers a causal inference approach in team sport, which utilizes
player tracking data to identify and measure confounding variables. Our findings suggest
that crossing remains an effective tactic for increasing shot probabilities. This chapter has
been published as the following research article:

   • Wu, L., Danielson, A., Hu, J.X. and Swartz, T.B. (2021). A contextual analysis of
        crossing the ball in soccer. Journal of Quantitative Analysis in Sports, 17(1), 57-66.

    Chapter 4 considers the evaluation of off-the-ball actions in soccer. There are numerous
statistics and metrics that have been proposed to evaluate the performance of players in

                                                2
team sports based on actions involving the ball. In soccer, players typically don’t have
the possession of the ball for even three minutes during a game. In this paper, we develop
methods that analyze the activities of players that are “off-the-ball”. Specifically, we propose
a metric to measure defensive anticipation in soccer. The analogy in chess would be when
you are planning your next move, you will always try to anticipate the moves of your
opponents. Similarly in soccer, we try to conceptualize the idea of anticipation for defensive
players using expected movements at the next moment given a snapshot of the game. The
expected movement at the next moment is a function of the spatio-temporal snapshot of the
match prior to the moment in time. This provides a new way to evaluate the performance
of players off-the-ball. We used machine learning models to learn the non-linear relationship
between the contextual variables and velocity from a massive set of game instances. The
output from the model which we termed the predicted (expected) velocity represents where
the player is expected to move and how fast he is expected to move on average. Then a
metric is developed by comparing the player’s actual velocity with the predicted velocity
of a typical player in this situation. The interpretation of the defensive anticipation metric
is based on the tenet that moving faster to the expected location is better than moving
slower. This chapter is under revision at Statistica Applicata - Italian Journal of Applied
Statistics:

   • Wu, L. and Swartz, T.B. (2022). Evaluation of off-the-ball actions in soccer. Manuscript
      under review.

   Chapter 5 considers the problem of pitch control in soccer. With the availability of track-
ing data, one of the most intriguing ideas in soccer is to model how much space the player
owned at any given time, which is known as pitch control or field ownership in the soccer
analytics community. This chapter first reviews various approaches for the determination
of pitch control and introduces a new metric that takes into account associated movement
dynamics of the ball and players. With the pitch control model, we could determine if the
home team or road team or neither team has the control at any given location on the field.
This approach is generally applicable to invasion sports and is illustrated in the context of
soccer. This chapter has been submitted to Scientific Reports:

   • Wu, L. and Swartz, T.B. (2022). A New Metric for Pitch Control based on an Intuitive
      Motion Model. Manuscript under review.

                                               3
Chapter 2

The Calculation of Player Speed
from Tracking Data

2.1     Introduction
In the past decade, the advent of player tracking data has sparked a revolution in sports
analytics (Morgulev, Azar and Lidor 2018). With player tracking data, analysts have access
to the Cartesian coordinates of each player on the pitch where the observations are recorded
frequently (e.g. 10 times per second). The availability of such detailed data provides oppor-
tunities to investigate sporting questions that were previously unimaginable. Gudmundsson
and Horton (2017) provide a review paper on spatio-temporal analyses used in invasion
sports where player tracking data are available.
   Currently, player tracking systems are expensive, and consequently, tracking data are
only collected in “big” sports such as basketball (the National Basketball Association),
soccer (various leagues and competitions), football (the National Football League) and
hockey (the National Hockey League). Tracking data are not only collected during matches
but also during workout sessions where fitness, training and health considerations are main
concerns.
   Tracking data are typically proprietary and are supplied by service providers using
various technologies (Torres-Ronda et al. 2022). There are four prominent technologies: (1)
global positioning systems (GPS), (2) local positioning systems (LPS), (3) inertial measure-
ment units (IMU) and (4) optical tracking (OT) systems. OT systems are fundamentally
different as they do not require wearable devices and do not directly determine player coor-
dinates. Instead, OT technology requires advanced camera systems and player recognition
software to evaluate player coordinates. No matter which technology is utilized, tracking sys-
tems begin with the collection of the (x, y) coordinates of participants measured at frequent
time intervals. With the coordinates, various statistics can be calculated or approximated
(e.g. speed, acceleration, distance travelled, etc.).

                                                4
In this paper, we are concerned with derivative calculations associated with tracking
data coordinates. Specifically, we are interested in the approximation of player speed which
is an important statistic in sports analytics and sports science. For example, Wu and Swartz
(2022) require player speeds in soccer to assess off-the-ball activity. They introduce a mea-
sure which addresses defensive anticipation. Buchheit et al. (2014) use regression method-
ology to determine factors that are associated with player speed in soccer. For example,
horizontal force and horizontal power were seen to be associated with speed. Oliva-Lozano
et al. (2020) characterize positional differences in soccer based on acceleration and sprint
profiles. Related to speed, Shen, Santo and Akande (2022) analyze pace of play in soccer, and
conclude that pace increases with decreasing team quality, which indicates the importance
of playing with pace. From a training and performance perspective, Ferrari Bravo et al.
(2008) demonstrate that sprint-training significantly increases both aerobic and anaerobic
performances in soccer. Naturally, different applications require different levels of accuracy.
For example, in sports science, critical velocity is an active research field which relies on
highly accurate measurements of speed (Peng, Clarke and Swartz 2022).
   Much has been written on the accuracy of various tracking data technologies. For ex-
ample, Mara et al. (2017) considered the displacement accuracy of an OT system, Tan,
Polglaze and Peeling (2021) investigated the validity and accuracy of a GPS system, and
Pino-Ortega et al. (2022) provided a review of the validity and reliability of LPS systems
against other devices. Massard, Eggars and Lovell (2017) questioned the need for sprint
testing based on the comparison of GPS match and field-testing data. However, all of these
investigations rely on some measure of the truth against which tracking measurements are
compared. What should experimenters do if they do not have access to the truth and they
are unsure of the accuracy of speed calculations obtained from tracking data? This paper
introduces some simple principles from exploratory data analysis that assists experimenters
to obtain more reliable estimates of speed.
   In Section 2.2, we describe the datasets upon which our methods are illustrated, and we
describe how player speed is calculated from tracking data coordinates. In Section 2.3, some
simple exploratory plots are introduced that help the analyst obtain more reliable speed
calculations. We conclude with a short discussion in Section 2.4.

2.2     Data
We have access to tracking data from matches during the 2019 season of the Chinese Su-
per League (CSL). The CSL uses OT technology (previously discussed) provided by Stats
Perform where observations were recorded 10 times per second. The tracking data consist
of roughly one million rows per match measured on 7 variables. Each row corresponds to a
particular player at a given instant in time. The soccer tracking data were initially provided
as xml files, and were processed in R for further analysis. In Table 2.1, we present three

                                              5
rows of the soccer tracking data. Here we observe x-y coordinates and player identifiers
at every 1/10th of a second. The entries are mostly intuitive except perhaps for the x-y
coordinates which refer to the player location on a 105m by 68m soccer field. For example,
(x, y) = (−52.5, 0) corresponds to the middle of the goal line on the left hand side of the
soccer field.

 gameID                                    Time     x      y          IdActor    IsBall         IdHalf   JerseyNumber
 WUHAN-BEIJI-01032019                      30       -4     -9.6       345354     FALSE          1        25
 WUHAN-BEIJI-01032019                      30.1     -4     -9.5       345354     FALSE          1        25
 WUHAN-BEIJI-01032019                      30.2     -4     -9.4       345354     FALSE          1        25

                       Table 2.1: A sample of soccer tracking data from the CSL.

    Our second dataset corresponds to tracking data from the National Football League
(NFL). Unlike the OT soccer data, the NFL data were based on GPS technology, but were
also collected using 10 hertz sampling frames. The data were used in the 2019 Big Data
Bowl competition and are publicly available at https://github.com/nfl-football-ops/Big-
Data-Bowl. Here we use data corresponding to a single deep pass play by the wide receiver
Brandin Cooks of the New England Patriots taken from a 7-second interval during the
September 7/2017 match against the Kansas City Chiefs. In Table 2.2, we present three
rows of the football tracking data. Here we observe a similar structure to the tracking data
in soccer. The football tracking data include the x-y coordinates for players measured in
yards where x refers to the player position along the long axis of the field ranging from 0 to
120 yards, and y refers to the player position along the short axis of the field ranging from
0 to 53.3 yards. For instance, (x, y) = (0, 0) corresponds to the bottom left of the football
field. The remaining variables in Table 2.2 are mostly intuitive where dis corresponds to
distance travelled from the previous frame (i.e. previous 1/10th second) and dir corresponds
to the angle of player motion in degrees. The frame.id is the frame identifier for each frame
which resets to 1 for each play.

 gameId       playId    frame.id   x        y       dis    dir        event          playerId   displayName     jerseyNumber
 2017090700   160       40         53.78    10.82   0.77   239.36     pass_forward   2543498    Brandin Cooks   14
 2017090700   160       41         53.11    10.45   0.76   238.66     NA             2543498    Brandin Cooks   14
 2017090700   160       42         52.44    10.08   0.76   237.76     NA             2543498    Brandin Cooks   14

                    Table 2.2: A sample of football tracking data from the NFL.

2.2.1     Speed Calculations
We emphasize that the approach that we introduce is general and straightforward. It can
be utilized using any tracking technology in any sport. However, knowledge of the sport
dictates our interpretation of the exploratory plots.

                                                                  6
Consider then a particular player where our interest concerns the calculation of their
speed. If (x(t), y(t)) denotes the location of the player at time t, then the player’s speed at
time t is defined by

                            (x(t + ∆) − x(t − ∆))2 + (y(t + ∆) − y(t − ∆))2
                          p
            s(t) = lim                                                      .                 (2.1)
                    ∆→0                          2∆

   In words, formula (2.1) is the limiting change in distance travelled with respect to time.
Of course, (2.1) is a mathematical expression based on taking a limit, and is not a quantity
that can be calculated from data. Instead, with tracking data, the player’s locations are
obtained at regular times which are denoted by (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). Here, the
subscripts i = 1, . . . , n of the Cartesian coordinates refer to the time increments. Therefore,
assuming that t corresponds to an observed time increment from the tracking data, it is
reasonable to approximate s(t) in (2.1) by

                                  (xt+∆ − xt−∆ )2 + (yt+∆ − yt−∆ )2
                                  p
                        ŝ(t) =                                                               (2.2)
                                                2∆

where ∆ = 1, 2, . . . is an increment that needs to be specified. In our illustration with 10
hertz data, the value ∆ = 1 corresponds to 1/10th of a second.
   We have simplified the discussion by referring to speed. The approximation of velocity
is also of interest where velocity has a directional component in addition to the scalar
quantity speed. Note that acceleration calculations are also important, and are obtained as
derivatives of speed.

2.3     Exploratory Analyses
Whereas the estimand s(t) in (2.1) is an instantaneous speed, it’s estimate ŝ(t) in (2.2)
is an average speed taken over the time period 2∆. It may therefore appear that smaller
values of ∆ will yield better estimates. However, this needs to be balanced against the fact
that player coordinates (xt , yt ) are subject to measurement error as is the time interval 2∆.
Therefore, inaccuracies in the speed estimates are propagated from inaccuracies in the raw
data.
   To theoretically investigate the magnitude of error in speed via measurement error in
the numerator of (2.2), we consider the true speed ∆l /(2∆) which denotes the change in
location ∆l by the change in time where ∆ denotes the previously defined incremental step
size in time. With measurement error present, we denote the observed speed (∆l ± E)/(2∆)
where E denotes a fixed error in the location measurement corresponding to the device.

                                                 7
Then relative error RE is given by

                                  | ∆l /(2∆) − (∆l ± E)/(2∆) |
                          RE =
                                              ∆l /(2∆)
                                = | E | /∆l .                                           (2.3)

We note that the relative error (2.3) is smaller for larger speeds (i.e. greater changes in
location ∆l ). For example, when ∆ = 1, consider a true location displacement ∆l = 8
metres which is incorrectly measured as 9 metres. Then the actual speed is 8.0 metres/sec
(fast), the observed speed is 9.0 metres/sec, and the measurement error is E = 1 metre.
This results in relative error RE = 0.125. For contrast, when ∆ = 1, consider a true location
displacement ∆l = 2 metres which is incorrectly measured as 3 metres. Then the actual
speed is 2.0 metres/sec (slow), the observed speed is 3.0 metres/sec, and the measurement
error is E = 1 metre. This results in relative error RE = 0.50.

2.3.1   Soccer Example
To begin our investigation, Figure 1 provides a plot of the locations of a player from the
CSL dataset taken during a 29-second interval where he is known to be running fast during
portions of the interval. When a player is running fast, it is physically impossible to make
sharp turns, and therefore, the smoothness of the path suggests apparent accuracy in the
location measurements.

                                            Starting point

Figure 2.1: Path of a player over a 29-second interval based on location data recorded at 10
hertz.

   However, when we take the path locations in Figure 2.1, and estimate speeds (2.2) using
∆ = 1, there seems to be a significant accuracy problem. Figure 2.2 provides a plot of
estimated speed versus time for the selected path. In Figure 2.2, we observe that there
are many instances where a player has a recorded speed which increases (or decreases) by

                                                8
roughly 1.0 metre per second in the subsequent 1/10th second, and then returns to the
baseline speed 1/10th of a second later. When speeds are recorded in the (0,8) metres per
second range, frequent fluctuations of this magnitude do not seem plausible. The problem
here is that the location measurements were recorded to one decimal point on the metres
scale, and therefore, there is inaccuracy in (2.2) when dividing by 2∆ which corresponds to
0.2 seconds.

                                               8

                                               6
                       Estimated speed (m/s)

                                               4

                                               2

                                               0

                                                   0   4   8     12         16           20   24   28
                                                               Time elapsed in seconds

Figure 2.2: Estimated speed (∆ = 1) of the player corresponding to the path in Figure 2.1
over a 29-second interval.

   A remedy to the estimation of the instantaneous speed s(t) is to increase the time
increment ∆ surrounding t. Increasing the length of the time interval 2∆ results in less
fluctuation in the estimated speeds which is desirable. However, this is done at the expense
of moving in the direction from instantaneous speeds to average speeds. We have found that
the approximation ∆ = 4 works well in this application. Figure 2.3 provides the analogous
plot to Figure 2.2 where the time intervals have been widened to intervals of length 0.8
seconds. In Figure 2.3, we observe that the fluctuations are less pronounced, and that the
plot of estimated speed versus time is smoother. For example, the fluctuations during the
interval 16-18 seconds in Figure 2.2 are less believable than what is observed in Figure 2.3.
   We refer back to the theoretical analysis of relative error at the beginning of Section
3. In this example, we have seen that we prefer the time increment ∆ = 4 over ∆ = 1.
With ∆ = 4, speed ∆l /2∆ = 8 metres/sec and location measurement error E = 1 metre,
this implies ∆l = 64 metres and relative error RE = E/64 = 0.015625. With ∆ = 1, speed
∆l /2∆ = 8 metres/sec and location measurement error E = 1 metre, this implies ∆l = 16
metres and relative error RE = E/16 = 0.0625. Therefore, ∆ = 4 is preferred over ∆ = 1
in reducing relative error. This exercise can be repeated for any speed.
   Issues which arise in speed measurements are a consequence of the fact that speed is
the derivative of position, and that position is not measured with sufficient accuracy. In
applications where acceleration measurements are important, one can imagine even greater

                                                                       9
6

                       Estimated speed (m/s)
                                               4

                                               2

                                               0

                                                   0   4   8     12         16           20   24   28
                                                               Time elapsed in seconds

Figure 2.3: Estimated speed (∆ = 4) of the player corresponding to the path in Figure 2.1
over a 29-second interval.

challenges since acceleration is the derivative of speed. This is illustrated in the following
example.

2.3.2    NFL Football Example
In the second example, we first note that the running patterns of a NFL wide receiver differ
from those of a soccer player. Typically, the wide receiver sprints over a short time interval
and does not make many changes of direction. This has implications for the estimation of
speed.
   In Figure 2.4, we provide the estimated speed and acceleration estimates for Brandin
Cooks based on a 7-second pass route. The red-lined plots correspond to estimates based
on ∆ = 1 (i.e. intervals of 0.2 seconds), and the blue-lined plots correspond to estimates
based on ∆ = 2 (i.e. intervals of 0.4 seconds). Using ∆ = 1, the speed estimates appear
satisfactory as there are no unrealistic fluctuations between successive estimates. When we
compare the speed estimates using ∆ = 1 to ∆ = 2, there is no apparent improvement in
the speed estimates. This suggests that ∆ = 1 may be adequate for this application which
is a different conclusion than with the soccer data. This may point to either a difference
between the OT technology versus the GPS technology, or the intrinsic differences between
the motions of soccer players compared to wide receivers in football.
   When we look at the acceleration plots in Figure 2.4, it appears that ∆ = 1 may exhibit
untenable fluctuations in acceleration, especially around the 5.5 second mark. For example,
from the 5.2-second to 5.9-second mark, there is a change in acceleration in each successive
time step, and the acceleration follows an unlikely fluctuating pattern of up, down, up, down,
down, up and down (i.e. five changes in direction). From the 5.2-second to 5.9-second mark
with ∆ = 2, we observe the more believable pattern of up, up, down, same, same, down

                                                                      10
and same (i.e. only one change in direction). With respect to the estimation of acceleration,
∆ = 2 is preferred over ∆ = 1.

                                                                                                                 20

                              6

                                                                                Estimated acceleration (m/s^2)
      Estimated speed (m/s)

                                                                                                                 10

                              4

                                                                                                                                                                    delta = 1
                                                                                                                                                                    delta = 2

                              2                                                                                   0

                              0

                                  0   1   2       3        4       5    6   7                                         0   1   2       3        4       5    6   7
                                              Time elapsed in seconds                                                             Time elapsed in seconds

Figure 2.4: The red-lined plots correspond to speed and acceleration estimates (∆ = 1) for
Brandin Cooks of the NFL during a 7-second time interval. The analogous blue-lined plots
correspond to ∆ = 2.

2.4                           Discussion
Tracking data have provided opportunities to study problems in sports analytics which
were once unimaginable. However, sound tracking data analyses require data that are reli-
able, and the reliability of tracking data statistics often degrade with increasingly complex
statistics. We have provided some simple principles from exploratory data analysis to help
experimenters derive more reliable estimates of player speed. The same principles can be
utilized in the calculation of velocities and accelerations.
   The principles developed here are general and can be used with any type of player track-
ing system in any sport. The experimenter needs to consider the estimands of interest. The
experimenter also requires domain knowledge of the sport to assess whether the resultant
variations in the estimates are reasonable.
   An avenue of future research may involve the implementation of statistical methods to
smooth estimates of speed and acceleration. For example, one might consider the Hodrick-
Prescott filter to smooth estimates of speed (Hodrick and Prescott 1997).
   Instead of having experimenters manually estimate speed from (x, y) coordinates, some
tracking data providers automatically provide speed statistics. Coleman (2018) describes
the procedure that the data provider Opta uses in calculating top speeds for players in
soccer: “The speed in kilometers per hour for a given frame is based on the previous 15
frame-to-frame speeds. Out of the 15 frame-to-frame speeds, the four highest and the four
lowest values are discarded and the result is an average of the remaining seven values.”
Given that speed is of great importance in sports analytics, we suggest that it would be

                                                                                                         11
good practice for the providers to be explicit about the the derivation and justification of
their speed calculations.

                                            12
Chapter 3

A Contextual Analysis of Crossing
the Ball in Soccer

3.1       Introduction
The sport of soccer (association football) has a long history dating back to 1863 when
the Laws of the Game were codified by the Football Association in England. Throughout
the history of the sport, tactics have evolved with the intention of providing a competitive
advantage (Wilson 2013). As a strategy, the action of crossing the ball in soccer has always
been a staple of the game that has been thought to produce goals. A crossed ball occurs
when a player (normally situated in a wide area of the attacking third of the pitch) kicks
the ball towards the box with the intention that an attacking teammate will score.
   However, in recent years, research has been carried out that casts doubt on the benefits
of crossing the ball. Vecer (2014) provides a persuasive argument that the overall effect of
crossing the ball has a strong negative impact on scoring. Vecer (2014) uses both aggregate
crossing statistics and multilevel Poisson regression to study the impact of crossing. In the
analyses, there is a suggestion that crossing (when executed properly) is valuable; however,
the rate of bad crosses greatly exceeds the rate of good crosses, and this is a primary
argument against crossing. Vecer (2014) also demonstrates that missed scoring opportunities
due to open crossing is associated with the quality of the attacking team. In recent years,
teams have become more reluctant to cross the ball. For example, Vecer (2014) states that
the number of open crosses in the German Bundesliga dropped from 12.0 per match in
the 2009/2010 season to 8.9 per match in the 2015/2016 season, a decrease exceeding 25%.
Vecer (2014) analyzes the efficiency of crossing and found that 14.5% of the goals scored
were the results of open crosses in English Premier League. We found a similar story in
Chinese Super League, where 16.9% of the goals were scored from open crosses in 2019
season.
   Sarkar (2018) investigates crosses from a game theoretic perspective. They assume the
attacking team can cross the ball or not, and the defending team can utilize an offside

                                             13
trap or not. The vector of equilibrium strategies determines the probabilities of the possi-
ble outcomes. Somewhat surprisingly, Sarkar (2018) suggests that teams that are good at
aspects of executing a cross should cross the ball less often. Sarkar (2018) and Sarkar and
Chakraborty (2018) also confirm the inverse relationship between the number of crosses and
the number of goals scored in a match. Other papers that have provided nuanced views on
the negative effects of crossing include Liu et al. (2015) and Oberstone (2009).
   Given the longstanding history of crossing the ball in soccer, the conclusions reached
by Vecer (2014) and Sarkar (2018) have been surprising to many, including the authors of
this paper. We hypothesize that there are contexts in which crossing the ball in soccer is a
beneficial strategy. Knowing when to the cross the ball is a step in the direction of effective
playing strategy. Our contextual investigation is made possible by the availability of player
tracking data. Player tracking data in soccer consists of the (x, y) coordinates of the ball and
the 22 players on the pitch recorded at regular and frequent time intervals. Player tracking
data in sport are the catalysts for big data analyses and do not form part of the analyses by
Vecer (2014) and Sarkar (2018). Gudmundsson and Horton (2017) provide a review paper
on spatio-temporal analyses used in invasion sports (including soccer) where player tracking
data are available. The analysis of player tracking data has been particularly prominent in
the sport of basketball; see for example, Miller et al. (2014).
   Although tactical decisions are a fundamental aspect of sport, sporting decisions are
not typically based on the results of randomized designs, the bread and butter of causal
inference. Clearly, in professional sport, match outcomes are important and coaches would
be unwilling to implement a tactic in a random selection of games and then implement an
alternative tactic in a remaining subset of games. There are many approaches that estimate
causal effects with observational data (see Pearl 2009), but these methods have not received
much attention in the sports analytics literature. One exception is the work of Yam and
Lopez (2019) who investigate the impact of “going for it” on fourth down in the National
Football League as opposed to punting or kicking a field goal. Their approach is based
on matching propensity scores and covariates associated with game situations. As another
example, Toumi and Lopez (2019) use propensity score matching and Bayesian additive
regression trees to estimate the causal effects of zone-entry decisions in the National Hockey
League.
   Our work uses spatio-temporal data to investigate three aspects of the crossing problem
in soccer. First, we investigate the spatio-temporal conditions that lead to crossing. Then we
introduce an intended target model that investigates crossing success. Finally, a contextual
analysis is provided that assesses the benefits of crossing in various situations. The analysis
is based on causal inference techniques and suggests that crossing remains an effective tactic
in particular contexts.
   Section 3.2 introduces the dataset. We outline the steps involved in converting the player
tracking data into features that are used in the ensuing analyses. The resultant design matrix

                                              14
consists of rows that correspond to crossing opportunities and columns (covariates) that are
believed to related to aspects of crossing. Our analysis is based on various assumptions used
in the definition of a crossing opportunity and on the definition of outcomes arising from
crossing opportunities. In cases where the rationale for the assumptions is less clear, we
introduce tuning parameters so that analyses can be carried out using a range of values of
the tuning parameters.
   Section 3.3 is concerned with the spatio-temporal conditions that lead a player to cross
the ball. We develop a logistic regression model which relates the attempt (or non-attempt)
to cross the ball to covariates (situational variables) which are believed to be related to the
crossing decision. We observe that the model makes physical sense according to our under-
standing of soccer. The fitted model provides evidence of the rich information embedded in
the player tracking data. The logistic model is subsequently used in the causal analysis of
Section 3.5.
   Section 3.4 develops an intended target model. The model introduces additional covari-
ates that are relevant to the probability of success of a cross. The analysis concerns a sender
(the player contemplating the cross) and potential receivers (players to whom the cross
may be intended). The intended target model provides insight to whom a cross ought to
be made. Again, the fitted model aligns with our understanding of soccer. The information
gleaned from the model may benefit players and coaches in terms of tactical decisions.
   In Section 3.5, we first review concepts needed to apply basic causal inference tech-
niques to the crossing problem. Then we use propensity score matching to assess whether
crossing is beneficial. Our results are nuanced as crossing is seen to be beneficial in par-
ticular circumstances, and these circumstances are those when a player is more likely to
cross. We therefore see that the intuition of soccer players involving the decision to cross
corresponds to good decision making. And importantly, we dispel the notion that crossing
is not a valuable tactic in soccer.
   Some concluding remarks are then provided in Section 3.6.

3.2     Data Preprocessing
Statistical analyses begin with the existence of a dataset. However, with big data, the pre-
processing of data has become an integral part of statistical practice that defines the types
of models and analyses that can be entertained.
   In this paper, we have a big data problem where both event data and player tracking
data are analyzed based on the 30 regular season matches of the 2017 season for Shandong
Taishan Luneng FC of the Chinese Super League. Event data and tracking data are collected
independently where event data consists of occurrences such as tackles and passes, and these
are manually recorded along with auxiliary information whenever an “event” takes place.
Both event data and tracking data have timestamps so that the two files can be compared

                                              15
for internal consistency. In the Shandong Luneng dataset, tracking data are obtained from
the use of optical recognition software. The Shandong Luneng tracking data consists of
roughly 1,000,000 rows per match measured on 7 variables where the data are recorded every
1/10th of a second. Each row corresponds to a particular player at a given time. Although
the inferences gained via our analyses are specific to Shandong Luneng, it is plausible that
some of the broad insights may hold generally to high level soccer competitions.

3.2.1    Defining Crossing Opportunities
Vecer (2014) suggests that there are alternative strategies to crossing that are more beneficial
in terms of goal scoring. These strategies include attacking through the center of the pitch
(via dribbling and passing) and shooting.
    Vecer (2014) also states that when the attacking team enters the final third of the pitch,
various options are more or less open. We focus on this assumption in our analysis. In
particular, we utilize event and player tracking data to define a crossing opportunity. We
define a crossing opportunity to be an occasion where a player has possession of the ball in
a potential crossing zone and has the opportunity to either cross or not cross the ball. Also,
we record covariates that describe the relevant circumstances at the time of each crossing
opportunity.
    Soccer is a fluid game where events frequently occur. Following Bransen, Van Haaren
and van de Velden (2019), we define a possession sequence as a sequence of events involving
possession of the ball by the same team. A possession sequence concludes with a change of
possession or a stoppage. In our dataset, the length of a possession sequence ranges up to
19 events.
    We begin by restricting our crossing analysis to occasions when the offensive team retains
possession in a wide position of an attacking third of the pitch (i.e. within 13.85 metres of
the sideline). We refer to these two regions (on the opposite sides of the field) as the potential
crossing zones, which are highlighted in blue in Figure 3.1. We are interested in the segment
of possession sequences in the blue region. Only in these segments is it possible to cross the
ball.
    After restricting our analysis to possessions in the potential crossing zones, we identify
the final event that occurred in the zone, and we record the spatio-temporal information of
all players at that moment. The last event in the potential crossing zone will be either a
cross or non-cross (i.e. pass or dribble).
    In particular, we remove possession sequences that correspond to corner kicks and free
kicks. Note that corner kicks and free kicks are not open crosses, but could possibly occur
in a wide position of an attacking third of the pitch. We have N = 2225 final events in
potential crossing zones throughout the 30 matches.

                                               16
(a)

                                          (b)

Figure 3.1: Examples of possession sequences with (a) a crossing attempt and (b) without
a crossing attempt.

                                          17
3.2.2    Crafting Situational Variables
Building on previous research that evaluates passing ability (Szczepanski and McHale 2016,
Power et al. 2017), we propose variables specific to the context of crossing.
   It is a tenet of soccer that time and space are paramount factors that lead to improved
attacking outcomes. From the tracking data, it is possible to determine the location and
velocity of both the ball and the player of interest. The location and velocity measurements
form the basis for the situational variables presented in Table 3.1. Recall that the situational
variables τ, z1 , ..., z9 form the columns of a design matrix Z where the rows of Z are crossing
opportunities corresponding to the final event in a possession sequence occurring in potential
crossing zones. Although the situational variables in Table 3.1 are self-explanatory, the
variable z2 (nearest defender distance) is a measure of defensive pressure on the sender.
However, it does not account for the situation where multiple defenders are covering the
sender and the location of defender relative to sender matters. A defender standing one
meter in front of you versus one meter behind you is very different. The variable z3 indicating
the space controlled within 2 meters by the sender has been introduced using ideas from
Fernandez and Bornn (2018) and Fernandez et al. (2019). Although we experimented with
many other crossing variables, the variables presented in Table 3.1 are those that provided
excellent fit for the logistic model of Section 3.3.

   Variable                                  Definition of Variable
   τ = 1 (0)    -   the ball is crossed (not crossed)
      z1        -   score differential wrt the team in possession
      z2        -   distance between the sender and nearest defender
      z3        -   space controlled by the sender
      z4        -   distance between the sender and nearest teammate
      z5        -   distance between the sender and the endline
      z6        -   ratio of the number of offensive players to defensive players in the box
      z7        -   indicator variable corresponding to whether the sender is a defender
      z8        -   indicator variable corresponding to whether the sender is a midfielder
      z9        -   indicator variable for last 10 minutes of a half

Table 3.1: A subset of situational variables relevant to crossing which form the columns of
the design matrix Z. All distances are measured in metres.

3.2.3    Outcome Variable
We require a response variable that allows us to assess whether crossing is beneficial. The
obvious candidate is the variable Y1 = 1(0) according to whether a crossing opportunity led
(did not lead) to a goal. Although scoring and preventing goals is the primary objective of
soccer teams, goal scoring is a rare event with only 2.5-3.0 goals scored per game on average
in top European soccer leagues. Therefore, it is difficult to tease out subtle inferences when
goal scoring is used as the dependent variable.

                                                18
Alternative indicator variables that we have considered for a response variable are
whether a crossing opportunity led to a shot on goal Y2 and whether a crossing oppor-
tunity led to a shot Y3 . The variable Y2 is more common than Y1 and Y3 is more common
than Y2 . For this reason, we prefer the response variable Y = Y3 . We note that shot statis-
tics (as opposed to goal statistics) are prevalent in the hockey analytics literature and are
referred to as Fenwick and Corsi (Vollman, Awad and Fyffe 2016).
   Clearly, shots do not necessarily occur immediately after a cross. Therefore, we introduce
a tuning parameter k where a success (shot attempt) is defined as having occurred within
the next k events. If the team maintains possession after the ball exits the potential crossing
zone and a shot attempt occurs within the next k events, then Y = 1, otherwise Y = 0. In
this application, we set k = 5. The idea to let the play “unfold” was used by Schuckers and
Curro (2013) in the context of player evaluation in hockey. Using the above definition for Y ,
we observed 274 shots arising from the N = 2225 crossing opportunities. With the choice
k = 5, it took 2.61 seconds on average for a shot to occur after a cross. Also, the offensive
team retained possession (and did not cross the ball) 14.92% of the time (332 out of the
2225 cases). We recognize that k is a tuning parameter and we have experimented with
different values for k, such as k = 4, 6, 7 and found little difference in the results. Another
possible way of defining the response variable involves the consideration of time until a shot
occurs. For example, Espasinghege Dona and Swartz (2022) define Y according to whether
a shot occurs by the end of possession.

3.3     A Model for the Crossing Decision
We first consider how T (i.e. the variable denoting the decision to cross) depends on situa-
tional variables as expressed by Z (see Table 3.1). For this, we consider a logistic regression
model based on the N = 2225 crossing opportunities where T ∼ Bernoulli(pT ) and

                                  logit(pT ) = λ0 + λZ .                                  (3.1)

   Parameter estimates and standard errors for the significant terms corresponding to
model (3.1) are given in Table 3.2. To get a sense of the relative importance of the terms,
the third column in Table provides the parameter estimate multiplied by the mean value
of its corresponding covariate. A notable observation is that given a crossing opportunity,
crossing the ball is less frequent than not crossing the ball. For example, when the mean
values of the covariates are substituted into the fitted equation corresponding to (3.1), the
probability of a cross is Prob(T = 1) = 0.130. We also note that all of the parameters in
Table 3.2 are highly significant except for z1 (p-value = 0.040) and z9 (p-value = 0.051).
   The coefficients in Table 3.2 also correspond to our soccer intuition. For example, we
see that an increase in the ratio of offensive players in the box to defensive players in the
box leads to an increased probability of crossing (i.e. positive coefficient of z6 ). The most

                                              19
You can also read