View
7
Download
0
Category
Preview:
Citation preview
Predicting player movements in soccer using Deep Learning
A comparative study between LSTM and GRU on a real-life sports case
Joris Verpalen
ANR: 115394
Thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science in Communication and Information Sciences,
Master track Data Science Business & Governance,
at the school of humanities and digital sciences
of Tilburg University
Thesis Committee:
Prof. Eric Postma & Sebastian Olier
May 13th, 2019
1. Preface I hereby present you my master thesis on predicting player movements in soccer using deep
learning. This study is performed in partial fulfillment of graduating for the Master of Science in
Communication and Information Sciences and, more specifically, for the master track Data Science
Business and Governance (DSBG) at Tilburg University.
I want to thank my supervisors Prof. Dr. Eric Postma and Sebastian Olier for their feedback, guidance
and support on writing this thesis. Additionally, I would like to thank A.C. Vu, a friend of mine who was
willing to help with cleaning and constructing the .csv-data used for this study. I also want to thank I.
Maat, a friend of mine who works as an app developer and advised me on the programming issues I
encountered during this process.
2. Abstract Analytics is becoming increasingly important in the domain of sports. The way analytics in sports
is performed has changed rapidly in recent years. This is mainly because of the availability of better
technology and due to the development of many applications in the field of computer science. One
way of analyzing sports and performance is through the prediction of players' positions and their
movement in the field. The most well-known application where predicting future movement is an
important aspect is within multi-object tracking. Recently, deep learning techniques are often applied
to do this. LSTM and GRU are mainly used for this, since these methods are well-known to deal with
long-term dependencies in a reliable fashion. However, these techniques have, to the best of my
knowledge, not yet been applied in a multi-object tracking setting in a real-life soccer case. Therefore,
this study aims to compare the performance of the LSTM and the GRU in a real-life soccer setting using
sensor data of the players' positions and pave the way for multi-object tracking using deep learning in
soccer. The research question of this study is as follows:
"To what extent does the use of LSTM and GRU contribute to the prediction of player movements in
soccer?"
In order to answer this research question, we used a data set that contains the x, y coordinates of
soccer players of a professional soccer club in Norway. after pre-processing the data 5 experiments are
performed to test the predictive ability of the LSTM and GRU in this setting. The results of the first two
experiments show that the data needs a more appropriate scaling in order to be suitable for learning
and prediction. Therefore, the data is modified to absolute changes in coordinates between two
measurements. Based on this data, four experiments are performed: 1) varying the number of
timestamps as input sequences, 2) doubling the time between timestamps, 3) test how far in the future
we can predict, and 4) predict the players’ trajectories of 40 timestamps into the future. The results
show that both the LSTM and the GRU are well capable of predicting the next change in the players'
positions. However, on longer input sequences, the predictive performance decreases. This can be due
to the depth of the models, which can be seen as a limitation of this study. However, on average the
LSTM and GRU perform equally well - meaning that they predict the future position of the soccer
players with a low error rate compared to the benchmark - and therefore, we can conclude that both
the LSTM and the GRU would be suitable deep learning techniques for predicting movement in a soccer
setting. This study only used sensor data of the soccer players to predict movement. In multi-object
tracking, however, usually visual data is taken as input. Therefore, the most important suggestion for
future research is to use video or image data as input to extract the features (i.e., soccer players) and
use this as input for prediction. In this way, a next step of applying deep learning techniques to multi-
object tracking in soccer can be realized.
Contents 1. Preface ............................................................................................................................................ 2
2. Abstract ........................................................................................................................................... 3
3. Predicting player movement in soccer ........................................................................................... 5
3.1 Research question ......................................................................................................................... 6
3.1.1 Goal ........................................................................................................................................ 6
3.1.2 Research question ................................................................................................................. 6
3.2 Overview study ............................................................................................................................. 6
3.3 Approach ....................................................................................................................................... 7
4. Related work ................................................................................................................................... 7
4.1 Predicting future movement and positions .................................................................................. 7
4.2 Predicting future movement and positions in sports ................................................................... 8
4.3 The use of deep learning to predict future movement and positions .......................................... 9
5. Methods ........................................................................................................................................ 10
5.1 Data set description .................................................................................................................... 10
5.2 Pre-processing data..................................................................................................................... 12
5.3 Algorithms and software ............................................................................................................. 13
6. Experimental set-up...................................................................................................................... 13
6.1 Model description ....................................................................................................................... 13
6.2 Hyperparameter settings ............................................................................................................ 14
6.3 Experiments to perform ......................................................................................................... 15
6.4 Evaluation metrics ....................................................................................................................... 16
6.4.1 Metrics ................................................................................................................................. 16
6.4.2 Benchmark ........................................................................................................................... 17
7. Results ........................................................................................................................................... 17
7.1 Experiment 1: Varying number of timestamps as input sequences ........................................... 17
7.2 Experiment 2: Doubling the time between timestamps ............................................................. 18
7.3 Experiment 3: Varying number of timestamps as input sequences when using absolute differences between coordinates ..................................................................................................... 18
7.4 Experiment 4: Doubling the time between timestamps when using absolute differences between coordinates ....................................................................................................................................... 19
7.5 Experiment 5: How far can we predict? ...................................................................................... 20
7.6 Experiment 6: Predicting longer trajectories .............................................................................. 21
8. Discussion & Limitations ............................................................................................................... 22
9. Conclusions ................................................................................................................................... 23
10. Future work ............................................................................................................................... 24
11. References ................................................................................................................................. 25
3. Predicting player movement in soccer The analysis of team and player performance in sports has changed rapidly in recent years. This is
mainly because of the availability of better technology and due to the development of many
applications in the field of computer science (Cust, Sweeting, Ball, & Robertson, 2019). In soccer the
demand for automated analysis has increased rapidly because it provides valuable information in at
least two ways. First, it provides information for managers and athletes in terms of individual or team
performance and for the development of proper tactics. Second, it can provide useful insights for
spectators in order to help them to better understand a soccer match (Kim, Moon, Lee, Nam, & Jung,
2018).
One way of analyzing sports and performance is through the prediction of players' positions and their
movements in the field. The most well-known application where predicting future movement is key is
in object tracking, which is a domain within computer vision. The goal of object tracking is to infer
trajectories of persons as they move around (Sadeghian, Alahi, & Savarese, 2017). This basically means
that the interest is in predicting the next movement and the accompanying new location of the object.
In object tracking, we can distinguish between the use of two types of data. First, visual data such as
videos or images are widely used in many applications in sports analytics. Second, IMUs (Inertial
Measurement Units), such as sensors are widely used as well in sports analytics. Here, athletes are
usually equipped with sensor belts in order to gather information about their position, speed, distance
covered, heart rate, and so on. So these sensors are able to collect information about many things such
as movement patterns in a reliable fashion (Camomilla, Bergamini, Fantozzi, & Vannozzi, 2018).
A development in analytics is that deep learning is being used more frequently. Using predictive
approaches with deep learning on IMUs or sensor data is a rapidly developing field of interest as well
(Xiang, Alahi, & Savarese, 2015). Examples of deep learning applications in multi-object tracking can
be found outside of the sports domain. In particular, predicting the movements of pedestrians, cars
and so on is widely studied (Kim, et al., 2017) (Alahi, et al., 2016). Especially predicting movement of
pedestrians has many similarities to a soccer setting. First, interest is in multiple objects (i.e., people)
who look very similar (e.g., due to matching outfits). Second, in both settings people are very close to
each other (Gade & Moeslund, 2018). Third, in tracking pedestrians as well as athletes, sudden changes
in motion appear very often which makes it harder to infer trajectories and predict their next position
(Ben Shitrit, Berclaz, Fleuret, & Fua, 2011).
In multi-object tracking results have shown that convolutional neural networks and recurrent neural
networks perform reasonably well (Ma, et al., 2018). Based on this, (Ma, et al., 2018), opted for the
use of GRUs (gated recurrent units) and found better results compared to other deep learning
methods. Other studies on multi-object tracking opt for the use of LSTM (long short-term memory) in
order to infer trajectories (Kim, et al., 2017) (Alahi, et al., 2016) (Xue, Huynh, & Reynolds, 2018). From
a theoretical point of view this makes sense because gated recurrent units and long short-term
memory are better able to deal with long-term dependencies as opposed to recurrent neural
networks. Within soccer the application of deep learning methods to predict movement and future
positions appears to be limited. Some examples can be found in other sports, such as basketball. Here,
(Shah & Romijnders, 2016) applied long short-term memory to model sequences of a basketball in the
field. Here, the model learns the trajectories of the basketball with only the coordinates of the ball as
input variables (Shah & Romijnders, 2016).
Since the application of deep learning methods to predict future movement and positions in soccer
seems to be relatively new, we will attempt to pave the way for this in a real-life soccer setting. In this
study we will compare the application of LSTM with the use of GRU to infer trajectories and predict
movement of multiple athletes. These methods have, to the best of my knowledge, not been applied
to a real-life soccer case. Therefore, it would be interesting to compare these models, which have
already proved themselves worthy in other settings, and see how they perform in a real-life soccer
setting.
In their paper, (Pettersen, et al., 2014) came up with a dataset that has sensor data (i.e., gps
coordinates of the soccer players' positions). Therefore, predicting movement and future positions of
all the soccer players using their sensor dataset could be an interesting application that would
contribute to the sports analytics domain and pave the way for visual multi-object tracking (i.e., using
video or image data) in a soccer setting.
3.1 Research question
3.1.1 Goal The goal of this study is to investigate whether the use of LSTM and GRU, which has been
applied in other settings before, can contribute to the prediction of future movements of soccer
players. Therefore, the research question of this study can be defined as follows:
3.1.2 Research question Research question: “To what extent does the use of long short-term memory and gated recurrent unit
contribute to the prediction of player movements in soccer?”
3.2 Overview study The rest of this study will be organized as follows: In chapter 4, an overview of the related work
is presented. Here, an extensive review of the relevant literature is presented. In chapter 5, we start
by describing the data set used for this study. Additionally, a brief explanation of the data pre-
processing is given and, moreover, the algorithms and software used for this study are described. Next,
chapter 6 presents the experimental set-up and the metrics used for evaluating the performance of
the methods. Subsequently, the results of the performed experiments are presented, compared and
visualized in chapter 7 and these results will be discussed in chapter 8. Here, the limitations of this
study will be discussed as well. In chapter 9 the conclusions that can be drawn from this study are
elaborated on and, additionally, recommendations for future work will be given in chapter 10. Finally,
an overview of the references used in this study are presented in chapter 11.
3.3 Approach The approach for answering the research question of this study starts by delving further into to
the work that has already been done on this topic first. This will give us a good idea of what has been
on this topic already and how we can embed our study in the literature and contribute to the further
development of analytics in sports using deep learning. Then we will explore our dataset and, if
needed, modify the data to make it appropriate to use for this study. When the data is prepared and
ready for use we will evaluate the available code from GitHub, Deep Learning lectures at Tilburg
University and other sources for constructing the LSTM and Gated Recurrent Unit models. When the
models are constructed we will use our dataset on these models and evaluate their performance by
comparing them to each other. This allows us to create a base-case for predicting players’ movements
using deep learning in soccer. The results of this will provide us with the input to draw conclusions,
elaborate on limitations, and provide the reader with some suggestions for future work.
4. Related work This chapter gives an overview of the related work that already has been done within the
domain of this study. We will discuss the following three things: First, we define the domain of
predicting future movements. Second, we make a link to sports and elaborate on the challenges that
are associated with predicting future movements in sports. Third, we conclude this chapter by
elaborating on the trend of applying deep learning techniques to predict movement and future
positions and what work has been done in and outside the domain of sports.
4.1 Predicting future movement and positions The problem of tracking moving targets or predicting future movements is a field that is
actively studied nowadays. Applications can be found in autonomous robots such as self-driving cars
where the importance is in predicting the future trajectories of other objects or humans in order to
avoid collisions and so on (Nikhil & Tran Morris, 2018). This problem of trajectory prediction can be
viewed as a sequence generation task, where the main interest is in predicting the future position at
different time-instances of people or objects based on their past positions (Alahi, et al., 2016). The
most widely recognized field where predicting future movement is the main interest is in object
tracking, a domain in the field of computer vision. The goal of object tracking is to infer trajectories of
objects or persons as they move around (Sadeghian, Alahi, & Savarese, 2017). The basic idea here is to
predict the next movement and the accompanying new position or location of the object. Especially
multi-object tracking is a challenging task since it involves multiple objects or persons to be tracked
instead of just one (Kim, Moon, Lee, Nam, & Jung, 2018).
From the literature we can derive that usually two types of data are used for object tracking. First,
visual data such as videos or images. This is the most used data type for this purpose. Second, IMUs
(Inertial Measurement Units), such as sensors are sometimes used as well. Here, the objects are
equipped with tools in order to gather information about their position, speed, distance covered, and
many more. With these tools we are able to collect information about many things such as movement
patterns in a reliable fashion (Camomilla, Bergamini, Fantozzi, & Vannozzi, 2018). Considering the fact
that the development of algorithms to model sensory inputs and infer predictions based on that is still
an unsolved problem (Felsen, Agrawal, & Malik, 2017), it would be an interesting area for further study.
In the next paragraph, we will narrow this problem down to the sports analytics domain.
4.2 Predicting future movement and positions in sports In many team sports one team tries to score a goal and the defending team constantly tries to
estimate the next move of the attacking team in order to prevent them from scoring and vice versa.
This can be explained as the human activity to make predictions or inferences about the future and act
accordingly based on these inference (human intelligence). Automated analysis of player movements
can help managers and athletes to be better able to make tactical decisions. This will benefit individual
as well as team performance. The analysis of team and player performance in sports has changed
rapidly in recent years. This is mainly because of the availability of better technology and due to the
development of many applications in the field of computer science (Cust, Sweeting, Ball, & Robertson,
2019).
One of the tools for analyzing sports are automatic tracking systems. With the use of player tracking,
valuable information can be discovered about a player or team's movement during a match (Lara,
Vieira, Misuta, Moura, & Barros, 2018). Tracking systems where predictions are made about players'
or the ball's trajectories are studied for many sports, including American football (Lee & Kitani, 2016),
basketball (Shah & Romijnders, 2016) (Zhao, Yang, Chevalier, Shah, & Romeijnders, 2018), soccer
(Barber & Carré, 2010), and table tennis (Zhang, Xu, & Tan, 2010). Object tracking in team sports are
particularly difficult. For example, in soccer multi-object tracking (i.e., tracking multiple players at the
same time) is particularly challenging due to the number of objects to be tracked (Pettersen, et al.,
2014). Additionally, soccer is a dynamic sport where abrupt changes in a player's motion occurs
frequently, just as there are many cases where players are very close to each other which makes it
harder to determine which trajectory belongs to which player (Kim, Moon, Lee, Nam, & Jung, 2018).
4.3 The use of deep learning to predict future movement and positions Applying predictive approaches such as multi-object tracking with deep learning is a rapidly
developing field of interest (Xiang, Alahi, & Savarese, 2015). Therefore, we will now discuss some of
the applications where deep learning is being applied within the domain of predicting future
movement and positions. We also look into the applications of deep learning methods for multi-object
tracking within sports.
Examples of deep learning applications to multi-object tracking can be found in multiple domains,
including predicting the movements of pedestrians, cars, and so on is widely studied (Kim, et al., 2017)
(Alahi, et al., 2016). Especially inferring trajectories of pedestrians has many similarities to team sports
such as soccer. One reason for this is that in both cases, interest is in tracking multiple objects who
look very similar in a crowded space. Second, in both settings, the objects to be tracked are very close
to each other (Gade & Moeslund, 2018). Third, in tracking pedestrians as well as athletes like soccer
players, sudden changes in motion occur frequently which makes it harder to infer trajectories and
predict the future positions (Ben Shitrit, Berclaz, Fleuret, & Fua, 2011).
In multi-object tracking, results have shown that convolutional neural networks and recurrent neural
networks perform reasonably well. However, (Ma, et al., 2018) opted for the use of GRUs and found
better results compared to other deep learning methods. Other studies that focus on multi-object
tracking opt for the use of LSTM in order to infer trajectories and predict future movement and
positions (Kim, et al., 2017) (Alahi, et al., 2016) (Xue, Huynh, & Reynolds, 2018). A better performance
when using GRUs as well as LSTMs compared to RNNs or CNNs make sense from a theoretical point of
view because GRUs and LSTMs are better capable of dealing with long-term dependencies as opposed
to recurrent neural networks, for example. Multi-object tracking using deep learning has been applied
in many sports domains already. In basketball for example, where researchers used LSTM to infer
trajectories of the basketball itself with only the coordinates of the basketball as input variables (Shah
& Romijnders, 2016) (Zhao, Yang, Chevalier, Shah, & Romeijnders, 2018). Another example is in water
polo where (Felsen, Agrawal, & Malik, 2017) used RNN's to track water polo players. Also in volleyball,
trajectory prediction of the volleyball is performed using neural networks (Suda, Makino, & Shinoda,
2019).
However, the use of LSTMs and GRUs for multi-object tracking in a sports setting, and especially in
soccer, appears to be limited thus far. Since the application of these deep learning methods to predict
future movement and positions in a soccer setting seems to be relatively new, this study attempts to
pave the way for this in a real-life soccer setting case. These methods have, to the best of my
knowledge, not been performed in such a setting using only sensor data of the players' positions in the
field. Using LSTM and GRU on only sensor data could be a first step in multi-object tracking using these
methods in a real-life soccer setting.
5. Methods In this section, we turn to describing our research method. We start by giving a description of the
data set that is used for this study. Furthermore, we explain how this data is pre-processed in order to
make it suitable for the experiments conducted within this study and the accompanying research goal.
We conclude this chapter by briefly explaining the algorithms and software used for this study. A more
in-depth explanation of the algorithms and software with the accompanying experiments will be given
in chapter 6.
5.1 Data set description The dataset used for this study is a player positions dataset (Pettersen, et al., 2014) of
professional soccer players. The data is gathered at the Alfheim Stadium, which is the stadium of a
professional soccer club in Norway, Tromsø IL. An overview of the size of the pitch can be found in
figure 1. The data set contains body sensor data of the players of Tromsø IL. This data contains
information about their position on the field, using Cartesian coordinates1, speed, acceleration, sprint
distance, heading, direction, energy consumption, total distance covered, a unique player ID
(anonymized) and a timestamp. This data is gathered with a ZXY Sports Tracking System. The data is
captured every 5 milliseconds and the data gathered is stored in csv-files. Additionally, the tag-ids of
1 Cartesian coordinates specify each object (i.e., soccer player) by a set of numerical values, which are
the shortest distances from the object to the reference point.
Figure 1: An overview of the pitch at Alfheim Stadium,
Norway (Pettersen, et al., 2014). From this we can derive
that x, y coordinates are between (0,0) and (105, 68).
the players have been randomized in order to anonymize the individual soccer players. An example of
the raw sensor data can be found in figure 2.
Timestamp ID X_pos Y_pos Heading Direction Energy Speed Total_distance
7-11-2013
21:05
2 35.2178 30.15277 1.82346 -1.7425 142.9266 0.3286 203.4843
7-11-2013
21:05
6 47.1549 19.9495 2.1015 2.0128 243.2256 0.5106 278.1391
7-11-2013
21:05
7 41.7908 49.8916 1.9047 -1.6392 155.0570 0.5365 380.0208
7-11-2013
21:05
8 53.2812 41.9960 2.6029 -3.0881 165.0328 1.0327 287.2322
7-11-2013
21:05
10 39.4537 28.8088 1.2822 2.7842 330.9677 1.4894 333.4538
Figure 2: An example of the raw ZXY sensor data, where the columns describe the following from left to right: The timestamp
of the data gathered, the anonymized player ID, the position in meters on the x-axis of the field, the position in meters on the
y-axis of the field, the heading and direction where the player is going, the energy level, the speed and the total distance
covered so far
The ZXY data are capture every 5 milliseconds. For example, the first measurement is taken at 2013-
11-03 18:01:09, whereas the second measurement takes place at 2013-11-03 18:01:09:05. However,
it occurs relatively often that not all players’ coordinates are recorded every 5 milliseconds and, as a
consequence, there is a difference in the amount of coordinates available per player. Because the goal
of this study is to predict the future movement and positions of all 11 soccer players, not all data is
useful and some pre-processing actions are needed to make this data set useful for the experiments
conducted within this study. These pre-processing actions will be discussed in the next paragraph.
Player ID 1 2 5 7 8 9 10 12 13 14
# Coordinates 44.555 56.436 56.242 56.273 56.478 55.399 56.505 47.478 56.156 56.499
Figure 3: An example of the anonymized player ID's and the number of x,y coordinates available. From here, we can
conclude that not all players have the same number of coordinates available, which means that not all data is useful for this
study.
5.2 Pre-processing data The ZXY data – which comprises the actual positions of the soccer players in the field – is
defined in x, y coordinates, where x represents the position in meters on the x-axis of the field and y
represents the position in meters on the y-axis of the field (see figure 1). As described in section 5.1,
the ZXY sensor data contains anonymized player IDs which enables us to predict the player IDs’
positions based on their x, y coordinates. However, not all player IDs contain the same number of
coordinates (see figure 3). This is because it occurs relatively often that not all players’ coordinates are
recorded every 5 milliseconds and, consequently, there is a difference in the amount of coordinates
available per player. Since the goal of this study is to predict the future positions of all eleven players
at the same time only a part of the data is useful for this study. Additionally, this study intends to use
coordinates from at most 100 subsequent timestamps of all 11 players as input for the model and – as
output - predict the coordinates (i.e., the players’ positions) at the next timestamp. Therefore, we only
extracted the timestamps that have coordinates of all 11 players and can complete a sequence of 100
subsequent timestamps. The selection of these timestamps and accompanying player IDs is performed
in Excel. First the data of the player IDs which are not useful (e.g., substitutes) are deleted. Secondly,
all timestamps are given a timestamp ID (i.e., timestamp 0.00.05 = 1, timestamp 0.00.10 = 2, and so
on).
Time X player
1
Y player
1
X player
2
Y player
2
X player
3
Y player
3
X player
4
Y player
4
1 26.60 29.40 35.61 30.32 41.61 38.79 28.52 39.67
2 26.66 29.51 35.53 30.35 41.69 38.77 28.58 39.61
3 26.62 29.53 35.56 30.39 41.63 38.71 28.65 39.57
4 26.70 29.59 35.59 30.47 41.57 38.63 28.67 39.52
5 26.73 29.52 35.71 30.45 41.50 38.61 28.69 39.52
After cleaning the data we have the remaining coordinates of the 11 player IDs that we need. Next, we
copied the timestamp IDs and deleted all duplicates in order to have a list of unique timestamp IDs.
Using COUNTIF-statements, the results show a numerical value of all unique players per timestamp.
All timestamps that do not have the coordinates of all the 11 player IDs are deleted. Finally, the data
is transposed so that each row contains a timestamp and the x, y-coordinates of all eleven player IDs.
For an example of the data after the pre-processing stage, see figure 4.
Due to the large amount of data, the data pre-processing is executed at an offsite server, accessed via
a virtual machine running on an Octacore Intel Xeon CPU @ 3.33 Ghz with 16 GB RAM. After pre-
Figure 4: An overview of a part of the data set after pre-processing it in Excel. Here we can see the timestamp at which the
data are gathered by the sensors. In column 2 to 8 we can see the x-coordinate and the y-coordinate in meters of player 1 to
4.
processing the data, what remains is the data set that serves as input for the model. The data consists
of 56.808 timestamps with x, y coordinates of eleven players. From these timestamps, multiple input
sequences are generated in Python. For example, for the setting in which we will be using 5 timestamps
as input we can compute more sequences than we have timestamps. This works as follows: compute
a sequence of timestamp 1, 2, 3, 4, 5. Then the second sequence that is computed is 2, 3 ,4 ,5, 6.
Therefore, this dataset allows us to create more sequences than only the timestamps that we have
available. This is also true when we experiment with longer sequences (i.e., 20, 50, and 100 timestamps
respectively). The sequence-generating process is executed in Python.
Additionally, this study intends to experiment with better scaled data as well. Reason for this is the
frequency by which coordinates are generated (i.e., every 5 milliseconds). Because of this, we expect
little difference in coordinates between two subsequent timestamps. This may have a negative impact
on the ability of our models to learn trajectories. Therefore, we additionally make a copy of the pre-
processed data and convert the coordinates into absolute differences in coordinates between
subsequent timestamps. For example, when an x, y coordinate at timestamp 1 is (48, 29) and at
timestamp 2 this is (47, 32), the new coordinate of absolute difference becomes -1 for the x-coordinate
and +3 for the y-coordinate.
5.3 Algorithms and software The goal of this study is to test the predictive ability of Long Short-Term Memory (LSTM) and
Gated Recurrent Unit (GRU) on a real-life soccer setting and pave the way for predicting future
movements and positions in the sports domain using deep learning. Therefore, the development of
the LSTM and GRU are the most important algorithms used for this study. These models are built in
Python, supported by the use of Keras and Tensorflow. In the next chapter we will further elaborate
on the models and the algorithms used for this study.
6. Experimental set-up In this section we describe the models that we propose for this study and the tasks that the model
should perform. Additionally, we present the algorithms that we use as well as the parameters of the
models. Finally, the metrics used for evaluating and comparing the performance of both the models
are described and motivated.
6.1 Model description Long Short-Term Memory (LSTM) is a modification of the RNN that are able to learn long-term
dependencies. The LSTM is invented by (Hochreiter & Schmidhuber, 1997) and its main goal is to
remember information for longer time periods. A gated recurrent unit, invented by (Cho, et al., 2014),
is an improved model of the recurrent neural network as well and is quite similar to long-short term
memory (LSTM). Gated recurrent units can be trained to pertain information from the past without
deleting information that is not relevant for the prediction. This works as follows: It has a reset gate
(to determine how the new input should be combined with the memory) and an update gate (which is
responsible for determining the amount of previous information to keep). A gated recurrent unit’s aim
is to learn long-term dependencies. The main difference between LSTM and GRU is that a GRU has
gating units just like the LSTM. However, the GRU has gating units that regulate the information flow
within the unit without having separate memory cells that LSTM’s encompass (Chung, Gulcehre, Cho,
& Bengio, 2014). For an overview of the LSTM and GRU, see figure 5.
The task of both models, LSTM and GRU, in this setting is to predict the x, y coordinates of all eleven
soccer players at timestamp t, based on n input sequences. For example, 5 sequences of x, y
coordinates at time t1, t2, t3, t4, t5 are used to predict the x, y coordinates at time t6. But more
variations can be thought of and will be experimented with. This will be further described in paragraph
6.3. First, we will discuss the hyperparameters used for our models in the next paragraph.
6.2 Hyperparameter settings The (hyper)parameters of interest for this study consist of the train-test set ratio, the number
of input sequences (i.e., how many timestamps are used as input for prediction), the number of nodes
Figure 5: An overview of the LSTM and GRU
per layer, the number of hidden layers and the learning rate. We will now discuss the selection of these
(hyper)parameters.
The train-test set ratio is set at 80 - 20, since the study of (Ma, et al., 2018) also used this ratio.
Moreover, this ratio is often chosen in other studies on object tracking as well. The number of
sequences used as input for the model will be varied in order to see how the model performs when
increasing the number of sequences. However, the dataset consists of 56.808 timestamps and,
consequently, one must keep in mind not to generate sequences that are too long so that the training-
set becomes too low for training the model proper training. Therefore, we will run the model on
sequences with 5, 20, 50, and 100 timestamps as input. The number of hidden layers will be set at 4,
since (Ma, et al., 2018) showed that this number of layers performs best. The number of nodes per
hidden layer is set at 22. Additionally, the learning rate will be set at l = 0.001 with rectified linear unit
as activation function and the mean squared error as a loss function. In the next paragraph we will
describe the experiments that we will perform for this study.
6.3 Experiments to perform The experiments that will be performed for this study are six-fold. We start by using a naïve
approach. As explained in the pre-processing section (paragraph 5.2), the player coordinates are
measured every 5 milliseconds, which means that we expect little differences in coordinates between
two subsequent timestamps. Therefore, we can expect our models to behave poorly when using these
coordinates as input data. We start by assessing the predictive performance when using these
coordinates. Additionally, we vary the number of timestamps as input sequences. This means that we
will assess the predictive performance as follows: predict the x, y coordinates at timestamp t, based
on 5, 20, 50, or 100 input sequences. For example, 5 sequences of x, y coordinates at time t1, t2, t3,
t4, t5 are used to predict the x, y coordinates of all the eleven soccer players at time t6.
Next, we test the performance of both the LSTM and the GRU by varying number of timestamps as
input sequences, just like in the previous experiment (i.e., number of timestamps as input sequences
= 5, 20, 50, and 100). Additionally, this time the length between timestamps will be doubled. This is
because the measurement of timestamps occurs frequently, which means that there is not much time
between two timestamps. This makes it easier for both the LSTM and the GRU to predict, because
movements are limited. Also, we may expect that the model just uses the last input coordinates as
prediction-values because of the small time between timestamps. By doubling the time between two
timestamps (i.e., by deleting every second timestamp) the models may have more challenges to
predict the movement and the next positions of all the 11 soccer players.
In the third experiment we opt for a less naïve approach by creating a better scaling of the x, y
coordinates. So, for the third experiment we will modify the coordinates to the absolute changes in
coordinates in order to predict the change in the next coordinates. For example, an x, y coordinate at
timestamp t1 = [x = 25, y = 37] and an x, y coordinate at timestamp t2 = [x = 26, y = 36] will result in
an absolute change coordinate of x = +1 and y = - 1. The reason for this is to get a better scaling: predict
change (i.e., movement) instead of predict exact coordinates. These coordinates of absolute change
will be used to assess the predictive performance when using input sequences of 5, 20, 50, and 100
timestamps again.
The fourth experiment is a combination of experiment 2 and 3: assess the predictive performance of
both the GRU and the LSTM when doubling the time between measurements and using the absolute
differences between coordinates as input. Again, this will be conducted using input sequences of
length 5, 20, 50, and 100.
Fifth, we test the predictive performance of both models by evaluating how far these models can
predict into the future. This works as follows: We start by using 5 timestamps as input to predict the
x, y coordinates at a next timestamp further in time. For example, 5 sequences of x, y coordinates at
time t1, t2, t3, t4, and t5 are used to predict the x, y coordinates at time t10, t20, t50, or t100. Again,
this will be performed with the absolute differences between the timestamps which are described in
experiment 3.
Finally, we construct a predicted trajectory of all eleven soccer players. Basically, this means that we
use 5 timestamps as input sequences in order to predict the coordinates at the sixth timestamp: t1, t2,
t3, t4, t5 → predicted t6. Then, we remove t1 and add the predicted t6 to predict t7: t2, t3, t4, t5,
predicted t6 → t7. We repeat this process to construct a trajectory of 40 timestamps.
We now turn to describing our evaluation metrics used to assess the predictive performance of both
our models in all experiments.
6.4 Evaluation metrics
6.4.1 Metrics Because the prediction of future movements and positions in a soccer setting using the LSTM
and GRU on the players’ coordinates only is relatively new, the performance of both models will be
compared to each other in order to determine which model performs best in a real-life soccer setting
based on x, y-coordinates. The metrics used to evaluate the models’ performance are the following:
MSE, RMSE, and MAE (see figure 6 for the formulas of these error metrics).
These metrics are chosen based on other studies that predict future movements or positions in other
settings. Since the models aim to predict coordinates or the difference in coordinates (movement),
interest in lowering the error metrics. Therefore, MAE, MSE and RMSE are suitable metrics for
evaluation in this setting.
6.4.2 Benchmark In addition to the evaluation metrics described in the previous paragraph, we will compare the
performance of our models against a benchmark which can be seen as the base-case. Because the
timestamps are close to each other – which means that we get player coordinates every 5 milliseconds
– we may expect only very small changes in coordinates between timestamps. Therefore, we compare
our predictions to the last input coordinates and see if there are any deviations. For the experiments
where we will be using the differences in player coordinates between timestamps as input (i.e.,
experiment 1, 2, and 3) this means that we compare the performance of our models against a base-
case of no change.
7. Results In this section, the results of the experiments described in the previous section will be presented.
We will compare the predictive performance of both models – LSTM and GRU – to each other using
the metrics defined earlier. The results will be described per experiment.
7.1 Experiment 1: Varying number of timestamps as input sequences The first experiment was to assess the predictive performance of both models when using the
players’ coordinates of a varying number of timestamps as input sequences to predict the coordinates
of the soccer players for the next timestamp. We assessed the predictive performance of the LSTM
and the GRU with the number of timestamps used as input set at 5, 20, 50, and 100, respectively. The
Figure 6: Evaluation metrics for assessing model performance
results of these experiments show good results at first sight. However, when comparing the results to
the base-case described in section 6.4, we see that the results are exactly the same. This means that
the model is using the last input coordinates as a prediction, which can be explained by the very small
changes in coordinates between the timestamps. This is due to the fact that the timestamps are
constructed relatively frequent (i.e., every 5 milliseconds) and that, as a consequence, the change
between coordinates is so small that the model learns to use the last input coordinates as predictions.
7.2 Experiment 2: Doubling the time between timestamps For the second experiment, we vary the number of timestamps with the players’ coordinates
as input in the same way as we did for the first experiment (i.e., 5, 20, 50, 100). Additionally, we
doubled the time between the timestamps by using sequences with more difference in time. For
example, when using 5 timestamps as input, we can use a sequence of timestamps like t1, t3, t5, t7,
and t9 to predict the coordinates of the 11 players at t10.
Again, these results are compared against the base-case (i.e., using the last input coordinates as
prediction). Just like in experiment 1, it can be concluded that by doubling the time between
timestamps does not lead to a model that is learning to predict compared to the results of the
benchmark. Instead, the models learn to return the last input coordinates since these give the lowest
error metrics. Therefore, both experiment 1 and 2 show that the scaling of the data is not suitable for
the learning ability of the models and as a consequence, for the rest of the experiments the data is
modified to absolute changes in coordinates between two timestamps.
7.3 Experiment 3: Varying number of timestamps as input sequences when using absolute differences between coordinates
For this experiment we will modify the coordinates to the absolute changes in coordinates in
order to predict the change in the next coordinates. For example, an x, y coordinate at timestamp t1 =
[25.8, 37.7] and an x, y coordinate at timestamp t2 = [25.9, 37.4] will result in an absolute change
coordinate of x = +1 and y = -3. These coordinates are then used to assess the predictive performance
when using input sequences of 5, 20, 50, and 100 timestamps just like in experiment 1 and 2. The
results show a decrease on all relative error metrics compared to the benchmark (see figure 7). This
can be explained by the better scaling of all player coordinates, which definitely benefits the model
compared to the previous two experiments. To explain how to interpret the results we will give an
example of the LSTM-5, where 5 input sequences where used to predict t6 using the LSTM model (see
the first row of figure 7). Here, we can see that the mean absolute error, MAE, = 0.6080. This should
be interpreted as follows: when the model predicts +1 on the x-coordinate, it predicts that the player’s
x-coordinate will change with 10 centimeters (e.g., from 24.8 to 24.9). Consequently, the absolute
error in centimeters is: 0.6080 * 10 = 6.080 centimeters.
5 timestamps as input MAE MSE RMSE
GRU 0.6101 2.8702 1.6942
LSTM 0.6080 2.8765 1.6960
Baseline 0.9243 6.745 2.5971
20 timestamps as input MAE MSE RMSE
GRU 0.5999 2.8996 1.7028
LSTM 0.5987 2.8553 1.6898
Baseline 0.9194 6.6693 2.5825
50 timestamps as input MAE MSE RMSE
GRU 0.6074 2.9366 1.7137
LSTM 0.6042 2.9094 1.7057
Baseline 0.9244 6.7081 2.5900
100 timestamps as input MAE MSE RMSE
GRU 0.6184 2.9503 1.7176
LSTM 0.6140 2.9428 1.7155
Baseline 0.9293 6.7198 2.5923
Figure 7: The results of experiment 3. From top to bottom: the results of the experiment using 5 timestamps as input, 20
timestamps as input, 50 timestamps as input, and 100 timestamps as input. From these results, we can derive that the LSTM
performs slightly better, since they show lower error metrics
7.4 Experiment 4: Doubling the time between timestamps when using absolute differences between coordinates
The fourth experiment is a combination of experiment 2 and experiment 3: assess the
predictive performance of both the GRU and the LSTM when doubling the time between
measurements and using the absolute differences between coordinates as input. Again, this
experiment is conducted using input sequences of length 5, 20, 50, and 100.
The results, depicted in figure 8, again show a decrease in the error metrics compared to the
benchmark. This can be explained by the better scaling of the coordinates compared to experiment 1
and 2. Additionally, increasing the time between two measurements of coordinates ensures that there
are bigger differences in coordinates between timestamps. This benefits the models in their predictive
performance. What is noticeable is the fact that on the longer input sequences (i.e., 50 and 100), the
model still shows a good predictive performance compared to our benchmark. However, the error
metrics are getting closer to the benchmark results. This may be due to the relative big amount of
input data which the models find hard to derive structure from in a setting with just 4 hidden layers
and 22 nodes per hidden layer.
5 timestamps as input MAE MSE RMSE
GRU 0.8349 3.9281 1.9819
LSTM 0.8374 3.9475 1.9868
Baseline 1.3178 10.8778 3.2982
20 timestamps as input MAE MSE RMSE
GRU 0.8634 4.2253 2.0555
LSTM 0.8470 4.0884 2.0220
Baseline 1.3156 10.8481 3.2936
50 timestamps as input MAE MSE RMSE
GRU 0.8725 4.3106 2.0762
LSTM 0.8605 4.1748 2.0432
Baseline 1.3211 10.9151 3.3038
100 timestamps as input MAE MSE RMSE
GRU 0.8811 4.3531 2.0864
LSTM 0.8698 4.2200 2.0543
Baseline 1.3293 10.9828 3.3140
Figure 8: The results of experiment 4. From top to bottom: the results of the experiment using 5 timestamps as input, 20
timestamps as input, 50 timestamps as input, and 100 timestamps as input. On average, we can see that the LSTM models
perform slightly better, except for in the first case
7.5 Experiment 5: How far can we predict? Testing how far we can predict works as follows: We start by using 5 timestamps as input to
predict the x, y coordinates at the next timestamp. For example, 5 sequences of x, y coordinates at
time t1, t2, t3, t4, and t5 are used to predict the x, y coordinates at time t10, t20, t50, or t100. Again,
this will be performed with the absolute differences between the timestamps which are described in
experiment 3. Figure 9 shows the results of this experiment. It shows that both LSTM and GRU perform
approximately equally well. On top of that, for all predictions both models outperform the benchmark
significantly. What is notable is the fact that both the LSTM and the GRU seem to perform only slightly
worse when predicting further in time (for example, comparing the performance of both the GRU and
the LSTM on predicting t+10 and t+100). Comparing this to the previous experiment, it is possible that
the models were not deep enough for the previous experiment in order to deal with the big amounts
of data.
t = + 10 MAE MSE RMSE
GRU 0.6203 2.9065 1.7048
LSTM 0.6204 2.9085 1.7054
Baseline 0.9210 6.7297 2.5942
t = + 20 MAE MSE RMSE
GRU 0.6248 2.8738 1.6952
LSTM 0.6257 2.8850 1.6985
Baseline 0.9306 6.7886 2.6055
t = + 50 MAE MSE RMSE
GRU 0.6226 2.8863 1.6989
LSTM 0.6180 2.8772 1.6962
Baseline 0.9181 6.6564 2.5800
t = + 100 MAE MSE RMSE
GRU 0.6187 2.8522 1.6888
LSTM 0.6157 2.8576 1.6904
Baseline 0.9172 6.6168 2.5723
Figure 9: Results experiment 5: Based on 5 timestamps as input sequences we predict (from top to bottom): the 5 + 10 = 11th
timestamp, the 5 + 20 = 25th timestamp, the 5 + 50 = 55th timestamp, and the 5 + 100 = 105th timestamp. We can see that the
GRU performs slightly better in the first two scenario’s whereas the LSTM performs slightly better in the last two scenario’s
7.6 Experiment 6: Predicting longer trajectories In this experiment we constructed a trajectory of all eleven soccer players. Basically, this means
that we use 5 timestamps as input sequences in order to predict the coordinates at the sixth
timestamp: t1, t2, t3, t4, t5 → predicted t6. Then, we remove t1 and add the predicted t6 to predict
t7: t2, t3, t4, t5, predicted t6 → t7. We repeat this process until we construct a trajectory of 40
timestamps. The results, depicted in figure 10, show that the GRU performs only slightly better than
the LSTM. However, the differences are extremely small which means that both the GRU and LSTM
could be suitable models for trajectory prediction. However, it should be noted that the error metrics
increase significantly as compared to the other experiments. This makes intuitive sense, because
predictions are used as input for making new predictions in this experiment. By using predictions that
deviate from the ground-truth as input for making new predictions, it makes sense that the model will
return wrong predictions again.
Trajectory prediction
MAE MSE RMSE
GRU 0.1988 0.9682 0.983951
LSTM 0.1999 0.9691 0.984426
Baseline 0.3750 1.9318 1.389899 Figure 10: Results experiment 6: The GRU performs slightly better in terms of trajectory prediction as compared to the
benchmark.
8. Discussion & Limitations From the experiments performed and the results of the predictive performance of both the LSTM
and the GRU show multiple things. First, when comparing experiment 1 and 2 to experiment 3, 4, and
5 we can see that modifying the players’ coordinates to the absolute difference between coordinates
at different timestamps benefits the model. From experiment 1 and 2, the results show that the models
basically learn to return the last input coordinates as predictions. This means that the models are not
learning anything. However, this was expected because the time between the measurement of the
coordinates is extremely small (i.e., 5 milliseconds). Therefore, the changes in coordinates between
timestamps are extremely small and consequently, when ‘predicting’ the same coordinates as the last
input coordinates, the model predicts that there is no change in coordinates and this results in the
lowest error metrics.
In order to ensure that both the LSTM and the GRU are capable of learning we therefore modified the
coordinates to the absolute changes between two coordinates (e.g., t1 = (98.2 ; 47.3), t2 = (89.1 ; 47.5)
is converted to x = -1 and y = +2. Then, we let the models predict the change in coordinates for the
next timestamp. The results show that both models learn well and perform better than the benchmark,
but the LSTM performs slightly better on almost all experiments resulting in lower error metrics. For
example, the LSTM with 5 timestamps as input sequences to predict t6 (see the first row of figure 8)
shows in an MAE of 0.6080. This means that we can predict with a deviation of only a few centimeters
from the ground-truth. This is because of the better scaling of the coordinates. Additionally, doubling
the time between coordinates (i.e., using t1, t3, t5, t7, and t9 instead of t1, t2, t3, t4, and t5 for
prediction) shows an even better predictive performance. Though the LSTM as well as the GRU
perform better than the benchmark on all input sizes, a downside from the results of experiment 3
and 4 is that the results of the longer input sequences (i.e., 50 and 100 timestamps as input) performs
slightly worse compared to the shorter input sequences. It seems that the models are having some
trouble when dealing with more information. This can be due to the depth of the models or simply
because there is too much input in order to derive the right structure from it. This is confirmed by
experiment 5 where we used 5 timestamps as input sequences and predict t+10, t+20, t+50, and t+100.
These results appear to be more stable because of the limited number of input coordinates. Therefore,
a limitation of this study is the depth of the models for some experiments. By making deeper models,
the GRU and LSTM might be better able to deal with more input sequences and increase predictive
performance. Additionally, in experiment 5 we assessed how far the models can predict into the future.
However, this experiment did not model trajectories based on predictions only (i.e., make predictions
and use these predictions as input to model the next movement). In this way, one could model an
entire trajectory instead of just predicting one timestep ahead. Since this experiment is performed in
many multi-object tracking studies we conducted this experiment as well in experiment 6. Here, both
the LSTM and the GRU perform approximately equally well on a trajectory prediction of 40 timestamps.
However, the error metrics increase significantly. This is because the models use predictions as input
for making new predictions. If these predictions used as input are false, then the model will
automatically return false predictions again. This is a limitation of this experiment. Another important
limitation of this study is that it is hard to compare the results to full multi-object tracking studies based
on visual data. This is because this study intents to predict coordinates, which are numerical data.
These coordinates are 2 digits behind the comma (e.g., 28, 47) which makes it harder to predict right
or wrong, whereas many multi-object tracking studies based on visual data have bounding boxes as
ground-truth and consequently, can verify whether a prediction is right or wrong. So this study
assesses how far a prediction is off the ground-truth coordinates, whereas other studies that built a
complete multi-object tracking algorithm based on visual data assess if a prediction is right or wrong.
Therefore, it is hard to concluded whether our models are better than these other studies. However,
this study intended to pave the way for multi-object tracking in soccer using deep learning so it should
be seen as a first step in this area. Therefore, using these models on a dataset that also contains visuals
would be an interesting field for future work. Overall, when comparing the performance of the LSTM
to the GRU specifically, we can see that the LSTM usually performs a little better than the GRU, so it
would be advised to start with further investigating the performance of the LSTM in such a setting.
9. Conclusions From the results of the experiments conducted within this study, we can draw multiple
conclusions. First, we can conclude that pure coordinates measured in meters are not suitable for
making future predictions using deep learning models such as LSTM and GRU. By creating a proper
scaling for the data (i.e., in this study this is done by calculating the absolute differences between
coordinates at different measurements) the data becomes more appropriate for prediction. By doing
this, one can also overcome the burden of too frequent measurements. As stated earlier, we saw that
the differences between coordinates at subsequent timestamps are extremely small. This is because
the measurement of the coordinates occurs too often. By deleting every nth measurement, this
problem can be overcome as well. However, the cost of this would be a significant decrease in the
available data needed for proper training of the models. Therefore, we can conclude that adjusting the
x, y coordinates to absolute differences between two coordinates at different times is the most
suitable solution to overcome this problem.
Second, we can conclude that both the LSTM and the GRU perform well on the task of predicting
movement and future positions, but the LSTM performs slightly better resulting in lower error metrics
on almost all experimental results. However, from experiment 4 we can conclude that the LSTM and
GRU with 4 hidden layers are not deep enough when dealing with longer input sequences (i.e., 50 or
100). While the results still outperform the benchmark, we can see that the predictive performance
decreases as compared to the shorter input sequences (i.e., 5 or 20). Therefore, deeper models may
enhance the predictive performance of both the LSTM and the GRU when dealing with longer input
sequences. Additionally, this study assessed predictive performance by predicting not only 1 sequence,
but entire sequences by using predictions as input to the model. In that way, a better comparison
between an entire ground-truth trajectory and a predicted trajectory could be assessed. Also on this
experiment, the models seem to perform rather well. However, it must be noted that the error metrics
increase significantly as compared to other experiments.
When comparing both models on the experiments conducted within this study, we can conclude that
the LSTM usually performs slightly better than the GRU. However, the differences between LSTM and
GRU are rather small and, therefore, they could both be suitable when predicting movement and
player positions in a soccer setting. However, it is hard to assess whether our models perform better
to complete multi-object tracking algorithms since these studies usually have bounding boxes as
ground-truth data. Therefore, they sometimes tend to assess the performance with MOTA, multi-
object tracking accuracy – which means that they examine whether positions or movements are
predicted right or wrong. In this study we only predict coordinates (numerical data). This, of course, is
harder to predict right or wrong since our coordinates are measured in 2 digits behind the comma
(e.g., 27,48). Therefore, a right prediction is extremely harder to make. In the next section, we will
give some recommendations for future work which can help to further pave the way for deep learning
models to predict movement and positions in a soccer setting.
10. Future work This study attempted to predict movement and future positions of soccer players using deep learning
and thereby paving the way for multi-object tracking in soccer using deep learning. Since this can be
seen as a starting point, some suggestions for future work that would further contribute to multi-object
tracking in soccer can be made. First, in addition to the experiments performed in this study, we
suggest to assess the predictive performance of both the GRU and the LSTM when they have more
hidden layers. Especially when dealing with longer input sequences we saw that the input information
is too large to properly inherent structure and be just as stable compared to the use of shorter input
sequences. Second, this study used input sequences to predict only one x, y coordinate in the future
of all eleven players. However, it would be interesting to model entire trajectories by using predicted
coordinates as input for predicting the next coordinates. In this way, one would be able to predict
entire trajectories and compare them to the ground-truth trajectory. This would give us a deeper
understanding of the predictive performance over a longer period of time instead of just one time
period. With the experiments conducted within this study, we established a basis by finding that the
LSTM and GRU are capable of predicting further into the future (experiment 5). This is further
developed by modeling trajectories in experiment 6. However, here trajectories of only 40 timestamps
are predicted. It would be interesting to model longer trajectories in future work to see if the models
remain stable over time. This is especially true, since we concluded that on longer sequences our
models are not deep enough. Third, for this study we used a cleaned data set where all the x, y
coordinates of the soccer players are available. However, it would be interesting to see how the model
deals with players that move out of the soccer pitch (e.g., due to a temporary injury treatment) or with
substitutes. Objects disappearing or re-appearing in the space under study is one of the challenges
that is accompanied in multi-object tracking and tackling this challenge would even further contribute
to the development of deep learning models for multi-object tracking in soccer. Fourth, this study is
performed using only ZXY sensor data (i.e., player coordinates that indicate their position in meters in
the soccer pitch). However, the vast majority of the multi-object tracking algorithms in the literature
that use deep learning techniques make use of video or image data as input. Here, the athletes,
pedestrians, and so on are usually identified using pre-trained networks like VGG-16 or ResNet-50 in
all images. Then the coordinates of the identified objects are used as input for the LSTM or GRU models
with bounding-box coordinates or x, y coordinates used as ground-truth. Of course, this study served
as a starting point for multi-object tracking in a real-life soccer setting. The next step, however, could
be to use these models on a video or image data set in order to further develop the use of deep learning
for multi-object tracking in a soccer setting.
11. References
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., & Savarese, S. (2016). Social LSTM: Human
trajectory prediction in crowded spaces. Proceedings of the IEEE conference on computer vision
and pattern recorgnition, 961-971.
Barber, S., & Carré, M. (2010). The effect of surface geometry on soccer ball trajectories. Sports
Engineering 13(1), 47-55.
Ben Shitrit, H., Berclaz, J., Fleuret, F., & Fua, P. (2011). Tracking Multiple People Under Global
Appearance Constraints. 2011 International Conference on Computer Vision, 137-144.
Camomilla, V., Bergamini, V., Fantozzi, S., & Vannozzi, G. (2018). Trends supporting the in-field use of
wearable inertial sensors for sport performance evaluation: A systematic review. Sensors
18(3), 873.
Cho, K., Merriënboer, B. v., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014).
Learning phrase representations using RNN encoder-decoder for statistical machine
translation. arXiv preprint arXiv:1406.1078.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural
networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Cust, E., Sweeting, A., Ball, K., & Robertson, S. (2019). Machine and deep learning for sport-specific
movement recognition: a systematic review of model development and performance. Journal
of sports sciences 37(5), 568-600.
Felsen, P., Agrawal, P., & Malik, J. (2017). What will happen next? Forecasting player moves in sports
videos. Proceedings of the IEEE International Conference on Computer Vision , 3342-3351.
Gade, R., & Moeslund, T. B. (2018). Constrained multi-target tracking for team sports activities. IPSJ
Transactions on Computer Vision and Applications, 2-11.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation 9(8), 1735-
1780.
Kim, B., Kang, C., Kim, J., Lee, S., Chung, C., & Choi, J. (2017). Probabilistic vehicle trajectory prediction
over occupancy grid map via recurrent neural network. IEEE 20th International Conference on
Intelligent Transportation Systems (ITSC), 399-404.
Kim, W., Moon, S.-W., Lee, J., Nam, D.-W., & Jung, C. (2018). Multiple player tracking in soccer videos:
an adaptive multiscale sampling approach. Multimedia Systems, 611-623.
Lara, J., Vieira, C., Misuta, M., Moura, F., & Barros, R. (2018). Validation of a video-based system for
automatic tracking of tennis players. International Journal of Performance Analysis in Sport,
137-150.
Lee, N., & Kitani, K. (2016). Predicting wide receiver trajectories in american football. IEEE Winter
Conference on Applications of Computer Vision (WACV), 1-9.
Ma, C., Yang, C., Yang, F., Zhuang, Y., Zhang, Z., Jia, H., & Xie, X. (2018). Trajectory factory: Tracklet
cleaving and re-connection by deep siamese bi-gru for multiple object tracking. IEEE
International Conference on Multimedia and Expo (ICME), 1-6.
Nikhil, N., & Tran Morris, B. (2018). Convolutional Neural Network for Trajectory Prediction.
Proceedings of the European Conference on Computer Vision (EECV).
Pettersen, S. A., Johansen, D., Johansen, H., Berg-Johansen, V., Gaddam, V. R., Mortensen, A., . . .
Halvorsen, P. (2014). Soccer video and player position dataset. Proceedings of the International
Conference on Multimedia Systems (MMSys), 18-23.
Sadeghian, A., Alahi, A., & Savarese, S. (2017). Tracking the untrackable: Learning to track multiple cues
with long-term dependencies. Proceedings of the IEEE International Conference on Computer
Vision, 300-311.
Shah, R., & Romijnders, R. (2016). Applying deep learning to basketball trajectories. arXiv preprint
arXiv:1608.03793.
Suda, S., Makino, Y., & Shinoda, H. (2019). Prediction of Volleyball Trajectory Using Skeletal Motions
of Setter Player. Proceedings of the 10th Augmented Human International Conference, 16.
Xiang, Y., Alahi, A., & Savarese, S. (2015). Learning to Track: Online Multi-Object Tracking by Decision
Making. Proceedings of the IEEE international conference on computer vision, 4705-4713.
Xue, H., Huynh, D., & Reynolds, M. (2018). SS-LSTM: a hierarchical LSTM model for pedestrian
trajectory prediction. IEEE Winter Conference on Applications of Computer Vision (WACV),
1186-1194.
Zhang, Z., Xu, D., & Tan, M. (2010). Visual measurement and prediction of ball trajectory for table
tennis robot. IEEE Transactions on Instrumentation and Measurement 59(2), 3195-3205.
Zhao, Y., Yang, R., Chevalier, G., Shah, R., & Romeijnders, R. (2018). Applying deep bidrectional LSTM
and mixture density network for basketball trajectory prediction. Optik 158, 266-272.
Recommended