Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
Modeling player skill in Starcraft II
Tetske Avontuur
ANR: 282263
HAIT Master Thesis series nr. 12-004
THESIS SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF ARTS IN COMMUNICATION AND INFORMATION SCIENCES,
MASTER TRACK HUMAN ASPECTS OF INFORMATION TECHNOLOGY,
AT THE SCHOOL OF HUMANITIES
OF TILBURG UNIVERSITY
Thesis committee:
Dr. Ir. P.H.M. Spronck
Dr. M.M. van Zaanen
Prof. dr. E.O. Postma
Tilburg University
School of Humanities
Department of Communication and Information Sciences
Tilburg center for Cognition and Communication (TiCC)
Tilburg, The Netherlands
June, 2012
2
Abstract Starcraft II is a popular real-time strategy (RTS) game in which many players compete with each other. Based on
their performance, the players get ranked in one of seven leagues. In this thesis, we aim at predicting which
league a player competes in, based on observations of his in-game behavior. We do this by building a player
model using a classification algorithm.
We gathered 1297 game replays uploaded by players from 3 different websites. We ensured that they all had
the same patch version and that league information about the players was up-to-date. To describe the data, we
used features that measure skill. We picked these features based on cognitive research and our knowledge of
the game. We then performed a pretest using 4 different algorithms to see which one could be suitable to solve
our problem. We chose SMO as our classifier. The weighted accuracy was 47.3% (SD = 2.19), meaning that we
were able to correctly classify 47.3% of the instances. This is significantly better than the weighted baseline of
25.5% (SD = 1.05). We used InfoGain to examine what features are most important in solving our problem. We
then tested from what moment in the game it is possible to predict players’ skill. The results showed that
performance does not change significantly over time. This indicates that at least from 2.5 minute into the game,
time does not significantly impact the performance of the classifier. We conclude that it is possible to predict
players’ skill by building a player model using a classification algorithm. We can do this fairly early in the game.
3
Preface During the summer of 2011, I had a lot of free time. Since this summer was just like a typical autumn, I had lots
of opportunity to play computer games. I spent a lot of time on one game in particular: Starcraft II. Not only did
I play this game extensively, I also watched the rise of Starcraft II shoutcasters. My personal favorites are Husky
and Day[9] because of their extensive knowledge of the game, in addition to their great enthusiasm. This way, I
learned a lot about the game and I realized that I could do more than just play the game; I could use it as a tool
for research.
Performing the research for this thesis was a lot of work. It took a lot of time and effort to gather and organize
data, and I had to redo a lot of my experiments. Despite the sometimes frustrating moments, I look back on my
thesis with great fun. I would like to thank my supervisor Pieter Spronck for his guidance and during the entire
process. However, I would like to thank him most of all for his enthusiasm about games and AI. I would also like
to thank my second reader Menno van Zaanen for his help during the last days of this thesis.
I would also like to thank my parents for supporting me all these years, my boyfriend for putting up with me
during the final weeks of this thesis, and my friends for making my life in Tilburg a memorable experience.
Tetske
June, 2012
4
Contents 1. Introduction 6
1.1 Problem statement and research questions ......................................................................................... 6
1.2 Outline ................................................................................................................................................ 7
2. Background 8
2.1 Starcraft II ........................................................................................................................................... 8
2.1.1 Rating system ................................................................................................................................. 8
2.1.2 Reasons for choosing Starcraft II ................................................................................................... 10
2.2 Expertise ........................................................................................................................................... 10
2.2.1 Expertise in Chess ........................................................................................................................ 10
2.2.2 Expertise in action video games ................................................................................................... 11
2.2.3 Expertise in other domains ........................................................................................................... 11
2.3 Player modeling ................................................................................................................................ 12
2.3.1 Player modeling in classic games .................................................................................................. 12
2.3.2 Player modeling in computer games ............................................................................................. 12
2.4 Sequential Minimize Optimization Algorithm .................................................................................... 13
3. Experimental setup 15
3.1 Data set ............................................................................................................................................ 15
3.1.1 Data Selection .............................................................................................................................. 16
3.1.2 Data extraction ............................................................................................................................ 17
3.1.3 Feature set ................................................................................................................................... 18
3.2 Measurement ................................................................................................................................... 22
3.3 Pretest .............................................................................................................................................. 23
3.4 Player model ..................................................................................................................................... 25
3.4.1 Performance of SMO .................................................................................................................... 25
3.4.2 InfoGain ....................................................................................................................................... 25
3.4.2 Time............................................................................................................................................. 26
3.4.3 Server .......................................................................................................................................... 26
4. Results 27
4.1 Player model ..................................................................................................................................... 27
4.1.1 Error Distribution .............................................................................................................................. 28
5
4.2 InfoGain .......................................................................................................................................... 330
4.3 Time ................................................................................................................................................. 31
4.4 Server ............................................................................................................................................... 32
5. Discussion 33
5.1 Player model ..................................................................................................................................... 33
5.2 InfoGain ............................................................................................................................................ 33
5.3 Time ................................................................................................................................................. 34
5.4 Server ............................................................................................................................................... 34
5.5 Implementation in AI ........................................................................................................................ 34
6. Conclusions 36
6.1 Selection of the dataset .................................................................................................................... 36
6.2 Finding a classification method ......................................................................................................... 36
6.3 Predicting skill ................................................................................................................................... 37
6.4 The influence of game progression .................................................................................................... 37
6.5 Artificial Intelligence ......................................................................................................................... 37
6.6 Answering the problem statement .................................................................................................... 38
6.7 Limitations and recommendations for future work ............................................................................ 38
Literature 40
Appendix A 42
Appendix B 43
6
1. Introduction In regular sports, people do not only want to practice their sport, they also want to compete with others and
see who the best sportsperson is. The same holds for competitive computer gaming, also called eSports, where
players of a game participate in a competition to see who is the best. In western culture, eSports are still an
emerging field. One important reason is that here, competitive gaming is mainly associated with first person
shooters (Wagner, 2006). In the last few years, this view has shifted with the release of the popular, highly
competitive real-time strategy game Starcraft II, selling 3 million copies in the first month of its release (Blizzard
Entertainment, 2010).
Besides the standard online laddering competition embedded in the game, various Starcraft II tournaments and
competitions are held, both offline and online. This motivates gamers to train their skills and increase their
performance. The standard ladder of the game divides gamers into 7 distinct leagues, roughly representing the
skill level of players. With the large number of Starcraft II players and the classification of players according to
their performance, we now have the opportunity to study skill differences on a large scale in a real-time
strategy game.
Skill difference between novices and experts has been researched in many domains. Studies in chess suggest
that because of their extensive knowledge, experts are better at recognizing patterns, they make decisions from
these patterns more quickly, and they can make general assumptions based on chuncks of information (Gobet
& Simon, 1998). Research on pilots and athletes shows that experts also see related cues faster, make fewer
errors, and pay less attention to unrelated cues than novices (Schriver, Morrow, Wickens & Talleur, 2008;
Chaddock, Neider, Voss, Gaspar & Kramer, 2011).
To examine how these differences affect gameplay in fast paced computer games, we can create a model of
players according to their skill. We call this a player model: an abstracted description of the behavior of players
in a game. Next to preference, strategies and weak spots, player modeling can be used to describe the skill of a
player (Van den Herik, Donkers & Spronck, 2005). The current research focuses on building a player model that
can predict the skill of a player.
1.1 Problem statement and research questions Our problem statement is as follows:
To what extent can we use a player model to accurately distinguish a novice player from an expert player in a
real-time strategy game?
In this research, we use the commercial real-time strategy game Starcraft II as a tool to solve this problem. The
first step is to find a selection of replay files that is suitable for building a player model. We need to establish
when a replay is appropriate to use and what data we want to extract from it. This leads us to first research
question:
1. For Starcraft II, what selection of player data is appropriate as a data set to build a player model upon?
We then need to find one or more classification methods that are suitable to model players in the game
Starcraft II. These methods should be able to deal with sets of instances that are related to each other in that
they describe the behavior of a player in one game. We also want to know what features are most informative
7
to the classification algorithms, so at least one of the methods should give us an output that provides us with
this information. This gives us our second research question:
2. For Starcraft II, what is a suitable classification method to build a player model that predicts the
player’s skill level according to the player’s behavior?
If we have found such a method, we can test it to see to what extent its results are accurate. We need to pick
one or more measures that are appropriately informative about the performance of the classification method.
This leads to our third research question:
3. How accurate can a player model predict a player’s skill?
If we have an indication of the effectiveness of the player model, we can examine how the performance of the
algorithm changes over time. This gives us the fourth research question:
4. How does the performance of our player model change as the game progresses?
One other point of interest is to examine if we can implement the outcome of our research into an AI.
Therefore, our fifth research question is:
5. To what extent can we use the player model to build an AI?
1.2 Outline We begin our research in Chapter 2 by discussing the real-time strategy game Starcraft II and its system for
rating players’ skill. We then describe the main features of expertise to see what distinguishes a novice from an
expert. We follow by elaborating on player modeling. We conclude Chapter 2 with an explanation of the SMO
learning algorithm. In Chapter 3, we describe our experiment. We first describe how our data is gathered and
selected. We then discuss the feature set we selected. We explain how we compared algorithms to examine
which one is most suitable for our problem. We then elaborate on the construction of our player model. In
Chapter 4, we present the results of the tests described in Chapter 3. We discuss these results in Chapter 5.
Finally, we present our conclusions, limitations and recommendations for future work in Chapter 6.
8
2. Background In this chapter we elaborate on the four topics relevant to this research. In Section 2.1 we describe our first
topic: the real-time strategy game Starcraft II. In Section 2.2 we define the differences between novices and
experts. In Section 2.3 we elaborate on opponent modeling in classic and computer games. In Section 2.4 we
describe the techniques that we use for the classification of our data.
2.1 Starcraft II Starcraft II is a real-time strategy game where players have the goal to destroy their enemy by building a base
and an army. Players can choose 1 out of 3 races to play with. These races are Terran, Protoss, and Zerg. The
Terran are humans, the Protoss are alien humanoids with highly advanced technology, and the Zerg are a
collection of assimilated creatures who use biological adaptation instead of technology.
For anything the player builds, he needs to gather 2 types of resources: minerals and gas. These resources are
used to construct buildings which in turn can be used to produce units. At the start of the game, not all units
and buildings are available. New construction options can be unlocked by making certain buildings. This means
that some units and buildings are available at the start of the game while others become available later in the
game. This is also called tier: the point in time that certain units and buildings become available.
In order to play the game well, the player must engage in macro and micro-management. Macro management
determines the economic strength of a player. This is determined by the construction of buildings, the gathering
of resources and the composition of units. Micro-management determines how well a player is able to locally
control small groups and individual units. This includes movements and attacks that are issued by the player
(Blizzard Entertainment, 2010). The success of the macro and micro-management of a player heavily depends
on the strategy a player chooses to follow. For example, if a player chooses to rush his opponent by making
fighting units very early in the game, his economy will suffer. On the other hand, if a player chooses to focus on
having a strong economy before building an adequately sized army, he runs the risk of being overrun by his
opponent.
Players can play against others on Blizzard’s multiplayer system Battle.net. There are four regions, each with
their own ladder: one for Europe and Russia, North and Latin America, Korea and Taiwan and one for South-
East Asia (Blizzard Entertainment, 2011). Players can play on the ladder or set up a custom game. Games that
are played on the ladder are ranked. The system automatically matches players of about the same skill to play
each other. Custom games are unranked. Here, players can choose their own opponents.
2.1.1 Rating system
The main rating system in Starcraft II is the division into leagues. Each ladder is divided into 7 leagues: bronze,
silver, gold, platinum, diamond, master, and grandmaster. The bronze league is the lowest league and contains
people who are still learning to play the game. This league contains 20% of the population that is active on the
ladder. When players improve their skill, they get promoted into the silver, gold, and platinum leagues. Each of
these leagues also contains 20% of the population. The diamond, master, and grandmaster leagues in total
contain the final 20% of the population. The diamond league contains 18%, the master league contains the top
2% players of the diamond league and the grandmaster league consists of the top 200 players of the game
(Blizzard Entertainment, 2011). Each server has its own ranking. As there are 4 regions, there are a total of 800
grandmasters in the world.
9
When a player first starts playing on the ladder, he has to play 5 placement matches. After these matches, the
system evaluates how many were won and lost. Based on those numbers, the matchmaking system places the
player in a league. From that moment on, players can climb or fall into another league by winning or losing
matches. If a player wins about 50% of his matches, he is in the right league. If he wins more than 50%, the
system will eventually match him with players from higher leagues. If the player keeps winning, he will be
promoted into the next league.
There are two systems that determine a player´s league. One is the ladder point system. Players earn or lose
ladder points by winning or losing matches. If their score is around a certain threshold, players are placed in
another league. Their ladder points are then reset. The amount of points that is won or lost is determined by
the skill difference with the opponent. More weight is assigned to a win against a stronger opponent and less
weight is assigned to a win against a weaker opponent. Players also have a bonus pool that is filled over time. If
a player wins, and there are sufficient points in his bonus pool, the points he wins are doubled by taking points
out of the bonus pool. This helps players that have not played for a while climb up the ladder and prevents
them from having to play weaker opponents. A table with an indication of the amount of points that is required
to move up a league can be found in Table 2.1 (Blizzard Entertainment, 2012).
League transition Indication of required points
Bronze to Silver 1200
Silver to Gold 800
Gold to Platinum 800
Platinum to Diamond 800
Diamond to Master 900
Master to near Grandmaster 1400
Table 2.1 Points required to climb leagues
Next to the ladder system, there is the hidden rating system, also called hidden matchmaking rating (MMR).
The points in this system are never reset. The game matches players based on their hidden rating. If a player
wins his hidden rating increases and if he loses, it decreases. If the player wins from an opponent in a higher
league, more weight is assigned to this win and his hidden rating increases more than if the opponent was
equally skilled. If a player wins consistently, eventually he will be promoted. However, if a player wins from
lower players in a higher league but loses from average players in that league, his rating will not increase
consistently and he will not be promoted. All leagues are based on a range of hidden rating points. If a player’s
points fall within this range, and he consistently wins, he is placed in the corresponding league.
If we assume the Starcraft II population is distributed normally over the leagues, we can get an indication of the
MMR range for every league. Because the bronze league encompasses the bottom 20% of all players, the MMR
range is larger for this league than for the silver league, which contains the following 20% of all players. This
also means that the skill difference between the silver, gold, and platinum leagues is smaller than the skill
difference for the other leagues. The entire graph is shown in Figure 2.1.
10
Figure 2.1 Distribution of skill range. Grandmaster league is not shown as it consists of the top 200 players in the master league.1
Every league is split into a set of divisions. Each division contains 100 randomly assigned players with their
ranking displayed. If a player wins games, his rating points increase and he raises one or more positions within
his division. This way, players have a notion of progress and accomplishment which encourages them to
continue playing. In this research, divisions and division points were not taken into account because a player´s
ranking in a division depends on his frequency of play and when he was placed in that division. It also depends
on the time of measurement as the division points are reset at the start of every season.
2.1.2 Reasons for choosing Starcraft II
We have 3 reasons for choosing Starcraft II as a tool in this research. First, there is a great degree of skill
involved in playing this game. Second, players are divided into leagues according to their skill level based on
their wins and losses in the game. This gives us a reliable and objective grouping of players who are roughly
evenly skilled. This helps us to determine the performance of our opponent models. Third, there is a large
amount of replays available on several websites which allows us to gather data.
2.2 Expertise According to the theory of deliberate practice, (Ericsson, Krampe & Tesch-Römer, 1993) expertise is a question
of 10 years or 10.000 hours of deliberate practice, rather than talent. In this time, experts gain extensive
knowledge about a specific domain. This knowledge gives experts the ability to recognize patterns. This also
helps them to make general assumptions based on chunks of information (Gobet & Simon, 1998). This could
also apply to experts in real-time strategy games where players have to make quick decisions but have little
information about the opponent.
2.2.1 Expertise in Chess
One of the first studies on expertise in Chess was conducted by De Groot (1946/1965) who studied the thought
process of chess players by asking them to think out loud about making their next move in a chess game. He
1 [Untitled graph of population density and MMR]. Retrieved May, 2012, from: http://us.battle.net/sc2/en/forum/topic/2112234276
11
found that there is no difference in depth of search between Grandmasters and to-be masters although the
quality of the produced moves was better for Grandmasters. He also found that Grandmasters were better at
remembering existing, non- random board configurations. This was later confirmed by Campitelli, Gobet,
Williams, and Parker (2007), and Reingold, Charness, Pomplun, and Stampe (2001). Both research groups
performed an eye-gazing study. The latter found that expert chess fixate more on the area between pieces
rather than the individual pieces themselves. With this technique, experts are better at remembering existing
board configurations than novices and intermediates.
Chase and Simon (1973) examined expert memory by asking expert and novice players to remember a number
of random and existing board configurations. They found that expert players were better at remembering
existing configurations while they performed equally well to the novices at random compositions. Experts were
also better at reproducing the position of a chess piece after viewing it for five seconds. This research lead to
the chunking theory: the idea that experts make decisions based on a large number of chunks of information
that are stored in their long term memory. These chunks help experts to recognize patterns quicker and
accordingly make quicker decisions. Gobet and Simon (1998) extended this theory into the template theory
where a set of chunks together forms a larger and more complex structure in memory. This allows
Grandmasters to memorize relevant information faster, recognize board positions quicker and consequently
make faster decisions. Gobet and Simon (1996) also studied Grandmasters playing 6 opponents at the same
time. Rather than engaging in deep search to look ahead, Grandmasters reduce search space by using
recognition patterns, based on their extensive knowledge of the game and their opponents. This has also been
confirmed in research by Campitelli and Gobet (2004).
2.2.2 Expertise in action video games
One important difference between chess and modern video games is the pace of the game. In modern action or
strategy games, several events follow each other quickly and may occur simultaneously. Players are required to
react to these events timely and accurately. Green and Bavelier (2005, 2006), examined the difference in the
allocation of attention between games and non-gamers. They conducted a functional field of view task to
measure how well a person can locate a central task whilst being distracted by a number of visible elements
and one other central task. The gamers with experience in these kinds of games had enhanced attentional
resources: they were better able to pay attention to several moving stimuli at the same time. Gamers also had
better visuospatial attention: they were better able to localize one or two central tasks among a number of
distractions (Green and Bavelier, 2006). When non-gamers were asked to play an action game for ten hours,
they showed significant improvement in attentional resources and visuospatial attention. In other words,
experience with a game improves players’ ability to multi-task. This indicates that experience with a game
improves players’ abilities. Dye, Green and Bavelier (2009) examined the allocation of attention to a number of
alerting cues. They found that action gamers responded quicker to these events and attended to them more
accurately than non-gamers.
2.2.3 Expertise in other domains
Research on pilots has shown that experts make decisions that are based on related cues faster and with fewer
errors than novices. Experienced pilots are also more likely to direct their attention to relevant cues rather than
irrelevant cues (Schriver, Morrow, Wickens & Talleur, 2008). Other research has shown that highly ranked
athletes process information faster than regular people when crossing a street (Chaddock, Neider, Voss, Gaspar
12
& Kramer, 2011). An eye gaze study in cricket has shown that more experienced batsmen need less time to
predict where the ball will hit the ground (Land & McLeod, 2000).
2.3 Player modeling According to Bakkes, Spronck and Van Lankveld (2012), a player model is an abstracted description of a player’s
behavior in a game environment. Hence, an opponent model gives us knowledge about the characteristics of
the opponent. This can be preferences, strategies, strengths, weaknesses or skill (Van den Herik et al., 2005).
We can use this knowledge to create an AI that adapts itself to its opponent. In Subsection 2.2.1 we describe
opponent modeling in classic games. In Subsection 2.2.2 we discuss opponent modeling in computer games
and the goal of opponent modeling in this thesis.
2.3.1 Player modeling in classic games
In research on classic games, the main goal of player modeling is to create strong AI. Carmel and Markovitch
(1993) and Iida, Uiterwijk and Van den Herik (1993) simultaneously studied opponent modeling to create AI
that searches for the best solution, given a certain game state. Donkers, Uiterwijk and Van den Herik (2001) use
a game tree to find the best solution. Richards and Amir (2007) examined probabilistic opponent modeling in
scrabble. They took observations of previously played games and predicted what tiles the opponent will hold.
Their method performed significantly better at player Scrabble than Quackle, an open-source scrabble AI that is
based on the architecture of the Scrabble AI program Maven.
Creating strong AI by means of player modeling gives us useful research on modeling expert opponents in
games. Van der Werf, Uiterwijk, Postma and Van den Herik (2002) focused on learning a machine to predict
moves in the game Go by observing human expert play. The system constructed in the experiment performed
well on the prediction of regular moves but performed less on the prediction of odd moves. Kocsis, Uiterwijk,
Postma and Van den Herik (2003) used a neural network to predict which move in chess is best given a certain
position. The moves that are most likely to be the best should be considered first. They found that it is possible
to predict the best moves in mid-game from patterns in the training data.
2.3.2 Player modeling in computer games
Research on player modeling focuses on increasing the effectiveness of artificial agents. In computer games,
another point of interest is raising the entertainment value (Van den Herik et al., 2005). A player model can
have 2 different roles: it can either inform the player during the game or it can serve as an artificial opponent
(Van den Herik et al., 2005). If the AI has an opponent role, it is important that there is a balance: the AI should
not be too hard or too easy to beat. A strong imbalance leads to the player losing his interest and thus lowers
the entertainment value of the game (Houlette, 2004).
Some research on player modeling in computer games has been conducted. Schadd, Bakkes and Spronck (2007)
examined player modeling in the real-time strategy game Spring. They were able to successfully classify the
strategy of a player using hierarchical opponent models. Drachen, Canossa and Yannakakis (2009) collected data
from 1365 Tomb Raider: Underworld players. They used a self-organizing map as an unsupervised learning
technique to categorize players into 4 types. They showed that it is possible to cluster player data based on
patterns in game play.
The goal of player modeling in this thesis is to automatically classify human players according to their skill. We
can use this knowledge to create AI that mimics the skill of a human player. This makes the behavior of the
13
artificial opponent more realistic and thus more interesting to human players (Van den Herik et al., 2005;
Schadd et al., 2007; Houlette, 2004).
2.4 Sequential Minimize Optimization Algorithm In order to see which algorithm is most suited for our classification task, we tested 5 algorithms on a subset of
the data. We compared the k-nearest neighbor algorithm IBk, the support vector machine Sequential Minimize
Optimization (SMO), the decision tree J48, and the ensemble learner RandomForest. IBk is a nearest neighbor
learner which classifies instances based on their similarity. J48 builds a tree that splits data on each node using
1 attribute for each node. The attributes are considered in the order in which they are presented.
RandomForest is an ensemble learner that grows a forest of random decision trees. At each node, a
predetermined number of random attributes is considered and the most informative one is picked. After al
trees are grown, they do a majority vote to determine to which class an instance belongs to. The SMO classifier
is described below. We picked SMO as our algorithm to build the player model based on the pretest described
in Subsections 3.2.
The SMO classifier is an optimized support vector machine (SVM). A standard SVM uses linear functions to
classify data in a feature space that is not linearly separable. A kernel is used to transform this space into a
linearly separable vector space. In this space, classes can be separated with a straight line. If we then transform
the vector space back to feature space, the line does not appear to be straight anymore. This can be described
best with an illustration. In Figure 1.1, we can see 12 instances in normal space. In Figure 1.2, we have used a
kernel to transform the space into a high-dimensional vector space which makes the data linearly separable. In
Figure 1.3, we transformed the vector space back to normal space, showing that the line does not look straight
anymore.
Figure 2.1 Feature Space Figure 2.2 Vector Space
14
Figure 2.3 Transformation back into Feature Space
Instances in vector space are called vectors. The linear line that separates the classes is called a hyperplane.
This hyperplane tries to find the line that has the most distances to both classes: it looks for the maximum
margin. This margin is determined by the vectors that are closest to the boundary with the other class. These
vectors are called support vectors. Each of the support vectors pushes the hyperplane as far away as possible,
creating the maximum margin. This is shown in Figure 2.4. The triangles represent the support vectors, the thin
lines represent the maximum margin, and the thick line represents the decision boundary.
Figure 2.4 Support Vectors
SVM perform a binary classification. The standard way to do this is to do a one-versus-all classification: an
instance is either a member of the class or not. The SMO uses pair-wise classification for multiclass problems.
This means that it makes pairs of classes for every possible combination. A hyperplane is created for each pair
of classes. This has proven to lead to more accurate results than one-versus-all classification (Park & Fürnkranz,
2007).
The dimensionality of vector space depends on the number of features that is used to describe the data. If the
number of features grows, the dimensionality grows in a quadratic way. Having many features and classes
creates a complex problem that requires a lot of time and computing power to solve. This is called the
quadratic optimization problem. SMO solves this by breaking the problem down into a set of smaller problems
which can be solved analytically instead of numerically (Platt, 1998).
In short, SMO is an optimized SVM. It uses pair-wise classification which leads to more accurate results. It also
solves the quadratic optimization problem so less time and computing power is needed for classification which
means that large data sets with many attributes and classes can be processed.
15
3. Experimental setup We begin this chapter by explaining how we collected and selected our data. We elaborate on the features we
used to describe the data set. This is described in Subsection 3.1. In Subsection 3.2 we describe the method we
used to evaluate the classifiers. In Subsection 3.3, we report how we performed the pretest, what the results
were and what algorithm we picked. In Subsection 3.4, we explain how we built the final player model and how
we examined the importance of attributes by applying InfoGain. We also elaborate on the impact of time and
the server on the performance of the algorithm.
3.1 Data set For this research, a large database of Starcraft II game replays is needed. We collected this database by
repeatedly checking the websites gamereplays.org, drop.sc, and sc2rep.com for two months. The replays we
collected here were manually uploaded by players or by tournament organizers. We used the data of both
these players to diminish any imbalance in the data set. Hence, 50% of the instances contain data about players
who won, and 50% of the data is about players who lost.
We collected a total of 1297 games. This is only a small portion of all games that were played during our
collection phase which ran from 25 November 2011 up to and including 13 February 2012. We can make a
rough estimation of the number of games that were played during this time. The main screen of Starcraft II
displays in real-time how many games are being played worldwide. We measured these numbers for one Friday,
one Sunday and one Monday. Each day, we measured at 2 different times: around noon and in the evening.
Based on the data we gathered, we assume that an average game lasts for 15 minutes. We also assume that
games are played for about 14 hours a day. We took the average of about 8750 games at any moment and
multiplied it by 4 to get the number of games each hour, and multiplied that by 14 hours to get to the number
of 490.000 games each day. We multiplied this number by the number of days in our collection phase which is
104. This results in an estimated number of 50.960.000 games. This number also includes games that are
played on other servers besides Europe and America, so we divide this number by 2 to get 2.548.000 games.
This number also includes 5 other game types besides 1v1 games. Therefore, we take 1/6th of the games as an
estimate for the 1v1 games. We divide that number by half to eliminate games that were not played on
American or European servers. This results in an estimated number of 4.246.667 1v1 games on American and
European servers. As we only have 1297 games, we have gathered about 0.03% of all played games.
All replays were compared to each other to filter out any doubles. We collected a total of 1307 games from
1590 players in 7 different leagues. All games were played either on the American server (63.3%) or the
European server (36.8%). Out of the entire set of games, 74.6% was matched automatically by the system
(AutoMM), and 25.4% was matched manually (Private or Public). This resulted in a data set consisting of
49,908 instances, each describing one minute in the game. A specified overview can be found in Table 3.1.
16
league
total
instances percentage games players
American
server
European
Server AutoMM
Private
/Public
bronze 4082 8.2% 85 139 70.6% 29.4% 94.1% 5.9%
silver 3979 8.0% 96 147 75.0% 25.0% 99.0% 1.0%
gold 5195 10.4% 122 195 67.2% 33.6% 92.6% 7.4%
platinum 8066 16.2% 196 282 67.3% 32.7% 87.8% 12.2%
diamond 10088 20.2% 244 321 64.8% 35.2% 85.2% 14.8%
master 12747 25.5% 383 427 76.2% 23.8% 58.0% 42.0%
grandmaster 5751 11.5% 171 79 22.2% 77.8% 5.3% 94.7%
total 49908 100% 1297 1590 63.3% 36.8% 74.6% 25.4%
Table 3.1 Distribution of the data set
Considering that the grandmaster league encompasses the top 200 players of the master league, and that the
master league consists of 2% of all active players, the distribution of our data set is skewed with 25.5% of the
data consisting of master games and 11.5% consisting of grandmaster games. Games from the higher leagues
are more often uploaded than games from lower leagues. Players in the master and grandmaster league often
play in tournaments. The games played in these tournaments are often put on the internet so anyone can
watch them. Players in the 2 highest leagues also have fans for whom they upload replays. Players from lower
leagues upload games less often. This could be due to the fact that they are new to the game and are not
involved in the Starcraft II culture available outside the game.
Another reason could be that lower ranked players might play less frequently than higher ranked players. The
theory of deliberate practice states that people become more skilled by engaging in deliberate practice
(Ericsson, et al., 1993). It could be that higher ranked players are skilled because they practice more. A
consequence of this is that there are more games played in the higher leagues than the lower leagues.
Therefore, more games from higher leagues are available. This can only be a partial explanation because the
differences are relatively large. There are far less players in the 2 highest leagues than in the lower leagues.
3.1.1 Data Selection
Some requirements were set for the selection of the data. Firstly, we only were interested in 2-player games.
These games should be played on the American or European servers. All games should have the same version
of the game and they have to be played between 25 November 2011 and 13 February 2012. Information about
the league of players should be available. Games should be played between players within the same league.
We only used games that were played on American or European servers due to a significant difference in skill
between Asian and Western players. The over-all level of play in Asia is higher than in Europe and America. This
means that the level of play in one league differs for Asian and Western servers. Therefore, we cannot rely on
league as a measure of skill if we consider players from both Asian and Western servers. There is a difference in
skill level between Europeans and Americans but it is far less than the difference between Western and Asian
players.
17
Games played in the bronze league include games that are played in the optional “practice league”. This is a
league only by name as practice players are officially placed in the bronze league. Games are played in a slower
pace and on simplified maps. Players are allowed a maximum of 50 games in this league before proceeding to
standard games.
We only used 2-player games that were either set up through automatic matchmaking by the system (ladder
games) or set up manually (private or public games). In ladder games, two players are matched by the system
according to their league or hidden rating. In private or public games, players chose their own opponent.
Because players are placed in a league by playing games on ladder, we can assume that the skill level of players
in private or public games is sufficiently accurate.
We only used games between players within the same league. When two players are matched by the system, it
takes their hidden rating into account. This means that for example a silver player could be matched with a gold
player. We do not know whether the silver player is about to be promoted or the gold player is about to be
demoted. Therefore, we cannot be as certain about their skill level as we can when they both play in the same
league. This uncertainty is less relevant for a numerical classification, but it can lower the performance for a
nominal classification.
Starcraft II is regularly patched by Blizzard to balance the game and make it more interesting to players. These
patches include changes in upgrades, units and buildings. These changes may affect the way the game is played.
All selected games were played in version 1.4.2.20141 of the game to avoid any differences that could be
contributed to the patch version.
We only used games played in season 4 or 5. Season 4 ran from 25 October 2011 up to and including 19
December 2011, and season 5 ran from 20 December 2011 up to and including 13 February 2012. We limited
our data set to this time because during the collection phase, we only had league information on players for
those two seasons. For most of season 4 and the entire season 5, the game was in version 1.4.2.20141 which is
convenient because it allows us to use almost all replays from season 4 and 5.
3.1.2 Data extraction
We used the sc2reader python library to extract game and player information from the replay files. We first
extracted the battle.net urls of all players from the games. We then made a list containing the name of the
player together with his rank in seasons 4 and 5. For each game, we then determined when it was played, who
the players were and what their league was. If the game was not from season 4 or 5 or the players were not in
the same league, we disregarded that game. We then checked the patch version of the game and only selected
games from version 1.4.2.20141 of the game.
We grouped the games according to their league. If both players are in the same league, the game is regarded
to have been played in that league. For example, if both players of a game are in the silver league, that game is
regarded as a silver league game.
After the replay files are selected and ordered, we use the command line interface of the program Sc2gears to
print the in-game actions of all the players. The replay files only contain information about the interactions of
the players with the game interface. Hence, we did not have information about how many units were killed,
how many structures were razed, what the available supply was, or how many resources were gathered. We
could only see how many buildings or units were issued to be built, not how many were actually completed. For
18
example, if a command is issued to build a unit, but the building that produces this unit is destroyed, the
unfinished unit in production is killed too. This means that we can only count how many units, buildings and
upgrades were issued for production. We can use this information to calculate how many resources and supply
the issued commands cost but this does not tell us the actual costs because we do not know if something was
completed.
We now have a data set, which has to be divided into a training set and a test set. We have 2 different kinds of
sets: a pretest set and a 5-fold set. For the pretest set, we take 70% of the data as the training set and 30% to
use as the test set. For the 5-fold set, we take 80% of the data as our training set and 20% as our data set. We
do this 5 times, so we end up with 5 training sets and 5 complementing test sets.
3.1.3 Feature set
After collecting the data set, we had to create an appropriate feature set to feed to the algorithms. Next to
general information about the game and the players, the replay files only contain information about the players'
interactions with the user interface. Therefore, we are limited to base our features on those interactions. We
only select features that are applicable to all three races because we want to have a general method to
measure a players’ skill.
The instances in our data set are time slices. Every instance describes one minute of gameplay. There are 3
different kinds of features. Firstly, we have features that give general information. These features are not
affected by the in-game behavior of the players. Secondly, we have features that describe what has happened
during the minute that is described in the instance. Thirdly, we have features that describe what happened
during the entire game up to and including the minute that is described in the instance.
We can explain this with an example. Suppose our instance number is 2. This means that we describe
everything that has happened during the 2nd minute from 90 seconds onwards. This describes everything that
happened between the 2.5 minute and the 3.5 minute. We also have features that describe what has happened
during the entire game up to now. These features describe everything that happened between 1.5 minute and
3.5 minute.
General features
We want some general information about the players that might affect their behavior. Our first feature is of
course the league a player is in. Then there is the issue of race. Each race works in a different way. This might
affect the number of commands that is issued. The server on which the game is played might also be of
influence. We assume that the difference in skill between American and European players is not large but we
do not know the actual size of the difference. Near the end of the game, the winner usually has a stronger
economy than the loser. Therefore, we want to know whether or not a player won the game. Time is also
important because more game states are possible as the game progresses. There might be differences as to
when a player develops new technologies. This gives us the 6 features that are described in Table 3.2.
19
Feature Possible values Description
League
bronze, silver, gold,
platinum, diamond,
master, grandmaster
Describes players’ skill
Server American, European Server where the game was played
Player race terran, protoss, zerg Race the player picked
Opponent race terran, protoss, zerg Race the opponent picked
Winner yes, no Whether the player won the game
Minute of the game 0 - 86
Each instance describes one minute
in the game. The first 90 seconds are
excluded from measurement.
Table 3.2 General features
Visuospatial attention and motor skills
Because experts tend to have better motor skills, better visuospatial attention, and make faster decisions than
novices, we select the amount of actions a player issues per minute as a feature. We also calculate how much
of the issued actions are actually effective. We use two measures: one for the over-all game up to and including
the instance, and one for just the minute described in the instance. We also measure the over-all redundancy.
This is the ratio between effective and issued commands. All features relating to issued and effective
commands are described in Table 3.3.
Feature over-all Possible
values Feature per minute
Possible
values Description
average macro 0 - 87 macro 0 - 131 Average macro actions
average micro 0 - 269 micro 1 - 372 Average micro actions
average apm 0 - 289 apm 1 - 388 Average macro actions + average micro
actions
total eapm 0 - 165 eapm 0 - 191 Average effective actions
Redundancy 0 – 0.747 -
(total average actions per minute -
average total effective actions per
minute) / total average actions per
minute
Table 3.3 Features measuring amount of commands issued by the player
To decide whether an action is effective, we create a set of rules based on the preconditions used by the
Starcraft II analysis tool Sc2Gears. We also look at the amount of macro and micro actions that are issued. A
macro action is usually any action that costs the player minerals and/or gas and thus describes the players’
20
economy. A micro action is any other action and describes the players’ skill to micro manage his units. The
specific rules for deciding whether an action is macro or micro are based on the rules used by Sc2Gears. The
rule sets for micro, micro, and effective actions can be found in Appendix A.
One other way to measure motor skills and visuospatial attention, is to look at the hotkey use of a player.
Hotkeys can be used to quickly perform different actions on different parts of the map. If a player uses hotkeys
more often, this could indicate that the player is more proficient in the game. Therefore, we include the use of
hotkeys in our feature set. There are three ways in which a player can use hotkeys: he can assign a group to a
hotkey, he can add new buildings and/or units to an already assigned hotkey, and he can select the assigned
hotkeys. The features related to hotkey usage are described in Table 3.4.
Feature over-all Possible
values Feature per minute
Possible
values Description
Total hotkeys
selected 0 - 5487 Hotkeys selected 0 - 237
Total number of times hotkeys are
selected
Total hotkeys set 0 - 176 Hotkeys set 0 -21 Total number of hotkeys set
Total hotkeys
added 0 - 199 Hotkeys added 0 -14
Total number of times new units or
buildings were added to a hotkey
Total different
hotkeys 0 -17 -
Total number of different hotkeys that
were set
Hotkeys selected
per set hotkey 0 - 1501 - Total number of selections / total
number of hotkeys set
Table 3.4 Features related to hotkey usage
Economy
In Starcraft II, having a strong economy is key to winning or losing a game. If a player does not have enough
resources to spend, his army will be small and he will lose the game. On the other hand, if a player invests only
in his economy, chances are the opponent will have an army first and kills the player. A skilled player knows how
to find an effective balance. We include the number of bases, number of workers and spending rate into our
feature set to measure the strength of a players’ economy. We also calculate the number of workers per gas
collection building and the amount of minerals spent per worker to measure the balance between workers and
resources. The features regarding economy are described in Table 3.5.
21
Feature over-all Possible
values Feature per minute
Possible
values Description
Total number of
bases 1 - 20
Amount of bases a player built during
the game
Total number of
workers 6 - 188 Number of workers 0 -22
Amount of workers a player built
during the game
Total resources
spent
0 -
111755 Resources 0 - 5150 Minerals spent + gas spent
Total minerals
spent 0 - 72400 Minerals spent 0 - 3600
Amount of minerals that was spent by
the player during the game
Total gas spent 0 - 26600 Gas spent 0 - 2050 Amount of gas that was spent by the
player during the game
Workers per
collection building 0 - 90 -
Total number of workers built / total
number of gas collection buildings built
Spent minerals per
worker 0 - 1084 -
Total amount of minerals spent / total
number of workers built
Table 3.5 Features describing economy
Technology
Next to a strong economy, players can strengthen their position by strengthening their technology. Players can
upgrade the strength of their units and research special abilities. New, stronger units become available to
players when they advance in the technology tree. Therefore, the number of upgrades and abilities researched
are included in the feature set. We also want to know the tier a player is in. In our case, this is a general
description of a player’s advancement in the technology tree. These three attributes are described in Table 3.6.
Feature over-all Possible
values Description
Total number of upgrades 0 - 16 Amount of improvements researched
Total number of research 0 - 11 Amount of special abilities researched
Tier 0 - 3 General level of advancement in the
technology tree
Table 3.6 Features related to technological development
22
Strategy
Another important factor in Starcraft II is play style, also called strategy. Players can be offensive, defensive or
anywhere in between. This determines what and how many units are built and how much resources are
invested in defense. A good player knows how to balance between offense and defense. We have included the
number of different units, total number of fighting units, and number of defensive structures as features to
measure this balance. Next to gas and minerals, units also cost supply. Therefore, we have also included the
supply that was provided and used. The features regarding strategy are described in Table 3.7.
Feature over-all Possible
values Feature per minute
Possible
values Description
Total supply used 6 - 1066 Supply used 0 - 51 Amount of supply that was used by the
units that were built
Total supply gained 10 - 756 - Amount of supply that was gained by
constructing supply buildings
Different units 0 -14 -
Any unit that was built, including
workers, transport units and detection
units
Total number of
fighting units 0 - 695 -
Any units that can fight except for
workers
Total number of
defensive
structures built
0 - 69 - Any building that can do damage to
nearby enemies
Table 3.7 Features describing strategy
We now have a total of 44 features including the class. We divided each game into slices with a length of one
minute. Some features were calculated for only the minute described in one slice, others were calculated for
the entire game up to and including the minute described in the slice. We do not describe the first 90 seconds
of a game because only a limited amount of actions is possible during this time. This results in an almost equal
game state for all players, regardless of their skill. The first 90 seconds are excluded from the calculations for
the average (effective) actions per minute, but not for any other features that describe more than one minute,
for example the total amount of minerals that was collected during the game. Games are recorded in game
time which is faster than normal time. In this research, we used normal time for our measurements because it
is easier to use in our calculations.
3.2 Measurement To measure the performance of the classifiers, we use accuracy. This measure tells us how many instances were
classified correctly out of a set of instances. We calculate this by dividing the total number of correctly classified
instances in a class by the total number of instances in that class (TP/all instances in the class). We then add up
the accuracy for each class and divide it by the number of classes to get the average accuracy.
23
Because the data in our set is not evenly distributed over the classes, our data set is skewed. We can correct for
this by using a weighed accuracy. This is calculated by multiplying the accuracy for each class with a
corresponding weight. This weight is equal to the percentage of instances in the test set that belongs to that
class (instances in the class / total number of instances). All weights are then added up to give us the weighted
average accuracy.
To measure how well the algorithms actually perform, we need to compare their performance to a baseline.
Because our data set is skewed, we use a frequency baseline. This baseline automatically assigns all instances to
the most frequent class which in our case is the master league.
Whenever we want to examine if there are significant differences in any of our experiments except for the
pretest, we use the Shapiro-Wilk test of normality to examine if the results are distributed normally across the 5
folds. The null hypothesis of this test states that the samples come from normally distributed data. The
difference is significant if p < .05 and the null hypothesis should then be rejected.
According to Witten and Hall (2011), we should use a paired t-test to compare 2 algorithms that were tested
using an x-fold cross validation. This test examines if the difference between the 2 classifiers is significant for
each fold. We do not need to test for variance since this test assumes that both measures were conducted on
the exact same subsets of data. If the results of the t-test suggest a significant difference, we can calculate what
the size of the effect is using Cohen’s d to calculate the effect size r. If we want to test several measures on the
same subset of data, and the number of members in each group is small, we should use a k-related samples
test.
If we want to compare the performance of an algorithm on different subsets of the data, we need to use a one-
way ANOVA. This test compares the means of all groups with each other to see if they are significantly different.
Before we can conduct this test, we have to examine if the variance between the subsets is equal by conducting
a Levene’s test. If variance is equal, and ANOVA suggests that the difference is significant, we can use η2 to
compute the effect size. Cohen suggests that for both the effect measures r and η2, an effect size of .10 can be
considered small, a size of .30 as medium, and a size of .50 as large (as cited in Field, 2009, p. 57).
3.3 Pretest We want to find a classification method that is suitable to model players according to skill. To find one, we need
to perform a pretest where we compare the performance of a number of algorithms. We are not interested in
optimizing any of the algorithms. This is interesting if you want to use the classifier to build an AI but we are
only interested in examining the basic performance of a classifier that seems decent enough. Therefore, we do
not perform an x-fold cross validation. Rather, we use 70% of our data set to train the classifiers and 30% to test
them.
We used 4 different classification algorithms on the test set to see which one is most suitable. We used 3 single
algorithms and 1 ensemble learner. The single algorithms we tested are: IBk, SMO, and J48. The ensemble
learner is RandomForest. For each classifier we used the default settings. We also used the original, unaltered
feature set. An overview of the accuracy of all classifiers is given in Table 4.1.
24
League J48 IBk SMO RandomForest Baseline
bronze 47.7% 33.6% 64.6% 52.3%
silver 22.8% 28.7% 24.6% 26.8%
gold 14.9% 17.2% 15.5% 16.1%
platinum 25.4% 25.3% 42.7% 26.1%
diamond 30.9% 31.1% 34.7% 34.7%
master 45.8% 45.8% 55.0% 52.2%
grandmaster 51.3% 41.4% 58.9% 45.4%
unweighted average 34.1% 31.9% 42.3% 36.2% 28.4%
weighted average 35.7% 34.2% 44.0% 38.6% 28.4%
Table 4.1 Accuracy of the performance of the classifiers
If we plot the accuracy of all classifiers for each league, we get the graph in Figure 4.1. We can see that all
classifiers follow about the same curve. Only the performance of SMO for the platinum league is higher than
that of the other algorithms.
Figure 4.1 Accuracy of the classifiers for each league
All algorithms except for IBk seem to perform well on the bronze league. For all algorithms, the performance
was worst for the gold league. After the gold league, performance gradually increases for most classifiers. SMO
has a steep increase in performance after the gold league. All classifiers had a relatively strong performance for
the master and grandmaster leagues. The players in these leagues together encompass the top 2% of Starcraft
II players. As experts tend to play more consistent than novices, it is easier to model their behavior.
This distribution in performance is in line with the nature of the data set. If we look at Figure 2.1 we can see
that within the gold league, the skill difference between players is the smallest. For the silver and platinum
leagues it is somewhat larger. This makes it hard for classifiers to distinguish players between these leagues.
The bronze league encompasses the largest range in skill difference, which makes it easier to classify players in
this league. However, this league also has the largest error distance of all classes. It might be that some players
are new to the game. Based on their performance in the 5 initial placement matches, they were placed in the
0
10
20
30
40
50
60
70
smo
j48
ibk
randomforest
25
bronze league. However, these players might have a steep learning curve. Because of this, they might still be in
the bronze league although their skill has improved a lot and they actually play at the level of a higher league.
Based on these results, we choose SMO as our classifier. It seems to have a decent performance on this test set.
It also seems to outperform the other classifiers for the prediction of the platinum league. For the other
classes, its performance seems to be about the same as the performance of the other algorithms.
3.4 Player model In the previous Sections, we described the collection and structure of our data set. We have also performed a
pretest to pick an algorithm that seems suitable. We chose SMO based on the results described in Section 3.3.
We now test the performance of SMO using a 5-fold cross validation. This is discussed in Subsection 3.4.1. We
use InfoGain to examine which features contribute the most to solving our classification problem. This is
described in Subsection 3.4.2. We then examine to what extent the time of measurement is important. We
create subsets of the test set by removing instances from specific time slices. We then re-test our player model
on these subsets. We describe this in Subsection 3.4.3. Finally, we try to predict on which server a game was
played. We describe this in Subsection 3.4.4.
3.4.1 Performance of SMO
To test the model, a 10-fold cross validation is commonly used since that has been proven to be the most
statistically robust method as it gives usually gives the best estimate of error (Witten & Hall, 2011). However,
due to a lack of time, we limited our experiment to a 5-fold cross validation. We sampled the data for each set
randomly ourselves instead of using Weka because we wanted the training and test sets to consist of entire
games. The x-fold cross validation embedded in Weka uses stratified holdout. This method aims at taking a
sample that is as weighted as possible compared to the entire set. However, this means that it does not leave
out entire games but rather parts of games. This is not representative of our data set, since we always have an
entire game of a player to classify and not a few random minutes.
We build the player model with SMO using the basic settings of Weka. These settings are not necessarily
optimal. However, our goal is to examine if it is possible to build a player model and not to find the most
optimal solution. After building and testing the model using a 5-fold cross validation, we examine the
performance and the errors that were made. We first elaborate how the errors are distributed. We then
calculate the average classification and average error distance by multiplying the average weight table with a
linear weight table. The results of these tests are reported in Section 4.1.
3.4.2 InfoGain
InfoGain is a technique that evaluates each attribute to examine how much it contributes to the total variance.
If an attribute accounts for more variance, this means it is more informative. After all attributes are evaluated,
they are ranked according to their contribution to the variance. The most informative attribute is placed at the
first position and the least informative one is placed last. In order to test how removing attributes affects the
accuracy of the classifier and the time it takes to build the model, we built the model using different subsets of
the feature set. We removed 0, 20, and 40 instances to see how this affects time and accuracy. The machine we
use for this test has a 64-bit Intel Core i7 processor @2.20 GHz with 6050 MB RAM. We ran a 64-bit version of
Weka with a maximum heap size of 4048 MB on a 64-bit version of Windows 7. The reports of the tests
regarding InfoGain are reported in Section 4.2.
26
3.4.2 Time
When we know the performance of the classifier on the test set, we can examine the impact of time. We want
to know if the accuracy of the classifier improves during the game. We begin by examining how the
performance changes when we remove the first instance of every game. This means we see how well the
classifier performs after at least 2.5 minute has been played. We perform a 5-fold cross validation to test this.
We begin by removing the first instance from all test sets and use the models on these subsets. We then
remove the first and second instances and test again. We repeat this process for the first 10 instances of every
game. The results of these tests are reported in Subsection 4.3.
3.4.3 Server
In our research, we used replay files from games that were played on either the American or the European
server. We want to know to what extent we can predict on which server a game was played. If we can do this
accurately, this means that there is a difference in behavior between players from different continents. We use
a 5-fold cross validation to build and test a model that predicts server. We also try to predict the skill of a player
by using the entire feature set except for ‘server’ to build a model. We use a 5-fold cross validation to test to
what extent removing the feature ‘server’ has a significant effect on performance. The results of these tests are
reported in Subsection 4.4.
27
4. Results In this section, we report the results of our experiments. We begin by examining what classification method is
most suitable to model players according to skill. In Subsection 4.1, we report how accurately SMO is able to
build a player model using a 5-fold cross validation. In Subsection 4.2, we discuss the ranking of the attributes
using InfoGain. In Subsection 4.3, we elaborate on the influence of time on the performance of the classifier. In
Subsection 4.4, we report the importance of the server regions.
4.1 Player model If we evaluate our model using a 5-fold cross validation, on average 44.9% of instances are classified correctly
with a standard deviation of 2.67. The average accuracy for the baseline is 25.5% with a standard deviation of
1.05. The average weighted accuracy for SMO is 47.3% with a standard deviation of 2.2. The accuracy of SMO
for each class based on the average of the 5-fold cross validation is shown in Table 4.1.
SMO accuracy
Baseline accuracy
bronze 69.6%
silver 25.8%
gold 10.6%
platinum 40.2%
diamond 42.9%
master 63.3%
grandmaster 62.1%
average 44.9% (2.67) 25.5% (1.05)
weighted average 47.3% (2.19)
Table 4.1 Accuracy scores for SMO and the baseline. The standard deviation is reported between brackets
We want to know if the difference between the baseline accuracy and the performance of SMO is significantly
different. The Shapiro-Wilk test of normality shows that p > .05 for the unweighted accuracy (M = 47.27, SD =
2.19), the weighted accuracy of SMO (M = 44.93, SD = 2.67), and the baseline (M = 25.55, SD = 1.05). This
means that it is likely that the data is normally distributed. An overview of the results of this test is given in
Table 4.2.
Measure Statistic df p
SMO weighted accuracy 0.941 5 0.674
SMO unweighted accuracy 0.971 5 0.882
Baseline accuracy 0.892 5 0.365
Table 4.2 Results for the Shapiro-Wilk normality test
We can now use a paired t-test to examine if SMO performs significantly better than the baseline. This test
suggests that both the unweighted accuracy of SMO with t(4) = 14.35, p < .001 and the weighted accuracy of
28
SMO with t(4) = 32.10, p < .001 are significantly better than the baseline. The effect size for both the
unweighted and the unweighted accuracy is .99.
4.1.1 Error Distribution The classes we use in our data set are ordinal. This means that class C is more distant to class A than class B.
Therefore, it is preferable if an instance is misclassified in a neighboring class than in another class. In Table 4.4,
we report the average confusion table for the 5-fold cross validation. We can see that on average, 67.0% of the
wrongly classified instances are placed in a neighboring class.
a b c d e f g
classified
as
567 135 39 57 16 2 0 a
206 202 127 185 63 12 1 b
80 195 109 394 178 69 14 c
60 99 130 649 391 262 23 d
16 28 77 319 865 609 102 e
9 17 24 196 471 1615 217 f
0 0 2 9 95 331 713 g
Table 4.3 Average confusion table for the 5-fold cross validation
We can see that as leagues become more distant to each other, errors become less. For each league, we can
calculate the average distance of the actual class to the assigned classes. If an instance is placed in a
neighboring class, we assign a weight of 1 to this error. If an instance is placed one class next to the neighboring
class, we assign a weight of 2 to this error. If we do this for all classes, we get the linear weight table that is
displayed in Table 4.4.
a b c d e f g
<--
classified as
0 1 2 3 4 5 6 a
1 0 1 2 3 4 5 b
2 1 0 1 2 3 4 c
3 2 1 0 1 2 3 d
4 3 2 1 0 1 2 e
5 4 3 2 1 0 1 f
6 5 4 3 2 1 0 g
Table 4.4 Weight table to calculate the average misclassification
29
For the confusion table of each fold, we multiply each entry with the corresponding weight in Table 4.5. The
sum of each row is then divided by the total number of instances in the corresponding class. This gives us the
average classification for each league. This is shown in the second column of Table 4.5. A distance lower than 1
means that most of the instances were classified correctly. To get an average distance error for each league, we
need to sum the multiplied errors and divide this number by the total amount of errors. The average error
distance for each class is shown in the third column of Table 4.5. A distance of 1 means that all wrongly
classified instances were placed in a neighboring class. All reported numbers are averaged over the 5 folds.
League
Average
classification
Average distance of
misclassifications
Bronze 0.75 1.95
Silver 1.34 1.75
Gold 1.47 1.61
Platinum 1.11 1.67
Diamond 0.92 1.35
Master 0.60 1.35
Grandmaster 0.38 1.19
Average 0.94 1.55
Table 4.5 Average distance of classification and misclassification per league based on a 5-fold cross validation
We can see that on average, the distance of the misclassification is than 1.55. This means that on average, most
instances were placed in a neighboring class. We can also see that no class scored an average error distance of
2.00 or more, meaning that for all classes, most misclassified instances were placed in a neighboring class.
4.2 InfoGain
InfoGain tells us how informative each attribute is. It ranks the features in the set according to the information
they provide in solving the classification problem. We ranked the features based on the average weight for the
5 training sets used in the 5-fold cross validation. This gives us the list of ranked features that is reported in
Appendix B. If we perform the Shapiro-Wilk test for normality, we can see that the weights for all attributes
except total minerals spent, amount of minerals spent per worker, and tier are distributed normally with p >
.05. A Friedman test suggests that there is a significant difference in weight between the attributes χ2(42) =
210.00, p <.001.
We have plotted the average weights for all attributes in Figure 4.1. The x-axis contains the rank number of the
attribute. We can see that between the first attribute (average micro over the game) and the second attribute
(average apm over the game), the weight drops from 0.569 to 0.511. A paired t-test suggests that the difference
in weight is significant with t(4) = 34.71, p < .001 with an effect size of .99.
30
Figure 4.1 Weight values for the attributes. The rank number is displayed on the x-axis
After the eighth ranked feature (total hotkeys selected over the game), the weight drops from 0.415 for the
eighth feature to 0.256 for the ninth feature (amount of selections per hotkey). To examine if this is a significant
difference, we perform a paired t-test between the scores of these 2 attributes. We have already established
that the weights within these groups are distributed normally with p > .05. The t-test indicates that the
difference in weight between the 2 attributes is significant with t(4) = 59.30, p < .001 and an effect size of .99.
The top 8 features are ranked in the same order for all 5 folds.
If we remove features from the feature set, the time it takes to build the model will drop but performance will
drop too. We want to know how much time we can gain and how much the performance of SMO will drop. We
built models using 3 subsets of the feature set: 1 with the complete feature set, 1 with the worst 20 features
removed and 1 with the worst 40 features removed. Figure 4.1 shows the accuracy for these feature sets. If we
compare the weighted accuracy for the entire feature set (M = 47.27, SD = 2.19) with the set where we
removed 20 features (M = 43.75, SD = 1.05) with a paired t-test, we can see that performance drops
significantly with t(4) = 2.93, p < .05, and an effect size of .82.
Figure 4.1 accuracy after removing attributes
0
0.1
0.2
0.3
0.4
0.5
0.6
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
weight
31
On our machine, it takes an average of 2.8 seconds to build a model based on 3 predictive features. If we don’t
remove any attributes, it takes 2398.3 seconds to build the model as we can see in Table 4.5. If we remove the
worst 20 attributes, it takes 49.1 seconds to build the model.
Removed attributes Seconds to build model
0 2398.3
20 49.1
40 2.8
Table 4.5 Seconds to build the model after removing features
4.3 Time We want to know from what moment in the game SMO can model players according to their skill and if time
has an influence on performance. We have plotted the average weighted accuracy scores of 10 different
subsets to examine when performance increases. We first examined the performance on the entire training and
test set. We then incrementally removed instances to see how the accuracy changes. We started by removing
all 1st instances, then removing all first and second instances, and so on up to and including the 9th instance.
The graph is shown in Figure 4.2. The blue line represents the accuracy for SMO and the red line represents the
baseline.
Figure 4.2 Accuracy of SMO for different time periods, the number on the x-axis shows the instances that were omitted
We want to know if there is a significant change in performance for SMO if the game progresses. We first test if
the data is normally distributed. The Shapiro-Wilk test shows that this is indeed the case for all subsets of the
data with p > .05 for each individual group and for all groups together. If we perform a Levene’s test on the
data, we see that p > .05, meaning that variance is equal over the groups. We can now perform an ANOVA test
to examine if there is any significant difference in the performance of SMO over time. The null hypothesis states
that there is no difference between groups and is rejected if p < .05. For this test, we get a value of F(8) = 0.082
with p > .05 which suggests that time does not seem to have a significant effect on the accuracy of the classifier.
We also tested if the baseline has a normal distribution. Since p > .05 for each individual minute and for the
over-all average of the baseline, we can say that the data is distributed normally. Levene’s test shows that the
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7 8 9
SMO
Baseline
32
variance is equal for all groups with p > .05. If we now perform an ANOVA, we see that the baseline does not
seem to change significantly over time with F(8) = 1.71 and p > .05.
4.4 Server We built a new player model using SMO that predicts whether a game was played on the American or the
European server. On average, 65.3% (SD = 1.48) of the instances in the test sets belonged to the most frequent
class: the American server. If we count the true positives for both classes and divide this by the total number of
instances, on average, 71.7% (SD = 1.20) of all instances are classified correctly. The data is normally distributed
with p > .05. A t-test suggests us that the difference in accuracy could be significant with t(4) = 19.43, p < .001
and an effect size of .99. However, the percentage of games on each server is not equal for each league. For all
leagues but grandmaster, the majority of the games were played on the American server. In the grandmaster
league, the majority of games were played on the European server. If we would assign all data to the American
server for the first 6 leagues, and all data to the European server for the last league, we get a corrected
baseline. The accuracy values of this baseline are the same as SMO for each fold with the same mean and
standard deviation (M = 71.73, SD = 1.20). This indicates that SMO has the same accuracy as the corrected
baseline and does not perform significantly better or worse than the baseline.
According to InfoGain, the server is the 14th most important feature for our model. We can examine its
influence further by constructing a model that contains the entire feature set except for the server. If we use
this model on the test set, we can see that weighted accuracy is 41.7% with a standard deviation of 2.05. Table
4.6 shows the accuracy and standard deviation of the weighted accuracy for SMO and the baseline.
All features Without server
SMO 47.3% (2.19) 44.4% (2.05)
Baseline 25.5% (1.05) 25.5% (1.05)
Table 4.6 Weighted accuracy for SMO with and without the feature "server", the standard deviation is shown between brackets
The data is distributed normally with p > .05 using the Shapiro-Wilk test. We use paired t-test to see if removing
the server has a significant effect on the performance of SMO. The results of the test suggests that the
weighted accuracy of SMO using all features (M = 47.27, SD = 2.19) is not significantly better than when we
remove the attribute ‘server’ (M = 44.39, SD = 1.05) with t(4) = 2.15, p > .05. If we compare the weighted
accuracy of the ‘without server’ feature set with the baseline, the t-test suggests that it is significantly better
than the baseline with t(4) = 18.31, p < .05.
33
5. Discussion In this chapter, we discuss the results of our experiments. In Subsection 5.1, we elaborate on the performance
of SMO. We also discuss the error distribution. Our tests with InfoGain are discussed in Subsection 5.2. In
Subsection 5.3, we elaborate on how time influences performance. The importance of the server on which a
game was played, is discussed in Subsection 5.4. We conclude this Chapter by elaborating on the
implementation of our results into AI in Subsection 5.5.
5.1 Player model We used SMO to build the player model. Wet tested the algorithm using a 5-fold cross validation and the results
suggest that its performance is significantly better than the baseline. The effect size indicated that the
difference in accuracy between SMO and the baseline is large. These results suggest that SMO is significantly
better at predicting players’ skill than could be expected by chance.
If we look at the distribution of errors for the test set, we can see that most misclassified instances were placed
in a neighboring class with an average error distance of 1.55. We can attribute this distribution of errors to the
nature of the data set. If a player beats better players, his rating will rise more too. If he wins consistently,
eventually he will be promoted. If a player wins from lower players in a higher league but loses from average
players in that league, his rating will not increase consistently and he will not be promoted. The system does
this to prevent players around the borders of a league to constantly switch leagues. It could also be the case
that a player has a relatively high hidden rating but only has a small number of ladder points because he does
not play often. This means that two players in different leagues might have the same hidden rating but are not
promoted yet because of a lack of ladder points. This means that for each league in our data set, a percentage
of players belong to another league or could be placed in either one of 2 leagues.
In our research, we used a linear weight table to calculate the error distance. If the results of this test were to
be implemented in an AI, an exponential weight table might also be suitable. Such a table assigns an
exponentially larger weight to errors that are more distant. Thus, it is much worse to classify a grandmaster as a
bronze player than to classify a gold player as bronze. If we were to build an adaptive AI, it would be much
worse to pity a grandmaster against an easy AI than a gold player.
5.2 InfoGain Applying InfoGain to optimize the performance of SMO by rearranging the original feature set has no effect.
However, it does tell us something about which attributes contribute most to solving the classification problem.
The single most informative feature, average micro over the game, has a significantly higher weight than the
second ranked feature, average apm. Average micro describes the average amount of issued commands that
involve micro actions over the entire game history up to the minute of the instance. The second (average apm)
and third (average eapm) ranked attributes also describe an average number over the past behavior in the
game. This indicates that the game history is an important factor in determining the skill of a player.
Between the eighth and ninth feature, the weight drops significantly. The first 8 features are ranked in the same
order for all folds. We included these 8 best features in the set because they measure motor skills. A gamer
must have excellent control over his mouse and keyboard to issue a large amount of commands and use a large
amount of hotkeys in a short period of time. This suggests that motor skill is another attribute that applies to
the domain of real-time strategy games.
34
The sixth (hotkey selections during the past minute) and eighth (amount of hotkeys selected during the entire
game) ranked feature also measure the visuospatial attention of players. The features regarding apm and eapm
also measure this to some extent since they also include the use of hotkeys. Visuospatial attention is an
important attribute of expertise in many domains where several cues in different locations ask for attention at
the same time. The outcome of our research suggests that this may also be an important attribute of expertise
in real-time strategy games.
Removing 20 features from the feature set to build the model might have a significant effect on accuracy.
However, it also gains us time as the time required to build the model takes almost 50 times less than using the
entire feature set.
5.3 Time Removing instances from the test set does not improve performance. Because the first instance of every entails
1.5 minute until the 2.5 minute into the game, we can say that from 2.5 minute into the game, we can give a
reasonably accurate prediction of players’ skill.
On average, the length of our games over the entire data set is about 15 minutes. This means that on average,
we can predict players’ skill after about 16.7% of the game is played. We can use this knowledge to scale the
difficulty of an AI during the game if the right strategy is used. As economy is an important factor in Starcraft II,
we could create a basic AI with an average economy. The AI can then choose what to do with its resources. If it
turns out that the player is a novice, it can play in such a way that only a small portion of the available resources
is used. This play style is somewhat similar to the behavior of novices since generally they do not spend their
resources effectively. However, if the opponent is an expert, the AI can strengthen its economy and increase its
spending. As expert players usually take more time to create a strong economy than novices, it is an acceptable
strategy for the AI to do the same.
5.4 Server SMO does not perform significantly better than the corrected baseline in predicting on which server a game
was played. We also examined what happens if we try to predict players’ skill without the feature ‘server’. After
using the entire feature set except for “server” to build a model, performance did not seem to drop
significantly. This indicates that the server may not be an important factor in solving the classification problem.
This means that the in-game behavior of European and American players may not be significantly different. The
importance of server could be further examined by building a separate model for data from each server. If this
significantly improves the performance of the classifier, this would mean that the game behavior of players on
the American and European server is significantly different.
5.5 Implementation in AI The InfoGain filter showed that features describing issued actions and effective actions are the most
informative features. Although we cannot directly implement apm or eapm into the behavior of an AI, we can
use it to measure a human player’s’ skill. This helps the AI to measure the skill of the player. This information
can be used to adjust the difficulty of the AI to the level of the player to make the game more interesting. One
way to do this described in Subsection 5.3.
The average number of micro actions that were issued seems to be the most important feature in predicting
players’ skill. This feature describes how quickly a player was able to micro-manage its units. If a player is good
35
at micro-managing his units, he is better at destroying the opponents units and keeping his own units alive than
letting the game AI fight the battles on a micro level. This indicates that it might be possible to make an AI more
interesting by adjusting the way it micro manages units. An AI could do this more effectively when pitied
against a strong player and less effectively when pitied against a weak player. However, we must note that the
number of actions does not tell us something about the quality of those actions. From our personal experience
with the game, we know that effective micro-management can outperform the game AI. Further research is
needed to examine to what extent the quality of players’ micro-management changes with skill.
Although the effective issued actions per minute (eapm) is in the top 3 of informative features, it is not an
effective method to measure a player’s’ skill, it is hard to directly implement it into an AI. One cannot simply
raise the number of effective actions in order to make an AI play at a higher level. An effective action is an
action that was not only issued but also performed. This says nothing about the quality of the action. For
example, building 20 spawning pools during one game is not relevant since only one is needed. Raising the
number of effective actions by making the AI build 20 spawning pools will raise its eapm but does not make it
play better. However, if the AI is using a solid strategy, making it perform actions quicker or slower may affect its
level of play. Further research might investigate to what extent the measure eapm can be applied to artificial
agents.
Directly implementing the issued actions per minute (apm) into the behavior of an AI is not useful. This
measure describes how many commands were issued. This includes selecting buildings and units without giving
them actual commands. It does not contribute much if we tell an AI to select and deselect a lot of units and
buildings since these actions have no effect on the game.
36
6. Conclusions
In this chapter, we present the answers to the 5 research questions that we formulated in Section 1.1. These
questions aim to solve our problem statement:
To what extent can we use a player model to accurately distinguish a novice player from an expert player in a
real-time strategy game?
The 5 questions that we asked to solve this problem are:
1. For Starcraft II, what selection of player data is appropriate as a data set to build a player model upon?
2. For Starcraft II, what is a suitable classification method to build a player model that predicts the player’s
skill level according to the player’s behavior?
3. How accurately can a player model predict a player’s skill?
4. How does the performance of our player model change as the game progresses?
5. To what extent can we use the player model to build an AI?
These research questions are answered in the first 5 Subsections of this Chapter. Each Subsection answers one
question. We attempt to solve the problem statement in Subsection 6.6. We end this Chapter by discussing the
limitations of this this thesis and giving recommendations for future work in Subsection 6.7.
6.1 Selection of the dataset We gathered data from 3 different websites. Most of this data was uploaded by players themselves. Because we
only had access to manually uploaded replay files, our data set contains only an estimated 0.03% of all games
that were played during our collection phase. The distribution of the data over the leagues was skewed
compared to the distribution of the population since we were unable to gather as much games from the lower
leagues as the higher ones.
To remove as much noise from the data set as possible, we limited our data to one patch version of the game.
This version of the patch was live during season 4 and 5. These were also the seasons that we could gather
league information from. We only used games between players within the same league to limit the amount of
players whose skill is around the border of a league. All games were played on either the American or European
server. We chose to use 2 servers so we could collect more data. The level of play between Americans and
Europeans is not large enough to make a significant impact on the performance of our classifier.
We can conclude that we were able to collect a data set of games that are comparable to each other. However,
we were able to collect only a small percentage of all games that were played during our collection phase. The
distribution of our data set over the leagues was skewed compared to the distribution of the population.
6.2 Finding a classification method We established the main properties of expertise that are applicable in this domain. Our feature set was
composed according to these properties and according to properties of the game itself. We then compared 4
different algorithms to see which one could be suitable for our problem. We chose SMO as it seemed to have a
somewhat better performance than the other classifiers on the platinum league whilst having about the same
performance on all other leagues.
37
For all classifiers, it seemed that the gold league was hardest to predict. This is in line with the nature of the
data set as the gold league spans the shortest range in skill difference. This means that there is not much
difference between players in the middle of a league and players near the class border. This makes prediction
harder. All classifiers seemed to have a decent performance on classifying the bronze, master, and grandmaster
league. As the bronze league encompasses the largest span in skill difference, and master and grandmaster
players tend to play more consistent, this result is also in line with the nature of the data set.
In short, we can conclude that based on the methods we tested on the pretest set, SMO seemed the most
suitable classifier. All algorithms followed about the same curve for their accuracy over the leagues. The gold
league seemed hardest to predict. The bronze, master, and grandmaster leagues seemed easiest to predict.
6.3 Predicting skill Using a 5-fold cross validation, the average weighted accuracy of SMO was 47.3% (SD = 2.19) with a baseline of
25.5% (SD = 1.05). This is a significant improvement. The average error distance for the misclassified instances
was 1.55. This means that most misclassified instances are placed in a neighboring league. Due to the nature of
the rating system, players whose skill is around a class border are hard to classify. It is uncertain in what league
they should be placed. Strong players in one league could have about the same skill as weak players in the next
league. We do not know the percentage of players that could be seen as ambiguous. Therefore, we cannot say
what percentage of players can be classified with certainty.
To conclude, we can say that it is possible to build a player model to predict players’ skill. Most misclassified
instances were placed in a neighboring league which can partly be attributed to the nature of the rating system.
Since we do not know what percentage of players is around a class border, we cannot say what the upper limit
of our player model is.
6.4 The influence of game progression For our classification problem, we already skipped the first 90 seconds of every game. We then incrementally
removed instances from every game in the test set of each fold to examine how this changes performance. The
improvement of accuracy for SMO did not change significantly. This means that at least from 2.5 minute into
the game, time does not have a big influence on performance. We can get a decent prediction after 2.5 minute
is played. If we compare this to the average game length in our data set, on average we can predict skill after
about 16.7% of the game is played. After that time, performance marginally increases until the 8th instance, or
9.5 minute into the game when performance flattens. We only examined this until the 10th instance so we
cannot say if performance would suddenly increase later on.
In short, we can conclude that at least from 2.5 minute into the game, we can give a decent prediction of a
players’ skill. From that moment, performance does not improve significantly over time.
6.5 Artificial Intelligence We can use our player model to measure a human players’ skill. This information could be used to scale the
difficulty of an AI to the skill of the player. One way to do this is to create a basic AI with an average economy
and an average level of micro-management. Because we can predict the skill of a player after 2.5 minute, the AI
could then adjust its behavior to match the player. We cannot directly implement the results of our test into an
AI because the main features that describe expertise, give information about the amount of commands that
38
were issued. They do not tell us something about the quality of these actions. Since we did not actually build
and test an AI, we can only give an example as to how our player model could be used.
In short, we can model a players’ skill at least after 2.5 minutes into the game. This indicates that it could be
possible to adjust the behavior of an AI to a players’ skill early on in the game.
6.6 Answering the problem statement Based on our research, we can say that it is possible to build a player model according to a players’ skill. In our
research, we chose the SMO classification algorithm as a suitable classifier. We have not used an x-fold cross
validation or any optimization methods to test which classifier is most suitable, and what the highest possible
performance is. It might be possible that SMO has a stronger performance if optimized or that another
optimized algorithm might be more effective. Further research is needed to examine this.
We do not know the upper limit of the classifiers’ performance because we do not know what percentage of
players is around a class border. Based on our data, we can say that we can make a fairly accurate model of
expert players. Because our data set is skewed, we cannot say to what extent we can generalize our research to
all leagues and the entire set of games that were played. We cannot directly implement our results into an AI
because the main features in our set describe the amount of actions that were issued but do not say anything
about the quality of these actions. Because we can give a fairly accurate prediction early on in the game, an AI
could adjust its behavior to a player early on in the game. Since we did not test this, we can only give an advice
as to how our results might be implemented into an AI.
Therefore, the answer to the problem statement is that we can build a player model of a players’ skill for our
data set. We can do this at least after 2.5 minute into the game. The model for high ranked players is fairly
accurate. However, we do not know the upper limit of the model. We also do not know to what extent our
model is an accurate representation of the entire data set since we only used a small and skewed subset of
data. Our results indicate that an AI might be able to automatically adapt to a players’ skill early on in the game
but we did not implement this into an AI.
6.7 Limitations and recommendations for future work This research shows that we can model the skill of players. This is an encouragement for further and more
extensive research in the domain of player modeling. Although our research has its limitations, further research
might lift some of them.
Because the replay files were intentionally uploaded and we only used a small percentage of all games that are
played, we do not know to what extent our data set is representable for all games that are played. The
distribution of the data over the leagues is skewed compared to the actual distribution of the population. We
have more games for the higher leagues than the lower. One possible partial explanation is that higher ranked
players play more often. It could also simply be unrepresentative because lower players less often upload their
games. Either way, more extensive research is needed to examine how this research scales up to the entire set
of games that are played.
Players around the border of each class are hard to classify. The class borders are nominal, but difference in
player skill is gradual. We do not know what percentage of players is around a class border. Therefore, we
cannot say what the upper limit is of our player model. Further research could look at data sets where there is
more information about the actual skill of players.
39
In our research, we did not attempt to optimize any of the classifiers. Further research could attempt to
optimize the classifiers to examine how this improves the performance. Other classifiers might be just as
suitable or even better than the one we used. Furthermore, the performance of SMO on itself could be boosted
by optimization techniques.
We chose an algorithm for our research by constructing a model based on 70% of the data with the remaining
30% serving as our test set. This only indicates the performance of the algorithms on that particular test set.
We used a 5-fold cross validation to examine the performance of our classifier. However, a 10-fold cross
validation has proven to be the most effective method to assess algorithms. Further research could use this
technique to compare and test algorithms as it provides a better estimation of the accuracy of the classifiers.
The outcome of our research suggests that motor skills and visuospatial attention are important attributes of
expertise in real-time strategy games. Further research could focus on one these 2 topics to examine how
exactly they influence expert skill in our domain.
We cannot directly implement the results of our test into an AI because the main features that describe
expertise, give information about the amount of commands that were issued. They do not tell us something
about the quality of these actions. One way to gather this information is to include more information on
players’ strategies and tactics. One example is including the build order of a player. This is the order in which
buildings and units are built during the first few minutes of the game. Further research could incorporate more
functional and applicable behavior into the model. Our results can be regarded as an indication as to how an AI
might use our model. Further research is needed to examine to what extent it is possible to automatically adapt
an AI to a players’ skill during the game.
40
Literature Blizzard Entertainment. (2010). Starcraft® II: Wings of Liberty™ one-month sales break 3 million mark [press
release]. Retrieved from http://us.blizzard.com/en-us/company/press/pressreleases.html?id=2847879
Blizzard Entertainment. (2011). Master League. Retrieved from http://sea.battle.net/sc2/en/blog/116151#blog
Blizzard Entertainment. (2011). Region Linking. Retrieved from http://us.battle.net/sc2/en/blog/3085760
Blizzard Entertainment. (2012). Climbing the Ladder: How to earn a League Promotion. Retrieved from
http://us.battle.net/sc2/en/blog/4798491/Climbing_the_Ladder_How_to_earn_a_League_Promotion-
4_5_2012#blog
Campitelli, G., & Gobet, F. (2004). Adaptive expert decision making: Skilled chess players search more and
deeper, Journal of the International Computer Games Association 27(4), 209-216.
Campitelli, G., Gobet, F., Williams, G., & Parker, A. (2007). Integration of perceptual input and visual imagery in
chess players: Evidence from eye movements. Swiss Journal of Psychology, 66, 201–213.
Carmel, D., & Markovitch, S. (1993) Learning models of the opponent's strategy in game-playing, Proceedings of
The AAAI Fall Symposium on Games: Planning and Learning
Chaddock, L., Neider, M., Voss, M., Gaspar, J., & Kramer, A. (2011). Do athletes excel at everyday tasks?
Medicine & Science in Sports & Exercise, 43(10), 1920-1926. doi: 10.1249/MSS.0b013e318218ca74
Chase, W., G., & Simon, H., A., ( 1973). Perception in chess, Cognitive Psychology.
Carmel, D., & Markovitch, S. (1993). Learning models of opponent's strategy in game playing. Proceedings of the
AAAI fall symposium on games: planning and learning, 140-147.
Donkers, H., Uiterwijk, J., & Van den Herik, H., (2001). Probabilistic opponent-model search. Information
Sciences, 135(3–4), 123–149.
Drachen, A., Canossa, A., & Yannakakis, G. N. (2009). Player Modeling using Self-Organization in Tomb Raider:
Underworld. Proceedings of the IEEE Symposium on Computational Intelligence and Games (CIG2009),
Milano, Italy. Retrieved from http://www.itu.dk/~yannakakis/CIG09_IOI.pdf
Ericsson, K., & Tesch-Römer, C. (1993). The role of deliberate practice in the acquisition of expert performance.
Psychological Review 100(3), 363-406.
Field, A. (2009). Discovering Statistics Using SPSS (Third Edition), Sage
Gobet, F., & Simon, H. (1996). The Roles of Recognition Processes and Look-Ahead Search in Time-Constrained
Expert Problem Solving: Evidence from Grand-Master-Level Chess. Psychological Science, 7(1) 52-55.
Gobet, F., & Simon, H. (1998). Expert Chess Memory: Revisiting the Chunking Hypothesis, Memory, 6(3), pp.
225-255 Retrieved from http://dx.doi.org/10.1080/741942359
Houlette, R. (2004). Player Modeling for Adaptive Games. AI Game Programming Wisdom II, Charles River
Media, Inc, 557-566.
41
Witten, F., & Hall, M. (2011). Data Mining: Practical Machine Learning Tools and Techniques (Third Edition),
Morgan Kaufmann
Iida, H., Uiterwijk, J., Van den Herik, J., (1993). Opponent-Model Search. Technical Reports in Computer Science,
CS 93-03
Kocsis, L., Uiterwijk, J., Postma, E., & Van den Herik, J. (2003). The Neural MoveMap Heuristic in Chess. In
Schaeffer, J.; Müller, M., & Björnsson, Y. (Ed.), Computers and games: Lecture notes in computer science,
154-170. doi: 10.1007/978-3-540-40031-8_11
Land, M., & McLeod, P. (2000). From eye-movements to actions: how batsmen hit the ball. Nature
Neuroscience, 3(12), 1340-1345.
Nethaera. (2010). The Big Picture on Macro and Micro. Retrieved from
http://us.battle.net/sc2/en/blog/748044#blog
Richards and Amir, 2007] M. Richards and E. Amir. Opponent modeling in Scrabble. Proceedings of the 20th
International Joint Conference on Artificial Intelligence (IJCAI2007)
Schadd, F., Bakkes, S., & Spronck, P. (2007). Opponent Modeling in Real-Time Strategy Games, Proceedings of
the 8th Int.ernational Conference on Intelligent Games and Simulation (GAMEON'2007), 61-68.
Schriver, A., Morrow, D., Wickens, C., & Talleur, D. (2008). Expertise Differences in Attentional Strategies Related
to Pilot Decision Making. Human Factors: The Journal of the Human Factors and Ergonomics Society, 50,
864. doi: 10.1518/001872008X374974
Van den Herik, J., Donkers, J., & Spronck, P. (2005). Opponent modelling and commercial games. Proceedings of
the IEEE 2005 Symposium on Computational Intelligence and Games, 15-25.
Van der Werf, E., Uiterwijk, J., Postma, E., & Van den Herik, J. (2002). Local move prediction in Go. In Schaeffer,
Computers and games: Lecture notes in computer science, 393-412.
Park, S.-H., & Fürnkranz, J. (2007). Efficient pairwise classification. In Kok, J., Koronacki, J., Lopez de Mantaras,
R., Matwin, S., Mladenič, D., & A. Skowron (Eds.), Proceedings of 18th European conference on machine
learning (ECML-07), 658–665.
Platt, J. (1998). Sequetial minimal optimization: A fast algorithm for training support vector machines. Technical
Report MST-TR-98-14, Microsoft Research
Reingold, E., Charness, N., Pomplun, M., & Stampe, D. (2001). Visual Span in Expert Chess Players: Evidence
From Eye Movements. Psychological Science 2001, 12(48) doi: 10.1111/1467-9280.00309
Wagner, M. (2006) On the scientific relevance of eSports. Proceedings of the 2006 International Conference on
Internet Computing and Conference on Computer Game Development, 437-440
Yannakakis, G. & Hallam, J. (2006). "Towards Capturing and Enhancing Entertainment in Computer Games".
Proceedings of the 4th Hellenic Conference on Artificial Intelligence, Lecture Notes in Artificial Intelligence,
432–442.
42
Appendix A To decide whether an action is effective, we adjusted the preconditions used by the tool Sc2Gears. This
resulted in the following set of rules:
Our rules:
• If one of the commands train, research, upgrade or hatch is canceled within 1 second, both the issued
action and the cancel action are considered ineffective.
If the same command is repeated more than twice within one second, the first command is considered
ineffective. The commands that this rule applies to are: right clock, stop, hold position, move, patrol,
scan, attack, set rally point, set worker rally point, hold fire, halt, attack, stop, land, activate or
deactivate auto-repair, neural parasite, activate or deactivate auto-heal, charge, attack structure,
activate or deactivate auto-attack structure
Too fast switch away from a selected unit or reselecting the same units without giving them any
commands (0.25 sec) (by too fast I mean there is not even time to check the state of the units and
optionally react to it accordingly); double tapping a hotkey to center a group of units is NOT considered
ineffective.
If one of the following commands is repeated consecutively, regardless of the time frame, it is
considered ineffective:
the same research, the same upgrade; Gather Resources, Return Cargo, Cloack, Decloack, Siege Mode,
Tank Mode, Assault Mode, Fighter Mode, Burrow, Unburrow, Phasing Mode, Transport Mode, Generate
Creep, Stop Generating Creep, Weapons Free, Cancel an Addon, Mutate into Lair, Cancel Lair Upgrade,
Mutate into Hive, Cancel Hive Upgrade, Mutate into Greater Spire, Cancel Greater Spire Upgrade,
Upgrade to Planetary Fortress, Cancel Planetary Fortress Upgrade, Upgrade to Orbital Command,
Cancel Orbital Command Upgrade, Salvage, Lift Off, Uproot, Root, Lower, Raise, Archon Warp (High
Templar or Dark Templar)
We make blocks of 1 second to make our decisions. If for example one right click was issued in second
1, and 2 right clicks were issued in second 2, they are both counted because our time count is not
specific enough to say whether or not they were issued within 0.5 seconds after each other. If 3 right
clicks are issued within the same second, they are considered ineffective because 1/3 = 0.33 which
means the repetition was too fast.
We use the following rules to determine whether an action is considered macro or micro:
Any action that costs minerals and/or gas, is considered to be a macro action. This includes training
units, constructing buildings, upgrades, researches, and salvaging buildings. Canceling macro actions
and merging archons are also considered macro actions.
Any other action is considered to be a micro action. Repair actions are considered micro actions
because it involves the micro-management of worker units, despite the fact that it costs money.
43
Appendix B The list of features as ranked by InfoGain.
weight ranking attribute
0.569 1 average micro
0.511 2 average apm
0.509 3 total eapm
0.497 4 apm micro
0.493 5 eapm
0.467 6 delta select
0.446 7 apm
0.415 8 total hotkeys select
0.256 9 selections per key set
0.233 10 total hotkeys set
0.212 11 different keys
0.163 12 delta set
0.091 13 redundancy
0.082 14 server
0.081 15 average macro
0.072 16 apm micro
0.043 17 total hotkeys add
0.042 18 delta minerals
0.037 19 delta workers
0.034 20 total workers built
0.034 21 delta resources
0.031 22 delta supply used
0.028 23 defensive structures
0.028 24 workers per geyser
0.028 25 geysers total
0.027 26 instance
0.023 27 minerals per worker
0.022 28 delta add
0.018 29 total bases built
0.015 30 total upgrades
0.015 31 total supply gained
0.015 32 research
0.014 33 opponent race
0.014 34 total resources
0.014 35 total minerals spent
0.013 36 total gas spent
0.013 37 player race
0.013 38 total supply used
0.013 39 delta gas
0.012 40 total number of fighting units
0.010 41 tier
0.010 42 different units
0.001 43 winner
44
league total instances Distribution of our data set Actual distribution
bronze 4082 8.2% 20%
silver 3979 8.0% 20%
gold 5195 10.4% 20%
platinum 8066 16.2% 20%
diamond 10088 20.2% 18%
master 12747 25.5% 2%
grandmaster 5751 11.5% Top 200 players
total 49908 100% 100%