Modeling player skill in Starcraft II - Semantic Scholarpdfs.semanticscholar.org/df53/d0152624991f7814587efdda909fad6de46c.pdf2.1 Starcraft II Starcraft II is a real-time strategy

Modeling player skill in Starcraft II

Tetske Avontuur

ANR: 282263

HAIT Master Thesis series nr. 12-004

THESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF ARTS IN COMMUNICATION AND INFORMATION SCIENCES,

MASTER TRACK HUMAN ASPECTS OF INFORMATION TECHNOLOGY,

AT THE SCHOOL OF HUMANITIES

OF TILBURG UNIVERSITY

Thesis committee:

Dr. Ir. P.H.M. Spronck

Dr. M.M. van Zaanen

Prof. dr. E.O. Postma

Tilburg University

School of Humanities

Department of Communication and Information Sciences

Tilburg center for Cognition and Communication (TiCC)

Tilburg, The Netherlands

June, 2012

2

Abstract Starcraft II is a popular real-time strategy (RTS) game in which many players compete with each other. Based on

their performance, the players get ranked in one of seven leagues. In this thesis, we aim at predicting which

league a player competes in, based on observations of his in-game behavior. We do this by building a player

model using a classification algorithm.

We gathered 1297 game replays uploaded by players from 3 different websites. We ensured that they all had

the same patch version and that league information about the players was up-to-date. To describe the data, we

used features that measure skill. We picked these features based on cognitive research and our knowledge of

the game. We then performed a pretest using 4 different algorithms to see which one could be suitable to solve

our problem. We chose SMO as our classifier. The weighted accuracy was 47.3% (SD = 2.19), meaning that we

were able to correctly classify 47.3% of the instances. This is significantly better than the weighted baseline of

25.5% (SD = 1.05). We used InfoGain to examine what features are most important in solving our problem. We

then tested from what moment in the game it is possible to predict players’ skill. The results showed that

performance does not change significantly over time. This indicates that at least from 2.5 minute into the game,

time does not significantly impact the performance of the classifier. We conclude that it is possible to predict

players’ skill by building a player model using a classification algorithm. We can do this fairly early in the game.

3

Preface During the summer of 2011, I had a lot of free time. Since this summer was just like a typical autumn, I had lots

of opportunity to play computer games. I spent a lot of time on one game in particular: Starcraft II. Not only did

I play this game extensively, I also watched the rise of Starcraft II shoutcasters. My personal favorites are Husky

and Day[9] because of their extensive knowledge of the game, in addition to their great enthusiasm. This way, I

learned a lot about the game and I realized that I could do more than just play the game; I could use it as a tool

for research.

Performing the research for this thesis was a lot of work. It took a lot of time and effort to gather and organize

data, and I had to redo a lot of my experiments. Despite the sometimes frustrating moments, I look back on my

thesis with great fun. I would like to thank my supervisor Pieter Spronck for his guidance and during the entire

process. However, I would like to thank him most of all for his enthusiasm about games and AI. I would also like

to thank my second reader Menno van Zaanen for his help during the last days of this thesis.

I would also like to thank my parents for supporting me all these years, my boyfriend for putting up with me

during the final weeks of this thesis, and my friends for making my life in Tilburg a memorable experience.

Tetske

June, 2012

4

Contents 1. Introduction 6

1.1 Problem statement and research questions ......................................................................................... 6

1.2 Outline ................................................................................................................................................ 7

2. Background 8

2.1 Starcraft II ........................................................................................................................................... 8

2.1.1 Rating system ................................................................................................................................. 8

2.1.2 Reasons for choosing Starcraft II ................................................................................................... 10

2.2 Expertise ........................................................................................................................................... 10

2.2.1 Expertise in Chess ........................................................................................................................ 10

2.2.2 Expertise in action video games ................................................................................................... 11

2.2.3 Expertise in other domains ........................................................................................................... 11

2.3 Player modeling ................................................................................................................................ 12

2.3.1 Player modeling in classic games .................................................................................................. 12

2.3.2 Player modeling in computer games ............................................................................................. 12

2.4 Sequential Minimize Optimization Algorithm .................................................................................... 13

3. Experimental setup 15

3.1 Data set ............................................................................................................................................ 15

3.1.1 Data Selection .............................................................................................................................. 16

3.1.2 Data extraction ............................................................................................................................ 17

3.1.3 Feature set ................................................................................................................................... 18

3.2 Measurement ................................................................................................................................... 22

3.3 Pretest .............................................................................................................................................. 23

3.4 Player model ..................................................................................................................................... 25

3.4.1 Performance of SMO .................................................................................................................... 25

3.4.2 InfoGain ....................................................................................................................................... 25

3.4.2 Time............................................................................................................................................. 26

3.4.3 Server .......................................................................................................................................... 26

4. Results 27

4.1 Player model ..................................................................................................................................... 27

4.1.1 Error Distribution .............................................................................................................................. 28

5

4.2 InfoGain .......................................................................................................................................... 330

4.3 Time ................................................................................................................................................. 31

4.4 Server ............................................................................................................................................... 32

5. Discussion 33

5.1 Player model ..................................................................................................................................... 33

5.2 InfoGain ............................................................................................................................................ 33

5.3 Time ................................................................................................................................................. 34

5.4 Server ............................................................................................................................................... 34

5.5 Implementation in AI ........................................................................................................................ 34

6. Conclusions 36

6.1 Selection of the dataset .................................................................................................................... 36

6.2 Finding a classification method ......................................................................................................... 36

6.3 Predicting skill ................................................................................................................................... 37

6.4 The influence of game progression .................................................................................................... 37

6.5 Artificial Intelligence ......................................................................................................................... 37

6.6 Answering the problem statement .................................................................................................... 38

6.7 Limitations and recommendations for future work ............................................................................ 38

Literature 40

Appendix A 42

Appendix B 43

6

1. Introduction In regular sports, people do not only want to practice their sport, they also want to compete with others and

see who the best sportsperson is. The same holds for competitive computer gaming, also called eSports, where

players of a game participate in a competition to see who is the best. In western culture, eSports are still an

emerging field. One important reason is that here, competitive gaming is mainly associated with first person

shooters (Wagner, 2006). In the last few years, this view has shifted with the release of the popular, highly

competitive real-time strategy game Starcraft II, selling 3 million copies in the first month of its release (Blizzard

Entertainment, 2010).

Besides the standard online laddering competition embedded in the game, various Starcraft II tournaments and

competitions are held, both offline and online. This motivates gamers to train their skills and increase their

performance. The standard ladder of the game divides gamers into 7 distinct leagues, roughly representing the

skill level of players. With the large number of Starcraft II players and the classification of players according to

their performance, we now have the opportunity to study skill differences on a large scale in a real-time

strategy game.

Skill difference between novices and experts has been researched in many domains. Studies in chess suggest

that because of their extensive knowledge, experts are better at recognizing patterns, they make decisions from

these patterns more quickly, and they can make general assumptions based on chuncks of information (Gobet

& Simon, 1998). Research on pilots and athletes shows that experts also see related cues faster, make fewer

errors, and pay less attention to unrelated cues than novices (Schriver, Morrow, Wickens & Talleur, 2008;

Chaddock, Neider, Voss, Gaspar & Kramer, 2011).

To examine how these differences affect gameplay in fast paced computer games, we can create a model of

players according to their skill. We call this a player model: an abstracted description of the behavior of players

in a game. Next to preference, strategies and weak spots, player modeling can be used to describe the skill of a

player (Van den Herik, Donkers & Spronck, 2005). The current research focuses on building a player model that

can predict the skill of a player.

1.1 Problem statement and research questions Our problem statement is as follows:

To what extent can we use a player model to accurately distinguish a novice player from an expert player in a

real-time strategy game?

In this research, we use the commercial real-time strategy game Starcraft II as a tool to solve this problem. The

first step is to find a selection of replay files that is suitable for building a player model. We need to establish

when a replay is appropriate to use and what data we want to extract from it. This leads us to first research

question:

1. For Starcraft II, what selection of player data is appropriate as a data set to build a player model upon?

We then need to find one or more classification methods that are suitable to model players in the game

Starcraft II. These methods should be able to deal with sets of instances that are related to each other in that

they describe the behavior of a player in one game. We also want to know what features are most informative

7

to the classification algorithms, so at least one of the methods should give us an output that provides us with

this information. This gives us our second research question:

2. For Starcraft II, what is a suitable classification method to build a player model that predicts the

player’s skill level according to the player’s behavior?

If we have found such a method, we can test it to see to what extent its results are accurate. We need to pick

one or more measures that are appropriately informative about the performance of the classification method.

This leads to our third research question:

3. How accurate can a player model predict a player’s skill?

If we have an indication of the effectiveness of the player model, we can examine how the performance of the

algorithm changes over time. This gives us the fourth research question:

4. How does the performance of our player model change as the game progresses?

One other point of interest is to examine if we can implement the outcome of our research into an AI.

Therefore, our fifth research question is:

5. To what extent can we use the player model to build an AI?

1.2 Outline We begin our research in Chapter 2 by discussing the real-time strategy game Starcraft II and its system for

rating players’ skill. We then describe the main features of expertise to see what distinguishes a novice from an

expert. We follow by elaborating on player modeling. We conclude Chapter 2 with an explanation of the SMO

learning algorithm. In Chapter 3, we describe our experiment. We first describe how our data is gathered and

selected. We then discuss the feature set we selected. We explain how we compared algorithms to examine

which one is most suitable for our problem. We then elaborate on the construction of our player model. In

Chapter 4, we present the results of the tests described in Chapter 3. We discuss these results in Chapter 5.

Finally, we present our conclusions, limitations and recommendations for future work in Chapter 6.

8

2. Background In this chapter we elaborate on the four topics relevant to this research. In Section 2.1 we describe our first

topic: the real-time strategy game Starcraft II. In Section 2.2 we define the differences between novices and

experts. In Section 2.3 we elaborate on opponent modeling in classic and computer games. In Section 2.4 we

describe the techniques that we use for the classification of our data.

2.1 Starcraft II Starcraft II is a real-time strategy game where players have the goal to destroy their enemy by building a base

and an army. Players can choose 1 out of 3 races to play with. These races are Terran, Protoss, and Zerg. The

Terran are humans, the Protoss are alien humanoids with highly advanced technology, and the Zerg are a

collection of assimilated creatures who use biological adaptation instead of technology.

For anything the player builds, he needs to gather 2 types of resources: minerals and gas. These resources are

used to construct buildings which in turn can be used to produce units. At the start of the game, not all units

and buildings are available. New construction options can be unlocked by making certain buildings. This means

that some units and buildings are available at the start of the game while others become available later in the

game. This is also called tier: the point in time that certain units and buildings become available.

In order to play the game well, the player must engage in macro and micro-management. Macro management

determines the economic strength of a player. This is determined by the construction of buildings, the gathering

of resources and the composition of units. Micro-management determines how well a player is able to locally

control small groups and individual units. This includes movements and attacks that are issued by the player

(Blizzard Entertainment, 2010). The success of the macro and micro-management of a player heavily depends

on the strategy a player chooses to follow. For example, if a player chooses to rush his opponent by making

fighting units very early in the game, his economy will suffer. On the other hand, if a player chooses to focus on

having a strong economy before building an adequately sized army, he runs the risk of being overrun by his

opponent.

Players can play against others on Blizzard’s multiplayer system Battle.net. There are four regions, each with

their own ladder: one for Europe and Russia, North and Latin America, Korea and Taiwan and one for South-

East Asia (Blizzard Entertainment, 2011). Players can play on the ladder or set up a custom game. Games that

are played on the ladder are ranked. The system automatically matches players of about the same skill to play

each other. Custom games are unranked. Here, players can choose their own opponents.

2.1.1 Rating system

The main rating system in Starcraft II is the division into leagues. Each ladder is divided into 7 leagues: bronze,

silver, gold, platinum, diamond, master, and grandmaster. The bronze league is the lowest league and contains

people who are still learning to play the game. This league contains 20% of the population that is active on the

ladder. When players improve their skill, they get promoted into the silver, gold, and platinum leagues. Each of

these leagues also contains 20% of the population. The diamond, master, and grandmaster leagues in total

contain the final 20% of the population. The diamond league contains 18%, the master league contains the top

2% players of the diamond league and the grandmaster league consists of the top 200 players of the game

(Blizzard Entertainment, 2011). Each server has its own ranking. As there are 4 regions, there are a total of 800

grandmasters in the world.

9

When a player first starts playing on the ladder, he has to play 5 placement matches. After these matches, the

system evaluates how many were won and lost. Based on those numbers, the matchmaking system places the

player in a league. From that moment on, players can climb or fall into another league by winning or losing

matches. If a player wins about 50% of his matches, he is in the right league. If he wins more than 50%, the

system will eventually match him with players from higher leagues. If the player keeps winning, he will be

promoted into the next league.

There are two systems that determine a player´s league. One is the ladder point system. Players earn or lose

ladder points by winning or losing matches. If their score is around a certain threshold, players are placed in

another league. Their ladder points are then reset. The amount of points that is won or lost is determined by

the skill difference with the opponent. More weight is assigned to a win against a stronger opponent and less

weight is assigned to a win against a weaker opponent. Players also have a bonus pool that is filled over time. If

a player wins, and there are sufficient points in his bonus pool, the points he wins are doubled by taking points

out of the bonus pool. This helps players that have not played for a while climb up the ladder and prevents

them from having to play weaker opponents. A table with an indication of the amount of points that is required

to move up a league can be found in Table 2.1 (Blizzard Entertainment, 2012).

League transition Indication of required points

Bronze to Silver 1200

Silver to Gold 800

Gold to Platinum 800

Platinum to Diamond 800

Diamond to Master 900

Master to near Grandmaster 1400

Table 2.1 Points required to climb leagues

Next to the ladder system, there is the hidden rating system, also called hidden matchmaking rating (MMR).

The points in this system are never reset. The game matches players based on their hidden rating. If a player

wins his hidden rating increases and if he loses, it decreases. If the player wins from an opponent in a higher

league, more weight is assigned to this win and his hidden rating increases more than if the opponent was

equally skilled. If a player wins consistently, eventually he will be promoted. However, if a player wins from

lower players in a higher league but loses from average players in that league, his rating will not increase

consistently and he will not be promoted. All leagues are based on a range of hidden rating points. If a player’s

points fall within this range, and he consistently wins, he is placed in the corresponding league.

If we assume the Starcraft II population is distributed normally over the leagues, we can get an indication of the

MMR range for every league. Because the bronze league encompasses the bottom 20% of all players, the MMR

range is larger for this league than for the silver league, which contains the following 20% of all players. This

also means that the skill difference between the silver, gold, and platinum leagues is smaller than the skill

difference for the other leagues. The entire graph is shown in Figure 2.1.

10

Figure 2.1 Distribution of skill range. Grandmaster league is not shown as it consists of the top 200 players in the master league.1

Every league is split into a set of divisions. Each division contains 100 randomly assigned players with their

ranking displayed. If a player wins games, his rating points increase and he raises one or more positions within

his division. This way, players have a notion of progress and accomplishment which encourages them to

continue playing. In this research, divisions and division points were not taken into account because a player´s

ranking in a division depends on his frequency of play and when he was placed in that division. It also depends

on the time of measurement as the division points are reset at the start of every season.

2.1.2 Reasons for choosing Starcraft II

We have 3 reasons for choosing Starcraft II as a tool in this research. First, there is a great degree of skill

involved in playing this game. Second, players are divided into leagues according to their skill level based on

their wins and losses in the game. This gives us a reliable and objective grouping of players who are roughly

evenly skilled. This helps us to determine the performance of our opponent models. Third, there is a large

amount of replays available on several websites which allows us to gather data.

2.2 Expertise According to the theory of deliberate practice, (Ericsson, Krampe & Tesch-Römer, 1993) expertise is a question

of 10 years or 10.000 hours of deliberate practice, rather than talent. In this time, experts gain extensive

knowledge about a specific domain. This knowledge gives experts the ability to recognize patterns. This also

helps them to make general assumptions based on chunks of information (Gobet & Simon, 1998). This could

also apply to experts in real-time strategy games where players have to make quick decisions but have little

information about the opponent.

2.2.1 Expertise in Chess

One of the first studies on expertise in Chess was conducted by De Groot (1946/1965) who studied the thought

process of chess players by asking them to think out loud about making their next move in a chess game. He

1 [Untitled graph of population density and MMR]. Retrieved May, 2012, from: http://us.battle.net/sc2/en/forum/topic/2112234276

http://us.battle.net/sc2/en/forum/topic/2112234276

11

found that there is no difference in depth of search between Grandmasters and to-be masters although the

quality of the produced moves was better for Grandmasters. He also found that Grandmasters were better at

remembering existing, non- random board configurations. This was later confirmed by Campitelli, Gobet,

Williams, and Parker (2007), and Reingold, Charness, Pomplun, and Stampe (2001). Both research groups

performed an eye-gazing study. The latter found that expert chess fixate more on the area between pieces

rather than the individual pieces themselves. With this technique, experts are better at remembering existing

board configurations than novices and intermediates.

Chase and Simon (1973) examined expert memory by asking expert and novice players to remember a number

of random and existing board configurations. They found that expert players were better at remembering

existing configurations while they performed equally well to the novices at random compositions. Experts were

also better at reproducing the position of a chess piece after viewing it for five seconds. This research lead to

the chunking theory: the idea that experts make decisions based on a large number of chunks of information

that are stored in their long term memory. These chunks help experts to recognize patterns quicker and

accordingly make quicker decisions. Gobet and Simon (1998) extended this theory into the template theory

where a set of chunks together forms a larger and more complex structure in memory. This allows

Grandmasters to memorize relevant information faster, recognize board positions quicker and consequently

make faster decisions. Gobet and Simon (1996) also studied Grandmasters playing 6 opponents at the same

time. Rather than engaging in deep search to look ahead, Grandmasters reduce search space by using

recognition patterns, based on their extensive knowledge of the game and their opponents. This has also been

confirmed in research by Campitelli and Gobet (2004).

2.2.2 Expertise in action video games

One important difference between chess and modern video games is the pace of the game. In modern action or

strategy games, several events follow each other quickly and may occur simultaneously. Players are required to

react to these events timely and accurately. Green and Bavelier (2005, 2006), examined the difference in the

allocation of attention between games and non-gamers. They conducted a functional field of view task to

measure how well a person can locate a central task whilst being distracted by a number of visible elements

and one other central task. The gamers with experience in these kinds of games had enhanced attentional

resources: they were better able to pay attention to several moving stimuli at the same time. Gamers also had

better visuospatial attention: they were better able to localize one or two central tasks among a number of

distractions (Green and Bavelier, 2006). When non-gamers were asked to play an action game for ten hours,

they showed significant improvement in attentional resources and visuospatial attention. In other words,

experience with a game improves players’ ability to multi-task. This indicates that experience with a game

improves players’ abilities. Dye, Green and Bavelier (2009) examined the allocation of attention to a number of

alerting cues. They found that action gamers responded quicker to these events and attended to them more

accurately than non-gamers.

2.2.3 Expertise in other domains

Research on pilots has shown that experts make decisions that are based on related cues faster and with fewer

errors than novices. Experienced pilots are also more likely to direct their attention to relevant cues rather than

irrelevant cues (Schriver, Morrow, Wickens & Talleur, 2008). Other research has shown that highly ranked

athletes process information faster than regular people when crossing a street (Chaddock, Neider, Voss, Gaspar

12

& Kramer, 2011). An eye gaze study in cricket has shown that more experienced batsmen need less time to

predict where the ball will hit the ground (Land & McLeod, 2000).

2.3 Player modeling According to Bakkes, Spronck and Van Lankveld (2012), a player model is an abstracted description of a player’s

behavior in a game environment. Hence, an opponent model gives us knowledge about the characteristics of

the opponent. This can be preferences, strategies, strengths, weaknesses or skill (Van den Herik et al., 2005).

We can use this knowledge to create an AI that adapts itself to its opponent. In Subsection 2.2.1 we describe

opponent modeling in classic games. In Subsection 2.2.2 we discuss opponent modeling in computer games

and the goal of opponent modeling in this thesis.

2.3.1 Player modeling in classic games

In research on classic games, the main goal of player modeling is to create strong AI. Carmel and Markovitch

(1993) and Iida, Uiterwijk and Van den Herik (1993) simultaneously studied opponent modeling to create AI

that searches for the best solution, given a certain game state. Donkers, Uiterwijk and Van den Herik (2001) use

a game tree to find the best solution. Richards and Amir (2007) examined probabilistic opponent modeling in

scrabble. They took observations of previously played games and predicted what tiles the opponent will hold.

Their method performed significantly better at player Scrabble than Quackle, an open-source scrabble AI that is

based on the architecture of the Scrabble AI program Maven.

Creating strong AI by means of player modeling gives us useful research on modeling expert opponents in

games. Van der Werf, Uiterwijk, Postma and Van den Herik (2002) focused on learning a machine to predict

moves in the game Go by observing human expert play. The system constructed in the experiment performed

well on the prediction of regular moves but performed less on the prediction of odd moves. Kocsis, Uiterwijk,

Postma and Van den Herik (2003) used a neural network to predict which move in chess is best given a certain

position. The moves that are most likely to be the best should be considered first. They found that it is possible

to predict the best moves in mid-game from patterns in the training data.

2.3.2 Player modeling in computer games

Research on player modeling focuses on increasing the effectiveness of artificial agents. In computer games,

another point of interest is raising the entertainment value (Van den Herik et al., 2005). A player model can

have 2 different roles: it can either inform the player during the game or it can serve as an artificial opponent

(Van den Herik et al., 2005). If the AI has an opponent role, it is important that there is a balance: the AI should

not be too hard or too easy to beat. A strong imbalance leads to the player losing his interest and thus lowers

the entertainment value of the game (Houlette, 2004).

Some research on player modeling in computer games has been conducted. Schadd, Bakkes and Spronck (2007)

examined player modeling in the real-time strategy game Spring. They were able to successfully classify the

strategy of a player using hierarchical opponent models. Drachen, Canossa and Yannakakis (2009) collected data

from 1365 Tomb Raider: Underworld players. They used a self-organizing map as an unsupervised learning

technique to categorize players into 4 types. They showed that it is possible to cluster player data based on

patterns in game play.

The goal of player modeling in this thesis is to automatically classify human players according to their skill. We

can use this knowledge to create AI that mimics the skill of a human player. This makes the behavior of the

13

artificial opponent more realistic and thus more interesting to human players (Van den Herik et al., 2005;

Schadd et al., 2007; Houlette, 2004).

2.4 Sequential Minimize Optimization Algorithm In order to see which algorithm is most suited for our classification task, we tested 5 algorithms on a subset of

the data. We compared the k-nearest neighbor algorithm IBk, the support vector machine Sequential Minimize

Optimization (SMO), the decision tree J48, and the ensemble learner RandomForest. IBk is a nearest neighbor

learner which classifies instances based on their similarity. J48 builds a tree that splits data on each node using

1 attribute for each node. The attributes are considered in the order in which they are presented.

RandomForest is an ensemble learner that grows a forest of random decision trees. At each node, a

predetermined number of random attributes is considered and the most informative one is picked. After al

trees are grown, they do a majority vote to determine to which class an instance belongs to. The SMO classifier

is described below. We picked SMO as our algorithm to build the player model based on the pretest described

in Subsections 3.2.

The SMO classifier is an optimized support vector machine (SVM). A standard SVM uses linear functions to

classify data in a feature space that is not linearly separable. A kernel is used to transform this space into a

linearly separable vector space. In this space, classes can be separated with a straight line. If we then transform

the vector space back to feature space, the line does not appear to be straight anymore. This can be described

best with an illustration. In Figure 1.1, we can see 12 instances in normal space. In Figure 1.2, we have used a

kernel to transform the space into a high-dimensional vector space which makes the data linearly separable. In

Figure 1.3, we transformed the vector space back to normal space, showing that the line does not look straight

anymore.

Figure 2.1 Feature Space Figure 2.2 Vector Space

14

Figure 2.3 Transformation back into Feature Space

Instances in vector space are called vectors. The linear line that separates the classes is called a hyperplane.

This hyperplane tries to find the line that has the most distances to both classes: it looks for the maximum

margin. This margin is determined by the vectors that are closest to the boundary with the other class. These

vectors are called support vectors. Each of the support vectors pushes the hyperplane as far away as possible,

creating the maximum margin. This is shown in Figure 2.4. The triangles represent the support vectors, the thin

lines represent the maximum margin, and the thick line represents the decision boundary.

Figure 2.4 Support Vectors

SVM perform a binary classification. The standard way to do this is to do a one-versus-all classification: an

instance is either a member of the class or not. The SMO uses pair-wise classification for multiclass problems.

This means that it makes pairs of classes for every possible combination. A hyperplane is created for each pair

of classes. This has proven to lead to more accurate results than one-versus-all classification (Park & Fürnkranz,

2007).

The dimensionality of vector space depends on the number of features that is used to describe the data. If the

number of features grows, the dimensionality grows in a quadratic way. Having many features and classes

creates a complex problem that requires a lot of time and computing power to solve. This is called the

quadratic optimization problem. SMO solves this by breaking the problem down into a set of smaller problems

which can be solved analytically instead of numerically (Platt, 1998).

In short, SMO is an optimized SVM. It uses pair-wise classification which leads to more accurate results. It also

solves the quadratic optimization problem so less time and computing power is needed for classification which

means that large data sets with many attributes and classes can be processed.

15

3. Experimental setup We begin this chapter by explaining how we collected and selected our data. We elaborate on the features we

used to describe the data set. This is described in Subsection 3.1. In Subsection 3.2 we describe the method we

used to evaluate the classifiers. In Subsection 3.3, we report how we performed the pretest, what the results

were and what algorithm we picked. In Subsection 3.4, we explain how we built the final player model and how

we examined the importance of attributes by applying InfoGain. We also elaborate on the impact of time and

the server on the performance of the algorithm.

3.1 Data set For this research, a large database of Starcraft II game replays is needed. We collected this database by

repeatedly checking the websites gamereplays.org, drop.sc, and sc2rep.com for two months. The replays we

collected here were manually uploaded by players or by tournament organizers. We used the data of both

these players to diminish any imbalance in the data set. Hence, 50% of the instances contain data about players

who won, and 50% of the data is about players who lost.

We collected a total of 1297 games. This is only a small portion of all games that were played during our

collection phase which ran from 25 November 2011 up to and including 13 February 2012. We can make a

rough estimation of the number of games that were played during this time. The main screen of Starcraft II

displays in real-time how many games are being played worldwide. We measured these numbers for one Friday,

one Sunday and one Monday. Each day, we measured at 2 different times: around noon and in the evening.

Based on the data we gathered, we assume that an average game lasts for 15 minutes. We also assume that

games are played for about 14 hours a day. We took the average of about 8750 games at any moment and

multiplied it by 4 to get the number of games each hour, and multiplied that by 14 hours to get to the number

of 490.000 games each day. We multiplied this number by the number of days in our collection phase which is

104. This results in an estimated number of 50.960.000 games. This number also includes games that are

played on other servers besides Europe and America, so we divide this number by 2 to get 2.548.000 games.

This number also includes 5 other game types besides 1v1 games. Therefore, we take 1/6th of the games as an

estimate for the 1v1 games. We divide that number by half to eliminate games that were not played on

American or European servers. This results in an estimated number of 4.246.667 1v1 games on American and

European servers. As we only have 1297 games, we have gathered about 0.03% of all played games.

All replays were compared to each other to filter out any doubles. We collected a total of 1307 games from

1590 players in 7 different leagues. All games were played either on the American server (63.3%) or the

European server (36.8%). Out of the entire set of games, 74.6% was matched automatically by the system

(AutoMM), and 25.4% was matched manually (Private or Public). This resulted in a data set consisting of

49,908 instances, each describing one minute in the game. A specified overview can be found in Table 3.1.

16

league

total

instances percentage games players

American

server

European

Server AutoMM

Private

/Public

bronze 4082 8.2% 85 139 70.6% 29.4% 94.1% 5.9%

silver 3979 8.0% 96 147 75.0% 25.0% 99.0% 1.0%

gold 5195 10.4% 122 195 67.2% 33.6% 92.6% 7.4%

platinum 8066 16.2% 196 282 67.3% 32.7% 87.8% 12.2%

diamond 10088 20.2% 244 321 64.8% 35.2% 85.2% 14.8%

master 12747 25.5% 383 427 76.2% 23.8% 58.0% 42.0%

grandmaster 5751 11.5% 171 79 22.2% 77.8% 5.3% 94.7%

total 49908 100% 1297 1590 63.3% 36.8% 74.6% 25.4%

Table 3.1 Distribution of the data set

Considering that the grandmaster league encompasses the top 200 players of the master league, and that the

master league consists of 2% of all active players, the distribution of our data set is skewed with 25.5% of the

data consisting of master games and 11.5% consisting of grandmaster games. Games from the higher leagues

are more often uploaded than games from lower leagues. Players in the master and grandmaster league often

play in tournaments. The games played in these tournaments are often put on the internet so anyone can

watch them. Players in the 2 highest leagues also have fans for whom they upload replays. Players from lower

leagues upload games less often. This could be due to the fact that they are new to the game and are not

involved in the Starcraft II culture available outside the game.

Another reason could be that lower ranked players might play less frequently than higher ranked players. The

theory of deliberate practice states that people become more skilled by engaging in deliberate practice

(Ericsson, et al., 1993). It could be that higher ranked players are skilled because they practice more. A

consequence of this is that there are more games played in the higher leagues than the lower leagues.

Therefore, more games from higher leagues are available. This can only be a partial explanation because the

differences are relatively large. There are far less players in the 2 highest leagues than in the lower leagues.

3.1.1 Data Selection

Some requirements were set for the selection of the data. Firstly, we only were interested in 2-player games.

These games should be played on the American or European servers. All games should have the same version

of the game and they have to be played between 25 November 2011 and 13 February 2012. Information about

the league of players should be available. Games should be played between players within the same league.

We only used games that were played on American or European servers due to a significant difference in skill

between Asian and Western players. The over-all level of play in Asia is higher than in Europe and America. This

means that the level of play in one league differs for Asian and Western servers. Therefore, we cannot rely on

league as a measure of skill if we consider players from both Asian and Western servers. There is a difference in

skill level between Europeans and Americans but it is far less than the difference between Western and Asian

players.

17

Games played in the bronze league include games that are played in the optional “practice league”. This is a

league only by name as practice players are officially placed in the bronze league. Games are played in a slower

pace and on simplified maps. Players are allowed a maximum of 50 games in this league before proceeding to

standard games.

We only used 2-player games that were either set up through automatic matchmaking by the system (ladder

games) or set up manually (private or public games). In ladder games, two players are matched by the system

according to their league or hidden rating. In private or public games, players chose their own opponent.

Because players are placed in a league by playing games on ladder, we can assume that the skill level of players

in private or public games is sufficiently accurate.

We only used games between players within the same league. When two players are matched by the system, it

takes their hidden rating into account. This means that for example a silver player could be matched with a gold

player. We do not know whether the silver player is about to be promoted or the gold player is about to be

demoted. Therefore, we cannot be as certain about their skill level as we can when they both play in the same

league. This uncertainty is less relevant for a numerical classification, but it can lower the performance for a

nominal classification.

Starcraft II is regularly patched by Blizzard to balance the game and make it more interesting to players. These

patches include changes in upgrades, units and buildings. These changes may affect the way the game is played.

All selected games were played in version 1.4.2.20141 of the game to avoid any differences that could be

contributed to the patch version.

We only used games played in season 4 or 5. Season 4 ran from 25 October 2011 up to and including 19

December 2011, and season 5 ran from 20 December 2011 up to and including 13 February 2012. We limited

our data set to this time because during the collection phase, we only had league information on players for

those two seasons. For most of season 4 and the entire season 5, the game was in version 1.4.2.20141 which is

convenient because it allows us to use almost all replays from season 4 and 5.

3.1.2 Data extraction

We used the sc2reader python library to extract game and player information from the replay files. We first

extracted the battle.net urls of all players from the games. We then made a list containing the name of the

player together with his rank in seasons 4 and 5. For each game, we then determined when it was played, who

the players were and what their league was. If the game was not from season 4 or 5 or the players were not in

the same league, we disregarded that game. We then checked the patch version of the game and only selected

games from version 1.4.2.20141 of the game.

We grouped the games according to their league. If both players are in the same league, the game is regarded

to have been played in that league. For example, if both players of a game are in the silver league, that game is

regarded as a silver league game.

After the replay files are selected and ordered, we use the command line interface of the program Sc2gears to

print the in-game actions of all the players. The replay files only contain information about the interactions of

the players with the game interface. Hence, we did not have information about how many units were killed,

how many structures were razed, what the available supply was, or how many resources were gathered. We

could only see how many buildings or units were issued to be built, not how many were actually completed. For

18

example, if a command is issued to build a unit, but the building that produces this unit is destroyed, the

unfinished unit in production is killed too. This means that we can only count how many units, buildings and

upgrades were issued for production. We can use this information to calculate how many resources and supply

the issued commands cost but this does not tell us the actual costs because we do not know if something was

completed.

We now have a data set, which has to be divided into a training set and a test set. We have 2 different kinds of

sets: a pretest set and a 5-fold set. For the pretest set, we take 70% of the data as the training set and 30% to

use as the test set. For the 5-fold set, we take 80% of the data as our training set and 20% as our data set. We

do this 5 times, so we end up with 5 training sets and 5 complementing test sets.

3.1.3 Feature set

After collecting the data set, we had to create an appropriate feature set to feed to the algorithms. Next to

general information about the game and the players, the replay files only contain information about the players'

interactions with the user interface. Therefore, we are limited to base our features on those interactions. We

only select features that are applicable to all three races because we want to have a general method to

measure a players’ skill.

The instances in our data set are time slices. Every instance describes one minute of gameplay. There are 3

different kinds of features. Firstly, we have features that give general information. These features are not

affected by the in-game behavior of the players. Secondly, we have features that describe what has happened

during the minute that is described in the instance. Thirdly, we have features that describe what happened

during the entire game up to and including the minute that is described in the instance.

We can explain this with an example. Suppose our instance number is 2. This means that we describe

everything that has happened during the 2nd minute from 90 seconds onwards. This describes everything that

happened between the 2.5 minute and the 3.5 minute. We also have features that describe what has happened

during the entire game up to now. These features describe everything that happened between 1.5 minute and

3.5 minute.

General features

We want some general information about the players that might affect their behavior. Our first feature is of

course the league a player is in. Then there is the issue of race. Each race works in a different way. This might

affect the number of commands that is issued. The server on which the game is played might also be of

influence. We assume that the difference in skill between American and European players is not large but we

do not know the actual size of the difference. Near the end of the game, the winner usually has a stronger

economy than the loser. Therefore, we want to know whether or not a player won the game. Time is also

important because more game states are possible as the game progresses. There might be differences as to

when a player develops new technologies. This gives us the 6 features that are described in Table 3.2.

19

Feature Possible values Description

League

bronze, silver, gold,

platinum, diamond,

master, grandmaster

Describes players’ skill

Server American, European Server where the game was played

Player race terran, protoss, zerg Race the player picked

Opponent race terran, protoss, zerg Race the opponent picked

Winner yes, no Whether the player won the game

Minute of the game 0 - 86

Each instance describes one minute

in the game. The first 90 seconds are

excluded from measurement.

Table 3.2 General features

Visuospatial attention and motor skills

Because experts tend to have better motor skills, better visuospatial attention, and make faster decisions than

novices, we select the amount of actions a player issues per minute as a feature. We also calculate how much

of the issued actions are actually effective. We use two measures: one for the over-all game up to and including

the instance, and one for just the minute described in the instance. We also measure the over-all redundancy.

This is the ratio between effective and issued commands. All features relating to issued and effective

commands are described in Table 3.3.

Feature over-all Possible

values Feature per minute

Possible

values Description

average macro 0 - 87 macro 0 - 131 Average macro actions

average micro 0 - 269 micro 1 - 372 Average micro actions

average apm 0 - 289 apm 1 - 388 Average macro actions + average micro

actions

total eapm 0 - 165 eapm 0 - 191 Average effective actions

Redundancy 0 – 0.747 -

(total average actions per minute -

average total effective actions per

minute) / total average actions per

minute

Table 3.3 Features measuring amount of commands issued by the player

To decide whether an action is effective, we create a set of rules based on the preconditions used by the

Starcraft II analysis tool Sc2Gears. We also look at the amount of macro and micro actions that are issued. A

macro action is usually any action that costs the player minerals and/or gas and thus describes the players’

20

economy. A micro action is any other action and describes the players’ skill to micro manage his units. The

specific rules for deciding whether an action is macro or micro are based on the rules used by Sc2Gears. The

rule sets for micro, micro, and effective actions can be found in Appendix A.

One other way to measure motor skills and visuospatial attention, is to look at the hotkey use of a player.

Hotkeys can be used to quickly perform different actions on different parts of the map. If a player uses hotkeys

more often, this could indicate that the player is more proficient in the game. Therefore, we include the use of

hotkeys in our feature set. There are three ways in which a player can use hotkeys: he can assign a group to a

hotkey, he can add new buildings and/or units to an already assigned hotkey, and he can select the assigned

hotkeys. The features related to hotkey usage are described in Table 3.4.



Possible

values Description

Total hotkeys

selected 0 - 5487 Hotkeys selected 0 - 237

Total number of times hotkeys are

selected

Total hotkeys set 0 - 176 Hotkeys set 0 -21 Total number of hotkeys set

Total hotkeys

added 0 - 199 Hotkeys added 0 -14

Total number of times new units or

buildings were added to a hotkey

Total different

hotkeys 0 -17 -

Total number of different hotkeys that

were set

Hotkeys selected

per set hotkey 0 - 1501 - Total number of selections / total

number of hotkeys set

Table 3.4 Features related to hotkey usage

Economy

In Starcraft II, having a strong economy is key to winning or losing a game. If a player does not have enough

resources to spend, his army will be small and he will lose the game. On the other hand, if a player invests only

in his economy, chances are the opponent will have an army first and kills the player. A skilled player knows how

to find an effective balance. We include the number of bases, number of workers and spending rate into our

feature set to measure the strength of a players’ economy. We also calculate the number of workers per gas

collection building and the amount of minerals spent per worker to measure the balance between workers and

resources. The features regarding economy are described in Table 3.5.

21



Possible

values Description

Total number of

bases 1 - 20

Amount of bases a player built during

the game

Total number of

workers 6 - 188 Number of workers 0 -22

Amount of workers a player built

during the game

Total resources

spent

0 -

111755 Resources 0 - 5150 Minerals spent + gas spent

Total minerals

spent 0 - 72400 Minerals spent 0 - 3600

Amount of minerals that was spent by

the player during the game

Total gas spent 0 - 26600 Gas spent 0 - 2050 Amount of gas that was spent by the

player during the game

Workers per

collection building 0 - 90 -

Total number of workers built / total

number of gas collection buildings built

Spent minerals per

worker 0 - 1084 -

Total amount of minerals spent / total

number of workers built

Table 3.5 Features describing economy

Technology

Next to a strong economy, players can strengthen their position by strengthening their technology. Players can

upgrade the strength of their units and research special abilities. New, stronger units become available to

players when they advance in the technology tree. Therefore, the number of upgrades and abilities researched

are included in the feature set. We also want to know the tier a player is in. In our case, this is a general

description of a player’s advancement in the technology tree. These three attributes are described in Table 3.6.


values Description

Total number of upgrades 0 - 16 Amount of improvements researched

Total number of research 0 - 11 Amount of special abilities researched

Tier 0 - 3 General level of advancement in the

technology tree

Table 3.6 Features related to technological development

22

Strategy

Another important factor in Starcraft II is play style, also called strategy. Players can be offensive, defensive or

anywhere in between. This determines what and how many units are built and how much resources are

invested in defense. A good player knows how to balance between offense and defense. We have included the

number of different units, total number of fighting units, and number of defensive structures as features to

measure this balance. Next to gas and minerals, units also cost supply. Therefore, we have also included the

supply that was provided and used. The features regarding strategy are described in Table 3.7.



Possible

values Description

Total supply used 6 - 1066 Supply used 0 - 51 Amount of supply that was used by the

units that were built

Total supply gained 10 - 756 - Amount of supply that was gained by

constructing supply buildings

Different units 0 -14 -

Any unit that was built, including

workers, transport units and detection

units

Total number of

fighting units 0 - 695 -

Any units that can fight except for

workers

Total number of

defensive

structures built

0 - 69 - Any building that can do damage to

nearby enemies

Table 3.7 Features describing strategy

We now have a total of 44 features including the class. We divided each game into slices with a length of one

minute. Some features were calculated for only the minute described in one slice, others were calculated for

the entire game up to and including the minute described in the slice. We do not describe the first 90 seconds

of a game because only a limited amount of actions is possible during this time. This results in an almost equal

game state for all players, regardless of their skill. The first 90 seconds are excluded from the calculations for

the average (effective) actions per minute, but not for any other features that describe more than one minute,

for example the total amount of minerals that was collected during the game. Games are recorded in game

time which is faster than normal time. In this research, we used normal time for our measurements because it

is easier to use in our calculations.

3.2 Measurement To measure the performance of the classifiers, we use accuracy. This measure tells us how many instances were

classified correctly out of a set of instances. We calculate this by dividing the total number of correctly classified

instances in a class by the total number of instances in that class (TP/all instances in the class). We then add up

the accuracy for each class and divide it by the number of classes to get the average accuracy.

23

Because the data in our set is not evenly distributed over the classes, our data set is skewed. We can correct for

this by using a weighed accuracy. This is calculated by multiplying the accuracy for each class with a

corresponding weight. This weight is equal to the percentage of instances in the test set that belongs to that

class (instances in the class / total number of instances). All weights are then added up to give us the weighted

average accuracy.

To measure how well the algorithms actually perform, we need to compare their performance to a baseline.

Because our data set is skewed, we use a frequency baseline. This baseline automatically assigns all instances to

the most frequent class which in our case is the master league.

Whenever we want to examine if there are significant differences in any of our experiments except for the

pretest, we use the Shapiro-Wilk test of normality to examine if the results are distributed normally across the 5

folds. The null hypothesis of this test states that the samples come from normally distributed data. The

difference is significant if p < .05 and the null hypothesis should then be rejected.

According to Witten and Hall (2011), we should use a paired t-test to compare 2 algorithms that were tested

using an x-fold cross validation. This test examines if the difference between the 2 classifiers is significant for

each fold. We do not need to test for variance since this test assumes that both measures were conducted on

the exact same subsets of data. If the results of the t-test suggest a significant difference, we can calculate what

the size of the effect is using Cohen’s d to calculate the effect size r. If we want to test several measures on the

same subset of data, and the number of members in each group is small, we should use a k-related samples

test.

If we want to compare the performance of an algorithm on different subsets of the data, we need to use a one-

way ANOVA. This test compares the means of all groups with each other to see if they are significantly different.

Before we can conduct this test, we have to examine if the variance between the subsets is equal by conducting

a Levene’s test. If variance is equal, and ANOVA suggests that the difference is significant, we can use η2 to

compute the effect size. Cohen suggests that for both the effect measures r and η2, an effect size of .10 can be

considered small, a size of .30 as medium, and a size of .50 as large (as cited in Field, 2009, p. 57).

3.3 Pretest We want to find a classification method that is suitable to model players according to skill. To find one, we need

to perform a pretest where we compare the performance of a number of algorithms. We are not interested in

optimizing any of the algorithms. This is interesting if you want to use the classifier to build an AI but we are

only interested in examining the basic performance of a classifier that seems decent enough. Therefore, we do

not perform an x-fold cross validation. Rather, we use 70% of our data set to train the classifiers and 30% to test

them.

We used 4 different classification algorithms on the test set to see which one is most suitable. We used 3 single

algorithms and 1 ensemble learner. The single algorithms we tested are: IBk, SMO, and J48. The ensemble

learner is RandomForest. For each classifier we used the default settings. We also used the original, unaltered

feature set. An overview of the accuracy of all classifiers is given in Table 4.1.

24

League J48 IBk SMO RandomForest Baseline

bronze 47.7% 33.6% 64.6% 52.3%

silver 22.8% 28.7% 24.6% 26.8%

gold 14.9% 17.2% 15.5% 16.1%

platinum 25.4% 25.3% 42.7% 26.1%

diamond 30.9% 31.1% 34.7% 34.7%

master 45.8% 45.8% 55.0% 52.2%

grandmaster 51.3% 41.4% 58.9% 45.4%

unweighted average 34.1% 31.9% 42.3% 36.2% 28.4%

weighted average 35.7% 34.2% 44.0% 38.6% 28.4%

Table 4.1 Accuracy of the performance of the classifiers

If we plot the accuracy of all classifiers for each league, we get the graph in Figure 4.1. We can see that all

classifiers follow about the same curve. Only the performance of SMO for the platinum league is higher than

that of the other algorithms.

Figure 4.1 Accuracy of the classifiers for each league

All algorithms except for IBk seem to perform well on the bronze league. For all algorithms, the performance

was worst for the gold league. After the gold league, performance gradually increases for most classifiers. SMO

has a steep increase in performance after the gold league. All classifiers had a relatively strong performance for

the master and grandmaster leagues. The players in these leagues together encompass the top 2% of Starcraft

II players. As experts tend to play more consistent than novices, it is easier to model their behavior.

This distribution in performance is in line with the nature of the data set. If we look at Figure 2.1 we can see

that within the gold league, the skill difference between players is the smallest. For the silver and platinum

leagues it is somewhat larger. This makes it hard for classifiers to distinguish players between these leagues.

The bronze league encompasses the largest range in skill difference, which makes it easier to classify players in

this league. However, this league also has the largest error distance of all classes. It might be that some players

are new to the game. Based on their performance in the 5 initial placement matches, they were placed in the

0

10

20

30

40

50

60

70

smo

j48

ibk

randomforest

25

bronze league. However, these players might have a steep learning curve. Because of this, they might still be in

the bronze league although their skill has improved a lot and they actually play at the level of a higher league.

Based on these results, we choose SMO as our classifier. It seems to have a decent performance on this test set.

It also seems to outperform the other classifiers for the prediction of the platinum league. For the other

classes, its performance seems to be about the same as the performance of the other algorithms.

3.4 Player model In the previous Sections, we described the collection and structure of our data set. We have also performed a

pretest to pick an algorithm that seems suitable. We chose SMO based on the results described in Section 3.3.

We now test the performance of SMO using a 5-fold cross validation. This is discussed in Subsection 3.4.1. We

use InfoGain to examine which features contribute the most to solving our classification problem. This is

described in Subsection 3.4.2. We then examine to what extent the time of measurement is important. We

create subsets of the test set by removing instances from specific time slices. We then re-test our player model

on these subsets. We describe this in Subsection 3.4.3. Finally, we try to predict on which server a game was

played. We describe this in Subsection 3.4.4.

3.4.1 Performance of SMO

To test the model, a 10-fold cross validation is commonly used since that has been proven to be the most

statistically robust method as it gives usually gives the best estimate of error (Witten & Hall, 2011). However,

due to a lack of time, we limited our experiment to a 5-fold cross validation. We sampled the data for each set

randomly ourselves instead of using Weka because we wanted the training and test sets to consist of entire

games. The x-fold cross validation embedded in Weka uses stratified holdout. This method aims at taking a

sample that is as weighted as possible compared to the entire set. However, this means that it does not leave

out entire games but rather parts of games. This is not representative of our data set, since we always have an

entire game of a player to classify and not a few random minutes.

We build the player model with SMO using the basic settings of Weka. These settings are not necessarily

optimal. However, our goal is to examine if it is possible to build a player model and not to find the most

optimal solution. After building and testing the model using a 5-fold cross validation, we examine the

performance and the errors that were made. We first elaborate how the errors are distributed. We then

calculate the average classification and average error distance by multiplying the average weight table with a

linear weight table. The results of these tests are reported in Section 4.1.

3.4.2 InfoGain

InfoGain is a technique that evaluates each attribute to examine how much it contributes to the total variance.

If an attribute accounts for more variance, this means it is more informative. After all attributes are evaluated,

they are ranked according to their contribution to the variance. The most informative attribute is placed at the

first position and the least informative one is placed last. In order to test how removing attributes affects the

accuracy of the classifier and the time it takes to build the model, we built the model using different subsets of

the feature set. We removed 0, 20, and 40 instances to see how this affects time and accuracy. The machine we

use for this test has a 64-bit Intel Core i7 processor @2.20 GHz with 6050 MB RAM. We ran a 64-bit version of

Weka with a maximum heap size of 4048 MB on a 64-bit version of Windows 7. The reports of the tests

regarding InfoGain are reported in Section 4.2.

26

3.4.2 Time

When we know the performance of the classifier on the test set, we can examine the impact of time. We want

to know if the accuracy of the classifier improves during the game. We begin by examining how the

performance changes when we remove the first instance of every game. This means we see how well the

classifier performs after at least 2.5 minute has been played. We perform a 5-fold cross validation to test this.

We begin by removing the first instance from all test sets and use the models on these subsets. We then

remove the first and second instances and test again. We repeat this process for the first 10 instances of every

game. The results of these tests are reported in Subsection 4.3.

3.4.3 Server

In our research, we used replay files from games that were played on either the American or the European

server. We want to know to what extent we can predict on which server a game was played. If we can do this

accurately, this means that there is a difference in behavior between players from different continents. We use

a 5-fold cross validation to build and test a model that predicts server. We also try to predict the skill of a player

by using the entire feature set except for ‘server’ to build a model. We use a 5-fold cross validation to test to

what extent removing the feature ‘server’ has a significant effect on performance. The results of these tests are

reported in Subsection 4.4.

27

4. Results In this section, we report the results of our experiments. We begin by examining what classification method is

most suitable to model players according to skill. In Subsection 4.1, we report how accurately SMO is able to

build a player model using a 5-fold cross validation. In Subsection 4.2, we discuss the ranking of the attributes

using InfoGain. In Subsection 4.3, we elaborate on the influence of time on the performance of the classifier. In

Subsection 4.4, we report the importance of the server regions.

4.1 Player model If we evaluate our model using a 5-fold cross validation, on average 44.9% of instances are classified correctly

with a standard deviation of 2.67. The average accuracy for the baseline is 25.5% with a standard deviation of

1.05. The average weighted accuracy for SMO is 47.3% with a standard deviation of 2.2. The accuracy of SMO

for each class based on the average of the 5-fold cross validation is shown in Table 4.1.

SMO accuracy

Baseline accuracy

bronze 69.6%

silver 25.8%

gold 10.6%

platinum 40.2%

diamond 42.9%

master 63.3%

grandmaster 62.1%

average 44.9% (2.67) 25.5% (1.05)

weighted average 47.3% (2.19)

Table 4.1 Accuracy scores for SMO and the baseline. The standard deviation is reported between brackets

We want to know if the difference between the baseline accuracy and the performance of SMO is significantly

different. The Shapiro-Wilk test of normality shows that p > .05 for the unweighted accuracy (M = 47.27, SD =

2.19), the weighted accuracy of SMO (M = 44.93, SD = 2.67), and the baseline (M = 25.55, SD = 1.05). This

means that it is likely that the data is normally distributed. An overview of the results of this test is given in

Table 4.2.

Measure Statistic df p

SMO weighted accuracy 0.941 5 0.674

SMO unweighted accuracy 0.971 5 0.882

Baseline accuracy 0.892 5 0.365

Table 4.2 Results for the Shapiro-Wilk normality test

We can now use a paired t-test to examine if SMO performs significantly better than the baseline. This test

suggests that both the unweighted accuracy of SMO with t(4) = 14.35, p < .001 and the weighted accuracy of

28

SMO with t(4) = 32.10, p < .001 are significantly better than the baseline. The effect size for both the

unweighted and the unweighted accuracy is .99.

4.1.1 Error Distribution The classes we use in our data set are ordinal. This means that class C is more distant to class A than class B.

Therefore, it is preferable if an instance is misclassified in a neighboring class than in another class. In Table 4.4,

we report the average confusion table for the 5-fold cross validation. We can see that on average, 67.0% of the

wrongly classified instances are placed in a neighboring class.

a b c d e f g

classified

as

567 135 39 57 16 2 0 a

206 202 127 185 63 12 1 b

80 195 109 394 178 69 14 c

60 99 130 649 391 262 23 d

16 28 77 319 865 609 102 e

9 17 24 196 471 1615 217 f

0 0 2 9 95 331 713 g

Table 4.3 Average confusion table for the 5-fold cross validation

We can see that as leagues become more distant to each other, errors become less. For each league, we can

calculate the average distance of the actual class to the assigned classes. If an instance is placed in a

neighboring class, we assign a weight of 1 to this error. If an instance is placed one class next to the neighboring

class, we assign a weight of 2 to this error. If we do this for all classes, we get the linear weight table that is

displayed in Table 4.4.

a b c d e f g

<--

classified as

0 1 2 3 4 5 6 a

1 0 1 2 3 4 5 b

2 1 0 1 2 3 4 c

3 2 1 0 1 2 3 d

4 3 2 1 0 1 2 e

5 4 3 2 1 0 1 f

6 5 4 3 2 1 0 g

Table 4.4 Weight table to calculate the average misclassification

29

For the confusion table of each fold, we multiply each entry with the corresponding weight in Table 4.5. The

sum of each row is then divided by the total number of instances in the corresponding class. This gives us the

average classification for each league. This is shown in the second column of Table 4.5. A distance lower than 1

means that most of the instances were classified correctly. To get an average distance error for each league, we

need to sum the multiplied errors and divide this number by the total amount of errors. The average error

distance for each class is shown in the third column of Table 4.5. A distance of 1 means that all wrongly

classified instances were placed in a neighboring class. All reported numbers are averaged over the 5 folds.

League

Average

classification

Average distance of

misclassifications

Bronze 0.75 1.95

Silver 1.34 1.75

Gold 1.47 1.61

Platinum 1.11 1.67

Diamond 0.92 1.35

Master 0.60 1.35

Grandmaster 0.38 1.19

Average 0.94 1.55

Table 4.5 Average distance of classification and misclassification per league based on a 5-fold cross validation

We can see that on average, the distance of the misclassification is than 1.55. This means that on average, most

instances were placed in a neighboring class. We can also see that no class scored an average error distance of

2.00 or more, meaning that for all classes, most misclassified instances were placed in a neighboring class.

4.2 InfoGain

InfoGain tells us how informative each attribute is. It ranks the features in the set according to the information

they provide in solving the classification problem. We ranked the features based on the average weight for the

5 training sets used in the 5-fold cross validation. This gives us the list of ranked features that is reported in

Appendix B. If we perform the Shapiro-Wilk test for normality, we can see that the weights for all attributes

except total minerals spent, amount of minerals spent per worker, and tier are distributed normally with p >

.05. A Friedman test suggests that there is a significant difference in weight between the attributes χ2(42) =

210.00, p <.001.

We have plotted the average weights for all attributes in Figure 4.1. The x-axis contains the rank number of the

attribute. We can see that between the first attribute (average micro over the game) and the second attribute

(average apm over the game), the weight drops from 0.569 to 0.511. A paired t-test suggests that the difference

in weight is significant with t(4) = 34.71, p < .001 with an effect size of .99.

30

Figure 4.1 Weight values for the attributes. The rank number is displayed on the x-axis

After the eighth ranked feature (total hotkeys selected over the game), the weight drops from 0.415 for the

eighth feature to 0.256 for the ninth feature (amount of selections per hotkey). To examine if this is a significant

difference, we perform a paired t-test between the scores of these 2 attributes. We have already established

that the weights within these groups are distributed normally with p > .05. The t-test indicates that the

difference in weight between the 2 attributes is significant with t(4) = 59.30, p < .001 and an effect size of .99.

The top 8 features are ranked in the same order for all 5 folds.

If we remove features from the feature set, the time it takes to build the model will drop but performance will

drop too. We want to know how much time we can gain and how much the performance of SMO will drop. We

built models using 3 subsets of the feature set: 1 with the complete feature set, 1 with the worst 20 features

removed and 1 with the worst 40 features removed. Figure 4.1 shows the accuracy for these feature sets. If we

compare the weighted accuracy for the entire feature set (M = 47.27, SD = 2.19) with the set where we

removed 20 features (M = 43.75, SD = 1.05) with a paired t-test, we can see that performance drops

significantly with t(4) = 2.93, p < .05, and an effect size of .82.

Figure 4.1 accuracy after removing attributes

0

0.1

0.2

0.3

0.4

0.5

0.6

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43

weight

31

On our machine, it takes an average of 2.8 seconds to build a model based on 3 predictive features. If we don’t

remove any attributes, it takes 2398.3 seconds to build the model as we can see in Table 4.5. If we remove the

worst 20 attributes, it takes 49.1 seconds to build the model.

Removed attributes Seconds to build model

0 2398.3

20 49.1

40 2.8

Table 4.5 Seconds to build the model after removing features

4.3 Time We want to know from what moment in the game SMO can model players according to their skill and if time

has an influence on performance. We have plotted the average weighted accuracy scores of 10 different

subsets to examine when performance increases. We first examined the performance on the entire training and

test set. We then incrementally removed instances to see how the accuracy changes. We started by removing

all 1st instances, then removing all first and second instances, and so on up to and including the 9th instance.

The graph is shown in Figure 4.2. The blue line represents the accuracy for SMO and the red line represents the

baseline.

Figure 4.2 Accuracy of SMO for different time periods, the number on the x-axis shows the instances that were omitted

We want to know if there is a significant change in performance for SMO if the game progresses. We first test if

the data is normally distributed. The Shapiro-Wilk test shows that this is indeed the case for all subsets of the

data with p > .05 for each individual group and for all groups together. If we perform a Levene’s test on the

data, we see that p > .05, meaning that variance is equal over the groups. We can now perform an ANOVA test

to examine if there is any significant difference in the performance of SMO over time. The null hypothesis states

that there is no difference between groups and is rejected if p < .05. For this test, we get a value of F(8) = 0.082

with p > .05 which suggests that time does not seem to have a significant effect on the accuracy of the classifier.

We also tested if the baseline has a normal distribution. Since p > .05 for each individual minute and for the

over-all average of the baseline, we can say that the data is distributed normally. Levene’s test shows that the

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7 8 9

SMO

Baseline

32

variance is equal for all groups with p > .05. If we now perform an ANOVA, we see that the baseline does not

seem to change significantly over time with F(8) = 1.71 and p > .05.

4.4 Server We built a new player model using SMO that predicts whether a game was played on the American or the

European server. On average, 65.3% (SD = 1.48) of the instances in the test sets belonged to the most frequent

class: the American server. If we count the true positives for both classes and divide this by the total number of

instances, on average, 71.7% (SD = 1.20) of all instances are classified correctly. The data is normally distributed

with p > .05. A t-test suggests us that the difference in accuracy could be significant with t(4) = 19.43, p < .001

and an effect size of .99. However, the percentage of games on each server is not equal for each league. For all

leagues but grandmaster, the majority of the games were played on the American server. In the grandmaster

league, the majority of games were played on the European server. If we would assign all data to the American

server for the first 6 leagues, and all data to the European server for the last league, we get a corrected

baseline. The accuracy values of this baseline are the same as SMO for each fold with the same mean and

standard deviation (M = 71.73, SD = 1.20). This indicates that SMO has the same accuracy as the corrected

baseline and does not perform significantly better or worse than the baseline.

According to InfoGain, the server is the 14th most important feature for our model. We can examine its

influence further by constructing a model that contains the entire feature set except for the server. If we use

this model on the test set, we can see that weighted accuracy is 41.7% with a standard deviation of 2.05. Table

4.6 shows the accuracy and standard deviation of the weighted accuracy for SMO and the baseline.

All features Without server

SMO 47.3% (2.19) 44.4% (2.05)

Baseline 25.5% (1.05) 25.5% (1.05)

Table 4.6 Weighted accuracy for SMO with and without the feature "server", the standard deviation is shown between brackets

The data is distributed normally with p > .05 using the Shapiro-Wilk test. We use paired t-test to see if removing

the server has a significant effect on the performance of SMO. The results of the test suggests that the

weighted accuracy of SMO using all features (M = 47.27, SD = 2.19) is not significantly better than when we

remove the attribute ‘server’ (M = 44.39, SD = 1.05) with t(4) = 2.15, p > .05. If we compare the weighted

accuracy of the ‘without server’ feature set with the baseline, the t-test suggests that it is significantly better

than the baseline with t(4) = 18.31, p < .05.

33

5. Discussion In this chapter, we discuss the results of our experiments. In Subsection 5.1, we elaborate on the performance

of SMO. We also discuss the error distribution. Our tests with InfoGain are discussed in Subsection 5.2. In

Subsection 5.3, we elaborate on how time influences performance. The importance of the server on which a

game was played, is discussed in Subsection 5.4. We conclude this Chapter by elaborating on the

implementation of our results into AI in Subsection 5.5.

5.1 Player model We used SMO to build the player model. Wet tested the algorithm using a 5-fold cross validation and the results

suggest that its performance is significantly better than the baseline. The effect size indicated that the

difference in accuracy between SMO and the baseline is large. These results suggest that SMO is significantly

better at predicting players’ skill than could be expected by chance.

If we look at the distribution of errors for the test set, we can see that most misclassified instances were placed

in a neighboring class with an average error distance of 1.55. We can attribute this distribution of errors to the

nature of the data set. If a player beats better players, his rating will rise more too. If he wins consistently,

eventually he will be promoted. If a player wins from lower players in a higher league but loses from average

players in that league, his rating will not increase consistently and he will not be promoted. The system does

this to prevent players around the borders of a league to constantly switch leagues. It could also be the case

that a player has a relatively high hidden rating but only has a small number of ladder points because he does

not play often. This means that two players in different leagues might have the same hidden rating but are not

promoted yet because of a lack of ladder points. This means that for each league in our data set, a percentage

of players belong to another league or could be placed in either one of 2 leagues.

In our research, we used a linear weight table to calculate the error distance. If the results of this test were to

be implemented in an AI, an exponential weight table might also be suitable. Such a table assigns an

exponentially larger weight to errors that are more distant. Thus, it is much worse to classify a grandmaster as a

bronze player than to classify a gold player as bronze. If we were to build an adaptive AI, it would be much

worse to pity a grandmaster against an easy AI than a gold player.

5.2 InfoGain Applying InfoGain to optimize the performance of SMO by rearranging the original feature set has no effect.

However, it does tell us something about which attributes contribute most to solving the classification problem.

The single most informative feature, average micro over the game, has a significantly higher weight than the

second ranked feature, average apm. Average micro describes the average amount of issued commands that

involve micro actions over the entire game history up to the minute of the instance. The second (average apm)

and third (average eapm) ranked attributes also describe an average number over the past behavior in the

game. This indicates that the game history is an important factor in determining the skill of a player.

Between the eighth and ninth feature, the weight drops significantly. The first 8 features are ranked in the same

order for all folds. We included these 8 best features in the set because they measure motor skills. A gamer

must have excellent control over his mouse and keyboard to issue a large amount of commands and use a large

amount of hotkeys in a short period of time. This suggests that motor skill is another attribute that applies to

the domain of real-time strategy games.

34

The sixth (hotkey selections during the past minute) and eighth (amount of hotkeys selected during the entire

game) ranked feature also measure the visuospatial attention of players. The features regarding apm and eapm

also measure this to some extent since they also include the use of hotkeys. Visuospatial attention is an

important attribute of expertise in many domains where several cues in different locations ask for attention at

the same time. The outcome of our research suggests that this may also be an important attribute of expertise

in real-time strategy games.

Removing 20 features from the feature set to build the model might have a significant effect on accuracy.

However, it also gains us time as the time required to build the model takes almost 50 times less than using the

entire feature set.

5.3 Time Removing instances from the test set does not improve performance. Because the first instance of every entails

1.5 minute until the 2.5 minute into the game, we can say that from 2.5 minute into the game, we can give a

reasonably accurate prediction of players’ skill.

On average, the length of our games over the entire data set is about 15 minutes. This means that on average,

we can predict players’ skill after about 16.7% of the game is played. We can use this knowledge to scale the

difficulty of an AI during the game if the right strategy is used. As economy is an important factor in Starcraft II,

we could create a basic AI with an average economy. The AI can then choose what to do with its resources. If it

turns out that the player is a novice, it can play in such a way that only a small portion of the available resources

is used. This play style is somewhat similar to the behavior of novices since generally they do not spend their

resources effectively. However, if the opponent is an expert, the AI can strengthen its economy and increase its

spending. As expert players usually take more time to create a strong economy than novices, it is an acceptable

strategy for the AI to do the same.

5.4 Server SMO does not perform significantly better than the corrected baseline in predicting on which server a game

was played. We also examined what happens if we try to predict players’ skill without the feature ‘server’. After

using the entire feature set except for “server” to build a model, performance did not seem to drop

significantly. This indicates that the server may not be an important factor in solving the classification problem.

This means that the in-game behavior of European and American players may not be significantly different. The

importance of server could be further examined by building a separate model for data from each server. If this

significantly improves the performance of the classifier, this would mean that the game behavior of players on

the American and European server is significantly different.

5.5 Implementation in AI The InfoGain filter showed that features describing issued actions and effective actions are the most

informative features. Although we cannot directly implement apm or eapm into the behavior of an AI, we can

use it to measure a human player’s’ skill. This helps the AI to measure the skill of the player. This information

can be used to adjust the difficulty of the AI to the level of the player to make the game more interesting. One

way to do this described in Subsection 5.3.

The average number of micro actions that were issued seems to be the most important feature in predicting

players’ skill. This feature describes how quickly a player was able to micro-manage its units. If a player is good

35

at micro-managing his units, he is better at destroying the opponents units and keeping his own units alive than

letting the game AI fight the battles on a micro level. This indicates that it might be possible to make an AI more

interesting by adjusting the way it micro manages units. An AI could do this more effectively when pitied

against a strong player and less effectively when pitied against a weak player. However, we must note that the

number of actions does not tell us something about the quality of those actions. From our personal experience

with the game, we know that effective micro-management can outperform the game AI. Further research is

needed to examine to what extent the quality of players’ micro-management changes with skill.

Although the effective issued actions per minute (eapm) is in the top 3 of informative features, it is not an

effective method to measure a player’s’ skill, it is hard to directly implement it into an AI. One cannot simply

raise the number of effective actions in order to make an AI play at a higher level. An effective action is an

action that was not only issued but also performed. This says nothing about the quality of the action. For

example, building 20 spawning pools during one game is not relevant since only one is needed. Raising the

number of effective actions by making the AI build 20 spawning pools will raise its eapm but does not make it

play better. However, if the AI is using a solid strategy, making it perform actions quicker or slower may affect its

level of play. Further research might investigate to what extent the measure eapm can be applied to artificial

agents.

Directly implementing the issued actions per minute (apm) into the behavior of an AI is not useful. This

measure describes how many commands were issued. This includes selecting buildings and units without giving

them actual commands. It does not contribute much if we tell an AI to select and deselect a lot of units and

buildings since these actions have no effect on the game.

36

6. Conclusions

In this chapter, we present the answers to the 5 research questions that we formulated in Section 1.1. These

questions aim to solve our problem statement:

To what extent can we use a player model to accurately distinguish a novice player from an expert player in a

real-time strategy game?

The 5 questions that we asked to solve this problem are:

1. For Starcraft II, what selection of player data is appropriate as a data set to build a player model upon?

2. For Starcraft II, what is a suitable classification method to build a player model that predicts the player’s

skill level according to the player’s behavior?

3. How accurately can a player model predict a player’s skill?

4. How does the performance of our player model change as the game progresses?

5. To what extent can we use the player model to build an AI?

These research questions are answered in the first 5 Subsections of this Chapter. Each Subsection answers one

question. We attempt to solve the problem statement in Subsection 6.6. We end this Chapter by discussing the

limitations of this this thesis and giving recommendations for future work in Subsection 6.7.

6.1 Selection of the dataset We gathered data from 3 different websites. Most of this data was uploaded by players themselves. Because we

only had access to manually uploaded replay files, our data set contains only an estimated 0.03% of all games

that were played during our collection phase. The distribution of the data over the leagues was skewed

compared to the distribution of the population since we were unable to gather as much games from the lower

leagues as the higher ones.

To remove as much noise from the data set as possible, we limited our data to one patch version of the game.

This version of the patch was live during season 4 and 5. These were also the seasons that we could gather

league information from. We only used games between players within the same league to limit the amount of

players whose skill is around the border of a league. All games were played on either the American or European

server. We chose to use 2 servers so we could collect more data. The level of play between Americans and

Europeans is not large enough to make a significant impact on the performance of our classifier.

We can conclude that we were able to collect a data set of games that are comparable to each other. However,

we were able to collect only a small percentage of all games that were played during our collection phase. The

distribution of our data set over the leagues was skewed compared to the distribution of the population.

6.2 Finding a classification method We established the main properties of expertise that are applicable in this domain. Our feature set was

composed according to these properties and according to properties of the game itself. We then compared 4

different algorithms to see which one could be suitable for our problem. We chose SMO as it seemed to have a

somewhat better performance than the other classifiers on the platinum league whilst having about the same

performance on all other leagues.

37

For all classifiers, it seemed that the gold league was hardest to predict. This is in line with the nature of the

data set as the gold league spans the shortest range in skill difference. This means that there is not much

difference between players in the middle of a league and players near the class border. This makes prediction

harder. All classifiers seemed to have a decent performance on classifying the bronze, master, and grandmaster

league. As the bronze league encompasses the largest span in skill difference, and master and grandmaster

players tend to play more consistent, this result is also in line with the nature of the data set.

In short, we can conclude that based on the methods we tested on the pretest set, SMO seemed the most

suitable classifier. All algorithms followed about the same curve for their accuracy over the leagues. The gold

league seemed hardest to predict. The bronze, master, and grandmaster leagues seemed easiest to predict.

6.3 Predicting skill Using a 5-fold cross validation, the average weighted accuracy of SMO was 47.3% (SD = 2.19) with a baseline of

25.5% (SD = 1.05). This is a significant improvement. The average error distance for the misclassified instances

was 1.55. This means that most misclassified instances are placed in a neighboring league. Due to the nature of

the rating system, players whose skill is around a class border are hard to classify. It is uncertain in what league

they should be placed. Strong players in one league could have about the same skill as weak players in the next

league. We do not know the percentage of players that could be seen as ambiguous. Therefore, we cannot say

what percentage of players can be classified with certainty.

To conclude, we can say that it is possible to build a player model to predict players’ skill. Most misclassified

instances were placed in a neighboring league which can partly be attributed to the nature of the rating system.

Since we do not know what percentage of players is around a class border, we cannot say what the upper limit

of our player model is.

6.4 The influence of game progression For our classification problem, we already skipped the first 90 seconds of every game. We then incrementally

removed instances from every game in the test set of each fold to examine how this changes performance. The

improvement of accuracy for SMO did not change significantly. This means that at least from 2.5 minute into

the game, time does not have a big influence on performance. We can get a decent prediction after 2.5 minute

is played. If we compare this to the average game length in our data set, on average we can predict skill after

about 16.7% of the game is played. After that time, performance marginally increases until the 8th instance, or

9.5 minute into the game when performance flattens. We only examined this until the 10th instance so we

cannot say if performance would suddenly increase later on.

In short, we can conclude that at least from 2.5 minute into the game, we can give a decent prediction of a

players’ skill. From that moment, performance does not improve significantly over time.

6.5 Artificial Intelligence We can use our player model to measure a human players’ skill. This information could be used to scale the

difficulty of an AI to the skill of the player. One way to do this is to create a basic AI with an average economy

and an average level of micro-management. Because we can predict the skill of a player after 2.5 minute, the AI

could then adjust its behavior to match the player. We cannot directly implement the results of our test into an

AI because the main features that describe expertise, give information about the amount of commands that

38

were issued. They do not tell us something about the quality of these actions. Since we did not actually build

and test an AI, we can only give an example as to how our player model could be used.

In short, we can model a players’ skill at least after 2.5 minutes into the game. This indicates that it could be

possible to adjust the behavior of an AI to a players’ skill early on in the game.

6.6 Answering the problem statement Based on our research, we can say that it is possible to build a player model according to a players’ skill. In our

research, we chose the SMO classification algorithm as a suitable classifier. We have not used an x-fold cross

validation or any optimization methods to test which classifier is most suitable, and what the highest possible

performance is. It might be possible that SMO has a stronger performance if optimized or that another

optimized algorithm might be more effective. Further research is needed to examine this.

We do not know the upper limit of the classifiers’ performance because we do not know what percentage of

players is around a class border. Based on our data, we can say that we can make a fairly accurate model of

expert players. Because our data set is skewed, we cannot say to what extent we can generalize our research to

all leagues and the entire set of games that were played. We cannot directly implement our results into an AI

because the main features in our set describe the amount of actions that were issued but do not say anything

about the quality of these actions. Because we can give a fairly accurate prediction early on in the game, an AI

could adjust its behavior to a player early on in the game. Since we did not test this, we can only give an advice

as to how our results might be implemented into an AI.

Therefore, the answer to the problem statement is that we can build a player model of a players’ skill for our

data set. We can do this at least after 2.5 minute into the game. The model for high ranked players is fairly

accurate. However, we do not know the upper limit of the model. We also do not know to what extent our

model is an accurate representation of the entire data set since we only used a small and skewed subset of

data. Our results indicate that an AI might be able to automatically adapt to a players’ skill early on in the game

but we did not implement this into an AI.

6.7 Limitations and recommendations for future work This research shows that we can model the skill of players. This is an encouragement for further and more

extensive research in the domain of player modeling. Although our research has its limitations, further research

might lift some of them.

Because the replay files were intentionally uploaded and we only used a small percentage of all games that are

played, we do not know to what extent our data set is representable for all games that are played. The

distribution of the data over the leagues is skewed compared to the actual distribution of the population. We

have more games for the higher leagues than the lower. One possible partial explanation is that higher ranked

players play more often. It could also simply be unrepresentative because lower players less often upload their

games. Either way, more extensive research is needed to examine how this research scales up to the entire set

of games that are played.

Players around the border of each class are hard to classify. The class borders are nominal, but difference in

player skill is gradual. We do not know what percentage of players is around a class border. Therefore, we

cannot say what the upper limit is of our player model. Further research could look at data sets where there is

more information about the actual skill of players.

39

In our research, we did not attempt to optimize any of the classifiers. Further research could attempt to

optimize the classifiers to examine how this improves the performance. Other classifiers might be just as

suitable or even better than the one we used. Furthermore, the performance of SMO on itself could be boosted

by optimization techniques.

We chose an algorithm for our research by constructing a model based on 70% of the data with the remaining

30% serving as our test set. This only indicates the performance of the algorithms on that particular test set.

We used a 5-fold cross validation to examine the performance of our classifier. However, a 10-fold cross

validation has proven to be the most effective method to assess algorithms. Further research could use this

technique to compare and test algorithms as it provides a better estimation of the accuracy of the classifiers.

The outcome of our research suggests that motor skills and visuospatial attention are important attributes of

expertise in real-time strategy games. Further research could focus on one these 2 topics to examine how

exactly they influence expert skill in our domain.

We cannot directly implement the results of our test into an AI because the main features that describe

expertise, give information about the amount of commands that were issued. They do not tell us something

about the quality of these actions. One way to gather this information is to include more information on

players’ strategies and tactics. One example is including the build order of a player. This is the order in which

buildings and units are built during the first few minutes of the game. Further research could incorporate more

functional and applicable behavior into the model. Our results can be regarded as an indication as to how an AI

might use our model. Further research is needed to examine to what extent it is possible to automatically adapt

an AI to a players’ skill during the game.

40

Literature Blizzard Entertainment. (2010). Starcraft® II: Wings of Liberty™ one-month sales break 3 million mark [press

release]. Retrieved from http://us.blizzard.com/en-us/company/press/pressreleases.html?id=2847879

Blizzard Entertainment. (2011). Master League. Retrieved from http://sea.battle.net/sc2/en/blog/116151#blog

Blizzard Entertainment. (2011). Region Linking. Retrieved from http://us.battle.net/sc2/en/blog/3085760

Blizzard Entertainment. (2012). Climbing the Ladder: How to earn a League Promotion. Retrieved from

http://us.battle.net/sc2/en/blog/4798491/Climbing_the_Ladder_How_to_earn_a_League_Promotion-

4_5_2012#blog

Campitelli, G., & Gobet, F. (2004). Adaptive expert decision making: Skilled chess players search more and

deeper, Journal of the International Computer Games Association 27(4), 209-216.

Campitelli, G., Gobet, F., Williams, G., & Parker, A. (2007). Integration of perceptual input and visual imagery in

chess players: Evidence from eye movements. Swiss Journal of Psychology, 66, 201–213.

Carmel, D., & Markovitch, S. (1993) Learning models of the opponent's strategy in game-playing, Proceedings of

The AAAI Fall Symposium on Games: Planning and Learning

Chaddock, L., Neider, M., Voss, M., Gaspar, J., & Kramer, A. (2011). Do athletes excel at everyday tasks?

Medicine & Science in Sports & Exercise, 43(10), 1920-1926. doi: 10.1249/MSS.0b013e318218ca74

Chase, W., G., & Simon, H., A., ( 1973). Perception in chess, Cognitive Psychology.

Carmel, D., & Markovitch, S. (1993). Learning models of opponent's strategy in game playing. Proceedings of the

AAAI fall symposium on games: planning and learning, 140-147.

Donkers, H., Uiterwijk, J., & Van den Herik, H., (2001). Probabilistic opponent-model search. Information

Sciences, 135(3–4), 123–149.

Drachen, A., Canossa, A., & Yannakakis, G. N. (2009). Player Modeling using Self-Organization in Tomb Raider:

Underworld. Proceedings of the IEEE Symposium on Computational Intelligence and Games (CIG2009),

Milano, Italy. Retrieved from http://www.itu.dk/~yannakakis/CIG09_IOI.pdf

Ericsson, K., & Tesch-Römer, C. (1993). The role of deliberate practice in the acquisition of expert performance.

Psychological Review 100(3), 363-406.

Field, A. (2009). Discovering Statistics Using SPSS (Third Edition), Sage

Gobet, F., & Simon, H. (1996). The Roles of Recognition Processes and Look-Ahead Search in Time-Constrained

Expert Problem Solving: Evidence from Grand-Master-Level Chess. Psychological Science, 7(1) 52-55.

Gobet, F., & Simon, H. (1998). Expert Chess Memory: Revisiting the Chunking Hypothesis, Memory, 6(3), pp.

225-255 Retrieved from http://dx.doi.org/10.1080/741942359

Houlette, R. (2004). Player Modeling for Adaptive Games. AI Game Programming Wisdom II, Charles River

Media, Inc, 557-566.

http://sea.battle.net/sc2/en/blog/116151#blog

http://us.battle.net/sc2/en/blog/3085760

http://us.battle.net/sc2/en/blog/4798491/Climbing_the_Ladder_How_to_earn_a_League_Promotion-4_5_2012#blog

http://us.battle.net/sc2/en/blog/4798491/Climbing_the_Ladder_How_to_earn_a_League_Promotion-4_5_2012#blog

41

Witten, F., & Hall, M. (2011). Data Mining: Practical Machine Learning Tools and Techniques (Third Edition),

Morgan Kaufmann

Iida, H., Uiterwijk, J., Van den Herik, J., (1993). Opponent-Model Search. Technical Reports in Computer Science,

CS 93-03

Kocsis, L., Uiterwijk, J., Postma, E., & Van den Herik, J. (2003). The Neural MoveMap Heuristic in Chess. In

Schaeffer, J.; Müller, M., & Björnsson, Y. (Ed.), Computers and games: Lecture notes in computer science,

154-170. doi: 10.1007/978-3-540-40031-8_11

Land, M., & McLeod, P. (2000). From eye-movements to actions: how batsmen hit the ball. Nature

Neuroscience, 3(12), 1340-1345.

Nethaera. (2010). The Big Picture on Macro and Micro. Retrieved from

http://us.battle.net/sc2/en/blog/748044#blog

Richards and Amir, 2007] M. Richards and E. Amir. Opponent modeling in Scrabble. Proceedings of the 20th

International Joint Conference on Artificial Intelligence (IJCAI2007)

Schadd, F., Bakkes, S., & Spronck, P. (2007). Opponent Modeling in Real-Time Strategy Games, Proceedings of

the 8th Int.ernational Conference on Intelligent Games and Simulation (GAMEON'2007), 61-68.

Schriver, A., Morrow, D., Wickens, C., & Talleur, D. (2008). Expertise Differences in Attentional Strategies Related

to Pilot Decision Making. Human Factors: The Journal of the Human Factors and Ergonomics Society, 50,

864. doi: 10.1518/001872008X374974

Van den Herik, J., Donkers, J., & Spronck, P. (2005). Opponent modelling and commercial games. Proceedings of

the IEEE 2005 Symposium on Computational Intelligence and Games, 15-25.

Van der Werf, E., Uiterwijk, J., Postma, E., & Van den Herik, J. (2002). Local move prediction in Go. In Schaeffer,

Computers and games: Lecture notes in computer science, 393-412.

Park, S.-H., & Fürnkranz, J. (2007). Efficient pairwise classification. In Kok, J., Koronacki, J., Lopez de Mantaras,

R., Matwin, S., Mladenič, D., & A. Skowron (Eds.), Proceedings of 18th European conference on machine

learning (ECML-07), 658–665.

Platt, J. (1998). Sequetial minimal optimization: A fast algorithm for training support vector machines. Technical

Report MST-TR-98-14, Microsoft Research

Reingold, E., Charness, N., Pomplun, M., & Stampe, D. (2001). Visual Span in Expert Chess Players: Evidence

From Eye Movements. Psychological Science 2001, 12(48) doi: 10.1111/1467-9280.00309

Wagner, M. (2006) On the scientific relevance of eSports. Proceedings of the 2006 International Conference on

Internet Computing and Conference on Computer Game Development, 437-440

Yannakakis, G. & Hallam, J. (2006). "Towards Capturing and Enhancing Entertainment in Computer Games".

Proceedings of the 4th Hellenic Conference on Artificial Intelligence, Lecture Notes in Artificial Intelligence,

432–442.

http://us.battle.net/sc2/en/blog/748044#blog

42

Appendix A To decide whether an action is effective, we adjusted the preconditions used by the tool Sc2Gears. This

resulted in the following set of rules:

Our rules:

• If one of the commands train, research, upgrade or hatch is canceled within 1 second, both the issued

action and the cancel action are considered ineffective.

If the same command is repeated more than twice within one second, the first command is considered

ineffective. The commands that this rule applies to are: right clock, stop, hold position, move, patrol,

scan, attack, set rally point, set worker rally point, hold fire, halt, attack, stop, land, activate or

deactivate auto-repair, neural parasite, activate or deactivate auto-heal, charge, attack structure,

activate or deactivate auto-attack structure

Too fast switch away from a selected unit or reselecting the same units without giving them any

commands (0.25 sec) (by too fast I mean there is not even time to check the state of the units and

optionally react to it accordingly); double tapping a hotkey to center a group of units is NOT considered

ineffective.

If one of the following commands is repeated consecutively, regardless of the time frame, it is

considered ineffective:

the same research, the same upgrade; Gather Resources, Return Cargo, Cloack, Decloack, Siege Mode,

Tank Mode, Assault Mode, Fighter Mode, Burrow, Unburrow, Phasing Mode, Transport Mode, Generate

Creep, Stop Generating Creep, Weapons Free, Cancel an Addon, Mutate into Lair, Cancel Lair Upgrade,

Mutate into Hive, Cancel Hive Upgrade, Mutate into Greater Spire, Cancel Greater Spire Upgrade,

Upgrade to Planetary Fortress, Cancel Planetary Fortress Upgrade, Upgrade to Orbital Command,

Cancel Orbital Command Upgrade, Salvage, Lift Off, Uproot, Root, Lower, Raise, Archon Warp (High

Templar or Dark Templar)

We make blocks of 1 second to make our decisions. If for example one right click was issued in second

1, and 2 right clicks were issued in second 2, they are both counted because our time count is not

specific enough to say whether or not they were issued within 0.5 seconds after each other. If 3 right

clicks are issued within the same second, they are considered ineffective because 1/3 = 0.33 which

means the repetition was too fast.

We use the following rules to determine whether an action is considered macro or micro:

Any action that costs minerals and/or gas, is considered to be a macro action. This includes training

units, constructing buildings, upgrades, researches, and salvaging buildings. Canceling macro actions

and merging archons are also considered macro actions.

Any other action is considered to be a micro action. Repair actions are considered micro actions

because it involves the micro-management of worker units, despite the fact that it costs money.

43

Appendix B The list of features as ranked by InfoGain.

weight ranking attribute

0.569 1 average micro

0.511 2 average apm

0.509 3 total eapm

0.497 4 apm micro

0.493 5 eapm

0.467 6 delta select

0.446 7 apm

0.415 8 total hotkeys select

0.256 9 selections per key set

0.233 10 total hotkeys set

0.212 11 different keys

0.163 12 delta set

0.091 13 redundancy

0.082 14 server

0.081 15 average macro

0.072 16 apm micro

0.043 17 total hotkeys add

0.042 18 delta minerals

0.037 19 delta workers

0.034 20 total workers built

0.034 21 delta resources

0.031 22 delta supply used

0.028 23 defensive structures

0.028 24 workers per geyser

0.028 25 geysers total

0.027 26 instance

0.023 27 minerals per worker

0.022 28 delta add

0.018 29 total bases built

0.015 30 total upgrades

0.015 31 total supply gained

0.015 32 research

0.014 33 opponent race

0.014 34 total resources

0.014 35 total minerals spent

0.013 36 total gas spent

0.013 37 player race

0.013 38 total supply used

0.013 39 delta gas

0.012 40 total number of fighting units

0.010 41 tier

0.010 42 different units

0.001 43 winner

44

league total instances Distribution of our data set Actual distribution

bronze 4082 8.2% 20%

silver 3979 8.0% 20%

gold 5195 10.4% 20%

platinum 8066 16.2% 20%

diamond 10088 20.2% 18%

master 12747 25.5% 2%

grandmaster 5751 11.5% Top 200 players

total 49908 100% 100%

Documents

Modeling player skill in Starcraft II - Semantic Scholarpdfs.semanticscholar.org/df53/d0152624991f7814587efdda909fad6de46c.pdf2.1 Starcraft II Starcraft II is a real-time strategy