Machine Learning in Computer Game Players

Machine Learning in Computer Game

PlayersChikayama & Taura Lab.

M1 Ayato Miki

1

1. Introduction2. Computer Game Players3. Machine Learning in Computer Game

Players4. Tuning Evaluation Functions

◦ Supervised Learning◦ Reinforcement Learning◦ Evolutionary Algorithms

5. Conclusion

Outline

2

Improvements in Computer Game Players◦ DEEP BLUE defeated Kasparov in 1997◦ GEKISASHI and TANASE SHOGI on WCSC 2008

Strong Computer Game Players are usually developed by strong human players◦ Input heuristics manually◦ Devote a lot of time and energy to tuning

1. Introduction

3

Machine Learning enables automatic tuning using a large amount of data

It is not necessary for a developer to be an expert of the game

Machine Learning for Games

4

1. Introduction2. Computer Game Players3. Machine Learning in Computer Game

Players4. Tuning Evaluation Functions


5. Conclusion

Outline

5

Games

Game Trees

Game Tree Search

Evaluation Function

2. Computer Game Players

6

Turn system games◦ ex. tic-tac-toe, chess, shogi, poker, mah-jong…

Additional Classification◦ two player or otherwise◦ zero-sum or otherwise◦ deterministic or non-deterministic◦ perfect or imperfect information

Game Tree Model

Games

7

Game Trees

8

← player’s turn

← move 2move 1 →

← opponent’s turn

ex. Minimax search algorithm

Game Tree Search

9

5

5 8 3 6

5 3

3 51 4 28 3 10 6 24

Max

Max Max

Min Min

Difficult to search up to leaf nodes◦ 10^220 possible positions in shogi

Stop search at practicable depth And “Evaluate” nodes

◦ Using Evaluation Function

10

Game Tree Search

Estimate the superiority of the position

Elements◦ feature vector of the position◦ parameter vector

Evaluation Function

),()( sfsV

s feature vector of position

sparameter vector

11

Introduction Computer Game Players Machine Learning in Computer Game

Players Tuning Evaluation Functions


Conclusion

Outline

12

Initial work◦ Samuel’s research [1959]

Learning objective◦ What do Computer Game Players Learn ?

3. Machine Learning inComputer Game Players

13

Many useful techniques◦ Rote learning◦ Quiescence search◦ 3-layer neural network evaluation function

And some machine learning techniques◦ Learning through self-play◦ Temporal-difference learning◦ Comparison training

Samuel’s Checker Player [1959]

14

Opening Book

Search Control

Evaluation Function

15

Learning Objective

Automatic construction of evaluation function◦ Construct and select a feature vector

automatically◦ ex. GLEM [Buro, 1998]◦ Difficult

Tuning evaluation function parameters◦ Make a feature vector manually and tune its

parameters automatically◦ Easy and effective

Learning Evaluation Functions

18




Conclusion

Outline

19

Supervised Learning

Reinforcement Learning

Evolutionary Algorithm

4. Tuning Evaluation Functions

20

Provide the program with example positions and their exact evaluation values

Adjusts the parameters in a way that minimizes the error between the evaluation function outputs and the exact values

Supervised Learning

2050

)(sVerror

・・・

1040 50

21

Manual labeling positions

Quantitative evaluation

22

Difficulty of Hard Supervised Training

Consider more soft approach

Soft Supervised Training

Require only relative order for the possible moves◦ Easier and more intuitive

Comparison Training

>

23

Comparison training using records of expert games

Simple relative order

Bonanza [Hoki, 2006]

The expert move other moves>

24

Based on the Optimal Control Theory Minimize the Cost Function J

Bonanza Method

1

0110 ),(),,,,(

N

iiN slsssJ

),(

i

i

sl

Ns example positions in the

records

error functiontotal number of example positions

25

Bonanza Method

1

10 )],'(),'([),(

M

mmm ssTsl

Error Function

)(),'(

0

'

xTs

mMs

m

m

child position with move mtotal number of possible movesthe move played in the recordminimax search valueorder discriminant function

26

Sigmoid Function

◦ k is the parameter to control the gradient◦ When , T(x) is Step Function◦ In this case, the error function means “the

number of moves that were considered to be better than the move in the record”

Order Discriminant Function

kxexT

11)(

k

27

30,000 professional game records and 30,000 high rating game records in SHOGI CLUB 24 were used

The weight parameters of about 10,000 feature elements were tuned

And won in the World Computer Shogi Championship 2006

Bonanza

29

It is costly to accumulate a training data set◦ It takes a lot of time to label manually◦ Using expert records has been successful

But how if not enough expert records ?◦ New games◦ Minor games

Problem of Supervised Learning

30

Other approach without a training set◦ ex. Reinforcement Learning (Next)

Supervised Learning




31

The learner gets “a reward” from the environment

In the domain of game, the reward is final outcome(win/lose)

Reinforcement learning requires only the objective information of the game


32

33


+100

+60

+30

+10

+200

+120

+60

+20

-100

-60

-30

-10

Inefficient in Games…

34

Temporal-Difference Learning

+100

+60

+30

+10

+80

+15

+10

+10

)()( 1 tt sVsVrTDerror

Trained through self-play

TD-Gammon [Tesauro, 1992]

35

Version Features Strength

TD-Gammon 0.0 Raw Board Information

Top of Computer Players

TD-Gammon 1.0 Plus Additional Heuristics

World-championship

Falling into a local optimum◦ Lack of playing variation

Solutions◦ Add intentional randomness◦ Play against various players (computer/human)

Credit Assignment Problem (CAP)◦ Not clear which action was effective

Problems of Reinforcement Learning

36

Supervised Learning




37

Evolutionary AlgorithmInitialize Population

Randomly Vary Individuals

Evaluate “Fitness”

Apply Selection

38

Evolutionary algorithm for chess player

Using open-source chess program◦ Attempt to tune its parameters

Research of Fogel et al. [2004]

39

Make initial 10 parents◦ Initialize parameters with random values

Initialization

40

Create 10 offsprings from each surviving parent by mutating parental parameters

Variation

)',0(' iii sN

isN'

),( Gaussian random variablestrategy parameter

41

Each player plays ten games against randomly selected opponents

Ten best players become parents of the next generation

Evaluate Fitness and Selection

42

Select 10 opponents randomly

Material value

Positional value

Weights and biases of three neural networks

43

Tuned Parameters

Each network has 3 Layers Input = Arrangement of specific areas(front 2 rows, back 2 rows, and center 4x4 square) Hidden = 10 Units Output = Worth of the area arrangement

44

Three Neural Networks

16 input 10 hidden 1 output

Initial Rating = 2066 (Expert)◦ Rating of open-source player

Best Rating = 2437 (Senior Master)

But the program cannot yet compete with other strongest chess programs (R2800~)

Result

45

10 independent trials (Each has 50 generations)




Conclusion

Outline

47

Advantages Disadvantages

Supervised Learning Direct and Effective Manual Labeling Cost

Reinforcement Learning Wide Application Local Optimal

CAP


Wide ApplicationNo CAP

IndirectRandom Dispersion

Characteristics

48

Automatic position labeling◦ Using records or computer play

Sophisticated reward◦ Consider opponent’s strength◦ Move analysis for credit assignment

Experiment in other games

Future Work

49

Documents

Machine Learning in Computer Game Players