TETRIS AI WITH TEMPORAL DYNAMIC LEARNING OF MDP

1

TETRIS WARTeam 4

2008.12.1JungKyu Lee

SangYun KimShinnyue Kang

2

Contents

• General Idea Description– Approximation using feature based MDP– policy iteration

• Apply to Tetris– Problem description–MDP formulation– Feature based MDP formulation

• Result• Conclusion

3

General Idea

• Infinite horizon MDP with discount fac-tor

• Goal : Find a policy that maxi-mize value function cost-to-go vector

– Let policy

10

AX :V

,...},{ 10

4

Cost-to-go value

• Definition of optimal cost-to-go vector

• By bellman’s optimal equation

• Using Optimal stationary policy

• Then optimal equation is given as

*V

,...},{

*V

5

Policy iteration• Policy iteration

• The value is updated as follows

• Vector has components given by

• Temporal difference(TD) associated with each transition under

t

),( ji 1t

6

Tetris

• Board size –Width 10 ; height 22

• Blocks– 7 pieces with predetermined

probability

• Score– (number of erased

line)2X100

• Action– Left, Right, Rotate, No move

7

MDP formulation• MDP Model for tetris– States : X={wall configuration + Piece} – Actions : A={rotation, right, left, nomove}– Transitions : deterministic new wall after(i,a) +

uniform random new piece– Reward : r(i,a,j)=number of lines removed after

(i,a)

• A value function can be computed only on the set of wall configurations.

• The optimal value function V* is the best average score!

8

Approximation for tetris

• Number of state is too Large to compute– Feature-based MDP

• Feature– Each column’s height, width– Absolute difference be-

tween adjacency column–Maximum height– Number of holes

9

Value function• We defined approximated value function

using above feature

• Where is a vector of features for state k

• Finally

• Our decision is as follows

V~

10

Weight vector

• Iteration to approximate to the optimal value function for policy iteration

• weight vector (equation 1)

– games– :state sequence of the

game m – :termination state of game m –

M

*VV~

11

Minimum squared error technique using psedoinverse

• To solve equation (1)• Goal : find a weight vector a satisfying

following equation– d : # of feature– n : # of samples

• Formal solution – Y is nonsingular

• Error vector

12

Squared error criterion function

• Minimize the squared length of error vec-tor

• Define the error criterion function

• Using gradient method for simplify

• Necessary condition yield by above equation has 0 value

– : psedoinverse

13

Apply to tetris problem

• Let

• Equation (1)

– M : # of games– n : # of samples– : feature vector

14

Simulation Result

• Let– = 0.6– Test 100 game• using random seed 0 to 100

• Simple TD algorithm is our heuristic al-gorithm

• Our learning algorithm improve 2010% of the heuristic algorithm

15

Simulation result

• zxzxz

16

Conclusion• Goal of project– Make a algorithm to achieve the highest average

cost

• Our learning algorithm is powerful– Average score and maximum score is satisfied to

compare with heuristic algorithm

• Problem of deviation – Deviation : difference between the highest score and

lowest score– Our learning algorithm gives big deviation

• Suggest– Reduce the deviation without dropped average score

17

Reference[1] Bertsekas, D. P. and Tsitsiklis, J. N., 1996, "Neuro-Dy-

namic Programming", Athena Scientific.[2] Colin Fahey., 2003, "Tetris AI", http://www.colinfahey.-

com[3] Dimitri P. Bertsekas2 and Sergey Ioffe.1996, "Temporal

Differences-Based Policy Iterationand Applications in Neuro-Dynamic Programming" LIDS-P-2349

[4] Donald Carr., 2005, "Applying reinforcement learning to Tetris", Dept. of CS, Rhodes University. South Africa.

[5] Niko Böhm at el., 2005, "An Evolutionary Approach to Tetris", MIC2005: The Sixth Metaheuristics International Conference, Vienna, Austria.

[6] Richard S. Sutton and Andrew G. Barto., 1998, "Rein-forcement Learning: An Introduction", The MIT Press.

Entertainment & Humor

TETRIS AI WITH TEMPORAL DYNAMIC LEARNING OF MDP