33
Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban www.cse.lehigh.edu/~munoz/ InSyTe

Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Reinforcement Learning and Markov Decision Processes: A Quick Introduction

Hector Munoz-Avila

Stephen Lee-Urbanwww.cse.lehigh.edu/~munoz/InSyTe

Page 2: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Outline Introduction

Adaptive Game AI Domination games in Unreal Tournament© Reinforcement Learning

Adaptive Game AI with Reinforcement Learning RETALIATE – architecture and algorithm

Empirical Evaluation Final Remarks – Main Lessons

Page 3: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Introduction

Adaptive Game AI, Unreal Tournament, Reinforcement Learning

Page 4: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Adaptive AI in Games

Without (shipped) Learning With Learning

Non Stochastic Stochastic Offline Online

Symbolic(FOL, etc.)

Scripts HTN Planning

Trained VS Decision Tree

Sub-Symbolic(weights, etc.)

Stored NNs Genetic Alg. RL offline RL online

In this class: Using Reinforcement Learning to accomplish Online Learning of Game AI for Team based First-Person Shooters

In this class: Using Reinforcement Learning to accomplish Online Learning of Game AI for Team based First-Person Shooters

HTNbots: we presented this before

Lee-Urban et al, ICAPS-2007

http://www.youtube.com/watch?v=yO9CcEujJ64

Page 5: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Adaptive Game AI and Learning Learning – Motivation

Combinatorial explosion of possible situations Tactics (e.g., competing team’s tactics) Game worlds (e.g., map where the game is played) Game modes (e.g., domination, capture the flag)

Little time for development Learning – the “Cons”

Difficult to control and predict Game AI Difficult to test

Page 6: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Unreal Tournament© (UT)

Online FPS developed by Epic Games Inc. 1999

Six gameplay modes including team deathmatch and domination games

Gamebots: a client-server architecture for controlling bots started by U.S.C. Information Sciences Institute (ISI)

Page 7: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

UT Domination Games

A number of fixed domination locations.

Ownership: the team of last player to step into location

Scoring: a team point awarded for every five seconds location remains controlled

Winning: first team to reach pre-determined score (50)

(top-down view)

Page 8: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Reinforcement Learning

Page 9: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Some Introductory RL Videos http://demo.viidea.com/ijcai09_paduraru_rle/ http://www.youtube.com/watch?v=NR99Hf9Ke2c http://demo.viidea.com/ijcai09_littman_rlrl/

Page 10: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Reinforcement Learning

Agents learn policies through rewards and punishments

Policy - Determines what action to take from a given state (or situation)

Agent’s goal is to maximize returns (example) Tabular Techniques We maintain a “Q-Table”:

Q-table: State × Action value

Page 11: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

The DOM Game

Domination Points

Wall

Spawn Points

Lets write on blackboard: a policy for this and a potential Q-table

Page 12: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Example of a Q-TableACTIONS

STA

TE

S

“good” action “bad” action Best action identified so far

For state “EFE” (Enemy controls 2 DOM points)

Page 13: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Reinforcement Learning Problem ACTIONS

STA

TE

S

How can we identify for every state which is the BEST action to take over the long run?

Page 14: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Let Us Model the Problem of Finding the best Build Order for a Zerg Rush as a Reinforcement Learning Problem

Page 15: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Adaptive Game AI with RL

RETALIATE (Reinforced Tactic Learning in Agent-Team Environments)

Page 16: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

The RETALIATE Team

Controls two or more UT bots Commands bots to execute actions through the

GameBots API The UT server provides sensory (state and event)

information about the UT world and controls all gameplay

Gamebots acts as middleware between the UT server and the Game AI

UT

GameBots API

RETALIATE

Plug-in Bot Plug-in Bot Plug-in Bot

Opponent Team

Plug-in Bot Plug-in Bot Plug-in Bot

Page 17: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

The RETALIATE AlgorithmInit./restore state-

action table & initial state

Begin Game

Observe State

Choose

Random applicable action

Applicable action with max value in state-action table

Execute Action

Calculate reward & update state-action table

Probability Probability 1 –

Game Over?

No Yes

Page 18: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Init./restore state-action table &

initial state

Begin Game

Observe State

Choose

Random applicable action

Applicable action with max value in state-action table

Execute Action

Calculate reward & update state-action table

Probability Probability 1 –

Game Over?

No Yes

Initialization

• Game model: n is the number of domination points (Owner1, Owner2, …, Ownern)

• For all states s and for all actions a • Q[s,a] 0.5

• Actions: m is the number of bots in team (goto1, goto2, …, gotom)

• Team 1• Team 2• …• None

• loc 1• loc 2• …• loc n

Page 19: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Init./restore state-action table &

initial state

Begin Game

Observe State

Choose

Random applicable action

Applicable action with max value in state-action table

Execute Action

Calculate reward & update state-action table

Probability Probability 1 –

Game Over?

No Yes

Rewards and Utilities• U(s) = F( s ) – E( s ),

F(s) is the number of friendly locations E(s) is the number of enemy-controlled locations

• R = U( s’ ) – U( s )

• Standard Q-learning ([Sutton & Barto, 1998]): Q(s, a) ← Q(s, a) + ( R + γ maxa’ Q(s’, a’) – Q(s, a))

Page 20: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Rewards and Utilities• U(s) = F( s ) – E( s ),

F(s) is the number of friendly locations E(s) is the number of enemy-controlled locations

• R = U( s’ ) – U( s )

• Standard Q-learning ([Sutton & Barto, 1998]): Q(s, a) ← Q(s, a) + ( R + γ maxa’ Q(s’, a’) – Q(s, a))

“step-size” parameter was set to 0.2 discount-rate parameter γ was set close to 0.9

Thus, most recent state-reward pairs are considered more important than earlier state-reward pairs

Page 21: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

State Information and Actions

x, y, z

Player Scores Team ScoresDomination Loc. Ownership Map TimeLimit Score Limit Max # Teams Max Team SizeNavigation (path nodes…)ReachabilityItems (id, type, location…)Events (hear, incoming…)

SetWalkRunTo

StopJumpStrafe

TurnToRotate Shoot

ChangeWeaponStopShoot

Page 22: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Managing (State x Action) Growth Our Table:

States: ( {E,F,N}, {E,F,N}, {E,F,N} ) = 27 Actions: ( {L1, L2, L3}, …) = 27 27 x 27 = 729 Generally, 3#loc x #loc#bot

Adding health, discretized (high, med, low) States: (…, {h,m,l}) = 27 x 3 = 81 Actions: ( {L1, L2, L3, Health}, … ) = 43 = 64 81 x 64 = 5184 Generally, 3(#loc+1) x (#loc+1)#bot

Number of Locations, size of team frequently varies.

Page 23: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Empirical Evaluation

Opponents, Performance Curves, Videos

Page 24: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

The CompetitorsTeam Name Description

HTNBot HTN planning. We discussed this previously

OpportunisticBot Bots go from one domination location to the next. If the location is under the control of the opponent’s team, the bot captures it.

PossesiveBot Each bot is assigned a single domination location that it attempts to capture and hold during the whole game

GreedyBot Attempts to recapture any location that is taken by the opponent

RETALIATE Reinforcement Learning

Page 25: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Summary of Results

Against the opportunistic, possessive, and greedy control strategies, RETALIATE won all 3 games in the tournament. within the first half of the first game, RETALIATE

developed a competitive strategy.

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10

game instances

sco

re

RETALIATE

Opponents

5 runs of 10 games opportunistic

possessive greedy

Page 26: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Summary of Results: HTNBots vs RETALIATE (Round 1)

-10

0

10

20

30

40

50

60

Time

Sco

re

RETALIATE HTNbotsDifference

Page 27: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Summary of Results: HTNBots vs RETALIATE (Round 2)

-10

0

10

20

30

40

50

60

Time

Sco

re

RETALIATE

HTNbots

Difference

Page 28: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Video: Initial Policy

(top-down view)

RETALIATEOpponent

http://www.cse.lehigh.edu/~munoz/projects/RETALIATE/BadStrategy.wmv

Page 29: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Video: Learned Policy

RETALIATEOpponent

http://www.cse.lehigh.edu/~munoz/projects/RETALIATE/GoodStrategy.wmv

Page 30: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Final Remarks

Lessons Learned, Future Work

Page 31: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Final Remarks (1)

From our work with RETALIATE we learned the following lessons, beneficial to any real-world application of RL for these kinds of games: Separate individual bot behavior from team strategies. Model the problem of learning team tactics through a

simple state formulation.

Page 32: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Final Remarks (2)

It is very hard to predict all strategies beforehand As a result, RETALIATE was able to find a weakness and

exploit it to produce a winning strategy that HTNBots could not counter

On the other hand HTNBots produce winning strategies against the other opponents from the beginning while it took RETALIATE half a game in some situations

Tactics emerging from RETALIATE might be difficult to predict, a game developer will have a hard time maintaining the Game AI

Page 33: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe

Thank you!

Questions?