58
Reinforcement Learning 1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki ([email protected] ) Fung On Tik Andy ( [email protected] ) Li Yuk Hin ([email protected] ) Instructor: Nevin L. Zh ang

Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki ([email protected])[email protected] Fung On Tik Andy ([email protected])[email protected]

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 1

COMP538Reinforcement LearningRecent Development

Group 7:

Chan Ka Ki ([email protected])

Fung On Tik Andy ([email protected])

Li Yuk Hin ([email protected])

Instructor: Nevin L. Zhang

Page 2: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 2

Outline Introduction 3 Solving Methods Main Consideration

Exploration vs. Exploitation Directed / Undirected Exploration

Function Approximation Planning and Learning

Directed RL vs. Undirected RL Dyna-Q and Prioritized Sweeping

Conclusion on recent development

Page 3: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 3

Introduction Agent interacts with environment Goal-directed learning from interaction

Environment

Action a

AI Agent

s(t)

Reward r

s(t + 1)

Page 4: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 4

Key Features Agent is NOT told which actions to take, but lear

n by itself By trial-and-error From experiences Explore and exploit

Exploitation = agent takes the best action based on its current knowledge

Exploration = try to take NOT the best action to gain more knowledge

Page 5: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 5

Elements of RL Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what

Page 6: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 6

Dynamic Programming Model-based

compute optimal policies given a perfect model of the environment as a Markov decision process (MDP)

Bootstrap update estimates based in part on other

learned estimates, without waiting for a final outcome

Page 7: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 7

Dynamic Programming

T

T T TT

TT

T

TT

T

T

T

Page 8: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 8

Monte Carlo Model-free NOT bootstrap Entire episode included Only one choice at each

state (unlike DP) Time required to estimate

one state does not depend on the total number of states

Page 9: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 9

Monte Carlo

T T T TT

T T T T T

T T

T T

TT T

T TT

Page 10: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 10

Temporal Difference Model-free Bootstrap Partial episode included

Page 11: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 11

Temporal Difference

T T T TT

T T T T TTTTTT

T T T T T

Page 12: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 12

Example: Driving home

Page 13: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 13

Driving home Changes recommended

by Monte Carlo methodsChanges recommendedby TD methods

Page 14: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 14

N-step TD Prediction MC and TD are extreme cases!

Page 15: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 15

Averaging N-step Returns n-step methods were introduced to help

with TD() understanding Idea: backup an average of several

returns e.g. backup half of 2-step and half of 4-step

Called a complex backup Draw each component Label with the weights for that component

)4()2(

2

1

2

1tt

avgt RRR

Page 16: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 16

Forward View of TD() TD() is a method for

averaging all n-step backups weight by n-1 (time

since visitation) -return:

Backup using -return

Rt (1 ) n 1

n1

Rt(n)

Vt(st ) Rt Vt(st )

Page 17: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 17

Forward View of TD() Look forward from each state to determine

update from future states and rewards:

Page 18: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 18

Backward View of TD() The forward view was for theory The backward view is for mechanism

New variable called eligibility trace On each step, decay all traces by and

increment the trace for the current state by 1 Accumulating trace

)(set

et(s) et 1(s) if s st

et 1(s) 1 if s st

Page 19: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 19

Backward View

Shout t backwards over time The strength of your voice decreases with

temporal distance by

)()( 11 tttttt sVsVr

Page 20: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 20

Forward View = Backward View The forward (theoretical) view of TD() is

equivalent to the backward (mechanistic) view for off-line updating

Page 21: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Adaptive Exploration in Reinforcement Learning

Relu PatrascuDepartment of Systems Design

EngineeringUniversity of Waterloo

Waterloo, Ontario, Canada

[email protected]

Deborah StaceyDept. of Computing and

Information ScienceUniversity of Guelph

Ontario, [email protected]

Page 22: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 22

Objectives Explains the trade-off between exploitation

and exploration Introduces two categories of exploration

methods: Undirected Exploration

-greedy exploration Directed Exploration

Counter-based exploration Past-Success directed exploration

Function approximation Backpropagation algorithm and Fuzzy ARTMAP

Page 23: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 23

Introduction Main problem: How to make the learning

process adapt to the non-stationary environment?

Sub-Problems: How to balance exploitation and exploration

when the environment change? How can the function approximators adapt

the environment?

Page 24: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 24

Exploitation and Exploration Exploit or Explore?

To maximize reward, a learner must exploit the knowledge it already has

Explore an action with small immediate reward, but may yield more reward in the long run

An example: Choosing the job Suppose you are working at a small company with $25,000 salary You have another offer from an enterprise but only start at $12,000 Keep working on the small company may guarantee you have

stable income Work on an enterprise may have more opportunities for promotion,

which increase the income in long run

Page 25: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 25

Undirected Exploration Undirected Exploration

No biased purely random Eg. -greedy exploration it explores it chooses

equally among all actions likely to choose the worst

appearing action as it is to choose the next-to-best

Page 26: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 26

Directed Exploration Directed Exploration

Memorize exploration-specific knowledge Biased by some features of the learning process Eg. Counter-based techniques Favor the choice of actions resulting in a transition to

a state that has not been frequently visited The main idea is encourage the learner to explore :

parts of the state space that have not been sampled often parts that have not been sampled recently

Page 27: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 27

Past-success Directed Exploration Based on -greedy exploration Bias to adapt the environment from the learning

process Increase exploitation rate if receives reward at an

increasing rate Increase exploration rate when stop receiving reward

Average discounted reward Reflects amount and frequency of received immediate

rewards The further back in time, the less effect on average

reward

Page 28: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 28

Average discounted reward defined as:

Apply it on -greedy algorithm

Past-Success Directed Exploration

t

k

kt

t

kt

kt

rt

r

1

1

1

1

where (0,1] is the discount factor

rt the reward received at time t

1.08.0 )( srtet

Page 29: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 29

Gradient Descent Method Why use a gradient descent method?

RL applications use table to store the value functions Large number of states causes practically impossible Solution: use function approximator to predict the

value Error backpropagation algorithm

Catastrophic Interference cannot learn incrementally in non-stationary environment acquire new knowledge forget much of its previous

knowledge

Page 30: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 30

Gradient Descent Method

Initialize w arbitrarily and e = 0Repeat (for each episode):

Initialize sPass s through each network and obtain Qa

a arg maxa Qa

With probability : a a random action A(s)Repeat (for each step of episode):

e eea ea wQa

Take action a, observe reward, r and next state, s’ r – Qa

Pass s’ through each network and obtain Q’a

a’ arg maxa Q’a

With probability : a a random action A(s’) + Q’a

w w + ea a’until s’ is terminal

 

where a’ arg maxa Q’a means a’ is set to the action for which the expression is maximal, in this case the highest Q’ is a constant step size parameter named the learning ratewQa is the partial derivative of Qa with respect to the weights w the discount factor e the vector of eligibility traces (0, 1] is the eligibility trace parameter

Page 31: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 31

Fuzzy ARTMAP ARTMAP - Adaptive Resonancy Theory

mapping between input vector and output pattern a neural network specifically designed to deal

with the stability/plasticity dilemma This dilemma means a neural network isn't able

to learn new information without damaging what was learned previously, similar to catastrophic interference

Page 32: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 32

Experiments Gridworld with non-stationary environment

Learning agent can move up, down, left or right Two gates: must pass through one of them from start

state to goal state First 1000 episodes, gate 1 open and gate 2 close 1001-5000 episodes, gate 1 close and gate 2 open To test how well the algorithm adapt to the changed

environment

Page 33: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 33

Results Backpropagation algorithm

After 1000th episode: average discounted reward drops rapidly and

monotonically Surges to maximum exploitation

Fuzzy ARTMAP After 1000th episode: Reward drops in a few episode and goes back to high

values A temporary surge in exploration

Page 34: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 34

Planning and Learning

Use of environment models Integration of planning and learning

methods

Objectives:

Page 35: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 35

Models Model: anything the agent can use to predict how

the environment will respond to its actions Distribution model: description of all possibilities

and their probabilities e.g.,

Sample model: produces sample experiences e.g., a simulation model, set of data

Both types of models can be used to produce simulated experience

Often sample models are much easier to obtain

Ps s a and Rs s

a for all s, s , and a A(s)

Page 36: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 36

Planning Planning: any computational process that

uses a model to create or improve a policy

We take the following view: all state-space planning methods involve computing

value functions, either explicitly or implicitly they all apply backups to simulated experience

Model PolicyPlanning

Simulated Experience

Model Values Policybackups

Page 37: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 37

Learning, Planning, and Acting Two uses of real

experience: model learning: to

improve the model direct RL: to directly

improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.

Page 38: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 38

Direct vs. Indirect RL Indirect methods:

make fuller use of experience: get better policy with fewer environment interactions

Direct methods simpler not affected by

bad models

But they are very closely related and can be usefully combined:

planning, acting, model learning, and direct RL can occur simultaneously and in parallel

Page 39: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 39

The Dyna-Q Architecture(Sutton 1990)

Page 40: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 40

The Dyna-Q Architecture (Sutton 1990)• Dyna use the experience to build the model (R, T), uses experience

to adjust the policy and user the model to adjust the policy

For each interaction with environment, experiencing <s, a, s’, r>

1. use experience to adjust the policyQ(s,a) = R(s,a) + [ r + Maxa’Q(s’, a’) – Q(s,a)]

2. use experience to update a model (T, R)Model (s,a) = (s’, r)

3. use model to simulate the experience to adjust the policya Rand(a), s Rand(s)(s’, r) Model(s, a)Q(s,a) = R(s,a) + [ r + Maxa’Q(s’, a’) – Q(s,a)]

Page 41: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 41

The Dyna-Q Algorithm

direct RL

model learning

planning

Page 42: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 42

Dyna-Q Snapshots: Midway in 2nd Episode

Page 43: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 43

Dyna-Q Properties Dyna algorithm requires about N times the comp

utation of Q learning per instance But it is typically vastly less than that for naïve m

odel-based method N can be determined by the relative speed of co

mputation and of the taking action

What if the environment is changed ? Change to harder or change to easier.

Page 44: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 44

Blocking Maze

The changed

environment is harder

Page 45: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 45

Shortcut MazeThe changed

environment is

easier

Page 46: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 46

What is Dyna-Q ? Uses an “exploration bonus”:

Keeps track of time since each state-action pair was tried for real

An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting

The agent actually “plans” how to visit long unvisited states

+

Page 47: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 47

Prioritized Sweeping The updating of the model is no longer random Instead, store additional information in the model in order to make

the appropriate choice of updating

Store the change of each state value V(s), and use it to modify the priority of the predecessors of s, according their transition probability T(s,a, s’)

s4

s5

s1

s2

s3

= 10

= 5

S4 S5 S2 S1 S3

Priority: High Low

Page 48: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 48

Prioritized Sweeping

Page 49: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 49

Prioritized Sweeping vs. Dyna-QBoth use N=5 backups per

environmental interaction

Page 50: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 50

Full and Sample(One-Step)Backups

Page 51: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 51

Summary Emphasized close relationship between

planning and learning Important distinction between distribution

models and sample models Looked at some ways to integrate planning

and learning synergy among planning, acting, model

learning

Page 52: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 52

RL Recent Development : Problem Modeling

Partially Observable MDP

MDP

Hidden State RL

Traditional RL

Known Unknown

Completely Observable

Partially Observable

Model of environment

Page 53: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 53

Research topics Exploration-Exploitation tradeoff Problem of delayed reward (credit assignment) Input generalization

Function Approximator

Multi-Agent Reinforcement Learning Global goal vs Local goal Achieve several goals in parallel Agent cooperation and communication

Page 54: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 54

RL ApplicationTD Gammon Tesauro 1992, 1994,

1995, ... 30 pieces, 24 locations

implies enormous number of configurations

Effective branching factor of 400

TD() algorithm Multi-layer Neural Network Near the level of world’s

strongest grandmasters

Page 55: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 55

RL ApplicationElevator Dispatching Crites and Barto 1996

Page 56: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

56Reinforcement Learning

RL Application

Conservatively about 1022 states

Elevator Dispatching 18 hall call buttons: 218 combinations positions and directions of cars: 184 (rounding to nearest floor) motion states of cars (accelerating, moving, decelerating, stopped, loading,

turning): 6 40 car buttons: 240

18 discretized real numbers are available giving elapsed time since hall buttons pushed

Set of passengers riding each car and their destinations: observable only through the car buttons

Page 57: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 57

RL Application

Dynamic Channel Allocation Singh and Bertsekas 1997

Job-Shop Scheduling Zhang and Dietterich 1995, 1996

Page 58: Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki (cski@ust.hk)cski@ust.hk Fung On Tik Andy (cpegandy@ust.hk)cpegandy@ust.hk

Reinforcement Learning 58

Q & A