Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki...

Preview:

Citation preview

Reinforcement Learning 1

COMP538Reinforcement LearningRecent Development

Group 7:

Chan Ka Ki (cski@ust.hk)

Fung On Tik Andy (cpegandy@ust.hk)

Li Yuk Hin (tonyli@ust.hk)

Instructor: Nevin L. Zhang

Reinforcement Learning 2

Outline Introduction 3 Solving Methods Main Consideration

Exploration vs. Exploitation Directed / Undirected Exploration

Function Approximation Planning and Learning

Directed RL vs. Undirected RL Dyna-Q and Prioritized Sweeping

Conclusion on recent development

Reinforcement Learning 3

Introduction Agent interacts with environment Goal-directed learning from interaction

Environment

Action a

AI Agent

s(t)

Reward r

s(t + 1)

Reinforcement Learning 4

Key Features Agent is NOT told which actions to take, but lear

n by itself By trial-and-error From experiences Explore and exploit

Exploitation = agent takes the best action based on its current knowledge

Exploration = try to take NOT the best action to gain more knowledge

Reinforcement Learning 5

Elements of RL Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what

Reinforcement Learning 6

Dynamic Programming Model-based

compute optimal policies given a perfect model of the environment as a Markov decision process (MDP)

Bootstrap update estimates based in part on other

learned estimates, without waiting for a final outcome

Reinforcement Learning 7

Dynamic Programming

T

T T TT

TT

T

TT

T

T

T

Reinforcement Learning 8

Monte Carlo Model-free NOT bootstrap Entire episode included Only one choice at each

state (unlike DP) Time required to estimate

one state does not depend on the total number of states

Reinforcement Learning 9

Monte Carlo

T T T TT

T T T T T

T T

T T

TT T

T TT

Reinforcement Learning 10

Temporal Difference Model-free Bootstrap Partial episode included

Reinforcement Learning 11

Temporal Difference

T T T TT

T T T T TTTTTT

T T T T T

Reinforcement Learning 12

Example: Driving home

Reinforcement Learning 13

Driving home Changes recommended

by Monte Carlo methodsChanges recommendedby TD methods

Reinforcement Learning 14

N-step TD Prediction MC and TD are extreme cases!

Reinforcement Learning 15

Averaging N-step Returns n-step methods were introduced to help

with TD() understanding Idea: backup an average of several

returns e.g. backup half of 2-step and half of 4-step

Called a complex backup Draw each component Label with the weights for that component

)4()2(

2

1

2

1tt

avgt RRR

Reinforcement Learning 16

Forward View of TD() TD() is a method for

averaging all n-step backups weight by n-1 (time

since visitation) -return:

Backup using -return

Rt (1 ) n 1

n1

Rt(n)

Vt(st ) Rt Vt(st )

Reinforcement Learning 17

Forward View of TD() Look forward from each state to determine

update from future states and rewards:

Reinforcement Learning 18

Backward View of TD() The forward view was for theory The backward view is for mechanism

New variable called eligibility trace On each step, decay all traces by and

increment the trace for the current state by 1 Accumulating trace

)(set

et(s) et 1(s) if s st

et 1(s) 1 if s st

Reinforcement Learning 19

Backward View

Shout t backwards over time The strength of your voice decreases with

temporal distance by

)()( 11 tttttt sVsVr

Reinforcement Learning 20

Forward View = Backward View The forward (theoretical) view of TD() is

equivalent to the backward (mechanistic) view for off-line updating

Adaptive Exploration in Reinforcement Learning

Relu PatrascuDepartment of Systems Design

EngineeringUniversity of Waterloo

Waterloo, Ontario, Canada

relu@pami.uwaterloo.ca

Deborah StaceyDept. of Computing and

Information ScienceUniversity of Guelph

Ontario, Canadadastacey@uoguelph.ca

Reinforcement Learning 22

Objectives Explains the trade-off between exploitation

and exploration Introduces two categories of exploration

methods: Undirected Exploration

-greedy exploration Directed Exploration

Counter-based exploration Past-Success directed exploration

Function approximation Backpropagation algorithm and Fuzzy ARTMAP

Reinforcement Learning 23

Introduction Main problem: How to make the learning

process adapt to the non-stationary environment?

Sub-Problems: How to balance exploitation and exploration

when the environment change? How can the function approximators adapt

the environment?

Reinforcement Learning 24

Exploitation and Exploration Exploit or Explore?

To maximize reward, a learner must exploit the knowledge it already has

Explore an action with small immediate reward, but may yield more reward in the long run

An example: Choosing the job Suppose you are working at a small company with $25,000 salary You have another offer from an enterprise but only start at $12,000 Keep working on the small company may guarantee you have

stable income Work on an enterprise may have more opportunities for promotion,

which increase the income in long run

Reinforcement Learning 25

Undirected Exploration Undirected Exploration

No biased purely random Eg. -greedy exploration it explores it chooses

equally among all actions likely to choose the worst

appearing action as it is to choose the next-to-best

Reinforcement Learning 26

Directed Exploration Directed Exploration

Memorize exploration-specific knowledge Biased by some features of the learning process Eg. Counter-based techniques Favor the choice of actions resulting in a transition to

a state that has not been frequently visited The main idea is encourage the learner to explore :

parts of the state space that have not been sampled often parts that have not been sampled recently

Reinforcement Learning 27

Past-success Directed Exploration Based on -greedy exploration Bias to adapt the environment from the learning

process Increase exploitation rate if receives reward at an

increasing rate Increase exploration rate when stop receiving reward

Average discounted reward Reflects amount and frequency of received immediate

rewards The further back in time, the less effect on average

reward

Reinforcement Learning 28

Average discounted reward defined as:

Apply it on -greedy algorithm

Past-Success Directed Exploration

t

k

kt

t

kt

kt

rt

r

1

1

1

1

where (0,1] is the discount factor

rt the reward received at time t

1.08.0 )( srtet

Reinforcement Learning 29

Gradient Descent Method Why use a gradient descent method?

RL applications use table to store the value functions Large number of states causes practically impossible Solution: use function approximator to predict the

value Error backpropagation algorithm

Catastrophic Interference cannot learn incrementally in non-stationary environment acquire new knowledge forget much of its previous

knowledge

Reinforcement Learning 30

Gradient Descent Method

Initialize w arbitrarily and e = 0Repeat (for each episode):

Initialize sPass s through each network and obtain Qa

a arg maxa Qa

With probability : a a random action A(s)Repeat (for each step of episode):

e eea ea wQa

Take action a, observe reward, r and next state, s’ r – Qa

Pass s’ through each network and obtain Q’a

a’ arg maxa Q’a

With probability : a a random action A(s’) + Q’a

w w + ea a’until s’ is terminal

 

where a’ arg maxa Q’a means a’ is set to the action for which the expression is maximal, in this case the highest Q’ is a constant step size parameter named the learning ratewQa is the partial derivative of Qa with respect to the weights w the discount factor e the vector of eligibility traces (0, 1] is the eligibility trace parameter

Reinforcement Learning 31

Fuzzy ARTMAP ARTMAP - Adaptive Resonancy Theory

mapping between input vector and output pattern a neural network specifically designed to deal

with the stability/plasticity dilemma This dilemma means a neural network isn't able

to learn new information without damaging what was learned previously, similar to catastrophic interference

Reinforcement Learning 32

Experiments Gridworld with non-stationary environment

Learning agent can move up, down, left or right Two gates: must pass through one of them from start

state to goal state First 1000 episodes, gate 1 open and gate 2 close 1001-5000 episodes, gate 1 close and gate 2 open To test how well the algorithm adapt to the changed

environment

Reinforcement Learning 33

Results Backpropagation algorithm

After 1000th episode: average discounted reward drops rapidly and

monotonically Surges to maximum exploitation

Fuzzy ARTMAP After 1000th episode: Reward drops in a few episode and goes back to high

values A temporary surge in exploration

Reinforcement Learning 34

Planning and Learning

Use of environment models Integration of planning and learning

methods

Objectives:

Reinforcement Learning 35

Models Model: anything the agent can use to predict how

the environment will respond to its actions Distribution model: description of all possibilities

and their probabilities e.g.,

Sample model: produces sample experiences e.g., a simulation model, set of data

Both types of models can be used to produce simulated experience

Often sample models are much easier to obtain

Ps s a and Rs s

a for all s, s , and a A(s)

Reinforcement Learning 36

Planning Planning: any computational process that

uses a model to create or improve a policy

We take the following view: all state-space planning methods involve computing

value functions, either explicitly or implicitly they all apply backups to simulated experience

Model PolicyPlanning

Simulated Experience

Model Values Policybackups

Reinforcement Learning 37

Learning, Planning, and Acting Two uses of real

experience: model learning: to

improve the model direct RL: to directly

improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.

Reinforcement Learning 38

Direct vs. Indirect RL Indirect methods:

make fuller use of experience: get better policy with fewer environment interactions

Direct methods simpler not affected by

bad models

But they are very closely related and can be usefully combined:

planning, acting, model learning, and direct RL can occur simultaneously and in parallel

Reinforcement Learning 39

The Dyna-Q Architecture(Sutton 1990)

Reinforcement Learning 40

The Dyna-Q Architecture (Sutton 1990)• Dyna use the experience to build the model (R, T), uses experience

to adjust the policy and user the model to adjust the policy

For each interaction with environment, experiencing <s, a, s’, r>

1. use experience to adjust the policyQ(s,a) = R(s,a) + [ r + Maxa’Q(s’, a’) – Q(s,a)]

2. use experience to update a model (T, R)Model (s,a) = (s’, r)

3. use model to simulate the experience to adjust the policya Rand(a), s Rand(s)(s’, r) Model(s, a)Q(s,a) = R(s,a) + [ r + Maxa’Q(s’, a’) – Q(s,a)]

Reinforcement Learning 41

The Dyna-Q Algorithm

direct RL

model learning

planning

Reinforcement Learning 42

Dyna-Q Snapshots: Midway in 2nd Episode

Reinforcement Learning 43

Dyna-Q Properties Dyna algorithm requires about N times the comp

utation of Q learning per instance But it is typically vastly less than that for naïve m

odel-based method N can be determined by the relative speed of co

mputation and of the taking action

What if the environment is changed ? Change to harder or change to easier.

Reinforcement Learning 44

Blocking Maze

The changed

environment is harder

Reinforcement Learning 45

Shortcut MazeThe changed

environment is

easier

Reinforcement Learning 46

What is Dyna-Q ? Uses an “exploration bonus”:

Keeps track of time since each state-action pair was tried for real

An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting

The agent actually “plans” how to visit long unvisited states

+

Reinforcement Learning 47

Prioritized Sweeping The updating of the model is no longer random Instead, store additional information in the model in order to make

the appropriate choice of updating

Store the change of each state value V(s), and use it to modify the priority of the predecessors of s, according their transition probability T(s,a, s’)

s4

s5

s1

s2

s3

= 10

= 5

S4 S5 S2 S1 S3

Priority: High Low

Reinforcement Learning 48

Prioritized Sweeping

Reinforcement Learning 49

Prioritized Sweeping vs. Dyna-QBoth use N=5 backups per

environmental interaction

Reinforcement Learning 50

Full and Sample(One-Step)Backups

Reinforcement Learning 51

Summary Emphasized close relationship between

planning and learning Important distinction between distribution

models and sample models Looked at some ways to integrate planning

and learning synergy among planning, acting, model

learning

Reinforcement Learning 52

RL Recent Development : Problem Modeling

Partially Observable MDP

MDP

Hidden State RL

Traditional RL

Known Unknown

Completely Observable

Partially Observable

Model of environment

Reinforcement Learning 53

Research topics Exploration-Exploitation tradeoff Problem of delayed reward (credit assignment) Input generalization

Function Approximator

Multi-Agent Reinforcement Learning Global goal vs Local goal Achieve several goals in parallel Agent cooperation and communication

Reinforcement Learning 54

RL ApplicationTD Gammon Tesauro 1992, 1994,

1995, ... 30 pieces, 24 locations

implies enormous number of configurations

Effective branching factor of 400

TD() algorithm Multi-layer Neural Network Near the level of world’s

strongest grandmasters

Reinforcement Learning 55

RL ApplicationElevator Dispatching Crites and Barto 1996

56Reinforcement Learning

RL Application

Conservatively about 1022 states

Elevator Dispatching 18 hall call buttons: 218 combinations positions and directions of cars: 184 (rounding to nearest floor) motion states of cars (accelerating, moving, decelerating, stopped, loading,

turning): 6 40 car buttons: 240

18 discretized real numbers are available giving elapsed time since hall buttons pushed

Set of passengers riding each car and their destinations: observable only through the car buttons

Reinforcement Learning 57

RL Application

Dynamic Channel Allocation Singh and Bertsekas 1997

Job-Shop Scheduling Zhang and Dietterich 1995, 1996

Reinforcement Learning 58

Q & A

Recommended