Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki...

Reinforcement Learning 1

COMP538Reinforcement LearningRecent Development

Group 7:

Chan Ka Ki (cski@ust.hk)

Fung On Tik Andy (cpegandy@ust.hk)

Li Yuk Hin (tonyli@ust.hk)

Instructor: Nevin L. Zhang

Outline Introduction 3 Solving Methods Main Consideration

Exploration vs. Exploitation Directed / Undirected Exploration

Function Approximation Planning and Learning

Directed RL vs. Undirected RL Dyna-Q and Prioritized Sweeping

Conclusion on recent development

Introduction Agent interacts with environment Goal-directed learning from interaction

Environment

Action a

AI Agent

Reward r

s(t + 1)

Key Features Agent is NOT told which actions to take, but lear

n by itself By trial-and-error From experiences Explore and exploit

Exploitation = agent takes the best action based on its current knowledge

Exploration = try to take NOT the best action to gain more knowledge

Elements of RL Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what

Dynamic Programming Model-based

compute optimal policies given a perfect model of the environment as a Markov decision process (MDP)

Bootstrap update estimates based in part on other

learned estimates, without waiting for a final outcome

Dynamic Programming

T T TT

Monte Carlo Model-free NOT bootstrap Entire episode included Only one choice at each

state (unlike DP) Time required to estimate

one state does not depend on the total number of states

Monte Carlo

T T T TT

T T T T T

Temporal Difference Model-free Bootstrap Partial episode included

Temporal Difference

T T T TT

T T T T TTTTTT

T T T T T

Example: Driving home

Driving home Changes recommended

by Monte Carlo methodsChanges recommendedby TD methods

N-step TD Prediction MC and TD are extreme cases!

Averaging N-step Returns n-step methods were introduced to help

with TD() understanding Idea: backup an average of several

returns e.g. backup half of 2-step and half of 4-step

Called a complex backup Draw each component Label with the weights for that component

)4()2(

avgt RRR

Forward View of TD() TD() is a method for

averaging all n-step backups weight by n-1 (time

since visitation) -return:

Backup using -return

Rt (1 ) n 1

Vt(st ) Rt Vt(st )

Forward View of TD() Look forward from each state to determine

update from future states and rewards:

Backward View of TD() The forward view was for theory The backward view is for mechanism

New variable called eligibility trace On each step, decay all traces by and

increment the trace for the current state by 1 Accumulating trace

et(s) et 1(s) if s st

et 1(s) 1 if s st

Backward View

Shout t backwards over time The strength of your voice decreases with

temporal distance by

)()( 11 tttttt sVsVr

Forward View = Backward View The forward (theoretical) view of TD() is

equivalent to the backward (mechanistic) view for off-line updating

Adaptive Exploration in Reinforcement Learning

Relu PatrascuDepartment of Systems Design

EngineeringUniversity of Waterloo

Waterloo, Ontario, Canada

relu@pami.uwaterloo.ca

Deborah StaceyDept. of Computing and

Information ScienceUniversity of Guelph

Ontario, Canadadastacey@uoguelph.ca

Objectives Explains the trade-off between exploitation

and exploration Introduces two categories of exploration

methods: Undirected Exploration

-greedy exploration Directed Exploration

Counter-based exploration Past-Success directed exploration

Function approximation Backpropagation algorithm and Fuzzy ARTMAP

Introduction Main problem: How to make the learning

process adapt to the non-stationary environment?

Sub-Problems: How to balance exploitation and exploration

when the environment change? How can the function approximators adapt

the environment?

Exploitation and Exploration Exploit or Explore?

To maximize reward, a learner must exploit the knowledge it already has

Explore an action with small immediate reward, but may yield more reward in the long run

An example: Choosing the job Suppose you are working at a small company with $25,000 salary You have another offer from an enterprise but only start at $12,000 Keep working on the small company may guarantee you have

stable income Work on an enterprise may have more opportunities for promotion,

which increase the income in long run

Undirected Exploration Undirected Exploration

No biased purely random Eg. -greedy exploration it explores it chooses

equally among all actions likely to choose the worst

appearing action as it is to choose the next-to-best

Directed Exploration Directed Exploration

Memorize exploration-specific knowledge Biased by some features of the learning process Eg. Counter-based techniques Favor the choice of actions resulting in a transition to

a state that has not been frequently visited The main idea is encourage the learner to explore :

parts of the state space that have not been sampled often parts that have not been sampled recently

Past-success Directed Exploration Based on -greedy exploration Bias to adapt the environment from the learning

process Increase exploitation rate if receives reward at an

increasing rate Increase exploration rate when stop receiving reward

Average discounted reward Reflects amount and frequency of received immediate

rewards The further back in time, the less effect on average

reward

Average discounted reward defined as:

Apply it on -greedy algorithm

Past-Success Directed Exploration

where (0,1] is the discount factor

rt the reward received at time t

1.08.0 )( srtet

Gradient Descent Method Why use a gradient descent method?

RL applications use table to store the value functions Large number of states causes practically impossible Solution: use function approximator to predict the

value Error backpropagation algorithm

Catastrophic Interference cannot learn incrementally in non-stationary environment acquire new knowledge forget much of its previous

knowledge

Gradient Descent Method

Initialize w arbitrarily and e = 0Repeat (for each episode):

Initialize sPass s through each network and obtain Qa

a arg maxa Qa

With probability : a a random action A(s)Repeat (for each step of episode):

e eea ea wQa

Take action a, observe reward, r and next state, s’ r – Qa

Pass s’ through each network and obtain Q’a

a’ arg maxa Q’a

With probability : a a random action A(s’) + Q’a

w w + ea a’until s’ is terminal

where a’ arg maxa Q’a means a’ is set to the action for which the expression is maximal, in this case the highest Q’ is a constant step size parameter named the learning ratewQa is the partial derivative of Qa with respect to the weights w the discount factor e the vector of eligibility traces (0, 1] is the eligibility trace parameter

Fuzzy ARTMAP ARTMAP - Adaptive Resonancy Theory

mapping between input vector and output pattern a neural network specifically designed to deal

with the stability/plasticity dilemma This dilemma means a neural network isn't able

to learn new information without damaging what was learned previously, similar to catastrophic interference

Experiments Gridworld with non-stationary environment

Learning agent can move up, down, left or right Two gates: must pass through one of them from start

state to goal state First 1000 episodes, gate 1 open and gate 2 close 1001-5000 episodes, gate 1 close and gate 2 open To test how well the algorithm adapt to the changed

environment

Results Backpropagation algorithm

After 1000th episode: average discounted reward drops rapidly and

monotonically Surges to maximum exploitation

Fuzzy ARTMAP After 1000th episode: Reward drops in a few episode and goes back to high

values A temporary surge in exploration

Planning and Learning

Use of environment models Integration of planning and learning

methods

Objectives:

Models Model: anything the agent can use to predict how

the environment will respond to its actions Distribution model: description of all possibilities

and their probabilities e.g.,

Sample model: produces sample experiences e.g., a simulation model, set of data

Both types of models can be used to produce simulated experience

Often sample models are much easier to obtain

Ps s a and Rs s

a for all s, s , and a A(s)

Planning Planning: any computational process that

uses a model to create or improve a policy

We take the following view: all state-space planning methods involve computing

value functions, either explicitly or implicitly they all apply backups to simulated experience

Model PolicyPlanning

Simulated Experience

Model Values Policybackups

Learning, Planning, and Acting Two uses of real

experience: model learning: to

improve the model direct RL: to directly

improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.

Direct vs. Indirect RL Indirect methods:

make fuller use of experience: get better policy with fewer environment interactions

Direct methods simpler not affected by

bad models

But they are very closely related and can be usefully combined:

planning, acting, model learning, and direct RL can occur simultaneously and in parallel

The Dyna-Q Architecture(Sutton 1990)

The Dyna-Q Architecture (Sutton 1990)• Dyna use the experience to build the model (R, T), uses experience

to adjust the policy and user the model to adjust the policy

For each interaction with environment, experiencing <s, a, s’, r>

1. use experience to adjust the policyQ(s,a) = R(s,a) + [ r + Maxa’Q(s’, a’) – Q(s,a)]

2. use experience to update a model (T, R)Model (s,a) = (s’, r)

3. use model to simulate the experience to adjust the policya Rand(a), s Rand(s)(s’, r) Model(s, a)Q(s,a) = R(s,a) + [ r + Maxa’Q(s’, a’) – Q(s,a)]

The Dyna-Q Algorithm

direct RL

model learning

planning

Dyna-Q Snapshots: Midway in 2nd Episode

Dyna-Q Properties Dyna algorithm requires about N times the comp

utation of Q learning per instance But it is typically vastly less than that for naïve m

odel-based method N can be determined by the relative speed of co

mputation and of the taking action

What if the environment is changed ? Change to harder or change to easier.

Blocking Maze

The changed

environment is harder

Shortcut MazeThe changed

environment is

easier

What is Dyna-Q ? Uses an “exploration bonus”:

Keeps track of time since each state-action pair was tried for real

An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting

The agent actually “plans” how to visit long unvisited states

Prioritized Sweeping The updating of the model is no longer random Instead, store additional information in the model in order to make

the appropriate choice of updating

Store the change of each state value V(s), and use it to modify the priority of the predecessors of s, according their transition probability T(s,a, s’)

S4 S5 S2 S1 S3

Priority: High Low

Prioritized Sweeping

Prioritized Sweeping vs. Dyna-QBoth use N=5 backups per

environmental interaction

Full and Sample(One-Step)Backups

Summary Emphasized close relationship between

planning and learning Important distinction between distribution

models and sample models Looked at some ways to integrate planning

and learning synergy among planning, acting, model

learning

RL Recent Development : Problem Modeling

Partially Observable MDP

Hidden State RL

Traditional RL

Known Unknown

Completely Observable

Partially Observable

Model of environment

Research topics Exploration-Exploitation tradeoff Problem of delayed reward (credit assignment) Input generalization

Function Approximator

Multi-Agent Reinforcement Learning Global goal vs Local goal Achieve several goals in parallel Agent cooperation and communication

RL ApplicationTD Gammon Tesauro 1992, 1994,

1995, ... 30 pieces, 24 locations

implies enormous number of configurations

Effective branching factor of 400

TD() algorithm Multi-layer Neural Network Near the level of world’s

strongest grandmasters

RL ApplicationElevator Dispatching Crites and Barto 1996

56Reinforcement Learning

RL Application

Conservatively about 1022 states

Elevator Dispatching 18 hall call buttons: 218 combinations positions and directions of cars: 184 (rounding to nearest floor) motion states of cars (accelerating, moving, decelerating, stopped, loading,

turning): 6 40 car buttons: 240

18 discretized real numbers are available giving elapsed time since hall buttons pushed

Set of passengers riding each car and their destinations: observable only through the car buttons

RL Application

Dynamic Channel Allocation Singh and Bertsekas 1997

Job-Shop Scheduling Zhang and Dietterich 1995, 1996

Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki...

Documents

NGA Project Strong-Motion Database - ust.hk

An Efficient Brush Model for Physically-Based 3D Painting Nelson S.-H. CHU (cpegnel@ust.hk) Chiew-Lan TAI (taicl@ust.hk) The Hong Kong University of Science

James Kai-sing KUNG (sojk@ust.hk) Hong Kong University of

Overview of Spark project Presented by Yin Zhu (yinz@ust.hk)yinz@ust.hk Materials from Hadoop in Practice by A

REINFORCEMENT DETAILING ACCESSORIES - …paradigm.in/.../2015/01/6-REINFORCEMENT-DETAILING-ACCESSORIES.pdfStructural Engineering & Geospatial Consultants REINFORCEMENT DETAILING ACCESSORIES

Small Summaries for Big Data Ke Yi HKUST yike@ust.hk

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning

An Easily Accessible Ionic Aggregation ... - ias2.ust.hk

Guide to Historical Reinforcement - SRIA Concrete 2017 Historical Reinforcement... · Guide to Historical Reinforcement ... reinforcement material properties to use when checking

Reinforcement Learning - Multi-Agent Reinforcement

From Reinforcement Learning to Deep Reinforcement …fagostin/assets/files/...Keywords: Machine learning · Reinforcement learning Deep learning · Deep reinforcement learning 1 Introduction

Environment and Business Course Overview Instructor: Jerry Patchell; sopatch@ust.hk Office Hours: Tuesday 12:00-14:00 Room 2352A TA: Kaxton Siu; kaxton@ust.hk

AND TECHNOLOGY FACTS AND FIGURES 2010€¦ · Tel: (852) 2623-1120 Fax: (852) 2358-2463 Email: gradmit@ust.hk Alumni Tel: (852) 2358-6158 Email: alumni@ust.hk Donations Tel: (852)

Multi-Objective Reinforcement Learning using Sets of Pareto … · 2020. 10. 19. · learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement

Preface to the Third Edition - chz276.ust.hk

ELEC 300R Robot Design and Competition Zexiang Li (eezxli@ust.hk, X7051, Rm 2453)eezxli@ust.hk Lecture : Tue 18:30-20:20, Rm 2463 Lab : Thu 18:30-20:20,

for high ReinfoRcement systems - kotaca.cz · ReinfoRcement systems rEiNforcEmENt systEm PyraPlEx ... and DBV data sheet "Reinforcement system steel and ... suant to EC2 6.2.2 and

Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Rational Learning Leads to Nash Equilibrium Ehud Kalai and Ehud Lehrer Econometrica, Vol. 61 No. 5 (Sep 1993), 1019-1045 Presented by Vincent Mak (wsvmak@ust.hk)wsvmak@ust.hk

292 IEEE TRANSACTIONS ON VERY LARGE SCALE …eexu/publications/J_TVLSI2013.pdf · jiang.xu@ust.hk; wxxaf@ust.hk ... sidered in our analysis. To better study ONoC characteristics,