31
Sample-based Planning for Continuous Action Markov Decision Processes [on robots] Ari Weinstein 1

Sample-based Planning for Continuous Action Markov Decision Processes [on robots]

  • Upload
    jania

  • View
    19

  • Download
    2

Embed Size (px)

DESCRIPTION

Sample-based Planning for Continuous Action Markov Decision Processes [on robots]. Ari Weinstein. Reinforcement Learning (RL). Agent takes an action in the world, gets information including numerical reward; how does it learn to maximize that reward?. - PowerPoint PPT Presentation

Citation preview

Page 1: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Sample-based Planning for Continuous Action

Markov Decision Processes[on robots]

Ari Weinstein

1

Page 2: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Reinforcement Learning (RL)

• Agent takes an action in the world, gets information including numerical reward; how does it learn to maximize that reward?

• Fundamental concept is exploration vs. exploitation. Must take actions in the world in order to learn about it, but eventually use what was learned to get high reward

• Bandits (stateless), Markov Decision Processes (state)

2

Page 3: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

The Goal• I want to be here:

• Most RL algorithms are here [Knox Stone 09]:

• Some RL done with robots, but its rare, partly because its hard:

3

Page 4: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Overview

• RL Basics (bandits/Markov decision processes)• Planning – Bandits– MDPs (novel)

• Model Building• Exploring• Acting (novel)

Composing pieces in this manner is novel

4

Page 5: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

k-armed Bandits• Agent selects from k-arms, each with a

distribution over rewards• If we call the arm pulled at step t at , and the

reward at t rt~R(at)

• The regret is the difference in reward between the arm pulled and optimal arm; want cumulative regret to increase sub-linearly in t

)(regret

)(maxarg

*

0t

*

aRr

aREa

tt

a

5

Page 6: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Hierarchical Optimistic Optimization (HOO)[Bubeck et al. 08]

• Partition action space by a tree– Keep track of rewards for each subtree

• Blue is the bandit, red is the decomposition of HOO tree– Thickness represents

estimated reward

• Tree grows deeper and builds estimates at high resolution where reward is highest

6

Page 7: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

HOO continued

• Exploration bonuses for number of samples and size of each subregion– Regions with large volume and few samples are unknown, vice

versa• Pull arm in region according to maximal

• Has optimal regret: sqrt(t), independent of action dimension

)(),(max),(min)(

)(

)ln(2)(ˆ)(

2,112,1,,

1,

,,

tBtBtUtB

vtN

tttU

ihihihih

h

ihihih

7

Page 8: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Markov Decision Processes• Composed of:– States S (s, s’ from S)– Actions A (a from A)– Transition distribution T(s’|s,a)– Reward function R(s,a)– Discount factor 0<γ<1

• Goal in is to find a policy π, a mapping from states to actions, maximizing expected long term discounted reward: where rt is reward at time t.

• Maximize long-term reward but favor immediate reward more heavily; decayed by γ. How much long term reward is possible is measured by value function

][0

tt

trE

8

Page 9: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Value Function

• Value of a state s under policy π:

• Q-value of an action a under the same definition:

• Optimally,

)'())(,|'())(,()('

sVsssTssRsVs

)'(),|'(),(),('

sVassTasRasQs

)'(),|'(),(max)( *

'

* sVassTasRsVs

a

),(maxarg)( ** asQsa

9

Page 10: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Sample-based Planning [Kearns Mansour 99]

• In simplest case, agent can query domain for any:<s,a>, get <r,s’>

• Flow:– Domain informs agent of current state, s– Agent queries domain for any number of <s,a,r,s’>– Agent informs domain of true action to take, a– Domain informs agent of new state, s

10

Page 11: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Planning with HOO (HOLOP)

• Call this approach HOLOP – Hierarchical Open Loop Optimistic Planning

• Can treat the n-step planning problem as a large optimization problem

• Probability of splitting for a particular value of n proportional to γn

• Use HOO to optimize n-step planning, and then use action recommended in the first step.

11

Page 12: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

1-Step Lookahead in HOLOP

• Just maximizing immediate reward, r1

• 1 dimensional; horizontal axis is splitting immediate action

12

Page 13: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

2-Step Lookahead in HOLOP

• Maximizing r1+ γ r2

• 2 dimensional; horizontal axis is splitting immediate action, vertical is splitting next action

13

Page 14: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

3-Step Lookahead in HOLOP

• Maximizing r1+ γ r2 + γ2 r3

• 3 dimensional; horizontal axis is splitting immediate action, vertical is splitting next action, depth is third action

14

Page 15: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Properties of HOLOP• Planning of HOO/HOLOP (regret) improves at rate of

sqrt(t), and independent of n • Cost independent of |S|

– Open loop control means agnostic to state• Anytime planner• Open loop control means guarantees don’t exist in noisy

domains

15

Page 16: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Learning System Update• If generative model is available, can use HOLOP

directly

16

Page 17: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

HOLOP in Practice:Double Integrator Domain

[Santamaría et al. 98]

• Object with position(p) and velocity(v). Control acceleration (a). R((p,v), a) = -(p2+a2)

– Stochasticity introduced with noise added to action command• Planning done to 50 steps• As an anytime planner, can stop and ask for an action anytime (left)• Performance degrades similarly to optimal as noise varies (right)

– Action corrupted by +/- amount on x-axis, uniformly distributed. Action range is [-1.5, 1.5]

17

Page 18: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Building a Model: KD Trees [Moore 91]

• HOLOP needs a model – where does it come from?• KD Tree is a simple method of partitioning a space• At each node, a split is made in some dimension in the

region that node represents– Various rules for deciding when, where to split

• To make an estimation, find the leaf that the query point fits in, use some method to make an estimation– Commonly use the mean, I used linear regression

• This is used to build models of reward, transitions

18

Page 19: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

KD Trees Approximating Gaussian• Samples drawn iid from Gaussian, labeled with pdf of Gaussian at

point• Piecewise linear fit of function

19

Page 20: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Learning System Update• Model and Environment now 2 pieces

– Generative model not required• Model learns from environment when true <s,a,r,s’> samples available• HOLOP uses learned model to plan

20

Page 21: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Efficient Exploration• Multi-resolution Exploration (MRE) [Nouri Littman 08] is a method

that allows any RL algorithm to have efficient exploration• Partitions space by a tree with each node representing

knownness based on sample density in leaf• When using samples, treat a sample as having a transition to

Smax with probability calculated by tree. Smax is a state with a self transition and maximum possible reward

• While doing rollouts in HOLOP, MRE perturbs results in order to drive the agent to explore 21

Page 22: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Learning System Update• When HOLOP queries model for <s,a,r,s’>, MRE can step in and lie,

say transition to Smax occurs instead– Happens with high probability if <s,a> is not well known

22

Page 23: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Double Integrator+/- 0.05 units uniformly distributed noise on actions• Explosion from discretization causes slow learning• Near-optimal behavior with 10 trajectories, 2000 total samples• Discrete algorithms have fast convergence to poor results, or slow

convergence to good results

23

Page 24: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

3 Link Swimmer[Tassa et al.]

• Big domain: 2 action dimensions, 9 state dimensions– Model building needs 11 input dimensions, 10 output dimensions

• Tested a number of algorithms in this domain, HOLOP has best performance

• Rmax still worse than HOLOP after 120 trials

24

Page 25: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Next Step: Doing it all Quickly• In simulations, can replan with HOLOP at every

step• In real time robotics control, can’t stop world

while the algorithm plans. • Need a method of caching planning online. This

is tricky as model is updated online – when is policy updated?

• As with the other algorithms discussed (HOO, KD Trees, MRE) trees are used here. – Adaptively partitioning a space based on sample

concentration is both very powerful and very efficient

25

Page 26: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

[Preliminary]

TCP: Tree-based Cached PlanningAlgorithm:• Start with root that covers entire state space, initialize

with exploratory action• As samples are experienced, add them to the tree,

partition nodes based on some splitting rule• Child nodes inherit action from parent on creationRunning it:• Can ask tree for an action to take from a state; leaf

returns its cached action. – If planner is not busy request it to plan from the center of

the area leaf represents in different thread

26

Page 27: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Cached PoliciesDouble Integrator with Generative Model

• Shade indicates action cached (black -1.5, white 1.5)• Rewards:[-44.3, -5.5, -2.2, -4.6, -21.1, -2.6] (-1.3 optimal)

27

Page 28: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Close UpOrder is red, green, blue, yellow, magenta, cyan• In policy error, black indicates 0 error, white indicates

maximum possible error in policy• Minimal error along optimal trajectory – errors off optimal

trajectory are acceptable

28

Page 29: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Learning System Update:Our Happy Forest

• Agent acts based on policy cached in TCP

• TCP sends request for updated policy, for state s

29

Page 30: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Conclusions• There are many existing classes of RL algorithms, but

almost all fail at least one requirement of real time robotics control. My approach addresses all requirements

• Hierarchical Open Loop Optimistic Planning is introduced:– Operates directly in continuous action spaces, and is

agnostic to state. No need for tuning– No function approximation of value function

• Tree-based Cached Planning is introduced:– Develops policies for regions when enough data is

available to accurately determine policy– Opportunistic updating of policy allows for real-time

polling, with policy updated frequently30

Page 31: Sample-based Planning for  Continuous Action  Markov Decision Processes [on robots]

Citations• Bubeck et al 08:, S. Bubeck, R. Munos, G. Stoltz, C. Szepesvari. Online

Optimization in X-Armed Bandits, NIPS 08• Kearns Mansour 99: Kearns M., Mansour S., Ng A., A Sparse Sampling

Algorithm for Near-Optimal Planning in Large MDPs, IJCAI 99• Knox Stone 09: Knox B. W., Stone, P. Interactively Shaping Agents via

Human Reinforcement: The TAMER Framework. K-CAP, 2009.• Moore 91: Efficient Memory-based Learning for Robot Control. PhD.

Thesis; University of Cambridge, 1991.• Nouri Littman 08:Nouri, A. and Littman, M. L. Mutli-resolution Exploration

in Continuous Spaces. NIPS 2008• Santamaría el al 98: Santamaría, Juan C., Sutton, R., and Ram, Ashwin.

Experiments with reinforcement learning in problems with continuous state and action spaces. In Adaptive Behavior 6, 1998.

• Tassa et al.: Tassa, Yuval, Erez, Tom, and Smart, William D. Receding horizon differential dynamic programming. In Advances in Neural Information Processing Systems 21. 2007.

31