16
Sample-based Planning for Continuous Action Markov Decision Processes [email protected] [email protected] [email protected] Ari Weinstein Chris Mansley Michael L. Littman Rutgers Laboratory for Real-Life Reinforcement Learning

Sample-based Planning for Continuous Action Markov Decision Processes

  • Upload
    kaycee

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Sample-based Planning for Continuous Action Markov Decision Processes. Ari Weinstein Chris Mansley Michael L. Littman. [email protected] [email protected] [email protected]. Rutgers Laboratory for Real-Life Reinforcement Learning. Motivation. Sample-based planning: - PowerPoint PPT Presentation

Citation preview

Page 1: Sample-based Planning for Continuous Action Markov Decision Processes

Sample-based Planning for Continuous Action Markov Decision Processes

[email protected]@[email protected]

Ari WeinsteinChris MansleyMichael L. Littman

Rutgers Laboratory for Real-Life Reinforcement Learning

Page 2: Sample-based Planning for Continuous Action Markov Decision Processes

Motivation

• Sample-based planning:– Planning cost independent of size of state– Sometimes MDP too large

• Continuous Action MDPs:– Common setting, but few RL algorithms exist– Imagine riding in a car where gas and brakes are

on/off switches• If we have, or can learn dynamics for continuous

action domains, how do we plan in it?

Page 3: Sample-based Planning for Continuous Action Markov Decision Processes

Sample-based planning for finite MDPs

• Don’t care about regions far away• Requires generative model– Ask for a <s, a, r, s’> for any <s, a> anytime

• Sparse sampling [Kearns et al. 1999]

– PAC-style guarantees– Too expensive

• Monte-Carlo tree search– Weaker theoretical guarantees

(generally)– In practice, more useful

Page 4: Sample-based Planning for Continuous Action Markov Decision Processes

Monte-Carlo Tree [DAG] Search• Possible trajectories (rollouts)

through an MDP can be encoded by a DAG– Layered in depths with all states in

each depth– Edges contain actions, rewards

• Explore DAG so high value action is taken

Page 5: Sample-based Planning for Continuous Action Markov Decision Processes

• Instance of Monte-Carlo tree search

• Leverages bandit literature– Places a bandit agent

similar to UCB1 at each <state, depth> in rollout tree [Auer et al. 2002]

(only illustrated at root)

• Action selection according to:

Upper Confidence bounds applied to Trees(UCT)

[Kocsis, Szepesvári. 2006]

Page 6: Sample-based Planning for Continuous Action Markov Decision Processes

Continuous action spaces

• Most canonical RL domains are continuous action MDPs – why ignore it?– Hillcar, pole balancing, acrobot, double integrator, robotics…

• Coarse discretization is not good enough– Infinite regret– Want to focus samples in optimal region

Page 7: Sample-based Planning for Continuous Action Markov Decision Processes

Hierarchical Optimistic Optimization (HOO)[Bubeck et al. 2008]

• Partition action space similar to a KD-tree– Keep track of rewards for each subtree

• Blue is the bandit, red is the decomposition of HOO tree– Thickness represents

estimated reward

• Tree grows deeper and builds estimates at high resolution where reward is highest

Page 8: Sample-based Planning for Continuous Action Markov Decision Processes

HOO continued

• Exploration bonuses for number of samples and size of each subregion– Regions with large volume and few samples are unknown,

vice versa• Pull arm in region according to maximal

• Has optimal regret, independent of action dimension

Page 9: Sample-based Planning for Continuous Action Markov Decision Processes

HOOT[Weinstein, Mansley, Littman. 2010]

• Hierarchical Optimistic Optimization applied to Trees• Ideas follow from UCT• Instead of UCB, places a HOO agent at each <state, depth> in rollout tree

– Results in continuous action planning

Page 10: Sample-based Planning for Continuous Action Markov Decision Processes

Benefits of HOOT

• Planning cost independent of state size• Continuous action planning• Adaptive partitioning of action space allows

for more efficient tree search– Fewer samples wasted on suboptimal actions

• Good performance in high dimensional action spaces

• Good horizon depth

Page 11: Sample-based Planning for Continuous Action Markov Decision Processes

Experiments

• D-double integrator, D-link swimmer• Number of samples to generative model fixed

to 2048, 8192 per planning step, respectively• Since both are discrete state planners, state

dimension has coarse discretization of 20 divisions per dimension

Page 12: Sample-based Planning for Continuous Action Markov Decision Processes

D-Double Integrator [Santamaría et al. 2006]

• Object with position and velocity. Control acceleration. Reward is -(p2+a2)

• Consequence of poor action discretization• Explosion in finite actions causes failure

Page 13: Sample-based Planning for Continuous Action Markov Decision Processes

D-link Swimmer [Tassa et al. 2006]

• Swim head from start to goal• For D links, there are D-1 actions and 2D+4 states

– 5 continuous action and 16 continuous state dimensions in most complex– Difficult to get good coverage with standard RL methods

• With more dimensions, UCT fails while HOOT improves significantly

Page 14: Sample-based Planning for Continuous Action Markov Decision Processes

In the interest of full disclosure

• Bad (undirected) exploration• Theoretical analysis difficult (nonstationarity)• Degenerate behavior due to vMin, vMax

scaling• UCT also has these problems

Page 15: Sample-based Planning for Continuous Action Markov Decision Processes

Conclusions• HOOT is a planner that operates directly in continuous

action spaces– Local solutions of MDP mean costs independent of state size– No action discretization tuning

• Coarse discretization not good enough even in simple MDPs, even when tuned

• Coarse discretization explodes in high dimensions, making planning almost impossible

• Future work:– HOOT for continuous state spaces– Using optimiziers in place of max for continuous action RL

algorithms of other forms

Page 16: Sample-based Planning for Continuous Action Markov Decision Processes

References• Kocsis, L. and Szepesvári, C. Bandit based Monte-Carlo planning. In

Machine Learning: ECML 2006, 2006.• Auer, P., Fischer, P., and Cesa-Bianchi, N. Finite-time analysis of the

multi-armed bandit problem. Machine Learning, 47, 2002• Kearns M., Mansour S., Ng A., A Sparse Sampling Algorithm for Near-

Optimal Planning in Large MDPs, IJCAI 99• Bubeck S., Munos R., Stoltz G., Szepesvári C., Online Optimization in X-

Armed Bandits, NIPS 08• Santamaría, Juan C., Sutton, R., and Ram, Ashwin. Ex-periments with

reinforcement learning in problems with continuous state and action spaces. In Adaptive Behavior 6, 1998.

• Tassa, Yuval, Erez, Tom, and Smart, William D. Receding horizon differential dynamic programming. In Advances in Neural Information Processing Systems 21. 2007.