Applying Online Search Techniques to Reinforcement Learning Scott Davies, Andrew Ng, and Andrew...

Preview:

Citation preview

Applying Online Search Techniques to

Reinforcement Learning

Scott Davies, Andrew Ng, and Andrew Moore

Carnegie Mellon University

The Agony of Continuous State Spaces

• Learning useful value functions for continuous-state optimal control problems can be difficult– Small inaccuracies/inconsistencies in approximated

value functions can cause simple controllers to fail miserably

– Accurate value functions can be very expensive to compute even in relatively low-dimensional spaces with perfectly accurate state transition models

Combining Value Functions With Online Search

• Instead of modeling the value function accurately everywhere, we can perform online searches for good trajectories from the agent’s current position to compensate for value function inaccuracies

• We examine two different types of search:– “Local” searches in which the agent performs a finite-

depth look-ahead search

– “Global” searches in which the agent searches for trajectories all the way to goal states

Typical One-Step “Search”

Given a value function V(x) over the state space, an agent typically uses a model to predict where each possible one-step trajectory T takes it, then chooses the trajectory that maximizes

This takes O(|A|) time, where A is the set of possible actions.

is the discount factor

xT is the state at the end of T

where RT is the reward accumulated along T

Given a perfect V(x), this would lead to optimal behavior.

RT + V(xT)

Local Search

• An obvious possible extension: consider all possible d-step trajectories T, selecting the one that maximizes RT + dV(xT).

– Computational expense: O(|A|d).

• To make deeper searches more computationally tractable, we can limit agent to considering only trajectories in which the action is switched at most s times.

– Computational expense:

(considerably cheaper than full d-step search if s << d) )(1s

sd AdO

Local Search: Example

• Two-dimensional state space (position + velocity)

• Car must back up to take “running start” to make it

Title:hillcar.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Title:

Creator:

Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Position

Velocity

Search over 20-step trajectorieswith at most one switch in actions

Using Local Search Online

Repeat:

• From current state, consider all possible d-step trajectories T in which the action is changed at most s times

• Perform the first action in the trajectory that maximizes RT + dV(xT).

Let B denote the “parallel backup operator” such that

)),((),(max)( axVaxRxBVAa

If s = (d-1), Local Search is formally equivalent to behaving greedilywith respect to the new value function Bd-1V.

Since V is typically arrived at through iterations of a much cruder backupoperator, this value function is often much more accurate than V.

Uninformed Global Search

• Suppose we have a minimum-cost-to-goal problem in a continuous state space with nonnegative costs. Why not forget about explicitly calculating V and just extend the search from the current position all the way to the goal?

• Problem: combinatorial explosion.

• Possible solution: – Break state space into partitions, e.g. a uniform grid. (Can be

represented sparsely.)

– Use previously discussed local search procedure to find trajectories between partitions

– Prune all but least-cost trajectory entering any given partition

Uninformed Global Search

• Problems:– Still computationally

expensive

– Even with fine partitioning of state space, pruning the wrong trajectories can cause search to fail

Title:

Creator:

Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Informed Global Search

• Use approximate value function V to guide the selection of which points to search from next

• Reasonably accurate V will cause search to stay along optimal path to goal: dramatic reduction in search time

• V can help choose effective points within each partition from which to search, thereby improving solution quality

• Uniformed Global Search same as “Informed” Global Search with V(x) = 0

Informed Global Search Algorithm

• Let x0 be current state, and g(x0) be the grid element containing x0

• Set g(x0)’s “representative state” to x0, and add g(x0) to priority queue P with priority V(x0)

• Until goal state found or P empty:– Remove grid element g from top of P. Let x denote g’s

“representative state.”

– SEARCH-FROM(g, x)

• If goal found, execute trajectory; otherwise signal failure

Informed Global Search Algorithm, cont’d

SEARCH-FROM(g, x):• Starting from x, perform “local search” as described

earlier, but prune the search wherever it reaches a different grid element g g.

• Each time another grid element g reached at state x:– If g previously SEARCHED-FROM, do nothing.

– If g never previously reached, add g to P with priority RT(x0…x) + |T|V(x), where T is trajectory from x0 to x. Set g’s “representative state” to x. Record trajectory from x to x.

– If g previously reached but previous priority is lower than RT(x0…x) + |T|V(x), update g s priority to RT(x0…x) + |T|V(x) and set “representative state” to x. Record trajectory from x to x.

Informed Global Search Examples

Title:

Creator:

Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Title:

Creator:

Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

7*7 simplex-interpolated V 13*13 simplex-interpolated V

Hill-car Search Trees

Informed Global Search as A*

• Informed Global Search is essentially an A* search using the value function V as a search heuristic

• Using A* with an optimistic heuristic function normally guarantees optimal path to the goal.

• Uninformed global search effectively uses trivially optimistic heuristic V(s) = 0. Might we expect better solution quality with uninformed search than with non-optimistic crude approximate value function V?

• Not necessarily! A crude approximate non-optimistic value function can improve solution quality by helping the algorithm avoid pruning wrong parts of search tree

Hill-car

• Car on steep hill• State variables:

position and velocity (2-d)

• Actions: accelerate forward or backward

• Goal: park near top• Random start states• Cost: total time to goal

Title:hillcar.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Acrobot

• Two-link planar robot acting in vertical plane under gravity

• Underactuated joint at elbow; unactuated shoulder• Two angular positions & their velocities (4-d)• Goal: raise tip at least one link’s height above

shoulder• Two actions: full torque clockwise /

counterclockwise• Random starting positions• Cost: total time to goal

12

Goal

Move-Cart-Pole

• Upright pole attached to cart by unactuated joint

• State: horizontal position of cart, angle of pole, and associated velocities (4-d)

• Actions: accelerate left or right

• Goal configuration: cart moved, pole balanced

• Start with random x; = 0

• Per-step cost quadratic in distance from goal configuration

• Big penalty if pole falls over

xGoal configuration

Planar Slider

• Puck sliding on bumpy 2-d surface

• Two spatial variables & their velocities (4-d)

• Actions: accelerate NW, NE, SW, or SE

• Goal in NW corner• Random start states• Cost: total time to goal

Title:

Creator:

Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Local Search Experiments

• CPU Time and Solution cost vs. search depth d

• No limits imposed on number of action switches (s=d)

• Value function: 134 simplex-interpolation grid

0

10000

20000

30000

40000

50000

60000

1 2 3 4 5 6 7 8 9 10

Search depth

Solu

tion

cos

t

02468

1012141618

1 2 3 4 5 6 7 8 9 10

Search depth

CP

U T

ime/

Tri

al (s

ec.)

Move-Cart-Pole

Local Search Experiments

• CPU Time and Solution cost vs. search depth d

• Max. number of action switches fixed at 2 (s = 2)

• Value function: 72 simplex-interpolated value function

0

50

100

150

200

0 10 20 30

Search depth

Solu

tion

cos

t

0

2

4

6

8

10

12

14

0 10 20 30

Search depth

CP

U t

ime/

tria

l (se

c.)

Hill-car

Comparative experiments: Hill-Car

• Local search: d=6, s=2

• Global searches: – Local search between grid elements: d=20, s=1

– 502 search grid resolution

• 72 simplex-interpolated value function

Search Method

None Local Uninf. Glob. Inf. Glob.

Solution Cost 187 140 FAIL 151CPU Time/Trial 0.02 0.36 N/A 0.14

Hill-Car results cont’d

• Uninformed Global Search prunes wrong trajectories

• Increase search grid to 1002

so this doesn’t happen:– Uninformed does near-optimal

– Informed doesn’t: crude value function not optimistic

Search Method

Uninf. Glob. 1 Inf. Glob. 1 Uninf. Glob. 2 Inf. Glob. 2

Solution Cost FAIL 151 109 138CPU Time/Trial N/A 0.14 0.82 0.29

Failed search trajectory picture goeshere

Comparative Results: Four-d domains

All value functions: 134 simplex interpolations

All local searches between global search elements:

depth 20, with at max. 1 action switch (d=20, s=1)

• Acrobot:– Local Search: depth 4; no action switch restriction (d=4,s=4)

– Global: 504 search grid

• Move-Cart-Pole: same as Acrobot

• Slider:– Local Search: depth 10; max. 1 action switch (d=10,s=1)

– Global: 204 search grid

Acrobot

• Local search significantly improves solution quality, but increases CPU time by order of magnitude

• Uninformed global search takes even more time; poor solution quality indicates suboptimal trajectory pruning

• Informed global search finds much better solutions in relatively little time. Value function drastically reduces search, and better pruning leads to better solutions

cost time cost time cost time #LS cost time #LS

Acrobot 454 0.1 305 1.2 407 5.8 14250 198 0.47 914Move-Cart-Pole 49993 0.66 10339 1.13 3164 3.45 7605 5073 0.64 1072Planar Slider 212 1.9 197 52 104 94 23690 54 2 533

No search Local search Uninf. Global Search Inf. Global Search

#LS: number of local searches performed to find pathsbetween elements of global search grid

Move-Cart-Pole

• No search: pole often falls, incurring large penalties; overall poor solution quality

• Local search improves things a bit

• Uninformed search finds better solutions than informed– Few grid cells in which pruning is required

– Value function not optimistic, so informed search solutions suboptimal

• Informed search reduces costs by order of magnitude with no increase in required CPU time

cost time cost time cost time #LS cost time #LS

Acrobot 454 0.1 305 1.2 407 5.8 14250 198 0.47 914Move-Cart-Pole 49993 0.66 10339 1.13 3164 3.45 7605 5073 0.64 1072Planar Slider 212 1.9 197 52 104 94 23690 54 2 533

No search Local search Uninf. Global Search Inf. Global Search

Planar Slider

cost time cost time cost time #LS cost time #LS

Acrobot 454 0.1 305 1.2 407 5.8 14250 198 0.47 914Move-Cart-Pole 49993 0.66 10339 1.13 3164 3.45 7605 5073 0.64 1072Planar Slider 212 1.9 197 52 104 94 23690 54 2 533

No search Local search Uninf. Global Search Inf. Global Search

• Local search almost useless, and incurs massive CPU expense

• Uninformed search decreases solution cost by 50%, but at even greater CPU expense

• Informed search decreases solution cost by factor of 4, at no increase in CPU time

Using Search with Learned Models

• Toy Example: Hill-Car– 72 simplex-interpolated value function

– One nearest-neighbor function approximator per possible action used to learn dx/dt

– States sufficiently far away from nearest neighbor optimistically assumed to be absorbing to encourage exploration

• Average costs over first few hundred trials:– No search: 212

– Local search: 127

– Informed global search: 155

Using Search with Learned Models

• Problems do arise when using learned models:– Inaccuracies in models may cause global searches to

fail. Not clear then if failure should be blamed on model inaccuracies or on insufficiently fine state space partitioning

– Trajectories found will be inaccurate• Need adaptive closed-loop controller

• Fortunately, we will get new data with which to increase the accuracy of our model

– Model approximators must be fast and accurate

Avenues for Future Research

• Extensions to nondeterministic systems?

• Higher-dimensional problems

• Better function approximators for model learning

• Variable-resolution search grids

• Optimistic value function generation?

Recommended