Applying Online Search Techniques to Reinforcement Learning Scott Davies, Andrew Ng, and Andrew Moore Carnegie Mellon University

Applying Online Search Techniques to

Reinforcement Learning

Scott Davies, Andrew Ng, and Andrew Moore

Carnegie Mellon University

The Agony of Continuous State Spaces

• Learning useful value functions for continuous-state optimal control problems can be difficult– Small inaccuracies/inconsistencies in approximated

value functions can cause simple controllers to fail miserably

– Accurate value functions can be very expensive to compute even in relatively low-dimensional spaces with perfectly accurate state transition models

Combining Value Functions With Online Search

• Instead of modeling the value function accurately everywhere, we can perform online searches for good trajectories from the agent’s current position to compensate for value function inaccuracies

• We examine two different types of search:– “Local” searches in which the agent performs a finite-

depth look-ahead search

– “Global” searches in which the agent searches for trajectories all the way to goal states

Typical One-Step “Search”

Given a value function V(x) over the state space, an agent typically uses a model to predict where each possible one-step trajectory T takes it, then chooses the trajectory that maximizes

This takes O(|A|) time, where A is the set of possible actions.

is the discount factor

xT is the state at the end of T

where RT is the reward accumulated along T

Given a perfect V(x), this would lead to optimal behavior.

RT + V(xT)

Local Search

• An obvious possible extension: consider all possible d-step trajectories T, selecting the one that maximizes RT + dV(xT).

– Computational expense: O(|A|d).

• To make deeper searches more computationally tractable, we can limit agent to considering only trajectories in which the action is switched at most s times.

– Computational expense:

(considerably cheaper than full d-step search if s << d) )(1s

sd AdO

Local Search: Example

• Two-dimensional state space (position + velocity)

• Car must back up to take “running start” to make it

Title:hillcar.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Title:

Creator:

Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Position

Velocity

Search over 20-step trajectorieswith at most one switch in actions

Using Local Search Online

Repeat:

• From current state, consider all possible d-step trajectories T in which the action is changed at most s times

• Perform the first action in the trajectory that maximizes RT + dV(xT).

Let B denote the “parallel backup operator” such that

)),((),(max)( axVaxRxBVAa

If s = (d-1), Local Search is formally equivalent to behaving greedilywith respect to the new value function Bd-1V.

Since V is typically arrived at through iterations of a much cruder backupoperator, this value function is often much more accurate than V.

Uninformed Global Search

• Suppose we have a minimum-cost-to-goal problem in a continuous state space with nonnegative costs. Why not forget about explicitly calculating V and just extend the search from the current position all the way to the goal?

• Problem: combinatorial explosion.

• Possible solution: – Break state space into partitions, e.g. a uniform grid. (Can be

represented sparsely.)

– Use previously discussed local search procedure to find trajectories between partitions

– Prune all but least-cost trajectory entering any given partition

Uninformed Global Search

• Problems:– Still computationally

expensive

– Even with fine partitioning of state space, pruning the wrong trajectories can cause search to fail

Title:

Creator:


Informed Global Search

• Use approximate value function V to guide the selection of which points to search from next

• Reasonably accurate V will cause search to stay along optimal path to goal: dramatic reduction in search time

• V can help choose effective points within each partition from which to search, thereby improving solution quality

• Uniformed Global Search same as “Informed” Global Search with V(x) = 0

Informed Global Search Algorithm

• Let x0 be current state, and g(x0) be the grid element containing x0

• Set g(x0)’s “representative state” to x0, and add g(x0) to priority queue P with priority V(x0)

• Until goal state found or P empty:– Remove grid element g from top of P. Let x denote g’s

“representative state.”

– SEARCH-FROM(g, x)

• If goal found, execute trajectory; otherwise signal failure

Informed Global Search Algorithm, cont’d

SEARCH-FROM(g, x):• Starting from x, perform “local search” as described

earlier, but prune the search wherever it reaches a different grid element g g.

• Each time another grid element g reached at state x:– If g previously SEARCHED-FROM, do nothing.

– If g never previously reached, add g to P with priority RT(x0…x) + |T|V(x), where T is trajectory from x0 to x. Set g’s “representative state” to x. Record trajectory from x to x.

– If g previously reached but previous priority is lower than RT(x0…x) + |T|V(x), update g s priority to RT(x0…x) + |T|V(x) and set “representative state” to x. Record trajectory from x to x.

Informed Global Search Examples

Title:

Creator:


Title:

Creator:


7*7 simplex-interpolated V 13*13 simplex-interpolated V

Hill-car Search Trees

Informed Global Search as A*

• Informed Global Search is essentially an A* search using the value function V as a search heuristic

• Using A* with an optimistic heuristic function normally guarantees optimal path to the goal.

• Uninformed global search effectively uses trivially optimistic heuristic V(s) = 0. Might we expect better solution quality with uninformed search than with non-optimistic crude approximate value function V?

• Not necessarily! A crude approximate non-optimistic value function can improve solution quality by helping the algorithm avoid pruning wrong parts of search tree

Hill-car

• Car on steep hill• State variables:

position and velocity (2-d)

• Actions: accelerate forward or backward

• Goal: park near top• Random start states• Cost: total time to goal

Title:hillcar.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Acrobot

• Two-link planar robot acting in vertical plane under gravity

• Underactuated joint at elbow; unactuated shoulder• Two angular positions & their velocities (4-d)• Goal: raise tip at least one link’s height above

shoulder• Two actions: full torque clockwise /

counterclockwise• Random starting positions• Cost: total time to goal

12

Goal

Move-Cart-Pole

• Upright pole attached to cart by unactuated joint

• State: horizontal position of cart, angle of pole, and associated velocities (4-d)

• Actions: accelerate left or right

• Goal configuration: cart moved, pole balanced

• Start with random x; = 0

• Per-step cost quadratic in distance from goal configuration

• Big penalty if pole falls over

xGoal configuration

Planar Slider

• Puck sliding on bumpy 2-d surface

• Two spatial variables & their velocities (4-d)

• Actions: accelerate NW, NE, SW, or SE

• Goal in NW corner• Random start states• Cost: total time to goal

Title:

Creator:


Local Search Experiments

• CPU Time and Solution cost vs. search depth d

• No limits imposed on number of action switches (s=d)

• Value function: 134 simplex-interpolation grid

0

10000

20000

30000

40000

50000

60000

1 2 3 4 5 6 7 8 9 10

Search depth

Solu

tion

cos

t

02468

1012141618

1 2 3 4 5 6 7 8 9 10

Search depth

CP

U T

ime/

Tri

al (s

ec.)

Move-Cart-Pole

Local Search Experiments

• CPU Time and Solution cost vs. search depth d

• Max. number of action switches fixed at 2 (s = 2)

• Value function: 72 simplex-interpolated value function

0

50

100

150

200

0 10 20 30

Search depth

Solu

tion

cos

t

0

2

4

6

8

10

12

14

0 10 20 30

Search depth

CP

U t

ime/

tria

l (se

c.)

Hill-car

Comparative experiments: Hill-Car

• Local search: d=6, s=2

• Global searches: – Local search between grid elements: d=20, s=1

– 502 search grid resolution

• 72 simplex-interpolated value function

Search Method

None Local Uninf. Glob. Inf. Glob.

Solution Cost 187 140 FAIL 151CPU Time/Trial 0.02 0.36 N/A 0.14

Hill-Car results cont’d

• Uninformed Global Search prunes wrong trajectories

• Increase search grid to 1002

so this doesn’t happen:– Uninformed does near-optimal

– Informed doesn’t: crude value function not optimistic

Search Method

Uninf. Glob. 1 Inf. Glob. 1 Uninf. Glob. 2 Inf. Glob. 2

Solution Cost FAIL 151 109 138CPU Time/Trial N/A 0.14 0.82 0.29

Failed search trajectory picture goeshere

Comparative Results: Four-d domains

All value functions: 134 simplex interpolations

All local searches between global search elements:

depth 20, with at max. 1 action switch (d=20, s=1)

• Acrobot:– Local Search: depth 4; no action switch restriction (d=4,s=4)

– Global: 504 search grid

• Move-Cart-Pole: same as Acrobot

• Slider:– Local Search: depth 10; max. 1 action switch (d=10,s=1)

– Global: 204 search grid

Acrobot

• Local search significantly improves solution quality, but increases CPU time by order of magnitude

• Uninformed global search takes even more time; poor solution quality indicates suboptimal trajectory pruning

• Informed global search finds much better solutions in relatively little time. Value function drastically reduces search, and better pruning leads to better solutions

cost time cost time cost time #LS cost time #LS

Acrobot 454 0.1 305 1.2 407 5.8 14250 198 0.47 914Move-Cart-Pole 49993 0.66 10339 1.13 3164 3.45 7605 5073 0.64 1072Planar Slider 212 1.9 197 52 104 94 23690 54 2 533

No search Local search Uninf. Global Search Inf. Global Search

#LS: number of local searches performed to find pathsbetween elements of global search grid

Move-Cart-Pole

• No search: pole often falls, incurring large penalties; overall poor solution quality

• Local search improves things a bit

• Uninformed search finds better solutions than informed– Few grid cells in which pruning is required

– Value function not optimistic, so informed search solutions suboptimal

• Informed search reduces costs by order of magnitude with no increase in required CPU time




Planar Slider




• Local search almost useless, and incurs massive CPU expense

• Uninformed search decreases solution cost by 50%, but at even greater CPU expense

• Informed search decreases solution cost by factor of 4, at no increase in CPU time

Using Search with Learned Models

• Toy Example: Hill-Car– 72 simplex-interpolated value function

– One nearest-neighbor function approximator per possible action used to learn dx/dt

– States sufficiently far away from nearest neighbor optimistically assumed to be absorbing to encourage exploration

• Average costs over first few hundred trials:– No search: 212

– Local search: 127

– Informed global search: 155

Using Search with Learned Models

• Problems do arise when using learned models:– Inaccuracies in models may cause global searches to

fail. Not clear then if failure should be blamed on model inaccuracies or on insufficiently fine state space partitioning

– Trajectories found will be inaccurate• Need adaptive closed-loop controller

• Fortunately, we will get new data with which to increase the accuracy of our model

– Model approximators must be fast and accurate

Avenues for Future Research

• Extensions to nondeterministic systems?

• Higher-dimensional problems

• Better function approximators for model learning

• Variable-resolution search grids

• Optimistic value function generation?

Documents

Applying Online Search Techniques to Reinforcement Learning Scott Davies, Andrew Ng, and Andrew Moore Carnegie Mellon University