Applying Online Search Techniques to
Reinforcement Learning
Scott Davies, Andrew Ng, and Andrew Moore
Carnegie Mellon University
The Agony of Continuous State Spaces
• Learning useful value functions for continuous-state optimal control problems can be difficult– Small inaccuracies/inconsistencies in approximated
value functions can cause simple controllers to fail miserably
– Accurate value functions can be very expensive to compute even in relatively low-dimensional spaces with perfectly accurate state transition models
Combining Value Functions With Online Search
• Instead of modeling the value function accurately everywhere, we can perform online searches for good trajectories from the agent’s current position to compensate for value function inaccuracies
• We examine two different types of search:– “Local” searches in which the agent performs a finite-
depth look-ahead search
– “Global” searches in which the agent searches for trajectories all the way to goal states
Typical One-Step “Search”
Given a value function V(x) over the state space, an agent typically uses a model to predict where each possible one-step trajectory T takes it, then chooses the trajectory that maximizes
This takes O(|A|) time, where A is the set of possible actions.
is the discount factor
xT is the state at the end of T
where RT is the reward accumulated along T
Given a perfect V(x), this would lead to optimal behavior.
RT + V(xT)
Local Search
• An obvious possible extension: consider all possible d-step trajectories T, selecting the one that maximizes RT + dV(xT).
– Computational expense: O(|A|d).
• To make deeper searches more computationally tractable, we can limit agent to considering only trajectories in which the action is switched at most s times.
– Computational expense:
(considerably cheaper than full d-step search if s << d) )(1s
sd AdO
Local Search: Example
• Two-dimensional state space (position + velocity)
• Car must back up to take “running start” to make it
Title:hillcar.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Title:
Creator:
Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Position
Velocity
Search over 20-step trajectorieswith at most one switch in actions
Using Local Search Online
Repeat:
• From current state, consider all possible d-step trajectories T in which the action is changed at most s times
• Perform the first action in the trajectory that maximizes RT + dV(xT).
Let B denote the “parallel backup operator” such that
)),((),(max)( axVaxRxBVAa
If s = (d-1), Local Search is formally equivalent to behaving greedilywith respect to the new value function Bd-1V.
Since V is typically arrived at through iterations of a much cruder backupoperator, this value function is often much more accurate than V.
Uninformed Global Search
• Suppose we have a minimum-cost-to-goal problem in a continuous state space with nonnegative costs. Why not forget about explicitly calculating V and just extend the search from the current position all the way to the goal?
• Problem: combinatorial explosion.
• Possible solution: – Break state space into partitions, e.g. a uniform grid. (Can be
represented sparsely.)
– Use previously discussed local search procedure to find trajectories between partitions
– Prune all but least-cost trajectory entering any given partition
Uninformed Global Search
• Problems:– Still computationally
expensive
– Even with fine partitioning of state space, pruning the wrong trajectories can cause search to fail
Title:
Creator:
Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Informed Global Search
• Use approximate value function V to guide the selection of which points to search from next
• Reasonably accurate V will cause search to stay along optimal path to goal: dramatic reduction in search time
• V can help choose effective points within each partition from which to search, thereby improving solution quality
• Uniformed Global Search same as “Informed” Global Search with V(x) = 0
Informed Global Search Algorithm
• Let x0 be current state, and g(x0) be the grid element containing x0
• Set g(x0)’s “representative state” to x0, and add g(x0) to priority queue P with priority V(x0)
• Until goal state found or P empty:– Remove grid element g from top of P. Let x denote g’s
“representative state.”
– SEARCH-FROM(g, x)
• If goal found, execute trajectory; otherwise signal failure
Informed Global Search Algorithm, cont’d
SEARCH-FROM(g, x):• Starting from x, perform “local search” as described
earlier, but prune the search wherever it reaches a different grid element g g.
• Each time another grid element g reached at state x:– If g previously SEARCHED-FROM, do nothing.
– If g never previously reached, add g to P with priority RT(x0…x) + |T|V(x), where T is trajectory from x0 to x. Set g’s “representative state” to x. Record trajectory from x to x.
– If g previously reached but previous priority is lower than RT(x0…x) + |T|V(x), update g s priority to RT(x0…x) + |T|V(x) and set “representative state” to x. Record trajectory from x to x.
Informed Global Search Examples
Title:
Creator:
Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Title:
Creator:
Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
7*7 simplex-interpolated V 13*13 simplex-interpolated V
Hill-car Search Trees
Informed Global Search as A*
• Informed Global Search is essentially an A* search using the value function V as a search heuristic
• Using A* with an optimistic heuristic function normally guarantees optimal path to the goal.
• Uninformed global search effectively uses trivially optimistic heuristic V(s) = 0. Might we expect better solution quality with uninformed search than with non-optimistic crude approximate value function V?
• Not necessarily! A crude approximate non-optimistic value function can improve solution quality by helping the algorithm avoid pruning wrong parts of search tree
Hill-car
• Car on steep hill• State variables:
position and velocity (2-d)
• Actions: accelerate forward or backward
• Goal: park near top• Random start states• Cost: total time to goal
Title:hillcar.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Acrobot
• Two-link planar robot acting in vertical plane under gravity
• Underactuated joint at elbow; unactuated shoulder• Two angular positions & their velocities (4-d)• Goal: raise tip at least one link’s height above
shoulder• Two actions: full torque clockwise /
counterclockwise• Random starting positions• Cost: total time to goal
12
Goal
Move-Cart-Pole
• Upright pole attached to cart by unactuated joint
• State: horizontal position of cart, angle of pole, and associated velocities (4-d)
• Actions: accelerate left or right
• Goal configuration: cart moved, pole balanced
• Start with random x; = 0
• Per-step cost quadratic in distance from goal configuration
• Big penalty if pole falls over
xGoal configuration
Planar Slider
• Puck sliding on bumpy 2-d surface
• Two spatial variables & their velocities (4-d)
• Actions: accelerate NW, NE, SW, or SE
• Goal in NW corner• Random start states• Cost: total time to goal
Title:
Creator:
Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
Local Search Experiments
• CPU Time and Solution cost vs. search depth d
• No limits imposed on number of action switches (s=d)
• Value function: 134 simplex-interpolation grid
0
10000
20000
30000
40000
50000
60000
1 2 3 4 5 6 7 8 9 10
Search depth
Solu
tion
cos
t
02468
1012141618
1 2 3 4 5 6 7 8 9 10
Search depth
CP
U T
ime/
Tri
al (s
ec.)
Move-Cart-Pole
Local Search Experiments
• CPU Time and Solution cost vs. search depth d
• Max. number of action switches fixed at 2 (s = 2)
• Value function: 72 simplex-interpolated value function
0
50
100
150
200
0 10 20 30
Search depth
Solu
tion
cos
t
0
2
4
6
8
10
12
14
0 10 20 30
Search depth
CP
U t
ime/
tria
l (se
c.)
Hill-car
Comparative experiments: Hill-Car
• Local search: d=6, s=2
• Global searches: – Local search between grid elements: d=20, s=1
– 502 search grid resolution
• 72 simplex-interpolated value function
Search Method
None Local Uninf. Glob. Inf. Glob.
Solution Cost 187 140 FAIL 151CPU Time/Trial 0.02 0.36 N/A 0.14
Hill-Car results cont’d
• Uninformed Global Search prunes wrong trajectories
• Increase search grid to 1002
so this doesn’t happen:– Uninformed does near-optimal
– Informed doesn’t: crude value function not optimistic
Search Method
Uninf. Glob. 1 Inf. Glob. 1 Uninf. Glob. 2 Inf. Glob. 2
Solution Cost FAIL 151 109 138CPU Time/Trial N/A 0.14 0.82 0.29
Failed search trajectory picture goeshere
Comparative Results: Four-d domains
All value functions: 134 simplex interpolations
All local searches between global search elements:
depth 20, with at max. 1 action switch (d=20, s=1)
• Acrobot:– Local Search: depth 4; no action switch restriction (d=4,s=4)
– Global: 504 search grid
• Move-Cart-Pole: same as Acrobot
• Slider:– Local Search: depth 10; max. 1 action switch (d=10,s=1)
– Global: 204 search grid
Acrobot
• Local search significantly improves solution quality, but increases CPU time by order of magnitude
• Uninformed global search takes even more time; poor solution quality indicates suboptimal trajectory pruning
• Informed global search finds much better solutions in relatively little time. Value function drastically reduces search, and better pruning leads to better solutions
cost time cost time cost time #LS cost time #LS
Acrobot 454 0.1 305 1.2 407 5.8 14250 198 0.47 914Move-Cart-Pole 49993 0.66 10339 1.13 3164 3.45 7605 5073 0.64 1072Planar Slider 212 1.9 197 52 104 94 23690 54 2 533
No search Local search Uninf. Global Search Inf. Global Search
#LS: number of local searches performed to find pathsbetween elements of global search grid
Move-Cart-Pole
• No search: pole often falls, incurring large penalties; overall poor solution quality
• Local search improves things a bit
• Uninformed search finds better solutions than informed– Few grid cells in which pruning is required
– Value function not optimistic, so informed search solutions suboptimal
• Informed search reduces costs by order of magnitude with no increase in required CPU time
cost time cost time cost time #LS cost time #LS
Acrobot 454 0.1 305 1.2 407 5.8 14250 198 0.47 914Move-Cart-Pole 49993 0.66 10339 1.13 3164 3.45 7605 5073 0.64 1072Planar Slider 212 1.9 197 52 104 94 23690 54 2 533
No search Local search Uninf. Global Search Inf. Global Search
Planar Slider
cost time cost time cost time #LS cost time #LS
Acrobot 454 0.1 305 1.2 407 5.8 14250 198 0.47 914Move-Cart-Pole 49993 0.66 10339 1.13 3164 3.45 7605 5073 0.64 1072Planar Slider 212 1.9 197 52 104 94 23690 54 2 533
No search Local search Uninf. Global Search Inf. Global Search
• Local search almost useless, and incurs massive CPU expense
• Uninformed search decreases solution cost by 50%, but at even greater CPU expense
• Informed search decreases solution cost by factor of 4, at no increase in CPU time
Using Search with Learned Models
• Toy Example: Hill-Car– 72 simplex-interpolated value function
– One nearest-neighbor function approximator per possible action used to learn dx/dt
– States sufficiently far away from nearest neighbor optimistically assumed to be absorbing to encourage exploration
• Average costs over first few hundred trials:– No search: 212
– Local search: 127
– Informed global search: 155
Using Search with Learned Models
• Problems do arise when using learned models:– Inaccuracies in models may cause global searches to
fail. Not clear then if failure should be blamed on model inaccuracies or on insufficiently fine state space partitioning
– Trajectories found will be inaccurate• Need adaptive closed-loop controller
• Fortunately, we will get new data with which to increase the accuracy of our model
– Model approximators must be fast and accurate
Avenues for Future Research
• Extensions to nondeterministic systems?
• Higher-dimensional problems
• Better function approximators for model learning
• Variable-resolution search grids
• Optimistic value function generation?