View
223
Download
6
Tags:
Embed Size (px)
Citation preview
Generalizing Plans to New Environments in Multiagent
Relational MDPs
Carlos Guestrin
Daphne KollerStanford University
Multiagent Coordination Examples
Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control
Multiple, simultaneous decisions Exponentially-large spaces Limited observability Limited communication
peasant
footman
building
Real-time Strategy GamePeasants collect resources and buildFootmen attack enemiesBuildings train peasants and footmen
Scaling up by Generalization
Exploit similarities between world elements
Generalize plans: From a set of worlds to a
new, unseen world Avoid need to replan Tackle larger problems
Formalize notion of “similar” elementsCompute generalizable plans
Relational Models and MDPs
Classes: Peasant, Gold, Wood, Barracks,
Footman, Enemy… Relations
Collects, Builds, Trains, Attacks… Instances
Peasant1, Peasant2, Footman1, Enemy1…
Value functions in class level Objects of the same class have same
contribution to value function Factored MDP equivalents of PRMs
[Koller, Pfeffer ‘98]
Relational MDPs
Class-level transition probabilities depends on: Attributes; Actions; Attributes of
related objects Class-level reward function Instantiation (world)
Number objects; Relations Well-defined MDP
Peasant
P’ P
AP
Gold
G’ GCollects
Planning in a World
Long-term planning by solving MDP # states exponential in number of objects # actions exponential
Efficient approximation by exploiting structure!
RMDP world is a factored MDP
Roadmap to Generalization
Solve 1 world
Compute generalizable value function
Tackle a new world
World is a Factored MDP
P
F
E
G
R
F’
E’
G’
P’
State Dynamics Decisions Rewards
P(F’|F,G,H,AF)H
AP
AF
Long-term Utility = Value of MDP
Value computed by linear programming:
,
),()( :subject to
)(:minimize
ax
xax
xx
QV
V
One variable V (x) for each state One constraint for each state x and action a Number of states and actions exponential!
[Manne `60]
Approximate Value Functions
Linear combination of restricted domain functions [Bellman et al. `63][Tsitsiklis & Van Roy `96][Koller & Parr `99,`00][Guestrin et al. `01]
Each Vo depends on state of object and related objects: State of footman Status of barracks
Must find Vo giving good approximate value function
o oVV )()(
~xx
Single LP Solution for Factored MDPs
Variables for each Vo , for each object Polynomially many LP variables
One constraint for every state and action Exponentially many LP constraints
Vo , Qo depend on small sets of variables/actions Exploit structure as in variable elimination
[Guestrin, Koller, Parr `01]
,
),()( :subject to
':minimize
ax
xaxo
oo
o
ooo
QV
V [Schweitzer and Seidmann ‘85]
,),()( :subject to
axxaxo
oo
o QV
Representing Exponentially Many Constraints
)(),( :subject to max0,
o
oo VQ xxaxa
Exponentially many linear = one nonlinear constraint
,)(),(0 :subject to
axxxao
oo VQ
Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72]
Variable Elimination
A
D
B C
1f
4f 3f
2f
As in Bayes nets, maximization is exponential in tree-width
Here we need only 23, instead of 63 sum operations
),(),(),(max 121,,
CBgCAfBAfCBA
),(),(max),(),(max 4321,,
DBfDCfCAfBAfDCBA
),(),(),(),(max 4321,,,
DBfDCfCAfBAfDCBA
Representing the Constraints
Functions are factored, use Variable Elimination to represent constraints:
),(),(max),(),(max0 4321,,
DBfDCfCAfBAfDCBA
),(),(
),(),(max0
43),(
1
),(121
,,
DBfDCfg
gCAfBAf
CB
CB
CBA
Number of constraints exponentially smaller
)(),( :subject to max0,
o
oo VQ xaxxa
Roadmap to Generalization
Solve 1 world
Compute generalizable value function
Tackle a new world
Generalization
Sample a set of worlds
Solve a linear program for these worlds: Obtain class value functions
When faced with new problem: Use class value function No re-planning needed
Worlds and RMDPs
Meta-level MDP:
Meta-level LP:
,,),,()(
)(),()( :subject to
)()(:minimize
0
0
axxax
xx
x
x
x
QV
VPxV
VxV
Class-level Value Functions
Approximate solution to meta-level MDP Linear approximation Value function defined in the class level All instances use same local value function
Class-level LP
,,
),()( :subject to
)()(:minimize
][][
][
ax
axx
xxx
c Coooc
c Cooc
c Coococ
QV
Vo
Constraints for each world represented by factored LP
Number of worlds exponential or infinite Sample worlds from P()
Theorem
Exponentially (infinitely) many worlds !
need exponentially many samples?NO!
samples
Value function within , with prob. at least 1-.
Rmax is the maximum class reward Proof method related to [de Farias, Van Roy ‘02]
LP with sampled worlds
,,
),()( :subject to
)()(:minimize
][][
][
I
QV
V
c Coooc
c Cooc
I c Coococ
o
ax
axx
xxx
Solve LP for sampled worlds Use Factored LP for each world Obtain class-level value function New world: instantiate value function and act
Learning Classes of Objects
Which classes of objects have same value function?
Plan for sampled worlds individually Use value function as “training data” Find objects with similar values Include features of world
Used decision tree regression in experiments
Summary of Generalization Algorithm
1. Model domain as Relational MDPs
2. Pick local object value functions Vo
3. Learn classes by solving some instances
4. Sample set of worlds
5. Factored LP computes class-level value
function
A New World When faced with a new world , value function
is:
Q function becomes:
At each state, choose action maximizing Q(x,a) Number of actions is exponential! Each QC depends only on a few objects!!!
Q(A1,…,A4, X1,…,X4) ¼
Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +
Q3(A2, A3, X2,X3) + Q4(A3, A4,
X3,X4)
Local Q function Approximation
M4
M1
M3
M2
Q3
Q(A1,…,A4, X1,…,X4)
Associated with Agent 3
Limited observability: agent i only observes variables in Qi
Observe only X2 and X3
Must choose action to maximize i Qi
Use variable elimination for maximization: [Bertele & Brioschi ‘72]
Maximizing i Qi: Coordination
Graph
Limited communication for optimal action choice
Comm. bandwidth = induced width of coord. graph
A1
A4
A2 A3
1Q
4Q 3Q
2Q
),(),(),(max 321312211,, 321
AAgAAQAAQA A A
),(),(max),(),(max 424433312211,, 4321
AAQAAQAAQAAQAA A A
),(),(),(),(max 424433312211,,, 4321
AAQAAQAAQAAQA A A A
If A2 attacks and A3 defends,
then A4 gets $10
Summary of Algorithm
1. Model domain as Relational MDPs
2. Factored LP computes class-level value
function
3. Reuse class-level value function in new world
Experimental Results
SysAdmin problem
Unidirectional Ring
Server
StarRing of Rings
Generalizing to New Problems
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
Ring Star Three legs
Est
imat
ed p
oli
cy v
alu
e p
er a
gen
tClass-based value function'Optimal' approximate value functionUtopic maximum value
Generalizing to New Problems
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
Ring Star Three legs
Est
imat
ed p
oli
cy v
alu
e p
er a
gen
tClass-based value function'Optimal' approximate value functionUtopic maximum value
Generalizing to New Problems
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
Ring Star Three legs
Est
imat
ed p
oli
cy v
alu
e p
er a
gen
tClass-based value function'Optimal' approximate value functionUtopic maximum value
Classes of Objects Discovered
Learned 3 classes
Server
Intermediate
Intermediate
Intermediate
Leaf
LeafLeaf
Learning Classes of Objects
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Ring Star Three legs
Max
-no
rm e
rro
r o
f va
lue
fun
ctio
n No class learning
Learnt classes
Learning Classes of Objects
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Ring Star Three legs
Max
-no
rm e
rro
r o
f va
lue
fun
ctio
n No class learning
Learnt classes
Results 2 Peasants, Gold, Wood, Barracks, 2 Footman, Enemy
Reward for dead enemy
About 1 million of state/action pairs
Solve with Factored LP
Some factors are exponential
Coordination graph for action selection
[with Gearhart and Kanodia]
Generalization
9 Peasants, Gold, Wood, Barracks, 3 Footman, Enemy
Reward for dead enemy
About 3 trillion of state/action pairs
Instantiate generalizable value function
At run-time, factors are polynomial
Coordination graph for action selection
The 3 aspects of this talk
Scaling up collaborative multiagent planning Exploiting structure Generalization
Factored representation and algorithms Relational MDP, Factored LP, coordination graph
Freecraft as a benchmark domain
Conclusions RMDP
Compact representation for set of similar planning problems
Solve single instance with factored MDP algorithms
Tackle sets of problems with class-level value functions
Efficient sampling of worlds Learn classes of value functions
Generalization to new domains Avoid replanning Solve larger, more complex MDPs